Sei sulla pagina 1di 4

Principal components analysis

Principal Component Analysis (PCA) is a multivariate statistical technique used to form a


smaller number of uncorrelated variables from a large set of data. The goal of PCA is to
explain the maximum amount of variance with the fewest number of principal components.
PCA is commonly used in the social sciences, market research, and other industries that use
large data sets.

PCA is commonly used as one step in a series of analyses. One can use principal components
analysis to reduce the number of variables and avoid multicollinearity, or when you have too
many predictors relative to the number of observations.

Working of PCA:

1. Calculate the covariance matrix X of data points.


2. Calculate eigen vectors and corresponding eigen values.
3. Sort the eigen vectors according to their eigen values in decreasing order.
4. Choose first k eigen vectors and that will be the new k dimensions.
5. Transform the original n dimensional data points into k dimensions.

For example, a consumer products company wants to analyze customer responses to several
characteristics of a new shampoo: color, smell, texture, cleanliness, shine, volume, amount
needed to lather, and price. They conduct a principal components analysis to see if they can
form a smaller number of uncorrelated variables that are easier to interpret and analyze. The
results suggest the following patterns:

Color, smell, and texture form a "Shampoo quality" component.

Cleanliness, shine, and volume form an "Effect on hair" component.

Amount needed to lather and price form a "Value" component.


Principal components method

In PCA, one first finds the set of orthogonal eigenvectors of the correlation or covariance
matrix of the variables. The matrix of principal components is the product of the eigenvector
matrix with the matrix of independent variables. The first principal component accounts for
the largest percent of the total data variation. The second principal component accounts the
second largest percent of the total data variation, and so on. The goal of principal components
is to explain the maximum amount of variance with the fewest number of components.

Terminologies associated with PCA:

Eigenvectors

Eigenvectors, which are comprised of coefficients corresponding to each variable, are the
weights for each variable used to calculate the principal components scores.

Scores

The linear combinations of the original variables that account for the variance in the data.

Eigenvalue

The eigenvalues are the variances of the principal components

Principal Component Model

The method involves decomposing a data matrix X into a structure part and
a noise part. The PC model is the matrix product TPT (the Structure):

X = TPT + E = Structure + Noise


We assume that X can be split into the sum of the matrix product TPT and the residual matrix
E.

Scores: T
The Scores, structure part of the PCA. Summary of the original variables
in X that describe how the different rows in X (observations) relate to each
other. In the T-matrix column 1 (t1) is the scores of the first PC. The second
column contains the scores of the second PC, and so on.

Loadings: P
The Loadings, structure part of the PCA. The weights (in°uence) of the
variables in X on the scores T. Of the loadings we can see which variables
that are responsible for patterns found in scores, T, using the Loadings plot.
This plot is simply the loadings of a PC plotted against the loadings of
another PC. We can see how the scores and loadings relate, and that is
very important about this plot. The loadings plot could be called a map of
variables.

Residuals: E
The Residuals (E-matrix), is the noise part of the PCA, a n x p large Matrix.
E is not part of the model. It will be the part of X which is not explained by
the model TPT .

NIPALS algorithm

The NIPALS ("Nonlinear Iterative Partial Least Squares") algorithm is employed for
estimating the parameters of the PCA model. The steps of the algorithm is listed below:

X is a mean centered data matrix


E(0) = X The E-matrix for the zero-th PC (PC0) is mean centered X
t vector is set to a column in X
t will be the scores for PCi
p will be the loadings for PCi
threshold = 0:00001 Just a low value, to do the convergence check
Iterations (i=1 to number-of-PCs):

1. Project X onto t to ¯nd the corresponding loading p


p = (ET(i-1)t)=(tT t)

2. Normalise loading vector p to length 1


p = p * (pT p)-0.5
3. Project X onto p to ¯nd corresponding score vector t
t = (E(i-1)p) / (pT p)

4. Check for convergence. If difference between eigenvalues Tnew = (tT t)


and Told (from last iteration) is larger than threshold Tnew, return to
step 1.

5. Remove the estimated PC component from E(i¡1)


E(i) = E(i-1) - (tpT )

Potrebbero piacerti anche