Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
An Introduction
April 15th, 2015
Ari Paul
Overview
What is Principal Component Analysis (PCA)?
PCA is a tool for analyzing a complex data set and re-expressing it in simpler terms. The textbook definition
is: a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly
correlated variables into a set of values of linearly uncorrelated variables called principal components. It
replaces a set of observations with a new dataset that may better explain the underlying dynamics of the
system with less data.
Before diving into the meat of PCA, Ill provide 4 slides on Linear Algebra, a necessary toolkit for
understanding and performing PCA.
Outline:
Linear Algebra
Slide
Terminology and Basic Matrix Math
3
Matrices
4
Determinants and Characteristic Equation
5
Eigenvectors and Eigenvalues
PCA
Introduction
7
Step 1 Centering and Scaling
Step 2 Covariance Matrix
Step 3 Calculating Eigenvectors
9
Step 3 (cont) Eigenvector Explanation
10
Step 4 Re-express the Data
Interpretation
Assumptions and Pitfalls
Appendices
Appendix
Algebraic
Support
for
THE UNIVERSITY
OF 1CHICAGO
OFFICE
OF INVESTMENTS
Eigenvectors
14
6
8
8
11
12
13
This is a special matrix called the identity matrix. It has 1s along the diagonal and zeroes
everywhere else. It is square, but can be of any dimension. It has the interesting property that any matrix A
multiplied by it, will equal itself. A x I = A.
Calculating reciprocals of matrices is tricky, but all we need to know for basic PCA is that a matrix
multipled by its recriprocal will always equal the identity matrix, I. A x A-1 = I. For a reciprocal of
a matrix to exist, it must be square (m = n), and it must have a non-zero determinant.
|A| = ad - bc
In Linear Algebra terminology, an eigenvector is a vector that points in a direction which is invariant under
the associated linear transformation.
In this equation, is a real number (scalar) known as the eigenvalue. v is the eigenvector. And A is the
original matrix. This equation says that a vector v is an eigenvector of matrix A, if the resulting product can
be restated as a scalar multiple of v. the intuition here is that the vector v is being scaled or stretched
by matrix A, but is not otherwise changing (i.e. its direction is unaffected). This will be clearer with an
example.
Start with matrix A:
The equation
It has roots = 1, and = 3; these are As eigenvalues. To find the eigenvectors, we use the roots to solve
the equation.
For = 3:
THE
UNIVERSITY OF
OFFICE OF INVESTMENTS
eigenvectors
willCHICAGO
be orthogonal.
PCA - Introduction
Having established a mathematical framework, we can now work through the PCA example step by step.
Our Example:
I find PCA easiest to understand via example, so Ill walk through a calculation using a very simple data set.
Imagine that we measure the height, neck circumference, and armspan of 4 individuals. Intuitively, these 3
variables are likely to be correlated (e.g. a taller person is more likely to have a longer armspan etc). Our goal
is to re-express the dataset using fewer variables to better and more simply capture the systems underlying
dynamics. All measurements are in inches.
PCA Step 3
3. Calculate Eigenvectors and Eigenvalues, and Sort by Associated
Eigenvalue
Realistically youll be using a software package like Matlab to calculate the eigenvectors and eigenvalues,
because it is prohibitively difficult to do by hand for larger matrices, but this example is simple enough for us to
work through in detail. This follows the process laid out on slides 5 and 6.
is rewritten as
setting the determinant equal to zero.
The determinant of a 3x3 matrix is |B| = a(ei - fh) - b(di - fg) + c(dh - eg) = aei afh bdi + bfg + cdh
ceg
Using the characteristic equation of our covariance matrix, this is:
(t-1)(t-1)(t-1) - (t-1)0.7166^2 - .9676^2(t-1) + .9676(.7166)(.8534) + (.8534).9676(.7166) - .8534^2(t-1)
Which further reduces to: t^3 3t^2 + 0.8219t -0.0054
This polynomial has the roots: 2.6959, 0.0067, and 0.2974. These are the eigenvalues of the covariance
matrix, and their relative sizes reflect their relative importance. The eigenvalues will always sum to the
number of variables (in this case 3). The eigenvalues tell us that the first eigenvector will contain 89.9% of the
variance of the entire dataset, the second will contain just 0.2%, and the third 9.9% We then solve for v as
shown on slide 6 and get the eigenvectors: (0.6052, 0.7779, 0.1688), (0.5770, -0.5748, -0.5802), and (0.5484,
-0.2537, 0.7968) respectively. If you graphed these vectors on a 3-dimensional plot, they would all be
perpendicular to one another.
We then sort the eigenvectors in order of their associated eigenvalue (highest eigenvalue first), and arrange
the eigenvectors as columns, we get the following matrix:
10
PCA Step 4
4. Multiply the Standardized Data by the Eigenvectors
The ultimate goal of PCA is to re-express the original dataset, and now were finally ready to take that step
and to generate the actual principal components. We simply multiply our scaled and cleaned dataset by the
matrix of eigenvectors that we generated in step 3. As a matrix operation, its a simple as A x V. To generate
a particular datapoint for Principal Component 1, were adding together all of the first observations from the
scaled/cleaned dataset, with each one loaded by its associated eigenvector. So, if PCA1 is the first Principal
Component vector and V represents our eigenvector matrix, then PCA1 1 = x1(v1,1) + y1(v2,1) + z1(v3,1).
Principal Component 1 is uncorrelated to Principal Component 2 by construction. Conceptually, were reexpressing the original dataset in terms of the eigenvectors. Since the eigenvectors are perpendicular to one
another, the principal components are uncorrelated.
To interpret the transformed data, its useful to look at the correlation to the original (unscaled and
uncentered). In the table below, we can see that the first principal component (PC1) is extremely similar to
Height, but also captures most of the information in Neck and Armspan; as we learned from the
eigenvalues, PC1 captures 89.9% of the total information of the dataset. PC2 seems to relate primarily to
Neck and Armspan, and PC3 is not closely connected to any of the three variables and adds almost no
information. We could discard the second and third principal components and still retain most of the
information of the original dataset.
11
PCA Interpretation
12
13
This section comes directly from Jon Shlens, Tutorial on Principal Component Analysis (2003), an
excellent resource.
THE UNIVERSITY OF CHICAGO OFFICE OF INVESTMENTS
14
Credits:
Tutorial on Principal Component Analysis (2003) by Jon Shlens
https://www.mathsisfun.com/ sections on linear algebra.
15