What Is PCA: When Should You Use PCA?

12/22/2017 Principle Component Analysis
What is PCA
Principal Component Analysis (PCA) is a dimensionality-reduction technique that is often used to
transform a high-dimensional dataset into a smaller-dimensional subspace prior to running a
machine learning algorithm on the data.
When should you use PCA?

It is often helpful to use a dimensionality-reduction technique such as PCA prior to performing
machine learning because:
Reducing the dimensionality of the dataset reduces the size of the space on which k-
nearest-neighbors (kNN) must calculate distance, which improve the performance of kNN.
Reducing the dimensionality of the dataset reduces the number of degrees of freedom of
the hypothesis, which reduces the risk of overfitting.
Most algorithms will run significantly faster if they have fewer dimensions they need to look
at.
Reducing the dimensionality via PCA can simplify the dataset, facilitating description,
visualization, and insight.
What does PCA do?

Principal Component Analysis does just what it advertises; it finds the principal components of the
dataset. PCA transforms the data into a new, lower-dimensional subspace—into a new coordinate
system—. In the new coordinate system, the first axis corresponds to the first principal component,
which is the component that explains the greatest amount of the variance in the data.
Math of PCA
In [1]: import numpy as np
http://localhost:8888/notebooks/session%2023_24/ML10/Principle%20Component%20Analysis.ipynb 1/21
In [2]: # Create (6x2) array

A = np.array([
[ 3, 7],
[-4, -6],
[ 7, 8],
[ 1, -1],
[-4, -1],
[-3, -7]
])
m,n = A.shape # m-observations, n-features
print("Array:")
print(A) # our array
print("---")
print("Dimensions:")
print(A.shape) # shape
print("---")
print("Mean across Rows:")
print(np.mean(A,axis=0))
Array:
[[ 3 7]
[-4 -6]
[ 7 8]
[ 1 -1]
[-4 -1]
[-3 -7]]
---
Dimensions:
(6, 2)
---
Mean across Rows:
[ 0. 0.]
In [3]: # Note: you can convert this easily into a DataFrame ...
import pandas as pd
df = pd.DataFrame(A, columns = ['a0', 'a1'])
print(df)
a0 a1
0 3 7
1 -4 -6
2 7 8
3 1 -1
4 -4 -1
5 -3 -7
In [4]: # ... and can go from df back to np.array

df.values
Out[4]: array([[ 3, 7],

[-4, -6],
[ 7, 8],
[ 1, -1],
[-4, -1],
[-3, -7]])
Covariance
In [5]: import matplotlib

import matplotlib.pyplot as plt
%matplotlib inline
# makes charts pretty

import seaborn as sns
sns.set(color_codes=True)
In [6]: # plots
plt.scatter(A[:,0],A[:,1])
# annotations
for i in range(m):
plt.annotate('('+str(A[i,0])+','+str(A[i,1])+')',(A[i,0]+0.2,A[i,1]+0.2))
# axes
plt.plot([-6,8],[0,0],'grey') # x-axis
plt.plot([0,0],[-8,10],'grey') # y-axis
plt.axis([-6, 8, -8, 10])
plt.axes().set_aspect('equal')
# labels
plt.xlabel("$a_0$")
plt.ylabel("$a_1$")
plt.title("Dataset $A$")
Out[6]: <matplotlib.text.Text at 0x7fd0c6baa438>
Sample covariance between a0 and a1 :

∑m−1 (ai − a¯0 )(ai − a¯1 )
cova0,a1 = i=0 0 m − 1 1
In [7]: # Calculate covariance between a0 and a1
a0 = A[:,0]
a1 = A[:,1]
prod = a0*a1 # element-wise product, ignore means as zero already
print("Length of prod equals " + str(len(prod)))
print("---")
print("Covariance:")
print(np.sum(prod)/(m-1))
Length of prod equals 6

---
Covariance:
25.0
In [8]: # Get more stuff using NumPy's covariance method

np.cov(a0,a1)
Out[8]: array([[ 20., 25.],

[ 25., 40.]])
Linear Algebra way:
Σ = (mA−A1)
T
In [9]: # Aside: What does A.T do?

A.T # or np.transpose(A)
Out[9]: array([[ 3, -4, 7, 1, -4, -3],

[ 7, -6, 8, -1, -1, -7]])
In [10]: # Matrix Multiplication, note @ operator

A.T @ A # or np.dot(A.T,A)
Out[10]: array([[100, 125],

[125, 200]])
In [11]: # Need to divide by (m-1) to yield true Sample Covariance Matrix

# Let's call this Sigma
Sigma = (A.T @ A)/(m-1) # or np.cov(A.T)
Sigma
Out[11]: array([[ 20., 25.],

[ 25., 40.]])
4. Eigen-decomposition of Σ
According to Wikipedia article on PCA
(https://en.m.wikipedia.org/wiki/Principal_component_analysis), "PCA can be done by eigenvalue
decomposition of a data covariance (or correlation) matrix or singular value decomposition of a data
matrix." I choose the first approach.
Σ
* is a real, symmetric matrix; thus, it has 1) real eigenvalues and 2) orthogonal eigenvectors.
In [12]: l, X = np.linalg.eig(Sigma)
print("Eigenvalues:")
print(l)
print("---")
print("Eigenvectors:")
print(X)
Eigenvalues:
[ 3.07417596 56.92582404]
---
Eigenvectors:
[[-0.82806723 -0.56062881]
[ 0.56062881 -0.82806723]]
Recall from your Linear Algebra class that the following should hold:
Σx0 = λ0 x0
Σx1 = λ1 x1
In [13]: # let's check the first Eigenvalue, Eigenvector combination
print("Sigma times eigenvector:")
print(Sigma @ X[:,0]) # 2x2 times 2x1
print("Eigenvalue times eigenvector:")
print(l[0] * X[:,0]) # scalar times 2x1, ANNOYING - MUST USE * vs. @
Sigma times eigenvector:

[-2.54562438 1.72347161]
Eigenvalue times eigenvector:
[-2.54562438 1.72347161]
In [14]: # ... and the second

print("Sigma times eigenvector:")
print(Sigma @ X[:,1]) # 2x2 times 2x1
print("Eigenvalue times eigenvector:")
print(l[1] * X[:,1]) # scalar times 2x1, ANNOYING - MUST USE * vs. @
Sigma times eigenvector:

[-31.91425695 -47.13840945]
Eigenvalue times eigenvector:
[-31.91425695 -47.13840945]
In [15]: print("The first principal component is evector with largest evalue:")

print(X[:,1])
print("---")
print("Second principal component:")
print(X[:,0])
The first principal component is evector with largest evalue:

[-0.56062881 -0.82806723]
---
Second principal component:
[-0.82806723 0.56062881]
In [16]: # Orthogonal? A: Yes

X[:,1].T @ X[:,0]
Out[16]: 0.0
In [17]: # Length 1? A: Yes

print(np.sqrt(X[:,1].T @ X[:,1]))
print(np.sqrt(X[:,0].T @ X[:,0]))
1.0
1.0
In [18]: # plots
plt.scatter(A[:,0],A[:,1])
scale = 3 # increase this scaling factor to highlight these vectors
plt.plot([0,X[0,1]*scale],[0,X[1,1]*scale],'r') # First principal component
plt.plot([0,X[0,0]*scale],[0,X[1,0]*scale],'g') # Second principal component
# annotations
for i in range(m):
plt.annotate('('+str(A[i,0])+','+str(A[i,1])+')',(A[i,0]+0.2,A[i,1]+0.2))
# axes
plt.axis([-6, 8, -8, 10])
# labels
plt.xlabel("$a_0$")
plt.ylabel("$a_1$")
plt.title("Eigenvectors of $\Sigma$")
Out[18]: <matplotlib.text.Text at 0x7fd0c6a41588>
5. Dimensionality Reduction: 2D to 1D
I discovered why Prof. Ng recommended Octave/Matlab versus Python. Linear algebra

expressions are not clean in Python. Had to convert np.array to matrix object. See below.
In [19]: # change to matrix

Amat = np.asmatrix(A)
Xmat = np.asmatrix(X)
In [20]: # Choose eigenvector with highest eigenvalue as first principal component

pc1 = Xmat[:,1]
In [21]: Acomp = Amat @ pc1 # 6x2 @ 2x1 yields 6x1

print("Compressed version of A:")
print(Acomp)
Compressed version of A:
[[ -7.47835704]
[ 7.21091862]
[-10.54893951]
[ 0.26743842]
[ 3.07058247]
[ 7.47835704]]
In [22]: Arec = Acomp @ pc1.T # 6x1 @ 1x2, this breaks with np.array
print("Reconstruction from 1D compression of A:")
print(Arec)
Reconstruction from 1D compression of A:

[[ 4.1925824 6.1925824 ]
[-4.04264872 -5.97112541]
[ 5.9140394 8.73523112]
[-0.14993368 -0.22145699]
[-1.72145699 -2.54264872]
[-4.1925824 -6.1925824 ]]
In [24]: # plots
plt.scatter([Amat[:,0]], [Amat[:,1]]# Dimensionality Reduction: Principal Componen
Here we'll explore **Principal Component Analysis**, which is an extremely useful
We'll start with our standard set of initial imports:
from __future__ import print_function, division
%matplotlib inline
import numpy as np
from scipy import stats
# use seaborn plotting style defaults

import seaborn as sns; sns.set()
## Introducing Principal Component Analysis
Principal Component Analysis is a very powerful unsupervised method for *dimensio
np.random.seed(1)
X = np.dot(np.random.random(size=(2, 2)), np.random.normal(size=(2, 200))).T
plt.plot(X[:, 0], X[:, 1], 'o')
plt.axis('equal');
We can see that there is a definite trend in the data. What PCA seeks to do is to
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
pca.fit(X)
print(pca.explained_variance_)
print(pca.components_)
To see what these numbers mean, let's view them as vectors plotted on top of the d
plt.plot(X[:, 0], X[:, 1], 'o', alpha=0.5)

for length, vector in zip(pca.explained_variance_, pca.components_):
v = vector * 3 * np.sqrt(length)
plt.plot([0, v[0]], [0, v[1]], '-k', lw=3)
plt.axis('equal');
Notice that one vector is longer than the other. In a sense, this tells us that th
The explained variance quantifies this measure of "importance" in direction.
Another way to think of it is that the second principal component could be **comp
clf = PCA(0.95) # keep 95% of variance

X_trans = clf.fit_transform(X)
print(X.shape)
print(X_trans.shape)
By specifying that we want to throw away 5% of the variance, the data is now comp
X_new = clf.inverse_transform(X_trans)
plt.plot(X_new[:, 0], X_new[:, 1], 'ob', alpha=0.8)

plt.axis('equal');
The light points are the original data, while the dark points are the projected ve
This is the sense in which "dimensionality reduction" works: if you can approximat
### Application of PCA to Digits
The dimensionality reduction might seem a bit abstract in two dimensions, but the
from sklearn.datasets import load_digits

digits = load_digits()
X = digits.data
y = digits.target
pca = PCA(2) # project from 64 to 2 dimensions

Xproj = pca.fit_transform(X)
print(X.shape)
print(Xproj.shape)
plt.scatter(Xproj[:, 0], Xproj[:, 1], c=y, edgecolor='none', alpha=0.5,

cmap=plt.cm.get_cmap('nipy_spectral', 10))
plt.colorbar();
This gives us an idea of the relationship between the digits. Essentially, we have
### What do the Components Mean?
PCA is a very useful dimensionality reduction algorithm, because it has a very int
The input data is represented as a vector: in the case of the digits, our data is
$$
x = [x_1, x_2, x_3 \cdots]
$$
but what this really means is
$$
image(x) = x_1 \cdot{\rm (pixel~1)} + x_2 \cdot{\rm (pixel~2)} + x_3 \cdot{\rm (pi
$$
If we reduce the dimensionality in the pixel space to (say) 6, we recover only a p
from fig_code.figures import plot_image_components
sns.set_style('white')
plot_image_components(digits.data[0])
But the pixel-wise representation is not the only choice. We can also use other *b
$$
image(x) = {\rm mean} + x_1 \cdot{\rm (basis~1)} + x_2 \cdot{\rm (basis~2)} + x_3
$$
What PCA does is to choose optimal **basis functions** so that only a few are need
The low-dimensional representation of our data is the coefficients of this series
from fig_code.figures import plot_pca_interactive

plot_pca_interactive(digits.data)
Here we see that with only six PCA components, we recover a reasonable approximat
Thus we see that PCA can be viewed from two angles. It can be viewed as **dimensio
### Choosing the Number of Components
But how much information have we thrown away? We can figure this out by looking a
sns.set()
pca = PCA().fit(X)
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance');
Here we see that our two-dimensional projection loses a lot of information (as mea
### PCA as data compression
As we mentioned, PCA can be used for is a sort of data compression. Using a small
Here's what a single digit looks like as you change the number of components:
fig, axes = plt.subplots(8, 8, figsize=(8, 8))

fig.subplots_adjust(hspace=0.1, wspace=0.1)
for i, ax in enumerate(axes.flat):
pca = PCA(i + 1).fit(X)
im = pca.inverse_transform(pca.transform(X[20:21]))
ax.imshow(im.reshape((8, 8)), cmap='binary')

ax.text(0.95, 0.05, 'n = {0}'.format(i + 1), ha='right',
transform=ax.transAxes, color='green')
ax.set_xticks([])
ax.set_yticks([])
Let's take another look at this by using IPython's ``interact`` functionality to v
from IPython.html.widgets import interact
def plot_digits(n_components):
fig = plt.figure(figsize=(8, 8))
plt.subplot(1, 1, 1, frameon=False, xticks=[], yticks=[])
nside = 10
pca = PCA(n_components).fit(X)
Xproj = pca.inverse_transform(pca.transform(X[:nside ** 2]))
Xproj = np.reshape(Xproj, (nside, nside, 8, 8))
total_var = pca.explained_variance_ratio_.sum()
im = np.vstack([np.hstack([Xproj[i, j] for j in range(nside)])

for i in range(nside)])
plt.imshow(im)
plt.grid(False)
plt.title("n = {0}, variance = {1:.2f}".format(n_components, total_var),

size=18)
plt.clim(0, 16)
interact(plot_digits, n_components=[1, 64], nside=[1, 8]);) # A in blue

plt.plot(Arec[:,0],Arec[:,1],'r', marker='o') # Arec in RED
# axes
plt.axis([-6, 8, -8, 10])
# labels
plt.xlabel("$a_0$")
plt.ylabel("$a_1$")
plt.title("Reconstructing the 1D compression of $A$")
Out[24]: <matplotlib.text.Text at 0x7fd0c67c7518>
In [25]: print(np.linalg.matrix_rank(Amat)) # originally a Rank 2 matrix

print(np.linalg.matrix_rank(Arec)) # reconstructed matrix is Rank 1
2
1
By tacking on the Rank-1 matrix related to the 2nd eigenvector you get back to the original data:
In [26]: # Add the Rank 1 matrix for the other vector to recover A completely
Amat @ Xmat[:,1] @ Xmat[:,1].T + Amat @ Xmat[:,0] @ Xmat[:,0].T
Out[26]: matrix([[ 3., 7.],

[-4., -6.],
[ 7., 8.],
[ 1., -1.],
[-4., -1.],
[-3., -7.]])
In [27]: # Why does this work? Well, recall

# X @ X.T is identity matrix as X is orthonormal
A @ Xmat @ Xmat.T
Out[27]: matrix([[ 3., 7.],

[-4., -6.],
[ 7., 8.],
[ 1., -1.],
[-4., -1.],
[-3., -7.]])
In [28]: # plots
plt.scatter(A[:,0], A[:,1]) # A in blue
plt.plot(Arec[:,0],Arec[:,1],'r', marker='o') # Arec in RED
# across observations
for i in range(m):
e = np.vstack((A[i],Arec[i]))
plt.plot(e[:,0],e[:,1],'b') # BLUE
# axes
plt.axis([-6, 8, -8, 10])
# labels
plt.xlabel("$a_0$")
plt.ylabel("$a_1$")
plt.title("Back to $A$")
Out[28]: <matplotlib.text.Text at 0x7fd0c675cdd8>
Wicked animated GIF which illustrates PCA

(http://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-
eigenvectors-eigenvalues)
Magically, eigen-decomposition (or PCA) finds the line where
1. the spread of values along the black line is maximal

2. the projection error (sum of red lines) is minimal
6. Variance Retained
In [29]: # Average squared projection error using PC1

unexp_err = np.mean(np.sum(np.square(Amat - Arec),axis=1))
total_err = np.mean(np.sum(np.square(Amat),axis=1))
ret_err = 1 - (unexp_err / total_err) # percent of variance retained
print(ret_err)
0.948763733928
In [30]: # Using eigenvalues

l[1]/np.sum(l) # recall, use the 2nd eigenvalue
Out[30]: 0.94876373392787527
7. Summary of Eigen-decomposition Approach
1. A
Normalize columns of so that each feature has zero mean
2. Compute sample covariance matrix Σ = AT A/(m − 1)
3. Σ
Perform eigen-decomposition of using np.linalg.eig(Sigma)
4. k
Compress by ordering evectors according to largest evalues and compute AXk
5. Reconstruct from compressed version by computing AXk XkT
Implementation with scikit-learn
In [41]: from sklearn import datasets
Let's try going from 4D to 2D using the classical iris dataset.
In [42]: iris = datasets.load_iris() # Bunch object

print(iris.DESCR)
Iris Plants Database

====================
Notes
-----
Data Set Characteristics:
:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
- class:
- Iris-Setosa
- Iris-Versicolour
- Iris-Virginica
:Summary Statistics:
============== ==== ==== ======= ===== ====================

Min Max Mean SD Class Correlation
============== ==== ==== ======= ===== ====================
sepal length: 4.3 7.9 5.84 0.83 0.7826
sepal width: 2.0 4.4 3.05 0.43 -0.4194
petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)
petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)
============== ==== ==== ======= ===== ====================
:Missing Attribute Values: None

:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988
This is a copy of UCI ML iris datasets.

http://archive.ics.uci.edu/ml/datasets/Iris (http://archive.ics.uci.edu/ml/data
sets/Iris)
The famous Iris database, first used by Sir R.A Fisher
This is perhaps the best known database to be found in the

pattern recognition literature. Fisher's paper is a classic in the field and
is referenced frequently to this day. (See Duda & Hart, for example.) The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant. One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.
References
----------
- Fisher,R.A. "The use of multiple measurements in taxonomic problems"
Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
Mathematical Statistics" (John Wiley, NY, 1950).
- Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.
(Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.
- Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System

Structure and Classification Rule for Recognition in Partially Exposed
Environments". IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol. PAMI-2, No. 1, 67-71.
- Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions
on Information Theory, May 1972, 431-433.
- See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II
conceptual clustering system finds 3 classes in the data.
- Many, many more ...
In [43]: A0 = iris.data # np.array
print("Dimensions:")
print(A0.shape)
print("---")
print("First 5 samples:")
print(A0[:5,:])
print("---")
print("Feature names:")
print(iris.feature_names)
Dimensions:
(150, 4)
---
First 5 samples:
[[ 5.1 3.5 1.4 0.2]
[ 4.9 3. 1.4 0.2]
[ 4.7 3.2 1.3 0.2]
[ 4.6 3.1 1.5 0.2]
[ 5. 3.6 1.4 0.2]]
---
Feature names:
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (c
m)']
In [44]: # Eigen-decomposition: 5-step process
# 1. Normalize columns of $A$ so that each feature has zero mean

mu = np.mean(A0,axis=0)
A = A0 - mu
print("Does A have zero mean across rows?")
print(np.mean(A,axis=0))
# 2. Compute sample covariance matrix $\Sigma = {A^TA}/{(m-1)}$

m,n = A.shape
Sigma = (A.T @ A)/(m-1)
print("---")
print("Sigma:")
print(Sigma)
# 3. Perform eigen-decomposition of $\Sigma$ using `np.linalg.eig(Sigma)`

l,X = np.linalg.eig(Sigma)
print("---")
print("Evalues:")
print(l)
print("---")
print("Evectors:")
print(X)
# 4. Compress by ordering $k$ evectors according to largest evalues and compute $A

print("---")
print("Compressed - 4D to 2D:")
Acomp = A @ X[:,:2] # first 2 evectors
print(Acomp[:5,:]) # first 5 observations
# 5. Reconstruct from compressed version by computing $A X_k X_k^T$

print("---")
print("Reconstructed version - 2D to 4D:")
Arec = A @ X[:,:2] @ X[:,:2].T # first 2 evectors
print(Arec[:5,:]+mu) # first 5 obs, adding mu to compare to original
Does A have zero mean across rows?

[ -1.12502600e-15 -6.75015599e-16 -3.23889064e-15 -6.06921920e-16]
---
Sigma:
[[ 0.68569351 -0.03926846 1.27368233 0.5169038 ]
[-0.03926846 0.18800403 -0.32171275 -0.11798121]
[ 1.27368233 -0.32171275 3.11317942 1.29638747]
[ 0.5169038 -0.11798121 1.29638747 0.58241432]]
---
Evalues:
[ 4.22484077 0.24224357 0.07852391 0.02368303]
---
Evectors:
[[ 0.36158968 -0.65653988 -0.58099728 0.31725455]
[-0.08226889 -0.72971237 0.59641809 -0.32409435]
[ 0.85657211 0.1757674 0.07252408 -0.47971899]
[ 0.35884393 0.07470647 0.54906091 0.75112056]]
---
Compressed - 4D to 2D:
[[-2.68420713 -0.32660731]
[-2.71539062 0.16955685]
[-2.88981954 0.13734561]
[-2.7464372 0.31112432]
[-2.72859298 -0.33392456]]
---
Reconstructed version - 2D to 4D:
[[ 5.08718247 3.51315614 1.4020428 0.21105556]
[ 4.75015528 3.15366444 1.46254138 0.23693223]
[ 4.70823155 3.19151946 1.30746874 0.17193308]
[ 4.64598447 3.05291508 1.46083069 0.23636736]
[ 5.07593707 3.5221472 1.36273698 0.19458132]]
In [45]: # Using sklearn.decomposition.PCA
pca = PCA(n_components=2) # two components

pca.fit(A0) # run PCA, putting in raw version for fun
print("Principal components:")
print("---")
print("Compressed - 4D to 2D:")
print(pca.transform(A0)[:5,:]) # first 5 obs
print("---")
print("Reconstructed - 2D to 4D:")
print(pca.inverse_transform(pca.transform(A0))[:5,:]) # first 5 obs
Principal components:
[[ 0.36158968 -0.08226889 0.85657211 0.35884393]
[ 0.65653988 0.72971237 -0.1757674 -0.07470647]]
---
Compressed - 4D to 2D:
[[-2.68420713 0.32660731]
[-2.71539062 -0.16955685]
[-2.88981954 -0.13734561]
[-2.7464372 -0.31112432]
[-2.72859298 0.33392456]]
---
Reconstructed - 2D to 4D:
[[ 5.08718247 3.51315614 1.4020428 0.21105556]
[ 4.75015528 3.15366444 1.46254138 0.23693223]
[ 4.70823155 3.19151946 1.30746874 0.17193308]
[ 4.64598447 3.05291508 1.46083069 0.23636736]
[ 5.07593707 3.5221472 1.36273698 0.19458132]]
Another Example
In [31]: from __future__ import print_function, division
%matplotlib inline
import numpy as np
from scipy import stats
# use seaborn plotting style defaults

import seaborn as sns; sns.set()
In [32]: np.random.seed(1)
X = np.dot(np.random.random(size=(2, 2)), np.random.normal(size=(2, 200))).T
plt.plot(X[:, 0], X[:, 1], 'o')
plt.axis('equal');
We can see that there is a definite trend in the data. What PCA seeks to do is to find the Principal
Axes in the data, and explain how important those axes are in describing the data distribution:
In [33]: from sklearn.decomposition import PCA

pca = PCA(n_components=2)
pca.fit(X)
print(pca.explained_variance_)
[ 0.75871884 0.01838551]
[[-0.94446029 -0.32862557]
[-0.32862557 0.94446029]]
To see what these numbers mean, let's view them as vectors plotted on top of the data:
In [34]: plt.plot(X[:, 0], X[:, 1], 'o', alpha=0.5)

for length, vector in zip(pca.explained_variance_, pca.components_):
v = vector * 3 * np.sqrt(length)
plt.plot([0, v[0]], [0, v[1]], '-k', lw=3)
plt.axis('equal');
Notice that one vector is longer than the other. In a sense, this tells us that that direction in the data
is somehow more "important" than the other direction. The explained variance quantifies this
measure of "importance" in direction.
Another way to think of it is that the second principal component could be completely ignored
without much loss of information! Let's see what our data look like if we only keep 95% of the
variance:
In [35]: clf = PCA(0.95) # keep 95% of variance

X_trans = clf.fit_transform(X)
print(X.shape)
print(X_trans.shape)
(200, 2)
(200, 1)
By specifying that we want to throw away 5% of the variance, the data is now compressed by a
factor of 50%! Let's see what the data look like after this compression:
In [36]: X_new = clf.inverse_transform(X_trans)

plt.plot(X_new[:, 0], X_new[:, 1], 'ob', alpha=0.8)
plt.axis('equal');
The light points are the original data, while the dark points are the projected version. We see that
after truncating 5% of the variance of this dataset and then reprojecting it, the "most important"
features of the data are maintained, and we've compressed the data by 50%!
This is the sense in which "dimensionality reduction" works: if you can approximate a data set in a
lower dimension, you can often have an easier time visualizing it or fitting complicated models to the
data.
In [ ]:

What Is PCA: When Should You Use PCA?

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

What Is PCA: When Should You Use PCA?

Caricato da

Copyright:

Formati disponibili

12/22/2017 Principle Component Analysis

When should you use PCA?

What does PCA do?

In [2]: # Create (6x2) array

m,n = A.shape # m-observations, n-features

In [4]: # ... and can go from df back to np.array

Out[4]: array([[ 3, 7],

In [5]: import matplotlib

# makes charts pretty

Out[6]: <matplotlib.text.Text at 0x7fd0c6baa438>

Sample covariance between a0 and a1 :

Length of prod equals 6

In [8]: # Get more stuff using NumPy's covariance method

Out[8]: array([[ 20., 25.],

Linear Algebra way:

In [9]: # Aside: What does A.T do?

Out[9]: array([[ 3, -4, 7, 1, -4, -3],

In [10]: # Matrix Multiplication, note @ operator

Out[10]: array([[100, 125],

In [11]: # Need to divide by (m-1) to yield true Sample Covariance Matrix

Out[11]: array([[ 20., 25.],

Sigma times eigenvector:

In [14]: # ... and the second

Sigma times eigenvector:

In [15]: print("The first principal component is evector with largest evalue:")

The first principal component is evector with largest evalue:

In [16]: # Orthogonal? A: Yes

In [17]: # Length 1? A: Yes

Out[18]: <matplotlib.text.Text at 0x7fd0c6a41588>

I discovered why Prof. Ng recommended Octave/Matlab versus Python. Linear algebra

In [19]: # change to matrix

In [20]: # Choose eigenvector with highest eigenvalue as first principal component

In [21]: Acomp = Amat @ pc1 # 6x2 @ 2x1 yields 6x1

Reconstruction from 1D compression of A:

Here we'll explore **Principal Component Analysis**, which is an extremely useful

We'll start with our standard set of initial imports:

from __future__ import print_function, division

# use seaborn plotting style defaults

## Introducing Principal Component Analysis

Principal Component Analysis is a very powerful unsupervised method for *dimensio

from sklearn.decomposition import PCA

plt.plot(X[:, 0], X[:, 1], 'o', alpha=0.5)

clf = PCA(0.95) # keep 95% of variance

plt.plot(X_new[:, 0], X_new[:, 1], 'ob', alpha=0.8)

### Application of PCA to Digits

from sklearn.datasets import load_digits

pca = PCA(2) # project from 64 to 2 dimensions

plt.scatter(Xproj[:, 0], Xproj[:, 1], c=y, edgecolor='none', alpha=0.5,

### What do the Components Mean?

but what this really means is

If we reduce the dimensionality in the pixel space to (say) 6, we recover only a p

from fig_code.figures import plot_image_components

from fig_code.figures import plot_pca_interactive

### Choosing the Number of Components

### PCA as data compression

fig, axes = plt.subplots(8, 8, figsize=(8, 8))

ax.imshow(im.reshape((8, 8)), cmap='binary')

Let's take another look at this by using IPython's ``interact`` functionality to v

from IPython.html.widgets import interact

im = np.vstack([np.hstack([Xproj[i, j] for j in range(nside)])

plt.title("n = {0}, variance = {1:.2f}".format(n_components, total_var),

interact(plot_digits, n_components=[1, 64], nside=[1, 8]);) # A in blue

Here we'll explore Principal Component Analysis, which is an extremely useful

from future import print_function, division

In [31]: from future import print_function, division