Sei sulla pagina 1di 9

Principal component analysis

(PCA): Explained and


implemented
It is very common in datascience tasks involving large
number of features , that one is advised to PCA aka Principal
component analysis . We is will start with a brief
introduction to what and why of PCA . Then we will look
into implementing PCA with explanation.

The What , whys of PCA

When there are lot of variables aka features n(> 10) , then
we are advised to do PCA. PCA is a statistical technique
which reduces the dimensions of the data and help us
understand, plot the data with lesser dimension compared
to original data. As the name says PCA helps us compute the
Principal components in data. Principal components are
basically vectors that are linearly uncorrelated and have a
variance with in data. From the principal components
top p is picked which have the most variance.
PCA example

In the above plot what PCA does is it draws a line across the
data which has maximum variance , 2nd maximum
variance , so on … But Wait , why is an axes with maximum
variance important ? Lets consider our problem to be
classification problem (Similar arguments can also be drawn
for other problems too) . Our goal separate data by drawing
a line (or a plane) between data. If we find out the
dimension which has maximum variance, then it solves part
of the problem, now all we have to use suitable algorithm to
draw the line or plane which splits the data.

Lets implement PCA


Lets begin by generating random data with 3 dimensions
with 40 samples . We will have two class with 20 samples
per class.
import numpy as np
np.random.seed(1)

vec1 = np.array([0, 0, 0])


mat1 = np.array([[1, 0, 0], [0, 1, 0], [0, 0, 1]])
sample_for_class1 = np.random.multivariate_normal(vec1, mat1,
20).Tassert sample_for_class1.shape == (3, 20), "The
dimension of the sample_for_class1 matrix is not 3x20"

vec2 = np.array([1, 1, 1])


mat2 = np.array([[1, 0, 0], [0, 1, 0], [0, 0, 1]])
sample_for_class2 = np.random.multivariate_normal(vec2, mat2,
20).Tassert sample_for_class2.shape == (3, 20), "The
dimension of the sample_for_class2 matrix is not 3x20"

all_data = np.concatenate((sample_for_class1,
sample_for_class2), axis=1)assert all_data.shape == (3, 40),
"The dimension of the all_data matrix is not 3x20"

Now we will compute the scatter matrix or covariance of this


matrix. Why scatter matrix ? scatter matrix records the
relationship between variables , which is important to find a
dimension of maximum variance . More on scatter martrix ,
covariance here . Either of scatter matrix or covariance can
be used as covariance is just scaled version of scatter matrix.
mean_dim1 = np.mean(all_data[0, :])
mean_dim2 = np.mean(all_data[1, :])
mean_dim3 = np.mean(all_data[2, :])

mean_vector = np.array([[mean_dim1], [mean_dim2],


[mean_dim3]])

print('The Mean Vector:\n', mean_vector)

scatter_matrix = np.zeros((3,3))
for i in range(all_data.shape[1]):
scatter_matrix += (all_data[:, i].reshape(3, 1) -
mean_vector).dot((all_data[:, i].reshape(3, 1) -
mean_vector).T)
print('The Scatter Matrix is :\n', scatter_matrix)output :
('The Mean Vector:\n', array([[0.41667492],
[0.69848315],
[0.49242335]]))
('The Scatter Matrix:\n', array([[38.4878051 ,
10.50787213,11.13746016],
[10.50787213, 36.23651274, 11.96598642],
[11.13746016, 11.96598642, 49.73596619]]))

Now we know how each of variables are related to each


other. We are just one step away from computing the
principal components . Before that we will review a basic
concept from Linear algebra : Eigenvalues and
Eigenvectors.

Eigenvalues and Eigenvectors

Eigenvectors and Eigenvalues are a property of a matrix


which satisfies the following equation .

Eigenvectors and Eigenvalues

Where A denotes the matrix , x denotes the eigenvector and


lambda denotes the eigenvalues. Now to understand the
significance of eigenvector and eigenvalues , lets observe
into the following gif .
Here the original matrix is multiplied by a vector ( aka.
undergoing a linear transformation), what we can observe is
for lines colored blue and purple do not change direction , it
only scales up , while the line colored red changes direction
in addition it scaling up.

A Eigenvector of a matrix (which can be seen as a linear


transformation of matrix) is the condensed vector which
summarizes one axes of the matrix. Another property of
eigen vectors is that even if I scale the vector by some
amount before I multiply it, I still get the same multiple of it
as a result.We might also say that eigenvectors are axes
along which linear transformation acts, stretching or
compressing input vectors.They are the lines of change that
represent the action of the larger matrix, the very “line” in
linear transformation. Eigenvectors and Eigenvalues are
defined for a square matrix. We prefer our eigenvectors to
be always unit length one .
If we have a eigenvector

We compute the length of the vector by

Length of the eigenvector

Then to make the eigenvector length one we do

Since this is just the scalar operation , it is not going to affect


the direction of the vector.A simple reason we do this is that
we need all of our eigenvectors to be comparable ,so that we
can choose the most valuable among them (the one having
maximum variance). A matrix can have more than one
eigenvector , and at max d eigenvectors if matrix is of
dimention d X d.

Back to PCA ….
Lets continue on where we left of PCA , we have the scatter
matrix which has the information about how one variable is
related to the other variable.

Now we use the numpy library to compute the Eigenvectors


and eigenvalues
eig_val, eig_vec = np.linalg.eig(scatter_matrix)

By default numpy or any stats library gives out eigenvectors


of unit length. Lets verify it
for ev in eig_vec:
np.testing.assert_array_almost_equal(1.0,
np.linalg.norm(ev))

Okay we could endup having upto 3 eigenvectors , since the


size of our scatter matrix is 3 X 3 .Choosing k important
eigenvectors from d vectors is same as droping d-k vectors .
We now sort the eigenvectors by their eigenvalues and drop
the least one.
# We Make a list of tuple containing (eigenvalue,
eigenvector)
eig_pairs = [(np.abs(eig_val_sc[i]), eig_vec_sc[:,i]) for i
in range(len(eig_val_sc))]

# We then Sort list of tuples by the eigenvalue


eig_pairs.sort(key=lambda x: x[0], reverse=True)

# verify that the list is correctly sorted by decreasing


eigenvalues
for i in eig_pairs:
print(i[0])Output :
65.16936779078195
32.69471296321799
26.596203282097097

Now we choose k the largest eigenvectors :


matrix_w = np.hstack((eig_pairs[0][1].reshape(3,1),
eig_pairs[1][1].reshape(3,1)))
print('Matrix W:\n', matrix_w)('Matrix W:\n', array([[-
0.49210223, -0.64670286],
[-0.47927902, -0.35756937],
[-0.72672348, 0.67373552]]))

Finally we have the new axes across which we can project


our samples , we just multiply the original vector with our
chosen eigenvectors and plot.
transformed = matrix_w.T.dot(all_samples)
assert transformed.shape == (2,40), "The matrix is not 2x40
dimensional."

Given that we generated raw data , the plot may vary when
we attempt to recreate.

Comparing for the same data the PCA from sklearn


Bingo !!! With this understanding by our side , we can define
PCA as a process of finding the axes in our features space ,
viewed from which each samples in our data is separable in
maximum way.

Potrebbero piacerti anche