Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
When there are lot of variables aka features n(> 10) , then
we are advised to do PCA. PCA is a statistical technique
which reduces the dimensions of the data and help us
understand, plot the data with lesser dimension compared
to original data. As the name says PCA helps us compute the
Principal components in data. Principal components are
basically vectors that are linearly uncorrelated and have a
variance with in data. From the principal components
top p is picked which have the most variance.
PCA example
In the above plot what PCA does is it draws a line across the
data which has maximum variance , 2nd maximum
variance , so on … But Wait , why is an axes with maximum
variance important ? Lets consider our problem to be
classification problem (Similar arguments can also be drawn
for other problems too) . Our goal separate data by drawing
a line (or a plane) between data. If we find out the
dimension which has maximum variance, then it solves part
of the problem, now all we have to use suitable algorithm to
draw the line or plane which splits the data.
all_data = np.concatenate((sample_for_class1,
sample_for_class2), axis=1)assert all_data.shape == (3, 40),
"The dimension of the all_data matrix is not 3x20"
scatter_matrix = np.zeros((3,3))
for i in range(all_data.shape[1]):
scatter_matrix += (all_data[:, i].reshape(3, 1) -
mean_vector).dot((all_data[:, i].reshape(3, 1) -
mean_vector).T)
print('The Scatter Matrix is :\n', scatter_matrix)output :
('The Mean Vector:\n', array([[0.41667492],
[0.69848315],
[0.49242335]]))
('The Scatter Matrix:\n', array([[38.4878051 ,
10.50787213,11.13746016],
[10.50787213, 36.23651274, 11.96598642],
[11.13746016, 11.96598642, 49.73596619]]))
Back to PCA ….
Lets continue on where we left of PCA , we have the scatter
matrix which has the information about how one variable is
related to the other variable.
Given that we generated raw data , the plot may vary when
we attempt to recreate.