Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Principal component
analysis
Seropédica - RJ
12/14/2008
Content
Introduction
Principal component analysis is a multivariate statistical technique that consists of
transforming a set of original variables into another set of variables of the same dimension
http://www.htmlpublish.com/convert-pdf-to-html/success.aspx?zip=DocStorage/9a2d4f284a3c411a842ad8aa4311edf0/analise%20de%20compo… 2/11
08/03/2018 Your Document is Ready
called principal components. The main components have important properties: each main
component is a linear combination of all the original variables, they are independent of each
other and estimated with the purpose of retaining, in order of estimation, the maximum
information, in terms of the total variation contained in the data . Principal component
analysis is associated with the idea of reducing data mass, with the lowest possible loss of
information. Wantedredistribute the observed variation in the original axes in order to
obtain a set of uncorrelated orthogonal axes. This technique can be used for index
generation and grouping of individuals. The analysis groups the individuals according to
their variation, that is, the individuals are grouped according to their variances, that is,
according to their behavior within the population, represented by the variation of the set of
characteristics that define the individual, that is, the technique groups the individuals of a
population according to the variation of their characteristics. According to REGAZZI
(2000), although the techniques of multivariate analysis have been developed to solve
specific problems, especially Biology and Psychology, they can also be used to solve other
types of problems in several areas of knowledge.
1
Teacher. Federal Rural University of Rio de Janeiro, IT-Department of Engineering, BR 465 km 7 - CEP 23890-000 - Seropédica
- RJ. E-mail: varella@ufrrj.br .
Data matrix X
Consider the situation in which we observe 'p' characteristics of 'n' individuals of a π
population. The observed characteristics are represented by the variables X 1 , X 2 , X 3 , ...,
X p . The data matrix is of order 'nx p' and is usually called the 'X' array.
é x 11 x 12 x 13 x 1 p (I.e.
(I.e. (I.e.
ê x 21 x 22 x 23 x2pú
X=êx x x x (I.e.
(I.e. 31 32 33 3p
ú
(I.e. (I.e.
(I.e. x n3 x(I.e.
ê x n1 x np ú
(I.e. (I.e.
The interdependence structure between the variables of the data matrix is represented by
the covariance matrix 'S' or by the correlation matrix 'R'. The understanding of this structure
through the variables X 1 , X 2 , X 3 , ..., X p , may in practice be a complicated thing. Thus, the
objective of the analysis of principal components is to transform this complicated structure,
represented by the variables X 1 , X 2 , X 3 , ..., X p, into another structure represented by the
variables Y 1 , Y 2 , Y 3 , Y pnot correlated and with ordered variances, so that it is possible to
compare the individuals using only the Y is variables that present greater variance. The
solution is given from the covariance matrix S or the correlation matrix R.
Covariance matrix S
http://www.htmlpublish.com/convert-pdf-to-html/success.aspx?zip=DocStorage/9a2d4f284a3c411a842ad8aa4311edf0/analise%20de%20compo… 3/11
08/03/2018 Your Document is Ready
From the matrix X of order data 'nx p' we can make an estimate of the covariance matrix Σ
of the population π that we represent by S. The matrix S is symmetric and order
'px p'.
ˆ ˆ ˆ ˆ
é Var (x 1 ) Cov (x 1 x 2 ) Cov (x 1 x 3 ) Cov (x 1 x P (
(I.e.ˆ ˆ ˆ ˆ
ê Cov (x 2 x Var (x 2 ) Cov (x 2 x 3 ) Cov (x 2 x P (
(I.e.ˆ ˆ ˆ ˆ
S = ê Cov (x 3 x Cov (x 3 x 2 ) Var (x 3 ) Cov (x 3 x
(I.e.
(I.e.
(I.e.ˆ ˆ ˆ ˆ
Cov (x p x 2 ) Cov (x p x 3 ) Var (x p )
ê Cov (x p x
(I.e.
In this case, according to REGAZZI (2000), it is convenient to standardize the variables
X j (i = 1, 2, 3,
..., P). Standardization can be done with mean zero and variance 1, or with variance 1 and any
mean.
Standardization with zero mean and variance 1
z ij (I.e. x ij - x j , i = 1, 2,, n j = 1, 2 ,, p
s (x j )
Standardization with variance 1 and any mean
x ij
z ij = s (x ) i = 1, 2, ne j = 1, 2 ,, p
j
åx ij
x j (I.e. i = 1
n
ˆ
and s (x j ) = Var (x j ), j = 1, 2, p
(I.e. n
ö2
n 2 n (I.e. å x ij ÷
è i=1 (I.e.
ˆ
å(x
i=1
ij - xj)
ˆ å
i=1
x ij -
2
n
Var (x j ) = or Var (x j ) =
n- 1 n -1
After the standardization we obtain a new data matrix Z:
é z 11 z 12 z 13 z 1 p (I.e.
(I.e. (I.e.
ê z 21 z 22 z 23 z2pú
ê
Z= z z z z (I.e.
(I.e. 31 32 33 3 P. (I.e.
(I.e. (I.e.
(I.e. z n2 z n3 (I.e.
ê z n1 z NP ú
(I.e. (I.e.
http://www.htmlpublish.com/convert-pdf-to-html/success.aspx?zip=DocStorage/9a2d4f284a3c411a842ad8aa4311edf0/analise%20de%20compo… 4/11
08/03/2018 Your Document is Ready
The matrix Z of the standardized variables z j is equal to the correlation matrix of the data
matrix X. To determine the main components we usually start from the correlation matrix R. It
is important to note that the result found for the analysis from the matrix S can be different
from the result found from the R matrix. The recommendation is that
should only be made where the units of measurement of the characteristics observed are not
the same.
Determination of the main components
The main components are determined by solving the characteristic equation of the matrix
S or R, that is:
det [ R - l I ] = 0 or R-I =0
é1 r ( x 1 x 2 ) r ( x 1 x 3 ) r ( x 1 x p ) ù I.e.
ê
r(x x) 1 r(x x )r(x x (
(I.e. 2 1 2 3 2 P (I.e.
R=êr(x x)r(x x ) 1 r ( xx )(I.e.
(I.e. 3 1 3 2 3 P (I.e.
(I.e. (I.e.
(I.e. (I.e.
ê r ( x p x 1) r ( x p x 2) r ( x px 3 ) 1 (I.e.
(I.e. (I.e.
If the matrix R is full rank equal to 'p', that is, it does not present any column that is a
linear combination of another, the equation R - l I = 0 will have 'p' roots called eigenvalues or
roots characteristic of the matrix R In the data matrix X it is important to note that the value of
'n' (individuals, treatments, genotypes, etc.) must be at least equal to 'p + 1', that is, if we want
to set up an experiment to analyze the behavior of 'p' characteristics of individuals from a
population is recommended that the statistical design present at least 'p + 1' treatments.
Let λ 1 , λ 2, λ 3, ..., λ p be the roots of the characteristic equation of the matrix R or S, then:
l 1 > l 2 > l 3, l p.
é a i1 (I.e.
(I.e. (I.e.
~ ê to i2 (I.e.
to i(I.e.
(I.e. (I.e.
(I.e. (I.e.
ê the ip(I.e.
(I.e. (I.e.
http://www.htmlpublish.com/convert-pdf-to-html/success.aspx?zip=DocStorage/9a2d4f284a3c411a842ad8aa4311edf0/analise%20de%20compo… 5/11
08/03/2018 Your Document is Ready
~
The eigenvectors at iare normalized, that is, the sum of the squares of the coefficients is equal
to 1, and are still orthogonal to each other. Because of this they have the following properties:
P
2 ~ ' ~
åj =the
1
ij = 1 ( the i × the i = 1 )
P
~ ~
)
'
and å a ij × a kj = 0
j=1
( the × thei k = 0 for i ¹k
~
Being i the corresponding eigenvector to eigenvalue λ i , then the i-th component
is given by:
Y i = a i1 X 1 + a i2 X 2 + + a ip X p
ˆ
W i (I.e. Var ( Y i(I.e. × 100 = l i × 100 =
li
× 100
P trace ( S )
åp Var ( Y (I.e. å l i i
i=1 i=1
ˆ ˆ
Var ( Y 1 ) + Var ( Y k (I.e.× 100³ 70% wherek < p
k ˆ
å Var ( Y (I.e.
i=1
i
ˆ
Corr ( X j, Y 1 ) = r Xj × Y1= the 1j ×
Var ( Y 1 (I.e. (I.e. l 1 × to 1j
ˆ
Var X Var ( X (I.e.
( j (I.e. j
to 11 to 12 to 1p
w1= (I.e. W w= , where w 1 is the
Var ( X 1 (I.e. Var ( X 2 (I.e. Var ( X p(I.e.
If the purpose of the analysis is to obtain indices, a very common practice in economics,
the analysis ends here.
If the purpose of the analysis is to compare or group individuals, the analysis continues
and it is necessary to calculate the scores for each major component that will be used in the
analysis.
http://www.htmlpublish.com/convert-pdf-to-html/success.aspx?zip=DocStorage/9a2d4f284a3c411a842ad8aa4311edf0/analise%20de%20compo… 7/11
08/03/2018 Your Document is Ready
Thus we have that the scores of the first component for the n treatments are:
1 and 11 = to 11 X 11 + to 12 X 12 + + to 1p X 1p
2 and 21 = to 11 X 21 + to 12 X 22 + + to 1p X 2p
N Y n1 = a 11 X n1 + a 12 X n 2 + + a 1p X np
Application example
Table 2 shows the original values observed (X 1 and X 2 ) and standardized (Z 1 and Z 2 )
of two variables for five treatments (n = 5).
Table 2. Original and standardized values of two variables for five treatments
Original variables Standardized variables
Treatments
X1 X2 Z1 Z2
1 102 96 24.3827 6.9554
2 104 87 24.8608 6,3033
3 101 62 24.1436 4.4920
4 93 68 22.2313 4.9268
5 100 77 23,9046 5,5788
Variance 17.50 190.50 1 1
Average 100.00 78.00 23,9046 5,6513
s ( X (I.e.
j 17.5
The correlation matrix is:
é1
R=
0.5456 ù
ê ú
ë 0.5456 1 û
1 - l 0.5456 = 0
0.5456 1 - l
l 2 - 2 l + 0.7023 = 0
The sum of λ 1 and λ 2 is equal to the trace of the matrix R. The trace of a matrix is the
sum of the elements of its principal diagonal.
dash (R) = 1 + 1 = 2
10
~ (I.e.The 11
1
(I.e. 1 (I.e. 0,7071(I.e.
(I.e.
to 1 = ê ú= (I.e. = ê (I.e.
ë to 12 û 2ë1û ë 0.7070 û
and the first major component is:
Y 1 = 0.7071Z 1 + 0.7071Z 2
http://www.htmlpublish.com/convert-pdf-to-html/success.aspx?zip=DocStorage/9a2d4f284a3c411a842ad8aa4311edf0/analise%20de%20compo… 9/11
08/03/2018 Your Document is Ready
Table 4. Scores of the two main components for the five treatments obtained from the
correlation matrix R.
Core component scores
Treatments
Y1 And 2
1 22.16 -12.32
2 22.04 -13,12
3 20.25 -13.90
4 19,20 -12.24
5 20.85 -12.96
11
Scatter plot
They are used to visualize the dispersion of the treatments in function of the
main components in bi or three-dimensional space. The dispersion of
The treatments for this example are illustrated in Figure 2.
23
2 1
22
(First component (Y1
21 5
3
20
4
19
-14 -13.5 -13 -12.5 -12
Second component (Y2)
Figure 2. Dispersion of the treatments as a function of the scores of the main components.
BIBLIOGRAPHY
REGAZZI, AJ Multivariate analysis, lecture notes INF 766, Department of Informatics,
Federal University of Viçosa, v.2, 2000.
KHATTREE, R. & NAIK, DN . Multivariate data reduction and discrimination with SAS
software . Cary, NC, USA: SAS Institute Inc., 2000. 558 p.
JOHNSON, RA; WICHERN, DW Applied multivariate statistical analysis . 4th ed. Upper
Saddle River, New Jersey: Prentice-Hall, 1999, 815 p.
http://www.htmlpublish.com/convert-pdf-to-html/success.aspx?zip=DocStorage/9a2d4f284a3c411a842ad8aa4311edf0/analise%20de%20comp… 10/11
08/03/2018 Your Document is Ready
12
http://www.htmlpublish.com/convert-pdf-to-html/success.aspx?zip=DocStorage/9a2d4f284a3c411a842ad8aa4311edf0/analise%20de%20comp… 11/11