Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Nuno Vasconcelos
ECE Department,
p , UCSD
Principal component analysis
basic idea:
• if the data lives in a subspace, it is going to look very flat when
viewed from the full space, e.g.
1D subspace in 2D 2D subspace in 3D
2
Principal component analysis
If y is Gaussian with covariance Σ, the equiprobability
contours are the ellipses whose y2
• principal components φi are the φ1
φ2
eigenvectors of Σ λ1
λ2
• principal
p p lengths
g λi are the
y1
eigenvalues of Σ
3
Principal component analysis (learning)
4
Principal component analysis
5
Principal component analysis
there is an alternative manner to compute the principal
components,
p , based on singular
g value decomposition
p
SVD:
• any real n x m matrix (n>m) can be decomposed as
A = ΜΠΝ T
• where M is a n x m column orthonormal matrix of left singular
vectors (columns of M)
• Π a m x m diagonal matrix of singular values
• NT a m x m row orthonormal matrix of right singular vectors
(columns of N)
ΜT Μ = I ΝT Ν = I
6
PCA by SVD
to relate this to PCA, we consider the data matrix
⎡| |⎤
⎢
X = ⎢ x1 K xn ⎥⎥
⎢⎣ | | ⎥⎦
the sample
p mean is
⎡| | ⎤ ⎡1⎤
1 1⎢ ⎥ ⎢ ⎥ 1
µ = ∑ x i = ⎢x 1 K x n ⎥ ⎢M⎥ = X 1
n i n n
⎣⎢ | | ⎦⎥ ⎢⎣1⎥⎦
7
PCA by SVD
and we can center the data by subtracting the mean to
each column of X
this is the centered data matrix
⎡| | ⎤ ⎡| |⎤
X c = ⎢⎢x 1 K x n ⎥⎥ − ⎢⎢ µ K µ ⎥⎥
⎢⎣ | | ⎥⎦ ⎢⎣ | | ⎥⎦
T 1 T ⎛ 1 T⎞
= X − µ1 = X − X 11 = X ⎜ I − 11
n ⎝ n ⎠
8
PCA by SVD
the sample covariance is
1
n i
T 1
n i
( )
Σ = ∑ ( xi − µ )( xi − µ ) = ∑ xic xic
T
⎡ | | ⎤ ⎡ − x1
c
−⎤
1⎢ c c ⎥⎢ ⎥ 1
Σ = ⎢ x1 K xn ⎥ ⎢ M = X X
⎥ n c c
T
n
⎢⎣ | | ⎥⎦ ⎣⎢− xnc −⎥⎦
9
PCA by SVD
the matrix
⎡− x 1c −⎤
⎢ ⎥
X cT =⎢ M ⎥
⎢− x nc −⎥⎦
⎣
is real n x d. Assuming n > d it has SVD decomposition
X cT = ΜΠΝT ΜT Μ = I ΝT Ν = I
and
1 T 1 T T 1
Σ= Xc X c = ΝΠΜ ΜΠΝ = ΝΠ 2 ΝT
n n n
10
PCA by SVD
⎛ 1 2⎞ T
Σ = Ν⎜ Π ⎟ Ν
⎝n ⎠
noting that N is d x d and orthonormal, and Π2 diagonal,
shows that this is jjust the eigenvalue
g decomposition
p of Σ
it follows that
• the eigenvectors of Σ are the columns of N
• the eigenvalues of Σ are
1 2
λi = π i
n
this gives an alternative algorithm for PCA
11
PCA by SVD
computation of PCA by SVD
given X with one example per column
• 1) create the centered data-matrix
T ⎛ 1 T ⎞ T
X c = ⎜ I − 11 ⎟X
⎝ n ⎠
• 2) compute
t its
it SVD
X cT = ΜΠΝT
• 3) principal components are columns of N, eigenvalues are
1 2
λi = π i
n
12
Limitations of PCA
PCA is not optimal for classification
• note that there is no mention of the class label in the definition of
PCA
• keeping the dimensions of largest energy (variance) is a good
idea but not always enough
idea,
• certainly improves the density estimation, since space has
smaller dimension
• but could be unwise from a classification point of view
• the discriminant dimensions could be thrown out
it is not hard to construct examples where PCA is the
worst possible thing we could do
13
Example
consider a problem with
• two n-D Gaussian classes with covariance Σ=σ2I,
I σ2 = 10
X ~ N ( µi ,10 I )
• we add an extra variable which is the class label itself
X ' = [ X , i]
• assuming that PY(0)=PY(1)=0.5
(1)=0 5
= 0.125 < 10
• dimension n+1 has the smallest variance and is the first to be
discarded!
14
Example
this is
• a very contrived example
• but shows that PCA can throw away all the discriminant info
does this mean yyou should never use PCA?
• no, typically it is a good method to find a suitable subset of
variables, as long as you are not too greedy
• e
e.g.
g if you start with n = 100
100, and know that there are only 5
variables of interest
• picking the top 20 PCA components is likely to keep the desired 5
• your classifier will be much better than for n = 100, probably
not much worse than the one with the best 5 features
is there a rule of thumb for finding the number of PCA
components?
15
Principal component analysis
a natural measure is to pick the eigenvectors that explain
p % of the data variability
y
• can be done by plotting the ratio rk as a function of k
∑λ 2
i
rk = i =1
n
∑ i
λ2
i =1
16
Fischer’s linear discriminant
what if we really need to find the best features?
• harder question, usually impossible with simple methods
• there are better methods at finding discriminant directions
one good example is linear discriminant analysis (LDA)
• the idea is to find the line that best separates the two classes
bad projection
good projection
17
Linear discriminant analysis
we have two classes such that
E X |Y [X | Y = i ] = µi
[ ]
E X |Y ( X − µi )( X − µi ) | Y = i = Σ i
T
(E [Z | Y = 1] − E [Z | Y = 0]) =
Z |Y Z |Y
2
(E [w x | Y = 1]− E [w x | Y = 0]) = (w [µ
X |Y
T
X |Y
T 2 T
1 )
− µ0 ]
2
18
Linear discriminant analysis
however, this
(w [µ
T
1 )
− µ0 ]
2
max
(EZ |Y [Z | Y = 1] − EZ |Y [Z | Y = 0])
2
w var[Z | Y = 1] + var[Z | Y = 0]
19
Linear discriminant analysis
we have already seen that
(E [Z | Y = 1] − E [Z | Y = 0]) = (w [µ
Z |Y Z |Y
2 T
1 )
− µ0 ]
2
= w [µ1 − µ 0 ][µ1 − µ 0 ] w
T T
also
{
var[Z | Y = i ] = EZ |Y (z − EZ |Y [Z | Y = i ]) | Y = i
2
}
{(
= EZ |Y w [x − µi ] | Y = i
T
)
2
}
{
= EZ |Y w [x − µi ][x − µi ] w | Y = i
T T
}
= wT Σ i w
20
Linear discriminant analysis
and
J ( w) =
(E [Z | Y = 1] − E [Z | Y = 0])
Z |Y Z |Y
2
var[Z | Y = 1] + var[Z | Y = 0]
wT (µ1 − µ 0 )(µ1 − µ 0 ) w
T
=
wT (Σ1 + Σ 0 )w
S B = (µ1 − µ 0 )(µ1 − µ 0 )
T
wT S B w
J ( w) = T
w SW w SW = (Σ1 + Σ 0 )
max wT S B w subject to wT SW w = K
w
(
L = wT S B w - λ wT SW w − K )
• and maximize with respect to both w and λ
22
Linear discriminant analysis
setting the gradient of
L = wT (S B - λSW )w + λK
23
Linear discriminant analysis
in this case
SW−11S B w = λw
and, using the definition of SB
α
W
24
Linear discriminant analysis
note that we have seen this before
• for a classification problem with Gaussian classes of equal
covariance Σi = Σ, the BDR boundary is the plane of normal
w = Σ −1 (µi − µ j )
• if Σ1 = Σ0, this is also the LDA solution
x0 w µj
µi
Gaussian classes,
classes
equal covariance Σ
25
Linear discriminant analysis
this gives two different interpretations of LDA
• it is optimal if and only if the classes are Gaussian and have
equal covariance
• better than PCA, but not necessarily good enough
• a classifier on the LDA feature, is equivalent to
• the BDR after the approximation of the data by two Gaussians with
equal covariance
26
27