Sei sulla pagina 1di 27

PCA and LDA

Nuno Vasconcelos
ECE Department,
p , UCSD
Principal component analysis
basic idea:
• if the data lives in a subspace, it is going to look very flat when
viewed from the full space, e.g.
1D subspace in 2D 2D subspace in 3D

• this means that if we fit a Gaussian to the data the equiprobability


contours are going to be highly skewed ellipsoids

2
Principal component analysis
If y is Gaussian with covariance Σ, the equiprobability
contours are the ellipses whose y2
• principal components φi are the φ1
φ2
eigenvectors of Σ λ1
λ2
• principal
p p lengths
g λi are the
y1
eigenvalues of Σ

by computing the eigenvalues we know if the data is flat


λ1 >> λ2: flat λ1=λ2: not flat
y2 y2
λ2
λ2 λ1
y1 λ1 y1

3
Principal component analysis (learning)

4
Principal component analysis

5
Principal component analysis
there is an alternative manner to compute the principal
components,
p , based on singular
g value decomposition
p
SVD:
• any real n x m matrix (n>m) can be decomposed as

A = ΜΠΝ T
• where M is a n x m column orthonormal matrix of left singular
vectors (columns of M)
• Π a m x m diagonal matrix of singular values
• NT a m x m row orthonormal matrix of right singular vectors
(columns of N)

ΜT Μ = I ΝT Ν = I
6
PCA by SVD
to relate this to PCA, we consider the data matrix

⎡| |⎤

X = ⎢ x1 K xn ⎥⎥
⎢⎣ | | ⎥⎦
the sample
p mean is

⎡| | ⎤ ⎡1⎤
1 1⎢ ⎥ ⎢ ⎥ 1
µ = ∑ x i = ⎢x 1 K x n ⎥ ⎢M⎥ = X 1
n i n n
⎣⎢ | | ⎦⎥ ⎢⎣1⎥⎦

7
PCA by SVD
and we can center the data by subtracting the mean to
each column of X
this is the centered data matrix

⎡| | ⎤ ⎡| |⎤
X c = ⎢⎢x 1 K x n ⎥⎥ − ⎢⎢ µ K µ ⎥⎥
⎢⎣ | | ⎥⎦ ⎢⎣ | | ⎥⎦
T 1 T ⎛ 1 T⎞
= X − µ1 = X − X 11 = X ⎜ I − 11
n ⎝ n ⎠

8
PCA by SVD
the sample covariance is
1
n i
T 1
n i
( )
Σ = ∑ ( xi − µ )( xi − µ ) = ∑ xic xic
T

where xic is the ith column of Xc


this can be written as

⎡ | | ⎤ ⎡ − x1
c
−⎤
1⎢ c c ⎥⎢ ⎥ 1
Σ = ⎢ x1 K xn ⎥ ⎢ M = X X
⎥ n c c
T

n
⎢⎣ | | ⎥⎦ ⎣⎢− xnc −⎥⎦

9
PCA by SVD
the matrix
⎡− x 1c −⎤
⎢ ⎥
X cT =⎢ M ⎥
⎢− x nc −⎥⎦

is real n x d. Assuming n > d it has SVD decomposition

X cT = ΜΠΝT ΜT Μ = I ΝT Ν = I
and

1 T 1 T T 1
Σ= Xc X c = ΝΠΜ ΜΠΝ = ΝΠ 2 ΝT
n n n
10
PCA by SVD
⎛ 1 2⎞ T
Σ = Ν⎜ Π ⎟ Ν
⎝n ⎠
noting that N is d x d and orthonormal, and Π2 diagonal,
shows that this is jjust the eigenvalue
g decomposition
p of Σ
it follows that
• the eigenvectors of Σ are the columns of N
• the eigenvalues of Σ are
1 2
λi = π i
n
this gives an alternative algorithm for PCA

11
PCA by SVD
computation of PCA by SVD
given X with one example per column
• 1) create the centered data-matrix

T ⎛ 1 T ⎞ T
X c = ⎜ I − 11 ⎟X
⎝ n ⎠
• 2) compute
t its
it SVD

X cT = ΜΠΝT
• 3) principal components are columns of N, eigenvalues are

1 2
λi = π i
n
12
Limitations of PCA
PCA is not optimal for classification
• note that there is no mention of the class label in the definition of
PCA
• keeping the dimensions of largest energy (variance) is a good
idea but not always enough
idea,
• certainly improves the density estimation, since space has
smaller dimension
• but could be unwise from a classification point of view
• the discriminant dimensions could be thrown out
it is not hard to construct examples where PCA is the
worst possible thing we could do

13
Example
consider a problem with
• two n-D Gaussian classes with covariance Σ=σ2I,
I σ2 = 10

X ~ N ( µi ,10 I )
• we add an extra variable which is the class label itself

X ' = [ X , i]
• assuming that PY(0)=PY(1)=0.5
(1)=0 5

E[Y ] = 0.5 × 0 + 0.5 × 1 = 0.5


var[Y ] = 0.5 × (0 − 0.5) + 0.5 × (1 − 0.5)
2 2

= 0.125 < 10
• dimension n+1 has the smallest variance and is the first to be
discarded!

14
Example
this is
• a very contrived example
• but shows that PCA can throw away all the discriminant info
does this mean yyou should never use PCA?
• no, typically it is a good method to find a suitable subset of
variables, as long as you are not too greedy
• e
e.g.
g if you start with n = 100
100, and know that there are only 5
variables of interest
• picking the top 20 PCA components is likely to keep the desired 5
• your classifier will be much better than for n = 100, probably
not much worse than the one with the best 5 features
is there a rule of thumb for finding the number of PCA
components?
15
Principal component analysis
a natural measure is to pick the eigenvectors that explain
p % of the data variability
y
• can be done by plotting the ratio rk as a function of k

∑λ 2
i
rk = i =1
n

∑ i
λ2

i =1

• e.g. we need 3 eigenvectors to cover 70% of the variability of this


dataset

16
Fischer’s linear discriminant
what if we really need to find the best features?
• harder question, usually impossible with simple methods
• there are better methods at finding discriminant directions
one good example is linear discriminant analysis (LDA)
• the idea is to find the line that best separates the two classes

bad projection

good projection
17
Linear discriminant analysis
we have two classes such that
E X |Y [X | Y = i ] = µi

[ ]
E X |Y ( X − µi )( X − µi ) | Y = i = Σ i
T

and want to find the line


z = wT x
that best separates them
one possibility would be to maximize

(E [Z | Y = 1] − E [Z | Y = 0]) =
Z |Y Z |Y
2

(E [w x | Y = 1]− E [w x | Y = 0]) = (w [µ
X |Y
T
X |Y
T 2 T
1 )
− µ0 ]
2

18
Linear discriminant analysis
however, this
(w [µ
T
1 )
− µ0 ]
2

can be made arbitrarily large by simply scaling w


we are only interested in the direction, not the magnitude
need some type of normalization
Fischer suggested

between class scatter


max =
within class scatter

max
(EZ |Y [Z | Y = 1] − EZ |Y [Z | Y = 0])
2

w var[Z | Y = 1] + var[Z | Y = 0]

19
Linear discriminant analysis
we have already seen that

(E [Z | Y = 1] − E [Z | Y = 0]) = (w [µ
Z |Y Z |Y
2 T
1 )
− µ0 ]
2

= w [µ1 − µ 0 ][µ1 − µ 0 ] w
T T

also

{
var[Z | Y = i ] = EZ |Y (z − EZ |Y [Z | Y = i ]) | Y = i
2
}
{(
= EZ |Y w [x − µi ] | Y = i
T
)
2
}
{
= EZ |Y w [x − µi ][x − µi ] w | Y = i
T T
}
= wT Σ i w

20
Linear discriminant analysis
and

J ( w) =
(E [Z | Y = 1] − E [Z | Y = 0])
Z |Y Z |Y
2

var[Z | Y = 1] + var[Z | Y = 0]
wT (µ1 − µ 0 )(µ1 − µ 0 ) w
T
=
wT (Σ1 + Σ 0 )w

which can be written as between class scatter

S B = (µ1 − µ 0 )(µ1 − µ 0 )
T
wT S B w
J ( w) = T
w SW w SW = (Σ1 + Σ 0 )

within class scatter


21
Linear discriminant analysis
maximizing the ratio
wT S B w
J ( w) = T
w SW w
• is equivalent to maximizing the numerator while keeping the
denominator constant, i.e.

max wT S B w subject to wT SW w = K
w

• and can be accomplished using Lagrange multipliers


• define
d fi ththe Lagrangian
L i

(
L = wT S B w - λ wT SW w − K )
• and maximize with respect to both w and λ
22
Linear discriminant analysis
setting the gradient of

L = wT (S B - λSW )w + λK

with respect to w to zero we get


∇ w L = 2(S B - λSW )w = 0
or
S B w = λSW w

this is a generalized eigenvalue problem


the solution is easy when S w−1 = (Σ1 + Σ 0 )−1 exists

23
Linear discriminant analysis
in this case
SW−11S B w = λw
and, using the definition of SB

SW−1 (µ1 − µ 0 )(µ1 − µ0 ) w = λw


T

noting that (µ1-µ0)Tw = α is a scalar this can be written as


λ
S (µ1 − µ0 ) = w
−1

α
W

and since we don’t care about the magnitude of w

w* = SW−1 (µ1 − µ0 ) = (Σ1 + Σ 0 ) (µ1 − µ 0 )


−1

24
Linear discriminant analysis
note that we have seen this before
• for a classification problem with Gaussian classes of equal
covariance Σi = Σ, the BDR boundary is the plane of normal
w = Σ −1 (µi − µ j )
• if Σ1 = Σ0, this is also the LDA solution

x0 w µj

µi
Gaussian classes,
classes
equal covariance Σ
25
Linear discriminant analysis
this gives two different interpretations of LDA
• it is optimal if and only if the classes are Gaussian and have
equal covariance
• better than PCA, but not necessarily good enough
• a classifier on the LDA feature, is equivalent to
• the BDR after the approximation of the data by two Gaussians with
equal covariance

26
27

Potrebbero piacerti anche