Dimensions Reduction

PCA and LDA
Nuno Vasconcelos
ECE Department,
p , UCSD
Principal component analysis
basic idea:
• if the data lives in a subspace, it is going to look very flat when
viewed from the full space, e.g.
1D subspace in 2D 2D subspace in 3D
• this means that if we fit a Gaussian to the data the equiprobability

contours are going to be highly skewed ellipsoids
2
If y is Gaussian with covariance Σ, the equiprobability
contours are the ellipses whose y2
• principal components φi are the φ1
φ2
eigenvectors of Σ λ1
λ2
• principal
p p lengths
g λi are the
y1
eigenvalues of Σ
by computing the eigenvalues we know if the data is flat

λ1 >> λ2: flat λ1=λ2: not flat
y2 y2
λ2
λ2 λ1
y1 λ1 y1
3
Principal component analysis (learning)
4
5
there is an alternative manner to compute the principal
components,
p , based on singular
g value decomposition
p
SVD:
• any real n x m matrix (n>m) can be decomposed as
A = ΜΠΝ T
• where M is a n x m column orthonormal matrix of left singular
vectors (columns of M)
• Π a m x m diagonal matrix of singular values
• NT a m x m row orthonormal matrix of right singular vectors
(columns of N)
ΜT Μ = I ΝT Ν = I
6
PCA by SVD
to relate this to PCA, we consider the data matrix
⎡| |⎤
⎢
X = ⎢ x1 K xn ⎥⎥
⎢⎣ | | ⎥⎦
the sample
p mean is
⎡| | ⎤ ⎡1⎤
1 1⎢ ⎥ ⎢ ⎥ 1
µ = ∑ x i = ⎢x 1 K x n ⎥ ⎢M⎥ = X 1
n i n n
⎣⎢ | | ⎦⎥ ⎢⎣1⎥⎦
7
PCA by SVD
and we can center the data by subtracting the mean to
each column of X
this is the centered data matrix
⎡| | ⎤ ⎡| |⎤
X c = ⎢⎢x 1 K x n ⎥⎥ − ⎢⎢ µ K µ ⎥⎥
⎢⎣ | | ⎥⎦ ⎢⎣ | | ⎥⎦
T 1 T ⎛ 1 T⎞
= X − µ1 = X − X 11 = X ⎜ I − 11
n ⎝ n ⎠
8
PCA by SVD
the sample covariance is
1
n i
T 1
n i
( )
Σ = ∑ ( xi − µ )( xi − µ ) = ∑ xic xic
T
where xic is the ith column of Xc

this can be written as
⎡ | | ⎤ ⎡ − x1
c
−⎤
1⎢ c c ⎥⎢ ⎥ 1
Σ = ⎢ x1 K xn ⎥ ⎢ M = X X
⎥ n c c
T
n
⎢⎣ | | ⎥⎦ ⎣⎢− xnc −⎥⎦
9
PCA by SVD
the matrix
⎡− x 1c −⎤
⎢ ⎥
X cT =⎢ M ⎥
⎢− x nc −⎥⎦
⎣
is real n x d. Assuming n > d it has SVD decomposition
X cT = ΜΠΝT ΜT Μ = I ΝT Ν = I
and
1 T 1 T T 1
Σ= Xc X c = ΝΠΜ ΜΠΝ = ΝΠ 2 ΝT
n n n
10
PCA by SVD
⎛ 1 2⎞ T
Σ = Ν⎜ Π ⎟ Ν
⎝n ⎠
noting that N is d x d and orthonormal, and Π2 diagonal,
shows that this is jjust the eigenvalue
g decomposition
p of Σ
it follows that
• the eigenvectors of Σ are the columns of N
• the eigenvalues of Σ are
1 2
λi = π i
n
this gives an alternative algorithm for PCA
11
PCA by SVD
computation of PCA by SVD
given X with one example per column
• 1) create the centered data-matrix
T ⎛ 1 T ⎞ T
X c = ⎜ I − 11 ⎟X
⎝ n ⎠
• 2) compute
t its
it SVD
X cT = ΜΠΝT
• 3) principal components are columns of N, eigenvalues are
1 2
λi = π i
n
12
Limitations of PCA
PCA is not optimal for classification
• note that there is no mention of the class label in the definition of
PCA
• keeping the dimensions of largest energy (variance) is a good
idea but not always enough
idea,
• certainly improves the density estimation, since space has
smaller dimension
• but could be unwise from a classification point of view
• the discriminant dimensions could be thrown out
it is not hard to construct examples where PCA is the
worst possible thing we could do
13
Example
consider a problem with
• two n-D Gaussian classes with covariance Σ=σ2I,
I σ2 = 10
X ~ N ( µi ,10 I )
• we add an extra variable which is the class label itself
X ' = [ X , i]
• assuming that PY(0)=PY(1)=0.5
(1)=0 5
E[Y ] = 0.5 × 0 + 0.5 × 1 = 0.5

var[Y ] = 0.5 × (0 − 0.5) + 0.5 × (1 − 0.5)
2 2
= 0.125 < 10
• dimension n+1 has the smallest variance and is the first to be
discarded!
14
Example
this is
• a very contrived example
• but shows that PCA can throw away all the discriminant info
does this mean yyou should never use PCA?
• no, typically it is a good method to find a suitable subset of
variables, as long as you are not too greedy
• e
e.g.
g if you start with n = 100
100, and know that there are only 5
variables of interest
• picking the top 20 PCA components is likely to keep the desired 5
• your classifier will be much better than for n = 100, probably
not much worse than the one with the best 5 features
is there a rule of thumb for finding the number of PCA
components?
15
a natural measure is to pick the eigenvectors that explain
p % of the data variability
y
• can be done by plotting the ratio rk as a function of k
∑λ 2
i
rk = i =1
n
∑ i
λ2
i =1
• e.g. we need 3 eigenvectors to cover 70% of the variability of this

dataset
16
Fischer’s linear discriminant
what if we really need to find the best features?
• harder question, usually impossible with simple methods
• there are better methods at finding discriminant directions
one good example is linear discriminant analysis (LDA)
• the idea is to find the line that best separates the two classes
bad projection
good projection
17
Linear discriminant analysis
we have two classes such that
E X |Y [X | Y = i ] = µi
[ ]
E X |Y ( X − µi )( X − µi ) | Y = i = Σ i
T
and want to find the line

z = wT x
that best separates them
one possibility would be to maximize
(E [Z | Y = 1] − E [Z | Y = 0]) =
Z |Y Z |Y
2
(E [w x | Y = 1]− E [w x | Y = 0]) = (w [µ
X |Y
T
X |Y
T 2 T
1 )
− µ0 ]
2
18
however, this
(w [µ
T
1 )
− µ0 ]
2
can be made arbitrarily large by simply scaling w

we are only interested in the direction, not the magnitude
need some type of normalization
Fischer suggested
between class scatter

max =
within class scatter
max
(EZ |Y [Z | Y = 1] − EZ |Y [Z | Y = 0])
2
w var[Z | Y = 1] + var[Z | Y = 0]
19
we have already seen that
(E [Z | Y = 1] − E [Z | Y = 0]) = (w [µ
Z |Y Z |Y
2 T
1 )
− µ0 ]
2
= w [µ1 − µ 0 ][µ1 − µ 0 ] w
T T
also
{
var[Z | Y = i ] = EZ |Y (z − EZ |Y [Z | Y = i ]) | Y = i
2
}
{(
= EZ |Y w [x − µi ] | Y = i
T
)
2
}
{
= EZ |Y w [x − µi ][x − µi ] w | Y = i
T T
}
= wT Σ i w
20
and
J ( w) =
(E [Z | Y = 1] − E [Z | Y = 0])
Z |Y Z |Y
2
var[Z | Y = 1] + var[Z | Y = 0]
wT (µ1 − µ 0 )(µ1 − µ 0 ) w
T
=
wT (Σ1 + Σ 0 )w
which can be written as between class scatter
S B = (µ1 − µ 0 )(µ1 − µ 0 )
T
wT S B w
J ( w) = T
w SW w SW = (Σ1 + Σ 0 )
within class scatter

21
maximizing the ratio
wT S B w
J ( w) = T
w SW w
• is equivalent to maximizing the numerator while keeping the
denominator constant, i.e.
max wT S B w subject to wT SW w = K
w
• and can be accomplished using Lagrange multipliers

• define
d fi ththe Lagrangian
L i
(
L = wT S B w - λ wT SW w − K )
• and maximize with respect to both w and λ
22
setting the gradient of
L = wT (S B - λSW )w + λK
with respect to w to zero we get

∇ w L = 2(S B - λSW )w = 0
or
S B w = λSW w
this is a generalized eigenvalue problem

the solution is easy when S w−1 = (Σ1 + Σ 0 )−1 exists
23
in this case
SW−11S B w = λw
and, using the definition of SB
SW−1 (µ1 − µ 0 )(µ1 − µ0 ) w = λw

T
noting that (µ1-µ0)Tw = α is a scalar this can be written as

λ
S (µ1 − µ0 ) = w
−1
α
W
and since we don’t care about the magnitude of w
w* = SW−1 (µ1 − µ0 ) = (Σ1 + Σ 0 ) (µ1 − µ 0 )

−1
24
note that we have seen this before
• for a classification problem with Gaussian classes of equal
covariance Σi = Σ, the BDR boundary is the plane of normal
w = Σ −1 (µi − µ j )
• if Σ1 = Σ0, this is also the LDA solution
x0 w µj
µi
Gaussian classes,
classes
equal covariance Σ
25
this gives two different interpretations of LDA
• it is optimal if and only if the classes are Gaussian and have
equal covariance
• better than PCA, but not necessarily good enough
• a classifier on the LDA feature, is equivalent to
• the BDR after the approximation of the data by two Gaussians with
equal covariance
26
27

Dimensions Reduction

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Dimensions Reduction

Caricato da

Copyright:

Formati disponibili

PCA and LDA

• this means that if we fit a Gaussian to the data the equiprobability

by computing the eigenvalues we know if the data is flat

where xic is the ith column of Xc

E[Y ] = 0.5 × 0 + 0.5 × 1 = 0.5

• e.g. we need 3 eigenvectors to cover 70% of the variability of this

and want to find the line

can be made arbitrarily large by simply scaling w

between class scatter

which can be written as between class scatter

within class scatter

• and can be accomplished using Lagrange multipliers

with respect to w to zero we get

this is a generalized eigenvalue problem

SW−1 (µ1 − µ 0 )(µ1 − µ0 ) w = λw

noting that (µ1-µ0)Tw = α is a scalar this can be written as

and since we don’t care about the magnitude of w

w* = SW−1 (µ1 − µ0 ) = (Σ1 + Σ 0 ) (µ1 − µ 0 )

Potrebbero piacerti anche