Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
1
Linear Discriminant
Feature 1
Feature 2
Linear Discriminant
Feature 1
Feature 2
2
Linear Discriminant
Feature 1
Feature 2
Feature 1
Feature 2
3
Perceptrons
Precursors to neural networks.
Discriminant function:
y (x) = f (w T x + w0 )
= f ( w0 + w1 x1 + ... + wD xD )
1 if y < 0
where sgn( y ) =
+ 1 otherwise
Feature 1
w1 w
x2 = x1 0
w2 w2
Feature 2
4
In-class exercise
y (x) = sgn(w T x)
= sgn( w0 x0 + w1 x1 + ... + wD xD )
5
Notation
Let S = {(xn,tn): n = 1, 2, ..., m} be a training set.
Output o:
(
o = sgn j =0 w j x j = w T x
d
)
Error of a perceptron on a single training example, (xn ,tn)
1 n
En = (t o n ) 2
2
Example
What is E1?
+1
w0 What is E2?
x1 w1
o
w2
x2
6
How do we train a perceptron?
7
Gradient descent
E E E
E (w ) = , ,...,
w0 w1 wn
where
E
w j =
w j
8
Error function
1 m n
E (w ) = (t o n ) 2
2 n =0
has to be differentiable, so output function o also has to be
differentiable.
Activation functions
1
output
1
output
-1
-1
0
0 activation
activation
o = w j x j + w0
o = sgn w j x j + w0 j
j
9
E 1
= (t n o n ) 2
wi wi 2 n
(1)
1 n
=
2 n wi
(t o n ) 2 (2)
So,
w i = (t n o n ) x in (6)
1 n
= 2(t o )
n n n
(t o n ) (3)
2 n wi
This is called the perceptron
learning rule.
n
= (t n o n ) (t w x n ) (4)
n wi
= (t n o n )( x in ) (5)
n
10
Training a perceptron
1. Start with random weights, w = (w1, w2, ... , wd).
+1
w0
x1 w1
o
11
1960s: Rosenblatt proved that the perceptron learning rule
converges to correct weights in a finite number of steps,
provided the training examples are linearly separable.
y ( x) = w T x
12
Assume two classes, C1 and C2, with N1 and N2 instances,
respectively. The means of the two classes are:
1 1
m1 =
N1 nC1
xn , m2 =
N2
x
nC 2
n
y ( x ) = w T x,
we can write the variance of the ys within class Ck as:
(y
2
mk ) ,
2
sk = n
nCk
13
Total within-class variance in the data set is
2 2
s1 + s2 .
Define the Fisher criterion as
(m2 m1 ) 2
J (w ) = 2 2
,
s1 + s2
i.e., the ratio of between-class variance to within-class
variance.
To maximize this:
Compute between-class covariance matrix :
S B = (m1 m 2 )(m1 m 2 )T
14
wT S B w
J (w ) = T
w S ww
or
1
w p S w (m 2 m1 )
15
Historical note: Who is Fishers linear discriminant named
for?
Feature selection/construction
Notion of overfitting.
16
Linear classification in feature space
Variance of an attribute:
(x x)
i =1
i
2
var( A1 ) =
(n 1)
17
Covariance of two attributes:
n
( x x )( y
i =1
i i y)
cov( A1 , A2 ) =
( n 1)
Covariance matrix
Suppose we have n attributes, A1, ..., An.
Covariance matrix:
18
cov( H , H ) cov( H , M )
cov(M , H ) cov( M , M )
var( H ) 104.5
=
104.5 var( M )
47.7 104.5
=
104.5 370
Covariance matrix
19
Principal Components Analysis (PCA)
1. Given original data set S = {x1, ..., xk}, produce new set
by subtracting the mean of attribute Ai from each xi.
20
2. Calculate the covariance matrix:
x y
x
y
21
4. Order eigenvectors by eigenvalue, highest to lowest.
.677873399
v1 = = 1.28402771
.735178956
.735178956
v 2 = = .0490833989
.677873399
.677873399 .735178956
FeatureVector1 =
.735178956 .677873399
.677873399
FeatureVector 2 =
.735178956
22
5. Derive the new data set.
.677873399 .735178956
RowFeatureVector1 =
.735178956 .677873399
.69 1.31 .39 .09 1.29 .49 .19 .81 .31 .71
RowDataAdjust =
.49 1.21 .99 .29 1.09 .79 .31 .81 .31 1.01
23
Intuition: We projected the data onto new axes that captures
the strongest linear trends in the data set. Each transformed data
point tells us how far it is above or below those trend lines.
24
Reconstructing the original data
We did:
TransformedData = RowFeatureVector RowDataAdjust
so we can do
RowDataAdjust = RowFeatureVector -1
TransformedData
= RowFeatureVector T TransformedData
and
RowDataOriginal = RowDataAdjust + OriginalMean
25
Example: Linear discrimination using PCA
for face recognition
Normalize intensities
26
2. Raw features are pixel intensity values (2061 features)
M
1
=
M
i =1
i
27
The eigenfaces encode the principal sources of variation
in the dataset (e.g., absence/presence of facial hair, skin tone,
glasses, etc.).
Linear discrimination
(e.g., glasses versus no glasses,
or male versus female)
28
The eigenfaces encode the principal sources of variation
in the dataset (e.g., absence/presence of facial hair, skin tone,
glasses, etc.).
Linear discrimination
(e.g., glasses versus no glasses,
or male versus female)
29