Jan 14 Slides

Linear Classification
Reading: Chapter 4 of C. Bishop book,

Sections 4.1.1, 4.1.4, 4.1.7,
on electronic reserve at PSU library.
Linear Discriminant Functions
Example: Boolean functions
Notion of decision surfaces
Linear discriminant function: Assuming data is D-

dimensional, define (D-1)-dimensional hyperplanes to
classify the data into classes.
1
Linear Discriminant
Feature 1
Feature 2
Linear Discriminant
Feature 1
Feature 2
2
Linear Discriminant
Feature 1
Feature 2
Example where line wont work?
Feature 1
Feature 2
3
Perceptrons
Precursors to neural networks.
Discriminant function:
y (x) = f (w T x + w0 )
= f ( w0 + w1 x1 + ... + wD xD )
w0 is called the bias. (Equivalently, w0 is called the

threshold.)
Classification: y (x) = sgn( w1 x1 + w2 x2 + ... + wD xD + w0 )
1 if y < 0
where sgn( y ) =
+ 1 otherwise
Geometry of the perceptron

Hyperplane
In 2d:
w1 x1 + w2 x2 + w0 = 0
Feature 1
w1 w
x2 = x1 0
w2 w2
Feature 2
4
In-class exercise
(a) Find weights for a perceptron that separates

true and false in x1 x2. Sketch the
separation line defined by this discriminant.
(b) Do the same, but for x1 x2.
To simplify notation, assume a dummy coordinate (or

attribute) x0 = 1. Then we can write:
y (x) = sgn(w T x)
= sgn( w0 x0 + w1 x1 + ... + wD xD )
We can generalize the perceptron to cases where we

project data points xn into feature space, (xn):
y (x) = sgn(w T (x))
5
Notation
Let S = {(xn,tn): n = 1, 2, ..., m} be a training set.
Note: xn is a vector of inputs, and tn is {+1, -1} for

binary classification, tn for regression.
Output o:
(
o = sgn j =0 w j x j = w T x
d
)
Error of a perceptron on a single training example, (xn ,tn)
1 n
En = (t o n ) 2
2
Example
S = {((0,0), -1), ((0,1), 1)}
Let w = {w0, w1, w2) = {0.1, 0.1, -0.3}
What is E1?
+1
w0 What is E2?
x1 w1
o
w2
x2
6
How do we train a perceptron?
Gradient descent in weight space
From T. M. Mitchell, Machine Learning
Perceptron learning algorithm
Start with random weights w = (w1, w2, ... , wd).
Do gradient descent in weight space, in order to minimize

error E:
Given error E, want to modify weights w so as to take

a step in direction of steepest descent.
7
Gradient descent
We want to find w so as to minimize sum-squared error:

1 m n
E (w ) = (t o n ) 2
2 n =0
To minimize, take the derivative of E(w) with respect to w.
A vector derivative is called a gradient: E(w)
E E E
E (w ) = , ,...,
w0 w1 wn
Here is how we change each weight:

w j w j + w j
where
E
w j =
w j
and is the learning rate.
8
Error function
1 m n
E (w ) = (t o n ) 2
2 n =0
has to be differentiable, so output function o also has to be
differentiable.
Activation functions
1
output
1
output
-1
-1
0
0 activation
activation
o = w j x j + w0
o = sgn w j x j + w0 j
j
Not differentiable Differentiable
9
E 1
= (t n o n ) 2
wi wi 2 n
(1)
1 n
=
2 n wi
(t o n ) 2 (2)
So,
w i = (t n o n ) x in (6)
1 n
= 2(t o )
n n n
(t o n ) (3)
2 n wi
This is called the perceptron
learning rule.
n
= (t n o n ) (t w x n ) (4)
n wi
= (t n o n )( x in ) (5)
n
Problem with true gradient descent:
Search process will land in local optimum.
Common approach to this: use stochastic gradient

descent:
Instead of doing weight update after all training
examples have been processed, do weight update after
each training example has been processed (i.e.,
perceptron output has been calculated).
Stochastic gradient descent approximates true gradient

descent increasingly well as 1/.
10
Training a perceptron
1. Start with random weights, w = (w1, w2, ... , wd).
2. Select training example (xn, tn).
3. Run the perceptron with input xn and weights w to

obtain o.
4. Let be the training rate (a user-set parameter).

Now,
wi wi + wi
where
wi = (t n o n ) xin
5. Go to 2.
Example (slightly modified)
S = {((0,0), -1), ((0,1), 1)}
Let w = {w0, w1, w2) = {0.1, 0.1, -0.3}
Perceptron weight updates:
+1
w0
x1 w1
o
w2 New perceptron weights:

x2
11
1960s: Rosenblatt proved that the perceptron learning rule
converges to correct weights in a finite number of steps,
provided the training examples are linearly separable.
1969: Minsky and Papert proved that perceptrons cannot

represent non-linearly separable target functions.
Fishers linear discriminant
y ( x) = w T x
Note: This projects x to one dimension.
We want to find a projection that maximizes the

separation of the means of each class, while also giving a
small variance within each class to minimize class
overlap.
12
Assume two classes, C1 and C2, with N1 and N2 instances,
respectively. The means of the two classes are:
1 1
m1 =
N1 nC1
xn , m2 =
N2
x
nC 2
n
Define the projected class mean as

mk = w T m k .
Thus we want to choose w so as to maximize

m2 m1 = w T (m 2 m1 )
while minimizing within-class variance.
Given our original projection,
y ( x ) = w T x,
we can write the variance of the ys within class Ck as:
(y
2
mk ) ,
2
sk = n
nCk
where yn is the value of y on data-point n that is a member

of class Ck .
13
Total within-class variance in the data set is
2 2
s1 + s2 .
Define the Fisher criterion as
(m2 m1 ) 2
J (w ) = 2 2
,
s1 + s2
i.e., the ratio of between-class variance to within-class
variance.
We want to maximize this.
To maximize this:
Compute between-class covariance matrix :
S B = (m1 m 2 )(m1 m 2 )T
Compute total within-class covariance matrix :

SW = (x
nC1
n m1 )(x n m1 )T + (x
nC 2
n m 2 )(x n m 2 )T
Now we can write J(w) as follows:

wT S B w
J (w ) =
wT S ww
14
wT S B w
J (w ) = T
w S ww
Now we can maximize J(w) by differentiating the above

expression.
Result: J(w) is maximized when

w T S B wS w w = w T S w w S B w
Note that we only care about the direction of w, not its

magnitude, since we can get an equivalent decision surface
by multiplying w by any scalar. This is because
w0 x0 + w1 x1 + ... + wD xD = 0
Thus we can simplify

w T S B wS w w = w T S w w S B w
T T
by (1) dropping the scalar factors w S B w and w S w w
and (2) noting that SBw is always in the direction of
(m2-m1), to get S w p m m ,
w 2 1
or
1
w p S w (m 2 m1 )
Fishers linear discriminant
For classification, define a threshold y0 such that if y(x) < y0 then

x is in class C1; otherwise C2.
(More later on finding an optimal threshold.)
15
Historical note: Who is Fishers linear discriminant named
for?
Feature selection/construction
Adding too many features (i.e., making problem higher-

dimensional) decreases performance of classifier.
Notion of overfitting.
16
Linear classification in feature space
Principal components analysis (PCA)

Reading: L. I. Smith, A tutorial on principal
components analysis (on class website)
PCA used to create high-level features in order to

improve classification and reduce dimensions of data
without much loss of information.
Used in machine learning and in signal processing and

image compression (among other things).
Background for PCA
Suppose attributes are A1 and A2, and we have n training

examples. xs denote values of A1 and ys denote values of
A2 over the training examples.
Variance of an attribute:
(x x)
i =1
i
2
var( A1 ) =
(n 1)
17
Covariance of two attributes:
n
( x x )( y
i =1
i i y)
cov( A1 , A2 ) =
( n 1)
If covariance is positive, both dimensions increase

together. If negative, as one increases, the other decreases.
Zero: independent of each other.
Covariance matrix
Suppose we have n attributes, A1, ..., An.
Covariance matrix:
C nn = (ci , j ), where ci , j = cov( Ai , A j )
18
cov( H , H ) cov( H , M )

cov(M , H ) cov( M , M )
var( H ) 104.5
=
104.5 var( M )
47.7 104.5
=
104.5 370
Covariance matrix
Review of Matrix Algebra

Eigenvectors:
Let M be an nn matrix.
v is an eigenvector of M if M v = v
is called the eigenvalue associated with v
For any eigenvector v of M and scalar a,

M av = av
Thus you can always choose eigenvectors of length 1:

2 2
v1 + ... + vn = 1
If M has any eigenvectors, it has n of them, and they are
orthogonal to one another.
Thus eigenvectors can be used as a new basis for a n-dimensional

vector space.
19
Principal Components Analysis (PCA)
1. Given original data set S = {x1, ..., xk}, produce new set
by subtracting the mean of attribute Ai from each xi.
Mean: 1.81 1.91 Mean: 0 0
20
2. Calculate the covariance matrix:
x y
x
y
3. Calculate the (unit) eigenvectors and eigenvalues of the

covariance matrix:
Eigenvector with largest

eigenvalue traces
linear pattern in data
21
4. Order eigenvectors by eigenvalue, highest to lowest.
.677873399
v1 = = 1.28402771
.735178956
.735178956
v 2 = = .0490833989
.677873399
In general, you get n components. To reduce

dimensionality to p, ignore np components at the bottom
of the list.
Construct new feature vector.

Feature vector = (v1, v2, ...vp)
.677873399 .735178956
FeatureVector1 =
.735178956 .677873399
or reduced dimension feature vector :
.677873399
FeatureVector 2 =
.735178956
22
5. Derive the new data set.
TransformedData = RowFeatureVector RowDataAdjust
where RowDataAdjust = transpose of mean-adjusted data
.677873399 .735178956
RowFeatureVector1 =
.735178956 .677873399
RowFeatureVector 2 = ( .677873399 .735178956 )
.69 1.31 .39 .09 1.29 .49 .19 .81 .31 .71
RowDataAdjust =
.49 1.21 .99 .29 1.09 .79 .31 .81 .31 1.01
This gives original data in terms of chosen

components (eigenvectors)that is, along these axes.
23
Intuition: We projected the data onto new axes that captures
the strongest linear trends in the data set. Each transformed data
point tells us how far it is above or below those trend lines.
24
Reconstructing the original data
We did:
TransformedData = RowFeatureVector RowDataAdjust
so we can do
RowDataAdjust = RowFeatureVector -1
TransformedData
= RowFeatureVector T TransformedData
and
RowDataOriginal = RowDataAdjust + OriginalMean
25
Example: Linear discrimination using PCA
for face recognition
1. Preprocessing: Normalize faces
Make images the same size
Line up with respect to eyes
Normalize intensities
26
2. Raw features are pixel intensity values (2061 features)
3. Each image is encoded as a vector i of these features
4. Compute mean face in training set:
M
1
=
M

i =1
i
From W. Zhao et al., Discriminant analysis of principal components for

face recognition.
Subtract the mean face from each face vector

i = i
Compute the covariance matrix C
Compute the (unit) eigenvectors vi of C
Keep only the first K principal components (eigenvectors)
27
The eigenfaces encode the principal sources of variation
in the dataset (e.g., absence/presence of facial hair, skin tone,
glasses, etc.).
We can represent any face as a linear combination of these

basis faces.
Use this representation for:

Face recognition
(e.g., Euclidean distance from known faces)
Linear discrimination
(e.g., glasses versus no glasses,
or male versus female)
28
The eigenfaces encode the principal sources of variation
in the dataset (e.g., absence/presence of facial hair, skin tone,
glasses, etc.).
We can represent any face as a linear combination of these

basis faces.
Use this representation for:

Face recognition
(e.g., Euclidean distance from known faces)
Linear discrimination
(e.g., glasses versus no glasses,
or male versus female)
PCA versus Linear Discriminant Analysis
PCA seeks a new basis that is good for dimensionality

reduction.
LDA seeks a new basis that is good for discrimination.
Can apply PCA first, then LDA.
If original data is linearly separable, PCA data is also

linearly separable.
29

Jan 14 Slides

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Jan 14 Slides

Caricato da

Copyright:

Formati disponibili

Linear Classification

Reading: Chapter 4 of C. Bishop book,

Linear Discriminant Functions

Example: Boolean functions

Notion of decision surfaces

Linear discriminant function: Assuming data is D-

Example where line wont work?

w0 is called the bias. (Equivalently, w0 is called the

Classification: y (x) = sgn( w1 x1 + w2 x2 + ... + wD xD + w0 )

Geometry of the perceptron

(a) Find weights for a perceptron that separates

(b) Do the same, but for x1 x2.

To simplify notation, assume a dummy coordinate (or

We can generalize the perceptron to cases where we

y (x) = sgn(w T (x))

Note: xn is a vector of inputs, and tn is {+1, -1} for

S = {((0,0), -1), ((0,1), 1)}

Let w = {w0, w1, w2) = {0.1, 0.1, -0.3}

Gradient descent in weight space

From T. M. Mitchell, Machine Learning

Perceptron learning algorithm

Start with random weights w = (w1, w2, ... , wd).

Do gradient descent in weight space, in order to minimize

Given error E, want to modify weights w so as to take

We want to find w so as to minimize sum-squared error:

To minimize, take the derivative of E(w) with respect to w.

A vector derivative is called a gradient: E(w)

Here is how we change each weight:

and is the learning rate.

Not differentiable Differentiable

Problem with true gradient descent:

Search process will land in local optimum.

Common approach to this: use stochastic gradient

Stochastic gradient descent approximates true gradient

2. Select training example (xn, tn).

3. Run the perceptron with input xn and weights w to

4. Let be the training rate (a user-set parameter).

Example (slightly modified)

S = {((0,0), -1), ((0,1), 1)}

Let w = {w0, w1, w2) = {0.1, 0.1, -0.3}

Perceptron weight updates:

w2 New perceptron weights:

1969: Minsky and Papert proved that perceptrons cannot

Fishers linear discriminant

Note: This projects x to one dimension.

We want to find a projection that maximizes the

Define the projected class mean as

Thus we want to choose w so as to maximize

while minimizing within-class variance.

Given our original projection,

where yn is the value of y on data-point n that is a member

We want to maximize this.

Compute total within-class covariance matrix :

Now we can write J(w) as follows:

Now we can maximize J(w) by differentiating the above

Result: J(w) is maximized when

Note that we only care about the direction of w, not its

Thus we can simplify

Fishers linear discriminant

For classification, define a threshold y0 such that if y(x) < y0 then

(More later on finding an optimal threshold.)

Adding too many features (i.e., making problem higher-

Principal components analysis (PCA)

PCA used to create high-level features in order to