Sei sulla pagina 1di 29

Linear Classification

Reading: Chapter 4 of C. Bishop book,


Sections 4.1.1, 4.1.4, 4.1.7,
on electronic reserve at PSU library.

Linear Discriminant Functions

Example: Boolean functions

Notion of decision surfaces

Linear discriminant function: Assuming data is D-


dimensional, define (D-1)-dimensional hyperplanes to
classify the data into classes.

1
Linear Discriminant

Feature 1

Feature 2

Linear Discriminant

Feature 1

Feature 2

2
Linear Discriminant

Feature 1

Feature 2

Example where line wont work?

Feature 1

Feature 2

3
Perceptrons
Precursors to neural networks.

Discriminant function:
y (x) = f (w T x + w0 )
= f ( w0 + w1 x1 + ... + wD xD )

w0 is called the bias. (Equivalently, w0 is called the


threshold.)

Classification: y (x) = sgn( w1 x1 + w2 x2 + ... + wD xD + w0 )

1 if y < 0
where sgn( y ) =
+ 1 otherwise

Geometry of the perceptron


Hyperplane
In 2d:
w1 x1 + w2 x2 + w0 = 0

Feature 1
w1 w
x2 = x1 0
w2 w2

Feature 2

4
In-class exercise

(a) Find weights for a perceptron that separates


true and false in x1 x2. Sketch the
separation line defined by this discriminant.

(b) Do the same, but for x1 x2.

To simplify notation, assume a dummy coordinate (or


attribute) x0 = 1. Then we can write:

y (x) = sgn(w T x)
= sgn( w0 x0 + w1 x1 + ... + wD xD )

We can generalize the perceptron to cases where we


project data points xn into feature space, (xn):

y (x) = sgn(w T (x))

5
Notation
Let S = {(xn,tn): n = 1, 2, ..., m} be a training set.

Note: xn is a vector of inputs, and tn is {+1, -1} for


binary classification, tn for regression.

Output o:

(
o = sgn j =0 w j x j = w T x
d
)
Error of a perceptron on a single training example, (xn ,tn)
1 n
En = (t o n ) 2
2

Example

S = {((0,0), -1), ((0,1), 1)}

Let w = {w0, w1, w2) = {0.1, 0.1, -0.3}

What is E1?
+1
w0 What is E2?
x1 w1
o

w2
x2

6
How do we train a perceptron?

Gradient descent in weight space

From T. M. Mitchell, Machine Learning

Perceptron learning algorithm

Start with random weights w = (w1, w2, ... , wd).

Do gradient descent in weight space, in order to minimize


error E:

Given error E, want to modify weights w so as to take


a step in direction of steepest descent.

7
Gradient descent

We want to find w so as to minimize sum-squared error:


1 m n
E (w ) = (t o n ) 2
2 n =0

To minimize, take the derivative of E(w) with respect to w.

A vector derivative is called a gradient: E(w)

E E E
E (w ) = , ,...,
w0 w1 wn

Here is how we change each weight:


w j w j + w j

where

E
w j =
w j

and is the learning rate.

8
Error function

1 m n
E (w ) = (t o n ) 2
2 n =0
has to be differentiable, so output function o also has to be
differentiable.

Activation functions

1
output

1
output

-1
-1
0
0 activation
activation

o = w j x j + w0
o = sgn w j x j + w0 j
j

Not differentiable Differentiable

9
E 1
= (t n o n ) 2
wi wi 2 n
(1)

1 n
=
2 n wi
(t o n ) 2 (2)

So,
w i = (t n o n ) x in (6)
1 n
= 2(t o )
n n n
(t o n ) (3)
2 n wi
This is called the perceptron
learning rule.
n
= (t n o n ) (t w x n ) (4)
n wi

= (t n o n )( x in ) (5)
n

Problem with true gradient descent:

Search process will land in local optimum.

Common approach to this: use stochastic gradient


descent:
Instead of doing weight update after all training
examples have been processed, do weight update after
each training example has been processed (i.e.,
perceptron output has been calculated).

Stochastic gradient descent approximates true gradient


descent increasingly well as 1/.

10
Training a perceptron
1. Start with random weights, w = (w1, w2, ... , wd).

2. Select training example (xn, tn).

3. Run the perceptron with input xn and weights w to


obtain o.

4. Let be the training rate (a user-set parameter).


Now,
wi wi + wi
where
wi = (t n o n ) xin
5. Go to 2.

Example (slightly modified)

S = {((0,0), -1), ((0,1), 1)}

Let w = {w0, w1, w2) = {0.1, 0.1, -0.3}

Perceptron weight updates:

+1
w0
x1 w1
o

w2 New perceptron weights:


x2

11
1960s: Rosenblatt proved that the perceptron learning rule
converges to correct weights in a finite number of steps,
provided the training examples are linearly separable.

1969: Minsky and Papert proved that perceptrons cannot


represent non-linearly separable target functions.

Fishers linear discriminant

y ( x) = w T x

Note: This projects x to one dimension.

We want to find a projection that maximizes the


separation of the means of each class, while also giving a
small variance within each class to minimize class
overlap.

12
Assume two classes, C1 and C2, with N1 and N2 instances,
respectively. The means of the two classes are:
1 1
m1 =
N1 nC1
xn , m2 =
N2
x
nC 2
n

Define the projected class mean as


mk = w T m k .

Thus we want to choose w so as to maximize


m2 m1 = w T (m 2 m1 )

while minimizing within-class variance.

Given our original projection,

y ( x ) = w T x,
we can write the variance of the ys within class Ck as:

(y
2
mk ) ,
2
sk = n
nCk

where yn is the value of y on data-point n that is a member


of class Ck .

13
Total within-class variance in the data set is
2 2
s1 + s2 .
Define the Fisher criterion as

(m2 m1 ) 2
J (w ) = 2 2
,
s1 + s2
i.e., the ratio of between-class variance to within-class
variance.

We want to maximize this.

To maximize this:
Compute between-class covariance matrix :
S B = (m1 m 2 )(m1 m 2 )T

Compute total within-class covariance matrix :


SW = (x
nC1
n m1 )(x n m1 )T + (x
nC 2
n m 2 )(x n m 2 )T

Now we can write J(w) as follows:


wT S B w
J (w ) =
wT S ww

14
wT S B w
J (w ) = T
w S ww

Now we can maximize J(w) by differentiating the above


expression.

Result: J(w) is maximized when


w T S B wS w w = w T S w w S B w

Note that we only care about the direction of w, not its


magnitude, since we can get an equivalent decision surface
by multiplying w by any scalar. This is because
w0 x0 + w1 x1 + ... + wD xD = 0

Thus we can simplify


w T S B wS w w = w T S w w S B w
T T
by (1) dropping the scalar factors w S B w and w S w w
and (2) noting that SBw is always in the direction of
(m2-m1), to get S w p m m ,
w 2 1

or
1
w p S w (m 2 m1 )

Fishers linear discriminant

For classification, define a threshold y0 such that if y(x) < y0 then


x is in class C1; otherwise C2.

(More later on finding an optimal threshold.)

15
Historical note: Who is Fishers linear discriminant named
for?

Feature selection/construction

Adding too many features (i.e., making problem higher-


dimensional) decreases performance of classifier.

Notion of overfitting.

16
Linear classification in feature space

Principal components analysis (PCA)


Reading: L. I. Smith, A tutorial on principal
components analysis (on class website)

PCA used to create high-level features in order to


improve classification and reduce dimensions of data
without much loss of information.

Used in machine learning and in signal processing and


image compression (among other things).

Background for PCA

Suppose attributes are A1 and A2, and we have n training


examples. xs denote values of A1 and ys denote values of
A2 over the training examples.

Variance of an attribute:

(x x)
i =1
i
2

var( A1 ) =
(n 1)

17
Covariance of two attributes:
n

( x x )( y
i =1
i i y)
cov( A1 , A2 ) =
( n 1)

If covariance is positive, both dimensions increase


together. If negative, as one increases, the other decreases.
Zero: independent of each other.

Covariance matrix
Suppose we have n attributes, A1, ..., An.

Covariance matrix:

C nn = (ci , j ), where ci , j = cov( Ai , A j )

18
cov( H , H ) cov( H , M )

cov(M , H ) cov( M , M )

var( H ) 104.5
=
104.5 var( M )

47.7 104.5
=
104.5 370

Covariance matrix

Review of Matrix Algebra


Eigenvectors:
Let M be an nn matrix.
v is an eigenvector of M if M v = v
is called the eigenvalue associated with v

For any eigenvector v of M and scalar a,


M av = av

Thus you can always choose eigenvectors of length 1:


2 2
v1 + ... + vn = 1
If M has any eigenvectors, it has n of them, and they are
orthogonal to one another.

Thus eigenvectors can be used as a new basis for a n-dimensional


vector space.

19
Principal Components Analysis (PCA)
1. Given original data set S = {x1, ..., xk}, produce new set
by subtracting the mean of attribute Ai from each xi.

Mean: 1.81 1.91 Mean: 0 0

20
2. Calculate the covariance matrix:
x y
x
y

3. Calculate the (unit) eigenvectors and eigenvalues of the


covariance matrix:

Eigenvector with largest


eigenvalue traces
linear pattern in data

21
4. Order eigenvectors by eigenvalue, highest to lowest.

.677873399
v1 = = 1.28402771
.735178956

.735178956
v 2 = = .0490833989
.677873399

In general, you get n components. To reduce


dimensionality to p, ignore np components at the bottom
of the list.

Construct new feature vector.


Feature vector = (v1, v2, ...vp)

.677873399 .735178956
FeatureVector1 =
.735178956 .677873399

or reduced dimension feature vector :

.677873399
FeatureVector 2 =
.735178956

22
5. Derive the new data set.

TransformedData = RowFeatureVector RowDataAdjust

where RowDataAdjust = transpose of mean-adjusted data

.677873399 .735178956
RowFeatureVector1 =
.735178956 .677873399

RowFeatureVector 2 = ( .677873399 .735178956 )

.69 1.31 .39 .09 1.29 .49 .19 .81 .31 .71
RowDataAdjust =
.49 1.21 .99 .29 1.09 .79 .31 .81 .31 1.01

This gives original data in terms of chosen


components (eigenvectors)that is, along these axes.

23
Intuition: We projected the data onto new axes that captures
the strongest linear trends in the data set. Each transformed data
point tells us how far it is above or below those trend lines.

24
Reconstructing the original data

We did:
TransformedData = RowFeatureVector RowDataAdjust

so we can do
RowDataAdjust = RowFeatureVector -1
TransformedData

= RowFeatureVector T TransformedData

and
RowDataOriginal = RowDataAdjust + OriginalMean

25
Example: Linear discrimination using PCA
for face recognition

1. Preprocessing: Normalize faces

Make images the same size

Line up with respect to eyes

Normalize intensities

26
2. Raw features are pixel intensity values (2061 features)

3. Each image is encoded as a vector i of these features

4. Compute mean face in training set:

M
1
=
M

i =1
i

From W. Zhao et al., Discriminant analysis of principal components for


face recognition.

Subtract the mean face from each face vector


i = i
Compute the covariance matrix C

Compute the (unit) eigenvectors vi of C

Keep only the first K principal components (eigenvectors)

27
The eigenfaces encode the principal sources of variation
in the dataset (e.g., absence/presence of facial hair, skin tone,
glasses, etc.).

We can represent any face as a linear combination of these


basis faces.

Use this representation for:


Face recognition
(e.g., Euclidean distance from known faces)

Linear discrimination
(e.g., glasses versus no glasses,
or male versus female)

28
The eigenfaces encode the principal sources of variation
in the dataset (e.g., absence/presence of facial hair, skin tone,
glasses, etc.).

We can represent any face as a linear combination of these


basis faces.

Use this representation for:


Face recognition
(e.g., Euclidean distance from known faces)

Linear discrimination
(e.g., glasses versus no glasses,
or male versus female)

PCA versus Linear Discriminant Analysis

PCA seeks a new basis that is good for dimensionality


reduction.

LDA seeks a new basis that is good for discrimination.

Can apply PCA first, then LDA.

If original data is linearly separable, PCA data is also


linearly separable.

29

Potrebbero piacerti anche