Sei sulla pagina 1di 32

Linear Classification

CSL465/603 - Fall 2016


Narayanan C Krishnan
ckn@iitrpr.ac.in

Outline
Linear Classification
Linear Regression as a Classifier
Linear Discriminant Analysis
Logistic Regression
Cost function
Maximum likelihood estimation
Multi-class logistic regression

Linear Classification

CSL465/603 - Machine Learning

Linear Classifier
Classifier Partitions input space into decision
regions
Linearly Separable Input space can be partitioned
by linear decision boundary
2.5

3.5

2
3

1.5
2.5

Input Variable X3

Input Variable X

0.5

1.5

0.5

0.5

0
4.5

3.5

Linear Classification

2.5
Input Variable X2

1.5

1.5
5.5

CSL465/603 - Machine Learning

4.5

3.5
3
Input Variable X2

2.5

1.5

Linear Regression as a Classifier


Start with 2 class scenario " 0, 1
Treat the output as if it were continuous and
perform regression
x = x
:

Where = argmin 6 x7 "


5

";<

A new data point x can be classified as


0, if x 0.5
x ==
1, if x > 0.5

Linear Classification

CSL465/603 - Machine Learning

Example
1

0.9

0.8

0.7

1
0.6

0.9
0.5

0.8
0.4

0.7
0.3

0.6

0.2

0.5

0.1

0.4
0
1

X
1

0.3

0.9

0.2

0.8

0.1

0.7

0
1

0.6

0.5

0.4

0.3

0.2

0.1

0
1

Linear Classification

CSL465/603 - Machine Learning

Linear Regression of Indicators


(1)
Extending linear regression for more than 2 classes
Indicator matrix
1, if " = , " belongs to class
"C = =
0, otherwise

1
1
0
0
0
0

Linear Classification

0
0
0
0
0
0

0
0
1
1
1
1

W = O
:

P<

wC = arg min 6 x7 wC "C


5Q

";<

x = arg max xwC


C


CSL465/603 - Machine Learning

Linear Regression of Indicators


(2)
Let C be the ST column of the identity matrix
If a data point " belongs to class , then " = C
Linear model can be learned using least squares
approach
:

W = arg min 6 x" W "


W
X

";<

For a new data point the class label can be


determined as
arg min xW C 9
C

Exercise: Show that both the approaches are equivalent


Linear Classification

CSL465/603 - Machine Learning

Decision Boundary in the MultiClass Scenario


Linear regression learns wC for each class
Characterization of the decision boundary between
two classes and
If x data point in class - xw[\ > xwC
If x data point in class - xw[\ < xwC
So at the decision boundary -

Linear Classification

CSL465/603 - Machine Learning

Problem with Linear Regression


as a Classifier
4.2 Linear Regression
106of an Indicator
4. LinearMatrix
Methods 105
for Classification

Consider the following 3 class dataset


3
3
3 3 33
3
33333 333 33 33
3
33
3
3 3 3 3333
3 3
3
33
3 3 333 333
33 3 3
33333333
333
33
333 3 3333 333333
33
33
3333333 3 3 3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
333333 3 33333333 333
3
333 3 33333
333333 3
3 333 33
333 33333 3
3
33 3
3 33
33 33
33
33
333
3333
3 33 333333333 33
3333333 33
33
3333
33
33
333333
33
33
333333
33
33 333
3
3
33
33
333 33 33
3
3
3 33 3 33 33
3
3
3
3
3
3 33 3 333
333333 333 3333 3333
3 333 33 33333333333333333
3
3 33333
3 33 33 333
3
33
333333333333 3
3333 3 3
3
3
33 333 3
33 333 3
3333
333333 333 3333
33
3
3
3
3
3
2
3
3
3
3 3
33333 333333
3
3
3
3
3
3
3
3
3
3
3 3 3 333
33 3
333 3 33 3
3 33
22
3
3 333
3 3
3
2
33
3 3 3 3 3 3 3333
3
22 2
3
3
2
33
3
2 2 22
2
2
2
2 222 2222222
2222
2
2
3
2
2
2
2
2
2
2
2
2
2
2
2
2 2 2222 222 22 222222 2 2 2 2
3
222 2 22 2 2
22
22 2222 2222 2222
2
2222
2 2 222 222222
22222222
2222 2
22 222 2222
2
2
2 2 2 22222
2
22
2
22
2
2222
2
22
22 22
2
2
2 2 2 22 2 2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2 2 22
2222222222222 222222 2 2 2
2
2 2
2 2 2222222
22222222
2
22
22 2
2
22222222
22
2 2 2 2222
2
22 2
22 22
2
22
22222 222222
2 22222
22 22 2
222222
2
2
2222222222
22 2
222
2222
222
2222
222222
22 2
22
2
2
2
2
2
2
222222
2
2
2
2
2
2
2
2
2 2222 222
2
22 222222222222222
2
1
22 2
2222 22 222
2 2 2222 222 2 22 22 22 2 2 22 2
1
1
1
2
2 2 2 22
2
1
1 11
22 2222
222
2
2
2
11 1
1
22 22 22
2
1 11
2 2 2 2 2
1 1 1 1 11 11 1
1 1
1 111 1111 111
2 2
1 1 1 11 2
1 1111 1 111 11
11 111111
1
1
1111
1 1
111
1
1 1 11111
11 111111111111
111
1
11111
1 11111111
1 1
1 1
11111 111111
1 11 1 1 11111
11111 1 11 1 1 1
1111
11111
1 1 1111 11111 1
1 111 1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1 1 11 1 1 1
1111111 111111111 1111 1
111111111111
11
11
1 1 11111 11 11
11
1 111
111
11 1
1 111
1
1
111111
11
11
11111
1
1111 1111 1 1 1
1
11
1
1111
11 1111
11
111
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1 1
1 1111
1
111 11111
11111111111111
11
11
11 11 1 11 111 1
111 11111
111 1 1 1
1 1111
1 1 11
1
1 1 11111 1
1111
1 111111 11 111
1 11
11 1 1
11
1 111
1
1
1
1 1 1 1 1 1 111
1
1
1
1
1

X1

LinearDegree
Discriminant
= 1; Analysis
Error = 0.33
1

3
3
3 3 33
3
33333 333 33 33
33
3
3 3 3 3333
3
3 33 33 3
33
33
3 3 333 333
33 333 333 3 3
33
33
33
333 3 3333 333333
33
33
333333 3 3 3
3
3
3
111
3
3
3
3
3
3
3
3
3
3
3
3
33
33 33 3
333333 3 33333333 333
3
3
333 3 33333
3
3
3
3
3
3 333 3
33 3333 33333 3
3
1
3 33
33 33 33
33
3
33
333
3333
3 33 333333333 33
33
3
333 3 33
33
3333
33
3333
333333
33
33
1111
333333
33
33 333
3
3
33
33
333
3
3
3 33 3 33 33
33
3 3
111
3
3
3
333 33 3333
3333
3 33333 333
3
3
33 333333
3
3
333
3
3
3
3
333 333
3
1111
33 33
3 33
333 33 3 3 3
3 33 33 333
33 33
3333333
33333 3
333 3
3 33
3
11
33 333 3
33 33
33 3
3333
3333333 333 3333333 3
33
3
111
21
3
3 3 33 333333
33 3
3
3 333333
3
3
3
3
3
3
3
3
3 3 3 333
33 3
112 2
333
333 3333
3 33
3
3 333
3 3
3
2
33 33
3 3 3 3 3 3 3333
3
22 2
3
3
2
33
3
2 2 22
2
3
2 1222
2 222 2222222
2 2
3
2
2
3
3
2
2
2
3
2
2
2
2
2
1
2
2
2
3
2
2 222 221
2 2 2222 222 22
2 22 2
33
2222222222 11
221
2 2
22 2222 222 2222
3
2222
2 2 222 222222
22222222 2 1
12 2
2222 2
22
2
33
2 2 2 22222
2
22
2
22
2
2222
3
2
22
22 2222222 2221
2
2
2
2
2
2
2
2
2
2
2
1
2
2
2
3
2
2
2
2
3
2
2
2
2
1
2
2 2 22
2 22222222222 222222 2 2 2 1
11122
2 2
3233
2 2 2222222
22222222
22222222222222222222 22
2222222
22
22 2
2
222
2
22
23
2 2 222222
3
2222222222222222222222222222
1222222222 22222 2 22 222 222
22
222222222222222222
222
22222
22222 222222
2 2222
222
222
2 22222222 222 22 2222222
2
3333 11111
2222222 33
22 2
2222
222
2222
22
2222
222222
22 2
2
2
2
2
222222
2
2
2
2
1
2
2
2
2
21
2 2222 22233
2
22 222222222222222
11
2
1
22 2 33
2222 22 222
2 2
2 2 2222 222 2 22 22323
1
1
32 2 2 2 2 1111
1
2
2 2 2 22
1
1 11
22 2222
222
2
2
2
1
11 1
1
22 22 22 3 2
1 11
1
2 2 2 2 2
1 1 1 1 11 11 1
1 1
11
1 111 1111 111
2 2
1 1 1 11 2
1 111 11
11 111111
11 1111 111111
33
11
1 1 1
111
1 1 11111
11 1111 11111
1
1
1
3
1
11
1
3
11
1 111
1 1
1 1
111111111
3
1
1
1
3
1
1
1
1
1111 1 111111 11 1
11
1 11 1 1 1111
3313 1
1111
1 1 1111 11111 1
11111
1 111 1
1
3
111
11
3
1
1
1
1
1
1
1
1
1
1
1
1
1
1
3
1 1 11 1 1 1
1111111 1111111 1 111131
111111111111
11
1111
1
1
1
3
111
1
1111 11111 11 13
111
1 11
11 1
3
1111
1
111
111
3
1
1
1
1
1
1
1
1
1
1
1
3
1
1
111111
1111 1 13 1 1
111
1 111111 1111111111
1
111
1
1
1
1
1
1
1
1
1
1
1
1
1 1
1 1111
1
111111111
1
1111 3
1
1
1
1
1
1
1111
1
1
1
1
1
1
1
11111
11 1
111
113
1 1111111111 1111
131 1
1
1
1
11
1
1
111
1
1 1 1 11
11 1
3
1
1
1
1
1
1
1
1
1
1
1
1 1
11
31 1 11
11 3
3
1 111
1
1
1
111
1 1
3
1
1
1 1
1
33 1 1 1
11
133 1 1
1

111
11
111

1.0
3

1.0

0.5

X2

X2

Linear Regression

0.0

0.5

0.0

0.0

0.2

X0.4
1

0.6

0.8

1.0

FIGURE
4.3.
eects
of masking on linear
FIGURE
4.2.
data is
come
from three masked
classes
in by
IR2 others
and
areThe
easily
separated
Masking
aThe
class
completely
problem.
The rug found
plot atbythe
base indicates the p
by linear decision boundaries. The right plot shows
the boundaries
linear
each observation.
three
curves in each pan
discriminant analysis. The left plot shows the boundaries
found byThe
linear
regresLinear Classification
9 for
CSL465/603 - Machine
Learning
three-class
variables;
sion of the indicator response variables.
The middle
class indicator
is completely
maskedfor example,

Linear Discriminant Analysis (1)


Bayes Classifier predicted class label
arg max = |x
C

Applying Bayes rule

x = ( = )
= x =
(x)
Let = = C
Then prediction task becomes
arg max x = C
C

Linear Classification

CSL465/603 - Machine Learning

10

Linear Discriminant Analysis (2)


Assume that the data within each class are normally
distributed
= ~ C , C
Furthermore
Each class has a different mean C
Each class has a common variance matrix

= ~ C ,
1
1
P<
=
exp

x C
C
h
<
2
9
9
2

Linear Classification

CSL465/603 - Machine Learning

11

Linear Discriminant Function


arg max x = C arg max log x = C
C

Linear Classification

CSL465/603 - Machine Learning

12

Estimating the Linear


Discriminant
Given training data
parameters

" , "

:
";< ,

estimate the

:Q
,
:

kC =
the proportion of observations in class
kC - sample mean for points in class
m - pooled sample covariance matrix
q
:
1
m =
6 6 x 7 [ O x 7 [

C;< ";<,op ;C

Discriminant function
1
O
P<
s
C x = x kC C P<kOC + log kC
2

Linear Classification

CSL465/603 - Machine Learning

13

Linear Decision Boundaries


Characterizing the decision boundary between two
classes

Linear Classification

CSL465/603 - Machine Learning

14

Linear Decision Boundaries Example


4.3 Linear Discriminant Analysis

+
+

13
3
33

3
3
3
2 2
13
2
1 1
3
33
3 3
3 22
2
1
3 3
2
3
33
11 23 33 1
22 2 2
2
3
2
1 1 1 1 22
13
3
2
1 31 1
3
1 11
2 22
11
22
1
1
2
1
2
1
1
2
1
1 2
1
2
2
2
1 3
33

109

FIGURE 4.5. The left panel shows three Gaussian distributions, with the same
covariance and dierent means. Included are the contours of constant density
enclosing 95% of the probability in each case. The Bayes decision boundaries
between each pair of classes are shown (broken straight lines), and the Bayes
three -classes
are the thicker solid lines (a subset
Linear decision
Classification boundaries separating all
CSL465/603
Machine Learning

15

Linear Discriminant Analysis


Extensions
Fundamental assumption Data points in each
class follow a Gaussian distribution, with common
covariance
If each class has separate covariance quadratic
discriminant analysis (QDA)
Regularized discriminant analysis combines both
LDA and QDA
Reduced-rank linear discriminant analysis project
the data to a lower dimensional subspace before
performing LDA (Fishers Discriminant) after midsem
Linear Classification

CSL465/603 - Machine Learning

16

Logistic Regression
Linear regression results in poorly fit models for
classification
Given " 0,1 , we want the output to be also in
the range [0, 1].
Use the logistic (sigmoid) function
x =

<
<vw xyz

Linear Classification

CSL465/603 - Machine Learning

17

Interpretation of the Output


(x) estimated probability that = 1 on input x
(posterior probability)
Example: If =

<
S{|}~"w

and = 0.7

70% chance that the tumor is malignant

The posterior probability that = 1 on input x is


parameterized by
For a two class scenario
=1 x + =0 x =1

Linear Classification

CSL465/603 - Machine Learning

18

Characterization of the Decision


Boundary (1)
1
x =
1 + P
Predict class 1, if
x 0.5

Predict class 0, if
x < 0.5

Linear Classification

CSL465/603 - Machine Learning

19

Characterization of the Decision


Boundary (2)
xw = + < < + 9 9
w = 3, 1,1

Predict = 1 if
3 + < + 9 0
How to estimate w

Linear Classification

CSL465/603 - Machine Learning

20

Loss Function (1)


Given training data " , " :
";< and " 0,1
Least Mean Square function?

Linear Classification

CSL465/603 - Machine Learning

21

Loss Function (2)


Given training data " , " :
";< and " 0,1
Least Mean Square function?
Nonconvex
log 1 , if = 0

Linear Classification

log x

CSL465/603 - Machine Learning

, if = 1

22

Logistic Regression Loss


Function
:

w = 6 " log x7 + 1 " log 1 x7


";<

Minimize w

Linear Classification

CSL465/603 - Machine Learning

23

Logistic Regression Gradient


Descent Parameter Update
Repeat till convergence

= 6 x7 "
";<
:

= 6 x" " "


";<

= O

Linear Classification

CSL465/603 - Machine Learning

24

:=

f ()

Logistic
NewtonThis method has Regression
a natural interpretation in-which
we can think of it as
approximating the function f via a linear function that is tangent to f at
Raphson
Method
(1)
the current guess , solving
for where that
linear function equals to zero, and
letting the next guess for be where that linear function is zero.
Newtons
method
- finding
a zero
of a function
Heres a picture
of the Newtons
method
in action:
50

50

50

40

40

40

30

30

30

f(x)

60

f(x)

60

f(x)

60

20

20

20

10

10

10

10

1.5

2.5

3
x

3.5

4.5

10

1.5

2.5

3
x

3.5

4.5

10

1.5

2.5

3
x

3.5

4.5


In the leftmost figure, wesee
f plotted along with the line
the
function

y = 0. Were trying to find so that f () =


0; the value of that achieves this
()
is about 1.3. Suppose we initialized the algorithm with = 4.5. Newtons
method
Find then
the fits
zero
of w
a straight line tangent to f at = 4.5, and solves for the
w
where that line evaluates to 0. (Middle figure.) This give us the next guess
w w shows the result of running
for , which is about 2.8. The rightmostfigure
w

one more iteration, which the updates to about 1.8. After a few more
Linear Classification
25
CSL465/603 - Machine Learning
iterations, we rapidly approach
= 1.3.

Logistic Regression - NewtonRaphson Method (2)


Hessian - w
:

9 (w)
" =
"

= 6 x7 1 x7 " "
";<

= O
Where is a diagonal matrix with the diagonal
element- x7 1 x7
""

Parameter update
w v< = w P< (w)
Linear Classification

CSL465/603 - Machine Learning

26

Logistic Regression - NewtonRaphson Method (3)


w v< = w P< w
w v< = w O

P<

Also called as Iterative Reweighted Least Squares


algorithm
Linear Classification

CSL465/603 - Machine Learning

27

Logistic Regression via Maximum


Likelihood Estimation (1)
Linear regression MLE using Gaussian
distribution assumption
Logistic regression MLE using Bernoulli
distribution assumption
Bernoulli distribution The probability distribution
function of a random variable that takes value 1 with
success probability and value 0 with failure
probability 1
Example coin toss

Interpret (x7 ) as the success probability that


x7 takes the value 1
Linear Classification

CSL465/603 - Machine Learning

28

Logistic Regression via Maximum


Likelihood Estimation (2)
" x7 ; w = x7

Linear Classification

op

1 x7

CSL465/603 - Machine Learning

<Pop

29

Multi-class Logistic Regression


(1)
One vs All Approach

Linear Classification

CSL465/603 - Machine Learning

30

Multi-class Logistic Regression


(2)
One vs All Approach
Train a logistic regression classifier C x for each
class
= |x C x
Predicted class
arg max C x
C

Linear Classification

CSL465/603 - Machine Learning

31

Summary
Linear regression as a classifier
Masking

Linear classifiers
Linear discriminants
Logistic regression

Sigmoid function
Loss function
Iterative parameter update
Maximum likelihood estimate

Multi-class logistic regression

Linear Classification

CSL465/603 - Machine Learning

32

Potrebbero piacerti anche