w6 PDF

Linear Classification
CSL465/603 - Fall 2016

Narayanan C Krishnan
ckn@iitrpr.ac.in
Outline
Linear Regression as a Classifier
Linear Discriminant Analysis
Logistic Regression
Cost function
Maximum likelihood estimation
Multi-class logistic regression
CSL465/603 - Machine Learning
Linear Classifier
Classifier Partitions input space into decision
regions
Linearly Separable Input space can be partitioned
by linear decision boundary
2.5
3.5
2
3
1.5
2.5
Input Variable X3
Input Variable X
0.5
1.5
0.5
0.5
0
4.5
3.5
2.5
Input Variable X2
1.5
1.5
5.5
4.5
3.5
3
Input Variable X2
2.5
1.5
Linear Regression as a Classifier

Start with 2 class scenario " 0, 1
Treat the output as if it were continuous and
perform regression
x = x
:
Where = argmin 6 x7 "

5
";<
A new data point x can be classified as

0, if x 0.5
x ==
1, if x > 0.5
Example
1
0.9
0.8
0.7
1
0.6
0.9
0.5
0.8
0.4
0.7
0.3
0.6
0.2
0.5
0.1
0.4
0
1
X
1
0.3
0.9
0.2
0.8
0.1
0.7
0
1
0.6
0.5
0.4
0.3
0.2
0.1
0
1
Linear Regression of Indicators

(1)
Extending linear regression for more than 2 classes
Indicator matrix
1, if " = , " belongs to class
"C = =
0, otherwise
1
1
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
W = O
:
P<
wC = arg min 6 x7 wC "C

5Q
";<
x = arg max xwC

C

Linear Regression of Indicators

(2)
Let C be the ST column of the identity matrix
If a data point " belongs to class , then " = C
Linear model can be learned using least squares
approach
:
W = arg min 6 x" W "

W
X
";<
For a new data point the class label can be

determined as
arg min xW C 9
C
Exercise: Show that both the approaches are equivalent

Decision Boundary in the MultiClass Scenario

Linear regression learns wC for each class
Characterization of the decision boundary between
two classes and
If x data point in class - xw[\ > xwC
If x data point in class - xw[\ < xwC
So at the decision boundary -
Problem with Linear Regression

as a Classifier
4.2 Linear Regression
106of an Indicator
4. LinearMatrix
Methods 105
for Classification
Consider the following 3 class dataset

3
3
3 3 33
3
33333 333 33 33
3
33
3
3 3 3 3333
3 3
3
33
3 3 333 333
33 3 3
33333333
333
33
333 3 3333 333333
33
33
3333333 3 3 3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
333333 3 33333333 333
3
333 3 33333
333333 3
3 333 33
333 33333 3
3
33 3
3 33
33 33
33
33
333
3333
3 33 333333333 33
3333333 33
33
3333
33
33
333333
33
33
333333
33
33 333
3
3
33
33
333 33 33
3
3
3 33 3 33 33
3
3
3
3
3
3 33 3 333
333333 333 3333 3333
3 333 33 33333333333333333
3
3 33333
3 33 33 333
3
33
333333333333 3
3333 3 3
3
3
33 333 3
33 333 3
3333
333333 333 3333
33
3
3
3
3
3
2
3
3
3
3 3
33333 333333
3
3
3
3
3
3
3
3
3
3
3 3 3 333
33 3
333 3 33 3
3 33
22
3
3 333
3 3
3
2
33
3 3 3 3 3 3 3333
3
22 2
3
3
2
33
3
2 2 22
2
2
2
2 222 2222222
2222
2
2
3
2
2
2
2
2
2
2
2
2
2
2
2
2 2 2222 222 22 222222 2 2 2 2
3
222 2 22 2 2
22
22 2222 2222 2222
2
2222
2 2 222 222222
22222222
2222 2
22 222 2222
2
2
2 2 2 22222
2
22
2
22
2
2222
2
22
22 22
2
2
2 2 2 22 2 2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2 2 22
2222222222222 222222 2 2 2
2
2 2
2 2 2222222
22222222
2
22
22 2
2
22222222
22
2 2 2 2222
2
22 2
22 22
2
22
22222 222222
2 22222
22 22 2
222222
2
2
2222222222
22 2
222
2222
222
2222
222222
22 2
22
2
2
2
2
2
2
222222
2
2
2
2
2
2
2
2
2 2222 222
2
22 222222222222222
2
1
22 2
2222 22 222
2 2 2222 222 2 22 22 22 2 2 22 2
1
1
1
2
2 2 2 22
2
1
1 11
22 2222
222
2
2
2
11 1
1
22 22 22
2
1 11
2 2 2 2 2
1 1 1 1 11 11 1
1 1
1 111 1111 111
2 2
1 1 1 11 2
1 1111 1 111 11
11 111111
1
1
1111
1 1
111
1
1 1 11111
11 111111111111
111
1
11111
1 11111111
1 1
1 1
11111 111111
1 11 1 1 11111
11111 1 11 1 1 1
1111
11111
1 1 1111 11111 1
1 111 1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1 1 11 1 1 1
1111111 111111111 1111 1
111111111111
11
11
1 1 11111 11 11
11
1 111
111
11 1
1 111
1
1
111111
11
11
11111
1
1111 1111 1 1 1
1
11
1
1111
11 1111
11
111
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1 1
1 1111
1
111 11111
11111111111111
11
11
11 11 1 11 111 1
111 11111
111 1 1 1
1 1111
1 1 11
1
1 1 11111 1
1111
1 111111 11 111
1 11
11 1 1
11
1 111
1
1
1
1 1 1 1 1 1 111
1
1
1
1
1
X1
LinearDegree
Discriminant
= 1; Analysis
Error = 0.33
1
3
3
3 3 33
3
33333 333 33 33
33
3
3 3 3 3333
3
3 33 33 3
33
33
3 3 333 333
33 333 333 3 3
33
33
33
333 3 3333 333333
33
33
333333 3 3 3
3
3
3
111
3
3
3
3
3
3
3
3
3
3
3
3
33
33 33 3
333333 3 33333333 333
3
3
333 3 33333
3
3
3
3
3
3 333 3
33 3333 33333 3
3
1
3 33
33 33 33
33
3
33
333
3333
3 33 333333333 33
33
3
333 3 33
33
3333
33
3333
333333
33
33
1111
333333
33
33 333
3
3
33
33
333
3
3
3 33 3 33 33
33
3 3
111
3
3
3
333 33 3333
3333
3 33333 333
3
3
33 333333
3
3
333
3
3
3
3
333 333
3
1111
33 33
3 33
333 33 3 3 3
3 33 33 333
33 33
3333333
33333 3
333 3
3 33
3
11
33 333 3
33 33
33 3
3333
3333333 333 3333333 3
33
3
111
21
3
3 3 33 333333
33 3
3
3 333333
3
3
3
3
3
3
3
3
3 3 3 333
33 3
112 2
333
333 3333
3 33
3
3 333
3 3
3
2
33 33
3 3 3 3 3 3 3333
3
22 2
3
3
2
33
3
2 2 22
2
3
2 1222
2 222 2222222
2 2
3
2
2
3
3
2
2
2
3
2
2
2
2
2
1
2
2
2
3
2
2 222 221
2 2 2222 222 22
2 22 2
33
2222222222 11
221
2 2
22 2222 222 2222
3
2222
2 2 222 222222
22222222 2 1
12 2
2222 2
22
2
33
2 2 2 22222
2
22
2
22
2
2222
3
2
22
22 2222222 2221
2
2
2
2
2
2
2
2
2
2
2
1
2
2
2
3
2
2
2
2
3
2
2
2
2
1
2
2 2 22
2 22222222222 222222 2 2 2 1
11122
2 2
3233
2 2 2222222
22222222
22222222222222222222 22
2222222
22
22 2
2
222
2
22
23
2 2 222222
3
2222222222222222222222222222
1222222222 22222 2 22 222 222
22
222222222222222222
222
22222
22222 222222
2 2222
222
222
2 22222222 222 22 2222222
2
3333 11111
2222222 33
22 2
2222
222
2222
22
2222
222222
22 2
2
2
2
2
222222
2
2
2
2
1
2
2
2
2
21
2 2222 22233
2
22 222222222222222
11
2
1
22 2 33
2222 22 222
2 2
2 2 2222 222 2 22 22323
1
1
32 2 2 2 2 1111
1
2
2 2 2 22
1
1 11
22 2222
222
2
2
2
1
11 1
1
22 22 22 3 2
1 11
1
2 2 2 2 2
1 1 1 1 11 11 1
1 1
11
1 111 1111 111
2 2
1 1 1 11 2
1 111 11
11 111111
11 1111 111111
33
11
1 1 1
111
1 1 11111
11 1111 11111
1
1
1
3
1
11
1
3
11
1 111
1 1
1 1
111111111
3
1
1
1
3
1
1
1
1
1111 1 111111 11 1
11
1 11 1 1 1111
3313 1
1111
1 1 1111 11111 1
11111
1 111 1
1
3
111
11
3
1
1
1
1
1
1
1
1
1
1
1
1
1
1
3
1 1 11 1 1 1
1111111 1111111 1 111131
111111111111
11
1111
1
1
1
3
111
1
1111 11111 11 13
111
1 11
11 1
3
1111
1
111
111
3
1
1
1
1
1
1
1
1
1
1
1
3
1
1
111111
1111 1 13 1 1
111
1 111111 1111111111
1
111
1
1
1
1
1
1
1
1
1
1
1
1
1 1
1 1111
1
111111111
1
1111 3
1
1
1
1
1
1
1111
1
1
1
1
1
1
1
11111
11 1
111
113
1 1111111111 1111
131 1
1
1
1
11
1
1
111
1
1 1 1 11
11 1
3
1
1
1
1
1
1
1
1
1
1
1
1 1
11
31 1 11
11 3
3
1 111
1
1
1
111
1 1
3
1
1
1 1
1
33 1 1 1
11
133 1 1
1
111
11
111
1.0
3
1.0
0.5
X2
X2
Linear Regression
0.0
0.5
0.0
0.0
0.2
X0.4
1
0.6
0.8
1.0
FIGURE
4.3.
eects
of masking on linear
FIGURE
4.2.
data is
come
from three masked
classes
in by
IR2 others
and
areThe
easily
separated
Masking
aThe
class
completely
problem.
The rug found
plot atbythe
base indicates the p
by linear decision boundaries. The right plot shows
the boundaries
linear
each observation.
three
curves in each pan
discriminant analysis. The left plot shows the boundaries
found byThe
linear
regresLinear Classification
9 for
CSL465/603 - Machine
Learning
three-class
variables;
sion of the indicator response variables.
The middle
class indicator
is completely
maskedfor example,
Linear Discriminant Analysis (1)

Bayes Classifier predicted class label
arg max = |x
C
Applying Bayes rule
x = ( = )
= x =
(x)
Let = = C
Then prediction task becomes
arg max x = C
C
10
Linear Discriminant Analysis (2)

Assume that the data within each class are normally
distributed
= ~ C , C
Furthermore
Each class has a different mean C
Each class has a common variance matrix
= ~ C ,
1
1
P<
=
exp
x C
C
h
<
2
9
9
2
11
Linear Discriminant Function

arg max x = C arg max log x = C
C
12
Estimating the Linear

Discriminant
Given training data
parameters
" , "
:
";< ,
estimate the
:Q
,
:
kC =
the proportion of observations in class
kC - sample mean for points in class
m - pooled sample covariance matrix
q
:
1
m =
6 6 x 7 [ O x 7 [
C;< ";<,op ;C
Discriminant function
1
O
P<
s
C x = x kC C P<kOC + log kC
2
13
Linear Decision Boundaries

Characterizing the decision boundary between two
classes
14
Linear Decision Boundaries Example

4.3 Linear Discriminant Analysis
+
+
13
3
33
3
3
3
2 2
13
2
1 1
3
33
3 3
3 22
2
1
3 3
2
3
33
11 23 33 1
22 2 2
2
3
2
1 1 1 1 22
13
3
2
1 31 1
3
1 11
2 22
11
22
1
1
2
1
2
1
1
2
1
1 2
1
2
2
2
1 3
33
109
FIGURE 4.5. The left panel shows three Gaussian distributions, with the same
covariance and dierent means. Included are the contours of constant density
enclosing 95% of the probability in each case. The Bayes decision boundaries
between each pair of classes are shown (broken straight lines), and the Bayes
three -classes
are the thicker solid lines (a subset
Linear decision
Classification boundaries separating all
CSL465/603
Machine Learning
15
Linear Discriminant Analysis

Extensions
Fundamental assumption Data points in each
class follow a Gaussian distribution, with common
covariance
If each class has separate covariance quadratic
discriminant analysis (QDA)
Regularized discriminant analysis combines both
LDA and QDA
Reduced-rank linear discriminant analysis project
the data to a lower dimensional subspace before
performing LDA (Fishers Discriminant) after midsem
16
Logistic Regression
Linear regression results in poorly fit models for
classification
Given " 0,1 , we want the output to be also in
the range [0, 1].
Use the logistic (sigmoid) function
x =
<
<vw xyz
17
Interpretation of the Output

(x) estimated probability that = 1 on input x
(posterior probability)
Example: If =
<
S{|}~"w
and = 0.7
70% chance that the tumor is malignant
The posterior probability that = 1 on input x is

parameterized by
For a two class scenario
=1 x + =0 x =1
18
Characterization of the Decision

Boundary (1)
1
x =
1 + P
Predict class 1, if
x 0.5
Predict class 0, if
x < 0.5
19
Characterization of the Decision

Boundary (2)
xw = + < < + 9 9
w = 3, 1,1
Predict = 1 if
3 + < + 9 0
How to estimate w
20
Loss Function (1)

Given training data " , " :
";< and " 0,1
Least Mean Square function?
21
Loss Function (2)

Given training data " , " :
";< and " 0,1
Least Mean Square function?
Nonconvex
log 1 , if = 0
log x
, if = 1
22
Logistic Regression Loss

Function
:
w = 6 " log x7 + 1 " log 1 x7

";<
Minimize w
23
Logistic Regression Gradient

Descent Parameter Update
Repeat till convergence
= 6 x7 "
";<
:
= 6 x" " "

";<
= O
24
:=
f ()
Logistic
NewtonThis method has Regression
a natural interpretation in-which
we can think of it as
approximating the function f via a linear function that is tangent to f at
Raphson
Method
(1)
the current guess , solving
for where that
linear function equals to zero, and
letting the next guess for be where that linear function is zero.
Newtons
method
- finding
a zero
of a function
Heres a picture
of the Newtons
method
in action:
50
50
50
40
40
40
30
30
30
f(x)
60
f(x)
60
f(x)
60
20
20
20
10
10
10
10
1.5
2.5
3
x
3.5
4.5
10
1.5
2.5
3
x
3.5
4.5
10
1.5
2.5
3
x
3.5
4.5

In the leftmost figure, wesee
f plotted along with the line
the
function
y = 0. Were trying to find so that f () =

0; the value of that achieves this
()
is about 1.3. Suppose we initialized the algorithm with = 4.5. Newtons
method
Find then
the fits
zero
of w
a straight line tangent to f at = 4.5, and solves for the
w
where that line evaluates to 0. (Middle figure.) This give us the next guess
w w shows the result of running
for , which is about 2.8. The rightmostfigure
w
one more iteration, which the updates to about 1.8. After a few more
25
iterations, we rapidly approach
= 1.3.
Logistic Regression - NewtonRaphson Method (2)

Hessian - w
:
9 (w)
" =
"
= 6 x7 1 x7 " "
";<
= O
Where is a diagonal matrix with the diagonal
element- x7 1 x7
""
Parameter update
w v< = w P< (w)
26
Logistic Regression - NewtonRaphson Method (3)

w v< = w P< w
w v< = w O
P<
Also called as Iterative Reweighted Least Squares

algorithm
27
Logistic Regression via Maximum

Likelihood Estimation (1)
Linear regression MLE using Gaussian
distribution assumption
Logistic regression MLE using Bernoulli
distribution assumption
Bernoulli distribution The probability distribution
function of a random variable that takes value 1 with
success probability and value 0 with failure
probability 1
Example coin toss
Interpret (x7 ) as the success probability that

x7 takes the value 1
28
Logistic Regression via Maximum

Likelihood Estimation (2)
" x7 ; w = x7
op
1 x7
<Pop
29
Multi-class Logistic Regression

(1)
One vs All Approach
30
Multi-class Logistic Regression

(2)
One vs All Approach
Train a logistic regression classifier C x for each
class
= |x C x
Predicted class
arg max C x
C
31
Summary
Linear regression as a classifier
Masking
Linear classifiers
Linear discriminants
Logistic regression
Sigmoid function
Loss function
Iterative parameter update
Maximum likelihood estimate
Multi-class logistic regression
32

w6 PDF

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

w6 PDF

Caricato da

Copyright:

Formati disponibili

Linear Classification

CSL465/603 - Fall 2016

CSL465/603 - Machine Learning

CSL465/603 - Machine Learning

Linear Regression as a Classifier

Where = argmin 6 x7 "

A new data point x can be classified as

CSL465/603 - Machine Learning

CSL465/603 - Machine Learning

Linear Regression of Indicators

wC = arg min 6 x7 wC "C

x = arg max xwC

Linear Regression of Indicators

W = arg min 6 x" W "

For a new data point the class label can be

Exercise: Show that both the approaches are equivalent

CSL465/603 - Machine Learning

Decision Boundary in the MultiClass Scenario

CSL465/603 - Machine Learning

Problem with Linear Regression

Consider the following 3 class dataset

Linear Discriminant Analysis (1)

Applying Bayes rule

CSL465/603 - Machine Learning

Linear Discriminant Analysis (2)

CSL465/603 - Machine Learning

Linear Discriminant Function

CSL465/603 - Machine Learning

Estimating the Linear

CSL465/603 - Machine Learning

Linear Decision Boundaries

CSL465/603 - Machine Learning

Linear Decision Boundaries Example

Linear Discriminant Analysis

CSL465/603 - Machine Learning

CSL465/603 - Machine Learning

Interpretation of the Output

70% chance that the tumor is malignant

The posterior probability that = 1 on input x is

CSL465/603 - Machine Learning

Characterization of the Decision

CSL465/603 - Machine Learning

Characterization of the Decision

CSL465/603 - Machine Learning

Loss Function (1)

CSL465/603 - Machine Learning

Loss Function (2)

CSL465/603 - Machine Learning

Logistic Regression Loss

w = 6 " log x7 + 1 " log 1 x7

CSL465/603 - Machine Learning

Logistic Regression Gradient

= 6 x" " "

CSL465/603 - Machine Learning

y = 0. Were trying to find so that f () =

Logistic Regression - NewtonRaphson Method (2)

CSL465/603 - Machine Learning

Logistic Regression - NewtonRaphson Method (3)

Also called as Iterative Reweighted Least Squares

CSL465/603 - Machine Learning

Logistic Regression via Maximum

Interpret (x7 ) as the success probability that

CSL465/603 - Machine Learning