Sei sulla pagina 1di 70

Classification

Qualitative variables take values in an unordered set C,

such as:
eye color {brown, blue, green}
email {spam, ham}.
Given a feature vector X and a qualitative response Y

taking values in the set C, the classification task is to build


a function C(X) that takes as input the feature vector X
and predicts its value for Y ; i.e. C(X) C.
Often we are more interested in estimating the probabilities

that X belongs to each category in C.

1 / 40

Classification
Qualitative variables take values in an unordered set C,

such as:
eye color {brown, blue, green}
email {spam, ham}.
Given a feature vector X and a qualitative response Y

taking values in the set C, the classification task is to build


a function C(X) that takes as input the feature vector X
and predicts its value for Y ; i.e. C(X) C.
Often we are more interested in estimating the probabilities

that X belongs to each category in C.


For example, it is more valuable to have an estimate of the
probability that an insurance claim is fraudulent, than a
classification fraudulent or not.
1 / 40

Income

20000

500
500

1000

1500

Balance

2000

2500

0
0

40000

2000
1500

Balance

1000

60000
40000
20000
0

Income

60000

2500

Example: Credit Card Defualt

No

Yes

Default

No

Yes

Default

2 / 40

Can we use Linear Regression?


Suppose for the Default classification task that we code
(
0 if No
Y =
1 if Yes.
Can we simply perform a linear regression of Y on X and
classify as Yes if Y > 0.5?

3 / 40

Can we use Linear Regression?


Suppose for the Default classification task that we code
(
0 if No
Y =
1 if Yes.
Can we simply perform a linear regression of Y on X and
classify as Yes if Y > 0.5?
In this case of a binary outcome, linear regression does a

good job as a classifier, and is equivalent to linear


discriminant analysis which we discuss later.
Since in the population E(Y |X = x) = Pr(Y = 1|X = x),
we might think that regression is perfect for this task.

3 / 40

Can we use Linear Regression?


Suppose for the Default classification task that we code
(
0 if No
Y =
1 if Yes.
Can we simply perform a linear regression of Y on X and
classify as Yes if Y > 0.5?
In this case of a binary outcome, linear regression does a

good job as a classifier, and is equivalent to linear


discriminant analysis which we discuss later.
Since in the population E(Y |X = x) = Pr(Y = 1|X = x),
we might think that regression is perfect for this task.
However, linear regression might produce probabilities less

than zero or bigger than one. Logistic regression is more


appropriate.
3 / 40

500

1000

1500

Balance

2000

| ||

2500

1.0

| || | ||||| || ||| ||||||||||| ||||| |||||||||||||||||| |||||||| ||||||||||| |||||||||||||||||||||||||||| ||||||||||||||||||||||||| | |||| | | | |

0.8

0.8
0.6
0.4
||||||||||||||||||||||||||||||||||||||||||||||||||||||
||||||||||||||
|||||||||||||||||||||||||||||
|||||||||||||
||||||||||||||||||||||||||||||||||
|||||||
|||||
|||||||||||||||||||
||||||||||||
||||||||||||||||
|||||||||||||
|||||||||||||||||||
|||||
||||
||||||||||||
|||||||||
||||
||||||||||||||
|||||||||||||||||||||||||||||||||||||||||||||||||||||
|||||||||||||||||||||||||
|||||||||||||||||||||||
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| |||||| || || ||||| |

| |

0.6

0.4

0.2

| || | ||||| || ||| ||||||||||| ||||| |||||||||||||||||| |||||||| ||||||||||| |||||||||||||||||||||||||||| ||||||||||||||||||||||||| | |||| | | | |

Probability of Default

0.0

| |

0.2
0.0

Probability of Default

1.0

Linear versus Logistic Regression

||||||||||||||||||||||||||||||||||||||||||||||||||||||
||||||||||||||
|||||||||||||||||||||||||||||
|||||||||||||
||||||||||||||||||||||||||||||
||||
|||||||
|||||
|||||||||||||||||||
||||||||||||
||||||||||||||||
|||||||||||||
|||||||||||||||||||
|||||
||||
||||||||||||
|||||||||
||||
||||||||||||||
|||||||||||||||||||||||||||||||||||||||||||||||||||||
|||||||||||||||||||||||||
|||||||||||||||||||||||
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| |||||| || || ||||| |

500

1000

1500

2000

| ||

2500

Balance

The orange marks indicate the response Y , either 0 or 1. Linear


regression does not estimate Pr(Y = 1|X) well. Logistic
regression seems well suited to the task.

4 / 40

Linear Regression continued


Now suppose we have a response variable with three possible
values. A patient presents at the emergency room, and we must
classify them according to their symptoms.

1 if stroke;
Y = 2 if drug overdose;

3 if epileptic seizure.
This coding suggests an ordering, and in fact implies that the
difference between stroke and drug overdose is the same as
between drug overdose and epileptic seizure.

5 / 40

Linear Regression continued


Now suppose we have a response variable with three possible
values. A patient presents at the emergency room, and we must
classify them according to their symptoms.

1 if stroke;
Y = 2 if drug overdose;

3 if epileptic seizure.
This coding suggests an ordering, and in fact implies that the
difference between stroke and drug overdose is the same as
between drug overdose and epileptic seizure.
Linear regression is not appropriate here.
Multiclass Logistic Regression or Discriminant Analysis are
more appropriate.
5 / 40

Logistic Regression
Lets write p(X) = Pr(Y = 1|X) for short and consider using
balance to predict default. Logistic regression uses the form
p(X) =

e0 +1 X
.
1 + e0 +1 X

(e 2.71828 is a mathematical constant [Eulers number.])


It is easy to see that no matter what values 0 , 1 or X take,
p(X) will have values between 0 and 1.

6 / 40

Logistic Regression
Lets write p(X) = Pr(Y = 1|X) for short and consider using
balance to predict default. Logistic regression uses the form
p(X) =

e0 +1 X
.
1 + e0 +1 X

(e 2.71828 is a mathematical constant [Eulers number.])


It is easy to see that no matter what values 0 , 1 or X take,
p(X) will have values between 0 and 1.
A bit of rearrangement gives


p(X)
log
= 0 + 1 X.
1 p(X)
This monotone transformation is called the log odds or logit
transformation of p(X).
6 / 40

500

1000

1500

Balance

2000

| ||

2500

1.0

| || | ||||| || ||| ||||||||||| ||||| |||||||||||||||||| |||||||| ||||||||||| |||||||||||||||||||||||||||| ||||||||||||||||||||||||| | |||| | | | |

0.8

0.8
0.6
0.4
||||||||||||||||||||||||||||||||||||||||||||||||||||||
||||||||||||||
|||||||||||||||||||||||||||||
|||||||||||||
||||||||||||||||||||||||||||||||||
|||||||
|||||
|||||||||||||||||||
||||||||||||
||||||||||||||||
|||||||||||||
|||||||||||||||||||
|||||
||||
||||||||||||
|||||||||
||||
||||||||||||||
|||||||||||||||||||||||||||||||||||||||||||||||||||||
|||||||||||||||||||||||||
|||||||||||||||||||||||
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| |||||| || || ||||| |

| |

0.6

0.4

0.2

| || | ||||| || ||| ||||||||||| ||||| |||||||||||||||||| |||||||| ||||||||||| |||||||||||||||||||||||||||| ||||||||||||||||||||||||| | |||| | | | |

Probability of Default

0.0

| |

0.2
0.0

Probability of Default

1.0

Linear versus Logistic Regression

||||||||||||||||||||||||||||||||||||||||||||||||||||||
||||||||||||||
|||||||||||||||||||||||||||||
|||||||||||||
||||||||||||||||||||||||||||||
||||
|||||||
|||||
|||||||||||||||||||
||||||||||||
||||||||||||||||
|||||||||||||
|||||||||||||||||||
|||||
||||
||||||||||||
|||||||||
||||
||||||||||||||
|||||||||||||||||||||||||||||||||||||||||||||||||||||
|||||||||||||||||||||||||
|||||||||||||||||||||||
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| |||||| || || ||||| |

500

1000

1500

| ||

2000

2500

Balance

Logistic regression ensures that our estimate for p(X) lies


between 0 and 1.

7 / 40

Maximum Likelihood
We use maximum likelihood to estimate the parameters.
Y
Y
`(0 , ) =
p(xi )
(1 p(xi )).
i:yi =1

i:yi =0

This likelihood gives the probability of the observed zeros and


ones in the data. We pick 0 and 1 to maximize the likelihood
of the observed data.

8 / 40

Maximum Likelihood
We use maximum likelihood to estimate the parameters.
Y
Y
`(0 , ) =
p(xi )
(1 p(xi )).
i:yi =1

i:yi =0

This likelihood gives the probability of the observed zeros and


ones in the data. We pick 0 and 1 to maximize the likelihood
of the observed data.
Most statistical packages can fit linear logistic regression models
by maximum likelihood. In R we use the glm function.

Intercept
balance

Coefficient
-10.6513
0.0055

Std. Error
0.3612
0.0002

Z-statistic
-29.5
24.9

P-value
< 0.0001
< 0.0001
8 / 40

Making Predictions

What is our estimated probability of default for someone with


a balance of $1000?

p(X) =

e0 +1 X
1 + e0 +1 X

e10.6513+0.00551000
= 0.006
1 + e10.6513+0.00551000

9 / 40

Making Predictions

What is our estimated probability of default for someone with


a balance of $1000?

p(X) =

e0 +1 X
1 + e0 +1 X

e10.6513+0.00551000
= 0.006
1 + e10.6513+0.00551000

With a balance of $2000?

p(X) =

e0 +1 X
1 + e0 +1 X

e10.6513+0.00552000
= 0.586
1 + e10.6513+0.00552000

9 / 40

Lets do it again, using student as the predictor.


Intercept
student[Yes]

Coefficient
-3.5041
0.4049

Std. Error
0.0707
0.1150

Z-statistic
-49.55
3.52

P-value
< 0.0001
0.0004

10 / 40

Lets do it again, using student as the predictor.


Intercept
student[Yes]

Coefficient
-3.5041
0.4049

Std. Error
0.0707
0.1150

Z-statistic
-49.55
3.52

P-value
< 0.0001
0.0004

c
Pr(default=Yes|student=Yes)
=

e3.5041+0.40491
= 0.0431,
1 + e3.5041+0.40491

c
Pr(default=Yes|student=No)
=

e3.5041+0.40490
= 0.0292.
1 + e3.5041+0.40490

10 / 40

Logistic regression with several variables



log

p(X)
1 p(X)
p(X) =

Intercept
balance
income
student[Yes]


= 0 + 1 X1 + + p Xp
e0 +1 X1 ++p Xp
1 + e0 +1 X1 ++p Xp

Coefficient
-10.8690
0.0057
0.0030
-0.6468

Std. Error
0.4923
0.0002
0.0082
0.2362

Z-statistic
-22.08
24.74
0.37
-2.74

P-value
< 0.0001
< 0.0001
0.7115
0.0062

Why is coefficient for student negative, while it was positive


before?
11 / 40

2000
1500
1000

Credit Card Balance

500

0.6
0.4
0.2
0.0

Default Rate

0.8

2500

Confounding

500

1000

1500

Credit Card Balance

2000

No

Yes

Student Status

Students tend to have higher balances than non-students,

so their marginal default rate is higher than for


non-students.
But for each level of balance, students default less than
non-students.
Multiple logistic regression can tease this out.
12 / 40

Example: South African Heart Disease

160 cases of MI (myocardial infarction) and 302 controls

(all male in age range 15-64), from Western Cape, South


Africa in early 80s.
Overall prevalence very high in this region: 5.1%.
Measurements on seven predictors (risk factors), shown in

scatterplot matrix.
Goal is to identify relative strengths and directions of risk

factors.
This was part of an intervention study aimed at educating

the public on healthier diets.

13 / 40

50

100

o
o
oo
o
ooo
o
o
oo
o
o
o
oo
o
o
oo
o
o
oooo
o
o
o
o
oo
o
o
o
o
o
o
oo
o
o
o
oo

o oo
ooo
o o
ooo
oooooo o
oooooooo
oooo
oooooooooo o
ooooooooo
oooo
oo
ooo
o
o o
oooooooo
o
oooo
oo
o
oo
ooo
ooooo
o
o
o
o
o
o
o
oooooo o
oooo
oo
o
o
o
o
oo
oooooo
oooo
oo
oooo
o
oo
oo
o
ooo
ooo
oo
ooo
oooooooo
o ooo
oo
o
o
oo
o
o
o
ooo
oooo
o
o
ooo
oooo
oooo
oo
oooooo
oo
oo
oooo o o
ooo
o
ooooooo
ooo
oo
o
o
o
o
o
o
o
o oo o

o o oo
o
oooo o
o o
o ooo
o
oo o ooo
oo ooo ooo
oo
oo
o oooooooo
o
o
ooo
oo
oooooooo
oooo
oo
ooooo
o
o o oooooo
ooooo
oooooooo oo o
ooooooo
o
o
oo
oo
ooo
o
o
oo
ooo
oooooo
o ooo oooooo
o ooo o
o
ooooo
o
o
o
o
o
o
o
o
oooo
o
o
oooo
oo
ooooo
o
o
o
o
o
o
o
o
o
o
o
o
o
oooooo
o
oooo
oo
ooo
oooo
oooooooooooooo
ooooo
o
oo o
o
oooooo
ooo
ooo
oo
oooooo
oo
o
oooooooo
ooo
oo
o
oo
oooo
ooooo
o
ooo
oo
oooooo
oooo
ooooo
ooo oo o o ooooo
ooo
o
oooo
o
ooo
oo
oo
o
o
oo
oo
oooooooooo
oo
ooo
o
o
oo
ooo
oooooooo
ooo
oo
oooooo
ooooooooo
o
o
oo
oo
o
oo
o
o
oo
oo
o
oo
o
ooooooo oo ooo o o oo
o
ooooooooo
ooo
ooo
oo
ooo
ooooooooooooooooo
oooooooo
o
ooo
oo
o
o
o
ooo
oo
o
o
o
o
o
o
o
o
o
oo

o
o
o
oo oo o
o
oooooo
ooooo
o
o o
ooo
o
oo
oooo
oooooooo
o
oo
oo
oooo
o
o
o
oo o oo
ooooooooooooo
oo
oooo
ooo
oo
ooo
o
ooooooooooooooo
oooo
oooo
ooo
oo
oo
o
o
o
oo
oo
oooo
o
o
oo
oo
o
ooooo
o o
oo
o
o
o
o
o
ooo
o
o
o
o
oo
oooo
oo
oo
oooo
oooooo
oo
oo
ooooooooooooooo
oo

o
o
ooo
o
o
ooo
o
o
oo
o
o
o
o
o
o
oo
oo
o
o
o
o
o
o
oo
oo
oo

o
o
o
o
ooooo o
o
o oooo
o
ooo
oo
oo
oooo oo
o
o
ooo
o oooooo
o
ooo
ooo
ooo
ooooooooooo
oo
oooo
o
o
o
o
ooooo
o
o
o
o
o
o
o o o
oo
ooooo
oo
oooo
ooooo
oo
ooo
o
ooo
oo
o
ooooo
o
ooooooooooo
o
oo
o
o
ooooo
oo
oo
oo
oo
oo
o
oo
ooo
o
o
ooo
ooo
o
o
oo
oo
ooo
ooooo
oo
oo
oooo
ooooooooooooo o
o ooo
ooo
oo
oo
ooo
oo
o
oo
oo
o
o
o oo
oooo
ooo oooo o
o
o
ooooo
ooo
oooooo o
oo
ooooo
o
oooo
oo
ooo
oooooo
o
oo
oooooo oo
oooooo
o
oo
oo
oooo
o
o
oo
ooo
oo
ooo
oooo
oo
ooo
o o oooooo
o
oooooo
oooo
ooo
ooo
o
o
o oo o
o
ooo
o
o
o
o
oo
ooo
ooo
o
o
ooooo
o
ooo
oooo
oo
oo
o
oo
ooo
o
oo
o
oooo
ooo
oo
oo
o
oo
oooooo
oo
ooo
oo
oo
oooo
o
oo
ooo
o
o
o
oo
o
ooooo oo o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
oo
ooooooo
oooo
ooooooooo
oo
ooo
oooooo
oo
o oo
oo
oo
oo
o
o
oo
o
oo
oo
oo
o
ooooo
o

o
o
o
o o
o
o o ooo
ooo o o
o o
ooo ooo o
oo
o oo ooooooo
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
ooo
ooooooooo o o o o
oooooooooo
o
oo
o
oo
oooo
o oo
ooo
oooo
ooo
oo ooo
o
ooo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
ooooooo
o
oo
oo
o
ooo
oo o
oooo
o
oooo
oooooooo
oooooo
oooo
oo
ooo
oo
o
oo
o
o
oo
oo
oooo
ooooo
oooo
oo
oooo
o
oo
ooooooooooo
oo
oooo
o
oo
ooo oo oo o
oooo
oo
o
oo
ooo
oo
oooooooooooo
o
ooooo
o
o
ooooo
oo
ooo
ooo
ooo
o
ooooooo
ooo
ooooo
ooo
oooo
oo
ooooooo
o
oo
oo
ooooooo
oo
o ooo
ooo
oo
oo
o
oo
oo
ooo
oooo
ooo
oooooooooooooo
o
o
o
ooo
ooooooooooooooo
oooo
o
o
o
o
o
o o
o
o
o
o
o
o
oo
o o oo
o
o
ooooooooo o
o oo o
ooooooooooo o o oo
oo o oo ooooooooooooooooooooooo
o
o
o
ooooooo o
ooo
o
oo
o
ooooooo
oo
oo
oooooooo
oo
oooooooo
oo
ooooooo
o
oo
o ooooooooooo
oo
oo
oo
o
oo
o
o
ooooo
oo
oooooooo
ooo
ooo
ooooo
ooooooo
oo
oo
o
ooooo
o
oooooo
ooo
oo
oooo
o
o
o
o
ooo
ooooo
oooooo
ooo
oo
o
oooooooooo o o oo
oo
oooooo
ooooooo
oo
o
oooo
o
ooooo
oo
oooo
ooooooo
ooo
o
ooo
oo
ooooo
oooooooooo
ooo
o
oo
oooo
oooooooooooo
ooo
o
ooo
ooo
oo
oooooooo
oooo
oooo
o
oo
oo
oooo
ooooooo
oo
o
ooooo o o o oo
oooooo
oooooo
o
o
ooooooo
ooo
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
ooo
ooo
o
oo
ooooo
ooooo oo oo oooo
o
oooooooo
oooooooooooooo
ooooooo
oooo
ooooooooo
ooo
oooo
oooooooooooooooooo
oo
o
oo
oooooooo
oo
oo
ooooo
oo

tobacco

ldl

o
o
o
ooo
o
oooo
ooo
o
oo
oo
o
o
o
o
oooo
o
oo
o
o
o
oo
o
o
o
ooo
o

oooooooooooo oooo
o
ooo
ooooo
oo
oo
ooo
oo
oooooo
ooo
ooooo
oo
oo
oo

0.8

o
o
o
o
oooo
ooo o
o
o
o oo o
oo
oooo oo
o
o
ooo
ooo oo
oooo
ooooo
oo
oooo
oooo
oo
ooooo o o
o
ooooo
oo
oooo
ooooooo
oooooo
oo
ooo
oooo
oooo
o
o
o
o
oo o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
ooooo
ooo
o
o
oooo
o
oooo
ooooo
oo
oooooo
ooooooo
ooo
oo
oo
oo
ooo oooooo o
oo
oo
oooo
ooo
oo
ooo
oooo
oooooo
o
oo
oo
oooo
ooo
oo
ooooooo
oo
o
oo
oo
o
o
oo
o
ooo
ooooooo
o
ooo o
oo
oooooooo
oo
oo
ooooooo
o
oo
oo
o
oo o o o
o
oo
o
oo
oo
oo
oo
o
ooo
oo
oo
oo
ooooo
oooooooo
o
oo o
ooo o
oo
o
o
o
o
oo
o
oo
oo
o
o
o
oo
ooo
oo
ooooo
o
oo
ooooooooooo
oooo
oo
oooooo
o
oo
oo
ooo
oooooooo ooo o
oooo
o
o
o
o
o
o
o
oo
o
o
oo
o
oo
oo
ooooooooooooooooooo o
ooooooo
oooooo
ooo
ooooooo
oo
o
oo
oo
ooo
oo
ooo
ooo
ooooooooooooo o o
oo
oooooooooooo
ooo
ooo
oooooo
oooo

220

0.8

160

0.4

oo
o
o
oo
oo
oo
o
o
o
oo
o
oo
o
o
o
o
oo
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
oo
o
ooo
oo

o o
oo oo o o
o
o ooooo oo
o
oooooooooo
o
oo
ooooooo
oo
oooooooo
ooooo
oooo
o
o
o
o
o
o
o
o
o
o
o
oooooooooooooooo o ooo
oo
ooooo
oooo
ooo
ooooooooooooo
o
oooo
oo
o
o
o
oooooo
oo
ooo
ooo
oooo
oo
oooo
o
oooo
oo
o
o
oooo
oo
oooooooooooooooooo ooo
oo
oooo
o
oo
ooo
o
oo
oo
ooo
o
oo
oo
o

0.0

100

30

10 14

20

10

20

30

sbp

10

o
o
o
oo o o
o
oo ooo
ooo o
o
ooo
o
ooooo
o o
o
ooooooo
oooooooo
ooo
ooooooo oooo
oo
oooooo o
ooooooo
oo
oo
oooo
o
ooooo
oooooo
oo
o
oo o
oo
oo
o
ooo o
ooooo o o
ooooooooooo
oo
o
o
oo
ooo
ooooo
oo
oo
ooo
oooo
oo
oo
o
o
o
o
o
o
o
o
o
o
o
oooo
o
o
o
o
o
o
o
ooo
oo
oo
o
ooooooo
ooooooo o
oo
ooooo
ooooo
oo
ooooo
oo
o
oo
oo
ooo
ooo
ooooo
o
oo
o
oo
o
o
o
oooo
o
oo
o
o
o
o
oo
oo
oo
oo
oo
oo
ooooooooo oo
oo
ooo
ooooo
ooo
o
o oo o
o
ooooo
o
o
oo
oo
o
o
o
o
oo
o
o
oo
o
o
o
o
ooo
o
o
oo
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
oooo
ooooooooo
ooooooooooo oo
oo
ooooo
oo
ooo
oooooooooo o o
o
ooo
o
o o
o
oooooooo
o oo
o
ooooooooooooooooo
ooooo o

100

160

220

o oo

o
oo o

o
oo o
ooo ooo oo
o
oo o
ooooooooo
oooooo o
ooo
oo
oooooo o
oooooo
oo
ooo
ooo
ooooo
oo
ooooooooo oo o
ooo
oo
o
oooooo
oo
o
o
oooo
o
oo
o
o o o
oo
o
o
ooooooo
oo
oooo
o
ooo
ooo
oooo
o
oooooooooooo o
oo
oooo
ooo
ooooo
oo
oooo
ooooooo
ooo
ooo
oooo
ooo
o
oo
ooo
oooooooo
ooooo
o
o
oo
o
ooooo
ooo
o
oooo
oo
ooo
ooo
ooo o o
oo
oo
oooo
ooo
ooooo
oooo
ooooo
oooooooooo o
oooo
ooooo
o
oooo
ooooo
oooooooooooo
oo
ooo
oooooooo
ooooo
oooooo o
ooo
ooo
o
oo
oooo
ooo
ooo
o
oo
ooo
oooo
oooo
oo
o
o
oo
o
o
o
o
oooo
oooooooooooooooo
o
ooooooooo
ooo
oooo
ooooooo
ooo o o
o
oooo
o
o
o
o
o
o
o
o
oo ooo
oooooooooooo
2

10 14

o
ooo
o
o
o
ooo
o
oo
o
o
o
o
oo
o
o
o
o
o
o
oo
oo
o
o
o
o
oo
oo
o
o
o
oo
o
o
o
oo
o
oo
oo
o
o
o
ooo
oo
oo
ooo
oo
oo
oo
o
o
o
o
oo
oo
o
o
o
o
oo
oo
oo
oo
oo
o
o
ooo

obesity

o
oo o
o
oo
ooo
oo
o ooo
o o
ooo
ooooo
ooooo
o oooo
o
oo
o
o
o
ooo oo
ooooo
ooooo o
ooo
oo
o
ooooooooo
o
oooooo
oooooooo
oo
oo
o
ooo
oo
o
oo
ooo
oo
oo
oooooo
ooo
o
o
ooo
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
oo
o
o
o
o
o
o
o
o
o
o
oo
oooo
ooooo oo
o
oo o ooo
oo
oo
oo
oo
oo
oo
oo
o
oo
oo
ooooooo
oooo
ooooo
oo
ooo
ooo
ooo
o
oooo
o
ooo
ooooo
ooooo
o
o
oooo
o
o
o
o
o
o
o
o
oooo
o
o
o
o
o
o
oo
o
oooooooo
oooo
oo
o
oo
ooooooo
o
oo
ooo
o
oo
o
oooo o
o
oo
ooooo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o oo
oooo
oooooooooooo
o
oo o oo
oo o o
o
ooooooo
ooooo
ooo
o
oo
o
o
o oo
oooooooo
o
oo
o
oo
o
oo
oo
ooooooo oo
ooo
o
oooooo
ooooo
o
o
oo
o
oo
ooooo
ooooooooo
oooo
oo
o
o
ooooo
o
ooo
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
oooooooooo o
oo
o
ooo
oo
oo oo
o
oo
oo
o
o
o
ooooo
oooooo
o
o
oooooo
15

25

35

45

oooo
ooooo
oooo
oo
o
oo
ooooooo
ooooo
oooooooooo
oo
ooo
ooo
o
ooo
ooooo
ooo
oooooo
ooooooooooo
ooooo
o
oo
o
oo
oo o
o
o
ooooo o
oooooo
oooo
ooo
ooo
oooooooo
o
o
o
o
o
o
oooooo
oo oooo
ooooooooooo
oo
oooo oo
oooooo
ooo
o
oo
ooo
oooo
ooooooooo
oo
o
o
ooooo
ooooo
oo
oo
oo
ooo
oooo
ooooooo
oo
o
ooo
o
o
oooo
oooo
oo
oo
oooo
o
o
o
o
o
oooooo
ooooo
oooooo
oo
o
ooo
oooooooo
oo
o
ooooooooooo
o
ooo
ooooooo
ooo
ooo
oo
ooooo
ooo
oo oooooooo
oo
o
ooo
oo
oo
ooo
oo
oo
ooo
oooo
oo
ooooo
o
o
ooo
ooooo
o o
oooo
ooo
ooooooooo
oo
oo
oo
oo
ooooo
oo
oo
ooo
oo
ooo
oo
oooo
o
ooo
o
ooo
o
ooo
oooooo
ooo
o
oo
o
o oooooo o o o ooooo
oo
oo
oooo
o
ooooo
oooooo
ooo
ooooooo
oo
oo
oo oo
oooooooo
oo
ooooo ooo
o
oooo
oo
ooooo oooo
o
o
o
o

alcohol

o o
o
o o o
oo o oooo
oo ooo oooooo ooooo
oo o o
oo oo oooooooooooo
o oooooooooooooooooooo
o
o
oo
ooooo
oo
ooo
oooooooooooooooooo
o
oo
oooo
ooooo
oo
oo
ooo
o oo
ooooooooo
oo
o
ooo
oooooo
ooooo
ooo
ooo
oo
oooo
oo
ooooo
o
o
ooo
oo
oo
o
o
o
oo
ooooooo
ooooooo
o
ooooo
ooo
oo
oooo
ooo
oo
oo
oo
oooooo
oo
o
ooo
oo
ooooo
oo

o
oo
oo o o
oooo
ooooooo
oo
oo
oooo
o
oo
ooo
o
o
oooooooooo
ooooooooooo oo oo
oooooo
oo
o
oo
oo o
oo
oo
oooooooooo
ooo
o
oo
o
ooo
oo
ooooo
oo ooo o
oo
ooooooo
o
ooo
ooooooo
o oo
oo
oo
oo
ooo
ooo
o o oooo
o
o
ooooooo
o
o
o
ooooooooo oooo
o
oo
oo
oooo
o
oooooooo oo
oo
o
o
o
o
o
ooooo oo o
ooo
oo
oooo
o
o
ooo o

age

45

o
ooo
ooo
oo
oo
oo
oo
oo
o
o
o
o
oo
ooo
ooooooooooooooooo
ooooooo
oo

35

25

oo
oo
o
o
oo
o
o
oo
o
oo
o
o
o
o
oo
oo
o
o
o
oo
o
o
o
ooo
o
o

15

o o
oo
oo
o oo o
oo
oooo
oo
o
o
ooooo o oo
oooo
ooo
o
o
oo
oooooooo
o
o oo
o
o
o
oooooo
oo
ooo
ooooooo
oooooo
oo
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oooo
oooooo oo o o o
ooo
oo
ooo
oo
oooo
oooooo
o
o
o
o
oooo
ooo
o
o
oo
oo o
o
ooo
o
o
oo
oo
oo
oooo
oo
ooo
o
o
ooo
ooooo
o
oo
o
o
o
o
oooo
o
o
o
o
o
o
o
o
o
o
o
ooo
ooo
oooo
o
o
o ooooooooo
oooooo o
oo
o

60

oooooo
o
ooo
oo
oo
ooo
oo
ooo
oooooo
ooo
oo
oo
oo
oo
o
oooooooooo
oo
oo
oo
oo
o

oo
ooooo
oooooo
o
oo
ooooooooooo o o o
ooo
ooooo
ooo
o
oo
ooo
ooo
ooooooo
oo

40

oo

20

50

100

0.0

0.4

famhist
o
ooo
oo
oo
o
oooo
o
oo
ooo
ooooooo
o
oo
oo
ooo
oo
oo
oo
ooo
ooooo
oo
oooo
oo
ooo
oooo
o
ooooooooooooooooo
o
ooooooooooo
ooooooooooooooooooo ooooo
oo
oo
oo
o o
o o
o
o
o o oo oo
ooo ooo
oo oooooo
ooo
oooooo ooo
o
o
o
o
o
o
o
o
o
o
ooo oooo
oooooooooooooooo o o
oooooooooo
oo
ooo
oooo
ooo
oooo
oo
o
oo
oooo
oooooo o
oo o
ooooooooooooooooo
oooo
oo
o
o
oo
oooo
ooo
ooo
oo
oo
oo
o
oooo
oooo
o
o
o
o
o
o
oo o
oo
ooooo
oooo
oo
ooooo
o
oo
o
oo
ooo
oo
ooo
oo
oo
oo o
oo
oo
ooooooooo
oo
ooo
ooo
oooooooooo
ooo
ooo
oo
ooo
o
o
o
oo
oo
oooo
oooo
ooo
o
oo
ooo
o
o
oo
oooo
ooo
o
oo
oooo
ooo
o
oooo
ooo
ooooo
o
oo
oo
oo o
o
o
oo
oooo
oo
o
o
o
o
oo
oo
o
ooo
ooo
oo
ooooooo
o o
oo
ooo
oo
oooooo
oooo
oooo
oooooooooo
o
ooooooooooo
o
ooooooooo
oooo
o
o
ooo
o oo
o
o
oo o o
o
o
o
oo ooooo o o o o
oooooo oo
oo
ooooo
oo
ooooooo ooo
oooo
o
o
o
o
o
o
o
o
ooo oo
ooo
oooooo o o
ooo
o
ooo
oo
ooooo
o
ooooooo oo
oooo
o
ooo
ooo o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
oooo
oooooooooo
ooooooooo
oo
ooo
oo
oo
oo o ooo
oooooo
ooo
oooo
oooooo
ooooo
ooooo
ooo
ooo
ooo
oo
oooooooooo
oo
oo
oo
o o ooo
oooo
ooooooooooo
ooo
oo
oo
oo
oooo
ooo
o
oo
ooooo
ooo
oo
o
oo
ooo
ooo
ooooo
oooo
oo
ooooooooooooooooooo
oo
oo
ooo
ooo
oo
o
oooo
o
ooooo
ooooo ooo oo
oooooo
o
oo
oo
oo
oo
oo
o
ooo
oooooo
o
ooo
ooooooo oooo oo
ooo
ooo
oo
oo
ooooooo ooo oooo
oo
oo
ooooo
o
o
o ooo
o
o
o
o
oo
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oooooooo oo o ooo o
oo
ooo
o
oooooo
ooooooooo
oo
oooooooo
o
oo
ooo
oo
oooo
oo
o
oo
ooo
oo
oo
ooooooooooooo
o
o
o
o
o
o
o
ooo
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
ooooooo
oooooo
oooo
ooooooo
oo o
oo
ooooooo
oo
ooo oo o o o
oo
o
ooooo
oo
o
ooo
o
ooooooo o
oooooooooo
ooo
ooo
oooo
ooo
oooo
oooo
ooooooo o
oo
oooo
oo
oo
ooooooo
ooo
oo o ooo
ooooo
oooo
o
ooo
oooooo
oo
oo
oo
oo
oooo
o
ooo
oo
o
oo
ooooooo
ooo
o
ooo
oo
oo
ooo
oo oo
ooo
oooo
ooo
oo
ooo
oo
o
o
oooo
o
oooo
oo o
ooo
o
o
o
o
o
oooo
o
o
o
oo
oooo
o
o
o
oooo
oo
ooo
oo
ooo
ooo
o
oo
oo
oo
oo
ooooo oo
oo
oo
oooooo
o
oooo
o
o

Scatterplot
matrix of the South
African
Heart
Disease data. The
response is color
coded The cases
(MI) are red, the
controls turquoise.
famhist is a binary
variable, with 1
indicating family
history of MI.

20

40

60

14 / 40

> heartfit < - glm ( chd. , data = heart , family = binomial )


> summary ( heartfit )
Call :
glm ( formula = chd . , family = binomial , data = heart )
Coefficients :
Estimate Std . Error z value Pr ( >| z |)
( Intercept )
-4.1295997 0.9641558
-4.283 1.84 e -05
sbp
0.0057607 0.0056326
1.023 0.30643
tobacco
0.0795256 0.0262150
3.034 0.00242
ldl
0.1847793 0.0574115
3.219 0.00129
famh istPrese nt 0.9391855 0.2248691
4.177 2.96 e -05
obesity
-0.0345434 0.0291053
-1.187 0.23529
alcohol
0.0006065 0.0044550
0.136 0.89171
age
0.0425412 0.0101749
4.181 2.90 e -05

***
**
**
***

***

( Dispersion parameter for binomial family taken to be 1)


Null deviance : 596.11
Residual deviance : 483.17
AIC : 499.17

on 461
on 454

degrees of freedom
degrees of freedom

15 / 40

Case-control sampling and logistic regression


In South African data, there are 160 cases, 302 controls

= 0.35 are cases. Yet the prevalence of MI in this region


is = 0.05.
With case-control samples, we can estimate the regression

parameters j accurately (if our model is correct); the


constant term 0 is incorrect.
We can correct the estimated intercept by a simple

transformation
0 = 0 + log

log
1
1

Often cases are rare and we take them all; up to five times

that number of controls is sufficient. See next frame


16 / 40

c
SLDM III Hastie
& Tibshirani - March 7, 2013

Linear Regression

108

Diminishing returns in unbalanced binary data

0.07
0.06
0.05
0.04

Coefficient Variance

0.08

0.09

Simulation
Theoretical

10

12

14

Sampling more
more
Sampling
controls than
than
controls
cases reduces
reduces
cases
the variance
variance ofof
the
the
the parameter
parameter
estimates.
estimates. But
But
after
after aa ratio
ratio ofof
about
about 55 to
to 11
the
variance
the variance rereduction
flattens
duction flattens
out.
out.

Control/Case Ratio

17 / 40

Logistic regression with more than two classes


So far we have discussed logistic regression with two classes.
It is easily generalized to more than two classes. One version
(used in the R package glmnet) has the symmetric form
e0k +1k X1 +...+pk Xp
Pr(Y = k|X) = PK
0` +1` X1 +...+p` Xp
`=1 e
Here there is a linear function for each class.

18 / 40

Logistic regression with more than two classes


So far we have discussed logistic regression with two classes.
It is easily generalized to more than two classes. One version
(used in the R package glmnet) has the symmetric form
e0k +1k X1 +...+pk Xp
Pr(Y = k|X) = PK
0` +1` X1 +...+p` Xp
`=1 e
Here there is a linear function for each class.
(The mathier students will recognize that some cancellation is
possible, and only K 1 linear functions are needed as in
2-class logistic regression.)

18 / 40

Logistic regression with more than two classes


So far we have discussed logistic regression with two classes.
It is easily generalized to more than two classes. One version
(used in the R package glmnet) has the symmetric form
e0k +1k X1 +...+pk Xp
Pr(Y = k|X) = PK
0` +1` X1 +...+p` Xp
`=1 e
Here there is a linear function for each class.
(The mathier students will recognize that some cancellation is
possible, and only K 1 linear functions are needed as in
2-class logistic regression.)
Multiclass logistic regression is also referred to as multinomial
regression.

18 / 40

Discriminant Analysis

Here the approach is to model the distribution of X in each of


the classes separately, and then use Bayes theorem to flip things
around and obtain Pr(Y |X).
When we use normal (Gaussian) distributions for each class,
this leads to linear or quadratic discriminant analysis.
However, this approach is quite general, and other distributions
can be used as well. We will focus on normal distributions.

19 / 40

Bayes theorem for classification


Thomas Bayes was a famous mathematician whose name
represents a big subfield of statistical and probablistic modeling.
Here we focus on a simple result, known as Bayes theorem:
Pr(Y = k|X = x) =

Pr(X = x|Y = k) Pr(Y = k)


Pr(X = x)

20 / 40

Bayes theorem for classification


Thomas Bayes was a famous mathematician whose name
represents a big subfield of statistical and probablistic modeling.
Here we focus on a simple result, known as Bayes theorem:
Pr(Y = k|X = x) =

Pr(X = x|Y = k) Pr(Y = k)


Pr(X = x)

One writes this slightly differently for discriminant analysis:


k fk (x)
Pr(Y = k|X = x) = PK
,
l=1 l fl (x)

where

fk (x) = Pr(X = x|Y = k) is the density for X in class k.

Here we will use normal densities for these, separately in


each class.
k = Pr(Y = k) is the marginal or prior probability for
class k.
20 / 40

Classify to the highest density


1=.5, 2=.5

1=.3, 2=.7

We classify a new point according to which density is highest.

21 / 40

Classify to the highest density


1=.5, 2=.5

1=.3, 2=.7

We classify a new point according to which density is highest.


When the priors are different, we take them into account as
well, and compare k fk (x). On the right, we favor the pink
class the decision boundary has shifted to the left.
21 / 40

Why discriminant analysis?


When the classes are well-separated, the parameter

estimates for the logistic regression model are surprisingly


unstable. Linear discriminant analysis does not suffer from
this problem.
If n is small and the distribution of the predictors X is

approximately normal in each of the classes, the linear


discriminant model is again more stable than the logistic
regression model.
Linear discriminant analysis is popular when we have more

than two response classes, because it also provides


low-dimensional views of the data.

22 / 40

Linear Discriminant Analysis when p = 1


The Gaussian density has the form
1
1
fk (x) =
e 2
2k

xk
k

2

Here k is the mean, and k2 the variance (in class k). We will
assume that all the k = are the same.

23 / 40

Linear Discriminant Analysis when p = 1


The Gaussian density has the form
1
1
fk (x) =
e 2
2k

xk
k

2

Here k is the mean, and k2 the variance (in class k). We will
assume that all the k = are the same.
Plugging this into Bayes formula, we get a rather complex
expression for pk (x) = Pr(Y = k|X = x):

pk (x) =

1
1
k 2
e 2

PK

1
l=1 l 2 e

xk

21

2

xl

2

Happily, there are simplifications and cancellations.


23 / 40

Discriminant functions
To classify at the value X = x, we need to see which of the
pk (x) is largest. Taking logs, and discarding terms that do not
depend on k, we see that this is equivalent to assigning x to the
class with the largest discriminant score:
k (x) = x

2k
k

+ log(k )
2
2 2

Note that k (x) is a linear function of x.

24 / 40

Discriminant functions
To classify at the value X = x, we need to see which of the
pk (x) is largest. Taking logs, and discarding terms that do not
depend on k, we see that this is equivalent to assigning x to the
class with the largest discriminant score:
k (x) = x

2k
k

+ log(k )
2
2 2

Note that k (x) is a linear function of x.


If there are K = 2 classes and 1 = 2 = 0.5, then one can see
that the decision boundary is at
x=

1 + 2
.
2

(See if you can show this)


24 / 40

5
4
3
2
1
0
4

Example with 1 = 1.5, 2 = 1.5, 1 = 2 = 0.5, and 2 = 1.

25 / 40

5
4
3
2
1
0
4

Example with 1 = 1.5, 2 = 1.5, 1 = 2 = 0.5, and 2 = 1.


Typically we dont know these parameters; we just have the
training data. In that case we simply estimate the parameters
and plug them into the rule.

25 / 40

Estimating the parameters

k =

k =

nk
n
1 X
xi
nk
i: yi =k

K
1 X X
(xi
k )2
nK
k=1 i: yi =k

K
X
k=1

nk 1 2

nK k

P
where
k2 = nk11 i: yi =k (xi
k )2 is the usual formula for the
estimated variance in the kth class.
26 / 40

x1

Density: f (x) =

x2

x2

Linear Discriminant Analysis when p > 1

x1

1
1
T 1
e 2 (x) (x)
(2)p/2 ||1/2

27 / 40

x1

Density: f (x) =

x2

x2

Linear Discriminant Analysis when p > 1

x1

1
1
T 1
e 2 (x) (x)
(2)p/2 ||1/2

1
Discriminant function: k (x) = xT 1 k Tk 1 k + log k
2

27 / 40

x1

Density: f (x) =

x2

x2

Linear Discriminant Analysis when p > 1

x1

1
1
T 1
e 2 (x) (x)
(2)p/2 ||1/2

1
Discriminant function: k (x) = xT 1 k Tk 1 k + log k
2
Despite its complex form,
k (x) = ck0 + ck1 x1 + ck2 x2 + . . . + ckp xp a linear function.
27 / 40

4
2

X2
0
2
4

X2

Illustration: p = 2 and K = 3 classes

X1

X1

Here 1 = 2 = 3 = 1/3.
The dashed lines are known as the Bayes decision boundaries.
Were they known, they would yield the fewest misclassification
errors, among all possible classifiers.
28 / 40

Fishers Iris Data

2.5

4.5

5.5

7.5

Petal.Length

7.5
6.5
5.5
4.5

2.5

6.5

1.5
0.5

LDA classifies all


but 3 of the 150
training
samples
correctly.

1.5

Sepal.Width

0.5

4.0

4.0
3.0

Setosa
Versicolor
Virginica

Sepal.Length

2.0

4 variables
3 species
50 samples/class

3.0

2.0

Petal.Width

29 / 40

Fishers Discriminant Plot

2 1

Discriminant Variable 2

10

10

Discriminant Variable 1

When there are K classes, linear discriminant analysis can be


viewed exactly in a K 1 dimensional plot.
Why?

30 / 40

Fishers Discriminant Plot

2 1

Discriminant Variable 2

10

10

Discriminant Variable 1

When there are K classes, linear discriminant analysis can be


viewed exactly in a K 1 dimensional plot.
Why? Because it essentially classifies to the closest centroid,
and they span a K 1 dimensional plane.

30 / 40

Fishers Discriminant Plot

2 1

Discriminant Variable 2

10

10

Discriminant Variable 1

When there are K classes, linear discriminant analysis can be


viewed exactly in a K 1 dimensional plot.
Why? Because it essentially classifies to the closest centroid,
and they span a K 1 dimensional plane.
Even when K > 3, we can find the best 2-dimensional plane
for vizualizing the discriminant rule.
30 / 40

From k (x) to probabilities


Once we have estimates k (x), we can turn these into estimates
for class probabilities:
k (x)

e
c
Pr(Y
= k|X = x) = PK

l=1 e

l (x)

So classifying to the largest k (x) amounts to classifying to the


c
class for which Pr(Y
= k|X = x) is largest.

31 / 40

From k (x) to probabilities


Once we have estimates k (x), we can turn these into estimates
for class probabilities:
k (x)

e
c
Pr(Y
= k|X = x) = PK

l=1 e

l (x)

So classifying to the largest k (x) amounts to classifying to the


c
class for which Pr(Y
= k|X = x) is largest.
c
When K = 2, we classify to class 2 if Pr(Y
= 2|X = x) 0.5,
else to class 1.

31 / 40

LDA on Credit Data


True Default Status
Predicted
Default Status

No
Yes
Total

No
9644
23
9667

Yes
252
81
333

Total
9896
104
10000

(23 + 252)/10000 errors a 2.75% misclassification rate!


Some caveats:
This is training error, and we may be overfitting.

32 / 40

LDA on Credit Data


True Default Status
Predicted
Default Status

No
Yes
Total

No
9644
23
9667

Yes
252
81
333

Total
9896
104
10000

(23 + 252)/10000 errors a 2.75% misclassification rate!


Some caveats:
This is training error, and we may be overfitting. Not a big

concern here since n = 10000 and p = 4!

32 / 40

LDA on Credit Data


True Default Status
Predicted
Default Status

No
Yes
Total

No
9644
23
9667

Yes
252
81
333

Total
9896
104
10000

(23 + 252)/10000 errors a 2.75% misclassification rate!


Some caveats:
This is training error, and we may be overfitting. Not a big

concern here since n = 10000 and p = 4!


If we classified to the prior always to class No in this

case we would make 333/10000 errors, or only 3.33%.

32 / 40

LDA on Credit Data


True Default Status
Predicted
Default Status

No
Yes
Total

No
9644
23
9667

Yes
252
81
333

Total
9896
104
10000

(23 + 252)/10000 errors a 2.75% misclassification rate!


Some caveats:
This is training error, and we may be overfitting. Not a big

concern here since n = 10000 and p = 4!


If we classified to the prior always to class No in this

case we would make 333/10000 errors, or only 3.33%.


Of the true Nos, we make 23/9667 = 0.2% errors; of the

true Yess, we make 252/333 = 75.7% errors!


32 / 40

Types of errors
False positive rate: The fraction of negative examples that are
classified as positive 0.2% in example.
False negative rate: The fraction of positive examples that are
classified as negative 75.7% in example.
We produced this table by classifying to class Yes if
c
Pr(Default
= Yes|Balance, Student) 0.5
We can change the two error rates by changing the threshold
from 0.5 to some other value in [0, 1]:
c
Pr(Default
= Yes|Balance, Student) threshold,
and vary threshold.
33 / 40

0.4

Overall Error
False Positive
False Negative

0.0

0.2

Error Rate

0.6

Varying the threshold

0.0

0.1

0.2

0.3

0.4

0.5

Threshold

In order to reduce the false negative rate, we may want to


reduce the threshold to 0.1 or less.
34 / 40

0.6
0.4
0.0

0.2

True positive rate

0.8

1.0

ROC Curve

0.0

0.2

0.4

0.6

0.8

1.0

False positive rate

The ROC plot displays both simultaneously.

35 / 40

0.6
0.4
0.0

0.2

True positive rate

0.8

1.0

ROC Curve

0.0

0.2

0.4

0.6

0.8

1.0

False positive rate

The ROC plot displays both simultaneously.


Sometimes we use the AUC or area under the curve to
summarize the overall performance. Higher AUC is good.
35 / 40

Other forms of Discriminant Analysis


k fk (x)
Pr(Y = k|X = x) = PK
l=1 l fl (x)
When fk (x) are Gaussian densities, with the same covariance
matrix in each class, this leads to linear discriminant analysis.
By altering the forms for fk (x), we get different classifiers.
With Gaussians but different k in each class, we get

quadratic discriminant analysis.

36 / 40

Other forms of Discriminant Analysis


k fk (x)
Pr(Y = k|X = x) = PK
l=1 l fl (x)
When fk (x) are Gaussian densities, with the same covariance
matrix in each class, this leads to linear discriminant analysis.
By altering the forms for fk (x), we get different classifiers.
With Gaussians but different k in each class, we get

quadratic discriminant analysis.


Q
With fk (x) = pj=1 fjk (xj ) (conditional independence
model) in each class we get naive Bayes. For Gaussian this
means the k are diagonal.

36 / 40

Other forms of Discriminant Analysis


k fk (x)
Pr(Y = k|X = x) = PK
l=1 l fl (x)
When fk (x) are Gaussian densities, with the same covariance
matrix in each class, this leads to linear discriminant analysis.
By altering the forms for fk (x), we get different classifiers.
With Gaussians but different k in each class, we get

quadratic discriminant analysis.


Q
With fk (x) = pj=1 fjk (xj ) (conditional independence
model) in each class we get naive Bayes. For Gaussian this
means the k are diagonal.
Many other forms, by proposing specific density models for

fk (x), including nonparametric approaches.


36 / 40

2
1
0

X2

1
4

1
4

X2

Quadratic Discriminant Analysis

X1

X1

1
k (x) = (x k )T 1
k (x k ) + log k
2
Because the k are different, the quadratic terms matter.
37 / 40

Naive Bayes
Assumes features are independent in each class.
Useful when p is large, and so multivariate methods like QDA
and even LDA break down.
Gaussian naive Bayes assumes each k is diagonal:

k (x) log k

p
Y
j=1

1
fkj (xj ) =
2

p
X
(xj kj )2
+ log k
2
kj
j=1

can use for mixed feature vectors (qualitative and

quantitative). If Xj is qualitative, replace fkj (xj ) with


probability mass function (histogram) over discrete
categories.
Despite strong assumptions, naive Bayes often produces good
classification results.
38 / 40

Logistic Regression versus LDA


For a two-class problem, one can show that for LDA




p1 (x)
p1 (x)
log
= log
= c0 + c1 x1 + . . . + cp xp
1 p1 (x)
p2 (x)
So it has the same form as logistic regression.
The difference is in how the parameters are estimated.

39 / 40

Logistic Regression versus LDA


For a two-class problem, one can show that for LDA




p1 (x)
p1 (x)
log
= log
= c0 + c1 x1 + . . . + cp xp
1 p1 (x)
p2 (x)
So it has the same form as logistic regression.
The difference is in how the parameters are estimated.
Logistic regression uses the conditional likelihood based on

Pr(Y |X) (known as discriminative learning).

39 / 40

Logistic Regression versus LDA


For a two-class problem, one can show that for LDA




p1 (x)
p1 (x)
log
= log
= c0 + c1 x1 + . . . + cp xp
1 p1 (x)
p2 (x)
So it has the same form as logistic regression.
The difference is in how the parameters are estimated.
Logistic regression uses the conditional likelihood based on

Pr(Y |X) (known as discriminative learning).


LDA uses the full likelihood based on Pr(X, Y ) (known as

generative learning).

39 / 40

Logistic Regression versus LDA


For a two-class problem, one can show that for LDA




p1 (x)
p1 (x)
log
= log
= c0 + c1 x1 + . . . + cp xp
1 p1 (x)
p2 (x)
So it has the same form as logistic regression.
The difference is in how the parameters are estimated.
Logistic regression uses the conditional likelihood based on

Pr(Y |X) (known as discriminative learning).


LDA uses the full likelihood based on Pr(X, Y ) (known as

generative learning).
Despite these differences, in practice the results are often

very similar.

39 / 40

Logistic Regression versus LDA


For a two-class problem, one can show that for LDA




p1 (x)
p1 (x)
log
= log
= c0 + c1 x1 + . . . + cp xp
1 p1 (x)
p2 (x)
So it has the same form as logistic regression.
The difference is in how the parameters are estimated.
Logistic regression uses the conditional likelihood based on

Pr(Y |X) (known as discriminative learning).


LDA uses the full likelihood based on Pr(X, Y ) (known as

generative learning).
Despite these differences, in practice the results are often

very similar.
Footnote: logistic regression can also fit quadratic boundaries
like QDA, by explicitly including quadratic terms in the model.
39 / 40

Summary

Logistic regression is very popular for classification,

especially when K = 2.
LDA is useful when n is small, or the classes are well

separated, and Gaussian assumptions are reasonable. Also


when K > 2.
Naive Bayes is useful when p is very large.
See Section 4.5 for some comparisons of logistic regression,

LDA and KNN.

40 / 40

Potrebbero piacerti anche