Discriminant Analysis: 5.1 The Maximum Likelihood (ML) Rule

5.
Discriminant Analysis
Given k populations (groups)
1
; :::;
k
; we suppose that an individual from
j
has p.d.f. f
j
(x)
for a set of p measurement x.
The purpose of discriminant analysis is to allocate an individual to one of the groups
j
on
the basis of x, making as few "mistakes" as possible. For example a patient presents at a doctors
surgery with a set of symptom x. The symptoms suggest a number of posible disease groups
j
to which the patient might belong. What is the most likely diagnosis?
The aim initially is to nd a partition of R
p
into disjoint regions R
1
; :::; R
k
together with a
decision rule
x 2 R
j
=allocate x to
j
The decision rule will be more accurate if "
j
has most of its probability concentrated in R
j
"
for each j:
5.1 The maximum likelihood (ML) rule
Allocate x to population
j
that gives the largest likelihood to x. Choose j by
L
j
(x) = max
1ik
L
i
(x)
(break ties arbitrarily).
Result 1
If
i
is the multivariate normal (MVN) population N
p
(
i
; ) for i = 1; :::; k; the ML rule
allocates x to population
i
that minimize the Mahalanobis distance between x and
i
:
Proof
L
i
(x) = [2[
1
2
exp
_
1
2
(x
i
)
T
1
(x
i
)
_
so the likelihood is maximized when the exponent is minimized.
Result 2
When k = 2 the ML rule allocates x to
1
if
d
T
(x ) > 0 (5.1)
where d =
1
(
1
2
) and =
1
2
(
1
+
2
) and to
2
otherwise.
Proof
For the two group case, the ML rule is to allocate x to
1
if
(x
1
)
T
1
(x
1
) < (x
2
)
T
1
(x
2
)
1
which reduces to
2d
T
x >
1
2
= (
1
2
)
T
1
(
1
+
2
)
= d
T
(
1
+
2
)
Hence the result. The function
h(x) = (
1
2
)
T
1
_
x
1
2
(
1
+
2
)
_
(5.2)
is known as the discriminant function (DF). In this case the DF is linear in x.
5.2 Sample ML rule
In practice
1
;
2
; are estimated by, respectively x
1
; x
2
; S
P
where S
P
is the pooled (unbiased)
estimator of covariance matrix.
Example
The eminent statistician R.A. Fisher took measurements on samples of size 50 of 4 types of
iris. Two of the variables: x
1
= sepal length and x
2
= sepal width gave the following data on
species I and II:
x
1
=
_
5:0
3:4
_
x
2
=
_
6:0
2:8
_
S
1
=
_
:12 :10
:10 :14
_
S
2
=
_
:26 :08
:08 :10
_
(The data have been rounded for clarity).
S
p
=
50S
1
+ 50S
2
98
=
_
0:19 0:09
0:09 0:12
_
Hence
d = S
1
p
( x
1
x
2
)
=
_
0:19 0:09
0:09 0:12
_
1
_
1:0
0:6
_
=
_
11:4
14:1
_
=
1
2
( x
1
+ x
2
) =
_
5:5
3:1
_
giving the rule:
Allocate x to
1
if
11:4 (x
1
5:5) + 14:1 (x
2
3:1) > 0
11:4x
1
+ 14:1x
2
+ 19:0 > 0
2
5.3 Misclassication probabilities
The misclassication probabilities p
ij
dened as
p
ij
= Pr [Allocate to
i
when in fact from
j
]
form a k k matrix, of which the diagonal elements p
ii
are a measure of the classiers accuracy.
For the case k = 2
p
12
= Pr [h(x) > 0 [
2
]
Since h(x) = d
T
(x ) is a linear compound of x it has a (univariate) normal distribution.
Given that x
2
:-
E[h(x)] = d
T
_
1
2
(
1
+
2
)
_
=
1
2
d
T
(
2
1
)
=
1
2
2
where
2
= (
2
1
)
T
1
(
2
1
) is the Mahalanobis distance between
2
and
1
:
The variance of h(x) is
d
T
d = (
2
1
)
T
1
(
2
1
)
= (
2
1
)
T
1
(
2
1
)
=
2
p
12
= Pr [h(x) > 0]
= Pr
_
h(x) +
1
2
>
1
2
_
(1)
= Pr
_
Z >
1
2
_
= 1
_
1
2
_
(5.3)
=
_
1
2
_
(2)
By symmetry if we write =
12
the misclassication rate p
21
=
1
2
(
21
) and since
12
=
21
we have
p
12
= p
21
Example (contd.)
We can estimate the misclassication probability from the sample Mahalanobis distance between
x
2
and x
1
3
D
2
= ( x
2
x
1
)
T
S
1
p
( x
2
x
1
)
=
_
1:0 0:6
_
11:4
14:1
_
19:9
1
2
D
_
= (2:23)
= 0:013
The misclassication rate is 1.3%.
5.4 Optimality of ML rule
We can show that the ML rule minimizes the probability of misclassication if an individual is a
priori equally likely to belong to any population.
Let M be the event of a misclassication and consider a decision rule (x) represented as
follows:
i
(x) =
_
1 if x is assigned to
i
0 otherwise
i.e. = (
1
;
2
; :::;
k
) is a 0-1 vector everywhere in the space of x and
i
(x) = 1 for x R
i
:
Recall that the classier assigns x to
i
if x R
i
:
The ML rule is represented as
ML
i
(x) =
_
1 if f
i
(x) _ f
k
(x) k ,= i
0 otherwise
Ties can be ignored, arbitrarily decided, or randomized by allowing
i
=
1
t
if t populations
(likelihoods) are tied.
The misclassication probabilities are
p
ij
= Pr [Allocate to
i
when in fact from
j
]
= Pr (
i
= 1[x
j
)
=
_

i
(x) f
j
(x) dx
=
_
R
i
f
j
(x) dx (3)
where x is equally likely to come from
j
(j = 1; :::; k) :
The total probability of misclassication is
Pr (M) =
k
i=1
Pr (M[x
i
) Pr (x
i
)
=
1
k
k
i=1
(1 p
ii
)
= 1
1
k
k
i=1
p
ii
(4)
4
Clearly we need to maximize the sum of the probabilities of correct classication which is the
trace of the misclassication matrix
k
i=1
p
ii
=
k
i=1
_

i
(x) f
i
(x) dx
=
_
k
i=1
i
(x) f
i
(x) dx
_
_
max
i
f
i
(x) dx
This shows that the trace is maximized for the ML rule and therefore Pr (M) is minimized.
5.5 Bayes Rule
The Bayes rule generalizes the ML rule by introducing a set of prior probabilities
i
assumed
known, where
i
= Pr(individual belongs to
i
)
The misclassication probability becomes
Pr (M) =
k
i=1
Pr (M[x
i
) Pr (x
i
)
=
k
i=1
(1 p
ii
)
i
= 1
k
i=1
p
ii
i
The previous analysis carries across as follows:
k
i=1
p
ii
i
=
k
i=1
i
_

i
(x) f
i
(x) dx
=
_
k
i=1
i
(x) f
i
(x) dx
_
_
max
i

i
f
i
(x) dx
The Bayes rule assigns x to
i
that maximizes the posterior probability p (j[x)
j
f
j
(x) :
This rule also minimizes the probability of misclassication Pr (M)
5.6 Minimizing Expected Loss
We can also introduce unequal costs of misclassication.
Let c
ij
= c (i[j) be the cost of assigning individual x
j
to
i
:
Generally we suppose c
ii
= 0:
Denition
5
The expected cost of misclassication is known as the Bayes risk:
R
i
(x) =
k
j=1
c (i[j) p (j[x)
is the risk or expected loss conditional on x and taking action i, where
p (j[x) =

j
f
j
(x)
f (x)
and f (x) =

k
j=1
j
f
j
(x)
Denition
The overall risk of a rule dened by is the expected loss at x
R(x) =
k
i=1
i
(x) R
i
(x)
We can show that it is optimal to take the action i that minimizes the Bayes risk.
ER(x) =
_
_
k
i=1
i
(x) R
i
(x)
_
f (x) dx
_
_
k
i=1
_
min
i
R
i
(x)
_
f (x) dx
It is optimal to minimize the overall expected loss to minimize the Bayes risk.
Example
Given two populations
1
;
2
; suppose c (2[1) = 5 and c (1[2) = 10:
Suppose that 20% of the population belong to
2
; then
1
= 0:8 and
2
= 0:2
Given a new individual x; the Bayes risk of assigning x to
1
is
R
1
(x) =
k
j=1
c (1[j) p (j[x)
= 10 0:2 f
2
(x)
= 2f
2
(x)
The Bayes risk of assigning x to
2
is
R
2
(x) =
k
j=1
c (2[j) p (j[x)
= 5 0:8 f
1
(x)
= 4f
1
(x)
Suppose that a new individual x
0
gives f
1
(x
0
) = 0:3 and f
2
(x
0
) = 0:4; then
2 0:4 = 0:8 < 4 0:3 = 1:2
so we assign x
0
to
1
:
6

Discriminant Analysis: 5.1 The Maximum Likelihood (ML) Rule

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Discriminant Analysis: 5.1 The Maximum Likelihood (ML) Rule

Caricato da

Copyright:

Formati disponibili

5.

Potrebbero piacerti anche