IIT Madras Notes Machine Learning

Discriminant Functions I
Consider a two class problem. We need a function for the decision

boundary to separate the classes.
Consider a simple linear discriminant function:
w2 x2 + w1 x1 + w0 = 0
defined
over
a two
dimensional space, where
w1 x1
w̄ = x̄ =
w2 x2
An example line:
x2 + x1 − 1 = 0
Clearly points about the line yield
x2 + x1 − 1 > 0
while points below the line yield
x2 + x1 − 1 < 0
Pattern Classification Linear Discriminant functions Bayes Decision Theory
Discriminant Functions II
If the points belonging to the two classes C1 and C2 are as shown
in the Figure, they can be easily discriminated.
In general linear discriminant functions are of the form
wd xd + wd−1 xd−1 + ... + w1 x1 + w0 = 0
This represents a hyperplane in a d−dimensional space.
Alternatively can be written as
w̄ t x̄ + w0 = 0
x
2
(0,1)
x1
(1,0)
Pattern Recognition Pattern Classification

Non linear discriminant functions I
Consider two classes being separated by a circle as shown in the

Figure.
x2 + y2 = r2
x 2 + y 2 − r 2 = 0 is the boundary between the two classes.
Clearly x 2 + y 2 − r 2 < 0 inside the circle
x 2 + y 2 − r > 02 outside the circle
The boundary is clearly nonlinear in the input space.
Consider the transformation:
z1 = x 2
z2 = y2
r2 = 1
This yields

Non linear discriminant functions II
z1 + z2 − 1 = 0
This leads to a linear hyperplane in z−space that is isomorphic to
the input x−space.
oo z
2
o y o
o x x x o
x x x o
x x x x z1
o x x o
o
oo
o o

p(~x /ωi ) = N (~x /~

µ i , Σi ) (where ~x = [x1 x2 . . . xd ]T )
Classification steps:
I Training Process: We estimate µ̂i and Σ̂i using the dataset of
i th class: D(x~1 x~2 . . . x~N )
I Development Process: We fix our hyperparameters in this
process.
I Testing Process: We test our model using unseen data.
We classify the feature vector ~x to the class for which P(ωi /~x ) is
the highest and the rest becomes the error. Example: If we have M
classes and max is the answer, then error = 1 − P(ωmax /~x )

Bayes’ decision Theory
P( ω | x)
i
p(x| ω )P
P (( ω )) p(x| ω2 ) P (ω2 )
1 1
As p(x) does not affect the decision process,

P(ωi |x) ∼ p(x|ωi )P(ωi )
Unimodal Multivariate Gaussian Distribution
1 x −µ~i )Σ−1
− 1 (~ x −µ~i )T
p(~x /ωi ) = √ i (~
1 e 2
( 2π)d Σi 2
where,
Σi = E [(~x − µ~i )(~x − µ~i )T ]

µi = E [~
xi ]
ln g1 (~x ) = ln p(~x /ω1 )

1 √ d 1 1
~ )T C1−1 (~x − µ
= − ln( 2π) − ln C1 2 − (~x − µ ~ ) + ln P(w1 )
2 2 2
Similarly for g2 (x) = p(~x /ω2 )

Discriminating Function: g (~x ) = ln g1 (~x ) − ln g2 (~x )
CASE-1: C1 = C2 = σ 2 I (less parameters =⇒ less data reqd.)
The quadratic term (x T x) is same in both g1 (~x ) and g2 (~x ), so we

can assume linear equation. Hence,
gi (~x ) = ω~i t ~x + ωio (assumed)
Now, neglecting terms which gets cancelled out in

ln g1 (~x ) − ln g2 (~x ), we get:
−1
g1 (~x ) = (µ~1 µ~1 T − 2µ~1 T ~x ) + ln P(ω1 )
2σ 2
Now comparing our assumed equation and this equation, we get:
µ~i −1
ωi = 2
and ωio = 2 µ~1 µ~1 T + ln P(ωi )
σi 2σ

Decision Boundary: g (~x ) = ω~1 T ~x + ω1o − ω~2 T ~x − ω2o = 0

So, ~ T ~x + ωo (because Straight Line)
g (~x ) = ω
where, 1
~ = ω~1 − ω~2 = 2 (µ1 − µ2 )t ~x
ω
σ
−1 P(ω1 )
ωo = ω1o − ω2o = 2 (µ~1 µ~1 T − µ~2 µ~2 T ) + ln
2σ P(ω2 )
Now since,
µ~1 µ~1 T − µ~2 µ~2 T = kµ~1 k2 − kµ~2 k2 = (µ~1 − µ~2 )t (µ~1 + µ~2 )
1 1 P(ω1 )
(µ~1 − µ~2 )t ~x − 2 (µ~1 − µ~2 )t (µ~1 + µ~2 ) + ln

g (~x ) = 2
σ 2σ P(ω2 )
2

1 t 1 σ (µ~1 − µ~2 ) P(ω1 )
= 2 (µ~1 − µ~2 ) ~x − (µ~1 + µ~2 ) + ln
σ 2 kµ~1 − µ~2 k2 P(ω2 )
t
~ (~x − xo ) = 0 (i.e The separating plane passes through xo )
=ω
Now, if P(ω1 ) = P(ω2 ) then the boundary perpendicularly bisects
the line joining µ~1 and µ~2
2
σ1 0
CASE-2: C1 = C2 = C (σjk = 0 for j 6= k)
0 σ22
−1
gi (x) = (~x − µ~i )t Ci (~x − µ~i ) + ln P(wi )
2
−1 t −1 1 1 1
= ~x C ~x + µ~ti C −1 ~x + ~x t C −1 µ~i − µ~i t C −1 µ~i + ln P(ωi )
2 2 2 2
Ignoring the terms that do not depend on i i.e they will cancel out.
1
gi (~x ) = (C −1 µi )t ~x − µ~i t C −1 µ~i + ln P(ωi )
2
t
= ω~i ~x + ωio
Now, the discriminating boundary can be given by:
P(ω1 )
g (x) = (C −1 µ~1 − C −1 µ~2 )~x − µ~1 C −1 µ~1 + µ~2 C −1 µ~2 + ln
P(ω2 )
C −1 P(ω1 )
= C −1 (µ~1 − µ~2 )~x − (µ~1 − µ~2 )t (µ~1 + µ~2 ) + ln
2 P(ω2 )

~ t (~x − xo )
On comparing with equation of plane g (x) = ω
~ = C −1 (µ~1 − µ~2 )
ω
1 ln P(ω1 )/P(ω2 )
xo = (µ~1 + µ~2 ) −
2 (µ~1 − µ~2 )t C −1
Notice that C −1 will apply an affine rotation on (µ~1 − µ~2 ) so ω ~ will

not be in direction of (µ~1 − µ~2 ) . Also, if priors are equal,
xo = 21 (µ~1 + µ~2 ) i.e it still passes through the midpoint but the
direction is transformed.
Note that the contours have same probability density in a Gaussian

because it is symmetric about the mean.

CASE-3: C1 6= C2 (They can be diagonal)

Again, neglecting the terms that don’t affect.
1 1
gi (~x ) = ~x t Ci−1 ~x + µ~i t Ci−1 ~x + µ~i t Ci−1 µ~i + ln P(ωi ) − ln|Ci |
2 2
g (~x ) = ~x t W ~x + ω ~ t ~x + ωo = 0 where,
−1 −1
W = (C1 − C2−1 )
2
ω = (C1−1 µ~1 − C2−1 µ~2 )
−1 1 C1 P(ω1 )
ωo = (µ~2 t C2−1 µ~2 + µ~1 t C1−1 µ~1 ) − ln| | + ln
2 2 C2 P(ω2 )

Summary of footprint of density function
EigenVectors
Covariance 2D 3D nD
parallel to axis?
C = σ2I Circle Sphere Hypersphere Yes
C = Diagonal Ellipse Ellipsoid Hyperellipsoid Yes
C = Full Ellipse Ellipsoid Hyperellipsoid No
An extended quadratic discriminant function in two dimensions is

Ax 2 + By 2 + Cx + Dy + E = 0
A = B =⇒ circle
AorBiszero =⇒ Parabola
A.B > 0 =⇒ Ellipse
A.B < 0 =⇒ Hyperbola
Pattern Recognition
A = 0 =⇒ Hyperplane
Pattern Classification

IIT Madras Notes Machine Learning

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

IIT Madras Notes Machine Learning

Caricato da

Copyright:

Formati disponibili

Discriminant Functions I

Consider a two class problem. We need a function for the decision

Pattern Recognition Pattern Classification

Non linear discriminant functions I

Consider two classes being separated by a circle as shown in the

Pattern Recognition Pattern Classification

Non linear discriminant functions II

Pattern Recognition Pattern Classification

p(~x /ωi ) = N (~x /~

Pattern Recognition Pattern Classification

Bayes’ decision Theory

As p(x) does not affect the decision process,

Unimodal Multivariate Gaussian Distribution

Σi = E [(~x − µ~i )(~x − µ~i )T ]

ln g1 (~x ) = ln p(~x /ω1 )

Pattern Recognition Pattern Classification

Discriminating Function: g (~x ) = ln g1 (~x ) − ln g2 (~x )

CASE-1: C1 = C2 = σ 2 I (less parameters =⇒ less data reqd.)

The quadratic term (x T x) is same in both g1 (~x ) and g2 (~x ), so we

gi (~x ) = ω~i t ~x + ωio (assumed)

Now, neglecting terms which gets cancelled out in

Pattern Recognition Pattern Classification

Decision Boundary: g (~x ) = ω~1 T ~x + ω1o − ω~2 T ~x − ω2o = 0

Pattern Recognition Pattern Classification

Notice that C −1 will apply an affine rotation on (µ~1 − µ~2 ) so ω ~ will

Note that the contours have same probability density in a Gaussian

Pattern Recognition Pattern Classification

CASE-3: C1 6= C2 (They can be diagonal)

Pattern Recognition Pattern Classification

Summary of footprint of density function

An extended quadratic discriminant function in two dimensions is

Potrebbero piacerti anche