1 Classifiers Bayes

Sergios Theodoridis
Konstantinos Koutroumbas
1
Version 2
PATTERN RECOGNITION
Typical application areas
Machine vision
Character recognition (OCR)
Computer aided diagnosis
Speech recognition
Face recognition
Biometrics
Image Data Base retrieval
Data mining g
Bionformatics
Th
The task
t k: Assign
A i unknown
k objects
bj t patterns
tt into
i t the
th correctt
class. This is known as classification.
2
Features: These are measurable quantities obtained from
the p
patterns,, and the classification task is based on their
respective values.
Feature vectors: A number of features

x1 ,,...,, xl ,
constitute the feature vector
x = [x1 ,..., xl ] R l
T
Feature vectors are treated as random vectors.
3
An example:
4
The classifier consists of a set of functions, whose values,
computed at x , determine the class to which the
corresponding pattern belongs
Classification system overview
Patterns
sensor
feature
generation
feature
selection
classifier
design
system
evaluation
l i
5
Supervised unsupervised pattern recognition:
The
h two major directions
d
Supervised: Patterns whose class is known a-priori
are used for training
training.
Unsupervised: The number of classes is (in general)
unknown and no trainingg patterns
p are available.
6
CLASSIFIERS BASED ON BAYES DECISION
THEORY
Statistical nature of feature vectors

x = [x1 , x 2 ,..., xl ]
T
Assign the pattern represented by feature vector x

t th
to the mostt probable
b bl off th
the available
il bl classes
l
1 , 2 ,..., M
That is x i : P ( i x )
maximum
7
Computation of a
a-posteriori
posteriori probabilities
Assume known
p
a-priori probabilities
p
P (1 ), P(2 )..., P(M )
p ( x i ), i = 1,2...M
This
h is also
l known
k as the
h likelihood
l k l h d off
x w.r . to i .
8
Th Bayes
The B rule
l (=2)
( 2)
p ( x ) P ( i x ) = p ( x i ) P ( i )
p ( x i ) P ( i )
P ( i x ) =
p( x)
where 2
p( x) = i =1
p ( x i ) P ( i )
9
The Bayes classification rule (for two classes M 2)
M=2)
Given x classify it according to the rule
If P (1 x ) > P(2 x ) x 1
If P (2 x ) > P (1 x ) x 2
Equivalently: classify x according to the rule
p ( x 1 ) P (1 )( >< ) p ( x 2 ) P (2 )
For
F equiprobable
i b bl classes
l the
th test
t t becomes
b
p ( x 1 ) ( >< ) P ( x 2 )
10
R1 ( 1 ) and R2 ( 2 )
11
Equivalently in words: Divide space in two regions
If x R1 x in 1
If x R2 x in 2
Probability of error
Total shaded area
x0 +
P
e = p( x

2 d + p( x 1 )dx
)dx
x0
d
Bayesian classifier is OPTIMAL with respect to

minimising
i i i i the
th classification
l ifi ti error probability!!!!
b bilit !!!!
12
Indeed: Moving the threshold the total shaded
area INCREASES by
b the
th extra
t grey
area.
13
The Bayes classification rule for many (M>2) classes:
Given x classify it to i if:
P(i x) > P( j x) j i
Such a choice also minimizes the classification error

probability
b bilit
Minimizing
Mi i i i the
th average risk
i k
For each wrong decision, a penalty term is assigned since
some decisions are more sensitive than others
14
For M=2
Define the loss matrix
11 12
L=( )
21 22
12 penalty term for deciding class 2

,
although the pattern belongs to 1 , etc
etc.
Risk with respect to 1
r1 = 11 p(x1)d x + 12 p(x1)d x
R1 R2
15
Risk with respect to 2
r2 = 21 p ( x 2 )d x + 22 p ( x 2 )d x
R1 R2
Probabilities of wrong
g decisions,
weighted by the penalty terms
Average risk
r = r1 P (1 ) + r2 P ( 2 )
16
Choose R1 and R2 so that r is minimized
Then assign x to i if
l1 11 p( x 1 ) P(1 ) + 21 p( x 2 )P(2 )
l 2 12 p( x 1 )P(1 ) + 22 p( x 2 ) P(2 )
Equivalently:
E i l tl
assign x in 1 ( 2 ) if
p( x 1 ) P(2 ) 21 22
l 12 > (< )
p ( x 2 ) P(1 ) 12 11
l 12 : likelihood ratio
17
If 1
P(1 ) = P(2 ) = and 11 = 22 = 0
2
21
x 1 if P( x 1 ) > P( x 2 )
12
12
x 2 if P( x 2 ) > P( x 1 )
21
if 21 = 12 Minimum classification
eerror
o pprobabilit
ob b y
18
An
A example:
l
1
p ( x 1 ) = exp( x 2 )

1
p( x 2 ) = exp((( x 1) 2 )

1
P (1 ) = P(2 ) =
2
0 0.5
L =
1.0 0
19
Then the threshold value is:
x0 for minimum Pe :
x0 : exp( x 2 ) = exp(( x 1) 2 )
1
x0 =
2
Threshold x0 for minimum r
x0 : exp( x ) = 2 exp(( x 1) )
2 2
(1 ln 2) 1
x0 = <
2 2
20
1
Thus x
x0 moves to the left of = x0
(WHY?) 2
21
DISCRIMINANT FUNCTIONS
DECISION SURFACES
If Ri , R j are contiguous: g(x) P(i x) P(j x) = 0
Ri : P(i x) > P( j x)
+
- g ( x) = 0
R j : P( j x) > P (i x)
is the surface separating the regions. On one side is

positive (+), on the other is negative (-).
( ). It is known
as Decision Surface
22
If f(.)
f( ) monotonic,
monotonic the rule remains the same if we use:
x i if : f (P(i x)) > f (P(j x)) i j
gi ( x) f ( P(i x)) i a discriminant

is di i i t function
f ti
In
I general,
l discriminant
di i i t functions
f ti can be
b defined
d fi d
independent of the Bayesian rule. They lead to
suboptimal solutions,
solutions yet if chosen appropriately
appropriately, can be
computationally more tractable.
23
BAYESIAN CLASSIFIER FOR NORMAL
DISTRIBUTIONS
Multivariate Gaussian pdf

p
1 1
p ( x i ) = l
exp ( x i ) i ( x i )
1
1
2
( 2 ) i
2 2
i = E [x ] l l matrix in i
[
i = E ( x i )( x i ) ]
called covariance matrix
24
ln() is monotonic. Define:
g i ( x) = ln( p ( x i ) P(i )) =
ln p ( x i ) + ln P(i )
1
g i ( x ) = ( x i ) i ( x i ) + ln P (i ) + Ci
T 1
2
l 1
Ci = ( ) ln 2 ( ) ln i
2 2
2 0
Example: i =
0 2
25
1 1
g i ( x) = (x + x ) +
2 2
( i1 x1 + i 2 x2 )
2
2 1 2 2
1
( i21 + i22 ) + ln( Pi ) + Ci
2 2
That is
is, g i (x) is quadratic and the surfaces
g i ( x) g j ( x) = 0
quadrics ellipsoids
quadrics, ellipsoids, parabolas,
parabolas hyperbolas,
hyperbolas
pairs of lines.
For example:
26
Decision Hyperplanes
x i1 x
T
Quadratic terms:
If ALL i = (the same) the quadratic

terms are not of interest. They are not
involved in comparisons. Then, equivalently,
we can write:
T
g i ( x) = w x + wio
i
wi = 1 i
1 1
wi 0 = ln P (i ) i i
2
Discriminant functions are LINEAR 27
Let in addition:
= 2 I . Then
1
g i ( x) = i x + wi 0
T
g ij ( x) = g i ( x) g j ( x) = 0
T
= w ( x xo )
w = i j,
1 P (i ) i j
x o = ( i + j ) ln
2
2 P( j ) 2
i j
28
Nondiagonal: 2
T
gij ( x) = w ( x x0 ) = 0
w = (i j )
1
1 P (i ) i j
x 0 = ( i + j ) ln ( )
2 P ( j ) 2
i j 1
1
T
x 1
( x 1 x ) 2
not normal to i j
Decision hyperplane
normal to 1 ( i j )
29
Minimum Distance Classifiers
1
P (i ) = equiprobable
M
1
g i ( x) = ( x i )T 1 ( x i )
2
= 2 I : Assign x i :
Euclidean Distance: dE x i
smaller
2 I : Assign
g x i :
1
Mahalanobis Distance: dm = ((x i ) (x i ))
T 1 2
smaller 30
31
Example:
Given 1 , 2 : P (1 ) = P ( 2 ) and p ( x 1 ) = N ( 1 , ),
)
0 3 1.1 0.3
p ( x 2 ) = N ( 2 , ), 1 = , 2 = , =

0
3 0 . 3 1 . 9
1.0
classify t he vector x = using Bayesian classifica tion :
2 .2
0.95 0.15
=
-1

0. 15 0 .55
Compute Mahalanobis d m from 1 , 2 : d 2 m ,1 = [1.0, 2.2]
1.0 1 2.0
= 2.952, d m , 2 = [ 2.0, 0.8]
1 2
= 3.672
2.2 0.8
Classify x 1. Observe that d E ,,2 < d E ,1

32
ESTIMATION OF UNKNOWN PROBABILITY
DENSITY FUNCTIONS
Maximum Likelihood
Let x , x ,...., x known and independent

1 2 N
Let p( x) known within an unknown vector
p
parameter : p ( x) p ( x; )
X = {x1 , x 2 ,...x N }
p ( X ; ) p ( x1 , x 2 ,...x N ; )

N
= p ( x k ; )
k =1
which is known as the Likelihood of w.r. to X

The method :
33

ML : arg max p ( x k ; )
k =1
N
L( ) ln p ( X ; ) = ln p ( x k ; )
k =1
L ( ) N 1 p ( x )

ML : = k ;
=0
( ) k =1 p ( x k ; ) ( )
34
35
If, indeed, there is a 0 such that
p ( x ) = p ( x ; 0 ) , then
lim E [ ML ]= 0
N
2
lim E ML 0 =0
N
A
Asymptotically
t ti ll unbiased
bi d and
d consistent
i t t
36
Example:
p ( x) : N ( , ) : unknown, x1 , x 2 ,..., x N p ( x k ) p ( x k ; )
N 1 N
L( ) = ln p ( x k ; ) = C ( x k )T 1 ( x k )
k =1 2 k =1
1 1
p( x k ; ) = l 1
exp( ( x k ) T
1
( x k ))
2
(2 )
2 2
L

1
.
L( ) N 1 N
. = ( x k ) = 0 ML = x k
1
( ) k =1 N k =1
.
L

l
( A )
T
Remember : if A = A
T
= 2 A 37

Maximum Aposteriori Probability Estimation
In th d was considered
I ML method, id d as a parameter
t
Here we shall look at as a random vector
described by a pdf p(),
p() assumed to be known
Given
X = {x 1 , x 2 ,..., x N }
C
Compute
t th
the maximum
i off
p ( X )
From Bayes theorem
p( ) p ( X ) = p ( X ) p ( X ) or
p( ) p ( X )
p( X ) = 38
p( X )
The method:
MAP = arg max p ( X ) or

: ( P ( ) p ( X ))

MAP
If p ( ) is
i uniform
if or broad h MAP ML
b d enough
39
40
Example:
p( x) : N (, ), unknown, X = {x1,...,x N }
2
1 0
p( ) = exp( )
l
2
2
(2 )
2 l
) N N 1 1
MAP : ln( p( x k ) p( )) = 0 or 2 ( x k ) 2 ( 0 ) = 0
k =1 k =1
2 N
0 + 2 xk 2
MAP = k =1 For >> 1, or for N

2 2
1+ 2 N

1 N
MAP ML = x k
N k =1
41
Bayesian Inference
ML, MAP a single estimate for .

Here a different root is followed.
Given : X = {x1 ,..., x N }, p ( x ) and p ( )
The goal : estimate p( x X )
How??
42
p( x X ) = p ( x ) p( X )d
p ( X ) p( ) p( X ) p( )
p( X ) = =
p( X ) p( X ) p( )d
N
p( X ) = p( x k )
k =1
A bit more insight vi a an example

Let p ( x ) N ( , 2 )
p ( ) N ( 0 , 02 )
It turns out that : p ( X ) N ( N , N2 )
N 02 x + 2 0 0
2 2
1 N
N =
N 0 +
2 2
, N =
2
N 0 +
2 2
, x=
N
x
k =1
k
43
The above is a sequence of Gaussians as N
Maximum Entropy
Entropy H = p ( x ) ln p ( x)d x
p ( x ) : maximum H subjectj to the

available constraint s 44
Example: x is nonzero in the interval x1 x x2
and zero otherwise. Compute
p the ME pdf
p
The constraint:
x2

x1
p ( x ) dx = 1
Lagrange Multipliers
x2
H L = H + ( p ( x)dx 1)
x1
p ( x) = exp( 1)
1 x1 x x2

p ( x) = x2 x1
0 otherwise 45
Mixture Models
J
p ( x ) = p ( x j ) Pj
j =1
M
P
j =1
j = 1, p ( x j ) d x = 1
x
Assume p
parametric modeling,
g, i.e.,, p( x j ; )
The g
goal is to estimate and P1 , P2 ,..., Pj
given a set X= {x1 , x 2 ,..., x N }
Why not ML? As before?
N
max P( x k ; , Pi ,..., Pj )
, Pi ,..., Pj k =1 46
This is a nonlinear p
problem due to the missingg
label information. This is a typical problem with
an incomplete data set.
The Expectation-Maximisation (EM) algorithm.
General formulation
y the
h complete
l d
data set y Y R m
i h p y ( y; ) ,
, with
which are not observed directly.
We observe
ith Px (x; ),
x = g( y) Xob Rl , l < m with )
a many to one transformation
47
Let Y ( x) Y all y ' s to a specific x
p x ( x; ) = p
Y ( x)
y ( y; ) d y
What we need is to compute
ln( p y ( y k ; ))
ML :
k
=0
But yk ' s are not observed.

b d Here comes theh
EM. Maximize the expectation of the loglikelihood
conditioned on the observed samples and the
current iteration estimate of .
48
The algorithm:
E-step: Q( ; (t )) = E[ln(( py ( y k ; X ; (t ))]
k
Q ( ; (t ))
M-step: (t + 1) : =0

Application to the mixture modeling problem
Complete data ( x k , jk ), k = 1,2,..., N
Observed data x k , k = 1,2,..., N
p ( x k , jk ; ) = p ( x jk ; ) Pjk
k
Assuming mutual independence
N
L ( ) = ln( p ( x k jk ; ) Pjk ) 49
k =1
Unknown parameters
= [ , P ] , P = [ P1 , P2 ,..., Pj ]T
T T T T
E
E-step
step N N
Q(; (t )) = E[ln(p( xk jk ; )Pjk )] = E[ ]=
k =1 k =1
N J

k =1 jk =1
P ( jk xk ;(t )) ln( p( xk jk ; )P jk )
Q Q
M-step
p =0 = 0, jk = 1,2,..., J
Pjk
p ( x k j; (t )) Pj J
P ( j x k ; (t )) = p ( x k ; (t )) = p ( x k j; (t )) Pj
P ( x k ; (t )) j =1
50
Nonparametric Estimation
kN k N in h
P
N N total h h
x x x +
1 kN h 2 2
p ( x) p ( x ) = , x-x

h N 2
If p(x) continuous, p (x) p(x) as N , if

kN
hN 0, kN , 0 51
N
Parzen Windows
Divide the multidimensional space in hypercubes
52
Define
1 1

x ij
(xi) = 2
0 otherwise
That is, it is 1 inside a unit side hypercube centered
at 0
1 1 N
xi x
p
( x ) = l
h N
( (
i =1 h
))
1 1
* * number of points inside
volume N
an h - side hypercube centered at x
The problem: p ( x) continuous

ti
(.) discontinuous
Parzen windows-kernels-potential functions
( x) is smooth
( x) 0, ( x)d x = 1 53
x
Mean value
1 1 N
xi x 1 x' x
E[ p ( x)] = l (
h N
E[ (
i =1 h
)]) = l (
h h
) p ( x' )d x'
x'
1
h 0, l
h
x' x
h 0 the width of ( )0
h
1 x ' x
l ( )d x = 1
h h
1 x
h 0 l ( ) ( x)
h h
E[ p ( x)] = ( x' x) p ( x' )d x' = p ( x)
x'
Hence unbiased in the limit 54

Variance
The smaller the h the higher the variance
h=0.1, N=1000 h=0.8, N=1000
55
h=0.1, N=10000
The higher the N the better the accuracy
56
If
h0
N
h
asymptotically unbiased
The method
Remember:
p ( x 1 ) P ( 2 ) 21 22
l12 ( >< )
p(x 2 ) P ( 1 ) 12 11
1 N1
xi x
N 1h l
(
h
)
i =1
( >< )
1 N2
xi x
N 2h l

i =1
(
h
)
57
CURSE OF DIMENSIONALITY
In all the methods, so far, we saw that the highest
the number of points, N, the better the resulting
estimate.
estimate
If in the one-dimensional
one dimensional space an interval,
interval filled
with N points, is adequately (for good estimation), in
the two-dimensional space the corresponding square
will
ill require
i N2 andd in
i the
th -dimensional
di i l space the
th -

dimensional cube will require N points.
The exponential increase in the number of necessary

points in known as the curse of dimensionality.
p y This
is a major problem one is confronted with in high
dimensional spaces.
58
NAIVE BAYES CLASSIFIER
Let x and the goal is to estimate p ( x | i )

l
i = 1, 2, , M. For a good estimate of the pdf

one would need, say, N points.
Assume x1, x2 ,, x mutually independent. Then:

p ( x | i ) = p (x j | i )
l
j =1
In this case, one would require, roughly, N points

for each pdf. Thus, a number of points of the
order N would suffice.
It turns out that the Nave Bayes classifier

works
o ks reasonably
easonabl wellell even
e en in cases that violate
iolate
the independence assumption. 59
K Nearest Neighbor Density Estimation
In Parzen:
The volume is constant
The number of points in the volume is varying
Now:
Keep the number of points kN = k
constant
Leave
L the
th volume
l to
t be
b varying
i
k
( x) =
p
NV ( x)
60

k
N1V1 N 2V2
= (><)
k N1V1
N 2V2
61
The Nearest Neighbor Rule
Choose k out of the N training vectors, identify the k
nearest ones to x
Out of these k identify ki that belong to class i
Assign
A i x i : ki > k j i j
The
Th simplest
i l t version
i
k=1 !!!
For large N this is not bad. It can be shown that:

if PB is the optimal Bayesian error probability
probability, then:
M
PB PNN PB (2 PB ) 2 PB
M 1
62
2 PNN
PB PkNN PB +
k
k , PkNN PB
For small PB:
PNN 2 PB
P3 NN PB + 3( PB ) 2
63
Voronoi tesselation
Ri = {x : d ( x , x i ) < d ( x , x j ) i j }
64
BAYESIAN NETWORKS
Bayes
B P
Probability
b bilit Chain
Ch i Rule
R l
p( x1, x2 ,..., xl ) = p( xl | xl1,..., x1 ) p( xl1 | xl2 ,..., x1 ) ...
... p( x2 | x1 ) p( x1 )
Assume now that the conditional dependence for

each xi is limited to a subset of the features
appearing in each of the product terms. That is:
l
p( x1, x2 ,..., xl ) = p( x1 ) p( xi | Ai )
i =2
where
Ai {xi1, xi2 ,..., x1}
65
For l if =6,
F example, 6 then
th we could
ld assume:
p( x6 | x5 ,..., x1 ) = p( x6 | x5 , x4 )
Then:
A6 = {x5 , x4 } {x5 ,..., x1}
The above is a generalization of the Nave Bayes.

For the Nave Bayes the assumption is:
Ai = ,
ffor ii=1, 2 ,
1 2,
66
A graphical way to portray conditional dependencies
is given below
According to this figure we
have that:
x6 is conditionallyy dependent
p on
x4, x5.
x5 on x4
x4 on x1, x2
x3 on x2
x1, x2 are conditionally
i d
independent
d t on other
th variables.
i bl
For this case:

p( x1, x2 ,...,x6 ) = p( x6 | x5 , x4 ) p( x5 | x4 ) p( x3 | x2 ) p( x2 ) p( x1 )
67
Bayesian Networks
Definition: A Bayesian Network is a directed acyclic

graph (DAG) where the nodes correspond to random
variables. Each node is associated with a set of
conditional probabilities (densities), p(xi|Ai), where xi
is the variable associated with the node and Ai is the
set of its parents in the graph.
A Bayesian Network is specified by:

The marginal probabilities of its root nodes.
nodes
The conditional probabilities of the non-root nodes,
given their parents,
g p , for ALL p
possible combinations.
68
The figure below is an example of a Bayesian
Network corresponding to a paradigm from the
medical applications
pp field.
This Bayesian network
models conditional
dependencies for an
example concerning
smokers (S),
tendencies
d to develop
d l
cancer (C) and heart
disease (H), together
with variables
corresponding to heart
(H1 H2) and cancer
(H1,
(C1, C2) medical tests.
69
Once a DAG has been constructed, the joint
probability
b bilit can be
b obtained
bt i d by
b multiplying
lti l i the
th
marginal (root nodes) and the conditional (non-root
nodes)) probabilities.
p
Training:
g Once a topologyg is given,
g probabilities are
estimated via the training data set. There are also
methods that learn the topology.
Probability Inference: This is the most common task

that Bayesian networks help us to solve efficiently.
Given the values of some of the variables in the
graph, known as evidence, the goal is to compute
the conditional probabilities for some of the other
variables, given the evidence.
70
Example: Consider the Bayesian network of the
figure:
a) If x is measured to be x=1 (x1), compute

P(w=0|x=1)
( | ) [[P(w0|x1)].
( | )]
b) If w is measured to be w=1 (w1) compute

P(x=0|w=1) [ P(x0|w1)].
71
For a), a set of calculations are required that
propagate from node x to node w. It turns out that
P(w0|x1) = 0.63.
For b), the propagation is reversed in direction. It

turns out that P(x0|w1)
P( 0| 1) = 0.4.
04
In
I general,l the
th required
i d inference
i f information
i f ti isi
computed via a combined process of message
passing among the nodes of the DAG
passing DAG.
Complexity:
For singly connected graphs, message passing
algorithms amount to a complexity linear in the
number of nodes. 72

1 Classifiers Bayes

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

1 Classifiers Bayes

Caricato da

Copyright:

Formati disponibili

Sergios Theodoridis

Feature vectors: A number of features

Feature vectors are treated as random vectors.

Classification system overview

Statistical nature of feature vectors

Assign the pattern represented by feature vector x

Equivalently: classify x according to the rule

Bayesian classifier is OPTIMAL with respect to

Such a choice also minimizes the classification error

12 penalty term for deciding class 2

Risk with respect to 1

is the surface separating the regions. On one side is

x i if : f (P(i x)) > f (P(j x)) i j

gi ( x) f ( P(i x)) i a discriminant

Multivariate Gaussian pdf

If ALL i = (the same) the quadratic

Classify x 1. Observe that d E ,,2 < d E ,1

Let x , x ,...., x known and independent

which is known as the Likelihood of w.r. to X

MAP = arg max p ( X ) or

MAP = k =1 For >> 1, or for N

ML, MAP a single estimate for .

A bit more insight vi a an example

p ( x ) : maximum H subjectj to the

The Expectation-Maximisation (EM) algorithm.

What we need is to compute

But yk ' s are not observed.

If p(x) continuous, p (x) p(x) as N , if

The problem: p ( x) continuous

Hence unbiased in the limit 54

h=0.1, N=1000 h=0.8, N=1000

The higher the N the better the accuracy

The exponential increase in the number of necessary

Let x and the goal is to estimate p ( x | i )

i = 1, 2, , M. For a good estimate of the pdf

Assume x1, x2 ,, x mutually independent. Then:

In this case, one would require, roughly, N points

It turns out that the Nave Bayes classifier

Out of these k identify ki that belong to class i

For large N this is not bad. It can be shown that:

For small PB:

Assume now that the conditional dependence for

The above is a generalization of the Nave Bayes.

For this case:

Definition: A Bayesian Network is a directed acyclic

A Bayesian Network is specified by:

Probability Inference: This is the most common task

a) If x is measured to be x=1 (x1), compute

b) If w is measured to be w=1 (w1) compute

For b), the propagation is reversed in direction. It

Potrebbero piacerti anche