CPE542 - 3 - Classifiers - Bayes

Sergios Theodoridis
Konstantinos Koutroumbas
Version 2
PATTERN RECOGNITION
Typical application areas
Machine vision
Character recognition (OCR)
Computer aided diagnosis
Speech recognition
Face recognition
Biometrics
Image Data Base retrieval
Data mining
Bionformatics
The task: Assign unknown objects patterns into the
correct class. This is known as classification.
Features: These are measurable quantities obtained

from the patterns, and the classification task is based
on their respective values.
Feature vectors: A number of features
x1 ,..., xl ,
constitute the feature vector
x x1 ,..., xl R l
T
Feature vectors are treated as random vectors.
An example:
The classifier consists of a set of functions, whose

x
values, computed
at , determine the class to
which the corresponding pattern belongs
Classification system overview
Patterns
sensor
feature
generation
feature
selection
classifier
design
system
evaluation
5
Supervised unsupervised pattern recognition:

The two major directions
Supervised: Patterns whose class is known apriori are used for training.
Unsupervised: The number of classes is (in
general) unknown and no training patterns are
available.
CLASSIFIERS BASED ON BAYES

DECISION THEORY
Statistical nature of feature vectors
x x1 , x2 ,..., xl
Assign the pattern represented by feature x

vector
to the most
of the
probable
, ,...,
available classes
1
That is
x i : P ( i x )
maximum
7
Computation of a-posteriori probabilities

Assume known
a-priori probabilities
P (1 ), P (2 )..., P (M )
p ( x i ), i 1,2...M
This is also known as the likelihood of
x w.r. to i .
The Bayes rule (=2)
p ( x ) P (i x ) p ( x i ) P (i )
p ( x i ) P (i )
P (i x)
p ( x)
where
p ( x ) p ( x i ) P (i )
i 1
The Bayes classification rule (for two classes

M=2) x
Given classify it according to the rule
If P(1 x ) P (2 x ) x 1
If P(2 x ) P(1 x ) x 2
x
Equivalently: classify
according to the
rule p ( x 1 ) P (1 )( ) p ( x 2 ) P (2 )
p ( x )( ) P ( x )
1
For equiprobable classes
the test2 becomes
10
R1 ( 1 ) and R2 ( 2 )
11
Equivalently in words: Divide space in two

regions
If x R1 x in 1
If x R2 x in 2
Probability of error
Total shaded
area
x0
x0
p
(
x
)
dx
p
(
x
)
dx
2
1
Bayesian classifier is OPTIMAL with respect to

12
minimising the classification error probability!!!!
Indeed: Moving the threshold the total shaded

area INCREASES by the extra grey area.
13
The Bayes classification rule for many (M>2)

classes:x
i
Given
classify it to
if:
P(i x) P( j x) j i
Such a choice also minimizes the classification

error probability
Minimizing the average risk
For each wrong decision, a penalty term is assigned
since some decisions are more sensitive than others
14
For M=2
Define the loss matrix
11 12
L(
)
21 22
12 penalty term for deciding class

2
although the pattern belongs
to1
,
, etc.
1
Risk with respect to
r1 11 p( x 1 )d x 12 p( x 1 )d x
R1
R2
15
Risk with respect to

2
r2 21 p ( x 2 )d x 22 p ( x 2 )d x
R1
R2
Probabilities of wrong
decisions, weighted by the
penalty terms
Average risk
r r1 P (1 ) r2 P ( 2 )
16
Choose R1 and
R2
so that r is minimized
x to i if
1 11 p ( x 1 ) P(1 ) 21 p( x 2 ) P (2 )
Then assign
2 12 p ( x 1 ) P (1 ) 22 p ( x 2 ) P (2 )
Equivalently:
assign
x in ( ) if
1
2
p ( x 1 )
P (2 ) 21 22
12
()
p( x 2 )
P (1 ) 12 11
: likelihood ratio
12
17
If
1
P (1 ) P(2 ) and 11 22 0
2
21
x 1 if P ( x 1 ) P ( x 2 )
12
12
x 2 if P ( x 2 ) P( x 1 )
21
if 21 12 Minimum classification
error probability
18
An example:
1
2
p ( x 1 )
exp( x )
1
p ( x 2 )
exp(( x 1) 2 )
1
P(1 ) P(2 )
2
0 0.5
L
1.0 0
19
Then the threshold value is:
x0 for minimum Pe :
x0 : exp( x ) exp(( x 1) )
2
1
x0
2
0 for minimum r
Threshold x
x0 : exp( x ) 2 exp(( x 1) )
2
(1 n2) 1
x0
2
2
20
1
0 moves to the left of x0
Thus x
2
(WHY?)
21
DISCRIMINANT FUNCTIONS
DECISION SURFACES
If Ri , R j are contiguous:
g ( x) P(i x) P( j x) 0
Ri : P(i x) P ( j x)
+
-
g ( x) 0
R j : P ( j x) P (i x)
is the surface separating the regions. On one

side is positive (+), on the other is negative
(-). It is known as Decision Surface
22
If f(.) monotonic, the rule remains the same if we

use:
x i if : f ( P(i x)) f ( P( j x)) i j
g i ( x) f ( P(i x))
is a discriminant function
In general, discriminant functions can be defined

independent of the Bayesian rule. They lead to
suboptimal solutions, yet if chosen appropriately,
can be computationally more tractable.
23
BAYESIAN CLASSIFIER FOR NORMAL

DISTRIBUTIONS
Multivariate Gaussian pdf
p ( x i )
(2 ) i
1
2
exp ( x i ) i1 ( x i )
2
i E x matrix in i
i E ( x i )( x i )
called covariance matrix

24
ln(is) monotonic.
Define:
g i ( x) ln( p ( x i ) P (i ))
ln p ( x i ) ln P(i )
1
T 1
g i ( x) ( x i ) i ( x i ) ln P (i ) Ci
2
1
Ci ( ) ln 2 ( ) ln i
Example:
2
2
2 0
i
2
0
25
1
1
2
2
( x1 x2 ) 2 ( i1 x1 i 2 x2 )
g i ( x)
2
2
1
2 ( i21 i22 ) ln( Pi ) Ci
2
That is,
surfaces
g i (x)
is quadratic and the
g i ( x) g j ( x) 0
quadrics, ellipsoids, parabolas,

hyperbolas,
pairs of lines.
For example:
26
Decision Hyperplanes
Quadratic terms:x
x
1
i
If ALL i
(the same) the quadratic
terms are not of interest. They are not
involved in comparisons.
Then,
equivalently, we can write:
T
i
g i ( x) w x wio
wi 1 i
1 1
wi 0 ln P (i ) i i
2
Discriminant functions are LINEAR
27
Let in addition:
2 I . Then
1 T
g i ( x) 2 i x wi 0
g ij ( x) g i ( x) g j ( x) 0
T
w ( x xo )
w i j,
1
P (i ) i j
2
x o ( i j ) ln
2
P ( j ) 2
i
j
28
Nondiagonal:
2
T
g ij ( x) w ( x x 0 ) 0
1
w ( i j )
i j
P(i )
1
x 0 ( i j ) n (
)
2
P ( j ) 2
i
j 1
x
( x 1 x)
Decision hyperplane
1
2
not normal to i j
normal to 1 ( i j )
29
Minimum Distance Classifiers
equiprobable
1
P (i )
M
1
g i ( x) ( x i )T 1 ( x i )
2
Euclidean Distance:
smaller
2
I : Assign x i :
Mahalanobis Distance:
smaller
dE x i
2 I : Assign x i :
1
d m (( x i ) ( x i ))
T
30
1
2
31
Example:
Given 1 , 2 : P(1 ) P (2 ) and p ( x 1 ) N ( 1 , ),
0
3
1.1 0.3
p ( x 2 ) N ( 2 , ), 1 , 2 ,
0
3
0
.
3
1
.
9

1.0
classify the vector x
using Bayesian classification :
2.2
0.95 0.15

0
.
15
0
.
55
-1
Compute Mahalanobis d m from 1 , 2 : d 2 m ,1 1.0, 2.2

1 .0
2
1 2.0

2.952, d m , 2 2.0, 0.8
3.672
2.2
0 .8
1
Classify x 1. Observe that d E ,2 d E ,1

32
lide
S
ESTIMATION OF UNKNOWN PROBABILITY
rt
o
p
Sup
DENSITY FUNCTIONS
Maximum Likelihood
Let x , x ,...., x known and independent

1
2
N
Let p( x) known within an unknown vector
parameter : p( x) p ( x; )
X x1 , x 2 ,...x N
p ( X ; ) p ( x1 , x 2 ,...x N ; )
N
p ( x k ; )
k 1
which is known as the Likelihood of w.r. to X

The method :
33
po
p
u
S
lid
S
t
r
ML : arg max p ( x k ; )
k 1
L( ) ln p ( X ; ) ln p ( x k ; )
k 1
p
(
x
L
(
)
1
k
;
0
ML :
( ) k 1 p ( x k ; ) ( )
34
po
p
u
S
lid
S
t
r
35
po
p
u
S
lid
S
t
r
If, indeed, there is a 0 such that

p ( x ) p ( x; 0 ), then
lim E[ ML ] 0
lim E ML 0
Asymptotically unbiased and consistent
36
Example:
p ( x) : N ( , ) : unknown, x1 , x 2 ,..., x N p ( x k ) p ( x k ;Su)ppo
lid
S
t
r
1 N
L( ) ln p ( x k ; ) C ( x k )T 1 ( x k )
k 1
2 k 1
1
1
T
1
p( x k ; )
exp(
(
x
( x k ))
k
l
1
2
2
2
(2 )
N
L

1
N
N
L( )
1
1
. ( x k ) 0 ML x k
( )
k 1
N k 1
.
L

l
( A )
T
Remember : if A A
2 A
37
Maximum Aposteriori Probability Estimation po

Sup
In ML method, was considered as a
parameter
Here we shall look at as a random vector
described by a pdf p(), assumed to be
known
X x 1 , x 2 ,..., x N
Given
lid
S
t
r
p ( X )
Compute the maximum of
p ( ) p ( X ) p ( X ) p ( X ) or
From Bayes theorem
p ( X )
p ( ) p ( X )
p( X )
38
po
p
u
S
lid
S
t
r
The method:
MAP arg max p ( X ) or
MAP :
( P ( ) p ( X ))
If p ( ) is uniform or broad enough MAP ML
39
po
p
u
S
lid
S
t
r
40
Example:
p ( x) : N ( , ), unknown, X x1,...,x N
p( )
1
l
2
(2 ) l
exp(
0
2
po
p
u
S
lid
S
t
r
N
N 1
1
MAP :
ln( p ( x k ) p( )) 0 or 2 ( x k ) 2 ( 0 ) 0
k 1
k 1
2 N
2
0 2 xk
MAP
For 2 1, or for N
2
1 2 N
1 N
MAP ML x k
N k 1
41
Bayesian Inference
po
p
u
S
lid
S
t
r
ML, MAP a single estimate for .

Here a different root is followed.
Given : X {x1 ,..., x N }, p ( x ) and p ( )
The goal : estimate p( x X )
How??
42
p( x X ) p ( x ) p( X )d
p( X ) p ( )
p( X ) p( )
p( X )
p( X )
p( X ) p( )d
po
p
u
S
lid
S
t
r
p( X )
p( x k )
k 1
A bit more insight via an example
Let p ( x ) N ( , 2 )
p ( ) N ( 0 , 02 )
It turns out that : p ( X ) N ( N , N2 )
N 02 x 2 0
N
,
2
2
N 0
2 02

,
2
2
N 0
2
N
1
x
N
x
k 1
43
The above is a sequence of Gaussians as

N p
po
Su
lid
S
t
r
Maximum Entropy
Entropy
H p ( x ) ln p ( x ) d x
p ( x) : maximum H subject to the

available constraints
44
Example: x is nonzero in the intervalx1 x x2

lide
S
rt
o
p
and zero otherwise. Compute the ME pdf Sup
The constraint:
x2
p( x)dx 1
x1
Lagrange Multipliers
x2
H L H ( p( x)dx 1)
x1
p ( x) exp( 1)
p ( x)
x1 x x2
1
x2 x1
0
otherwise
45
Mixture Models
po
p
u
S
p ( x) p ( x j ) Pj
lid
S
t
r
j 1
P
j 1
1, p ( x j )d x 1
x
Assume parametric modeling, i.e.,
p( x j ; )
The goal is to estimate

given a set
and P1 , P2 ,..., Pj
X x 1 , x 2 ,..., x N
Why not ML? As before?

N
max P ( x k ; , Pi ,..., Pj )
, Pi ,..., Pj k 1
46
This is a nonlinear problem due to the

missing label information. This is a typical
problem with an incomplete data set.
po
p
u
S
lid
S
t
r
The Expectation-Maximisation (EM)

algorithm.
General
formulation
y the
complete data set
y Y R m , with p y ( y; ) ,
which are not observed directly.
x g ( y ) X ob R , l m with Px ( x; ),
We observe
a many to one transformation
47
Let
Y ( x) Y all y ' s to a specific x

p x ( x; )
po
p
u
S
lid
S
t
r
( y; ) d y
Y ( x)
What we need is to compute
ML :
ln( p y ( y k ; ))
But
are not observed. Here comes the
yk ' s the expectation of the loglikelihood
EM. Maximize
conditioned on the observed samples and the
current iteration estimate of
48
The algorithm:
E-step:
lid
S
t
r
Q( ; (t )) E[ ln( p y ( y k ; SXup;po (t ))]

k
M-step:
Q ( ; (t ))
(t 1) :
0
Application to the mixture modeling problem

Complete data
( x k , jk ), k 1,2,..., N
x k , k 1,2,..., N
Observed data
p( x k , jk ; ) p( x jk ; ) Pjk
k
Assuming mutual independence
N
L( ) ln( p ( x k jk ; ) Pjk )
k 1
49
Unknown parameters
po
p
u
S
[ , P ] , P [ P1 , P2 ,..., Pj ]T
T
E-step
T T
k 1
k 1
Q(; (t )) E[ ln( p( x k jk ; ) Pjk )] E[

N
k 1
jk 1
M-step
P ( j x k ; (t ))
lid
S
t
r
P ( jk x k ;(t )) ln( p( x k jk ; ) P jk )
Q
0
p( x k j; (t )) Pj
P ( x k ; (t ))
Q
0,
Pjk
jk 1,2,..., J
p( x k ; (t )) p( x k j; (t )) Pj
j 1
50
Nonparametric Estimation
kN
P
N
k N in h
N total
1 kN
h
p ( x) p ( x )
, x - x
h N
2
h
x
2
If p( x) continuous , p ( x) p( x) as N , if
kN
hN 0, k N ,
0
N
h
x
2
51
Parzen Windows
Divide the multidimensional space in
hypercubes
52
Define
xij
( xi )
2
0
a otherwise
That is, it is 1 inside

unit side hypercube
centered at 0
1 1
p ( x) l (
h N
xi x
(
))
h
i 1
N
1
1
* * number of points inside
volume N
The problem:
an h - side hypercube centered at x
Parzen windows-kernels-potential
functions
p ( x) continuous
(.) discontinuous
( x) is smooth
( x) 0, ( x)d x 1
x
53
Mean value
1 1
E[ p ( x)] l (
h N
lid
S
t
r
po
xi x
1 x' x
p
u
E[ (
)]) l (
) p (Sx' )d x'
h
h
h
i 1
x'
N
1
h 0, l
h
x' x
h 0 the width of (
)0
h
1
x ' x
h l ( h )d x 1
1 x
h 0 l ( ) ( x)
h
h
E[ p ( x)] in
the
(limit
x' x) p ( x' )d x' p ( x)
Hence unbiased
x'
54
Variance
The smaller the h the higher the variance
h=0.1, N=1000
h=0.8, N=1000
55
h=0.1, N=10000
The higher the N the better the accuracy

56
If
h0
N
asymptotically unbiased
The method
Remember:
p ( x 1 )
P (2 ) 21 22
l12
()
p ( x 2 )
P (1 ) 12 11
1
N 1h l
1
N 2 hl
xi x
(
)
h
i 1
( )
N2
xi x
(
)
h
i 1
N1
57
CURSE OF DIMENSIONALITY
In all the methods, so far, we saw that the
highest the number of points, N, the better the
resulting estimate.
If in the one-dimensional space an interval, filled
with N points, is adequately (for good
estimation), in the two-dimensional space the
corresponding square will require N2 and in the dimensional space the -dimensional cube will
require N points.
The exponential increase in the number of
necessary points in known as the curse of
dimensionality. This is a major problem one is
confronted with in high dimensional spaces.
58
NAIVE BAYES CLASSIFIER
Let x and
the goal is to estimate p x |
i
i = 1, 2, , M. For a good estimate of the pdf one
would need, say, N points.
Assume x1, x2 ,, x mutually independent. Then:
p x | i p x j | i
1
In this case, one wouldj require,
roughly, N points
for each pdf. Thus, a number of points of the
order N would suffice.
It turns out that the Nave Bayes classifier

works reasonably well even in cases that violate
the independence assumption.
59
K Nearest Neighbor Density Estimation

In Parzen:
The volume is constant
The number of points in the volume is
varying
kN k
Now:
Keep the number of points

constant
Leave the k
volume to be varying
p ( x)
NV ( x)
60
k
N 2V2
N1V1
()
k
N1V1
N 2V2
61
The Nearest Neighbor Rule

Choose k out of the N training vectors, identify
the k nearest ones to x
Out of these k identify ki that belong to class i
Assign
x i : ki k j i j
The simplest version

k=1 !!!
For large N this is not bad. It can be shown that:
if PB is the optimal Bayesian error probability,
then:
M
PB PNN PB (2
M 1
PB ) 2 PB
62
PB PkNN
2 PNN
PB
k
k , PkNN PB
For small PB:
PNN 2 PB
P3 NN PB 3( PB )
63
Voronoi tesselation
Ri x : d ( x, x i ) d ( x, x j ) i j
64
BAYESIAN NETWORKS
Bayes Probability Chain Rule
po
p
u
S
lid
S
t
r
p( x1 , x2 ,..., x ) p( x | x1 ,..., x1 ) p( x1 | x 2 ,..., x1 ) ...

... p( x2 | x1 ) p( x1 )
Assume now that the conditional dependence
for each xi is limited to a subset of the features
appearing in each of the product terms. That is:
p( x1 , x2 ,..., x ) p( x1 ) p( xi | Ai )
where
i 2
Ai xi 1 , xi 2 ,..., x1
65
po
p
u
S
lid
S
t
r
For example, if =6, then we could assume:
p( x6 | x5 ,..., x1 ) p( x6 | x5 , x4 )
Then:
A6 x5 , x4 x5 ,..., x1
The above is a generalization of the Nave

Bayes. For the Nave Bayes the assumption
is:
Ai = , for i=1, 2, ,
66
A graphical way to portray conditional

dependencies is given below
po
p
u
S
lid
S
t
r
According to this figure

we have that:
x6 is conditionally
dependent on x4, x5.
x5 on x4
x4 on x1, x2
x3 on x2
x1, x2 are conditionally
independent on other
variables.
For this case:
p( x1 , x2 ,..., x6 ) p( x6 | x5 , x4 ) p ( x5 | x4 ) p( x3 | x2 ) p( x2 ) p( x1 )
67
Bayesian Networks
po
p
u
S
lid
S
t
r
Definition: A Bayesian Network is a directed

acyclic graph (DAG) where the nodes
correspond to random variables. Each node is
associated with a set of conditional
probabilities (densities), p(xi|Ai), where xi is the
variable associated with the node and Ai is the
set of its parents in the graph.
A Bayesian Network is specified by:
The marginal probabilities of its root nodes.
The conditional probabilities of the non-root
nodes, given their parents, for ALL possible
combinations.
68
po
p
u
S
lid
S
t
r
The figure below is an example of a Bayesian

Network corresponding to a paradigm from the
medical applications field.
This Bayesian
network models
conditional
dependencies for an
example concerning
smokers (S),
tendencies to
develop cancer (C)
and heart disease
(H), together with
variables
corresponding to
heart (H1, H2) and
cancer (C1, C2)
69
medical tests.
lid
S
t
r
Once a DAG has been constructed, the joint

po
p
u
S
probability can be obtained by multiplying the
marginal (root nodes) and the conditional (nonroot nodes) probabilities.
Training: Once a topology is given, probabilities
are estimated via the training data set. There
are also methods that learn the topology.
Probability Inference: This is the most common
task that Bayesian networks help us to solve
efficiently. Given the values of some of the
variables in the graph, known as evidence, the
goal is to compute the conditional probabilities
for some of the other variables, given the
evidence.
70
lid
S
t
r
po
p
Example: Consider the Bayesian network of
the
u
S
figure:
a) If x is measured to be x=1 (x1), compute P(w=0|

x=1) [P(w0|x1)].
b) If w is measured to be w=1 (w1) compute P(x=0|
w=1) [ P(x0|w1)].
71
lid
S
t
r
For a), a set of calculations are required that uppo

S
propagate from node x to node w. It turns out that
P(w0|x1) = 0.63.
For b), the propagation is reversed in direction. It
turns out that P(x0|w1) = 0.4.
In general, the required inference information is
computed via a combined process of message
passing among the nodes of the DAG.
Complexity:
For singly connected graphs, message passing
algorithms amount to a complexity linear in the
number of nodes.
72

CPE542 - 3 - Classifiers - Bayes

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

CPE542 - 3 - Classifiers - Bayes

Caricato da

Copyright:

Formati disponibili

Sergios Theodoridis

Features: These are measurable quantities obtained

Feature vectors: A number of features

constitute the feature vector

Feature vectors are treated as random vectors.

The classifier consists of a set of functions, whose

Supervised unsupervised pattern recognition:

CLASSIFIERS BASED ON BAYES

Assign the pattern represented by feature x

Computation of a-posteriori probabilities

This is also known as the likelihood of

The Bayes rule (=2)

The Bayes classification rule (for two classes

Equivalently in words: Divide space in two

Bayesian classifier is OPTIMAL with respect to

Indeed: Moving the threshold the total shaded

The Bayes classification rule for many (M>2)

Such a choice also minimizes the classification

12 penalty term for deciding class

Risk with respect to

Then the threshold value is:

is the surface separating the regions. On one

If f(.) monotonic, the rule remains the same if we

x i if : f ( P(i x)) f ( P( j x)) i j

In general, discriminant functions can be defined

BAYESIAN CLASSIFIER FOR NORMAL

called covariance matrix

is quadratic and the

quadrics, ellipsoids, parabolas,

Minimum Distance Classifiers

Compute Mahalanobis d m from 1 , 2 : d 2 m ,1 1.0, 2.2

Classify x 1. Observe that d E ,2 d E ,1

Let x , x ,...., x known and independent

which is known as the Likelihood of w.r. to X

If, indeed, there is a 0 such that

Asymptotically unbiased and consistent

Maximum Aposteriori Probability Estimation po

Compute the maximum of

From Bayes theorem

MAP arg max p ( X ) or

If p ( ) is uniform or broad enough MAP ML

ML, MAP a single estimate for .

The above is a sequence of Gaussians as

p ( x) : maximum H subject to the

Example: x is nonzero in the intervalx1 x x2

Assume parametric modeling, i.e.,

The goal is to estimate

Why not ML? As before?

This is a nonlinear problem due to the

The Expectation-Maximisation (EM)

which are not observed directly.

a many to one transformation

Y ( x) Y all y ' s to a specific x

What we need is to compute

Q( ; (t )) E[ ln( p y ( y k ; SXup;po (t ))]

Application to the mixture modeling problem

Q(; (t )) E[ ln( p( x k jk ; ) Pjk )] E[

That is, it is 1 inside

The higher the N the better the accuracy

NAIVE BAYES CLASSIFIER

Assume x1, x2 ,, x mutually independent. Then:

It turns out that the Nave Bayes classifier

K Nearest Neighbor Density Estimation