Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Konstantinos Koutroumbas
1
Version 2
PATTERN RECOGNITION
Typical application areas
Machine vision
Character recognition (OCR)
Computer aided diagnosis
Speech recognition
Face recognition
Biometrics
Image Data Base retrieval
Data mining g
Bionformatics
Th
The task
t k: Assign
A i unknown
k objects
bj t patterns
tt into
i t the
th correctt
class. This is known as classification.
2
Features: These are measurable quantities obtained from
the p
patterns,, and the classification task is based on their
respective values.
3
An example:
4
The classifier consists of a set of functions, whose values,
computed at x , determine the class to which the
corresponding pattern belongs
Patterns
sensor
feature
generation
feature
selection
classifier
design
system
evaluation
l i
5
Supervised unsupervised pattern recognition:
The
h two major directions
d
Supervised: Patterns whose class is known a-priori
are used for training
training.
Unsupervised: The number of classes is (in general)
unknown and no trainingg patterns
p are available.
6
CLASSIFIERS BASED ON BAYES DECISION
THEORY
That is x i : P ( i x )
maximum
7
Computation of a
a-posteriori
posteriori probabilities
Assume known
p
a-priori probabilities
p
P (1 ), P(2 )..., P(M )
p ( x i ), i = 1,2...M
This
h is also
l known
k as the
h likelihood
l k l h d off
x w.r . to i .
8
Th Bayes
The B rule
l (=2)
( 2)
p ( x ) P ( i x ) = p ( x i ) P ( i )
p ( x i ) P ( i )
P ( i x ) =
p( x)
where 2
p( x) = i =1
p ( x i ) P ( i )
9
The Bayes classification rule (for two classes M 2)
M=2)
Given x classify it according to the rule
If P (1 x ) > P(2 x ) x 1
If P (2 x ) > P (1 x ) x 2
p ( x 1 ) P (1 )( >< ) p ( x 2 ) P (2 )
For
F equiprobable
i b bl classes
l the
th test
t t becomes
b
p ( x 1 ) ( >< ) P ( x 2 )
10
R1 ( 1 ) and R2 ( 2 )
11
Equivalently in words: Divide space in two regions
If x R1 x in 1
If x R2 x in 2
Probability of error
Total shaded area
x0 +
P
e = p( x
2 d + p( x 1 )dx
)dx
x0
d
P(i x) > P( j x) j i
Minimizing
Mi i i i the
th average risk
i k
For each wrong decision, a penalty term is assigned since
some decisions are more sensitive than others
14
For M=2
Define the loss matrix
11 12
L=( )
21 22
r1 = 11 p(x1)d x + 12 p(x1)d x
R1 R2
15
Risk with respect to 2
r2 = 21 p ( x 2 )d x + 22 p ( x 2 )d x
R1 R2
Probabilities of wrong
g decisions,
weighted by the penalty terms
Average risk
r = r1 P (1 ) + r2 P ( 2 )
16
Choose R1 and R2 so that r is minimized
Then assign x to i if
l1 11 p( x 1 ) P(1 ) + 21 p( x 2 )P(2 )
l 2 12 p( x 1 )P(1 ) + 22 p( x 2 ) P(2 )
Equivalently:
E i l tl
assign x in 1 ( 2 ) if
p( x 1 ) P(2 ) 21 22
l 12 > (< )
p ( x 2 ) P(1 ) 12 11
l 12 : likelihood ratio
17
If 1
P(1 ) = P(2 ) = and 11 = 22 = 0
2
21
x 1 if P( x 1 ) > P( x 2 )
12
12
x 2 if P( x 2 ) > P( x 1 )
21
if 21 = 12 Minimum classification
eerror
o pprobabilit
ob b y
18
An
A example:
l
1
p ( x 1 ) = exp( x 2 )
1
p( x 2 ) = exp((( x 1) 2 )
1
P (1 ) = P(2 ) =
2
0 0.5
L =
1.0 0
19
Then the threshold value is:
x0 for minimum Pe :
x0 : exp( x 2 ) = exp(( x 1) 2 )
1
x0 =
2
Threshold x0 for minimum r
x0 : exp( x ) = 2 exp(( x 1) )
2 2
(1 ln 2) 1
x0 = <
2 2
20
1
Thus x
x0 moves to the left of = x0
(WHY?) 2
21
DISCRIMINANT FUNCTIONS
DECISION SURFACES
If Ri , R j are contiguous: g(x) P(i x) P(j x) = 0
Ri : P(i x) > P( j x)
+
- g ( x) = 0
R j : P( j x) > P (i x)
22
If f(.)
f( ) monotonic,
monotonic the rule remains the same if we use:
In
I general,
l discriminant
di i i t functions
f ti can be
b defined
d fi d
independent of the Bayesian rule. They lead to
suboptimal solutions,
solutions yet if chosen appropriately
appropriately, can be
computationally more tractable.
23
BAYESIAN CLASSIFIER FOR NORMAL
DISTRIBUTIONS
i = E [x ] l l matrix in i
[
i = E ( x i )( x i ) ]
called covariance matrix
24
ln() is monotonic. Define:
g i ( x) = ln( p ( x i ) P(i )) =
ln p ( x i ) + ln P(i )
1
g i ( x ) = ( x i ) i ( x i ) + ln P (i ) + Ci
T 1
2
l 1
Ci = ( ) ln 2 ( ) ln i
2 2
2 0
Example: i =
0 2
25
1 1
g i ( x) = (x + x ) +
2 2
( i1 x1 + i 2 x2 )
2
2 1 2 2
1
( i21 + i22 ) + ln( Pi ) + Ci
2 2
That is
is, g i (x) is quadratic and the surfaces
g i ( x) g j ( x) = 0
quadrics ellipsoids
quadrics, ellipsoids, parabolas,
parabolas hyperbolas,
hyperbolas
pairs of lines.
For example:
26
Decision Hyperplanes
x i1 x
T
Quadratic terms:
wi = 1 i
1 1
wi 0 = ln P (i ) i i
2
Discriminant functions are LINEAR 27
Let in addition:
= 2 I . Then
1
g i ( x) = i x + wi 0
T
g ij ( x) = g i ( x) g j ( x) = 0
T
= w ( x xo )
w = i j,
1 P (i ) i j
x o = ( i + j ) ln
2
2 P( j ) 2
i j
28
Nondiagonal: 2
T
gij ( x) = w ( x x0 ) = 0
w = (i j )
1
1 P (i ) i j
x 0 = ( i + j ) ln ( )
2 P ( j ) 2
i j 1
1
T
x 1
( x 1 x ) 2
not normal to i j
Decision hyperplane
normal to 1 ( i j )
29
Minimum Distance Classifiers
1
P (i ) = equiprobable
M
1
g i ( x) = ( x i )T 1 ( x i )
2
= 2 I : Assign x i :
Euclidean Distance: dE x i
smaller
2 I : Assign
g x i :
1
Mahalanobis Distance: dm = ((x i ) (x i ))
T 1 2
smaller 30
31
Example:
Given 1 , 2 : P (1 ) = P ( 2 ) and p ( x 1 ) = N ( 1 , ),
)
0 3 1.1 0.3
p ( x 2 ) = N ( 2 , ), 1 = , 2 = , =
0
3 0 . 3 1 . 9
1.0
classify t he vector x = using Bayesian classifica tion :
2 .2
0.95 0.15
=
-1
0. 15 0 .55
Compute Mahalanobis d m from 1 , 2 : d 2 m ,1 = [1.0, 2.2]
1.0 1 2.0
= 2.952, d m , 2 = [ 2.0, 0.8]
1 2
= 3.672
2.2 0.8
L ( ) N 1 p ( x )
ML : = k ;
=0
( ) k =1 p ( x k ; ) ( )
34
35
If, indeed, there is a 0 such that
p ( x ) = p ( x ; 0 ) , then
lim E [ ML ]= 0
N
2
lim E ML 0 =0
N
A
Asymptotically
t ti ll unbiased
bi d and
d consistent
i t t
36
Example:
p ( x) : N ( , ) : unknown, x1 , x 2 ,..., x N p ( x k ) p ( x k ; )
N 1 N
L( ) = ln p ( x k ; ) = C ( x k )T 1 ( x k )
k =1 2 k =1
1 1
p( x k ; ) = l 1
exp( ( x k ) T
1
( x k ))
2
(2 )
2 2
L
1
.
L( ) N 1 N
. = ( x k ) = 0 ML = x k
1
( ) k =1 N k =1
.
L
l
( A )
T
Remember : if A = A
T
= 2 A 37
Maximum Aposteriori Probability Estimation
In th d was considered
I ML method, id d as a parameter
t
Here we shall look at as a random vector
described by a pdf p(),
p() assumed to be known
Given
X = {x 1 , x 2 ,..., x N }
C
Compute
t th
the maximum
i off
p ( X )
From Bayes theorem
p( ) p ( X ) = p ( X ) p ( X ) or
p( ) p ( X )
p( X ) = 38
p( X )
The method:
: ( P ( ) p ( X ))
MAP
If p ( ) is
i uniform
if or broad h MAP ML
b d enough
39
40
Example:
p( x) : N (, ), unknown, X = {x1,...,x N }
2
1 0
p( ) = exp( )
l
2
2
(2 )
2 l
) N N 1 1
MAP : ln( p( x k ) p( )) = 0 or 2 ( x k ) 2 ( 0 ) = 0
k =1 k =1
2 N
0 + 2 xk 2
41
Bayesian Inference
42
p( x X ) = p ( x ) p( X )d
p ( X ) p( ) p( X ) p( )
p( X ) = =
p( X ) p( X ) p( )d
N
p( X ) = p( x k )
k =1
N 0 +
2 2
, x=
N
x
k =1
k
43
The above is a sequence of Gaussians as N
Maximum Entropy
Entropy H = p ( x ) ln p ( x)d x
The constraint:
x2
x1
p ( x ) dx = 1
Lagrange Multipliers
x2
H L = H + ( p ( x)dx 1)
x1
p ( x) = exp( 1)
1 x1 x x2
p ( x) = x2 x1
0 otherwise 45
Mixture Models
J
p ( x ) = p ( x j ) Pj
j =1
M
P
j =1
j = 1, p ( x j ) d x = 1
x
Assume p
parametric modeling,
g, i.e.,, p( x j ; )
The g
goal is to estimate and P1 , P2 ,..., Pj
given a set X= {x1 , x 2 ,..., x N }
Why not ML? As before?
N
max P( x k ; , Pi ,..., Pj )
, Pi ,..., Pj k =1 46
This is a nonlinear p
problem due to the missingg
label information. This is a typical problem with
an incomplete data set.
General formulation
y the
h complete
l d
data set y Y R m
i h p y ( y; ) ,
, with
which are not observed directly.
We observe
ith Px (x; ),
x = g( y) Xob Rl , l < m with )
a many to one transformation
47
Let Y ( x) Y all y ' s to a specific x
p x ( x; ) = p
Y ( x)
y ( y; ) d y
ln( p y ( y k ; ))
ML :
k
=0
48
The algorithm:
E-step: Q( ; (t )) = E[ln(( py ( y k ; X ; (t ))]
k
Q ( ; (t ))
M-step: (t + 1) : =0
Application to the mixture modeling problem
Complete data ( x k , jk ), k = 1,2,..., N
Observed data x k , k = 1,2,..., N
p ( x k , jk ; ) = p ( x jk ; ) Pjk
k
Assuming mutual independence
N
L ( ) = ln( p ( x k jk ; ) Pjk ) 49
k =1
Unknown parameters
= [ , P ] , P = [ P1 , P2 ,..., Pj ]T
T T T T
E
E-step
step N N
Q(; (t )) = E[ln(p( xk jk ; )Pjk )] = E[ ]=
k =1 k =1
N J
k =1 jk =1
P ( jk xk ;(t )) ln( p( xk jk ; )P jk )
Q Q
M-step
p =0 = 0, jk = 1,2,..., J
Pjk
p ( x k j; (t )) Pj J
P ( j x k ; (t )) = p ( x k ; (t )) = p ( x k j; (t )) Pj
P ( x k ; (t )) j =1
50
Nonparametric Estimation
kN k N in h
P
N N total h h
x x x +
1 kN h 2 2
p ( x) p ( x ) = , x-x
h N 2
52
Define
1 1
x ij
(xi) = 2
0 otherwise
That is, it is 1 inside a unit side hypercube centered
at 0
1 1 N
xi x
p
( x ) = l
h N
( (
i =1 h
))
1 1
* * number of points inside
volume N
an h - side hypercube centered at x
1
h 0, l
h
x' x
h 0 the width of ( )0
h
1 x ' x
l ( )d x = 1
h h
1 x
h 0 l ( ) ( x)
h h
E[ p ( x)] = ( x' x) p ( x' )d x' = p ( x)
x'
55
h=0.1, N=10000
56
If
h0
N
h
asymptotically unbiased
The method
Remember:
p ( x 1 ) P ( 2 ) 21 22
l12 ( >< )
p(x 2 ) P ( 1 ) 12 11
1 N1
xi x
N 1h l
(
h
)
i =1
( >< )
1 N2
xi x
N 2h l
i =1
(
h
)
57
CURSE OF DIMENSIONALITY
In all the methods, so far, we saw that the highest
the number of points, N, the better the resulting
estimate.
estimate
If in the one-dimensional
one dimensional space an interval,
interval filled
with N points, is adequately (for good estimation), in
the two-dimensional space the corresponding square
will
ill require
i N2 andd in
i the
th -dimensional
di i l space the
th -
dimensional cube will require N points.
58
NAIVE BAYES CLASSIFIER
j =1
In Parzen:
The volume is constant
The number of points in the volume is varying
Now:
Keep the number of points kN = k
constant
Leave
L the
th volume
l to
t be
b varying
i
k
( x) =
p
NV ( x)
60
k
N1V1 N 2V2
= (><)
k N1V1
N 2V2
61
The Nearest Neighbor Rule
Choose k out of the N training vectors, identify the k
nearest ones to x
Assign
A i x i : ki > k j i j
The
Th simplest
i l t version
i
k=1 !!!
k , PkNN PB
PNN 2 PB
P3 NN PB + 3( PB ) 2
63
Voronoi tesselation
Ri = {x : d ( x , x i ) < d ( x , x j ) i j }
64
BAYESIAN NETWORKS
Bayes
B P
Probability
b bilit Chain
Ch i Rule
R l
p( x1, x2 ,..., xl ) = p( xl | xl1,..., x1 ) p( xl1 | xl2 ,..., x1 ) ...
... p( x2 | x1 ) p( x1 )
65
For l if =6,
F example, 6 then
th we could
ld assume:
p( x6 | x5 ,..., x1 ) = p( x6 | x5 , x4 )
Then:
A6 = {x5 , x4 } {x5 ,..., x1}
66
A graphical way to portray conditional dependencies
is given below
According to this figure we
have that:
x6 is conditionallyy dependent
p on
x4, x5.
x5 on x4
x4 on x1, x2
x3 on x2
x1, x2 are conditionally
i d
independent
d t on other
th variables.
i bl
68
The figure below is an example of a Bayesian
Network corresponding to a paradigm from the
medical applications
pp field.
This Bayesian network
models conditional
dependencies for an
example concerning
smokers (S),
tendencies
d to develop
d l
cancer (C) and heart
disease (H), together
with variables
corresponding to heart
(H1 H2) and cancer
(H1,
(C1, C2) medical tests.
69
Once a DAG has been constructed, the joint
probability
b bilit can be
b obtained
bt i d by
b multiplying
lti l i the
th
marginal (root nodes) and the conditional (non-root
nodes)) probabilities.
p
Training:
g Once a topologyg is given,
g probabilities are
estimated via the training data set. There are also
methods that learn the topology.
70
Example: Consider the Bayesian network of the
figure:
71
For a), a set of calculations are required that
propagate from node x to node w. It turns out that
P(w0|x1) = 0.63.
In
I general,l the
th required
i d inference
i f information
i f ti isi
computed via a combined process of message
passing among the nodes of the DAG
passing DAG.
Complexity:
For singly connected graphs, message passing
algorithms amount to a complexity linear in the
number of nodes. 72