Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Konstantinos Koutroumbas
Version 2
PATTERN RECOGNITION
Typical application areas
Machine vision
Character recognition (OCR)
Computer aided diagnosis
Speech recognition
Face recognition
Biometrics
Image Data Base retrieval
Data mining
Bionformatics
The task: Assign unknown objects patterns into the
correct class. This is known as classification.
x1 ,..., xl ,
x x1 ,..., xl R l
T
An example:
sensor
feature
generation
feature
selection
classifier
design
system
evaluation
5
x x1 , x2 ,..., xl
That is
x i : P ( i x )
maximum
7
P (1 ), P (2 )..., P (M )
p ( x i ), i 1,2...M
x w.r. to i .
p ( x ) P (i x ) p ( x i ) P (i )
p ( x i ) P (i )
P (i x)
p ( x)
where
p ( x ) p ( x i ) P (i )
i 1
If P(1 x ) P (2 x ) x 1
If P(2 x ) P(1 x ) x 2
x
Equivalently: classify
according to the
rule p ( x 1 ) P (1 )( ) p ( x 2 ) P (2 )
p ( x )( ) P ( x )
1
For equiprobable classes
the test2 becomes
10
R1 ( 1 ) and R2 ( 2 )
11
If x R1 x in 1
If x R2 x in 2
Probability of error
Total shaded
area
x0
x0
p
(
x
)
dx
p
(
x
)
dx
2
1
classify it to
if:
P(i x) P( j x) j i
For M=2
Define the loss matrix
11 12
L(
)
21 22
,
, etc.
1
Risk with respect to
r1 11 p( x 1 )d x 12 p( x 1 )d x
R1
R2
15
r2 21 p ( x 2 )d x 22 p ( x 2 )d x
R1
R2
Probabilities of wrong
decisions, weighted by the
penalty terms
Average risk
r r1 P (1 ) r2 P ( 2 )
16
Choose R1 and
R2
so that r is minimized
x to i if
1 11 p ( x 1 ) P(1 ) 21 p( x 2 ) P (2 )
Then assign
2 12 p ( x 1 ) P (1 ) 22 p ( x 2 ) P (2 )
Equivalently:
assign
x in ( ) if
1
2
p ( x 1 )
P (2 ) 21 22
12
()
p( x 2 )
P (1 ) 12 11
: likelihood ratio
12
17
If
1
P (1 ) P(2 ) and 11 22 0
2
21
x 1 if P ( x 1 ) P ( x 2 )
12
12
x 2 if P ( x 2 ) P( x 1 )
21
if 21 12 Minimum classification
error probability
18
An example:
1
2
p ( x 1 )
exp( x )
1
p ( x 2 )
exp(( x 1) 2 )
1
P(1 ) P(2 )
2
0 0.5
L
1.0 0
19
x0 for minimum Pe :
x0 : exp( x ) exp(( x 1) )
2
1
x0
2
0 for minimum r
Threshold x
x0 : exp( x ) 2 exp(( x 1) )
2
(1 n2) 1
x0
2
2
20
1
0 moves to the left of x0
Thus x
2
(WHY?)
21
DISCRIMINANT FUNCTIONS
DECISION SURFACES
If Ri , R j are contiguous:
g ( x) P(i x) P( j x) 0
Ri : P(i x) P ( j x)
+
-
g ( x) 0
R j : P ( j x) P (i x)
g i ( x) f ( P(i x))
is a discriminant function
p ( x i )
(2 ) i
1
2
exp ( x i ) i1 ( x i )
2
i E x matrix in i
i E ( x i )( x i )
ln(is) monotonic.
Define:
g i ( x) ln( p ( x i ) P (i ))
ln p ( x i ) ln P(i )
1
T 1
g i ( x) ( x i ) i ( x i ) ln P (i ) Ci
2
1
Ci ( ) ln 2 ( ) ln i
Example:
2
2
2 0
i
2
0
25
1
1
2
2
( x1 x2 ) 2 ( i1 x1 i 2 x2 )
g i ( x)
2
2
1
2 ( i21 i22 ) ln( Pi ) Ci
2
That is,
surfaces
g i (x)
g i ( x) g j ( x) 0
26
Decision Hyperplanes
Quadratic terms:x
x
1
i
If ALL i
(the same) the quadratic
terms are not of interest. They are not
involved in comparisons.
Then,
equivalently, we can write:
T
i
g i ( x) w x wio
wi 1 i
1 1
wi 0 ln P (i ) i i
2
Discriminant functions are LINEAR
27
Let in addition:
2 I . Then
1 T
g i ( x) 2 i x wi 0
g ij ( x) g i ( x) g j ( x) 0
T
w ( x xo )
w i j,
1
P (i ) i j
2
x o ( i j ) ln
2
P ( j ) 2
i
j
28
Nondiagonal:
2
T
g ij ( x) w ( x x 0 ) 0
1
w ( i j )
i j
P(i )
1
x 0 ( i j ) n (
)
2
P ( j ) 2
i
j 1
x
( x 1 x)
Decision hyperplane
1
2
not normal to i j
normal to 1 ( i j )
29
equiprobable
1
P (i )
M
1
g i ( x) ( x i )T 1 ( x i )
2
Euclidean Distance:
smaller
2
I : Assign x i :
Mahalanobis Distance:
smaller
dE x i
2 I : Assign x i :
1
d m (( x i ) ( x i ))
T
30
1
2
31
Example:
Given 1 , 2 : P(1 ) P (2 ) and p ( x 1 ) N ( 1 , ),
0
3
1.1 0.3
p ( x 2 ) N ( 2 , ), 1 , 2 ,
0
3
0
.
3
1
.
9
1.0
classify the vector x
using Bayesian classification :
2.2
0.95 0.15
0
.
15
0
.
55
-1
2.2
0 .8
1
lide
S
ESTIMATION OF UNKNOWN PROBABILITY
rt
o
p
Sup
DENSITY FUNCTIONS
Maximum Likelihood
p ( X ; ) p ( x1 , x 2 ,...x N ; )
N
p ( x k ; )
k 1
33
po
p
u
S
lid
S
t
r
ML : arg max p ( x k ; )
k 1
L( ) ln p ( X ; ) ln p ( x k ; )
k 1
p
(
x
L
(
)
1
k
;
0
ML :
( ) k 1 p ( x k ; ) ( )
34
po
p
u
S
lid
S
t
r
35
po
p
u
S
lid
S
t
r
lim E ML 0
36
Example:
p ( x) : N ( , ) : unknown, x1 , x 2 ,..., x N p ( x k ) p ( x k ;Su)ppo
lid
S
t
r
1 N
L( ) ln p ( x k ; ) C ( x k )T 1 ( x k )
k 1
2 k 1
1
1
T
1
p( x k ; )
exp(
(
x
( x k ))
k
l
1
2
2
2
(2 )
N
L
1
N
N
L( )
1
1
. ( x k ) 0 ML x k
( )
k 1
N k 1
.
L
l
( A )
T
Remember : if A A
2 A
37
lid
S
t
r
p ( X )
p ( ) p ( X ) p ( X ) p ( X ) or
p ( X )
p ( ) p ( X )
p( X )
38
po
p
u
S
lid
S
t
r
The method:
MAP :
( P ( ) p ( X ))
39
po
p
u
S
lid
S
t
r
40
Example:
p ( x) : N ( , ), unknown, X x1,...,x N
p( )
1
l
2
(2 ) l
exp(
0
2
po
p
u
S
lid
S
t
r
N
N 1
1
MAP :
ln( p ( x k ) p( )) 0 or 2 ( x k ) 2 ( 0 ) 0
k 1
k 1
2 N
2
0 2 xk
MAP
For 2 1, or for N
2
1 2 N
1 N
MAP ML x k
N k 1
41
Bayesian Inference
po
p
u
S
lid
S
t
r
42
p( x X ) p ( x ) p( X )d
p( X ) p ( )
p( X ) p( )
p( X )
p( X )
p( X ) p( )d
po
p
u
S
lid
S
t
r
p( X )
p( x k )
k 1
A bit more insight via an example
Let p ( x ) N ( , 2 )
p ( ) N ( 0 , 02 )
It turns out that : p ( X ) N ( N , N2 )
N 02 x 2 0
N
,
2
2
N 0
2 02
,
2
2
N 0
2
N
1
x
N
x
k 1
43
lid
S
t
r
Maximum Entropy
Entropy
H p ( x ) ln p ( x ) d x
44
p( x)dx 1
x1
Lagrange Multipliers
x2
H L H ( p( x)dx 1)
x1
p ( x) exp( 1)
p ( x)
x1 x x2
1
x2 x1
0
otherwise
45
Mixture Models
po
p
u
S
p ( x) p ( x j ) Pj
lid
S
t
r
j 1
P
j 1
1, p ( x j )d x 1
x
p( x j ; )
and P1 , P2 ,..., Pj
X x 1 , x 2 ,..., x N
max P ( x k ; , Pi ,..., Pj )
, Pi ,..., Pj k 1
46
po
p
u
S
lid
S
t
r
y Y R m , with p y ( y; ) ,
x g ( y ) X ob R , l m with Px ( x; ),
We observe
47
Let
po
p
u
S
lid
S
t
r
( y; ) d y
Y ( x)
ML :
ln( p y ( y k ; ))
But
are not observed. Here comes the
yk ' s the expectation of the loglikelihood
EM. Maximize
conditioned on the observed samples and the
current iteration estimate of
48
The algorithm:
E-step:
lid
S
t
r
M-step:
Q ( ; (t ))
(t 1) :
0
( x k , jk ), k 1,2,..., N
x k , k 1,2,..., N
Observed data
p( x k , jk ; ) p( x jk ; ) Pjk
k
Assuming mutual independence
N
L( ) ln( p ( x k jk ; ) Pjk )
k 1
49
Unknown parameters
po
p
u
S
[ , P ] , P [ P1 , P2 ,..., Pj ]T
T
E-step
T T
k 1
k 1
k 1
jk 1
M-step
P ( j x k ; (t ))
lid
S
t
r
P ( jk x k ;(t )) ln( p( x k jk ; ) P jk )
Q
0
p( x k j; (t )) Pj
P ( x k ; (t ))
Q
0,
Pjk
jk 1,2,..., J
p( x k ; (t )) p( x k j; (t )) Pj
j 1
50
Nonparametric Estimation
kN
P
N
k N in h
N total
1 kN
h
p ( x) p ( x )
, x - x
h N
2
h
x
2
If p( x) continuous , p ( x) p( x) as N , if
kN
hN 0, k N ,
0
N
h
x
2
51
Parzen Windows
Divide the multidimensional space in
hypercubes
52
Define
xij
( xi )
2
0
a otherwise
1 1
p ( x) l (
h N
xi x
(
))
h
i 1
N
1
1
* * number of points inside
volume N
The problem:
an h - side hypercube centered at x
Parzen windows-kernels-potential
functions
p ( x) continuous
(.) discontinuous
( x) is smooth
( x) 0, ( x)d x 1
x
53
Mean value
1 1
E[ p ( x)] l (
h N
lid
S
t
r
po
xi x
1 x' x
p
u
E[ (
)]) l (
) p (Sx' )d x'
h
h
h
i 1
x'
N
1
h 0, l
h
x' x
h 0 the width of (
)0
h
1
x ' x
h l ( h )d x 1
1 x
h 0 l ( ) ( x)
h
h
E[ p ( x)] in
the
(limit
x' x) p ( x' )d x' p ( x)
Hence unbiased
x'
54
Variance
The smaller the h the higher the variance
h=0.1, N=1000
h=0.8, N=1000
55
h=0.1, N=10000
If
h0
N
asymptotically unbiased
The method
Remember:
p ( x 1 )
P (2 ) 21 22
l12
()
p ( x 2 )
P (1 ) 12 11
1
N 1h l
1
N 2 hl
xi x
(
)
h
i 1
( )
N2
xi x
(
)
h
i 1
N1
57
CURSE OF DIMENSIONALITY
In all the methods, so far, we saw that the
highest the number of points, N, the better the
resulting estimate.
If in the one-dimensional space an interval, filled
with N points, is adequately (for good
estimation), in the two-dimensional space the
corresponding square will require N2 and in the dimensional space the -dimensional cube will
require N points.
The exponential increase in the number of
necessary points in known as the curse of
dimensionality. This is a major problem one is
confronted with in high dimensional spaces.
58
Let x and
the goal is to estimate p x |
i
i = 1, 2, , M. For a good estimate of the pdf one
would need, say, N points.
p x | i p x j | i
1
In this case, one wouldj require,
roughly, N points
for each pdf. Thus, a number of points of the
order N would suffice.
kN k
Now:
Leave the k
volume to be varying
p ( x)
NV ( x)
60
k
N 2V2
N1V1
()
k
N1V1
N 2V2
61
x i : ki k j i j
PB PNN PB (2
M 1
PB ) 2 PB
62
PB PkNN
2 PNN
PB
k
k , PkNN PB
PNN 2 PB
P3 NN PB 3( PB )
63
Voronoi tesselation
Ri x : d ( x, x i ) d ( x, x j ) i j
64
BAYESIAN NETWORKS
Bayes Probability Chain Rule
po
p
u
S
lid
S
t
r
p( x1 , x2 ,..., x ) p( x1 ) p( xi | Ai )
where
i 2
Ai xi 1 , xi 2 ,..., x1
65
po
p
u
S
lid
S
t
r
p( x6 | x5 ,..., x1 ) p( x6 | x5 , x4 )
Then:
A6 x5 , x4 x5 ,..., x1
66
po
p
u
S
lid
S
t
r
p( x1 , x2 ,..., x6 ) p( x6 | x5 , x4 ) p ( x5 | x4 ) p( x3 | x2 ) p( x2 ) p( x1 )
67
Bayesian Networks
po
p
u
S
lid
S
t
r
po
p
u
S
lid
S
t
r
lid
S
t
r
lid
S
t
r
po
p
Example: Consider the Bayesian network of
the
u
S
figure:
lid
S
t
r
Complexity:
For singly connected graphs, message passing
algorithms amount to a complexity linear in the
number of nodes.
72