Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Konstantinos Koutroumbas
1
Version 3
PATTERN RECOGNITION
Typical application areas
Machine vision
Character recognition (OCR)
Computer aided diagnosis
Speech/Music/Audio recognition
Face recognition
Biometrics
Image Data Base retrieval
Data mining
Social Networks
Bionformatics
2
Features: These are measurable quantities obtained from
the patterns, and the classification task is based on their
respective values.
3
An example:
4
The classifier consists of a set of functions, whose values,
computed at x, determine the class to which the
corresponding pattern belongs
Patterns
sensor
feature
generation
feature
selection
classifier
design
system
evaluation
5
Supervised – unsupervised – semisupervised pattern
recognition:
The major directions of learning are:
Supervised: Patterns whose class is known a-priori
are used for training.
Unsupervised: The number of classes/groups is (in
general) unknown and no training patterns are
available.
Semisupervised: A mixed type of patterns is
available. For some of them, their corresponding class
is known and for the rest is not.
6
CLASSIFIERS BASED ON BAYES DECISION
THEORY
That is x i : P(i x)
maximum
7
Computation of a-posteriori probabilities
Assume known
• a-priori probabilities
P(1 ), P(2 )..., P(M )
• p( x i ), i 1,2,..., M
8
The Bayes rule (Μ=2)
p ( x) P(i x) p ( x i ) P(i )
p ( x i ) P(i )
P(i x)
p ( x)
where 2
p ( x) p ( x i ) P(i )
i 1
9
The Bayes classification rule (for two classes M=2)
Given x classify it according to the rule
If P(1 x ) P(2 x ) x 1
If P(2 x ) P(1 x ) x 2
p( x 1 )() p( x 2 )
10
R1 ( 1 ) and R2 ( 2 )
11
Equivalently in words: Divide space in two regions
If x R1 x in 1
If x R2 x in 2
Probability of error
Total shaded area
x0
1 1
2
P
e p ( x 2 ) dx p ( x 1 ) dx
2 x0
P(i x) P( j x) j i
14
For M=2
• Define the loss matrix
11 12
L( )
21 22
r1 11 p( x 1 )d x 12 p( x 1 )d x
R1 R2
15
Risk with respect to 2
r2 21 p( x 2 )d x 22 p( x 2 )d x
R1 R2
Average risk
r r1P(1 ) r2 P(2 )
16
Choose R1 and R2 so that r is minimized
12 : likelihood ratio
17
If 1
P(1 ) P(2 ) and 11 22 0
2
21
x 1 if P( x 1 ) P( x 2 )
12
12
x 2 if P( x 2 ) P( x 1 )
21
if 21 12 Minimum classifica tion
error probabilit y
18
An example:
1
p ( x 1 ) exp( x )
2
1
p ( x 2 ) exp( ( x 1) 2 )
1
P(1 ) P(2 )
2
0 0.5
L
1.0 0
19
Then the threshold value is:
x0 for minimum Pe :
x0 : exp( x 2 ) exp( ( x 1) 2 )
1
x0
2
20
1
Thus x̂0 moves to the left of x0
2
(WHY?)
21
DISCRIMINANT FUNCTIONS
DECISION SURFACES
If Ri , R j are contiguous: g ( x) P(i x) P( j x) 0
Ri : P(i x) P( j x)
+
- g ( x) 0
R j : P( j x) P(i x)
22
If f (.) monotonically increasing, the rule remains the same if we use:
23
THE GAUSSIAN DISTRIBUTION
The one-dimensional case
1 ( x )2
p ( x) exp
2 2 2
where
24
The Multivariate (Multidimensional) case:
1 1
p ( x)
exp ( x )T 1 ( x )
1
2
(2 )
2 2
where E( x1 1 )( x2 2 ) 25
BAYESIAN CLASSIFIER FOR NORMAL
DISTRIBUTIONS
1 1
p ( x i )
exp ( x i ) i1 ( x i )
1
2
(2 ) i
2 2
i E x is an 1 vector, for x i
i E ( x i )( x i )
is the covariance matrix.
26
ln( ) is monotonic. Define:
g i ( x) ln( p( x i ) P(i ))
ln p( x i ) ln P(i )
1
g i ( x) ( x i ) i ( x i ) ln P (i ) Ci
T 1
2
1
Ci ( ) ln 2 ( ) ln i
2 2
2 0
Example: i
2
0 27
1 1
g i ( x) (x x )
2 2
( i1 x1 i 2 x2 )
2
2 1 2 2
1
( i21 i22 ) ln( Pi ) Ci
2 2
28
Example 1:
Example 2:
29
Decision Hyperplanes
x x 1
T
Quadratic terms: i
g i ( x) wi x wio
T
wi Σ 1 i
1 Τ 1
wi 0 ln P(i ) i Σ i
2
• w i j,
1 P(i ) i j
• x o ( i j ) ln
2
2 P ( j ) 2
i j
31
Remark :
• If p(1 ) p(2 ) , then
1
x0 1 2
2
32
• If p 1 p 2 , the linear classifier moves towards the
class with the smaller probability
33
Nondiagonal: 2
gij ( x) w ( x x 0 ) 0
T
•
• w ( i j )
1
1 P(i ) i j
• x 0 ( i j ) n ( )
2 P( j ) 2
i j 1
where
1
( x 1 x)
T 2
x 1
not normal to i j
Decision hyperplane
normal to 1 ( i j )
34
Minimum Distance Classifiers
1
P (i ) equiprobable
M
1
i
g ( x ) ( x i
) (x i )
T 1
2 I : Assign x i :
Euclidean Distance: d E x i
smaller
I : Assign x i :
2
1
Mahalanobis Distance: d m (( x i ) ( x i ))
T 1 2
smaller 35
36
Example:
Given 1 , 2 : P(1 ) P(2 ) and p( x 1 ) N ( 1 , Σ ),
0 3 1.1 0.3
p( x 2 ) N ( 2 , Σ ), 1 , 2 ,
0
3 0. 3 1. 9
1.0
classify t he vector x using Bayesian classifica tion :
2.2
0.95 0.15
Σ
-1
0.15 0.55
Compute Mahalanobi s d m from 1 , 2 : d 2 m,1 1.0, 2.2
1.0 1 2.0
Σ 2.952, d m, 2 2.0, 0.8
1 2
3.672
2.2 0.8
L ( ) N 1 p ( x )
ˆ ML : 0
k ;
( ) k 1 p ( x k ; ) ( )
39
40
If, indeed, there is a 0 such that
p ( x) p ( x; 0 ), then
lim E[ˆ ]
ML 0
N
2
lim E ˆ ML 0 0
N
41
Example:
p( x) : N ( , Σ ) : unknown, x1 , x 2 ,..., x N p( x k ) p( x k ; )
N 1 N
L( ) ln p( x k ; ) C ( x k )T Σ 1 ( x k )
k 1 2 k 1
1 1
p( x k ; ) l 1
exp( ( x k ) T
Σ 1
( x k ))
2
(2 ) Σ
2 2
L
1
.
L( ) N 1 N
. Σ ( x k ) 0 ML x k
1
k 1 N k 1
.
L
l
( A )
T
Remember : if A A
T
2 A 42
Maximum Aposteriori Probability Estimation
In ML method, θ was considered as a parameter
Here we shall look at θ as a random vector
described by a pdf p(θ), assumed to be known
Given
X x1 , x 2 ,..., x N
ˆ MAP : ( P( ) p ( X ))
If p ( ) is uniform or broad enough ˆ MAP ML
44
45
Example:
p( x) : N ( , 2 Ι ), unknown, X x1,...,x N
2
1 0
p( ) exp( )
l
2 2
(2 ) l
2
N N 1 1
MAP : ln( p( x k ) p ( )) 0 or 2 ( x k ) 2 ( ˆ 0 ) 0
k 1 k 1
2 N
0 2 xk 2
ˆ MAP k 1
For 2 1, or for N
2
1 2 N
1 N
ˆ MAP ˆ ML x k
N k 1
46
Bayesian Inference
47
p( x X ) p ( x ) p ( X )d
p ( X ) p ( ) p ( X ) p ( )
p ( X )
p( X ) p ( X ) p ( )d
N
p( X ) p( x k )
k 1
48
The previous formulae correspond to a sequence
of Gaussians for different values of N.
True mean 2 .
49
Maximum Entropy Method
Compute the pdf so that to be maximally non-
committal to the unavailable information and
constrained to respect the available information.
Entropy H p( x) ln p ( x)d x
• The constraint:
x2
p( x)dx 1
x1
• Lagrange Multipliers
x2
H L H ( p( x)dx 1)
x1
• pˆ ( x) exp( 1)
1 x1 x x2
ˆp( x) x2 x1
0 otherwise 51
• This is most natural. The “most random”
pdf is the uniform one. This complies
with the Maximum Entropy rationale.
52
Mixture Models
J
p( x) p( x j ) Pj
j 1
M
P
j 1
j 1, p( x j )d x 1
x
• General formulation
– y the complete data set y Y R m
, with p y ( y; ) ,
which are not observed directly.
We observe
x g ( y) X ob Rl , l m with p x ( x; ),
ln( p y ( y k ; ))
ˆ
ML :
k
0
55
The algorithm:
• E-step: Q( ; (t )) E[ ln( p y ( y k ; X ; (t ))]
k
Q( ; (t ))
• M-step: (t 1) : 0
• p( x k , jk ; ) p( x jk ; ) Pjk
k
[ , P ] , P [ P1 , P2 ,..., PJ ]T
T T T T
• E-step
N N
Q(; (t )) E[ ln( p ( x k jk ; ) Pjk )] E[ ]
k 1 k 1
N J
k 1 jk 1
P ( jk x k ;(t )) ln ( p ( x k jk ; ) P j )
k
Q Q
• M-step 0 0, jk 1,2,..., J
Pjk
p( x k j; (t )) Pj J
P( j x k ; (t )) p( x k ; (t )) p( x k j; (t )) Pj
p( x k ; (t )) j 1
57
Nonparametric Estimation
kN k N in h
P
N N total
h h
1 kN h x̂ x̂ x̂
pˆ ( x) pˆ ( xˆ ) , x - xˆ 2 2
h N 2
In words : Place a segment of length h at x̂
and count points inside it.
If p (x ) is continuous: p( x) p( x) as N , if
kN
hN 0, k N , 0 58
N
Parzen Windows
Place at x a hypercube of length h and count
points inside.
59
Define
1
1 xij
( xi ) 2
0 otherwise
• That is, it is 1 inside a unit side hypercube centered
at 0
1 1 N
xi x
• p( x) l (
ˆ
h N
i 1
(
h
))
1 1
• * * number of points inside
volume N
an h - side hypercube centered at x
1
• h 0, l
h
x' x
• h 0 the width of ( )0
h
1 x ' x
• l ( )d x 1
h h
1 x
• h 0 l ( ) ( x)
h h
E[ pˆ ( x)] ( x' x) p( x' )d x' p( x)
x'
62
h=0.1, N=10000
63
If
• h0
• N
• hΝ
asymptotically unbiased
The method
• Remember:
p ( x 1 ) P (2 ) 21 22
l12 ()
p ( x 2 ) P (1 ) 12 11
1 N1
xi x
N1h l
(
h
)
• i 1
()
1 N2
xi x
N 2 hl
i 1
(
h
)
64
CURSE OF DIMENSIONALITY
In all the methods, so far, we saw that the highest
the number of points, N, the better the resulting
estimate.
65
An Example :
66
NAIVE – BAYES CLASSIFIER
j 1
In Parzen:
• The volume is constant
• The number of points in the volume is varying
Now:
• Keep the number of points kN k
constant
68
•
k
N1V1 N 2V2
()
k N1V1
N 2V2
69
The Nearest Neighbor Rule
Choose k out of the N training vectors, identify the k
nearest ones to x
Assign x i : ki k j i j
71
Voronoi tesselation
Ri x : d ( x, x i ) d ( x, x j ) i j
72
BAYESIAN NETWORKS
Bayes Probability Chain Rule
73
For example, if ℓ=6, then we could assume:
p( x6 | x5 ,..., x1 ) p( x6 | x5 , x4 )
Then:
A6 x5 , x4 x5 ,..., x1
74
A graphical way to portray conditional dependencies
is given below
According to this figure we
have that:
• x6 is conditionally dependent on
x4, x5
• x5 on x4
• x4 on x1, x2
• x3 on x2
• x1, x2 are conditionally
independent on other variables.
76
The figure below is an example of a Bayesian
Network corresponding to a paradigm from the
medical applications field.
This Bayesian network
models conditional
dependencies for an
example concerning
smokers (S),
tendencies to develop
cancer (C) and heart
disease (H), together
with variables
corresponding to heart
(H1, H2) and cancer
(C1, C2) medical tests.
77
Once a DAG has been constructed, the joint
probability can be obtained by multiplying the
marginal (root nodes) and the conditional (non-root
nodes) probabilities.
78
Example: Consider the Bayesian network of the
figure:
79
For a), a set of calculations are required that
propagate from node x to node w. It turns out that
P(w0|x1) = 0.63.
Complexity:
For singly connected graphs, message passing
algorithms amount to a complexity linear in the
number of nodes. 80