Sei sulla pagina 1di 41

Note to other teachers and users of these slides.

Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrews tutorials: http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully received.

Hidden Markov Models


Andrew W. Moore Professor School of Computer Science Carnegie Mellon University
www.cs.cmu.edu/~awm awm@cs.cmu.edu 412-268-7599

Copyright Andrew W. Moore

Slide 1

A Markov System
Has N states, called s1, s2 .. sN

s2

There are discrete timesteps, t=0, t=1,

s1
N=3 t=0

s3

Copyright Andrew W. Moore

Slide 2

A Markov System
Has N states, called s1, s2 .. sN

s2
Current State

There are discrete timesteps, t=0, t=1, On the tth timestep the system is in exactly one of the available states. Call it qt Note: qt {s1, s2 .. sN }

s1
N=3 t=0 qt=q0=s3

s3

Copyright Andrew W. Moore

Slide 3

A Markov System
Current State

Has N states, called s1, s2 .. sN

s2

There are discrete timesteps, t=0, t=1, On the tth timestep the system is in exactly one of the available states. Call it qt Note: qt {s1, s2 .. sN } Between each timestep, the next state is chosen randomly.

s1
N=3 t=1 qt=q1=s2

s3

Copyright Andrew W. Moore

Slide 4

P(qt+1=s1|qt=s2) = 1/2 P(qt+1=s2|qt=s2) = 1/2 P(qt+1=s3|qt=s2) = 0 P(qt+1=s1|qt=s1) = 0 P(qt+1=s2|qt=s1) = 0 P(qt+1=s3|qt=s1) = 1

A Markov System
Has N states, called s1, s2 .. sN There are discrete timesteps, t=0, t=1, On the tth timestep the system is in exactly one of the available states. Call it qt Note: qt {s1, s2 .. sN } Between each timestep, the next state is chosen randomly. The current state determines the probability distribution for the next state.
Slide 5

s2

s1
N=3 t=1 qt=q1=s2

s3
P(qt+1=s1|qt=s3) = 1/3 P(qt+1=s2|qt=s3) = 2/3 P(qt+1=s3|qt=s3) = 0

Copyright Andrew W. Moore

P(qt+1=s1|qt=s2) = 1/2 P(qt+1=s2|qt=s2) = 1/2 P(qt+1=s3|qt=s2) = 0 P(qt+1=s1|qt=s1) = 0 P(qt+1=s2|qt=s1) = 0 P(qt+1=s3|qt=s1) = 1


1/2 2/3

A Markov System
Has N states, called s1, s2 .. sN There are discrete timesteps, t=0, t=1, On the tth timestep the system is in exactly one of the available states. Call it qt Note: qt {s1, s2 .. sN } Between each timestep, the next state is chosen randomly. The current state determines the probability distribution for the next state.
Slide 6

s2

1/2

s1
N=3 t=1 qt=q1=s2

1/3 1

s3

P(qt+1=s1|qt=s3) = 1/3 P(qt+1=s2|qt=s3) = 2/3 P(qt+1=s3|qt=s3) = 0


Often notated with arcs between states

Copyright Andrew W. Moore

P(qt+1=s1|qt=s2) = 1/2 P(qt+1=s2|qt=s2) = 1/2 P(qt+1=s3|qt=s2) = 0 P(qt+1=s1|qt=s1) = 0 P(qt+1=s2|qt=s1) = 0 P(qt+1=s3|qt=s1) = 1


1/2 2/3

Markov Property
qt+1 is conditionally independent of { qt-1, qt-2, q1, q0 } given qt. In other words: P(qt+1 = sj |qt = si ) = P(qt+1 = sj |qt = si ,any earlier history) Question: what would be the best Bayes Net structure to represent the Joint Distribution of ( q0, q1, q3,q4 )?

s2

1/2

s1
N=3 t=1 qt=q1=s2

1/3 1

s3

P(qt+1=s1|qt=s3) = 1/3 P(qt+1=s2|qt=s3) = 2/3 P(qt+1=s3|qt=s3) = 0

Copyright Andrew W. Moore

Slide 7

P(qt+1=s1|qt=s2) = 1/2

Answer:

P(qt+1=s2|qt=s2) = 1/2

0 P(qt+1=s3|qt=s2) = 0
P(qt+1=s1|qt=s1) = 0 P(qt+1=s2|qt=s1) = 0 P(qt+1=s3|qt=s1) = 1

Markov Property
qt+1 is conditionally independent of { qt-1, qt-2, q1, q0 } given qt. In other words: P(qt+1 = sj |qt = si ) = P(qt+1 = sj |qt = si ,any earlier history) Question: what would be the best Bayes Net structure to represent the Joint Distribution of ( q0, q1, q2,q3,q4 )?

q1 q1/3 2
1

s2
2/3

1/2

1/2

s1
N=3 t=1 qt=q1=s2

s3

P(qt+1=s1|qt=s3) = 1/3 3 P(qt+1=s2|qt=s3) = 2/3 P(qt+1=s3|qt=s3) = 0

q4

Copyright Andrew W. Moore

Slide 8

P(qt+1=s1|qt=s2) = 1/2

Answer:

P(qt+1=s2|qt=s2) = 1/2

0 P(qt+1=s3|qt=s2) = 0
P(qt+1=s1|qt=s1) = 0 P(qt+1=s2|qt=s1) = 0 P(qt+1=s3|qt=s1) = 1
i 1 2 3 i

Markov Property
qt+1 is conditionally independent of { qt-1, qt-2, P(qt+11, j|q =0si)} P(qt+1=sq|q.=si) q=s q given N t
t t

q1 q1/3 2
1

a s2 11 a21 a31 ai1

P(qt+1=s1|qt=si) P(qt+1=s2|qt=si)

1/2

In other words:
a22
:

a12

1j

a a a : :

1N 2N 3N

a2j

1/2

:2/3 :

P(qt+1 = sj |q= 3j i ) = a32 t as P(qt+1 =


ai2
: : sj |q= t

s1
Each of these probability N=3 tables is identical t=1

s3
N

aN1

P(qt+1=s1|qt=s3) = 1/3 3 P(qt+1=s2|qt=s3) = 2/3 P(qt+1=s3|qt=s3) = 0

Question: what would be the best a Bayes Net structure to aNN represent aN2 Nj the Joint Distribution of ( q0, q1, q2,q3,q4 )?
Notation:

aij

si ,any earlier history)


aiN

qt=q1=s2

q4

aij = P(qt +1 = s j | qt = si )
Slide 9

Copyright Andrew W. Moore

A Blind Robot
A human and a robot wander around randomly on a grid

R H
. N (num Note: * ) = 18 states 24 18 = 3
Slide 10

STATE q =
Copyright Andrew W. Moore

Location of Robot, Location of Human

Dynamics of System
q0 =

R H

Typical Questions: Whats the expected time until the human is crushed like a bug? Whats the probability that the robot will hit the left wall before it hits the human? Whats the probability Robot crushes human on next time step?
Copyright Andrew W. Moore

Each timestep the human moves randomly to an adjacent cell. And Robot also moves randomly to an adjacent cell.

Slide 11

Example Question
Its currently time t, and human remains uncrushed. Whats the probability of crushing occurring at time t + 1 ? If robot is blind: We can compute this in advance. If robot is omnipotent: (I.E. If robot knows state at time t), can compute directly. If robot has some sensors, but incomplete state information Hidden Markov Models are applicable!
Copyright Andrew W. Moore Slide 12

Well do this first Too Easy. We wont do this Main Body of Lecture

What is P(qt =s)? slow, stupid answer


Step 1: Work out how to compute P(Q) for any path Q = q1 q2 q3 .. qt Given we know the start state q1 (i.e. P(q1)=1) P(q1 q2 .. qt) = P(q1 q2 .. qt-1) P(qt|q1 q2 .. qt-1) WHY? = P(q1 q2 .. qt-1) P(qt|qt-1) = P(q2|q1)P(q3|q2)P(qt|qt-1) Step 2: Use this knowledge to get P(qt =s)
ion is putat Com ntial in t ne expo
Slide 13

P(qt = s ) =
Copyright Andrew W. Moore

QPaths of length t that end in s

P(Q)

What is P(qt =s) ? Clever answer


For each state si, define pt(i) = Prob. state is si at time t = P(qt = si) Easy to do inductive definition

i
j

p0 (i ) =
pt +1 ( j ) = P (qt +1 = s j ) =

Copyright Andrew W. Moore

Slide 14

What is P(qt =s) ? Clever answer


For each state si, define pt(i) = Prob. state is si at time t = P(qt = si) Easy to do inductive definition 1 if si is the start state i p0 (i ) = otherwise 0

pt +1 ( j ) = P (qt +1 = s j ) =

Copyright Andrew W. Moore

Slide 15

What is P(qt =s) ? Clever answer


For each state si, define pt(i) = Prob. state is si at time t = P(qt = si) Easy to do inductive definition 1 if si is the start state i p0 (i ) = otherwise 0

pt +1 ( j ) = P (qt +1 = s j ) =

P(q
i =1

t +1

= s j qt = si ) =

Copyright Andrew W. Moore

Slide 16

What is P(qt =s) ? Clever answer


For each state si, define pt(i) = Prob. state is si at time t = P(qt = si) Easy to do inductive definition 1 if si is the start state i p0 (i ) = otherwise 0

pt +1 ( j ) = P (qt +1 = s j ) =

P(q
i =1 N i =1

Remember,

t +1

= s j qt = si ) =
t +1

aij = P(qt +1 = s j | qt = si )

P(q
Copyright Andrew W. Moore

= s j | qt = si ) P(qt = si ) =

a
i =1

ij

pt (i )
Slide 17

What is P(qt =s) ? Clever answer


For each state si, define pt(i) = Prob. state is si at time t = P(qt = si) Easy to do inductive definition 1 if si is the start state i p0 (i ) = otherwise 0

Computation is simple. Just fill in this table in this order:


t 0 1 : tfinal pt(1) pt(2) 0 1 pt(N) 0

pt +1 ( j ) = P (qt +1 = s j ) =

P(q
i =1 N i =1

t +1

= s j qt = si ) =

P(qt +1 = s j | qt = si ) P(qt = si ) =
Copyright Andrew W. Moore

a
i =1

ij

pt (i )
Slide 18

What is P(qt =s) ? Clever answer


For each state si, define pt(i) = Prob. state is si at time t = P(qt = si) Easy to do inductive definition 1 if si is the start state i p0 (i ) = otherwise 0
Cost of computing Pt(i) for all states Si is now O(t N2) The stupid way was O(Nt) This was a simple example It was meant to warm you up to this trick, called Dynamic Programming, because HMMs do many tricks like this.

pt +1 ( j ) = P (qt +1 = s j ) =

P(q
i =1 N i =1

t +1

= s j qt = si ) =
t +1

P(q
Copyright Andrew W. Moore

= s j | qt = si ) P(qt = si ) =

a
i =1

ij

pt (i )
Slide 19

Hidden State
Its currently time t, and human remains uncrushed. Whats the probability of crushing occurring at time t + 1 ? If robot is blind: We can compute this in advance. If robot is omnipotent: (I.E. If robot knows state at time t), can compute directly. If robot has some sensors, but incomplete state information Hidden Markov Models are applicable!
Copyright Andrew W. Moore Slide 20

Well do this first Too Easy. We wont do this Main Body of Lecture

10

Hidden State
The previous example tried to estimate P(qt = si) unconditionally (using no observed evidence). Suppose we can observe something thats affected by the true state. Example: Proximity sensors. (tell us the contents of the 8 adjacent squares)

R0 H

W denotes WALL

True state qt

What the robot sees: Observation Ot


Slide 21

Copyright Andrew W. Moore

Noisy Hidden State


Example: Noisy Proximity sensors. (unreliably tell us the contents of the 8 adjacent squares)
R0 H H W W W

W denotes WALL

True state qt

Uncorrupted Observation
W W

H H

What the robot sees: Observation Ot

Copyright Andrew W. Moore

Slide 22

11

Noisy Hidden State


Example: Noisy Proximity sensors. (unreliably tell us the contents of the 8 adjacent squares)
R0 H H 2 W W W

W denotes WALL

True state qt Ot is noisily determined depending on the current state. Assume that Ot is conditionally independent of {qt-1, qt-2, q1, q0 ,Ot-1, Ot-2, O1, O0 } given qt. In other words: P(Ot = X |qt = si ) = P(Ot = X |qt = si ,any earlier history)
Copyright Andrew W. Moore

Uncorrupted Observation
W W

H H

What the robot sees: Observation Ot

Slide 23

Noisy Hidden State


Example: Noisy Proximity sensors. (unreliably tell us the contents of the 8 adjacent squares)
R0 H H 2 W W W

W denotes WALL

True state qt Ot is noisily determined depending on the current state. Assume that Ot is conditionally independent of {qt-1, qt-2, q1, q0 ,Ot-1, Ot-2, O1, O0 } given qt. In other words: P(Ot = X |qt = si ) =

Uncorrupted Observation
W W

H H

What the robot sees: Observation Ot

Question: whatd be the best Bayes Net structure to represent the Joint Distribution P(Ot = X |qt = si ,any earlier history) of (q0, q1, q2,q3,q4 ,O0, O1, O2,O3,O4 )?
Slide 24

Copyright Andrew W. Moore

12

Answer:

Example: Noisy Proximity sensors. (unreliably tell us O0 the contents of the 8 adjacent squares)

q0 q1

Noisy Hidden State

R0 H

O1

W denotes WALL

True state qt q2 Ot is noisily determined depending on O2 the current state. Assume that Ot is conditionally q3 independent of {qt-1, qt-2, q1, q0 ,Ot-1, O Ot-2, O1, O0 } given3qt. In other words: P(Ot = q4 t = si ) = X |q

Uncorrupted Observation
W W

H H

What the robot sees: Observation Ot

Question: whatd be the best Bayes Net structure to represent the Joint Distribution P(Ot = X |qt = si ,any earlier history) of (q0, q1, q2,q3,q4 ,O0, O1, O2,O3,O4 )?

O4

Copyright Andrew W. Moore

Slide 25

Answer:

bi (k = P (Ot = k | t = i ) Example: Noisy Proximity sensors. )(unreliablyqtell sus O0 the contents of the 8i adjacent Osquares) =k|q =s ) P(O =M|q =s ) P(O =1|q =s ) P( =2|q =s ) P(O

q0 q1

Noisy Hidden State Notation:


t
t

R0 H

O1

1 b1(1) 2 b2 (1) 3 b3 (1)

b1 (2) b2 (2) b3 (2)


:

b1 (k)
W

b1(M)
W b2 (M)
:

b2(k) b (k) 3
: : :

b3 (M) WALL bi (M)


: W

denotes

: : True state qt q2 determined depending on i bi(1) Ot is noisily O2 the current state. : :

Uncorrupted Observation
bi(k)
: W :

bi (2)
:

Assume that Ot is q3 independent of {qt-1, qt-2, q1, q0 ,Ot-1, O Ot-2, O1, O0 } given3qt. In other words: P(Ot = q4 t = si ) = X |q

N conditionally bN (1)

bN (2)

bN(k)
H

W bN (M) H

What the robot sees: Observation Ot

Question: whatd be the best Bayes Net structure to represent the Joint Distribution P(Ot = X |qt = si ,any earlier history) of (q0, q1, q2,q3,q4 ,O0, O1, O2,O3,O4 )?

O4

Copyright Andrew W. Moore

Slide 26

13

Hidden Markov Models


Our robot with noisy sensors is a good example of an HMM Question 1: State Estimation What is P(qT=Si | O1O2OT) It will turn out that a new cute D.P. trick will get this for us. Question 2: Most Probable Path Given O1O2OT , what is the most probable path that I took? And what is that probability? Yet another famous D.P. trick, the VITERBI algorithm, gets this. Question 3: Learning HMMs: Given O1O2OT , what is the maximum likelihood HMM that could have produced this string of observations? Very very useful. Uses the E.M. Algorithm
Copyright Andrew W. Moore Slide 27

Are H.M.M.s Useful?


You bet !! Robot planning + sensing when theres uncertainty (e.g. Reid Simmons / Sebastian Thrun / Sven Koenig) Speech Recognition/Understanding Phones Words, Signal phones Human Genome Project Complicated stuff your lecturer knows nothing about. Consumer decision modeling Economics & Finance. Plus at least 5 other things I havent thought of.
Copyright Andrew W. Moore Slide 28

14

Some Famous HMM Tasks


Question 1: State Estimation What is P(qT=Si | O1O2Ot)

Copyright Andrew W. Moore

Slide 29

Some Famous HMM Tasks


Question 1: State Estimation What is P(qT=Si | O1O2Ot)

Copyright Andrew W. Moore

Slide 30

15

Some Famous HMM Tasks


Question 1: State Estimation What is P(qT=Si | O1O2Ot)

Copyright Andrew W. Moore

Slide 31

Some Famous HMM Tasks


Question 1: State Estimation What is P(qT=Si | O1O2Ot) Question 2: Most Probable Path Given O1O2OT , what is the most probable path that I took?

Copyright Andrew W. Moore

Slide 32

16

Some Famous HMM Tasks


Question 1: State Estimation What is P(qT=Si | O1O2Ot) Question 2: Most Probable Path Given O1O2OT , what is the most probable path that I took?

Copyright Andrew W. Moore

Slide 33

Some Famous HMM Tasks


Question 1: State Estimation What is P(qT=Si | O1O2Ot) Woke up at 8.35, Got on Bus at 9.46, Question 2: Most Probable Path Sat in lecture 10.05-11.22 Given O1O2OT , what is the most probable path that I took?

Copyright Andrew W. Moore

Slide 34

17

Some Famous HMM Tasks


Question 1: State Estimation What is P(qT=Si | O1O2Ot) Question 2: Most Probable Path Given O1O2OT , what is the most probable path that I took? Question 3: Learning HMMs: Given O1O2OT , what is the maximum likelihood HMM that could have produced this string of observations?

Copyright Andrew W. Moore

Slide 35

Some Famous HMM Tasks


Question 1: State Estimation What is P(qT=Si | O1O2OT) Question 2: Most Probable Path Given O1O2OT , what is the most probable path that I took? Question 3: Learning HMMs: Given O1O2OT , what is the maximum likelihood HMM that could have produced this string of observations?

Copyright Andrew W. Moore

Slide 36

18

O Some Famous HMM Tasks


aBB
t

Question 1: State Estimation What is P(qT=Si | O1O2OT) Ot-1 Question 2: Most Probable Path b (O ) Given O1O2OT , what is the most probable path a that I took? Question 3: Learning HMMs: Given O1O2OT , what is the maximum likelihood HMM that could have produced this string of observations?
A t-1 AA

bB(Ot)

Bus
aAB aBA aBC bC(Ot+1) aCB

Ot+1

Eat

walk
aCC

Copyright Andrew W. Moore

Slide 37

Basic Operations in HMMs


For an observation sequence O = O1OT, the three basic HMM operations are:
Problem Evaluation: Calculating P(qt=Si | O1O2Ot) Inference: Computing Q* = argmaxQ P(Q|O) Learning: Computing * = argmax P(O|) Algorithm Forward-Backward Viterbi Decoding Baum-Welch (EM) Complexity
+

O(TN2) O(TN2) O(TN2)

T = # timesteps, N = # states
Copyright Andrew W. Moore Slide 38

19

HMM Notation (from Rabiners Survey)


The states are labeled S1 S2 .. SN

*L. R. Rabiner, "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition," Proc. of the IEEE, Vol.77, No.2, pp.257--286, 1989. Available from
http://ieeexplore.ieee.org/iel5/5/698/00018626.pdf?arnumber=18626

For a particular trial. Let T be the number of observations T is also the number of states passed through O = O1 O2 .. OT is the sequence of observations Q = q1 q2 .. qT is the notation for a path of states = N,M,{i,},{aij},{bi(j)}
Copyright Andrew W. Moore

is the specification of an HMM


Slide 39

HMM Formal Definition


An HMM, , is a 5-tuple consisting of N the number of states This is new. In our M the number of possible observations previous example, {1, 2, .. N} The starting state probabilities start state was P(q0 = Si) = i deterministic a11 a22 a1N a21 a22 a2N The state transition probabilities : : : P(qt+1=Sj | qt=Si)=aij aN1 aN2 aNN b1(1) b2(1) : bN(1) b1(2) b2(2) : bN(2) b1(M) b2(M) : bN(M) The observation probabilities P(Ot=k | qt=Si)=bi(k)

Copyright Andrew W. Moore

Slide 40

20

Heres an HMM
S1
1/3

Start randomly in state 1 or 2 Choose one of the output S2 symbols in each state at random.

XY
1/3

1/3 2/3 2/3

ZY
1/3

ZX
S3
1/3

N=3 M=3 1 = 1/2 a11 = 0 a12 = 1/3 a13 = 1/3 b1 (X) = 1/2 b2 (X) = 0 b3 (X) = 1/2
Copyright Andrew W. Moore

2 = 1/2 a12 = 1/3 a22 = 0 a32 = 1/3 b1 (Y) = 1/2 b2 (Y) = 1/2 b3 (Y) = 0

3 = 0 a13 = 2/3 a13 = 2/3 a13 = 1/3 b1 (Z) = 0 b2 (Z) = 1/2 b3 (Z) = 1/2
Slide 41

Heres an HMM
S1
1/3

Start randomly in state 1 or 2 Choose one of the output S2 symbols in each state at random. Lets generate a sequence of observations: 50-50 choice between S1 and S2

XY
1/3

1/3 2/3 2/3

ZY
1/3

ZX
S3
1/3

N=3 M=3 1 = a11 = 0 a12 = a13 = b1 (X) = b2 (X) = 0 b3 (X) =


Copyright Andrew W. Moore

2 = a12 = a22 = 0 a32 = b1 (Y) = b2 (Y) = b3 (Y) = 0

3 = 0 a13 = a13 = a13 = b1 (Z) = 0 b2 (Z) = b3 (Z) =

q0= q1= q2=

__ __ __

O0= O1= O2=

__ __ __

Slide 42

21

Heres an HMM
S1
1/3

Start randomly in state 1 or 2 Choose one of the output S2 symbols in each state at random. Lets generate a sequence of observations: 50-50 choice between X and Y

XY
1/3

1/3 2/3 2/3

ZY
1/3

ZX
S3
1/3

N=3 M=3 1 = a11 = 0 a12 = a13 = b1 (X) = b2 (X) = 0 b3 (X) =


Copyright Andrew W. Moore

2 = a12 = a22 = 0 a32 = b1 (Y) = b2 (Y) = b3 (Y) = 0

3 = 0 a13 = a13 = a13 = b1 (Z) = 0 b2 (Z) = b3 (Z) =

q0= q1= q2=

S1 __ __

O0= O1= O2=

__ __ __

Slide 43

Heres an HMM
S1
1/3

Start randomly in state 1 or 2 Choose one of the output S2 symbols in each state at random. Lets generate a sequence of observations: Goto S3 with probability 2/3 or S2 with prob. 1/3

XY
1/3

1/3 2/3 2/3

ZY
1/3

ZX
S3
1/3

N=3 M=3 1 = a11 = 0 a12 = a13 = b1 (X) = b2 (X) = 0 b3 (X) =


Copyright Andrew W. Moore

2 = a12 = a22 = 0 a32 = b1 (Y) = b2 (Y) = b3 (Y) = 0

3 = 0 a13 = a13 = a13 = b1 (Z) = 0 b2 (Z) = b3 (Z) =

q0= q1= q2=

S1 __ __

O0= O1= O2=

X __ __

Slide 44

22

Heres an HMM
S1
1/3

Start randomly in state 1 or 2 Choose one of the output S2 symbols in each state at random. Lets generate a sequence of observations: 50-50 choice between Z and X

XY
1/3

1/3 2/3 2/3

ZY
1/3

ZX
S3
1/3

N=3 M=3 1 = a11 = 0 a12 = a13 = b1 (X) = b2 (X) = 0 b3 (X) =


Copyright Andrew W. Moore

2 = a12 = a22 = 0 a32 = b1 (Y) = b2 (Y) = b3 (Y) = 0

3 = 0 a13 = a13 = a13 = b1 (Z) = 0 b2 (Z) = b3 (Z) =

q0= q1= q2=

S1 S3 __

O0= O1= O2=

X __ __

Slide 45

Heres an HMM
S1
1/3

Start randomly in state 1 or 2 Choose one of the output S2 symbols in each state at random. Lets generate a sequence of observations: Each of the three next states is equally likely

XY
1/3

1/3 2/3 2/3

ZY
1/3

ZX
S3
1/3

N=3 M=3 1 = a11 = 0 a12 = a13 = b1 (X) = b2 (X) = 0 b3 (X) =


Copyright Andrew W. Moore

2 = a12 = a22 = 0 a32 = b1 (Y) = b2 (Y) = b3 (Y) = 0

3 = 0 a13 = a13 = a13 = b1 (Z) = 0 b2 (Z) = b3 (Z) =

q0= q1= q2=

S1 S3 __

O0= O1= O2=

X X __

Slide 46

23

Heres an HMM
S1
1/3

Start randomly in state 1 or 2 Choose one of the output S2 symbols in each state at random. Lets generate a sequence of observations: 50-50 choice between Z and X

XY
1/3

1/3 2/3 2/3

ZY
1/3

ZX
S3
1/3

N=3 M=3 1 = a11 = 0 a12 = a13 = b1 (X) = b2 (X) = 0 b3 (X) =


Copyright Andrew W. Moore

2 = a12 = a22 = 0 a32 = b1 (Y) = b2 (Y) = b3 (Y) = 0

3 = 0 a13 = a13 = a13 = b1 (Z) = 0 b2 (Z) = b3 (Z) =

q0= q1= q2=

S1 S3 S3

O0= O1= O2=

X X __

Slide 47

Heres an HMM
S1
1/3

Start randomly in state 1 or 2 Choose one of the output S2 symbols in each state at random. Lets generate a sequence of observations:

XY
1/3

1/3 2/3 2/3

ZY
1/3

ZX
S3
1/3

N=3 M=3 1 = a11 = 0 a12 = a13 = b1 (X) = b2 (X) = 0 b3 (X) =


Copyright Andrew W. Moore

2 = a12 = a22 = 0 a32 = b1 (Y) = b2 (Y) = b3 (Y) = 0

3 = 0 a13 = a13 = a13 = b1 (Z) = 0 b2 (Z) = b3 (Z) =

q0= q1= q2=

S1 S3 S3

O0= O1= O2=

X X Z

Slide 48

24

State Estimation
S1
1/3

Start randomly in state 1 or 2 Choose one of the output S2 symbols in each state at random. Lets generate a sequence of observations:

XY
1/3

1/3 2/3 2/3

ZY
1/3

ZX
S3
1/3

N=3 M=3 1 = a11 = 0 a12 = a13 = b1 (X) = b2 (X) = 0 b3 (X) =


Copyright Andrew W. Moore

2 = a12 = a22 = 0 a32 = b1 (Y) = b2 (Y) = b3 (Y) = 0

3 = 0 a13 = a13 = a13 = b1 (Z) = 0 b2 (Z) = b3 (Z) =

This is what the observer has to work with


q0= q1= q2= ? ? ? O0= O1= O2= X X Z

Slide 49

Prob. of a series of observations


What is P(O) = P(O1 O2 O3) = P(O1 = X ^ O2 = X ^ O3 = Z)? Slow, stupid way:
P (O) = =
QPaths of length 3

S1

1/3

S2

XY
1/3

1/3 2/3 2/3

ZY
1/3

P(O Q) P(O | Q) P(Q)

ZX
S3
1/3

QPaths of length 3

How do we compute P(Q) for an arbitrary path Q? How do we compute P(O|Q) for an arbitrary path Q?

Copyright Andrew W. Moore

Slide 50

25

Prob. of a series of observations


What is P(O) = P(O1 O2 O3) = P(O1 = X ^ O2 = X ^ O3 = Z)? Slow, stupid way:
P (O) = =
QPaths of length 3

S1

1/3

S2

XY
1/3

1/3 2/3 2/3

ZY
1/3

P(O Q) P(O | Q) P(Q)

ZX
S3
1/3

QPaths of length 3

P(Q)= P(q1,q2,q3) =P(q1) P(q2,q3|q1) (chain rule) =P(q1) P(q2|q1) P(q3| q2,q1) (chain) =P(q1) P(q2|q1) P(q3| q2) (why?) Example in the case Q = S1 S3 S3: =1/2 * 2/3 * 1/3 = 1/9

How do we compute P(Q) for an arbitrary path Q? How do we compute P(O|Q) for an arbitrary path Q?

Copyright Andrew W. Moore

Slide 51

Prob. of a series of observations


What is P(O) = P(O1 O2 O3) = P(O1 = X ^ O2 = X ^ O3 = Z)? Slow, stupid way:
P (O) = =
QPaths of length 3

S1

1/3

S2

XY
1/3

1/3 2/3 2/3

ZY
1/3

P(O Q) P(O | Q) P(Q)


P(O|Q)

ZX
S3
1/3

QPaths of length 3

How do we compute P(Q) for = P(O1 O2 O3 |q1 q2 q3 ) an arbitrary path Q? = P(O1 | q1 ) P(O2 | q2 ) P(O3 | q3 ) (why?) How do we compute P(O|Q) for an arbitrary path Q?
Example in the case Q = S1 S3 S3: = P(X| S1) P(X| S3) P(Z| S3) = =1/2 * 1/2 * 1/2 = 1/8

Copyright Andrew W. Moore

Slide 52

26

Prob. of a series of observations


What is P(O) = P(O1 O2 O3) = P(O1 = X ^ O2 = X ^ O3 = Z)? Slow, stupid way:
P (O) = =
QPaths of length 3

S1

1/3

S2

XY
1/3

1/3 2/3 2/3

ZY
1/3

P(O Q) P(O | Q) P(Q)

ZX
S3
1/3

P (Q ) d 27 P(O|Q) nee |Q) would O)O O |q q qd )27 P(O P( 1 2 3 1 an 3 How do we compute P(Q) for = P(O 2 t ons uta) iP(O | q ) P(O | q ) (why?) an arbitrary path Q? = com1p| q1 P(O 2 2 ions 3 3 putat omcase Q = S S eS320 = c d 3: How do we compute P(O|Q) Example in the 1 ne 3 ould ns w |Q) rvatio for an arbitrary path Q? = P(X| S1) P(X| e 3) P(Z| bil3) n P(O S lio = 0 obs S
QPaths of length 3

.5 of 2 and 3 ence sequ * 1/2 pu1/2 ns 1/8 tatio = A =1/2 * c om illion So lets 3 .5 b s tation ompu c

be smarter
Slide 53

Copyright Andrew W. Moore

The Prob. of a given series of observations, non-exponential-cost-style


Given observations O1 O2 OT Define t(i) = P(O1 O2 Ot qt = Si | ) t(i) = Probability that, in a random trial, Wed have seen the first t observations Wed have ended up in Si as the tth state visited. In our example, what is 2(3) ? where 1 t T

Copyright Andrew W. Moore

Slide 54

27

t(i) = P(O1 O2 OT qt = Si | ) (t(i) can be defined stupidly by considering all paths length t. How?)

t(i): easy to define recursively


= P(q1 = Si )P(O1 q1 = S i )

1 (i ) = P(O1 q1 = Si )

= what? t +1 ( j ) = P(O1O2 ...Ot Ot +1 qt +1 = S j ) =

Copyright Andrew W. Moore

Slide 55

t(i) = P(O1 O2 OT qt = Si | ) (t(i) can be defined stupidly by considering all paths length t. How?)

t(i): easy to define recursively


1 (i ) = P(O1 q1 = Si )
= = P(q1 = Si )P(O1 q1 = Si ) what?

t +1 ( j ) = P(O1O2 ...Ot Ot +1 qt +1 = S j )
N i =1 N

= P(O1O2 ...Ot qt = Si Ot +1 qt +1 = S j ) = P(Ot +1 , qt +1 = S j O1O2 ...Ot qt = Si )P(O1O2 ...Ot qt = Si ) = P(Ot +1 , qt +1 = S j qt = Si ) t (i ) = P(qt +1 = S j qt = Si )P Ot +1 qt +1 = S j t (i ) = aij b j (Ot +1 ) t (i )
i
Copyright Andrew W. Moore Slide 56

i =1

28

in our example
t (i ) = P(O1O2 ..Ot qt = Si ) 1 (i ) = bi (O1 ) i
i

S1

1/3 1/3

S2

XY
1/3

ZY
2/3 1/3

2/3

t +1 ( j ) = aij b j (Ot +1 ) t (i )

ZX
S3
1/3

WE SAW O1 O2 O3 = X X Z

1 (1) =

1 4

1 (2) = 0 2 (2) = 0 3 (2) =


1 72

1 (3) = 0
1 12 1 3 (3) = 72

2 (1) = 0 3 (1) = 0
Copyright Andrew W. Moore

2 (3) =

Slide 57

Easy Question
We can cheaply compute t(i)=P(O1O2Otqt=Si) (How) can we cheaply compute P(O1O2Ot) ?

(How) can we cheaply compute P(qt=Si|O1O2Ot)

Copyright Andrew W. Moore

Slide 58

29

Easy Question
We can cheaply compute t(i)=P(O1O2Otqt=Si) (How) can we cheaply compute P(O1O2Ot) ?

(i)
i =1 t

(How) can we cheaply compute P(qt=Si|O1O2Ot)

t (i )

( j)
j =1 t
Slide 59

Copyright Andrew W. Moore

Most probable path given observations


What' s most probable path given O1O2 ...OT , i.e. What is

argmax P(Q O O ...O )?


1 2 T Q 1 2 T

Slow, stupid answer :


Q

argmax P(Q O O ...O )


P(O1O2 ...OT Q )P(Q) P(O1O2 ...OT )
Q

= argmax

= argmax P(O1O2 ...OT Q )P(Q )


Q
Copyright Andrew W. Moore Slide 60

30

Efficient MPP computation


Were going to compute the following variables: t(i)= max P(q1 q2 .. qt-1 qt = Si O1 .. Ot) q1q2..qt-1

= The Probability of the path of Length t-1 with the maximum chance of doing all these things: OCCURING and ENDING UP IN STATE Si and PRODUCING OUTPUT O1Ot DEFINE: So: mppt(i) = that path t(i)= Prob(mppt(i))

Copyright Andrew W. Moore

Slide 61

The Viterbi Algorithm


t (i ) = q1q2 ...qt 1 P(q1q2 ...qt 1 qt = Si O1O2 ..Ot )
mppt (i ) = q q ...q P(q1q2 ...qt 1 qt = S i O1O2 ..Ot ) t 1 1 2
arg max max

1 (i ) = one choice P(q1 = Si O1 )


= i bi (O1 ) = P(q1 = Si )P(O1 q1 = Si )

max

Now, suppose we have all the t(i)s and mppt(i)s for all i. HOW TO GET mppt(1) mppt(2) : mppt(N)
Copyright Andrew W. Moore

t+1(j) and
Prob=t(1)

mppt+1(j)? S1 S2 :

Prob=t(2) Prob=t(N)

Sj

SN qt qt+1
Slide 62

31

The Viterbi Algorithm


time t S1 : Si : time t+1 Sj

The most prob path with last two states Si Sj is the most prob path to Si , followed by transition Si Sj

Copyright Andrew W. Moore

Slide 63

The Viterbi Algorithm


time t S1 : Si : time t+1 Sj

The most prob path with last two states Si Sj is

the most prob path to Si , followed by transition Si Sj What is the prob of that path? t(i) x P(Si Sj Ot+1 | ) = t(i) aij bj (Ot+1) SO The most probable path to Sj has Si* as its penultimate state where i*=argmax t(i) aij bj (Ot+1)
i
Copyright Andrew W. Moore Slide 64

32

The Viterbi Algorithm


time t S1 : Si : time t+1 Sj

The most prob path with last two states Si Sj is

the most prob path to Si , followed by transition Si Sj What is the prob of that path? t(i) x P(Si Sj Ot+1 | ) Summary: = t(i) aij bj (Ot+1) t+1(j) = t(i*) a b (Ot+1) with i* defined SO The most probable path to(j) j=has ij j (i*)S to the left mppt+1 S mppt+1 i* Si* as its penultimate state where i*=argmax t(i) aij bj (Ot+1)

i
Copyright Andrew W. Moore Slide 65

Whats Viterbi used for?


Classic Example Speech recognition: Signal words HMM observable is signal Hidden state is part of word formation What is the most probable word given this signal?
UTTERLY GROSS SIMPLIFICATION In practice: many levels of inference; not one big jump.
Copyright Andrew W. Moore Slide 66

33

HMMs are used and useful


But how do you design an HMM? Occasionally, (e.g. in our robot example) it is reasonable to deduce the HMM from first principles. But usually, especially in Speech or Genetics, it is better to infer it from large amounts of data. O1 O2 .. OT with a big T.
Observations previously in lecture Observations in the next bit

O1 O2 .. OT

O1 O2 .. O

T
Slide 67

Copyright Andrew W. Moore

Inferring an HMM
Remember, weve been doing things like P(O1 O2 .. OT | ) That is the notation for our HMM parameters. Now We have some observations and we want to estimate from them. AS USUAL: We could use (i) (ii) MAX LIKELIHOOD = argmax P(O1 .. OT | ) BAYES
and then take E[] or max P( | O1 .. OT )
Copyright Andrew W. Moore Slide 68

Work out P( | O1 .. OT )

34

Max likelihood HMM estimation


Define t(i) = P(qt = Si | O1O2OT , ) t(i,j) = P(qt = Si qt+1 = Sj | O1O2OT , ) t(i) and t(i,j) can be computed efficiently i,j,t (Details in Rabiner paper)
T 1 t =1

t (i ) =

Expected number of transitions out of state i during the path Expected number of transitions from state i to state j during the path
Slide 69

T 1 t =1

t (i, j ) =

Copyright Andrew W. Moore

t (i ) = P(qt = Si O1O2 ..OT , )


T 1 t =1

t (i, j ) = P(qt = Si qt +1 = S j O1O2 ..OT , )

(i ) = expected number of transitions out of state i during path


t

(i, j ) = expected number of transitions out of i and into j during path


t =1 t

T 1

HMM estimation

Notice

(i, j )
t =1 T 1 t =1 t

T 1

(i )
t

= Estimate of Prob(Next state S j This state Si ) We can re - estimate a ij

expected frequency i j = expected frequency i

(i, j ) (i )
t t
Slide 70

We can also re - estimate b j (Ok ) L (See Rabiner)


Copyright Andrew W. Moore

35

We want

new aij = new estimate of P(qt +1 = s j | qt = si )

Copyright Andrew W. Moore

Slide 71

We want

new aij = new estimate of P(qt +1 = s j | qt = si )

Expected # transitions i j | old , O1 , O2 , LOT

Expected # transitions i k |
k =1

old

, O1 , O2 , LOT

Copyright Andrew W. Moore

Slide 72

36

We want

new aij = new estimate of P(qt +1 = s j | qt = si )

Expected # transitions i j | old , O1 , O2 , LOT

Expected # transitions i k |
k =1

old

, O1 , O2 , LOT

P(q
N t =1 T k =1 t =1

t +1

= s j , qt = si | old , O1 , O2 , L OT ) = sk , qt = si | old , O1 , O2 , L OT )

P(q

t +1

Copyright Andrew W. Moore

Slide 73

We want

new aij = new estimate of P(qt +1 = s j | qt = si )

Expected # transitions i j | old , O1 , O2 , LOT

Expected # transitions i k |
k =1

old

, O1 , O2 , LOT

P(q
N t =1 T k =1 t =1

t +1

= s j , qt = si | old , O1 , O2 , L OT ) = sk , qt = si | old , O1 , O2 , L OT )
Sij = P (qt +1 = s j , qt = si , O1 , LOT | old )
t =1 T

P(q
Sij

t +1

S
k =1

where
ik

= What?
Slide 74

Copyright Andrew W. Moore

37

We want

new aij = new estimate of P (qt +1 = s j | qt = si )

Expected # transitions i j | old , O1 , O2 , LOT

Expected # transitions i k |
k =1

old

, O1 , O2 , LOT

P(q
N t =1 T k =1 t =1

t +1

= s j , qt = si | old , O1 , O2 , L OT ) = sk , qt = si | old , O1 , O2 , L OT )
Sij = P (qt +1 = s j , qt = si , O1 , LOT | old )
t =1 T

P(q
Sij

t +1

S
k =1

where
ik

= aij t (i ) t +1 ( j )b j (Ot +1 )
t =1
Slide 75

Copyright Andrew W. Moore

We want

new ij

= Sij

S
k =1

ik

where

Sij = aij t (i ) t +1 ( j )b j (Ot +1 )


t =1

Copyright Andrew W. Moore

Slide 76

38

We want

new ij

= Sij

S
k =1

ik

where

Sij = aij t (i ) t +1 ( j )b j (Ot +1 )


t =1

N
Slide 77

Copyright Andrew W. Moore

EM for HMMs
If we knew we could estimate EXPECTATIONS of quantities such as Expected number of times in state i Expected number of transitions i j If we knew the quantities such as Expected number of times in state i Expected number of transitions i j We could compute the MAX LIKELIHOOD estimate of = {aij},{bi(j)}, i Roll on the EM Algorithm
Copyright Andrew W. Moore Slide 78

39

EM 4 HMMs
1. 2. 3. 4. 5. 6. 7. Get your observations O1 OT Guess your first estimate (0), k=0 k = k+1 Given O1 OT, (k) compute t(i) , t(i,j) 1 t T, 1 i N, 1 j N

Compute expected freq. of state i, and expected freq. ij Compute new estimates of aij, bj(k), i accordingly. Call them (k+1) Goto 3, unless converged. Also known (for the HMM case) as the BAUM-WELCH algorithm.
Slide 79

Copyright Andrew W. Moore

Bad News
There are lots of local minima

Good News
The local minima are usually adequate models of the data.

Notice

EM does not estimate the number of states. That must be given. Often, HMMs are forced to have some links with zero probability. This is done by setting aij=0 in initial estimate (0) Easy extension of everything seen today: HMMs with real valued outputs
Copyright Andrew W. Moore Slide 80

40

There are lots of local minima

Trade-off between too few states (inadequately modeling the structure in the data) and too many (fitting the noise). Thus #states is a regularization parameter.

Bad News

Blah blah blah bias variance tradeoffblah blahcross-validationblah blah.AIC, The local minima are usually adequate models of the BIC.blah blah (same ol same ol)

Good News Notice

data.

EM does not estimate the number of states. That must be given. Often, HMMs are forced to have some links with zero probability. This is done by setting aij=0 in initial estimate (0) Easy extension of everything seen today: HMMs with real valued outputs
Copyright Andrew W. Moore Slide 81

What You Should Know


What is an HMM ? Computing (and defining) t(i) DONT PANIC: The Viterbi algorithm starts on p. 257. Outline of the EM algorithm To be very happy with the kind of maths and analysis needed for HMMs Fairly thorough reading of Rabiner* up to page 266* [Up to but not including IV. Types of HMMs].
*L. R. Rabiner, "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition," Proc. of the IEEE, Vol.77, No.2, pp.257--286, 1989.
http://ieeexplore.ieee.org/iel5/5/698/00018626.pdf?arnumber=18626
Copyright Andrew W. Moore Slide 82

41

Potrebbero piacerti anche