Sei sulla pagina 1di 20

Overview

Hidden Markov Models


HMMs and GMMs
and
Key models and algorithms for HMM acoustic models
Gaussian Mixture Models
Gaussians
GMMs: Gaussian mixture models
Steve Renals and Peter Bell
HMMs: Hidden Markov models
HMM algorithms
Automatic Speech Recognition— ASR Lectures 4&5 Likelihood computation (forward algorithm)
28/31 January 2013 Most probable state sequence (Viterbi algorithm)
Estimting the parameters (EM algorithm)

ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 1 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 2

Fundamental Equation of Statistical Speech Recognition Acoustic Modelling

If X is the sequence of acoustic feature vectors (observations) and


Recorded Speech Decoded Text
W denotes a word sequence, the most likely word sequence W∗ is (Transcription)
given by Hidden Markov Model
∗ Signal
W = arg max P(W | X) Analysis
W Acoustic
Model
Applying Bayes’ Theorem:
Search
p(X | W)P(W) Lexicon
P(W | X) = Training Space
p(X) Data
∝ p(X | W)P(W) Language
∗ Model
W = arg max p(X | W) P(W)
W | {z } | {z }
Acoustic Language
model model

ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 3 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 4
Hierarchical modelling of speech Acoustic Model: Continuous Density HMM
"No right" Utterance P(s1 | s1 ) P(s2 | s2 ) P(s3 | s3 )
Generative Model

NO RIGHT Word

n oh r ai t Subword sI s1 s2 s3 sE
P(s1 |sI ) P(s2 | s1 ) P(s3 | s2 ) P(sE | s3 )

HMM p(x | s1 ) p(x | s2 ) p(x | s3 )

x x x
Acoustics Probabilistic finite state automaton
Paramaters λ:
Transition probabilities: akj = P(sj | sk )
Output probability density function: bj (x) = p(x | sj )

ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 5 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 6

Acoustic Model: Continuous Density HMM HMM Assumptions


P(s1 | s1 ) P(s2 | s2 ) P(s3 | s3 )

sI s1 s2 s3 sE
sI s1 s2 s3 sE P(s1 |sI ) P(s2 | s1 ) P(s3 | s2 ) P(sE | s3 )

p(x | s1 ) p(x | s2 ) p(x | s3 )

x x x
1 Observation independence An acoustic observation x is
x1 x2 x3 x4 x5 x6 conditionally independent of all other observations given the
Probabilistic finite state automaton state that generated it
Paramaters λ: 2 Markov process A state is conditionally independent of all
other states given the previous state
Transition probabilities: akj = P(sj | sk )
Output probability density function: bj (x) = p(x | sj )

ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 6 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 7
HMM Assumptions

s(t−1) s(t) s(t+1)

x(t − 1) x(t) x(t + 1) HMM OUTPUT DISTRIBUTION


1 Observation independence An acoustic observation x is
conditionally independent of all other observations given the
state that generated it
2 Markov process A state is conditionally independent of all
other states given the previous state

ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 8 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 9

Output distribution Background: cdf


P(s1 | s1 ) P(s2 | s2 ) P(s3 | s3 )

Consider a real valued random variable X


sI s1 s2 s3 sE
P(s1 |sI ) P(s2 | s1 ) P(s3 | s2 ) P(sE | s3 ) Cumulative distribution function (cdf) F (x) for X :
p(x | s1 ) p(x | s2 ) p(x | s3 )
F (x) = P(X ≤ x)

x x x To obtain the probability of falling in an interval we can do


the following:
Single multivariate Gaussian with mean µj , covariance matrix Σj :
bj (x) = p(x | sj ) = N (x; µj , Σj ) P(a < X ≤ b) = P(X ≤ b) − P(X ≤ a)
= F (b) − F (a)
M-component Gaussian mixture model:
M
X
bj (x) = p(x | sj ) = cjm N (x; µjm , Σjm )
m=1

ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 10 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 11
Background: pdf The Gaussian distribution (univariate)

The Gaussian (or Normal) distribution is the most common


The rate of change of the cdf gives us the probability density
(and easily analysed) continuous distribution
function (pdf), p(x):
It is also a reasonable model in many situations (the famous
d “bell curve”)
p(x) = F (x) = F 0 (x)
dX
Z If a (scalar) variable has a Gaussian distribution, then it has a
x
F (x) = p(x)dx probability density function with this form:
−∞  
2 2 1 −(x − µ)2
p(x) is not the probability that X has value x. But the pdf is p(x|µ, σ ) = N(x; µ, σ ) = √ exp
2πσ 2 2σ 2
proportional to the probability that X lies in a small interval
centred on x. The Gaussian is described by two parameters:
Notation: p for pdf, P for probability the mean µ (location)
the variance σ 2 (dispersion)

ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 12 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 13

Plot of Gaussian distribution Properties of the Gaussian distribution

Gaussians have the same shape, with the location controlled


 
by the mean, and the spread controlled by the variance 2 1 −(x − µ)2
N(x; µ, σ ) = √ exp
One-dimensional Gaussian with zero mean and unit variance 2πσ 2 2σ 2
(µ = 0, σ 2 = 1): pdfs of Gaussian distributions
0.4

pdf of Gaussian Distribution mean=0


0.4 variance=1
0.35
mean=0
variance=1 mean=0
0.35
0.3 variance=2

0.3
0.25

p(x|m,s)
0.25
0.2
p(x|m,s)

0.2 mean=0
0.15
variance=4

0.15
0.1

0.1
0.05

0.05
0
−8 −6 −4 −2 0 2 4 6 8
x
0
−4 −3 −2 −1 0 1 2 3 4
x

ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 14 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 15
Parameter estimation Exercise

Consider the log likelihood of a set of N data points {x 1 , . . . , x N }


being generated by a Gaussian with mean µ and variance σ 2 :
Estimate mean and variance parameters of a Gaussian from  
N
data x 1 , x 2 , . . . , x n 1 n 2 1X (xn − µ)2
L = ln p({x , . . . , x } | µ, σ ) = − 2
− ln σ 2 − ln(2π)
Use sample mean and sample variance estimates: 2 σ
n=1
N
X
1X i
n 1 N N
µ= x (sample mean) =− (xn − µ)2 − ln σ 2 − ln(2π)
n 2σ 2 2 2
n=1
i=1
Xn
1 By maximising the the log likelihood function with respect to µ
σ2 = (x i − µ)2 (sample variance)
n show that the maximum likelihood estimate for the mean is indeed
i=1
the sample mean:
N
1 X
µML = xn .
N
n=1

ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 16 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 17

The multidimensional Gaussian distribution Covariance matrix


The mean vector µ is the expectation of x:
µ = E [x]
The d-dimensional vector x is multivariate Gaussian if it has a
probability density function of the following form: The covariance matrix Σ is the expectation of the deviation of
  x from the mean:
1 1 T −1
p(x|µ, Σ) = exp − (x − µ) Σ (x − µ)
(2π)d/2 |Σ|1/2 2 Σ = E [(x − µ)(x − µ)T ]

The pdf is parameterized by the mean vector µ and the Σ is a d × d symmetric matrix:
covariance matrix Σ.
Σij = E [(xi − µi )(xj − µj )] = E [(xj − µj )(xi − µi )] = Σji
The 1-dimensional Gaussian is a special case of this pdf
The sign of the covariance helps to determine the relationship
The argument to the exponential 0.5(x − µ)T Σ−1 (x − µ) is
between two components:
referred to as a quadratic form.
If xj is large when xi is large, then (xj − µj )(xi − µi ) will tend
to be positive;
If xj is small when xi is large, then (xj − µj )(xi − µi ) will tend
to be negative.
ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 18 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 19
Spherical Gaussian Diagonal Covariance Gaussian

Contour plot of p(x1, x2) Contour plot of p(x1, x2)


Surface plot of p(x1, x2) 2 Surface plot of p(x1, x2)
4

1.5 3

0.16
1 0.08 2
0.14
0.07
0.12 0.5 0.06 1
0.1
0.05
p(x1, x2)

0.08

p(x1, x2)
0 0

x2

x2
0.04
0.06
0.03
0.04 −0.5 −1
0.02
0.02
−1 0.01 −2
0
2
0
4 4
1 2
1.5 −1.5 2 −3
1 2
0 0.5
0 0 0
−1 −0.5
−1 −2 −2 −2
−1.5 −4
−2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −4 −4 −4 −3 −2 −1 0 1 2 3 4
x2 −2
x1 x1 x2 x1 x1

       
0 1 0 0 1 0
µ= Σ= ρ12 = 0 µ= Σ= ρ12 = 0
0 0 1 0 0 4

ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 20 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 21

Full covariance Gaussian Parameter estimation

Surface plot of p(x1, x2) 4


Contour plot of p(x1, x2)
It is possible to show that the mean vector µ̂ and covariance
3
matrix Σ̂ that maximize the likelihood of the training data are
0.1

0.08
2
given by:
1

0.06
p(x1, x2)

N
1 X n
0
x2

0.04

0.02
−1
µ̂ = x
N
0
4
−2
n=1
2 4
2
3 −3

XN
1
0 1
0

(xn − µ̂)(xn − µ̂)T


−2 −1

x2 −4 −4
−3
−2

x1
−4
−4 −3 −2 −1 0
x1
1 2 3 4
Σ̂ =
    N
n=1
0 1 −1
µ= Σ= ρ12 = −0.5
0 −1 4 The mean of the distribution is estimated by the sample mean
and the covariance by the sample covariance

ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 22 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 23
Example data Maximum likelihood fit to a Gaussian

10 10

5 5

X2

X2
0 0

−5 −5
−4 −2 0 2 4 6 8 10 −4 −2 0 2 4 6 8 10
X1 X1

ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 24 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 25

Data in clusters (example 1) Example 1 fit by a Gaussian

2.5 2.5

2 2

1.5 1.5

1 1

0.5 0.5

0 0

−0.5 −0.5

−1 −1

−1.5 −1.5
−1.5 −1 −0.5 0 0.5 1 1.5 2 −1.5 −1 −0.5 0 0.5 1 1.5 2

µ1 = [0 0]T µ2 = [1 1]T Σ1 = Σ2 = 0.2I µ1 = [0 0]T µ2 = [1 1]T Σ1 = Σ2 = 0.2I

ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 26 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 27
k-means clustering k-means example: data set

(4,13)

k-means is an automatic procedure for clustering unlabelled 10


data (2,9)
Requires a prespecified number of clusters (7,8)

Clustering algorithm chooses a set of clusters with the


minimum within-cluster variance (6,6) (7,6)
5 (4,5) (10,5)
Guaranteed to converge (eventually)
(5,4) (8,4)
Clustering solution is dependent on the initialisation
(1,2) (5,2)
(1,1) (3,1)
(10,0)
0
0 5 10

ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 28 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 29

k-means example: initialization k-means example: iteration 1 (assign points to clusters)

(4,13) (4,13)

10 10
(2,9) (2,9)

(7,8) (7,8)

(6,6) (7,6)
(6,6) (7,6)
5 (4,5) (10,5)
5 (4,5) (10,5)

(5,4) (8,4) (5,4) (8,4)

(1,2) (5,2) (1,2) (5,2)


(1,1) (3,1) (1,1) (3,1)
(10,0) (10,0)
0 0
0 5 10 0 5 10

ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 30 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 31
k-means example: iteration 1 (recompute centres) k-means example: iteration 2 (assign points to clusters)

(4,13) (4,13)

10 10
(4.33, 10) (4.33, 10)
(2,9) (2,9)

(7,8) (7,8)

(6,6) (7,6) (7,6)


(6,6)
5 (4,5) (10,5) 5 (4,5) (10,5)
(5,4) (8,4) (5,4) (8,4)
(8.75,3.75) (8.75,3.75)
(3.57, 3) (3.57, 3)
(1,2) (5,2) (1,2) (5,2)
(1,1) (3,1) (1,1) (3,1)
(10,0) (10,0)
0 0
0 5 10 0 5 10

ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 32 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 33

k-means example: iteration 2 (recompute centres) k-means example: iteration 3 (assign points to clusters)

(4,13) (4,13)

10 10
(4.33, 10) (4.33, 10)
(2,9) (2,9)

(7,8) (7,8)

(7,6)
(6,6) (7,6) (6,6)
5 (4,5) (10,5)
5 (4,5) (10,5)
(8.2,4.2) (8.2,4.2)
(5,4) (8,4) (5,4) (8,4)

(1,2) (3.17, 2.5) (5,2) (1,2) (3.17, 2.5) (5,2)


(1,1) (3,1) (1,1) (3,1)
(10,0) (10,0)
0 0
0 5 10 0 5 10
No changes, so converged
ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 34 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 35
Mixture model Component occupation probability

A more flexible form of density estimation is made up of a We can apply Bayes’ theorem:
linear combination of component densities:
p(x|j)P(j) p(x|j)P(j)
P(j|x) = = PM
M
X p(x) j=1 p(x|j)P(j)
p(x) = p(x|j)P(j)
j=1 The posterior probabilities P(j|x) give the probability that
component j was responsible for generating data point x
This is called a mixture model or a mixture density
The P(j|x)s are called the component occupation probabilities
p(x|j): component densities (or sometimes called the responsibilities)
P(j): mixing parameters Since they are posterior probabilities:
Generative model:
1 Choose a mixture component based on P(j) M
X
2 Generate a data point x from the chosen component using P(j|x) = 1
p(x|j) j=1

ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 36 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 37

Parameter estimation Gaussian mixture model


The most important mixture model is the Gaussian Mixture
Model (GMM), where the component densities are Gaussians
If we knew which mixture component was responsible for a Consider a GMM, where each component Gaussian
data point: Nj (x; µj , σj2 ) has mean µj and a spherical covariance Σ = σ 2 I
we would be able to assign each point unambiguously to a P P
mixture component
X X
p(x) = P(j)p(x|j) = P(j)Nj (x; µj , σj2 )
and we could estimate the mean for each component Gaussian
j=1 j=1
as the sample mean (just like k-means clustering)
and we could estimate the covariance as the sample covariance p(x)

But we don’t know which mixture component a data point


comes from... P(1)
P(2)
P(M)

Maybe we could use the component occupation probabilities


p(x|1) p(x|M)
P(j|x)? p(x|2)

x1 x2 xd
ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 38 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 39
GMM Parameter estimation when we know which Soft assignment
component generated the data
Estimate “soft counts” based on the component occupation
Define the indicator variable zjn = 1 if component j generated probabilities P(j|xn ):
component xn (and 0 otherwise) N
X
If zjn wasn’t hidden then we could count the number of Nj∗ = P(j|xn )
observed data points generated by j: n=1

N We can imagine assigning data points to component j


X
Nj = zjn weighted by the component occupation probability P(j|xn )
n=1 So we could imagine estimating the mean, variance and prior
And estimate the mean, variance and mixing parameters as: probabilities as:
P P P
zjn xn n P(j|xn )xn n n
n P(j|x )x
µ̂j = n µ̂j = P =
Nj
n
n P(j|x ) Nj∗
P P n )||xn − µ ||2
P n n 2
zjn ||xn − µk ||2 2 n P(j|x
P k n P(j|x )||x − µk ||
σ̂j2 = n σ̂j = n
=
Nj∗
Nj n P(j|x )
1 X Nj 1 X n
Nj∗
P̂(j) = zjn = P̂(j) = P(j|x ) =
N n N N n N
ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 40 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 41

EM algorithm Maximum likelihood parameter estimation


Problem! Recall that: The likelihood of a data set X = {x1 , x2 , . . . , xN } is given by:
p(x|j)P(j) N N X
M
P(j|x) = Y Y
p(x) L= p(xn ) = p(xn |j)P(j)
n=1 n=1 j=1
We need to know p(x|j) and P(j) to estimate the parameters
of p(x|j) and to estimate P(j).... We can regard the negative log likelihood as an error function:
Solution: an iterative algorithm where each iteration has two
N
X
parts:
Compute the component occupation probabilities P(j|x) using E = − ln L = − ln p(xn )
the current estimates of the GMM parameters (means, n=1
 
variances, mixing parameters) (E-step) N
X XM
Computer the GMM parameters using the current estimates of =− ln  p(xn |j)P(j)
the component occupation probabilities (M-step) n=1 j=1
Starting from some initialization (e.g. using k-means for the
means) these steps are alternated until convergence
This is called the EM Algorithm and can be shown to Considering the derivatives of E with respect to the
maximize the likelihood parameters, gives expressions like the previous slide
ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 42 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 43
Example 1 fit using a GMM Peakily distributed data (Example 2)

2.5

4
2

3
1.5

2
1
1
0.5
0

0
−1

−0.5
−2

−1 −3

−1.5 −4
−1.5 −1 −0.5 0 0.5 1 1.5 2

−5
−4 −3 −2 −1 0 1 2 3 4

2.5

2
µ1 = µ2 = [0 0]T Σ1 = 0.1I Σ2 = 2I
1.5

1
ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 44 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 45
0.5

Example 2 fit by a Gaussian


0
Example 2 fit by a GMM
−0.5

−1 4

4
−1.5 3
−1.5 −1 −0.5 0 0.5 1 1.5 2

3 2

Fitted with a two


2 component GMM using EM 1

1 0

0 −1

−1 −2

−2 −3

−3 −4

−4 −5
−4 −3 −2 −1 0 1 2 3 4

−5
−4 −3 −2 −1 0 1 2 3 4

µ1 = µ2 = [0 0]T Σ1 = 0.1I Σ2 = 2I 3

ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 46 0 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 47

−1
Example 2: component Gaussians Comments on GMMs

GMMs trained using the EM algorithm are able to self


4 4
organize to fit a data set
3 3

2 2
Individual components take responsibility for parts of the data
1 1 set (probabilistically)
0 0
Soft assignment to components not hard assignment — “soft
−1 −1

−2 −2
clustering”
−3 −3 GMMs scale very well, e.g.: large speech recognition systems
−4
−4 −3 −2 −1 0 1 2 3 4
−4
−4 −3 −2 −1 0 1 2 3 4 can have 30,000 GMMs, each with 32 components:
sometimes 1 million Gaussian components!! And the
parameters all estimated from (a lot of) data by EM

ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 48 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 49

Back to HMMs... The three problems of HMMs


P(s1 | s1 ) P(s2 | s2 ) P(s3 | s3 )

sI s1 s2 s3 sE Working with HMMs requires the solution of three problems:


P(s1 |sI ) P(s2 | s1 ) P(s3 | s2 ) P(sE | s3 )
1 Likelihood Determine the overall likelihood of an observation
p(x | s1 ) p(x | s2 ) p(x | s3 ) sequence X = (x1 , . . . , xt , . . . , xT ) being generated by an
HMM
x x x 2 Decoding Given an observation sequence and an HMM,
Output distribution: determine the most probable hidden state sequence
Single multivariate Gaussian with mean µj , covariance matrix 3 Training Given an observation sequence and an HMM, learn
Σj : the best HMM parameters λ = {{ajk }, {bj ()}}
bj (x) = p(x | sj ) = N (x; µj , Σj )
M-component Gaussian mixture model:
M
X
bj (x) = p(x | sj ) = cjm N (x; µjm , Σjm )
m=1
ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 50 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 51
1. Likelihood: The Forward algorithm Recursive algorithms on HMMs

Goal: determine p(X | λ)


Sum over all possible state sequences s1 s2 . . . sT that could Visualize the problem as a state-time trellis
result in the observation sequence X t-1 t t+1
Rather than enumerating each sequence, compute the
probabilities recursively (exploiting the Markov assumption)
i i i

j j j

k k k

ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 52 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 53

1. Likelihood: The Forward algorithm 1. Likelihood: The Forward recursion

Goal: determine p(X | λ)


Sum over all possible state sequences s1 s2 . . . sT that could Initialization
result in the observation sequence X
α0 (sI ) = 1
Rather than enumerating each sequence, compute the
α0 (sj ) = 0 if sj 6= sI
probabilities recursively (exploiting the Markov assumption)
Forward probability, αt (sj ): the probability of observing the Recursion
N
X
observation sequence x1 . . . xt and being in state sj at time t:
αt (sj ) = αt−1 (si )aij bj (xt )
αt (sj ) = p(x1 , . . . , xt , S(t) = sj | λ) i=1

Termination
N
X
p(X | λ) = αT (sE ) = αT (si )aiE
i=1

ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 54 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 55
1. Likelihood: Forward Recursion Viterbi approximation

αt (sj ) = p(x1 , . . . , xt , S(t) = sj | λ) Instead of summing over all possible state sequences, just
consider the most likely
t-1 t t+1 Achieve this by changing the summation to a maximisation in
aii b j (xt ) αt (si ) the recursion:
i � i i
αt−1 (si ) Vt (sj ) = max Vt−1 (si )aij bj (xt )
i
a ji
Changing the recursion in this way gives the likelihood of the
j j j most probable path
αt−1 (s j ) We need to keep track of the states that make up this path by
aki keeping a sequence of backpointers to enable a Viterbi
backtrace: the backpointer for each state at each time
k k k indicates the previous state on the most probable path
αt−1 (sk )
ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 56 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 57

Viterbi Recursion Viterbi Recursion

Likelihood of the most probable path Backpointers to the previous state on the most probable path
t-1 t t+1 t-1 t btt (si ) = s j t+1
aii b j (xt ) Vt (si ) b j (xt ) Vt (si )
i i i i i i
max
Vt−1 (si )
a ji a ji
j j j j j j
Vt−1 (s j ) Vt−1 (s j )
aki

k k k k k k
Vt−1 (sk )

ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 58 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 59
2. Decoding: The Viterbi algorithm Viterbi Backtrace
Initialization Backtrace to find the state sequence of the most probable path
V0 (sI ) = 1 t-1 t btt (si ) = s j t+1
V0 (sj ) = 0 if sj 6= sI b j (xt ) Vt (si )
bt0 (sj ) = 0 i i i
Recursion
N
Vt (sj ) = max Vt−1 (si )aij bj (xt ) a ji
i=1
N
btt (sj ) = arg max Vt−1 (si )aij bj (xt ) j j j
i=1
Termination
Vt−1 (s j )
N
P ∗ = VT (sE ) = max VT (si )aiE
i=1
N k k k
sT∗ = btT (qE ) = arg max VT (si )aiE
i=1
btt+1 (sk ) = si
ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 60 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 61

3. Training: Forward-Backward algorithm Viterbi Training


If we knew the state-time alignment, then each observation
feature vector could be assigned to a specific state
Goal: Efficiently estimate the parameters of an HMM λ from
A state-time alignment can be obtained using the most
an observation sequence
probable path obtained by Viterbi decoding
Assume single Gaussian output probability distribution Maximum likelihood estimate of aij , if C (si → sj ) is the count
of transitions from si to sj
bj (x) = p(x | sj ) = N (x; µj , Σj )
C (si → sj )
Parameters λ: âij = P
k C (si → sk )
Transition probabilities aij :
Likewise if Zj is the set of observed acoustic feature vectors
X assigned to state j, we can use the standard maximum
aij = 1
likelihood estimates for the mean and the covariance:
j P
x∈Zj x
Gaussian parameters for state sj : µ̂j =
mean vector µj ; covariance matrix Σj |Zj |
P j j T
j x∈Zj (x − µ̂ )(x − µ̂ )
Σ̂ =
|Zj |
ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 62 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 63
EM Algorithm Backward probabilities
Viterbi training is an approximation—we would like to To estimate the state occupation probabilities it is useful to
consider all possible paths define (recursively) another set of probabilities—the Backward
In this case rather than having a hard state-time alignment we probabilities
estimate a probability βt (sj ) = p(xt+1 , xt+2 , xT | S(t) = sj , λ)
State occupation probability: The probability γt (sj ) of
occupying state sj at time t given the sequence of The probability of future observations given a the HMM is in
observations. state sj at time t
These can be recursively computed (going backwards in time)
Compare with component occupation probability in a GMM Initialisation
We can use this for an iterative algorithm for HMM training: βT (si ) = aiE
the EM algorithm Recursion
Each iteration has two steps: N
X
E-step estimate the state occupation probabilities βt (si ) = aij bj (xt+1 )βt+1 (sj )
j=1
(Expectation)
Termination
M-step re-estimate the HMM parameters based on the
N
X
estimated state occupation probabilities p(X | λ) = β0 (sI ) = aIj bj (x1 )β1 (sj ) = αT (sE )
(Maximisation) j=1
ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 64 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 65

Backward Recursion State Occupation Probability


The state occupation probability γt (sj ) is the probability of
occupying state sj at time t given the sequence of observations
Express in terms of the forward and backward probabilities:
βt (sj ) = p(xt+1 , xt+2 , xT | S(t) = sj , λ)
1
γt (sj ) = P(S(t) = sj | X, λ) = αt (j)βt (j)
t-1 t t+1 αT (sE )
βt (si ) bi (xt+1 )
i i
� aii i
recalling that p(X|λ) = αT (sE )
Since
βt+1 (si )
αt (sj )βt (sj ) = p(x1 , . . . , xt , S(t) = sj | λ)
ai j b j (xt+1 )
p(xt+1 , xt+2 , xT | S(t) = sj , λ)
j j j
= p(x1 , . . . , xt , xt+1 , xt+2 , . . . , xT , S(t) = sj | λ)
βt+1 (s j )
aik = p(X, S(t) = sj | λ)
bk (xt+1 )
k k k
p(X, S(t) = sj | λ)
βt+1 (sk ) P(S(t) = sj | X, λ) =
p(X|λ)
ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 66 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 67
Re-estimation of Gaussian parameters Re-estimation of transition probabilities

Similarly to the state occupation probability, we can estimate


ξt (si , sj ), the probability of being in si at time t and sj at
The sum of state occupation probabilities through time for a t + 1, given the observations:
state, may be regarded as a “soft” count
We can use this “soft” alignment to re-estimate the HMM ξt (si , sj ) = P(S(t) = si , S(t + 1) = sj | X, λ)
parameters: P(S(t) = si , S(t + 1) = sj , X | λ)
=
PT p(X|Λ)
j γt (sj )x t
µ̂ = Pt=1
T
αt (si )aij bj (xt+1 )βt+1 (sj )
=
t=1 γt (sj ) αT (sE )
PT j
j t=1 γt (sj )(x t − µ̂ )(x − µ̂j )T
Σ̂ = PT We can use this to re-estimate the transition probabilities
t=1 γt (sj )
PT
t=1 ξt (si , sj )
âij = PN PT
k=1 t=1 ξt (si , sk )

ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 68 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 69

Pulling it all together Extension to a corpus of utterances

Iterative estimation of HMM parameters using the EM


algorithm. At each iteration We usually train from a large corpus of R utterances
E step For all time-state pairs
If xrt is the tth frame of the r th utterance Xr then we can
1 Recursively compute the forward probabilities
compute the probabilities αtr (j), βtr (j), γtr (sj ) and ξtr (si , sj ) as
αt (sj ) and backward probabilities βt (j)
2 Compute the state occupation probabilities before
γt (sj ) and ξt (si , sj ) The re-estimates are as before, except we must sum over the
M step Based on the estimated state occupation R utterances, eg:
probabilities re-estimate the HMM parameters: PR PT r r
mean vectors µj , covariance matrices Σj and j r =1 t=1 γt (sj )x t
µ̂ = PR PT r
transition probabilities aij r =1 t=1 γt (sj )
The application of the EM algorithm to HMM training is
sometimes called the Forward-Backward algorithm

ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 70 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 71
Extension to Gaussian mixture model (GMM) EM training of HMM/GMM
Rather than estimating the state-time alignment, we estimate
The assumption of a Gaussian distribution at each state is the component/state-time alignment, and component-state
very strong; in practice the acoustic feature vectors associated occupation probabilities γt (sj , m): the probability of
with a state may be strongly non-Gaussian occupying mixture component m of state sj at time t
In this case an M-component Gaussian mixture model is an We can thus re-estimate the mean of mixture component m
appropriate density function: of state sj as follows
PT
jm γt (sj , m)x t
M
X µ̂ = Pt=1 T
bj (x) = p(x | sj ) = cjm N (x; µjm , Σjm ) t=1 γt (sj , m)
m=1 And likewise for the covariance matrices (mixture models
Given enough components, this family of functions can model often use diagonal covariance matrices)
any distribution. The mixture coefficients are re-estimated in a similar way to
transition probabilities:
Train using the EM algorithm, in which the component PT
estimation probabilities are estimated in the E-step γt (sj , m)
ĉjm = PM t=1PT
`=1 t=1 γt (sj , `)

ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 72 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 73

Doing the computation Summary: HMMs

HMMs provide a generative model for statistical speech


recognition
The forward, backward and Viterbi recursions result in a long
Three key problems
sequence of probabilities being multiplied
1 Computing the overall likelihood: the Forward algorithm
This can cause floating point underflow problems 2 Decoding the most likely state sequence: the Viterbi algorithm
In practice computations are performed in the log domain (in 3 Estimating the most likely parameters: the EM
which multiplies become adds) (Forward-Backward) algorithm
Working in the log domain also avoids needing to perform the Solutions to these problems are tractable due to the two key
HMM assumptions
exponentiation when computing Gaussians
1 Conditional independence of observations given the current
state
2 Markov assumption on the states

ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 74 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 75
References: HMMs

Gales and Young (2007). “The Application of Hidden Markov


Models in Speech Recognition”, Foundations and Trends in
Signal Processing, 1 (3), 195–304: section 2.2.
Jurafsky and Martin (2008). Speech and Language Processing
(2nd ed.): sections 6.1–6.5; 9.2; 9.4. (Errata at
http://www.cs.colorado.edu/~martin/SLP/Errata/
SLP2-PIEV-Errata.html)
Rabiner and Juang (1989). “An introduction to hidden
Markov models”, IEEE ASSP Magazine, 3 (1), 4–16.
Renals and Hain (2010). “Speech Recognition”,
Computational Linguistics and Natural Language Processing
Handbook, Clark, Fox and Lappin (eds.), Blackwells.

ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 76

Potrebbero piacerti anche