Asr03 HMMGMM 4up PDF

Overview
Hidden Markov Models

HMMs and GMMs
and
Key models and algorithms for HMM acoustic models
Gaussian Mixture Models
Gaussians
GMMs: Gaussian mixture models
Steve Renals and Peter Bell
HMMs: Hidden Markov models
HMM algorithms
Automatic Speech Recognition— ASR Lectures 4&5 Likelihood computation (forward algorithm)
28/31 January 2013 Most probable state sequence (Viterbi algorithm)
Estimting the parameters (EM algorithm)
ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 1 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 2
Fundamental Equation of Statistical Speech Recognition Acoustic Modelling
If X is the sequence of acoustic feature vectors (observations) and

Recorded Speech Decoded Text
W denotes a word sequence, the most likely word sequence W∗ is (Transcription)
given by Hidden Markov Model
∗ Signal
W = arg max P(W | X) Analysis
W Acoustic
Model
Applying Bayes’ Theorem:
Search
p(X | W)P(W) Lexicon
P(W | X) = Training Space
p(X) Data
∝ p(X | W)P(W) Language
∗ Model
W = arg max p(X | W) P(W)
W | {z } | {z }
Acoustic Language
model model
Hierarchical modelling of speech Acoustic Model: Continuous Density HMM
"No right" Utterance P(s1 | s1 ) P(s2 | s2 ) P(s3 | s3 )
Generative Model
NO RIGHT Word
n oh r ai t Subword sI s1 s2 s3 sE
P(s1 |sI ) P(s2 | s1 ) P(s3 | s2 ) P(sE | s3 )
HMM p(x | s1 ) p(x | s2 ) p(x | s3 )
x x x
Acoustics Probabilistic finite state automaton
Paramaters λ:
Transition probabilities: akj = P(sj | sk )
Output probability density function: bj (x) = p(x | sj )
Acoustic Model: Continuous Density HMM HMM Assumptions

P(s1 | s1 ) P(s2 | s2 ) P(s3 | s3 )
sI s1 s2 s3 sE
sI s1 s2 s3 sE P(s1 |sI ) P(s2 | s1 ) P(s3 | s2 ) P(sE | s3 )
p(x | s1 ) p(x | s2 ) p(x | s3 )
x x x
1 Observation independence An acoustic observation x is
x1 x2 x3 x4 x5 x6 conditionally independent of all other observations given the
Probabilistic finite state automaton state that generated it
Paramaters λ: 2 Markov process A state is conditionally independent of all
other states given the previous state
Transition probabilities: akj = P(sj | sk )
Output probability density function: bj (x) = p(x | sj )
HMM Assumptions
s(t−1) s(t) s(t+1)
x(t − 1) x(t) x(t + 1) HMM OUTPUT DISTRIBUTION

1 Observation independence An acoustic observation x is
conditionally independent of all other observations given the
state that generated it
2 Markov process A state is conditionally independent of all
other states given the previous state
Output distribution Background: cdf

P(s1 | s1 ) P(s2 | s2 ) P(s3 | s3 )
Consider a real valued random variable X

sI s1 s2 s3 sE
P(s1 |sI ) P(s2 | s1 ) P(s3 | s2 ) P(sE | s3 ) Cumulative distribution function (cdf) F (x) for X :
p(x | s1 ) p(x | s2 ) p(x | s3 )
F (x) = P(X ≤ x)
x x x To obtain the probability of falling in an interval we can do

the following:
Single multivariate Gaussian with mean µj , covariance matrix Σj :
bj (x) = p(x | sj ) = N (x; µj , Σj ) P(a < X ≤ b) = P(X ≤ b) − P(X ≤ a)
= F (b) − F (a)
M-component Gaussian mixture model:
M
X
bj (x) = p(x | sj ) = cjm N (x; µjm , Σjm )
m=1
Background: pdf The Gaussian distribution (univariate)
The Gaussian (or Normal) distribution is the most common

The rate of change of the cdf gives us the probability density
(and easily analysed) continuous distribution
function (pdf), p(x):
It is also a reasonable model in many situations (the famous
d “bell curve”)
p(x) = F (x) = F 0 (x)
dX
Z If a (scalar) variable has a Gaussian distribution, then it has a
x
F (x) = p(x)dx probability density function with this form:
−∞
2 2 1 −(x − µ)2
p(x) is not the probability that X has value x. But the pdf is p(x|µ, σ ) = N(x; µ, σ ) = √ exp
2πσ 2 2σ 2
proportional to the probability that X lies in a small interval
centred on x. The Gaussian is described by two parameters:
Notation: p for pdf, P for probability the mean µ (location)
the variance σ 2 (dispersion)
Plot of Gaussian distribution Properties of the Gaussian distribution
Gaussians have the same shape, with the location controlled

by the mean, and the spread controlled by the variance 2 1 −(x − µ)2
N(x; µ, σ ) = √ exp
One-dimensional Gaussian with zero mean and unit variance 2πσ 2 2σ 2
(µ = 0, σ 2 = 1): pdfs of Gaussian distributions
0.4
pdf of Gaussian Distribution mean=0

0.4 variance=1
0.35
mean=0
variance=1 mean=0
0.35
0.3 variance=2
0.3
0.25
p(x|m,s)
0.25
0.2
p(x|m,s)
0.2 mean=0
0.15
variance=4
0.15
0.1
0.1
0.05
0.05
0
−8 −6 −4 −2 0 2 4 6 8
x
0
−4 −3 −2 −1 0 1 2 3 4
x
Parameter estimation Exercise
Consider the log likelihood of a set of N data points {x 1 , . . . , x N }

being generated by a Gaussian with mean µ and variance σ 2 :
Estimate mean and variance parameters of a Gaussian from
N
data x 1 , x 2 , . . . , x n 1 n 2 1X (xn − µ)2
L = ln p({x , . . . , x } | µ, σ ) = − 2
− ln σ 2 − ln(2π)
Use sample mean and sample variance estimates: 2 σ
n=1
N
X
1X i
n 1 N N
µ= x (sample mean) =− (xn − µ)2 − ln σ 2 − ln(2π)
n 2σ 2 2 2
n=1
i=1
Xn
1 By maximising the the log likelihood function with respect to µ
σ2 = (x i − µ)2 (sample variance)
n show that the maximum likelihood estimate for the mean is indeed
i=1
the sample mean:
N
1 X
µML = xn .
N
n=1
The multidimensional Gaussian distribution Covariance matrix

The mean vector µ is the expectation of x:
µ = E [x]
The d-dimensional vector x is multivariate Gaussian if it has a
probability density function of the following form: The covariance matrix Σ is the expectation of the deviation of
x from the mean:
1 1 T −1
p(x|µ, Σ) = exp − (x − µ) Σ (x − µ)
(2π)d/2 |Σ|1/2 2 Σ = E [(x − µ)(x − µ)T ]
The pdf is parameterized by the mean vector µ and the Σ is a d × d symmetric matrix:
covariance matrix Σ.
Σij = E [(xi − µi )(xj − µj )] = E [(xj − µj )(xi − µi )] = Σji
The 1-dimensional Gaussian is a special case of this pdf
The sign of the covariance helps to determine the relationship
The argument to the exponential 0.5(x − µ)T Σ−1 (x − µ) is
between two components:
referred to as a quadratic form.
If xj is large when xi is large, then (xj − µj )(xi − µi ) will tend
to be positive;
If xj is small when xi is large, then (xj − µj )(xi − µi ) will tend
to be negative.
Spherical Gaussian Diagonal Covariance Gaussian
Contour plot of p(x1, x2) Contour plot of p(x1, x2)

Surface plot of p(x1, x2) 2 Surface plot of p(x1, x2)
4
1.5 3
0.16
1 0.08 2
0.14
0.07
0.12 0.5 0.06 1
0.1
0.05
p(x1, x2)
0.08
p(x1, x2)
0 0
x2
x2
0.04
0.06
0.03
0.04 −0.5 −1
0.02
0.02
−1 0.01 −2
0
2
0
4 4
1 2
1.5 −1.5 2 −3
1 2
0 0.5
0 0 0
−1 −0.5
−1 −2 −2 −2
−1.5 −4
−2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −4 −4 −4 −3 −2 −1 0 1 2 3 4
x2 −2
x1 x1 x2 x1 x1

0 1 0 0 1 0
µ= Σ= ρ12 = 0 µ= Σ= ρ12 = 0
0 0 1 0 0 4
Full covariance Gaussian Parameter estimation
Surface plot of p(x1, x2) 4

Contour plot of p(x1, x2)
It is possible to show that the mean vector µ̂ and covariance
3
matrix Σ̂ that maximize the likelihood of the training data are
0.1
0.08
2
given by:
1
0.06
p(x1, x2)
N
1 X n
0
x2
0.04
0.02
−1
µ̂ = x
N
0
4
−2
n=1
2 4
2
3 −3
XN
1
0 1
0
(xn − µ̂)(xn − µ̂)T

−2 −1
x2 −4 −4
−3
−2
x1
−4
−4 −3 −2 −1 0
x1
1 2 3 4
Σ̂ =
N
n=1
0 1 −1
µ= Σ= ρ12 = −0.5
0 −1 4 The mean of the distribution is estimated by the sample mean
and the covariance by the sample covariance
Example data Maximum likelihood fit to a Gaussian
10 10
5 5
X2
X2
0 0
−5 −5
−4 −2 0 2 4 6 8 10 −4 −2 0 2 4 6 8 10
X1 X1
Data in clusters (example 1) Example 1 fit by a Gaussian
2.5 2.5
2 2
1.5 1.5
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
−1.5 −1.5
−1.5 −1 −0.5 0 0.5 1 1.5 2 −1.5 −1 −0.5 0 0.5 1 1.5 2
µ1 = [0 0]T µ2 = [1 1]T Σ1 = Σ2 = 0.2I µ1 = [0 0]T µ2 = [1 1]T Σ1 = Σ2 = 0.2I
k-means clustering k-means example: data set
(4,13)
k-means is an automatic procedure for clustering unlabelled 10

data (2,9)
Requires a prespecified number of clusters (7,8)
Clustering algorithm chooses a set of clusters with the

minimum within-cluster variance (6,6) (7,6)
5 (4,5) (10,5)
Guaranteed to converge (eventually)
(5,4) (8,4)
Clustering solution is dependent on the initialisation
(1,2) (5,2)
(1,1) (3,1)
(10,0)
0
0 5 10
k-means example: initialization k-means example: iteration 1 (assign points to clusters)
(4,13) (4,13)
10 10
(2,9) (2,9)
(7,8) (7,8)
(6,6) (7,6)
(6,6) (7,6)
5 (4,5) (10,5)
5 (4,5) (10,5)
(5,4) (8,4) (5,4) (8,4)
(1,2) (5,2) (1,2) (5,2)

(1,1) (3,1) (1,1) (3,1)
(10,0) (10,0)
0 0
0 5 10 0 5 10
k-means example: iteration 1 (recompute centres) k-means example: iteration 2 (assign points to clusters)
(4,13) (4,13)
10 10
(4.33, 10) (4.33, 10)
(2,9) (2,9)
(7,8) (7,8)
(6,6) (7,6) (7,6)

(6,6)
5 (4,5) (10,5) 5 (4,5) (10,5)
(5,4) (8,4) (5,4) (8,4)
(8.75,3.75) (8.75,3.75)
(3.57, 3) (3.57, 3)
(1,2) (5,2) (1,2) (5,2)
(1,1) (3,1) (1,1) (3,1)
(10,0) (10,0)
0 0
0 5 10 0 5 10
k-means example: iteration 2 (recompute centres) k-means example: iteration 3 (assign points to clusters)
(4,13) (4,13)
10 10
(4.33, 10) (4.33, 10)
(2,9) (2,9)
(7,8) (7,8)
(7,6)
(6,6) (7,6) (6,6)
5 (4,5) (10,5)
5 (4,5) (10,5)
(8.2,4.2) (8.2,4.2)
(5,4) (8,4) (5,4) (8,4)
(1,2) (3.17, 2.5) (5,2) (1,2) (3.17, 2.5) (5,2)

(1,1) (3,1) (1,1) (3,1)
(10,0) (10,0)
0 0
0 5 10 0 5 10
No changes, so converged
Mixture model Component occupation probability
A more flexible form of density estimation is made up of a We can apply Bayes’ theorem:
linear combination of component densities:
p(x|j)P(j) p(x|j)P(j)
P(j|x) = = PM
M
X p(x) j=1 p(x|j)P(j)
p(x) = p(x|j)P(j)
j=1 The posterior probabilities P(j|x) give the probability that
component j was responsible for generating data point x
This is called a mixture model or a mixture density
The P(j|x)s are called the component occupation probabilities
p(x|j): component densities (or sometimes called the responsibilities)
P(j): mixing parameters Since they are posterior probabilities:
Generative model:
1 Choose a mixture component based on P(j) M
X
2 Generate a data point x from the chosen component using P(j|x) = 1
p(x|j) j=1
Parameter estimation Gaussian mixture model

The most important mixture model is the Gaussian Mixture
Model (GMM), where the component densities are Gaussians
If we knew which mixture component was responsible for a Consider a GMM, where each component Gaussian
data point: Nj (x; µj , σj2 ) has mean µj and a spherical covariance Σ = σ 2 I
we would be able to assign each point unambiguously to a P P
mixture component
X X
p(x) = P(j)p(x|j) = P(j)Nj (x; µj , σj2 )
and we could estimate the mean for each component Gaussian
j=1 j=1
as the sample mean (just like k-means clustering)
and we could estimate the covariance as the sample covariance p(x)
But we don’t know which mixture component a data point

comes from... P(1)
P(2)
P(M)
Maybe we could use the component occupation probabilities

p(x|1) p(x|M)
P(j|x)? p(x|2)
x1 x2 xd
GMM Parameter estimation when we know which Soft assignment
component generated the data
Estimate “soft counts” based on the component occupation
Define the indicator variable zjn = 1 if component j generated probabilities P(j|xn ):
component xn (and 0 otherwise) N
X
If zjn wasn’t hidden then we could count the number of Nj∗ = P(j|xn )
observed data points generated by j: n=1
N We can imagine assigning data points to component j

X
Nj = zjn weighted by the component occupation probability P(j|xn )
n=1 So we could imagine estimating the mean, variance and prior
And estimate the mean, variance and mixing parameters as: probabilities as:
P P P
zjn xn n P(j|xn )xn n n
n P(j|x )x
µ̂j = n µ̂j = P =
Nj
n
n P(j|x ) Nj∗
P P n )||xn − µ ||2
P n n 2
zjn ||xn − µk ||2 2 n P(j|x
P k n P(j|x )||x − µk ||
σ̂j2 = n σ̂j = n
=
Nj∗
Nj n P(j|x )
1 X Nj 1 X n
Nj∗
P̂(j) = zjn = P̂(j) = P(j|x ) =
N n N N n N
EM algorithm Maximum likelihood parameter estimation

Problem! Recall that: The likelihood of a data set X = {x1 , x2 , . . . , xN } is given by:
p(x|j)P(j) N N X
M
P(j|x) = Y Y
p(x) L= p(xn ) = p(xn |j)P(j)
n=1 n=1 j=1
We need to know p(x|j) and P(j) to estimate the parameters
of p(x|j) and to estimate P(j).... We can regard the negative log likelihood as an error function:
Solution: an iterative algorithm where each iteration has two
N
X
parts:
Compute the component occupation probabilities P(j|x) using E = − ln L = − ln p(xn )
the current estimates of the GMM parameters (means, n=1
 
variances, mixing parameters) (E-step) N
X XM
Computer the GMM parameters using the current estimates of =− ln  p(xn |j)P(j)
the component occupation probabilities (M-step) n=1 j=1
Starting from some initialization (e.g. using k-means for the
means) these steps are alternated until convergence
This is called the EM Algorithm and can be shown to Considering the derivatives of E with respect to the
maximize the likelihood parameters, gives expressions like the previous slide
Example 1 fit using a GMM Peakily distributed data (Example 2)
2.5
4
2
3
1.5
2
1
1
0.5
0
0
−1
−0.5
−2
−1 −3
−1.5 −4
−1.5 −1 −0.5 0 0.5 1 1.5 2
−5
−4 −3 −2 −1 0 1 2 3 4
2.5
2
µ1 = µ2 = [0 0]T Σ1 = 0.1I Σ2 = 2I
1.5
1
0.5
Example 2 fit by a Gaussian

0
Example 2 fit by a GMM
−0.5
−1 4
4
−1.5 3
−1.5 −1 −0.5 0 0.5 1 1.5 2
3 2
Fitted with a two

2 component GMM using EM 1
1 0
0 −1
−1 −2
−2 −3
−3 −4
−4 −5
−4 −3 −2 −1 0 1 2 3 4
−5
−4 −3 −2 −1 0 1 2 3 4
µ1 = µ2 = [0 0]T Σ1 = 0.1I Σ2 = 2I 3
ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 46 0 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 47
−1
Example 2: component Gaussians Comments on GMMs
GMMs trained using the EM algorithm are able to self

4 4
organize to fit a data set
3 3
2 2
Individual components take responsibility for parts of the data
1 1 set (probabilistically)
0 0
Soft assignment to components not hard assignment — “soft
−1 −1
−2 −2
clustering”
−3 −3 GMMs scale very well, e.g.: large speech recognition systems
−4
−4 −3 −2 −1 0 1 2 3 4
−4
−4 −3 −2 −1 0 1 2 3 4 can have 30,000 GMMs, each with 32 components:
sometimes 1 million Gaussian components!! And the
parameters all estimated from (a lot of) data by EM
Back to HMMs... The three problems of HMMs

P(s1 | s1 ) P(s2 | s2 ) P(s3 | s3 )
sI s1 s2 s3 sE Working with HMMs requires the solution of three problems:

P(s1 |sI ) P(s2 | s1 ) P(s3 | s2 ) P(sE | s3 )
1 Likelihood Determine the overall likelihood of an observation
p(x | s1 ) p(x | s2 ) p(x | s3 ) sequence X = (x1 , . . . , xt , . . . , xT ) being generated by an
HMM
x x x 2 Decoding Given an observation sequence and an HMM,
Output distribution: determine the most probable hidden state sequence
Single multivariate Gaussian with mean µj , covariance matrix 3 Training Given an observation sequence and an HMM, learn
Σj : the best HMM parameters λ = {{ajk }, {bj ()}}
bj (x) = p(x | sj ) = N (x; µj , Σj )
M-component Gaussian mixture model:
M
X
bj (x) = p(x | sj ) = cjm N (x; µjm , Σjm )
m=1
1. Likelihood: The Forward algorithm Recursive algorithms on HMMs
Goal: determine p(X | λ)

Sum over all possible state sequences s1 s2 . . . sT that could Visualize the problem as a state-time trellis
result in the observation sequence X t-1 t t+1
Rather than enumerating each sequence, compute the
probabilities recursively (exploiting the Markov assumption)
i i i
j j j
k k k
1. Likelihood: The Forward algorithm 1. Likelihood: The Forward recursion
Goal: determine p(X | λ)

Sum over all possible state sequences s1 s2 . . . sT that could Initialization
result in the observation sequence X
α0 (sI ) = 1
Rather than enumerating each sequence, compute the
α0 (sj ) = 0 if sj 6= sI
probabilities recursively (exploiting the Markov assumption)
Forward probability, αt (sj ): the probability of observing the Recursion
N
X
observation sequence x1 . . . xt and being in state sj at time t:
αt (sj ) = αt−1 (si )aij bj (xt )
αt (sj ) = p(x1 , . . . , xt , S(t) = sj | λ) i=1
Termination
N
X
p(X | λ) = αT (sE ) = αT (si )aiE
i=1
1. Likelihood: Forward Recursion Viterbi approximation
αt (sj ) = p(x1 , . . . , xt , S(t) = sj | λ) Instead of summing over all possible state sequences, just
consider the most likely
t-1 t t+1 Achieve this by changing the summation to a maximisation in
aii b j (xt ) αt (si ) the recursion:
i � i i
αt−1 (si ) Vt (sj ) = max Vt−1 (si )aij bj (xt )
i
a ji
Changing the recursion in this way gives the likelihood of the
j j j most probable path
αt−1 (s j ) We need to keep track of the states that make up this path by
aki keeping a sequence of backpointers to enable a Viterbi
backtrace: the backpointer for each state at each time
k k k indicates the previous state on the most probable path
αt−1 (sk )
Viterbi Recursion Viterbi Recursion
Likelihood of the most probable path Backpointers to the previous state on the most probable path
t-1 t t+1 t-1 t btt (si ) = s j t+1
aii b j (xt ) Vt (si ) b j (xt ) Vt (si )
i i i i i i
max
Vt−1 (si )
a ji a ji
j j j j j j
Vt−1 (s j ) Vt−1 (s j )
aki
k k k k k k
Vt−1 (sk )
2. Decoding: The Viterbi algorithm Viterbi Backtrace
Initialization Backtrace to find the state sequence of the most probable path
V0 (sI ) = 1 t-1 t btt (si ) = s j t+1
V0 (sj ) = 0 if sj 6= sI b j (xt ) Vt (si )
bt0 (sj ) = 0 i i i
Recursion
N
Vt (sj ) = max Vt−1 (si )aij bj (xt ) a ji
i=1
N
btt (sj ) = arg max Vt−1 (si )aij bj (xt ) j j j
i=1
Termination
Vt−1 (s j )
N
P ∗ = VT (sE ) = max VT (si )aiE
i=1
N k k k
sT∗ = btT (qE ) = arg max VT (si )aiE
i=1
btt+1 (sk ) = si
3. Training: Forward-Backward algorithm Viterbi Training

If we knew the state-time alignment, then each observation
feature vector could be assigned to a specific state
Goal: Efficiently estimate the parameters of an HMM λ from
A state-time alignment can be obtained using the most
an observation sequence
probable path obtained by Viterbi decoding
Assume single Gaussian output probability distribution Maximum likelihood estimate of aij , if C (si → sj ) is the count
of transitions from si to sj
bj (x) = p(x | sj ) = N (x; µj , Σj )
C (si → sj )
Parameters λ: âij = P
k C (si → sk )
Transition probabilities aij :
Likewise if Zj is the set of observed acoustic feature vectors
X assigned to state j, we can use the standard maximum
aij = 1
likelihood estimates for the mean and the covariance:
j P
x∈Zj x
Gaussian parameters for state sj : µ̂j =
mean vector µj ; covariance matrix Σj |Zj |
P j j T
j x∈Zj (x − µ̂ )(x − µ̂ )
Σ̂ =
|Zj |
EM Algorithm Backward probabilities
Viterbi training is an approximation—we would like to To estimate the state occupation probabilities it is useful to
consider all possible paths define (recursively) another set of probabilities—the Backward
In this case rather than having a hard state-time alignment we probabilities
estimate a probability βt (sj ) = p(xt+1 , xt+2 , xT | S(t) = sj , λ)
State occupation probability: The probability γt (sj ) of
occupying state sj at time t given the sequence of The probability of future observations given a the HMM is in
observations. state sj at time t
These can be recursively computed (going backwards in time)
Compare with component occupation probability in a GMM Initialisation
We can use this for an iterative algorithm for HMM training: βT (si ) = aiE
the EM algorithm Recursion
Each iteration has two steps: N
X
E-step estimate the state occupation probabilities βt (si ) = aij bj (xt+1 )βt+1 (sj )
j=1
(Expectation)
Termination
M-step re-estimate the HMM parameters based on the
N
X
estimated state occupation probabilities p(X | λ) = β0 (sI ) = aIj bj (x1 )β1 (sj ) = αT (sE )
(Maximisation) j=1
Backward Recursion State Occupation Probability

The state occupation probability γt (sj ) is the probability of
occupying state sj at time t given the sequence of observations
Express in terms of the forward and backward probabilities:
βt (sj ) = p(xt+1 , xt+2 , xT | S(t) = sj , λ)
1
γt (sj ) = P(S(t) = sj | X, λ) = αt (j)βt (j)
t-1 t t+1 αT (sE )
βt (si ) bi (xt+1 )
i i
� aii i
recalling that p(X|λ) = αT (sE )
Since
βt+1 (si )
αt (sj )βt (sj ) = p(x1 , . . . , xt , S(t) = sj | λ)
ai j b j (xt+1 )
p(xt+1 , xt+2 , xT | S(t) = sj , λ)
j j j
= p(x1 , . . . , xt , xt+1 , xt+2 , . . . , xT , S(t) = sj | λ)
βt+1 (s j )
aik = p(X, S(t) = sj | λ)
bk (xt+1 )
k k k
p(X, S(t) = sj | λ)
βt+1 (sk ) P(S(t) = sj | X, λ) =
p(X|λ)
Re-estimation of Gaussian parameters Re-estimation of transition probabilities
Similarly to the state occupation probability, we can estimate

ξt (si , sj ), the probability of being in si at time t and sj at
The sum of state occupation probabilities through time for a t + 1, given the observations:
state, may be regarded as a “soft” count
We can use this “soft” alignment to re-estimate the HMM ξt (si , sj ) = P(S(t) = si , S(t + 1) = sj | X, λ)
parameters: P(S(t) = si , S(t + 1) = sj , X | λ)
=
PT p(X|Λ)
j γt (sj )x t
µ̂ = Pt=1
T
αt (si )aij bj (xt+1 )βt+1 (sj )
=
t=1 γt (sj ) αT (sE )
PT j
j t=1 γt (sj )(x t − µ̂ )(x − µ̂j )T
Σ̂ = PT We can use this to re-estimate the transition probabilities
t=1 γt (sj )
PT
t=1 ξt (si , sj )
âij = PN PT
k=1 t=1 ξt (si , sk )
Pulling it all together Extension to a corpus of utterances
Iterative estimation of HMM parameters using the EM

algorithm. At each iteration We usually train from a large corpus of R utterances
E step For all time-state pairs
If xrt is the tth frame of the r th utterance Xr then we can
1 Recursively compute the forward probabilities
compute the probabilities αtr (j), βtr (j), γtr (sj ) and ξtr (si , sj ) as
αt (sj ) and backward probabilities βt (j)
2 Compute the state occupation probabilities before
γt (sj ) and ξt (si , sj ) The re-estimates are as before, except we must sum over the
M step Based on the estimated state occupation R utterances, eg:
probabilities re-estimate the HMM parameters: PR PT r r
mean vectors µj , covariance matrices Σj and j r =1 t=1 γt (sj )x t
µ̂ = PR PT r
transition probabilities aij r =1 t=1 γt (sj )
The application of the EM algorithm to HMM training is
sometimes called the Forward-Backward algorithm
Extension to Gaussian mixture model (GMM) EM training of HMM/GMM
Rather than estimating the state-time alignment, we estimate
The assumption of a Gaussian distribution at each state is the component/state-time alignment, and component-state
very strong; in practice the acoustic feature vectors associated occupation probabilities γt (sj , m): the probability of
with a state may be strongly non-Gaussian occupying mixture component m of state sj at time t
In this case an M-component Gaussian mixture model is an We can thus re-estimate the mean of mixture component m
appropriate density function: of state sj as follows
PT
jm γt (sj , m)x t
M
X µ̂ = Pt=1 T
bj (x) = p(x | sj ) = cjm N (x; µjm , Σjm ) t=1 γt (sj , m)
m=1 And likewise for the covariance matrices (mixture models
Given enough components, this family of functions can model often use diagonal covariance matrices)
any distribution. The mixture coefficients are re-estimated in a similar way to
transition probabilities:
Train using the EM algorithm, in which the component PT
estimation probabilities are estimated in the E-step γt (sj , m)
ĉjm = PM t=1PT
`=1 t=1 γt (sj , `)
Doing the computation Summary: HMMs
HMMs provide a generative model for statistical speech

recognition
The forward, backward and Viterbi recursions result in a long
Three key problems
sequence of probabilities being multiplied
1 Computing the overall likelihood: the Forward algorithm
This can cause floating point underflow problems 2 Decoding the most likely state sequence: the Viterbi algorithm
In practice computations are performed in the log domain (in 3 Estimating the most likely parameters: the EM
which multiplies become adds) (Forward-Backward) algorithm
Working in the log domain also avoids needing to perform the Solutions to these problems are tractable due to the two key
HMM assumptions
exponentiation when computing Gaussians
1 Conditional independence of observations given the current
state
2 Markov assumption on the states
References: HMMs
Gales and Young (2007). “The Application of Hidden Markov

Models in Speech Recognition”, Foundations and Trends in
Signal Processing, 1 (3), 195–304: section 2.2.
Jurafsky and Martin (2008). Speech and Language Processing
(2nd ed.): sections 6.1–6.5; 9.2; 9.4. (Errata at
http://www.cs.colorado.edu/~martin/SLP/Errata/
SLP2-PIEV-Errata.html)
Rabiner and Juang (1989). “An introduction to hidden
Markov models”, IEEE ASSP Magazine, 3 (1), 4–16.
Renals and Hain (2010). “Speech Recognition”,
Computational Linguistics and Natural Language Processing
Handbook, Clark, Fox and Lappin (eds.), Blackwells.
ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 76

Asr03 HMMGMM 4up PDF

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Asr03 HMMGMM 4up PDF

Caricato da

Copyright:

Formati disponibili

Overview

Hidden Markov Models

Fundamental Equation of Statistical Speech Recognition Acoustic Modelling

If X is the sequence of acoustic feature vectors (observations) and

HMM p(x | s1 ) p(x | s2 ) p(x | s3 )

Acoustic Model: Continuous Density HMM HMM Assumptions

p(x | s1 ) p(x | s2 ) p(x | s3 )

s(t−1) s(t) s(t+1)

x(t − 1) x(t) x(t + 1) HMM OUTPUT DISTRIBUTION

Output distribution Background: cdf

Consider a real valued random variable X

x x x To obtain the probability of falling in an interval we can do

The Gaussian (or Normal) distribution is the most common

Plot of Gaussian distribution Properties of the Gaussian distribution

Gaussians have the same shape, with the location controlled

pdf of Gaussian Distribution mean=0

Consider the log likelihood of a set of N data points {x 1 , . . . , x N }

The multidimensional Gaussian distribution Covariance matrix

Contour plot of p(x1, x2) Contour plot of p(x1, x2)

Full covariance Gaussian Parameter estimation

Surface plot of p(x1, x2) 4

(xn − µ̂)(xn − µ̂)T

Data in clusters (example 1) Example 1 fit by a Gaussian

µ1 = [0 0]T µ2 = [1 1]T Σ1 = Σ2 = 0.2I µ1 = [0 0]T µ2 = [1 1]T Σ1 = Σ2 = 0.2I

k-means is an automatic procedure for clustering unlabelled 10

Clustering algorithm chooses a set of clusters with the

k-means example: initialization k-means example: iteration 1 (assign points to clusters)

(5,4) (8,4) (5,4) (8,4)

(1,2) (5,2) (1,2) (5,2)

(6,6) (7,6) (7,6)

(1,2) (3.17, 2.5) (5,2) (1,2) (3.17, 2.5) (5,2)

Parameter estimation Gaussian mixture model

But we don’t know which mixture component a data point

Maybe we could use the component occupation probabilities

N We can imagine assigning data points to component j

EM algorithm Maximum likelihood parameter estimation

Example 2 fit by a Gaussian

Fitted with a two

GMMs trained using the EM algorithm are able to self

Back to HMMs... The three problems of HMMs

sI s1 s2 s3 sE Working with HMMs requires the solution of three problems:

Goal: determine p(X | λ)

1. Likelihood: The Forward algorithm 1. Likelihood: The Forward recursion

Goal: determine p(X | λ)

Viterbi Recursion Viterbi Recursion

3. Training: Forward-Backward algorithm Viterbi Training

Backward Recursion State Occupation Probability

Similarly to the state occupation probability, we can estimate

Pulling it all together Extension to a corpus of utterances

Iterative estimation of HMM parameters using the EM

Doing the computation Summary: HMMs

HMMs provide a generative model for statistical speech

Gales and Young (2007). “The Application of Hidden Markov

Potrebbero piacerti anche