Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 1 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 2
ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 3 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 4
Hierarchical modelling of speech Acoustic Model: Continuous Density HMM
"No right" Utterance P(s1 | s1 ) P(s2 | s2 ) P(s3 | s3 )
Generative Model
NO RIGHT Word
n oh r ai t Subword sI s1 s2 s3 sE
P(s1 |sI ) P(s2 | s1 ) P(s3 | s2 ) P(sE | s3 )
x x x
Acoustics Probabilistic finite state automaton
Paramaters λ:
Transition probabilities: akj = P(sj | sk )
Output probability density function: bj (x) = p(x | sj )
ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 5 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 6
sI s1 s2 s3 sE
sI s1 s2 s3 sE P(s1 |sI ) P(s2 | s1 ) P(s3 | s2 ) P(sE | s3 )
x x x
1 Observation independence An acoustic observation x is
x1 x2 x3 x4 x5 x6 conditionally independent of all other observations given the
Probabilistic finite state automaton state that generated it
Paramaters λ: 2 Markov process A state is conditionally independent of all
other states given the previous state
Transition probabilities: akj = P(sj | sk )
Output probability density function: bj (x) = p(x | sj )
ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 6 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 7
HMM Assumptions
ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 8 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 9
ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 10 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 11
Background: pdf The Gaussian distribution (univariate)
ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 12 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 13
0.3
0.25
p(x|m,s)
0.25
0.2
p(x|m,s)
0.2 mean=0
0.15
variance=4
0.15
0.1
0.1
0.05
0.05
0
−8 −6 −4 −2 0 2 4 6 8
x
0
−4 −3 −2 −1 0 1 2 3 4
x
ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 14 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 15
Parameter estimation Exercise
ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 16 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 17
The pdf is parameterized by the mean vector µ and the Σ is a d × d symmetric matrix:
covariance matrix Σ.
Σij = E [(xi − µi )(xj − µj )] = E [(xj − µj )(xi − µi )] = Σji
The 1-dimensional Gaussian is a special case of this pdf
The sign of the covariance helps to determine the relationship
The argument to the exponential 0.5(x − µ)T Σ−1 (x − µ) is
between two components:
referred to as a quadratic form.
If xj is large when xi is large, then (xj − µj )(xi − µi ) will tend
to be positive;
If xj is small when xi is large, then (xj − µj )(xi − µi ) will tend
to be negative.
ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 18 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 19
Spherical Gaussian Diagonal Covariance Gaussian
1.5 3
0.16
1 0.08 2
0.14
0.07
0.12 0.5 0.06 1
0.1
0.05
p(x1, x2)
0.08
p(x1, x2)
0 0
x2
x2
0.04
0.06
0.03
0.04 −0.5 −1
0.02
0.02
−1 0.01 −2
0
2
0
4 4
1 2
1.5 −1.5 2 −3
1 2
0 0.5
0 0 0
−1 −0.5
−1 −2 −2 −2
−1.5 −4
−2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −4 −4 −4 −3 −2 −1 0 1 2 3 4
x2 −2
x1 x1 x2 x1 x1
0 1 0 0 1 0
µ= Σ= ρ12 = 0 µ= Σ= ρ12 = 0
0 0 1 0 0 4
ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 20 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 21
0.08
2
given by:
1
0.06
p(x1, x2)
N
1 X n
0
x2
0.04
0.02
−1
µ̂ = x
N
0
4
−2
n=1
2 4
2
3 −3
XN
1
0 1
0
x2 −4 −4
−3
−2
x1
−4
−4 −3 −2 −1 0
x1
1 2 3 4
Σ̂ =
N
n=1
0 1 −1
µ= Σ= ρ12 = −0.5
0 −1 4 The mean of the distribution is estimated by the sample mean
and the covariance by the sample covariance
ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 22 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 23
Example data Maximum likelihood fit to a Gaussian
10 10
5 5
X2
X2
0 0
−5 −5
−4 −2 0 2 4 6 8 10 −4 −2 0 2 4 6 8 10
X1 X1
ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 24 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 25
2.5 2.5
2 2
1.5 1.5
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
−1.5 −1.5
−1.5 −1 −0.5 0 0.5 1 1.5 2 −1.5 −1 −0.5 0 0.5 1 1.5 2
ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 26 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 27
k-means clustering k-means example: data set
(4,13)
ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 28 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 29
(4,13) (4,13)
10 10
(2,9) (2,9)
(7,8) (7,8)
(6,6) (7,6)
(6,6) (7,6)
5 (4,5) (10,5)
5 (4,5) (10,5)
ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 30 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 31
k-means example: iteration 1 (recompute centres) k-means example: iteration 2 (assign points to clusters)
(4,13) (4,13)
10 10
(4.33, 10) (4.33, 10)
(2,9) (2,9)
(7,8) (7,8)
ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 32 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 33
k-means example: iteration 2 (recompute centres) k-means example: iteration 3 (assign points to clusters)
(4,13) (4,13)
10 10
(4.33, 10) (4.33, 10)
(2,9) (2,9)
(7,8) (7,8)
(7,6)
(6,6) (7,6) (6,6)
5 (4,5) (10,5)
5 (4,5) (10,5)
(8.2,4.2) (8.2,4.2)
(5,4) (8,4) (5,4) (8,4)
A more flexible form of density estimation is made up of a We can apply Bayes’ theorem:
linear combination of component densities:
p(x|j)P(j) p(x|j)P(j)
P(j|x) = = PM
M
X p(x) j=1 p(x|j)P(j)
p(x) = p(x|j)P(j)
j=1 The posterior probabilities P(j|x) give the probability that
component j was responsible for generating data point x
This is called a mixture model or a mixture density
The P(j|x)s are called the component occupation probabilities
p(x|j): component densities (or sometimes called the responsibilities)
P(j): mixing parameters Since they are posterior probabilities:
Generative model:
1 Choose a mixture component based on P(j) M
X
2 Generate a data point x from the chosen component using P(j|x) = 1
p(x|j) j=1
ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 36 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 37
x1 x2 xd
ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 38 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 39
GMM Parameter estimation when we know which Soft assignment
component generated the data
Estimate “soft counts” based on the component occupation
Define the indicator variable zjn = 1 if component j generated probabilities P(j|xn ):
component xn (and 0 otherwise) N
X
If zjn wasn’t hidden then we could count the number of Nj∗ = P(j|xn )
observed data points generated by j: n=1
2.5
4
2
3
1.5
2
1
1
0.5
0
0
−1
−0.5
−2
−1 −3
−1.5 −4
−1.5 −1 −0.5 0 0.5 1 1.5 2
−5
−4 −3 −2 −1 0 1 2 3 4
2.5
2
µ1 = µ2 = [0 0]T Σ1 = 0.1I Σ2 = 2I
1.5
1
ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 44 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 45
0.5
−1 4
4
−1.5 3
−1.5 −1 −0.5 0 0.5 1 1.5 2
3 2
1 0
0 −1
−1 −2
−2 −3
−3 −4
−4 −5
−4 −3 −2 −1 0 1 2 3 4
−5
−4 −3 −2 −1 0 1 2 3 4
µ1 = µ2 = [0 0]T Σ1 = 0.1I Σ2 = 2I 3
ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 46 0 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 47
−1
Example 2: component Gaussians Comments on GMMs
2 2
Individual components take responsibility for parts of the data
1 1 set (probabilistically)
0 0
Soft assignment to components not hard assignment — “soft
−1 −1
−2 −2
clustering”
−3 −3 GMMs scale very well, e.g.: large speech recognition systems
−4
−4 −3 −2 −1 0 1 2 3 4
−4
−4 −3 −2 −1 0 1 2 3 4 can have 30,000 GMMs, each with 32 components:
sometimes 1 million Gaussian components!! And the
parameters all estimated from (a lot of) data by EM
ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 48 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 49
j j j
k k k
ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 52 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 53
Termination
N
X
p(X | λ) = αT (sE ) = αT (si )aiE
i=1
ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 54 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 55
1. Likelihood: Forward Recursion Viterbi approximation
αt (sj ) = p(x1 , . . . , xt , S(t) = sj | λ) Instead of summing over all possible state sequences, just
consider the most likely
t-1 t t+1 Achieve this by changing the summation to a maximisation in
aii b j (xt ) αt (si ) the recursion:
i � i i
αt−1 (si ) Vt (sj ) = max Vt−1 (si )aij bj (xt )
i
a ji
Changing the recursion in this way gives the likelihood of the
j j j most probable path
αt−1 (s j ) We need to keep track of the states that make up this path by
aki keeping a sequence of backpointers to enable a Viterbi
backtrace: the backpointer for each state at each time
k k k indicates the previous state on the most probable path
αt−1 (sk )
ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 56 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 57
Likelihood of the most probable path Backpointers to the previous state on the most probable path
t-1 t t+1 t-1 t btt (si ) = s j t+1
aii b j (xt ) Vt (si ) b j (xt ) Vt (si )
i i i i i i
max
Vt−1 (si )
a ji a ji
j j j j j j
Vt−1 (s j ) Vt−1 (s j )
aki
k k k k k k
Vt−1 (sk )
ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 58 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 59
2. Decoding: The Viterbi algorithm Viterbi Backtrace
Initialization Backtrace to find the state sequence of the most probable path
V0 (sI ) = 1 t-1 t btt (si ) = s j t+1
V0 (sj ) = 0 if sj 6= sI b j (xt ) Vt (si )
bt0 (sj ) = 0 i i i
Recursion
N
Vt (sj ) = max Vt−1 (si )aij bj (xt ) a ji
i=1
N
btt (sj ) = arg max Vt−1 (si )aij bj (xt ) j j j
i=1
Termination
Vt−1 (s j )
N
P ∗ = VT (sE ) = max VT (si )aiE
i=1
N k k k
sT∗ = btT (qE ) = arg max VT (si )aiE
i=1
btt+1 (sk ) = si
ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 60 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 61
ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 68 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 69
ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 70 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 71
Extension to Gaussian mixture model (GMM) EM training of HMM/GMM
Rather than estimating the state-time alignment, we estimate
The assumption of a Gaussian distribution at each state is the component/state-time alignment, and component-state
very strong; in practice the acoustic feature vectors associated occupation probabilities γt (sj , m): the probability of
with a state may be strongly non-Gaussian occupying mixture component m of state sj at time t
In this case an M-component Gaussian mixture model is an We can thus re-estimate the mean of mixture component m
appropriate density function: of state sj as follows
PT
jm γt (sj , m)x t
M
X µ̂ = Pt=1 T
bj (x) = p(x | sj ) = cjm N (x; µjm , Σjm ) t=1 γt (sj , m)
m=1 And likewise for the covariance matrices (mixture models
Given enough components, this family of functions can model often use diagonal covariance matrices)
any distribution. The mixture coefficients are re-estimated in a similar way to
transition probabilities:
Train using the EM algorithm, in which the component PT
estimation probabilities are estimated in the E-step γt (sj , m)
ĉjm = PM t=1PT
`=1 t=1 γt (sj , `)
ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 72 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 73
ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 74 ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 75
References: HMMs
ASR Lectures 4&5 Hidden Markov Models and Gaussian Mixture Models 76