Lecture 9 PDF

Speech Recognition
Lecture 9: Pronunciation Models.
Mehryar Mohri
Courant Institute of Mathematical Sciences
mohri@cims.nyu.edu
Speech Recognition Components
Acoustic and pronunciation model:
!
Pr(o | w) = Pr(o | d) Pr(d | c) Pr(c | p) Pr(p | w).
d,c,p
• Pr(o | d): observation seq. ← distribution seq.

acoustic model
• Pr(d | c): distribution seq. ← CD phone seq.

• Pr(c | p) : CD phone seq. ← phoneme seq.
•Pr(p | w): phoneme seq. ← word seq.

Language model: Pr(w), distribution over word
seq.
Mehryar Mohri - Speech Recognition page 2 Courant Institute, NYU
Terminology
Phonemes: abstract units representing sounds in
words or word sequences, e.g., /aa/, or /t/.
Phones: acoustic realizations of phonemes, e.g., [t].
Allophones: distinct realizations of the same
phoneme, typically due to specific dialect,
phonemic context, or speaking rate. [dx] and [t]
can be realizations of /t/ in American English.
Example: [s aa dx el] or [s aa t el].

This Lecture
Decision trees
Context-dependent model

Motivation
Interpretation: explain complex data, result easy to
analyze and understand.
Adaptation: easy to update to account for new data.
Different types of variables: categorical, numerical.
Monotone transformation invariance: measuring
unit is not a concern.
Dealing with missing labels.
But: Beware of interpretation! Generalization?
Theoretical guarantees?
Mehryar Mohri - Speech Recognition 5 page Courant Institute, NYU
Common Decision Tree Induction Systems
Most commonly used tools for induction of
decision trees:
• CART (classification and regression tree)

(Breiman et al., 1984)
• C4.5 (Quinlan, 1986, 1993) and C5.0

(RuleQuest Research) a commercial system.
Differences: minor between latest versions.

Example
X1 < a1 X2
R2
a4 R5
X1 < a2 X2 < a3
R3
R1 a3
X2 < a4 R3 R4 R5 R4
a2 a1
R1 R2 X1

Different Types of Questions
Decision trees
• X ∈ {blue, white, red}: categorical questions

•X ≤ a : continuous variables
Binary space partition (BSP) trees:
!n
• i=1 αi Xi ≤ a : partitioning with convex
polyhedral regions
Sphere trees:
• ||X − a0|| ≤ a : partitioning with pieces of spheres

Decision Trees
Training data:
• Regression:
(x1 , y1 ), . . . , (xm , ym ) ∈ RN × R
• Classification (k classes):
(x1 , y1 ), . . . , (xm , ym ) ∈ RN × {1, . . . , k}
Result:
!
r
h(x) = ai 1x∈Ri
i=1

Optimization Criterion
Possible criterion: minimize mean squared error
1 !m
2
E= (yi − f (xi )) .
m i=1
For each region R, find a such that:
!
a = argmina (yi − a)2 .
i:xi ∈R
Thus, !
yi
i:xi ∈R
a= = < yi >xi ∈R .
|{i : xi ∈ R}|

Example - Playing Golf
play
humidity <= 40%
(77/423)
play do not play

wind <= 12 mph visibility > 10 mi
(20/238) (57/185)
play
play do not play do not play
barometer <= 18 in
(19/236) (1/2) (9/72)
(48/113)
Misclassification rates are play do not play

indicated at each node. (37/101) (1/12)

Learning Algorithm
Problem: general problem of finding partition that
minimizes mean squared error is NP-hard.
Solution: greedy algorithm.
original region ← {x1, ..., xm}

for each region R such that Pred(R) holds
do find optimal splitting variable Xj and threshold t
partition R into two regions according to (Xj <= t)

Splitting Variable and Threshold
For a region R, define for j = 1, . . . , N, t > 0,
R1 (j, t) = {x ∈ R : xj ≤ t} and R2 (j, t) = {x ∈ R : xj > t}
Optimal splitting variable Xj and threshold t:

 
# #
2
argminj,t min (yi − a1 ) + min (yi − a2 )2 
a1 a2
xi ∈R1 (j,t) xi ∈R2 (j,t)
 
# #
2
= argminj,t  (yi − < yi >xi ∈R1 (j,t) ) + (yi − < yi >xi ∈R2 (j,t) )2 
xi ∈R1 (j,t) xi ∈R2 (j,t)

Stopping Predicates/Criteria
Problem: larger trees will overfit the training data
Conservative splitting:
• split tree node only if mean squared error

reduced by at least α > 0.
• problem: a seemingly bad split may dominate

useful splits
Alternative: grow-then-prune strategy (CART),
• grow a very large tree (Pred(R): |R| > n0),

• prune tree using complexity criterion.
Pruning Criterion
Node impurity criterion:
1 !
C(T, j) = (yi − < yi >Rj )2
|xi ∈ Rj |
xi ∈Rj
Complexity criterion:
!r
Cα (T ) = |xi ∈ Rj | C(T, j) + α|T |
"#$%
j=1
regularization term
Pruning: let T0 be the full decision tree obtained.
Find subtree T minimizing complexity criterion.

Weakest Link Pruning
For each α, find tree T minimizing complexity
criterion:
• Collapse node with smallest increase in squared

error
• Construct sequence of subtrees till single-node

tree
• Best tree is proved to be among the elements

of this sequence
How to choose α? Cross-validation.
Cross-Validation
Estimate of prediction error:
• Partition data S into k equal-sized subsets S , ..., S ; 1 k
• Train learning algorithm L on S - S . Let f be the

i i
output.
• Test f on S and let e!

i
rrorS (fi )be the error rate.
i i
• Then, the k-fold cross-validation estimate of the

error of L is (k = |S| leave-one-out estimate):
1 ! k
E(L, k) = e!
rrorSi (fi ).
k i=1
Application - Model Selection
Assume that the output of the learning algorithm
depends on a parameter α.
Compute cross-validation estimate:
1 !k
E(L, k, α) = rrorSi (fiα ).
e!
k i=1
Choose parameter α that minimizes cross-
validation estimate of error.
Typical values: k = 5, or 10.

Classification
Each region Rj is labeled with its dominating class:
!
for j = 1, . . . , r; c ∈ [1, k], pc,j = 1yi =c
xi ∈Rj
cj = argmax1≤c≤k pc,j
Three main measures of node impurity:
• misclassification:
1 !
C(T, j) = 1yi "=cj = 1 − pcj ,j .
|{xi ∈ Rj }|
xi ∈Rj

• Gini index:
! !
k
C(T, j) = pc,j pc! ,j = pc,j (1 − pc,j ).
c!=c! c=1
Can be viewed as the average misclassification

error. Used in CART.
• Entropy:
!
k
C(T, j) = − pc,j log pc,j .
c=1

Categorical Variables
Problem: with N possible unordered variables,
e.g., color (blue, white, red), there are 2N-1-1 possible
partitions.
Solution (when only two possible outcomes): sort
variables according to the number of 1s in each,
e.g., white .9, red .45, blue .3. Split predictor as with
ordered variables.

Missing Values
Problem: points x with missing values y, due to:
• the proper measurement not taken,

• a source causing the absence of labels.
Solution:
• categorical case: create new category missing;

• use surrogate variables: use only those variables
that are available for a split.

Instability
Problem: high variance
• small changes in the data may lead to very

different splits,
• price to pay for the hierarchical nature of

decision trees,
• more stable criteria could be used.

This Lecture
Decision trees
Context-dependent model

Context-Dependent Phones
(Lee, 1990;Young et al., 1994)
Idea:
• phoneme pronounciation depends on

environment (allophones, co-articulation).
• model phone in context → better accuracy.

Context-dependent rules:
• Context-dependent units: ae/b d → aeb,d.

• Allophonic rules: t/V V → dx.
!
• Complex contexts: regular expressions.

CD Phones - Speech Recognition
Triphones: simplest and most widely used model.
• context: window of length three.

• example: cat, pause k ae t pause .
pause ae k t ae pause
• cross-word triphones: context spanning word

boundaries, important for accurate modeling.
• older systems: only word-internal triphones.

Extensions: quinphones (window of size five),
gender-dependent models.

Modeling Problems
Parameters: very large numbers for VLVR.
• Number of phones: about 50 .

• Number of CD phones: possibly 50 3
= 125,000 ,
but not all of them occur (phonotactic
constraints). In practice, about 60,000 .
• Number of HMM parameters: with16 mixture

components, 60000 × 3 × (39 × 16 × 2 + 16) ≈ 228M.
Data sparsity: some triphones, particularly cross-
word triphones, do not appear in sample.
Solutions
Backing-off: use simpler models with shorter
contexts when triphone not available.
Interpolation: with simpler models such as
monophone or biphone models.
Parameter reduction: cluster parameters with
similar characteristics (‘parameter tying’). Better
estimates for each state distribution.

Clustering Method
Initially, group together all triphones for the same
phoneme.
Split group according to decision tree questions
based on left or right phonetic context.
All triphones (HMM states) at the same leaf are
clustered (tied).
Advantage: even unseen triphones are assigned to
a cluster and thus a model.
Questions: which DT questions? Which criterion?
Questions
Simple discrete pre-defined binary questions.
Examples:
• is the phoneme to the left an /l/?

• is the phoneme to the right a nasal?
• is the previous phoneme an unvoiced stop?

Phonetics
Vowels:
Consonants:
plosives: affricates:
fricatives:
nasals:
approximants:

Sound Features
Example: voiced sound (vocal cords vibrate), nasals
(e.g., /m/, /n/), approximants (e.g., /l/, /r/, /w/, /j/),
vowels.

Clustering CD Phones
right = /r/
d0:! d1:! d2:!
left = nasal right = stop
d0:! d1:! d2:ae k,t
k ae t 0 1 2 3
d0:! d1:! d2:! right = /l/
n ae r 0
d0:!
1
d1:!
2
d2:ae n,r
3
d0:! d1:! d2:!
p ae n 0
d0:!
1
d1:!
2
d2:aep,n
3

Criterion
Criterion: best question is one that maximizes
sample likelihood after splitting.
ML evaluation: requires single Gaussians with
diagonal covariance matrices trained on sample.
question q, sample log-likelihood L(S)
L(Sl ) L(Sr )
Log-likelihood difference: ∆L(q) = L(Sl ) + L(Sr ) − L(S).

! "
Best question: q ∗ = argmax L(Sl ) + L(Sr ).

q
Log-Likelihood
Sample S = (x1 , . . . , xm ) ∈ (RN )m .
Diagonal covariance Gaussian:
N
1 " # 1 (xk − µk )2 $
Pr[x] = !N exp − 2 .
k=1 (2πσk )
1/2 2 σk
k=1
Log-likelihood for diagonal covariance Gaussian:
m N N
! (xik − µk )2 #
1 !" !
L(S) = − log(2πσk2 ) +
2 i=1 σk2
k=1 k=1
N ! σ2 # N
1" ! k
=− m log(2πσk2 ) +
2 σk2
k=1 k=1
N
1 " !
2
#
= − mN (1 + log(2π)) + m log(σk ) .
2
k=1
Decision Tree Split
Log-likelihood difference:
N N
1 1! " 2
"
2
#
L(Sl ) + L(Sr ) = − mN (1 + log(2π)) − ml log(σlk ) − mr log(σrk ) .
2 2
k=1 k=1
Best question:
! " N N
" #
2 2
q ∗ = argmin ml log(σlk ) − mr log(σrk ) ,
q
k=1 k=1
1 ! 2 1 !
with 2
σlk = xk − 2 ( xk )2
ml ml
x∈Sl x∈Sl
2 1 ! 2 1 !
σrk = xk − 2 ( xk )2 .
mr mr
x∈Sr x∈Sr

Stopping Criteria
Grow-then-prune strategy with cross-validation
using held-out data set.
Heuristics in VLVR:
• question does not yield significant increase in

log-likelihood.
• insufficient data for questions.

• computational limitations.

Full Training Process
Train CI phone HMMs with single Gaussians and
diagonal covariance.
Create triphone HMMs by replicating CI phone
models and reestimate parameters.
Apply decision tree clustering to the set of
triphones representing the same phoneme.
Create Gaussian mixture model using mixture
splitting technique for each cluster.

CD Model Representation
(MM, Pereira, Riley, 2007)
Non-deterministic transducer representation
x:x/ !_!
y:y/ !_x
x:x/ !_y x:x/y_x

x:x/y_!
!,* x:x/y_y y,x x, !
x:x/ !_x x:x/x_x
y:y/x_x
x:x/x_y
x,x x,y
y:y/x_y y:y/y_y y:y/y_x
y,y y:y/y_!
y:y/x_ ! y,!
y:y/ !_y x:x/x_!
y:y/ !_!

CD Model Representation
(MM, Pereira, Riley, 2007)
Deterministic transducer representation
$:y/!_!
x:y/!_y
!,y $:y/x_!
y,!
y:y/y_y $:y/y_!
y:!
y:y/!_y
y:y/x_y y,y x:y/y_x
!,! y:x/!_y
x:! x,y y:x/y_y
y:x/x_y x:y/x_x y,x
x:x/x_x
!,x x:x/!_x x:x/y_x $:x/y_!
x,x
$:x/x_!
x,!
$:x/!_!

References
• Leo Breiman, Jerome Friedman, Charles J. Stone, and R. A. Olshen. Classifications and
Regression Trees. Chapman & Hall, 1984.
• Chen, F. Identification of contextual factors for pronunciation networks. In Proceedings of
ICASSP (1990), S14.9.
• Luc Devroye, Laszlo Gyorfi, Gabor Lugosi. A Probabilistic Theory of Pattern Recognition.
Springer, 1996.
• Ladefoged, P., A Course in Phonetics., New York: Harcourt, Brace, and Jovanovich, 1982.
Automatic Generation of Lexicons 17.
• Kai-Fu Lee. Context-Dependent Phonetic Hidden Markov Models for Continuous Speech
Recognition. IEEE International Conference on Acoustics, Speech, and Signal Processing, 38(4):
599-609, 1990.
• Mehryar Mohri, Fernando C. N. Pereira, and Michael Riley. Speech Recognition with
Weighted Finite-State Transducers. In Larry Rabiner and Fred Juang, editors, Handbook on
Speech Processing and Speech Communication, Part E: Speech recognition. volume to
appear. Springer-Verlag, Heidelberg, Germany, 2007.

References
• Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993.
• Quinlan, J. R. Induction of Decision Trees, in Machine Learning,Volume 1, pages 81-106, 1986.
• Randolph, M. "A data-driven method for discover and predicting allophonic variation,"
Proc. ICASSP `90, S14.10, 1990.
• Michael Riley and Andrej Ljolje. Lexical access with a statistically-derived phonetic
network. In Proceedings of the European Conference on Speech Communication and
Technology, pages 585-588, 1991.
• M. Weintraub, H. Murveit, M. Cohen, P. Price, J. Bernstein, G. Baldwin, and D. Bell, "Linguistic

Constraints in Hidden Markov Model Based Speech Recognition," Proc. ICASSP '89, pp.
699--702, Glasgow, Scotland, May, 1989.
• Steve Young, J. Odell, and Phil Woodland. Tree-Based State-Tying for High Accuracy
Acoustic Modelling. In Proceedings of ARPA Human Language Technology Workshop, Morgan
Kaufmann, San Francisco, 1994.

Lecture 9 PDF

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Lecture 9 PDF

Caricato da

Copyright:

Formati disponibili

Speech Recognition

Lecture 9: Pronunciation Models.

• Pr(o | d): observation seq. ← distribution seq.

• Pr(d | c): distribution seq. ← CD phone seq.

•Pr(p | w): phoneme seq. ← word seq.

Mehryar Mohri - Speech Recognition page 3 Courant Institute, NYU

Mehryar Mohri - Speech Recognition page 4 Courant Institute, NYU

• CART (classification and regression tree)

• C4.5 (Quinlan, 1986, 1993) and C5.0

Mehryar Mohri - Speech Recognition 6 page Courant Institute, NYU

Mehryar Mohri - Speech Recognition 7 page Courant Institute, NYU

• X ∈ {blue, white, red}: categorical questions

• ||X − a0|| ≤ a : partitioning with pieces of spheres

Mehryar Mohri - Speech Recognition 9 page Courant Institute, NYU

Mehryar Mohri - Speech Recognition 10 page Courant Institute, NYU

play do not play

Misclassification rates are play do not play

Mehryar Mohri - Speech Recognition 11 page Courant Institute, NYU

original region ← {x1, ..., xm}

Mehryar Mohri - Speech Recognition 12 page Courant Institute, NYU

Optimal splitting variable Xj and threshold t:

Mehryar Mohri - Speech Recognition 13 page Courant Institute, NYU

• split tree node only if mean squared error

• problem: a seemingly bad split may dominate

• grow a very large tree (Pred(R): |R| > n0),

Mehryar Mohri - Speech Recognition 15 page Courant Institute, NYU

• Collapse node with smallest increase in squared

• Construct sequence of subtrees till single-node

• Best tree is proved to be among the elements

• Partition data S into k equal-sized subsets S , ..., S ; 1 k

• Train learning algorithm L on S - S . Let f be the

• Test f on S and let e!

• Then, the k-fold cross-validation estimate of the

Mehryar Mohri - Speech Recognition 18 page Courant Institute, NYU

Mehryar Mohri - Speech Recognition 19 page Courant Institute, NYU

Can be viewed as the average misclassification

Mehryar Mohri - Speech Recognition 20 page Courant Institute, NYU

Mehryar Mohri - Speech Recognition 21 page Courant Institute, NYU

• the proper measurement not taken,

• categorical case: create new category missing;

Mehryar Mohri - Speech Recognition 22 page Courant Institute, NYU

• small changes in the data may lead to very

• price to pay for the hierarchical nature of

• more stable criteria could be used.

Mehryar Mohri - Speech Recognition 23 page Courant Institute, NYU

Mehryar Mohri - Speech Recognition page 24 Courant Institute, NYU

• phoneme pronounciation depends on

• model phone in context → better accuracy.

• Context-dependent units: ae/b d → aeb,d.

• Complex contexts: regular expressions.

• context: window of length three.

• cross-word triphones: context spanning word

• older systems: only word-internal triphones.

Mehryar Mohri - Speech Recognition page 26 Courant Institute, NYU

• Number of phones: about 50 .

• Number of HMM parameters: with16 mixture

Mehryar Mohri - Speech Recognition page 28 Courant Institute, NYU

• is the phoneme to the left an /l/?

Mehryar Mohri - Speech Recognition page 30 Courant Institute, NYU

Mehryar Mohri - Speech Recognition page 31 Courant Institute, NYU

Mehryar Mohri - Speech Recognition page 32 Courant Institute, NYU

d0:! d1:! d2:! right = /l/