Sei sulla pagina 1di 42

Speech Recognition

Lecture 9: Pronunciation Models.

Mehryar Mohri
Courant Institute of Mathematical Sciences
mohri@cims.nyu.edu
Speech Recognition Components
Acoustic and pronunciation model:
!
Pr(o | w) = Pr(o | d) Pr(d | c) Pr(c | p) Pr(p | w).
d,c,p

• Pr(o | d): observation seq. ← distribution seq.


acoustic model

• Pr(d | c): distribution seq. ← CD phone seq.


• Pr(c | p) : CD phone seq. ← phoneme seq.

•Pr(p | w): phoneme seq. ← word seq.


Language model: Pr(w), distribution over word
seq.
Mehryar Mohri - Speech Recognition page 2 Courant Institute, NYU
Terminology
Phonemes: abstract units representing sounds in
words or word sequences, e.g., /aa/, or /t/.
Phones: acoustic realizations of phonemes, e.g., [t].
Allophones: distinct realizations of the same
phoneme, typically due to specific dialect,
phonemic context, or speaking rate. [dx] and [t]
can be realizations of /t/ in American English.
Example: [s aa dx el] or [s aa t el].

Mehryar Mohri - Speech Recognition page 3 Courant Institute, NYU


This Lecture
Decision trees
Context-dependent model

Mehryar Mohri - Speech Recognition page 4 Courant Institute, NYU


Motivation
Interpretation: explain complex data, result easy to
analyze and understand.
Adaptation: easy to update to account for new data.
Different types of variables: categorical, numerical.
Monotone transformation invariance: measuring
unit is not a concern.
Dealing with missing labels.
But: Beware of interpretation! Generalization?
Theoretical guarantees?
Mehryar Mohri - Speech Recognition 5 page Courant Institute, NYU
Common Decision Tree Induction Systems
Most commonly used tools for induction of
decision trees:

• CART (classification and regression tree)


(Breiman et al., 1984)

• C4.5 (Quinlan, 1986, 1993) and C5.0


(RuleQuest Research) a commercial system.
Differences: minor between latest versions.

Mehryar Mohri - Speech Recognition 6 page Courant Institute, NYU


Example
X1 < a1 X2
R2
a4 R5
X1 < a2 X2 < a3
R3
R1 a3
X2 < a4 R3 R4 R5 R4

a2 a1
R1 R2 X1

Mehryar Mohri - Speech Recognition 7 page Courant Institute, NYU


Different Types of Questions
Decision trees

• X ∈ {blue, white, red}: categorical questions


•X ≤ a : continuous variables
Binary space partition (BSP) trees:
!n
• i=1 αi Xi ≤ a : partitioning with convex
polyhedral regions
Sphere trees:

• ||X − a0|| ≤ a : partitioning with pieces of spheres


Mehryar Mohri - Speech Recognition 8 page Courant Institute, NYU
Decision Trees
Training data:

• Regression:
(x1 , y1 ), . . . , (xm , ym ) ∈ RN × R
• Classification (k classes):
(x1 , y1 ), . . . , (xm , ym ) ∈ RN × {1, . . . , k}
Result:
!
r
h(x) = ai 1x∈Ri
i=1

Mehryar Mohri - Speech Recognition 9 page Courant Institute, NYU


Optimization Criterion
Possible criterion: minimize mean squared error
1 !m
2
E= (yi − f (xi )) .
m i=1
For each region R, find a such that:
!
a = argmina (yi − a)2 .
i:xi ∈R
Thus, !
yi
i:xi ∈R
a= = < yi >xi ∈R .
|{i : xi ∈ R}|

Mehryar Mohri - Speech Recognition 10 page Courant Institute, NYU


Example - Playing Golf
play
humidity <= 40%
(77/423)

play do not play


wind <= 12 mph visibility > 10 mi
(20/238) (57/185)

play
play do not play do not play
barometer <= 18 in
(19/236) (1/2) (9/72)
(48/113)

Misclassification rates are play do not play


indicated at each node. (37/101) (1/12)

Mehryar Mohri - Speech Recognition 11 page Courant Institute, NYU


Learning Algorithm
Problem: general problem of finding partition that
minimizes mean squared error is NP-hard.
Solution: greedy algorithm.

original region ← {x1, ..., xm}


for each region R such that Pred(R) holds
do find optimal splitting variable Xj and threshold t
partition R into two regions according to (Xj <= t)

Mehryar Mohri - Speech Recognition 12 page Courant Institute, NYU


Splitting Variable and Threshold
For a region R, define for j = 1, . . . , N, t > 0,
R1 (j, t) = {x ∈ R : xj ≤ t} and R2 (j, t) = {x ∈ R : xj > t}

Optimal splitting variable Xj and threshold t:


 
# #
2
argminj,t min (yi − a1 ) + min (yi − a2 )2 
a1 a2
xi ∈R1 (j,t) xi ∈R2 (j,t)
 
# #
2
= argminj,t  (yi − < yi >xi ∈R1 (j,t) ) + (yi − < yi >xi ∈R2 (j,t) )2 
xi ∈R1 (j,t) xi ∈R2 (j,t)

Mehryar Mohri - Speech Recognition 13 page Courant Institute, NYU


Stopping Predicates/Criteria
Problem: larger trees will overfit the training data
Conservative splitting:

• split tree node only if mean squared error


reduced by at least α > 0.

• problem: a seemingly bad split may dominate


useful splits
Alternative: grow-then-prune strategy (CART),

• grow a very large tree (Pred(R): |R| > n0),


• prune tree using complexity criterion.
Mehryar Mohri - Speech Recognition 14 page Courant Institute, NYU
Pruning Criterion
Node impurity criterion:
1 !
C(T, j) = (yi − < yi >Rj )2
|xi ∈ Rj |
xi ∈Rj
Complexity criterion:
!r
Cα (T ) = |xi ∈ Rj | C(T, j) + α|T |
"#$%
j=1
regularization term
Pruning: let T0 be the full decision tree obtained.
Find subtree T minimizing complexity criterion.

Mehryar Mohri - Speech Recognition 15 page Courant Institute, NYU


Weakest Link Pruning
For each α, find tree T minimizing complexity
criterion:

• Collapse node with smallest increase in squared


error

• Construct sequence of subtrees till single-node


tree

• Best tree is proved to be among the elements


of this sequence
How to choose α? Cross-validation.
Mehryar Mohri - Speech Recognition 16 page Courant Institute, NYU
Cross-Validation
Estimate of prediction error:

• Partition data S into k equal-sized subsets S , ..., S ; 1 k

• Train learning algorithm L on S - S . Let f be the


i i
output.

• Test f on S and let e!


i
rrorS (fi )be the error rate.
i i

• Then, the k-fold cross-validation estimate of the


error of L is (k = |S| leave-one-out estimate):
1 ! k
E(L, k) = e!
rrorSi (fi ).
k i=1
Mehryar Mohri - Speech Recognition 17 page Courant Institute, NYU
Application - Model Selection
Assume that the output of the learning algorithm
depends on a parameter α.
Compute cross-validation estimate:
1 !k
E(L, k, α) = rrorSi (fiα ).
e!
k i=1
Choose parameter α that minimizes cross-
validation estimate of error.
Typical values: k = 5, or 10.

Mehryar Mohri - Speech Recognition 18 page Courant Institute, NYU


Classification
Each region Rj is labeled with its dominating class:
!
for j = 1, . . . , r; c ∈ [1, k], pc,j = 1yi =c
xi ∈Rj
cj = argmax1≤c≤k pc,j
Three main measures of node impurity:

• misclassification:
1 !
C(T, j) = 1yi "=cj = 1 − pcj ,j .
|{xi ∈ Rj }|
xi ∈Rj

Mehryar Mohri - Speech Recognition 19 page Courant Institute, NYU


• Gini index:
! !
k
C(T, j) = pc,j pc! ,j = pc,j (1 − pc,j ).
c!=c! c=1

Can be viewed as the average misclassification


error. Used in CART.

• Entropy:
!
k
C(T, j) = − pc,j log pc,j .
c=1

Mehryar Mohri - Speech Recognition 20 page Courant Institute, NYU


Categorical Variables
Problem: with N possible unordered variables,
e.g., color (blue, white, red), there are 2N-1-1 possible
partitions.
Solution (when only two possible outcomes): sort
variables according to the number of 1s in each,
e.g., white .9, red .45, blue .3. Split predictor as with
ordered variables.

Mehryar Mohri - Speech Recognition 21 page Courant Institute, NYU


Missing Values
Problem: points x with missing values y, due to:

• the proper measurement not taken,


• a source causing the absence of labels.
Solution:

• categorical case: create new category missing;


• use surrogate variables: use only those variables
that are available for a split.

Mehryar Mohri - Speech Recognition 22 page Courant Institute, NYU


Instability
Problem: high variance

• small changes in the data may lead to very


different splits,

• price to pay for the hierarchical nature of


decision trees,

• more stable criteria could be used.

Mehryar Mohri - Speech Recognition 23 page Courant Institute, NYU


This Lecture
Decision trees
Context-dependent model

Mehryar Mohri - Speech Recognition page 24 Courant Institute, NYU


Context-Dependent Phones
(Lee, 1990;Young et al., 1994)
Idea:

• phoneme pronounciation depends on


environment (allophones, co-articulation).

• model phone in context → better accuracy.


Context-dependent rules:

• Context-dependent units: ae/b d → aeb,d.


• Allophonic rules: t/V V → dx.
!

• Complex contexts: regular expressions.


Mehryar Mohri - Speech Recognition page 25 Courant Institute, NYU
CD Phones - Speech Recognition
Triphones: simplest and most widely used model.

• context: window of length three.


• example: cat, pause k ae t pause .
pause ae k t ae pause

• cross-word triphones: context spanning word


boundaries, important for accurate modeling.

• older systems: only word-internal triphones.


Extensions: quinphones (window of size five),
gender-dependent models.

Mehryar Mohri - Speech Recognition page 26 Courant Institute, NYU


Modeling Problems
Parameters: very large numbers for VLVR.

• Number of phones: about 50 .


• Number of CD phones: possibly 50 3
= 125,000 ,
but not all of them occur (phonotactic
constraints). In practice, about 60,000 .

• Number of HMM parameters: with16 mixture


components, 60000 × 3 × (39 × 16 × 2 + 16) ≈ 228M.
Data sparsity: some triphones, particularly cross-
word triphones, do not appear in sample.
Mehryar Mohri - Speech Recognition page 27 Courant Institute, NYU
Solutions
Backing-off: use simpler models with shorter
contexts when triphone not available.
Interpolation: with simpler models such as
monophone or biphone models.
Parameter reduction: cluster parameters with
similar characteristics (‘parameter tying’). Better
estimates for each state distribution.

Mehryar Mohri - Speech Recognition page 28 Courant Institute, NYU


Clustering Method
Initially, group together all triphones for the same
phoneme.
Split group according to decision tree questions
based on left or right phonetic context.
All triphones (HMM states) at the same leaf are
clustered (tied).
Advantage: even unseen triphones are assigned to
a cluster and thus a model.
Questions: which DT questions? Which criterion?
Mehryar Mohri - Speech Recognition page 29 Courant Institute, NYU
Questions
Simple discrete pre-defined binary questions.
Examples:

• is the phoneme to the left an /l/?


• is the phoneme to the right a nasal?
• is the previous phoneme an unvoiced stop?

Mehryar Mohri - Speech Recognition page 30 Courant Institute, NYU


Phonetics
Vowels:

Consonants:
plosives: affricates:
fricatives:
nasals:
approximants:

Mehryar Mohri - Speech Recognition page 31 Courant Institute, NYU


Sound Features
Example: voiced sound (vocal cords vibrate), nasals
(e.g., /m/, /n/), approximants (e.g., /l/, /r/, /w/, /j/),
vowels.

Mehryar Mohri - Speech Recognition page 32 Courant Institute, NYU


Clustering CD Phones
right = /r/
d0:! d1:! d2:!
left = nasal right = stop
d0:! d1:! d2:ae k,t
k ae t 0 1 2 3

d0:! d1:! d2:! right = /l/

n ae r 0
d0:!
1
d1:!
2
d2:ae n,r
3

d0:! d1:! d2:!

p ae n 0
d0:!
1
d1:!
2
d2:aep,n
3

Mehryar Mohri - Speech Recognition page 33 Courant Institute, NYU


Criterion
Criterion: best question is one that maximizes
sample likelihood after splitting.
ML evaluation: requires single Gaussians with
diagonal covariance matrices trained on sample.
question q, sample log-likelihood L(S)

L(Sl ) L(Sr )

Log-likelihood difference: ∆L(q) = L(Sl ) + L(Sr ) − L(S).


! "

Best question: q ∗ = argmax L(Sl ) + L(Sr ).


q
Mehryar Mohri - Speech Recognition page 34 Courant Institute, NYU
Log-Likelihood
Sample S = (x1 , . . . , xm ) ∈ (RN )m .
Diagonal covariance Gaussian:
N
1 " # 1 (xk − µk )2 $
Pr[x] = !N exp − 2 .
k=1 (2πσk )
1/2 2 σk
k=1
Log-likelihood for diagonal covariance Gaussian:
m N N
! (xik − µk )2 #
1 !" !
L(S) = − log(2πσk2 ) +
2 i=1 σk2
k=1 k=1
N ! σ2 # N
1" ! k
=− m log(2πσk2 ) +
2 σk2
k=1 k=1
N
1 " !
2
#
= − mN (1 + log(2π)) + m log(σk ) .
2
k=1
Mehryar Mohri - Speech Recognition page 35 Courant Institute, NYU
Decision Tree Split
Log-likelihood difference:
N N
1 1! " 2
"
2
#
L(Sl ) + L(Sr ) = − mN (1 + log(2π)) − ml log(σlk ) − mr log(σrk ) .
2 2
k=1 k=1

Best question:
! " N N
" #
2 2
q ∗ = argmin ml log(σlk ) − mr log(σrk ) ,
q
k=1 k=1

1 ! 2 1 !
with 2
σlk = xk − 2 ( xk )2
ml ml
x∈Sl x∈Sl

2 1 ! 2 1 !
σrk = xk − 2 ( xk )2 .
mr mr
x∈Sr x∈Sr

Mehryar Mohri - Speech Recognition page 36 Courant Institute, NYU


Stopping Criteria
Grow-then-prune strategy with cross-validation
using held-out data set.
Heuristics in VLVR:

• question does not yield significant increase in


log-likelihood.

• insufficient data for questions.


• computational limitations.

Mehryar Mohri - Speech Recognition page 37 Courant Institute, NYU


Full Training Process
Train CI phone HMMs with single Gaussians and
diagonal covariance.
Create triphone HMMs by replicating CI phone
models and reestimate parameters.
Apply decision tree clustering to the set of
triphones representing the same phoneme.
Create Gaussian mixture model using mixture
splitting technique for each cluster.

Mehryar Mohri - Speech Recognition page 38 Courant Institute, NYU


CD Model Representation
(MM, Pereira, Riley, 2007)
Non-deterministic transducer representation
x:x/ !_!
y:y/ !_x

x:x/ !_y x:x/y_x


x:x/y_!
!,* x:x/y_y y,x x, !
x:x/ !_x x:x/x_x
y:y/x_x
x:x/x_y
x,x x,y
y:y/x_y y:y/y_y y:y/y_x

y,y y:y/y_!

y:y/x_ ! y,!

y:y/ !_y x:x/x_!

y:y/ !_!

Mehryar Mohri - Speech Recognition page 39 Courant Institute, NYU


CD Model Representation
(MM, Pereira, Riley, 2007)
Deterministic transducer representation
$:y/!_!
x:y/!_y
!,y $:y/x_!
y,!
y:y/y_y $:y/y_!
y:!
y:y/!_y
y:y/x_y y,y x:y/y_x
!,! y:x/!_y
x:! x,y y:x/y_y
y:x/x_y x:y/x_x y,x
x:x/x_x
!,x x:x/!_x x:x/y_x $:x/y_!
x,x
$:x/x_!
x,!
$:x/!_!

Mehryar Mohri - Speech Recognition page 40 Courant Institute, NYU


References
• Leo Breiman, Jerome Friedman, Charles J. Stone, and R. A. Olshen. Classifications and
Regression Trees. Chapman & Hall, 1984.
• Chen, F. Identification of contextual factors for pronunciation networks. In Proceedings of
ICASSP (1990), S14.9.

• Luc Devroye, Laszlo Gyorfi, Gabor Lugosi. A Probabilistic Theory of Pattern Recognition.
Springer, 1996.

• Ladefoged, P., A Course in Phonetics., New York: Harcourt, Brace, and Jovanovich, 1982.
Automatic Generation of Lexicons 17.

• Kai-Fu Lee. Context-Dependent Phonetic Hidden Markov Models for Continuous Speech
Recognition. IEEE International Conference on Acoustics, Speech, and Signal Processing, 38(4):
599-609, 1990.

• Mehryar Mohri, Fernando C. N. Pereira, and Michael Riley. Speech Recognition with
Weighted Finite-State Transducers. In Larry Rabiner and Fred Juang, editors, Handbook on
Speech Processing and Speech Communication, Part E: Speech recognition. volume to
appear. Springer-Verlag, Heidelberg, Germany, 2007.

Mehryar Mohri - Speech Recognition page 41 Courant Institute, NYU


References
• Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993.

• Quinlan, J. R. Induction of Decision Trees, in Machine Learning,Volume 1, pages 81-106, 1986.

• Randolph, M. "A data-driven method for discover and predicting allophonic variation,"
Proc. ICASSP `90, S14.10, 1990.

• Michael Riley and Andrej Ljolje. Lexical access with a statistically-derived phonetic
network. In Proceedings of the European Conference on Speech Communication and
Technology, pages 585-588, 1991.

• M. Weintraub, H. Murveit, M. Cohen, P. Price, J. Bernstein, G. Baldwin, and D. Bell, "Linguistic


Constraints in Hidden Markov Model Based Speech Recognition," Proc. ICASSP '89, pp.
699--702, Glasgow, Scotland, May, 1989.

• Steve Young, J. Odell, and Phil Woodland. Tree-Based State-Tying for High Accuracy
Acoustic Modelling. In Proceedings of ARPA Human Language Technology Workshop, Morgan
Kaufmann, San Francisco, 1994.

Mehryar Mohri - Speech Recognition page 42 Courant Institute, NYU

Potrebbero piacerti anche