Hidden Markov Models

7/2/2015
Hidden Markov Models
HiddenMarkovModels
Thispageisunderconstruction.
Introduction
Thefollowingpresentationisadaptedfrom[Rabiner&Juang,1986]and[Charniak,1993].
Notationalconventions
T=lengthofthesequenceofobservations(trainingset)
N=numberofstates(weeitherknoworguessthisnumber)
M=numberofpossibleobservations(fromthetrainingset)
Omega_X={q_1,...q_N}(finitesetofpossiblestates)
Omega_O={v_1,...,v_M}(finitesetofpossibleobservations)
X_trandomvariabledenotingthestateattimet(statevariable)
O_trandomvariabledenotingtheobservationattimet(outputvariable)
sigma=o_1,...,o_T(sequenceofactualobservations)
Distributionalparameters
A={a_ij}s.t.a_ij=Pr(X_t+1=q_j|X_t=q_i)(transitionprobabilities)
B={b_i}s.t.b_i(k)=Pr(O_t=v_k|X_t=q_it)(observationprobabilities)
pi={pi_i}s.t.pi_i=Pr(X_0=q_i)(initialstatedistribution)
Definitions
AhiddenMarkovmodel(HMM)isafivetuple(Omega_X,Omega_O,A,B,pi).Letlambda={A,B,pi}
denotetheparametersforagivenHMMwithfixedOmega_XandOmega_O.
Problems
1.FindPr(sigma|lambda):theprobabilityoftheobservationsgiventhemodel.
2.Findthemostlikelystatetrajectorygiventhemodelandobservations.
3.Adjustlambda={A,B,pi}tomaximizePr(sigma|lambda).
Motivation
Adiscretetime,discretespacedynamicalsystemgovernedbyaMarkovchainemitsasequenceofobservable
outputs:oneoutput(observation)foreachstateinatrajectoryofsuchstates.Fromtheobservablesequenceof
outputs,inferthemostlikelydynamicalsystem.Theresultisamodelfortheunderlyingprocess.Alternatively,
givenasequenceofoutputs,inferthemostlikelysequenceofstates.Wemightalsousethemodeltopredictthe
nextobservationormoregenerallyacontinuationofthesequenceofobservations.
HiddenMarkovmodelsareusedinspeechrecognition.SupposethatwehaveasetWofwordsandaseparate
http://cs.brown.edu/research/ai/dynamics/tutorial/Documents/HiddenMarkovModels.html
1/6
7/2/2015
trainingsetforeachword.BuildanHMMforeachwordusingtheassociatedtrainingset.Letlambda_wdenote
theHMMparametersassociatedwiththewordw.Whenpresentedwithasequenceofobservationssigma,
choosethewordwiththemostlikelymodel,i.e.,
w* = arg max_{w in W} Pr(sigma|lambda_w)
ForwardBackwardAlgorithm
Preliminaries
Definethealphavaluesasfollows,
alpha_t(i) = Pr(O_1=o_1,...,O_t=o_t, X_t = q_i | lambda)
Notethat
alpha_T(i) = Pr(O_1=o_1,...,O_T=o_T, X_T = q_i | lambda)
= Pr(sigma, X_T = q_i | lambda)
ThealphavaluesenableustosolveProblem1since,marginalizing,weobtain
Pr(sigma|lambda) = sum_i=1^N Pr(o_1,...,o_T, X_T = q_i | lambda)
= sum_i=1^N alpha_T(i)
Definethebetavaluesasfollows,
beta_t(i) = Pr(O_t+1=o_t+1,...,O_T=o_T | X_t = q_i, lambda)
WewillneedthebetavalueslaterintheBaumWelchalgorithm.
AlgorithmicDetails
1.Computetheforward(alpha)values:
a. alpha_1(i) = pi_i b_i(o_1)
b. alpha_t+1(j) = [sum_i=1^N alpha_t(i) a_ij] b_j(o_t+1)
2.Computingthebackward(beta)values:
a. beta_T(i) = 1
b. beta_t(i) = sum_j=1^N a_ij b_j(o_t+1) beta_t+1(j)
ViterbiAlgorithm
Intuition
Computethemostlikelytrajectorystartingwiththeemptyoutputsequenceusethisresulttocomputethemost
likelytrajectorywithanoutputsequenceoflengthonerecurseuntilyouhavethemostlikelytrajectoryforthe
entiresequenceofoutputs.
AlgorithmicDetails
1.Initialization:
2/6
7/2/2015
For 1 <= i <= N,

a. delta_1(i) = pi b_i(o_1)
b. Phi_1(i) = 0
2.Recursion:
For 2 <= t <= T, 1 <= j <= N,
a. delta_t(j) = max_i [delta_t-1(i)a_ij]b_j(o_t)
b. Phi_t(j) = argmax_i [delta_t-1(i)a_ij]
3.Termination:
a. p* = max_i [delta_T(i)]
b. i*_T = argmax_i [delta_T(i)]
4.Reconstruction:
For t = t-1,t-2,...,1,
i*_t = Phi_t+1(i*_t+1)
Theresultingtrajectory,i*_1,...,i*_T,solvesProblem2.
BaumWelchAlgorithm
Intuition
TosolveProblem3weneedamethodofadjustingthelambdaparameterstomaximizethelikelihoodofthe
trainingset.
Supposethattheoutputs(observations)areina11correspondencewiththestatessothatN=M,varphi(q_i)
=v_iandb_i(j)=1forj=iand0forj!=i.NowtheMarkovprocessisnothiddenatallandtheHMMisjusta
Markovchain.ToestimatethelambdaparametersforthisMarkovchainitisenoughjusttocalculatethe
appropriatefrequenciesfromtheobservedsequenceofoutputs.Thesefrequenciesconstitutesufficientstatistics
fortheunderlyingdistributions.
Inthemoregeneralcase,wecan'tobservethestatesdirectlysowecan'tcalculatetherequiredfrequencies.In
thehiddencase,weuseexpectationmaximization(EM)asdescribedin[Dempsteretal.,1977].
Insteadofcalculatingtherequiredfrequenciesdirectlyfromtheobservedoutputs,weiterativelyestimatedthe
parameters.Westartbychoosingarbitraryvaluesfortheparameters(justmakesurethatthevaluessatisfythe
requirementsforprobabilitydistributions).
Wethencomputetheexpectedfrequenciesgiventhemodelandtheobservations.Theexpectedfrequenciesare
obtainedbyweightingtheobservedtransitionsbytheprobabilitiesspecifiedinthecurrentmodel.Theexpected
frequenciessoobtainedarethensubstitutedfortheoldparametersandweiterateuntilthereisnoimprovement.
OneachiterationweimprovetheprobabilityofObeingobservedfromthemodeluntilsomelimitingprobability
isreached.Thisiterativeprocedureisguaranteedtoconvergeonalocalmaximumofthecrossentropy
(KullbackLeibler)performancemeasure.
3/6
7/2/2015
Preliminaries
Theprobabilityofatrajectorybeinginstateq_iattimetandmakingthetransitiontoq_jatt+1giventhe
observationsequenceandmodel.
xi_t(i,j) = Pr(X_t = q_i, X_t+1 = q_j | sigma, lambda)
Wecomputetheseprobabilitiesusingtheforwardbackwardvariables.
alpha_t(i) a_ij(o_t+1) beta_t+1(j)
xi_t(i,j) = ------------------------------------Pr(O | lambda)
Theprobabilityofbeinginq_iattgiventheobservationsequenceandmodel.
gamma_t(i) = Pr(X_t = q_i | sigma, lambda)
Whichweobtainbymarginalization.
gamma_t(i) = sum_j xi_t(i,j)
Notethat
sum_t=1^T gamma_t(i) = expected number of transitions from q_i
and
sum_t=1^T xi_t(i,j) = expected number of transitions from q_i to q_j
AlgorithmicDetails
1.Choosetheinitialparameters,lambda={A,B,pi},arbitrarily.
2.Reestimatetheparameters.
a. bar{pi}_i = gamma_t(i)
sum_t=1^T-1 xi_t(i,j)
b. bar{a}_ij = -----------------------sum_t=1^T-1 gamma_t(i)
sum_t=1^T-1 gamma_t(j) 1_{o_t = k}
c. bar{b}_j(k) = -----------------------------------sum_t=1^T-1 gamma_t(j)
where1_{o_t=k}=1ifo_t=kand0otherwise.
3.Letbar{A}={bar{a}_ij},bar{B}={bar{b}_i(k)},andbar{pi}={{bar{pi}_i}.
4.Setbar{lambda}tobe{bar{A},bar{B},bar{pi}}.
5.Iflambda=bar{lambda}thenquit,elsesetlambdatobebar{lambda}andreturntoStep2.
BayesianNetworkAlgorithms
TheBayesiannetworkrepresentationisshowninFigure1.
X_0
X_1
X_2
X_3
o ----> o ----> o ----> o
|
|
|
|
|
|
|
|
...
X_T-1 X_T
o ----> o
|
|
|
|
4/6
7/2/2015
v
o
O_0
v
o
O_1
v
o
O_2
v
o
O_3
...
v
o
O_T-1
v
o
O_T
Fig. 1: Bayesian network representation for an HMM
InthedescriptionoftheBaumWelchalgorithmprovidedabove,thecomputationoftheexpectedsufficient
statisticsdependsoncomputingthefollowingtermforalliandjinOmega_X.
xi_t(i,j) = Pr( X_t=q_i, X_t+1=q_j | sigma, lambda)
Thesecomputationsinturnrelyoncomputingtheforwardandbackwardvariables(thealpha'sandbeta's).
alpha_t(i) a_ij(o_t+1) beta_t+1(j)
xi_t(i,j) = ------------------------------------Pr(sigma | lambda)
Generally,theforwardandbackwardvariablesarecomputedusingtheforwardbackwardprocedurewhich
usesdynamicprogrammingtocomputethevariablesintimepolynomialin|Omega_X|,|Omega_O|,andT.Inthe
followingparagraphs,weshowhowthexi'scanbecomputedusingstandardBayesiannetworkinference
algorithmsinthesamebigOhcomplexity.Oneadvantageofthisapproachisthatitextendseasilytothecasein
whichthehiddenpartofthemodelisfactoredintosomenumberofstatevariables.
InthenetworkshowninFigure1theO_t'sareknown.Inparticular,wehavethatO_i=o_i.Ifweassignthe
O_t'saccordingly,usetheprobabilitiesindicatedbylambda,andapplyastandardBayesiannetworkinference
algorithm,weobtainforeveryX_ttheposteriordistributionPr(X_t|sigma,lambda).Thisisn'texactlywhatwe
needsinceX_tandX_t+1areclearlynotindependent.Iftheywereindependent,thenwecouldobtain
Pr(X_t,X_t+1|sigma,lambda)fromtheproductofPr(X_t|sigma,lambda)andPr(X_t+1|sigma,lambda).There
areanumberofremediesbutoneapproachwhichisgraphicallyintuitiveinvolvesaddinganewstatevariable
(X_t,X_t+1)whichistheobviousdeterministicfunctionX_tandX_t+1.Thisadditionresultsinthenetwork
showninFigure2.
X_0,X_1 X_1,X_2 X_2,X_3
o
o
o
/^
/^
/^
/ | / | / |
/
|/
|/
|
o ----> o ----> o ----> o
|
|
|
|
|
|
|
|
v
v
v
v
o
o
o
o
O_0
O_1
O_2
O_3
...
...
...
X_T-1,X_T
o
o
^
/^
| / |
|/
|
o ----> o
|
|
|
|
v
v
o
o
O_T-1 O_T
Fig. 2: Bayesian network with joint variables, (X_t,X_t+1)
IfweupdatethenetworkinFigure2withO_t=o_tthenweobtainPr((X_t,X_t+1)|sigma,lambda)directly.It
shouldbeclearthatwecaneliminatetheX_t(exceptforX_0)toobtainthesinglyconnectednetworkshownin
Figure3whichcanbeupdatedintimepolynomialin|Omega_X|,|Omega_O|,andT.
X_0 X_0,X_1 X_1,X_2 X_2,X_3
o ----> o ----> o ----> o
|
|
|
|
|
|
|
|
v
v
v
v
o
o
o
o
O_0
O_1
O_2
O_3
...
...
X_T-1,X_T
o ----> o
|
|
|
|
v
v
o
o
O_T-1 O_T
5/6
7/2/2015
Fig. 3: Bayesian network with X_t's eliminated
TheextensiontoHMMswithfactoredstatespaces(e.g.,seeFigure4)isgraphicallystraightforward.The
computationalpictureismorecomplicatedanddependsonthespecificsoftheupdatealgorithm.Itisimportant
topointout,however,thatthereisawiderangeofupdatealgorithms,bothapproximateandexact,tochoose
from.
X_1 o ----> o ----> o ----> o
\
\
\
\
\
\
\
\
\
X_2 o ----> o ----> o ----> o
|
|
|
|
|
|
|
|
v
v
v
v
O o
o
o
o
...
...
...
o ----> o
\
\
\
o ----> o
|
|
|
|
v
v
o
o
Fig. 4: Bayesian network representation for an HMM with

factored state space Omega_X = Omega_X_1 times Omega_X_2.
The state variable is two dimensional X_t = X_{1,t},X_{2,t}.
References
See[Rabiner&Juang,1986]and[Rabiner,1989]forageneralintroductionandapplicationsinspeech.See
[Charniak,1993]forapplicationsinnaturallanguageprocessingincludingpartofspeechtagging.Charniak
[1993]provideslotsofexamplesthatprovideusefulinsight.See[Rabiner,1989]and[Fraser&Dimitriadis,
1994]fordetailsregardingnumericalissuesthatariseinimplementingtheabovealgorithm.RabinerandJuang
[1986]alsodiscussvariantalgorithmsforcontinuousobservationspacesusingmultivariateGaussianmodels.
BacktoTutorial
6/6

Hidden Markov Models

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Hidden Markov Models

Caricato da

Copyright:

Formati disponibili

7/2/2015