Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Maneesh Sahani
maneesh@gatsby.ucl.ac.uk
Data: X = {x1 . . . xN }
Latent process:
iid
si ∼ Disc[π]
Component distributions:
Marginal distribution:
k
X
P (xi) = πmPm(x; θm)
m=1
Log-likelihood:
n k
1
πm |2πΣm|−1/2 exp − (xi − µm)TΣ−1
X X
log p(X | {µm}, {Σm}, π) = log m (xi − µm )
i=1 m=1
2
EM for MoGs
• Evaluate responsibilities
Pm(x)πm
rim = P
m0 Pm0 (x)πm0
• Update parameters
P
rimxi
µm ← Pi
r
P i im
i rim (xi − µm)(xi − µm)T
Σm ← P
P i rim
i rim
πm ←
N
The Expectation Maximisation (EM) algorithm
The EM algorithm finds a (local) maximum of a latent variable model likelihood. It starts from
arbitrary values of the parameters, and iterates two steps:
• Useful in models where learning would be easy if hidden variables were, in fact, observed
(e.g. MoGs).
• Decomposes difficult problems into series of tractable steps.
• No learning rate.
• Framework lends itself to principled approximations.
Jensen’s Inequality
log(x)
x1 α x1 + (1−α)x2 x2
P
For αi ≥ 0, αi = 1 and any {xi > 0}
!
X X
log αixi ≥ αi log(xi)
i i
Equality if and only if αi = 1 for some i (and therefore all others are 0).
The Free Energy for a Latent Variable Model
Observed data X = {xi}; Latent variables Y = {yi}; Parameters θ.
Any distribution, q(Y), over the hidden variables can be used to obtain a lower bound on the
log likelihood using Jensen’s inequality:
EM alternates between:
E step: optimize F(q, θ) wrt distribution over hidden variables holding parameters fixed:
(k) (k−1)
q (Y) := argmax F q(Y), θ .
q(Y)
The second equality comes from the fact that the entropy of q(Y) does not depend directly
on θ.
EM as Coordinate Ascent in F
The E Step
P (Y, X |θ)
Z
F(q, θ)= q(Y) log dY
q(Y)
P (Y|X , θ)P (X |θ)
Z
= q(Y) log dY
q(Y)
Z Z
P (Y|X , θ)
= q(Y) log P (X |θ) dY + q(Y) log dY
q(Y)
= `(θ) − KL[q(Y)kP (Y|X , θ)]
This means that, for fixed θ, F is bounded above by `, and achieves that bound when
KL[q(Y)kP (Y|X , θ)] = 0.
But KL[qkp] is zero if and only if q = p. So, the E step simply sets
To find the distribution q which minimizes KL[qkp] we add a Lagrange multiplier to enforce
the normalization constraint:
def
X X qi X
E = KL[qkp] + λ 1 − qi = qi log + λ 1 − qi
i i
p i i
∂ 2E 1 ∂ 2E
= > 0, = 0,
∂qi∂qi qi ∂qi∂qj
showing that qi = pi is a genuine minimum.
A similar proof holds for KL[·k·] between continuous densities, the derivatives being substi-
tuted by functional derivatives.
Coordinate Ascent in F (Demo)
s ∼ Bernoulli[π]
x|s = 0 ∼ N [−1, 1] x|s = 1 ∼ N [1, 1]
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
Coordinate Ascent in F (Demo)
EM Never Decreases the Likelihood
The E and M steps together never decrease the log likelihood:
If the M-step is executed so that θ(k) 6= θ(k−1) iff F increases, then the overall EM iteration
will step to a new value of θ iff the likelihood increases.
Fixed Points of EM are Stationary Points in `
Let a fixed point of EM occur with parameter θ∗. Then:
∂
hlog P (Y, X | θ)iP (Y|X ,θ∗) = 0
∂θ θ∗
so, d d d
`(θ)= hlog P (Y, X |θ)iP (Y|X ,θ∗) − hlog P (Y|X , θ)iP (Y|X ,θ∗)
dθ dθ dθ
The second term is 0 at θ∗ if the derivative exists (minimum of KL[·k·]), and thus:
d d
`(θ) = hlog P (Y, X |θ)iP (Y|X ,θ∗) = 0
dθ θ∗ dθ θ∗
Let θ∗ now be the parameter value at a local maximum of F (and thus at a fixed point)
d2 d2 d2
2
`(θ)= 2 hlog P (Y, X |θ)iP (Y|X ,θ∗) − 2 hlog P (Y|X , θ)iP (Y|X ,θ∗)
dθ dθ dθ
The first term on the right is negative (a maximum) and the second term is positive (a mini-
mum). Thus the curvature of the likelihood is negative and
θ∗ is a maximum of `.
Partial M steps: The proof holds even if we just increase F wrt θ rather than maximize.
(Dempster, Laird and Rubin (1977) call this the generalized EM, or GEM, algorithm).
For example, sparse or online versions of the EM algorithm would compute the posterior
for a subset of the data points or as the data arrives, respectively. You can also update the
posterior over a subset of the hidden variables, while holding others fixed...
The Gaussian mixture model (E-step)
where λ is a Lagrange multiplier ensuring that the mixing proportions sum to unity.
Factor Analysis
y1 yK
K
X
Linear generative model: xd = Λdk yk + d
k=1
• yk are independent N (0, 1) Gaussian factors
• d are independent N (0, Ψdd) Gaussian noise
x1 x2 xD • K <D
Z
So, x is Gaussian with: p(x) = p(y)p(x|y)dy = N (0, ΛΛ> + Ψ)
where Λ is a D × K matrix, and Ψ is diagonal.
y1 yK
The model for x:
Z
p(x|θ) = p(y|θ)p(x|y, θ)dy = N (0, ΛΛ> + Ψ)
x1 x2 xD
Model parameters: θ = {Λ, Ψ}.
E step: For each data point xn, compute the posterior distribution of hidden factors given
the observed data: qn(y) = p(y|xn, θt).
E step: For each data point xn, compute the posterior distribution of hidden factors given
the observed data: qn(y) = p(y|xn, θ) = p(y, xn|θ)/p(xn|θ)
P R
M step: Find θt+1 maximising F = n qn(y) [log p(y|θ) + log p(xn|y, θ)] dy + c
1 > 1 1
log p(y|θ)+ log p(xn|y, θ) = c − y y − log |Ψ| − (xn − Λy)>Ψ−1(xn − Λy)
2 2 2
1 1
= c’ − log |Ψ| − [xn>Ψ−1xn − 2xn>Ψ−1Λy + y>Λ>Ψ−1Λy]
2 2
1 1 > −1 > −1
> −1 >
= c’ − log |Ψ| − [xn Ψ xn − 2xn Ψ Λy + Tr Λ Ψ Λyy ]
2 2
Taking expectations over qn(y). . .
1 1 > −1 > −1
> −1 >
= c’ − log |Ψ| − [xn Ψ xn − 2xn Ψ Λµn + Tr Λ Ψ Λ(µnµn + Σ) ]
2 2
Note that we don’t need to know everything about q , just the expectations of y and yy> under
q (i.e. the expected sufficient statistics).
The M step for Factor Analysis (cont.)
N 1 X > −1 > −1
> −1 >
F = c’ − log |Ψ| − xn Ψ xn − 2xn Ψ Λµn + Tr Λ Ψ Λ(µnµn + Σ)
2 2 n
∂Tr[AB] ∂ log |A|
Taking derivatives w.r.t. Λ and Ψ−1, using ∂B = A> and ∂A = A−>:
!
∂F −1
X
> −1
X
=Ψ xnµn − Ψ Λ N Σ + µnµn> =0
∂Λ n n
!−1
X X
>
Λ̂= ( xnµn ) N Σ+ µnµn>
n n
∂F N 1 X > > > > > >
= Ψ− xnxn − Λµnxn − xnµn Λ + Λ(µnµn + Σ)Λ
∂Ψ−1 2 2 n
1 X > > > > > >
Ψ̂ = xnxn − Λµnxn − xnµn Λ + Λ(µnµn + Σ)Λ
N n
> 1
X
Ψ̂= ΛΣΛ + (xn − Λµn)(xn − Λµn)> (squared residuals)
N n
Note: we should actually only take derivarives w.r.t. Ψdd since Ψ is diagonal.
When Σ → 0 these become the equations for linear regression!
Mixtures of Factor Analysers
where πk is the mixing proportion for FA k , µk is its centre, Λk is its “factor loading matrix”,
and Ψ is a common sensor noise model. θ = {{πk , µk , Λk }k=1...K , Ψ}
We can think of this model as having two sets of hidden latent variables:
• A discrete indicator variable sn ∈ {1, . . . K}
• For each factor analyzer, a continous factor vector yn,k ∈ RDk
XK Z
p(x|θ) = p(sn|θ) p(y|sn, θ)p(xn|y, sn, θ) dy
sn =1
So, in the E step all we need to compute are the expected sufficient statistics under q .
We also have:
Z
∂ log Z(θ) 1 ∂ 1 ∂
= Z(θ) = f (z) exp{θTT(z)}
∂θ Z(θ) ∂θ ZZ(θ) ∂θ
= f (z) exp{θTT(z)}/Z(θ) ·T(z) = hT(z)|θi
| {z }
p(z|θ)
∂F
Thus, the M step solves: = hT(z)iq(y) − hT(z)|θi = 0
∂θ
References
Regroup: