Sei sulla pagina 1di 14

Estimating a Dirichlet distribution

Thomas P. Minka
2000 (revised 2003, 2009, 2012)
Abstract
The Dirichlet distribution and its compound variant, the Dirichlet-multinomial, are two
of the most basic models for proportional data, such as the mix of vocabulary words in a
text document. Yet the maximum-likelihood estimate of these distributions is not available in
closed-form. This paper describes simple and ecient iterative schemes for obtaining parameter
estimates in these models. In each case, a xed-point iteration and a Newton-Raphson (or
generalized Newton-Raphson) iteration is provided.
1 The Dirichlet distribution
The Dirichlet distribution is a model of how proportions vary. Let p denote a random vector whose
elements sum to 1, so that p
k
represents the proportion of item k. Under the Dirichlet model with
parameter vector , the probability density at p is
p(p) D(
1
, ...,
K
) =
(

k

k
)

k
(
k
)

k
p

k
1
k
(1)
where p
k
> 0 (2)

k
p
k
= 1 (3)
The parameters can be estimated from a training set of proportions: D = {p
1
, ..., p
N
}. The
maximum-likelihood estimate of maximizes p(D|) =

i
p(p
i
|). The log-likelihood can be
written
log p(D|) = N log (

k
) N

k
log (
k
) + N

k
(
k
1) log p
k
(4)
where log p
k
=
1
N

i
log p
ik
(5)
This objective is convex in since Dirichlet is in the exponential family. This implies that the
likelihood is unimodal and the maximum can be found by a simple search. A direct convexity proof
has also been given by Ronning (1989). The gradient of the log-likelihood with respect to one
k
is
g
k
=
d log p(D|)
d
k
= N(

k
) N(
k
) + N log p
k
(6)
(x) =
d log (x)
dx
(7)
is known as the digamma function and is similar to the natural logarithm. As always with the
exponential family, when the gradient is zero, the expected sucient statistics are equal to the
observed sucient statistics. In this case, the expected sucient statistics are
E[log p
k
] = (
k
) (

k
) (8)
and the observed sucient statistics are log p
k
.
1
A xed-point iteration for maximizing the likelihood can be derived as follows. Given an initial guess
for , we construct a simple lower bound on the likelihood which is tight at . The maximum of this
bound is computed in closed-form and it becomes the new guess. Such an iteration is guaranteed
to converge to a stationary point of the likelihoodin fact it is the same principle behind the EM
algorithm (Minka, 1998). For the Dirichlet, the maximum is the only stationary point.
As shown in appendix A, a bound on (

k

k
) leads to the following xed-point iteration:
(
new
k
) = (

old
k
) + log p
k
(9)
This algorithm requires inverting the functiona procedure which is described in appendix C.
Another approach to nding a stationary point is Newton iteration. The second-derivatives, i.e.
Hessian matrix, of the log-likelihood are given by
d log p(D|)
d
2
k
= N

k
) N

(
k
) (10)
d log p(D|)
d
k
d
j
= N

k
) (k = j) (11)

is known as the trigamma function. The Hessian can be written in matrix form as
H = Q+11
T
z (12)
q
jk
= N

(
k
)(j k) (13)
z = N

k
) (14)
One Newton step is therefore

new
=
old
H
1
g (15)
H
1
= Q
1

Q
1
11
T
Q
1
1/z +1
T
Q
1
1
(16)
(H
1
g)
k
=
g
k
b
q
kk
(17)
where b =
1
T
Q
1
g
1/z +1
T
Q
1
1
=

j
g
j
/q
jj
1/z +

j
1/q
jj
(18)
Unlike some Newton algorithms, this one does not require storing or inverting the Hessian matrix
explicitly. The same Newton algorithm was given by Ronning (1989) and by Naryanan (1991).
Naryanan also derives a stopping rule for the iteration.
An approximate MLE, useful for initialization, is given by nding the density which matches the
moments of the data. The rst two moments of the density are
E[p
k
] =

k

k

k
(19)
E[p
2
k
] = E[p
k
]
1 +
k
1 +

k

k
(20)

k
=
E[p
1
] E[p
2
1
]
E[p
2
1
] E[p
1
]
2
(21)
2
Multiplying (21) and (19) gives a formula for
k
in terms of moments. Equation (21) uses p
1
, but
any other p
k
could also be used to estimate

k

k
. Ronning (1989) suggests instead using all of the
p
k
s via
var(p
k
) =
E[p
k
](1 E[p
k
])
1 +

k

k
(22)
log

k
=
1
K 1
K1

k=1
log
_
E[p
k
](1 E[p
k
])
var(p
k
)
1
_
(23)
Another approximate MLE, specically for the case K = 2, is given by Johnson & Kotz (1970):

1
=
1
2
1 p
2
1 p
1
p
2
(24)

2
=
1
2
1 p
1
1 p
1
p
2
(25)
2 Estimating Dirichlet mean and precision separately
The parameters of the Dirichlet can be understood by considering the following alternative rep-
resentation:
s =

k
(26)
m = E[p
k
] = /s (27)
Here m is the mean of the distribution for p and s can be understood as the precision. When s is
large, p is likely to be near m, and when s is small, p is distributed more diusely. Interpretation of s
and m suggests situations, such as hierarchical modeling, in which we may want to x one parameter
and only optimize the other. Additionally, s and m are roughly decoupled in the maximum-likelihood
objective, which means we can get simplications and speedups by optimizing them alternately.
Thus, in this section we will reparameterize the distribution with (s, m) where

k
= sm
k
(28)

k
m
k
= 1 (29)
2.1 Estimating Dirichlet precision
The likelihood for s alone is
p(D|s)
_
(s) exp(s

k
m
k
log p
k
)

k
(sm
k
)
_
N
(30)
whose derivatives are
d log p(D|s)
ds
= N(s) N

k
m
k
(sm
k
) + N

k
m
k
log p
k
(31)
d
2
log p(D|s)
ds
2
= N

(s) N

k
m
2
k

(sm
k
) (32)
3
A convergent xed-point iteration for s is
(K 1)/s
new
= (K 1)/s (s) +

k
m
k
(sm
k
)

k
m
k
log p
k
(33)
Proof Use the bound (see Appendix E)
(s)

k
(sm
k
)
exp(sb + (K 1) log(s) + c) (34)
b = ( s)

k
m
k
( sm
k
) (K 1)/ s (35)
to get
log p(D|s) s

k
m
k
log p
k
+ sb + (K 1) log(s) + (const.) (36)
from which (33) follows.
This iteration is only rst-order convergent because the bound only matches the rst derivative
of the likelihood. We can derive a second-order method using the technique of generalized Newton
iteration Minka (2000). The idea is to approximate the likelihood by a simpler function, by matching
the rst two derivatives at the current guess:
(s)

k
(sm
k
)
exp(sb + a log(s) + c) (37)
a = s
2
(

( s)

k
m
2
k

( sm
k
)) (38)
b = ( s)

k
m
k
( sm
k
) a/ s (39)
Maximizing the approximation leads to the update
1
s
new
=
1
s
+
1
s
2
_
d
2
log p(D|s)
ds
2
_
1
_
d log p(D|s)
ds
_
(40)
This update resembles Newton-Raphson, but converges faster.
For initialization of s, it is useful to derive a closed-form approximate MLE. Stirlings approximation
to gives
(s) exp(s

k
m
k
log p
k
)

k
(sm
k
)

_
s
2
_
(K1)/2
k
m
1/2
k
exp(s

k
m
k
log
p
k
m
k
) (41)
s
(K 1)/2

k
m
k
log
p
k
m
k
(42)
2.2 Estimating Dirichlet mean
Now suppose we x the precision s and want to estimate the mean m. The likelihood for m alone is
p(D|m)
_

k
exp(sm
k
log p
k
)
(sm
k
)
_
N
(43)
4
Reparameterize with the unconstrained vector z, to get the gradient:
m
k
=
z
k

j
z
j
(44)
d log p(D|m)
dz
k
=
Ns

j
z
j
_
_
log p
k
(sm
k
)

j
m
j
(log p
j
(sm
j
))
_
_
(45)
The MLE can be computed by the xed-point iteration
(
k
) = log p
k

j
m
old
j
_
log p
j
(sm
old
j
)
_
(46)
m
new
k
=

k

j

j
(47)
This update converges very quickly.
3 The Dirichlet-multinomial/Polya distribution
The Dirichlet-multinomial distribution is a compound distribution where p is drawn from a Dirichlet
and then a sample of discrete outcomes x is drawn from a multinomial with probability vector p.
This compounding is essentially a Polya urn scheme, so the Dirichlet-multinomial is also called the
Polya distribution. Let n
k
be the number of times the outcome was k, i.e.
n
k
=

j
(x
j
k) (48)
Then the resulting distribution over x, a vector of outcomes, is
p(x|) =
_
p
p(x|p)p(p|)dp (49)
=
(

k

k
)
(

k
n
k
+
k
)

k
(n
k
+
k
)
(
k
)
(50)
This distribution is also parameterized by , which can be estimated from a training set of count
vectors: D = {x
1
, ..., x
N
}. The likelihood is
n
i
=

k
n
ik
(51)
p(D|) =

i
p(x
i
|) (52)
=

i
_
(

k

k
)
(n
i
+

k

k
)

k
(n
ik
+
k
)
(
k
)
_
(53)
The gradient of the log-likelihood is
g
k
=
d log p(D|)
d
k
=

i
(

k
) (n
i
+

k
) + (n
ik
+
k
) (
k
) (54)
The maximum can be computed via the xed-point iteration

new
k
=
k

i
(n
ik
+
k
) (
k
)

i
(n
i
+

k

k
) (

k

k
)
(55)
5
(see appendix B).
Alternatively, there is a simplied Newton iteration as in the Dirichlet case. The Hessian of the
log-likelihood is
d log p(D|)
d
2
k
=

k
)

(n
i
+

k
) +

(n
ik
+
k
)

(
k
) (56)
d log p(D|)
d
k
d
j
=

k
)

(n
i
+

k
) (k = j) (57)
The Hessian can be written in matrix form as
H = Q+11
T
z (58)
q
jk
= (j k)

(n
ik
+
k
)

(
k
) (59)
z =

k
)

(n
i
+

k
) (60)
from which a Newton step can be computed as before. The search can be initialized with the moment
matching estimate where p
ik
is approximated by n
ik
/n
i
.
Another approach is to reduce this problem to the previous one via EM; see appendix D.
A dierent method is to maximize the leave-one-out (LOO) likelihood instead of the true likelihood.
The LOO likelihood is the product of the probability of each sample given the remaining data and
the parameters. The LOO log-likelihood is
f() =

ik
n
ik
log
_
n
ik
1 +
k
n
i
1 +

k

k
_
=

ik
n
ik
log(n
ik
1 +
k
)

i
n
i
log(n
i
1 +

k
) (61)
Note that it doesnt involve any special functions. The derivatives are
df()
d
k
=

i
n
ik
n
ik
1 +
k

n
i
n
i
1 +

k

k
(62)
df()
d
2
k
=

n
ik
(n
ik
1 +
k
)
2
+
n
i
(n
i
1 +

k

k
)
2
(63)
df()
d
k

j
=

i
n
i
(n
i
1 +

k

k
)
2
(64)
A convergent xed-point iteration is

new
k
=
k

i
n
ik
n
ik
1+
k

i
n
i
n
i
1+

k

k
(65)
Proof Use the bounds
log(n + x) q log x + (1 q) log n q log q (1 q) log(1 q) (66)
q =
x
n + x
(67)
log(x) ax 1 + log x (68)
a = 1/ x (69)
6
to get
f()

i
n
ik
q
ik
log
k
n
i
a
i

k
+ (const.) (70)
leading to (65).
The LOO likelihood can be interpreted as the approximation
(x + n)
(x)
(x + n 1)
n
(71)
4 Estimating Polya mean and precision separately
The parameters of the Polya distribution can be decomposed into mean m and precision s, just
as in the Dirichlet case, and optimization can be done separately on each part. The decomposition
also leads to an interesting interpretation of the Polya, discussed in the next subsection.
4.1 A novel interpretation of the Polya distribution
The Polya distribution can be interpreted as a multinomial distribution over a modied set of counts,
with a special normalizer. To see why, consider the log-probability of the outcomes x under the Polya
versus multinomial:
log p(x|) = log (s) log (s + n) +

k
log (n
k
+ sm
k
) log (sm
k
) (72)
log p(x|p) =

k
n
k
log p
k
(73)
The multinomial is an exponential family, and the Polya is not. But we can nd an approximate
exponential family representation of the Polya, by considering derivatives. In the multinomial case,
the counts can be recovered from the expression
n
k
= p
k
d log p(x|p)
dp
k
(74)
In the Polya case, the analogous expression is
n
k
= m
k
d log p(x|)
dm
k
(75)
=
k
((n
k
+
k
) (
k
)) (n
k
,
k
) (76)
The log-probability of x under the Polya can thus be approximated by
log p(x|) =

k
n
k
log m
k
(77)
When s , the Dirichlet-multinomial becomes an ordinary multinomial with p = m, and there-
fore the eective counts are the same as the ordinary counts:
(n
k
, ) = n
k
(78)
7
At the other extreme, as s 0, the Dirichlet-multinomial favors extreme proportions, and the
eective counts are a binarized version of the original counts:
(n
k
, 0) =
_
0 if n
k
= 0,
1 if n
k
> 0.
(79)
For intermediate values of , the mapping behaves like a logarithm, reducing the inuence of large
counts on the likelihood (see gure 1). Thus the Polya can be understood as a multinomial with
damped counts.
0 1 2 3 4 5 6 7 8 9 10
0
2
4
6
8
1
0
= 0.1
= 1
= 3
= 10
= 100
nk

(
n
k
,

)
Figure 1: The eective counts for the Polya distribution, as a function of the original count n
k
and
the parameter
k
.
This representation of the Polya also arises in the estimation of m when s is xed (section 5).
4.2 Estimating Polya precision
The likelihood for s alone is
p(D|s)

i
_
(s)
(n
i
+ s)

k
(n
ik
+ sm
k
)
(sm
k
)
_
(80)
The derivatives are
d log p(D|s)
ds
=

i
(s) (n
i
+ s) +

k
m
k
(n
ik
+ sm
k
) m
k
(sm
k
) (81)
d
2
log p(D|s)
ds
2
=

(s)

(n
i
+ s) +

k
m
2
k

(n
ik
+ sm
k
) m
2
k

(sm
k
) (82)
A convergent xed-point iteration is
s
new
= s

ik
m
k
(n
ik
+ sm
k
) m
k
(sm
k
)

i
(n
i
+ s) (s)
(83)
8
(the proof is similar to (55)). However, it is very slow. We can get a fast second-order method as
follows. When s is small, i.e. the gradient is positive, use the approximation
log p(D|s) a log(s) + cs + k (84)
a = s
2
0
f

(s
0
) (85)
c = f

(s
0
) a/s
0
(86)
to get the update
s
new
= a/c = s/(1 +
f

(s)
sf

(s)
) (87)
except when c 0, in which case the solution is s = . When s is large, i.e. the gradient is negative,
use the approximation
log p(D|s)
a
2s
2
+
c
s
(88)
a = s
3
0
(s
0
f

(s
0
) + 2f

(s
0
)) (89)
c = (s
2
0
f

(s
0
) + a/s
0
) (90)
to get the update
s
new
= a/c = s
f

(s)
f

(s) + 3f

(s)/s
(91)
For large s, the value of a tends to be numerically unstable. If sf

(s) + 2f

(s) is within machine


epsilon, then it is better to substitute the limiting value:
a

i
n
i
(n
i
1)(2n
i
1)
6

ik
n
ik
(n
ik
1)(2n
ik
1)
6m
2
k
(92)
A even faster update for large s is possible by using a richer approximation:
log p(D|s) c log
_
s
s + b
_
+
e
s + b
(93)
c =

ik
(n
ik
> 0)

i
(n
i
> 0) (94)
e =
s
0
+ b
s
0
(s
0
(s
0
+ b)f

(s
0
) cb) (95)
b = RootOf(a
2
b
2
+ a
1
b + a
0
) (96)
a
2
= s
3
0
(s
0
f

(s
0
) + 2f

(s
0
)) (97)
a
1
= 2s
2
0
(s
0
f

(s
0
) + f

(s
0
)) (98)
a
0
= s
2
0
f

(s
0
) + c (99)
The approximation comes from setting c log s to match log p(D|s) as s 0 and then choosing (b, e)
to match the rst two derivatives of f at the current s. The resulting update is
s
new
=
cb
2
e cb
=
_
1
s

f

(s)(s + b)
2
cb
2
_
1
(100)
Note that a
2
is equivalent to a above and should be corrected for stability via the same method.
The case of large dimension
An interesting special case arises when K is very large. The precision can be estimated simply by
counting the number of singleton elements in each x. Because the precision acts like a smoothing
9
parameter on the estimate of m, this result is reminiscent of smoothing methods in document
modeling which are based on counting singletons.
If m
k
is roughly uniform and K >> 1, then
k
<< 1 and we can use the approximations
(n
k
+
k
) (n
k
) (101)
(
k
) 1/
k
(102)
p(x|s)
(s)
(n + s)

n
k
>0
sm
k
(n
k
) (103)

(s)s

K
(n + s)
(104)
where

K is the number of unique observations in x. The approximation does not hold if s is large,
which can happen when m is a good match to the data. But if the dimensionality is large enough,
the data will be too sparse for this to happen. The derivatives become
d log p(D|s)
ds

i
(s) (n
i
+ s) +

K
i
/s (105)
d
2
log p(D|s)
ds
2

(s)

(n
i
+ s)

K
i
/s
2
(106)
Newton iteration can be used as long as the maximum for s is not on the boundary of (0, ). These
cases occur when

K = 1 and

K = n.
When the gradient is zero, we have

K = s((n + s) (s)) = E[K|s, n] (107)


A convergent xed-point iteration is
s
new
=

K
i

i
(n
i
+ s
old
) (s
old
)
(108)
Proof Use the bound
(s)
(n + s)

( s) exp(( s s)b)
(n + s)
(109)
b = (n + s) ( s) (110)
to get
p(D|s) s

i
b
i
+

K
i
log s + (const.) (111)
leading to (108).
Applying the large K approximation to the LOO likelihood gives
t =

ik
(n
ik
1) (number of singletons) (112)
f(s) = t log s

i
n
i
log(n
i
1 + s) (113)
df(s)
ds
=
t
s

i
n
i
n
i
1 + s
(114)
10
For N = 1:
s =
t(n 1)
n t
(115)
s
s + n
=
t(n 1)
n
2
t

t
n
(116)
which is the result we wanted.
5 Estimating Polya mean
The likelihood for m only is
p(D|m)

ik
(n
ik
+ sm
k
)
(sm
k
)
(117)
The maximum can be computed by the xed-point iteration
m
new
k

i
(n
ik
, sm
k
) (118)
where is dened in (76). This update can be understood intuitively as the maximum-likelihood
estimate of a multinomial distribution from eective counts n
ik
= (n
ik
, sm
k
). The proof of this
iteration is similar to (55).
For a Newton-Raphson iteration, reparameterize to get
m
K
= 1
K1

k=1
m
k
(119)
g
k
=
d log p(D|m)
dm
k
= s

i
(n
ik
+ sm
k
) (sm
k
) (n
iK
+ sm
K
) + (sm
K
) (120)
d
2
log p(D|m)
dm
2
k
= s
2

(n
ik
+ sm
k
)

(sm
k
) +

(n
iK
+ sm
K
)

(sm
K
) (121)
d
2
log p(D|m)
dm
k
m
j
= s
2

(n
iK
+ sm
K
)

(sm
K
) (122)
The search should be initialized at m
k

i
n
ik
, since for large s this is the exact optimum.
References
Johnson, N. L., & Kotz, S. (1970). Distributions in statistics: Continuous univariate distributions.
New York: Hougton Miin.
Minka, T. P. (1998). Expectation-Maximization as lower bound maximization.
http://research.microsoft.com/~minka/papers/em.html.
Minka, T. P. (2000). Beyond newtons method.
http://research.microsoft.com/~minka/papers/newton.html.
11
Naryanan, A. (1991). Algorithm as 266: Maximum likelihood estimation of the parameters of the
dirichlet distribution. Applied Statistics, 40, 365374.
http://www.psc.edu/~burkardt/dirichlet.html.
Ronning, G. (1989). Maximum-likelihood estimation of dirichlet distributions. Journal of
Statistical Computation and Simulation, 32, 215221.
A Proof of (9)
Use the bound
(x) ( x) exp((x x)( x)) (123)
to get
1
N
log p(D|) (

k
)(

old
k
)

k
log (
k
) +

k
(
k
1) log p
k
+ (const.) (124)
leading to (9).
B Proof of (55)
Use the bound
(x)
(n + x)

( x) exp(( x x)b)
(n + x)
(125)
b = (n + x) ( x) (126)
and the bound
(n + x)
(x)
cx
a
if n 1 (127)
a = ((n + x) ( x)) x (128)
c =
(n + x)
( x)
x
a
(129)
to get
log p(D|) (

k
1)

i
b
i
+

k
a
ik
log
k
+ (const.) (130)
leading to (55).
C Inverting the function
This section describes how to compute a high-accuracy solution to
(x) = y (131)
12
for x given y. Given a starting guess for x, Newtons method can be used to nd the root of
(x) y = 0. The Newton update is
x
new
= x
old

(x) y

(x)
(132)
To start the iteration, use the following asymptotic formulas for (x):
(x)
_
log(x 1/2) if x 0.6

1
x
if x < 0.6
(133)
= (1) (134)
to get

1
(y)
_
exp(y) + 1/2 if y 2.22

1
y+
if y < 2.22
(135)
With this initialization, ve Newton iterations are sucient to reach fourteen digits of precision.
D EM for estimation from counts
Any algorithm for estimation from probability vectors can be turned into an algorithm for
estimation from counts, by treating the p
i
as hidden variables in EM. The E-step computes a
posterior distribution over p
i
:
q(p
i
) D(n
ik
+
k
) (136)
and the M-step maximizes
E[

i
log p(p
i
|)] = N log (

k
) N

k
log (
k
) + N

k
(
k
1) log p
k
(137)
where log p
k
=
1
N

i
E[log p
ik
] (138)
=
1
N

i
(n
ik
+
old
k
) (n
i
+

old
k
) (139)
This is the same optimization problem as in section 1, with a new denition for p. It is not
necessary or desirable to reach the exact maximum in the M-step; a single Newton step will do.
The Newton step will end up using the old Hessian (10) but the new gradient (54). Compared to
the exact Newton algorithm, this uses half as much computation per iteration, but usually requires
more than twice the iterations.
E Proof of (34)
We want to prove the bound
(s)

k
(sm
k
)
exp(sb + (K 1) log(s) + c) (140)
b = ( s)

k
m
k
( sm
k
) (K 1)/ s (141)
13
Dene the function f(s) via
f(s) = log (s + 1)

k
log (sm
k
+ 1) (142)
= log (s)

k
log (sm
k
) (K 1)s + const. (143)
The bound is equivalent to saying that f(s) can be lower bounded by a linear function in s at any
point s, i.e. f(s) is convex in s. To show this, take the second derivative of f(s):
df
ds
= (s + 1)

k
(sm
k
+ 1)m
k
(144)
d
2
f
ds
2
=

(s + 1)

(sm
k
+ 1)m
2
k
(145)
We need to show that (145) is always positive. Since the function g(x) =

(x + 1)x is increasing,
we know that:
g(s) g(sm
k
) (146)

k
m
k
g(s)

k
m
k
g(sm
k
) (147)
g(s)

k
m
k
g(sm
k
) (148)

(s + 1)

(sm
k
+ 1)m
2
k
(149)
thus f(s) is convex and the bound follows.
14

Potrebbero piacerti anche