Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
distributions
Jakub M. Tomczak
November 28, 2012
1 Notations
Let x be a random variable. Consider a parametric distribution of x with parameters θ, p(x|θ). The
contiuous random variable x ∈ R can be modelled by normal distribution (Gaussian distribution):
1 n (x − µ)2 o
p(x|θ) = √ exp −
2πσ 2 2σ 2
= N (x|µ, σ 2 ), (1)
T
where θ = µ σ 2 .
A discrete (categorical) variable x ∈ X , X is a finite set of K values, can be modelled by categorical
distribution:1
K
Y
p(x|θ) = θkxk
k=1
= Cat(x|θ), (2)
P
where 0 ≤ θk ≤ 1, k θk = 1.
For X = {0, 1} we get a special case of the categorical distribution, Bernoulli distribution,
p(x|θ) = θx (1 − θ)1−x
= Bern(x|θ). (3)
1
2.2 Example 1: Bernoulli distribution
Let us calculate the fisher matrix for Bernoulli distribution (3). First, we need to take the logarithm:
Hence, we get the following Fisher score for the Bernoulli distribution:
x−θ
g(θ, x) = . (8)
θ(1 − θ)
The Fisher information matrix (here it is a scalar) for the Bernoulli distribution is as follows:
2
Now, let us calculate the product of Fisher score and its transposition:
2
x1 x1 x2 x1 xK
···
x1
θk θ1 θ1 θ2 θ1 θK
.. x1 · · · xK .. .
.. ..
. θk θK = . · · · .
xK 2
xK x1 xK x2 xK
θK
θK θ1 θK θ2
··· θK
g11 g12 · · · g1K
.. .. .. .
= . . ··· . (13)
gK1 gK2 · · · gKK
Finally, we get: n1 1 o
F = diag ,..., . (16)
θ1 θK
3
Now, let us calculate the product of Fisher score and its transposition:
1
σ2
(x − µ)
12 (x − µ) − 1 2 + 1 4 (x − µ)2 =
σ 2σ 2σ
− 2σ1 2 + 2σ1 4 (x − µ)2
1 1 1
2 3
σ 4 (x − µ) − 2σ 4 (x − µ) + 2σ 6 (x − µ)
= (21)
1 1 3 1 1 2 1 4
− 2σ4 (x − µ) + 2σ6 (x − µ) 4σ4 − 2σ6 (x − µ) + 4σ8 (x − µ)
g11 g12
,
g21 g22
where g12 = g21 .
In order to calculate the Fisher information matrix we need to determine the expected value of
all gij . Hence,2 for g11 :
1
2
Ex [g11 ] = Ex 4 (x − µ)
σ
1
= 2 Ex [x2 ] − 2µ2 + µ2
σ
1
= 2 µ2 + σ 2 − 2µ2 + µ2
σ
1
= 2, (22)
σ
and for g12 :
1 1
Ex [g12 ] = Ex − 4 (x − µ) + 6 (x − µ)3
2σ 2σ
1 1
3 2 2 3
= − 4 (Ex [x] − µ) + Ex (x − 3x µ + 3xµ − µ )
2σ 2σ 6
1
= 6 (Ex [x3 ] − 3µEx [x2 ] + 3µ2 Ex [x] − µ3 )
2σ
1
= 6 µ3 + 3µσ 2 − 3µ(µ2 + σ 2 ) + 3µ3 − µ3
2σ
= 0, (23)
and for g22 :
1 1 1
2 4
Ex [g22 ] = Ex − (x − µ) + (x − µ)
4σ 4 2σ 6 4σ 8
1 1 1
= 4 − 6 Ex [x2 − 2xµ + µ2 ] + 8 Ex [x4 − 4x3 µ + 6x2 µ2 − 4xµ3 + µ4 ]
4σ 2σ 4σ
1 1 1
= 4 − 6 Ex [x2 ] − 2Ex [x]µ + µ2 + 8 Ex [x4 ] − 4Ex [x3 ]µ + 6Ex [x2 ]µ2 − 4Ex [x]µ3 + µ4
4σ 2σ 4σ
1 1 2 1
= 4 − 6 σ + 8 3σ 4
4σ 2σ 4σ
1
= 4. (24)
2σ
Finally, we get: 1
2 0
F= σ . (25)
0 2σ1 4
2
See Section 3 for raw moments of univariate normal distribution.
4
2.5 Summary
The Fisher information matrix for given distribution:
• Bernoulli distribution:
1
F= ,
θ(1 − θ)
• Categorical distribution: n1 1 o
F = diag ,..., ,
θ1 θK
• Normal distribution:
1
σ2
0
F= 1 .
0 2σ 4
3 Ex [x3 ] µ3 + 3µσ 2
4 Ex [x4 ] µ4 + 6µ2 σ 2 + 3µ4
References
[1] C. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.