Sei sulla pagina 1di 5

Fisher information matrix for Gaussian and categorical

distributions
Jakub M. Tomczak
November 28, 2012

1 Notations
Let x be a random variable. Consider a parametric distribution of x with parameters θ, p(x|θ). The
contiuous random variable x ∈ R can be modelled by normal distribution (Gaussian distribution):

1 n (x − µ)2 o
p(x|θ) = √ exp −
2πσ 2 2σ 2
= N (x|µ, σ 2 ), (1)
T
where θ = µ σ 2 .
A discrete (categorical) variable x ∈ X , X is a finite set of K values, can be modelled by categorical
distribution:1
K
Y
p(x|θ) = θkxk
k=1
= Cat(x|θ), (2)
P
where 0 ≤ θk ≤ 1, k θk = 1.
For X = {0, 1} we get a special case of the categorical distribution, Bernoulli distribution,

p(x|θ) = θx (1 − θ)1−x
= Bern(x|θ). (3)

2 Fisher information matrix


2.1 Definition
The Fisher score is determined as follows [1]:

g(θ, x) = ∇θ ln p(x|θ). (4)

The Fisher information matrix is defined as follows [1]:

F = Ex g(θ, x) g(θ, x)T .


 
(5)
1
We use the 1-of-K encoding [1].

1
2.2 Example 1: Bernoulli distribution
Let us calculate the fisher matrix for Bernoulli distribution (3). First, we need to take the logarithm:

ln Bern(x|θ) = x ln θ + (1 − x) ln(1 − θ). (6)


Second, we need to calculate the derivative:
d x 1−x
ln Bern(x|θ) = −
dθ θ 1−θ
x−θ
= . (7)
θ(1 − θ)

Hence, we get the following Fisher score for the Bernoulli distribution:
x−θ
g(θ, x) = . (8)
θ(1 − θ)

The Fisher information matrix (here it is a scalar) for the Bernoulli distribution is as follows:

F = Ex [g(θ, x) g(θ, x)]


h (x − θ)2 i
= Ex
(θ(1 − θ))2
1 n
2 2
o
= Ex [x − 2xθ + θ ]
(θ(1 − θ))2
1 n
2 2
o
= E x [x ] − 2θE x [x] + θ
(θ(1 − θ))2
1 n
2 2
o
= θ − 2θ + θ
(θ(1 − θ))2
1
= θ(1 − θ)
(θ(1 − θ))2
1
= . (9)
θ(1 − θ)

2.3 Example 2: Categorical distribution


Let us calculate the fisher matrix for categorical distribution (2). First, we need to take the logarithm:
K
X
ln Cat(x|θ) = xk ln θk . (10)
k=1

Second, we need to calculate partial derivatives:


∂ xk
ln Cat(x|θ) = . (11)
∂θk θk
Hence, we get the following Fisher score for the categorical distribution:
 x1 
θk
g(θ, x) =  ...  . (12)
 
xK
θK

2
Now, let us calculate the product of Fisher score and its transposition:
 2 
x1 x1 x2 x1 xK
···
 x1 
θk  θ1 θ1 θ2 θ1 θK 
 ..   x1 · · · xK   .. .
.. .. 
 .  θk θK =  . · · · . 
xK   2 
xK x1 xK x2 xK
θK
θK θ1 θK θ2
··· θK
 
g11 g12 · · · g1K
 .. .. ..  .
= . . ··· .  (13)
gK1 gK2 · · · gKK

Therefore, for gkk we have:


h x 2 i
k
Ex [gkk ] = Ex
θk
1
= E [x2 ]
2 x
θk
1
= , (14)
θk
and for gij , i 6= j:
hx x i
i j
Ex [gij ] = Ex
θi θj
1
= Ex [xi xj ]
θi θj
= 0. (15)

Finally, we get: n1 1 o
F = diag ,..., . (16)
θ1 θK

2.4 Example 3: Normal distribution


Let us calculate the Fisher matrix for univariate normal distribution (1). First, we need to take the
logarithm:
1 1 1
ln N (x|µ, σ 2 ) = − ln 2π − ln σ 2 − 2 (x − µ)2 . (17)
2 2 2σ
Second, we need to calculate the partial derivatives:
∂ 1
N (x|µ, σ 2 ) = 2 (x − µ) (18)
∂µ σ
∂ 1 1
2
N (x|µ, σ 2 ) = − 2 + 4 (x − µ)2 . (19)
∂σ 2σ 2σ
Hence, we get the following Fisher score for normal distribution:
 ∂
N (x|µ, σ 2 )

g(θ, x) = ∂ ∂µ
2
2 N (x|µ, σ )
 ∂σ 1 
σ 2 (x − µ)
= . (20)
− 2σ1 2 + 2σ1 4 (x − µ)2

3
Now, let us calculate the product of Fisher score and its transposition:
1
 
σ2
(x − µ)
 12 (x − µ) − 1 2 + 1 4 (x − µ)2 =
 

σ 2σ 2σ
− 2σ1 2 + 2σ1 4 (x − µ)2
1 1 1
 2 3

σ 4 (x − µ) − 2σ 4 (x − µ) + 2σ 6 (x − µ)
 = (21)
1 1 3 1 1 2 1 4
− 2σ4 (x − µ) + 2σ6 (x − µ) 4σ4 − 2σ6 (x − µ) + 4σ8 (x − µ)
 
g11 g12
 ,
g21 g22
where g12 = g21 .
In order to calculate the Fisher information matrix we need to determine the expected value of
all gij . Hence,2 for g11 :

1 
2
Ex [g11 ] = Ex 4 (x − µ)
σ
1
= 2 Ex [x2 ] − 2µ2 + µ2

σ
1
= 2 µ2 + σ 2 − 2µ2 + µ2

σ
1
= 2, (22)
σ
and for g12 :
 1 1 
Ex [g12 ] = Ex − 4 (x − µ) + 6 (x − µ)3
2σ 2σ
1  1
3 2 2 3

= − 4 (Ex [x] − µ) + Ex (x − 3x µ + 3xµ − µ )
2σ 2σ 6
1  
= 6 (Ex [x3 ] − 3µEx [x2 ] + 3µ2 Ex [x] − µ3 )

1  
= 6 µ3 + 3µσ 2 − 3µ(µ2 + σ 2 ) + 3µ3 − µ3

= 0, (23)
and for g22 :
 1 1 1 
2 4
Ex [g22 ] = Ex − (x − µ) + (x − µ)
4σ 4 2σ 6 4σ 8
1 1 1
= 4 − 6 Ex [x2 − 2xµ + µ2 ] + 8 Ex [x4 − 4x3 µ + 6x2 µ2 − 4xµ3 + µ4 ]
4σ 2σ 4σ
1 1   1  
= 4 − 6 Ex [x2 ] − 2Ex [x]µ + µ2 + 8 Ex [x4 ] − 4Ex [x3 ]µ + 6Ex [x2 ]µ2 − 4Ex [x]µ3 + µ4
4σ 2σ 4σ
1 1 2 1
= 4 − 6 σ + 8 3σ 4
4σ 2σ 4σ
1
= 4. (24)

Finally, we get: 1 
2 0
F= σ . (25)
0 2σ1 4
2
See Section 3 for raw moments of univariate normal distribution.

4
2.5 Summary
The Fisher information matrix for given distribution:

• Bernoulli distribution:
1
F= ,
θ(1 − θ)
• Categorical distribution: n1 1 o
F = diag ,..., ,
θ1 θK
• Normal distribution:
1
 
σ2
0
F= 1 .
0 2σ 4

3 Appendix: Raw moments

Table 1: The raw moments of univariate normal distribution.

Order Expression Raw moment


1 Ex [x] µ
2 Ex [x2 ] µ + σ2
2

3 Ex [x3 ] µ3 + 3µσ 2
4 Ex [x4 ] µ4 + 6µ2 σ 2 + 3µ4

References
[1] C. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.

Potrebbero piacerti anche