Solution Manual For PRML

S OLUTION M ANUAL F OR
PATTERN R ECOGNITION AND M ACHINE

L EARNING
E DITED B Y
ZHENGQI GAO
the State Key Lab. of ASIC and System

School of Microelectronics
Fudan University
N OV.2017
1
0.1 Introduction
Problem 1.1 Solution
We let the derivative of error function E with respect to vector w equals

to 0, (i.e. ∂∂w
E
= 0), and this will be the solution of w = {w i } which minimizes
error function E . To solve this problem, we will calculate the derivative of E
with respect to every w i , and let them equal to 0 instead. Based on (1.1) and
(1.2) we can obtain :
=>
∂E ∑
N
= { y( xn , w) − t n } xni = 0
∂w i n=1
=>
∑
N ∑
N
y( xn , w) xni = xni t n
n=1 n=1
=>
N ∑
∑ M
j ∑
N
( w j xn ) xni = xni t n
n=1 j =0 n=1
=>
N ∑
∑ M
( j+ i) ∑
N
w j xn = xni t n
n=1 j =0 n=1
=>
∑
M ∑
N
( j+ i) ∑
N
xn wj = xni t n
j =0 n=1 n=1
∑N i+ j ∑N
If we denote A i j = n=1 xn and T i = n=1 xn i t n , the equation above can
be written exactly as (1.222), Therefore the problem is solved.
This problem is similar to Prob.1.1, and the only difference is the last
term on the right side of (1.4), the penalty term. So we will do the same thing
as in Prob.1.1 :
=>
∂E ∑
N
= { y( xn , w) − t n } xni + λw i = 0
∂w i n=1
=>
∑
M ∑
N
( j+ i) ∑
N
xn w j + λw i = xni t n
j =0 n=1 n=1
=>
∑
M ∑ N
( j+ i) ∑
N
{ xn + δ ji λ}w j = xni t n
j =0 n=1 n=1
2
where
{
0 j ̸= i
δ ji
1 j=i
This problem can be solved by Bayes’ theorem. The probability of selecting

an apple P (a) :
3 1 3
P ( a) = P ( a| r ) P ( r ) + P ( a| b ) P ( b ) + P ( a| g ) P ( g ) = × 0.2 + × 0.2 + × 0.6 = 0.34
10 2 10
Based on Bayes’ theorem, the probability of an selected orange coming
from the green box P ( g| o) :
P ( o| g ) P ( g )
P ( g | o) =
P ( o)
We calculate the probability of selecting an orange P ( o) first :
4 1 3
P ( o) = P ( o| r ) P ( r ) + P ( o| b ) P ( b ) + P ( o| g ) P ( g ) = × 0.2 + × 0.2 + × 0.6 = 0.36
10 2 10
Therefore we can get :
3
P ( o| g ) P ( g ) 10 × 0. 6
P ( g | o) = = = 0.5
P ( o) 0.36
This problem needs knowledge about calculus, especially about Chain

rule. We calculate the derivative of P y ( y) with respect to y, according to
(1.27) :
d p y ( y) d ( p x ( g( y))| g‘ ( y)|) d p x ( g( y)) ‘ d | g‘ ( y)|

= = | g ( y)| + p x ( g( y)) (∗)
dy dy dy dy
The first term in the above equation can be further simplified:
d p x ( g( y)) ‘ d p x ( g( y)) d g( y) ‘
| g ( y)| = | g ( y)| (∗∗)
dy d g ( y) dy
If x̂ is the maximum of density over x, we can obtain :
d p x ( x) ¯¯
=0
dx x̂
Therefore, when y = ŷ, s.t. x̂ = g( ŷ), the first term on the right side of (∗∗)
will be 0, leading the first term in (∗) equals to 0, however because of the
existence of the second term in (∗), the derivative may not equal to 0. But
3
when linear transformation is applied, the second term in (∗) will vanish,
(e.g. x = a y + b). A simple example can be shown by :
p x ( x) = 2 x, x ∈ [0, 1] => x̂ = 1
And given that:
x = sin( y)
Therefore, p y ( y) = 2 sin( y) | cos( y)|, y ∈ [0, π2 ], which can be simplified :
π π
p y ( y) = sin(2 y), y ∈ [0, ] => ŷ =
2 4
However, it is quite obvious :
x̂ ̸= sin( ŷ)
This problem takes advantage of the property of expectation:
var [ f ] = E[( f ( x) − E[ f ( x)])2 ]

= E[ f ( x)2 − 2 f ( x)E[ f ( x)] + E[ f ( x)]2 ]
= E[ f ( x)2 ] − 2E[ f ( x)]2 + E[ f ( x)]2
=> var [ f ] = E[ f ( x)2 ] − E[ f ( x)]2
Based on (1.41), we only need to prove when x and y is independent,

E x,y [ x y] = E[ x]E[ y]. Because x and y is independent, we have :
p( x, y) = p x ( x) p y ( y)
Therefore:
∫ ∫ ∫ ∫
x yp( x, y) dx d y = x yp x ( x) p y ( y) dx d y
∫ ∫
= ( xp x ( x) dx)( yp y ( y) d y)
=> E x,y [ x y] = E[ x]E[ y]
This problem should take advantage of Integration by substitution.

∫ +∞ ∫ +∞
2 1 1
I = exp(− 2 x2 − 2 y2 ) dx d y
−∞ −∞ 2σ 2σ
∫ 2π ∫ +∞
1 2
= exp(− 2 r ) r dr d θ
0 0 2σ
4
Here we utilize :
x = r cos θ , y = r sin θ
Based on the fact :
∫ +∞
1 r 2 ¯+∞
exp(− 2 ) r dr = −σ2 exp(− 2 )¯0 = −σ2 (0 − (−1)) = σ2
0 2σ 2σ
Therefore, I can be solved :

∫ 2π p
2
I = σ2 d θ = 2πσ2 , => I = 2πσ
0
¯
And next,we will¯ show that Gaussian distribution N ( x¯µ, σ2 ) is normal-
∫ +∞
ized, (i.e. −∞ N ( x¯µ, σ2 ) dx = 1) :
∫ ∫
+∞ ¯ 1+∞ 1
N ( x¯µ, σ2 ) dx = p exp{− 2 ( x − µ)2 } dx
−∞ −∞ 2πσ 2 2σ
∫ +∞
1 1
= p exp{− 2 y2 } d y ( y = x − µ )
−∞ 2πσ2 2σ
∫ +∞
1 1
= p exp{− 2 y2 } d y
2πσ2 −∞ 2σ
= 1
The first question will need the result of Prob.1.7 :

∫ +∞ ∫ +∞
¯ 1 1
N ( x¯µ, σ2 ) x dx = p exp{− 2 ( x − µ)2 } x dx
−∞ −∞ 2πσ2 2σ
∫ +∞
1 1
= p exp{− 2 y2 }( y + µ) d y ( y = x − µ)
−∞ 2πσ 2 2 σ
∫ +∞ ∫ +∞
1 1 2 1 1
= µ p exp{− 2 y } d y + p exp{− 2 y2 } y d y
−∞ 2πσ 2 2σ −∞ 2πσ 2 2σ
= µ+0 = µ
The second problem has already be given hint in the description. Given
that :
d ( f g) dg df
= f +g
dx dx dx
We differentiate both side of (1.127) with respect to σ2 , we will obtain :
∫ +∞ 1 ( x − µ)2 ¯
(− + )N ( x¯µ, σ2 ) dx = 0
−∞ 2σ 2 2σ 4
5
Provided the fact that σ ̸= 0, we can get:

∫ +∞ ∫ +∞
¯ ¯
2 ¯ 2
( x − µ) N ( x µ, σ ) dx = σ2 N ( x¯µ, σ2 ) dx = σ2
−∞ −∞
So the equation above has actually proven (1.51), according to the defini-
tion: ∫ +∞
¯
var [ x] = ( x − E[ x])2 N ( x¯µ, σ2 ) dx
−∞
Where E[ x] = µ has already been proved. Therefore :
var [ x] = σ2
Finally,
E[ x2 ] = var [ x] + E[ x]2 = σ2 + µ2
Here we only focus on (1.52), because (1.52) is the general form of (1.42).
Based on the definition : The maximum of distribution is known as its mode
and (1.52), we can obtain :
¯
∂N (x¯µ, Σ) 1 ¯
= − [Σ−1 + (Σ−1 )T ] (x − µ)N (x¯µ, Σ)
∂x 2 ¯
= −Σ−1 (x − µ)N (x¯µ, Σ)
Where we take advantage of :
∂xT Ax
= (A + AT )x and (Σ−1 )T = Σ−1
∂x
Therefore, ¯
∂N (x¯µ, Σ)
only when x = µ, =0
∂x
Note: You may also need to calculate Hessian Matrix to prove that it is
maximum. However, here we find that the first derivative only has one root.
Based on the description in the problem, this point should be maximum point.
We will solve this problem based on the definition of expectation, variation

6
and independence.
∫ ∫
E[ x + z ] = ( x + z) p( x, z) dx dz
∫ ∫
= ( x + z) p( x) p( z) dx dz
∫ ∫ ∫ ∫
= xp( x) p( z) dx dz + z p( x) p( z) dx dz
∫ ∫ ∫ ∫
= ( p( z) dz) xp( x) dx + ( p( x) dx) z p( z) dz
∫ ∫
= xp( x) dx + z p( z) dz
= E[ x ] + E[ z ]
∫ ∫
var [ x + z] = ( x + z − E[ x + z] )2 p( x, z) dx dz
∫ ∫
= { ( x + z )2 − 2 ( x + z) E[ x + z]) + E2 [ x + z] } p( x, z) dx dz
∫ ∫ ∫ ∫
2
= ( x + z) p( x, z) dx dz − 2E[ x + z] ( x + z) p( x, z) dx dz + E2 [ x + z]
∫ ∫
= ( x + z)2 p( x, z) dx dz − E2 [ x + z]
∫ ∫
= ( x2 + 2 xz + z2 ) p( x) p( z) dx dz − E2 [ x + z]
∫ ∫ ∫ ∫ ∫ ∫
2
= ( p( z) dz) x p( x) dx + 2 xz p( x) p( z) dx dz + ( p( x) dx) z2 p( z) dz − E2 [ x + z]
∫ ∫
= E[ x 2 ] + E[ z 2 ] − E2 [ x + z ] + 2 xz p( x) p( z) dx dz
∫ ∫
= E[ x2 ] + E[ z2 ] − (E[ x] + E[ z])2 + 2 xz p( x) p( z) dx dz
∫ ∫
= E[ x2 ] − E2 [ x] + E[ z2 ] − E2 [ z] − 2E[ x] E[ z] + 2 xz p( x) p( z) dx dz
∫ ∫
= var [ x] + var [ z] − 2E[ x] E[ z] + 2( xp( x) dx)( z p( z) dz)
= var [ x] + var [ z]
Based on prior knowledge that µ ML and σ2ML will decouple. We will first
calculate µ ML : ¯
d ( ln p(x¯ µ, σ2 )) 1 ∑
N
= 2 ( x n − µ)
dµ σ n=1
We let : ¯
d ( ln p(x¯ µ, σ2 ))
=0
dµ
7
Therefore :
1 ∑N
µ ML = xn
N n=1
And because:
¯
d ( ln p(x¯ µ, σ2 )) 1 ∑ N
= ( ( x n − µ )2 − N σ 2 )
d σ2 2σ4 n=1
We let : ¯
d ( ln p(x¯ µ, σ2 ))
=0
d σ2
Therefore :
1 ∑N
σ2ML = ( xn − µ ML )2
N n=1
It is quite straightforward for E[µ ML ], with the prior knowledge that xn is

i.i.d. and it also obeys Gaussian distribution N (µ, σ2 ).
1 ∑N 1 ∑ N
E[µ ML ] = E[ x n ] = E[ x n ] = E[ x n ] = µ
N n=1 N n=1
For E[σ2ML ], we need to take advantage of (1.56) and what has been given
in the problem :
1 ∑N
E[σ2ML ] = E[ ( xn − µ ML )2 ]
N n=1
1 ∑ N
= E[ ( xn − µ ML )2 ]
N n=1
1 ∑ N
= E[ ( x2 − 2 xn µ ML + µ2ML )]
N n=1 n
1 ∑ N 1 ∑ N 1 ∑ N
= E[ x2n ] − E[ 2 xn µ ML ] + E[ µ2 ]
N n=1 N n=1 N n=1 ML
2 ∑ N 1 ∑N
= µ2 + σ 2 − E[ xn ( xn )] + E[µ2ML ]
N n=1 N n=1
2 ∑
N ∑N 1 ∑N
= µ2 + σ 2 − E [ x n ( x n )] + E [( xn )2 ]
N 2 n=1 n=1 N n=1
2 ∑
N 1 ∑
N
= µ2 + σ 2 − E [( x n )2
] + E [( x n )2 ]
N 2 n=1 N 2 n=1
1 ∑
N
= µ2 + σ 2 − E [( x n )2 ]
N 2 n=1
1
= µ2 + σ 2 − 2 [ N ( N µ2 + σ 2 ) ]
N
8
Therefore we have:
N −1 2
E[σ2ML ] = ( )σ
N
This problem can be solved in the same method used in Prob.1.12 :
1 ∑N
E[σ2ML ] = E[ ( x n − µ )2 ] (Because here we use µ to replace µ ML )
N n=1
1 ∑ N
= E[ ( xn − µ)2 ]
N n=1
1 ∑ N
= E[ ( x2 − 2 xn µ + µ2 )]
N n=1 n
1 ∑ N 1 ∑ N 1 ∑ N
= E[ x2n ] − E[ 2 x n µ ] + E[ µ2 ]
N n=1 N n=1 N n=1
2µ ∑ N
= µ2 + σ 2 − E[ x n ] + µ2
N n=1
= µ2 + σ2 − 2µ2 + µ2
= σ2
Note: The biggest difference between Prob.1.12 and Prob.1.13 is that the
mean of Gaussian Distribution is known previously (in Prob.1.13) or not (in
Prob.1.12). In other words, the difference can be shown by the following equa-
tions:
E[ µ 2 ] = µ 2(µ is determined, i.e. its expectation is itself, also true for µ2 )

1 ∑N 1 ∑N 1 σ2
E[µ2ML ] = E[( xn )2 ] = 2 E[( xn )2 ] = 2 N ( N µ2 + σ2 ) = µ2 +
N n=1 N n=1 N N
This problem is quite similar to the fact that any function f ( x) can be
written into the sum of an odd function and an even function. If we let:
w i j + w ji w i j − w ji
wSij = and w iAj =
2 2
It is obvious that they satisfy the constraints described in the problem,
which are :
w i j = wSij + w iAj , wSij = wSji , w iAj = −w Aji
9
To prove (1.132), we only need to simplify it :
∑
D ∑
D ∑
D ∑
D
wi j xi x j = (wSij + w iAj ) x i x j
i =1 j =1 i =1 j =1
∑
D ∑
D ∑
D ∑
D
= wSij x i x j + w iAj x i x j
i =1 j =1 i =1 j =1
Therefore, we only need to prove that the second term equals to 0, and
here we use a simple trick: we will prove twice of the second term equals to 0
instead.
∑
D ∑
D ∑
D ∑
D
2 w iAj x i x j = ( w iAj + w iAj ) x i x j
i =1 j =1 i =1 j =1
∑
D ∑
D
= ( w iAj − w Aji ) x i x j
i =1 j =1
∑
D ∑
D ∑
D ∑
D
= w iAj x i x j − w Aji x i x j
i =1 j =1 i =1 j =1
∑
D ∑
D ∑
D ∑
D
= w iAj x i x j − w Aji x j x i
i =1 j =1 j =1 i =1
= 0
Therefore, we choose the coefficient matrix to be symmetric as described

in the problem. Considering about the symmetry, we can see that if and only
if for i = 1, 2, ..., D and i ≤ j , w i j is given, the whole matrix will be determined.
Hence, the number of independent parameters are given by :
D (D + 1)
D + D − 1 + ... + 1 =
2
Note: You can view this intuitively by considering if the upper triangular
part of a symmetric matrix is given, the whole matrix will be determined.
This problem is a more general form of Prob.1.14, so the method can also
e i 1 i 2 ...i M .
be used here: we will find a way to use w i 1 i 2 ...i M to represent w
We begin by introducing a mapping function:
F ( x i1 x i2 ...x iM ) = x j1 x j2 ..., x jM
∪
M ∪
M
s.t. x ik = x jk , and x j1 ≥ x j2 ≥ x j3 ... ≥ x jM
k=1 k=1
10
It is complexed to write F in mathematical form. Actually this function

does a simple work: it rearranges the element in a decreasing order based on
its subindex. Several examples are given below, when D = 5, M = 4:
F ( x5 x2 x3 x2 ) = x5 x3 x2 x2
F ( x1 x3 x3 x2 ) = x3 x3 x2 x1
F ( x1 x4 x2 x3 ) = x4 x3 x2 x1
F ( x1 x1 x5 x2 ) = x5 x2 x1 x1
After introducing F , the solution will be very simple, based on the fact
that F will not change the value of the term, but only rearrange it.
∑
D ∑
D ∑
D ∑
D ∑
j1 j∑
M −1
... w i 1 i 2 ...i M x i1 x i2 ...x iM = ... e j 1 j 2 ... j M x j1 x j2 ...x jM
w
i 1 =1 i 2 =1 i M =1 j 1 =1 j 2 =1 j M =1
∑
where e j 1 j 2 ... j M =
w w
w∈Ω
¯
Ω = {w i 1 i 2 ...i M ¯ F ( x i1 x i2 ...x iM ) = x j1 x j2 ...x jM , ∀ x i1 x i2 ...x iM }
By far, we have already proven (1.134). Mathematical induction will be

used to prove (1.135) and we will begin by proving D = 1, i.e. n(1, M ) =
n(1, M − 1). When D = 1, (1.134) will degenerate into wx e 1M , i.e., it only has
one term, whose coefficient is govern by w e regardless the value of M .
Therefore, we have proven when D = 1, n(D, M ) = 1. Suppose (1.135)
holds for D , let’s prove it will also hold for D + 1, and then (1.135) will be
proved based on Mathematical induction.
Let’s begin based on (1.134):
D∑
+1 ∑
i1 i∑
M −1
... e i 1 i 2 ...i M x i1 x i2 ...x iM
w (∗)
i 1 =1 i 2 =1 i M =1
We divide (∗) into two parts based on the first summation: the first part
is made up of i i = 1, 2, ..., D and the second part i 1 = D + 1. After division, the
first part corresponds to n(D, M ), and the second part corresponds to n(D +
1, M − 1). Therefore we obtain:
n(D + 1, M ) = n(D, M ) + n(D + 1, M − 1) (∗∗)
And given the fact that (1.135) holds for D :
∑
D
n(D, M ) = n( i, M − 1)
i =1
11
Therefore,we substitute it into (∗∗)
∑
D D∑
+1
n ( D + 1, M ) = n( i, M − 1) + n(D + 1, M − 1) = n( i, M − 1)
i =1 i =1
We will prove (1.136) in a different but simple way. We rewrite (1.136) in

Permutation and Combination view:
∑
D
C iM+−M1−2 = C D
M
+ M −1
i =1
Firstly, We expand the summation.

M −1 M −1 M −1 M
CM −1 + C M + ... C D + M −2 = C D + M −1
M M −1
Secondly, we rewrite the first term on the left side to C M , because C M −1
=
M
CM = 1. In other words, we only need to prove:
M M −1 M −1 M
CM + CM + ... C D + M −2 = C D + M −1
r r r −1
Thirdly, we take advantage of the property : C N = CN −1
+ CN −1
. So we
can recursively combine the first term and the second term on the left side,
and it will ultimately equal to the right side.
(1.137) gives the mathematical form of n(D, M ), and we need all the con-
clusions above to prove it.
Let’s give some intuitive concepts by illustrating M = 0, 1, 2. When M = 0,
(1.134) will consist of only a constant term, which means n(D, 0) = 1. When
M = 1,it is obvious n(D, 1) = D , because in this case (1.134) will only have D
terms if we expand it. When M = 2, it degenerates to Prob.1.14, so n(D, 2) =
D (D +1)
2 is also obvious. Suppose (1.137) holds for M − 1, let’s prove it will also
hold for M .
∑
D
n(D, M ) = n( i, M − 1) ( based on (1.135) )
i =1
∑
D
= C iM+−M1−2 ( based on (1.137) holds for M − 1 )
i =1
M −1 M −1 M −1 M −1
= CM −1 + C M + CM +1 ... + C D + M −2
M M −1 M −1 M −1
= ( CM + CM ) + CM +1 ... + C D + M −2
M M −1 M −1
= ( CM +1 + C M +1 )... + C D + M −2
M M −1
= CM +2 ... + C D + M −2
...
M
= CD + M −1
By far, all have been proven.

12
This problem can be solved in the same way as the one in Prob.1.15.
Firstly, we should write the expression consisted of all the independent terms
up to M th order corresponding to N (D, M ). By adding a summation regard-
ing to M on the left side of (1.134), we obtain:
∑
M ∑
D ∑
i1 i∑
m−1
... e i 1 i 2 ...i m x i1 x i2 ...x im
w (∗)
m=0 i 1 =1 i 2 =1 i m =1
(1.138) is quite obvious if we view m as an looping variable, iterating

through all the possible orders less equal than M , and for every possible oder
m, the independent parameters are given by n(D, m).
Let’s prove (1.138) in a formal way by using Mathematical Induction.
When M = 1,(∗) will degenerate to two terms: m = 0, corresponding to n(D, 0)
and m = 1, corresponding to n(D, 1). Therefore N (D, 1) = n(D, 0) + n(D, 1).
Suppose (1.138) holds for M , we will see that it will also hold for M + 1. Let’s
begin by writing all the independent terms based on (∗) :
∑
M +1 ∑
D ∑
i1 i∑
m−1
... e i 1 i 2 ...i m x i1 x i2 ...x im
w (∗∗)
m=0 i 1 =1 i 2 =1 i m =1
Using the same technique as in Prob.1.15, we divide (∗∗) to two parts

based on the summation regarding to m: the first part consisted of m =
0, 1, ..., M and the second part m = M + 1. Hence, the first part will corre-
spond to N (D, M ) and the second part will correspond to n(D, M + 1). So we
obtain:
N (D, M + 1) = N (D, M ) + n(D, M + 1)
Then we substitute (1.138) into the equation above :
∑
M
N (D, M + 1) = n(D, m) + n(D, M + 1)
m=0
∑
M +1
= n(D, m)
m=0
To prove (1.139), we will also use the same technique in Prob.1.15 instead
of Mathematical Induction. We begin based on already proved (1.138):
∑
M
N (D, M ) = n(D, M )
m=0
13
We then take advantage of (1.137):
∑
M
m
N (D, M ) = CD + m−1
m=0
= C 0D −1 + C 1D + C 2D +1 + ... + C D M
+ M −1
= ( C 0D + C 1D ) + C 2D +1 + ... + C DM
+ M −1
= M
( C 1D +1 + C 2D +1 ) + ... + C D + M −1
= ...
M
= CD +M
Here as asked by the problem, we will view the growing speed of N (D, M ).
We should see that in n(D, M ), D and M are symmetric, meaning that we only
need to prove when D ≫ M , it will grow like D M , and then the situation of
M ≫ D will be solved by symmetry.
(D + M )! ( D + M )D + M
N (D, M ) = ≈
D! M! DD MM
1 D+M D
= ( ) (D + M ) M
MM D
1 M D
= M
[(1 + ) M ] M (D + M ) M
M D
e M
≈ ( ) (D + M ) M
M
eM M
= M
(1 + ) M D M
M D
eM M D M2
= M
[(1 + ) M ] D D M
M D
M2
e M+ D eM
≈ DM ≈ DM
MM MM
M2
Where we use Stirling’s approximation, lim (1 + n1 )n = e and e D ≈ e0 =
n→+∞
1. According to the description in the problem, When D ≫ M , we can actually
eM M
view M M as a constant, so N ( D, M ) will grow like D in this case. And by
D
symmetry, N (D, M ) will grow like M , when M ≫ D .
Finally, we are asked to calculate N (10, 3) and N (100, 3):
3
N (10, 3) = C 13 = 286
3
N (100, 3) = C 103 = 176851
14
∫ +∞
Γ( x + 1) = u x e−u du
0
∫ +∞
= − u x de−u
0
¯+∞ ∫ +∞
¯
= − u x e−u ¯ − e−u d (− u x )
0 0
¯+∞ ∫ +∞
x −u ¯
= −u e ¯ +x e−u u x−1 du
0 0
¯+∞
x −u ¯
= −u e ¯ + x Γ( x)
0
Where we have taken advantage of Integration by parts and according to

the equation above, we only need to prove the first term equals to 0. Given
L’Hospital’s Rule:
ux x!
lim − u = lim − u = 0
u→+∞ e u→+∞ e
And also when u = 0,− u x e u = 0, so we have proved Γ( x + 1) = xΓ( x). Based
on the definition of Γ( x), we can write:
∫ +∞ ¯+∞
¯
Γ(1) = e−u du = − e−u ¯ = −(0 − 1) = 1
0 0
Therefore when x is an integer:
Γ( x) = ( x − 1) Γ( x − 1) = ( x − 1) ( x − 2) Γ( x − 2) = ... = x! Γ(1) = x!
p
Based on (1.124) and (1.126) and by substituting x to 2σ y, it is quite
obvious to obtain : ∫ +∞
p
e− xi dx i = π
2
−∞
D
Therefore, the left side of (1.42) will equal to π 2 . For the right side of
(1.42):
∫ +∞ ∫ +∞
D −1 p
e−r r D −1 dr = S D e−u u 2 d u ( u = r 2 )
2
SD
0 0
∫
S D +∞ −u D −1
= e u 2 du
2 0
SD D
= Γ( )
2 2
Hence, we obtain:
D
D SD D 2π 2
π2 = Γ( ) => SD =
2 2 Γ( D2 )
15
S D has given the expression of the surface area with radius 1 in dimen-
sion D , we can further expand the conclusion: the surface area with radius r
in dimension D will equal to S D · r D −1 , and when r = 1, it will reduce to S D .
This conclusion is naive, if you find that the surface area of different sphere
in dimension D is proportion to the D − 1th power of radius, i.e. r D −1 . Con-
sidering the relationship between V and S of a sphere with arbitrary radius
in dimension D : dV
dr = S , we can obtain :
∫ ∫
SD D
V = S dr = S D r D −1 dr = r
D
The equation above gives the expression of the volume of a sphere with
radius r in dimension D , so we let r = 1 :
SD
VD =
D
For D = 2 and D = 3 :
S2 1 2π
V2 = = · =π
2 2 Γ(1)
3 3
S3 1 2π 2 1 2π 2 4
V3 = = · 3 = · p = π
3 3 Γ( 2 ) 3 π 3
2
We have already given a hint in the solution of Prob.1.18, and here we

will make it more clearly: the volume of a sphere with radius r is VD · r D .
This is quite similar with the conclusion we obtained in Prob.1.18 about the
surface area except that it is proportion to D th power of its radius, i.e. r D not
r D −1 .
D
volume of sphere VD aD SD π2
= = D = (∗)
volume of cube (2a)D 2 D 2D −1 D Γ( D ) 2
Where we have used the result of (1.143). And when D → +∞, we will use
a simple method to show that (∗) will converge to 0. We rewrite it :
2 π D 1
(∗) = ·( ) 2 · D
D 4 Γ( 2 )
Hence, it is now quite obvious, all the three terms will converge to 0 when
D → +∞. Therefore their product will also converge to 0. The last problem is
quite simple :
p
center to one corner a2 · D p p
= = D and lim D = +∞
center to one side a D →+∞

16
The density of probability in a thin shell with radius r and thickness ϵ

can be viewed as a constant. And considering that a sphere in dimension D
with radius r has surface area S D r D −1 , which has already been proved in
Prob.1.19 :
∫ ∫ exp(− 2rσ2 )
2 2
exp(− 2rσ2 )
p(x) d x = p(x) dx = D
· V (shell) = D
S D r D −1 ϵ
shell shell (2πσ2 ) 2 (2πσ2 ) 2
Thus we denote :
S D r D −1 r2
p( r ) = exp(− )
(2πσ2 ) 2
D
2σ2
We calculate the derivative of (1.148) with respect to r :
d p( r ) SD D −2 r2 r2
= r exp ( − ) ( D − 1 − ) (∗)
dr D
(2πσ2 ) 2 2σ 2 σ2
We let pthe derivative equal to 0, we will obtain its unique root( stationary
point) r̂ = D − 1 σ, because r ∈ [0, +∞]. When r < r̂ , the derivative is large
than 0, p( r ) will increase as r ↑, and when r > r̂ , the derivative is less than 0,
p( r ) will decrease as r ↑. Therefore
p r̂ will be the only maximum point. And it
is obvious when D ≫ 1, r̂ ≈ D σ.
( r̂ + ϵ)D −1 exp(− (r̂2+σϵ2) )

2
p( r̂ + ϵ )
= 2
p( r̂ ) r̂ D −1 exp(− 2r̂σ2 )
ϵ 2ϵ r̂ + ϵ2
= (1 + )D −1 exp(− )
r̂ 2σ 2
2ϵ r̂ + ϵ2 ϵ
= exp( − + (D − 1) ln(1 + ) )
2σ 2 r̂
We process for the exponential term by using Taylor Theorems.
2ϵ r̂ + ϵ2 ϵ 2ϵ r̂ + ϵ2 ϵ ϵ2
− + ( D − 1) ln (1 + ) ≈ − + ( D − 1) ( − )
2σ2 r̂ 2σ 2 r̂ 2 r̂ 2
2ϵ r̂ + ϵ2 2 r̂ ϵ − ϵ2
= − +
2σ 2 2σ2
2
ϵ
= − 2
σ
2
Therefore, p( r̂ + ϵ) = p( r̂ ) exp(− σϵ 2 ). Note: Here I draw a different con-
clusion compared with (1.149), but I do not think there is any mistake in
my deduction.
Finally, we see from (1.147) :
¯
¯ 1
p(x)¯ = D
x=0
(2πσ2 ) 2
17
¯ r̂ 2
¯ 1 1 D
p(x)¯ = exp(− )≈ exp(− )
||x||2 = r̂ 2
(2πσ2 ) 2
D
2σ 2 D
(2πσ2 ) 2 2
The first question is rather simple :

1 1 1 1
(ab) 2 − a = a 2 ( b 2 − a 2 ) ≥ 0
Where we have taken advantage of b ≥ a ≥ 0. And based on (1.78):
p(mistake) = p(x ∈ R 1 , C 2 ) + p(x ∈ R 2 , C 1 )

∫ ∫
= p(x, C 2 ) dx + p(x, C 1 ) dx
R1 R2
Recall that the decision rule which can minimize misclassification is that
if p(x, C 1 ) > p(x, C 2 ), for a given value of x, we will assign that x to class
C 1 . We can see that in decision area R 1 , it should satisfy p(x, C 1 ) > p(x, C 2 ).
Therefore, using what we have proved, we can obtain :
∫ ∫
1
p(x, C 2 ) dx ≤ { p(x, C 1 ) p(x, C 2 ) } 2 dx
R1 R1
It is the same for decision area R 2 . Therefore we can obtain:

∫
1
p(mistake) ≤ { p(x, C 1 ) p(x, C 2 ) } 2 dx
We need to deeply understand (1.81). When L k j = 1 − I k j :

∑ ¯ ∑ ¯ ¯
L k j p(C k ¯x) = p(C k ¯x) − p(C j ¯x)
k k
Given a specific x, the first term on the right side is a constant, which
equals to 1, no matter which class C j we assign
¯ x to. Therefore if we want to
¯
minimize the loss, we will maximize p(C j x). Hence, we will ¯ assign x to class
C j , which can give the biggest posterior probability p(C j ¯x).
The explanation of the loss matrix is quite simple. If we label correctly,
there is no loss. Otherwise, we will incur a loss, in the same degree whichever
class we label it to. The loss matrix is given below to give you an intuitive
view:  
0 1 1 ... 1
 
1 0 1 . . . 1
. . . .
 . . . .. 
. . . . .. 
1 1 1 ... 0
18
∑∑∫ ∑∑∫ ¯
E[ L ] = L k j p(x, C k ) d x = L k j p(C k ) p(x¯C k ) d x
k j Rj k j Rj
If we denote a new loss matrix by L⋆jk = L jk p(C k ), we can obtain a new

equation :
∑∑∫ ¯
E[ L ] = L⋆ ¯
k j p(x C k ) d x
k j Rj
This description of the problem is a little confusing, and what it really

mean is that λ is the parameter governing the loss, just like θ governing the
posterior probability p(C k |x) when we introduce the reject option. Therefore
the reject option can be written in a new way when we view it from the view
of λ and the loss:

class C j min ∑k L kl p(C k | x) < λ
choice l
reject else
Where C j is the class that can obtain the minimum. If L k j = 1 − I k j ,

according to what we have proved in Prob.1.22 :
∑ ¯ ∑ ¯ ¯ ¯
L k j p(C k ¯x) = p(C k ¯x) − p(C j ¯x) = 1 − p(C j ¯x)
k k
Therefore, the reject criterion from the view of λ above is actually equiv-
alent to the largest posterior probability is larger than 1 − λ :
∑
min L kl p(C k | x) < λ <=> max p(C l | x) > 1 − λ
l k l
And from the view of θ and posterior probability, we label a class for x (i.e.
we do not reject) is given by the constrain :
max p(C l | x) > θ

l
Hence from the two different views, we can see that λ and θ are correlated
with:
λ+θ =1
We can prove this informally by dealing with one dimension once a time
just as the same process in (1.87) - (1.89) until all has been done, due to the
fact that the total loss E can be divided to the summation of loss on every
19
dimension, and what’s more they are independent. Here, we will use a more
informal way to prove this. In this case, the expected loss can be written :
∫ ∫
E[ L ] = {y(x) − t}2 p(x, t) d t d x
Therefore, just as the same process in (1.87) - (1.89):

∫
∂E[L]
= 2 {y(x) − t} p(x, t) d t = 0
∂ y(x)
∫
t p(x, t) d t
=> y(x) = = Et [t|x]
p(x)
The process is identical as the deduction we conduct for (1.90). We will

not repeat here. And what we should emphasize is that E[t|x] is a function of
x, not t. Thus the integral over t and x can be simplified based on Integration
by parts and that is how we obtain (1.90).
Note: There is a mistake in (1.90), i.e. the second term on the right side
is wrong. You can view (3.37) on P148 for reference. It should be :
∫ ∫
¯ ¯
E[L] = { y(x) − E[ t¯x]}2 p(x) d x + {E[ t¯x − t]}2 p(x, t) d x dt
Moreover, this mistake has already been revised in the errata.

We deal with this problem based on Calculus of Variations.

∫
∂E[L q ]
= q [ y(x − t)] q−1 si gn( y(x) − t) p(x, t) dt = 0
∂ y(x)
∫ y(x) ∫ +∞
=> [ y(x) − t] q−1 p(x, t) dt = [ y(x) − t] q−1 p(x, t) dt
−∞ y(x)
∫ y(x) ∫ +∞
=> [ y(x) − t] q−1 p( t|x) dt = [ y(x) − t] q−1 p( t|x) dt
−∞ y(x)
Where we take advantage of p(x, t) = p( t|x) p(x) and the property of sign
function. Hence, when q = 1, the equation above will reduce to :
∫ y(x) ∫ +∞
p( t|x) dt = p( t|x) dt
−∞ y(x)
In other words, when q = 1, the optimal y(x) will be given by conditional

median. When q = 0, it is non-trivial. We need to rewrite (1.91) :
∫ {∫ }
E[ L q ] = | y(x) − t| q p( t|x) p(x) dt d x
∫ { ∫ }
= p(x) | y(x) − t| q p( t|x) dt d x (∗)
20
If we want to minimize E[L q ], we only need to minimize the integrand of

(∗): ∫
| y(x) − t| q p( t|x) dt (∗∗)
When q = 0, | y(x) − t| q is close to 1 everywhere except in the neighborhood

around t = y(x) (This can be seen from Fig1.29). Therefore:
∫ ∫ ∫ ∫
(∗∗) ≈ p( t|x) dt − (1 − | y(x) − t| q ) p( t|x) dt ≈ p( t|x) dt − p( t|x) dt
U ϵ U ϵ
Where ϵ means the small neighborhood,U means the whole space x lies
in. Note that y(x) has no correlation with the first term, but the second term
(because how to choose y(x) will affect the location of ϵ). Hence we will put ϵ
at the location where p( t|x) achieve its largest value, i.e. the mode, because
in this way we can obtain the largest reduction. Therefore, it is natural we
choose y(x) equals to t that maximize p( t|x) for every x.
Basically this problem is focused on the definition of Information Content,

i.e. h( x). We will rewrite the problem more precisely. In Information Theory,
h(·) is also called Information Content and denoted as I (·). Here we will still
use h(·) for consistency. The whole problem is about the property of h( x).
Based on our knowledge that h(·) is a monotonic function of the probability
p( x), we can obtain:
h( x) = f ( p( x))
The equation above means that the Information we obtain for a specific
value of a random variable x is correlated with its occurring probability p( x),
and its relationship is given by a mapping function f (·). Suppose C is the
intersection of two independent event A and B, then the information of event
C occurring is the compound message of both independent events A and B
occurring:
h(C ) = h( A ∩ B ) = h( A ) + h(B ) (∗)
Because A and B is independent:
P (C ) = P ( A ) · P (B )
We apply function f (·) to both side:
f (P (C )) = f (P ( A ) · P (B)) (∗∗)
Moreover, the left side of (∗) and (∗∗) are equivalent by definition, so we
can obtain:
h( A ) + h(B) = f (P ( A ) · P (B))
=> f ( p( A )) + f ( p(B)) = f (P ( A ) · P (B))
21
We obtain an important property of function f (·): f ( x · y) = f ( x) + f ( y).

Note: In problem (1.28), what it really wants us to prove is about the form
and property of function f in our formulation, because there is one sentence
in the description of the problem : "In this exercise, we derive the relation
between h and p in the form of a function h( p)", (i.e. f (·) in our formulation
is equivalent to h( p) in the description).
At present, what we know is the property of function f (·):
f ( x y ) = f ( x ) + f ( y) (∗)
Firstly, we choose x = y, and then it is obvious : f ( x2 ) = 2 f ( x). Secondly, it

is obvious f ( x n ) = n f ( x) , n ∈ N is true for n = 1, n = 2. Suppose it is also true
for n, we will prove it is true for n + 1:
f ( x n+1 ) = f ( x n ) + f ( x) = n f ( x) + f ( x) = ( n + 1) f ( x)
Therefore, f ( x n ) = n f ( x) , n ∈ N has been proved. For an integer m, we

n m
rewrite x n as ( x m ) , and take advantage of what we have proved, we will
obtain:
n m n
f ( x n ) = f (( x m ) ) = m f ( x m )
n
Because f ( x n ) also equals to n f ( x), therefore n f ( x) = m f ( x m ). We sim-
plify the equation and obtain:
n n
f (x m ) = f ( x)
m
For an arbitrary positive x , x ∈ R+ , we can find two positive rational array

{ yn } and { z n }, which satisfy:
y1 < y2 < ... < yN < x and lim yN = x

N →+∞
z1 > z2 > ... > z N > x, and lim z N = x

N →+∞
We take advantage of function f (·) is monotonic:
yN f ( p) = f ( p yN ) ≤ f ( p x ) ≤ f ( p z N ) = z N f ( p)
And when N → +∞, we will obtain: f ( p x ) = x f ( p) , x ∈ R+ . We let p = e,

it can be rewritten as : f ( e x ) = x f ( e). Finally, We denote y = e x :
f ( y) = ln( y) f ( e)
Where f ( e) is a constant once function f (·) is decided. Therefore f ( x) ∝

ln( x).

22
This problem is a little bit tricky. The entropy for a M-state discrete ran-
dom variable x can be written as :
∑
M
H [ x] = − λ i ln(λ i )
i
Where λ i is the probability that x choose state i . Here we choose a concave

function f (·) = ln(·), we rewrite Jensen’s inequality, i.e.(1.115):
∑
M ∑
M
ln( λi xi ) ≥ λ i ln( x i )
i =1 i =1
1
We choose x i = λi
and simplify the equation above, we will obtain :
∑
M
lnM ≥ − λ i ln(λ i ) = H [ x]
i =1
Based on definition :
p ( x) s 1 1
ln{ } = ln( ) − [ 2 ( x − µ)2 − 2 ( x − m)2 ]
q ( x) σ 2σ 2s
s 1 1 µ m µ2 m2
= ln( ) − [( 2 − 2 ) x2 − ( 2 − 2 ) x + ( 2 − 2 )]
σ 2σ 2s σ s 2σ 2s
We will take advantage of the following equations to solve this problem.
∫
E[ x2 ] = x2 N ( x|µ, σ2 ) dx = µ2 + σ2
∫
E[ x ] = x N ( x|µ, σ2 ) dx = µ
∫
N ( x|µ, σ2 ) dx = 1
Given the equations above, it is easy to see :

∫
q ( x)
K L( p|| q) = − p( x) ln{ } dx
p ( x)
∫
p ( x)
= N ( x|µ, σ) ln{ } dx
q ( x)
s 1 1 µ m µ2 m2
= ln( ) − ( 2 − 2 )(µ2 + σ2 ) + ( 2 − 2 )µ − ( 2 − 2 )
σ 2σ 2s σ s 2σ 2s
s σ2 + (µ − m)2 1
= ln( ) + −
σ 2 s2 2
23
We will discuss this result in more detail. Firstly, if K L distance is defined

in Information Theory, the first term of the result will be log 2 ( σs ) instead of
ln( σs ). Secondly, if we denote x = σs , K L distance can be rewritten as :
1 1 (µ − m)2
K L( p|| q) = ln( x) + 2
− + a, where a =
2x 2 2 s2
We calculate the derivative of K L with respect to x, and let it equal to 0:
d (K L) 1
= − x−3 = 0 => x = 1 ( ∵ s, σ > 0 )
dx x
When x < 1 the derivative is less than 0, and when x > 1, it is greater than
0, which makes x = 1 the global minimum. When x = 1, K L( p|| q) = a. What’s
more, when µ = m, a will achieve its minimum 0. In this way, we have shown
that the K L distance between two Gaussian Distributions is not less than 0,
and only when the two Gaussian Distributions are identical, i.e. having same
mean and variance, K L distance will equal to 0.
We evaluate H [x] + H [y] − H [x, y] by definition. Firstly, let’s calculate

H [x, y] :
∫ ∫
H [x, y] = − p(x, y) lnp(x, y) d x d y
∫ ∫ ∫ ∫
= − p(x, y) lnp(x) d x d y − p(x, y) lnp(y|x) d x d y
∫ ∫ ∫
= − p(x) lnp(x) d x − p(x, y) lnp(y|x) d x d y
= H [x] + H [y|x]
∫
Where we take advantage of p(x, y) = p(x) p(y|x), p(x, y) d y = p(x) and
(1.111). Therefore, we have actually solved Prob.1.37 here. We will continue
our proof for this problem, based on what we have proved:
H [x] + H [y] − H [x, y] = H [y] − H [y|x]

∫ ∫ ∫
= − p(y) lnp(y) d y + p(x, y) lnp(y|x) d x d y
∫ ∫ ∫ ∫
= − p(x, y) lnp(y) d x d y + p(x, y) lnp(y|x) d x d y
∫ ∫
( p(x) p(y) )
= − p(x, y) ln d xd y
p(x, y)
= K L( p(x, y)|| p(x) p(y) ) = I (x, y) ≥ 0
Where we take advantage of the following properties:

∫
p(y) = p(x, y) d x
24
p(y) p(x) p(y)

=
p(y|x) p(x, y)
Moreover, it is straightforward that if and only if x and y is statistically
independent, the equality holds, due to the property of KL distance. You can
also view this result by :
∫ ∫
H [x, y] = − p(x, y) lnp(x, y) d x d y
∫ ∫ ∫ ∫
= − p(x, y) lnp(x) d x d y − p(x, y) lnp(y) d x d y
∫ ∫ ∫
= − p(x) lnp(x) d x − p(y) lnp(y) d y
= H [x] + H [y]
It is straightforward based on definition and note that if we want to

change variable in integral, we have to introduce a redundant term called
Jacobian Determinant.
∫
H [y] = − p(y) lnp(y) d y
∫
p(x) p(x) ∂y
= − ln | |d x
| A| |A| ∂x
∫
p(x)
= − p(x) ln dx
|A|
∫ ∫
1
= − p(x) lnp(x) d x − p(x) ln dx
|A|
= H [x] + ln|A|
Where we have taken advantage of the following equations:
∂y ∂y
=A and p(x) = p(y) | | = p(y) |A|
∂x ∂x
∫
p(x) d x = 1
Based on the definition of Entropy, we write:

∑∑
H [ y| x ] = − p( x i , y j ) lnp( y j | x i )
xi y j
Considering the property of probability, we can obtain that 0 ≤ p( y j | x i ) ≤

1, 0 ≤ p( x i , y j ) ≤ 1. Therefore, we can see that − p( x i , y j ) lnp( y j | x i ) ≥ 0 when
0 < p( y j | x i ) ≤ 1. And when p( y j | x i ) = 0, provided with the fact that lim plnp =
p→0
25
0, we can see that − p( x i , y j ) lnp( y j | x i ) = − p( x i ) p( y j | x i ) lnp( y j | x i ) ≈ 0, (here

we view p( x) as a constant). Hence for an arbitrary term in the equation
above, we have proved that it can not be less than 0. In other words, if and
only if every term of H [ y| x] equals to 0, H [ y| x] will equal to 0.
Therefore, for each possible value of random variable x, denoted as x i :
∑
− p( x i , y j ) lnp( y j | x i ) = 0 (∗)
yj
If there are more than one possible value of random variable y given
x = x i , denoted as y j , such that p( y j | x i ) ̸= 0 (Because x i , y j are both "possi-
ble", p( x i , y j ) will also not equal to 0), constrained by 0 ≤ p( y j | x i ) ≤ 1 and
∑
j p( y j | x i ) = 1, there should be at least two value of y satisfied 0 < p( y j | x i ) <
1, which ultimately leads to (∗) > 0.
Therefore, for each possible value of x, there will only be one y such that
p( y| x) ̸= 0. In other words, y is determined by x. Note: This result is quite
straightforward. If y is a function of x, we can obtain the value of y as soon
as observing a x. Therefore we will obtain no additional information when
observing a y j given an already observed x.
This problem is complicated. We will explain it in detail. According to

Appenddix D, we can obtain the relation,i.e. (D.3) :
∫
∂F
F [ y( x) + ϵη( x)] = F [ y( x)] + ϵη( x) dx (∗∗)
∂y
Where y( x) can be viewed as an operator that for any input x it will give
an output value y, and equivalently, F [ y( x)] can be viewed as an functional
operator that for any input value y( x), it will give an ouput value F [ y( x)].
Then we consider a functional operator:
∫
I [ p( x)] = p( x) f ( x) dx
Under a small variation p( x) → p( x) + ϵη( x), we will obtain :

∫ ∫
I [ p( x) + ϵη( x)] = p( x) f ( x) dx + ϵη( x) f ( x) dx
Comparing the equation above and (∗), we can draw a conclusion :
∂I
= f ( x)
∂ p ( x)
Similarly, let’s consider another functional operator:

∫
J [ p( x)] = p( x) lnp( x) dx
26
Then under a small variation p( x) → p( x) + ϵη( x):

∫
J [ p( x) + ϵη( x)] = ( p( x) + ϵη( x) ) ln( p( x) + ϵη( x) ) dx
∫ ∫
= p( x) ln( p( x) + ϵη( x) ) dx + ϵη( x) ln( p( x) + ϵη( x) ) dx
Note that ϵη( x) is much smaller than p( x), we will write its Taylor Theo-
rems at point p( x):
ϵη( x)
ln( p( x) + ϵη( x) ) = lnp( x) + + O (ϵη( x)2 )
p ( x)
Therefore, we substitute the equation above into J [ p( x) + ϵη( x)]:

∫ ∫
J [ p( x) + ϵη( x)] = p( x) lnp( x) dx + ϵη( x) ( lnp( x) + 1) dx + O (ϵ2 )
Therefore, we also obtain :
∂J
= lnp( x) + 1
∂ p ( x)
Now we can go back to (1.108). Based on ∂ ∂pJ( x) and ∂ p∂(Ix) , we can calculate
the derivative of the expression just before (1.108) and let it equal to 0:
− lnp( x) − 1 + λ1 + λ2 x + λ3 ( x − µ)2 = 0
Hence we rearrange it and obtain (1.108). From (1.108) we can see that
p( x) should take the form of a Gaussian distribution. So we rewrite it into
Gaussian form and then compare it to a Gaussian distribution with mean µ
and variance σ2 , it is straightforward:
1 ( x − µ )2
exp(−1 + λ1 ) = , exp(λ2 x + λ3 ( x − µ)2 ) = exp{ }
1
(2πσ2 ) 2 2σ 2
Finally, we obtain :
λ1 = 1 − ln(2πσ2 )
λ2 = 0
1
λ3 = −
2σ 2
Note that there is a typo in the official solution manual about λ3 . More-
over, in the following parts, we will substitute p( x) back into the three con-
straints and analytically prove that p( x) is Gaussian. You can skip the fol-
lowing part. (The writer would especially thank Dr.Spyridon Chavlis from
IMBB,FORTH for this analysis)
27
We already know:
p( x) = exp(−1 + λ1 + λ2 x + λ3 ( x − µ)2 )
Where the exponent is equal to:
−1 + λ1 + λ2 x + λ3 ( x − µ)2 = λ3 x2 + (λ2 − 2λ3 µ) x + (λ3 µ2 + λ1 − 1)
Completing the square, we can obtain that:
b b2
ax2 + bx + c = a( x − d )2 + f , d = − , f = c−
2a 4a
Using this quadratic form, the constraints can be written as
∫∞ ∫∞
1. −∞ p( x) dx = −∞ e[a( x−d ) + f ] dx = 1
2
∫∞ ∫∞ [ a ( x − d )2 + f ]
2. −∞ xp( x) dx = −∞ xe dx = µ
∫∞ 2
∫∞ 2 [a( x− d )2 + f ]
3. −∞ ( x − µ) p( x) dx = −∞ ( x − µ) e dx = σ2
The first constraint can be written as:

∫ ∞ ∫ ∞
e[a( x−d ) + f ] dx = e f e a( x−d ) dx
2 2
I1 =
−∞ −∞
Let u = x − d , which gives du = dx, and thus:

∫ ∞
2
I1 = e f e au du
−∞
p p
Let −w2 = au2 ⇒ w = −au ⇒ dw = −adu, and thus:
∫ ∞
ef
e−w dw
2
I1 = p
−a −∞
As e− x is an even function the integral is written as:

2
∫ ∞
2e f
e−w dw
2
I1 = p
−a 0
2
p 1
Let w = t ⇒ w = t ⇒ dw = p dt, and thus:
2 t
∫ ∞ ∫ ∞
√
2e f − 12 − t 2e f 1 1 −1 − t ef 1 π
I1 = p t e dt = p t 2 e dt = p Γ( ) = e f
−a 0 −a 0 2 −a 2 −a
Here the Gamma function is used. Gamma function is defined as

∫ ∞
Γ( z) = t z−1 e− t dt
0
28
where for non-negative integer values of n, we have:

1 (2 n)! p
Γ( + n) = n π
2 4 n!
Thus, the first constraint can be rewritten as:
√
f π
e =1 (∗)
−a
The second constraint can be written as:
∫ ∞ ∫ ∞
xe[a( x−d ) + f ] dx = e f xe a( x−d ) dx
2 2
I2 =
−∞ −∞
Let u = x − d ⇒ x = u + d ⇒ du = dx, and thus:

∫ ∞
2
I2 = e f ( u + d ) e au du
−∞
Using integral additivity, we have:
∫ ∞ ∫ ∞
2 2
I2 = e f ue au du + e f de au du
−∞ −∞
We first deal with the first term on the right hand side. Here we denote it
as I 21 : ∫ ∞ (∫ 0 ∫ ∞ )
f au2 f au2 au2
I 21 = e ue du = e ue du + ue du
−∞ −∞ 0
Swapping the integration limits, we obtain:
( ∫ −∞ ∫ ∞ )
f au2 au2
I 21 = e − ue du + ue du
0 0
(∫ −∞ ∫ ∞ )
f a (− u )2 au2
= e (− u) e du + ue du
( 0∫ ∞ 0
∫ ∞ )
f a (− u )2 au2
= e − (− u ) e (− du) + ue du = 0
0 0
Then we deal with the second term I 22 :

∫ ∞
2
I 22 = e f de au du
−∞
p p
Let −w2 = au2 ⇒ w = −au ⇒ dw = −adu, and thus:
∫ ∞
ef d
e−w dw
2
I 22 = p
−a −∞
As e− x is an even function the integral is written as:

2
∫ ∞
2e f d
e−w dw
2
I 22 = p
−a 0
29
p 1
Let w2 = t ⇒ w = t ⇒ dw = p dt, and thus:
2 t
∫ ∞ ∫ ∞
√
2e f d − 12 − t 2e f d 1 1 −1 − t ef d 1 π
I 22 = p t e dt = p t 2 e dt = p Γ( ) = e f d
−a 0 −a 0 2 −a 2 −a
Thus, the second constraint can be rewritten
√
π
ef d =µ (∗∗)
−a
Combining (∗) and (∗∗), we can obtain that d = µ. Recall that:
b λ2 − 2λ3 µ
d=− =− = µ ⇒ λ2 − 2λ3 µ = −2λ3 µ ⇒ λ2 = 0
2a 2λ3
So far, we have:
b = −2λ3 µ
And
b2 4λ2 µ2
= λ3 µ2 + λ1 − 1 − 3 = λ1 − 1
f = c−
4a 4λ3
Finally, we deal with the third also the last constraint. Substituting λ2 = 0
into the last constraint we have:
∫ ∞ ∫ ∞
( x − µ)2 e[λ3 ( x−µ) +λ1 −1] dx = eλ1 −1 ( x − µ)2 eλ3 ( x−µ) dx
2 2
I3 =
−∞ −∞
Let u = x − µ ⇒ du = dx, and thus:

∫ ∞
λ1 −1 2
I3 = e u2 eλ3 u dx
−∞
2
√ 2
√
Let −w = λ3 u ⇒ w = −λ3 u ⇒ dw = −λ3 du, and thus:
∫ ∞ ∫
1 dw eλ1 −1 ∞ 2 −w2
I 3 = eλ1 −1 − w2 e − w √
2
= w e dw
−∞ λ3
3
−λ3 −λ3 2 −∞
Because it is an even function, we can further obtain:

∫
eλ1 −1 ∞ 2 −w2
I3 = 2 3
w e dw
−λ32 0
p 1
Let w2 = t ⇒ w = t ⇒ dw = p dt, and thus:
2 t
λ1 −1 ∫ ∫
e ∞
−t 1 eλ1 −1 ∞ 1
I3 = 2 3
te p dt = 3
t1− 2 e− t dt
−λ3 2 0 2 ( t) −λ32 0
∫
eλ1 −1 ∞ 3
= 3
t 2 −1 e− t dt
0
−λ3 2
eλ1 −1 3 eλ1 −1 π
= Γ( ) =
3 3
2 2
−λ3 2
−λ32
30
Thus, the third constraint can be rewritten

p
eλ1 −1 π
3
= σ2 ( ∗ ∗ ∗)
2
−λ32
Rewriting (∗) with f = λ1 − 1, d = µ and a = λ3 , we obtain the following
equation √
π
eλ1 −1 =1 (∗ ∗ ∗∗)
−λ3
Substituting the equation above back into (∗ ∗ ∗), we obtain
√ p
−λ3 1 π 1 1
= σ2 ⇔ − = 2σ2 ⇔ λ3 = − 2
π 3
2 λ3 2σ
−λ32
Substituting λ3 back into (∗ ∗ ∗∗), we obtain:
√ √
π π 1 1
eλ1 −1 = 1 ⇔ eλ1 −1 = 1 ⇔ eλ1 −1 = p ⇔ λ1 − 1 = ln( p )
−λ3 1
2 πσ 2 2πσ 2
2σ 2
Thus, we obtain:
1
ln(2πσ2 )
λ1 = 1 −
2
So far, we have obtainde λ i , where i = 1, 2, 3. We substitute them back
into p( x), yielding:
( )
1 2 1 2
p( x) = exp −1 + 1 − ln(2πσ ) − 2 ( x − µ)
2 2σ
( ) ( )
1 2 1 2
= exp − ln(2πσ ) exp − 2 ( x − µ)
2 2σ
( ( )) ( )
1 1 2
= exp ln p exp − 2 ( x − µ)
2πσ2 2σ
Thus, ( )
1 1
p ( x) = p exp − 2 ( x − µ)2
2πσ 2 2σ
Just as required.
If p( x) = N (µ, σ2 ), we write its entropy:

∫
H [ x] = − p( x) lnp( x) dx
∫ ∫
1 ( x − µ)2
= − p( x) ln{ } dx − p ( x ) {− } dx
2πσ2 2σ 2
1 σ2
= − ln{ } +
2πσ2 2σ 2
1
= { 1 + ln(2πσ2 ) }
2
31
Where we have taken advantage of the following properties of a Gaussian

distribution: ∫ ∫
p( x) dx = 1 and ( x − µ)2 p( x) dx = σ2
Here we should make it clear that if the second derivative is strictly pos-
itive, the function must be strictly convex. However, the converse may not be
true. For example f ( x) = x4 , g( x) = x2 , x ∈ R are both strictly convex by def-
inition, but their second derivatives at x = 0 are both indeed 0 (See keyword
convex function on Wikipedia or Page 71 of the book Convex Optimization
written by Boyd, Vandenberghe for more details). Hence, here more precisely
we will prove that a convex function is equivalent to its second derivative is
non-negative by first considering Taylor Theorems:
f ′ ( x) f ′′ ( x) 2 f ′′′ ( x) 3
f ( x + ϵ) = f ( x ) + ϵ+ ϵ + ϵ + ...
1! 2! 3!
f ′ ( x) f ′′ ( x) 2 f ′′′ ( x) 3
f ( x − ϵ) = f ( x ) − ϵ+ ϵ − ϵ + ...
1! 2! 3!
Then we can obtain the expression of f ′′ ( x):
f ( x + ϵ) + f ( x − ϵ) − 2 f ( x )
f ′′ ( x) = lim
ϵ→0 ϵ2
Where O (ϵ4 ) is neglected and if f ( x) is convex, we can obtain:
1 1 1 1
f ( x) = f ( ( x + ϵ) + ( x − ϵ)) ≤ f ( x + ϵ) + f ( x − ϵ)
2 2 2 2
Hence f ′′ ( x) ≥ 0. The converse situation is a little bit complex, we will use
Lagrange form of Taylor Theorems to rewrite the Taylor Series Expansion
above :
f ′′ ( x⋆ )
f ( x) = f ( x0 ) + f ′ ( x0 )( x − x0 ) + ( x − x0 )
2
Where x⋆ lies between x and x0 . By hypothesis, f ′′ ( x) ≥ 0, the last term is
non-negative for all x. We let x0 = λ x1 + (1 − λ) x2 , and x = x1 :
f ( x1 ) ≥ f ( x0 ) + (1 − λ)( x1 − x2 ) f ′ ( x0 ) (∗)
And then, we let x = x2 :
f ( x2 ) ≥ f ( x0 ) + λ( x2 − x1 ) f ′ ( x0 ) (∗∗)
We multiply (∗) by λ, (∗∗) by 1 − λ and then add them together, we will

see :
λ f ( x1 ) + (1 − λ) f ( x2 ) ≥ f (λ x1 + (1 − λ) x2 )
32
See Prob.1.31.
When M = 2, (1.115) will reduce to (1.114). We suppose (1.115) holds for

M , we will prove that it will also hold for M + 1.
∑
M ∑
M λm
f( λm xm ) = f (λ M +1 x M +1 + (1 − λ M +1 ) xm )
m=1 m=1 1 − λ M +1
∑
M λm
≤ λ M +1 f ( x M +1 ) + (1 − λ M +1 ) f ( xm )
m=1 1 − λ M +1
∑
M λm
≤ λ M +1 f ( x M +1 ) + (1 − λ M +1 ) f ( xm )
m=1 1 − λ M +1
∑
M +1
≤ λm f ( xm )
m=1
Hence, Jensen’s Inequality, i.e. (1.115), has been proved.
It is quite straightforward based on definition.

∑ 2 2 1 1
H [ x] = − p( x i ) lnp( x i ) = − ln − ln = 0.6365
i 3 3 3 3
∑ 2 2 1 1
H [ y] = − p( yi ) lnp( yi ) = − ln − ln = 0.6365
i 3 3 3 3
∑ 1 1
H [ x, y] = − p( x i , y j ) lnp( x i , y j ) = −3 · ln − 0 = 1.0986
i, j 3 3
∑ 1 1 1 1 1
H [ x | y] = − p( x i , y j ) lnp( x i | y j ) = − ln1 − ln − ln = 0.4621
i, j 3 3 2 3 2
∑ 1 1 1 1 1
H [ y| x ] = − p( x i , y j ) lnp( y j | x i ) = − ln − − ln − ln1 = 0.4621
i, j 3 2 3 2 3
∑ p( x i ) p( y j )
I [ x, y] = − p( x i , y j ) ln
i, j p( x i , y j )
2 2 2 1
1 2
1 · 1 3·3 31 3·3
3
= − ln − ln − ln = 0.1744
3 1/3 3 1/3 3 1/3
Their relations are given below, diagrams omitted.
I [ x, y] = H [ x] − H [ x| y] = H [ y] − H [ y| x]
33
H [ x, y] = H [ y| x] + H [ x] = H [ x| y] + H [ y]
f ( x) = lnx is actually a strict concave function, therefore we take advan-

tage of Jensen’s Inequality to obtain:
∑
M ∑
M
f( λm xm ) ≥ λm f ( xm )
i =1 i =1
1
We let λm = M , m = 1, 2, ..., M . Hence we will obtain:
x1 + x2 + ... + xm 1 1
ln( )≥ [ ln( x1 ) + ln( x2 ) + ... + ln( x M ) ] = ln( x1 x2 ...x M )
M M M
We take advantage of the fact that f ( x) = lnx is strictly increasing and
then obtain :
x1 + x2 + ... + xm p
≥ M x1 x2 ...x M
M
Based on definition of I [x, y], i.e.(1.120), we obtain:

∫ ∫
p(x) p(y)
I [x, y] = − p(x, y) ln dx dy
p(x, y)
∫ ∫
p(x)
= − p(x, y) ln dx dy
p(x|y)
∫ ∫ ∫ ∫
= − p(x, y) lnp(x) d x d y + p(x, y) lnp(x|y) d x d y
∫ ∫ ∫ ∫
= − p(x) lnp(x) d x + p(x, y) lnp(x|y) d x d y
= H [x] − H [x|y]
Where we have taken advantage of the fact: p(x, y) = p(y) p(x|y), and
∫
p(x, y) d y = p(x). The same process can be used for proving I [x, y] = H [y] −
H [y|x], if we substitute p(x, y) with p(x) p(y|x) in the second step.
0.2 Probability Distribution
Based on definition, we can obtain :

∑
p( x i ) = µ + (1 − µ) = 1
x i =0,1
∑
E[ x ] = x i p( x i ) = 0 · (1 − µ) + 1 · µ = µ
x i =0,1
34
∑
var [ x] = ( x i − E[ x])2 p( x i )
x i =0,1
= (0 − µ)2 (1 − µ) + (1 − µ)2 · µ
= µ(1 − µ)
∑
H [ x] = − p( x i ) lnp( x i ) = − µ lnµ − (1 − µ) ln(1 − µ)
x i =0,1
The proof in Prob.2.1. can also be used here.

∑ 1−µ 1+µ
p( x i ) = + =1
x i =−1,1 2 2
∑ 1−µ 1+µ
E[ x ] = x i · p ( x i ) = −1 · + 1· =µ
x i =−1,1 2 2
∑
var[ x] = ( x i − E[ x])2 · p( x i )
x i =−1,1
1−µ 1+µ
= (−1 − µ)2 · + (1 − µ)2 ·
2 2
= 1 − µ2
∑ 1−µ 1−µ 1+µ 1+µ
H [ x] = − p( x i ) · ln p( x i ) = − · ln − · ln
x i =−1,1 2 2 2 2
(2.262) is an important property of Combinations, which we have used

m
before, such as in Prob.1.15. We will use the ’old fashioned’ denotation C N to
represent choose m objects from a total of N . With the prior knowledge:
m N!
CN =
m! ( N − m)!
We evaluate the left side of (2.262) :
m m−1 N! N!
CN + CN = +
m! ( N − m)! ( m − 1)! ( N − ( m − 1))!
N! 1 1
= ( + )
( m − 1)! ( N − m)! m N − m + 1
( N + 1)! m
= = CN +1
m! ( N + 1 − m)!
To proof (2.263), here we will proof a more general form:
∑
N
m m N −m
( x + y) N = CN x y (∗)
m=0
35
If we let y = 1, (∗) will reduce to (2.263). We will proof it by induction.

First, it is obvious when N = 1, (∗) holds. We assume that it holds for N , we
will proof that it also holds for N + 1.
∑
N
( x + y) N +1 = ( x + y) m m N −m
CN x y
m=0
∑
N
m m N −m
∑
N
m m N −m
= x CN x y +y CN x y
m=0 m=0
∑
N
m m+1 N − m
∑
N
m m N +1− m
= CN x y + CN x y
m=0 m=0
N∑
+1 ∑
N
m−1 m N +1− m m m N +1− m
= CN x y + CN x y
m=1 m=0
∑
N
m−1 m m N +1− m
= (C N + CN )x y + x N +1 + y N +1
m=1
∑
N
m N +1− m
= m
CN +1 x y + x N +1 + y N +1
m=1
N∑
+1
m m N +1− m
= CN +1 x y
m=0
By far, we have proved (∗). Therefore, if we let y = 1 in (∗), (2.263) has

been proved. If we let x = µ and y = 1 − µ, (2.264) has been proved.
Solution has already been given in the problem, but we will solve it in a
36
more intuitive way, beginning by definition:
∑
N
E[ m ] = mC N µ (1 − µ) N −m
m m
m=0
∑
N
= mC N µ (1 − µ) N −m
m m
m=1
∑
N N!
= µm (1 − µ) N −m
m=1 ( m − 1)!( N − m)!
∑
N ( N − 1)!
= N ·µ µm−1 (1 − µ) N −m
m=1 ( m − 1)!( N − m)!
∑
N
m−1 m−1
= N ·µ CN −1 µ (1 − µ) N −m
m=1
N∑
−1
k k N −1− k
= N ·µ CN −1 µ (1 − µ)
k=0
= N · µ [ µ + (1 − µ) ] N −1 = N µ
Some details should be explained here. We note that m = 0 actually

doesn’t affect the Expectation, so we let the summation begin from m = 1,
i.e. (what we have done from the first step to the second step). Moreover, in
the second last step, we rewrite the subindex of the summation, and what we
actually do is let k = m − 1. And in the last step, we have taken advantage of
(2.264). Variance is straightforward once Expectation has been calculated.
var [ m] = E[ m2 ] − E[ m]2
∑
N
= m2 C N µ (1 − µ) N −m − E[ m] · E[ m]
m m
m=0
∑
N ∑
N
= m m
m2 C N µ (1 − µ) N −m − ( N µ) · µ (1 − µ) N −m
m m
mC N
m=0 m=0
∑
N ∑
N
= m m
m2 C N µ (1 − µ) N −m − N µ · µ (1 − µ) N −m
m m
mC N
m=1 m=1
∑
N N! ∑N
= m µm (1 − µ) N −m − ( N µ) · mC N µ (1 − µ) N −m
m m
m=1 ( m − 1)!( N − m)! m=1
∑
N ( N − 1)! ∑N
= Nµ m µm−1 (1 − µ) N −m − N µ · mC N µ (1 − µ) N −m
m m
m=1 ( m − 1)!( N − m)! m=1
∑
N
= Nµ mµm−1 (1 − µ) N −m ( C N
m−1 m
−1 − µC N )
m=1
Here we will use a little tick, −µ = −1 + (1 − µ) and then take advantage

37
m m m−1
of the property, C N = CN −1
+ CN −1
.
∑
N [ m−1 ]
var [ m] = Nµ mµm−1 (1 − µ) N −m C N m m
−1 − C N + (1 − µ)C N
m=1
∑
N [ ]
= Nµ mµm−1 (1 − µ) N −m (1 − µ)C N
m m−1
+ CN m
−1 − C N
m=1
∑
N [ ]
= Nµ mµm−1 (1 − µ) N −m (1 − µ)C N
m m
− CN −1
m=1
{∑
N ∑
N }
= Nµ mµm−1 (1 − µ) N −m+1 C N
m
− mµm−1 (1 − µ) N −m C N
m
−1
m=1 m=1
{ }
= N µ · N (1 − µ)[µ + (1 − µ)] N −1 − ( N − 1)(1 − µ)[µ + (1 − µ)] N −2
{ }
= N µ N (1 − µ) − ( N − 1)(1 − µ) = N µ(1 − µ)
Hints have already been given in the description, and let’s make a little
improvement by introducing t = y + x and x = tµ at the same time, i.e. we will
do following changes:
{ {
x = tµ t = x+ y
and
y = t(1 − µ) µ = x+x y
Note t ∈ [0, +∞], µ ∈ (0, 1), and that when we change variables in integral,
we will introduce a redundant term called Jacobian Determinant.
¯ ∂x ∂x ¯ ¯ ¯
∂( x, y) ¯¯ ∂µ ∂ t ¯¯ ¯¯ t µ ¯¯
= ¯ ∂y ∂y ¯ = =t
∂(µ, t) ¯ ∂µ ∂ t ¯ ¯ − t 1 − µ ¯
Now we can calculate the integral.

∫ +∞ ∫ +∞
Γ(a)Γ( b) = exp(− x) xa−1 dx exp(− y) yb−1 d y
0 0
∫ +∞ ∫ +∞
a−1
= exp(− x) x exp(− y) yb−1 d y dx
0 0
∫ +∞ ∫ +∞
= exp(− x − y) xa−1 yb−1 d y dx
0 0
∫ 1 ∫ +∞
= exp(− t) ( tµ)a−1 ( t(1 − µ))b−1 t dt d µ
0 0
∫ +∞ ∫ 1
a+ b−1
= exp(− t) t dt · µa−1 (1 − µ)b−1 d µ
0 0
∫ 1
= Γ(a + b) · µa−1 (1 − µ)b−1 d µ
0
38
Therefore, we have obtained :

∫ 1 Γ(a)Γ( b)
µa−1 (1 − µ)b−1 d µ =
0 Γ(a + b)
We will solve this problem based on definition.

∫ 1
E[µ] = µ Beta(µ|a, b) d µ
0
∫
Γ(a + b) a
1
= µ (1 − µ)b−1 d µ
0 Γ(a) Γ( b)
∫
Γ(a + b)Γ(a + 1) 1 Γ(a + 1 + b) a
= µ (1 − µ)b−1 d µ
Γ(a + 1 + b)Γ(a) 0 Γ(a + 1) Γ( b)
∫
Γ(a + b)Γ(a + 1) 1
= Beta(µ|a + 1, b) d µ
Γ(a + 1 + b)Γ(a) 0
Γ(a + b) Γ(a + 1)
= ·
Γ(a + 1 + b) Γ(a)
a
=
a+b
Where we have taken advantage of the property: Γ( z + 1) = zΓ( z). For
variance, it is quite similar. We first evaluate E [µ2 ].
∫ 1
E[ µ 2 ] = µ2 Beta(µ|a, b) d µ
0
∫
Γ(a + b) a+1
1
= µ (1 − µ)b−1 d µ
0 Γ( a ) Γ( b )
∫
Γ(a + b)Γ(a + 2) 1 Γ(a + 2 + b) a+1
= µ (1 − µ)b−1 d µ
Γ(a + 2 + b)Γ(a) 0 Γ(a + 2) Γ( b)
∫
Γ(a + b)Γ(a + 2) 1
= Beta(µ|a + 2, b) d µ
Γ(a + 2 + b)Γ(a) 0
Γ(a + b) Γ(a + 2)
= ·
Γ(a + 2 + b) Γ(a)
a(a + 1)
=
(a + b)(a + b + 1)
Then we use the formula: var [µ] = E [µ2 ] − E [µ]2 .
a(a + 1) a 2
var [µ] = −( )
(a + b)(a + b + 1) a+b
ab
=
(a + b) (a + b + 1)
2

39
The maximum likelihood estimation for µ, i.e. (2.8), can be written as :

m
µ ML =
m+l
Where m represents how many times we observe ’head’, l represents how
many times we observe ’tail’. And the prior mean of µ is given by (2.15), the
posterior mean value of x is given by (2.20). Therefore, we will prove that
( m + a) / ( m + a + l + b) lies between m / ( m + l ), a / (a + b). Given the fact that :
a m m+a a+b
λ + (1 − λ) = where λ =
a+b m+l m+a+l +b m+l +a+b
We have solved problem. Note : you can also solve it in a more simple way
by prove that :
m+a a m+a m
( − )·( − )≤0
m+a+l +b a+b m+a+l +b m+l
The expression above can be proved by reduction of fractions to a common
denominator.
We solve it base on definition.

∫
E y [E x [ x| y]]] = E x [ x| y] p( y) d y
∫ ∫
= ( x p( x| y) dx) p( y) d y
∫ ∫
= x p( x| y) p( y) dx d y
∫ ∫
= x p( x, y) dx d y
∫
= x p( x) dx = E[ x]
(2.271) is complicated and we will calculate every term separately.

∫
E y [var x [ x| y]] = var x [ x| y] p( y) d y
∫ ∫
= ( ( x − E x [ x| y])2 p( x| y) dx ) p( y) d y
∫ ∫
= ( x − E x [ x| y])2 p( x, y) dx d y
∫ ∫
= ( x2 − 2 x E x [ x| y] + E x [ x| y]2 ) p( x, y) dx d y
∫ ∫ ∫ ∫ ∫ ∫
= x2 p( x) dx − 2 x E x [ x| y] p( x, y) dx d y + ( E x [ x | y] 2 ) p ( y) d y
40
About the second term in the equation above, we further simplify it :

∫ ∫ ∫ ∫
2 x E x [ x| y] p( x, y) dx d y = 2 E x [ x| y] ( xp( x, y) dx ) d y
∫ ∫
= 2 E x [ x| y] p( y) ( xp( x| y) dx ) d y
∫
= 2 E x [ x | y] 2 p ( y) d y
Therefore, we obtain the simple expression for the first term on the right
side of (2.271) :
∫ ∫ ∫ ∫
E y [var x [ x| y]] = x2 p( x) dx − E x [ x| y]2 p( y) d y (∗)
Then we process for the second term.

∫
var y [E x [ x| y]] = (E x [ x| y] − E y [E x [ x| y]])2 p( y) d y
∫
= (E x [ x| y] − E[ x])2 p( y) d y
∫ ∫ ∫
= E x [ x| y]2 p( y) d y − 2 E[ x]E x [ x| y] p( y) d y + E[ x]2 p( y) d y
∫ ∫
= E x [ x| y] p( y) d y − 2E[ x] E x [ x| y] p( y) d y + E[ x]2
2
Then following the same procedure, we deal with the second term of the
equation above.
∫
2E[ x] · E x [ x| y] p( y) d y = 2E[ x] · E y [E x [ x| y]]] = 2E[ x]2
Therefore, we obtain the simple expression for the second term on the
right side of (2.271) :
∫
var y [E x [ x| y]] = E x [ x| y]2 p( y) d y − E[ x]2 (∗∗)
Finally, we add (∗) and (∗∗), and then we will obtain:
E y [var x [ x| y]] + var y [E x [ x| y]] = E[ x2 ] − E[ x]2 = var [ x]
This problem is complexed, but hints have already been given in the de-
scription. Let’s begin by performing integral of (2.272) over µ M −1 . (Note :
41
by integral over µ M −1 , we actually obtain Dirichlet distribution with M − 1

variables.)
∫ 1−µ− m − ... −µ M −2 ∏
M −1 ∑
M −1
α −1
p M −1 (µ, m, ..., µ M −2 ) = CM µk k (1 − µ j )αM −1 d µ M −1
0 k=1 j =1
∏
M −2 ∫ 1−µ− m − ... −µ M −2 ∑
M −1
α −1 α −1
= CM µk k µ MM−−11 (1 − µ j )αM −1 d µ M −1
k=1 0 j =1
We change variable by :
µ M −1
t=
1 − µ − m − ... − µ M −2
The reason we do so is that µ M −1 ∈ [0, 1 − µ − m − ... − µ M −2 ], by making this

changing of variable, we can see that t ∈ [0, 1]. Then we can further simplify
the expression.
∫ α −1 ∑ M −1
∏
M −2 ∑
M −2 1 µ MM−−11 (1 − j =1 µ j )αM −1
α −1 α M −1 +α M −1
p M −1 = CM µk k (1 − µ j) dt
k=1 j =1 0 (1 − µ − m − ... − µ M −2 )αM −1 +αM −2
∏
M −2 ∑
M −2 ∫ 1
α −1
= CM µk k (1 − µ j )αM −1 +αM −1 tαM −1 −1 (1 − t)αM −1 dt
k=1 j =1 0
∏
M −2
α −1
∑
M −2 Γ(α M −1 − 1)Γ(α M )
= CM µk k (1 − µ j )αM −1 +αM −1
k=1 j =1 Γ(α M −1 + α M )
Comparing the expression above with a normalized Dirichlet Distribution

with M −1 variables, and supposing that (2.272) holds for M −1, we can obtain
that:
Γ(α M −1 )Γ(α M ) Γ(α1 + α2 + ... + α M )
CM =
Γ(α M −1 + α M ) Γ(α1 )Γ(α2 )...Γ(α M −1 + α M )
Therefore, we obtain
Γ(α1 + α2 + ... + α M )
CM =
Γ(α1 )Γ(α2 )...Γ(α M −1 )Γ(α M )
as required.

42
Based on definition of Expectation and (2.38), we can write:

∫
E[µ j ] = µ j D ir (µ|α) d µ
∫ ∏
Γ(α0 ) K
α −1
= µj µ k dµ
Γ(α1 )Γ(α2 )...Γ(αK ) k=1 k
∫ ∏
Γ(α0 ) K
α −1
= µj µk k d µ
Γ(α1 )Γ(α2 )...Γ(αK ) k=1
Γ(α0 ) Γ(α1 )Γ(α2 )...Γ(α j−1 )Γ(α j + 1)Γ(α j+1 )...Γ(αK )
=
Γ(α1 )Γ(α2 )...Γ(αK ) Γ(α0 + 1)
Γ(α0 )Γ(α j + 1) αj
= =
Γ(α j )Γ(α0 + 1) α0
It is quite the same for variance, let’s begin by calculating E[µ2j ].

∫
E[µ2j ] = µ2j D ir (µ|α) d µ
∫ ∏
Γ(α0 ) K
α −1
= µ2j µk k dµ
Γ(α1 )Γ(α2 )...Γ(αK ) k=1
Γ(α0 ) Γ(α1 )Γ(α2 )...Γ(α j−1 )Γ(α j + 2)Γ(α j+1 )...Γ(αK )
=
Γ(α1 )Γ(α2 )...Γ(αK ) Γ(α0 + 2)
Γ(α0 )Γ(α j + 2) α j (α j + 1)
= =
Γ(α j )Γ(α0 + 2) α0 (α0 + 1)
Hence, we obtain :
α j (α j + 1) αj α j (α0 − α j )
var [µ j ] = E[µ2j ] − E[µ j ]2 = −( )2 =
α0 (α0 + 1) α0 α20 (α0 + 1)
It is the same for covariance.

∫
cov[µ j µl ] = (µ j − E[µ j ])(µl − E[µl ]) D ir (µ|α) d µ
∫
= (µ j µl − E[µ j ]µl − E[µl ]µ j + E[µ j ]E[µl ]) D ir (µ|α) d µ
Γ(α0 )Γ(α j + 1)Γ(αl + 1)
= − 2E[µ j ]E[µl ] + E[µ j ]E[µl ]
Γ(α j )Γ(αl )Γ(α0 + 2)
α j αl
= − E[µ j ]E[µl ]
α0 (α0 + 1)
α j αl α j αl
= −
α0 (α0 + 1) α20
α j αl
= − 2 ( j ̸= l )
α0 (α0 + 1)
43
Note : when j = l , cov[µ j µl ] will actually reduce to var [µ j ], however we

cannot simply replace l with j in the expression of cov[µ j µl ] to get the right
∫ ∫
result and that is because µ j µl D ir (µ|α) d α will reduce to µ2j D ir (µ|α) d α
in this case.
Based on definition of Expectation and (2.38), we first denote :
Γ(α0 )
= K (α)
Γ(α1 )Γ(α2 )...Γ(αK )
Then we can write :
∂D ir (µ|α) ∏
K
α −1
= ∂(K (α) µi i ) / ∂α j
∂α j i =1
∏K α i −1
∂K (α) ∏
K
α −1
∂ i =1 µ i
= µ i i + K (α)
∂α j i =1 ∂α j
∂K (α) ∏
K
α −1
= µi i + lnµ j · D ir (µ|α)
∂α j i =1
Then let us perform integral to both sides:

∫ ∫ ∫
∂D ir (µ|α) ∂ K (α ) ∏
K
α −1
dµ = µi i dµ + lnµ j · D ir (µ|α) d µ
∂α j ∂α j i =1
The left side can be further simplified as :

∫
∂ D ir (µ|α) d µ ∂1
left side = = =0
∂α j ∂α j
The right side can be further simplified as :

∫ ∏
K
∂K (α) α −1
right side = µi i d µ + E[ lnµ j ]
∂α j i =1
∂K (α) 1
= + E[ lnµ j ]
∂α j K (α)
∂ lnK (α)
= + E[ lnµ j ]
∂α j
44
Therefore, we obtain :
∂ lnK (α)
E[ lnµ j ] = −
∂α j
{ ∑ }
∂ lnΓ(α0 ) − Ki =1 lnΓ(α i )
= −
∂α j
∂ lnΓ(α j ) ∂ lnΓ(α0 )
= −
∂α j ∂α j
∂ lnΓ(α j ) ∂ lnΓ(α0 ) ∂α0
= −
∂α j ∂α0 ∂α j
∂ lnΓ(α j ) ∂ lnΓ(α0 )
= −
∂α j ∂α0
= ψ(α j ) − ψ(α0 )
Therefore, the problem has been solved.
Since we have : ∫ b 1
dx = 1
a b−a
It is straightforward that it is normalized. Then we calculate its mean :
∫ b 1 x2 ¯¯b a + b
E[ x ] = x dx = ¯ =
a b−a 2( b − a) a 2
Then we calculate its variance.

∫ b 2
2 2 x a+b 2 x3 ¯¯b a+b 2
var [ x] = E[ x ] − E[ x] = dx − ( ) = ¯ −( )
a b−a 2 3( b − a) a 2
Hence we obtain:
( b − a )2
var [ x] =
12
This problem is an extension of Prob.1.30. We can follow the same proce-

p ( x)
dure to solve it. Let’s begin by calculating ln q(( x) :
p ( x) 1 |L| 1 1
ln( ) = ln ( ) + ( x − m)T L−1 ( x − m) − ( x − µ)T Σ−1 ( x − µ)
q ( x) 2 |Σ| 2 2
If x ∼ p( x) = N (µ|Σ), we then take advantage of the following properties.

∫
p( x) dx = 1
45
∫
E[ x ] = x p( x) dx = µ
E[( x − a)T A ( x − a)] = tr( A Σ) + (µ − a)T A (µ − a)

We obtain :
∫ { }
1 |L| 1 1
KL = ln − ( x − µ)T Σ−1 ( x − µ) + ( x − m)T L−1 ( x − m) p( x) dx
2 |Σ| 2 2
1 |L| 1 1
= ln − E [( x − µ)Σ−1 ( x − µ)T ] + E [( x − m)T L−1 ( x − m)]
2 | Σ| 2 2
1 |L| 1 1 1
= ln − tr { I D } + (µ − m)T L−1 (µ − m) + tr{L−1 Σ}
2 | Σ| 2 2 2
1 |L|
= [ ln − D + tr{L−1 Σ} + ( m − µ)T L−1 ( m − µ)]
2 |Σ|
The hint given in the problem is straightforward, however it is a little bit

difficult to calculate, and here we will use a more simple method to solve this
problem, taking advantage of the property of Kullback—Leibler Distance. Let
g( x) be a Gaussian PDF with mean µ and variance Σ, and f ( x) an arbitrary
PDF with the same mean and variance.
∫ { } ∫
g ( x)
0 ≤ K L( f || g) = − f ( x) ln dx = − H ( f ) − f ( x) lng( x) dx (∗)
f ( x)
Let’s calculate the second term of the equation above.

∫ ∫ { }
1 1 [ 1 T −1
]
f ( x) lng( x) dx = f ( x) ln exp − ( x − µ ) Σ ( x − µ ) dx
(2π)D /2 |Σ|1/2 2
∫ { } ∫
1 1 [ 1 T −1
]
= f ( x) ln dx + f ( x ) − ( x − µ ) Σ ( x − µ) dx
(2π)D /2 |Σ|1/2 2
{ }
1 1 1 [ T −1
]
= ln − E ( x − µ) Σ ( x − µ )
(2π)D /2 |Σ|1/2 2
{ }
1 1 1
= ln − tr{ I D }
(2π) D /2 |Σ| 1/2 2
{ }
1 D
= − ln|Σ| + (1 + ln(2π))
2 2
= − H ( g)
We take advantage of two properties of PDF f ( x), with mean µ and vari-
ance Σ, as listed below. What’s more, we also use the result of Prob.2.15,
which we will proof later. ∫
f ( x) dx = 1

46
Now we can further simplify (∗) to obtain:
H ( g) ≥ H ( f )
In other words, we have proved that an arbitrary PDF f ( x) with the same
mean and variance as a Gaussian PDF g( x), its entropy cannot be greater
than that of Gaussian PDF.
We have already used the result of this problem to solve Prob.2.14, and
now we will prove it. Suppose x ∼ p( x) = N (µ|Σ) :
∫
H [ x] = − p( x) lnp( x) dx
∫ { }
1 1 [ 1 T −1
]
= − p( x) ln exp − ( x − µ) Σ ( x − µ) dx
(2π)D /2 |Σ|1/2 2
∫ { } ∫
1 1 [ 1 ]
= − p( x) ln dx − f ( x) − ( x − µ)T Σ−1 ( x − µ) dx
(2π) D /2 | Σ| 1/2 2
{ }
1 1 1 [ ]
= − ln + E ( x − µ)T Σ−1 ( x − µ)
(2π) D /2 |Σ| 1/2 2
{ }
1 1 1
= − ln + tr{ I D }
(2π)D /2 |Σ|1/2 2
1 D
= ln|Σ| + (1 + ln(2π))
2 2
Where we have taken advantage of :

∫
p( x) dx = 1

Note : Actually in Prob.2.14, we have already solved this problem, you can
intuitively view it by replacing the integrand f ( x) lng( x) with g( x) lng( x), and
∫
the same procedure in Prob.2.14 still holds to calculate g( x) lng( x) dx.
Let us consider a more general conclusion about the Probability Density

Function (PDF) of the summation of two independent random variables. We
denote two random variables X and Y . Their summation Z = X + Y , is still a
random variable. We also denote f (·) as PDF, and F (·) as Cumulative Distri-
bution Function (CDF). We can obtain :
Ï
F Z ( z) = P ( Z < z) = f X ,Y ( x, y) dxd y
x+ y≤ z
47
Where z represents an arbitrary real number. We rewrite the double in-

tegral into iterated integral :
∫ +∞ [∫ z− y ]
F Z ( z) = f X ,Y ( x, y) dx d y
−∞ −∞
We fix z and y, and then make a change of variable x = u− y to the integral.

∫ +∞ [∫ z− y ] ∫ +∞ [∫ z ]
F Z ( z) = f X ,Y ( x, y) dx d y = f X ,Y ( u − x, y) du d y
−∞ −∞ −∞ −∞
Note: f X ,Y (·) is the joint PDF of X and Y , and then we rearrange the
order, we will obtain :
∫ z [∫ +∞ ]
F Z ( z) = f X ,Y ( u − y, y) d y du
−∞ −∞
Compare the equation above with th definition of CDF :

∫ z
F Z ( z) = f Z ( u) du
−∞
We can obtain : ∫ +∞
f Z ( u) = f X ,Y ( u − y, y) d y
−∞
And if X and Y are independent, which means f X ,Y ( x, y) = f X ( x) f Y ( y), we
can simplify f Z ( z) :
∫ +∞
f Z ( u) = f X ( u − y) f Y ( y) d y i.e. f Z = f X ∗ f Y
−∞
Until now we have proved that the PDF of the summation of two inde-
pendent random variable is the convolution of the PDF of them. Hence it is
straightforward to see that in this problem, where random variable x is the
summation of random variable x1 and x2 , the PDF of x should be the convo-
lution of the PDF of x1 and x2 . To find the entropy of x, we will use a simple
method, taking advantage of (2.113)-(2.117). With the knowledge :
p( x2 ) = N (µ2 , τ−1
2 )
p( x| x2 ) = N (µ1 + x2 , τ−1
1 )
We make analogies : x2 in this problem to x in (2.113), x in this problem to

y in (2.114). Hence by using (2.115), we can obtain p( x) is still a normal dis-
tribution, and since the entropy of a Gaussian is fully decided by its variance,
there is no need to calculate the mean. Still by using (2.115), the variance of
x is τ−1 −1
1 + τ2 , which finally gives its entropy :
1[ ]
H [ x] = 1 + ln2π(τ−1 −1
1 + τ2 )
2
48
This is an extension of Prob.1.14. The same procedure can be used here.

We suppose an arbitrary precision matrix Λ can be written as ΛS + Λ A , where
they satisfy :
Λ i j + Λ ji Λ i j − Λ ji
ΛSij = , Λ iAj =
2 2
Hence it is straightforward that ΛSij = ΛSji , and Λ iAj = − Λ Aji . If we expand
the quadratic form of exponent, we will obtain :
∑
D ∑
D
( x − µ)T Λ( x − µ) = ( x i − µ i )Λ i j ( x j − µ j ) (∗)
i =1 j =1
It is straightforward then :
∑
D ∑
D ∑
D ∑
D
(∗) = ( x i − µ i )ΛSij ( x j − µ j ) + ( x i − µ i )Λ iAj ( x j − µ j )
i =1 j =1 i =1 j =1
∑
D ∑
D
= ( x i − µ i )ΛSij ( x j − µ j )
i =1 j =1
Therefore, we can assume precision matrix is symmetric, and so is covari-

ance matrix.
We will just follow the hint given in the problem. Firstly, we take complex
conjugate on both sides of (2.45) :
Σu i = λi u i => Σu i = λi u i
Where we have taken advantage of the fact that Σ is a real matrix, i.e.,
Σ = Σ. Then using that Σ is a symmetric, i.e., ΣT = Σ :
u i T Σu i = u i T ( Σu i ) = u i T ( λi u i ) = λi u i T u i
T
u i T Σ u i = ( Σ u i )T u i = ( λ i u i )T u i = λ i u i T u i
T
Since u i ̸= 0, we have u i T u i ̸= 0. Thus λTi = λ i , which means λ i is real.
Next we will proof that two eigenvectors corresponding to different eigenval-
ues are orthogonal.
λ i < u i , u j > = < λ i u i , u j > = < Σ u i , u j > = < u i , ΣT u j > = λ j < u i , u j >
Where we have taken advantage of ΣT = Σ and for arbitrary real matrix

A and vector x, y, we have :
< Ax, y > = < x, A T y >

49
Provided λ i ̸= λ j , we have < u i , u j > = 0, i.e., u i and u j are orthogo-

nal. And then if we perform normalization on every eigenvector to force its
Euclidean norm to equal to 1, (2.46) is straightforward. By performing nor-
malization, I mean multiplying the eigenvector by a real number a to let its
Euclidean norm (length) to equal to 1, meanwhile we should also divide its
corresponding eigenvalue by a.
For every N × N real symmetric matrix, the eigenvalues are real and the
eigenvectors can be chosen such that they are orthogonal to each other. Thus
a real symmetric matrix Σ can be decomposed as Σ = U ΛU T ,where U is an
orthogonal matrix, and Λ is a diagonal matrix whose entries are the eigen-
values of A. Hence for an arbitrary vector x, we have:
   
u 1T x λ1 u 1T x
 .   ..  ∑
D
Σ x = U ΛU T x = U Λ  ..  = U  . =( λk u k u T
k )x
k=1
uT
D
x λD u T
D
x
And since Σ−1 = U Λ−1U T , the same procedure can be used to prove
(2.49).
Since u 1 , u 2 , ..., u D can constitute a basis for RD , we can make projection

for a :
a = a 1 u 1 + a 2 u 2 + ... + a D u D
We substitute the expression above into aT Σa, taking advantage of the
property: u i u j = 1 only if i = j , otherwise 0, we will obtain :
aT Σa = (a 1 u 1 + a 2 u 2 + ... + a D u D )T Σ(a 1 u 1 + a 2 u 2 + ... + a D u D )

= (a 1 u 1 T + a 2 u 2 T + ... + a D u D T )Σ(a 1 u 1 + a 2 u 2 + ... + a D u D )
= (a 1 u 1 T + a 2 u 2 T + ... + a D u D T )(a 1 λ1 u 1 + a 2 λ2 u 2 + ... + a D λD u D )
= λ1 a 1 2 + λ2 a 2 2 + ... + λD a D 2
Since a is real,the expression above will be strictly positive for any non-
zero a, if all eigenvalues are strictly positive. It is also clear that if an eigen-
value, λ i , is zero or negative, there will exist a vector a (e.g. a = u i ), for
which this expression will be no greater than 0. Thus, that a real symmet-
ric matrix has eigenvectors which are all strictly positive is a sufficient and
necessary condition for the matrix to be positive definite.
It is straightforward. For a symmetric matrix Λ of size D × D , when the

lower triangular part is decided, the whole matrix will be decided due to
50
symmetry. Hence the number of independent parameters is D + (D − 1) + ... +

1, which equals to D (D + 1)/2.
Suppose A is a symmetric matrix, and we need to prove that A −1 is also

symmetric, i.e., A −1 = ( A −1 )T . Since identity matrix I is also symmetric, we
have :
A A −1 = ( A A −1 )T
And since AB T = B T A T holds for arbitrary matrix A and B, we will
obtain :
T
A A −1 = ( A −1 ) A T
Since A = A T , we substitute the right side:
T
A A −1 = ( A −1 ) A
And note that A A −1 = A −1 A = I , we rearrange the order of the left side :

T
A −1 A = ( A −1 ) A
Finally, by multiplying A −1 to both sides, we can obtain:

T
A −1 A A −1 = ( A −1 ) A A −1
Using A A −1 = I , we will get what we are asked :

T
A −1 = ( A −1 )
Let’s reformulate the problem. What the problem wants us to prove is

that if ( x − µ)T Σ−1 ( x − µ) = r 2 , where r 2 is a constant, we will have the
volume of the hyperellipsoid decided by the equation above will equal to
VD |Σ|1/2 r D . Note that the center of this hyperellipsoid locates at µ, and a
translation operation won’t change its volume, thus we only need to prove
that the volume of a hyperellipsoid decided by xT Σ−1 x = r 2 , whose center
locates at 0 equals to VD |Σ|1/2 r D .
This problem can be viewed as two parts. Firstly, let’s discuss about VD ,
the volume of a unit sphere in dimension D . The expression of VD has already
be given in the solution procedure of Prob.1.18, i.e., (1.144) :
SD 2πD /2
VD = =
D Γ( D2 + 1)
And also in the procedure, we show that a D dimensional sphere with

radius r , i.e., xT x = r 2 , has volume V ( r ) = VD r D . We move a step forward: we
51
perform a linear transform using matrix Σ1/2 , i.e., yT y = r 2 , where y = Σ1/2 x.

After the linear transformation, we actually get a hyperellipsoid whose center
locates at 0, and its volume is given by multiplying V ( r ) with the determinant
of the transformation matrix, which gives |Σ|1/2 VD r D , just as required.
We just following the hint, and firstly let’s calculate :

[ ] [ ]
A B M − MBD −1
×
C D −D −1 CM D + D −1 CMBD −1
−1
The result can also be partitioned into four blocks. The block located at
left top equals to :
AM − BD −1 CM = ( A − BD −1 C )( A − BD −1 C )−1 = I
Where we have taken advantage of (2.77). And the right top equals to :
− AMBD −1 + BD −1 + BD −1 CMBD −1 = ( I − AM + BD −1 CM )BD −1 = 0
Where we have used the result of the left top block. And the left bottom
equals to :
CM − DD −1 CM = 0
And the right bottom equals to :
−CMBD −1 + DD −1 + DD −1 CMDD −1 = I
we have proved what we are asked. Note: if you want to be more precise,
you should also multiply the block matrix on the right side of (2.76) and then
prove that it will equal to a identity matrix. However, the procedure above
can be also used there, so we omit the proof and what’s more, if two arbitrary
square matrix X and Y satisfied X Y = I , it can be shown that Y X = I also
holds.
We will take advantage of the result of (2.94)-(2.98). Let’s first begin by

grouping xa and x b together, and then we rewrite what has been given as :
( ) ( ) [ ]
xa,b µa,b Σ(a,b)(a,b) Σ(a,b) c
x= µ= Σ=
xc µc Σ(a,b) c Σ cc
Then we take advantage of (2.98), we can obtain :
p( xa,b ) = N ( xa,b |µa,b , Σ(a,b)(a,b) )

52
Where we have defined:

( ) [ ]
µa Σaa Σab
µa,b = Σ(a,b)(a,b) =
µb Σba Σbb
Since now we have obtained the joint contribution of xa and x b , we will

take advantage of (2.96) (2.97) to obtain conditional distribution, which gives:
p( xa | x b ) = N ( x|µa|b , Λ−1
aa )
Where we have defined
µa|b = µa − Λ−1
aa Λab ( x b − µ b )
And the expression of Λ− aa and Λab can be given by using (2.76) and (2.77)
1
once we notice that the following relation exits:

[ ] [ ]−1
Λaa Λab Σaa Σab
=
Λba Λbb Σba Σbb
This problem is quite straightforward, if we just follow the hint.

( )
( A + BCD ) A −1 − A −1 B(C −1 + D A −1 B)−1 D A −1
= A A −1 − A A −1 B(C −1 + D A −1 B)−1 D A −1 + BCD A −1 − BCD A −1 B(C −1 + D A −1 B)−1 D A −1
= I − B(C −1 + D A −1 B)−1 D A −1 + BCD A −1 + B(C −1 + D A −1 B)−1 D A −1 − BCD A −1
=I
Where we have taken advantage of
− BCD A −1 B(C −1 + D A −1 B)−1 D A −1

= −BC (−C −1 + C −1 + D A −1 B)(C −1 + D A −1 B)−1 D A −1
= (−BC )(−C −1 )(C −1 + D A −1 B)−1 D A −1 + (−BC )(C −1 + D A −1 B)(C −1 + D A −1 B)−1 D A −1
= B(C −1 + D A −1 B)−1 D A −1 − BCD A −1
Here we will also directly calculate the inverse matrix instead to give
another solution. Let’s first begin by introducing two useful formulas.
( I + P )−1 = ( I + P )−1 ( I + P − P )
= I − ( I + P )−1 P
And since
P + PQP = P ( I + QP ) = ( I + PQ )P
53
The second formula is :
( I + PQ )−1 P = P ( I + QP )−1
And now let’s directly calculate ( A + BCD )−1 :
( A + BCD )−1 = [ A ( I + A −1 BCD )]−1

= ( I + A −1 BCD )−1 A −1
[ ]
= I − ( I + A −1 BCD )−1 A −1 BCD A −1
= A −1 − ( I + A −1 BCD )−1 A −1 BCD A −1
Where we have assumed that A is invertible and also used the first for-
mula we introduced. Then we also assume that C is invertible and recur-
sively use the second formula :
( A + BCD )−1 = A −1 − ( I + A −1 BCD )−1 A −1 BCD A −1

= A −1 − A −1 ( I + BCD A −1 )−1 BCD A −1
= A −1 − A −1 B( I + CD A −1 B)−1 CD A −1
[ ]−1
= A −1 − A −1 B C (C −1 + D A −1 B) CD A −1
= A −1 − A −1 B(C −1 + D A −1 B)−1 C −1 CD A −1
= A −1 − A −1 B(C −1 + D A −1 B)−1 D A −1
Just as required.
The same procedure used in Prob.1.10 can be used here similarly.

∫ ∫
E[ x + z ] = ( x + z) p( x, z) dx dz
∫ ∫
= ( x + z) p( x) p( z) dx dz
∫ ∫ ∫ ∫
= xp( x) p( z) dx dz + z p( x) p( z) dx dz
∫ ∫ ∫ ∫
= ( p( z) dz) xp( x) dx + ( p( x) dx) z p( z) dz
∫ ∫
= xp( x) dx + z p( z) dz
= E[ x ] + E[ z ]
And for covariance matrix, we will use matrix integral :

∫ ∫
cov[ x + z] = ( x + z − E[ x + z] )( x + z − E[ x + z] )T p( x, z) dx dz
Also the same procedure can be used here. We omit the proof for simplic-
ity.
54
It is quite straightforward when we compare the problem with (2.94)-

(2.98). We treat x in (2.94) as z in this problem, xa in (2.94) as x in this
problem, x b in (2.94) as y in this problem. In other words, we rewrite the
problem in the form of (2.94)-(2.98), which gives :
( ) ( ) [ ]
x µ Λ−1 Λ−1 A T
z= E( z ) = cov( z) = −1
y Aµ + b A Λ−1 L + A Λ−1 A T
By using (2.98), we can obtain:
p( x) = N ( x|µ, Λ−1 )
And by using (2.96) and (2.97), we can obtain :
p( y| x) = N ( y|µ y| x , Λ−yy1 )
Where Λ yy can be obtained by the right bottom part of (2.104),which gives

Λ yy = L−1 , and you can also calculate it using (2.105) combined with (2.78)
and (2.79). Finally the conditional mean is given by (2.97) :
µ y| x = A µ + L − L−1 (−LA )( x − µ) = Ax + L
It is straightforward. Firstly, we calculate the left top block :

[ ]−1
left top = (Λ + A T LA ) − (− A T L)(L−1 )(−LA ) = Λ−1
And then the right top block :
right top = −Λ−1 (− A T L)L−1 = Λ−1 A T
And then the left bottom block :
left bottom = −L−1 (−LA )Λ−1 = A Λ−1
Finally the right bottom block :
right bottom = L−1 + L−1 (−LA )Λ−1 (− A T L)L−1 = L−1 + A Λ−1 A T
It is straightforward by multiplying (2.105) and (2.107), which gives :

( )( ) ( )
Λ−1 Λ−1 A T Λµ − A T Lb µ
=
A Λ−1 L−1 + A Λ−1 A T Lb Aµ + b
Just as required in the problem.

55
According to the problem, we can write two expressions :
p ( x ) = N ( x |µ x , Σ x ) , p( y| x) = N ( y|µ z + x, Σ z )
By comparing the expression above and (2.113)-(2.117), we can write the

expression of p( y) :
p ( y) = N ( y|µ x + µ z , Σ x + Σ z )
Let’s make this problem more clear. The deduction in the main text, i.e.,
(2.101-2.110), firstly denote a new random variable z corresponding to the
joint distribution, and then by completing square according to z,i.e.,(2.103),
obtain the precision matrix R by comparing (2.103) with the PDF of a mul-
tivariate Gaussian Distribution, and then it takes the inverse of precision
matrix to obtain covariance matrix, and finally it obtains the linear term i.e.,
(2.106) to calculate the mean.
In this problem, we are asked to solve the problem from another perspec-
tive: we need to write the joint distribution p( x, y) and then perform inte-
gration over x to obtain marginal distribution p( y). Let’s begin by write the
quadratic form in the exponential of p( x, y) :
1 1
− ( x − µ)T Λ( x − µ) − ( y − Ax − b)T L( y − Ax − b)
2 2
We extract those terms involving x :
1
= − xT (Λ + A T LA ) x + xT [Λµ + A T L( y − b) ] + const
2
1 1
= − ( x − m)T (Λ + A T LA ) ( x − m) + mT (Λ + A T LA ) m + const
2 2
Where we have defined :
m = (Λ + A T LA )−1 [Λµ + A T L( y − b) ]
Now if we perform integration over x, we will see that the first term van-
ish to a constant, and we extract the terms including y from the remaining
parts, we can obtain :
1 [ ]
= − yT L − LA (Λ + A T LA )−1 A T L y
2 {
[ ]
+ yT L − LA (Λ + A T LA )−1 A T L b
}
+LA (Λ + A T LA )−1 Λµ
We firstly view the quadratic term to obtain the precision matrix, and
then we take advantage of (2.289), we will obtain (2.110). Finally, using the
56
linear term combined with the already known covariance matrix, we can ob-
tain (2.109).

p( x,y)
According to Bayesian Formula, we can write p( x| y) = p( y) , where we
have already known the joint distribution p( x, y) in (2.105) and (2.108), and
the marginal distribution p( y) in Prob.2.32., we can follow the same proce-
dure in Prob.2.32., i.e. firstly obtain the covariance matrix from the quadratic
term and then obtain the mean from the linear term. The details are omitted
here.
Let’s follow the hint by firstly calculating the derivative of (2.118) with
respect to Σ and let it equal to 0 :
N ∂ 1 ∂ ∑ N
− ln|Σ| − ( xn − µ)T Σ−1 ( xn − µ) = 0
2 ∂Σ 2 ∂Σ n=1
By using (C.28), the first term can be reduced to :
N ∂ N N
− ln|Σ| = − (Σ−1 )T = − Σ−1
2 ∂Σ 2 2
Provided with the result that the optimal covariance matrix is the sample
covariance, we denote sample matrix S as :
1 ∑N
S= ( xn − µ)( xn − µ)T
N n=1
We rewrite the second term :
1 ∂ ∑ N
second term = − ( xn − µ)T Σ−1 ( xn − µ)
2 ∂Σ n=1
N ∂
= − T r [Σ−1 S ]
2 ∂Σ
N −1
= Σ S Σ−1
2
Where we have taken advantage of the following property, combined with
the fact that S and Σ is symmetric. (Note : this property can be found in The
Matrix Cookbook.)
∂
T r ( A X −1 B) = −( X −1 BA X −1 )T = −( X −1 )T A T B T ( X −1 )T
∂X
Thus we obtain :
N −1 N −1
− Σ + Σ S Σ−1 = 0
2 2
57
Obviously, we obtain Σ = S , just as required.
The proof of (2.62) is quite clear in the main text, i.e., from page 82 to
page 83 and hence we won’t repeat it here. Let’s prove (2.124). We first begin
by proving (2.123) :
1 ∑ N 1
E[µ ML ] = E[ xn ] = · Nµ = µ
N n=1 N
Where we have taken advantage of the fact that xn is independently and
identically distributed (i.i.d).
Then we use the expression in (2.122) :
1 ∑N
E[Σ ML ] = E[ ( xn − µ ML )( xn − µ ML )T ]
N n=1
1 ∑N
= E[( xn − µ ML )( xn − µ ML )T ]
N n=1
1 ∑N
= E[( xn − µ ML )( xn − µ ML )T ]
N n=1
1 ∑N
= E[ xn xn T − 2µ ML xn T + µ ML µT
ML ]
N n=1
1 ∑N 1 ∑N 1 ∑N
= E[ x n x n T ] − 2 E[µ ML xn T ] + E[µ ML µT
ML ]
N n=1 N n=1 N n=1
By using (2.291), the first term will equal to :

1
first term = · N (µµT + Σ) = µµT + Σ
N
The second term will equal to :
1 ∑N
second term = −2 E[µ ML xn T ]
N n=1
1 ∑N 1 ∑N
= −2 E[ ( xm ) xn T ]
N n=1 N m=1
1 ∑ N ∑ N
= −2 E[ x m x n T ]
N 2 n=1 m=1
1 ∑ N ∑ N
= −2 (µµT + I nm Σ)
N 2 n=1 m=1
1
= −2 2 ( N 2 µµT + N Σ)
N
1
= −2(µµT + Σ)
N
58
Similarly, the third term will equal to :
1 ∑N
third term = E[µ ML µT
ML ]
N n=1
1 ∑N 1 ∑
N 1 ∑N
= E[( x j) · ( x i )]
N n=1 N j=1 N i=1
1 ∑ N ∑
N ∑ N
= E [( x j ) · ( x i )]
N 3 n=1 j=1 i =1
1 ∑ N
= ( N 2 µµT + N Σ)
N 3 n=1
1
= µµT + Σ
N
Finally, we combine those three terms, which gives:
N −1
E[Σ ML ] = Σ
N
Note: the same procedure from (2.59) to (2.62) can be carried out to prove
(2.291) and the only difference is that we need to introduce index m and n
to represent the samples. (2.291) is quite straightforward if we see it in this
way: If m = n, which means xn and xm are actually the same sample, (2.291)
will reduce to (2.262) (i.e. the correlation between different dimensions ex-
ists) and if m ̸= n, which means xn and xm are different samples, also i.i.d,
then no correlation should exist, we can guess E[ xn xm T ] = µµT in this case.
Let’s follow the hint. However, firstly we will find the sequential expres-
sion based on definition, which will make the latter process on finding coef-
ficient a N −1 more easily. Suppose we have N observations in total, and then
we can write:
1 ∑N
σ2(
ML
N)
= N) 2
( xn − µ(ML )
N n=1
[ ]
1 N∑ −1
= N) 2
( xn − µ(ML N) 2
) + ( x N − µ(ML )
N n=1
N − 1 1 N∑−1 1
= N) 2
( xn − µ(ML N) 2
) + ( x N − µ(ML )
N N − 1 n=1 N
N − 1 2( N −1) 1
= σ ML + ( x N − µ(ML N) 2
)
N N
1 [ ]
N −1) 2( N −1)
= σ2(
ML
+ ( x N − µ (N ) 2
ML
) − σ ML
N
59
And then let us write the expression for σ ML .

{ }
∂ 1 ∑N ¯
¯
lnp ( x n | µ , σ ) ¯ =0
∂σ 2 N n=1 σ ML
By exchanging the summation and the derivative, and letting N → +∞,

we can obtain :
[ ]
1 ∑N ∂ ∂
lim lnp ( x n | µ , σ ) = E x lnp ( x n | µ , σ )
N →+∞ N n=1 ∂σ2 ∂σ2
Comparing it with (2.127), we can obtain the sequential formula to esti-

mate σ ML :
N −1) ∂ N −1)
σ2(
ML
N)
= σ2(
ML
+ a N −1 N −1)
N)
lnp( x N |µ(ML , σ(ML ) (∗)
∂σ2(
ML
[ ]
N −1) 1
N) 2
( x N − µ(ML )
= σ2(
ML
+ a N −1 − N −1)
+ N −1)
2σ2(
ML
2σ4(
ML
Where we use σ2( ML

N)
to represent the N th estimation of σ2ML , i.e., the esti-
mation of σ2ML after the N th observation. What’s more, if we choose :
N −1)
2σ4(
ML
a N −1 =
N
Then we will obtain :
N −1) 1 [ 2( N −1) ]
σ2(
ML
N)
= σ2(
ML
+ −σ ML N) 2
+ ( x N − µ(ML )
N
We can see that the results are the same. An important thing should be
noticed : In maximum likelihood, when estimating variance σ2( N)
ML
, we will
N)
first estimate mean µ(ML , and then we we will calculate variance σ2( ML
N)
.
In other words, they are decoupled. It is the same in sequential method.
For instance, if we want to estimate both mean and variance sequentially,
N −1)
after observing the N th sample (i.e., x N ), firstly we can use µ(ML together
N)
with (2.126) to estimate µ(ML and then use the conclusion in this problem
N −1)
N)
to obtain σ(ML N)
. That is why in (∗) we write lnp( x N |µ(ML , σ(ML ) instead of
N −1) ( N −1)
lnp( x N |µ(ML , σ ML ).
Problem 2.37 Solution (Wait for revising)
We follow the same procedure in Prob.2.36 to solve this problem. Firstly,

60
we can obtain the sequential formula based on definition.
1 ∑N
Σ(ML
N)
= ( xn − µ(ML
N)
)( xn − µ(ML
N) T
)
N n=1
[ ]
1 N∑ −1
(N ) (N ) T (N ) (N ) T
= ( xn − µ ML )( xn − µ ML ) + ( x N − µ ML )( x N − µ ML )
N n=1
N − 1 ( N −1) 1
= Σ ML + ( x N − µ(ML N)
)( x N − µ(ML
N) T
)
N N
1 [ ]
N −1) N −1)
= Σ(ML + ( x N − µ(ML
N)
)( x N − µ(MLN) T
) − Σ(ML
N
If we use Robbins-Monro sequential estimation formula, i.e., (2.135), we
can obtain :
N −1) ∂ N −1)
Σ(ML
N)
= Σ(ML + a N −1
N −1)
lnp( x N |µ(ML
N)
, Σ(ML )
∂Σ(ML
N −1) ∂ N −1)
= Σ(ML + a N −1 lnp( x N |µ(ML
N)
, Σ(ML )
N −1)
∂Σ(ML
[ ]
N −1) 1 N −1) −1 1 N −1) −1 N −1) N −1) T N −1) −1
= Σ(ML + a N −1 − [Σ(ML ] + [Σ(ML ] ( xn − µ(ML )( xn − µ(ML ) [Σ(ML ]
2 2
Where we have taken advantage of the procedure we carried out in Prob.2.34

to calculate the derivative, and if we choose :
2 2( N −1)
a N −1 = Σ
N ML
We can see that the equation above will be identical with our previous
conclusion based on definition.
It is straightforward. Based on (2.137), (2.138) and (2.139), we focus on

the exponential term of the posterior distribution p(µ| X ), which gives :
1 ∑N 1 1
− ( xn − µ)2 − 2 (µ − µ0 )2 = − 2 (µ − µ N )2
2σ n=1
2 2σ 0 2σ N
We rewrite the left side regarding to µ.
N 1
quadratic term = −( + ) µ2
2σ 2 2σ20
∑N
n=1 x n µ0
linear term = ( + )µ
σ2 σ20
61
We also rewrite the right side regarding to µ, and hence we will obtain :
∑N
N 1 1 n=1 x n µ0 µN
−( 2 + ) µ2 = − 2 µ2 , ( + )µ = µ
2σ 2
2σ 0 2σ N σ2 σ20 σ2N
Then we will obtain :

1 1 N
= +
σ2N σ20 σ2
∑N
And with the prior knowledge that n=1 x n = N · µ ML , we can write :
∑N
n=1 x n µ0
µN = σ2N ·( + )
σ2 σ20
1 N −1 N µ ML µ0
= ( + ) ·( + 2)
σ20 σ 2 σ 2 σ0
σ20 σ2 N µ ML σ20 + µ0 σ2
= ·
σ2 + N σ20 σσ20
σ2 N σ20
= µ0 + µ ML
N σ20 + σ2 N σ20 + σ2
Let’s follow the hint.

1 1 N 1 N −1 1 1 1
= + = 2 + + 2 = 2 + 2
σ2N σ20 σ 2 σ0 σ 2 σ σ N −1 σ
However, it is complicated to derive a sequential formula for µ N directly.

Based on (2.142), we see that the denominator in (2.141) can be eliminated if
we multiply 1/σ2N on both side of (2.141). Therefore we will derive a sequen-
tial formula for µ N /σ2N instead.
µN σ2 + N σ20 σ2 N σ20
= ( µ0 + N)
µ(ML )
σ2N σ20 σ2 N σ20 + σ2 N σ20 + σ2
σ2 + N σ20 σ2 N σ20
= ( µ0 + N)
µ(ML )
σ20 σ2 N σ20 + σ2 N σ20 + σ2
∑N
µ0
N)
N µ(ML µ0 n=1 x n
= + = +
σ20 σ2 σ20 σ2
∑ N −1
µ0 n=1 xn xN
= + +
σ20 σ2 σ2
µ N −1 xN
= +
σ2N −1 σ2
62
Another possible solution is also given in the problem. We solve it by

completing the square.
1 1 1
− ( x N − µ)2 − 2 (µ − µ N −1 )2 = − 2 (µ − µ N )2
2σ 2 2σ N −1 2σ N
By comparing the quadratic and linear term regarding to µ, we can ob-

tain:
1 1 1
= 2 + 2
σN2 σ σ N −1
And :
µN xN µ N −1
= + 2
σ2N σ 2 σ N −1
It is the same as previous result. Note: after obtaining the N th observa-
tion, we will firstly use the sequential formula to calculate σ2N , and then µ N .
This is because the sequential formula for µ N is dependent on σ2N .
Based on Bayes Theorem, we can write :
p(µ| X ) ∝ p( X |µ) p(µ)
We focus on the exponential term on the right side and then rearrange it
regarding to µ.
[ ]
∑N 1 1
right = − ( xn − µ) Σ ( xn − µ) − (µ − µ0 )T Σ0 −1 (µ − µ0 )
T −1
n=1 2 2
[ ]
∑N 1 1
= − ( xn − µ) Σ ( xn − µ) − (µ − µ0 )T Σ0 −1 (µ − µ0 )
T −1
n=1 2 2
1 ∑N
= − µ (Σ0 −1 + N Σ−1 ) µ + µT (Σ−1
0 µ0 + Σ
−1
xn ) + const
2 n=1
Where ’const’ represents all the constant terms independent of µ. Accord-

ing to the quadratic term, we can obtain the posterior covariance matrix.
Σ−N1 = Σ0 −1 + N Σ−1
Then using the linear term, we can obtain :
∑
N
Σ−N1 µ N = (Σ−1
0 µ0 + Σ
−1
xn )
n=1
Finally we obtain posterior mean :
∑
N
µ N = (Σ0 −1 + N Σ−1 )−1 (Σ−1
0 µ0 + Σ
−1
xn )
n=1
63
Which can also be written as :
µ N = (Σ0 −1 + N Σ−1 )−1 (Σ0 −1 µ0 + Σ−1 N µ ML )
Let’s compute the integral of (2.146) over λ.

∫ +∞ ∫ +∞
1 a a−1 ba
b λ exp(− bλ) d λ = λa−1 exp(− bλ) d λ
0 Γ(a) Γ(a) 0
∫ +∞
ba u 1
= ( )a−1 exp(− u) du
Γ(a) 0 b b
∫ +∞
1
= u a−1 exp(− u) du
Γ(a) 0
1
= · Γ(a) = 1
Γ(a)
Where we first perform change of variable bλ = u, and then take advan-
tage of the definition of gamma function:
∫ +∞
Γ( x) = u x−1 e−u du
0
We first calculate its mean.

∫ +∞ ∫ +∞
1 a a−1 ba
λ b λ exp(− bλ) d λ = λa exp(− bλ) d λ
0 Γ( a) Γ(a) 0
∫ +∞
ba u 1
= ( )a exp(− u) du
Γ(a) 0 b b
∫ +∞
1
= u a exp(− u) du
Γ(a) · b 0
1 a
= · Γ(a + 1) =
Γ(a) · b b
Where we have taken advantage of the property Γ(a + 1) = aΓ(a). Then

we calculate E[λ2 ].
∫ +∞ ∫ +∞
2 1 a a−1 ba
λ b λ exp(− bλ) d λ = λa+1 exp(− bλ) d λ
0 Γ(a) Γ(a) 0
∫ +∞
ba u 1
= ( )a+1 exp(− u) du
Γ(a) 0 b b
∫ +∞
1 a+1
= u exp(− u) du
Γ(a) · b2 0
1 a(a + 1)
= · Γ(a + 2) =
Γ(a) · b2 b2
64
Therefore, according to var [λ] = E[λ2 ] − E[λ]2 , we can obtain :

a(a + 1) a 2 a
var [λ] = E[λ2 ] − E[λ]2 = 2
−( ) = 2
b b b
For the mode of a gamma distribution, we need to find where the max-
imum of the PDF occurs, and hence we will calculate the derivative of the
gamma distribution with respect to λ.
[ ]
d 1 a a−1 1 a a−2
b λ exp(− bλ) = [(a − 1) − bλ] b λ exp(− bλ)
d λ Γ(a) Γ(a)
It is obvious that Gam(λ|a, b) has its maximum at λ = (a − 1)/ b. In other
words, the gamma distribution Gam(λ|a, b) has mode (a − 1)/ b.
Let’s firstly calculate the following integral.

∫ +∞ ∫ +∞
| x| q xq
exp(− 2 ) dx = 2 exp(− 2 ) dx
−∞ 2σ −∞ 2σ
∫ +∞ 1
(2σ2 ) q 1q −1
= 2 exp(− u) u du
0 q
1 ∫
(2σ2 ) q +∞ 1
−1
= 2 exp(− u) u q dx
q 0
1
(2σ2 ) q 1
= 2 Γ( )
q q
And then it is obvious that (2.293) is normalized. Next, we consider about
the log likelihood function. Since ϵ = t − y( x, w) and ϵ ∼ p(ϵ|σ2 , q), we can
write:
∑
N ( )
ln p(t| X , w, σ2 ) = ln p y( xn , w) − t n |σ2 , q
n=1
[ ]
1 ∑ N
q q
= − | y( x n , w ) − t n | + N · ln
2σ2 n=1 2(2σ2 )1/ q Γ(1/ q)
1 ∑N N
= − | y( xn , w) − t n | q − ln(2σ2 ) + const
2σ n=1
2 q
Here we use a simple method to solve this problem by taking advantage

of (2.152) and (2.153). By writing the prior distribution in the form of (2.153),
i.e., p(µ, λ|β, c, d ), we can easily obtain the posterior distribution.
p(µ, λ| X ) ∝ p( X |µ, λ) · p(µ, λ)

[ ] N +β [ ]
1/2 λµ2 ∑N ∑N x2
n
∝ λ exp(− ) exp ( c + xn )λµ − ( d + )λ
2 n=1 n=1 2
65
Therefore, we can see that the posterior distribution has parameters: β′ =

∑ ∑ x2
β + N , c′ = c + nN=1 xn , d ′ = d + nN=1 2n . And since the prior distribution is
actually the product of a Gaussian distribution and a Gamma distribution:
[ ]
p(µ, λ|µ0 , β, a, b) = N µ|µ0 , (βλ)−1 Gam(λ|a, b)
Where µ0 = c/β, a = 1 + β/2, b = d − c2 /2β. Hence the posterior distri-

bution can also be written as the product of a Gaussian distribution and a
Gamma distribution.
[ ]
p(µ, λ| X ) = N µ|µ′0 , (β′ λ)−1 Gam(λ|a′ , b′ )

∑
N /
µ′0 = c′ /β′ = ( c + xn ) ( N + β)
n=1
/
a′ = 1 + β′ /2 = 1 + ( N + β) 2
∑
N x2 ∑
N /
b′ = d ′ − c′ /2β′ = d +
2 n
− (c + xn )2 (2(β + N ))
n=1 2 n=1
Let’s begin by writing down the dependency of the prior distribution W (Λ|W, v)
and the likelihood function p( X |µ, Λ) on Λ.
[∑
N 1 ]
p( X |µ, Λ) ∝ |Λ| N /2 exp − ( xn − µ)T Λ( xn − µ)
n=1 2
And if we denote
1 ∑N
S= ( xn − µ)( xn − µ)T
N n=1
Then we can rewrite the equation above as:
[ 1 ]
p( X |µ, Λ) ∝ |Λ| N /2 exp − Tr(S Λ)
2
Just as what we have done in Prob.2.34, and comparing this problem with
Prob.2.34, one important thing should be noticed: since S and Λ are both
( )
symmetric, we have: Tr(S Λ) = Tr (S Λ)T = Tr(ΛT S T ) = Tr(ΛS ). And we
can also write down the prior distribution as:
[ 1 ]
W (Λ|W, v) ∝ |Λ|(v−D −1)/2 exp − Tr(W −1 Λ)
2
Therefore, the posterior distribution can be obtained:
p(Λ| X ,W, v) ∝ p( X |µ, Λ) · W (Λ|W, v)

{ 1 [ ]}
∝ |Λ|( N +v−D −1)/2 exp − Tr (W −1 + S )Λ
2
66
Therefore, p(Λ| X ,W, v) is also a Wishart distribution, with parameters:
vN = N + v
W N = (W −1 + S )−1
It is quite straightforward.
∫ ∞
p( x|µ, a, b) = N ( x|µ, τ−1 )Gam(τ|a, b) d τ
0
∫ { τ }
b a exp(− bτ)τa−1 τ 1/2
∞
= ( ) exp − ( x − µ)2 d τ
0 Γ(a) 2π 2
a ∫ ∞ { }
b 1 τ
= ( )1/2 τa−1/2 exp − bτ − ( x − µ)2 d τ
Γ(a) 2π 0 2
And if we make change of variable: z = τ[ b + ( x − µ)2 /2], the integral above

can be written as:
∫ { }
b a 1 1/2 ∞ a−1/2 τ
p( x|µ, a, b) = ( ) τ exp − bτ − ( x − µ)2 d τ
Γ(a) 2π 0 2
a ∫ ∞[ ]a−1/2
b 1 z 1
= ( )1/2 exp {− z} dz
Γ(a) 2π 0 b + ( x − µ )2 /2 b + ( x − µ)2 /2
[ ]a+1/2 ∫ ∞
b a 1 1/2 1
= ( ) z a−1/2 exp {− z} dz
Γ(a) 2π b + ( x − µ)2 /2 0
[ ]−a−1/2
b a 1 1/2 ( x − µ)2
= ( ) b+ Γ(a + 1/2)
Γ(a) 2π 2
And if we substitute a = v/2 and b = v/2λ, we will obtain (2.159).
We focus on the dependency of (2.159) on x.

[ ]−v/2−1/2
λ( x − µ)2
St( x|µ, λ, v) ∝ 1+
v
[ ]
−v − 1 λ( x − µ)2
∝ exp ln(1 + )
2 v
[ ]
−v − 1 λ( x − µ)2 −2
∝ exp ( + O (v ))
2 v
[ ]
λ( x − µ)2
≈ exp − (v → ∞)
2
Where we have used Taylor Expansion: ln(1 + ϵ) = ϵ + O (ϵ2 ). We see that

this, up to an overall constant, is a Gaussian distribution with mean µ and
precision λ.
67
The same steps in Prob.2.46 can be used here.

∫ +∞
¯ ¯ ¯v v
¯
St( x µ, Λ, v) = N ( x ¯ µ, (ηΛ)−1 ) · Gam(η ¯ , ) d η
0 2 2
∫ +∞ { }
1 1 vη 1 v v/2
= 1/2 T
|ηΛ| exp − ( x − µ) (ηΛ)( x − µ) − ( ) ηv/2−1 d η
0 (2π) D /2 2 2 Γ(v/2) 2
v/2 1/2 ∫ +∞ { }
(v/2) |Λ| 1 T vη D /2+v/2−1
= exp − ( x − µ) (ηΛ)( x − µ) − η dη
(2π)D /2 Γ(v/2) 0 2 2
Where we have taken advantage of the property: |ηΛ| = ηD |Λ|, and if we
denote:
η
∆2 = ( x − µ)T Λ( x − µ) and z = (∆2 + v)
2
The expression above can be reduced to :
∫
¯ (v/2)v/2 |Λ|1/2 +∞ 2 z D /2+v/2−1 2
St( x ¯ µ, Λ, v) = exp ( − z )( ) · 2 dz
(2π)D /2 Γ(v/2) 0 ∆ +v
2
∆ +v
(v/2)v/2 |Λ|1/2 2 D /2+v/2 ∫ +∞
= ( ) exp(− z) · z D /2+v/2−1 dz
(2π)D /2 Γ(v/2) ∆2 + v 0
(v/2)v/2 |Λ|1/2 2 D /2+v/2
= ( ) Γ(D /2 + v/2)
(2π)D /2 Γ(v/2) ∆2 + v
And if we rearrange the expression above, we will obtain (2.162) just as
required.
Firstly, we notice that if and only if x = µ, ∆2 equals to 0, so that St( x|µ, Λ, v)

achieves its maximum. In other words, the mode of St( x|µ, Λ, v) is µ. Then
we consider about its mean E[ x].
∫
E[ x ] = St( x|µ, Λ, v) · x dx
x∈RD
∫ [∫ +∞ ]
¯ ¯v v
= ¯ −1 ¯
N ( x µ, (ηΛ) ) · Gam(η , ) d η x dx
x∈RD 0 2 2
∫ ∫ +∞
¯ ¯ v v
= xN ( x ¯ µ, (ηΛ)−1 ) · Gam(η ¯ , ) d η dx
x∈R D 0 2 2
∫ +∞ [∫ ]
¯ ¯v v
= ¯ −1 ¯
xN ( x µ, (ηΛ) ) dx · Gam(η , ) d η
0 x∈RD 2 2
∫ +∞ [
¯v v ]
= µ · Gam(η ¯ , ) d η
0 2 2
∫ +∞
¯v v
= µ Gam(η ¯ , ) d η = µ
0 2 2
Where we have taken the following property:
∫
¯
xN ( x ¯ µ, (ηΛ)−1 ) dx = E[ x] = µ
x∈RD
68
Then we calculate E[ xxT ]. The steps above can also be used here.
∫
T
E[ xx ] = St( x|µ, Λ, v) · xxT dx
x∈RD
∫ [∫ +∞ ]
¯ ¯v v
= ¯ −1 ¯
N ( x µ, (ηΛ) ) · Gam(η , ) d η xx dx T
x∈RD 0 2 2
∫ ∫ +∞
¯ ¯v v
= xxT N ( x ¯ µ, (ηΛ)−1 ) · Gam(η ¯ , ) d η dx
x∈RD 0 2 2
∫ +∞ [∫ ]
¯ ¯v v
= T ¯ −1
xx N ( x µ, (ηΛ) ) dx · Gam(η , ) d η ¯
0 x∈RD 2 2
∫ +∞ [ ]
¯ v v
= E[µµT ] · Gam(η ¯ , ) d η
0 2 2
∫ +∞ [ ] ¯v v
= µµT + (ηΛ)−1 Gam(η ¯ , ) d η
0 2 2
∫ +∞
¯ v v
= µµT + (ηΛ)−1 · Gam(η ¯ , ) d η
0 2 2
∫ +∞
T −1 1 v v/2 v/2−1 v
= µµ + (ηΛ) · ( ) η exp(− η) d η
0 Γ ( v /2) 2 2
∫
T −1 1 v v/2 +∞ v/2−2 v
= µµ + Λ ( ) η exp(− η) d η
Γ(v/2) 2 0 2
vη
If we denote: z = 2 , the equation above can be reduced to :
∫
T T −1 1 v v/2 +∞ 2 z v/2−2 2
E[ xx ] = µµ +Λ ( ) ( ) exp(− z) dz
Γ(v/2) 2 0 v v
∫ +∞
1 v
= µµT + Λ−1 · z v/2−2 exp(− z) dz
Γ(v/2) 2 0
Γ(v/2 − 1) v
= µµT + Λ−1 ·
Γ(v/2) 2
1 v
= µµT + Λ−1
v/2 − 1 2
v
= µµT + Λ−1
v−2
Where we have taken advantage of the property: Γ( x + 1) = xΓ( x), and
[ ]
since we have cov[ x] = E ( x − E[ x])( x − E[ x])T , together with E[ x] = µ, we
can obtain:
v
cov[ x] = Λ−1
v−2
69
The same steps in Prob.2.47 can be used here.

[ ]−D /2−v/2
∆2
St( x|µ, Λ, v) ∝ 1+
v
[ ]
∆2
∝ exp (−D /2 − v/2) · ln(1 + )
v
[ ]
D+v ∆ 2
−2
∝ exp − ·( + O (v ))
2 v
∆2
≈ exp(− ) (v → ∞)
2
Where we have used Taylor Expansion: ln(1 + ϵ) = ϵ + O (ϵ2 ). And since

∆2 = ( x−µ)T Λ( x−µ), we see that this, up to an overall constant, is a Gaussian
distribution with mean µ and precision Λ.
We first prove (2.177). Since we have exp( i A )· exp(− i A ) = 1, and exp( i A ) =

cosA + isinA . We can obtain:
( cosA + isinA ) · ( cosA − isinA ) = 1
Which gives cos2 A + sin2 A = 1. And then we prove (2.178) using the hint.
cos( A − B) = ℜ[ exp( i ( A − B))]

/
= ℜ[ exp( i A ) exp( iB)]
cosA + isinA
= ℜ[ ]
cosB + isinB
( cosA + isinA )( cosB − isinB)
= ℜ[ ]
( cosB + isinB)( cosB − isinB)
= ℜ[( cosA + isinA )( cosB − isinB)]
= cosAcosB + sinAsinB
It is quite similar for (2.183).
sin( A − B) = ℑ[ exp( i ( A − B))]

= ℑ[( cosA + isinA )( cosB − isinB)]
= sinAcosB − cosAsinB
Let’s follow the hint. We first derive an approximation for exp[ mcos(θ −
70
θ0 )].
{ [ ]}
(θ − θ0 )2 4
exp { mcos(θ − θ0 )} = exp m 1 − + O ((θ − θ0 ) )
2
{ }
(θ − θ0 )2
= exp m − m − mO ((θ − θ0 )4 )
2
{ }
(θ − θ0 )2 { }
= exp( m) · exp − m · exp − mO ((θ − θ0 )4 )
2
It is same for exp( mcosθ ) :
θ2 { }
exp { mcosθ } = exp( m) · exp(− m ) · exp − mO (θ 4 )
2
Now we rearrange (2.179):
1
p (θ | θ 0 , m ) = exp { mcos(θ − θ0 )}
2π I 0 ( m )
1
= ∫ 2π exp { mcos(θ − θ0 )}
0 exp { mcosθ } d θ
{ } { }
exp( m) · exp − m (θ−2θ0 ) · exp − mO ((θ − θ0 )4 )
2
= ∫ 2π { }
θ2
0 exp( m) · exp(− m 2 ) · exp − mO (θ ) d θ
4
{ }
1 (θ − θ 0 )2
= ∫ 2π exp − m
exp(− m θ ) d θ
2
2
0 2
Where we have taken advantage of the following fact:

{ } { }
exp − mO ((θ − θ0 )4 ) ≈ exp − mO (θ 4 ) (when m → ∞)
Therefore, it is straightforward that when m → ∞, (2.179) reduces to a

Gaussian Distribution with mean θ0 and precision m.
Let’s rearrange (2.182) according to (2.183).
∑
N ∑
N
sin(θ − θ0 ) = ( sinθn cosθ0 − cosθn sinθ0 )
n=1 n=1
∑
N ∑
N
= cosθ0 sinθn − sinθ0 cosθn
n=1 n=1
Where we have used (2.183), and then together with (2.182), we can ob-
tain :
∑N ∑
N
cosθ0 sinθn − sinθ0 cosθn = 0
n=1 n=1
71
Which gives: {∑ }
−1 n sinθ n
θ0ML = tan ∑
n cosθ n
We calculate the first and second derivative of (2.179) with respect to θ .

1 { }
p(θ |θ0 , m)′ = [− msin(θ − θ0 )] exp mcos(θ − θ0 )
2π I 0 ( m )
1 [ ] { }
p(θ |θ0 , m)′′ = − mcos(θ − θ0 ) + (− msin(θ − θ0 ))2 exp mcos(θ − θ0 )
2π I 0 ( m )
If we let p(θ |θ0 , m)′ equals to 0, we will obtain its root:
θ = θ0 + k π (k ∈ Z )
When k ≡ 0 ( mod 2), i.e. θ ≡ θ0 ( mod 2π), we have:

− m exp( m)
p(θ |θ0 , m)′′ = <0
2π I 0 ( m )
Therefore, when θ = θ0 , (2.179) obtains its maximum. And when k ≡
1 ( mod 2), i.e. θ ≡ θ0 + π ( mod 2π), we have:
m exp(− m)
p(θ |θ0 , m)′′ = >0
2π I 0 ( m )
Therefore, when θ = θ0 + π ( mod 2π), (2.179) obtains its minimum.
According to (2.185), we have :
1 ∑N
A ( m ML ) = cos(θn − θ0ML )
N n=1
By using (2.178), we can write :
1 ∑N
A ( m ML ) = cos(θn − θ0ML )
N n=1
1 ∑N ( )
= cosθn cosθ0ML + sinθn sinθ0ML
N n=1
( ) ( )
1 ∑N 1 ∑N
= cosθ N cosθ0ML + sinθ N sinθ0ML
N n=1 N n=1
By using (2.168), we can further derive:

( ) ( )
1 ∑N
ML 1 ∑N
A ( m ML ) = cosθ N cosθ0 + sinθ N sinθ0ML
N n=1 N n=1
= r̄cosθ̄ · cosθ0ML + r̄sinθ̄ · sinθ0ML
= r̄cos(θ̄ − θ0ML )
72
And then by using (2.169) and (2.184), it is obvious that θ̄ = θ0ML , and
hence A ( m ML ) = r̄ .
Recall that the distributions belonging to the exponential family have the
form:
p( x|η) = h( x) g(η) exp(ηT u( x))
And according to (2.13), the beta distribution can be written as:
Γ(a + b) a−1
Beta( x|a, b) = x (1 − x)b−1
Γ(a)Γ( b)
Γ(a + b)
= exp [(a − 1) lnx + ( b − 1) ln(1 − x)]
Γ(a)Γ( b)
Γ(a + b) exp [alnx + bln(1 − x)]
=
Γ(a)Γ( b) x(1 − x)
Comparing it with the standard form of exponential family, we can obtain:



η = [a, b]T


 u( x) = [ lnx, ln(1 − x)]T
/[ ]

 g(η) = Γ(η 1 + η 2 ) Γ(η 1 )Γ(η 2 )

 /

h( x) = 1 ( x(1 − x))
Where η 1 means the first element of η, i.e. η 1 = a − 1, and η 2 means the

second element of η, i.e. η 2 = b − 1. According to (2.146), Gamma distribution
can be written as:
1 a a−1
Gam( x|a, b) = b x exp(− bx)
Γ(a)

 T
η = [a, b]



 u( x) = [0, − x]
/
η1

 g (η ) = η Γ(η 1 )

 2

 h( x) = xη1 −1
According to (2.179), the von Mises distribution can be written as:
1
p ( x |θ 0 , m ) = exp( mcos( x − θ0 ))
2π I 0 ( m )
1
= exp [ m( cosxcosθ0 + sinxsinθ0 )]
2π I 0 ( m )
73



 η = [ mcosθ0 , msinθ0 ]T



 u( x) = [ cosx, sinx]
/ √

 g (η ) = 1 2π I ( η21 + η22 )

 0


h( x) = 1
Note : a given distribution can be written into the exponential family in

several ways with different natural parameters.
Recall that the distributions belonging to the exponential family have the
form:
p( x|η) = h( x) g(η) exp(ηT u( x))
And the multivariate Gaussian Distribution has the form:
{ }
1 1 1 T −1
N ( x|µ, Σ) = exp − ( x − µ) Σ ( x − µ)
(2π)D /2 |Σ|1 /2 2
We expand the exponential term with respect to µ.
{ }
1 1 1 T −1 T −1 −1
N ( x|µ, Σ) = exp − ( x Σ x − 2µ Σ x + µΣ µ)
(2π)D /2 |Σ|1/2 2
{ } { }
1 1 1 T −1 T −1 1 −1
= exp − x Σ x + µ Σ x exp − µ Σ µ)
(2π)D /2 |Σ|1/2 2 2



η = [Σ−1 µ, − 12 vec(Σ−1 ) ]T


 u( x) = [ x, vec( xxT ) ]

 g(η) = exp( 14 η 1 T η 2 −1 η 1 ) + | − 2η 2 |1/2



h( x) = (2π)−D /2
Where we have used η1 to denote the first element of η, and η2 to denote

the second element of η. And we also take advantage of the vectorizing oper-
ator, i.e.vec(·). The vectorization of a matrix is a linear transformation which
converts the matrix into a column vector. This can be viewed in an example :
[ ]
a b
A= => vec( A ) = [a, c, b, d ]T
c d
Note: By introducing vectorizing operator, we actually have vec(Σ−1 ) ·

vec( xxT ) = xT Σ−1 x
Problem 2.58 Solution (Wait for updating)

74
Based on (2.226), we rewrite the expression for ∇ g(η).
∇ g(η) = − g(η)E[ u( x)]
And then we calculate the derivative of both sides of the equation above
with respect to η.
[ ]
∇∇ g(η) = − ∇ g(η)E[ u( x)T ] + g(η)∇E[ u( x)T ]
If we multiply both sides by − g(1η) , we can obtain :
−∇∇ lng(η) = ∇ lng(η)E[ u( x)T ] + ∇E[ u( x)T ]
According to (2.225), we calculate ∇E[ u( x)T ].

∫ { }
∇E[ u( x)T ] = ∇ g(η) h( x) exp ηT u( x) u( x)T dx +
∫ { }
g(η) h( x) exp ηT u( x) u( x) u( x)T dx
=> ∇E[ u( x)T ] = ∇ lng(η)E[ u( x)T ] + E[ u( x) u( x)T ]

Therefore, we obtain :
−∇∇ lng(η) = 2∇ lng(η)E[ u( x)T ] + E[ u( x) u( x)T ] = −2E[ u( x)]E[ u( x)T ] + E[ u( x) u( x)T ]
It is straightforward.
∫ ∫
1 x
p( x|σ) dx = f ( ) dx
σ σ
∫
1
= f ( u)σ du
∫ σ
= f ( u) du = 1
Where we have denoted u = x/σ.
Firstly, we write down the log likelihood function.
∑
N ∑
M
lnp( xn ) = n i ln( h i )
n=1 i =1
Some details should be explained here. If xn falls into region ∆ i , then

p( xn ) will equal to h i , and since we have already been given that among
all the N observations, there are n i samples fall into region ∆ i , we can easily
write down the likelihood function just as the equation above, and note we use
75
M to denote the number of different regions. Therefore, an implicit equation

should hold:
∑
M
ni = N
i =1
We now need to take account of the constraint that p( x) must integrate

∑
to unity, which can be written as M j =1 h j ∆ j = 1. We introduce a Lagrange
multiplier to the expression, and then we need to minimize:
∑
M ∑
M
n i ln( h i ) + λ( h j ∆ j − 1)
i =1 j
We calculate its derivative with respect to h i and let it equal to 0.

ni
+ λ∆ i = 0
hi
Multiplying both sides by h i , performing summation over i and then us-

ing the constraint, we can obtain:
N +λ=0
In other words, λ = − N . Then we substitute the result into the likelihood

function, which gives:
ni 1
hi =
N ∆i
It is straightforward. In K nearest neighbours (KNN), when we want to

estimate probability density at a point x i , we will consider a small sphere
centered on x i and then allow the radius to grow until it contains K data
points, and then p( x i ) will equal to K /( NVi ), where N is total observations
and Vi is the volume of the sphere centered on x i . We can assume that Vi is
small enough that p( x i ) is roughly constant in it. In this way, We can write
down the integral:
∫ ∑
N ∑ N K
p( x) dx ≈ p( x i ) · Vi = · Vi = K ̸= 1
i =1 i =1 NVi
We also see that if we use "1NN" (K = 1), the probability density will be
well normalized. Note that if and only if the volume of all the spheres are
small enough and N is large enough, the equation above will hold. Fortu-
nately, these two conditions can be satisfied in KNN.
76
0.3 Probability Distribution
Based on (3.6), we can write :

2 1 − exp(−2a) exp(a) − exp(−a)
2σ(2a) − 1 = −1= =
1 + exp(−2a) 1 + exp(−2a) exp(a) + exp(−a)
Which is exactly tanh(a). Then we will find the relation between µ i , w i
in (3.101) and (3.102). Let’s start from (3.101).
∑
M x −µj
y( x, w) = w0 + w j σ( )
j =1 s
x−µ j
∑
M tanh( 2s ) + 1
= w0 + wj
j =1 2
1∑ M ∑M w
j x −µj
= w0 + wj + tanh( )
2 j=1 j =1 2 2s
Hence the relation is given by :
1∑ M wj
µ0 = w0 + wj and µ j =
2 j=1 2
Note: there is a typo in (3.102), the denominator should be 2 s instead of

s, or alternatively you can view it as a new s′ , which equals to 2 s.
We first need to show that (ΦT Φ)−1 is invertible. Suppose, for the sake of
contradiction, c is a nonzero vector in the kernel(Null space) of ΦT Φ. Then
ΦT Φ c equals to 0 and so we have:
0 = c T ΦT Φ c = (Φ c)T Φ c = ||Φ c||2
The equation above shows that Φ c = 0. However, Φ c = c 1 ϕ1 + c 2 ϕ2 +

... + c M ϕ M and {ϕ1 , ϕ2 , , ..., ϕ M } is a basis for Φ, there is no linear relation
between the ϕ i and therefore we cannot have c 1 ϕ1 + c 2 ϕ2 + ... + c M ϕ M = 0.
This is the contradiction. Hence ΦT Φ is invertible. Then let’s first prove two
specific cases.
Case 1: w1 is in Φ. In this case, we have Φ c = w1 for some c. So we
have:
Φ(ΦT Φ)−1 ΦT w1 = Φ(ΦT Φ)−1 ΦT Φ c = Φ c = w1
Case 2:w2 is in Φ⊥ , where Φ⊥ is used to denote the orthogonal comple-
ment of Φ and then we have ΦT w2 = 0, which leads to:
Φ(ΦT Φ)−1 ΦT w2 = 0
77
Recall that any vector x ∈ R M can be divided into the summation of two
vectors w1 and w2 , were w1 ∈ Φ and w2 ∈ Φ⊥ separately. And so we have:
Φ(ΦT Φ)−1 ΦT w = Φ(ΦT Φ)−1 ΦT (w1 + w2 ) = w1
Which is exactly what orthogonal projection is supposed to do.
Let’s calculate the derivative of (3.104) with respect to w.

∑
N { }
∇ E D ( w) = r n t n − wT Φ( xn ) Φ( xn )T
n=1
We set the derivative equal to 0.

( )
∑
N ∑
N
0= r n t n Φ( xn )T − wT r n Φ( xn )Φ( xn )T
n=1 n=1
p p
If we denote r n ϕ( xn ) = ϕ′ ( xn ) and r n t n = t′n , we can obtain:
( )
∑
N
′ ′ T T
∑
N
′ ′ T
0= t n Φ ( xn ) − w Φ ( xn )Φ ( xn )
n=1 n=1
Taking advantage of (3.11) – (3.17), we can derive a similar result, i.e.

w ML = (ΦT Φ)−1 ΦT t. But here, we define t as:
[p p p ]T
t= r 1 t 1 , r 2 t 2 , ... , r N t N
p
We also define Φ as a N × M matrix, with element Φ( i, j ) = r i ϕ j ( x i ).
The interpretation is two folds: (1) Examining Eq (3.10)-(3.12), we see that if
we substitute β−1 by r n ·β−1 in the summation term, Eq (3.12) will become the
expression in exercise 3.3. (2) r n can also be viewed as the effective number
of observation of (xn , t n ). Alternatively speaking, you can treat (xn , t n ) as
repeatedly occurring r n times.
Firstly, we rearrange E D (w).

{ }2
1 ∑N [ ∑D ]
E D ( w) = w0 + wi (xi + ϵi ) − t n
2 n=1 i =1
{ }2
1 ∑N ( ∑ D ) ∑ D
= w0 + wi xi − t n + wi ϵi
2 n=1 i =1 i =1
{ }2
1 ∑N ∑ D
= y( x n , w ) − t n + wi ϵi
2 n=1 i =1
{ }
1 ∑N ( )2 ( ∑ D )2 (∑ D )( )
= y( x n , w ) − t n + wi ϵi + 2 w i ϵ i y( xn , w) − t n
2 n=1 i =1 i =1
78
Where we have used y( xn , w) to denote the output of the linear model

when input variable is xn , without noise added. For the second term in the
equation above, we can obtain :
∑
D ∑
D ∑
D ∑
D ∑
D ∑
D ∑
D
Eϵ [( w i ϵ i )2 ] = Eϵ [ wi w j ϵi ϵ j ] = w i w j Eϵ [ϵ i ϵ j ] = σ2 wi w j δi j
i =1 i =1 j =1 i =1 j =1 i =1 j =1
Which gives
∑
D ∑
D
Eϵ [( w i ϵ i )2 ] = σ2 w2i
i =1 i =1
For the third term, we can obtain:
(∑
D )( ) ( ) ∑
D
Eϵ [2 w i ϵ i y( xn , w) − t n ] = 2 y( xn , w) − t n Eϵ [ w i ϵ i ]
i =1 i =1
( )∑
D
= 2 y( x n , w ) − t n Eϵ [ w i ϵ i ]
i =1
= 0
Therefore, if we calculate the expectation of E D (w) with respect to ϵ, we

can obtain:
1 ∑N ( )2 σ2 ∑
D
Eϵ [E D (w)] = y( xn , w) − t n + w2
2 n=1 2 i=1 i
We can firstly rewrite the constraint (3.30) as :

( )
1 ∑ M
q
|w j | − η ≤ 0
2 j=1
Where we deliberately introduce scaling factor 1/2 for convenience.Then

it is straightforward to obtain the Lagrange function.
( )
1 ∑N { }2 λ ∑M
T q
L(w, λ) = t n − w ϕ( x n ) + |w j | − η
2 n=1 2 j=1
It is obvious that L(w, λ) and (3.29) has the same dependence on w. Mean-
while, if we denote the optimal w that can minimize L(w, λ) as w⋆ (λ), we can
see that
∑
M
η= |w⋆j | q
j =1
Firstly, we write down the log likelihood function.
N 1 ∑N [ ]T [ ]
lnp(T | X ,W, β) = − ln|Σ| − t n − W T ϕ( xn ) Σ−1 t n − W T ϕ( xn )
2 2 n=1
79
Where we have already omitted the constant term. We set the derivative
of the equation above with respect to W equals to zero.
∑
N ]
0=− Σ−1 [ t n − W T ϕ( xn ) ϕ( xn )T
n=1
Therefore, we can obtain similar result for W as (3.15). For Σ, comparing

with (2.118) – (2.124), we can easily write down a similar result :
1 ∑N ] ]T
Σ= [t n − W T T
ML ϕ( x n ) [ t n − W ML ϕ( x n )
N n=1
We can see that the solutions for W and Σ are also decoupled.
Let’s begin by writing down the prior distribution p(w) and likelihood
function p( t| X , w, β).
∏
N
p ( w) = N ( w| m0 , S 0 ) , p( t| X , w, β) = N ( t n |wT ϕ( xn ), β−1 )
n=1
Since the posterior PDF equals to the product of the prior PDF and likeli-
hood function, up to a normalized constant. We mainly focus on the exponen-
tial term of the product.
N {
β ∑ }2
1
exponential term = − t n − w T ϕ( x n )− ( w − m0 ) T S −1
0 ( w − m0 )
2 n=1 2
β ∑N { } 1
= − t2n − 2 t n wT ϕ( xn ) + wT ϕ( xn )ϕ( xn )T w − (w − m0 )T S −1
0 ( w − m0 )
2 n=1 2
[ ]
1 T ∑ N
T −1
= − w βϕ( xn )ϕ( xn ) + S 0 w
2 n=1
[ ]
1 ∑N
− −2 m0T S − 1
0 − 2β t n ϕ ( x n ) T w
2 n=1
+ const
Hence, by comparing the quadratic term with standard Gaussian Distri-

bution, we can obtain: S −1 −1
N = S 0 + βΦ Φ. And then comparing the linear
T
term, we can obtain :
∑
N
−2 m N T S N −1 = −2 m0T S −1
0 − 2β t n ϕ( xn )T
n=1
If we multiply −0.5 on both sides, and then transpose both sides, we can
easily see that m N = S N (S 0 −1 m 0 + βΦT t)
80
Firstly, we write down the prior :
p ( w) = N ( m N , S N )
Where m N , S N are given by (3.50) and (3.51). And if now we observe

another sample ( X N +1 , t N +1 ), we can write down the likelihood function :
p( t N +1 | x N +1 , w) = N ( t N +1 | y( x N +1 , w), β−1 )
Since the posterior equals to the production of likelihood function and the
prior, up to a constant, we focus on the exponential term.
exponential term = ( w − m N )T S −1 T
N (w − m N ) + β( t N +1 − w ϕ( x N +1 ))
2
[ ]
= wT S N −1 + β ϕ( x N +1 ) ϕ( x N +1 )T w
[ 1 ]
−2wT S − N m N + β ϕ( x N +1 ) t N +1
+const
Therefore, after observing ( X N +1 , t N +1 ), we have p(w) = N ( m N +1 , S N +1 ),

where we have defined:
S−1 −1
N +1 = S N + β ϕ( x N +1 ) ϕ( x N +1 )
T
And
( 1 )
m N +1 = S N +1 S −
N m N + β ϕ( x N +1 ) t N +1
We know that the prior p(w) can be written as:
p ( w) = N ( m N , S N )
And the likelihood function p( t N +1 | x N +1 , w) can be written as:
p( t N +1 | x N +1 , w) = N ( t N +1 | y( x N +1 , w), β−1 )
According to the fact that y( x N +1 , w) = wT ϕ( x N +1 ) = ϕ( x N +1 )T w, the

likelihood can be further written as:
p( t N +1 | x N +1 , w) = N ( t N +1 |(ϕ( x N +1 )T w, β−1 )
Then we take advantage of (2.113), (2.114) and (2.116), which gives:

{ }
p(w| x N +1 , t N +1 ) = N (Σ ϕ( x N +1 )β t N +1 + S N −1 m N , Σ)
Where Σ = (S N −1 + ϕ( x N +1 )βϕ( x N +1 )T )−1 , and we can see that the result

is exactly the same as the one we obtained in the previous problem.
81
We have already known:
p( t|w, β) = N ( t| y( x, w), β−1 )
And
p(w|t, α, β) = N (w| m N , S N )
Where m N , S N are given by (3.53) and (3.54). As what we do in previous
problem, we can rewrite p( t|w, β) as:
p( t|w, β) = N ( t|ϕ( x)T w, β−1 )
And then we take advantage of (2.113), (2.114) and (2.115), we can obtain:
p( t|t, α, β) = N (ϕ( x)T m N , β−1 + ϕ( x)T S N ϕ( x))
Which is exactly the same as (3.58), if we notice that
ϕ( x ) T m N = m N T ϕ( x )
We need to use the result obtained in Prob.3.8. In Prob.3.8, we have de-

rived a formula for S −1
N +1
:
S−1 −1
N +1 = S N + β ϕ( x N +1 ) ϕ( x N +1 )
T
And then using (3.110), we can obtain :

[ ]−1
S N +1 = S N −1 + β ϕ( x N +1 ) ϕ( x N +1 )T
[ √ √ ]−1
= S N −1 + βϕ( x N +1 ) βϕ( x N +1 )T
√ √
S N ( βϕ( x N +1 ))( βϕ( x N +1 )T )S N
= SN − √ √
1 + ( βϕ( x N +1 )T )S N ( βϕ( x N +1 ))
βS N ϕ( x N +1 )ϕ( x N +1 )T S N
= SN −
1 + βϕ( x N +1 )T S N ϕ( x N +1 )
Now we calculate σ2N ( x) − σ2N +1 ( x) according to (3.59).
σ2N ( x) − σ2N +1 ( x) = ϕ( x)T (S N − S N +1 )ϕ( x)

βS N ϕ( x N +1 )ϕ( x N +1 )T S N
= ϕ( x ) T ϕ( x )
1 + βϕ( x N +1 )T S N ϕ( x N +1 )
ϕ( x)T S N ϕ( x N +1 )ϕ( x N +1 )T S N ϕ( x)
=
1/β + ϕ( x N +1 )T S N ϕ( x N +1 )
[ ]2
ϕ( x)T S N ϕ( x N +1 )
= (∗)
1/β + ϕ( x N +1 )T S N ϕ( x N +1 )
82
And since S N is positive definite, (∗) is larger than 0. Therefore, we have

proved that σ2N ( x) − σ2N +1 ( x) ≥ 0
Let’s begin by writing down the prior PDF p(w, β):
p(w, β) =N (w| m 0 , β−1 S 0 ) Gam(β|a 0 , b 0 ) (∗)

β 2 1
) exp(− (w − m 0 )T βS 0−1 (w − m 0 )) b 0 0 βa0 −1 exp(− b 0 β)
a
∝ (
|S 0 | 2
And then we write down the likelihood function p(t|X, w, β) :
∏
N
p(t|X, w, β) = N ( t n |wT ϕ( xn ), β−1 )
n=1
∏
N [ β ]
∝ β1/2 exp − ( t n − wT ϕ( xn ))2 (∗∗)
n=1 2
According to Bayesian Inference, we have p(w, β|t) ∝ p(t|X, w, β)× p(w, β).
We first focus on the quadratic term with regard to w in the exponent.
β ∑N β
quadratic term = − wT S 0 −1 w + − wT ϕ( xn )ϕ( xn )T w
2 n=1 2
β [ ∑N ]
= − wT S 0 −1 + ϕ ( x n )ϕ ( x n ) T w
2 n=1
Where the first term is generated by (∗), and the second by (∗∗). By now,
we know that:
∑
N
S N −1 = S 0 −1 + ϕ( xn )ϕ( xn )T
n=1
We then focus on the linear term with regard to w in the exponent.
∑
N
linear term = β m 0 T S 0 −1 w + β t n ϕ( x n ) T w
n=1
[ ∑
N ]
= β m 0 T S 0 −1 + t n ϕ( x n ) T w
n=1
Again, the first term is generated by (∗), and the second by (∗∗). We can
also obtain:
∑
N
m N T S N −1 = m 0 T S 0 −1 + t n ϕ( x n ) T
n=1
Which gives:
[ ∑
N ]
m N = S N S 0 −1 m 0 + t n ϕ( x n )
n=1
83
Then we focus on the constant term with regard to w in the exponent.
β β ∑N
constant term = (− m 0 T S 0 −1 m 0 − b 0 β) − t2
2 2 n=1 n
[1 1 ∑N ]
= −β m 0 T S 0 −1 m 0 + b 0 + t2n
2 2 n=1
Therefore, we can obtain:
1 1 1 ∑N
m N T S N −1 m N + b N = m 0 T S 0 −1 m 0 + b 0 + t2
2 2 2 n=1 n
Which gives :
1 1 ∑N 1
bN = m 0 T S 0 −1 m 0 + b 0 + t2 − m N T S N −1 m N
2 2 n=1 n 2
Finally, we focus on the exponential term whose base is β.

N
exponent term = (2 + a 0 − 1) +
2
Which gives:
N
2 + a N − 1 = (2 + a 0 − 1) +
2
Hence,
N
a N = a0 +
2
Similar to (3.57), we write down the expression of the predictive distribu-

tion p( t|X, t): ∫ ∫
p( t|X, t) = p( t|w, β) p(w, β|X, t) dw d β (∗)
We know that:
p( t|w, β) = N ( t| y( x, w), β−1 ) = N ( t|ϕ( x)T w, β−1 )
and that:
p(w, β|X, t) = N (w| m N , β−1 S N ) Gam(β|a N , b N )
We go back to (∗), and first deal with the integral with regard to w:
∫ ∫
[ ]
p( t|X, t) = N ( t|ϕ( x)T w, β−1 ) N (w| m N , β−1 S N ) dw Gam(β|a N , b N ) d β
∫
= N ( t|ϕ( x)T m N , β−1 + ϕ( x)T β−1 S N ϕ( x)) Gam(β|a N , b N ) d β
∫
[ ]
= N t|ϕ( x)T m N , β−1 (1 + ϕ( x)T S N ϕ( x)) Gam(β|a N , b N ) d β
84
Where we have used (2.113), (2.114) and (2.115). Then, we follow (2.158)-
(2.160), we can see that p( t|X, t) = St( t|µ, λ, v), where we have defined:
aN [ ]−1
µ = ϕ( x ) T m N , λ= · 1 + ϕ ( x ) T S N ϕ( x ) , v = 2a N
bN
Problem 3.14 Solution(Wait for updating)
Firstly, according to (3.16), if we use the new orthonormal basis set spec-
ified in the problem to construct Φ, we can obtain an important property:
ΦT Φ = I. Hence, if α = 0, together with (3.54), we know that SN = 1/β.
Finally, according to (3.62), we can obtain:
k( x, x′ ) = βψ( x)T SN ψ( x′ ) = ψ( x)T ψ( x′ )
It is quite obvious if we substitute (3.92) and (3.95) into (3.82), which

gives,
β α N −γ γ N
E ( m N ) = ||t − Φ m N ||2 + m N T m N = + =
2 2 2 2 2
Problem 3.16 Solution(Waiting for update)
We know that
∏
N
p(t|w, β) = N (ϕ( xn )T w, β−1 ) ∝ N (Φw, β−1 I)
n=1
And
p(w|α) = N (0, α−1 I)
Comparing them with (2.113), (2.114) and (2.115), we can obtain:
p(t|α, β) = N (0, β−1 I + α−1 ΦΦT )
We know that:
∏
N
p(t|w, β) = N (ϕ( xn )T w, β−1 )
n=1
∏
N 1 { 1 T 2
}
= exp − ( t n − ϕ ( x n ) w )
−1 1/2
n=1 (2πβ ) 2β−1
β β {∑
N }
= ( ) N /2 exp
− ( t n − ϕ( xn )T w)2
2π n=1 2
β N /2 { β }
= ( ) exp − ||t − Φw||2
2π 2
85
And that:
p(w|α) = N (0, α−1 I)

α M /2 { α 2
}
= exp − || w ||
(2π) M /2 2
If we substitute the expressions above into (3.77), we can obtain (3.78)

just as required.
We expand (3.79) as follows:
β α
E ( w) = ||t − Φw||2 + wT w
2 2
β T α
= (t t − 2tT Φw + wT ΦT Φw) + wT w
2 2
1[ T ]
= w (βΦ Φ + αI)w − 2βt Φw + βtT t
T T
2
Observing the equation above, we see that E (w) contains the following
term :
1
(w − m N )T A(w − m N ) (∗)
2
Now, we need to solve A and m N . We expand (∗) and obtain:
1 T
(∗) = (w Aw − 2 m N T Aw + m N T A m N )
2
We firstly compare the quadratic term, which gives:
A = βΦT Φ + αI
And then we compare the linear term, which gives:
m N T A = βtT Φ
Noticing that A = AT , which implies A−1 is also symmetric, we first trans-

pose and then multiply A−1 on both sides, which gives:
m N = βA−1 ΦT t
86
Now we rewrite E (w):

1[ T ]
E ( w) = w (βΦT Φ + αI)w − 2βtT Φw + βtT t
2
1[ ]
= (w − m N )T A(w − m N ) + βtT t − m N T A m N
2
1 1
= (w − m N )T A(w − m N ) + (βtT t − m N T A m N )
2 2
1 1
= (w − m N )T A(w − m N ) + (βtT t − 2 m N T A m N + m N T A m N )
2 2
1 1
= (w − m N ) A(w − m N ) + (βtT t − 2 m N T A m N + m N T (βΦT Φ + αI) m N )
T
2 2
1 1[ ] α
= (w − m N )T A(w − m N ) + βtT t − 2βtT Φ m N + m N T (βΦT Φ) m N + m N T m N
2 2 2
1 T β 2 α T
= (w − m N ) A(w − m N ) + ||t − Φ m N || + m N m N
2 2 2
Just as required.
Based on the standard form of a multivariate normal distribution, we

know that
∫
1 1 { 1 T
}
exp − ( w − m N ) A( w − m N ) dw = 1
(2π) M /2 |A|1/2 2
Hence,
∫
{ 1 }
exp − (w − m N )T A(w − m N ) dw = (2π) M /2 |A|1/2
2
And since E ( m N ) doesn’t depend on w, (3.85) is quite obvious. Then we
substitute (3.85) into (3.78), which will immediately gives (3.86).
You can just follow the steps from (3.87) to (3.92), which is already very
clear.
Let’s first prove (3.117). According to (C.47) and (C.48), we know that if A
is a M × M real symmetric matrix, with eigenvalues λ i , i = 1, 2, ..., M , |A| and
Tr(A) can be written as:
∏
M ∑
M
| A| = λi , Tr(A) = λi
i =1 i =1
Back to this problem, according to section 3.5.2, we know that A has

eigenvalues α + λ i , i = 1, 2, ..., M . Hence the left side of (3.117) equals to:
d [∏M ] ∑ M d ∑M 1
left side = ln (α + λ i ) = ln(α + λ i ) =
dα i =1 i =1 d α i =1 α + λ i
87
And according to (3.81), we can obtain:
d
A−1 A = A−1 I = A−1
dα
For the symmetric matrix A, its inverse A−1 has eigenvalues 1 / (α+λ i ) , i =
1, 2, ..., M . Therefore,
d ∑M 1
Tr(A−1 A) =
dα i =1 α + λ i
Hence there are the same, and (3.92) is quite obvious.
Let’s derive (3.86) with regard to β. The first term dependent on β in

(3.86) is :
d N N
( lnβ) =
dβ 2 2β
The second term is :
d 1 β d d α
E (m N ) = ||t − Φ m N ||2 + ||t − Φ m N ||2 + mN T mN
dβ 2 2 dβ dβ 2
The last two terms in the equation above can be further written as:
β d d α {β d d α } dm N
||t − Φ m N ||2 + mN T mN = ||t − Φ m N ||2 + mN T mN ·
2 dβ dβ 2 2 dm N dm N 2 dβ
{β α } dm N
= [−2ΦT (t − Φ m N )] + 2 m N ·
2 2 dβ
{ } dm N
= − βΦT (t − Φ m N ) + α m N ·
dβ
{ } dm N
= − βΦT t + (αI + βΦT Φ) m N ·
dβ
{ } dm N
= − βΦT t + A m N ·
dβ
= 0
Where we have taken advantage of (3.83) and (3.84). Hence
d 1 1 ∑N
E ( m N ) = ||t − Φ m N ||2 = ( t n − m N T ϕ( xn ))2
dβ 2 2 n=1
The last term dependent on β in (3.86) is:
d 1 γ
( ln|A|) =
dβ 2 2β
Therefore, if we combine all those expressions together, we will obtain

(3.94). And then if we rearrange it, we will obtain (3.95).
88
First, according to (3.10), we know that p(t|X, w, β) can be further written

as p(t|X, w, β) = N (t|Φw, β−1 I), and given that p(w|β) = N ( m 0 , β−1 S0 ) and
p(β) = Gam(β|a 0 , b 0 ). Therefore, we just follow the hint in the problem.
∫ ∫
p(t) = p(t|X, w, β) p(w|β) dw p(β) d β
∫ ∫
β { β }
= ( ) N /2 exp − (t − Φw)T (t − Φw) ·
2π 2
β M /2 { β }
( ) |S0 |−1/2 exp − (w − m 0 )T S0 −1 (w − m 0 ) dw
2π 2
Γ(a 0 )−1 b 0a0 βa0 −1 exp(− b 0 β) d β
a ∫ ∫
b0 0 { β }
= +
exp − (t − Φw)T (t − Φw)
(2π) ( M N )/2 |S0 | 1/2 2
{ β }
exp − (w − m 0 )T S0 −1 (w − m 0 ) dw
2
βa0 −1+ N /2+ M /2 exp(− b 0 β) d β
a ∫ ∫
b0 0 { β }
= +
exp − (w − m N )T SN −1 (w − m N ) dw
(2π) ( M N )/2 |S0 | 1/2 2
{ β T }
exp − (t t + m 0 T S0 −1 m 0 − m N T SN −1 m N )
2
a N −1+ M /2
β exp(− b 0 β) d β
Where we have defined
m N = SN (S0 −1 m 0 + ΦT t)
S N −1 = S0 −1 + ΦT Φ
N
a N = a0 +
2
1 ∑N
b N = b 0 + ( m 0 T S0 −1 m 0 − m N T SN −1 m N + t2n )
2 n=1
Which are exactly the same as those in Prob.3.12, and then we evaluate
the integral, taking advantage of the normalized property of multivariate
Gaussian Distribution and Gamma Distribution.
a ∫
b0 0 2π M /2
p(t) = ( ) | SN | 1/2
βa N −1+ M /2 exp(− b N β) d β
(2π)( M + N )/2 |S0 |1/2 β
a ∫
b0 0
= M /2
(2π) |SN | 1/2
βa N −1 exp(− b N β) d β
(2π)( M + N )/2 |S0 |1/2
a0
1 |SN |1/2 b 0 Γ(a N )
= a
(2π) N /2 |S0 |1/2 b NN Γ( b N )
89
Just as required.
Let’s just follow the hint and we begin by writing down expression for the
likelihood, prior and posterior PDF. We know that p(t|w, β) = N (t|Φw, β−1 I).
What’s more, the form of the prior and posterior are quite similar:
p(w, β) = N (w|m0 , β−1 S0 ) Gam(β|a 0 , b 0 )
And
p(w, β|t) = N (w|mN , β−1 SN ) Gam(β|a N , b N )
Where the relationships among those parameters are shown in Prob.3.12,
Prob.3.23. Now according to (3.119), we can write:
N (w|m0 , β−1 S0 ) Gam(β|a 0 , b 0 )
p(t) = N (t|Φw, β−1 I)
N (w|mN , β−1 SN ) Gam(β|a N , b N )
0 a −1
a
−1 N (w|m0 , β−1 S0 ) b 0 β 0 exp(− b 0 β) / Γ(a 0 )
= N (t|Φw, β I)
N (w|mN , β−1 SN ) b aNN βa N −1 exp(− b N β) / Γ(a N )
a
N (w|m0 , β−1 S0 ) b 0 Γ(a N ) a0 −a N { }
0
= N (t|Φw, β−1 I) − aN β exp − ( b 0 − b N )β

N (w|mN , β SN ) b N Γ(a 0 )
1
a
N (w|m0 , β−1 S0 ) { } b 0 0 Γ(a N ) − N /2
= N (t|Φw, β−1 I) exp − ( b 0 − b N ) β a β
N (w|mN , β−1 SN ) b NN Γ(a 0 )
Where we have used a N = a 0 + N

2 . Now we deal with the terms expressed
in the form of Gaussian Distribution:
N (w|m0 , β−1 S0 )
Gaussian terms = N (t|Φw, β−1 I)
N (w|mN , β−1 SN )
β { β }
= ( ) N /2 exp − (t − Φw)T (t − Φw) ·
2π 2
{ }
−1 1/2 exp − β (w − m )T S −1 (w − m )
| β SN | 2 0 0 0
{ }
|β−1 S0 |1/2 exp − β (w − mN )T SN −1 (w − mN )
2
β { β
|SN |1/2 }
= ( ) N /2 exp − (t − Φ w ) T
(t − Φ w ) ·
2π |S0 |1/2 2
{ β }
exp − 2 (w − m0 )T S0 −1 (w − m0 )
{ β }
exp − 2 (w − mN )T SN −1 (w − mN )
We look back to the previous problem and we notice that at the last step
in the deduction of p(t), we complete the square according to w. And if we
carefully compare the left and right side at the last step, we can obtain :
{ β } { β }
exp − (t − Φw)T (t − Φw) exp − (w − m 0 )T S0 −1 (w − m 0 )
2 2
{ β } { }
= exp − (w − m N )T SN −1 (w − m N ) exp − ( b N − b 0 )β
2
90
Hence, we go back to deal with the Gaussian terms:
β |SN |1/2 { }
Gaussian terms = ( ) N /2 exp − ( b N − b 0 )β
2π |S0 |1/2
If we substitute the expressions above into p(t), we will obtain (3.118)

immediately.
0.4 Linear Models Classification

If the convex hull of {xn } and {yn } intersects, we know that there will be a
∑ ∑
point z which can be written as z = n αn xn and also z = n βn yn . Hence we
can obtain:
∑
b T z + w0
w b T(
= w α n xn ) + w 0
n
∑ ∑
= ( b T xn ) + ( α n ) w 0
αn w
n n
∑ T
= b xn + w 0 )
αn (w (∗)
n
∑
Where we have used n αn = 1. And if {xn } and {yn } are linearly separa-
ble, we have w b T xn + w0 > 0 and wb T yn + w0 < 0, for ∀xn , yn . Together with
αn ≥ 0 and (∗), we know that w b z + w0 > 0. And if we calculate w
T
b T z + w0
from the perspective of {yn } following the same procedure, we can obtain
wb T z + w0 < 0. Hence contradictory occurs. In other words, they are not lin-
early separable if their convex hulls intersect.
We have already proved the first statement, i.e., "convex hulls intersect"
gives "not linearly separable", and what the second part wants us to prove
is that "linearly separable" gives "convex hulls do not intersect". This can be
done simply by contrapositive.
The true converse of the first statement should be if their convex hulls do
not intersect, the data sets should be linearly separable. This is exactly what
Hyperplane Separation Theorem shows us.

e on w0 explicitly:
Let’s make the dependency of E D (W)
e = 1 { }
E D (W) Tr (XW + 1w0 T − T)T (XW + 1w0 T − T)
2
e with respect to w0 :
Then we calculate the derivative of E D (W)
e
∂E D (W)
= 2 N w0 + 2(XW − T)T 1
∂w0
91
Where we have used the property:
∂ [ ]
Tr (AXB + C)(AXB + C)T = 2AT (AXB + C)BT
∂X
We set the derivative equals to 0, which gives:
1
w0 = − (XW − T)T 1 = t̄ − WT x̄
N
Where we have denoted:
1 T 1 T
t̄ = T 1, and x̄ = X 1
N N
e we can obtain:
If we substitute the equations above into E D (W),
{ }
e = 1 Tr (XW + T̄ − X̄W − T)T (XW + T̄ − X̄W − T)
E D (W)
2
Where we further denote
T̄ = 1t̄T , and X̄ = 1x̄T
e with regard to W to 0, which gives:

Then we set the derivative of E D (W)
b †T
W=X b
b = X − X̄ ,
X b = T − T̄
and T
Now consider the prediction for a new given x, we have:
y(x) = WT x + w0
= WT x + t̄ − WT x̄
= t̄ + WT (x − x̄)
If we know that aT tn + b = 0 holds for some a and b, we can obtain:
1 T T 1 ∑N
aT t̄ = a T 1= a T tn = − b
N N n=1
Therefore,
[ ]
aT y(x) = aT t̄ + WT (x − x̄)
= aT t̄ + aT WT (x − x̄)
= − b + aT Tb T (X
b † )T (x − x̄)
= −b
92
Where we have used:
bT 1 T T
aT T = aT (T − T̄)T = aT (T − 11 T)
N
1 T T T
= aT TT − a T 11 = − b1T + b1T
N
= 0T
Suppose there are Q constraints in total. We can write aq T tn + b q = 0 , q =

1, 2, ...,Q for all the target vector tn , n = 1, 2..., N . Or alternatively, we can
group them together:
A T tn + b = 0
Where A is a Q × Q matrix, and the qth column of A is aq , and mean-
while b is a Q × 1 column vector, and the qth element is bq . for every pair
of {aq , b q } we can follow the same procedure in the previous problem to show
that aq y(x) + b q = 0. In other words, the proofs will not affect each other.
Therefore, it is obvious :
AT y(x) + b = 0
We use Lagrange multiplier to enforce the constraint wT w = 1. We now

need to maximize :
L(λ, w) = wT (m2 − m1 ) + λ(wT w − 1)
We calculate the derivatives:

∂L(λ, w)
= wT w − 1
∂λ
And
∂L(λ, w)
= m2 − m1 + 2λw
∂w
We set the derivatives above equals to 0, which gives:
1
w=− (m2 − m1 ) ∝ (m2 − m1 )
2λ
We expand (4.25) using (4.22), (4.23) and (4.24).
( m 2 − m 1 )2
J (w) =
s21 + s22
||wT (m2 − m1 )||2
= ∑ ∑
n∈C 1 (w xn − m 1 ) + n∈C 2 (w xn − m 2 )
T 2 T 2
93
The numerator can be further written as:

[ ][ ]T
numerator = wT (m2 − m1 ) wT (m2 − m1 ) = wT SB w
SB = (m2 − m1 )(m2 − m1 )T
And ti is the same for the denominator:

∑ ∑
denominator = [wT (xn − m1 )]2 + [wT (xn − m2 )]2
n∈C 1 n∈C 2
T T
= w Sw1 w + w Sw2 w
= w T Sw w

∑ ∑
Sw = (xn − m1 )(xn − m1 )T + (xn − m2 )(xn − m2 )T
n∈C 1 n∈C 2
Just as required.
Let’s follow the hint, beginning by expanding (4.33).

∑
N ∑
N ∑
N
(4.33) = w T xn xn + w 0 xn − t n xn
n=1 n=1 n=1
∑
N ∑
N ∑ ∑
= xn xn T w − w T m xn − ( t n xn + t n xn )
n=1 n=1 n∈C 1 n∈C 2
∑
N ∑ N ∑ −N
= xn xn T w − wT m · ( N m) − ( xn + xn )
n=1 n ∈ C 1 N1 n ∈ C 2 N2
∑
N ∑ 1 ∑ 1
= xn xn T w − N wT mm − N ( xn − xn )
n=1 n∈C 1 N1 n ∈ C 2 N2
∑
N
= xn xn T w − N mmT w − N (m1 − m2 )
n=1
∑
N
= [ (xn xn T ) − N mmT ]w − N (m1 − m2 )
n=1
If we let the derivative equal to 0, we will see that:

∑
N
[ (xn xn T ) − N mmT ]w = N (m1 − m2 )
n=1
Therefore, now we need to prove:

∑
N N1 N2
(xn xn T ) − N mmT = Sw + SB
n=1 N
94
Let’s expand the left side of the equation above:

∑
N N1 N2
left = xn xn T − N ( m1 + m2 )2
n=1 N N
∑
N N12 N22 N1 N2
= xn xn T − N ( 2
||m1 ||2 + ||m2 ||2 + 2 m1 m2 T )
n=1 N N2 N2
∑
N N12 N22 N1 N2
= xn xn T − ||m1 ||2 − ||m2 ||2 − 2 m1 m2 T
n=1 N N N
∑
N N1 N2 N1 N2 N1 N2
= xn xn T + ( N 1 + − 2 N1 )||m1 ||2 + ( N2 + − 2 N2 )||m2 ||2 − 2 m1 m2 T
n=1 N N N
∑
N N1 N2
= xn xn T + ( N1 − 2 N1 )||m1 ||2 + ( N2 − 2 N2 )||m2 ||2 + ||m1 − m2 ||2
n=1 N
∑
N N1 N2
= xn xn T + N1 ||m1 ||2 − 2m1 · ( N1 m1 T ) + N2 ||m2 ||2 − 2m2 · ( N2 m2 T ) + SB
n=1 N
∑
N ∑ ∑ N1 N2
= xn xn T + N1 ||m1 ||2 − 2m1 xnT + N2 ||m2 ||2 − 2m2 xnT + SB
n=1 n∈C 1 n∈C 2 N
∑ T 2
∑
= xn xn + N1 ||m1 || − 2m1 xnT
n∈C 1 n∈C 1
∑ ∑ N1 N2
+ xn xn T + N2 ||m2 ||2 − 2m2 xnT + SB
n∈C 2 n∈C 2 N
∑ ∑ N1 N2
= (xn xn T + ||m1 ||2 − 2m1 xnT ) + (xn xn T + ||m2 ||2 − 2m2 xn T ) + SB
n∈C 1 n∈C 2 N
∑ ∑ N1 N2
= ||xn − m1 ||2 + ||xn − m2 ||2 + SB
n∈C 1 n∈C 2 N
N1 N2
= Sw + SB
N
Just as required.
This problem is quite simple. We can solve it by definition. We know that

logistic sigmoid function has the form:
1
σ( a ) =
1 + exp(−a)
1 1
σ ( a ) + σ (− a ) = +
1 + exp(−a) 1 + exp(a)
2 + exp(a) + exp(−a)
=
[1 + exp(−a)][1 + exp(a)]
= =1
95
Next we exchange the dependent and independent variables to obtain its

inverse.
1
a=
1 + exp(− y)
We first rearrange the equation above, which gives:
1−a
exp(− y) =
a
Then we calculate the logarithm for both sides, which gives:
a
y = ln( )
1−a
Just as required.
According to (4.58) and (4.64), we can write:

p(x|C 1 ) p(C 1 )
a = ln
p(x|C 2 ) p(C 2 )
p (C 1 )
= ln p(x|C 1 ) − ln p(x|C 2 ) + ln
p (C 2 )
1 1 p (C 1 )
= − (x − µ1 )T Σ−1 (x − µ1 ) + (x − µ2 )T Σ−1 (x − µ2 ) + ln
2 2 p (C 2 )
1 1 p ( C )
= Σ−1 (µ1 − µ2 )x − µ1 T Σ−1 µ1 + µ2 T Σ−1 µ2 + ln
1
2 2 p (C 2 )
= w T x + w0
Where in the last second step, we rearrange the term according to x, i.e.,
its quadratic, linear, constant term. We have also defined :
w = Σ−1 (µ1 − µ2 )
And
1 1 p (C 1 )
w0 = − µ1 T Σ−1 µ1 + µ2 T Σ−1 µ2 + ln
2 2 p (C 2 )
Finally, since p(C 1 |x) = σ(a) as stated in (4.57), we have p(C 1 |x) = σ(wT x+
w0 ) just as required.
We begin by writing down the likelihood function.
∏
N ∏
K
p({ϕn , t n }|π1 , π2 , ..., πK ) = [ p(ϕn |C k ) p(C k )] t nk
n=1 k=1
∏
N ∏
K
= [πk p(ϕn |C k )] t nk
n=1 k=1
96
Hence we can obtain the expression for the logarithm likelihood:

∑
N ∑
K [ ] ∑
N ∑
K
ln p = t nk ln πk + ln p(ϕn |C k ) ∝ t nk ln πk
n=1 k=1 n=1 k=1
Since there is a constraint on πk , so we need to add a Lagrange Multiplier

to the expression, which becomes:
∑
N ∑
K ∑
K
L= t nk ln πk + λ( πk − 1)
n=1 k=1 k=1
We calculate the derivative of the expression above with regard to πk :
∂L ∑N t
nk
= +λ
∂πk π
n=1 k
And if we set the derivative equal to 0, we can obtain:

∑
N Nk
πk = − ( t nk ) / λ = − (∗)
n=1 λ
And if we preform summation on both sides with regard to k, we can see

that:
∑K N
1 = −( Nk ) / λ = −
k=1 λ
Which gives λ = − N , and substitute it into (∗), we can obtain πk = Nk / N .
This time, we focus on the term which dependent on µk and Σ in the

logarithm likelihood.
∑
N ∑
K [ ] ∑
N ∑
K
ln p = t nk ln πk + ln p(ϕn |C k ) ∝ t nk ln p(ϕn |C k )
n=1 k=1 n=1 k=1
Provided p(ϕ|C k ) = N (ϕ|µk , Σ), we can further derive:

N ∑
∑ K [ 1 1 ]
ln p ∝ t nk − ln |Σ| − (ϕn − µk )Σ−1 (ϕn − µk )T
n=1 k=1 2 2
We first calculate the derivative of the expression above with regard to

µk :
∂ ln p ∑
N
= t nk Σ−1 (ϕn − µk )
∂µk n=1
We set the derivative equals to 0, which gives:
∑
N ∑
N
t nk Σ−1 ϕn = t nk Σ−1 µk = Nk Σ−1 µk
n=1 n=1
97
Therefore, if we multiply both sides by Σ / Nk , we will obtain (4.161). Now

let’s calculate the derivative of ln p with regard to Σ, which gives:
∂ ln p ∑
N ∑
K 1 1 ∂ ∑ N ∑ K
= t nk (− Σ−1 ) − t nk (ϕn − µk )Σ−1 (ϕn − µk )T
∂Σ n=1 k=1 2 2 ∂Σ n=1 k=1
∑
N ∑
K t nk −1 1 ∂ ∑ K ∑ N
= − Σ − t nk (ϕn − µk )Σ−1 (ϕn − µk )T
n=1 k=1 2 2 ∂Σ k=1 n=1
∑
N 1 1 ∂ ∑ K
= − Σ−1 − Nk Tr(Σ−1 Sk )
n=1 2 2 ∂Σ k=1
N −1 1 ∑ K
= − Σ + Nk Σ−1 Sk Σ−1
2 2 k=1
Where we have denoted
1 ∑ N
Sk = t nk (ϕn − µk )(ϕn − µk )T
Nk n=1
Now we set the derivative equals to 0, and rearrange the equation, which
gives:
∑
K N
k
Σ= Sk
k=1 N
Based on definition, we can write down
∏
M ∏
L
ϕ
p(ϕ|C k ) = µkml
ml
m=1 l =1
Note that here only one of the value among ϕm1 , ϕm2 , ... ϕmL is 1, and the
others are all 0 because we have used a 1 − of − L binary coding scheme, and
also we have taken advantage of the assumption that the M components of
ϕ are independent conditioned on the class C k . We substitute the expression
above into (4.63), which gives:
M ∑
∑ L
ak = ϕml µkml + ln p(C k )
m=1 l =1
Hence it is obvious that a k is a linear function of the components of ϕ.
Based on definition, i.e., (4.59), we know that logistic sigmoid has the
form:
1
σ( a ) =
1 + exp(−a)
98
Now, we calculate its derivative with regard to a.
d σ( a ) exp(a) exp(a) 1
= = · = [ 1 − σ( a ) ] · σ( a )
da [ 1 + exp(−a) ] 2 1 + exp(−a) 1 + exp(−a)
Just as required.
∑
N
∇E (w) = −∇ { t n ln yn + (1 − t n ) ln(1 − yn ) }
n=1
∑
N
= − ∇{ t n ln yn + (1 − t n ) ln(1 − yn ) }
n=1
∑N d { t ln y + (1 − t ) ln(1 − y ) } d y da
n n n n n n
= −
n=1 d yn da n d w
∑N t
n 1 − tn
= − ( − ) · yn (1 − yn ) · ϕn
y
n=1 n 1 − yn
∑
N t n − yn
= − · yn (1 − yn ) · ϕn
n=1 yn (1 − yn )
∑
N
= − ( t n − yn )ϕn
n=1
∑
N
= ( yn − t n )ϕn
n=1
Where we have used yn = σ(a n ), a n = wT ϕn , the chain rules and (4.88).
According to definition, we know that if a dataset is linearly separable,

we can find w, for some points xn , we have wT ϕ(xn ) > 0, and the others
wT ϕ(xm ) < 0. Then the boundary is given by wT ϕ(x) = 0. Note that for any
point x0 in the dataset, the value of wT ϕ(x0 ) should either be positive or
negative, but it can not equal to 0.
Therefore, the maximum likelihood solution for logistic regression is triv-
ial. We suppose for those points xn belonging to class C 1 , we have wT ϕ(xn ) >
0 and wT ϕ(xm ) < 0 for those belonging to class C 2 . According to (4.87), if
|w| → ∞, we have
p(C 1 |ϕ(xn )) = σ(wT ϕ(xn )) → 1
Where we have used wT ϕ(xn ) → +∞. And since wT ϕ(xm ) → −∞, we can
also obtain:
p(C 2 |ϕ(xm )) = 1 − p(C 1 |ϕ(xm )) = 1 − σ(wT ϕ(xm )) → 1

99
In other words, for the likelihood function, i.e.,(4.89), if we have |w| → ∞,

and also we label all the points lying on one side of the boundary as class C 1 ,
and those on the other side as class C 2 , the every term in (4.89) can achieve
its maximum value, i.e., 1, finally leading to the maximum of the likelihood.
Hence, for a linearly separable dataset, the learning process may prefer
to make |w| → ∞ and use the linear boundary to label the datasets, which
can cause severe over-fitting problem.
Since yn is the output of the logistic sigmoid function, we know that 0 <
yn < 1 and hence yn (1 − yn ) > 0. Then we use (4.97), for an arbitrary non-zero
real vector a ̸= 0, we have:
[∑
N ]
aT Ha = aT yn (1 − yn )ϕn ϕT
n a
n=1
∑
N
= yn (1 − yn ) (ϕT T T
n a) (ϕ n a)
n=1
∑
N
= yn (1 − yn ) b2n
n=1
Where we have denoted b n = ϕT n a. What’s more, there should be at least

one of { b 1 , b 2 , ..., b N } not equal to zero and then we can see that the expression
above is larger than 0 and hence H is positive definite.
Otherwise, if all the b n = 0, a = [a 1 , a 2 , ..., a M ]T will locate in the null
space of matrix Φ N × M . However, with regard to the rank-nullity theorem,
we know that Rank(Φ) + Nullity(Φ) = M, and we have already assumed that
those M features are independent, i.e., Rank(Φ) = M , which means there is
only 0 in its null space. Therefore contradictory occurs.
We still denote yn = p( t = 1|ϕn ), and then we can write down the log
likelihood by replacing t n with πn in (4.89) and (4.90).
∑
N
ln p(t|w) = { πn ln yn + (1 − πn ) ln(1 − yn ) }
n=1
We should discuss in two situations separately, namely j = k and j ̸= k.

When j ̸= k, we have:
∂ yk − exp(a k ) · exp(a j )
= ∑ = − yk · y j
∂a j [ j exp(a j ) ]2
And when j = k, we have:
∑
∂ yk exp(a k ) j exp(a j ) − exp(a k ) exp(a k )
= ∑ = yk − yk2 = yk (1 − yk )
∂a k [ j exp(a j ) ]2
100

∂ yk
= yk ( I k j − y j )
∂a j
Where I k j is the elements of the indentity matrix.
We derive every term t nk ln ynk with regard to a j .

∂ t nk ln ynk ∂ t nk ln ynk ∂ ynk ∂a j
=
∂wj ∂ ynk ∂a j ∂wj
1
= t nk · ynk ( I k j − yn j ) · ϕn
ynk
= t nk ( I k j − yn j ) ϕn
Where we have used (4.105) and (4.106). Next we perform summation

over n and k.
N ∑
∑ K
∇wj E = − t nk ( I k j − yn j ) ϕn
n=1 k=1
N ∑
∑ K ∑
N ∑
K
= t nk yn j ϕn − t nk I k j ϕn
n=1 k=1 n=1 k=1
∑
N [ ∑K ] ∑N
= ( t nk ) yn j ϕn − t n j ϕn
n=1 k=1 n=1
∑
N ∑
N
= yn j ϕn − t n j ϕn
n=1 n=1
∑
N
= ( yn j − t n j ) ϕn
n=1
∑K
Where we have used the fact that for arbitrary n, we have k=1 t nk = 1.
We write down the log likelihood.

∑
N { }
ln p(t|w) = t n ln yn + (1 − t n ) ln(1 − yn )
n=1

∂ ln p ∂ yn ∂a n
∇w ln p = · ·
∂ yn ∂a n ∂w
∑N t
n 1 − tn ′
= ( − )Φ (a n )ϕn
n=1 yn 1 − yn
∑
N yn − t n
= Φ′ (a n )ϕn
n=1 yn (1 − yn )
101
Where we have used y = p( t = 1|a) = Φ(a) and a n = wT ϕn . According to

(4.114), we can obtain:
¯ 1 1
Φ′ (a) = N (θ |0, 1)¯θ=a = p exp(− a2 )
2π 2
Hence, we can obtain:
a2
∑
N yn − t n exp(− 2n )
∇w ln p = p ϕn
n=1 yn (1 − yn ) 2π
To calculate the Hessian Matrix, we need to first evaluate several deriva-

tives.
∂ yn − t n ∂ yn − t n ∂ yn ∂a n
{ } = { }· ·
∂w yn (1 − yn ) ∂ yn yn (1 − yn ) ∂a n ∂w
yn (1 − yn ) − ( yn − t n )(1 − 2 yn ) ′
= Φ (a n )ϕn
[ yn (1 − yn ) ]2
a2
yn2 + t n − 2 yn t n exp(− 2n )
= p ϕn
yn2 (1 − yn )2 2π
And
a2 a2
∂ exp(− n ) ∂ exp(− n ) ∂a n
{ p 2 } = { p 2 }
∂w 2π ∂a n 2π ∂w
2
an a
= − p exp(− n )ϕn
2π 2
Therefore, using the chain rule, we can obtain:
a2 a2 a2
∂ yn − t n exp(− 2n ) ∂ yn − t n exp(− 2n ) yn − t n ∂ exp(− 2n )
{ p } = { } p + { p }
∂w yn (1 − yn ) 2π ∂w yn (1 − yn ) 2π yn (1 − yn ) ∂w 2π
a2 a2
[ yn2 + t n − 2 yn t n exp(− 2n ) ] exp(− 2n )
= p − a n ( yn − t n ) p ϕn
yn (1 − yn ) 2π 2π yn (1 − yn )
Finally if we perform summation over n, we can obtain the Hessian Ma-

trix:
H = ∇∇w ln p
a2
∑N ∂ yn − t n exp(− 2n )
= { p } · ϕn
n=1 ∂w yn (1 − yn ) 2π
a2 a2
∑N [ y2 + t − 2 y t exp(− n )
n n n n 2 ] exp(− 2n )
= p − a n ( yn − t n ) p ϕn ϕn T
n=1 yn (1 − yn ) 2π 2π yn (1 − yn )
102
Problem 4.20 Solution(waiting for update)
We know that the Hessian Matrix is of size MK × MK , and the ( j, k) th

block with size M × M is given by (4.110), where j, k = 1, 2, ..., K . Therefore,
we can obtain:
∑K ∑K
uT Hu = uTj Hj,k uk (∗)
j =1 k=1
Where we use uk to denote the k th block vector of u with size M × 1, and

Hj,k to denote the ( j, k) th block matrix of H with size M × M . Then based on
(4.110), we further expand (4.110):
∑
K ∑
K ∑
N
(∗) = uT
j {− ynk ( I k j − yn j ) ϕn ϕn T }uk
j =1 k=1 n=1
∑
K ∑
K ∑
N
= uT T
j {− ynk ( I k j − yn j ) ϕ n ϕ n }uk
j =1 k=1 n=1
∑
K ∑
K ∑
N ∑
K ∑
K ∑
N
= uT T
j {− ynk I k j ϕ n ϕ n }uk + uT T
j { ynk yn j ϕ n ϕ n }uk
j =1 k=1 n=1 j =1 k=1 n=1
∑
K ∑
N ∑
K ∑
K ∑
N
= uT T
k {− ynk ϕ n ϕ n }uk + yn j uT T
j { ϕ n ϕ n } ynk uk
k=1 n=1 j =1 k=1 n=1
It is quite obvious.
∫ a
Φ( a ) = N (θ |0, 1) d θ
−∞
∫ a
1
= + N (θ |0, 1) d θ
2
∫0 a
1
= + N (θ |0, 1) d θ
2 0
∫ a
1 1
= +p exp(−θ 2 /2) d θ
2 2π 0
p ∫ a
1 1 π 2
= +p p exp(−θ 2 /2) d θ
2 2π 2 0 π
∫ a
1 1 2
= (1 + p p exp(−θ 2 /2) d θ )
2 2 0 π
1 { 1 }
= 1 + p er f (a)
2 2
Where we have used

∫ 0 1
N (θ |0, 1) d θ =
−∞ 2
103
If we denote f (θ ) = p(D |θ ) p(θ ), we can write:

∫ ∫
p(D ) = p ( D | θ ) p (θ ) d θ = f (θ ) d θ
(2π) M /2
= f (θ M AP )
|A|1/2
(2π) M /2
= p(D |θ M AP ) p(θ M AP )
|A|1/2
Where θ M AP is the value of θ at the mode of f (θ ), A is the Hessian Matrix

of − ln f (θ ) and we have also used (4.135). Therefore,
M 1
ln p(D ) = ln p(D |θ M AP ) + ln p(θ M AP ) + ln 2π − ln |A|
2 2
Just as required.
According to (4.137), we can write:
M 1
ln p(D ) = ln p(D |θ M AP ) + ln p(θ M AP ) + ln 2π − ln |A|
2 2
M 1 1
= ln p(D |θ M AP ) − ln 2π − ln |V0 | − (θ M AP − m)T V0 −1 (θ M AP − m)
2 2 2
M 1
+ ln 2π − ln |A|
2 2
1 1 1
= ln p(D |θ M AP ) − ln |V0 | − (θ M AP − m)T V0 −1 (θ M AP − m) − ln |A|
2 2 2
Where we have used the definition of the multivariate Gaussian Distri-
bution. Then, from (4.138), we can write:
A = −∇∇ ln p(D |θ M AP ) p(θ M AP )

= −∇∇ ln p(D |θ M AP ) − ∇∇ ln p(θ M AP )
{ 1 }
= H − ∇∇ − (θ M AP − m)T V0 −1 (θ M AP − m)
2
{ −1 }
= H + ∇ V0 (θ M AP − m)
= H + V0 −1
Where we have denoted H = −∇∇ ln p(D |θ M AP ). Therefore, the equation

104
above becomes:
1 1 { }
ln p(D ) = ln p(D |θ M AP ) − (θ M AP − m)T V0 −1 (θ M AP − m) − ln |V0 | · |H + V−1
0 |
2 2
1 1 { }
= ln p(D |θ M AP ) − (θ M AP − m)T V0 −1 (θ M AP − m) − ln |V0 H + I|
2 2
1 1 1
≈ ln p(D |θ M AP ) − (θ M AP − m)T V0 −1 (θ M AP − m) − ln |V0 | − ln |H|
2 2 2
1 1
≈ ln p(D |θ M AP ) − (θ M AP − m)T V0 −1 (θ M AP − m) − ln |H| + const
2 2
Where we have used the property of determinant: |A|·|B| = |AB|, and the
fact that the prior is board, i.e. I can be neglected with regard to V0 H. What’s
more, since the prior is pre-given, we can view V0 as constant. And if the data
is large, we can write:
∑
N
H= Hn = N Hb
n=1
b = 1/ N ∑N
Where H n=1 Hn , and then
1 1
ln p(D ) ≈ ln p(D |θ M AP ) − (θ M AP − m)T V0 −1 (θ M AP − m) − ln |H| + const
2 2
1 1 b | + const
≈ ln p(D |θ M AP ) − (θ M AP − m)T V0 −1 (θ M AP − m) − ln | N H
2 2
1 M 1 b | + const
≈ ln p(D |θ M AP ) − (θ M AP − m)T V0 −1 (θ M AP − m) − ln N − ln |H
2 2 2
M
≈ ln p(D |θ M AP ) − ln N
2
This is because when N >> 1, other terms can be neglected.
Problem 4.24 Solution(Waiting for updating)
We first need to obtain the expression for the first derivative of probit
function Φ(λa) with regard to a. According to (4.114), we can write down:
d d Φ(λa) d λa
Φ (λ a ) = ·
da d (λa) da
λ { 1 }
= p exp − (λa)2
2π 2
Which further gives:

¯ λ
d ¯
Φ(λa)¯ = p
da a=0 2π
And for logistic sigmoid function, according to (4.88), we have
dσ 1
= σ (1 − σ) = 0.5 × 0.5 =
da 4
105
Where we have used σ(0) = 0.5. Let their derivatives at origin equals, we
have:
λ 1
p =
2π 4
p / 2
/
i.e., λ = 2π 4. And hence λ = π 8 is obvious.
We will prove (4.152) in a more simple and intuitive way. But firstly, we
need to prove a trivial yet useful statement: Suppose we have a random vari-
able satisfied normal distribution denoted as X ∼ N ( X |µ, σ2 ), the probability
x−µ
of X ≤ x is P ( X ≤ x) = Φ( σ ), and here x is a given real number. We can see
this by writing down the integral:
∫ x
1 [ 1 ]
P ( X ≤ x) = p exp − 2 ( X − µ)2 d X
−∞ 2πσ2 2σ
∫ x−µ
σ 1 1
= p exp(− γ2 ) σ d γ
−∞ 2πσ 2 2
∫ x−µ
σ 1 1
= p exp(− γ2 ) d γ
−∞ 2π 2
x−µ
= Φ( )
σ
Where we have changed the variable X = µ + σγ. Now consider two ran-
dom variables X ∼ N (0, λ−2 ) and Y ∼ N (µ, σ2 ). We first calculate the condi-
tional probability P ( X ≤ Y | Y = a):
a−0
P ( X ≤ Y | Y = a ) = P ( X ≤ a ) = Φ(
) = Φ(λa)
λ−1
Together with Bayesian Formula, we can obtain:
∫ +∞
P(X ≤ Y ) = P ( X ≤ Y | Y = a) pd f (Y = a) dY
−∞
∫ +∞
= Φ(λa) N (a|µ, σ2 ) da
−∞
Where pd f (·) denotes the probability density function and we have also
used pd f (Y ) = N (µ, σ2 ). What’s more, we know that X − Y should also sat-
isfy normal distribution, with:
E [ X − Y ] = E [ X ] − E [Y ] = 0 − µ = − µ
And
var [ X − Y ] = var [ X ] + var [Y ] = λ−2 + σ2
Therefore, X − Y ∼ N (−µ, λ−2 + σ2 ) and it follows that:
0 − (−µ) µ
P ( X − Y ≤ 0) = Φ( p ) = Φ( p )
λ−2 + σ2 λ−2 + σ2
Since P ( X ≤ Y ) = P ( X − Y ≤ 0), we obtain what have been required.
106
0.5 Neural Networks

Based on definition of tanh(·), we can obtain:
e a − e −a
tanh(a) =
e a + e −a
2 ea
= −1 + a
e + e −a
1
= −1 + 2
1 + e−2a
= 2σ(2a) − 1
If we have parameters w(1

ji
s)
, w(1
j0
s)
and w(2
kj
s)
, w(2
k0
s)
for a network whose
hidden units use logistic sigmoid function as activation and w(1
ji
t)
, w(1
j0
t)
and
w(2
kj
t)
, w(2
k0
t)
for another one using tanh(·), for the network using tanh(·) as
activation, we can write down the following expression by using (5.4):
∑
M
a(kt) = w(2
kj
t)
tanh(a(jt) ) + w(2
k0
t)
j =1
∑
M
= w(2
kj
t)
[ 2σ(2a(jt) ) − 1 ] + w(2
k0
t)
j =1
∑
M [ ∑M
t) ]
= 2 w(2
kj
t)
σ(2a(jt) ) + − w(2
kj
t)
+ w(2
k0
j =1 j =1
What’s more, we also have :
∑
M
a(ks) = w(2
kj
s)
σ(a(js) ) + w(2
k0
s)
j =1
To make the two networks equivalent, i.e., a(ks) = a(kt) , we should make
sure: 

 a(s) = 2a(jt)
 j
w(2 s)
= 2w(2 t)


kj
∑
kj
w(2s) = − M w(2 t) + w(2 t)
k0 j =1 kj k0
Note that the first condition can be achieved by simply enforcing:
w(1
ji
s)
= 2w(1
ji
t)
, and w(1
j0
s)
= 2w(1
j0
t)
Therefore, these two networks are equivalent under a linear transforma-

tion.

107
It is obvious. We write down the likelihood.
∏
N
p(T|X, w) = N (tn |y(xn , w), β−1 I)
n=1
Taking the negative logarithm, we can obtain:
β ∑
N [ ] NK
E (w, β) = − ln p(T|X, w) = ( y(xn , w)−tn )T ( y(xn , w)−tn ) − ln β+const
2 n=1 2
Here we have used const to denote the term independent of both w and
β. Note that here we have used the definition of the multivariate Gaussian
Distribution. What’s more, we see that the covariance matrix β−1 I and the
weight parameter w have decoupled, which is distinct from the next prob-
lem. We can first solve wML by minimizing the first term on the right of the
equation above or equivalently (5.11), i.e., imaging β is fixed. Then according
to the derivative of E (w, β) with regard to β, we can obtain (5.17) and hence
β ML .
Following the process in the previous question, we first write down the
negative logarithm of the likelihood function.
1 ∑N { } N
E (w, Σ) = [ y(xn , w) − tn ]T Σ−1 [ y(xn , w) − tn ] + ln |Σ| + const (∗)
2 n=1 2
Note here we have assumed Σ is unknown and const denotes the term
independent of both w and Σ. In the first situation, if Σ is fixed and known,
the equation above will reduce to:
1 ∑N { }
E (w) = [ y(xn , w) − tn ]T Σ−1 [ y(xn , w) − tn ] + const
2 n=1
We can simply solve wML by minimizing it. If Σ is unknown, since Σ is

in the first term on the right of (∗), solving wML will involve Σ. Note that in
the previous problem, the main reason that they can decouple is due to the
independent assumption, i.e., Σ reduces to β−1 I, so that we can bring β to the
front and view it as a fixed multiplying factor when solving wML .
Based on (5.20), the current conditional distribution of targets, consider-

ing mislabel, given input x and weight w is:
p( t = 1|x, w) = (1 − ϵ) · p( t r = 1|x, w) + ϵ · p( t r = 0|x, w)

108
Note that here we use t to denote the observed target label, t r to denote
its real label, and that our network is aimed to predict the real label t r not t,
i.e., p( t r = 1|x, w) = y(x, w), hence we see that:
[ ]
p( t = 1|x, w) = (1 − ϵ) · y(x, w) + ϵ · 1 − y(x, w) (∗)
Also, it is the same for p( t = 0|x, w):

[ ]
p( t = 0|x, w) = (1 − ϵ) · 1 − y(x, w) + ϵ · y(x, w) (∗∗)
Combing (∗) and (∗∗), we can obtain:
p( t|x, w) = (1 − ϵ) · y t (1 − y)1− t + ϵ · (1 − y) t y1− t
Where y is short for y(x, w). Therefore, taking the negative logarithm, we
can obtain the error function:
∑
N { 1− t }
ln (1 − ϵ) · ynn (1 − yn )1− t n + ϵ · (1 − yn ) t n yn n
t
E (w) = −
n=1
When ϵ = 0, it is obvious that the equation above will reduce to (5.21).
It is obvious by using (5.22).
∏
N
E (w) = − ln p(t|xn , w)
n=1
N ∏
∏ K [ ]1− t nk
= − ln yk (xn , w) t nk 1 − yk (xn , w)
n=1 k=1
N ∑
∑ K { [ ]1− t nk }
= − ln yk (xn , w) t nk 1 − yk (xn , w)
n=1 k=1
∑
N ∑
K [ t nk ]
= − ln ynk ( 1 − ynk )1− t nk
n=1 k=1
∑
N ∑
K { }
= − t nk ln ynk + (1 − t nk ) ln( 1 − ynk )
n=1 k=1
ynk = yk (xn , w)
We know that yk = σ(a k ), where σ(·) represents the logistic sigmoid func-
tion. Moreover,
dσ
= σ(1 − σ)
da
109
dE (w) 1[ ] 1 [ ]
= −t k yk (1 − yk ) + (1 − t k ) yk (1 − yk )
da k yk 1 − yk
[ ] [ 1 − tk tk ]
= yk (1 − yk ) −
1 − yk yk
= (1 − t k ) yk − t k (1 − yk )
= yk − t k
Just as required.
It is similar to the previous problem. First we denote ykn = yk (xn , w). If

we use softmax function as activation for the output unit, according to (4.106),
we have:
d ykn
= ykn ( I k j − y jn )
da j
Therefore,
dE (w) d { ∑ N ∑ K }
= − t kn ln yk (xn , w)
da j da k n=1 k=1
∑N ∑ K d { }
= − t kn ln ykn
n=1 k=1 da j
∑
N ∑
K 1 [ ]
= − t kn ykn ( I k j − y jn )
n=1 k=1 ykn
∑
N ∑
K
= − ( t kn I k j − t kn y jn )
n=1 k=1
∑
N ∑
K ∑
N ∑
K
= − t kn I k j + t kn y jn
n=1 k=1 n=1 k=1
∑
N ∑
N
= − t jn + y jn
n=1 n=1
∑
N
= ( y jn − t jn )
n=1
Where we have used the fact that only when k = j , I k j = 1 ̸= 0 and that
∑K
k=1 t kn = 1.
It is obvious based on definition of ’tanh’, i.e., (5.59).

d ( e a + e−a )( e a + e−a ) − ( e a − e−a )( e a − e−a )
tanh(a) =
da ( e a + e − a )2
( e a − e−a )2
= 1− a
( e + e−a )2
= 1 − tanh(a)2
110
We know that the logistic sigmoid function σ(a) ∈ [0, 1], therefore if we
perform a linear transformation h(a) = 2σ(a) − 1, we can find a mapping func-
tion h(a) from (−∞, +∞) to [−1, 1]. In this case, the conditional distribution
of targets given inputs can be similarly written as:
[ 1 + y(x, w) ](1+ t)/2 [ 1 − y(x, w) ](1− t)/2
p( t|x, w) =
2 2
[ ]
Where 1 + y(x, w) /2 represents the conditional probability p(C 1 | x). Since
now y(x, w) ∈ [−1, 1], we also need to perform the linear transformation to
make it satisfy the constraint for probability.Then we can further obtain:
∑
N {1+ t
n 1 + yn 1 − t n 1 − yn }
E (w) = − ln + ln
n=1 2 2 2 2
1 ∑
N { }
= − (1 + t n ) ln(1 + yn ) + (1 − t n ) ln(1 − yn ) + N ln 2
2 n=1
It is obvious. Suppose H is positive definite, i.e., (5.37) holds. We set v

equals to the eigenvector of H, i.e., v = ui which gives:
vT Hv = vT (Hv) = ui T λ i ui = λ i ||ui ||2
Therefore, every λ i should be positive. On the other hand, If all the eigen-
values λ i are positive, from (5.38) and (5.39), we see that H is positive defi-
nite.
It is obvious. We follow (5.35) and then write the error function in the
form of (5.36). To obtain the contour, we enforce E (w) to equal to a constant
C.
1∑
E (w) = E (w∗ ) + λ i α2i = C
2 i
We rearrange the equation above, and then obtain:
∑
λ i α2i = B
i
Where B = 2C − 2E (w∗ ) is a constant. Therefore, the contours of con-

stant error are ellipses whose axes are aligned with the eigenvector ui of
the Hessian Matrix H. The length for the j th axis is given by setting all
α i = 0, s.t.i ̸= j : √
B
αj =
λj
111
In other words, the length is inversely proportional to the square root of

the corresponding eigenvalue λ j .
If H is positive definite, we know the second term on the right side of

(5.32) will be positive for arbitrary w. Therefore, E (w∗ ) is a local minimum.
On the other hand, if w∗ is a local minimum, we have
1
E (w∗ ) − E (w) = − (w − w∗ )T H(w − w∗ ) < 0
2
In other words, for arbitrary w, (w − w∗ )T H(w − w∗ ) > 0, according to the
previous problem, we know that this means H is positive definite.
It is obvious. Suppose that there are W adaptive parameters in the net-

work. Therefore, b has W independent parameters. Since H is symmetric,
there should be W (W + 1)/2 independent parameters in it. Therefore, there
are W + W (W + 1)/2 = W (W + 3)/2 parameters in total.
It is obvious. Since we have

ϵ2
E n (w ji + ϵ) = E n (w ji ) + ϵE ′n (w ji ) + E ′′n (w ji ) + O (ϵ3 )
2
And
ϵ2
E n (w ji − ϵ) = E n (w ji ) − ϵE ′n (w ji ) +
E ′′ (w ji ) + O (ϵ3 )
2 n
We combine those two equations, which gives,
E n (w ji + ϵ) − E n (w ji − ϵ) = 2ϵE ′n (w ji ) + O (ϵ3 )
Rearrange the equation above, we obtain what has been required.
It is obvious. The back propagation formalism starts from performing

summation near the input, as shown in (5.73). By symmetry, the forward
propagation formalism should start near the output.
∂ yk ∂ h( a k ) ∂a k
Jki = = = h′ ( a k ) (∗)
∂xi ∂xi ∂xi
Where h(·) is the activation function at the output node a k . Considering

all the units j , which have links to unit k:
∂a k ∑ ∂a k ∂a j ∑ ∂a j
= = w k j h′ ( a j ) (∗∗)
∂xi j ∂a j ∂ x i j ∂xi
112
Where we have used:

∑
ak = wk j z j , z j = h( a j )
j
It is similar for ∂a j /∂ x i . In this way we have obtained a recursive formula

starting from the input node:
{
∂a l wl i , if there is a link from input unit i to l
=
∂xi 0, if there isn’t a link from input unit i to l
Using recursive formula (∗∗) and then (∗), we can obtain the Jacobian
Matrix.
It is obvious. We begin by writing down the error function.
1 ∑N 1 ∑N ∑ M
E= ||yn − tn ||2 = ( yn,m − t n,m )2
2 n=1 2 n=1 m=1
Where the subscript m denotes the mthe element of the vector. Then we
can write down the Hessian Matrix as before.
∑
N ∑
M ∑
N ∑
M
H = ∇∇E = ∇yn,m ∇yn,m + ( yn,m − t n,m )∇∇yn,m
n=1 m=1 n=1 m=1
Similarly, we now know that the Hessian Matrix can be approximated as:
∑
N ∑
M
H≃ bn,m bT
n,m
n=1 m=1
bn,m = ∇ yn,m
It is obvious.
∫ ∫
∂2 E ∂ 1 ∂y
= 2( y − t) p(x, t) d x dt
∂wr ∂ws ∂wr 2 ∂ws
∫ ∫
[ ∂ y2 ∂y ∂y ]
= ( y − t) + p(x, t) d x dt
∂wr ∂ws ∂ws ∂wr
Since we know that

∫ ∫ ∫ ∫
∂ y2 ∂ y2
( y − t) p(x, t) d x dt = ( y − t) p( t|x) p(x) d x dt
∂wr ∂ws ∂wr ∂ws
∫ ∫
∂ y2 { }
= ( y − t) p( t|x) dt p(x) d x
∂wr ∂ws
= 0
113
∫
Note that in the last step, we have used y = tp( t|x) dt. Then we substi-
tute it into the second derivative, which gives,
∫ ∫
∂2 E ∂y ∂y
= p(x, t) d x dt
∂wr ∂ws ∂ws ∂wr
∫
∂y ∂y
= p(x) d x
∂ws ∂wr

skip
By analogy with section 5.3.2, we denote wki as those parameters corre-
sponding to skip-layer connections, i.e., it connects the input unit i with the
output unit k. Note that the discussion in section 5.3.2 is still correct and
now we only need to obtain the derivative of the error function with respect
skip
to the additional parameters wki .
∂E n ∂E n ∂a k
= = δk x i
skip
∂wki ∂a k ∂wskip
ki
Where we have used a k = yk due to linear activation at the output unit

and:
∑
M ∑ skip
yk = w(2)
kj
z j + wki x i
j =0 i
Where the first term on the right side corresponds to those information
conveying from the hidden unit to the output and the second term corre-
sponds to the information conveying directly from the input to output.
The error function is given by (5.21). Therefore, we can obtain:
∑N ∂E
∇E (w) = ∇a n
n=1 ∂a n
∑
N ∂ [ ]
= − t n ln yn + (1 − t n ) ln(1 − yn ) ∇a n
n=1 ∂a n
∑
N { ∂( t ln y ) ∂ y
n n n ∂(1 − t n ) ln(1 − yn ) ∂ yn }
= − + ∇a n
n=1 ∂ yn ∂a n ∂ yn ∂a n
∑
N [t
n −1 ]
= − · yn (1 − yn ) + (1 − t n ) · yn (1 − yn ) ∇a n
n=1 yn 1 − yn
∑
N [ ]
= − t n (1 − yn ) − (1 − t n ) yn ∇a n
n=1
∑
N
= ( yn − t n )∇a n
n=1
114
Where we have used the conclusion of problem 5.6. Now we calculate the
second derivative.
∑
N { }
∇∇E (w) = yn (1 − yn )∇a n ∇a n + ( yn − t n )∇∇a n
n=1
Similarly, we can drop the last term, which gives exactly what has been
asked.
We begin by writing down the error function.
∑
N ∑
K
E (w) = − t nk ln ynk
n=1 k=1
Here we assume that the output of the network has K units in total and
there are W weights parameters in the network. WE first calculate the first
derivative:
∑N dE
∇E = · ∇an
n=1 d a n
∑N [ d ∑
K ]
= − ( t nk ln ynk ) · ∇an
n=1 d a n k=1
∑
N
= cn · ∇an
n=1
Note that here cn = − dE / d an is a vector with size K × 1, ∇an is a matrix

with size K × W . Moreover, the operator · means inner product, which gives
∇E as a vector with size 1 × W . According to (4.106), we can obtain the j th
element of cn :
∂ ∑
K
c n, j = − ( t nk ln ynk )
∂a j k=1
∑K ∂
= − ( t nk ln ynk )
k=1 ∂a j
∑K t
nk
= − ynk ( I k j − yn j )
y
k=1 nk
∑
K ∑
K
= − t nk I k j + t nk yn j
k=1 k=1
∑
K
= − t n j + yn j ( t nk )
k=1
= yn j − t n j
115
Now we calculate the second derivative:
∑N dc
n
∇∇E = ( ∇an ) · ∇an + cn ∇∇an
n=1 d a n
Here d cn / d an is a matrix with size K × K . Therefore, the second term can

be neglected as before, which gives:
∑N dc
n
H= ( ∇an ) · ∇an
n=1 d a n
We first write down the expression of Hessian Matrix in the case of K

outputs.
∑
N ∑ K
H N,K = bn,k bT
n,k
n=1 k=1
Where bn,k = ∇w an,k . Therefore, we have:
∑
K
H N +1,K = H N,K + b N +1,k bT T
N +1,k = H N,K + B N +1 B N +1
k=1
Where B N +1 = [b N +1,1 , b N +1,2 , ..., b N +1,K ] is a matrix with size W × K ,

and here W is the total number of the parameters in the network. By analogy
with (5.88)-(5.89), we can obtain:
H−1
B BT H−1
N,K N +1 N +1 N,K
H−1 −1
N +1,K = H N,K − (∗)
1 + BT H−1 B
N +1 N,K N +1
Furthermore, similarly, we have:
N∑
+1
H N +1,K +1 = H N +1,K + bn,K +1 bT T
n,K +1 = H N +1,K + BK +1 BK +1
n=1
Where BK +1 = [b1,K +1 , b2,K +1 , ..., b N +1,K +1 ] is a matrix with size W ×( N +

1). Also, we can obtain:
H−1
B BT H−1
N +1,K K +1 K +1 N +1,K
H−1
N +1,K +1 = H−1
N +1,K −
1 + BT H−1 B
K +1 N +1,K K +1
Where H−1
N +1,K
is defined by (∗). If we substitute (∗) into the expression
above, we can obtain the relationship between H−1
N +1,K +1
and H−1
N,K
.

116
We begin by handling the first case.
∂2 E n ∂ ∂E n
= ( )
∂w(2)
kj
∂w(2)
k′ j ′
∂wk j ∂w(2)
(2)
k′ j ′
∂ ∂ E n ∂ a k′
= ( )
∂w(2) ∂a k′ ∂w(2)′ ′
kj k j
∑
∂ ∂ E n ∂ j ′ w k′ j ′ z j ′
= ( )
∂w(2) ∂ a k′ ∂w(2)
kj k′ j ′
∂ ∂E n
= ( z j′ )
∂w(2) ∂ a k′
kj
∂ ∂E n ∂E n ∂ z j′
= ( ) z j′ +
∂w(2) ∂a k′ ∂a k′ ∂w(2)
kj kj
∂ ∂E n ∂a k
= ( ) z j′ + 0
∂a k ∂a k′ ∂w(2)
kj
∂ ∂E n
= ( ) z j z j′
∂ a k ∂ a k′
= z j z j′ M kk′
Then we focus on the second case, and if here j ̸= j ′
∂2 E n ∂ ∂E n
= ( )
∂w(1)
ji
∂w(1)
j′ i′
∂w(1)
ji
∂w(1)
j′ i′
∂ ∑ ∂ E n ∂ a k′
= ( )
∂w(1) ′ ∂a k′ ∂w(1)
ji k ′ ′ ji
∑ ∂ ∂E n
= ( w(2)
k′ j ′
h′ ( a j ′ ) x i ′ )
(1)
k′ ∂w ji ∂ a k′
∑ ∂ ∂E n
= h′ ( a j ′ ) x i ′ ( w(2)
k′ j ′
)
k′ ∂w(1) ∂ a k′
ji
∑ ∑ ∂ ∂E n (2) ∂a k
= h′ ( a j ′ ) x i ′ ( wk′ j′ ) (1)
k′ k ∂ a k ∂ a k′ ∂w ji
∑ ∑ ∂ ∂E n (2)
= h′ ( a j ′ ) x i ′ ( wk′ j′ ) · (w(2) h′ ( a j ) x i )
k ′ k ∂ a k ∂ a k ′
kj
∑ ′ ∑
= h (a j′ ) x i′ M kk′ w(2) k′ j ′
· w(2)
kj
h′ ( a j ) x i
k′ k
′ ′
∑∑
= x i′ x i h (a j′ ) h (a j ) w(2)
k′ j ′
· w(2)
kj
M kk′
k′ k
117
When j = j ′ , similarly we have:
∂2 E n ∑ ∂ ∂E n
= ( w(2)
k′ j
h′ ( a j ) x i ′ )
∂w(1) ∂w(1) k′ ∂w ji (1) ∂ a k′
ji ji ′
∑ ∂ ∂E n ∑ ∂E n (2) ∂ h′ (a j )
= x i′ ( w(2)
k′ j
) h′ ( a j ) + x i ′ ( w k′ j )
k′ ∂w ji ∂a k′
(1)
k′ ∂ a k′ ∂w(1)
ji
∑∑ ∑ ∂E n (2) ∂ h′ (a j )
= x i ′ x i h′ ( a j ) h′ ( a j ) w(2)
k′ j
· w (2)
M kk ′ + x i ′ ( w k′ j )
k′ ∂ a k′
kj
k′ k ∂w(1)
ji
∑∑ ∑ ∂E n (2) ′′
= x i ′ x i h′ ( a j ) h′ ( a j ) w(2)
′ · w (2)
M kk ′ + x i′ ( w k′ j ) h ( a j ) x i
k′ ∂ a k′
k j kj
k′ k
∑ ∑ (2) ∑
= x i ′ x i h′ ( a j ) h′ ( a j ) wk′ j · w(2)
kj
M kk′ + h′′ (a j ) x i x i′ δk′ w(2) k′ j
k′ k k′
It seems that what we have obtained is slightly different from (5.94) when
j = j ′ . However this is not the case, since the summation over k′ in the second
term of our formulation and the summation over k in the first term of (5.94) is
actually the same (i.e., they both represent the summation over all the output
units). Combining the situation when j = j ′ and j ̸= j ′ , we can obtain (5.94)
just as required. Finally, we deal with the third case. Similarly we first focus
on j ̸= j ′ :
∂2 E n ∂ ∂E n
= ( )
∂w(1)
ji
∂w(2)
k j′
∂w(1)
ji
∂w(2)
k j′
∂ ∂E n ∂a k
= ( )
∂w ji (1) ∂a k ∂w(2)′
kj
∑
∂ ∂E n ∂ j′ wk j′ z j′
= ( )
∂w(1) ∂a k ∂w(2)
ji k j′
∂ ∂E n
= ( z j′ )
∂w(1) ∂a k
ji
∑ ∂ ∂ E n ∂ a k′
= z j′ ( ) (1)
k′ ∂a k′ ∂a k ∂w ji
∑
= z j′ M kk′ w(2)
k′ j
h′ ( a j ) x i
k′
′
∑
= x i h (a j ) z j′ M kk′ w(2)
k′ j
k′
Note that in (5.95), there are two typos: (i) H kk′ should be M kk′ . (ii) j should
118
exchange position with j ′ in the right side of (5.95). When j = j ′ , we have:
∂2 E n ∂ ∂E n
= ( )
∂w(1)
ji
∂w(2)
kj
∂w ji ∂w(2)
(1)
kj
∂ ∂E n ∂a k
= ( )
∂w(1) ∂a k ∂w(2)
ji kj
∑
∂ ∂E n ∂ j wk j z j
= ( )
∂w(1) ∂a k ∂w(2)
ji kj
∂ ∂E n
= ( z j)
∂w(1) ∂a k
ji
∂ ∂E n ∂E n ∂ z j
= ( )z j +
∂w(1) ∂a k ∂a k w(1)
ji ji
∑ ∂E n ∂ z j
= x i h′ ( a j ) z j M kk′ w(2)
k′ j
+
k′ ∂a k w(1)
ji
∑
= x i h′ ( a j ) z j M kk′ w(2)
k′ j
+ δ k h′ ( a j ) x i
k′
Combing these two situations, we obtain (5.95) just as required.
It is similar to the previous problem.
∂2 E n ∂ ∂E n
= ( )
∂ w k′ i ′ ∂ w k j ∂ w k′ i ′ ∂ w k j
∂ ∂E n
= ( z j)
∂wk i ∂a k
′ ′
∂ w k′ i ′ ∂ ∂ E n
= zj ( )
∂ a k′ ∂ a k′ ∂ a k
= z j x i′ M kk′
119
And
∂2 E n ∑ ∂E n ∂a k
∂
= ( )
∂wk′ i′ ∂w ji ∂wk′ i′ k ∂a k ∂w ji
∂ ∑ ∂E n
= ( w k j h′ ( a j ) x i )
∂ w k′ i ′ k ∂ a k
∑ ′ ∂ ∂E n
= h (a j ) x i wk j ( )
k ∂ w k′ i ′ ∂ a k
∑ ′ ∂ ∂ E n a k′
= h (a j ) x i wk j ( )
k ∂ a k′ ∂ a k w k′ i ′
∑ ′
= h (a j ) x i wk j M kk′ x i′
k
∑
= x i x i ′ h′ ( a j ) wk j M kk′
k
Finally, we have
∂2 E n ∂ ∂E n
= ( )
∂wk′ i′ wki ∂wk′ i′ ∂wki
∂ ∂E n
= ( xi )
∂ w k′ i ′ ∂ a k
∂ ∂ E n ∂ a k′
= xi ( )
∂ a k′ ∂ a k w k′ i ′
= x i x i′ M kk′
It is obvious. According to (5.113), we have:

∑
ae j = e ji e
w xi + w
e j0
i
∑1 b∑
= w ji · (ax i + b) + w j0 − w ji
i a a i
∑
= w ji x i + w j0 = a j
i
Where we have used (5.115), (5.116) and (5.117). Currently, we have

proved that under the transformation the hidden unit a j is unchanged. If
z j = z j.
the activation function at the hidden unit is also unchanged, we have e
Now we deal with the output unit e yk :
∑
yk =
e ek j e
w zj + w
e k0
j
∑
= cwk j · z j + cwk0 + d
j
∑[ ]
= c w k j · z j + w k0 + d
j
= c yk + d
120
Where we have used (5.114), (5.119) and (5.120). To be more specific,

here we have proved that the linear transformation between e
yk and yk can
be achieved by making transformation (5.119) and (5.120).
Since we know the gradient of the error function with respect to w is:
∇E = H(w − w∗ )
Together with (5.196), we can obtain:
w(τ) = w(τ−1) − ρ ∇E
= w(τ−1) − ρ H(w(τ−1) − w∗ )
Multiplying both sides by uTj , using w j = wT u j , we can obtain:

[ ]
w(jτ) = uTj w(τ−1) − ρ H(w(τ−1) − w∗ )
= w(jτ−1) − ρ uTj H(w(τ−1) − w∗ )
= w(jτ−1) − ρη j uTj (w(τ−1) − w∗ )
= w(jτ−1) − ρη j (w(jτ−1) − w∗j )
= (1 − ρη j )w(jτ−1) + ρη j w∗j
Where we have used (5.198). Then we use mathematical deduction to

prove (5.197), beginning by calculating w(1)
j
:
w(1)
j
= (1 − ρη j )w(0)
j
+ ρη j w∗j
= ρη j w∗j
[ ]
= 1 − (1 − ρη j ) w∗j
Suppose (5.197) holds for τ, we now prove that it also holds for τ + 1.
w(jτ+1) = (1 − ρη j w(jτ) + ρη j w∗j

[ ]
= (1 − ρη j ) 1 − (1 − ρη j )τ w∗j + ρη j w∗j
{ [ ] }
= (1 − ρη j ) 1 − (1 − ρη j )τ + ρη j w∗j
[ ]
= 1 − (1 − ρη j )τ+1 w∗j
Hence (5.197) holds for τ = 1, 2, .... Provided |1 − ρη j | < 1, we have (1 −

ρη j )τ → 0 as τ → ∞ ans thus w(τ) = w∗ . If τ is finite and η j >> (ρτ)−1 , the
above argument still holds since τ is still relatively large. Conversely, when
η j << (ρτ)−1 , we expand the expression above:
[ ]
|w(jτ) | = | 1 − (1 − ρη j )τ w∗j | ≈ |τρη j w∗j | << |w∗j |
121
We can see that (ρτ)−1 works as the regularization parameter α in section

3.5.3.
Based on definition or by analogy with (5.128), we have:
1 ∑ ∂ ynk ¯¯
Ωn = ( )2
2 k ∂ξ ξ=0
1 ∑ ∑ ∂ ynk ∂ x i ¯¯
= ( )2
2 k i ∂ x i ∂ξ ξ=0
1∑ ∑ ∂
= ( τi ynk )2
2 k i ∂xi

∂ x i ¯¯
τi = ξ=0
∂ξ
And this is exactly the form given in (5.201) and (5.202) if the nth obser-
vation ynk is denoted as yk in short. Firstly, we define α j and β j as (5.205)
shows, where z j and a j are given by (5.203). Then we will prove (5.204) holds:
∑ ∂z j ∑ ∂ h( a j )
αj = τi = τi
i ∂xi i ∂xi
∑ ∂ h( a j ) ∂ a j
= τi
i ∂a j ∂ x i
∑ ∂
= h′ ( a j ) τ i a j = h ′ ( a j )β j
i ∂ x i
Moreover,
∑
∑ ∂a j ∑ ∂ i ′ w ji ′ z i ′
βj = τi = τi
i ∂xi i ∂xi
∑ ∑ ∂w ji′ z i′ ∑ ∑ ∂ z i′
= τi = τ i w ji′
i i′ ∂xi i i′ ∂xi
∑ ∑ ∂ z i′ ∑
= w ji′ τi = w ji′ α i′
i′ i ∂xi i′
So far we have proved that (5.204) holds and now we aim to find a forward
propagation formula to calculate Ωn . We firstly begin by evaluating {β j } at
the input units, and then use the first equation in (5.204) to obtain {α j } at the
input units, and then the second equation to evaluate {β j } at the first hidden
layer, and again the first equation to evaluate {α j } at the first hidden layer.
We repeatedly evaluate {β j } and {α j } in this way until reaching the output
122
layer. Then we deal with (5.206):
∂Ωn ∂ {1 ∑ } 1 ∑ ∂(G yk )2
= (G yk )2 =
∂wrs ∂wrs 2 k 2 k ∂wrs
1 ∑ ∂(G yk )2 ∂(G yk ) ∑ ∂G yk
= = G yk
2 k ∂(G yk ) ∂wrs k ∂wrs
∑ [ ∂ yk ] ∑ [ ∂ yk ∂a r ]
= G yk G = αk G
k ∂wrs k ∂a r ∂wrs
∑ [ ] ∑ { }
= αk G δkr z s = αk G [δkr ] z s + G [ z s ]δkr
k k
∑ { }
= αk ϕkr z s + αs δkr
k
Provided with the idea in section 5.3, the backward propagation formula
is easy to derive. We can simply replace E n with yk to obtain a backward
equation, so we omit it here.
Following the procedure in section 5.5.5, we can obtain:

∫
1
Ω= (τT ∇ y(x))2 p(x) d x
2
/
Since we have τ = ∂s(x, ξ) ∂ξ and s = x + ξ, so we have τ = I. Therefore,
substituting τ into the equation above, we can obtain:
∫
1
Ω= (∇ y(x))2 p(x) d x
2
Just as required.
The modifications only affect derivatives with respect to the weights in

the convolutional layer. The units within a feature map (indexed m) have
different inputs, but all share a common weight vector, w(m) . Therefore, we
can write:
( m)
∂E n ∑ ∂E n ∂a j ∑ ( m) ( m)
= = δ j z ji
∂w(im) ( m) ( m)
j ∂a j ∂w i j
Here a(jm) denotes the activation of the j th unit in th mth feature map,
whereas w(im) denotes the i th element of the corresponding feature vector
and finally z(im
j
)
denotes the i th input for the j th unit in the mth feature map.
Note that δ(jm) can be computed recursively from the units in the following
layer.

123
It is obvious. Firstly, we know that:
∂ { } wi − µ j
π j N (w i |µ j , σ2j ) = −π j N (w i |µ j , σ2j )
∂w i σ2j
We now derive the error function with respect to w i :
e
∂E ∂E ∂λΩ(w)
= +
∂w i ∂w i ∂w i
{ ( )}
∂E ∂ ∑ ∑
M
= −λ ln π j N (w i |µ j , σ2j )
∂w i ∂w i i j =1
{ ( )}
∂E ∂ ∑
M
= −λ ln π jN (w i |µ j , σ2j )
∂w i ∂w i j =1
{ }
∂E 1 ∂ ∑
M
= − λ ∑M π jN (w i |µ j , σ2j )
∂w i (w i |µ j , σ2j ) ∂w i
j =1 π j N j =1
{ }
∂E 1 ∑
M wi − µ j
= + λ ∑M πj N (w i |µ j , σ2j )
∂w i σ2j
j =1 π j N (w i |µ j , σ j ) j =1
2
∑M w i −µ j 2
∂E j =1 π j σ2 N (w i |µ j , σ j )
j
= +λ ∑
∂w i k πk N (w i |µk , σ2k )
∂E ∑
M π j N (w i |µ j , σ2j ) wi − µ j
= +λ ∑
∂w i j =1 k πk N (w i |µk , σ2k ) σ2j
∂E ∑
M wi − µ j
= +λ γ j (w i )
∂w i j =1 σ2j
Where we have used (5.138) and defined (5.140).
Is is similar to the previous problem. Since we know that:
∂ { } wi − µ j
π j N (w i |µ j , σ2j ) = π j N (w i |µ j , σ2j )
∂µ j σ2j
124
We can derive:
e
∂E ∂λΩ(w)
=
∂µ j ∂µ j
{ ( )}
∂ ∑ ∑
M
∂µ j i j =1
{ ( )}
∑ ∂ ∑
M
= −λ ln π j N (w i |µ j , σ2j )
i ∂µ j j =1
{ }
∑ 1 ∂ ∑
M
= −λ ∑M π jN (w i |µ j , σ2j )
i j =1 π j N (w i |µ j , σ2j ) ∂µ j j =1
∑ 1 wi − µ j
= −λ ∑M πj 2
N (w i |µ j , σ2j )
i j =1 π j N ( w i | µ j , σ 2
j
) σ j
2
∑ π j N ( w i |µ j , σ j ) µ j − wi ∑ µ j − wi
= λ ∑K 2
= λ γ j (w i )
i k=1 π k N (w i |µ k , σ k )
2 σj i σ2j
Note that there is a typo in (5.142). The numerator should be µ j − w i

instead of µ i − w j . This can be easily seen through the fact that the mean and
variance of the Gaussian Distribution should have the same subindex and
since σ j is in the denominator, µ j should occur in the numerator instead of
µi .
It is similar to the previous problem. Since we know that:

( )
∂ { 2
} 1 (w i − µ j )2
π j N ( w i |µ j , σ j ) = − + π j N (w i |µ j , σ2j )
∂σ j σj σ3j
125
We can derive:
e
∂E ∂λΩ(w)
=
∂σ j ∂σ j
{ ( )}
∂ ∑ ∑
M
∂σ j i j =1
{ ( )}
∑ ∂ ∑
M
= −λ ln π j N (w i |µ j , σ2j )
i ∂σ j j =1
{ }
∑ 1 ∂ ∑
M
= −λ ∑M π jN (w i |µ j , σ2j )
i j =1 π j N (w i |µ j , σ2j ) ∂σ j j =1
∑ 1 ∂ { }
= −λ ∑M π j N (w i |µ j , σ2j )
2 ∂σ
i j =1 π j N (w i |µ j , σ j ) j
( )
∑ 1 1 (w i − µ j )2
= λ ∑M − π j N (w i |µ j , σ2j )
π N ( w | µ , σ 2) σ σ 3
i j =1 j i j j j j
2 ( )
∑ π j N ( w i |µ j , σ j ) 1 ( w i − µ j )2
= λ ∑M
2) σ
−
i k=1 π k N ( w i | µ k , σ k j σ3j
( )
∑ 1 (w i − µ j )2
= λ γ j (w i ) −
i σj σ3j
Just as required.
It is trivial. We begin by verifying (5.208) when j ̸= k.

{ }
∂πk ∂ exp(η k )
= ∑
∂η j ∂η j k exp(η k )
− exp(η k ) exp(η j )
= [∑ ]2
k exp(η k )
= −π j πk
And if now we have j = k:

{ }
∂πk ∂ exp(η k )
= ∑
∂η k ∂η k k exp(η k )
[∑ ]
exp(η k ) k exp(η k ) − exp(η k ) exp(η k )
= [∑ ]2
k exp(η k )
= πk − πk πk
If we combine these two cases, we can easily see that (5.208) holds. Now
126
we prove (5.147).
e
∂E ∂Ω(w)
= λ
∂η j ∂η j
{ { }}
∂ ∑ ∑
M
= −λ ln π j N (w i |µ j , σ2j )
∂η j i j =1
{ { }}
∑ ∂ ∑
M
i ∂η j j =1
{ }
∑ 1 ∂ ∑
M
= −λ ∑M πk N (w i |µk , σ2k )
i j =1 π j N (w i |µ j , σ2j ) ∂η j k=1
∑ 1 ∑M ∂ { }
= −λ ∑M πk N (w i |µk , σ2k )
∂η j
j =1 π j N (w i |µ j , σ j ) k=1
2
i
∑ 1 ∑M ∂ { } ∂πk
= −λ ∑M πk N (w i |µk , σ2k )
∂πk ∂η j
j =1 π j N (w i |µ j , σ j ) k=1
2
i
∑ 1 ∑
M
= −λ ∑M N (w i |µk , σ2k )(δ jk π j − π j πk )
i j =1 j π N ( w |
i jµ , σ 2)
j k=1
{ }
∑ 1 2
∑M
2
= −λ ∑ M π j N ( w i |µ j , σ j ) − π j πk N (w i |µk , σk ))
j =1 π j N (w i |µ j , σ j )
2
i k=1
{ ∑ }
∑ π j N (w i |µ j , σ2j ) π j kM=1 πk N (w i |µk , σ2k ))
= −λ ∑M − ∑M
j =1 π j N (w i |µ j , σ j ) j =1 π j N (w i |µ j , σ j )
2 2
i
∑{ } ∑ { }
= −λ γ j (w i ) − π j = λ π j − γ j (w i )
i i
Just as required.
It is trivial. We set the attachment point of the lower arm with the ground
as the origin of the coordinate. We first aim to find the vertical distance from
the origin to the target point, and this is also the value of x2 .
x2 = L 1 sin(π − θ1 ) + L 2 sin(θ2 − (π − θ1 ))
= L 1 sin θ1 − L 2 sin(θ1 + θ2 )
Similarly, we calculate the horizontal distance from the origin to the tar-
get point.
x1 = −L 1 cos(π − θ1 ) + L 2 cos(θ2 − (π − θ1 ))
= L 1 cos θ1 − L 2 cos(θ1 + θ2 )
From these two equations, we can clearly see the ’forward kinematics’ of
the robot arm.
127
By analogy with (5.208), we can write:
∂πk (x)
= δ jk π j (x) − π j (x)πk (x)
∂aπj
Using (5.153), we can see that:

{ }
∑
K
2
E n = − ln πk N (tn |µk , σk )
k=1
Therefore, we can derive:

{ }
∂E n ∂ ∑K
2
= − π ln πk N (tn |µk , σk )
∂aπj ∂a j k=1
{ }
1 ∂ ∑
K
2
= − ∑K π πk N (tn |µk , σk )
πk N (tn |µk , σ2 ) ∂a j k=1
k=1 k
1 ∑
K ∂π
k 2
= − ∑K π N (t n |µ k , σ k )
π N (t | µ , σ 2) ∂ a
k=1 k n k k k=1 j
1 ∑
K [ ]
= − ∑K δ jk π j (xn ) − π j (xn )πk (xn ) N (tn |µk , σ2k )
k=1 π k N
(tn |µk , σ2k ) k=1
{ }
1 ∑K
= − ∑K π j (xn )N (tn |µ j , σ2j ) − π j (xn ) πk (xn )N (tn |µk , σ2k )
k=1 π k N (t n | µ k , σ 2)
k k =1
{ }
1 2
∑K
2
= ∑K −π j (xn )N (tn |µ j , σ j ) + π j (xn ) πk (xn )N (tn |µk , σk )
k=1 π k N (t n | µ k , σ 2)
k k =1
And if we denoted (5.154), we will have:
∂E n
= −γ j + π j
∂aπj
Note that our result is slightly different from (5.155) by the subindex. But
there are actually the same if we substitute index j by index k in the final
expression.
We deal with the derivative of error function with respect to µk instead,

which will give a vector as result. Furthermore, the l th element of this vector
will be what we have been required. Since we know that:
∂ { } tn − µk
πk N (tn |µk , σ2k ) = πk N (tn |µk , σ2k )
∂µk σk2
128
One thing worthy noticing is that here we focus on the isotropic case as
stated in page 273 of the textbook. To be more precise, N (tn |µk , σ2k ) should
be N (tn |µk , σ2k I). Provided with the equation above, we can further obtain:
{ }
∂E n ∂ ∑
K
= − ln πk N (tn |µk , σ2k )
∂µk ∂µk k=1
{ }
1 ∂ ∑
K
= − ∑K πk N (tn |µk , σ2k )
(tn |µk , σ2k ) ∂µk k=1
k=1 π k N
1 tn − µk
= − ∑K · 2
πk N (tn |µk , σ2k )
π
k=1 k N (t |
n kµ , σ 2
k
) σ k
tn − µk
= −γk
σ2k
Hence noticing (5.152), the l th element of the result above is what we are
required.
∂E n ∂E n µkl − tl
µ = = γk
∂a ∂µkl σ2k
kl
Similarly, we know that:

{ }
∂ { } D ||tn − µk ||2
πk N (tn |µk , σ2k ) = − + πk N (tn |µk , σ2k )
∂σk σk σk3

{ }
∂E n ∂ ∑K
2
= − ln πk N (tn |µk , σk )
∂σk ∂σk k=1
{ }
1 ∂ ∑K
= − ∑K πk N (tn |µk , σ2k )
π N (t | µ , σ 2 ) ∂σ
k=1 k n k k k k =1
{ }
1 D ||tn − µk ||2
= − ∑K · − + πk N (tn |µk , σ2k )
σ σk3
k=1 π k N (t n |µ k , σ k )
2
k
{ }
2
D ||tn − µk ||
= −γk − +
σk σ3k
Note that there is a typo in (5.157) and the underlying reason is that:
= (σ2k )D
|σ2k ID ×D |
First we know two properties for the Gaussian distribution N (t|µ, σ2 I):
∫
E[t] = tN (t|µ, σ2 I) d t = µ
129
And ∫
E[||t||2 ] = ||t||2 N (t|µ, σ2 I) d t = Lσ2 + ||µ||2
Where we have used E[tT At] = Tr[Aσ2 I] + µT Aµ by setting A = I. This

property can be found in Matrixcookbook eq(378). Here L is the dimension of
t. Noticing (5.148), we can write:
∫
E[t|x] = t p(t|x) d t
∫ ∑
K
= t πk N (t|µk , σ2k ) d t
k=1
∑
K ∫
= πk tN (t|µk , σ2k ) d t
k=1
∑
K
= πk µk
k=1
Then we prove (5.160).

( )
s2 (x) = E[||t − E[t|x]||2 |x] = E[ t2 − 2tE[t|x] + E[t|x]2 |x]
= E[t2 |x] − E[2tE[t|x]|x] + E[t|x]2 = E[t2 |x] − E[t|x]2
∫ ∑K ∑
K
= ||t||2 πk N (µk , σ2k ) d t − || πl µl ||2
k=1 l =1
∑
K ∫ ∑
K
= πk ||t||2 N (µk , σ2k ) d t − || πl µl ||2
k=1 l =1
∑
K ∑
K
= πk (Lσ2k + ||µk ||2 ) − || πl µl ||2
k=1 l =1
∑
K ∑
K ∑
K
= L πk σ2k + πk ||µk ||2 − || πl µl ||2
k=1 k=1 l =1
∑
K ∑
K ∑
K ∑
K
= L πk σ2k + πk ||µk ||2 − 2 × || πl µl ||2 + 1 × || πl µl ||2
k=1 k=1 l =1 l =1
( )
∑
K ∑
K ∑
K ∑
K ∑
K ∑
K
= L πk σ2k + 2
πk ||µk || − 2( πl µl )( πk µk ) + πk || πl µl ||2
k=1 k=1 l =1 k=1 k=1 l =1
∑
K ∑
K ∑
K ∑
K ∑
K ∑
K
= L πk σ2k + πk ||µk ||2 − 2( πl µl )( πk µk ) + πk || πl µl ||2
k=1 k=1 l =1 k=1 k=1 l =1
∑
K ∑
K ∑
K
= L πk σ2k + πk ||µk − πl µl ||2
k=1 k=1 l =1
( )
∑
K ∑
K
= πk Lσ2k + ||µk − πl µl ||2
k=1 l =1
Note that there is a typo in (5.160), i.e., the coefficient L in front of σ2k is
missing.
130
From (5.167) and (5.171), we can write down the expression for the pre-
dictive distribution:
∫
p( t|x, D, α, β) = p(w|D, α, β) p( t|x, w, β) d w
∫
≈ q(w|D ) p( t|x, w, β) d w
∫
= N (w|wMAP , A−1 )N ( t|gT w − gT wMAP + y(x, wMAP ), β−1 ) d w
Note here p( t|x, w, β) is given by (5.171) and q(w|D ) is the approximation

to the posterior p(w|D, α, β), which is given by (5.167). Then by analogy with
(2.115), we first deal with the mean of the predictive distribution:
mean = gT w − gT wMAP + y(x, wMAP )|w = wMAP

= y(x, wMAP )
Then we deal with the covariance matrix:
Covariance matrix = β−1 + gT A−1 g
Just as required.
Using Laplace Approximation, we can obtain:

{ }
p(D |w, β) p(w|α) = p(D |wMAP , β) p(wMAP |α) exp −(w − wMAP )T A(w − wMAP )
Then using (5.174), (5.162) and (5.163), we can obtain:

∫
p(D |α, β) = p(D |w, β) p(w, α) d w
∫ { }
= p(D |wMAP , β) p(wMAP |α) exp −(w − wMAP )T A(w − wMAP ) d w
(2π)W /2
= p(D |wMAP , β) p(wMAP |α)
|A|1/2
∏
N (2π)W /2
= N ( t n | y(xn , wMAP ), β−1 )N (wMAP |0, α−1 I)
n=1 |A|1/2
If we take logarithm of both sides, we will obtain (5.175) just as required.
For a k-class classification problem, we need to use softmax activation

function and also the error function is now given by (5.24). Therefore, the
131
Hessian matrix should be derived from (5.24) and the cross entropy in (5.184)
will also be replaced by (5.24).
By analogy to Prob.5.39, we can write:
(2π)W /2
p(D |α) = p(D |wMAP ) p(wMAP |α)
|A|1/2
Since we know that the prior p(w|α) follows a Gaussian distribution, i.e.,
(5.162), as stated in the text. Therefore we can obtain:
1
ln p(D |α) = ln p(D |wMAP ) + ln p(wMAP |α) − ln |A| + const
2
α W 1
= ln p(D |wMAP ) − wT w + ln α − ln |A| + const
2 2 2
W 1
= −E (wMAP ) + ln α − ln |A| + const
2 2
Just as required.
0.6 Kernel Methods

Recall that in section.6.1, a n can be written as (6.4). We can derive:
1
an = − {wT ϕ(xn ) − t n }
λ
1
= − {w1 ϕ1 (xn ) + w2 ϕ2 (xn ) + ... + w M ϕ M (xn ) − t n }
λ
w1 w2 wM tn
= − ϕ1 (xn ) − ϕ2 (xn ) − ... − ϕ M (xn ) +
λ λ λ λ
w1 w2 wM
= (cn − )ϕ1 (xn ) + ( c n − )ϕ2 (xn ) + ... + ( c n − )ϕ M (xn )
λ λ λ
Here we have defined:
t n /λ
cn =
ϕ1 (xn ) + ϕ2 (xn ) + ... + ϕ M (xn )
From what we have derived above, we can see that a n is a linear com-
bination of ϕ(xn ). What’s more, we first substitute K = ΦΦT into (6.7), and
then we will obtain (6.5). Next we substitute (6.3) into (6.5) we will obtain
(6.2) just as required.

132
If we set w(0) = 0 in (4.55), we can obtain:
∑
N
w(τ+1) = η c n t n ϕn
n=1
where N is the total number of samples and c n is the times that t n ϕn has
been added from step 0 to step τ + 1. Therefore, it is obvious that we have:
∑
N
w= αn t n ϕn
n=1
We further substitute the expression above into (4.55), which gives:
∑
N ∑
N
α(nτ+1) t n ϕn = α(nτ) t n ϕn + η t n ϕn
n=1 n=1
In other words, the update process is to add learning rate η to the coeffi-
cient αn corresponding to the misclassified pattern xn , i.e.,
α(nτ+1) = α(nτ) + η
Now we similarly substitute it into (4.52):
y(x) = f ( wT ϕ(x) )
∑
N
= f( α n t n ϕT
n ϕ(x) )
n=1
∑
N
= f( αn t n k(xn , x) )
n=1
We begin by expanding the Euclidean metric.
||x − xn ||2 = (x − xn )T (x − xn )
= (xT − xT
n )(x − x n )
= xT x − 2xT T
n x + xn xn
Similar to (6.24)-(6.26), we use a nonlinear kernel k(xn , x) to replace xT

n x,
which gives a general nonlinear nearest-neighbor classifier with cost function
defined as:
k(x, x) + k(xn , xn ) − 2 k(xn , x)
To construct such a matrix, let us suppose the two eigenvalues are 1 and
2, and the matrix has form: [ ]
a b
c d
133
Therefore, based on the definition of eigenvalue, we have two equations:

{
(a − 2)( d − 2) = bc (1)
(a − 1)( d − 1) = bc (2)
(2)-(1), yielding:
a+d =3
Therefore, we set a = 4 and d = −1. Then we substitute them into (1), and
thus we see:
bc = −6
Finally, we choose b = 3 and c = −2. The constructed matrix is:
[ ]
4 3
−2 −1
Since k 1 (x, x′ ) is a valid kernel, it can be written as:
k 1 (x, x′ ) = ϕ(x)T ϕ(x′ )
We can obtain:
[p ]T [p ]
k(x, x′ ) = ck 1 (x, x′ ) = cϕ(x) cϕ(x′ )
Therefore, (6.13) is a valid kernel. It is similar for (6.14):

[ ]T [ ]
k(x, x′ ) = f (x) k 1 (x, x′ ) f (x′ ) = f (x)ϕ(x) f (x′ )ϕ(x′ )
Just as required.
We suppose q( x) can be written as:
q( x) = a n x n + a n−1 x n−1 + ... + a 1 x + a 0
We now obtain:
n−1
k(x, x′ ) = a n k 1 (x, x′ ) + a n−1 k 1 (x, x′ ) + ... + a 1 k 1 (x, x′ ) + a 0
n
By repeatedly using (6.13), (6.17) and (6.18), we can easily verify k(x, x′ )
is a valid kernel. For (6.16), we can use Taylor expansion, and since the
coefficients of Taylor expansion are all positive, we can similarly prove its
validity.
To prove (6.17), we will use the property stated below (6.12). Since we
know k 1 (x, x′ ) and k 2 (x, x′ ) are valid kernels, their Gram matrix K1 and K2
134
are both positive semidefinite. Given the relation (6.12), it can be easily
shown K = K1 + K2 is also positive semidefinite and thus k(x, x′ ) is also a
valid kernel.
To prove (6.18), we assume the map function for kernel k 1 (x, x′ ) is ϕ(1) (x),
and similarly ϕ(2) (x) for k 2 (x, x′ ). Moreover, we further assume the dimension
of ϕ(1) (x) is M , and ϕ(2) (x) is N . We expand k(x, x′ ) based on (6.18):
k(x, x′ ) = k 1 (x, x′ ) k 2 (x, x′ )

= ϕ(1) (x)T ϕ(1) (x′ )ϕ(2) (x)T ϕ(2) (x′ )
∑
M
(1) ′
∑
N
= ϕ(1)
i
(x)ϕ i
(x ) ϕ(2)
j
(x)ϕ(2)
j
(x′ )
i =1 j =1
∑ N [
M ∑ ][ ]
′ (2) ′
= ϕ(1)
i
(x)ϕ(2)
j
(x) ϕ(1)
i
(x )ϕ j
(x )
i =1 j =1
∑
MN
= ϕk (x)ϕk (x′ ) = ϕ(x)T ϕ(x′ )
k=1
where ϕ(1)
i
(x) is the i th element of ϕ(1) (x), and ϕ(2)
j
(x) is the j th element
of ϕ(2) (x). To be more specific, we have proved that k(x, x′ ) can be written as
ϕ(x)T ϕ(x′ ). Here ϕ(x) is a MN ×1 column vector, and the kth ( k = 1, 2, ..., MN )
element is given by ϕ(1) i
(x) × ϕ(2)
j
(x). What’s more, we can also express i, j in
terms of k:
i = ( k − 1) ⊘ N + 1 and j = ( k − 1) ⊙ N + 1
where ⊘ and ⊙ means integer division and remainder, respectively.
For (6.19) we suppose k 3 (x, x′ ) = g(x)T g(x′ ), and thus we have:
k(x, x′ ) = k 3 (ϕ(x), ϕ(x′ )) = g(ϕ(x))T g(ϕ(x′ )) = f (x)T f (x′ )
where we have denoted g(ϕ(x)) = f (x) and now it is obvious that (6.19)
holds. To prove (6.20), we suppose x is a N × 1 column vector and A is a N × N
symmetric positive semidefinite matrix. We know that A can be decomposed
to QBQT . Here Q is a N × N orthogonal matrix, and B is a N × N diagonal
matrix whose elements are no less than 0. Now we can derive:
k(x, x′ ) = xT Ax′ = xT QBQT x′ = (QT x)T B(QT x′ ) = yT By′

∑N ′
N √
∑ √ ′
= B ii yi yi = ( B ii yi )( B ii yi ) = ϕ(x)T ϕ(x′ )
i =1 i =1
To be more specific, we have proved that k(x, x′ ) = ϕ(x)T ϕ(x′ ), and here
√(x) is a N ×√1 column
ϕ vector, whose i th ( i = 1, 2, ..., N ) element is given by
B ii yi , i.e., B ii (QT x) i .
135
To prove (6.21), let’s first expand the expression:

′ ′
k(x, x′ ) = k a (xa , xa ) + k b (xb , xb )
∑
M ′ ∑
N ′
= ϕ(ia) (xa )ϕ(ia) (xa ) + ϕ(ib) (xb )ϕ(ib) (xb )
i =1 j =1
M∑
+N ′
= ϕk (x)ϕk (x ) = ϕ(x)T ϕ(x′ )
k=1
where we have assumed the dimension of xa is M and the dimension of

xb is N . The mapping function ϕ(x) is a ( M + N ) × 1 column vector, whose kth
( k = 1, 2, ..., M + N ) element ϕk (x) is:
{
ϕ(ka) (x) 1≤k≤M
ϕk (x) =
ϕ(kb−) M (xa ) M + 1 ≤ k ≤ M + N
(6.22) is quite similar to (6.18). We follow the same procedure:

′ ′
k(x, x′ ) = k a (xa , xa ) k b (xb , xb )
∑
M ′ ∑
N ′
= ϕ(ia) (xa )ϕ(ia) (xa ) ϕ(jb) (xb )ϕ(jb) (xb )
i =1 j =1
∑ N [
M ∑ ][ ′ ′
]
= ϕ(ia) (xa )ϕ(jb) (xb ) ϕ(ia) (xa )ϕ(jb) (xb )
i =1 j =1
∑
MN
= ϕk (x)ϕk (x′ ) = ϕ(x)T ϕ(x′ )
k=1
By analogy to (6.18), the mapping function ϕ(x) is a MN ×1 column vector,

whose kth ( k = 1, 2, ..., MN ) element ϕk (x) is:
ϕk (x) = ϕ(ia) (xa ) × ϕ(jb) (xb )
To be more specific, xa is the sub-vector of x made up of the first M ele-

ment of x, and xb is the sub-vector of x made up of the last N element of x.
What’s more, we can also express i, j in terms of k:
i = ( k − 1) ⊘ N + 1 and j = ( k − 1) ⊙ N + 1
where ⊘ and ⊙ means integer division and remainder, respectively.
According to (6.9), we have:

[ ]
T −1 T
∑
N ∑
N
y(x) = k(x) (K + λI N ) t = k(x) a = f (xn ) · f (x) · a n = f (xn ) · ·a n f (x)
n=1 n=1
136
We see that if we choose k(x, x′ ) = f (x) f (x′ ) we will always find a solution
y(x) proportional to f (x).
We follow the hint.
k(x, x′ ) = exp(−xT x/2σ2 ) · exp(xT x′ /σ2 ) · exp(−(x′ )T x′ /2σ2 )

 
T ′ x T x′ 2
x x ( )
= exp(−xT x/2σ2 ) · 1 + 2 + σ + · · · · exp(−(x′ )T x′ /2σ2 )
2
σ 2!
= ϕ(x)T ϕ(x′ )
where ϕ(x) is a column vector with infinite dimension. To be more spe-

cific, (6.12) gives a simple example on how to decompose (xT x′ )2 . In our case,
we can also decompose (xT x′ )k , k = 1, 2, ..., ∞ in the similar way. However,
since k → ∞, i.e., the decomposition will consist monomials with infinite de-
gree. Thus, there will be infinite terms in the decomposition and the feature
mapping function ϕ(x) will have infinite dimension.
First, let’s explain the problem a little bit. According to (6.27), what we
need to prove here is:
k ( A 1 , A 2 ) = 2| A 1 ∩ A 2 | = ϕ ( A 1 ) T ϕ ( A 2 )
The biggest difference from the previous problem is that ϕ( A ) is a 2|D | × 1

column vector and instead of indexed by 1, 2, ..., 2|D | here we index it by {U |U ⊆
D } (Note that {U |U ⊆ D } is all the possible subsets of D and thus there are
2|D | elements in total). Therefore, according to (6.95), we can obtain:
∑
ϕ( A 1 ) T ϕ ( A 2 ) = ϕU ( A 1 )ϕU ( A 2 )
U ⊆D
By using the summation, we actually iterate through all the possible sub-
sets of D . If and only if the current iterating subset U is a subset of both A 1
and A 2 simultaneously, the current adding term equals to 1. Therefore, we
actually count how many subsets of D is in the intersection of A 1 and A 2 .
Moreover, since A 1 and A 2 are both defined in the subset space of D , what
we have deduced above can be written as:
ϕ ( A 1 ) T ϕ ( A 2 ) = 2| A 1 ∩ A 2 |
Just as required.
Problem 6.13 SolutionWait for update

137
Since the covariance matrix S is fixed, according to (6.32) we can obtain:

( )
∂ 1
g(µ, x) = ∇µ ln p(x|µ) = − (x − µ) S (x − µ) = S−1 (x − µ)
T −1
∂µ 2
Therefore, according to (6.34), we can obtain:

[ ] [ ]
F = Ex g(µ, x)g(µ, x)T = S−1 Ex (x − µ)(x − µ)T S−1
Since x ∼ N (x|µ, S), we have:

[ ]
Ex (x − µ)(x − µ)T = S
So we obtain F = S−1 and then according to (6.33), we have:
k(x, x′ ) = g(µ, x)T F−1 g(µ, x′ ) = (x − µ)T S−1 (x′ − µ)
We rewrite the problem. What we are required to prove is that the Gram
matrix K: [ ]
k 11 k 12
K= ,
k 21 k 22
where k i j ( i, j = 1, 2) is short for k( x i , x j ), should be positive semidefinite. A
positive semidefinite matrix should have positive determinant, i.e.,
k 12 k 21 ≤ k 11 k 22 .
Using the symmetric property of kernel, i.e., k 12 = k 21 , we obtain what

has been required.
Based on the total derivative of function f , we have:

( ) ∑N ∂f
f (w + ∆w)T ϕ1 , (w + ∆w)T ϕ2 , ..., (w + ∆w)T ϕ N = · ∆wT ϕn
n=1 ∂(w ϕ n )
T
Which can be further written as:

[ ]
( ) ∑
N ∂f
T T T
f (w + ∆w) ϕ1 , (w + ∆w) ϕ2 , ..., (w + ∆w) ϕ N = · ϕT
n ∆w
n=1 ∂(w
Tϕ
n)
Note that here ϕn is short for ϕ(xn ). Based on the equation above, we can
obtain:
∑N ∂f
∇w f = · ϕT
n
n=1 ∂(w Tϕ )
n
138
Now we focus on the derivative of function g with respect to w:
∂g
∇w g = · 2wT
∂(wT w)
In order to find the optimal w, we set the derivative of J with respect to

w equal to 0, yielding:
∑
N ∂f ∂g
∇w J = ∇w f + ∇w g = · ϕT
n + · 2wT = 0
n=1 ∂(wT ϕ n) ∂(wT w)
Rearranging the equation above, we can obtain:
1 ∑N ∂f
w= ·ϕ
2a n=1 ∂(wT ϕn ) n
∂g
Where we have defined: a = 1 ÷ ∂(wT w) , and since g is a monotonically
increasing function, we have a > 0.
We consider a variation in the function y(x) of the form:
y(x) → y(x) + ϵη(x)
Substituting it into (6.39) yields:
1 ∑ N ∫ { }2
E [ y + ϵη] = y + ϵη − t n v(ξ) d ξ
2 n=1
1 ∑ N ∫ { }
= ( y − t n )2 + 2 · (ϵη) · ( y − t n ) + (ϵη)2 v(ξ) d ξ
2 n=1
N ∫
∑
= E [ y] + ϵ { y − t n } ηvd ξ + O (ϵ2 )
n=1
Note that here y is short for y(xn + ξ), η is short for η(xn + ξ) and v is short
for v(ξ) respectively. Several clarifications must be made here. What we have
done is that we vary the function y by a little bit (i.e., ϵη) and then we expand
the corresponding error with respect to the small variation ϵ. The coefficient
before ϵ is actually the first derivative of the error E [ y + ϵη] with respect to
ϵ at ϵ = 0. Since we know that y is the optimal function that can make E
the smallest, the first derivative of the error E [ y + ϵη] should equal to zero at
ϵ = 0, which gives:
N ∫
∑
{ y(xn + ξ) − t n } η(xn + ξ)v(ξ) d ξ = 0
n=1
139
Now we are required to find a function y that can satisfy the equation
above no matter what η is. We choose:
η(x) = δ(x − z)
This allows us to evaluate the integral:

N ∫
∑ ∑
N
{ y(xn + ξ) − t n } η(xn + ξ)v(ξ) d ξ = { y(z) − t n } v(z − xn )
n=1 n=1
We set it to zero and rearrange it, which finally gives (6.40) just as re-
quired.
According to the main text below Eq (6.48), we know that f ( x, t), i.e., f (z),
follows a zero-mean isotropic Gaussian:
f (z) = N (z|0, σ2 I)
Then f ( x − xm , t − t m ), i.e., f (z − zm ) should also satisfy a Gaussian distri-

bution:
f (z − zm ) = N (z|zm , σ2 I)
zm = ( xm , t m )
∫
The integral f (z − zm ) dt corresponds to the marginal distribution with
respect to the remaining variable x and, thus, we obtain:
∫
f (z − zm ) dt = N ( x| xm , σ2 )
We substitute all the expressions into Eq (6.48), which gives:

∑ 2
p( t, x) n N (z|z m , σ I)
p ( t| x) = ∫ = ∑
m N ( x| x m , σ )
p( t, x) dt 2
∑ 1 ( 1 T 2 −1
)
n 2πσ2 exp − 2 (z − z n ) (σ I) (z − z n )
= ∑ ( )
1 1
m (2πσ2 )1/2 exp − 2σ 2
( x − x m ) 2
∑ 1 ( ) ( )
1 2 1 2
n 2πσ2 exp − 2σ2 ( x − x n ) exp − 2σ2 ( t − t n )
= ∑ ( )
1 1
m (2πσ2 )1/2 exp − 2σ2
( x − x m )2
( )
p 1 exp − 2σ1 2 ( x − xn )2 ( )
∑ 2πσ2 1 1 2
= ∑ ( ) · p exp − ( t − t n )
1 1 2πσ2 2σ 2
m (2πσ2 )1/2 exp − 2σ2 ( x − x m )
n 2
∑
= π n · N ( t | t n , σ2 )
n
140

( )
exp − 2σ1 2 ( x − xn )2
πn = ∑ ( )
1
m exp − 2σ 2
( x − x m ) 2
We also observe that: ∑

πn = 1
n
Therefore, the conditional distribution p( t| x) is given by a Gaussian Mix-

ture. Similarly, we attempt to find a specific form for Eq (6.46):
∫
f ( x − xn , t) dt
k( x, xn ) = ∑ ∫
m f ( x − x m , t) dt
N ( x | x n , σ2 )
= ∑
m N ( x| x m , σ )
2
= πn
In other words, the conditional distribution can be more precisely written

as: ∑
p ( t| x) = k( x, xn ) · N ( t| t n , σ2 )
n
Thus its mean is given by:

∑
E[ t | x ] = k( x, xn ) · t n
n
Its variance is given by:
var[ t| x] = E[( t| x)2 ] − E[ t| x]2

( )2
∑ ∑
= k( x, xn ) · ( t2n + σ2 ) − k( x, xn ) · t n
n n
Similar to Prob.6.17, it is straightforward to show that:

∑
y(x) = t n k(x, xn )
n

g(xn − x)
k(x, xn ) = ∑
n g(x n − x)
Since we know that t N +1 = ( t 1 , t 2 , ..., t N , t N +1 )T follows a Gaussian distri-

bution, i.e., t N +1 ∼ N (t N +1 |0, C N +1 ) given in Eq (6.64), if we rearrange its
141
order by putting the last element (i.e., t N +1 ) to the first position, denoted as
t̄ N +1 , it should also satisfy a Gaussian distribution:
t̄ N +1 = ( t N +1 , t 1 , ..., t 2 , t N )T ∼ N (t̄ N +1 |0, C̄ N +1 )

( )
c kT
C̄ N +1 =
k CN
Where k and c have been given in the main text below Eq (6.65). The
conditional distribution p( t N +1 |t N ) should also be a Gaussian. By analogy to
Eq (2.94)-(2.98), we can simply treat t N +1 as xa , t N as xb , c as Σaa , k as Σba ,
kT as Σab and C N as Σbb . Substituting them into Eq (2.79) and Eq (2.80)
yields:
Λaa = ( c − kT C−N1 k)−1
And:
Λab = −( c − kT C−N1 k)−1 kT C−N1
Then we substitute them into Eq (2.96) and (2.97), yields:
p( t N +1 |t N ) = N (µa|b , Λ−1
aa )
For its mean µa|b , we have:

( ) [ ]
µa|b = 0 − c − kT C− N
1
k · −( c − k T −1 −1 T −1
C N k) k C N · (t N − 0)
= kT C−1
N t N = m(x N +1 )
Similarly, for its variance Λ− 1

aa (Note that here since t N +1 is a scalar, the
mean and the covariance matrix actually degenerate to one dimension case),
we have:
Λ− 1 T −1 2
aa = c − k C N k = σ (x N +1 )
We follow the hint beginning by verifying the mean. We write Eq (6.62)

in a matrix form:
1
C N = ΦΦT + β−1 I N
α
Where we have used Eq (6.54). Here Φ is the design matrix defined below
Eq (6.51) and I N is an identity matrix. Before we use Eq (6.66), we need to
obtain k:
k = [ k(x1 , x N +1 ), k(x2 , x N +1 ), ..., k(x N , x N +1 )]T

1
= [ϕ(x1 )T ϕ(x N +1 ), ϕ(x2 )T ϕ(x N +1 ), ..., ϕ(xn )T ϕ(x N +1 )]T
α
1
= Φϕ(x N +1 )T
α
142
Now we substitute all the expressions into Eq (6.66), yielding:

[ ]−1
m(x N +1 ) = α−1 ϕ(x N +1 )T ΦT α−1 ΦΦT + β−1 I N t
Next using matrix identity (C.6), we obtain:

[ ]−1 [ ]−1
ΦT α−1 ΦΦT + β−1 I N = αβ βΦT Φ + αI M ΦT = αβS N ΦT
Where we have used Eq (3.54). Substituting it into m(x N +1 ), we obtain:
m(x N +1 ) = βϕ(x N +1 )T S N ΦT t = < ϕ(x N +1 )T , βS N ΦT t >
Where < ·, · > represents the inner product. Comparing the result above
with Eq (3.58), (3.54) and (3.53), we conclude that the means are equal. It is
similar for the variance. We substitute c, k and C N into Eq (6.67). Then we
simplify the expression using matrix identity (C.7). Finally, we will observe
that it is equal to Eq (3.59).
Based on Eq (6.64) and (6.65), We first write down the joint distribution
for t N +L = [ t 1 (x), t 2 (x), ..., t N +L (x)]T :
p(t N +L ) = N (t N +L |0, C N +L )
Where C N +L is similarly given by:

( )
C1,N K
C N +L =
KT C N +1,N +L
The expression above has already implicitly divided the vector t N +L into
two parts. Similar to Prob.6.20, for later simplicity we rearrange the order
of t N +L denoted as t̄ N +L = [ t N +1 , ..., t N +L , t 1 , ..., t N ]T . Moreover, t̄ N +L should
also follows a Gaussian distribution:
p(t̄ N +L ) = N (t̄ N +L |0, C̄ N +L )

( )
C N +1,N +L KT
C̄ N +L =
K C1,N
Now we use Eq (2.94)-(2.98) and Eq (2.79)-(2.80) to derive the conditional

distribution, beginning by calculate Λaa :
Λaa = (C N +1,N +L − KT · C−1

1,N · K)
−1
and Λab :
Λab = − (C N +1,N +L − KT · C−1
1,N · K)
−1
· KT · C−1
1,N
143
Now we can obtain:
p( t N +1 , ..., t N +L |t N ) = N (µa|b , Λ−1

aa )
µa|b = 0 + KT · C−1 T −1
1,N · t N = K · C1,N · t N
If now we want to find the conditional distribution p( t j |t N ), where N + 1 ≤

j ≤ N + L, we only need to find the corresponding entry in the mean (i.e., the
( j − N )-th entry) and covariance matrix (i.e., the ( j − N )-th diagonal entry) of
p( t N +1 , ..., t N +L |t N ). In this case, it will degenerate to Eq (6.66) and (6.67)
just as required.
By definition, we only need to prove that for arbitrary vector x ̸= 0, xT Wx

is positive. Here suppose that W is a M × M matrix. We expand the multipli-
cation:
∑
M ∑ M ∑
M
xT Wx = Wi j · x i · x j = Wii · x2i
i =1 j =1 i =1
where we have used the fact that W is a diagonal matrix. Since Wii > 0,
we obtain xT Wx > 0 just as required. Suppose we have two positive definite
matrix, denoted as A1 and A2 , i.e., for arbitrary vector x, we have xT A1 x > 0
and xT A2 x > 0. Therefore, we can obtain:
xT (A1 + A2 )x = xT A1 x + xT A2 x > 0
Just as required.
Based on Newton-Raphson formula, Eq(6.81) and Eq(6.82), we have:
anew
N = a N − (−W N − C−1 −1 −1
N ) (t N − σ N − C N a N )
= a N + (W N + C− 1 −1 −1
N ) (t N − σ N − C N a N )
[ ]
= (W N + C− 1 −1
N ) (W N + C− 1 −1
N )a N + t N − σ N − C N a N
= C N C−1 −1 −1
N (W N + C N ) (t N − σ N + W N a N )
= C N (C N W N + I)−1 (t N − σ N + W N a N )
Just as required.
Using Eq(6.77), (6.78) and (6.86), we can obtain:

∫
p(a N +1 |t N ) = p(a N +1 |a N ) p(a N |t N ) d a N
∫
⋆
= N (a N +1 |kT C− 1 T −1 −1
N a N , c − k C N k) · N (a N |a N , H ) d a N
144
By analogy to Eq (2.115), i.e.,

∫
p(y) = p(y|x) p(x) d x
We can obtain:
p(a N +1 |t N ) = N (Aµ + b, L−1 + AΛ−1 AT ) (∗)
A = kT C−1
N , b = 0, L
−1
= c − kT C−1
N k
And
µ = a⋆
N, Λ = H
Therefore, the mean is given by:

1 ⋆
Aµ + b = kT C− T −1 T
N a N = k C N C N (t N − σ N ) = k (t N − σ N )
Where we have used Eq (6.84). The covariance matrix is given by:
L−1 + AΛ−1 AT = c − kT C−1 T −1 −1 T −1 T

N k + k C N H (k C N )
= c − kT (C−1
− C− 1 −1 −1
N H C N )k
( N )
= c − kT C−N
1
− C −1
N (W N + C−1 −1 −1
N ) C N k
( )
= c − kT C−N
1
− (C W
N N N C + C −1 −1
N ) k
Where we have used Eq (6.85) and the fact that C N is symmetric. Then we
use matrix identity (C.7) to further reduce the expression, which will finally
give Eq (6.88).
Problem 6.27 Solution(Wait for update) This problem is really complicated.
What’s more, I find that Eq (6.91) seems not right.
0.7 Sparse Kernel Machines
By analogy to Eq (2.249), we can obtain:



 1 N∑+1
1

 · k(x, xn ) t = +1
 N+1
n=1 Z k
p(x| t) =

 1 N∑−1
1

 · k(x, xn ) t = −1

N−1 n=1 Z k
145
where N+1 represents the number of samples with label t = +1 and it is

the same for N−1 . Z k is a normalization constant representing the volume of
the hypercube. Since we have equal prior for the class, i.e.,
{
0. 5 t = + 1
p ( t) =
0. 5 t = − 1
Based on Bayes’ Theorem, we have p( t|x) ∝ p(x| t) · p( t), yielding:



 1 1 N∑
+1

 · · k(x, xn ) t = +1
 Z N+1
n=1
p( t|x) =

 1 1 N∑
−1


 · · k(x, xn ) t = −1
Z N−1 n=1
Where 1/ Z is a normalization constant to guarantee the integration of the

posterior equal to 1. To classify a new sample x⋆ , we try to find the value t⋆
that can maximize p( t|x). Therefore, we can obtain:


 1 N∑+1
1 N∑−1

 +1 if · k (x , x ) ≥ · k(x, xn )
 N+1 n=1
n
N−1 n=1
⋆
t = (∗)

 1 N∑+1
1 N∑−1


−1 if · k(x, xn ) ≤ · k(x, xn )
N+1 n=1 N−1 n=1
If we now choose the kernel function as k(x, x′ ) = xT x′ ,we have:
1 N∑+1
1 N∑+1
k(x, xn ) = xT xn = xT x̃+1
N+1 n=1 N+1 n=1
Where we have denoted:
1 N∑+1
x̃+1 = xn
N+1 n=1
and similarly for x̃−1 . Therefore, the classification criterion (∗) can be
written as: {
⋆ +1 if x̃+1 ≥ x̃−1
t =
−1 if x̃+1 ≤ x̃−1
When we choose the kernel function as k(x, x′ ) = ϕ(x)T ϕ(x′ ), we can sim-
ilarly obtain the classification criterion:
{
⋆ +1 if ϕ̃(x+1 ) ≥ ϕ̃(x−1 )
t =
−1 if ϕ̃(x+1 ) ≤ ϕ̃(x−1 )
1 N∑+1
ϕ̃(x+1 ) = ϕ(xn )
N+1 n=1
146
Suppose we have find w0 and b 0 , which can let all points satisfy Eq (7.5)
and simultaneously minimize Eq (7.3). This hyperlane decided by w0 and
b 0 is the optimal classification margin. Now if the constraint in Eq (7.5)
becomes:
t n (wT ϕ(xn ) + b) ≥ γ
We can conclude that if we perform change of variables: w0 − > γw0 and
b− > γ b, the constraint will still satisfy and Eq (7.3) will be minimize. In
other words, if the right side of the constraint changes from 1 to γ, The new
hyperlane decided by γw0 and γ b 0 is the optimal classification margin. How-
ever, the minimum distance from the points to the classification margin is
still the same.
Suppose we have x1 belongs to class one and we denote its target value
t 1 = 1, and similarly x2 belongs to class two and we denote its target value
t 2 = −1. Since we only have two points, they must have t i · y(x i ) = 1 as shown
in Fig. 7.1. Therefore, we have an equality constrained optimization problem:
{ T
1 w ϕ(x1 ) + b = 1
minimize ||w||2 s.t.
2 wT ϕ(x2 ) + b = −1
This is an convex optimization problem and it has been proved that global
optimal exists.
Since we know that

1
ρ=
||w||
Therefore, we have:
1
= ||w||2
ρ2
In other words, we only need to prove that
∑
N
||w||2 = an
n=1
When we find th optimal solution, the second term on the right hand side
of Eq (7.7) vanishes. Based on Eq (7.8) and Eq (7.10), we also observe that its
dual is given by:
∑N 1
L̃(a) = a n − ||w||2
n=1 2
147
Therefore, we have:
1 ∑N 1
||w||2 = L(a) = L̃(a) = a n − ||w||2
2 n=1 2
Rearranging it, we will obtain what we are required.
We have already proved this problem in the previous one.
If the target variable can only choose from {−1, 1}, and we know that
p ( t = 1| y ) = σ ( y )
We can obtain:
p( t = −1| y) = 1 − p( t = 1| y) = 1 − σ( y) = σ(− y)
Therefore, combining these two situations, we can derive:
p( t| y) = σ( yt)
Consequently, we can obtain the negative log likelihood:
∏
N ∑
N ∑
N
− ln p(D) = − ln σ( yn t n ) = − ln σ( yn t n ) = E LR ( yn t n )
n=1 n=1 n=1
Here D represents the dataset, i.e.,D = {(xn , t n ); n = 1, 2, ..., N }, and E LR ( yt)

is given by Eq (7.48). With the addition of a quadratic regularization, we ob-
tain exactly Eq (7.47).
The derivatives are easy to obtain. Our main task is to derive Eq (7.61)
148
using Eq (7.57)-(7.60).
∑
N 1 ∑N
L = C (ξn + ξbn ) + ||w||2 − bn ξbn )
(µ n ξ n + µ
n=1 2 n=1
∑
N ∑
N
− a n (ϵ + ξn + yn − t n ) − abn (ϵ + ξbn + yn − t n )
n=1 n=1
∑
N 1 ∑N ∑N
= C (ξn + ξbn ) + ||w||2 − (a n + µn )ξn − bn )ξbn
(abn + µ
n=1 2 n=1 n=1
∑
N ∑
N
− a n (ϵ + yn − t n ) − abn (ϵ + yn − t n )
n=1 n=1
∑
N 1 ∑N ∑N
= C (ξn + ξbn ) + ||w||2 − C ξn − C ξbn
n=1 2 n=1 n=1
∑
N ∑
N
− (a n + abn )ϵ − (a n − abn )( yn − t n )
n=1 n=1
1 ∑N ∑N
= ||w||2 − (a n + abn )ϵ − (a n − abn )( yn − t n )
2 n=1 n=1
1 ∑N ∑N ∑N
= ||w||2 − (a n − abn )(wT ϕ(xn ) + b − t n ) − (a n + abn )ϵ +
2 n=1 n=1 n=1
1 ∑N ∑N ∑N
= ||w||2 − (a n − abn )(wT ϕ(xn ) + b) − (a n + abn )ϵ + (a n − abn ) t n
2 n=1 n=1 n=1
1 ∑N ∑N ∑N
= ||w||2 − (a n − abn )wT ϕ(xn ) − (a n + abn )ϵ + (a n − abn ) t n
2 n=1 n=1 n=1
1 ∑N ∑N
= ||w||2 − ||w||2 − (a n + abn )ϵ + (a n − abn ) t n
2 n=1 n=1
1 ∑N ∑N
= − ||w||2 − (a n + abn )ϵ + (a n − abn ) t n
2 n=1 n=1
Just as required.
This obviously follows from the KKT condition, described in Eq (7.67) and
(7.68).
The prior is given by Eq (7.80).

∏
M
p(w|α) = N (0, α−1 −1
i ) = N (w|0, A )
i =1

A = diag(α i )
149
The likelihood is given by Eq (7.79).
∏
N
p(t|X, w, β) = p( t n |xn , w, β−1 )
n=1
∏
N
= N ( t n |wT ϕ(xn ), β−1 )
n=1
= N (t|Φw, β−1 I)
Φ = [ϕ(x1 ), ϕ(x2 ), ..., ϕ(xn )]T
Our definitions of Φ and A as consistent with the main text. Therefore,

according to Eq (2.113)-Eq (2.117), we have:
p(w|t, X, α, β) = N (m, Σ)
Σ = (A + βΦT Φ)−1
And
m = βΣΦT t
Just as required.
Problem 7.10&7.11 Solution
It is quite similar to the previous problem. We begin by writting down the

prior:
∏
M
p(w|α) = N (0, α−1 −1
i ) = N (w|0, A )
i =1
Then we write down the likelihood:
∏
N
p(t|X, w, β) = p( t n |xn , w, β−1 )
n=1
∏
N
= N ( t n |wT ϕ(xn ), β−1 )
n=1
= N (t|Φw, β−1 I)
Since we know that:

∫
p(t|X, α, β) = p(t|X, w, β) p(w|α) d w
150
First as required by Prob.7.10, we will solve it by completing the square.

We begin by write down the expression for p(t|X, w, β):
∫
p(t|X, α, β) = N (w|0, A−1 )N (t|Φw, β−1 I) d w
∏
M ∫
β 1
= ( ) N /2 · · α1/2
i · exp{−E (w)} d w
2π (2π) M /2 m=1

1 T β
E (w) = w Aw + ||t − Φw||2
2 2
We expand E (w) with respect to w:
1{ T }
E (w) = w (A + βΦT Φ)w − 2βtT (Φw) + βtT t
2
1 { T −1 }
= w Σ w − 2mT Σ−1 w + βtT t
2
1{ }
= (w − m)T Σ−1 (w − m) + βtT t − mT Σ−1 m
2
Where we have used Eq (7.82) and Eq (7.83). Substituting E (w) into the
integral, we will obtain:
∏
M ∫
β N /2 1 1/2
p(t|X, α, β) = ( ) · · α · exp{−E (w)} d w
2π (2π) M /2 m=1 i
β 1 ∏
M { 1 }
T −1
= ( ) N /2 · · α 1/2
· (2 π ) M /2
· | Σ | 1/2
exp − ( β t T
t − m Σ m)
2π (2π) M /2 m=1 i 2
β ∏
M { 1 }
T −1
= ( ) N /2 · |Σ|1/2 · α1/2
i · exp − ( β t T
t − m Σ m)
2π m=1 2
β ∏
M { }
= ( ) N /2 · |Σ|1/2 · α1/2
i · exp − E (t)
2π m=1
We further expand E (t):

1
E (t) = (βtT t − mT Σ−1 m)
2
1
= (βtT t − (βΣΦT t)T Σ−1 (βΣΦT t))
2
1
= (βtT t − β2 tT ΦΣΣ−1 ΣΦT t)
2
1
= (βtT t − β2 tT ΦΣΦT t)
2
1 T
= t (βI − β2 ΦΣΦT )t
2
1 T[ ]
= t βI − βΦ(A + βΦT Φ)−1 ΦT β t
2
1 T −1 1
= t (β I + ΦA−1 ΦT )−1 t = tT C−1 t
2 2
151
Note that in the last step we have used matrix identity Eq (C.7). There-
fore, as we know that the pdf is Gaussian and the exponential term has been
given by E (t), we can easily write down Eq (7.85) considering those normal-
ization constant.
What’s more, as required by Prob.7.11, the evaluation of the integral can
be easily performed using Eq(2.113)- Eq(2.117).
According to the previous problem, we can explicitly write down the log
marginal likelihood in an alternative form:
N N 1 1∑ M
ln p(t|X, α, β) = ln β − ln 2π + ln |Σ| + ln α i − E (t)
2 2 2 2 i=1
We first derive:
dE (t) 1 d
= − (mT Σ−1 m)
dαi 2 dαi
1 d
= − (β2 tT ΦΣΣ−1 ΣΦT t)
2 dαi
1 d
= − (β2 tT ΦΣΦT t)
2 dαi
1 [ d 2 T T d Σ−1
= − Tr (β t ΦΣΦ t) · ]
2 d Σ−1 dαi
1 2 [ 1
= β T r Σ(ΦT t)(ΦT t)T Σ · I i ] = m2ii
2 2
In the last step, we have utilized the following equation:
d
T r (AX−1 B) = −X−T AT BT X−T
dX
Moreover, here I i is a matrix with all elements equal to zero, expect the
i -th diagonal element, and the i -th diagonal element equals to 1. Then we
utilize matrix identity Eq (C.22) to derive:
d ln |Σ| d ln |Σ−1 |
= −
dαi dαi
[ d ]
= −T r Σ (A + βΦT Φ)
dαi
= −Σ ii
d ln p 1 1 1
= − m2 − Σ ii
dαi 2α i 2 i 2
152
Set it to zero and obtain:

1 − α i Σ ii γi
αi = = 2
mi mi
Then we calculate the derivatives of ln p with respect to β beginning by:
d ln |Σ| d ln |Σ−1 |
= −
dβ dβ
[ d ]
= −T r Σ (A + βΦT Φ)
dβ
[ ]
= −T r ΣΦT Φ
Then we continue:
dE (t) 1 T 1 d
= t t− (mT Σ−1 m)
dβ 2 2 dβ
1 T 1 d 2 T
= t t− (β t ΦΣΣ−1 ΣΦT t)
2 2 dβ
1 T 1 d 2 T
= t t− (β t ΦΣΦT t)
2 2 dβ
1 T 1 d T
= t t − βtT ΦΣΦT t − β2 (t ΦΣΦT t)
2 2 dβ
1{ T d T }
= t t − 2βtT ΦΣΦT t − β2 (t ΦΣΦT t)
2 dβ
{
1 T d T }
= t t − 2tT (Φm) − β2 (t ΦΣΦT t)
2 dβ
{
1 T [ d d Σ−1 ]}
= t t − 2tT (Φm) − β2 T r (t T
ΦΣΦ T
t) ·
2 d Σ−1 dβ
{
1 T [ ]}
= t t − 2tT (Φm) + β2 T r Σ(ΦT t)(ΦT t)T Σ · ΦT Φ
2
1{ T [ }
= t t − 2tT (Φm) + T r mmT · ΦT Φ]
2
1{ T [ }
= t t − 2tT (Φm) + T r ΦmmT · ΦT ]
2
1
= ||t − Φm||2
2
Therefore, we have obtained:
d ln p 1( N )
= − ||t − Φm||2 − T r [ΣΦT Φ]
dβ 2 β
153
Using Eq (7.83), we can obtain:
ΣΦT Φ = ΣΦT Φ + β−1 ΣA − β−1 ΣA

= Σ(βΦT Φ + A)β−1 − β−1 ΣA
= Iβ−1 − β−1 ΣA
= (I − ΣA)β−1
Setting the derivative equal to zero, we can obtain:
||t − Φm||2 ||t − Φm||2

β−1 = = ∑
N − T r (I − ΣA) N − i γi
Just as required.
This problem is quite confusing. In my point of view, the posterior should

be denoted as p(w|t, X, {a i , b i }, a β , b β ), where a β , b β controls the Gamma dis-
tribution of β, and a i , b i controls the Gamma distribution of α i . What we
should do is to maximize the marginal likelihood p(t|X, {a i , b i }, a β , b β ) with
respect to {a i , b i }, a β , b β . Now we do not have a point estimation for the hyper-
parameters β and α i . We have a distribution (controled by the hyper priors,
i.e., {a i , b i }, a β , b β ) instead.
We begin by writing down p( t|x, w, β∗ ). Using Eq (7.76) and Eq (7.77), we

can obtain:
p( t|x, w, β∗ ) = N ( t|wT ϕ(x), (β∗ )−1 )
Then we write down p(w|X, t, α∗ , β∗ ). Using Eq (7.81), (7.82) and (7.83),
we can obtain:
p(w|X, t, α∗ , β∗ ) = N (w|m, Σ)
Where m and Σ are evaluated using Eq (7.82) and (7.83) given α = α∗
and β = β∗ . Then we utilize Eq (7.90) and obtain:
∫
∗ ∗
p ( t | x , X, t , α , β ) = N ( t|wT ϕ(x), (β∗ )−1 )N (w|m, Σ) d w
∫
= N ( t|ϕ(x)T w, (β∗ )−1 )N (w|m, Σ) d w
Using Eq (2.113)-(2.117), we can obtain:
p( t|x, X, t, α∗ , β∗ ) = N (µ, σ2 )

µ = mT ϕ(x)
154
And
σ2 = (β∗ )−1 + ϕ(x)T Σϕ(x)
Just as required.
We just follow the hint.
1
L(α) = − { N ln 2π + ln |C| + tT C−1 t}
2
1{
= − N ln 2π + ln |C− i | + ln |1 + α−1 T −1
i φ i C− i φ i |
2
C− 1
φ φT C−1 }
−i i i −i
+tT (C−1
−i − )t
α i + φT C−1 φ
i −i i
−1 T −1
1 1 T C− i φ i φ i C− i
= L (α − i ) − ln |1 + α−i 1 φTi C−1
φ
−i i | + t t
2 2 α i + φTi C− 1φ
−i i
2
1 −1 1 qi
= L(α− i ) − ln |1 + α i s i | +
2 2 αi + s i
2
1 αi + s i 1 q i
= L(α− i ) − ln +
2 αi 2 αi + s i
1 [ q2i ]
= L(α− i ) + ln α i − ln(α i + s i ) + = L(α− i ) + λ(α i )
2 αi + s i
Where we have defined λ(α i ), s i and q i as shown in Eq (7.97)-(7.99).
We first calculate the first derivative of Eq(7.97) with respect to α i :
∂λ 1 1 1 q2i
= [ − − ]
∂α i 2 α i α i + s i (α i + s i )2
Then we calculate the second derivative:
∂2 λ 1 1 1 2 q2i
= [− 2 + + ]
∂α2i 2 α i (α i + s i )2 (α i + s i )3
Next we aim to prove that when α i is given by Eq (7.101), i.e., setting the
first derivative equal to 0, the second derivative (i.e., the expression above) is
negative. First we can obtain:
s2i s i q2i
αi + s i = + si =
q2i − s i q2i − s i
155
Therefore, substituting α i + s i and α i into the second derivative, we can

obtain:
2 2
∂2 λ 1 (q i − s i ) ( q2i − s i )2 2 q2i ( q2i − s i )3
= [− + + ]
∂α2i 2 s4i s2i q4i s3i q6i
4 2 2
1 q i (q i − s i ) s2i ( q2i − s i )2 2 s i ( q2i − s i )3
= [− + + ]
2 q4i s4i s4i q4i s4i q4i
2 2
1 (q i − s i )
= [− q4i + s2i + 2 s i ( q2i − s i )]
2 q4i s4i
2 2
1 (q i − s i )
= [−( q2i − s i )2 ]
2 q4i s4i
2 4
1 (q i − s i )
= − <0
2 q4i s4i
Just as required.
We just follow the hint. According to Eq (7.102), Eq (7.86) and matrix

identity (C.7), we have:
−1
Qi = φT
i C t
−1 −1 T −1
= φT
i (β I + ΦA Φ ) t
−1 T
= φT T
i (βI − βIΦ(A + Φ βIΦ) Φ βI)t
−1 T
= φT 2 T
i (β − β Φ(A + βΦ Φ) Φ )t
= φT 2 T
i (β − β ΦΣΦ )t
= βφT 2 T T
i t − β φ i ΦΣΦ t
Similarly, we can obtain:

−1
Si = φT
i C φi
= φT 2 T
i (β − β ΦΣΦ )φ i
= βφT 2 T T
i φ i − β φ i ΦΣΦ φ i
Just as required.
We begin by deriving the first term in Eq (7.109) with respect to w. This

can be easily evaluate based on Eq (4.90)-(4.91).
∂ {∑
N } ∑N
t n ln yn + (1 − t n ) ln(1 − yn ) = ( t n − yn )ϕn = ΦT (t − y)
∂w n=1 n=1
156
Since the derivative of the second term in Eq (7.109) with respect to w

is rather simple to obtain. Therefore, The first derivative of Eq (7.109) with
respect to w is:
∂ ln p
= ΦT (t − y) − Aw
∂w
For the Hessian matrix, we can first obtain:
∂ { } ∑N ∂ { }
ΦT (t − y) = ( t n − yn )ϕn
∂w n=1 ∂w
∑N ∂ { }
= − yn · ϕn
n=1 ∂w
∑
N ∂σ(wT ϕ )
n
= − · ϕT
n
n=1 ∂w
∑N ∂σ(a) ∂a
= − · · ϕT
n=1 ∂a ∂w n
Where we have defined a = wT ϕn . Then we can utilize Eq (4.88) to derive:
∂ { } ∑
N
ΦT (t − y) = − σ(1 − σ) · ϕn · ϕT T
n = −Φ BΦ
∂w n=1
Where B is a diagonal N × N matrix with elements b n = yn (1 − yn ). There-

fore, we can obtain the Hessian matrix:
∂ { ∂ ln p }
H= = −(ΦT BΦ + A)
∂w ∂w
Just as required.
We begin from Eq (7.114).
p(t|α) = p(t|w∗ ) p(w∗ |α)(2π) M /2 |Σ|1/2

[∏N ][ ∏
M ] ¯
1/2 ¯
= p( t n | xn , w) N (w i |0, α−
i
1
) (2π ) M /2
| Σ| ¯
n=1 w = w∗
i =1
[∏
N ] ¯
¯
= p( t n | xn , w) · N (w|0, A) · (2π) M /2 |Σ|1/2 ¯
n=1 w = w∗
We further take logarithm for both sides.

[∑
N ]¯
M 1 ¯
ln p(t|α) = ln p( t n | xn , w) + ln N (w|0, A) + ln 2π + ln |Σ| ¯
n=1 2 2 w = w∗
[∑
N [ ] 1 ]¯
1 1 ¯
= t n ln yn + (1 − t n ) ln(1 − yn ) − wT Aw − ln |A| + ln |Σ| + const ¯
n=1 2 2 2 w = w∗
[∑N [ ] 1 ] [1 ]¯
1 ¯
= t n ln yn + (1 − t n ) ln(1 − yn ) − wT Aw + ln |Σ| − ln |A| + const ¯
n=1 2 2 2 w = w∗
157
Using the Chain rule, we can obtain:
∂ ln p(t|α) ¯¯ ∂ ln p(t|α) ∂w ¯¯
¯ = ¯
∂α i w = w∗ ∂w ∂α i w = w∗
Observing Eq (7.109), (7.110) and that (7.110) will equal 0 at w∗ , we can

conclude that the first term on the right hand side of ln p(t|α) will have zero
derivative with respect to w at w∗ . Therefore, we only need to focus on the
second term:
[ ]¯
∂ ln p(t|α) ¯¯ ∂ 1 1 ¯
¯ = ln |Σ| − ln |A| ¯
∂α i w = w∗ ∂α i 2 2 w = w∗
It is rather easy to obtain:
∂ 1 1 ∂ [∑ ] 1
[− ln |A|] = − ln α−i 1 =
∂α i 2 2 ∂α i i 2α i
Then we follow the same procedure as in Prob.7.12, we can obtain:
∂ 1 1
[ ln |Σ|] = − Σ ii
∂α i 2 2
Therefore, we obtain:
∂ ln p(t|α) 1 1
= − Σ ii
∂α i 2α i 2
Note: here I draw a different conclusion as the main text. I have also
verified my result in another way. You can write the prior as the product of
N (w i |0, α−
i
1
) instead of N (w|0, A). In this form, since we know that:
∂ ∑
M ∂ 1 αi 1 1
ln N (w i |0, α−i 1 ) = ( ln α i − w2i ) = − (w∗i )2
∂α i i =1 ∂α i 2 2 2α i 2
The above expression can be used to replace the derivative of −1/2wT Aw−
1/2 ln |A|. Since the derivative of the likelihood with respect to α i is not zero
at w∗ , (7.115) seems not right anyway.
0.8 Graphical Models

We are required to prove:

∫ ∫ ∏
K
p(x) d x = p( xk | pa k ) d x = 1
x x k=1
158
Here we adopt the same assumption as in the main text: No arrows lead
from a higher numbered node to a According to Eq(8.5), we can write:
∫ ∫ ∏K
p(x) d x = p( xk | pa k ) d x
x x k=1
∫ K∏
−1
= p( xK | pa K ) p( xk | pa k ) d x
x k=1
∫ ∫ [ K∏
−1 ]
= p( xK | pa K ) p( xk | pa k ) dxK dx1 dx2 , ...dxK −1
[ x1 ,x2 ,...,xK −1 ] xK k=1
∫ [ K∏
−1 ∫ ]
= p( xk | pa k ) p( xK | pa K ) dxK dx1 dx2 , ...dxK −1
[ x1 ,x2 ,...,xK −1 ] k=1 xK
∫ [ K∏
−1 ]
= p( xk | pa k ) dx1 dx2 , ...dxK −1
[ x1 ,x2 ,...,xK −1 ] k=1
∫ K∏
−1
= p( xk | pa k ) dx1 dx2 , ...dxK −1
[ x1 ,x2 ,...,xK −1 ] k=1
Note that from the third line to the fourth line, we have used the fact
that x1 , x2 , ...xK −1 do not depend on xK , and thus the product from k = 1 to
K − 1 can be moved to the outside of the integral with respect to xK , and that
we have used the fact that the conditional probability is correctly normalized
from the fourth line to the fifth line. The aforementioned procedure will be
repeated for K times until all the variables have been integrated out.
This statement is obvious. Suppose that there exists an ordered num-

bering of the nodes such that for each node there are no links going to a
lower-numbered node, and that there is a directed cycle in the graph:
a 1 → a 2 → ... → a N
To make it a real cycle, we also require a N → a 1 . According to the as-
sumption, we have a 1 ≤ a 2 ≤ ... ≤ a N . Therefore, the last link a N → a 1 is
invalid since a N ≥ a 1 .
Based on definition, we can obtain:



 0.336, if a = 0, b =0

 0.264, if a = 0, b =1
p(a, b) = p(a, b, c = 0) + p(a, b, c = 1) =
 0.256, if a
 = 1, b =0

 0.144, if a = 1, b =1
Similarly, we can obtain:
{
0.6, if a = 0
p(a) = p(a, b = 0) + p(a, b = 1) =
0.4, if a = 1
159
And {
0.592, if b = 0
p ( b ) = p ( a = 0, b ) + p ( a = 1, b ) =
0.408, if b = 1
Therefore, we conclude that p(a, b) ̸= p(a) p( b). For instance, we have
p(a = 1, b = 1) = 0.144, p(a = 1) = 0.4 and p( b = 1) = 0.408. It is obvious
that:
0.144 = p(a = 1, b = 1) ̸= p(a = 1) p( b = 1) = 0.4 × 0.408
To prove the conditional dependency, we first calculate p( c):
{
∑ 0.480, if c = 0
p( c) = p(a, b, c) =
a,b = 0,1
0.520, if c = 1
According to Bayes’ Theorem, we have:



 0.400, if a = 0, b = 0, c =0



 0.277, if a = 0, b = 0, c =1



 0.100, if a = 0, b = 1, c =0


p(a, b, c)  0.415, if a = 0, b = 1, c =1
p(a, b| c) = =
p( c) 
 0.400, if a = 1, b = 0, c =0



 0.123, if a = 1, b = 0, c =1



 0.100, if a = 1, b = 1, c =0


 0.185, if a = 1, b = 1, c =1
Similarly, we also have:



 0.240/0.480 = 0.500, if a = 0, c =0

p(a, c)  0.360/0.520 = 0.692, if a = 0, c =1
p ( a| c ) = =
p( c) 
 0.240/0.480 = 0.500, if a = 1, c =0

 0.160/0.520 = 0.308, if a = 1, c =1
Where we have used p(a, c) = p(a, b = 0, c) + p(a, b = 1, c). Similarly, we

can obtain:


 0.384/0.480 = 0.800, if b = 0, c = 0


p( b, c) 0.208/0.520 = 0.400, if b = 0, c = 1
p( b| c) = =
p( c) 
 0 .096/0.480 = 0.200, if b = 1, c = 0

 0.312/0.520 = 0.600, if b = 1, c = 1
Now we can easily verify the statement p(a, b| c) = p(a| c) p( b| c). For in-
stance, we have:
0.1 = p(a = 1, b = 1| c = 0) = p(a = 1| c = 0) p( b = 1| c = 0) = 0.5 × 0.2 = 0.1

160
This problem follows the previous one. We have already calculated p(a)
and p( b| c), we rewrite it here.
{
0.6, if a = 0
p(a) = p(a, b = 0) + p(a, b = 1) =
0.4, if a = 1
And


 0.384/0.480 = 0.800, if b = 0, c =0

p( b, c)  0.208/0.520 = 0.400, if b = 0, c =1
p( b| c) = =
p( c) 
 0.096/0.480 = 0.200, if b = 1, c =0

 0.312/0.520 = 0.600, if b = 1, c =1
We can also obtain p( c|a):



 0.24/0.6 = 0.4, if a = 0, c =0

p(a, c)  0.36/0.6 = 0.6, if a = 0, c =1
p ( c | a) = =
p ( a) 
 0.24/0.4 = 0.6, if a = 1, c =0

 0.16/0.4 = 0.4, if a = 1, c =1
Now we can easily verify the statement that p(a, b, c) = p(a) p( c|a) p( b| c)
given Table 8.2. The directed graph looks like:
a→c→b
It looks quite like Figure 8.6. The difference is that we introduce α i for
each w i , where i = 1, 2, ..., M .
Figure 1: probabilistic graphical model corresponding to the RVM described

in (7.79) and (7.80).
Problem 8.6 Solution(Wait for update)
Let’s just follow the hint. We begin by calculating the mean µ.
E[ x 1 ] = b 1
161
According to Eq (8.15), we can obtain:

∑
E[ x 2 ] = w2 j E[ x j ] + b 2 = w21 b 1 + b 2
j ∈ pa 2
Then we can obtain:
E[ x3 ] = w32 E[ x2 ] + b 3
= w32 (w21 b 1 + b 2 ) + b 3
= w32 w21 b 1 + w32 b 2 + b 3
Therefore, we obtain Eq (8.17) just as required. Next, we deal with the

covariance matrix.
cov[ x1 , x1 ] = v1
Then we can obtain:
∑
cov[ x1 , x2 ] = w2k cov[ x1 , xk ] + I 12 v2 = w21 cov[ x1 , x1 ] = w21 v1
k=1
And also cov[ x2 , x1 ] = cov[ x1 , x2 ] = w21 v1 . Hence, we can obtain:

∑ 2
cov[ x2 , x2 ] = w2k cov[ x2 , xk ] + I 22 v2 = w21 v1 + v2
k=1
Next, we can obtain:

∑
cov[ x1 , x3 ] = w3k cov[ x1 , xk ] + I 31 v1 = w32 w21 v1
k=2
Then, we can obtain:

∑ 2
cov[ x2 , x3 ] = w3k cov[ x2 , xk ] + I 23 v3 = w32 (v2 + w21 v1 )
k=2
Finally, we can obtain:

∑
cov[ x3 , x3 ] = w3k cov[ x3 , xk ] + I 33 v3
k=2
[ ]
2
= w32 w32 (v2 + w21 v1 ) + v3
Where we have used the fact that cov[ x3 , x2 ] = cov[ x2 , x3 ]. By now, we

have obtained Eq (8.18) just as required.
According to the definition, we can write:
p(a, b, c| d ) = p(a| d ) p( b, c| d )
162
We marginalize both sides with respect to c, yielding:
p(a, b| d ) = p(a| d ) p( b| d )
Just as required.
This statement is easy to see but a little bit difficult to prove. We put Fig
8.26 here to give a better illustration.
Figure 2: Markov blanket of a node x i
Markov blanket Φ of node x i is made up of three kinds of nodes:(i) the

set Φ1 containing all the parents of node x i ( x1 and x2 in Fig.2), (ii) the set
Φ2 containing all the children of node x i ( x5 and x6 in Fig.2), and (iii) the set
Φ3 containing all the co-parents of node x i ( x3 and x4 in Fig.2). According to
the d-separation criterion, we need to show that all the paths from node x i to
an arbitrary node x̂ ∉ Φ = {Φ1 ∪ Φ2 ∪ Φ3 } are blocked given that the Markov
blanket Φ are observed.
It is obvious that x̂ can only connect to the target node x i via two kinds of
node: Φ1 ,Φ2 . First, suppose that x̂ connects to x i via some node x⋆ ∈ Φ1 . The
arrows definitely meet head-to-tail or tail-to-tail at node x⋆ because the link
from a parent node x⋆ to x i has its tail connected to the parent node x⋆ , and
since x⋆ is in Φ1 ⊆ Φ, we see that this path is blocked.
In the second case, suppose that x̂ connects to x i via some node x⋆ ∈ Φ2 .
We need to further divide this situation. If the path from x̂ to x i also goes
through a node x⋆⋆ from Φ3 (e.g., in Fig.2, some node x̂ connects to node x3 ,
and in this example x⋆⋆ = x3 , x⋆ = x5 ), it is clearly that the arrows meet
head-to-tail or tail-to-tail at the node x⋆⋆ ∈ Φ3 ⊆ Φ, this path is blocked.
In the final case, suppose that x̂ connects to x i via some node x⋆ ∈ Φ2 and
the path doesn’t go through any node from Φ3 . An important observation is
that the arrows cannot meet head-to-head at node x⋆ (otherwise, this path
will go through a node from Φ3 ). Thus, the arrows must meet either head-
to-tail or tail-to-tail at node x⋆ ∈ Φ2 ⊆ Φ. Therefore, the path is also blocked.
163
By examining Fig.8.54, we can obtain:
p(a, b, c, d ) = p(a) p( b) p( c|a, b) p( d | c)
Next we performing summation on both sides with respect to c and d , we

can obtain:
∑∑
p(a, b) = p(a) p( b) p( c|a, b) p( d | c)
c d
∑ [∑ ]
= p ( a) p ( b ) p( c|a, b) p( d | c)
c d
∑
= p ( a) p ( b ) p( c|a, b) × 1
c
= p ( a) p ( b ) × 1
= p ( a) p ( b )
If we want to prove that a and b are dependent conditioned on d , we only

need to prove:
p(a, b| d ) = p(a| d ) p( b| d )
We multiply both sides by p( d ) and use Bayes’ Theorem, yielding:
p(a, b, d ) = p(a) p( b| d ) (∗)
In other words, we can equivalently prove the expression above instead.

Recall that we have:
p(a, b, c, d ) = p(a) p( b) p( c|a, b) p( d | c)
We perform summation on both sides with respect to c, yielding:

∑
p(a, b, d ) = p(a) p( b) p( c|a, b) p( d | c)
c
Combining with (∗), we only need to prove:

∑
p( b| d ) = p( b) p( c|a, b) p( d | c)
c
However, we can see that the value of the right hand side depends on a, b
and d , while the left hand side only depends on b and d . In general, this
expression will not hold, and, thus, a and b are not dependent conditioned on
d.
This problem is quite straightforward, but it needs some patience. Ac-

cording to the Bayes’ Theorem, we have:
p(D = 0|F = 0) p(F = 0)
p(F = 0|D = 0) = (∗)
p(D = 0)
164
We will calculate each of the term on the right hand side. Let’s begin from
the numerator p(D = 0). According to the sum rule, we have:
p(D = 0) = p(D = 0,G = 0) + p(D = 0,G = 1)

= p(D = 0|G = 0) p(G = 0) + p(D = 0|G = 1) p(G = 1)
= 0.9 × 0.315 + (1 − 0.9) × (1 − 0.315)
= 0.352
Where we have used Eq(8.30), Eq(8.105) and Eq(8.106). Note that the
second term in the denominator, i.e., p(F = 0), equals 0.1, which can be easily
derived from the main test above Eq(8.30). We now only need to calculate
p(D = 0|F = 0). Similarly, according to the sum rule, we have:
∑
p(D = 0|F = 0) = p(D = 0,G |F = 0)
G =0,1
∑
= p(D = 0|G, F = 0) p(G |F = 0)
G =0,1
∑
= p(D = 0|G ) p(G |F = 0)
G =0,1
= 0.9 × 0.81 + (1 − 0.9) × (1 − 0.81)
= 0.748
Several clarifications must be made here. First, from the second line
to the third line, we simply eliminate the dependence on F = 0 because we
know that D only depends on G according to Eq(8.105) and Eq(8.106). Sec-
ond,from the third line to the fourth line, we have used Eq(8.31), Eq(8.105)
and Eq(8.106). Now, we substitute all of them back to (∗), yielding:
p(D = 0|F = 0) p(F = 0) 0.748 × 0.1

p(F = 0|D = 0) = = = 0.2125
p(D = 0) 0.352
Next, we are required to calculate the probability conditioned on both

D = 0 and B = 0. Similarly, we can write:
p(D = 0, B = 0, F = 0)
p(F = 0|D = 0, B = 0) =
p(D = 0, B = 0)
∑
G p(D = 0, B = 0, F = 0,G )
= ∑
p(D = 0, B = 0,G )
∑ G
G p ( B = 0, F = 0,G ) p(D = 0|B = 0, F = 0,G )
= ∑
G p(B = 0,G ) p(D = 0|B = 0,G )
∑
G p(B = 0, F = 0,G ) p(D = 0|G )
= ∑ (∗∗)
G p(B = 0,G ) p(D = 0|G )
We need to calculate p(B = 0, F = 0,G ) and p(B = 0,G ), where G = 0, 1.

165
We begin by calculating p(B = 0, F = 0,G = 0):
p(B = 0, F = 0,G = 0) = p(G = 0|B = 0, F = 0) × p(B = 0, F = 0)

= p(G = 0|B = 0, F = 0) × p(B = 0) × p(F = 0)
= (1 − 0.1) × (1 − 0.9) × (1 − 0.9)
= 0.009
Similarly, we can obtain p(B = 0, F = 0,G = 1) = 0.001. Next we calculate

p(B = 0,G ):
∑
p(B = 0,G = 0) = p(B = 0,G = 0, F )
F =0,1
∑
= p(G = 0|B = 0, F ) × p(B = 0, F )
F =0,1
∑
= p(G = 0|B = 0, F ) × p(B = 0) × p(F )
F =0,1
= (1 − 0.1) × (1 − 0.9) × (1 − 0.9) + (1 − 0.2) × (1 − 0.9) × 0.9
= 0.081
Similarly, we can obtain p(B = 0,G = 1) = 0.019. We substitute them back

into (∗∗), yielding:
∑
G p(B = 0, F = 0,G ) p(D = 0|G )
p(F = 0|D = 0, B = 0) = ∑
G p(B = 0,G ) p(D = 0|G )
0.009 × 0.9 + 0.001 × (1 − 0.9)
=
0.081 × 0.9 + 0.019 × (1 − 0.9)
= 0.1096
Just as required. The intuition behind this result coincides with the com-
mon sense. Moreover, by analogy to Fig.8.54, the node a and b in Fig.8.54
represents B and F in our case. Node c represents G , while node d represents
D . You can use d-separation criterion to verify the conditional properties.
An intuitive solution is that we construct a matrix A with size of M × M .

If there is a link from node i to node j , the entry on the i -th row and j -th
column of matrix A, i.e., A i, j , will equal to 1. Otherwise, it will equal to 0.
Since the graph is undirected, the matrix A will be symmetric. What’s more,
the element on the diagonal is 0 by definition. For a undirected graph, we can
use a matrix A to represent it. It is also a one-to-one mapping.
In other words, we equivalently count the number of possible matrix A
satisfying the following criteria: (i) each of the entry is either 0 or 1, (ii) it is
symmetric, and (iii) all of the entries on the diagonal are already determined
(i.e., they all equal 0).
166
Using the property of symmetry, we only need to count the free variables
on the lower triangle of the matrix. In the first column, there are ( M − 1) free
variables. In the second column, there are ( M − 2) free variables. Therefore,
the total free variables are given by:
M ( M − 1)
( M − 1) + ( M − 2) + ... + 0 =
2
Each value of these free variables has two choices, i.e., 1 or 0. Therefore,
the total number of such matrix is 2 M ( M −1)/2 . In the case of M = 3, there are
8 possible undirected graphs:
Figure 3: the undirected graph when M = 3
It is straightforward. Suppose that xk is the target variable whose state

may be {−1.1} while all other variables are fixed. According to Eq (8.42), we
can obtain:
∑ ∑ ∑
E (x, y) = h xi − β xi x j − η x i yi
i ̸= k i, j ̸= k i ̸= k
∑
+h xk − β xk xm − η xk yk
m
Note that we write down the dependence of E (x, y) on xk explicitly, which

is expressed via the second line. Moreover, the x i x j term in the first line
doesn’t include the pairs { x i , x j }, which one of them is xk . These terms are
considered by xk xm in the secone line. To be more specific, here xm represents
the neighbor of xk . Noticing that the first line doesn’t depend on xk , we can
obtain: ∑
E (x, y)| xk = 1 − E (x, y)| xk = −1 = 2 h − 2β xm − 2η yk
m
Obviously, the difference depends locally on xk , implied by h, the neigh-

bors xm and its observed value yk .

167
It is quite obvious. When h = 0, β = 0, the energy function reduces to

∑
E (x, y) = −η x i yi
i
If there exists some index j which satisfies x j ̸= y j , considering that x j , y j ∈

{−1.1}, then x j y j will equal to −1. By changing the sign of x j , we can always
increase the value of x j y j from −1 to 1, and, thus, decrease the energy func-
tion E (x, y).
Therefore, given the observed binary pixels yi ∈ {−1.1}, where i = 1, 2, ..., D ,
in order to obtain the minimum of energy, the optimal choice for x i is to set it
equal to yi .
This problem can be solved by analogy to Eq (8.49) - Eq(8.54). We begin

by noticing: ∑ ∑ ∑ ∑
p( xn−1 , xn ) = ... ... p(x)
x1 xn−2 xn+1 xN
We also have:
1
p(x) = ψ1,2 ( x1 , x2 ) ψ2,3 ( x2 , x3 ) ... ψ N −1,N ( x N −1 , x N )
Z
By analogy to Eq(8.52), we can obtain:
[ [∑ [∑ ]] ]
1 ∑
p( xn−1 , xn ) = ψn−2,n−1 ( xn−2 , xn−1 )... ψ2,3 ( x2 , x3 ) ψ1,2 ( x1 , x2 ) ...
Z xn−2 x2 x1
× ψn−1,n ( xn−1,xn )
[ [∑ ] ]
∑
× ψn,n+1 ( xn , xn+1 )... ψ N −1,N ( x N −1 , x N ) ...
xn+1 xN
1
= × µα ( xn−1 ) × ψn−1,n ( xn−1 , xn ) × µβ ( xn )
Z
Just as required.
We can simply obtain p( x N ) using Eq(8.52) and Eq(8.54):

1
p( x N ) = µα ( x N ) (∗)
Z
According to Bayes’ Theorem, we have:
p( xn , x N )
p( xn | x N ) =
p( x N )
Therefore, now we only need to derive an expression for p( xn , x N ), where
n = 1, 2, ..., N − 1. We follow the same procedure as in the previous problem.
Since we know that:
∑ ∑ ∑ ∑
p( xn , x N ) = ... ... p(x)
x1 xn−1 xn+1 x N −1
168
We can obtain:
[ [∑ [∑ ]] ]
1 ∑
p( xn , x N ) = ψn−1,n ( xn−1 , xn )... ψ2,3 ( x2 , x3 ) ψ1,2 ( x1 , x2 ) ...
Z xn−1 x2 x1
[ [∑ ] ]
∑
× ψn,n+1 ( xn , xn+1 )... ψ N −2,N −1 ( x N −2 , x N −1 )ψ N −1,N ( x N −1 , x N ) ...
xn+1 x N −1
Note that in the second line, the summation term with respect to x N −1 is
the product of ψ N −2,N −1 ( x N −2 , x N −1 ) and ψ N −1,N ( x N −1 , x N ). So here we can
actually draw an undirected graph with N − 1 nodes, and adopt the proposed
algorithm to solve p( xn , x N ). If we use x⋆ n to represent the new node, then the
joint distribution can be written as:
1 ⋆ ⋆ ⋆ ⋆ ⋆ ⋆
p(x⋆ ) = ψ ( x , x ) ψ ( x , x ) ... ψ⋆ ⋆ ⋆
N −2,N −1 ( x N −2 , x N −1 )
Z ⋆ 1, 2 1 2 2, 3 2 3
Where ψ⋆ ⋆ ⋆
n,n+1 ( x n , x n+1 ) is defined as:
{
⋆ ⋆ ⋆ ψn,n+1 ( xn , xn+1 ), n = 1, 2, ..., N − 3
ψn,n+1 ( xn , xn+1 ) =
ψ N −2,N −1 ( x N −2 , x N −1 )ψ N −1,N ( x N −1 , x N ), n = N −2
In other words, we have combined the original node x N −1 and x N . More-
over, we have the relationship:
1 ⋆ ⋆ ⋆ ⋆
p ( x n , x N ) = p ( x⋆
n) =
µ ( x )µ ( x ) n = 1, 2, ..., N − 1
Z⋆ α n β n
By adopting the proposed algorithm to the new undirected graph, p( x⋆
n)
can be easily evaluated, and so is p( xn , x N ).
It is straightforward to see that for every path connecting node x2 and x5

in Fig.8.38, it must pass through node x3 . Therefore, all paths are blocked
and the conditional property holds. For more details, you should read section
8.3.1. According to Bayes’ Theorem, we can obtain:
p ( x2 , x3 , x5 )
p( x2 | x3 , x5 ) =
p ( x2 )
Using the proposed algorithm in section 8.4.1, we can obtain:
∑ ∑
p ( x2 , x3 , x5 ) x x p(x)
p ( x2 | x3 , x5 ) = = ∑ ∑1 ∑4
p ( x3 , x5 ) x1 x2 x4 p(x)
∑ ∑
x x ψ1,2 ψ2,3 ψ3,4 ψ4,5
= ∑ ∑1 ∑4
x1 x2 x4 ψ1,2 ψ2,3 ψ3,4 ψ4,5
(∑ ) (∑ )
x1 ψ1,2 · ψ2,3 · x4 ψ3,4 ψ4,5
= ∑ [( ∑ ) ] (∑ )
x2 x1 ψ 1 , 2 ψ 2, 3 · x4 ψ3 , 4 ψ 4, 5
(∑ )
x1 ψ1,2 · ψ2,3
= ∑ [( ∑ ) ]
x2 ψ
x1 1,2 ψ 2,3
169
It is obvious that the right hand side doesn’t depend on x5 .
First, the distribution represented by a directed tree can be trivially be

written as an equivalent distribution over an undirected tree by moralization.
You can find more details in section 8.4.2.
Alternatively, now we want to represent a distribution, which is given by
a directed graph, via a directed graph. For example, the distribution defined
by the undirected tree in Fig.4 can be written as:
1
p(x) = ψ1,3 ( x1 , x3 ) ψ2,3 ( x2 , x3 ) ψ3,4 ( x3 , x4 ) ψ4,5 ( x4 , x5 )
Z
We simply choose x4 as the root and the corresponding directed tree is
well defined by working outwards. In this case, the distribution defined by
the directed tree is:
p(x) = p( x4 ) p( x5 | x4 ) p( x3 | x4 ) p( x1 | x3 ) p( x2 | x3 )
Thus it is not difficult to change an undirected tree to a directed on if

performing:
p( x4 ) p( x5 | x4 ) ∝ ψ5,4 , p( x3 | x4 ) ∝ ψ3,4 , p( x2 | x3 ) ∝ ψ2,3 , p( x1 | x3 ) ∝ ψ1,3 ,
Figure 4: Example of changing an undirected tree to a directed one x i
The symbol ∝ is used to represent a normalization term, which is used

to guarantee the integral of PDF equal to 1. In summary, in the particular
case of an undirected tree, there is only one path between any pair of nodes,
and thus the maximal clique is given by a pair of two nodes in an undirected
tree. This is because if we choose any three nodes x1 , x2 , x3 , according to the
definition there cannot exist a loop. Otherwise there are two paths between
x1 and x3 : (i) x1 − > x3 and (ii) x1 − > x2 − > x3 . In the directed tree, each node
170
only depends on only one node (except the root), i.e., its parent. Thus we can
easily change a undirected tree to a directed one by matching the potential
function with the corresponding conditional PDF, as shown in the example.
Moreover, we can choose any node in the undirected tree to be the root and
then work outwards to obtain a directed tree. Therefore, in an undirected tree
with n nodes, there is n corresponding directed trees in total.
Problem 8.19-8.29 Solution (Waiting for update)
I am quite confused by the deduction in Eq(8.66). I do not understand the

sum-prodcut algorithm and the max-sum algorithm very well.
0.9 Mixture Models and EM

For each r nk when n is fixed and k = 1, 2, ..., K , only one of them equals
1 and others are all 0. Therefore, there are K possible choices. When N
data are given, there are K N possible assignments for { r nk ; n = 1, 2, ..., N ; k =
1, 2, ..., K }. For each assignments, the optimal {µk ; k = 1, 2, ..., K } are well de-
termined by Eq (9.4).
As discussed in the main text, by iteratively performing E-step and M-
step, the distortion measure in Eq (9.1) is gradually minimized. The worst
case is that we find the optimal assignment and {µk } in the last iteration. In
other words, K N iterations are required. However, it is guaranteed to con-
verge because the assignments are finite and the optimal {µk } is determined
once the assignment is given.
By analogy to Eq (9.1), we can write down:
∑
K
JN = JN −1 + r N k ||x N − µk ||2
k=1
In the E-step, we still assign the N -th data x N to the closet center and
suppose that this cloest center is µm . Therefore, the expression above will
reduce to:
JN = JN −1 + ||xn − µm ||2
In the M-step, we set the derivative of JN with respect to µk to 0, where
k = 1, 2, ..., K . We can observe that for those µk , k ̸= m, we have:
∂ JN ∂ JN −1
=
∂µk ∂µk
171
In other words, we will only update µm in the M-step by setting the

derivative of JN equal to 0. Utilizing Eq (9.4), we can obtain:
∑ N −1
n = 1 r nk x n + x N
µ(mN ) = ∑ N −1
n = 1 r nk + 1
∑ N −1
n = 1 r nk x n x
∑ N −1 + ∑N −1N
n = 1 r nk n=1 r nk
= 1
1 + ∑N −1
n=1 r nk
µ(mN −1) + ∑N −1N

x
r n=1 nk
= 1
1 + ∑N −1
n=1 r nk
x µ( N −1)
∑ N −1N − ∑Nm−1
n = 1 r nk n = 1 r nk
= µ(mN −1) +
1 + ∑N −11
n=1 r nk
x N − µ(mN −1)
= µ(mN −1) + ∑
1 + nN=−11 r nk
So far we have obtained a sequential on-line update formula just as re-

quired.
We simply follow the hint.

∑
p(x) = p(z) p(x|z)
z
K [
∑∏ ] zk
= (πk N (x|µk , Σk )
z k=1
Note that we have used 1-of-K coding scheme for z = [ z1 , z2 , ..., zK ]T . To

be more specific, only one of z1 , z2 , ..., zK will be 1 and all others will equal 0.
Therefore, the summation over z actually consists of K terms and the k-th
term corresponds to z k equal to 1 and others 0. Moreover, for the k-th term,
the product will reduce to πk N (x|µk , Σk ). Therefore, we can obtain:
K [
∑∏ ] zk ∑
K
p(x) = (πk N (x|µk , Σk ) = πk N (x|µk , Σk )
z k=1 k=1
Just as required.
According to Bayes’ Theorem, we can write:
p(θ |X) ∝ p(X|θ ) p(θ )

172
Taking logarithm on both sides, we can write:
ln p(θ |X) ∝ ln p(X|θ ) + ln p(θ )
Further utilizing Eq (9.29), we can obtain:

{∑ }
ln p(θ |X) ∝ ln p(X, Z|θ ) + ln p(θ )
Z
{[ ∑ ] }
= ln p(X, Z|θ ) · p(θ )
Z
{∑ }
= ln p(X, Z|θ ) p(θ )
Z
In other words, in thise case, the only modification is that the term p(X, Z|θ )
in Eq (9.29) will be replaced by p(X, Z|θ ) p(θ ). Therefore, in the E-step, we still
need to calculate the posterior p(Z|X, θ old ) and then in the M-step, we are re-
′
quired to maximize Q (θ , θ old ). In this case, by analogy to Eq (9.30), we can
′
write down Q (θ , θ old ):
′ ∑ [ ]
Q (θ , θ old ) = p(Z|X, θ old ) ln p(X, Z|θ ) p(θ )
Z
∑ [ ]
= p(Z|X, θ old ) ln p(X, Z|θ ) + ln p(θ )
Z
∑ ∑
= p(Z|X, θ old ) ln p(X, Z|θ ) + p(Z|X, θ old ) ln p(θ )
Z Z
∑ old
∑
= p(Z|X, θ ) ln p(X, Z|θ ) + ln p(θ ) · p(Z|X, θ old )
Z Z
∑ old
= p(Z|X, θ ) ln p(X, Z|θ ) + ln p(θ )
Z
= Q (θ , θ old ) + ln p(θ )
Just as required.
Notice that the condition on µ, Σ and π can be omitted here, and we only
need to prove p(Z|X) can be written as the product of p(zn |xn ). Correspond-
ingly, the small dots representing µ, Σ and π can also be omitted in Fig 9.6.
Observing Fig 9.6 and based on definition, we can write :
p(X, Z) = p(x1 , z1 ) p(z1 )...p(x N , z N ) p(z N ) = p(x1 , z1 )...p(x N , z N )
Moreover, since there is no link from zm to zn , from xm to xn , and from

zm to xn ( m ̸= n), we can obtain:
p(Z) = p(z1 )...p(z N ), p(X) = p(x1 )...p(x N )

173
These can also be verified by calculating the marginal distribution from

p(X, Z), for example:
∑ ∑
p(Z) = p(X, Z) = p(x1 , z1 )...p(x N , z N ) = p(z1 )...p(z N )
X x1 ,...,x N
According to Bayes’ Theorem, we have
p(X|Z) p(Z)
p(Z|X) =
p(X)
[∏ ][ ∏ ]
N N
n=1 p(x n |z n ) n=1 p(z n )
= ∏N
n=1 p(x n )
∏N p(x |z ) p(z )
n n n
=
n=1 p(xn )
∏
N
= p(zn |xn )
n=1
Just as required. The essence behind the problem is that in the directed
graph, there are only links from zn to xn . The deeper reason is that (i) the
mixture model is given by Fig 9.4, and (ii) we assume the data {xn } is i.i.d,
and thus there is no link from xm to xn .
By analogy to Eq (9.19), we calculate the derivative of Eq (9.14) with

respect to Σ:
∂ ln p ∂ ∑N ∑N 1 ∂a
n
= { ln a n } = (∗)
∂Σ ∂Σ n=1 n=1 a n ∂Σ
∑
K
an = πk N (xn |µk , Σ)
k=1
Recall that in Prob.2.34, we have proved:
∂ ln N (xn |µk , Σ) 1 1
= − Σ−1 + Σ−1 Snk Σ−1
∂Σ 2 2
Snk = (xn − µk )(xn − µk )T

174

∂a n ∂ {∑
K }
= πk N (xn |µk , Σ)
∂Σ ∂Σ k=1
∑K ∂ { }
= πk N (xn |µk , Σ)
k=1 ∂Σ
∑K ∂ { [ ]}
= πk exp ln N (xn |µk , Σ)
k=1 ∂Σ
∑K [ ] ∂ [ ]
= πk · exp ln N (xn |µk , Σ) · ln N (xn |µk , Σ)
k=1 ∂Σ
∑
K 1 1
= πk · N (xn |µk , Σ) · (− Σ−1 + Σ−1 Snk Σ−1 )
k=1 2 2
Substitute the equation above into (∗), we can obtain:
∂ ln p ∑N 1 ∂a
n
=
∂Σ n=1 a n ∂Σ
∑K 1 −1
∑N
k=1 π k · N (x n |µ k , Σ) · (− 2 Σ + Σ−1 Snk Σ−1 )
= ∑K
n=1 j =1 π j N (x n |µ j , Σ)
∑
N ∑
K 1 1
= γ( z nk ) · (− Σ−1 + Σ−1 Snk Σ−1 )
n=1 k=1 2 2
1{ ∑N ∑ K } 1 {∑N ∑ K }
= − γ( z nk ) Σ−1 + Σ−1 γ( z nk )Snk Σ−1
2 n=1 k=1 2 n=1 k=1
If we set the derivative equal to 0, we can obtain:

∑ N ∑K
n=1 k=1 γ( z nk )S nk
Σ = ∑N ∑ K
n=1 k=1 γ( z nk )
We begin by calculating the derivative of Eq (9.36) with respect to µk :
∂ ln p ∂ { ∑
N ∑
K }
= z nk [ ln πk + ln N (xn |µk , Σk ) ]
∂µk ∂µk n=1 k=1
∂ { ∑ N }
= z nk [ ln πk + ln N (xn |µk , Σk ) ]
∂µk n=1
∑N ∂ { }
= z nk ln N (xn |µk , Σk )
n=1 ∂µ k
∑ ∂ { }
= ln N (xn |µk , Σk )
xn ∈C k ∂µ k
175
Where we have used xn ∈ C k to represent the data point xn which are

assigned to the k-th cluster. Therefore, µk is given by the mean of those
xn ∈ C k just as the case of a single Gaussian. It is exactly the same for the
covariance. Next, we maximize Eq (9.36) with respect to πk by enforcing a
Lagrange multiplier:
∑
K
L = ln p + λ( πk − 1)
k=1
We calculate the derivative of L with respect to πk and set it to 0:
∂L ∑N z
nk
= +λ = 0
∂πk π
n=1 k
We multiply both sides by πk and sum over k making use of the constraint
Eq (9.9), yielding λ = − N . Substituting it back into the expression, we can
obtain:
1 ∑N
πk = z nk
N n=1
Just as required.
Since γ( z nk ) is fixed, the only dependency of Eq (9.40) on µk occurs in the

Gaussian, yielding:
∂E z [ln p] ∂ {∑
N }
= γ( z nk ) ln N (xn |µk , Σk )
∂µk ∂µk n=1
∑
N ∂ ln N (xn |µk , Σk )
= γ( z nk ) ·
n=1 ∂µk
∑
N [ ]
= γ( z nk ) · − Σ−1
k (x n − µ k )
n=1
Setting the derivative equal to 0, we obtain exactly Eq (9.16), and conse-

quently Eq (9.17) just as required. Note that there is a typo in Eq (9.16), Σk
shoule be Σ−
k
1
.
We first calculate the derivative of Eq (9.40) with respect to Σk :
∂E z ∂ {∑
N }
= γ( z nk ) ln N (xn |µk , Σk )
∂Σk ∂Σk n=1
∑
N ∂ ln N (xn |µk , Σk )
= γ( z nk )
n=1 ∂Σk
∑
N [ 1 1 −1 ]
= γ( z nk ) · − Σ− 1
+ Σ S nk Σ−1
n=1 2 k 2 k k
176
As in Prob 9.6, we have defined:
Snk = (xn − µk )(xn − µk )T
Setting the derivative equal to 0 and rearranging it, we obtain:

∑N ∑N
n=1 γ( z nk ) S nk γ( z nk ) Snk
Σk = ∑ N = n=1
n=1 γ( z nk )
Nk
Where Nk is given by Eq (9.18). So now we have obtained Eq (9.19) just

as required. Next to maximize Eq (9.40) with respect to πk , we still need to
introduce Lagrange multiplier to enforce the summation of pi k over k equal
to 1, as in Prob 9.7:
∑
K
L = E z + λ( πk − 1)
k=1
We calculate the derivative of L with respect to πk and set it to 0:
∂L ∑N γ( z )
nk
= +λ = 0
∂πk n=1 π k
We multiply both sides by πk and sum over k making use of the constraint
Eq (9.9), yielding λ = − N (you can see Eq (9.20)- Eq (9.22) for more details).
Substituting it back into the expression, we can obtain:
1 ∑N Nk
πk = γ( z nk ) =
N n=1 N
Just as Eq (9.22).
According to the property of PDF, we know that:
p(xa , xb ) p(x) ∑
K πk
p(xb |xa ) = = = · p(x| k)
p(xa ) p(xa ) k=1 p(xa )
Note that here p(xa ) can be viewed as a normalization constant used to

guarantee that the integration of p(xb |xa ) equal to 1. Moreover, similarly, we
can also obtain:
∑K πk
p(xa |xb ) = · p(x| k)
k=1 p(x b )
According to the problem description, the expectation, i.e., Eq(9.40), can

now be written as:
N ∑
∑ K { }
E z [ln p] = γ( z nk ) ln πk + ln N (xn |µk , ϵI)
n=1 k=1
177
In the M-step, we are required to maximize the expression above with

respect to µk and πk . In Prob.9.8, we have already proved that µk should be
given by Eq (9.17):
1 ∑N
µk = γ( z nk )xn (∗)
Nk n=1
Where Nk is given by Eq (9.18). Moreover, in this case, by analogy to Eq
(9.16), γ( z nk ) is slightly different:
πk N (xn |µk , ϵI)

γ( z nk ) = ∑
j π j N (x n |µ j , ϵI)
When ϵ → 0, we can obtain:

∑
π j N (xn |µ j , ϵI) ≈ πm N (xn |µm , ϵI), where m = argmin j ||xn − µ j ||2
j
To be more clear, the summation is dominated by the max of π j N (xn |µ j , ϵI),

and this term is further determined by the exponent, i.e., −||xn − µ j ||2 . There-
fore, γ( z nk ) is given by exactly Eq (9.2), i.e., we have γ( z nk ) = r nk . Combining
with (∗), we can obtain exactly Eq (9.4). Next, according to Prob.9.9, πk is
given by Eq(9.22):
∑N
Nk γ( z nk ) r nk
πk = = n= =
N N N
In other words, πk equals the fraction of the data points assigned to the
k-th cluster.
First we calculate the mean µk :

∫
µk = x p(x) d x
∫ ∑
K
= x πk p(x| k) d x
k=1
∑
K ∫
= πk x p(x| k) d x
k=1
∑
K
= πk µk
k=1
Then we deal with the covariance matrix. For an arbitrary random vari-
able x, according to Eq (2.63) we have:
cov[x] = E[(x − E[x])(x − E[x])T ]

= E[xxT ] − E[x]E[x]T
178
Since E[x] is already obtained, we only need to solve E[xxT ]. First we only
focus on the k-th component and rearrange the expression above, yielding:
Ek [xxT ] = covk [x] + Ek [x]Ek [x]T = Σk + µk µT

k
We further use Eq (2.62), yielding:

∫ ∑
K
E[xxT ] = xxT πk p(x| k) d x
k=1
∑
K ∫
= πk xxT p(x| k) d x
k=1
∑
K
= πk Ek [xxT ]
k=1
∑
K
= πk (µk µT
k + Σk )
k=1
Therefore, we obtain Eq (9.50) just as required.
First, let’s make this problem more clear. In a mixture of Bernoulli dis-
tribution, whose complete-data log likelihood is given by Eq (9.54) and whose
model parameters are πk and µk . If we want to obtain those parameters, we
can adopt EM algorithm. In the E-step, we calculate γ( z nk ) as shown in Eq
(9.56). In the M-step, we update πk and µk according to Eq (9.59) and Eq
(9.60), where Nk and x¯k are defined in Eq (9.57) and Eq (9.58). Now let’s
back to this problem. The expectation of x is given by Eq (9.49):
∑
K
( opt) ( opt)
E[x] = πk µk
k=1
( opt) ( opt)
Here πk and µk are the parameters obtained when EM is converged.
179
Using Eq (9.58) and Eq(9.59), we can obtain:
∑
K
( opt) ( opt)
E[x] = πk µk
k=1
∑
K
( opt) 1 ∑
N
= πk ( opt)
γ( z nk )( opt) xn
k=1 NK n=1
( opt)
∑
K N
k 1 ∑
N
= ( opt)
γ( z nk )( opt) xn
k=1 N NK n=1
∑K 1 ∑ N
= γ( z nk )( opt) xn
k=1 N n=1
∑N ∑ K γ( z )( opt) x
nk n
=
n=1 k=1 N
∑N x ∑
n
K
= γ( z nk )( opt)
n=1 N k=1
1 ∑N
= xn = x̄
N n=1
If we set all µk equal to µ b in initialization, in the first E-step, we can

obtain:
π(0)
k
p(xn |µk = µ b) π(0)
(1)
γ( z nk ) = ∑ (0)
= ∑ k (0) = π(0)
k
K K
j =1 π j p(x n |µ j = µb) j =1 π j
Note that here µb and π(0)

k
are the initial values. In the subsequent M-step,
according to Eq (9.57)-(9.60), we can obtain:
∑N ∑N (0) ∑N
1 ∑
N (1)
n=1 γ( z nk ) x n n=1 π k x n n=1 x n
µ(1)
k
= (1)
γ( z nk ) xn = ∑N = ∑N =
Nk(1) n=1 n=1 γ( z nk )
(1) (0)
n=1 π k
N
And ∑N ∑N
Nk(1) n=1 γ( z nk )
(1) (0)
n=1 π k
π(1)
k
= = = π(0)
k
=
N N N
In other words, in this case, after the first EM iteration, we find that the
new µ(1)
k
are all identical, which are all given by x̄. Moreover, the new π(1)
k
are
identical to their corresponding initial value π(0)
k
. Therefore, in the second
EM iteration, we can similarly conclude that:
µ(2)
k
= µ(1)
k
= x̄ , π(2)
k
= π(1)
k
= π(0)
k
In other words, the EM algorithm actually stops after the first iteration.

180
p(x, z|µ, π) = p(x|z, µ) · p(z|π)

∏K ∏
K
z
= p(x|µk ) zk · πkk
k=1 k=1
∏
K [ ] zk
= πk p(x|µk )
k=1
Then we marginalize over z, yielding:
∑ K [
∑∏ ] zk
p(x|µ) = p(x, z|µ, π) = πk p(x|µk )
z z k=1
The summation over z is made up of K terms and the k-th term corre-
sponds to z k = 1 and other z j , where j ̸= k, equals 0. Therefore, the k-th term
will simply reduce to πk p(x|µk ). Hence, performing the summation over z
will finally give Eq (9.47) just as required. To be more clear, we summarize
the aforementioned statement:
K [
∑∏ ] zk
p(x|µ) = πk p(x|µk )
z k=1
K [
∏ ] zk ¯ K [
∏ ] zk ¯
¯ ¯
= πk p(x|µk ) ¯ + ... + πk p(x|µk ) ¯
z1 = 1 zK = 1
k=1 k=1
= π1 p(x|µ1 ) + ... + πK p(x|µK )
∑
K
= πk p(x|µk )
k=1
Noticing that πk doesn’t depend on any µki , we can omit the first term in
the open brace when calculating the derivative of Eq (9.55) with respect to
µki :
∂E z [ln p] ∂ ∑ K {
N ∑ ∑
D [ ]}
= γ( z nk ) xni ln µki + (1 − xni ) ln(1 − µki )
∂µki ∂µki n=1 k=1 i =1
∂ ∑ D {
K ∑
N ∑ [ ]}
∂µki n=1 k=1 i =1
∑
N ∂ { [ ]}
n=1 ∂µ ki
( )
∑N xni 1 − xni
= γ( z nk ) −
n=1 µki 1 − µki
∑
N xni − µki
= γ( z nk )
n=1 µki (1 − µki )
181
Setting the derivative equal to 0, we can obtain:

∑N
n=1 γ( z nk ) x ni 1 ∑ N
µki = ∑N = γ( z nk ) xni
n=1 γ( z nk )
Nk n=1
Where Nk is defined as Eq (9.57). If we group all the µki as a column

vector, i.e., µk = [µk1 , µk2 , ..., µkD ]T , we will obtain Eq (9.59) just as required.
We follow the hint beginning by introducing a Lagrange multiplier:
∑
K
L = E z [ln p(X, Z|µ, π)] + λ( πk − 1)
k=1
We calculate the derivative of L with respect to πk and then set it equal

to 0:
∂L ∑
N γ( z )
nk
= +λ = 0 (∗)
∂πk n=1 π k
Here E z [ln p] is given by Eq (9.55). We first multiply both sides of the

expression by πk and then adopt summation with respect to k, which gives:
∑
N ∑
K ∑
K
γ( z nk ) + λπk = 0
n=1 k=1 k=1
∑K
Noticing that k=1 π k equals 1, we can obtain:
N ∑
∑ K
λ=− γ( z nk )
n=1 k=1
Finally, substituting it back into (∗) and rearranging it, we can obtain:
∑K ∑K
k=1 γ( z nk ) =1
γ( z nk ) Nk
πk = − = ∑ N k∑ =
λ K
n=1 k=1 γ( z nk )
N
Where Nk is defined by Eq (9.57) and N is the summation of Nk over k,

and also equal to the number of data points.
The incomplete-data log likelihood is given by Eq (9.51), and p(xn |µk ) lies
in the interval [0, 1], which can be easily verified by its definition, i.e., Eq
(9.44). Therefore, we can obtain:
∑
N {∑
K } ∑N {∑
K } ∑N
ln p(X|µ, π) = ln πk p(xn |µk ) ≤ ln πk × 1 ≤ ln 1 = 0
n=1 k=1 n=1 k=1 n=1
182
Where we have used the fact that the logarithm is monotonic increasing,
and that the summation of πk over k equals 1. Moreover, if we want to achieve
the equality, we need p(xn |µk ) equal to 1 for all n = 1, 2, ..., N . However, this
is hardly possible.
To illustrate this, suppose that p(xn |µk ) equals 1 for all data points. With-
out loss of generality, consider two data points x1 = [ x11 , x12 , ..., x1D ]T and
x2 = [ x21 , x22 , ..., x2D ]T , whose i -th entries are different. We further assume
x1 i = 1 and x2 i = 0 since x i is a binary variable. According to Eq (9.44), if we
want p(x1 |µk ) = 1, we must have µ i = 1 (otherwise it muse be less than 1).
However, this will lead p(x2 |µk ) equal to 0 since there is a term 1 − µ i = 0 in
the product shown in Eq (9.44).
Therefore, when the data set is pathological, we will achieve this singu-
larity point by adopting EM. Note that in the main text, the author states
that the condition should be pathological initialization. This is also true. For
instance, in the extreme case, when the data set is not pathological, if we
initialize one πk equal to 1 and others all 0, and some of µ i to 1 and others 0,
we may also achieve the singularity.
In Prob.9.4, we have proved that if we want to maximize the posterior

by EM, the only modification is that in the M-step, we need to maximize
′
Q (θ , θ old ) = Q (θ , θ old ) + ln p(θ ). Here Q (θ , θ old ) has already been given by
E z [ln p], i.e., Eq (9.55). Therefore, we derive for ln p(θ ). Note that ln p(θ ) is
made up of two parts:(i) the prior for µk and (ii) the prior for π, we begin by
dealing with the first part. Here we assume the Beta prior for µki , where k is
fixed, is the same, i.e.,:
Γ(a k + b k ) a k −1
p(µki |a k , b k ) = µ (1 − µki )b k −1 , i = 1, 2, ..., D
Γ(a k )Γ( b k ) ki
Therefore, the contribution of this Beta prior to ln p(θ ) should be given by:
∑
K ∑
D
(a i − 1) ln µki + ( b i − 1) ln(1 − µki )
k=1 i =1
′
One thing worthy mentioned is that since we will maximize Q (θ , θ old )
with respect to π, µk , we can omit the terms which do not depend on π, µk ,
/
such as Γ(a k + b k ) Γ(a k )Γ( b k ). Then we deal with the second part. According
to Eq (2.38), we can obtain:
Γ(α0 ) ∏K
α −1
p(π|α) = π k
Γ(α1 )...Γ(αK ) k=1 k
Therefore, the contribution of the Dirichlet prior to ln p(θ ) should be given
by:
∑
K
(αk − 1) ln πk
k=1
183
′
Therefore, now Q (θ , θ old ) can be written as:
′ ∑ D [
K ∑ ] ∑
K
Q (θ , θ old ) = E z [ln p] + (a i − 1) ln µki + ( b i − 1) ln(1 − µki ) + (αk − 1) ln πk
k=1 i =1 k=1
′
Similarly, we calculate the derivative of Q (θ , θ old ) with respect to µki .
This can be simplified by reusing the deduction in Prob.9.15:
′
∂Q ∂E z [ln p] ai − 1 bi − 1
= + −
∂µki ∂µki µki 1 − µki
∑
N xni 1 − xni ai − 1 bi − 1
= γ( z nk )( − )+ −
n=1 µki 1 − µki µki 1 − µki
∑N ∑N
n=1 x ni · γ( z nk ) + a i − 1 (1 − xni )γ( z nk ) + b i − 1
= − n=1
µki 1 − µki
Nk x̄ki + a i − 1 Nk − Nk x̄ki + b i − 1
= −
µki 1 − µki
Note that here x̄ki is defined as the i -th entry of x̄k defined in Eq (9.58).
To be more clear, we have used Eq (9.57) and Eq (9.58) in the last step:
∑
N [ 1 ∑N ]
xni · γ( z nk ) = Nk · xni · γ( z nk ) = Nk · x̄ki
n=1 Nk n=1
Setting the derivative equal to 0 and rearranging it, we can obtain:
Nk x̄ki + a i − 1
µki =
Nk + a i − 1 + b i − 1
′
Next we maximize Q (θ , θ old ) with respect to π. By analogy to Prob.9.16,
we introduce Lagrange multiplier:
∑
K ∑
K
L ∝ Ez + (αk − 1) ln πk + λ( πk − 1)
k=1 k=1
′
Note that the second term on the right hand side of Q in its definition has
been omitted, since that term can be viewed as a constant with regard to π.
We then calculate the derivative of L with respect to πk by taking advantage
of Prob.9.16:
∂L ∑N γ( z )
nk αk − 1
= + +λ = 0
∂πk n=1 π k πk
Similarly, We first multiply both sides of the expression by πk and then adopt
summation with respect to k, which gives:
K ∑
∑ N ∑
K ∑
K
γ( z nk ) + (αk − 1) + λπk = 0
k=1 n=1 k=1 k=1
184
∑K
Noticing that k=1 π k equals 1, we can obtain:
∑
K ∑
K
λ=− Nk − (αk − 1) = − N − α0 + K
k=1 k=1
Here we have used Eq (2.39). Substituting it back into the derivative, we

can obtain: ∑N
n=1 γ( z nk ) + α k − 1 Nk + αk − 1
πk = =
−λ N + α0 − K
It is not difficult to show that if N is large, the update formula for π and
µ in this case (MAP), will reduce to the results given in the main text (MLE).
We first introduce a latent variable z = [ z1 , z2 , ..., zK ]T , only one of which

equals 1 and others all 0. The conditional distribution of x is given by:
∏
K
p(x|z, µ) = p(x|µk ) zk
k=1
The distribution of the latent variable is given by:
∏
K
z
p(z|π) = πkk
k=1
If we follow the same procedure as in Prob.9.14, we can show that Eq

(9.84) holds. In other words, the introduction of the latent variable is valid.
Therefore, according to Bayes’ Theorem, we can obtain:
∏
N ∏ K [
N ∏ ] znk
p(X, Z|µ, π) = p(zn |π) p(xn |zn , µ) = πk p(x|µ)
n=1 n=1 k=1
We further use Eq (9.85), which gives:
∑
N ∑
K [ ∏ D ∏
M ]
xni j
ln p(X, Z|µ, π) = z nk ln πk µki j
n=1 k=1 d =1 j =1
N ∑
∑ K [ D ∑
∑ M ]
= z nk ln πk + xni j ln µki j
n=1 k=1 d =1 j =1
Similarly, in the E-step, the responsibilities are evaluated using Bayes’

theorem, which gives:
πk p(xn |µk )
γ( z nk ) = E[ z nk ] = ∑K
j =1 π j p(x n |µ j )
185
Next, in the M-step, we are required to maximize E z [ln p(X, Z|µ, π)] with
respect to π and µk , where E z [ln p(X, Z|µ, π)] is given by:
∑
N ∑
K [ ∑
D ∑
M ]
E z [ln p(X, Z|µ, π)] = γ( z nk ) ln πk + xni j ln µki j
n=1 k=1 i =1 j =1
Notice that there exists two constraints: (i) the summation of πk over k
equals 1, and (ii) the summation of µki j over j equals 1 for any k and i , we
need to introduce Lagrange multiplier:
∑
K ∑
K ∑
D ∑
M
L = E z [ln p] + λ( πk − 1) + η ki ( µki j − 1)
k=1 k=1 i =1 j =1
First we maximize L with respect to πk . This is actually identical to the

case in the main text. To be more clear, we calculate the derivative of L with
respect to πk :
∂L ∑
N γ( z )
nk
= +λ
∂πk n=1 π k
As in Prob.9.16, we can obtain:
Nk
πk =
N
Where Nk is defined as:
∑
N
Nk = γ( z nk )
n=1
N is the summation of Nk over k, and also equals the number of data

points. Then we calculate the derivative of L with respect to µki j :
∂L ∑
N γ( z ) x
nk ni j
= + η ki
∂µki j n=1 µki j
We set it to 0 and multiply both sides by µki j , which gives:
∑
N
γ( z nk ) xni j + η ki µki j = 0
n=1
By analogy to deriving πk , an intuitive idea is to perform summation for

∑
the above expression over j and hence we can use the constraint j µki j = 1.
∑
M ∑
N ∑
N [∑
M ] ∑
N
η ki = − γ( z nk ) xni j = − γ( z nk ) xni j = − γ( z nk ) = − N k
j =1 n=1 n=1 j =1 n=1
186
∑
Where we have used the fact that j x ni j = 1. Substituting back into the
derivative, we can obtain:
∑N
n=1 γ( z nk ) x ni j 1 ∑ N
µki j = − = γ( z nk ) xni j
η ki Nk n=1
We first calculate the derivative of Eq (9.62) with respect to α and set it

to 0:
∂E [ln p] M 1 2π E[wT w]
= − =0
∂α 2 2π α 2
We rearrange the equation above, which gives:
M
α= (∗)
E[wT w]
Therefore, we now need to calculate the expectation E[wT w]. Notice that
the posterior has already been given by Eq (3.49):
p(w|t) = N (m N , S N )
To calculate E[wT w], here we write down an property for a Gaussian ran-
dom variable: if x ∼ N (m, Σ), we have:
E[xT Ax] = Tr[AΣ] + mT Am
This property has been shown in Eq(378) in ’the Matrix Cookbook’. Uti-
lizing this property, we can obtain:
E[wT w] = Tr[S N ] + mT
N mN
Substituting it back into (∗), we obtain what is required.
We calculate the derivative of Eq (9.62) with respect to β and set it equal

to 0:
∂ ln p N 1 2π 1 ∑ N
= − E[( t n − wT ϕn )2 ] = 0
∂β 2 2π β 2 n=1
Rearranging it, we obtain:
N
β = ∑N
n=1 E[( t n − w
Tϕ 2
n) ]
Therefore, we are required to calculate the expectation. To be more clear,

this expectation is with respect to the posterior defined by Eq (3.49):
p(w|t) = N (m N , S N )
187
We expand the expectation:
E[( t n − wT ϕn )2 ] = E[ t2n − 2 t n · wT ϕn + wT ϕn ϕT
n w]
= E[ t2n ] − E[2 t n · wT ϕn ] + E[wT (ϕn ϕT
n ) w]
= t2n − 2 t n · E[ϕT T T T
n w] + Tr[ϕ n ϕ n S N ] + m N ϕ n ϕ n m N
= t2n − 2 t n ϕT T T T
n · E[w] + Tr[ϕ n ϕ n S N ] + m N ϕ n ϕ n m N
= t2n − 2 t n ϕT T T T
n m N + Tr[ϕ n ϕ n S N ] + m N ϕ n ϕ n m N
= ( t n − mT 2 T
N ϕ N ) + Tr[ϕ n ϕ n S N ]
Substituting it back into the derivative, we can obtain:
1 1 ∑N { }
= ( t n − mT ϕ
N N )2
+ Tr[ ϕ ϕ T
n n NS ]
β N n=1
1{ }
= ||t − Φm N ||2 + Tr[ΦT ΦS N ]
N
Note that in the last step, we have performed vectorization. Here the j -th
row of Φ is given by ϕ j , identical to the definition given in Chapter 3.
First let’s expand the complete-data log likelihood using Eq (7.79), Eq

(7.80) and Eq (7.76).
ln p(t|X, w, β) p(w|α) = ln p(t|X, w, β) + ln p(w|α)

∑
N ∑
M
= ln p( t n | xn , w, β−1 ) + ln N (w i |0, α−i 1 )
n=1 i =1
∑
N ∑
M
= ln N ( t n |wT ϕn , β−1 ) + ln N (w i |0, α−i 1 )
n=1 i =1
N β β ∑
N 1∑ M αi ∑M α
i 2
= ln − ( t n − wT ϕn )2 + ln − w
2 2π 2 n=1 2 i=1 2π i=1 2 i
Therefore, the expectation of the complete-data log likelihood with respect

to the posterior of w equals:
N β β ∑
N 1∑ M αi ∑M α
i
Ew [ln p] = ln − Ew [( t n − wT ϕn )2 ] + ln − Ew [w2i ]
2 2π 2 n=1 2 i=1 2π i=1 2
We calculate the derivative of Ew [ln p] with respect to α i and set it to 0:

∂Ew [ln p] 1 1 2π 1
= − Ew [w2i ] = 0
∂α i 2 2π α i 2
Rearranging it, we can obtain:
1 1
αi = =
Ew [w2i ] Ew [wwT ] ( i,i)
188
Here the subscript ( i, i ) represents the entry on the i -th row and i -th
column of the matrix Ew [wwT ]. So now, we are required to calculate the
expectation. To be more clear, this expectation is with respect to the posterior
defined by Eq (7.81):
p(w|t, X, α, β) = N (m, Σ)
Here we use Eq (377) described in ’the Matrix Cookbook’. We restate it
here: if w ∼ N (m, Σ), we have:
E[wwT ] = Σ + mmT
According to this equation, we can obtain:
1 1 1
αi = = =
Ew [wwT ] ( i,i ) (Σ + mmT ) ( i,i ) Σ ii + m2i
Now We calculate the derivative of Ew [ln p] with respect to β and set it to

0:
∂Ew [ln p] N 1 2π 1 ∑N
= − Ew [( t n − wT ϕn )2 ] = 0
∂β 2 2π β 2 n=1
Rearranging it, we obtain:
N
β(new) = ∑ N
n=1 Ew [( t n − w
Tϕ 2]
n)
Therefore, we are required to calculate the expectation. By analogy to the

deduction in Prob.9.21, we can obtain:
1 1 ∑N { }
= ( t n − mT ϕ N )2 + Tr[ϕn ϕT
n Σ]
β(new) N n=1
1{ }
= ||t − Φm||2 + Tr[ΦT ΦΣ]
N
To make it consistent with Eq (9.68), let’s first prove a statement:
(β−1 A + ΦT Φ)Σ = β−1 I
This can be easily shown by substituting Σ, i.e., Eq(7.83), back into the
expression:
(β−1 A + ΦT Φ) Σ = (β−1 A + ΦT Φ) (A + βΦT Φ)−1 = β−1 I
Now we start from this statement and rearrange it, which gives:
ΦT ΦΣ = β−1 I − β−1 AΣ = β−1 (I − AΣ)

189
Substituting back into the expression for β(new) :
1 1{ }
= ||t − Φm||2 + Tr[ΦT ΦΣ]
β(new) N
1{ }
= ||t − Φm||2 + Tr[β−1 (I − AΣ)]
N
1{ }
= ||t − Φm||2 + β−1 Tr[I − AΣ]
N
1{ ∑ }
= ||t − Φm||2 + β−1 (1 − α i Σ ii )
N i
∑
||t − Φm||2 + β−1 i γ i
=
N
Here we have defined γ i = 1 − α i Σ ii as in Eq (7.89). Note that there is a
typo in Eq (9.68), m N should be m.
Some clarifications must be made here, Eq (7.87)-(7.88) only gives the

same stationary points, i.e., the same α⋆ and β⋆ , as those given by Eq (9.67)-
(9.68). However, the hyper-parameters estimated at some specific iteration
may not be the same by those two different methods.
When convergence is reached, Eq (7.87) can be written as:
1 − α⋆ Σ ii
α⋆ =
m2i
1
α⋆ =
m2i + Σ ii
This is identical to Eq (9.67). When convergence is reached, Eq (9.68) can

be written as: ∑
||t − Φm||2 + (β⋆ )−1 i γ i
(β⋆ )−1 =
N
||t − Φm||2
(β⋆ )−1 = ∑
N − i γi
This is identical to Eq (7.88).

190
We substitute Eq (9.71) and Eq (9.72) into Eq (9.70):

∑ { p(X, Z|θ ) p(Z|X, θ ) }
L( q, θ ) + KL ( q|| p) = q(Z) ln − ln
Z q(Z) q(Z)
∑ { p(X, Z|θ ) }
= q(Z) ln
Z p(Z|X, θ )
∑
= q(Z) ln p(X|θ )
Z
= ln p(X|θ )
Note that in the last step, we have used the fact that ln p(X|θ ) doesn’t
depend on Z, and that the summation of q(Z) over Z equal to 1 because q(Z)
is a PDF.
We calculate the derivative of Eq (9.71) with respect to θ , given q(Z) =

p(Z|X, θ (old) ):
∂L( q, θ ) ∂ {∑ p(X, Z|θ ) }

= p(Z|X, θ (old) ) ln
∂θ ∂θ Z p(Z|X, θ (old) )
∂ {∑ ∑ }
= p(Z|X, θ (old) ) ln p(X, Z|θ ) − p(Z|X, θ (old) ) ln p(Z|X, θ (old) )
∂θ Z Z
∂ {∑ }
= p(Z|X, θ (old) ) ln p(X, Z|θ )
∂θ Z
∑ ∂ ln p(X, Z|θ )
= p(Z|X, θ (old) )
Z ∂θ
∑ 1 ∂ p(X, Z|θ )
= p(Z|X, θ (old) )
Z p (X , Z | θ ) ∂θ
∑ 1 ∂ p(X|θ ) · p(Z|X, θ )
= p(Z|X, θ (old) )
Z p(X, Z|θ ) ∂θ
∑ p(Z|X, θ (old)
) [ ∂ p(Z|X, θ ) ∂ p(X|θ ) ]
= p(X|θ ) + p(Z|X, θ )
Z p(X, Z|θ ) ∂θ ∂θ
191
We evaluate this derivative at θ = θ old :

∂L( q, θ ) ¯¯ { ∑ p(Z|X, θ (old) ) [ ∂ p(Z|X, θ ) ∂ p(X|θ ) ] }¯¯
¯ old = p(X|θ ) + p(Z|X, θ ) ¯ old
∂θ θ
Z p(X, Z|θ ) ∂θ ∂θ θ
∑ p(Z|X, θ (old) ) [ ¯ ¯ ]
(old) ∂ p(Z|X, θ ) ¯ (old) ∂ p(X|θ ) ¯
= p (X | θ ) ¯ + p (Z | X , θ ) ¯
Z p(X, Z|θ
(old)
) ∂θ θ (old)
∂θ θ (old)
∑ [ ¯ ¯ ]
1 (old) ∂ p(Z|X, θ ) ¯ (old) ∂ p(X|θ ) ¯
= p (X | θ ) ¯ + p (Z | X , θ ) ¯
Z p(X|θ
(old)
) ∂θ θ (old) ∂θ θ (old)
∑ ∂ p(Z|X, θ ) ¯¯ ∑ p(Z|X, θ (old) ) ∂ p(X|θ ) ¯¯

= ¯ (old) + · ¯ (old)
Z ∂θ θ
Z p(X|θ
(old)
) ∂θ θ
∑ ∂ p(Z|X, θ ) ¯ ¯ 1 ∂ p(X|θ ) ¯ ¯
= ¯ (old) + · ¯ (old)
Z ∂θ θ p(X|θ (old)
) ∂θ θ
∑ ∂ p(Z|X, θ ) ¯¯ ∂ ln p(X|θ ) ¯ ¯
= ¯ (old) + ¯ (old)
Z ∂θ θ ∂θ θ
{ ∂ ∑ }¯
¯ ∂ ln p(X|θ ) ¯¯
= p(Z|X, θ ) ¯ (old) + ¯ (old)
∂θ Z θ ∂θ θ
∂1 ¯¯ ∂ ln p(X|θ ) ¯ ¯
= ¯ (old) + ¯ (old)
∂θ θ ∂θ θ
∂ ln p(X|θ ) ¯ ¯
= ¯ (old)
∂θ θ
This problem can be much easier to prove if we view it from the perspec-
tive of KL divergence. Note that when q(Z) = p(Z|X, θ (old) ), the KL divergence
vanishes, and that in general KL divergence is less or equal to zero. There-
fore, we must have:
∂K L( q|| p) ¯¯
¯ (old) = 0
∂θ θ
Otherwise, there exists a point θ in the neighborhood near θ (old) which

leads the KL divergence less than 0. Then using Eq (9.70), it is trivial to
prove.
From Eq (9.18), we have:

∑
Nkold = γold ( z nk )
n
If now we just re-evaluate the responsibilities for one data point xm , we can
obtain:
∑ old
Nknew = γ ( z nk ) + γnew ( z mk )
n̸= m
∑
= γold ( z nk ) + γnew ( z mk ) − γold ( z mk )
n
= Nkold + γnew ( z mk ) − γold ( z mk )
192
Similarly, according to Eq (9.17), we can obtain:
1 ∑ γnew ( z mk )xm
µnew
k = γold ( z nk )xn +
Nknew n̸=m Nknew
1 ∑ γnew ( z mk )xm γold ( z mk )xm
= γold ( z nk )xn + −
Nknew n Nknew Nknew
Nkold 1 ∑ γnew ( z mk )xm γold ( z mk )xm
= γold ( z nk )xn + −
Nknew N old n Nknew Nknew
k
Nkold [ ] x
old new old m
= µ
new k + γ ( z mk ) − γ ( z mk ) new
Nk Nk
Nknew − Nkold [ ] x
m
= µold
k − µ old
k + γnew
( z mk ) − γ old
( z mk )
Nknew Nknew
γnew ( z mk ) − γold ( z mk ) old [ new ] x
m
= µold
k − new µk + γ ( z mk ) − γold ( z mk )
Nk Nknew
γnew ( z mk ) − γold ( z mk ) ( )
= µold
k + · x m − µ old
k
Nknew
Just as required.
By analogy to the previous problem, we use Eq (9.24)-Eq(9.27), beginning

by first deriving an update formula for mixing coefficients πk :
Nknew1 { old }
πnew
k = =Nk + γnew ( z mk ) − γold ( z mk )
N N
new
γ ( z mk ) − γold ( z mk )
= πold
k +
N
Here we have used the conclusion (the update formula for Nknew ) in the
previous problem. Next we deal with the covariance matrix Σ. By analogy to
193
the previous problem, we can obtain:

1 ∑
Σnew = γold ( z nk ) (xn − µnew new T
k ) (x n − µ k )
k
Nknew n̸=m
1
+ γnew ( z mk ) (xm − µnew new T
k ) (x m − µ k )
Nknew
1 ∑ old
≈ γ ( z nk ) (xn − µold old T
k ) (x n − µ k )
Nknew n̸=m
1
+ γnew ( z mk ) (xm − µnew
k ) (x m − µ k )
new T
Nknew
1 ∑ old
= γ ( z nk ) (xn − µold old T
k ) (x n − µ k )
Nknew n
1
+ new γnew ( z mk ) (xm − µnew
k ) (x m − µ k )
new T
Nk
1
− new γold ( z mk ) (xm − µold old T
k ) (x m − µ k )
Nk
1 old old 1
= new N k Σ k + γnew ( z mk ) (xm − µnew new T
k ) (x m − µ k )
Nk Nknew
1
k ) (x m − µ k )
Nk
{ N old − N new } old
= 1 + k newk Σk
Nk
1
+ γnew ( z mk ) (xm − µnew new T
k ) (x m − µ k )
Nknew
1
k ) (x m − µ k )
Nk
{ γold ( z mk ) − γnew ( z mk ) } old
= 1+ Σk
Nknew
γnew ( z mk )
+ (xm − µnew new T
k ) (x m − µ k )
Nknew
γold ( z mk )
− (xm − µold old T
k ) (x m − µ k )
Nknew
= Σold
k
γnew ( z mk ) { }
+ new (xm − µnew new T
k ) (x m − µ k ) − Σ
old
Nk
γold ( z mk ) { old old T old
}
− (x m − µ k ) (x m − µ k ) − Σ k
Nknew
One important thing worthy mentioned is that in the second step, there
is an approximate equal sign. Note that in the previous problem, we have
194
shown that if we only recompute the data point xm , all the center µk will
also change from µold k
to µnew
k
, and the update formula is given by Eq (9.78).
However, for the convenience of computing, we have made an approximation
here. Other approximation methods can also be applied here. For instance,
you can replace µnewk
with µold
k
whenever it occurs.
The complete solution should be given by substituting Eq (9.78) into the
right side of the first equal sign and then rearranging it, in order to construct
a relation between Σnew k
and Σold
k
. However, this is too complicated.
0.10 Variational Inference

This problem is very similar to Prob.9.24. We substitute Eq (10.3) and Eq

(10.4) into Eq (10.2):
∫ { p(X, Z) p(Z|X) }
L( q) + KL ( q|| p) = q(Z) ln − ln dZ
Z q(Z) q(Z)
∫ { p(X, Z) }
= q(Z) ln dZ
p(Z|X)
∫Z
= q(Z) ln p(X) d Z
Z
= ln p(X)
Note that in the last step, we have used the fact that ln p(X) doesn’t de-
pend on Z, and that the integration of q(Z) over Z equal to 1 because q(Z) is
a PDF.
To be more clear, we are required to solve:

{
m 1 = µ1 − Λ−
11 Λ12 ( m 2 − µ2 )
1
m 2 = µ2 − Λ−
22 Λ21 ( m 1 − µ1 )
1
To obtain the equation above, we need to substitute E[ z i ] = m i , where

i = 1, 2, into Eq (10.13) and Eq (10.14). Here the unknown parameters are
m 1 and m 2 . It is trivial to notice that m i = µ i is a solution for the equation
above.
Let’s solve this equation from another perspective. Firstly, if any (or both)
of Λ− −1
11 and Λ22 equals 0, we can obtain m i = µ i directly from Eq (10.13)-
1
(10.14). When none of Λ− −1

11 and Λ22 equals 0, we substitute m 1 , i.e., the first
1
195
line, into the second line:
m2 = µ 2 − Λ−1
22 Λ21 ( m 1 − µ1 )
[ ]
= µ 2 − Λ−1 −1
22 Λ21 µ1 − Λ11 Λ12 ( m 2 − µ2 ) − µ1
= µ 2 − Λ−1 −1 −1 −1
22 Λ21 µ1 + Λ22 Λ21 Λ11 Λ12 ( m 2 − µ2 ) + Λ22 Λ21 µ1
= (1 − Λ−1 −1 −1 −1
22 Λ21 Λ11 Λ12 ) µ2 + Λ22 Λ21 Λ11 Λ12 m 2
We rearrange the expression above, yielding:
(1 − Λ−1 −1
22 Λ21 Λ11 Λ12 ) ( m 2 − µ2 ) = 0
The first term at the left hand side will equal 0 only when the distribu-
tion is singular, i.e., the determinant of the precision matrix Λ (i.e., Λ11 Λ22 −
Λ12 Λ21 ) is 0. Therefore, if the distribution is nonsingular, we must have
m 2 = µ2 . Substituting it back into the first line, we obtain m 1 = µ1 .
Let’s start from the definition of KL divergence given in Eq (10.16).

∫ [∑M ]
K L( p|| q) = − p(Z) ln q i (Z i ) d Z + const
i =1
∫ [ ∑ ]
= − p(Z) ln q j (Z j ) + ln q i (Z i ) d Z + const
i ̸= j
∫
= − p(Z) ln q j (Z j ) d Z + const
∫ [∫ ∏ ]
= − p(Z) d Z i ln q j (Z j ) d Z j + const
i ̸= j
∫
= − P (Z j ) ln q j (Z j ) d Z j + const
Note that in the third step, since all the factors q i (Z i ), where i ̸= j , are
fixed, they can be absorbed into the ’Const’ variable. In the last step, we have
denoted the marginal distribution:
∫ ∏
p(Z j ) = p(Z) d Z i
i ̸= j
We introduce the Lagrange multiplier to enforce q j (Z j ) integrate to 1.

∫ ∫
L = − P (Z j ) ln q j (Z j ) d Z j + λ ( q j (Z j ) d Z j − 1)
Using the functional derivative (for more details, you can refer to Ap-
pendix D or Prob.1.34), we calculate the functional derivative of L with re-
spect to q j (Z j ) and set it to 0:
p(Z j )
− +λ = 0
q j (Z j )
196
λ q j (Z j ) = p(Z j )
Integrating both sides with respect to Z j , we see that λ = 1. Substituting

it back into the derivative, we can obtain the optimal q j (Z j ):
q⋆j (Z j ) = p(Z j )
Notice that actually we should also enforce q j (Z j ) > 0 in the Lagrange

multiplier, however as we can see that when we only enforce q j (Z j ) integrate
to 1 and obtain the final close expression, q j (Z j ) is definitely larger than 0 at
all Z j because p(Z j ) is a PDF. Therefore, there is no need to introduce this
inequality constraint in the Lagrange multiplier.
We begin by writing down the KL divergence.

∫ { q(x) }
KL ( p|| q) = − p(x) ln dx
p(x)
∫
= − p(x) ln q(x) d x + const
∫ [ D ]
1 1
= − p(x) − ln 2π − ln |Σ| − (x − µ)T Σ−1 (x − µ) d x + const
2 2 2
∫ [1 ]
1
= p(x) ln |Σ| + (x − µ)T Σ−1 (x − µ) d x + const
2 2
∫ [1 ]
1
= ln |Σ| + p(x) (x − µ)T Σ−1 (x − µ) d x + const
2 2
∫
1 1 [ T −1 ]
= ln |Σ| + p(x) x Σ x − 2µT Σ−1 x + µT Σ−1 µ d x + const
2 2
∫
1 1 1
= ln |Σ| + p(x) Tr[Σ−1 (xxT )] d x − µT Σ−1 E[x] + µT Σ−1 µ + const
2 2 2
1 1 −1 T T −1 1 T −1
= ln |Σ| + Tr[Σ E(xx )] − µ Σ E[x] + µ Σ µ + const
2 2 2
Here D is the dimension of x. We first calculate the derivative of KL ( p|| q)
with respect to µ and set it to 0:
∂KL
= −Σ−1 E[ x] + Σ−1 µ = 0
∂µ
Therefore, we can obtain µ = E[x]. When µ = E[x] is satisfied, KL diver-

gence reduces to:
1 1 1
KL ( p|| q) = ln |Σ| + Tr[Σ−1 E(xxT )] − µT Σ−1 µ + const
2 2 2
197
Then we calculate the derivative of KL ( p|| q) with respect to Σ and set it

to 0:
∂KL 1 −1 1 −1 1
= Σ − Σ E[xxT ]Σ−1 + Σ−1 µµT Σ−1 = 0
∂Σ 2 2 2
Note that here we have used and Eq (61) and Eq (124) in ’MatrixCook-
Book’, and that Σ, E[xxT ] are both symmetric. We rewrite those equations
here for your reference:
∂aT X−1 b ∂Tr(AX−1 B)

= −X−T abT X−T and = −X−T AT BT X−T
∂X ∂X
Rearranging the derivative, we can obtain:
Σ = E[xxT ] − µµT = E[xxT ] − E[x]E[x]T = cov[x]
We introduce a property of Dirac function:

∫
δ(θ − θ 0 ) f (θ ) d θ = f (θ 0 )
We first calculate the optimal q(z, θ ) by fixing q θ (θ ). This is achieved by

minimizing the KL divergence given in Eq (10.4):
∫ ∫ { p(Z|X) }
KL( q|| p) = − q(Z) ln dZ
q(Z)
∫ ∫ { p(z, θ |X) }
= − q z (z) q θ (θ ) ln dz dθ
q z (z) q θ (θ )
∫ ∫ { p(z, θ |X) } ∫
= − q z (z) q θ (θ ) ln d z d θ + q θ (θ ) ln q θ (θ ) d θ
q z (z)
∫ ∫ { p(z, θ |X) }
= − q z (z) q θ (θ ) ln d z d θ + const
q z (z)
∫ {∫ { p(z, θ |X) } }
= − q θ (θ ) q z (z) ln d z d θ + const
q z (z)
∫ { p(z, θ |X) }
0
= − q z (z) ln d z + const
q z (z)
∫ { p(z|θ , X) p(θ |X) }
0 0
q z (z)
∫ { p(z|θ , X) }
0
q z (z)
Here the ’Const’ denotes the terms independent of q z (z). Note that we
will show at the end of this problem, here ’Const’ actually is −∞ due to the
existence of the entropy of Dirac function:
∫
q θ (θ ) ln q θ (θ ) d θ
198
Now it is clear that when q z (z) equals p(z|θ 0 , X), the KL divergence is
minimized. This corresponds to the E-step. Next, we calculate the optimal
q θ (θ ), i.e., θ 0 , by maximizing L( q) given in Eq (10.3), but fixing q θ (θ ):
∫ ∫ { p(X, Z) }
L( q) = q(Z) ln dZ
q(Z)
∫ ∫ { p(X, z, θ ) }
= q z (z) q θ (θ ) ln dz dθ
q z (z) q θ (θ )
∫ ∫ { p(X, z, θ ) } ∫
= q z (z) q θ (θ ) ln d z d θ − q θ (θ ) ln q θ (θ ) d θ
q z (z)
∫ ∫ { } ∫
= q z (z) q θ (θ ) ln p(X, z, θ ) d z d θ − q θ (θ ) ln q θ (θ ) d θ + const
∫ ∫
= q θ (θ ) E qz [ln p(X, z, θ )] d θ − q θ (θ ) ln q θ (θ ) d θ + const
∫
= E qz (z) [ln p(X, z, θ 0 )] − q θ (θ ) ln q θ (θ ) d θ + const
The second term is actually the entropy of a Dirac function, which is −∞

and independent of the value of θ 0 . Not strictly speaking, we only need to
maximize the first term. This is exactly the M-step.
One important thing needs to be clarified here. You may ask no matter
how we set θ 0 , L( q) will always be −∞. Actually, this is an intrinsic problem
as long as we use a point estimate q θ (θ ). This will even occur when we derive
the optimal q z (z) by minimizing the KL divergence at the first step. There-
fore, the ’Maximizing’ and ’Minimizing’ is a general meaning in this problem
where we neglect the −∞ term.

199
Let’s use the hint by first enforcing α → 1.

∫
4 ( (1+α)/2 (1−α)/2
)
D α ( p|| q) = 1 − p q dx
1 − α2 ∫
4 { p [ 1−α 1−α 2 ] }
= 1 − 1 + ln q + O ( ) dx
1 − α2 p(1−α)/2 2 2
∫
4 { 1 + 1−2α ln q + O ( 1−2α )2 }
= 1 − p · dx
1 − α2 1 + 1−2α ln p + O ( 1−2α )2
∫
4 { 1 + 1−2α ln q }
≈ 1 − p · dx
1 − α2 1 + 1−2α ln p
∫
4 { [ 1 + 1−2α ln q ] }
= − p · − 1 dx
1 − α2 1 + 1−2α ln p
∫
4 { }
1−α 1−α
2 ln q − 2 ln p
= − p · dx
1 − α2 1 + 1−2α ln p
{ ∫ }
2 ln q − ln p
= − p· dx
1+α 1 + 1−2α ln p
∫ ∫
q
≈ − p · (ln q − ln p) dx = − p · ln dx
p
Here p and q is short for p( x) and q( x). It is similar when α → −1. One
important thing worthy mentioning is that if we directly approximate p(1+α)/2
by p instead of p/ p(1−α)/2 in the first step, we won’t get the desired result.
Let’s begin from Eq (10.25).
E[ τ ] { ∑N }
ln q⋆
µ (µ) = − λ0 (µ − µ0 )2 + ( xn − µ)2 + const
2 n=1
E[ τ ] { ∑N ∑
N }
= − λ0 µ2 − 2 λ0 µ0 µ + λ0 µ20 + N µ2 − 2 ( xn ) µ + x2n + const
2 n=1 n=1
E[ τ ] { ∑
N ∑N }
= − (λ0 + N ) µ2 − 2 ( λ0 µ0 + xn ) µ + (λ0 µ20 + x2n ) + const
2 n=1 n=1
{ ∑N 2 ∑N
E[τ] (λ0 + N ) 2 λ0 µ0 + n=1 xn λ0 µ0 + n=1 x2n }
= − µ −2 µ+ + const
2 λ0 + N λ0 + N
From this expression, we see that q⋆ µ (µ) should be a Gaussian. Suppose

⋆ −1
that is has form: q µ (µ) ∼ N (µ|µ N , λ N ), then its logarithm can be written as:
1 λN λN
ln q⋆
µ (µ) = ln − (µ − µ N )2
2 2π 2
200
We match the terms related to µ (the quadratic term and linear term),
yielding:
∑
λ0 µ0 + nN=1 xn
λ N = E[τ] · (λ0 + N ) , and λ N µ N = E[τ] · (λ0 + N ) ·
λ0 + N
λ0 µ0 + N x̄
µN =
λ0 + N
Where x̄ is the mean of xn , i.e.,
1 ∑N
x̄ = xn
N n=1
Then we deal with the other factor q τ (τ). Note that there is a typo in
Eq (10.28), the coefficient ahead of ln τ should be N2+1 . Let’s verify this by
considering the terms introducing ln τ. The first term inside the expectation,
i.e., ln p(D |µ, τ), gives N 2 ln τ, and the second term inside the expectation, i.e.,
1
ln p(µ|τ), gives 2 ln τ. Finally the last term ln p(τ) gives (a 0 − 1) ln τ. Therefore,
Eq (10.29), Eq (10.31) and Eq (10.33) will also change consequently. The right
forms of these equations will be given in this and following problems.
Now suppose that q τ (τ) is a Gamma distribution, i.e., q τ (τ) ∼ Gam(τ|a N , b N ),
we have:
ln q τ (τ) = − ln Γ(a N ) + a N ln b N + (a N − 1) ln τ − b N τ
Comparing it with Eq (10.28) and matching the coefficients ahead of τ
and ln τ, we can obtain:
N +1 N +1
a N − 1 = a0 − 1 + ⇒ a N = a0 +
2 2
And similarly
1 [∑ N ]
b N = b 0 + Eµ ( xn − µ)2 + λ0 (µ − µ0 )2
2 n=1
Just as required.
According to Eq (B.27), we have:

a 0 + ( N + 1)/2
E[ τ ] = [∑ ]
b 0 + 21 Eµ N
n=1 ( x n − µ) + λ0 (µ − µ0 )
2 2
N /2
≈ [∑ ]
1 N
2 Eµ n=1 ( x n − µ)
2
N
= [∑ ]
N
Eµ n=1 ( x n − µ)
2
{1 [∑N ]}−1
= · Eµ ( x n − µ )2
N n=1
201
According to Eq (B.28), we have:
a 0 + ( N + 1)/2
var[τ] = ( [ ])2
∑
b 0 + 21 Eµ N
n=1 ( x n − µ )2 + λ (µ − µ )2
0 0
N /2
≈ [ ∑N ]2 ≈ 0
1
4 Eµ n=1 ( x n − µ)
2
Just as required.
The underlying assumption of this problem is a 0 = b 0 = λ0 = 0. Accord-

ing to Eq (10.26), Eq (10.27) and the definition of variance, we can obtain:
1 λ0 µ0 + N x 2
E[µ2 ] = λ−1 2
N + E[ µ ] = +( )
(λ0 + N )E[τ] λ0 + N
1
= + x2
N E[τ]
Note that since there is a typo in Eq (10.29) as stated in the previous
problem, i.e., missing a term 12 . E[τ]−1 actually equals:
[∑ ]
1 N 2 2
1 bN b 0 + E
2 µ (
n=1 n x − µ ) + λ0 ( µ − µ 0 )
= =
E[τ] aN a 0 + ( N + 1)/2
[∑ ]
1 N 2
E
2 µ (
n=1 n x − µ )
=
( N + 1)/2
1 ∑ N
= Eµ [ ( xn − µ)2 ]
N + 1 n=1
N 1 ∑N
= Eµ [ ( xn − µ)2 ]
N +1 N n=1
N { 2 }
= x − 2 xE[µ] + E[µ2 ]
N +1
N { 2 1 }
= x − 2 x2 + + x2
N +1 N E[τ]
N { 2 1 }
= x − x2 +
N +1 N E[ τ ]
N { 2 2
} 1
= x −x +
N +1 ( N + 1)E[τ]
1 1 ∑N
= ( x2 − x2 ) = ( xn − x)2
E[τ] N n=1
Actually it is still a biased estimator.

202
We substitute L m , i.e., Eq (10.35), back into the right hand side of Eq

(10.34), yielding:
∑∑ { p(Z, X, m) p(Z, m|X) }
(right) = q(Z| m) q( m) ln − ln
m Z q(Z| m) q( m) q(Z| m) q( m)
∑∑ { p(Z, X, m) }
= q(Z| m) q( m) ln
m Z p(Z, m|X)
∑∑
= q(Z, m) ln p(X)
m Z
= ln p(X)
Just as required.
We introduce the Lagrange Multiplier:

∑∑ { p(Z, X, m) } {∑ }
L = q(Z| m) q( m) ln −λ q ( m) − 1
m Z q(Z| m) q( m) m
∑∑ { } ∑∑ {∑ }
= q(Z| m) q( m) ln p(Z, X, m) − q(Z| m) − q(Z| m) q( m) ln q( m) − λ q ( m) − 1
m Z m Z m
∑ ∑ {∑ } {∑ }
= q ( m) · C − q(Z| m) q( m) ln q( m) − λ q ( m) − 1
m Z m m

∑ { }
C= q(Z| m) ln p(Z, X, m) − q(Z| m)
Z
According to Calculus of Variations given in Appendix D, and also Prob.1.34,

we can obtain the derivative of L with respect to q( m) and set it to 0:
∂L ∑ [ ]
= C + q(Z| m) ln q( m) + 1 − λ
∂ q ( m) Z
∑ { } ∑
= q(Z| m) ln p(Z, X, m) − q(Z| m) + q(Z| m) ln q( m) + 1 − λ
Z Z
∑ { p(Z, X, m) }
= q(Z| m) ln +1−λ = 0
Z q(Z| m) q( m)
We multiply both sides by q( m) and then perform summation over m,
yielding:
∑∑ { p(Z, X, m) } ∑
q(Z| m) q( m) ln + (1 − λ) q( m) = 0
m Z q(Z| m) q( m) m
Notice that the first term is actually L m defined in Eq (10.35) and that
the summation of q( m) over m equals 1, we can obtain:
λ = Lm + 1
203
We substitute λ back into the derivative, yielding:

∑ { p(Z, X, m) }
q(Z| m) ln − Lm = 0 (∗)
Z q(Z| m) q( m)
One important thing must be clarified here, there is a typo in Eq (10.36),

′′
L m in Eq (10.36) should be L , which is defined as:
′′ ∑ { p(Z, X| m) }
L = q(Z| m) ln
Z q(Z| m)
′′
Now with the definition of L , we expand (∗):
∑ { p(Z, X, m) }
(∗) = q(Z| m) ln − Lm
Z q(Z| m) q( m)
∑ { p(Z, X| m) p( m) }
= q(Z| m) ln − Lm
Z q(Z| m) q( m)
′′ ∑ p ( m)
= L + q(Z| m) ln − Lm
Z q ( m)
′′ p ( m) ∑ ∑ { p(Z, X, m) }
= L + ln − q(Z| m) q( m) ln
q ( m) m Z q(Z| m) q( m)
′′ p ( m) ∑ { ∑ p(Z, X| m) p( m) }
= L + ln − q ( m) q(Z| m) ln
q ( m) m Z q(Z| m) q( m)
′′ p ( m) ∑ { ′′ ∑ p ( m) }
= L + ln − q( m) L + q(Z| m) ln
q ( m) m Z q ( m)
′′ p ( m) ∑ { ′′ p ( m) }
= L + ln − q( m) L + ln
q ( m) m q ( m)
′′ ′′
p( m) exp(L ) ∑ p( m) exp(L )
= ln − q( m) ln =0
q ( m) m q ( m)
The solution is given by:

1 ′′
q ( m) = · p( m) exp(L )
A
Where A1 is a normalization constant, used to guarantee the summation
of q( m) over m equals 1. More specific, it is given by:
∑ ′′
A= p( m) exp(L )
Z
Therefore, it is obvious that A does not depend on the value of Z. You

can verify the result of q( m) by substituting it back into the last line of (∗),
yielding:
′′ ′′
p( m) exp(L ) ∑ p( m) exp(L ) ∑
ln − q( m) ln = ln A − q( m) · ln A = 0
q ( m) m q ( m) m
204
One last thing worthy mentioning is that you can directly start from L m
given in Eq (10.35), without enforcing Lagrange Multiplier, to obtain q( m).
In this way, we can actually obtain:
′′
∑ p( m) exp(L )
Lm = q( m) ln
m q ( m)
′′
It is actually the KL divergence between q( m) and p( m) exp(L ). Note
′′ ′′
that p( m) exp(L ) is not normalized, we cannot let q( m) equal to p( m) exp(L )
to achieve the minimum of a KL distance, i.e., 0, since q( m) is a probability
distribution and should sum to 1 over m.
Therefore, we can guess that the optimal q( m) is given by the normal-
′′
ized p( m) exp(L ). In this way, the constraint, i.e., summation of q( m) over
m equals 1, is implicitly guaranteed. The more strict proof using Lagrange
Multiplier has been shown above.
The solution procedure has already been given in Eq (10.43) - (10.49), so

here we explain it in more details, starting from Eq (10.43):
ln q⋆ (Z) = Eπ,µ,Λ [ln p(X, Z, π, µ, Λ)] + const

= Eπ [ln p(Z|π)] + Eµ,Λ [ln p(X|Z, µ, Λ)] + const
N ∑
∑ K
= const + Eπ [ z nk ln πk ]
n=1 k=1
N ∑
∑ K 1 D 1
+Eµ,Λ [ z nk { ln |Λk | − ln 2π − (xn − µk )T Λk (xn − µk ) } ]
n=1 k=1 2 2 2
N ∑
∑ K
= const + z nk Eπ [ln πk ]
n=1 k=1
N ∑
∑ K 1 D 1
+ z nk Eµ,Λ [{ ln |Λk | − ln 2π − (xn − µk )T Λk (xn − µk ) } ]
n=1 k=1 2 2 2
N ∑
∑ K
= z nk ln ρ nk + const
n=1 k=1
Where we have substituted used Eq (10.37) and Eq (10.38), and D is the

dimension of xn . Here ln ρ nk is defined as:
1 D 1
ln ρ nk = Eπ [ln πk ] + Eµ,Λ [{ ln |Λk | − ln 2π − (xn − µk )T Λk (xn − µk ) } ]
2 2 2
1 D 1
= Eπ [ln πk ] + Eµ,Λ [ ln |Λk |] − ln 2π − Eµk ,Λk [ (xn − µk )T Λk (xn − µk ) ]
2 2 2
Taking exponential of both sides, we can obtain:
N ∏
∏ K
q⋆ (Z) ∝
z
ρ nk
nk
n=1 k=1
205
Because q⋆ (Z) should be correctly normalized, we are required to find

the normalization constant. In this problem, we find that directly calculate
the normalization constant by performing summation of q⋆ (Z) over Z is non
trivial. Therefore, we will proof that Eq (10.49) is the correct normalization
∏
by mathematical induction. When N = 1, q⋆ (Z) will reduce to: K
z1 k
k=1 ρ 1 k , and
it is easy to see that the normalization constant is given by:
∑∏
K
z
∑
K
A= ρ 11kk = ρ1 j
z1 k=1 j =1
Here we have used 1-of-K coding scheme for z1 = [ z11 , z12 , ..., z1K ]T , i.e.,
only one of { z11 , z12 , ..., z1K } will be 1 and others all 0. Therefore the summa-
tion over z1 is made up of K terms, and the j -th term corresponds to z1 j = 1
and other z1 i equals 0. In this case, we have obtained:
1 ∏K ∏K ( ρ 1k ) z1 k
q⋆ (Z) =
z
ρ 11kk = ∑K
j =1 ln ρ 1 j
A k=1 k=1
It is exactly the same as Eq (10.48) and Eq (10.49). Suppose now we have

proved that for N − 1, the normalized q⋆ (Z) is given by Eq (10.48) and Eq
(10.49). For N , we have:
∑ ∑ ∏
N ∏
K
q⋆ (Z) =
znk
r nk
Z z1 ,...,z N n=1 k=1
∑{ ∑ ∏
N ∏
K
z
}
= nk
r nk
zN z1 ,...,z N −1 n=1 k=1
∑{ ∑ [∏
K
z ] [
N∏
−1 ∏
K
z nk ]
}
= r NNkk · r nk
zN z1 ,...,z N −1 k=1 n=1 k=1
∑ {[ ∏
K
zN k ]
∑ N∏
−1 ∏
K
z
}
= rNk · nk
r nk
zN k=1 z1 ,...,z N −1 n=1 k=1
∑ {[ ∏
K
z ]
}
= r NNkk · 1
zN k=1
∑ ∏
K
z
∑
K
= r NNkk = rNk = 1
z N k=1 k=1
The proof of the final step is exactly the same as that for N = 1. So now,
with the assumption Eq (10.48) and Eq (10.49) are right for N − 1, we have
shown that they are also correct for N . The proof is complete.

206
Let’s start from Eq (10.54).

∑
K ∑
K ∑
N
ln q⋆ (π, µ, Λ) ∝ ln p(π) + ln p(µk , Λk ) + E[ln p(Z|π)] + E[ z nk ] ln N (xn |µk , Λ−1
k )
k=1 k=1 n=1
∑
K
= ln C (α0 ) + (α0 − 1) ln πk
k=1
∑
K ∑
K
+ ln N (µk |m0 , (β0 Λk )−1 ) + ln W (Λk |W0 , v0 )
k=1 k=1
∑
N ∑
K ∑
K ∑
N
+ ln πk E[ z nk ] + E[ z nk ] ln N (xn |µk , Λ−1
k )
n=1 k=1 k=1 n=1
It is easy to observe that the equation above can be decomposed into a

sum of terms involving only π together with those only involving µ and Λ. In
other words, q(π, µ, Λ) can be factorized into the product of q(π) and q(µ, Λ).
We first extract those terms depend on π.
∑
K ∑
K ∑
N
ln q⋆ (π) ∝ (α0 − 1) ln πk + ln πk E[ z nk ]
k=1 k=1 n=1
∑
K ∑
K ∑
N
= (α0 − 1) ln πk + r nk ln πk
k=1 k=1 n=1
∑
K [ ∑
N ]
= ln πk · α0 − 1 + r nk
k=1 n=1
Comparing it to the standard form of a Dirichlet distribution, we can con-

clude that q⋆ (π) = Dir(π|α), where the k-th entry of α, i.e., αk is given by:
∑
N
αk = α0 + r nk = α0 + Nk
n=1
Next we gather all the terms dependent on µ = {µk } and Λ = {Λk }:

∑
K ∑
K ∑
K ∑
N
ln q⋆ (µ, Λ) = ln N (µk |m0 , (β0 Λk )−1 ) + ln W (Λk |W0 , v0 ) + E[ z nk ] ln N (xn |µk , Λ−1
k )
k=1 k=1 k=1 n=1
K {1
∑ 1 }
∝ ln |β0 Λk | − (µk − m0 )T β0 Λk (µk − m0 )
k=1 2 2
{
∑ v0 − D − 1
K 1 }
+ ln |Λk | − Tr(W−
0
1
Λ k )
k=1 2 2
∑K ∑ N {1 1 }
+ r nk ln |Λk | − (xn − µk )T Λk (xn − µk )
k=1 n=1 2 2
With the knowledge that the optimal q⋆ (µ, Λ) can be written as:
∏
K ∏
K
q⋆ (µ, Λ) = q ⋆ (µ k | Λ k ) q ⋆ (Λ k ) = N (µk |mk , (βk Λk )−1 )W (Λk |Wk , vk ) (∗)
k=1 k=1
207
We first complete square with respect to µk . The quadratic term is given

by:
1 ∑K ∑ N 1 1 T
− µT
k (β0 Λ k ) µ k − r nk µT
k Λ k µ k = − µ k (β0 Λ k + N k Λ k ) µ k
2 k=1 n=1 2 2
Therefore, comparing with (∗), we can obtain:
βk = β0 + N k
Next, we write down the linear term with respect to µk :
∑
N ∑
N
µT
k (β0 Λ k m0 ) + r nk · µT T
k (Λ k x n ) = µ k (β0 Λ k m0 + r nk Λk xn )
n=1 n=1
∑
N
= µT
k Λ k (β0 m0 + r nk xn )
n=1
= µT
k Λ k (β0 m0 + N k x̄ k )
1 ∑
N 1 ∑ N
x̄k = ∑ N r nk xn = r nk xn
n=1 r nk n=1
Nk n=1
Comparing to the standard form, we can obtain:
1
mk = (β0 m0 + Nk x̄k )
βk
Now we have obtained q⋆ (µk |Λk ) = N (µk |mk , (βk Λk )−1 ), using the rela-
tion:
ln q⋆ (Λk ) = ln q⋆ (µk , Λk ) − ln q⋆ (µk |Λk )
208
And focusing only on the terms dependent on Λk , we can obtain:

{1 1 }
ln q⋆ (Λk ) ∝ ln |β0 Λk | − (µk − m0 )T β0 Λk (µk − m0 )
2 2
{v −D −1 1 }
ln |Λk | − Tr(W−
0 1
+ 0 Λ k )
2 2
∑N {1 1 }
+ r nk ln |Λk | − (xn − µk )T Λk (xn − µk )
n=1 2 2
{1 1 }
− ln |βk Λk | − (µk − mk )T βk Λk (µk − mk )
2 2
{1 1 [ ]
∝ ln |Λk | − Tr β0 (µk − m0 )(µk − m0 )T · Λk
2 2
{v −D −1 1 }
ln |Λk | − Tr(W−
0 1
+ 0 Λ k )
2 2
{N 1 [ ∑N ]
k
+ ln |Λk | − Tr r nk (xn − µk )T (xn − µk ) · Λk
2 2 n=1
{1 1 [ ]
− ln |Λk | − Tr βk (µk − mk )T (µk − mk ) · Λk
2 2
v0 − D − 1 + N k 1
= ln |Λk | − Tr[T · Λk ]
2 2
∑
N
T = β0 (µk −m0 )(µk −m0 )T +W−1
0 + r nk (xn −µk )T (xn −µk )−βk (µk −mk )T (µk −mk )
n=1
By matching the coefficient ahead of ln |Λk |, we can obtain:

v k = v0 + N k
Next, by matching the coefficient in the Trace, we see that:
W−1
k =T
Let’s further simplify T , beginning by introducing a useful equation, which

will be used here and later in Prob.10.16:
∑
N ∑
N
r nk xn xT
n = r nk (xn − x̄k + x̄k )(xn − x̄k + x̄k )T
n=1 n=1
∑
N [ ]
= r nk (xn − x̄k )(xn − x̄k )T + x̄k x̄T T
k + 2(x n − x̄ k )x̄ k
n=1
∑
N [ ] ∑N [ ] ∑
N [ ]
= r nk (xn − x̄k )(xn − x̄k )T + r nk x̄k x̄T
k + r nk 2(xn − x̄k )x̄T
k
n=1 n=1 n=1
∑
N [ ]
= Nk Sk + Nk x̄k x̄T
k +2 r nk (xn − x̄k )x̄T
k
n=1
[ ]
k + 2 ( N k x̄k − N k x̄k )x̄T
k
k
209
Where in the last step we have used Eq (10.51). Now we are ready to
prove that T is exactly given by Eq (10.62). Let’s first consider the coefficients
ahead of the quadratic term with repsect to µk :
∑
N ∑
N
(quad) = β0 µk µT
k + r nk µk µT T
k − β k µ k µ k = (β0 + r nk − βk )µk µT
k =0
n=1 n=1
Where the summation is actually equal to Nk and we have also used Eq

(10.60). Next we focus on the linear term:
∑
N
(linear) = −2β0 m0 µT
k + 2 r nk xn µT T
k + 2β k m k µ k
n=1
∑
N
= 2(−β0 m0 + r nk xn + βk mk )µT
k =0
n=1
Finally we deal with the constant term:

∑
N
(const) = W−1 T
0 + β0 m0 m0 + r nk xn xT T
n − βk mk mk
n=1
= W−1 T
0 + β0 m0 m0 + N k Sk + N k x̄k x̄Tk − βk mk mk
T
1 2
= W−1 T T
0 + N k S k + β0 m0 m0 + N k x̄ k x̄ k − β m k mT
βk k k
1
= W−1 T T
0 + N k S k + β0 m0 m0 + N k x̄ k x̄ k − (β0 m0 + NK x̄k )(β0 m0 + NK x̄k )T
βk
β20 Nk2 1
= W−1
0 + N k S k + (β0 − )m0 m0T + ( Nk − )x̄k x̄T
k − 2(β0 m0 ) · ( NK x̄k )T
βk βk βk
β N β N β0 NK
W−1 0 k 0 k
= 0 + Nk Sk + m0 m0T + x̄k x̄T
k − 2(m0 ) · (x̄k )T
βk βk βk
β0 N k
= W−1
0 + Nk Sk + (m0 − x̄k )(m0 − x̄k )T
βk
Just as required.
Let’s begin by definition.

∫ ∫
Eµk ,Λk [(xn − µk )T Λk (xn − µk )] = (xn − µk )T Λk (xn − µk )) q⋆ (µk , Λk ) d µk d Λk
∫ {∫ }
= (xn − µk )T Λk (xn − µk )) q⋆ (µk |Λk ) d µk q⋆ (Λk ) d Λk
∫
= Eµk [(µk − xn )T Λk (µk − xn )] · q⋆ (Λk ) d Λk
The inner expectation is with respect to µk , which satisfies a Gaussian

distribution. We use Eq (380) in ’MatrixCookbook’: if x ∼ N (m, Σ), we have:
′ ′ ′ ′
E[(x − m )T A(x − m )] = (m − m )T A(m − m ) + Tr(AΣ)
210
Therefore, here we can obtain:

[ ]
Eµk [(µk − xn )T Λk (µk − xn )] = (mk − xn )T Λk (mk − xn ) + Tr Λk · (βk Λk )−1
Substituting it back into the integration, we can obtain:

∫ [ ]
T
Eµk ,Λk [(xn − µk ) Λk (xn − µk )] = (mk − xn )T Λk (mk − xn ) + D β− k
1
· q⋆ (Λk ) d Λk
[
= D β−k
1
+ E Λ k
(mk − xn )T Λk (mk − xn )]
{ }
= D β−k
1
+ E Λk Tr[ Λ k · (m k − x n )(m k − x n ) T
]
{ }
= D β−k
1
+ Tr E Λ k
[ Λ k ] · (m k − x n )(m k − x n ) T
{ }
−1
= D βk + Tr vk Wk · (mk − xn )(mk − xn )T
= D β−1 T
k + v k (m k − x n ) W k (m k − x n )
Just as required.
There is a typo in Eq (10.69). The numerator should be α0 + Nk . Let’s

substitute Eq (10.58) into (B.17):
αk α0 + N k α0 + N k
E[πk ] = ∑ = ∑ =
k αk K α0 + k Nk K α0 + N

N ∑
∑ K
E[ln p(X|Z, µ, Λ)] = E[ z nk ln N (xn |µk , Λ−1
k )]
n=1 k=1
N ∑
∑ K D 1 1
= E[ z nk ] · E[− ln 2π + ln |Λk | − (xn − µk )T Λk (xn − µk )]
n=1 k=1 2 2 2
1 ∑N ∑ K { }
= E[ z nk ] · − D ln 2π + E[ln |Λk |] − E[(xn − µk )T Λk (xn − µk )]
2 n=1 k=1
1 ∑N ∑ K { }
= r nk · − D ln 2π + ln Λ e k − D β−1 − vk (xn − mk )T Wk (xn − mk )
k
2 n=1 k=1
Where we have used Eq (10.50), Eq (10.64) and Eq (10.65). Then we first
deal with the first three terms inside the bracket, i.e.,
1 ∑N ∑ K { } 1∑ K ∑ N { }
e k − D β−1
r nk · − D ln 2π + ln Λ = r nk · − D ln 2 π + ln e k − D β−1
Λ
k k
2 n=1 k=1 2 k=1 n=1
1∑ K [∑ N ] [ ]
= r nk · − D ln 2π + ln Λ e k − D β−1
k
2 k=1 n=1
1∑ K [ ]
= e k − D β−1
Nk · − D ln 2π + ln Λ k
2 k=1
211
Where we have used the definition of Nk . Next we deal with the last term
inside the bracket, i.e.,
1 ∑N ∑ K { } 1 ∑N ∑ K
r nk · − vk (xn − mk )T Wk (xn − mk ) = − Tr[ r nk vk · (xn − mk )(xn − mk )T · Wk ]
2 n=1 k=1 2 n=1 k=1
1∑ K ∑N
= − Tr[ r nk vk · (xn − mk )(xn − mk )T · Wk ]
2 k=1 n=1
Since we have:
∑
N ∑
N
r nk vk · (xn − mk )(xn − mk )T = vk r nk · (x̄k − mk + xn − x̄k )(x̄k − mk + xn − x̄k )T
n=1 n=1
∑
N
= vk r nk · (x̄k − mk )(x̄k − mk )T
n=1
∑
N
+vk r nk · (xn − x̄k )(xn − x̄k )T
n=1
∑
N
+vk r nk · 2(x̄k − mk )(xn − x̄k )T
n=1
= vk Nk · (x̄k − mk )(x̄k − mk )T
+vk N k Sk
∑
N ∑
N
+vk · 2(x̄k − mk )( r nk xn − r nk x̄k )T
n=1 n=1
= vk Nk · (x̄k − mk )(x̄k − mk )T + vk Nk Sk
+vk · 2(x̄k − mk )( N k x̄k − N k x̄k )T
= vk Nk · (x̄k − mk )(x̄k − mk )T + vk Nk Sk
Therefore, the last term can be reduced to:

1 ∑N ∑ K { } 1∑ K
r nk · − vk (xn − mk )T Wk (xn − mk ) = − Tr[vk Nk · (x̄k − mk )(x̄k − mk )T Wk ]
2 n=1 k=1 2 k=1
1∑ K
− Tr[vk Nk Sk Wk ]
2 k=1
1∑ K
= − Nk vk · (x̄k − mk )Wk (x̄k − mk )T
2 k=1
1∑ K
− Nk vk Tr[Sk Wk ]
2 k=1
If we combine the first three and the last term, we just obtain Eq (10.71).
Next we prove Eq (10.72). According to Eq (10.37), we have:
∑
N ∑
K ∑
N ∑
K
E[ln p(Z|π)] = E[ z nk ln πk ] = ek
r nk ln π
n=1 k=1 n=1 k=1
212
Just as required.
According to Eq (10.39), we have:
∑
K
E[ln p(π)] = ln C (α0 ) + (α0 − 1) E[ln πk ]
k=1
∑
K
= ln C (α0 ) + (α0 − 1) ek
ln π
k=1
∑
K ∑
K
E[ln p(µ, Λ)] = E[ln N (µk |m0 , (β0 Λk )−1 )] + E[ln W (Λk |W0 , v0 )]
k=1 k=1
∑K { D 1 1 }
= E − ln 2π + ln |β0 Λk | − (µk − m0 )T (β0 Λk )(µk − m0 )
k=1 2 2 2
∑K { v0 − D − 1 1 }
+ E ln B(W0 , v0 ) + ln |Λk | − Tr[W− 1
0 Λk ]
k=1 2 2
∑K { D D 1 1 }
= E − ln 2π + ln β0 + ln |Λk | − (µk − m0 )T (β0 Λk )(µk − m0 )
k=1 2 2 2 2
∑K { v0 − D − 1 1 }
+ E ln B(W0 , v0 ) + ln |Λk | − Tr[W− 0
1
Λ k ]
k=1 2 2
K · D β0 1 ∑ K 1∑ K { }
= ln + fk −
ln Λ E (µk − m0 )T (β0 Λk )(µk − m0 )
2 2π 2 k=1 2 k=1
v0 − D − 1 ∑ K 1∑ K { }
K · ln B(W0 , v0 ) + fk −
ln Λ E Tr[W− 1
Λ k ]
0
2 k=1 2 k=1
So now we need to calculate these two expectations. Using (B.80), we can

obtain:
K {
∑ ∑
K { } ∑K { }
E Tr[W−
0
1
Λ k ] = Tr W −1
0 · E [Λ k ] = v k · Tr W−1
0 W k
k=1 k=1 k=1
To calculate the other expectation, first we write down two properties of

the Gaussian distribution, i.e.,
−1 −1
E[µk ] = mk , E[µk µT T
k ] = mk mk + βk Λk
213
K {
∑ } K {
∑ }
E (µk − m0 )T (β0 Λk )(µk − m0 ) = β0 E Tr[Λk · (µk − m0 )(µk − m0 )T ]
k=1 k=1
∑
K { [ ]}
= β0 Eµk ,Λk Tr Λk · (µk µT
k − 2 µ m
k 0
T
+ m0 m0
T
)
k=1
∑
K { [ ]}
−1 −1
= β0 EΛk Tr Λk · (mk mT
k + β k Λ k − 2m m
k 0
T
+ m m
0 0
T
)
k=1
∑
K { [ ]}
= β0 EΛk Tr β−1 T T T
k I + Λ k · (m k m k − 2m k m0 + m0 m0 )
k=1
∑
K { [ ]}
= β0 EΛk D · β−
k
1
+ Tr Λ k · (m k − m0 )(m k − m0 ) T
k=1
K D β0 ∑K { }
= + β0 EΛk (mk − m0 )Λk (mk − m0 )T
βk k=1
K D β0 ∑K
= + β0 (mk − m0 ) · EΛk [Λk ] · (mk − m0 )T
βk k=1
K D β0 ∑K
= + β0 (mk − m0 ) · (vk Wk ) · (mk − m0 )T
βk k=1
Substituting these two expectations back, we obtain Eq (10.74) just as

required. According to Eq (10.48), we have:
∑
N,K ∑
N,K
E[ln q(Z)] = E[ z nk ] · ln r nk = r nk · ln r nk
n,k=1 n,k=1
∑
K
E[ln q(π)] = ln C (α) + (αk − 1) E[ln πk ]
k=1
∑
K
= ln C (α0 ) + (αk − 1) ek
ln π
k=1
214
To derive Eq(10.77), we follow the same procedure as that for Eq (10.74):
∑
K ∑
K
E[ln q(µ, Λ)] = E[ln N (µk |mk , (βk Λk )−1 )] + E[ln W (Λk |Wk , vk )]
k=1 k=1
∑K { D D 1 1 }
= E − ln 2π + ln βk + ln |Λk | − (µk − mk )T (βk Λk )(µk − mk )
k=1 2 2 2 2
∑K { vk − D − 1 1 }
+ E ln B(Wk , vk ) + ln |Λk | − Tr[W− k
1
Λ k ]
k=1 2 2
K · D βk 1 ∑ K 1∑ K { }
= ln + fk −
ln Λ E (µk − mk )T (βk Λk )(µk − mk )
2 2π 2 k=1 2 k=1
vk − D − 1 ∑K 1∑ K { }
K · ln B(Wk , vk ) + fk −
ln Λ E Tr[W− 1
Λ k ]
k
2 k=1 2 k=1
K · D βk 1 ∑ K KD
= ln + fk −
ln Λ
2 2π 2 k=1 2
vk − D − 1 ∑K 1∑ K { }
ln Λ vk E Tr[W− 1
W k ]
k
2 k=1 2 k=1
K · D βk 1 ∑ K KD
= ln + fk −
ln Λ
2 2π 2 k=1 2
vk − D − 1 ∑K 1∑ K
ln Λ vk · D
2 k=1 2 k=1
It is identical to Eq (10.77).
This problem is very complicated. Let’s explain it in details. In section

10.2.1, we have obtained the update formula for all the coefficients using the
general framework of variational inference. For more details you can see
Prob.10.12 and Prob.10.13.
Moreover, in the previous problem, we have shown that L is given by
Eq (10.70)-Eq (10.77), if we have assumed the form of q, i.e., Eq (10.42), Eq
(10.48),Eq (10.55), Eq (10.57) and Eq (10.59). Note that here we do not know
the specific value of those coefficients, e.g., Eq (10.60)-Eq (10.63). In this prob-
lem, we will show that by maximizing L with respect to those coefficients,
we will obtain those formula just as in section 10.2.1.
To summarize, here we write down all the coefficients required to esti-
mate: {βk , mk , vk , Wk , αk , r nk }. We begin by considering βk . Note that only Eq
(10.71), (10.74) and (10.77) contain βk , we calculate the derivative of L with
215
respect to βk and set it to zero:

∂L 1 1 D 1 2π
= ( N k D β− 2 −2
k ) + ( D β0 β k ) − ( )
∂βk 2 2 2 2π β k
1 −2
= β · ( N k D + D β0 − D βk ) = 0
2 k
The three brackets in the first line correspond to the derivative with re-
spect to Eq (10.71), (10.74) and (10.77). Rearranging it, we obtain Eq (10.60).
Next we consider mk , which only occurs in the quadratic terms in Eq (10.71)
and (10.74).
∂L [1 ] [ 1 ]
= Nk vk · 2Wk (x¯k − mk ) + − β0 vk · 2Wk (mk − m0 )
∂mk 2 2
[ ]
= vk Wk Nk · (x¯k − mk ) − β0 (mk − m0 ) = 0
Similarly, the two brackets in the first line correspond to the derivative
with respect to Eq (10.71) and (10.74). Rearranging it, we obtain Eq (10.61).
Next noticing that vk and Wk are always coupled in L , e.g.,vk occurs ahead of
quadratic terms in Eq (10.71). We will deal with vk and Wk simultaneously.
Let’s first make this more clear by writing down those terms depend on vk
and Wk in L :
∑K {1 }
(10.77) ∝ e k − H[ q(Λk )]
ln Λ (∗)
k=1 2
1∑ K { }
(10.71) ∝ e k − vk · Tr[(Sk + Ak )Wk ]
Nk ln Λ
2 k=1
1∑ K { } v −D −1 ∑K 1∑ K
(10.74) ∝ e k − β0 vk · Tr[Bk Wk ] + 0
ln Λ ek −
ln Λ vk Tr[W−1
0 Wk ]
2 k=1 2 k=1 2 k=1
v0 − D ∑
K 1∑ K
= ek −
ln Λ vk Tr[(β0 Bk + W−1
0 )W k ]
2 k=1 2 k=1
e k is given by Eq (10.65) and Ak and Bk are given by:

Where ln Λ
Ak = (x¯k − mk )(x¯k − mk )T , Bk = (mk − m0 )(mk − m0 )T
Moreover, H[ q(Λk )] is given by (B.82):

vk − D − 1 v D
H[ q(Λk )] = − ln B(Wk , vk ) − ek + k
ln Λ
2 2
Where ln B(Wk , vk ) can be calculated based on (B.79). Note here we only
focus on those terms dependent on vk and Wk :
vk vk D ∑ D vk + 1 − i
ln B(Wk , vk ) ∝ − ln |Wk | − ln 2 − Γ( )
2 2 i =1 2
216
To further simplify the derivative, we now write down those terms in L

which only depends on vk and Wk with a given specific index k:
{1 } 1 { }
L ∝ − ln Λ e k − H[ q(Λk )] + N k ln Λ e k − vk · Tr[(Sk + Ak )Wk ]
2 2
v0 − D 1
+ e k − vk Tr[(β0 Bk + W−1 )Wk ]
ln Λ 0
2 2
1 1
= (−1 + Nk + v0 − D ) ln Λ e k + H[ q(Λk )] − vk · Tr[( N k Sk + N k Ak + β0 Bk + W−1 )Wk ]
0
2 2
1 1
= (−1 + Nk + v0 − D ) ln Λ e k − vk · Tr[( N k Sk + N k Ak + β0 Bk + W−1 )Wk ]
0
2 2
vk − D − 1 v D
− ln B(Wk , vk ) − ek + k
ln Λ
2 2
1 1 v D
= e k − vk · Tr[Fk Wk ] + k − ln B(Wk , vk )
( Nk + v0 − vk ) ln Λ
2 2 2
Fk = Nk Sk + Nk Ak + β0 Bk + W−
0
1
Note that Eq (10.77) has a minus sign in L , the negative of (∗) has been
used in the first line. We first calculate the derivative of L with respect to vk
and set it to zero:
∂L 1 d ln Λ ek 1
e k ln Λ D
= ( N k + v0 − v k ) − − Tr[Fk Wk ] +
∂vk 2 dvk 2 2 2
|Wk | D ln 2 1 ∑ ′ vk + 1 − i
D
+ + + Γ( )
2 2 2 i=1 2
1[ ek
d ln Λ ]
= ( N k + v0 − v k ) − Tr[Fk Wk ] + D = 0
2 dvk
e k , i.e., Eq
Where in the last step, we have used the definition of ln Λ
(10.65). Then we calculate the derivative of L with respect to Wk and set
it to zero:
∂L 1 vk vk
= ( Nk + v0 − vk )W−1
k − Fk + W− 1
∂Wk 2 2 2 k
1 vk
= ( Nk + v0 − vk )W−1
k − (Fk − W− 1
k )=0
2 2
Staring at these two derivatives long enough, we find that if the following
two conditions:
Nk + v0 − vk = 0 , and Fk = W− k
1
are satisfied, the derivatives of L with respect to vk and Wk will all be

zero. Rearranging the first condition, we obtain Eq (10.63). Next we prove
that the second condition is exactly Eq (10.62), by simplifying Fk .
Fk = Nk Sk + Nk Ak + β0 Bk + W−
0
1
= W−1 T
0 + N k S k + N k · (x¯k − m k )(x¯k − m k ) + β0 · (m k − m0 )(m k − m0 )
T
217
Comparing this with Eq (10.62), we only need to prove:

β0 N k
Nk ·(x¯k −mk )(x¯k −mk )T +β0 ·(mk −m0 )(mk −m0 )T = (x¯k −m0 )(x¯k −m0 )T
β0 + N k
Let’s start from the left hand side.
(left) = Nk x¯k x¯k T − 2 Nk x¯k mT T T T

k + N k m k m k + β0 m k m k − 2β0 m k m0 + β0 m0 m0
T
β0 m0 + N k x¯k T β0 m0 + N k x¯k β0 m0 + N k x¯k T

= N k x¯k x¯k T − 2 N k x¯k ( ) + ( Nk + β0 )( )( )
β0 + N k β0 + N k β0 + N k
β0 m0 + N k x¯k
−2β0 ( )m0T + β0 m0 m0T
β0 + N k
Then we complete the square with respect to x¯k , and we will see the co-
efficients match with the right hand side. Here as an example, we calculate
the coefficients ahead of the quadratic term x¯k x¯k T :
Nk Nk
(quad) = Nk − 2 Nk + (β0 + N k )( )2
β0 + N k β0 + N k
Nk (β0 + Nk ) − 2 Nk2 + Nk2
=
β0 + N k
β0 N k
=
β0 + N k
It is similar for the linear and the constant term, and here due to page
limit, we omit the proof. the update formula for αk , r nk are still remaining to
obtain. Noticing that only Eq (10.72), (10.73) and (10.76) depend on αk , we
now calculate the derivative of L with respect to αk :
∂L ∑
N ek
d ln π ek [
d ln π ek
d ln π d ln C (α) ]
= r nk + (α0 − 1) − (αk − 1) + ln π
ek +
∂αk n=1 d αk d αk d αk d αk
ek
d ln π d ln C (α)
= ( N k + α0 − αk ) − ln πek −
d αk d αk
[ ′ ′ ] [ ] d [ ln Γ(α
b) − ln Γ(αk ) ]
= ( N k + α0 − αk ) ϕ (αk ) − ϕ (αb) − ϕ(αk ) − ϕ(α
b) −
d αk
[ ′ ′ ] [ ]
= ( N k + α0 − αk ) ϕ (αk ) − ϕ (αb) − ϕ(αk ) − ϕ(α
b) − [ ϕ(α
b) − ϕ(αk ) ]
[ ′ ′ ]
= ( N k + α0 − αk ) ϕ (αk ) − ϕ (αb) = 0
Where we have used (B.25), Eq (10.66). Therefore, we obtain Eq (10.58).

Finally, we are required to derive an update formula for r nk . Note that x¯k ,
Sk and Nk also contains r nk , we conclude that Eq (10.71), (10.72) and (10.75)
depend on r nk . Using the definition of Nk , i.e., Eq (10.51), we can obtain:
1∑ { } 1∑
L ∝ r nk ln Λe k − D β−1 − Nk vk Tr[(Sk + Ak )Wk ]
k
2 k,n 2 k
1∑ 1∑
+ ek −
r nk ln π r nk ln r nk
2 k,n 2 k,n
218
∑
Note that constraint exists for r nk : k r nk = 1, we cannot calculate the
derivative and set it to zero. We must introduce a Lagrange Multiplier. Before
doing so, let’s simplify Sk + Ak :
1 ∑ N
Sk + Ak = r nk (xn − x¯k )(xn − x¯k )T + (x¯k − mk )(x¯k − mk )T
Nk n=1
1 ∑ N [ ]
= r nk xn xT T
n − 2 r nk x n x¯k + r nk x¯k x¯k
T
+ x¯k x¯k T − 2x¯k mk + mk mT k
Nk n=1
∑N ∑N
1 ∑ N ¯ T
n=1 2 r nk x n x k r nk x¯k x¯k T
= T
r nk xn xn − + n=1 + x¯k x¯k T − 2x¯k mk + mk mT
k
Nk n=1 Nk Nk
1 ∑ N 2 Nk x¯k x¯k T Nk x¯k x¯k T
= r nk xn xT
n − + + x¯k x¯k T − 2x¯k mk + mk mT
k
Nk n=1 Nk Nk
1 ∑ N
= r nk xn xT T
n − 2x¯k m k + m k m k
Nk n=1
1 ∑ N
= ( r nk xn xT T
n − 2 N k x¯k m k + N k m k m k )
Nk n=1
1 [∑N ]
= r nk (xn xT
n − 2x¯k m k + m k m T
k )
Nk n=1
1 ∑ N
= r nk (xn − mk )(xn − mk )T
Nk n=1
1∑ { } ∑ ∑
L ∝ e k − D β−1 + r nk ln π
r nk ln Λ e k − r nk ln r nk
k
2 k,n k,n k,n
1∑
− Nk vk Tr[(Sk + Ak )Wk ]
2 k
1∑ { } ∑ ∑
= e k − D β−1 + r nk ln π
k
2 k,n k,n k,n
1∑ K ∑ N
− vk r nk (xn − mk )T Wk (xn − mk )
2 k=1 n=1
Introducing Lagrange Multiplier λn , we obtain:
1∑ { } ∑ ∑
(Lagrange) = e k − D β−1 + r nk ln π
k
2 k,n k,n k,n
1∑ K ∑ N ∑N ∑
− vk r nk (xn − mk )T Wk (xn − mk ) + λn (1 − r nk )
2 k=1 n=1 n=1 k
Calculating the derivative with respect to λn and setting it to zero, we can

219
obtain:
∂(Lagrange) 1
= e k − D β−1 } + ln π
{ln Λ e k − [ln r nk + 1]
k
∂ r nk 2
1
− vk (xn − mk )T Wk (xn − mk ) + λn = 0
2
Moving ln r nk to the right side and then exponentiating both sides, we
obtain Eq (10.67), and the normalized r nk is given by Eq (10.49), (10.46), and
(10.64)-(10.66).
Let’s start from the definition, i.e., Eq (10.78).

∑∫ ∫ ∫
b |X) =
p(x z, µ, Λ) p(b
b |b
p(x z|π) p(π, µ, Λ|X) d π d µ d Λ
b
z
∑∫ ∫ ∫ ∏
K ∏
K
b
b |µk , Λ−1 b
zk z
= N (x k ) · πkk · p(π, µ, Λ|X) d π d µ d Λ
b
z k=1 k=1
∑∫ ∫ ∫ ∏
K [ ] bzk
≈ b |µk , Λ−
N (x 1
k ) · πk · q(π, µ, Λ) d π d µ d Λ
b
z k=1
∑ ∫ ∫ ∫
K [ ]
= b |µk , Λ−
N (x 1
k ) · π k · q(π, µ, Λ) d π d µ d Λ
k=1
K ∫ ∫ ∫
∑ [ ∏
K ]
= b |µk , Λ−
N (x 1
k ) · π k · q(π) · q(µ j , Λ j ) d π d µ d Λ
k=1 j =1
Where we have used the fact that z uses a one-of-k coding scheme. Recall
that µ = {µk } and Λ = {Λk }, the term inside the summation can be further
simplified. Namely, for those index j ̸= k, the integration with respect to µ j
and Λ j will equal 1, i.e.,
K ∫ ∫ ∫
∑ [ ∏
K ]
b |X) =
p (x b |µk , Λ−
N (x k
1
) · π k · q (π) · q (µ j , Λ j ) d π d µ d Λ
k=1 j =1
K ∫ ∫ ∫
∑
= b |µk , Λ−
N (x 1
k ) · π k · q(π) · q(µ k , Λ k ) d π d µ k d Λ k
k=1
K ∫ ∫ ∫
∑
= b |µk , Λ−
N (x 1 −1
k ) · π k · Dir(π|α) · N (µ k |m k , (β k Λ k ) )W (Λ k |W k , v k ) d π d µ k d Λ k
k=1
We notice that in the expression above, only πk · Dir(π|α) contains πk ,

and we know that the expectation of πk with respect to Dir(π|α) is αk /α
bk .
220
∑K ∫ ∫ α
b |µk , Λ− −1
k 1
b |X) =
p(x N (x k ) · N (µ k |m k , (β k Λ k ) ) · W (Λ k |W k , v k ) d µ k d Λ k
k=1 b
α
∑K {∫ [∫ ] αk }
= N (x b |µk , Λ−k
1
) · N ( µ k | m k , (β k Λ k )−1
) d µ k · · W ( Λ k | W k , v k ) d Λ k
k=1 αb
∑K { ∫ }
−1 α k
= N (xb |mk , (1 + β− k
1
) Λ k ) · · W (Λ k | W ,
k k v ) d Λ k
k=1 αb
∑K α ∫
b |mk , (1 + β− −1
k 1
= N (x k )Λ k ) · W (Λ k |W k , v k ) d Λ k
k=1 b
α
Notice that the Wishart distribution is a conjugate prior for the Gaussian
distribution with known mean and unknown precision. We conclude that the
b |mk , (1 + β−
product of N (x k
1
)Λ−
k
1
) · W (Λk |Wk , vk ) is again a Wishart distribu-
tion without normalized, which can be verified by focusing on the dependency
on Λk :
{ Tr[Λ · (x }
k b − m k )(xb − m k )T ] 1
(product) ∝ |Λk |1/2+(vk −D −1)/2 · exp − − Tr[Λ k W −1
]
2(1 + β−
k
1) 2 k
′ ′
∝ W (Λ k | W , v )

′
v = vk + 1
and
′ b − mk )(x
(x b − m k )T
[W ]−1 = + W− 1
1 + β−
k
1 k
Using the normalization constant of Wishart distribution, i.e., (B.79), we

can obtain:
∑K α ∫
b |mk , (1 + β− −1
k 1
b |X)
p(x = N (x k )Λ k ) · W (Λ k |W k , v k ) d Λ k
k=1 b
α
∑K α
k ′ ′
= · B(W , v )
k=1 αb
¯ (x ¯−(vk +1)/2
¯ b − mk )(x b − m k )T −1 ¯
∝ ¯ + W ¯
1 + β−k
1 k
¯ 1 ¯−(vk +1)/2
¯ T ¯
∝ ¯ W k ( b
x − m k )( b
x − m k ) + I¯
1 + β−k
1
b k . Next, we use:
Here we have only considered those terms dependent on x
|I + abT | = 1 + aT b
221
The expression above can be further simplified to:

¯ ¯−(vk +1)/2
¯ 1 T ¯
b |X) ∝ ¯
p(x −1
W k (xb − m k )( b
x − m k ) + I ¯
1 + βk
[ 1 ]−(vk +1)/2
T
= 1+ (xb − m k ) W k ( b
x − m k )
1 + β−
k
1
By comparing it with (B.68), we notice that it is a Student’s t distribution,

whose parameters are defined by Eq (10.81)-(10.82).
Let’s begin by dealing with q⋆ (Λk ). When N → +∞, we know that Nk

also approaches +∞ based on Eq (10.51). Therefore, we know that [Wk ]−1 →
Nk Sk and vk → Nk . Using (B.80), we conclude that E[Λk ] = vk Wk → S− k
1
.
If we now can prove that the entropy H[Λk ] is zero, we can conclude that
the distribution collapse to a Dirac function, i.e, the distribution is sharply
peaked around S− k
1
, which is identical to the EM of Gaussian mixture given
by Eq (9.25). Therefore, let’s now start from ln B(Wk , vk ), i.e., (B.79).
vk vk D D (D − 1) ∑ D vk + 1 − i
ln B(Wk , vk ) = − ln |Wk | − ln 2 − ln π − ln Γ( )
2 2 4 i =1 2
Nk Nk D ∑ D Nk + 1 − i
→ ln | Nk Sk | − ln 2 − ln Γ( )
2 2 i =1 2
Nk ∑ D Nk − 1 − i
= (D ln Nk + ln |Sk | − D ln 2) − ln Γ( + 1)
2 i =1 2
Nk Nk
≈ (D ln + ln |Sk |)
2 2
∑ D [1 Nk − 1 − i Nk − 1 − i 1 Nk − 1 − i ]
− ln 2π − +( + ) ln
i =1 2 2 2 2 2
Nk Nk ∑ D [ N
k Nk Nk ]
≈ (D ln + ln |Sk |) − − + ln
2 2 i =1 2 2 2
Nk Nk Nk D Nk D Nk
= (D ln + ln |Sk |) + − ln
2 2 2 2 2
Nk
= (D + ln |Sk |)
2
Where we have used Eq (1.146) to approximate the logarithm of Gamma
222
function. Next we deal with E[ln Λk ] based on (B.81):
∑
D vk + 1 − i
E[ln Λk ] = ϕ( ) + D ln 2 + ln |Wk |
i =1 2
∑
D Nk + 1 − i
→ ln( ) + D ln 2 − ln | Nk Sk |
i =1 2
∑
D Nk
≈ ln + D ln 2 − D ln N k − ln |Sk |
i =1 2
Nk
= D ln + D ln 2 − D ln N k − ln |Sk |
2
= − ln |Sk |
Where we have used Eq (10.241) to approximate the ϕ( vk +21− i ). Now we

are ready to deal with the entropy H[ q(Λk )]:
vk − D − 1 vk D
H[ q(Λk )] = − ln B(Wk , vk ) − E[ln Λk ] +
2 2
Nk Nk Nk D
→ − (D + ln |Sk |) + ln |Sk | + =0
2 2 2
Therefore, we can conclude that the distribution q⋆ (Λk ) collapse to a
Dirac function at S−k
1
. In other words, when N → +∞, Λk can only achieve
−1
one value Sk .
Next, we deal with q⋆ (µk |Λk ). According to Eq (10.60), when N → +∞,
we conclude that βk → Nk , and thus, mk → x¯k based on Eq (10.61). Since
we know q⋆ (µk |Λk ) = N (µk | mk , (βk Λk )−1 ) and βk Λk → Nk S−
k
1
is large, we
conclude that when N → ∞, µk also achieves only one value x¯k , which is
identical to the EM of Gaussian Mixture, i.e., Eq (9.24).
Finally, we consider q⋆ (π) given by Eq (10.54). Since we know αk → Nk
N
based on Eq (10.58), we see that E[µk ] = αk /α b → Nk and
b − αk )
αk (α b·α
α b 1
var[µk ] = ≤ = →0
α b + 1)
b 2 (α b3
α b
α
We can also conclude that pi k only achieves one value NNk , which is iden-
tical to the EM of Gaussian Mixture, i.e., Eq (9.26). Now it is trivial to see
that the predictive distribution will reduce to a Mixture of Gaussian using Eq
(10.80). Beause π, µk and Λk all reduce to a Dirac function, the integration
is easy to perform.
This can be verified directly. The total number of labeling equals assign
K labels to K object. For the first label, we have K choice, K − 1 choice for the
second label, and so on. Therefore, the total number is given by K !.

223
Let’s explain this problem in details. Suppose that now we have a mix-
ture of Gaussian p(Z|X), which are required to approximate. Moreover, it
has K components and each of the modes is denoted as {µ1 , µ2 , ..., µK }. We
use the variational inference, i.e., Eq (10.3), to minimize the KL divergence:
KL( q|| p), and obtain an approximate distribution q s (Z) and a corresponding
lower bound L( q s ).
According to the problem description, this approximate distribution q s (Z)
will be a single mode Gaussian located at one of the modes of p(Z|X), i.e.,
q s (Z) = N (Z|µs , Σs ), where s ∈ {1, 2, ..., K }. Now, we replicate this q s for K !
times in total. Each of the copies is moved to one mode’s center.
Now we can write down the mixing distribution made up of K ! Gaussian
distribution:
1 ∑ K!
q m (Z) = N (Z|µC (m) , Σs )
K ! m=1
Where C ( m) represents the mode of the m-th component. C ( m) ∈ {1, 2, ..., K }.
What the problem wants us to prove is:
L( q s ) + ln K ! ≈ L( q m )
In other words, the lower bound using q m to approximate, i.e., L( q m ), is

ln K ! larger than using q s , i.e., L( q s ). Based on Eq (10.3), let’s equivalently
deal with the KL divergence. According to Eq (10.4), we can obtain:
∫
p(Z|X)
KL( q m || p) = − q m (Z) ln dZ
q m (Z)
∫ ∫
= − q m (Z) ln p(Z|X) d Z + q m (Z) ln q m (Z) d Z
∫ ∫{ 1 ∑ K! }
= − q m (Z) ln p(Z|X) d Z + N (Z|µC (m) , Σs ) d Z
q m (Z) ln
K ! m=1
∫ ∫ {∑K! } 1
= − q m (Z) ln p(Z|X) d Z + q m (Z) ln N (Z|µC (m) , Σs ) d Z + ln
m=1 K !
∫
= − ln K ! − q m (Z) ln p(Z|X) d Z
∫ ∑
K! {∑
K! }
1
+ N (Z|µC (m) , Σs ) ln N (Z|µC (m) , Σs ) d Z
K! m=1 m=1
In order to further simplify the KL divergence, here we write down two

useful equations. First, we use the "negligible overlap" property. To be more
specific, according to the assumption that the overlap are negligible, we can
obtain:
∫ {∑
K! } ∫ { }
N (Z|µC (m) , Σs ) ln N (Z|µC (m) , Σs ) d Z ≈ N (Z|µC (m) , Σs ) ln N (Z|µC (m) , Σs ) d Z
m=1
224
The second equation is that for any m 1 , m 2 ∈ {1, 2, ..., K }, we have:

∫ ∫ { }
q s ln q s d Z = N (Z|µC (m1 ) , Σs ) ln N (Z|µC (m1 ) , Σs ) d Z
∫ { }
= N (Z|µC (m2 ) , Σs ) ln N (Z|µC (m2 ) , Σs ) d Z
Therefore, now we can obtain:

∫
KL( q m || p) = − ln K ! − q m (Z) ln p(Z|X) d Z
∫ ∑
K! {∑
K! }
1
K! m=1 m=1
∫
≈ − ln K ! − q m (Z) ln p(Z|X) d Z
∫ ∑
K! { }
1
K! m=1
∫
= − ln K ! − q m (Z) ln p(Z|X) d Z
∫ { }
+ N (Z|µC (m) , Σs ) ln N (Z|µC (m) , Σs ) d Z (∀ m ∈ {1, 2, ..., K })
∫ ∫
= − ln K ! − q m (Z) ln p(Z|X) d Z + q s (Z) ln q s (Z) d Z
∫ {∑ K! } ∫
1
= − ln K ! − N (Z|µC (m) , Σs ) ln p(Z|X) d Z + q s (Z) ln q s (Z) d Z
K! m=1
∫ ∫
≈ − ln K ! − q s (Z) ln p(Z|X) d Z + q s (Z) ln q s (Z) d Z
∫
p(Z|X)
= − ln K ! − q s (Z) ln d Z = − ln K ! + KL( q s || p)
q s (Z)
To obtain the desired result, we have adopted an approximation here,
however, you should notice that this approximation is rough.
Let’s go back to Eq (10.70). If now we treat πk as a parameter without

a prior distribution, πk will only occur in the second term in Eq (10.70), i.e.,
E[ln p(Z|π)]. Therefore, we can obtain:
∑
N ∑
K
L ∝ E[ln p(Z|π)] = r nk ln πk
n=1 k=1
Where we have used Eq (10.72), and here since πk is a point estimate,

the expectation E[ln πk ] will reduce to ln πk . Now we introduce a Lagrange
Multiplier.
∑N ∑ K ∑
K
Lag = r nk ln πk + λ · ( πk − 1)
n=1 k=1 k=1
225
Calculating the derivative of the expression above with respect to πk and

setting it to zero, we obtain:
∑N
n=1 r nk Nk
+λ = +λ = 0 (∗)
πk πk
Multiplying both sides by πk and then adopting summation of both sides
with respect to k, we obtain
∑
K ∑
K
Nk + λ πk = 0
k=1 k=1
Since we know the summation of Nk with respect to k equals N , and the

summation of πk with respect to k equals 1, we rearrange the equation above,
yielding:
λ = −N
Substituting it back into (∗), we can obtain:
Nk 1 ∑N
πk = = r nk
N N n=1
Just as required.
Recall that the singularity in the maximum likelihood estimation of Gaus-

sian mixture is caused by the determinant of the covariance matrix Σk ap-
proaches 0, and thus the value in N (xn |µk , Σk ) will approach +∞. For more
details, you can read Section 9.2, especially page 434.
In this problem, an intuition is that since we have introduce a prior dis-
tribution for Λk , this singularity won’t exist when adopting MAP. Let’s verify
this statement beginning by writing down the posterior.
p(Z|X, π, µ, Λ) ∝ p(X|Z, π, µ, Λ) · p(Z, π, µ, Λ)
= p(X|Z, π, µ, Λ) · p(Z|π, µ, Λ) · p(π|µ, Λ) · p(µ, Λ)
= p(X|Z, µ, Λ) · p(Z|π) · p(π) · p(µ, Λ)
Note that in the first step we have used Bayes’ theorem, that in the second
step we have used the fact that p(a, b) = p(a| b) · p( b), and that in the last step
we have omitted the extra dependence based on definition, i.e., Eq (10.37)-
(10.40). Now let’s calculate the MAP solution for Λk .
1 ∑ N { }
ln p(Z|X, π, µ, Λ) ∝ z nk ln |Λk | − (xn − µk )T Λk (xn − µk )
2 n=1
1{ }
ln |Λk | − β0 (µk − m0 )T Λk (µk − m0 )
2
1{ }
+ (v0 − D − 1) ln |Λk | − Tr[W− 1
0 Λ k ] + const
2
= c · ln |Λk | − Tr[BΛk ] + const
226
Where const is the term independent of Λk , and we have defined:
1 ∑N
c= ( v0 − D + z nk )
2 n=1
and
1{ ∑
N }
B= z nk (xn − µk )(xn − µk )T + β0 (µk − m0 )(µk − m0 )T + W−
0
1
2 n=1
Next we calculate the derivative of ln p(Z|X, π, µ, Λ) with respect to Λk

and set it to 0, yielding:
c · Λ−1
k −B = 0
therefore, we obtain:
1
Λ−1
k = B
c
Note that in the MAP framework, we need to solve z nk first, and then
substitute them in c and B in the expression above. Nevertheless, from the
expression above, we can see that Λ−
k
1
won’t have zero determinant.
We qualitatively solve this problem. As the number of mixture compo-

nents grows, so does the number of variables that may be correlated, but they
are treated as independent under a variational approximation if Eq (10.5) has
been used. Therefore, the proportion of probability mass under the true dis-
tribution, p(Z, π, µ, Σ|X), that the variational approximation q(Z, π, µ, Σ) does
not capture, will grow. The consequence will be that the second term in (10.2),
the KL divergence between q(Z, π, µ, Σ) and p(Z, π, µ, Σ|X) will increase.
To answer the question whether we will underestimate or overestimate
the number of components by minimizing KL( q|| p) divergence under factor-
ization, we only need to see Fig.10.3. It is obvious that we will underestimate
the number of components.
In this problem, we also need to consider the prior p(β) = Gam(β| c 0 , d 0 ).

To be more specific, based on the original joint distribution p(t, w, α), i.e., Eq
(10.90), the joint distribution p(t, w, α, β) now should be written as:
p(t, w, α, β) = p(t|w, β) p(w|α) p(α) p(β)
Where the first term on the right hand side is given by Eq (10.87), the
second one is given by Eq (10.88), the third one is given by Eq (10.89), and
the last one is given by Gam(β| c 0 , d 0 ). Using the variational framework, we
assume a posterior variational distribution:
q(w, α, β) = q(w) q(α) q(β)

227
It is trivial to observe that introducing a Gamma prior for β doesn’t affect

q(α) because the expectation of p(β) can be absorbed into the ’const’ term in
Eq (10.92). In other words, we still obtain Eq (10.93)-Eq(10.95).
Now we deal with q(w). By analogy to Eq (10.96)-(10.98), we can obtain:
ln q⋆ (w) ∝ Eβ [ln p(t|w, β)] + Eα [ln p(w|α)] + const

Eβ [β] ∑ N Eα [α] T
∝ − · ( t n − wT ϕn )2 − w w + const
2 n=1 2
1 { }
= − wT Eβ [β] · ΦT Φ + Eα [α] · I w + Eβ [β]wT ΦT t + const
2
Therefore, by analogy to Eq (10.99)-(10.101), we can conclude that q⋆ (w)

is still Gaussian, i.e., q⋆ (w) = N (w|m N , S N ), where we have defined:
m N = Eβ [β]S N ΦT t
and { }−1
S N = Eβ [β] · ΦT Φ + Eα [α] · I
Next, we deal with q(β). According to definition, we have:
ln q⋆ (β) ∝ Ew [ln p(t|w, β)] + ln p(β) + const

N β ∑N
∝ · ln β − · E[ ( t n − wT ϕn )2 ] + ( c 0 − 1) ln β − d 0 β
2 2 n=1
N β
= ( + c 0 − 1) · ln β − · E[||Φw − t||2 ] − d 0 β
2 2
N {1 }
= ( + c 0 − 1) · ln β − β · · E[||Φw − t||2 ] + d 0
2 2
N {1 }
= ( + c 0 − 1) · ln β − β · · E[wT ΦT Φw − 2tT Φw + tT t] + d 0
2 2
N {1 1 }
= ( + c 0 − 1) · ln β − β · · Tr[ΦT ΦE[wwT ]] − tT ΦE[w] + tT t + d 0
2 2 2
N {1 1 T }
= ( + c 0 − 1) · ln β − β · · Tr[ΦT Φ(m N mT N + S N )] − t T
Φm N + t t + d 0
2 2 2
N {1 1 T T 1 }
= ( + c 0 − 1) · ln β − β · Tr[Φ ΦS N ] + m N Φ Φm N − t Φm N + tT t + d 0
T T
2 2 2 2
N 1{ }
= ( + c 0 − 1) · ln β − β · Tr[ΦT ΦS N ] + ||Φm N − t||2 + 2 d 0
2 2
Therefore, we obtain q⋆ (β) = Gam(β| c N , d N ), where we have defined:
N
cN = + c0
2
and
1{ }
d N = d0 + Tr[ΦT ΦS N ] + ||Φm N − t||2
2
228
Furthermore, notice that from (B.27), the expectations in m N and S N can

be expressed in a N , b N , and c N , d N :
aN cN
E[α] = and E[β] =
bN dN
We have already obtained all the update formula. Next, we calculate the
lower bound. By noticing Eq (10.107), in this case, the first term on the right
hand side of Eq (10.107) will be modified, and two more terms will be added
on the right hand side, i.e., +E[ln p(β)] and −E[ln q⋆ (β)]. Let’s start from
calculating the adding two terms:
+E[ln p(β)] = ( c 0 − 1)E[ln β] − d 0 E[β] + c 0 ln d 0 − ln Γ( c 0 )

cN
= ( c 0 − 1) · (φ( c N ) − ln d N ) − d 0 + c 0 ln d 0 − ln Γ( c 0 )
dN
where we have used (B.26) and (B.30). Similarly, we have:
−E[ln q⋆ (β)] = ( c N − 1) · φ( c N ) − c N + ln d N − ln Γ( c N )
where we have used (B.31). Finally, we deal with the modification of the
first term on the right hand side of Eq (10.107):
{N N β }
Eβ,w [ln p(t|w, β)] = Eβ ln β − ln 2π − Ew [||Φw − t||2 ]
2 2 2
N N Eβ [β]
= Eβ [ln β] − ln 2π − Ew [||Φw − t||2 ]
2 2 2
N cN
= (φ( c N ) − ln d N − ln 2π) − Ew [||Φw − t||2 ]
2 2d N
N cN { }
= (φ( c N ) − ln d N − ln 2π) − Tr[ΦT ΦS N ] + ||Φm N − t||2
2 2d N
The last question is the predictive distribution. It is not difficult to ob-
serve that the predictive distribution is still given by Eq (10.105) and Eq
(10.106), with 1/β replaced by 1/E[β].
Let’s deal with the terms in Eq(10.107) one by one. Noticing Eq (10.87),
we have:
N N β [∑N ]
E[ln p(t|w)]w = − ln(2π) + ln β − E ( t n − wT ϕn )2
2 2 2 n=1
N N β [∑N ∑N ∑N ]
= − ln(2π) + ln β − E t2n − 2 t n · wT ϕ n + wT ϕ n · ϕT
nw
2 2 2 n=1 n=1 n=1
N N β [ T ]
= − ln(2π) + ln β − E t t − 2wT ΦT t + wT · (ΦT Φ) · w
2 2 2
N N β T [ [ ] ]
= − ln(2π) + ln β − t t − βE[wT ] · ΦT t + Tr E (wwT ) · (ΦT Φ)]
2 2 2
229
Where we have defined Φ = [ϕ1 , ϕ2 , ..., ϕ N ]T , i.e., the i -th row of Φ is

ϕT
i
.
Then using Eq(10.99), (10.100) and (10.103), it is easy to obtain (10.108).
Next, we deal with the second term by noticing Eq (10.88):
[ ] M M E[α]α
E ln p(w|α) w,α = − ln(2π) + E[ln α]α − · E[wwT ]w
2 2 2
Then using Eq (10.93)-(10.95), (B.27), (B.30) and Eq (10.103), we obtain
Eq (10.109) just as required. Then we deal with the third term in Eq (10.107)
by noticing Eq (10.89):
E[ln p(α)]α = a 0 ln b 0 + (a 0 − 1)E[ln α] − b 0 E[α] − ln Γ(a 0 )
Similarly, using Eq (10.93)-(10.95), (B.27), (B.30), we will obtain Eq (10.110).

Notice that there is a typo in Eq (10.110). The last term in Eq (10.110) should
be ln Γ(a 0 ) instead of ln Γ(a N ).
Finally we deal with the last two terms in Eq (10.107). We notice that
these two terms are actually negative entropy of a Gaussian and a Gamma
distribution, so that using (B.31) and (B.41), we can obtain:
−E[ln q(α)]α = H[α] = ln Γ(a N ) − (a N − 1) · φ(a N ) − ln b N + a N
and
1 M
−E[ln q(w)]w = H[w] =
ln |S N | + (1 + ln(2π))
2 2
The second derivative of f ( x) is given by:
d2 d 1 1
2
(ln x) = ( ) = − 2 <0
dx dx x x
Therefore, f ( x) = ln x is concave for 0 < x < ∞. Based on definition, i.e.,
Eq (10.133), we can obtain:
g(λ) = min{λ x − ln x}
x
We observe that:
d 1
(λ x − ln x) = λ −
dx x
In other words, when λ ≤ 0, λ x−ln x will always decrease as x increase. On
the other hand, when λ > 0, λ x − ln x will achieve its minimum when x = 1/λ.
Therefore, we conclude that:
1 1
g(λ) = λ · − ln = 1 + ln λ
λ λ
230
Substituting g(λ) back into Eq (10.132), we obtain:

f ( x) = min{λ x − 1 − ln λ}
λ
We calculate the derivative:

d 1
(λ x − 1 − ln λ) = x −
dλ λ
Therefore, when λ = 1/ x, λ x − 1 − ln λ achieves minimum with respect to
λ, which yields:
1 1
f ( x) = · x − 1 − ln = ln x
x x
In other words, we have shown that Eq (10.132) indeed recovers f ( x) =
ln x.
We begin by calculating the first derivative:

d f ( x) − e− x
=− = σ( x ) · e − x
dx 1 + e− x
Then we can obtain the second derivative:
d 2 f ( x) − e− x (1 + e− x ) − e− x (− e− x )
= = −[σ( x)]2 · e− x < 0
dx 2 (1 + e− x )2
Therefore, the log logistic function f ( x) is concave. Utilizing this concave
property, we can obtain:
′
f ( x) ≤ f (ξ) + f (ξ) · ( x − ξ)
which gives,
ln σ( x) ≤ ln σ(ξ) + σ(ξ) · e−ξ · ( x − ξ) (∗)
Comparing the expression above with Eq (10.136), we define λ = σ(ξ) · e−ξ .
Then we can obtain:
e−ξ 1
λ = σ(ξ) · e−ξ = = 1− = 1 − σ (ξ )
1 + e−ξ 1 + e−ξ
In other words, we have obtained σ(ξ) = 1 − λ. In order to simplify (∗),
we need to express ξ using λ and x. According to the definition of λ, we can
obtain:
ξ = ln σ(ξ) − ln λ = ln(1 − λ) − ln λ
Now (∗) can be simplified as:
ln σ( x) ≤ ln(1 − λ) + λ · ( x − ξ)
= λ · x + ln(1 − λ) − λ · ξ
[ ]
= λ · x + ln(1 − λ) − λ · ln(1 − λ) − ln λ
= λ · x + (1 − λ) ln(1 − λ) + λ ln λ
= λ · x − g(λ)
231
Just as required.
We start from calculating the derivative of f ( x) with respect to x:
d f ( x) e − 1 e− x/2
1 x/2
1 e x/2 − e− x/2
= − 2 x/2 2− x/2 = − · x/2
dx e +e 2 e + e− x/2
Then the second derivative is:
d 2 f ( x)
1 x/2
1 (2 e + 12 e− x/2 )( e x/2 + e− x/2 ) − ( e x/2 − e− x/2 )( 12 e x/2 − 12 e− x/2 )
= −
dx2 2 ( e x/2 + e− x/2 )2
x/2 − x/2
1 4· e · e 1
= − −
= − x/2 <0
4 (e + e
x /2 x /2 ) 2 ( e + e− x/2 )2
Therefore, the f ( x) is concave with respect to x. We denote y = x2 , and

calculate:
p p
1 −1/2
df d { p p } y e y/2
− 14 y−1/2 e− y/2
= − ln( e y/2 + e− y/2 ) = − 4 p p
dy dy e y/2 + e− y/2
Then the second derivative can be calulated as:

p p
d2 f 1 d { −1/2 e y/2 − e− y/2 }
= − y · p p
d y2 4 dy e y/2 + e− y/2
p p
1 { 1 −3/2 e y/2 − e− y/2 y−1/2 }
= − − y · p p + y−1/2 · p p
4 2 e y/2 + e− y/2 ( e y/2 + e− y/2 )2
1 y−1 { 1 p p 1 }
= − p p · − y−1/2 · ( e y/2 − e− y/2 ) + p p
4 e y/2 + e− y/2 2 e y/2 + e− y/2
In order to show that the second derivative is no less than 0, we only need
to prove:
1 p p 1
− y−1/2 · ( e y/2 − e− y/2 ) + p p ≤0
2 e y/2 + e− y/2
Which is equivalent to:
p p p p p p p
(e y/2
+ e− y/2
)( e y/2
− e− y/2
) ≥ 2 y1/2 ⇐⇒ e y/2
− e− y/2
≥2 y (∗)
We construct a function g( t) = e t − e− t − 2 t, where t ≥ 0. Notice that:

′
√
g ( t) = e t + e − t − 2 ≥ 2 e t · e − t − 2 = 0
Therefore, we conclude that g( t) is monotonically increasing on t ≥ 0, and

the minimum of g( t) is achieved when t = 0. Substituting t = 0 into g( t), we
obtain:
e t − e− t ≥ 2 t, ∀ t ≥ 0
232
By now, we have already proved (∗), and thus f ( x) is convex with respect
to x2 . Utilizing the convex property of f ( x) with respect to x2 , we can obtain:
p p
p p
− y/2
p p
− ϵ/2 1 e ϵ/2 − e− ϵ/2
− ln( e y/2
+e ) ≥ − ln( e ϵ/2
+e ) − ϵ−1/2 · p p · ( y − ϵ)
4 e ϵ/2 + e− ϵ/2
Substituting y = x2 and ϵ = ξ2 back into the expression above, we obtain:
1 eξ/2 − e−ξ/2
− ln( e x/2 + e− x/2 ) ≥ − ln( eξ/2 + e−ξ/2 ) − ξ−1 · ξ/2 · ( x 2 − ξ2 )
4 e + e−ξ/2
= − ln( eξ/2 + e−ξ/2 ) − λ(ξ) · ( x2 − ξ2 )
Where we have used the definition of λ(ξ), i.e., Eq (10.141). Notice that
the expression above is identical to Eq (10.143), from which we can easily
obtain Eq (10.144).
Figure 5: f ( x) with respect to x and x2
By carefully reading Section 10.6.1, we can obtain an Gaussian approxi-

mation to the posterior p(w|t), i.e,. Eq (10.156)-(10.168) if variational lower
bound of logistic sigmoid function has been used, i.e., Eq (10.151). What the
problem asks is to obtain a sequential update for m N and S− N
1
.
The intrinsic reason is that we notice that each data point {ϕn , t n } cor-
responds to one variational parameter ξn and the likelihood is given by the
233
product of p( t n |w). Let’s begin by deriving a sequential update formula for

S−
N
1
:
N∑
+1
S−1
N +1 = S−1
0 +2 λ(ξn )ϕn ϕT
n
n=1
−1
∑
N
= 2λ(ξ N +1 )ϕ N +1 ϕT
N +1 + S0 + 2 λ(ξn )ϕn ϕT
n
n=1
−1
= 2λ(ξ N +1 )ϕ N +1 ϕT
N +1 + S N
Next we derive a sequential update formula for m N :

[ N∑
+1 ]
m N +1 = S N +1 S−
0
1
m0 + ( t n − 1/2)ϕ n
n=1
[ ∑
N ]
= S N +1 S−
0
1
m0 + ( t n − 1/2)ϕ n + ( t N +1 − 1/2)ϕ N +1
n=1
[ [ −1 ∑
N ] ]
= S N +1 S−1
N S N S0 m0 + ( t n − 1/2)ϕn + ( t N +1 − 1/2)ϕ N +1
n=1
[ ]
= S N +1 S−1
N m N + ( t N +1 − 1/2)ϕ N +1
In conclusion, when a new data (ϕ N +1 , t N +1 ) arrives, we first update S−1

N +1
,
and then update m N +1 based on the formula we obtained above.
To prove Eq (10.163), we only need to prove Eq (10.162), from which Eq

(10.163) can be easily derived according to the text below Eq (10.132). There-
fore, in what follows, we prove that the derivative of Q (ξ, ξold ) with respect to
ξn will give Eq (10.162). We start by noticing Eq (4.88), i.e.,
d σ(ξ)
= σ(ξ) · (1 − σ(ξ))
dξ
Noticing Eq (10.150), now we can obtain:

′
dQ (ξ, ξold ) σ (ξn ) 1 ′
= −− λ (ξn ) · (ϕT T 2
n E[ww ]ϕ n − ξ n ) − λ(ξ n ) · (−2ξ n )
d ξn σ(ξn ) 2
1 ′
= 1 − σ(ξn ) − + 2ξn · λ(ξn ) − λ (ξn ) · (ϕT T 2
n E[ww ]ϕ n − ξ n )
2
1 1 ′
= − σ (ξ n ) + σ (ξ n ) − − λ (ξ n ) · (ϕ T T 2
2 2
′
= −λ (ξn ) · (ϕT T
2
Setting the derivative equal to zero, we obtain Eq (10.162),from which Eq

(10.16b3) follows.
234
First, we should clarify one thing and that is there is typos in Eq(10.164).
It is not difficult to observe these error if we notice that for q(w) = N (w|m N , S N ),
in its logarithm, i.e.,ln q(w), 12 ln |S N | should always have the same sign as
1 T −1
2 m N S N m N . This is our intuition. However, this is not the case in Eq(10.164).
Based on Eq(10.159), Eq(10.153) and the Gaussian prior p(w) = N (w|m0 , S0 ),
we can analytically obtain the correct lower bound L(ξ) (this will also be
strictly proved by the next problem):
1 1 1 −1 1 T −1
L(ξ) = ln |S N | − ln |S0 | + mTN S N m N − m0 S0 m0
2 2 2 2
∑N { 1 }
+ ln σ(ξn ) − ξn + λ(ξn )ξ2n
n=1 2
1 1 −1
∑N { 1 }
= ln |S N | + mT
N SN mN + ln σ(ξn ) − ξn + λ(ξn )ξ2n + const
2 2 n=1 2
Where const denotes the term unrelated to ξn because m0 and S0 don’t
depend on ξn . Moreover, noticing that S− N
1
· m N also doesn’t depend on ξn
according to Eq(10.157),thus it will be convenient to define a variable: z N =
S−
N
1
· m N , and we can easily verify:
−1 −1 T −1 −1 T −1 T −1
mT
N S N m N = [S N S N m N ] S N [S N S N m N ] = [S N z N ] S N [S N z N ] = z N S N z N
Now, we can obtain:

∂L(ξ) d {1 1 ∑N { 1 }}
= ln |S N | + zT S z
N N + ln σ ( ξ n ) − ξ n + λ(ξ )ξ
n n
2
∂ξn d ξn 2 2 N n=1 2
1 [ −1 ∂S N ] 1 [ ∂S N ] ′
= Tr S N + Tr z N zT
N· + λ (ξn )ξ2n
2 ∂ξn 2 ∂ξn
Where we have used Eq(3.117) for the first term, and for the second term
we have used:
d {1 T } 1 d { [ ]} 1 [ ∂S N ]
zN SN zN = Tr z N zT
N · SN = Tr z N zTN·
d ξn 2 2 d ξn 2 ∂ξn
Furthermore, for the last term, we can follow the same procedure as in
the previous problem and now our remain task is to calculate ∂S N /∂ξn . Based
on Eq(10.158) and (C.21), we can obtain:
∂S N ∂S−
N
1
′
= −S N S N = −S N · [2λ (ξn )ϕn ϕT
n ] · SN
∂ξn ∂ξn
Substituting it back into the derivative, we can obtain:
∂L(ξ) 1 [ −1 ∂S N ] ′
= Tr (S N + z N zT
N) + λ (ξn )ξ2n
∂ξn 2 ∂ξn
1 [ ′ ] ′
= − Tr (S− 1 T T 2
N + z N z N )S N · [2λ (ξ n )ϕ n ϕ n ] · S N + λ (ξ n )ξ n
2 { [
′ ] 2}
= −λ (ξn ) · Tr (S− N
1
+ z z
N N
T
) · S N · ϕ ϕ
n n
T
· S N − ξn = 0
235

[ ]
ξ2n = Tr (S−1 T T
N + z N z N ) · S N · ϕn ϕn · S N
= (S N · ϕn )T · (S−1 T
N + z N z N ) · (S N · ϕ n )
= ϕT T
n · (S N + S N z N z N S N ) · ϕ n
= ϕT T
n · (S N + m N m N ) · ϕ n
Where we have used the defnition of z N ,i.e., z N = S−

N
1
· m N and also re-
peatedly used the symmetry property of S N .
There is a typo in Eq (10.164), for more details you can refer to the previ-
ous problem. Let’s calculate L(ξ) based on Based on Eq(10.159), Eq(10.153)
and the Gaussian prior p(w) = N (w|m0 , S0 ):
∏
N {
h(w, ξ) p(w) = N (w|m0 , S0 ) · σ(ξn ) exp wT ϕn tn − (wT ϕn + ξn )/2
n=1
}
T
−λ(ξn )([w ϕn ]2 − ξ2n )
{ ∏N } { 1 }
= (2π)−W /2 · |S0 |−1/2 · σ(ξn ) · exp − (w − m0 )T S−
0
1
(w − m0 )
n=1 2
∏N { }
· exp wT ϕn tn − (wT ϕn + ξn )/2 − λ(ξn )([wT ϕn ]2 − ξ2n )
n=1
{ ∏N ( 1 ∑N ξ ∑N )}
(2π)−W /2 · |S0 |−1/2 · σ(ξn ) · exp − m0T S− 1 n 2
= 0 m 0 − + λ (ξ n )ξ n
n=1 2 n=1 2 n=1
{ 1 ( ∑
N ) ( ∑N 1 )}
· exp − wT S− 0
1
+ 2 λ(ξ n )ϕ ϕ
n n
T
w + w T
S −1
0 m 0 + ϕ n ( t n −
2 n=1 n=1 2
Noticing Eq (10.157)-(10.58), we can obtain:

{ ∏N ( 1 ∑N ξ ∑N )}
(2π)−W /2 · |S0 |−1/2 · σ(ξn ) · exp − m0T S−1 n
h(w, ξ) p(w) = 0 m0 − + λ(ξn )ξ2n
n=1 2 n=1 2 n=1
{ 1 }
· exp − wT S− 1 T −1
N w + w SN mN
2
{ ∏N
= (2π)−W /2 · |S0 |−1/2 · σ(ξn )
n=1
( 1 ∑N ξ ∑N 1 )}
· exp − m0T S− −1
1 n
0 m0 − + λ(ξn )ξ2n + mT
N SN mN
2 n=1 2 n=1 2
{ 1 }
· exp − (w − m N )T S−1
N (w − m N )
2
Therefore, utilizing the normalization constant of Gaussian distribution,
236
now we can obtain:

∫ { ∏
N
h(w, ξ) p(w) d w = (2π)W /2 · |S N |1/2 · (2π)−W /2 · |S0 |−1/2 · σ(ξn )
n=1
( 1 ξn ∑
N ∑
N 1 T −1 )}
· exp − m0T S− 0
1
m 0 − + λ(ξ n )ξ 2
n + m S m N
2 n=1 2 n=1 2 N N
{ |S |
N 1/2 ∏
N
= ( ) · σ (ξ n )
| S0 | n=1
( 1 ∑N ξ ∑N 1 T −1 )}
· exp − m0T S− 1 n 2
0 m 0 − + λ(ξ )
n nξ + m S m N
2 n=1 2 n=1 2 N N
Therefore, L(ξ) can be written as:

∫
L(ξ) = ln h(w, ξ) p(w) d w
1 |S N | 1 T −1 1 ∑N { 1 }
−1
= ln − m0 S0 m0 + mT
N SN mN + ln σ(ξn ) − ξn + λ(ξn )ξ2n
2 |S0 | 2 2 n=1 2
Let’s clarify this problem. What this problem wants us to prove is that
suppose at beginning the joint distribution comprises a product of j −1 factors,
i.e.,
−1
j∏
p j−1 (D, θ ) = f j−1 (θ )
i =1
and now the joint distribution comprises a product of j factors:
∏
j
p j (D, θ ) = f j (θ ) = p j−1 (D, θ ) · f j (θ )
i =1
Then we are asked to prove Eq (10.242). This situation corresponds to

j − 1 data points at the beginning and then one more data point is obtained.
For more details you can read the text below Eq (10.188). Based on definition,
we can write down:
∫
p j (D ) = p j (D, θ ) d θ
∫
= p j−1 (D, θ ) · f j (θ ) d θ
∫
= p j−1 (D ) · p j−1 (θ |D ) · f j (θ ) d θ
∫
= p j−1 (D ) · p j−1 (θ |D ) · f j (θ ) d θ
∫
≈ p j−1 (D ) · q j−1 (θ ) · f j (θ ) d θ
= p j−1 (D ) · Z j
237
Where we have sequentially used Bayes’ Theorem, q j−1 (θ ) is an approx-

imation for the posterior p j−1 (θ |D ), and Eq (10.197). To further prove Eq
(10.243), we only need to recursively use the expression we have proved.
Let’s start from definition. q() will be initialized as

∏ ∏
qinit (θ ) = fe0 (θ ) fei (θ ) = f 0 (θ ) fei (θ )
i ̸=0 i ̸=0
Where we have used fe0 (θ ) = f 0 (θ ) according to the problem description.

Then we can obtain:
q(θ ) ∏
q/0 (θ ) = = fei (θ )
fe0 (θ ) i̸=0
Next, we will obtain qnew (θ ) by matching its moments against q/0 (θ ) f 0 (θ ),

which exactly equals:
q (θ ) ∏
q/0 (θ ) f 0 (θ ) = = fei (θ ) · f 0 (θ ) = qinit (θ )
e
f 0 (θ ) i̸=0
In other words, in order to obtain qnew (θ ), we need to match its moment

against q/0 (θ ), and since qnew and qinit both belong to exponential family,
they will be identical if they have the same moment. Moreover, based on Eq
(10.206), we have:
∫ ∫
Z0 = q (θ ) f 0 (θ ) d θ = qinit (θ ) d θ = 1
/0
Therefore, based on Eq(10.207), we have:
qnew (θ ) qinit (θ )
fe0 (θ ) = Z0 /0 = 1 · /0 = f 0 (θ )
q (θ ) q (θ )
Based on Eq (10.205), (10.212) and (10.213), we can obtain:
q(θ ) N (θ |m, vI)

q/ j (θ ) = =
fej (θ ) s n N (θ |mn , vn I)
{ }
exp − 12 (θ − m)T (vI)−1 (θ − m)
∝ { }
exp − 12 (θ − mn )T (vn I)−1 (θ − mn )
{ 1 1 }
= exp − (θ − m)T (vI)−1 (θ − m) + (θ − mn )T (vn I)−1 (θ − mn )
2 2
{ 1 }
T T
= exp − (θ Aθ + θ · B + C)
2
238
Where we have completed squares over θ in the last step, and we have
defined:
[ ]
A = (vI)−1 − (vn I)−1 and B = 2 · − (vI)−1 · m + (vn I)−1 · mn
Note that in order to match this to a Gaussian, we don’t actually need C,

so we omit it here. Now we match this against a Gaussian, beginning by first
considering the quadratic term, we can obtain:
[Σ/n ]−1 = (vI)−1 − (vn I)−1 = (v−1 − v−1 −1
n )I = [v/n ]−1 · I−1
It is identical to Eq (10.214). By matching the linear term, we can also
obtain: [ ]
−2 · [Σ/n ]−1 · (m/n ) = B = 2 · − (vI)−1 · m + (vn I)−1 · mn
[ ]
(m/n ) = −[Σ/n ] · − (vI)−1 · m + (vn I)−1 · mn
[ ]
= −[v/n ] · − v−1 · m + v−
n
1
· m n
v/ n
= v/n · v−1 · m − · mn
vn
v/ n
= v/n ([v/n ]−1 − v−1
n )·m− · mn
vn
v/ n
= m+ · (m − mn )
vn
Which is identical to Eq (10.214). One important thing worthy clar-
ified is that: for arbitrary two Gaussian random variable, their division is
not a Gaussian. You can find more details by typing "ratio distribution" in
Wikipedia. Generally speaking, the division of two Gaussian random vari-
able follows a Cauchy distribution. Moreover, the product of two Gaussian
random variables is not a Gaussian random variable.
However, the product of two Gaussian PDF, e.g., p(x) and p(y), can be a
Gaussian PDF because when x and y are independent, p(x, y) = p(x) p(y),
is a Gaussian PDF. In the EP framework,according to Eq (10.204), we have
already assumed that q(θ ), i.e., Eq (10.212), is given by the product of fej (θ ),
i.e.,(10.213). Therefore, their division still gives by the product of many re-
maining Gaussian PDF, which is still a Gaussian.
Finally, based on Eq (10.206) and (10.209), we can obtain:
∫
Zn = q/n (θ ) p(xn |θ ) d θ
∫
{ }
= N (θ |m/n , v/n I) · (1 − w)N (xn |θ , I) + wN (xn |0, αI) d θ
∫ ∫
= (1 − w) N (θ |m , v I)N (xn |θ , I) d θ + w N (θ |m/n , v/n I) · N (xn |0, αI) d θ
/n /n
= (1 − w)N (xn |m/n , (v/n + 1)I) + wN (xn |0, αI)

239
Where we have used Eq (2.115).
This problem is really complicated, but hint has already been given in
Eq (10.244) and (10.255). Notice that in Eq (10.244), we have a quite com-
plicated term ∇m/n ln Z n , which we know that ∇m/n ln Z n = (∇m/n Z n )/ Z n based
on the Chain Rule, and since we know the exact form of Z n which has been
derived in the previous problem, we guess that we can start from dealing
with ∇m/n ln Z n to obtain Eq (10.244). Before starting, we write down a basic
formula here: for a Gaussian random variable x ∼ N (x|µ, Σ), we have:
∇µ N (x|µ, Σ) = N (x|µ, Σ) · (x − µ)Σ−1
Now we can obtain:

1
∇m/n ln Z n = · ∇m/n Z n
Zn
∫
1
= · ∇m/n q/n (θ ) p(xn |θ ) d θ
Zn
∫ { }
1
= · ∇m/n q/n (θ ) · p(xn |θ ) d θ
Zn
∫
1 1
= · (θ − m/n ) · q/n (θ ) · p(xn |θ ) d θ
Zn v/ n
∫ ∫
1 1 { }
= · /n · θ · q (θ ) · p(xn |θ ) d θ − m/n · q/n (θ ) · p(xn |θ ) d θ
/n
Zn v
1 { }
/n
= /n
· E [ θ ] − m
v
Here we have used q/n (θ ) = N (θ |m/n , v/n I), and q/n (θ ) · p(xn |θ ) = Z n ·
new
q (θ ). Rearranging the equation above, we obtain Eq (10.244). Then we
use Eq (10.216), yielding:
E[θ ] = m/n + v/n · ∇m/n ln Z n

1 1
= m/n + v/n · (1 − w)N (xn |m/n , (v/n + 1)I) · /n (xn − m/n )
Zn v +1
1
= m/n + v/n · ρ n · /n (xn − m/n )
v +1
1
ρn = (1 − w)N (xn |m/n , (v/n + 1)I)
Zn
1 Z n − wN (xn |0, αI)
= (1 − w) ·
Zn 1−w
w
= 1− N (xn |0, αI)
Zn
240
Therefore, we have proved the mean m is given by Eq (10.217), next we

prove Eq (10.218). Similarly, we can write down:
1
∇v/n ln Z n = · ∇ /n Z n
Zn v
∫
1
= · ∇v/n q/n (θ ) p(xn |θ ) d θ
Zn
∫ { }
1
= · ∇v/n q/n (θ ) p(xn |θ ) d θ
Zn
∫ {
1 1 /n 2 D } /n
= · || m − θ || − q (θ ) · p(xn |θ ) d θ
Z 2(v/n )2 2 v/ n
∫n { 1 D }
= qnew (θ ) · (m /n
− θ ) T
(m /n
− θ ) − dθ
2(v/n )2 2 v/ n
1 { T /n /n 2
} D
= / n 2
E [θθ ] − 2 E [θ ]m + || m || − /n
2(v ) 2v
E[θθ T ] = 2(v/n )2 · ∇v/n ln Z n + 2E[θ ]m/n − ||m/n ||2 + D · v/n
There is a typo in Eq (10.255), and the intrinsic reason is that when calcu-
lating ∇v/n q/n (θ ), there are two terms in q/n (θ ) dependent on v/n : one is inside
the exponential, and the other is in the fraction |v/n1I|1/2 , which is outside the
exponential. Now, we still use Eq (10.216), yielding:
1 [ 1 D ]
∇v/n ln Z n = (1 − w)N (xn |m/n , (v/n + 1)I) · || x n − m/n 2
|| −
Zn 2(v/n + 1)2 2(v/n + 1)
[ 1 D ]
= ρn · ||xn − m/n ||2 −
2(v + 1)
/ n 2 2(v n + 1)
/
Finally, using the definition of variance, we obtain:
vI = E[θθ T ] − E[θ ]E[θ T ]
Therefore, taking the trace, we obtain:
1 { } 1 { }
v = · E[θ T θ ] − E[θ T ]E[θ ] = · E[θ T θ ] − ||E[θ ]||2
D D
1 { }
= · 2(v ) · ∇v/n ln Z n + 2E[θ ]m/n − ||m/n ||2 + D · v/n − ||E[θ ]||2
/n 2
D
1 { }
= · 2(v/n )2 · ∇v/n ln Z n − ||E[θ ] − m/n ||2 + D · v/n
D
1 { 1 }
= · 2(v/n )2 · ∇v/n ln Z n − ||v/n · ρ n · /n (xn − m/n )||2 + D · v/n
D v +1
If we substitute ∇v/n ln Z n into the expression above, we will just obtain
Eq (10.215) as required.
241
0.11 Sampling Methods
Based on definition, we can write down:
1∑ L
E[ fb] = E[ f (z(l ) )]
L l =1
1∑ L
= E[ f (z(l ) )]
L l =1
1
= · L · E[ f ] = E[ f ]
L
Where we have used the fact that the expectation and the summation can
exchange order because all the z(l ) are independent, and that E[ f (z(l ) )] = E[ f ]
because all the z(l ) are drawn from p(z). Next, we deal with the variance:
var[ fb] = E[( fb − E[ fb])2 ] = E[ fb2 ] − E[ fb]2 = E[ fb2 ] − E[ f ]2

1∑ L
= E[( f (z(l ) ))2 ] − E[ f ]2
L l =1
1 ∑
L
= E [( f (z(l ) ))2 ] − E[ f ]2
L2 l =1
1 ∑ L
2 (l )
∑L
= E [ f (z ) + f (z( i) ) f (z( j) )] − E[ f ]2
L2 l =1 i, j =1,i ̸= j
1 ∑ L
2 (l ) L2 − L
= E [ f (z )] + E[ f ]2 − E[ f ]2
L2 l =1 L2
1 ∑ L 1
= E[ f 2 (z(l ) )] − E[ f ]2
L2 l =1 L
1 1
= 2
· L · E[ f 2 ] − E[ f ]2
L L
1 1 1
= E[ f ] − E[ f ]2 = E[( f − E[ f ])2 ]
2
L L L
Just as required.
What this problem wants us to prove is that if we use y = h−1 ( z) to trans-

form the value of z to y, where z satisfies a uniform distribution over [0, 1]
and h(·) is defined by Eq(11.6), we can enforce y to satisfy a specific desired
distribution p( y). Let’s prove it beginning by Eq (11.1):
∫ y
dz ′ d
p ⋆ ( y) = p ( z ) · | | = 1 · h ( y) = p( b y = p ( y)
y) d b
dy d y −∞
242
Just as required.
We use what we have obtained in the previous problem.

∫ y
h ( y) = p( b
y) d b
y
∫−∞
y 1 1
= db
y
−∞ π 1 + by2
= tan−1 ( y)
Therefore, since we know that z = h( y) = tan−1 ( y), we can obtain the

transformation from z to y: y = tan( z).
First, I believe there is a typo in Eq (11.10) and (11.11). Both ln z1 and

ln z2 should be ln( z12 + z22 ). In the following, we will solve the problem under
this assumption.
We only need to calculate the Jacobian matrix. First, based on Eq (11.10)-
(11.11), it is not difficult to observe that z1 only depends on y1 , and z2 only
depends on y2 , which means that ∂ z1 /∂ y2 = 0 and ∂ z2 /∂ y1 = 0. To obtain
the diagonal terms of the Jacobian matrix, i.e., ∂ z1 /∂ y1 and ∂ z2 /∂ y2 . To deal
with the problem associated with a circle, it is always convenient to use polar
coordinate:
z1 = r cos θ , and z2 = r sin θ
It is easily to obtain:
[ ] [ ]
∂( z1 , z2 ) ∂ z 1 /∂ r ∂ z1 /∂θ cos θ − r sin θ
= =
∂( r, θ ) ∂ z 2 /∂ r ∂ z2 /∂θ sin θ r cos θ

∂( z1 , z2 )
| | = r (cos2 θ + sin2 θ ) = r
∂( r, θ )
Then we substitute r and θ into Eq (11.10), yielding:
−2 ln r 2 1/2
y1 = r cos θ ( ) = cos θ (−2 ln r 2 )1/2 (∗)
r2
Similarly, we also have:
y2 = sin θ (−2 ln r 2 )1/2 (∗∗)
It is easily to obtain:
[ ] [ ]
∂( y1 , y2 ) ∂ y1 /∂ r ∂ y1 /∂θ −2 cos θ (−2 ln r 2 )−1/2 · r −1 − sin θ (−2 ln r 2 )1/2
= =
∂( r, θ ) ∂ y2 /∂ r ∂ y2 /∂θ −2 sin θ (−2 ln r 2 )−1/2 · r −1 cos θ (−2 ln r 2 )1/2
243
∂( y1 , y2 )
| | = (−2 r −1 (cos2 θ + sin2 θ )) = −2 r −1
∂( r, θ )
Next, we need to use the property of Jacobian Matrix:
∂( z1 , z2 ) ∂( z1 , z2 ) ∂( r, θ )
| | = | · |
∂( y1 , y2 ) ∂( r, θ ) ∂( y1 , y2 )
∂( z1 , z2 ) ∂( r, θ )
= | |·| |
∂( r, θ ) ∂( y1 , y2 )
∂( z1 , z2 ) ∂( y1 , y2 ) −1
= | |·| |
∂( r, θ ) ∂( r, θ )
r2
= r · (−2 r −1 )−1 = −
2
By squaring both sides of (∗) and (∗∗) and adding them together, we can
obtain:
{ y2 + y2 }
y12 + y22 = −2 ln r 2 => r 2 = exp 1 2
−2
Finally, we can obtain:
∂( z1 , z2 ) 1 r2 1 2 1 { y2 + y2 }
p( y1 , y2 ) = p( z1 , z2 )| |= ·|− | = r = exp 1 2
∂( y1 , y2 ) π 2 2π 2π −2
Just as required.
This is a linear transformation of z, we still obtain a Gaussian random

variable y. We only need to match its moments (mean and variance). We
know that z ∼ N (0, I), Σ = LLT , and y = µ + Lz. Now, using E[z] = 0, we
obtain:
E[y] = E[µ + Lz]

= µ + L · E[z]
= µ
Moreover, using cov[z] = E[zzT ] − E[z]E[zT ] = E[zzT ] = I, we can obtain:
cov[y] = E[yyT ] − E[y]E[yT ]

= E[(µ + Lz) · (µ + Lz)T ] − µµT
= E[µµT + 2µ · (Lz)T + (Lz) · (Lz)T ] − µµT
= 2µ · E[zT ] · LT + E[LzzT LT ]
= L · E[zzT ] · LT = L · I · LT
= Σ
244
Just as required.
This problem is all about definition. According to the description of re-

jection sampling, we know that: for a specific value z0 (drawn from q(z)), we
will generate a random variable u 0 , which satisfies a uniform distribution in
the interval [0, kq(z0 )], and if the generated value of u 0 is less than pe(z0 ), we
will accept this value. Therefore, we obtain:
pe(z0 )
P [accept|z0 ] =
kq(z0 )
Since we know z0 is drawn from q(z), we can obtain the total acceptance
rate by integral:
∫ ∫
pe(z0 )
P [accept] = P [accept|z0 ] · q(z0 ) d z0 = d z0
k
It is identical to Eq (11.14). We substitute Eq (11.13) into the expression
above, yielding:
Zp
P [accept] =
k
We define a very small vector ϵ, and we can obtain:
P [x0 ∈ (x, x + ϵ)] = P [z0 ∈ (x, x + ϵ)|accept]

P [accept, z0 ∈ (x, x + ϵ)]
=
P [accept]
∫
(x,x+ϵ) q(z0 )P [accept|z0 ] d z0
=
Z p /k
∫
k
= · q(z0 ) · p(accept|z0 ) d z0
(x,x+ϵ) Zp
∫
k pe(z0 )
= · q(z0 ) · d z0
(x,x+ϵ) Zp kq(z0 )
∫
1
= ·p
e(z0 ) d z0
(x,x+ϵ) Zp
∫
= p(z0 ) d z0
(x,x+ϵ)
Just as required. Several clarifications must be made here:(1)we

have used P [ A ] to represent the probability of event A occurs, and p(z) or q(z)
to represent the Probability Density Function (PDF). (2) Please be careful
with P [x0 ∈ (x, x + ϵ)] = P [z0 ∈ (x, x + ϵ)|accept], and this is the key point of
this problem.

245
Notice that the symbols used in the main text is different from those in
the problem description. in the following, we will use those in the main text.
Namely, y satisfies a uniform distribution on interval [0, 1], and z = b tan y+ c.
Then we aims to prove Eq (11.16). Since we know that:
dy
q ( z ) = p ( y) · | |
dz
and that:
z−c dy 1 1
y = arctan => = ·
b dz b 1 + [( z − c)/ b]2
Substituting it back, we obtain:
1 1
q( z) = 1 · ·
b 1 + [( z − c)/ b]2
In my point of view, Eq (11.16) is an expression for the comparison func-

tion kq( z), not the proposal function q( z). If we wish to use Eq (11.16) to
express the proposal function, the numerator in Eq (11.16) should be 1/ b in-
stead of k. Because the proposal function q( z) is a PDF, it should integrate
to 1. However, in rejection sampling, the comparison function is what we
actually care about.
There is a typo in Eq (11.17), which is not difficult to observe, if we care-

fully examine Fig.11.6. The correct form should be:
q i ( z) = k i λ i exp{−λ i ( z − z i )}, z i−1,i < z ≤ e

e z i,i+1 , where i = 1, 2, ..., N
Here we use e z i,i+1 to represent the intersection point of the i -th and i + 1-
th envelope, q i ( z) to represent the comparison function of the i -th envelope,
and N is the total number of the envelopes.Notice that e z0,1 and e
z N,N +1 could
be −∞ and ∞ correspondingly.
First, from Fig.11.6, we see that: q( z i ) = pe( z i ), substituting the expres-
sion above into the equation and yielding:
k i λ i = pe( z i ) (∗)
One important thing should be made clear is that we can only evaluate
pe( z) at specific point z, but not the normalized PDF p( z). This is the assump-
tion of rejection sampling. For more details, please refer to section 11.1.2.
Notice that q i ( z) and q i+1 ( z) should have the same value at e z i,i+1 , we
obtain:
z i,i+1 − z i )} = k i+1 λ i+1 exp{−λ i+1 ( e

k i λ i exp{−λ i ( e z i,i+1 − z i+1 )}
246
After several rearrangement, we obtain:
1 { k i λi }
z i,i+1 =
e ln + λ i z i − λ i+1 z i+1 (∗∗)
λ i − λ i+1 k i+1 λ i+1
Before moving on, we should make some clarifications: the adaptive re-
jection sampling begins with several grid points, e.g., z1 , z2 , ..., z N , and then
we evaluate the derivative of pe( z) at those points, i.e., λ1 , λ2 , ..., λ N . Then we
can easily obtain k i based on (∗), and next the intersection points e z i,i+1 based
on (∗∗).
In this problem, we will still use the same notation as in the previous one.
First, we need to know the probability of sampling from each segment. Notice
that Eq (11.17) is not correctly normalized, we first calculate its normaliza-
tion constant Z q :
∫ e
z N,N +1 N ∫
∑ e
z i,i+1
Zq = q( z) dz = q i ( z i ) dz i
e
z 0, 1 i =1 e
z i−1,i
N ∫
∑ e
z i,i+1
= k i λ i exp{−λ i ( z − z i )} dz i
i =1 e
z i−1,i
∑
N ¯ ez i,i+1
¯
= − k i exp{−λ i ( z − z i )}¯
e
z i−1,i
i =1
∑
N [ ] ∑N
= − k i exp{−λ i ( e
z i,i+1 − z i )} − exp{−λ i ( e
z i−1,i − z i )} = bi
k
i =1 i =1

[ ]
b i = − k i exp{−λ i ( e
k z i,i+1 − z i )} − exp{−λ i ( e
z i−1,i − z i )} (∗)
From this derivation, we know that the probability of sampling from the
b i / Z q , where Z q = ∑ N k
i -th segment is given by k b
i =1 i . Therefore, now we define
an auxiliary random variable η, which is uniform in interval [0, 1], and then
define:
−1
1 j∑ bm , 1
∑j
b m ], j = 1, 2, ..., N
i = j if η ∈ [ k k (∗∗)
Z q m=0 Z q m=0
Where we have defined k b0 = 0 for convenience. Until now, we have decide

the chosen i -th segment. Next, we should sample from the i -th exponential
distribution using the technique in section 11.1.1.. According to Eq (11.6), we
247
can write down:

∫ z q i (zi )
h i ( z) = dz i
e
z i−1,i bi
k
∫ z
1
= · k i λ i exp{−λ i ( z − z i )} dz i
b
ki e
z i−1,i
−k i ¯z
¯
= · exp{−λ i ( z − z i )}¯
bi
k e
z i−1,i
−k i [ ]
= · exp{−λ i ( z − z i )} − exp{−λ i ( e z i−1,i − z i )}
bi
k
ki [ ]
= · exp(λ i z i ) exp{−λ i e z i−1,i } − exp{−λ i z}
bi
k
b i is the correct
Notice that q i ( z) is not correctly normalized, and q i ( z)/ k
normalized form. With several rearrangement, we can obtain:
1 [ ξ ]
h−i 1 (ξ) = · ln exp{−λ i e z i−1,i } −
−λ i ki
· exp(λ i z i )
bi
k
[ ]
1 ln exp{−λ i e
z i −1,i }
= · bi ξ
−λ i ln k
k i ·exp(λ i z i )
e
z i−1,i
= b
ln ξ + ln kk ii − λ i z i
In conclusion, we first generate a random variable η, which is uniform

in interval [0, 1], and obtain the value i according to (∗∗), and then we gen-
erate a random variable ξ, which is also uniform in interval [0, 1], and then
transform it to z using z = h−i 1 (ξ).
Notice that here, λ i , e z i,i+1 and k i can be obtained once the grid points
z1 , z2 , ..., z N are given. For more details, please refer to the previous problem.
After these variables are obtained, k b i can also be determined using (∗), and
−1
thus h i (ξ) can be determined.
Based on definition and Eq (11.34)-(11.36), we can write down:
Eτ [( z(τ) )2 ] = 0.5 · Eτ−1 [( zτ−1 )2 ] + 0.25 · Eτ−1 [( zτ−1 + 1)2 ] + 0.25 · Eτ−1 [( zτ−1 − 1)2 ]
= Eτ−1 [( zτ−1 )2 ] + 0.5
If the initial state is z(0) = 0 (there is a typo in the line below Eq (11.36)),
we can obtain Eτ [( z(τ) )2 ] = τ/2 just as required.

248
This problem requires you to know the definition of detailed balance, i.e.,
Eq (11.40):
p⋆ (z)T (z, z′ ) = p⋆ (z′ )T (z′ , z)
Note that here z and z′ are the sampled values of [ z1 , z2 , ..., z M ]T in two
consecutive Gibbs Sampling step. Without loss of generality, we assume that
we are now updating zτj to zτj +1 in step τ:
p⋆ (z)T (z, z′ ) = p( z1τ , z2τ , ..., zτM ) · p( zτj +1 |zτ/ j )

= p( zτj |zτ/ j ) · p(zτ/ j ) · p( zτj +1 |zτ/ j )
= p( zτj |zτ/ j+1 ) · p(z/τj+1 ) · p( zτj +1 |zτ/ j+1 )
= p( zτj |zτ/ j+1 ) · p( z1τ+1 , z2τ+1 , ..., zτM+1 )
= T (z′ , z) · p⋆ (z)
To be more specific, we write down the first line based on Gibbs sampling,
where zτ/ j denotes all the entries in vector zτ except zτj . In the second line,
we use the conditional property, i.e, p(a, b) = p(a| b) p( b) for the first term.
In the third line, we use the fact that zτ/ j = zτ/ j+1 . Then we reversely use
the conditional property for the last two terms in the fourth line, and finally
obtain what has been asked.
Obviously, Gibbs Sampling is not ergodic for this specific distribution, and
the quick reason is that neither the projection of the two shaded region on z1
axis nor z2 axis overlaps. For instance, we denote the left down shaded region
as region 1. If the initial sample falls into this region, no matter how many
steps have been carried out, all the generated samples will be in region 1. It
is the same for the right up region.
Let’s begin by definition.
p(µ| x, τ, µ0 , s 0 ) ∝ p( x|µ, τ, µ0 , s 0 ) · p(µ|τ, µ0 , s 0 )

= p( x|µ, τ) · p(µ|µ0 , s 0 )
= N ( x|µ, τ−1 ) · N (µ|µ0 , s 0 )
Where in the first line, we have used Bayes’ Theorem:
p(µ| x, c) ∝ p( x|µ, c) · p(µ| c)
Now we use Eq (2.113)-Eq (2.117), we can obtain: p(µ| x, τ, µ0 , s 0 ) = N (µ|µ⋆ , s⋆ ),

[ s⋆ ]−1 = s−1
0 +τ , µ⋆ = s⋆ · (τ · x + s−1
0 µ0 )
249
It is similar for p(τ| x, µ, a, b):
p(τ| x, µ, a, b) ∝ p( x|τ, µ, a, b) · p(τ|µ, a, b)

= p( x|µ, τ) · p(τ|a, b)
= N ( x|µ, τ−1 ) · Gam(τ|a, b)
Based on Section 2.3.6, especially Eq (2.150)-(2.151), we can obtain p(τ| x, µ, a, b) =

Gam(τ|a⋆ , b⋆ ), where we have defined:
a⋆ = a + 0.5 , b⋆ = b + 0.5 · ( x − µ)2
Based on definition, we can write down:
E[ z′i ] = E[µ i + α( z i − µ i ) + σ i (1 − α2i )1/2 v]

= µ i + E[α( z i − µ i )] + E[σ i (1 − α2i )1/2 v]
= µ i + α · E[ z i − µ i ] + [σ i (1 − α2i )1/2 ] · E[v]
= µi
Where we have used the fact that the mean of z i is µ i , i.e., E[ z i ] = µ i , and
that the mean of v is 0, i.e., E[v] = 0. Then we deal with the variance:
var[ z′i ] = E[( z′i − µ i )2 ]

= E[(α( z i − µ i ) + σ i (1 − α2i )1/2 v)2 ]
= E[α2 ( z i − µ i )2 ] + E[σ2i (1 − α2i )v2 ] + E[2α( z i − µ i ) · σ i (1 − α2i )1/2 v]
= α2 · E[( z i − µ i )2 ] + σ2i (1 − α2i ) · E[v2 ] + 2α · σ i (1 − α2i )1/2 · E[( z i − µ i )v]
= α2 · var[ z i ] + σ2i (1 − α2i ) · (var[v] + E[v]2 ) + 2α · σ i (1 − α2i )1/2 · E[( z i − µ i )] · E[v]
= α2 · σ2i + σ2i (1 − α2i ) · 1 + 0
= σ2i
Where we have used the fact that z i and v are independent and thus
E[( z i − µ i )v] = E[ z i − µ i ] · E[v] = 0
Using Eq (11.57), we can write down:
∂H ∂K
= = ri
∂r i ∂r i
Comparing this with Eq (11.53), we obtain Eq (11.58). Similarly, still

using Eq (11.57), we can obtain:
∂H ∂E
=
∂zi ∂zi
250
Comparing this with Eq (11.55), we obtain Eq (11.59).
According to Bayes’ Theorem and Eq (11.54), (11.63), we have:

p(z, r) 1/ Z H · exp(− H (z, r)) Zp
p(r|z) = = = · exp(−K (r))
p(z) 1/ Z p · exp(−E (z)) ZH
where we have used Eq (11.57). Moreover, by noticing Eq (11.56), we
conclude that p(r|z) should satisfy a Gaussian distribution.
There are typos in Eq (11.68) and (11.69). The signs in the exponential of
the second argument of the min function is not right. To be more specific, Eq
(11.68) should be:
1 1
exp(− H (R ))δV min{1, exp( H (R ) − H (R ′ ))} (∗)
ZH 2
and Eq (11.69) is given by:
1 1
exp(− H (R ′ ))δV min{1, exp( H (R ′ ) − H (R ))} (∗∗)
ZH 2
When H (R ) = H (R ′ ), they are clearly equal. When H (R ) > H (R ′ ), (∗) will
reduce to:
1 1
exp(− H (R ))δV
ZH 2
Because the min function will give 1, and in this case (∗∗) will give:
1 1 1 1
exp(− H (R ′ ))δV exp( H (R ′ ) − H (R ))} = exp(− H (R ))δV
ZH 2 ZH 2
Therefore, they are identical, and it is similar when H (R ) < H (R ′ ).
0.12 Continuous Latent Variables
By analogy to Eq (12.2), we can conclude that the projected data with re-
spect to a vector u M +1 should have variance given by uT M +1
Su M +1 . Moreover,
there are two constraints for u M +1 : (1) it should be correctly normalized, i.e.,
uT u
M +1 M +1
= 1, and (2) it should be orthogonal to all the previous M chosen
eigenvectors {u1 , u2 , ..., u M }. We aim to maximize the variance with respect
to u M +1 satisfying these two constraints. This can be done by enforcing the
Lagrange Multiplier:
∑
M
L = uT T
M +1 Su M +1 + λ(1 − u M +1 u M +1 ) + η m uT
m u M +1 (∗)
m=1
251
Therefore, we can further calculate its derivative with respect to u M +1 :
∂L ∑
M
= 2Su M +1 − 2λu M +1 + η m um = 0
∂u M +1 m=1
We left multiply uT
m , yielding:
∑
M
2uT T T
m Su M +1 − 2λu m u M +1 + u m η m um = 2uT
m Su M +1 − 0 + η m
m=1
= 2uT
M +1 Su m + η m
= 2uT
M +1 λ m u m + η m
= ηm
where we have used the property of orthogonality and in the second line
we have transpose the first term and use the property that S is symmetric. So
now we obtain η m = 0. This will directly lead (∗) reduce to the form as shown
in Eq (12.4), and thus consequently we now need to choose a eigenvector of S
among those not chosen, which has the largest eigenvalue.
1 T
uTi u i = v XXT v i
N λi i
We left multiply vTi on both sides of Eq (12.28), yielding:
1 T
v XXT v i = λ i vTi v i = λ i ||v i ||2 = λ i
N i
Here we have used the fact that v i has unit length. Substituting it back
into uTi u i , we can obtain:
uTi u i = 1
Just as required.
We know p(z) = N (z|m, Σ), and p(x|z) = N (x|Wz + µ, σ2 I). According to

Eq (2.113)-(2.115), we have:
p(x) = N (x|Wm + µ, σ2 I + WΣWT ) = N (x|µ bW

b , σ2 I + W b T)

b = Wm + µ
µ
252
and
b = WΣ1/2
W
Therefore, in the general case, the final form of p(x) can still be written
as Eq (12.35).

Solution Manual For PRML

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Solution Manual For PRML

Caricato da

Copyright:

Formati disponibili

S OLUTION M ANUAL F OR

PATTERN R ECOGNITION AND M ACHINE

the State Key Lab. of ASIC and System

Problem 1.1 Solution

We let the derivative of error function E with respect to vector w equals

Problem 1.2 Solution

This problem can be solved by Bayes’ theorem. The probability of selecting

This problem needs knowledge about calculus, especially about Chain

d p y ( y) d ( p x ( g( y))| g‘ ( y)|) d p x ( g( y)) ‘ d | g‘ ( y)|

The first term in the above equation can be further simplified:

Problem 1.5 Solution

This problem takes advantage of the property of expectation:

var [ f ] = E[( f ( x) − E[ f ( x)])2 ]

Problem 1.6 Solution

Based on (1.41), we only need to prove when x and y is independent,

Problem 1.7 Solution

This problem should take advantage of Integration by substitution.

Therefore, I can be solved :

Problem 1.8 Solution

The first question will need the result of Prob.1.7 :

Provided the fact that σ ̸= 0, we can get:

Where we take advantage of :

We will solve this problem based on the definition of expectation, variation

Problem 1.11 Solution

It is quite straightforward for E[µ ML ], with the prior knowledge that xn is

This problem can be solved in the same method used in Prob.1.12 :

E[ µ 2 ] = µ 2(µ is determined, i.e. its expectation is itself, also true for µ2 )

Problem 1.14 Solution

To prove (1.132), we only need to simplify it :

Therefore, we choose the coefficient matrix to be symmetric as described

It is complexed to write F in mathematical form. Actually this function

By far, we have already proven (1.134). Mathematical induction will be

n(D + 1, M ) = n(D, M ) + n(D + 1, M − 1) (∗∗)

And given the fact that (1.135) holds for D :

Therefore,we substitute it into (∗∗)

We will prove (1.136) in a different but simple way. We rewrite (1.136) in

Firstly, We expand the summation.

By far, all have been proven.

Problem 1.16 Solution

(1.138) is quite obvious if we view m as an looping variable, iterating

Using the same technique as in Prob.1.15, we divide (∗∗) to two parts

We then take advantage of (1.137):

Where we have taken advantage of Integration by parts and according to

Therefore when x is an integer:

We have already given a hint in the solution of Prob.1.18, and here we

Problem 1.20 Solution

The density of probability in a thin shell with radius r and thickness ϵ

( r̂ + ϵ)D −1 exp(− (r̂2+σϵ2) )

We process for the exponential term by using Taylor Theorems.

The first question is rather simple :

Where we have taken advantage of b ≥ a ≥ 0. And based on (1.78):

p(mistake) = p(x ∈ R 1 , C 2 ) + p(x ∈ R 2 , C 1 )

It is the same for decision area R 2 . Therefore we can obtain:

Problem 1.22 Solution

We need to deeply understand (1.81). When L k j = 1 − I k j :

If we denote a new loss matrix by L⋆jk = L jk p(C k ), we can obtain a new

Problem 1.24 Solution

This description of the problem is a little confusing, and what it really

Where C j is the class that can obtain the minimum. If L k j = 1 − I k j ,

max p(C l | x) > θ

Problem 1.25 Solution

Therefore, just as the same process in (1.87) - (1.89):