Sei sulla pagina 1di 10

Homework for the Course

“Advanced Learninig Models”

Anja PANTOVIC
master MSIAM DS
anja.pantovic@grenoble-inp.org

Predrag PILIPOVIC
master MSIAM DS
predrag.pilipovic@grenoble-inp.org

1 Neural Networks

Let X = (xij )ij , i, j ∈ {1, ..., 5} denote the input of a convolutional layer with no bias. Let
W = (wij )ij , i, j ∈ {1, ..., 3} denote the weights of the convolutional filter. Let Y = (yij )ij , i ∈
{1, ..., I}, j ∈ {1, ..., J} denote the output of the convolution operation.

1. What is the output size (i.e. values of I and J) if:

(a) the convolution has no padding and no stride?


The output size for convolution with no padding and no stride (stride = 1) is 3 × 3.
(b) the convolution has stride 2 and no padding?
The output size for stride 2 convolution when the padding is deactivated is 2 × 2.
(c) the convolution has no stride and padding 2?
The output size for convolution with no stride and padding 2 is 7 × 7.

2. Let us suppose that we are in situation 1.(b) (i.e. stride 2 and no padding). Let us
also assume that the output of the convolution goes through a ReLU activation, whose
output is denoted by Z = (zij )ij , i ∈ {1, ..., I}, j ∈ {1, ..., J}:

(a) Derive the expression of the output pixels zij as a function of the input and the
weights.
We have seen that I = J = 2 in our case. Let us see what will happen when we apply
filter at the beginning of the layer, so we have

y11 = w11 x11 + w12 x12 + w13 x13


+ w21 x21 + w22 x22 + w23 x23
+ w31 x31 + w32 x32 + w33 x33
3 X
X 3
= wij xij .
i=1 j=1

1
If we do the same thing for the rest yij , we will have
y12 = w11 x13 + w12 x14 + w13 x15
+ w21 x23 + w22 x24 + w23 x25
+ w31 x33 + w32 x34 + w33 x35
3 X
X 3
= wij xi,j+2 ,
i=1 j=1

y21 = w11 x31 + w12 x32 + w13 x33


+ w21 x41 + w22 x42 + w23 x43
+ w31 x51 + w32 x52 + w33 x53
3 X
X 3
= wij xi+2,j ,
i=1 j=1

y22 = w11 x33 + w12 x34 + w13 x35


+ w21 x43 + w22 x44 + w23 x45
+ w31 x53 + w32 x54 + w33 x55
3 X
X 3
= wij xi+2,j+2 .
i=1 j=1

So, we can conclude that the general formula is


3 X
X 3
ylk = wij xi+2(l−1),j+2(k−1) .
i=1 j=1

Finally, we know that zlk = σ(ylk ), where σ is ReLu activation function, more pre-
cisely σ(x) = max{0, x}.
(b) How many multiplications and additions are needed to compute the output (the
forward pass)?
As we saw in the first part of the question, for computing ylk we need 9 multiplication
and 8 additions. As there are 4 cells in the output of the convolution, we need 4·(8+9)
operations to compute the output of the convolution.
3. Assume now that we are provided with the derivative of the loss w.r.t. the output of
the convolution layer ∂L/∂zij , ∀i ∈ {1, ..., I}, j ∈ {1, ..., J}:
(a) Derive the expression of ∂L/∂xij , ∀i, j ∈ {1, ...., 5}.
We will use the chain rule so we have
3 3
∂L X X ∂L ∂zlk ∂ylk
= · · .
∂xij ∂zlk ∂ylk ∂xij
l=1 k=1

We assumed to know ∂L/∂zlk , so we need to compute the two last partial derivatives.
We can easily see that 
∂zlk 1, ylk > 0
= .
∂ylk 0, ylk < 0
For the last partial derivative, we need to change the indexing in two sums in formula
for ylk , so we have
3
X 3
X
ylk = wi−2(l−1),j−2(k−1) xij .
i=2l−1 j=2k−1

Finally, we have that ∂ylk /∂xij = wi−2(l−1),j−2(k−1) , which gives us


3 3
∂L X X ∂L ∂zlk
= · · wi−2(l−1),j−2(k−1)
∂xij ∂zlk ∂ylk
l=1 k=1

2
(b) Derive the expression of ∂L/∂wij , ∀i, j ∈ {1, ..., 3}.
Similarly, we have
3 3 3 3
∂L X X ∂L ∂zlk ∂ylk X X ∂L ∂zlk
= · · = · · xi+2(l−1),j+2(k−1) .
∂wij ∂zlk ∂ylk ∂wij ∂zlk ∂ylk
l=1 k=1 l=1 k=1

Let us now consider a fully connected layer, with two input and two output neurons, without
bias and with a sigmoid activation. Let xi , i = 1, 2 denote the inputs, and zj , j = 1, 2 the
output. Let wij denote the weight connecting input i to output j. Let us also assume that
the gradient of the loss at the output ∂L/∂zj , j = 1, 2 is provided.

4. Derive the expressions for the following derivatives:


∂L
(a) :
∂xi
Firstly, we can introduce a sigmoid activation function
1
σ(x) = ,
1 + exp(−x)
from where we can derivate
− exp(−x)
σ 0 (x) = .
(1 + exp(−x))2
Also, we know that
1
zj = σ(w1j x1 + w2j x2 ) = , j = 1, 2.
1 + exp(−(w1j x1 + w2j x2 ))
Again, using the chain rule, we get
2 2
∂L X ∂L ∂zj X ∂L wij exp(−(w1j x1 + w2j x2 ))
= · = · .
∂xi j=1
∂zj ∂xi j=1
∂zj (1 + exp(−(w1j x1 + w2j x2 )))2

∂L
(b) :
∂wij
Similarly, we have
∂L ∂L ∂zj ∂L xi exp(−(w1j x1 + w2j x2 ))
= · = · .
∂wij ∂zj ∂wij ∂zj (1 + exp(−(w1j x1 + w2j x2 )))2
But this time without the sum, because wij depends only on zj .
∂2L
(c) 2 :
∂wij
Having in mind that ∂L/∂zj is a function of wij , we have
∂2L ∂2L ∂L ∂ 2 zj
 
∂ ∂L ∂zj ∂zj
2 = · = · + · 2 .
∂wij ∂wij ∂zj ∂wij ∂zj ∂wij ∂wij ∂zj ∂wij
The only thing left to compute is ∂ 2 zj /∂wij
2
, because we assumed to know ∂L/∂zj ,
2
hence we will know ∂ L/∂zj ∂wij . We will need the second derivative of σ, so
2 exp(−2x) exp(−x)
σ 00 (x) = 3
− .
(1 + exp(−x)) (1 + exp(−x))2
Finally, we have
∂ 2 zj
 
∂ xi exp(−(w1j x1 + w2j x2 ))
2 =
∂wij ∂wij (1 + exp(−(w1j x1 + w2j x2 )))2
= x2i · σ 00 (w1j x1 + w2j x2 ) .

3
∂2L
(d) , i 6= i0 , j 6= l0 : Again, from the chain rule we have
∂wij wi0 j 0
∂2L ∂2L ∂ 2 zj
 
∂ ∂L ∂zj ∂zj ∂L
= · = · + · .
∂wij ∂wi0 j 0 ∂wij ∂zj ∂wi0 j 0 ∂zj ∂wij ∂wi0 j 0 ∂zj ∂wij ∂wi0 j 0
But now, we can see from the previous exercises that the last term will always be zero
for j 6= j 0 , because we do not have j 0 in ∂zj /∂wij , so we will have just the first term.

(e) The elements in (c) and (d) are the entries of the Hessian matrix of L w.r.t the
weight vector. Imagine now that storing the weights of a network requires 40
MB of disk space: how much would it require to store the gradient? And the
Hessian?
If storing the weights of a network requires 40 MB, and we have 4 weights, it means
that one representation of number will require 10 MB. We know that for gradient we
have 4 elements, so we can conclude it will require 40 MB, as well. Since Hessian
is symmetric matrix we need to store just the upper triangle, which means we need
n(n+1)
2 for the matrix of the size n × n. In our case, n = 4, so we need 10 elements,
or 100 MB.

2 Conditionally Positive Definite Kernels

Let X be a set. A function k : X × X → R is called conditionally positive definite (c.p.d.) if and


only if it is symmetric and satisfies:
n
X
ai aj k(xi , xj ) ≥ 0
i,j

n
for any n ∈ N, (x1 , x2 , ..., xn ) ∈ X n and (a1 , a2 , ..., an ) ∈ Rn with
P
ai = 0.
i=1

1. Show that a positive definite (p.d.) function is c.p.d..


Let k be a positive definite function. This means that it is symmetric and for any n ∈ N,
(x1 , ..., xn ) ∈ X n , (a1 , ..., an ) ∈ Rn
n X
X n
ai aj k(xi , xj ) ≥ 0.
i=1 j=1

Since this holds for any (a1 , ..., an ) ∈ Rn , n ∈ N, it holds for (a1 , ..., an ) ∈ Rn with
Pn
ai = 0 as well. Hence, any positive definite function is conditionally positive definite.
i=1
2. Is a constant function p.d.? Is it c.p.d.?
Let k be a constant function
k :X ×X →R
(xi , xj ) 7→ c
Since k(xi , xj ) = k(xj , xi ) = c, for all (xi , xj ) ∈ X × X , the symmetry holds.
Let n ∈ N, (x1 , ..., xn ) ∈ X n and (a1 , ..., an ) ∈ Rn .

n X
n n X
n n
!2
X X X
ai aj k(xi , xj ) = c ai aj = c ai ≥ 0 ⇐⇒ c ≥ 0
i=1 j=1 i=1 j=1 i=1

So k is positive definite if and only if c ≥ 0. We already know that k is c.p.d. when c > 0
from the first question. Let us see if conditional positive definiteness of k works for any c.

4
n
Let n ∈ N, (x1 , ..., xn ) ∈ X n and (a1 , ..., an ) ∈ Rn such that
P
ai = 0. We have
i=1

n X
n n
!2
X X
ai aj k(xi , xj ) = c · ai =0
i=1 j=1 i=1

Hence, k is a conditionally positive definite function for any c ∈ R.


3. If X is a Hilbert space, then is k(x, y) = −||x − y||2 p.d.? Is it c.p.d.?
Firstly, we can see that

k(x, y) = −||x − y||2 = −||x||2 − ||y||2 + 2hx, yi = −||y − x||2 = k(y, x),

hence, k is symmetric. Let n ∈ N, (x1 , ..., xn ) ∈ X n and (a1 , ..., an ) ∈ Rn such that
Pn
ai = 0.
i=1

n
X n
X n
X n
X
ai aj k(xi , xj ) = − ai aj ||xi ||2 − ai aj ||xj ||2 + 2 ai aj hxi , xj i
i,j=1 i,j=1 i,j=1 i,j=1
X n n
X Xn n
X n
X
2
=− aj ai ||xi || − ai aj ||xj ||2 + 2 ai aj hxi , xj i
j=1 i=1 i=1 j=1 i,j
| {z } | {z }
0 0
2
n
X Xn
=2 ai aj hxi , xj i = 2 ai xi ≥ 0.


i,j=1 i=1

Thus, k is a conditionally positive definite function. We can see intuitively that k is not
positive definite function. To prove that we need one counterexample. For that reason we
can use n = 2, so we have
n
X
ai aj k(xi , xj ) = a21 k(x1 , x1 ) +2a1 a2 k(x1 , x2 ) + a22 k(x2 , x2 )
| {z } | {z }
i,j=1
0 0
= −2a1 a2 kx1 − x2 k2 ≤ 0,

where a1 and a2 are positive.


4. Let X be a nonempty set, and x0 ∈ X a point. For any function k : X × X → R, let
k̃ : X × X → R be the function defined by

k̃(x, y) = k(x, y) − k(x0 , x) − k(x0 , y) + k(x0 , x0 ).

Show that k is c.p.d. if and only if k̃ is p.d.

(⇒) Suppose k̃ is positive definite, i.e. for all n ∈ N, (a1 , ..., an ) ∈ Rn , (x1 , ..., xn ) ∈ X n ,
we have
Xn
ai aj k̃(x, y) ≥ 0.
i,j

n
Let us fix n ∈ N, and choose (a1 , ..., an ) ∈ Rn such that
P
ai = 0. For
i=1
(x1 , ..., xn ) ∈ X n , we have:

5
n
X n
X
ai aj k̃(x, y) = ai aj (k(xi , xj ) − k(x0 , xi ) − k(x0 , xj ) + k(x0 , x0 ))
i,j=1 i,j=1
X n X n n
X n
X
= ai aj k(xi , xj ) − aj ai k(x0 , xi )
i=1 j=1 j=1 i=1
| {z }
0
n n n
!2
X X X
− ai aj k(x0 , xj ) + k(x0 , x0 ) ai
i=1 j=1 i=1
| {z } | {z }
0 0
n X
X n
= ai aj k(xi , xj ).
i=1 j=1

Hence, positive definiteness of k̃ implies conditional positive definiteness of k.


(⇐) Let us now suppose that k is conditionally positive definite, i.e. for any n ∈ N,
n
(x0 , ..., xn ) ∈ X n+1 , and for any (a0 , ..., an ) ∈ Rn+1 such that
P
ai = 0, we have
i=0
n
X
ai aj k(xi , xj ) ≥ 0.
i,j=0

We want to show that for all n ∈ N, (a1 , ..., an ) ∈ Rn , (x1 , ..., xn ) ∈ X n , it is


n
X
ai aj k̃(xi , xj ) ≥ 0.
i,j=1

n
P n
P
To use an assumption, we need to introduce a0 := − ai , so we get ai = 0.
i=1 i=0
Now, we have
n
X n
X n
X
ai aj k̃(xi , xj ) = ai aj k(xi , xj ) − ai aj k(x0 , xj )
i,j=1 i,j=1 i,j=1
X n Xn
− ai aj k(x0 , xi ) + ai aj k(x0 , x0 )
i,j=1 i,j=1
X n X n n
X
= ai aj k(xi , xj ) − ai aj k(x0 , xj )
i,j=1 i=1 j=1
| {z }
−a0
n n n
!2
X X X
− aj ai k(xi , x0 ) + ai k(x0 , x0 )
j=1 i=1 i=1
| {z } | {z }
−a0 a20
n
X
= ai aj k̃(xi , xj ) ≥ 0.
i,j=0

Finally, we can conclude that k is c.p.d. if and only if k̃ is p.d.


5. Let k be a c.p.d. kernel on X such that k(x, x) = 0 for any x ∈ X . Show that there
exists a Hilbert space H and a mapping Φ : X → H such that, for any x, y ∈ X ,
k(x, y) = −||Φ(x) − Φ(y)||2 .

6
Let k be a c.p.d. kernel on X such that k(x, x) = 0, for any x ∈ X . From the previous
question, we know how to construct the feature map from k which is positive definite. Let
1 
k̃(x, y) := k(x, y) − k(x0 , x) − k(x0 , y) + k(x0 , x0 ) ,
2
where x0 ∈ X is fixed. Then k̃ is p.d. and hence we can use Aronszajn theorem, which
says that there exists a Hilbert space H and a mapping Φ : X → H such that, for any
x, y ∈ X
k̃(x, y) = hΦ(x), Φ(y)iH .
Now, only thing left to be proven is
k(x, y) = −||Φ(x) − Φ(y)||2 .
For this part, we will use the assumption k(x, x) = 0, for any x ∈ X . We have
||Φ(x) − Φ(y)||2 = Φ2 (x) − 2hΦ(x), Φ(y)iH + Φ2 (y)
= k̃(x, x) − 2k̃(x, y) + k̃(y, y)
1
= k(x, x) −k(x0 , x) − k(x0 , x) + k(x0 , x0 )
2 | {z } | {z }
0 0
− 2k(x, y) + 2k(x0 , x) + 2k(x0 , y) + 2k(x0 , x0 )
| {z }
0
!
+ k(y, y) −k(x0 , y) − k(x0 , y) + k(x0 , x0 )
| {z } | {z }
0 0
− k(x0 , x) − k(x, y) + k(x0 , x) + k(x0 , y) − k(x0 , y)
= −k(x, y).
6. Show that if k is c.p.d., then the function exp(tk(x, y)) is p.d. for all t ≥ 0.
Firstly, we need to show that the product of two p.d. functions is also a p.d. function. Let
k1 , and k2 be two p.d. functions, and [k1 ], [k2 ] the positive semidefinite similarity matrices
of k1 , k2 , respectively. Since [k2 ] is symmetric positive semidefinite, it has a symmetric
positive semidefinite square root S, i.e. [k2 ] = S 2 , more precisely
n
X n
X
[k2 ]ij = Sil Slj = Sil Sjl ,
l=1 l=1

for all i, j = 1, 2, ..., n. Therefore, for any (a1 , a2 , ..., an ) ∈ Rn , we have


n n n
!
X X X
ai aj [k1 ]ij [k2 ]ij = ai aj [k1 ]ij Sil Sjl
i,j=1 i,j=1 l=1
 
n
X n
X
= ai Sil aj Sjl [k1 ]ij  ≥ 0.
 
 | {z } | {z }
l=1 i,j=1 ãi ãj
| {z }
≥0

We used that the inner sum is nonnegative using the weights (ã1 , ã2 , ...ãn ) ∈ Rn and
the fact that k1 is p.d. kernel. So, we proved that the similarity matrix [k], defined by
[k]ij = [k1 ]ij [k2 ]ij is positive semidefinite. The symmetry of kernel k is immediate
consequence of the symmetry of k1 and k2 . This means, we proved that the product kernel
k is indeed a p.d. kernel.

We also need to prove that if a given sequence of p.d. kernels {kn }n∈N pointwise converges
to k, i.e. for all x, y ∈ X is
lim kn (x, y) = k(x, y),
n→∞

7
then k is a p.d. kernel. First of all, by uniqueness of the limit, we can indeed define
the pointwise limit k as a function. It is symmetric as a immediate consequence of the
symmetry of all the kernels kn . Let m ∈ N, (x1 , x2 , ..., xm ) ∈ X and (a1 , a2 , ..., am ) ∈
Rm , then we have
Xm m
X
ai aj k(xi , xj ) = ai aj lim kn (xi , xj
n→∞
i,j=1 i,j
Xm
= lim ai aj kn (xi , xj )
n→∞
i,j=1
 
Xm
= lim  ai aj kn (xi , xj ) ≥ 0.
n→∞
i,j
| {z }
≥0

This proves that k is also p.d. kernel.

Now, we can go back to our assignment. If k is c.p.d. we know that we can associate k
with p.d. k̃, such that
k̃(x, y) = k(x, y) − k(x0 , x) − k(x0 , y) + k(x0 , x0 ),
for any x, y ∈ X , and some point x0 ∈ X . From the previous line it follows
k(x, y) = k̃(x, y) + k(x0 , x) + k(x0 , y) − k(x0 , x0 ),
or
exp(tk(x, y)) = exp(tk̃(x, y)) exp(tk(x0 , x)) exp(tk(x0 , y)) exp(−tk(x0 , x0 )) .
| {z }| {z }
k1 k2

We know that k̃ is p.d., and using the Taylor expansion we can write
∞ m
X (tk̃(x, y))m X (tk̃(x, y))i
exp(tk̃(x, y)) = = lim .
m=0
m! m→∞
i=0
i!
On the right hand side we have limit of the sum of the product of p.d., which means that
exp(tk̃(x, y)) is also p.d (obviously, sum of two p.d. is again p.d.). On the other hand, for
all (a1 , a2 , ..., an ) ∈ Rn we have
Xn
ai aj exp(tk(x0 , xi )) exp(t(x0 , xj )) exp(−tk(x0 , x0 ))
i,j=1
n
X
= exp(−tk(x0 , x0 )) ai exp(tk(x0 , xi ))aj exp(t(x0 , xj ))
i,j=1
n 2
X
= exp(−tk(x0 , x0 )) ai exp(tk(x0 , xi )) ≥ 0.


i=1
Finally, we proved that k1 and k2 are p.d. hence exp(tk) is p.d. as a product of k1 and k2 .
7. Conversely, show that if the function exp(tk(x, y)) is p.d. for any t ≥ 0, then k is c.p.d.
We know that
exp(tk(x, y)) − 1
k(x, y) = lim ,
t→0 t
for all x, y ∈ X . We assummed that exp(tk) is p.d. so it must be c.p.d. also. Now, for any
n
n ∈ n, (x1 , x2 , ..., xn ) ∈ Rn , and any (a1 , a2 , ..., an ) ∈ Rn such that
P
ai = 0, we have
i=1
n n n n
X exp(tk(xi , xj )) − 1 1 X X X
ai aj = ai aj exp(tk(xi , xj )) − ai aj ≥ 0.
i,j=1
t t i,j=1 i=1 j=1
| {z } | {z }
≥0 0

8
exp(tk)−1
Which means that t is c.p.d. And finally,
n n
X X exp(tk(xi , xj )) − 1
ai aj k(xi , xj ) = ai aj lim
i,j=1 i,j=1
t→0 t
n
X exp(tk(xi , xj )) − 1
= lim ai aj ≥ 0,
t→0
i,j=1
t

so k is c.d.p.

8. Show that the shortest-path distance on a tree is c.p.d over the set of vertices (a tree is
an undirected graph without loops. The shortest-path distance between two vertices
is the number of edges of the unique path that connects them). Is the shortest-path
distance over graphs c.p.d. in general?

Let G = (V, E) a tree (V being a set of vertices and E set of edges) and x0 ∈ V its root.
Let us represent each node x ∈ V by Φ(x), as follows
Φ : V → R|E|
such that

1, if is the i-th edge is in the path between x and x0
Φi (x) =
0, otherwise
We know that for each vertex x ∈ V , the path to x0 is unique. Then, the graph distance
dG (x, y) between any two vertices x and y (the length of the shortest path between x and
y) is given by
dG (x, y) = kΦ(x) − Φ(y)k2 .
In problem 2.3 we have seen that −dG is c.p.d. Now, using the previous we can conclude
that exp(−tdG (x, y)) is p.d. for all t ≥ 0.

On the other hand, in general graphs do not have the property that −dG is c.p.d. We can
see that on the counterexample. Let us look at the graph below

1 3 5

We can write down its shortest-path distance matrix (it is 5 × 5 matrix):


 
0 1 1 1 2
1 0 2 2 1
[dG ] = 1 2 0 2 1
 
1 2 2 0 1
2 1 1 1 0

9
In order for shortest distance to correspond to a c.p.d function, exp(−tdG (x, y)) must be
p.d. for all t ≥ 0. We can write down matrix [exp(−tdG (x, y))] and we shell show that it
is not positive semi definite.
We have
1 e−t e−t e−t e−2t
 
 e−t 1 e−2t e−2t e−t 
 −t −2t
[dG ] =  e e 1 e−2t e−t 

 e−t e−2t e−2t 1 e−t 
−2t −t −t −t
e e e e 1
We can use Sylvester’s criterion to show that this matrix is not positive semi definite. Let
us calculate the corresponding determinant.
e−t e−t e−t e−2t

1

−t
e 1 e−2t e−2t e−t

e−2t e−t = e−10t (e2t − 2)(e2t − 1)4
−t
e e−2t 1

e−t e−2t e−2t 1 e−t

e−2t e−t e−t e−t 1

Let t = 0.2. Then the determinant reads


e−10t (e2t − 2)(e2t − 1)4 = |{z}
e−2 (e0.4 − 2) (e0.4 − 1)4 < 0
| {z } | {z }
>0 <0 >0

Hence, we can conclude that the shortest-path distance over graphs is not c.p.d. in general.

10

Potrebbero piacerti anche