Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Mitesh M. Khapra
1/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Module 11.1 : The convolution operation
2/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Suppose we are tracking the position
of an aeroplane using a laser sensor at
discrete time intervals
Now suppose our sensor is noisy
x0 x1 x2
To obtain a less noisy estimate we
would like to average several measure-
∞
X ments
st = xt−a w−a = (x∗w)t
a=0 More recent measurements are more
important so we would like to take a
weighted average
input
filter
convolution
3/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
In practice, we would only sum over a
6
X small window
st = xt−a w−a
a=0
The weight array (w) is known as the
filter
We just slide the filter over the input and
compute the value of st based on a win-
w−6 w−5 w−4 w−3 w−2 w−1 w0
dow around xt
W 0.01 0.01 0.02 0.02 0.04 0.4 0.5 Here the input (and the kernel) is one
dimensional
X 1.00 1.10 1.20 1.40 1.70 1.80 1.90 2.10 2.20 2.40 2.50 2.70 Can we use a convolutional operation on
a 2D input also?
S 0.00 1.80 0.00 0.00 0.00 0.00 0.00
4/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
In practice, we would only sum over a
6
X small window
st = xt−a w−a
a=0
The weight array (w) is known as the
filter
We just slide the filter over the input and
compute the value of st based on a win-
w−6 w−5 w−4 w−3 w−2 w−1 w0
dow around xt
W 0.01 0.01 0.02 0.02 0.04 0.4 0.5 Here the input (and the kernel) is one
dimensional
X 1.00 1.10 1.20 1.40 1.70 1.80 1.90 2.10 2.20 2.40 2.50 2.70 Can we use a convolutional operation on
a 2D input also?
S 0.00 1.80 1.96 0.00 0.00 0.00 0.00
5/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
In practice, we would only sum over a
6
X small window
st = xt−a w−a
a=0
The weight array (w) is known as the
filter
We just slide the filter over the input and
compute the value of st based on a win-
w−6 w−5 w−4 w−3 w−2 w−1 w0
dow around xt
W 0.01 0.01 0.02 0.02 0.04 0.4 0.5 Here the input (and the kernel) is one
dimensional
X 1.00 1.10 1.20 1.40 1.70 1.80 1.90 2.10 2.20 2.40 2.50 2.70 Can we use a convolutional operation on
a 2D input also?
S 0.00 1.80 1.96 2.11 0.00 0.00 0.00
6/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
In practice, we would only sum over a
6
X small window
st = xt−a w−a
a=0
The weight array (w) is known as the
filter
We just slide the filter over the input and
compute the value of st based on a win-
w−6 w−5 w−4 w−3 w−2 w−1 w0
dow around xt
W 0.01 0.01 0.02 0.02 0.04 0.4 0.5 Here the input (and the kernel) is one
dimensional
X 1.00 1.10 1.20 1.40 1.70 1.80 1.90 2.10 2.20 2.40 2.50 2.70 Can we use a convolutional operation on
a 2D input also?
S 0.00 1.80 1.96 2.11 2.16 2.28 0.00
7/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
In practice, we would only sum over a
6
X small window
st = xt−a w−a
a=0
The weight array (w) is known as the
filter
We just slide the filter over the input and
compute the value of st based on a win-
w−6 w−5 w−4 w−3 w−2 w−1 w0
dow around xt
W 0.01 0.01 0.02 0.02 0.04 0.4 0.5 Here the input (and the kernel) is one
dimensional
X 1.00 1.10 1.20 1.40 1.70 1.80 1.90 2.10 2.20 2.40 2.50 2.70 Can we use a convolutional operation on
a 2D input also?
S 0.00 1.80 1.96 2.11 2.16 2.28 2.42
8/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
We can think of images as 2D inputs
We would now like to use a 2D filter
(m × n)
First let us see what the 2D formula
looks like
This formula looks at all the preced-
ing neighbours (i − a, j − b)
In practice, we use the following for-
mula which looks at the succeeding
m−1
X n−1
X neighbours
Sij = (I ∗ K)ij = Ii−a,j−b Ka,b Ii+a,j+b Ka,b
a=0 b=0
9/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Input Let us apply this idea to a toy ex-
Kernel ample and see the results
a b c d
w x
e f g h
y z
i j k `
Output
10/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
For the rest of the discussion we will
2 c
bX
m
2c
bX
n
use the following formula for convolu-
Sij = (I ∗ K)ij = Ii−a,j−b K m +a, n +b
2 2 tion
a=b− m
2 c
b=b− n
2c
In other words we will assume that
the kernel is centered on the pixel of
pixel of interest interest
So we will be looking at both preceed-
ing and succeeding neighbors
11/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Let us see some examples of 2D convolutions applied to images
12/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
1 1 1
∗ 1 1 1 =
1 1 1
16/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
We just slide the kernel over the input
image
Each time we slide the kernel we get
one value in the output
The resulting output is called a fea-
ture map.
We can use multiple filters to get mul-
tiple feature maps.
17/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
a b c d
Question
In the 1D case, we slide a one dimensional e f g h
filter over a one dimensional input
i j k l
In the 2D case, we slide a two dimen-
stional filter over a two dimensional out-
A B C B A B C
put
What would happen in the 3D case?
18/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
R G B
20/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
So far we have not said anything explicit about the dimensions of the
1 inputs
2 filters
3 outputs
and the relations between them
We will see how they are related but before that we will define a few quantities
21/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
We first define the following quantit-
ies
F Width (W1 ), Height (H1 ) and Depth
(D1 ) of the original input
F The Stride S (We will come back to
H2
D1
this later)
The number of filters K
H1
The spatial extent (F ) of each filter
(the depth of each filter is same as
the depth of each input)
The output is W2 × H2 × D2 (we will
W2
soon see a formula for computing W2 ,
W1
D2 H2 and D2 )
D1
22/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Let us compute the dimension (W2 , H2 ) of
the output
Notice that we can’t place the kernel at the
corners as it will cross the input boundary
This is true for all the shaded points (the
kernel crosses the input boundary)
This results in an output which is of smaller
dimensions than the input
23/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Let us compute the dimension (W2 , H2 ) of
the output
Notice that we can’t place the kernel at the
corners as it will cross the input boundary
This is true for all the shaded points (the
kernel crosses the input boundary)
This results in an output which is of smaller
dimensions than the input
As the size of the kernel increases, this be-
In general, W2 = W1 − F + 1 comes true for even more pixels
H2 = H1 − F + 1 For example, let’s consider a 5 × 5 kernel
We will refine this formula further We have an even smaller output now
24/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
What if we want the output to be of same
0 0 0 0 0 0 0 0 0 size as the input?
0 0
We can use something known as padding
0 0
0 0 Pad the inputs with appropriate number of 0
0 0 = inputs so that you can now apply the kernel
0 0 at the corners
0 0 Let us use pad P = 1 with a 3 × 3 kernel
0 0
This means we will add one row and one
0 0 0 0 0 0 0 0 0
column of 0 inputs at the top, bottom, left
and right
We now have,
W2 = W1 − F + 2P + 1
H2 = H1 − F + 2P + 1
We will refine this formula further
25/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
0 0 0 0 0 0 0 0 0
What does the stride S do?
0 0
0 0 It defines the intervals at which the
0 0 filter is applied (here S = 2)
0 0 =
0 0
Here, we are essentially skipping
0 0 every 2nd pixel which will again
0 0 result in an output which is of
0 0 0 0 0 0 0 0 0 smaller dimensions
26/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Finally, coming to the depth of the
output.
Each filter gives us one 2D output.
K filters will give us K such 2D out-
H2 puts
We can think of the resulting output
filter as K × W2 × H2 volume
H1
Thus D2 = K
W2
W1
D2 = K
D1
W1 −F +2P
W2 = S +1
H1 −F +2P
H2 = S +1
D2 = K
27/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Let us do a few exercises
55 =H227−11
2 =?
4
+1
11
∗ =
11
227
3
55 =W227−11
2 4=? + 1
96 filters
227 Stride = 4
P adding = 0
W2 = W1 −F +2P
+1 D296
=?
S
3 H1 −F +2P
H2 = S
+1
28/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Let us do a few exercises
H232−5
28 = =?
1
+1
5
∗ =
5
32
1
28 =W32−5
2 1=? + 1
6 filters
32 Stride = 1
P adding = 0
W2 = W1 −F +2P
+1 D2 6=?
S
1 H1 −F +2P
H2 = S
+1
29/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Module 11.3 : Convolutional Neural Networks
30/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Putting things into perspective
What is the connection between this operation (convolution) and neural net-
works?
We will try to understand this by considering the task of “image classification”
31/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Features
Raw pixels
car, bus, monument, flower
Edge Detector
car, bus, monument, flower
SIF T /HOG
car, bus, monument, flower
32/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Input Features Classifier
0 0 0 0 0
0 1 1 1 0
0 1 -8 1 0
0 1 1 1 0
0 0 0 0 0
Instead of using handcrafted kernels such as edge detectors can we learn meaningful ker-
nels/filters in addition to learning the weights of the classifier?
33/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Input Features Classifier
0 0 0 0 0
0 1 1 1 0
0 1 -8 1 0
0 1 1 1 0
0 0 0 0 0
-0.02337041
-1.21358689e-03 3.23652686e-03··· ······ ···
-0.01871333 -0.03243878
-0.01075948 ··· -2.06615720e-02
0.04684572
-0.04728875
-0.05375158
-1.52757822e-03
0.00104325-0.05350766
0.01935937
2.36130832e-03 ··· ··· ······-0.04323674
··· -1.19824838e-02
0.01016542
.. .. .. .. .. .. ..
.. . .. ..
... ... ... ... ... ... ...
-0.00792501
-8.25322699e-04 -5.14897937e-03··· ······ ···
0.03008777 -0.00503319
0.00335217 ··· -9.90395527e-03
-0.02791128
0.00174674
Even better: Instead of using handcrafted kernels (such as edge detectors)can we learn
multiple meaningful kernels/filters in addition to learning the weights of the clas-
34/1
sifier?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Input Classifier
-1.21358689e-03 3.23652686e-03 ··· ··· -2.06615720e-02 -0.01112582 0.02185669 ··· ··· 0.00015161
-1.52757822e-03 2.36130832e-03 ··· ··· -1.19824838e-02 -0.00687587 0.01229961 ··· ··· 0.00214013
.. .. .. .. .. .. backpropagation
. . . . . .
.. .. .. .. .. ..
. . . . . .
-8.25322699e-04 -5.14897937e-03 ··· ··· -9.90395527e-03 -0.00372989 -0.00886137 ··· ··· -0.01974954
36/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
10 classes(digits)
...
This is what a regular feed-forward
neural network will look like
2
h14
11
12
13 between neighboring pixels are more
* = interesting)
Yes, indeed
42/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Convolution Layer 2
Input Convolution Layer 1 Pooling Layer 2
FC 1(120)FC 2(84)
A
Pooling Layer 1 Output(10)
32
28
14
P aram
32
10 P aram P aram = 850
28 14 5 = 48120 = 10164
10 5
S = 1,F = 5, S = 1,F = 2,
K = 6,P = 0, K = 6,P = 0,
S = 1,F = 5, S = 1,F = 2,
P aram = 150 P aram = 0 K = 16,P = 0, K = 16,P = 0,
P aram = 2400 P aram = 0
5 8 3 4 maxpool 8 4
* =
7 6 4 5 2x2 filters (stride 2) 7 5
1 3 1 2
Input 1 filter
1 4 2 1
5 8 3 4 maxpool 8 8 4
1 3 1 2 7 6 5
44/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
We will now see some case studies where convolution neural networks have been
successful
45/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
LeNet-5 for handwritten character recognition
Convolution Layer 2
Input Convolution Layer 1 Pooling Layer 2
FC 1(120)FC 2(84)
A
Pooling Layer 1 Output(10)
32
28
14
P aram
32
10 P aram P aram ==?
850
28 14 5 =? = 10164
= 48120 =?
10 5
S = 1,F = 5, S = 1,F = 2,
K = 6,P = 0, K = 6,P = 0,
S = 1,F = 5, S = 1,F = 2,
PP
aram
aram==?
150 P
Param
aram =
=?0 K = 16,P = 0, K = 16,P = 0,
P aram
P aram 2400 P
= =? Param
aram ==?0
46/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
How do we train a convolutional neural network ?
47/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Input Kernel
b c d l m n o
w x
e f g
y z
b c d e f g h i j
h i j
49/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
ImageNet Success Stories(roadmap for rest of the talk)
AlexNet
ZFNet
VGGNet
50/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
28.2
25.8
152 layers
16.4
11.7
19 layers 22 layers
7.3 6.7
51/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
ImageNet Success Stories(roadmap for rest of the talk)
AlexNet
ZFNet
VGGNet
52/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Total Parameters: 27.55M
227
55
27 23
11
11 9
3 5 7 dense
3 3 3 3 5
3 3 dense dense
3 3 3 3 3 2
5
11 5 2
7 256
11 9 256 MaxPooling
27 23 384 Convolution
55 384 Convolution
227 256
Convolution
MaxPooling 1000
96 256
MaxPooling Convolution
4096 4096
96
Convolution
3
Input
53/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Let us look at the
connections in the
fully connected lay-
ers in more detail make linear
2
dense dense dense
We will first stretch 2
make it a 1d vector
This 1d vector is 2 × 2 × 256 = 1024 1000
then densely connec-
ted to other lay- 4096 4096
ers just as in a
regular feedforward
neural network
54/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
ImageNet Success Stories(roadmap for rest of the talk)
AlexNet
ZFNet
VGGNet
55/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
227
55
27 23
11 11 9
3 5 7
3 3 3 3 5
3 3 3
3 2dense dense dense
11 5 3 3 3
5 2
7
11 9 256
23 256
27 384 MaxPooling
55 Convolution
227 384 Convolution
256
Convolution 1000
256 MaxPooling
96
MaxPooling Convolution 4096 4096
96
3
Convolution Difference in Total No. of Parameters
Input 1.45M
227
55
27 23
11 9
7
3 5 7
3 3 3 3 5
3 3 3
3 2dense dense dense
7 5 3 3 3
5 2
7
11 9 256
23 512
27 1024 MaxPooling
55 Convolution
227 512 Convolution
256
Convolution 1000
256 MaxPooling
96
MaxPooling Convolution 4096 4096
96
Convolution
3
Input 56/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
ImageNet Success Stories(roadmap for rest of the talk)
AlexNet
ZFNet
VGGNet
57/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
softmax
2
22
22
11
11
56
56
28
28
14
14
7
28
14
14
7
28
56
56
112
112
512
224
224
512 512
256 512
128 256
maxpool Conv maxpool Conv maxpool
64 128 maxpool Conv
64 maxpool Conv
1000
Input Conv fc fc
4096 4096
58/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Module 11.5 : Image Classification continued
(GoogLeNet and ResNet)
59/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
H1
Consider the output at a certain layer
f H
of a convolutional neural network
Max Pooling
D
f
After this layer we could apply a max-
H
1
1
W1
convolution H2 pooling layer
D
3
1
Or a 1 × 1 convolution
convolution
D
3
W Or a 3 × 3 convolution
W2
H3
Or a 5 × 5 convolution
W
5
5
1
convolution Question: Why choose between
D
1
these options (convolution, maxpool-
W3
D
ing, filter sizes)?
1 Idea: Why not apply all of them at
the same time and then concatenate
the feature maps?
60/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
H1
Well this naive idea could result in a
f H
large number of computations
Max Pooling
D
f
If P = 0 & S = 1 then convolving a
W × H × D input with a F × F × D
H
1 W1
convolution H2
1
1
filter results in a (W − F + 1)(H −
3
3
convolution
F + 1) sized output
D W
H3
Each element of the output requires
W2
5
W O(F × F × D) computations
convolution
1
5
1
Can we reduce the number of compu-
D
D
W3 tations?
61/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Yes, by using 1 × 1 convolutions
Huh?? What does a 1×1 convolution
H H do ?
It aggregates along the depth
1 So convolving a D×W ×H input with
1
D1 1×1 (D1 < D) filters will result in
D
a D1 × W × H output (S = 1, P = 0)
If D1 < D then this effectively re-
W W
duces the dimension of the input and
hence the computations
Specifically instead of O(F × F × D)
D D11 we will need O(F × F × D1 ) compu-
tations
We could then apply subsequent 3×3,
5 × 5 filter on this reduced output
62/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
1 × 1 convolutions But we might want to use different
28
1 × 1 convolutions
dimensionality reductions before the
3 × 3 convolutions
(dimensionality re-
duction)
(on reduced input)
Filter
3 × 3 and 5 × 5 filters
concatenation
1 × 1 convolutions
(dimensionality re-
5 × 5 convolutions
(on reduced input)
So we can use D1 and D2 1 × 1 fil-
duction)
28
3 × 3 Maxpooling
ters before the 3 × 3 and 5 × 5 filters
(dimensionality re-
duction)
1 × 1 convolutions
respectively
256
We can then add the maxpooling
layer followed by dimensionality re-
duction
And a new set of 1 × 1 convolutions
And finally we concatenate all these
layers
This is called the Inception module
We will now see GoogLeNet which
contains many such inception mod-
ules 63/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
9
56
56
22
11
28
28
28
14
14
14
14
7
7
1 1
1 1
28
28
28
3a 4a 4c 4e 5a 5b
56
56
14
14
14
14
3b 4b 4d 1024 1024
7
229
112
64 1 × 1 convolutions
28
96 1 × 1 convolu- 128 3 × 3 convolu-
tions (dimensionality tions (on reduced
reduction) input)
Filter
concatenation
16 1 × 1 convolu-
32 5 × 5 convolutions
tions (dimensionality
(on reduced input)
reduction)
28
3 × 3 Maxpooling
(dimensionality re- 32 1 × 1 convolutions
duction)
192
128 1 × 1
convolutions
28
128 1 × 1 convolu- 192 3 × 3 convolu-
tions (dimensionality tions (on reduced
reduction) input)
Filter 64/1
concatenation
32 1 × 1 convolu-
Miteshtions
M.(dimensionality
Khapra CS7015 (Deep Learning) : Lecture 11
96 5 × 5 convolutions
1000 Important Trick: Got rid of the
fully connected layer
W ∈ R1024×1000
Notice that output of the last layer is
1024 7 × 7 × 1024
7 × 7 × 1024 dimensional
flatten What if we were to add a fully connec-
ted layer with 1000 nodes (for 1000
7 classes) on top of this
pick average We would have 7 × 7 × 1024 × 1000 =
1024
1024
49M parameters
Instead they use an average pooling of
7 size 7 × 7 on each of the 1024 feature
maps
This results in a 1024 dimensional
output
Significantly reduces the number of
parameters 65/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
GoogLeNet
ResNet
66/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Suppose we have been able to train a
shallow neural network well
Now suppose we construct a deeper
network which has few more layers (in
orange)
Intuitively, if the shallow network
works well then the deep network
should also work well by simply learn-
ing to compute identity functions in
the new layers
Essentially, the solution space of a
shallow neural network is a subset of
the solution space of a deep neural
network
67/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
But in practice it is observed that this
doesn’t happen
Notice that the deep layers have a
higher error rate on the test set
68/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
x
Consider any two stacked layers in a
CNN
relu
The two layers are essentially
learning some function of the input
relu What if we enable it to learn only a
H(x) residual function of the input?
relu Identity
F (x) relu
H(x) = F (x) + x
69/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
x
Why would this help?
Remember our argument that a
relu
deeper version of a shallow network
would do just fine by learning identity
relu
transformations in the new layers
H(x) This identity connection from the in-
put allows a ResNet to retain a copy
x of the input
Using this idea they were able to train
really deep networks
relu Identity
F (x) relu
H(x) = F (x) + x
70/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
1st place in all five main tracks
ImageNet Classification: “Ultra-
deep” 152-layer nets
ImageNet Detection: 16% better
than the 2nd best system
ImageNet Localization: 27% bet-
ter than the 2nd best system
COCO Detection: 11% better than
the 2nd best system
COCO Segmentation: 12% better
than the 2nd best system
71/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11
Bag of tricks
Batch Normalizaton after every
CONV layer
Xavier/2 initialization from [He et al]
SGD + Momentum(0.9)
Learning rate:0.1, divided by 10 when
validation error plateaus
Mini-batch size 256
Weight decay of 1e-5
No dropout used
72/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 11