CS7015 (Deep Learning) : Lecture 18

CS7015 (Deep Learning) : Lecture 18
Using joint distributions for classification and sampling, Latent Variables,

Restricted Boltzmann Machines, Unsupervised Learning, Motivation for
Sampling
Mitesh M. Khapra
Department of Computer Science and Engineering

Indian Institute of Technology Madras
1/71
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 18
Acknowledgments
Probabilistic Graphical models: Principles and Techniques, Daphne Koller
and Nir Friedman
An Introduction to Restricted Boltzmann Machines, Asja Fischer and
Christian Igel
2/71
Module 18.1: Using joint distributions for classification
and sampling
3/71
Now that we have some understanding of joint probability distributions and
efficient ways of representing them, let us see some more practical examples where
we can use these joint distributions
4/71
Consider a movie critic who writes reviews
for movies
5/71
M1: An unexpected and necessary masterpiece
for movies
M2: Delightfully merged information and comedy
M3: Director’s first true masterpiece For simplicity let us assume that he always
M4: Sci-fi perfection,truly mesmerizing film.
writes reviews containing a maximum of 5
M5: Waste of time and money
M6: Best Lame Historical Movie Ever words
5/71
for movies
Further, let us assume that there are a total
of 50 words in his vocabulary
5/71
for movies
Each of the 5 words in his review can be
treated as a random variable which takes one
of the 50 values
5/71
for movies
Each of the 5 words in his review can be
treated as a random variable which takes one
of the 50 values
Given many such reviews written by the
reviewer we could learn the joint probability
distribution
P (X1 , X2 , . . . , X5 )
5/71
In fact, we can even think of a very simple
factorization for this model
Q
M3: Director’s first true masterpiece
P (X1 , X2 , . . . , X5 ) = P (Xi |Xi−1 , Xi−2 )
M6: Best Lame Historical Movie Ever
6/71
Q
P (X1 , X2 , . . . , X5 ) = P (Xi |Xi−1 , Xi−2 )
In other words, we are assuming that the i-th
M6: Best Lame Historical Movie Ever word only depends on the previous 2 words
and not anything before that
of and
waste
time money
6/71
Q
P (X1 , X2 , . . . , X5 ) = P (Xi |Xi−1 , Xi−2 )
of and
Let us consider one such factor P (Xi =
time|Xi−2 = waste, Xi−1 = of )
waste
time money
6/71
Q
P (X1 , X2 , . . . , X5 ) = P (Xi |Xi−1 , Xi−2 )
of and
waste
We can estimate this as
time money
count(waste of time)
count(waste of)
6/71
Q
P (X1 , X2 , . . . , X5 ) = P (Xi |Xi−1 , Xi−2 )
of and
waste
time money
count(waste of)
And the two counts mentioned above can be

computed by going over all the reviews
6/71
Q
P (X1 , X2 , . . . , X5 ) = P (Xi |Xi−1 , Xi−2 )
of and
waste
time money
count(waste of)
And the two counts mentioned above can be

computed by going over all the reviews
We could similarly compute the
probabilities of all such factors 6/71
Okay, so now what can we do
with this joint distribution?
P (Xi = w|, P (Xi = w|, P (Xi = w|
w Xi−2 = more, Xi−2 = realistic, Xi−2 = than, ...
Xi−1 = realistic) Xi−1 = than) Xi−1 = real)
than 0.61 0.01 0.20 ...

as 0.12 0.10 0.16 ...
for 0.14 0.09 0.05 ...
real 0.01 0.50 0.01 ...
the 0.02 0.12 0.12 ...
life 0.05 0.11 0.33 ...
7/71
M7: More realistic than real life with this joint distribution?
P (Xi = w|, P (Xi = w|, P (Xi = w| Given a review, classify if this
w Xi−2 = more, Xi−2 = realistic, Xi−2 = than, ... was written by the reviewer
than 0.61 0.01 0.20 ...

as 0.12 0.10 0.16 ...
for 0.14 0.09 0.05 ...
real 0.01 0.50 0.01 ...
the 0.02 0.12 0.12 ...
life 0.05 0.11 0.33 ...
7/71
than 0.61 0.01 0.20 ...

as 0.12 0.10 0.16 ...
for 0.14 0.09 0.05 ...
real 0.01 0.50 0.01 ...
the 0.02 0.12 0.12 ...
life 0.05 0.11 0.33 ...
P (M 7) = P (X1 = more).P (X2 = realistic|X1 = more).

P (X3 = than|X1 = more, X2 = realistic).
P (X4 = real|X2 = realistic, X3 = than).
P (X5 = lif e|X3 = than, X4 = real)
= 0.2 × 0.25 × 0.61 × 0.50 × 0.33 = 0.005 7/71
Generate new reviews which
than 0.61 0.01 0.20 ...
would look like reviews written
as 0.12 0.10 0.16 ...
by this reviewer
for 0.14 0.09 0.05 ...
real 0.01 0.50 0.01 ...
the 0.02 0.12 0.12 ...
life 0.05 0.11 0.33 ...

= 0.2 × 0.25 × 0.61 × 0.50 × 0.33 = 0.005 7/71
than 0.61 0.01 0.20 ...
as 0.12 0.10 0.16 ...
by this reviewer
for 0.14 0.09 0.05 ...
How would you do this? By
real 0.01 0.50 0.01 ...
sampling from this distribution!
the 0.02 0.12 0.12 ...
What does that mean? Let us
life 0.05 0.11 0.33 ...
see!
= 0.2 × 0.25 × 0.61 × 0.50 × 0.33 = 0.005 7/71
How does the reviewer start his
reviews (what is the first word that
he chooses)?
w P (X1 = w)
the 0.62
movie 0.10
amazing 0.01
useless 0.01
was 0.01
.. ..
. .
8/71
he chooses)?
We could take the word which has the
w P (X1 = w)
highest probability and put it as the
first word in our review
the 0.62
movie 0.10
amazing 0.01
useless 0.01
was 0.01
.. ..
. .
The
8/71
he chooses)?
P (X2 = w|, We could take the word which has the
w P (X1 = w)
X1 = the) highest probability and put it as the
the 0.62 0.01
movie 0.10 0.40
Having selected this what is the most
amazing 0.01 0.22
likely second word that the reviewer
useless 0.01 0.20
uses?
was 0.01 0.00
.. .. ..
. . .
The movie
8/71
he chooses)?
P (Xi = w|,
w P (X1 = w) Xi−2 = the,
Xi−1 = movie)
the 0.62 0.01 0.01
movie 0.10 0.40 0.01
amazing 0.01 0.22 0.01
useless 0.01 0.20 0.03
uses?
was 0.01 0.00 0.60 Having selected the first two words
..
.
..
.
..
.
..
. what is the most likely third word
that the reviewer uses?
The movie was
8/71
he chooses)?
P (Xi = w|,
w P (X1 = w) Xi−2 = the, ...
Xi−1 = movie)
the 0.62 0.01 0.01 ...
movie 0.10 0.40 0.01 ...
amazing 0.01 0.22 0.01 ...
useless 0.01 0.20 0.03 ...
uses?
was 0.01 0.00 0.60 ... Having selected the first two words
..
.
..
.
..
.
..
. ... what is the most likely third word
that the reviewer uses?
and so on...
The movie was really amazing
8/71
But there is a catch here!
P (Xi = w|,
P (X2 = w|,
w P (X1 = w) Xi−2 = the, ...
X1 = the)
Xi−1 = movie)
the 0.62 0.01 0.01 ...

movie 0.10 0.40 0.01 ...
amazing 0.01 0.22 0.01 ...
useless 0.01 0.20 0.03 ...
was 0.01 0.00 0.60 ...
.. .. .. ..
. . . . ...
9/71
Selecting the most likely word at each
time step will only give us the same
P (Xi = w|,
P (X2 = w|, review again and again!
w P (X1 = w) Xi−2 = the, ...
X1 = the)
Xi−1 = movie)
the 0.62 0.01 0.01 ...

movie 0.10 0.40 0.01 ...
amazing 0.01 0.22 0.01 ...
useless 0.01 0.20 0.03 ...
was 0.01 0.00 0.60 ...
.. .. .. ..
. . . . ...
9/71
P (Xi = w|,
w P (X1 = w) Xi−2 = the, ...
X1 = the)
Xi−1 = movie)
But we would like to generate
the 0.62 0.01 0.01 ...
different reviews
movie 0.10 0.40 0.01 ...
amazing 0.01 0.22 0.01 ...
useless 0.01 0.20 0.03 ...
was 0.01 0.00 0.60 ...
.. .. .. ..
. . . . ...
9/71
P (Xi = w|,
w P (X1 = w) Xi−2 = the, ...
X1 = the)
Xi−1 = movie)
the 0.62 0.01 0.01 ...
different reviews
movie 0.10 0.40 0.01 ... So instead of taking the max value we
amazing 0.01 0.22 0.01 ... can sample from this distribution
useless 0.01 0.20 0.03 ...
was 0.01 0.00 0.60 ...
.. .. .. ..
. . . . ...
9/71
P (Xi = w|,
w P (X1 = w) Xi−2 = the, ...
X1 = the)
Xi−1 = movie)
the 0.62 0.01 0.01 ...
different reviews
useless 0.01 0.20 0.03 ... How?
was 0.01 0.00 0.60 ...
.. .. .. ..
. . . . ...
9/71
P (Xi = w|,
w P (X1 = w) Xi−2 = the, ...
X1 = the)
Xi−1 = movie)
the 0.62 0.01 0.01 ...
different reviews
useless 0.01 0.20 0.03 ... How? Let us see!
was 0.01 0.00 0.60 ...
.. .. .. ..
. . . . ...
9/71
Suppose there are 10 words in the
w vocabulary
the
movie
amazing
useless
was
is
masterpiece
I
liked
decent
10/71
w P (X1 = w) vocabulary
We have computed the probability
the 0.62 distribution P (X1 = word)
movie 0.10
amazing 0.01
useless 0.01
was 0.01
is 0.01
masterpiece 0.01
I 0.21
liked 0.01
decent 0.01
10/71
w P (X1 = w) vocabulary
We have computed the probability
the 0.62 distribution P (X1 = word)
movie 0.10 P (X1 = the) is the fraction of reviews
amazing 0.01 having the as the first word
useless 0.01
was 0.01
is 0.01
masterpiece 0.01
I 0.21
liked 0.01
decent 0.01
10/71
P (Xi = w|, Suppose there are 10 words in the
P (X2 = w|,
w P (X1 = w) Xi−2 = the, ... vocabulary
X1 = the)
Xi−1 = movie) We have computed the probability
the 0.62 0.01 0.01 ... distribution P (X1 = word)
movie 0.10 0.40 0.01 ... P (X1 = the) is the fraction of reviews
amazing 0.01 0.22 0.01 ... having the as the first word
useless 0.01 0.20 0.03 ...
Similarly, we have computed
was 0.01 0.00 0.60 ...
P (X2 = word2 |X1 = word1 ) and
is 0.01 0.00 0.30 ...
P (X3 = word3 |X1 = word1 , X2 = word2 )
masterpiece 0.01 0.11 0.01 ...
I 0.21 0.00 0.01 ...
liked 0.01 0.01 0.01 ...
decent 0.01 0.02 0.01 ...
10/71
Now consider that we want to generate the 3rd word
The movie . . . in the review given the first 2 words of the review
Word
the
movie
amazing
useless
was
is
masterpiece
I
liked
decent
11/71
We can think of the 10 words as forming a 10 sided
dice where each side corresponds to a word
Index Word
0 the
1 movie
2 amazing
3 useless
4 was
5 is
6 masterpiece
7 I
8 liked
9 decent
11/71
P (Xi = w|, The probability of each side showing up is not uniform
Index Word Xi−2 = the, ...
Xi−1 = movie)
but as per the values given in the table
0 the 0.01 ...
1 movie 0.01 ...
2 amazing 0.01 ...
3 useless 0.03 ...
4 was 0.60 ...
5 is 0.30 ...
6 masterpiece 0.01 ...
7 I 0.01 ...
8 liked 0.01 ...
9 decent 0.01 ...
11/71
Xi−1 = movie)
0 the 0.01 ... We can select the next word by rolling this dice and
1 movie 0.01 ... picking up the word which shows up
2 amazing 0.01 ...
3 useless 0.03 ...
4 was 0.60 ...
5 is 0.30 ...
7 I 0.01 ...
8 liked 0.01 ...
9 decent 0.01 ...
11/71
Xi−1 = movie)
0 the 0.01 ... We can select the next word by rolling this dice and
1 movie 0.01 ... picking up the word which shows up
2 amazing 0.01 ...
You can write a python program to roll such a biased
3 useless 0.03 ...
4 was 0.60 ... dice
5 is 0.30 ...
7 I 0.01 ...
8 liked 0.01 ...
9 decent 0.01 ...
11/71
Now, at each timestep we do not
pick the most likely word but all
words are possible depending on
their probability (just as rolling
a biased dice or tossing a biased
coin)
12/71
pick the most likely word but all
words are possible depending on
coin)
Every run will now give us a
different review!
12/71
Generated Reviews pick the most likely word but all
the movie is liked decent words are possible depending on
coin)
different review!
12/71
I liked the amazing movie their probability (just as rolling
coin)
different review!
12/71
the movie is masterpiece coin)
different review!
12/71
the movie is masterpiece coin)
the movie I liked useless Every run will now give us a
different review!
12/71
Returning back to our story....
13/71
than 0.61 0.01 0.20 ...
as 0.12 0.10 0.16 ...
by this reviewer
for 0.14 0.09 0.05 ...
real 0.01 0.50 0.01 ...
the 0.02 0.12 0.12 ...
life 0.05 0.11 0.33 ...

= 0.2 × 0.25 × 0.61 × 0.50 × 0.33 = 0.005 14/71
than 0.61 0.01 0.20 ...
as 0.12 0.10 0.16 ...
by this reviewer
for 0.14 0.09 0.05 ...
Correct noisy reviews or help in
real 0.01 0.50 0.01 ...
completing incomplete reviews
the 0.02 0.12 0.12 ...
life 0.05 0.11 0.33 ...
argmax P (X1 = the, X2 = movie,
X5
P (M 7) = P (X1 = more).P (X2 = realistic|X1 = more). X3 = was,
X4 = amazingly,
X5 =?)
= 0.2 × 0.25 × 0.61 × 0.50 × 0.33 = 0.005 14/71
Let us take an example from another domain
15/71
Consider images which contain m × n pixels
(say 32 × 32)
16/71
(say 32 × 32)
Each pixel here is a random variable which
can take values from 0 to 255 (colors)
16/71
(say 32 × 32)
We thus have a total of 32×32 = 1024 random
variables (X1 , X2 , ..., X1024 )
16/71
(say 32 × 32)
variables (X1 , X2 , ..., X1024 )
Together these pixels define the image and
different combinations of pixel values lead to
different images
16/71
(say 32 × 32)
variables (X1 , X2 , ..., X1024 )
different images
16/71
(say 32 × 32)
variables (X1 , X2 , ..., X1024 )
different images
16/71
(say 32 × 32)
variables (X1 , X2 , ..., X1024 )
different images
16/71
(say 32 × 32)
variables (X1 , X2 , ..., X1024 )
different images
16/71
(say 32 × 32)
variables (X1 , X2 , ..., X1024 )
different images
16/71
(say 32 × 32)
variables (X1 , X2 , ..., X1024 )
different images
16/71
(say 32 × 32)
variables (X1 , X2 , ..., X1024 )
different images
16/71
(say 32 × 32)
variables (X1 , X2 , ..., X1024 )
different images
16/71
(say 32 × 32)
variables (X1 , X2 , ..., X1024 )
different images
Given many such images we want to learn the
joint distribution P (X1 , X2 , ..., X1024 )
16/71
We can assume each pixel is dependent only
on its neighbors
17/71
We can assume each pixel is dependent only
on its neighbors
In this case we could factorize the distribution
over a Markov network
Y
φ(Di )
where Di is a set of variables which

form a maximal clique (basically, groups of
neighboring pixels)
17/71
Again, what can we do with this joint
distribution?
18/71
distribution?
Given a new image, classify if is indeed a
bedroom
Probability Score = 0.01
18/71
distribution?
bedroom
Generate new images which would look like
bedrooms (say, if you are an interior designer)
18/71
distribution?
bedroom
Generate new images which would look like
bedrooms (say, if you are an interior designer)
Correct noisy images or help in completing
incomplete images
18/71
Such models which try to estimate the probability P (X) from a large number
of samples are called generative models
19/71
Module 18.2: The concept of a latent variable
20/71
We now introduce the concept of a latent
variable
21/71
variable
Recall that earlier we mentioned that the
neighboring pixels in an image are dependent
on each other
21/71
variable
on each other
Why is it so?
21/71
variable
on each other
Why is it so? (intuitively, because we expect
them to have the same color, texture, etc.?)
21/71
variable
on each other
Why is it so? (intuitively, because we expect
them to have the same color, texture, etc.?)
Let us probe this intuition a bit more and try
to formalize it
21/71
Suppose we asked a friend to send us a good wallpaper
and he/she thinks a bit about it and sends us this
image
22/71
image
Why are all the pixels in the top portion of the image
blue?
22/71
image
blue? (because our friend decided to show us an image
of the sky as opposed to mountains or green fields)
22/71
image
But then why blue why not black?
22/71
image
But then why blue why not black? (because our friend
decided to show us an image which depicts daytime as
opposed to night time)
22/71
image
Okay, But why is it not cloudy (gray)?
22/71
image
Okay, But why is it not cloudy (gray)?(because our
friend decided to show us an image which depicts a
sunny day)
22/71
image
sunny day)
These decisions made by our friend (sky, sunny,
daytime, etc) are not explicitly known to us (they are
hidden from us)
22/71
image
sunny day)
These decisions made by our friend (sky, sunny,
daytime, etc) are not explicitly known to us (they are
hidden from us)
We only observe the images but what we observe
depends on these latent (hidden) decisions
22/71
So what exactly are we trying to say here?
Latent Variable = daytime
23/71
We are saying that there are certain
underlying hidden (latent) characteristics
which are determining the pixels and their
interactions
23/71
interactions
We could think of these as additional (latent)
random variables in our distribution
23/71
interactions
Latent Variable = night
23/71
interactions
Latent Variable = night
Latent Variable = cloudy 23/71

interactions
These are latent because we do not observe
them unlike the pixels which are observable
Latent Variable = night random variables

interactions
These are latent because we do not observe
them unlike the pixels which are observable
Latent Variable = night random variables
The pixels depend on the choice of these latent
variables

More formally we now have visible (observed)
variables or pixels (V = {V1 , V2 , V3 , . . . , V1024 })
and hidden variables (H = {H1 , H2 , ..., Hn })
24/71
Can you now think of a Markov network to
represent the joint distribution P (V, H)?
24/71
Our original Markov Network suggested that
the pixels were dependent on neighboring pixels
(forming a clique)
24/71
(forming a clique)
But now we could have a better Markov Network
involving these latent variables
24/71
(forming a clique)
This Markov Network suggests that the pixels
(observed variables) are dependent on the latent
variables (which is exactly the intuition that we
were trying to build in the previous slides)
24/71
(forming a clique)
This Markov Network suggests that the pixels
(observed variables) are dependent on the latent
variables (which is exactly the intuition that we
were trying to build in the previous slides)
The interactions between the pixels are captured
through the latent variables
24/71
Before we move on to more formal definitions and equations, let us probe the
idea of using latent variables a bit more
25/71
Before we move on to more formal definitions and equations, let us probe the
idea of using latent variables a bit more
We will talk about two concepts: abstraction and generation
25/71
First let us talk about abstraction
26/71
Suppose, we are able to learn the joint
distribution P (V, H)
26/71
Using this distribution we can find
P (V, H)
P (H|V ) = P
H P (V, H)
26/71
P (V, H)
P (H|V ) = P
H P (V, H)
In other words, given an image, we can find

the most likely latent configuration (H = h)
that generated this image (of course, keeping
the computational cost aside for now)
26/71
P (V, H)
P (H|V ) = P
H P (V, H)

What does this h capture?
26/71
P (V, H)
P (H|V ) = P
H P (V, H)

What does this h capture? It captures a latent
representation or abstraction of the image!
26/71
In other words, it captures the most
important properties of the image
27/71
For example, if you were to describe the
adjacent image you wouldn’t say “I am
looking at an image where pixel 1 is blue, pixel
2 is blue, ..., pixel 1024 is beige”
27/71
Instead you would just say “I am looking at
an image of a sunny beach with an ocean in
the background and beige sand”
27/71
Instead you would just say “I am looking at
an image of a sunny beach with an ocean in
the background and beige sand”
This is exactly the abstraction captured by
the vector h
27/71
Under this abstraction all these images would
look very similar (i.e., they would have very
similar latent configurations h)
28/71
Even though in the original feature space
(pixels) there is a significant difference
between these images, in the latent space they
would be very close to each other
28/71
Even though in the original feature space
(pixels) there is a significant difference
between these images, in the latent space they
would be very close to each other
This is very similar to the idea behind PCA
and autoencoders
28/71
Of course, we still need to figure out a way of
computing P (H|V )
29/71
computing P (H|V )
In the case of PCA, learning such latent
representations boiled down to learning the
eigen vectors of X > X (using linear algebra)
29/71
computing P (H|V )
In the case of Autoencoders, this boiled down
to learning the parameters of the feedforward
network (Wend , Wdec ) (using gradient descent)
29/71
computing P (H|V )
In the case of Autoencoders, this boiled down
to learning the parameters of the feedforward
network (Wend , Wdec ) (using gradient descent)
We still haven’t seen how to learn the
parameters of P (H, V ) (we are far from it but
we will get there soon!)
29/71
Ok, I am just going to drag this a bit more! (bear with
me)
30/71
me)
Remember that in practice we have no clue what these
hidden variables are!
30/71
me)
Even in PCA, once we are given the new dimensions
we have no clue what these dimensions actually mean
30/71
me)
We cannot interpret them (for example, we cannot
say dimension 1 corresponds to weight, dimension 2
corresponds to height and so on!)
30/71
me)
Even here, we just assume there are some latent
variables which capture the essence of the data but
we do not really know what these are (because no one
ever tells us what these are)
30/71
me)
Even here, we just assume there are some latent
variables which capture the essence of the data but
we do not really know what these are (because no one
ever tells us what these are)
Only for illustration purpose we assumed that h1
corresponds to sunny/cloudy, h2 corresponds to beach
and so on
30/71
Just to reiterate, remember that while sending
us the wallpaper images our friend never told
us what latent variables he/she considered
31/71
Maybe our friend had the following latent
variables in mind: h1 = cheerf ul, h2 =
romantic, and so on
31/71
romantic, and so on
In fact, it doesn’t really matter what the
interpretation of these latent variable is
31/71
romantic, and so on
All we care about is that they should help us
learn a good abstraction of the data
31/71
romantic, and so on
All we care about is that they should help us
learn a good abstraction of the data
How? (we will get there eventually)
31/71
We will now talk about another interesting
concept related to latent variables: generation
32/71
Once again, assume that we are able to learn
the joint distribution P (V, H)
32/71
P (V, H)
P (V |H) = P
V P (V, H)
32/71
P (V, H)
P (V |H) = P
V P (V, H)
Why is this interesting?
32/71
Well, I can now say “Create an image which
is cloudy, has a beach and depicts daytime”
33/71
Or given h = [....] find the corresponding V
which maximizes P (V |H)
33/71
In other words, I can now generate images
given certain latent variables
33/71
In other words, I can now generate images
given certain latent variables
The hope is that I should be able to ask the
model to generate very creative images given
some latent configuration (we will come back
to this later)
33/71
The story ahead...
We have tried to understand the intuition behind latent variables and how
they could potenatially allow us to do abstraction and generation
34/71
The story ahead...
We will now concretize these intuitions by developings equations (models) and
learning algoritms
34/71
The story ahead...
We will now concretize these intuitions by developings equations (models) and
learning algoritms
And of course, we will tie all this back to neural networks!
34/71
For the remainder of this discussion we will assume that all our variables take
only boolean values
35/71
only boolean values
Thus, the vector V will be a boolean vector ∈ {0, 1}m (there are a total of 2m
values that V can take)
35/71
only boolean values
Thus, the vector V will be a boolean vector ∈ {0, 1}m (there are a total of 2m
values that V can take)
And the vector H will be a boolean vector ∈ {0, 1}n (there are a total of 2n
values that H can take)
35/71
Module 18.3: Restricted Boltzmann Machines
36/71
We return back to our Markov Network
containing hidden variables and visible
variables
37/71
h1 h2 ··· hn
variables
We will get rid of the image and just keep the
hidden and latent variables
v1 v2 ··· vm
37/71
h1 h2 ··· hn
variables
We have edges between each pair of (hidden,
visible) variables.
v1 v2 ··· vm
37/71
h1 h2 ··· hn
variables
We have edges between each pair of (hidden,
visible) variables.
v1 v2 ··· vm We do not have edges between (hidden,
hidden) and (visible, visible) variables
37/71
Earlier, we saw that given such a Markov
network the joint probability distribution can
h1 h2 ··· hn
be written as a product of factors
v1 v2 ··· vm
38/71
h1 h2 ··· hn
Can you tell how many factors are there in
this case?
v1 v2 ··· vm
38/71
h1 h2 ··· hn
this case?
Recall that factors correspond to maximal
cliques
v1 v2 ··· vm
38/71
h1 h2 ··· hn
this case?
cliques
v1 v2 ··· vm What are the maximal cliques in this case?
38/71
h1 h2 ··· hn
this case?
cliques
every pair of visible and hidden node forms a
clique
38/71
h1 h2 ··· hn
this case?
cliques
every pair of visible and hidden node forms a
clique
How many such cliques do we have? (m × n)
38/71
So we can write the joint pdf as a product of the
following factors
h1 h2 ··· hn
1 YY
P (V, H) = φij (vi , hj )
Z i j
v1 v2 ··· vm
39/71
following factors
h1 h2 ··· hn
1 YY
Z i j
In fact, we can also add additional factors

corresponding to the nodes and write
1 YY Y Y
v1 v2 ··· vm P (V, H) = φij (vi , hj ) ψi (vi ) ξj (hj )
Z i j i j
39/71
following factors
h1 h2 ··· hn
1 YY
Z i j

1 YY Y Y
Z i j i j
It is legal to do this (i.e., add factors for ψi (vi )ξj (hj ))

as long as we ensure that Z is adjusted in a way that
the resulting quantity is a probability distribution
39/71
following factors
h1 h2 ··· hn
1 YY
Z i j

1 YY Y Y
Z i j i j
It is legal to do this (i.e., add factors for ψi (vi )ξj (hj ))

as long as we ensure that Z is adjusted in a way that
the resulting quantity is a probability distribution
Z is the partition function and is given by
XXYY Y Y
φij (vi , hj ) ψi (vi ) ξj (hj )
V H i j i j
39/71
Let us understand each of these factors in
more detail
h1 h2 ··· hn
v1 v2 ··· vm
40/71
more detail
h1 h2 ··· hn
For example, φ11 (v1 , h1 ) is a factor which
takes the values of v1 ∈ {0, 1} and h1 ∈ {0, 1}
and returns a value indicating the affinity
between these two variables
v1 v2 ··· vm
40/71
more detail
h1 h2 ··· hn
The adjoining table shows one such possible
v1 v2 ··· vm instantiation of the φ11 function
φ11 (v1 , h1 )
0 0 30
0 1 5
1 0 1
1 1 10
40/71
more detail
h1 h2 ··· hn
Similarly, ψ1 (v1 ) takes the value of v1 ∈ {0, 1}
φ11 (v1 , h1 )
0 0 30 and gives us a number which roughly indicates
0 1 5
1 0 1
the possibility of v1 taking on the value 1 or 0
1 1 10
40/71
more detail
h1 h2 ··· hn
φ11 (v1 , h1 )
0 1 5
1 0 1
1 1 10
ψ1 (v1 )
instantiation of the ψ11 function
0 10
1 2
40/71
more detail
h1 h2 ··· hn
φ11 (v1 , h1 )
0 1 5
1 0 1
1 1 10
ψ1 (v1 )
instantiation of the ψ11 function
0 10
1 2
A similar interpretation can be made for
ξ1 (h1 )
40/71
Just to be sure that we understand this correctly let us take a small example
where |V | = 3 (i.e., V ∈ {0, 1}3 ) and |H| = 2 (i.e., H ∈ {0, 1}2 )
41/71
Suppose we are now interested in P (V =<
0, 0, 0 >, H =< 1, 1 >)
h1 h2
v1 v2 v3
φ11 (v1 , h1 ) φ12 (v1 , h2 ) φ21 (v2 , h1 ) φ22 (v2 , h2 ) φ31 (v3 , h1 ) φ32 (v3 , h2 )
0 0 20 0 0 6 0 0 3 0 0 2 0 0 6 0 0 3
0 1 3 0 1 20 0 1 3 0 1 1 0 1 3 0 1 1
1 0 5 1 0 10 1 0 2 1 0 10 1 0 5 1 0 10
1 1 10 1 1 2 1 1 10 1 1 10 1 1 10 1 1 10
ψ1 (v1 ) ψ2 (v2 ) ψ3 (v3 ) ξ1 (h1 ) ξ2 (h2 )
0 30 0 100 0 1 0 100 0 10
1 1 1 1 1 100 1 1 1 10
42/71
0, 0, 0 >, H =< 1, 1 >)
h1 h2
We can compute this using the following
function
P (V =< 0, 0, 0 >, H =< 1, 1 >)

1
= φ11 (0, 1)φ12 (0, 1)φ21 (0, 1)
Z
v1 v2 v3 φ22 (0, 1)φ31 (0, 1)φ32 (0, 1)
ψ1 (0)ψ2 (0)ψ3 (0)ξ1 (1)ξ2 (1)
0 0 20 0 0 6 0 0 3 0 0 2 0 0 6 0 0 3
0 1 3 0 1 20 0 1 3 0 1 1 0 1 3 0 1 1
1 0 5 1 0 10 1 0 2 1 0 10 1 0 5 1 0 10
1 1 10 1 1 2 1 1 10 1 1 10 1 1 10 1 1 10
ψ1 (v1 ) ψ2 (v2 ) ψ3 (v3 ) ξ1 (h1 ) ξ2 (h2 )
0 30 0 100 0 1 0 100 0 10
1 1 1 1 1 100 1 1 1 10
42/71
0, 0, 0 >, H =< 1, 1 >)
h1 h2
We can compute this using the following
function
P (V =< 0, 0, 0 >, H =< 1, 1 >)

1
= φ11 (0, 1)φ12 (0, 1)φ21 (0, 1)
Z
v1 v2 v3 φ22 (0, 1)φ31 (0, 1)φ32 (0, 1)
ψ1 (0)ψ2 (0)ψ3 (0)ξ1 (1)ξ2 (1)
0 0 20 0 0 6 0 0 3 0 0 2 0 0 6 0 0 3
0 1 3 0 1 20 0 1 3 0 1 1 0 1 3 0 1 1
1 0
1 1 10
5 1 0 10
1 1 2
1 0
1 1 10
2 1 0 10
1 1 10
1 0
1 1 10
5 1 0 10
1 1 10
and the partition function will be given by
ψ1 (v1 ) ψ2 (v2 ) ψ3 (v3 ) ξ1 (h1 ) ξ2 (h2 )
0 30 0 100 0 1 0 100 0 10
1 1 1 1 1 100 1 1 1 10 1 X
X 1 X
1 X
1 X
1
v1 =0 v2 =0 v3 =0 h1 =0 h2 =1
P (V =< v1 , v2 , v3 >, H =< h1 , h2 >)

42/71
How do we learn these clique potentials:
φij (vi , hj ), ψi (vi ), ξj (hj )?
h1 h2 ··· hn
v1 v2 ··· vm
43/71
Whenever we want to learn something what
do we introduce?
h1 h2 ··· hn
v1 v2 ··· vm
43/71
Whenever we want to learn something what
do we introduce? (parameters)
h1 h2 ··· hn
v1 v2 ··· vm
43/71
H ∈ {0, 1}n φij (vi , hj ), ψi (vi ), ξj (hj )?
c1 c2 cn Whenever we want to learn something what
h1 h2 ··· hn
So we will introduce a parametric form for
these clique potentials and then learn these
w1,1 wm,n parameters
W ∈ Rm×n
v1 v2 ··· vm
b1 b2 bm
V ∈ {0, 1}m
43/71
H ∈ {0, 1}n φij (vi , hj ), ψi (vi ), ξj (hj )?
c1 c2 cn Whenever we want to learn something what
h1 h2 ··· hn
So we will introduce a parametric form for
these clique potentials and then learn these
w1,1 wm,n parameters
W ∈ Rm×n
The specific parametric form chosen by RBMs
is
v1 v2 ··· vm
φij (vi , hj ) = ewij vi hj
b1 b2 bm ψi (vi ) = ebi vi
V ∈ {0, 1}m
ξj (hj ) = ecj hj
43/71
With this parametric form, let us see what the
H ∈ {0, 1}n joint distribution looks like
c1 c2 cn
h1 h2 ··· hn
w1,1 wm,n W ∈ Rm×n
v1 v2 ··· vm
b1 b2 bm
V ∈ {0, 1}m
44/71
c1 c2 cn 1 YY Y Y
P (V, H) = φij (vi , hj ) ψi (vi ) ξj (hj )
h1 h2 ··· hn Z
i j i j
v1 v2 ··· vm
b1 b2 bm
V ∈ {0, 1}m
44/71
c1 c2 cn 1 YY Y Y
h1 h2 ··· hn Z
i j i j
1 Y Y wij vi hj Y bi vi Y cj hj
= e e e
Z
w1,1 wm,n W ∈ Rm×n i j i j
v1 v2 ··· vm
b1 b2 bm
V ∈ {0, 1}m
44/71
c1 c2 cn 1 YY Y Y
h1 h2 ··· hn Z
i j i j
= e e e
Z
1 P P P P
= e i j wij vi hj e i bi vi e j cj hj
Z
v1 v2 ··· vm
b1 b2 bm
V ∈ {0, 1}m
44/71
c1 c2 cn 1 YY Y Y
h1 h2 ··· hn Z
i j i j
= e e e
Z
1 P P P P
Z
v1 v2 ··· vm 1 P P P P
= e i j wij vi hj + i bi vi + j cj hj
Z
b1 b2 bm
V ∈ {0, 1}m
44/71
c1 c2 cn 1 YY Y Y
h1 h2 ··· hn Z
i j i j
= e e e
Z
1 P P P P
Z
v1 v2 ··· vm 1 P P P P
Z
b1 b2 bm 1
V ∈ {0, 1}m = e−E(V,H) where,
Z
44/71
c1 c2 cn 1 YY Y Y
h1 h2 ··· hn Z
i j i j
= e e e
Z
1 P P P P
Z
v1 v2 ··· vm 1 P P P P
Z
b1 b2 bm 1
V ∈ {0, 1}m = e−E(V,H) where,
Z XX X X
E(V, H) = − wij vi hj − bi vi − cj hj
i j i j
44/71
XX X X
H ∈ {0, 1}n E(V, H) = − wij vi hj − bi vi − cj hj
c1 c2 cn i j i j
h1 h2 ··· hn
Because of the above form, we refer to these
networks as (restricted) Boltzmann machines
v1 v2 ··· vm
b1 b2 bm
V ∈ {0, 1}m
45/71
XX X X
c1 c2 cn i j i j
h1 h2 ··· hn
w1,1 wm,n W ∈ Rm×n The term comes from statistical mechanics
where the distribution of particles in a system
over various possible states is given by
v1 v2 ··· vm E
F (state) ∝ e− kt
b1 b2 bm
V ∈ {0, 1}m
45/71
XX X X
c1 c2 cn i j i j
h1 h2 ··· hn
w1,1 wm,n W ∈ Rm×n The term comes from statistical mechanics
where the distribution of particles in a system
over various possible states is given by
v1 v2 ··· vm E
F (state) ∝ e− kt
b1 b2 bm
V ∈ {0, 1}m which is called the Boltzmann distribution or
the Gibbs distribution
45/71
Module 18.4: RBMs as Stochastic Neural Networks
46/71
But what is the connection between this and deep neural networks?
We will get to it over the next few slides!
47/71
We will start by deriving a formula for P (V |H) and
P (H|V )
H∈ {0, 1}n
c1 c2 cn
h1 h2 ··· hn
v1 v2 ··· vm
b1 b2 bm
V ∈ {0, 1}m
48/71
P (H|V )
H∈ {0, 1}n
c1 c2 cn In particular, let us take the l-th visible unit and derive
a formula for P (vl = 1|H)
h1 h2 ··· hn
v1 v2 ··· vm
b1 b2 bm
V ∈ {0, 1}m
48/71
P (H|V )
H∈ {0, 1}n
h1 h2 ··· hn We will first define V−l as the state of all the visible
units except the l-th unit
v1 v2 ··· vm
b1 b2 bm
V ∈ {0, 1}m
48/71
P (H|V )
H∈ {0, 1}n
We now define the following quantities
w1,1 wm,n W ∈ Rm×n n
X
αl (H) = − wil hi − bl
i=1
n m m n
v1 v2 ··· vm
X X X X
β(V−l , H) = − wij hi vj − bi vi − ci hi
i=1 j=1,j6=l j=1,j6=l i=1
b1 b2 bm
V ∈ {0, 1}m
48/71
P (H|V )
H∈ {0, 1}n
We now define the following quantities
w1,1 wm,n W ∈ Rm×n n
X
αl (H) = − wil hi − bl
i=1
n m m n
v1 v2 ··· vm
X X X X
β(V−l , H) = − wij hi vj − bi vi − ci hi
i=1 j=1,j6=l j=1,j6=l i=1
b1 b2 bm
V ∈ {0, 1}m Notice that
E(V, H) = vl α(H) + β(V−l , H)
48/71
We can now write P (vl = 1|H) as
H∈ {0, 1}n
c1 c2 cn
h1 h2 ··· hn
v1 v2 ··· vm
b1 b2 bm
V ∈ {0, 1}m
49/71
H∈ {0, 1}n
p(vl = 1|H) = P (vl = 1|V−l , H)
c1 c2 cn
h1 h2 ··· hn
v1 v2 ··· vm
b1 b2 bm
V ∈ {0, 1}m
49/71
H∈ {0, 1}n
p(vl = 1|H) = P (vl = 1|V−l , H)
c1 c2 cn
p(vl = 1, V−l , H)
h1 h2 ··· hn =
p(V−l , H)
v1 v2 ··· vm
b1 b2 bm
V ∈ {0, 1}m
49/71
H∈ {0, 1}n
p(vl = 1|H) = P (vl = 1|V−l , H)
c1 c2 cn
p(vl = 1, V−l , H)
h1 h2 ··· hn =
p(V−l , H)
e−E(vl =1,V−l ,H)
=
e−E(vl =1,V−l ,H) + e−E(vl =0,V−l ,H)
v1 v2 ··· vm
b1 b2 bm
V ∈ {0, 1}m
49/71
H∈ {0, 1}n
p(vl = 1|H) = P (vl = 1|V−l , H)
c1 c2 cn
p(vl = 1, V−l , H)
h1 h2 ··· hn =
p(V−l , H)
=
e−E(vl =1,V−l ,H) + e−E(vl =0,V−l ,H)
e−β(V−l ,H)−1·αl (H)
= −β(V ,H)−1·α (H)
e −l l + e−β(V−l ,H)−0·αl (H)
v1 v2 ··· vm
b1 b2 bm
V ∈ {0, 1}m
49/71
H∈ {0, 1}n
p(vl = 1|H) = P (vl = 1|V−l , H)
c1 c2 cn
p(vl = 1, V−l , H)
h1 h2 ··· hn =
p(V−l , H)
=
e−E(vl =1,V−l ,H) + e−E(vl =0,V−l ,H)
= −β(V ,H)−1·α (H)
e −l l + e−β(V−l ,H)−0·αl (H)
v1 v2 ··· vm e−β(V−l ,H) · e−αl (H)
= −β(V ,H) −α (H)
e −l ·e l + e−β(V−l ,H)
b1 b2 bm
V ∈ {0, 1}m
49/71
H∈ {0, 1}n
p(vl = 1|H) = P (vl = 1|V−l , H)
c1 c2 cn
p(vl = 1, V−l , H)
h1 h2 ··· hn =
p(V−l , H)
=
e−E(vl =1,V−l ,H) + e−E(vl =0,V−l ,H)
= −β(V ,H)−1·α (H)
e −l l + e−β(V−l ,H)−0·αl (H)
v1 v2 ··· vm e−β(V−l ,H) · e−αl (H)
= −β(V ,H) −α (H)
e −l ·e l + e−β(V−l ,H)
b1 b2 bm −α (H)
V ∈ {0, 1}m e l 1
= −α (H) =
e l +1 1 + eαl (H)
49/71
H∈ {0, 1}n
p(vl = 1|H) = P (vl = 1|V−l , H)
c1 c2 cn
p(vl = 1, V−l , H)
h1 h2 ··· hn =
p(V−l , H)
=
e−E(vl =1,V−l ,H) + e−E(vl =0,V−l ,H)
= −β(V ,H)−1·α (H)
e −l l + e−β(V−l ,H)−0·αl (H)
v1 v2 ··· vm e−β(V−l ,H) · e−αl (H)
= −β(V ,H) −α (H)
e −l ·e l + e−β(V−l ,H)
b1 b2 bm −α (H)
V ∈ {0, 1}m e l 1
= −α (H) =
e l +1 1 + eαl (H)
Xn
= σ(−αl (H)) = σ( wil hi + bl )
i=1
49/71
Okay, so we arrived at
H∈ {0, 1}n Xn
p(vl = 1|H) = σ( wil hi + bl )
c1 c2 cn
i=1
h1 h2 ··· hn
v1 v2 ··· vm
b1 b2 bm
V ∈ {0, 1}m
50/71
H∈ {0, 1}n Xn
c1 c2 cn
i=1
h1 h2 ··· hn Similarly, we can show that

m
X
p(hl = 1|V ) = σ( wil vi + cl )
w1,1 wm,n W ∈ Rm×n i=1
v1 v2 ··· vm
b1 b2 bm
V ∈ {0, 1}m
50/71
H∈ {0, 1}n Xn
c1 c2 cn
i=1

m
X
p(hl = 1|V ) = σ( wil vi + cl )
The RBM can thus be interpreted as a stochastic

neural network, where the nodes and edges correspond
v1 v2 ··· vm to neurons and synaptic connections, respectively.
b1 b2 bm
V ∈ {0, 1}m
50/71
H∈ {0, 1}n Xn
c1 c2 cn
i=1

m
X
p(hl = 1|V ) = σ( wil vi + cl )
The RBM can thus be interpreted as a stochastic

neural network, where the nodes and edges correspond
v1 v2 ··· vm to neurons and synaptic connections, respectively.
The conditional probability of a single (hidden or
b1 b2 bm visible) variable being 1 can be interpreted as the firing
V ∈ {0, 1}m rate of a (stochastic) neuron with sigmoid activation
function
50/71
Given this neural network view of RBMs, can
H ∈ {0, 1}n you say something about what h is trying to
c1 c2 cn learn?
h1 h2 ··· hn
v1 v2 ··· vm
b1 b2 bm
V ∈ {0, 1}m
51/71
c1 c2 cn learn?
It is learning an abstract representation of V
h1 h2 ··· hn
v1 v2 ··· vm
b1 b2 bm
V ∈ {0, 1}m
51/71
c1 c2 cn learn?
h1 h2 ··· hn
This looks similar to autoencoders but how do
we train such an RBM? What is the objective
w1,1 wm,n function?
W ∈ Rm×n
v1 v2 ··· vm
b1 b2 bm
V ∈ {0, 1}m
51/71
c1 c2 cn learn?
h1 h2 ··· hn
This looks similar to autoencoders but how do
we train such an RBM? What is the objective
w1,1 wm,n function?
W ∈ Rm×n
We will see this in the next lecture!
v1 v2 ··· vm
b1 b2 bm
V ∈ {0, 1}m
51/71
Module 18.5: Unsupervised Learning with RBMs
52/71
So far, we have mainly dealt with supervised
H ∈ {0, 1}n learning where we are given {xi , yi }ni=1 for
c1 c2 cn training
h1 h2 ··· hn
v1 v2 ··· vm
b1 b2 bm
V ∈ {0, 1}m
53/71
c1 c2 cn training
In other words, for every training example we
h1 h2 ··· hn are given a label (or class) associated with it
v1 v2 ··· vm
b1 b2 bm
V ∈ {0, 1}m
53/71
c1 c2 cn training
In other words, for every training example we
h1 h2 ··· hn are given a label (or class) associated with it
Our job was then to learn a model which
w1,1 wm,n predicts ŷ such that the difference between y
W ∈ Rm×n
and ŷ is minimized
v1 v2 ··· vm
b1 b2 bm
V ∈ {0, 1}m
53/71
But in the case of RBMs, our training data
H ∈ {0, 1}n only contains x (for example, images)
c1 c2 cn
h1 h2 ··· hn
v1 v2 ··· vm
b1 b2 bm
V ∈ {0, 1}m
54/71
c1 c2 cn There is no explicit label (y) associated with
the input
h1 h2 ··· hn
v1 v2 ··· vm
b1 b2 bm
V ∈ {0, 1}m
54/71
the input
h1 h2 ··· hn
Of course, in addition to x we have the latent
variable h but we don’t know what these h’s
w1,1 wm,n are
W ∈ Rm×n
v1 v2 ··· vm
b1 b2 bm
V ∈ {0, 1}m
54/71
the input
h1 h2 ··· hn
Of course, in addition to x we have the latent
variable h but we don’t know what these h’s
w1,1 wm,n are
W ∈ Rm×n
We are interested in learning P (x, h) which we
have parameterized as
v1 v2 ··· vm
1 −(− Pi Pj wij vi hj −Pi bi vi −Pj cj hj )
P (V, H) = e
b1 b2 bm Z
V ∈ {0, 1}m
54/71
What is the objective function that we should use?
H∈ {0, 1}n
c1 c2 cn
h1 h2 ··· hn
v1 v2 ··· vm
b1 b2 bm
V ∈ {0, 1}m
55/71
H∈ {0, 1}n First note that if we have learnt P (x, h) we can
c1 c2 cn compute P (x)
h1 h2 ··· hn
v1 v2 ··· vm
b1 b2 bm
V ∈ {0, 1}m
55/71
What would we want P (X = x) to be for any x
h1 h2 ··· hn belonging to our training data?
v1 v2 ··· vm
b1 b2 bm
V ∈ {0, 1}m
55/71
We would want it to be high
v1 v2 ··· vm
b1 b2 bm
V ∈ {0, 1}m
55/71
So now can you think of an objective function
v1 v2 ··· vm
b1 b2 bm
V ∈ {0, 1}m
55/71
N
Y
maximize P (X = xi )
i=1
v1 v2 ··· vm
b1 b2 bm
V ∈ {0, 1}m
55/71
N
Y
i=1
v1 v2 ··· vm
Or, log-likelihood
b1 b2 bm
l l
V ∈ {0, 1}m ln L (θ) = ln
Y
p(xi |θ) =
X
ln p(xi |θ)
i=1 i=1
55/71
N
Y
i=1
v1 v2 ··· vm
Or, log-likelihood
b1 b2 bm
l l
V ∈ {0, 1}m ln L (θ) = ln
Y
p(xi |θ) =
X
ln p(xi |θ)
i=1 i=1
where θ are the parameters
55/71
Okay so we have the objective function now!
H ∈ {0, 1}n What next?
c1 c2 cn
h1 h2 ··· hn
v1 v2 ··· vm
b1 b2 bm
V ∈ {0, 1}m
56/71
c1 c2 cn We need a learning algorithm
h1 h2 ··· hn
v1 v2 ··· vm
b1 b2 bm
V ∈ {0, 1}m
56/71
··· We can just use gradient descent if we are able

h1 h2 hn
to compute the gradient of the loss function
w.r.t. the parameters
v1 v2 ··· vm
b1 b2 bm
V ∈ {0, 1}m
56/71
··· We can just use gradient descent if we are able

h1 h2 hn
to compute the gradient of the loss function
w.r.t. the parameters
w1,1 wm,n W ∈ Rm×n Let us see if we can do that
v1 v2 ··· vm
b1 b2 bm
V ∈ {0, 1}m
56/71
Module 18.6: Computing the gradient of the log
likelihood
57/71
We will just consider the loss for a single training
example
H∈ {0, 1}n
c1 c2 cn ln L (θ)
h1 h2 ··· hn
v1 v2 ··· vm
b1 b2 bm
V ∈ {0, 1}m
58/71
example
H∈ {0, 1}n
c1 c2 cn ln L (θ) = ln p(V |θ)
h1 h2 ··· hn
v1 v2 ··· vm
b1 b2 bm
V ∈ {0, 1}m
58/71
example
H∈ {0, 1}n 1 X −E(V,H)
c1 c2 cn ln L (θ) = ln p(V |θ) = ln e
Z H
h1 h2 ··· hn
v1 v2 ··· vm
b1 b2 bm
V ∈ {0, 1}m
58/71
example
H∈ {0, 1}n 1 X −E(V,H)
Z H
X −E(V,H) X −E(V,H)
h1 h2 ··· hn = ln e − ln e
H V,H
v1 v2 ··· vm
b1 b2 bm
V ∈ {0, 1}m
58/71
example
H∈ {0, 1}n 1 X −E(V,H)
Z H
h1 h2 ··· hn = ln e − ln e
H V,H
∂ ln L (θ)
w1,1 wm,n ∂θ
W ∈ Rm×n
v1 v2 ··· vm
b1 b2 bm
V ∈ {0, 1}m
58/71
example
H∈ {0, 1}n 1 X −E(V,H)
Z H
h1 h2 ··· hn = ln e − ln e
H V,H
∂ ln L (θ)

∂ X X
= ln e−E(V,H) − ln e−E(V,H)
w1,1 wm,n ∂θ ∂θ
W ∈ Rm×n H V,H
v1 v2 ··· vm
b1 b2 bm
V ∈ {0, 1}m
58/71
example
H∈ {0, 1}n 1 X −E(V,H)
Z H
h1 h2 ··· hn = ln e − ln e
H V,H
∂ ln L (θ)

∂ X X
W ∈ Rm×n H V,H
1 X ∂E(V, H)
= −P e−E(V,H)
H e−E(V,H) H
∂θ
v1 v2 ··· vm 1 X −E(V,H) ∂E(V, H)
+P −E(V,H)
e
V,H e V,H
∂θ
b1 b2 bm
V ∈ {0, 1}m
58/71
example
H∈ {0, 1}n 1 X −E(V,H)
Z H
h1 h2 ··· hn = ln e − ln e
H V,H
∂ ln L (θ)

∂ X X
W ∈ Rm×n H V,H
1 X ∂E(V, H)
= −P e−E(V,H)
H e−E(V,H) H
∂θ
v1 v2 ··· vm 1 X −E(V,H) ∂E(V, H)
+P −E(V,H)
e
V,H e V,H
∂θ
b1 b2 bm
V ∈ {0, 1}m
X e−E(V,H) ∂E(V, H)
=− P −E(V,H)
H H e ∂θ
+ P −E(V,H)
V,H V,H e ∂θ
58/71
Now,
H∈ {0, 1}n
c1 c2 cn
h1 h2 ··· hn
v1 v2 ··· vm
b1 b2 bm
V ∈ {0, 1}m
59/71
Now,
H∈ {0, 1}n
c1 c2 cn e−E(V,H)
P −E(V,H)
= p(V, H)
V,H e
h1 h2 ··· hn
v1 v2 ··· vm
b1 b2 bm
V ∈ {0, 1}m
59/71
Now,
H∈ {0, 1}n
c1 c2 cn e−E(V,H)
P −E(V,H)
= p(V, H)
V,H e
h1 h2 ··· hn
e−E(V,H)
P −E(V,H)
H e
v1 v2 ··· vm
b1 b2 bm
V ∈ {0, 1}m
59/71
Now,
H∈ {0, 1}n
c1 c2 cn e−E(V,H)
P −E(V,H)
= p(V, H)
V,H e
h1 h2 ··· hn
1 −E(V,H)
e−E(V,H) Z
e
P −E(V,H) = 1
P −E(V,H)
H e Z H e
v1 v2 ··· vm
b1 b2 bm
V ∈ {0, 1}m
59/71
Now,
H∈ {0, 1}n
c1 c2 cn e−E(V,H)
P −E(V,H)
= p(V, H)
V,H e
h1 h2 ··· hn
1 −E(V,H)
e−E(V,H) Z
e
P −E(V,H) = 1
P −E(V,H)
H e Z H e
w1,1 wm,n W ∈ Rm×n p(V, H)
=
p(V )
v1 v2 ··· vm
b1 b2 bm
V ∈ {0, 1}m
59/71
Now,
H∈ {0, 1}n
c1 c2 cn e−E(V,H)
P −E(V,H)
= p(V, H)
V,H e
h1 h2 ··· hn
1 −E(V,H)
e−E(V,H) Z
e
P −E(V,H) = 1
P −E(V,H)
H e Z H e
= = p(H|V )
p(V )
v1 v2 ··· vm
b1 b2 bm
V ∈ {0, 1}m
59/71
Now,
H∈ {0, 1}n
c1 c2 cn e−E(V,H)
P −E(V,H)
= p(V, H)
V,H e
h1 h2 ··· hn
1 −E(V,H)
e−E(V,H) Z
e
P −E(V,H) = 1
P −E(V,H)
H e Z H e
= = p(H|V )
p(V )
v1 v2 ··· vm ∂ ln L (θ)
∂θ
b1 b2 bm
V ∈ {0, 1}m
59/71
Now,
H∈ {0, 1}n
c1 c2 cn e−E(V,H)
P −E(V,H)
= p(V, H)
V,H e
h1 h2 ··· hn
1 −E(V,H)
e−E(V,H) Z
e
P −E(V,H) = 1
P −E(V,H)
H e Z H e
= = p(H|V )
p(V )
v1 v2 ··· vm ∂ ln L (θ) X e−E(V,H) ∂E(V, H)

=− P −E(V,H)
∂θ H H e ∂θ
b1 b2 bm
V ∈ {0, 1}m
+ P −E(V,H)
V,H V,H e ∂θ
59/71
Now,
H∈ {0, 1}n
c1 c2 cn e−E(V,H)
P −E(V,H)
= p(V, H)
V,H e
h1 h2 ··· hn
1 −E(V,H)
e−E(V,H) Z
e
P −E(V,H) = 1
P −E(V,H)
H e Z H e
= = p(H|V )
p(V )
v1 v2 ··· vm ∂ ln L (θ) X e−E(V,H) ∂E(V, H)

=− P −E(V,H)
∂θ H H e ∂θ
b1 b2 bm
V ∈ {0, 1}m
+ P −E(V,H)
V,H V,H e ∂θ
X ∂E(V, H) X ∂E(V, H)
=− p(H|V ) + p(V, H)
H
∂θ V,H
∂θ
59/71
Okay, so we have,
H ∈ {0, 1}n ∂ ln L (θ) X ∂E(V, H)
c1 c2 cn =− p(H|V )
∂θ ∂θ
H
h1 h2 ··· hn X ∂E(V, H)
+ p(V, H)
∂θ
V,H
v1 v2 ··· vm
b1 b2 bm
V ∈ {0, 1}m
60/71
Okay, so we have,
H ∈ {0, 1}n ∂ ln L (θ) X ∂E(V, H)
c1 c2 cn =− p(H|V )
∂θ ∂θ
H
h1 h2 ··· hn X ∂E(V, H)
+ p(V, H)
∂θ
V,H
Remember that θ is a collection of all the
parameters in our model, i.e., Wij , bi , cj ∀i ∈
{1, . . . , m} and ∀j ∈ {1, . . . , n}
v1 v2 ··· vm
b1 b2 bm
V ∈ {0, 1}m
60/71
Okay, so we have,
H ∈ {0, 1}n ∂ ln L (θ) X ∂E(V, H)
c1 c2 cn =− p(H|V )
∂θ ∂θ
H
h1 h2 ··· hn X ∂E(V, H)
+ p(V, H)
∂θ
V,H
Remember that θ is a collection of all the
parameters in our model, i.e., Wij , bi , cj ∀i ∈
{1, . . . , m} and ∀j ∈ {1, . . . , n}
v1 v2 ··· vm
We will follow our usual recipe of computing
b1 b2 bm the partial derivative w.r.t. one weight wij
V ∈ {0, 1}m and then generalize to the gradient w.r.t. the
entire weight matrix W
60/71
∂L (θ)
H ∈ {0, 1}n
c1 c2 cn ∂wij
h1 h2 ··· hn
v1 v2 ··· vm
b1 b2 bm
V ∈ {0, 1}m
61/71
∂L (θ)
H ∈ {0, 1}n
c1 c2 cn ∂wij
X ∂E(V, H) X ∂E(V, H)
h1 h2 ··· hn =− p(H|V ) + p(V, H)
∂wij ∂wij
H V,H
v1 v2 ··· vm
b1 b2 bm
V ∈ {0, 1}m
61/71
∂L (θ)
H ∈ {0, 1}n
c1 c2 cn ∂wij
X ∂E(V, H) X ∂E(V, H)
h1 h2 ··· hn =− p(H|V ) + p(V, H)
∂wij ∂wij
H V,H
X X
= p(H|V )hi vj − p(V, H)hi vj
w1,1 wm,n W ∈ Rm×n H V,H
v1 v2 ··· vm
b1 b2 bm
V ∈ {0, 1}m
61/71
∂L (θ)
H ∈ {0, 1}n
c1 c2 cn ∂wij
X ∂E(V, H) X ∂E(V, H)
h1 h2 ··· hn =− p(H|V ) + p(V, H)
∂wij ∂wij
H V,H
X X
v1 v2 ··· vm
b1 b2 bm We can write the above as a sum of two

V ∈ {0, 1}m expectations
61/71
∂L (θ)
H ∈ {0, 1}n
c1 c2 cn ∂wij
X ∂E(V, H) X ∂E(V, H)
h1 h2 ··· hn =− p(H|V ) + p(V, H)
∂wij ∂wij
H V,H
X X
= Ep(H|V ) [vi hj ] − Ep(V,H) [vi hj ]

v1 v2 ··· vm
b1 b2 bm We can write the above as a sum of two

V ∈ {0, 1}m expectations
61/71
How do we compute these expectations?
∂L (θ)
∂wij
62/71
∂L (θ) The first summation can actually be
∂wij simplified (we will come back and simplify it
later)
62/71
later)
However, the second summation contains an
exponential number of terms and hence
intractable in practice
62/71
later)
However, the second summation contains an
exponential number of terms and hence
intractable in practice
So how do we deal with this ?
62/71
Module 18.7: Motivation for Sampling
63/71
The trick is to approximate the sum by using
∂L (θ) a few samples instead of an exponential
∂wij number of samples
64/71
The trick is to approximate the sum by using
∂L (θ) a few samples instead of an exponential
∂wij number of samples
We will try to understand this with the help
of an analogy
64/71
Suppose you live in a city which
has a population of 10M and you
want to compute the average
weight of this population
65/71
You can think of X as a random
variable which denotes a person
65/71
The value assigned to this
random variable can be any
person from your population
65/71
For each person you have an
associated value denoted by
weight(X)
65/71
weight(X)
You are then interested in
computing the expected value of
weight(X) as shown on the RHS
65/71
Suppose you live in a city which X
has a population of 10M and you E[weight(X)] = p(x)weight(x)
want to compute the average (x∈P )
weight(X)
65/71
weight of this population Of course, it is going to be hard to get the
You can think of X as a random weights of every person in the population
variable which denotes a person and hence in practice we approximate the
above sum by sampling only few subjects
from the population (say 10000)
weight(X)
65/71
random variable can be any P
person from your population x∈P [:10000] [p(x)weight(x)]
E[weight(X)] ≈ P
For each person you have an x∈P [:10000] p(x)
weight(X)
65/71
E[weight(X)] ≈ P
Further, you assume that P (X) = N1 = 1
10K ,
weight(X)
i.e., every person in your population is
You are then interested in equally likely
65/71
E[weight(X)] ≈ P
Further, you assume that P (X) = N1 = 10K 1
,
weight(X)
i.e., every person in your population is
You are then interested in equally likely
computing the expected value of P
x∈P ersons[:10000] [weight(x)]
weight(X) as shown on the RHS E[weight(X)] ≈
104
65/71
This looks easy, why can’t we do the same
for our task ?
X
E[X] = xp(x)
(x∈P )
66/71
for our task ?
X
E[X] = xp(x)
(x∈P ) Why can’t we simply approximate the sum
by using some samples?
66/71
for our task ?
X
E[X] = xp(x)
What does that mean? It means that instead
of considering all possible values of
{v, h} ∈ 2m+n let us just consider some
samples from this population
66/71
for our task ?
X
E[X] = xp(x)
Analogy: Earlier we had 10M samples in the
population from which we drew 10K
samples, now we have 2m+n samples in the
population from which we need to draw a
reasonable number of samples
66/71
for our task ?
X
E[X] = xp(x)
Analogy: Earlier we had 10M samples in the
population from which we drew 10K
samples, now we have 2m+n samples in the
population from which we need to draw a
reasonable number of samples
Why is this not straightforward? Let us see!
66/71
For simplicity, first let us just focus on the
H ∈ {0, 1}n visible variables (V ∈ 2m ) and let us see
c1 c2 cn what it means to draw samples from P (V )
h1 h2 ··· hn
v1 v2 ··· vm
b1 b2 bm
V ∈ {0, 1}m
67/71
Well, we know that V = v1 , v2 , . . . , vm where
h1 h2 ··· hn each vi ∈ {0, 1}
v1 v2 ··· vm
b1 b2 bm
V ∈ {0, 1}m
67/71
h1 h2 ··· hn each vi ∈ {0, 1}
Suppose we decide to approximate the sum
w1,1 wm,n by 10K samples instead of the full 2m
W ∈ Rm×n
samples
v1 v2 ··· vm
b1 b2 bm
V ∈ {0, 1}m
67/71
h1 h2 ··· hn each vi ∈ {0, 1}
W ∈ Rm×n
samples
It is easy to create these samples by
v1 v2 ··· vm assigning values to each vi
b1 b2 bm
V ∈ {0, 1}m
67/71
h1 h2 ··· hn each vi ∈ {0, 1}
W ∈ Rm×n
samples
For example,
b1 b2 bm
V = 11111 . . . 11111, V = 00000 . . . 0000, V =
V ∈ {0, 1}m
00110011 . . . 00110011, . . . V = 0101 . . . 0101
are all samples from this population
67/71
h1 h2 ··· hn each vi ∈ {0, 1}
W ∈ Rm×n
samples
For example,
b1 b2 bm
V = 11111 . . . 11111, V = 00000 . . . 0000, V =
V ∈ {0, 1}m
00110011 . . . 00110011, . . . V = 0101 . . . 0101
are all samples from this population
So which samples do we consider ?
67/71
Well, that’s where the catch is!
68/71
Unlike, our population analogy, here we
cannot assume that every sample is equally
likely
68/71
likely
Why?
68/71
likely
Why? (Hint: consider the case that visible
variables correspond to pixels from natural
images)
68/71
likely
Likely
images)
Clearly some images are more likely than the
others!
Unlikely
68/71
likely
Likely
images)
Clearly some images are more likely than the
others!
Hence, we cannot assume that all samples
from the population (V ∈ 2m ) are equally
likely
Unlikely
68/71
Let us see this in more detail
Uniform distribution
Multimodal distribution
69/71
In our analogy, every person was equally
likely so we could just sample people
uniformly randomly
Uniform distribution
69/71
uniformly randomly
However, now if we sample people uniformly
Uniform distribution randomly then we will not get the true
picture of the expected value
69/71
uniformly randomly
We need to draw more samples from the high
probability region and fewer samples from
the low probability region
69/71
uniformly randomly
We need to draw more samples from the high
probability region and fewer samples from
the low probability region
In other words each sample needs to be
drawn in proportion to its probability and
not uniformly
69/71
That is where the problem lies!
∂L (θ|V )
∂wij
XXYY
Z= φij (vi , hj )
V H i j
Y Y
ψi (vi ) ξj (hj )
i j
70/71
∂L (θ|V ) To draw a sample (V, H), we need to know
∂wij its probability P (V, H)
XXYY
Z= φij (vi , hj )
V H i j
Y Y
ψi (vi ) ξj (hj )
i j
70/71
And of course, we also need this P (V, H)to
XXYY
Z= φij (vi , hj ) compute the expectation
V H i j
Y Y
ψi (vi ) ξj (hj )
i j
70/71
XXYY
V H i j
Y Y But, unfortunately computing P (V, H) is
ψi (vi ) ξj (hj ) intractable because of the partition function
i j
Z
70/71
XXYY
V H i j
Y Y But, unfortunately computing P (V, H) is
ψi (vi ) ξj (hj ) intractable because of the partition function
i j
Z
Hence, approximating the summation by
using a few samples is not straightforward!
(or rather drawing a few samples from the
distribution is hard!)
70/71
The story so far
Conclusion: Okay, I get it that drawing samples from this distribution P is
hard.
71/71
The story so far
hard.
Question: Is it possible to draw samples from an easier distribution (say, Q) as
long as I am sure that if I keep drawing samples from Q eventually my
samples will start looking as if they were drawn from P !
71/71
The story so far
hard.
Question: Is it possible to draw samples from an easier distribution (say, Q) as
long as I am sure that if I keep drawing samples from Q eventually my
samples will start looking as if they were drawn from P !
Answer: Well if you can actually prove this then why not? (and that’s what
we do in Gibbs Sampling)
71/71

CS7015 (Deep Learning) : Lecture 18

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

CS7015 (Deep Learning) : Lecture 18

Caricato da

Copyright:

Formati disponibili

CS7015 (Deep Learning) : Lecture 18

Using joint distributions for classification and sampling, Latent Variables,

Department of Computer Science and Engineering

And the two counts mentioned above can be

And the two counts mentioned above can be

than 0.61 0.01 0.20 ...

than 0.61 0.01 0.20 ...

than 0.61 0.01 0.20 ...

P (M 7) = P (X1 = more).P (X2 = realistic|X1 = more).

P (M 7) = P (X1 = more).P (X2 = realistic|X1 = more).

The movie was

the 0.62 0.01 0.01 ...

The movie was really amazing

the 0.62 0.01 0.01 ...

The movie was really amazing

The movie was really amazing

The movie was really amazing

The movie was really amazing

The movie was really amazing

P (M 7) = P (X1 = more).P (X2 = realistic|X1 = more).

where Di is a set of variables which

Probability Score = 0.01

Latent Variable = daytime

Latent Variable = night

Latent Variable = night

Latent Variable = cloudy 23/71

Latent Variable = cloudy 23/71

Latent Variable = cloudy 23/71

In other words, given an image, we can find

In other words, given an image, we can find

In other words, given an image, we can find

Why is this interesting?

In fact, we can also add additional factors

In fact, we can also add additional factors

It is legal to do this (i.e., add factors for ψi (vi )ξj (hj ))

In fact, we can also add additional factors

It is legal to do this (i.e., add factors for ψi (vi )ξj (hj ))

P (V =< 0, 0, 0 >, H =< 1, 1 >)

P (V =< 0, 0, 0 >, H =< 1, 1 >)

P (V =< v1 , v2 , v3 >, H =< h1 , h2 >)

w1,1 wm,n W ∈ Rm×n

w1,1 wm,n W ∈ Rm×n

w1,1 wm,n W ∈ Rm×n

w1,1 wm,n W ∈ Rm×n

w1,1 wm,n W ∈ Rm×n

E(V, H) = vl α(H) + β(V−l , H)

w1,1 wm,n W ∈ Rm×n

w1,1 wm,n W ∈ Rm×n

w1,1 wm,n W ∈ Rm×n

w1,1 wm,n W ∈ Rm×n

h1 h2 ··· hn Similarly, we can show that

h1 h2 ··· hn Similarly, we can show that

The RBM can thus be interpreted as a stochastic

h1 h2 ··· hn Similarly, we can show that

The RBM can thus be interpreted as a stochastic

w1,1 wm,n W ∈ Rm×n

w1,1 wm,n W ∈ Rm×n

w1,1 wm,n W ∈ Rm×n

w1,1 wm,n W ∈ Rm×n

w1,1 wm,n W ∈ Rm×n

w1,1 wm,n W ∈ Rm×n

w1,1 wm,n W ∈ Rm×n

w1,1 wm,n W ∈ Rm×n