Module 5 PDF

Module 5: Recurrent Neural Networks and
Deep unsupervised Learning
J. Andrew, Asst. Prof. - CSE, KITS

Module 5: Recurrent Neural Networks and
Deep unsupervised Learning
• Recurrent Neural Networks
• Long Short Term Memory (LSTM)
• Gated Recurrent Unit (GRU)
• Auto encoders and variational encoders
• Adversarial Generative Networks
• DBM
• Deep Reinforcement Learning
Recurrent Neural Networks
Topics
• Recurrent Neural Networks
– Bidirectional RNNs
– Encoder-Decoder Sequence-to-
Sequence Architectures
– Deep Recurrent Networks
– Recursive Neural Networks
• Long Short Term Memory (LSTM)
• Gated Recurrent Unit (GRU)
• Recurrent neural networks or RNNs (Rumelhart et al., 1986a) are a family
of neural networks for processing sequential data.
• A recurrent neural network is a neural network that is specialized for
processing a sequence of values x(1), . . . , x(T).
• Recurrent Neural Network remembers the past and it’s decisions are
influenced by what it has learnt from the past.
• Unlike feedforward NN, RNN learns from prior input and its output.
• RNNs can take one or more input vectors and produce one or more output
vectors and the output(s) are influenced not just by weights applied on
inputs like a regular NN, but also by a “hidden” state vector representing
the context based on prior input(s)/output(s).
Parameter Sharing
• Common weights are shared among the different input series.
• Vanilla neural networks don’t share parameters across inputs. Each node
has its own weight.
• This makes Vanilla NN incapable to process series input.

Unfolding Computational Graphs
• A computational graph is a way to formalize the structure of a set of
computations, such as those involved in mapping inputs and parameters
to outputs and loss.
• unfolding a recursive or recurrent computation into a computational graph
that has a repetitive structure, typically corresponding to a chain of
events.
For example, consider the classical form of a dynamical system:
where s(t) is called the state of the system.

• Eq. is recurrent because the definition of s at time t refers back to the
same definition at time t − 1.
• For example, if we unfold Eq. for τ = 3 time steps, we obtain
Formula for calculating current state:
ht -> current state

ht-1 -> previous state
xt -> input state
Formula for applying Activation function(tanh):
whh -> weight at recurrent neuron

wxh -> weight at input neuron
Formula for calculating output:
Yt -> output
Why -> weight at output layer
Training a RNN
Teacher Forcing
• Teacher forcing is a strategy for training recurrent neural networks that
uses model output from a prior time step as an input.
• Models that have recurrent connections from their outputs leading back
into the model may be trained with teacher forcing.
• Teacher forcing is a procedure that emerges from the maximum likelihood
criterion, in which during training the model receives the ground truth
output y(t) as input at time t + 1.
Training a RNN
Training a RNN
Computing the Gradient in a Recurrent Neural Network
• Computing the gradient through a recurrent neural network is
straightforward
• The use of back-propagation on the unrolled graph is called the back-
propagation through time (BPTT) algorithm.
• Gradients obtained by back-propagation may then be used with any
general-purpose gradient-based techniques to train an RNN.
• Once the gradients on the internal nodes of the computational graph are
obtained, we can obtain the gradients on the parameter nodes.
• Because the parameters are shared across many time steps, we must take
some care when denoting calculus operations involving these variables.
Types of RNN
• Bidirectional RNN
• Encoder-Decoder Sequence-to-Sequence Architectures
• Deep Recurrent Networks
• Recursive Neural Networks
Bidirectional RNNs
• In many applications we want to output a prediction of y(t) which may
depend on the whole input sequence.
• For example, in speech recognition, the correct interpretation of the
current sound as a phoneme may depend on the next few phonemes
• In speech recognition and handwriting recognition tasks, where there
could be considerable ambiguity given just one part of the input, we often
need to know what’s coming next to better understand the context and
detect the present.
• Bidirectional recurrent neural networks (or bidirectional RNNs) were
invented to address that need.
Bidirectional RNNs
• As the name suggests, bidirectional RNNs combine an
RNN that moves forward through time beginning from
the start of the sequence with another RNN that moves
backward through time beginning from the end of the
sequence.
• This allows the output units o(t) to compute a
representation that depends on both the past and the
future but is most sensitive to the input values around
time t, without having to specify a fixed-size window
around t
Encoder-Decoder Sequence-to-Sequence
Architectures
• RNN can be trained to map an input sequence to an output sequence
which is not necessarily of the same length.
• This comes up in many applications, such as speech recognition, machine
translation or question answering, where the input and output sequences
in the training set are generally not of the same length
• The idea is: (1) an encoder or reader or input RNN processes the input
sequence. The encoder emits the context C, usually as a simple function of
its final hidden state.
• (2) a decoder or writer or output RNN is conditioned on that fixed-length
vector to generate the output sequence.
Encoder-Decoder Sequence-to-Sequence
Architectures
Deep Recurrent Networks
• The computation in most RNNs can be decomposed into three blocks of
parameters and associated transformations:
1. from the input to the hidden state,
2. from the previous hidden state to the next hidden state, and
3. from the hidden state to the output.
• when the network is unfolded, each of these corresponds to a shallow

transformation.
• By a shallow transformation, we mean a transformation that would be
represented by a single layer within a deep MLP.
Deep Recurrent Networks
• A recurrent neural network can be made
deep in many ways
(a)The hidden recurrent state can be broken
down into groups organize hierarchically.
(b) Deeper computation (e.g., an MLP) can be
introduced in the input-to hidden, hidden-to-
hidden and hidden-to-output parts. This may
lengthen the shortest path linking different
time steps.
(c) The path-lengthening effect can be
mitigated by introducing skip connections.
Recursive Neural Networks
• Recursive neural networks represent yet another generalization of
recurrent networks, with a different kind of computational graph, which is
structured as a deep tree.
• Recursive neural networks were introduced by Pollack (1990) and their
potential use for learning to reason was described by Bottou (2011).
• Recursive networks have been successfully applied to processing data
structures as input to neural nets in natural language processing as well as
in computer vision
• One clear advantage of recursive nets over recurrent nets is that for a
sequence of the same length τ, the depth can be drastically reduced from
τ to O(log τ ), which might help deal with long-term dependencies.
Recursive Neural Networks
• A recursive network has a computational graph that
generalizes that of the recurrent network from a
chain to a tree.
• A variable-size sequence x(1),x(2) , . . . ,x(t) can be
mapped to a fixed-size representation (the output o),
with a fixed set of parameters (the weight matrices
U, V , W ).
• The figure illustrates a supervised learning case in
which some target y is provided which is associated
with the whole sequence.
Long Short Term Memory
The Challenge of Long-Term Dependencies
• The basic problem is that gradients propagated over many stages tend to
either vanish or explode
• The difficulty with long-term dependencies arises from the exponentially

smaller weights given to long-term interactions compared to short-term
ones.
• One may hope that the problem can be avoided simply by staying in a
region of parameter space where the gradients do not vanish or explode.
The Challenge of Long-Term Dependencies
• Unfortunately, in order to store memories in a way that is robust to small
perturbations, the RNN must enter a region of parameter space where
gradients vanish
• It does not mean that it is impossible to learn, but that it might take a very
long time to learn long-term dependencies
• Because the signal about these dependencies will tend to be hidden by

the smallest fluctuations arising from short-term dependencies.
LSTM Networks
• Long Short Term Memory networks – usually just called “LSTMs” – are a
special kind of RNN, capable of learning long-term dependencies.
• They were introduced by Hochreiter & Schmidhuber (1997), and were
refined and popularized by many people in following work.
• They work tremendously well on a large variety of problems such as
unconstrained handwriting recognition, speech recognition, handwriting
generation, machine translation, image captioning, and parsing
• LSTM’s have a Nature of Remembering information for a long periods of
time is their Default behaviour.
LSTM Networks
LSTM has 3 step process
• Every LSTM module will have
– Forget Gate
– Input Gate
– Output Gate
• Forget Gate: Decides how much of the past you should remember.
• Input Gate: Decides how much of this unit is added to the current state.
• Output Gate: Decides which part of the current cell makes it to the output.
LSTM Networks
Forget Gate:
• This gate Decides which information to be omitted in from the cell in that
particular time stamp.
• It is decided by the sigmoid function.
• It looks at the previous state(ht-1) and the content input(Xt) and outputs a
number between 0(omit this)and 1(keep this) for each number in the cell
state Ct−1.
LSTM Networks
Input Gate:
• Sigmoid function decides which values to let through 0,1 and
tanh function gives weightage to the values which are passed deciding
their level of importance ranging from-1 to 1.
Output Gate:
• Sigmoid function decides which values to let
through 0,1 and tanh function gives weightage to the values which are
passed deciding their level of importance ranging from-1 to 1 and
multiplied with output of Sigmoid.
LSTM Networks
Gated Recurrent Units (GRU)
• GRU is LSTM variant which was introduced by K. Cho et al., in 2014.
• The GRU is the newer generation of Recurrent Neural networks and is
similar to an LSTM.
• GRU retains the resisting vanishing gradient properties of LSTM but GRU’s
are internally simpler and faster than LSTM’s.
• GRU’s got rid of the cell state and used the hidden state to transfer
information.
• It also only has two gates, a reset gate and update gate.
Update Gate
• The update gate helps the model to determine how much of the past
information (from previous time steps) needs to be passed along to the
future.
• That is really powerful because the model can decide to copy all the
information from the past and eliminate the risk of vanishing gradient
problem.
• W(z) is weight, xt is input, ht-1 previous state and its own weight U(z)
Reset Gate
• This gate is used from the model to decide how much of the past
information to forget.
• Unlike LSTM in GRU, there is no persistent cell state distinct from the
hidden state as in LSTM.
def build_model(layers): model = Sequential()
model.add(GRU(input_dim=layers[0], output_dim=layers[1], activation='tanh',

return_sequences=True))
model.add(Dropout(0.15))
GRU v/s LSTM
• GRU and LSTM have comparable performance and there is no simple way
to recommend one or the other for a specific task.
• GRU’s are faster to train and need fewer data to generalize.
• When there is enough data, an LSTM’s greater expressive power may lead
to better results.
• Like LSTMs, GRUs are drop-in replacements for the SimpleRNN cell.
Autoencoders
What is autoencoder?
• An autoencoder is a neural network that is trained to attempt to copy its
input to its output.
• Internally, it has a hidden layer h that describes a code used to represent

the input.
• The network may be viewed as consisting of two parts: an encoder

function h = f (x) and a decoder that produces a reconstruction r = g(h).
• Autoencoder, by design, reduces data dimensions by
learning how to ignore the noise in the data.
• If an autoencoder succeeds in simply learning to set
g(f (x)) = x everywhere, then it is not especially useful.
• Usually they are restricted in ways that allow them to
copy only approximately, and to copy only input that
resembles the training data.
• Because the model is forced to prioritize which
aspects of the input should be copied, it often learns
useful properties of the data.
What are autoencoder used for?
• Data denoising and dimensionality reduction for data visualization are
considered as two main interesting practical applications of autoencoders.
• With appropriate dimensionality and sparsity constraints, autoencoders
can learn data projections that are more interesting than PCA or other
basic techniques.
Applications of Autoencoder
• Dimensionality Reduction
• Image Compression
• Image Denoising
• Feature Extraction
• Image generation
• Sequence to sequence prediction
• Recommendation system
Components of Autoencoder
• 1- Encoder: In which the model learns how to reduce the input dimensions
and compress the input data into an encoded representation.
• 2- Bottleneck: which is the layer that contains the compressed

representation of the input data. This is the lowest possible dimensions of
the input data.
Components of Autoencoder
• 3- Decoder: In which the model learns how to reconstruct the data from
the encoded representation to be as close to the original input as possible.
• 4- Reconstruction Loss: This is the method that measures measure how

well the decoder is performing and how close the output is to the original
input.
Types of Autoencoders
• Undercomplete Autoencoders
• Convolutional autoencoder
• Regularized Autoencoders
• Sparse Autoencoders
• Denoising Autoencoders
Undercomplete Autoencoders
• Goal of the Autoencoder is to capture the most important features present
in the data.
• Undercomplete autoencoders have a smaller dimension for hidden layer
compared to the input layer. This helps to obtain important features from
the data.
• Objective is to minimize the loss function by penalizing the g(f(x)) for being
different from the input x.
Undercomplete Autoencoders
• When decoder is linear and we use a mean squared error loss function
then undercomplete autoencoder generates a reduced feature space
similar to PCA
• We get a powerful nonlinear generalization of PCA when encoder
function f and decoder function g are non linear.
• Undercomplete autoencoders do not need any regularization as they
maximize the probability of data rather than copying the input to the
output.
Convolutional autoencoder
• In the traditional architecture of autoencoders, it is not taken into

account the fact that a signal can be seen as a sum of other signals.
• Convolutional Autoencoders (CAE), on the other way, use the
convolution operator to accommodate this observation.
• Convolution operator allows filtering an input signal in order to
extract some part of its content.
• They learn to encode the input in a set of simple signals and then try
to reconstruct the input from them.
Regularized autoencoder
• Undercomplete autoencoders, with code dimension less than the input
dimension, can learn the most salient features of the data distribution
• Problem occurs if the hidden code is allowed to have dimension equal to
the input, and in the overcomplete case in which the hidden code has
dimension greater than the input.
• Regularized autoencoders use a loss function that encourages the model
to have other properties besides the ability to copy its input to its output.
Regularized autoencoder
• Two types of regularized autoencoder
– Sparse autoencoder
– Denoising autoencoder
Sparse autoencoder
• Sparse autoencoders have hidden nodes greater than input nodes.

They can still discover important features from the data.
• Sparsity constraint is introduced on the hidden layer. This is to prevent
output layer copy input data.
• Sparse autoencoders have a sparsity penalty, Ω(h), a value close to
zero but not zero. Sparsity penalty is applied on the hidden layer in
addition to the reconstruction error. This prevents overfitting.
Sparse autoencoder
• Sparse autoencoders are typically used to learn features for another task
such as classification
• Sparse autoencoders take the highest activation values in the hidden layer
and zero out the rest of the hidden nodes. This prevents autoencoders to
use all of the hidden nodes at a time and forcing only a reduced number of
hidden nodes to be used.
• As we activate and inactivate hidden nodes for each row in the dataset.
Each hidden node extracts a feature from the data
Denoising Autoencoders(DAE)
• Denoising refers to intentionally adding noise to the raw input before
providing it to the network. Denoising can be achieved using stochastic
mapping.
• Denoising autoencoders create a corrupted copy of the input by
introducing some noise. This helps to avoid the autoencoders to copy the
input to the output without learning features about the data.
Denoising Autoencoders(DAE)
• Corruption of the input can be done randomly by making some of the
input as zero. Remaining nodes copy the input to the noised input.
• Denoising autoencoders must remove the corruption to generate an
output that is similar to the input. Output is compared with input and not
with noised input. To minimize the loss function we continue until
convergence
• Denoising autoencoders minimizes the loss function between the output
node and the corrupted input.
Stacked Denoising Autoencoders
• Stacked Autoencoders is a neural network with multiple layers of sparse
autoencoders
• When we add more hidden layers than just one hidden layer to an
autoencoder, it helps to reduce a high dimensional data to a smaller code
representing important features
• Each hidden layer is a more compact representation than the last hidden
layer
• We can also denoise the input and then pass the data through the stacked
autoencoders called as stacked denoising autoencoders
Variational Autoencoders
Limitations of autoencoders
• The fundamental problem with autoencoders, for generation, is that the
latent space they convert their inputs to and where their encoded vectors
lie, may not be continuous, or allow easy interpolation.
• The high degree of freedom of the autoencoder that makes possible to
encode and decode with no information loss (despite the low
dimensionality of the latent space) leads to a severe overfitting
• The autoencoder is solely trained to encode and decode with as few loss
as possible, no matter how the latent space is organized.
Note: the latent space is the space where your features lie.
• A variational autoencoder can be defined as being an autoencoder whose
training is regularized to avoid overfitting and ensure that the latent space
has good properties that enable generative process.
• The variational autoencoder or VAE (Kingma, 2013; Rezende et al., 2014) is
a directed model that uses learned approximate inference and can be
trained purely with gradient-based methods.
• Just as a standard autoencoder, a variational autoencoder is an
architecture composed of both an encoder and a decoder and that is
trained to minimise the reconstruction error between the encoded-
decoded data and the initial data.
• However, in order to introduce some regularization of the latent
space: instead of encoding an input as a single point, we encode it as a
distribution over the latent space.
The variational autoencoder is trained as follows
• first, the input is encoded as distribution over the latent space
• second, a point from the latent space is sampled from that distribution
• third, the sampled point is decoded and the reconstruction error can be
computed
• finally, the reconstruction error is backpropagated through the network
• Difference between autoencoder (deterministic) and variational
autoencoder (probabilistic).
Adversarial Generative Networks
Introduction
• Generative Adversarial Networks (GANs) are a powerful class of neural
networks that are used for unsupervised learning. It was developed and
introduced by Ian J. Goodfellow in 2014.
• GANs are basically made up of a system of two competing neural network
models which compete with each other and are able to analyze, capture
and copy the variations within a dataset.
• The main focus for GAN (Generative Adversarial Networks) is to generate
data from scratch, mostly images but other domains including music have
been done.
What is the Need for GAN?
• It has been noticed most of the mainstream neural nets can be easily
fooled into misclassifying things by adding only a small amount of noise
into the original data.
• Surprisingly, the model after adding noise has higher confidence in the
wrong prediction than when it predicted correctly.
• The reason for such adversary is that most machine learning models learn
from a limited amount of data, which is a huge drawback, as it is prone to
overfitting.
• An intuitive way to understand GANs is to imagine a forger trying to create
a fake Picasso painting. At first, the forger is pretty bad at the task. He
mixes some of his fakes with authentic Picassos and shows them all to an
art dealer. The art dealer makes an authenticity assessment for each
painting and gives the forger feedback about what makes a Picasso look
like a Picasso. The forger goes back to his studio to prepare some new
fakes. As times goes on, the forger becomes increasingly competent at
imitating the style of Picasso, and the art dealer becomes increasingly
expert at spotting fakes. In the end, they have on their hands some
excellent fake Picassos.
• That’s what a GAN is: a forger network and an expert network, each being
trained to best the other.
GAN Architecture
Generator and discriminator
• GAN composes of two deep networks, the generator, and
the discriminator.
• The generator network directly produces samples x = g(z; θ(g)).
• Its adversary, the discriminator network, attempts to distinguish between
samples drawn from the training data and samples drawn from the
generator.
• The discriminator emits a probability value given by d(x; θ(d)), indicating
the probability that x is a real training example rather than a fake sample
drawn from the model.
Generator and discriminator
• GAN is mathematically described by the formula below:
• G = Generator • x = sample from Pdata(x)

• D = Discriminator • z = sample from P(z)
• Pdata(x) = distribution of real data • D(x) = Discriminator network
• P(z) = distribution of generator • G(z) = Generator network
Training GAN
• One unusual capability of the GAN training procedure is that it can fit
probability distributions that assign zero probability to the training points.
• Rather than maximizing the log probability of specific points, the generator net
learns to trace out a manifold whose points resemble training points in some
way.
• Somewhat paradoxically, this means that the model may assign a log-
likelihood of negative infinity to the test set, while still representing a manifold
that a human observer judges to capture the essence of the generation task.
• This is not clearly an advantage or a disadvantage, and one may also guarantee
that the generator network assigns non-zero probability to all points simply by
making the last layer of the generator network add Gaussian noise to all of the
generated values.
• Generator networks that add Gaussian noise in this manner sample from the
same distribution that one obtains by using the generator network to
parametrize the mean of a conditional Gaussian distribution.
Training GAN
So, basically, training a GAN has two parts:
• Part 1: The Discriminator is trained while the Generator is idle. In this
phase, the network is only forward propagated and no back-propagation is
done. The Discriminator is trained on real data for n epochs, and see if it
can correctly predict them as real. Also, in this phase, the Discriminator is
also trained on the fake generated data from the Generator and see if it
can correctly predict them as fake.
• Part 2: The Generator is trained while the Discriminator is idle. After the
Discriminator is trained by the generated fake data of the Generator, we
can get its predictions and use the results for training the Generator and
get better from the previous state to try and fool the Discriminator.
Backpropagation
• The discriminator outputs a value D(x) indicating the chance that x is a
real image.
• Our objective is to maximize the chance to recognize real images as real
and generated images as fake. i.e. the maximum likelihood of the observed
data.
• To measure the loss, we use cross-entropy as in most Deep Learning: p
log(q). For real image, p (the true label for real images) equals to 1. For
generated images, we reverse the label (i.e. one minus label). So the
objective becomes:
Backpropagation
• On the generator side, its objective function wants the model to generate
images with the highest possible value of D(x) to fool the discriminator.
• We often define GAN as a minimax game which G wants to

minimize V while D wants to maximize it.
Types of GAN
• Vanilla GAN: This is the simplest type GAN. Here, the Generator and the
Discriminator are simple multi-layer perceptrons. In vanilla GAN, the
algorithm is really simple, it tries to optimize the mathematical equation
using stochastic gradient descent.
• Conditional GAN (CGAN): CGAN can be described as a deep learning

method in which some conditional parameters are put into place. In
CGAN, an additional parameter ‘y’ is added to the Generator for
generating the corresponding data. Labels are also put into the input to
the Discriminator in order for the Discriminator to help distinguish the real
data from the fake generated data.
Types of GAN
• Deep Convolutional GAN (DCGAN): DCGAN is one of the most popular also
the most successful implementation of GAN. It is composed of ConvNets in
place of multi-layer perceptrons. The ConvNets are implemented without max
pooling, which is in fact replaced by convolutional stride. Also, the layers are
not fully connected.
• Laplacian Pyramid GAN (LAPGAN): The Laplacian pyramid is a linear invertible

image representation consisting of a set of band-pass images, spaced an
octave apart, plus a low-frequency residual. This approach uses multiple
numbers of Generator and Discriminator networks and different levels of the
Laplacian Pyramid. This approach is mainly used because it produces very
high-quality images. The image is down-sampled at first at each layer of the
pyramid and then it is again up-scaled at each layer in a backward pass where
the image acquires some noise from the Conditional GAN at these layers until
it reaches its original size.
Types of GAN
• Super Resolution GAN (SRGAN): SRGAN as the name suggests is a way of
designing a GAN in which a deep neural network is used along with an
adversarial network in order to produce higher resolution images. This
type of GAN is particularly useful in optimally up-scaling native low-
resolution images to enhance its details minimizing errors while doing so.
Applications of GAN
• Generate Examples for Image • Face Frontal View Generation
Datasets
• Generate New Human Poses
• Generate Photographs of Human • Photos to Emojis
Faces • Photograph Editing
• Generate Realistic Photographs • Face Aging
• Generate Cartoon Characters • Photo Blending
• Image-to-Image Translation • Super Resolution
• Text-to-Image Translation • Photo Inpainting
• Clothing Translation
• Semantic-Image-to-Photo Translation
• Video Prediction
• Face Frontal View Generation • 3D Object Generation
Summary
A GAN consists of a generator network coupled with a discriminator network. The
discriminator is trained to differentiate between the output of the generator and real images
from a training dataset, and the generator is trained to fool the discriminator. Remarkably, the
generator never sees images from the training set directly; the information it has about the
data comes from the discriminator.
GANs are difficult to train, because training a GAN is a dynamic process rather than a
simple gradient descent process with a fixed loss landscape. Getting a GAN to train correctly
requires using a number of heuristic tricks, as well as extensive tuning.
GANs can potentially produce highly realistic images. But unlike VAEs, the latent space
they learn doesn’t have a neat continuous structure and thus may not be suited for certain
practical applications, such as image editing via latentspace concept vectors.
Deep Boltzmann Machine
Introduction
• Boltzmann machines were originally introduced as a general
“connectionist” approach to learning arbitrary probability distributions
over binary vectors
• They were one of the first neural networks capable of learning internal
representations, and are able to represent and (given sufficient time) solve
difficult combinatoric problems.
• The Boltzmann machine is an energy-based model
• where E(x) is the energy function and Z is the partition function that
ensures that ∑ x P(x) = 1.
Introduction
• The energy function of the Boltzmann machine is given by
• where U is the “weight” matrix of model parameters and b is the vector of

bias parameters.
• Training examples for Boltzmann machine are n-dimensional
• The Boltzmann machine becomes more powerful when not all the
variables are observed.
• The units in the Boltzmann machine are divided into 'visible' units, V, and
'hidden' units, H. The visible units are those that receive information from the
'environment', i.e. the training set is a set of binary vectors over the set V.
Applications of Boltzmann Machines
• Dimensionality Reduction
• Classification
• Regression
• Collaborative Filtering
• Feature Learning
• Topic Modeling
Types of Boltzmann Machines
• Types of Boltzmann Machines
1. Restricted Boltzmann Machine (RBM)

2. Deep Belief Network (DBN)
3. Deep Boltzmann Machine (DBM)
Restricted Boltzmann Machine (RBM)
• Invented under the name harmonium (Smolensky, 1986), restricted
Boltzmann machines are some of the most common building blocks of
deep probabilistic models.
• Restricted Boltzmann Machine is an undirected graphical model

• Restricted Boltzmann Machines are shallow, two-layer neural nets that
constitute the building blocks of deep-belief networks.
• The first layer of the RBM is called the visible, or input layer, and the
second is the hidden layer.
• The restriction in a Restricted Boltzmann Machine is that there is no intra-
layer communication.
• Each node is a locus of computation that processes input and begins by
making stochastic decisions about whether to transmit that input or not.
Difference between Autoencoders & RBMs
• Autoencoder is a simple 3-layer neural network where output units are directly
connected back to input units.
• The number of hidden units is much less than the number of visible ones.
• The task of training is to minimize an error or reconstruction, i.e. find the most efficient
compact representation for input data.
• RBM shares a similar idea, but it uses stochastic units with particular distribution
instead of deterministic distribution.
• The task of training is to find out how these two sets of variables are actually connected
to each other.
• One aspect that distinguishes RBM from other autoencoders is that it has two biases.
– The hidden bias helps the RBM produce the activations on the forward pass, while
– The visible layer’s biases help the RBM learn the reconstructions on the backward
pass.
Deep Belief Networks
• Deep belief networks (DBNs) were one of the first non-convolutional
models to successfully admit training of deep architectures.
• A deep belief network is a hybrid graphical model involving both directed
and undirected connections.
• Like an RBM, it has no intra-layer connections. However, a DBN has
multiple hidden layers, and thus there are connections between hidden
units that are in separate layers.
• All of the local conditional probability distributions needed by the deep
belief network are copied directly from the local conditional probability
distributions of its constituent RBMs.
• Deep belief networks are generative models with
several layers of latent variables.
• The latent variables are typically binary, while the
visible units may be binary or real.
• The connections between the top two layers are
undirected.
• The connections between all other layers are directed, with the arrows
pointed toward the layer that is closest to the data.
• DBN solves inference problem and learning problem
• Inference Problem: Infer the states of unobserved variables
• Learning Problem: Adjust the interactions between the variables to make the
network more likely to generate the observed data.
The two most significant properties of deep belief nets are:
• There is an efficient, layer-by-layer procedure for learning the top-down,
generative weights that determine how the variables in one layer depend on
the variables in the layer above.
• After learning, the values of the latent variables in every layer can be inferred
by a single, bottom-up pass that starts with an observed data vector in the
bottom layer and uses the generative weights in the reverse direction.
Applications of Deep Belief Nets
• Image/ Face Recognition
• Video Sequence recognition
• Motion-capture data
Deep Boltzmann Machine (DBM)
• Unsupervised, probabilistic, generative model with entirely undirected
connections between different layers
• Contains visible units and multiple layers of hidden units
• Like RBM, no intralayer connection exists in DBM. Connections exists only
between units of the neighboring layers
• Network of symmetrically connected stochastic binary units
• DBM can be organized as bipartite graph with odd layers on one side and
even layers on one side
• Units within the layers are independent of each other but are dependent
on neighbouring layers
• Learning is made efficient by layer by layer pre training — Greedy layer
wise pre training slightly different than done in DBM
• After learning the binary features in each layer, DBM is fine tuned by back
propagation.
• A DBM is an energy-based model, meaning that the
joint probability distribution over the model variables
is parametrized by an energy function E .
• Bi-partite graph structure
DBM vs RBM & DBN
• In comparison to the RBM energy function, the DBM energy function includes
connections between the hidden units (latent variables) in the form of the
weight matrices (W(2) and W (3)).
• Approximate inference procedure for DBM uses a top-down feedback in

addition to the usual bottom-up pass, allowing Deep Boltzmann Machines to
better incorporate uncertainty about ambiguous inputs.
• A disadvantage of DBN is the approximate inference based on mean field
approach is slower compared to a single bottom-up pass as in Deep Belief
Networks. Mean field inference needs to be performed for every new test
input.
DBM Mean Field Inference
• Computing posterior distribution is known as an inference problem.
• We have a data distribution P(x) and computing posterior distribution is
often intractable.
• We can approximate intractable inference with simpler tractable inference
by introducing a distribution Q(x) which is the best approximation of P(x).
• Q(x) becomes the mean field approximation where variables in Q
distribution is independent of variable x.
• Our goal is to minimize KL divergence between the approximate
distribution and the actual distribution
DBM Training
• Boltzmann machine uses randomly initialized Markov chains to
approximate the gradient of the likelihood function which is too slow to be
practical.
• DBM uses greedy layer by layer pre training to speed up learning the
weights. It relies on learning stacks of Restricted Boltzmann Machine with
a small modification using contrastive divergence.
• The key intuition for greedy layer wise training for DBM is that we double
the input for the lower-level RBM and the top level RBM.
• Lower level RBM inputs are doubled to compensate for the lack of top-
down input into first hidden layer. Similarly for top-level RBM, we double
the hidden units to compensate for the lack of bottom-up input.
Thank You

Module 5 PDF

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Module 5 PDF

Caricato da

Copyright:

Formati disponibili

Module 5: Recurrent Neural Networks and

Deep unsupervised Learning

J. Andrew, Asst. Prof. - CSE, KITS

• This makes Vanilla NN incapable to process series input.

where s(t) is called the state of the system.

ht -> current state

whh -> weight at recurrent neuron

• when the network is unfolded, each of these corresponds to a shallow

• The difficulty with long-term dependencies arises from the exponentially

• Because the signal about these dependencies will tend to be hidden by

model.add(GRU(input_dim=layers[0], output_dim=layers[1], activation='tanh',

• Internally, it has a hidden layer h that describes a code used to represent

• The network may be viewed as consisting of two parts: an encoder

• 2- Bottleneck: which is the layer that contains the compressed

• 4- Reconstruction Loss: This is the method that measures measure how

• In the traditional architecture of autoencoders, it is not taken into

• Sparse autoencoders have hidden nodes greater than input nodes.

• G = Generator • x = sample from Pdata(x)

• We often define GAN as a minimax game which G wants to

• Conditional GAN (CGAN): CGAN can be described as a deep learning

• Laplacian Pyramid GAN (LAPGAN): The Laplacian pyramid is a linear invertible

• where U is the “weight” matrix of model parameters and b is the vector of

1. Restricted Boltzmann Machine (RBM)

• Restricted Boltzmann Machine is an undirected graphical model

• Approximate inference procedure for DBM uses a top-down feedback in

Potrebbero piacerti anche