Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
• Vanilla neural networks don’t share parameters across inputs. Each node
has its own weight.
Yt -> output
Why -> weight at output layer
Training a RNN
Teacher Forcing
• Teacher forcing is a strategy for training recurrent neural networks that
uses model output from a prior time step as an input.
• Models that have recurrent connections from their outputs leading back
into the model may be trained with teacher forcing.
• Teacher forcing is a procedure that emerges from the maximum likelihood
criterion, in which during training the model receives the ground truth
output y(t) as input at time t + 1.
Training a RNN
Training a RNN
Computing the Gradient in a Recurrent Neural Network
• Computing the gradient through a recurrent neural network is
straightforward
• The use of back-propagation on the unrolled graph is called the back-
propagation through time (BPTT) algorithm.
• Gradients obtained by back-propagation may then be used with any
general-purpose gradient-based techniques to train an RNN.
• Once the gradients on the internal nodes of the computational graph are
obtained, we can obtain the gradients on the parameter nodes.
• Because the parameters are shared across many time steps, we must take
some care when denoting calculus operations involving these variables.
Types of RNN
• Bidirectional RNN
• Encoder-Decoder Sequence-to-Sequence Architectures
• Deep Recurrent Networks
• Recursive Neural Networks
Bidirectional RNNs
• In many applications we want to output a prediction of y(t) which may
depend on the whole input sequence.
• For example, in speech recognition, the correct interpretation of the
current sound as a phoneme may depend on the next few phonemes
• In speech recognition and handwriting recognition tasks, where there
could be considerable ambiguity given just one part of the input, we often
need to know what’s coming next to better understand the context and
detect the present.
• Bidirectional recurrent neural networks (or bidirectional RNNs) were
invented to address that need.
Bidirectional RNNs
• As the name suggests, bidirectional RNNs combine an
RNN that moves forward through time beginning from
the start of the sequence with another RNN that moves
backward through time beginning from the end of the
sequence.
• This allows the output units o(t) to compute a
representation that depends on both the past and the
future but is most sensitive to the input values around
time t, without having to specify a fixed-size window
around t
Encoder-Decoder Sequence-to-Sequence
Architectures
• RNN can be trained to map an input sequence to an output sequence
which is not necessarily of the same length.
• This comes up in many applications, such as speech recognition, machine
translation or question answering, where the input and output sequences
in the training set are generally not of the same length
• The idea is: (1) an encoder or reader or input RNN processes the input
sequence. The encoder emits the context C, usually as a simple function of
its final hidden state.
• (2) a decoder or writer or output RNN is conditioned on that fixed-length
vector to generate the output sequence.
Encoder-Decoder Sequence-to-Sequence
Architectures
Deep Recurrent Networks
• The computation in most RNNs can be decomposed into three blocks of
parameters and associated transformations:
1. from the input to the hidden state,
2. from the previous hidden state to the next hidden state, and
3. from the hidden state to the output.
• One may hope that the problem can be avoided simply by staying in a
region of parameter space where the gradients do not vanish or explode.
The Challenge of Long-Term Dependencies
• Unfortunately, in order to store memories in a way that is robust to small
perturbations, the RNN must enter a region of parameter space where
gradients vanish
• It does not mean that it is impossible to learn, but that it might take a very
long time to learn long-term dependencies
• Forget Gate: Decides how much of the past you should remember.
• Input Gate: Decides how much of this unit is added to the current state.
• Output Gate: Decides which part of the current cell makes it to the output.
LSTM Networks
Forget Gate:
• This gate Decides which information to be omitted in from the cell in that
particular time stamp.
• It is decided by the sigmoid function.
• It looks at the previous state(ht-1) and the content input(Xt) and outputs a
number between 0(omit this)and 1(keep this) for each number in the cell
state Ct−1.
LSTM Networks
Input Gate:
• Sigmoid function decides which values to let through 0,1 and
tanh function gives weightage to the values which are passed deciding
their level of importance ranging from-1 to 1.
Output Gate:
• Sigmoid function decides which values to let
through 0,1 and tanh function gives weightage to the values which are
passed deciding their level of importance ranging from-1 to 1 and
multiplied with output of Sigmoid.
LSTM Networks
Gated Recurrent Units (GRU)
• GRU is LSTM variant which was introduced by K. Cho et al., in 2014.
• The GRU is the newer generation of Recurrent Neural networks and is
similar to an LSTM.
• GRU retains the resisting vanishing gradient properties of LSTM but GRU’s
are internally simpler and faster than LSTM’s.
• GRU’s got rid of the cell state and used the hidden state to transfer
information.
• It also only has two gates, a reset gate and update gate.
Gated Recurrent Units (GRU)
Gated Recurrent Units (GRU)
Update Gate
• The update gate helps the model to determine how much of the past
information (from previous time steps) needs to be passed along to the
future.
• That is really powerful because the model can decide to copy all the
information from the past and eliminate the risk of vanishing gradient
problem.
• W(z) is weight, xt is input, ht-1 previous state and its own weight U(z)
Gated Recurrent Units (GRU)
Reset Gate
• This gate is used from the model to decide how much of the past
information to forget.
• Unlike LSTM in GRU, there is no persistent cell state distinct from the
hidden state as in LSTM.
Gated Recurrent Units (GRU)
def build_model(layers): model = Sequential()
model.add(Dropout(0.15))
GRU v/s LSTM
• GRU and LSTM have comparable performance and there is no simple way
to recommend one or the other for a specific task.
• GRU’s are faster to train and need fewer data to generalize.
• When there is enough data, an LSTM’s greater expressive power may lead
to better results.
• Like LSTMs, GRUs are drop-in replacements for the SimpleRNN cell.
Autoencoders
What is autoencoder?
• An autoencoder is a neural network that is trained to attempt to copy its
input to its output.
• 1- Encoder: In which the model learns how to reduce the input dimensions
and compress the input data into an encoded representation.
• 3- Decoder: In which the model learns how to reconstruct the data from
the encoded representation to be as close to the original input as possible.
– Sparse autoencoder
– Denoising autoencoder
Sparse autoencoder
Note: the latent space is the space where your features lie.
Variational Autoencoders
• A variational autoencoder can be defined as being an autoencoder whose
training is regularized to avoid overfitting and ensure that the latent space
has good properties that enable generative process.
• The variational autoencoder or VAE (Kingma, 2013; Rezende et al., 2014) is
a directed model that uses learned approximate inference and can be
trained purely with gradient-based methods.
• Just as a standard autoencoder, a variational autoencoder is an
architecture composed of both an encoder and a decoder and that is
trained to minimise the reconstruction error between the encoded-
decoded data and the initial data.
Variational Autoencoders
• However, in order to introduce some regularization of the latent
space: instead of encoding an input as a single point, we encode it as a
distribution over the latent space.
The variational autoencoder is trained as follows
• first, the input is encoded as distribution over the latent space
• second, a point from the latent space is sampled from that distribution
• third, the sampled point is decoded and the reconstruction error can be
computed
• finally, the reconstruction error is backpropagated through the network
Variational Autoencoders
• Difference between autoencoder (deterministic) and variational
autoencoder (probabilistic).
Adversarial Generative Networks
Introduction
• Generative Adversarial Networks (GANs) are a powerful class of neural
networks that are used for unsupervised learning. It was developed and
introduced by Ian J. Goodfellow in 2014.
• GANs are basically made up of a system of two competing neural network
models which compete with each other and are able to analyze, capture
and copy the variations within a dataset.
• The main focus for GAN (Generative Adversarial Networks) is to generate
data from scratch, mostly images but other domains including music have
been done.
What is the Need for GAN?
• It has been noticed most of the mainstream neural nets can be easily
fooled into misclassifying things by adding only a small amount of noise
into the original data.
• Surprisingly, the model after adding noise has higher confidence in the
wrong prediction than when it predicted correctly.
• The reason for such adversary is that most machine learning models learn
from a limited amount of data, which is a huge drawback, as it is prone to
overfitting.
• An intuitive way to understand GANs is to imagine a forger trying to create
a fake Picasso painting. At first, the forger is pretty bad at the task. He
mixes some of his fakes with authentic Picassos and shows them all to an
art dealer. The art dealer makes an authenticity assessment for each
painting and gives the forger feedback about what makes a Picasso look
like a Picasso. The forger goes back to his studio to prepare some new
fakes. As times goes on, the forger becomes increasingly competent at
imitating the style of Picasso, and the art dealer becomes increasingly
expert at spotting fakes. In the end, they have on their hands some
excellent fake Picassos.
• That’s what a GAN is: a forger network and an expert network, each being
trained to best the other.
GAN Architecture
Generator and discriminator
• GAN composes of two deep networks, the generator, and
the discriminator.
• The generator network directly produces samples x = g(z; θ(g)).
• Its adversary, the discriminator network, attempts to distinguish between
samples drawn from the training data and samples drawn from the
generator.
• The discriminator emits a probability value given by d(x; θ(d)), indicating
the probability that x is a real training example rather than a fake sample
drawn from the model.
Generator and discriminator
• GAN is mathematically described by the formula below:
GANs are difficult to train, because training a GAN is a dynamic process rather than a
simple gradient descent process with a fixed loss landscape. Getting a GAN to train correctly
requires using a number of heuristic tricks, as well as extensive tuning.
GANs can potentially produce highly realistic images. But unlike VAEs, the latent space
they learn doesn’t have a neat continuous structure and thus may not be suited for certain
practical applications, such as image editing via latentspace concept vectors.
Deep Boltzmann Machine
Introduction
• Boltzmann machines were originally introduced as a general
“connectionist” approach to learning arbitrary probability distributions
over binary vectors
• They were one of the first neural networks capable of learning internal
representations, and are able to represent and (given sufficient time) solve
difficult combinatoric problems.
• The Boltzmann machine is an energy-based model
• where E(x) is the energy function and Z is the partition function that
ensures that ∑ x P(x) = 1.
Introduction
• The energy function of the Boltzmann machine is given by