Deep Learning Tutorial: Reference: Hung-Yi Lee

Deep Learning Tutorial
Reference: Hung-yi Lee

Deep learning
attracts lots of attention.
• Google Trends
Deep learning obtains many exciting results.

The talks in this afternoon
This talk will focus on the technical part.
2007 2009 2011 2013 2015

Outline
Part I: Introduction of Deep Learning
Part II: Why Deep?
Part III: Tips for Training Deep Neural Network
Part IV: Neural Network with Memory

Part I:
Introduction of
Deep Learning
What people already knew in 1980s

Example Application
• Handwriting Digit Recognition
Machine “2”
Handwriting Digit Recognition
Input Output
y1
0.1 is 1
x1
x2 y2
0.7 is 2
The image
is “2”
……
……
……
x256 y10
0.2 is 0
16 x 16 = 256
Ink → 1 Each dimension represents
No ink → 0 the confidence of a digit.
Example Application
• Handwriting Digit Recognition
x1 y1
x2
y2
Machine “2”
……
……
x256 y10
𝑓: 𝑅256 → 𝑅10
In deep learning, the function 𝑓 is
represented by neural network
Element of Neural Network
Neuron 𝑓: 𝑅𝐾 → 𝑅
a1 w1 z  a1w1  a2 w2    aK wK  b
a2 w2
 z  z  a
wK
…
aK Activation
weights function
b
bias
Neural Network
neuron
Input Layer 1 Layer 2 Layer L Output
x1 …… y1
x2 …… y2
……
……
……
……
……
xN …… yM
Input Output
Layer Hidden Layers Layer
Deep means many hidden layers

Example of Neural Network
1 4 0.98
1
-2
1
-1 -2 0.12
-1
1
0
Sigmoid Function  z 
 z  
1
z
1 e z
1 4 0.98 2 0.86 3 0.62
1
-2 -1 -1
1 0 -2
-1 -2 0.12 -2 0.11 -1 0.83
-1
1 -1 4
0 0 2
1 0.73 2 0.72 3 0.51
0
-2 -1 -1
1 0 -2
-1 0.5 -2 0.12 -1 0.85
0
1 -1 4
0 0 2
𝑓: 𝑅2 → 𝑅2 1 0.62 0 0.51
𝑓 = 𝑓 =
−1 0.83 0 0.85
Different parameters define different function
Matrix Operation
1 4 0.98
1 y1
-2
1
-1 -2 0.12
-1 y2
1
0
1 −2 1 1 0.98
𝜎 + =
−1 1 −1 0 0.12
4
−2
Neural Network
x1 …… y1
x2 W1 W2 ……
WL y2
b1 b2 bL
……
……
……
……
……
xN x a1 ……
a2 y yM
𝜎 W1 x + b1
𝜎 W2 a1 + b2
𝜎 WL aL-1 + bL
Neural Network
x1 …… y1
x2 W1 W2 ……
WL y2
b1 b2 bL
……
……
……
……
……
xN x a1 ……
a2 y yM
Using parallel computing techniques

y =𝑓 x
to speed up matrix operation
=𝜎 WL …𝜎 W2 𝜎 W1 x + b1 + b2 … + bL
Softmax
• Softmax layer as the output layer
Ordinary Layer
z1   
y1   z1
In general, the output of
z2   
y2   z 2
network can be any value.
May not be easy to interpret

z3   
y3   z 3
Softmax
Probability:
• Softmax layer as the output layer  1 > 𝑦𝑖 > 0
 σ𝑖 𝑦𝑖 = 1
Softmax Layer
3 0.88 3
e
20
z1 e e z1
 y1  e z1 zj
j 1
1 0.12 3
z2 e e z 2 2.7
 y2  e z2
e
zj
j 1
0.05 ≈0
z3 -3
3
e e z3
 y3  e z3
e
zj
3 j 1
 e zj
j 1
How to set network parameters
𝜃 = 𝑊 1 , 𝑏1 , 𝑊 2 , 𝑏 2 , ⋯ 𝑊 𝐿 , 𝑏 𝐿
x1 …… y1
0.1 is 1
Softmax
x2 …… y2
0.7 is 2
……
……
……
x256 …… y10
0.2 is 0
16 x 16 = 256
Ink → 1 Set the network parameters 𝜃 such that ……
No ink → 0
Input: How to let thethe
y1 has neural
maximum value
network achieve this
Input: y2 has the maximum value
Training Data
• Preparing training data: images and their labels
“5” “0” “4” “1”
“9” “2” “1” “3”
Using the training data to find

the network parameters.
Given a set of network parameters 𝜃,
Cost each example has a cost value.
“1”
x1 …… y0.2
1 1
x2 …… y2
0.3 0
Cost
……
……
……
……
……
x256 …… y0.5 𝐿(𝜃) 0
10
target
Cost can be Euclidean distance or cross
entropy of the network output and target
Total Cost
For all training data … Total Cost:
𝑅
x1 NN y1 𝑦ො 1 𝐶 𝜃 = ෍ 𝐿𝑟 𝜃
𝐿1 𝜃
𝑟=1
x2 NN y2 𝑦ො 2
𝐿2 𝜃 How bad the network
parameters 𝜃 is on
x3 NN y3 𝑦ො 3 this task
𝐿3 𝜃
……
……
……
……
Find the network

parameters 𝜃 ∗ that
xR NN yR 𝑦ො 𝑅 minimize this value
𝐿𝑅 𝜃
Assume there are only two
parameters w1 and w2 in a
Gradient Descent network.
Error Surface 𝜃 = 𝑤1 , 𝑤2
The colors represent the value of C. Randomly pick a

starting point 𝜃 0
Compute the
negative gradient
𝑤2 𝜃∗ at 𝜃 0
−𝜂𝛻𝐶 𝜃 0 −𝛻𝐶 𝜃 0
−𝛻𝐶 𝜃 0
Times the
𝜕𝐶 𝜃 0 /𝜕𝑤1 learning rate 𝜂
𝜃0 𝛻𝐶 𝜃 0 =
𝜕𝐶 𝜃 0 /𝜕𝑤2 −𝜂𝛻𝐶 𝜃 0
𝑤1
Gradient Descent
Eventually, we would
Randomly pick a
reach a minima …..
starting point 𝜃 0
Compute the
2−𝜂𝛻𝐶 𝜃2 negative gradient
−𝜂𝛻𝐶 𝜃𝜃
1
𝑤2 2 at 𝜃 0
−𝛻𝐶
−𝛻𝐶 𝜃 1 𝜃
𝜃1 −𝛻𝐶 𝜃 0
Times the
learning rate 𝜂
𝜃0
−𝜂𝛻𝐶 𝜃 0
𝑤1
Local Minima
• Gradient descent never guarantee global minima
Different initial
point 𝜃 0
𝐶 Reach different minima,

so different results
Who is Afraid of Non-Convex
Loss Functions?
𝑤1 𝑤2 http://videolectures.net/eml07
_lecun_wia/
Besides local minima ……
cost
Very slow at the
plateau
Stuck at saddle point
Stuck at local minima
𝛻𝐶 𝜃 𝛻𝐶 𝜃 𝛻𝐶 𝜃
≈0 =0 =0
parameter space
In physical world ……
• Momentum
How about put this phenomenon

in gradient descent?
Still not guarantee reaching
Momentum global minima, but give some
hope ……
cost
Movement =
Negative of Gradient + Momentum
Negative of Gradient
Momentum
Real Movement
Gradient = 0
Mini-batch
 Randomly initialize 𝜃 0
x1 NN y1 𝑦ො 1  Pick the 1st batch
Mini-batch
𝐿1 𝐶 = 𝐿1 + 𝐿31 + ⋯
x31 NN y31 𝑦ො 31 𝜃 1 ← 𝜃 0 − 𝜂𝛻𝐶 𝜃 0
𝐿31  Pick the 2nd batch
……
𝐶 = 𝐿2 + 𝐿16 + ⋯
𝜃 2 ← 𝜃 1 − 𝜂𝛻𝐶 𝜃 1
x2 NN y2 𝑦ො 2
Mini-batch
…
𝐿2
C is different each time
x16 NN y16 𝑦ො 16 when we update
𝐿16 parameters!
……
Mini-batch
Original Gradient Descent With Mini-batch
unstable
The colors represent the total C on all training data.

Mini-batch Faster Better!
 Randomly initialize 𝜃 0
x1 NN y1 𝑦ො 1  Pick the 1st batch
Mini-batch
𝐶1 𝐶 = 𝐶 1 + 𝐶 31 + ⋯
x31 NN y31 𝑦ො 31 𝜃 1 ← 𝜃 0 − 𝜂𝛻𝐶 𝜃 0
𝐶 31  Pick the 2nd batch
……
𝐶 = 𝐶 2 + 𝐶 16 + ⋯
𝜃 2 ← 𝜃 1 − 𝜂𝛻𝐶 𝜃 1
x2 NN y2 𝑦ො 2
Mini-batch
…
𝐶2  Until all mini-batches
have been picked
x16 NN y16 𝑦ො 16
𝐶 16 one epoch
……
Repeat the above process

Backpropagation
• A network can have millions of parameters.
• Backpropagation is the way to compute the gradients
efficiently (not today)
• Ref:
http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_201
5_2/Lecture/DNN%20backprop.ecm.mp4/index.html
• Many toolkits can compute the gradients automatically
Ref:
http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lec
ture/Theano%20DNN.ecm.mp4/index.html
Part II:
Why Deep?
Deeper is Better?
Word Error Word Error
Layer X Size Layer X Size
Rate (%) Rate (%)
1 X 2k 24.2
2 X 2k 20.4 Not surprised, more
3 X 2k 18.4 parameters, better
4 X 2k 17.8 performance
5 X 2k 17.2 1 X 3772 22.5
7 X 2k 17.1 1 X 4634 22.6
1 X 16k 22.1
Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription
Using Context-Dependent Deep Neural Networks." Interspeech. 2011.
Universality Theorem
Any continuous function f
f : R N  RM
Can be realized by a network
with one hidden layer
Reference for the reason:
(given enough hidden http://neuralnetworksandde
neurons) eplearning.com/chap4.html
Why “Deep” neural network not “Fat” neural network?

Fat + Short v.s. Thin + Tall
The same number
of parameters
Which one is better?

……
x1 x2 …… xN x1 x2 …… xN
Shallow Deep
Fat + Short v.s. Thin + Tall
Word Error Word Error
Layer X Size Layer X Size
Rate (%) Rate (%)
1 X 2k 24.2
2 X 2k 20.4
3 X 2k 18.4
4 X 2k 17.8
5 X 2k 17.2 1 X 3772 22.5
7 X 2k 17.1 1 X 4634 22.6
1 X 16k 22.1
Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription
Using Context-Dependent Deep Neural Networks." Interspeech. 2011.
Why Deep?
• Deep → Modularization
Classifier Girls with
1 long hair
Classifier Boys with

2 weak long hair
Little examples
Image
3 short hair

4 short hair
Each basic classifier can have
Why Deep? sufficient training examples.
Boy or Girl?
Image Basic
Classifier
Long or
short?
Classifiers for the

attributes
Why Deep?
can be trained by little data
1 long hair
Boy or Girl? Classifier Boys with
2 fine long Little
hair data
Image Basic
Classifier Classifier Girls with
Long or 3 short hair
short?
Sharing by the 4 short hair
following classifiers
as module
Deep Learning also works
Why Deep? on small data set like TIMIT.
• Deep → Modularization → Less training data?

x1 ……
x2 The modularization is ……
automatically learned from data.
……
……
……
……
xN ……
The most basic Use 1st layer as module Use 2nd layer as
classifiers to build classifiers module ……
Hand-crafted
kernel function
SVM
Apply simple
classifier
Source of image: http://www.gipsa-lab.grenoble-
inp.fr/transfert/seminaire/455_Kadri2013Gipsa-lab.pdf
Deep Learning
simple
Learnable kernel
𝜙 𝑥 classifier
x1 …… y1
x2 …… y2
𝑥
…
…
…
…
…
xN …… yM
Hard to get the power of Deep …
Before 2006, deeper usually does not imply better.

Training Deep Networks
• Build a feature space
• Note that this is what we do with SVM kernels, or trained
hidden layers in BP, etc., but now we will build the
feature space using deep architectures
• Unsupervised training between layers can decompose
the problem into distributed sub-problems (with higher
levels of abstraction) to be further decomposed at
subsequent layers
CS 678 – Deep Learning 43

Training Deep Networks
• Difficulties of supervised training of deep networks
• Early layers of MLP do not get trained well
• Diffusion of Gradient – error attenuates as it propagates to earlier layers
• Leads to very slow training
• Exacerbated since top couple layers can usually learn any task "pretty well"
and thus the error to earlier layers drops quickly as the top layers "mostly"
solve the task– lower layers never get the opportunity to use their capacity
to improve results, they just do a random feature map
• Need a way for early layers to do effective work
• Instability of gradient in deep networks: Vanishing or exploding gradient
• Product of many terms, which unless “balanced” just right, is unstable
• Either early or late layers stuck while “opposite” layers are learning
• Often not enough labeled data available while there may be lots of
unlabeled data
• Can we use unsupervised/semi-supervised approaches to take advantage of
the unlabeled data
• Deep networks tend to have more sensitive training issues problems
than shallow networks during supervised training

Greedy Layer-Wise Training
• One answer is greedy layer-wise training
1. Train first layer using your data without the labels (unsupervised)
• Since there are no targets at this level, labels don't help. Could also
use the more abundant unlabeled data which is not part of the training
set (i.e. self-taught learning).
2. Then freeze the first layer parameters and start training the
second layer using the output of the first layer as the
unsupervised input to the second layer
3. Repeat this for as many layers as desired
• This builds our set of robust features
4. Use the outputs of the final layer as inputs to a supervised
layer/model and train the last supervised layer(s) (leave early
weights frozen)
5. Unfreeze all weights and fine tune the full network by training
with a supervised approach, given the pre-training weight
settings

Deep Net with Greedy Layer Wise
Training Supervised
ML Model Learning
New Feature Space
Unsupervised
Learning
Original Inputs
Adobe – Deep Learning and Active Learning 46

Greedy Layer-Wise Training
• Greedy layer-wise training avoids many of the
problems of trying to train a deep net in a supervised
fashion
• Each layer gets full learning focus in its turn since it is the
only current "top" layer
• Can take advantage of unlabeled data
• When you finally tune the entire network with supervised
training the network weights have already been adjusted so
that you are in a good error basin and just need fine tuning.
This helps with problems of
• Ineffective early layer learning
• Deep network local minima
• We will discuss the two most common approaches
• Stacked Auto-Encoders
• Deep Belief Networks

Self Taught vs Unsupervised
Learning
• When using Unsupervised Learning as a pre-processor to supervised
learning you are typically given examples from the same distribution as
the later supervised instances will come from
• Assume the distribution comes from a set containing just examples from a
defined set up possible output classes, but the label is not available (e.g.
images of car vs trains vs motorcycles)
• In Self-Taught Learning we do not require that the later supervised
instances come from the same distribution
• e.g., Do self-taught learning with any images, even though later you will do
supervised learning with just cars, trains and motorcycles.
• These types of distributions are more readily available than ones which just
have the classes of interest
• However, if distributions are very different…
• New tasks share concepts/features from existing data and statistical
regularities in the input distribution that many tasks can benefit from
• Note similarities to supervised multi-task and transfer learning
• Both approaches reasonable in deep learning models

Auto-Encoders
• A type of unsupervised learning which tries to discover generic
features of the data
• Learn identity function by learning important sub-features (not by just
passing through data)
• Compression, etc.
• Can use just new features in the new training set or concatenate both
49
Stacked Auto-Encoders
• Bengio (2007) – After Deep Belief Networks (2006)
• Stack many (sparse) auto-encoders in succession and train
them using greedy layer-wise training
• Drop the decode output layer each time

• Do supervised training on the last layer using final
features
• Then do supervised training on the entire network to
fine- tune all weights
51
Sparse Encoders
• Auto encoders will often do a dimensionality reduction
• PCA-like or non-linear dimensionality reduction
• This leads to a "dense" representation which is nice in
terms of parsimony
• All features typically have non-zero values for any input and the
combination of values contains the compressed information
• However, this distributed and entangled representation can
often make it more difficult for successive layers to pick out
the salient features
• A sparse representation uses more features where at any
given time many/most of the features will have a 0 value
• Thus there is an implicit compression each time but with varying
nodes
• This leads to more localist variable length encodings where a
particular node (or small group of nodes) with value 1 signifies the
presence of a feature (small set of bases)
• A type of simplicity bottleneck (regularizer)
• This is easier for subsequent layers to use for learning

How do we implement a sparse
Auto-Encoder?
• Use more hidden nodes in the encoder
• Use regularization techniques which encourage
sparseness (e.g. a significant portion of nodes have
0 output for any given input)
• Penalty in the learning function for non-zero nodes
• Weight decay
• etc.
• De-noising Auto-Encoder
• Stochastically corrupt training instance each time, but
still train auto-encoder to decode the uncorrupted
instance, forcing it to learn conditional dependencies
within the instance
• Better empirical results, handles missing values well
Sparse Representation
• For bases below, which is easier to see intuition for
current pattern - if a few of these are on and the rest
0, or if all have some non-zero value?
• Easier to learn if sparse

• Concatenation approach (i.e. using both hidden features
and original features in final (or other) layers) can be better
if not doing fine tuning. If fine tuning, the pure replacement
approach can work well.
• Always fine tune if there is a sufficient amount of labeled
data
• For real valued inputs, MLP training is like regression and
thus could use linear output node activations, still sigmoid
at hidden
• Stacked Auto-Encoders empirically not quite as accurate as
DBNs (Deep Belief Networks)
• (with De-noising auto-encoders, stacked auto-encoders competitive
with DBNs)
• Not generative like DBNs, though recent work with de-noising auto-
encoders may allow generative capacity

-0.06
W1
W2
-2.5 f(x)
W3
1.4
-0.06
2.7
-8.6
-2.5 f(x)
0.002 x = -0.06×2.7 + 2.5×8.6 + 1.4×0.002 = 21.34
1.4
A dataset
Fields class
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0
etc …
Training the neural network
Fields class
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0
etc …
Training data
Fields class Initialise with random weights
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0
etc …
Training data
Fields class Present a training pattern
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0 1.4
etc …
2.7
1.9
Training data
Fields class Feed it through to get output
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0 1.4
etc …
2.7 0.8
1.9
Training data
Fields class Compare with target output
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0 1.4
etc …
2.7 0.8
0
1.9 error 0.8
Training data
Fields class Adjust weights based on error
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0 1.4
etc …
2.7 0.8
0
1.9 error 0.8
Training data
Fields class Present a training pattern
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0 6.4
etc …
2.8
1.7
Training data
Fields class Feed it through to get output
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0 6.4
etc …
2.8 0.9
1.7
Training data
Fields class Compare with target output
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0 6.4
etc …
2.8 0.9
1
1.7 error -0.1
Training data
Fields class Adjust weights based on error
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0 6.4
etc …
2.8 0.9
1
1.7 error -0.1
Training data
Fields class And so on ….
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0 6.4
etc …
2.8 0.9
1
1.7 error -0.1
Repeat this thousands, maybe millions of times – each time

taking a random training instance, and making slight
weight adjustments
Algorithms for weight adjustment are designed to make
changes that will reduce the error
The decision boundary perspective…
Initial random weights
Present a training instance / adjust the weights
Eventually ….
The point I am trying to make
• weight-learning algorithms for NNs are dumb
• they work by making thousands and thousands of tiny

adjustments, each making the network do better at the most
recent pattern, but perhaps a little worse on many others
• but, by dumb luck, eventually this tends to be good enough to

learn effective classifiers for many real applications
Some other points
Detail of a standard NN weight learning algorithm –
later
If f(x) is non-linear, a network with 1 hidden layer

can, in theory, learn perfectly any classification
problem. A set of weights exists that can produce the
targets from the inputs. The problem is finding them.
Some other ‘by the way’ points
If f(x) is linear, the NN can only draw straight decision
boundaries (even if there are many layers of units)
NNs use nonlinear f(x) so they
can draw complex boundaries,
but keep the data unchanged
NNs use nonlinear f(x) so they SVMs only draw straight lines,
can draw complex boundaries, but they transform the data first
but keep the data unchanged in a way that makes that OK
Feature
detectors
what is this
unit doing?
Hidden layer units become
self-organised feature detectors
1 5 10 15 20 25 …
…
1
strong +ve weight
low/zero weight
63
What does this unit detect?
1 5 10 15 20 25 …
…
1
strong +ve weight
low/zero weight
63
1 5 10 15 20 25 …
…
1
strong +ve weight
low/zero weight
it will send strong signal for a horizontal

line in the top row, ignoring everywhere else
63
1 5 10 15 20 25 …
…
1
strong +ve weight
low/zero weight
63
1 5 10 15 20 25 …
…
1
strong +ve weight
low/zero weight
Strong signal for a dark area in the top left

corner
63
What features might you expect a good NN
to learn, when trained with data like this?
vertical lines
63
Horizontal lines
63
Small circles
63
Small circles
But what about position invariance ???

our example unit detectors were tied to
specific parts
63
of the image
successive layers can learn higher-level features …
detect lines in
Specific positions
etc …
Higher level detetors
etc …
( horizontal line,
“RHS vertical lune” v
“upper loop”, etc…
successive layers can learn higher-level features …
detect lines in
Specific positions
etc …
Higher level detetors
etc …
( horizontal line,
“RHS vertical lune” v
“upper loop”, etc…

So: multiple layers make sense
Your brain works that way

Many-layer neural network architectures should be capable of learning the true
underlying features and ‘feature logic’, and therefore generalise very well …
But, until very recently, our weight-learning
algorithms simply did not work on multi-layer
architectures
Along came deep learning …
The new way to train multi-layer NNs…
Train this layer first


then this layer

then this layer
then this layer

then this layer
then this layer
then this layer

then this layer
then this layer
then this layer
finally this layer
EACH of the (non-output) layers is trained

to be an auto-encoder
Basically, it is forced to learn good
features that describe what comes from
the previous layer
an auto-encoder is trained, with an absolutely standard weight-
adjustment algorithm to reproduce the input
an auto-encoder is trained, with an absolutely standard weight-
adjustment algorithm to reproduce the input
By making this happen with (many) fewer units than the

inputs, this forces the ‘hidden layer’ units to become good
feature detectors
intermediate layers are each trained to be
auto encoders (or similar)
Final layer trained to predict class based
on outputs from previous layers
And that’s that
• That’s the basic idea
• There are many many types of deep learning,
• different kinds of autoencoder, variations on
architectures and training algorithms, etc…
• Very fast growing area …
Part III:
Tips for Training DNN
Recipe for Learning
http://www.gizmodo.com.au/2015/04/the-basic-recipe-for-machine-learning-
explained-in-a-single-powerpoint-slide/
Recipe for Learning
Don’t forget! overfitting

Modify the Network Preventing
Better optimization Overfitting
Strategy
http://www.gizmodo.com.au/2015/04/the-basic-recipe-for-machine-learning-
explained-in-a-single-powerpoint-slide/
Recipe for Learning
Modify the Network
• New activation functions, for example, ReLU
or Maxout
Better optimization Strategy

• Adaptive learning rates
Prevent Overfitting
• Dropout Only use this approach when you already
obtained good results on the training data.
Part III:
New Activation Function
ReLU
• Rectified Linear Unit (ReLU)
Reason:
𝑎
𝜎 𝑧 1. Fast to compute
𝑎=𝑧
2. Biological reason
𝑎=0 3. Infinite sigmoid
𝑧
with different biases
4. Vanishing gradient
[Xavier Glorot, AISTATS’11]
[Andrew L. Maas, ICML’13] problem
[Kaiming He, arXiv’15]
Vanishing Gradient Problem
x1 …… y1
In x2006,
2
people used RBM
……pre-training. y2
In 2015, people use ReLU.
……
……
……
……
……
xN …… yM
Smaller gradients Larger gradients
Learn very slow Learn very fast
Almost random Already converge

based on random!?
Vanishing Gradient Problem
Smaller gradients
x1 …… 𝑦1 𝑦ො1
Small
x2 …… output 𝑦2 𝑦ො2
……
……
……
……
……
……
𝐶
+∆𝐶
xN …… 𝑦𝑀 𝑦ො𝑀
Large
+∆𝑤 input
Intuitive way to compute the gradient …

𝜕𝐶 ∆𝐶
=?
𝜕𝑤 ∆𝑤
𝑎
𝑎=𝑧
ReLU
𝑎=0
𝑧
0
x1 y1
0 y2
x2
0
0
𝑎
𝑎=𝑧
ReLU
A Thinner linear network 𝑎=0
𝑧
x1 y1
y2
x2
Do not have
smaller gradients
Maxout ReLU is a special cases of Maxout
• Learnable activation function [Ian J. Goodfellow, ICML’13]
+ 5 neuron + 1
Input
Max 7 Max 2
x1 + 7 + 2
x2 + −1 + 4
Max 1 Max 4
+ 1 + 3
You can have more than 2 elements in a group.

• Learnable activation function [Ian J. Goodfellow, ICML’13]

• Activation function in maxout network can be
any piecewise linear convex function
• How many pieces depending on how many
elements in a group
2 elements in a group 3 elements in a group

Part III:
Adaptive Learning Rate
Learning Rate Set the learning
rate η carefully
−𝜂𝛻𝐶 𝜃 0 If learning rate is too large
Cost may not decrease

after each update
𝑤2
−𝛻𝐶 𝜃 0
𝜃0
𝑤1
Can we give different
Learning Rate Set the learning
parameters different
rate η carefully
learning rates?
If learning rate is too large
Cost may not decrease

after each update
𝑤2
If learning rate is too small
−𝛻𝐶 𝜃0
−𝜂𝛻𝐶 𝜃0
𝜃0 Training would be too slow
𝑤1
Original Gradient Descent
Adagrad 𝜃 𝑡 ← 𝜃 𝑡−1 − 𝜂𝛻𝐶 𝜃 𝑡−1
Each parameter w are considered separately

𝜕𝐶 𝜃 𝑡
𝑤 𝑡+1 ← 𝑤 𝑡 − 𝜂𝑤 𝑔𝑡 𝑔𝑡 =
𝜕𝑤
Parameter dependent
learning rate
𝜂 constant
𝜂𝑤 =
σ𝑡𝑖=0 𝑔𝑖 2 Summation of the square of
the previous derivatives
𝜂
𝜂𝑤 =
Adagrad σ𝑡𝑖=0 𝑔𝑖 2
g0 g1 …… g0 g1 ……
𝑤1 𝑤2
0.1 0.2 …… 20.0 10.0 ……
Learning rate: Learning rate:
𝜂 𝜂 𝜂 𝜂
= =
0.12 0.1 20 2 20
𝜂 𝜂 𝜂 𝜂
= =
0.12 + 0.22 0.22 202 + 102 22
Observation: 1. Learning rate is smaller and
smaller for all parameters
2. Smaller derivatives, larger
Why?
learning rate, and vice versa
Larger
derivatives
Smaller
Learning Rate
Smaller Derivatives
Larger Learning Rate
2. Smaller derivatives, larger

Why?
learning rate, and vice versa
Not the whole story ……
• Adagrad [John Duchi, JMLR’11]
• RMSprop
• https://www.youtube.com/watch?v=O3sxAc4hxZU
• Adadelta [Matthew D. Zeiler, arXiv’12]

• Adam [Diederik P. Kingma, ICLR’15]
• AdaSecant [Caglar Gulcehre, arXiv’14]
• “No more pesky learning rates” [Tom Schaul, arXiv’12]
Part III:
Dropout
Pick a mini-batch
Dropout 𝜃 𝑡 ← 𝜃 𝑡−1 − 𝜂𝛻𝐶 𝜃 𝑡−1
Training:
 Each time before computing the gradients

 Each neuron has p% to dropout
Pick a mini-batch
Dropout 𝜃 𝑡 ← 𝜃 𝑡−1 − 𝜂𝛻𝐶 𝜃 𝑡−1
Training:
Thinner!
 Each time before computing the gradients

 Each neuron has p% to dropout
The structure of the network is changed.
 Using the new network for training
For each mini-batch, we resample the dropout neurons
Dropout
Testing:
 No dropout
 If the dropout rate at training is p%,
all the weights times (1-p)%
 Assume that the dropout rate is 50%.
If a weight w = 1 by training, set 𝑤 = 0.5 for testing.
Dropout - Intuitive Reason
 When teams up, if everyone expect the partner will do

the work, nothing will be done finally.
 However, if you know your partner will dropout, you
will do better.
 When testing, no one dropout actually, so obtaining
good results eventually.
Dropout - Intuitive Reason
• Why the weights should multiply (1-p)% (dropout
rate) when testing?
Training of Dropout Testing of Dropout
Assume dropout rate is 50% No dropout
Weights from training
𝑤1 0.5 × 𝑤1 𝑧 ′ ≈ 2𝑧
𝑤2 𝑧 0.5 × 𝑤2 𝑧 ′
𝑤3 0.5 × 𝑤3
𝑤4 0.5 × 𝑤4
Weights multiply (1-p)%
𝑧′ ≈ 𝑧
Dropout is a kind of ensemble.
Training
Ensemble Set
Set 1 Set 2 Set 3 Set 4
Network Network Network Network

1 2 3 4
Train a bunch of networks with different structures

Ensemble
Testing data x
Network Network Network Network

1 2 3 4
y1 y2 y3 y4
average
minibatch minibatch minibatch minibatch Training of
1 2 3 4 Dropout
M neurons
……
2M possible
networks
Using one mini-batch to train one network

Some parameters in the network are shared
Testing of Dropout testing data x
All the
weights
……
multiply
(1-p)%
y1 y2 y3
average ≈ y
More about dropout
• More reference for dropout [Nitish Srivastava, JMLR’14] [Pierre Baldi,
NIPS’13][Geoffrey E. Hinton, arXiv’12]
• Dropout works better with Maxout [Ian J. Goodfellow, ICML’13]
• Dropconnect [Li Wan, ICML’13]
• Dropout delete neurons
• Dropconnect deletes the connection between neurons
• Annealed dropout [S.J. Rennie, SLT’14]
• Dropout rate decreases by epochs
• Standout [J. Ba, NISP’13]
• Each neural has different dropout rate
Part IV:
Neural Network
with Memory
Neural Network needs Memory
• Name Entity Recognition
• Detecting named entities like name of people,
locations, organization, etc. in a sentence.
1
0.1 people
apple 0
0.1 location
0 DNN
0.5 organization
…
0 0.3 none
Neural Network needs Memory
• Name Entity Recognition
• Detecting named entities like name of people,
locations, organization, etc. in a sentence.
target ORG target NONE
y1 y2 y3 y4 y5 y6 y7
DNN DNN DNN DNN DNN DNN DNN
x1 x2 x3 x4 x5 x6 x7
the president of apple eats an apple
DNN needs memory!
Recurrent Neural Network (RNN)
y1 y2
The output of hidden layer

are stored in the memory.
copy
a1 a2
Memory can be considered x1 x2

as another input.
RNN
y1 y2 y3
Wo a1
copy Wo copy Wo
a2 a3
a1 a2
Wi Wh Wh
Wi Wi
x1 x2 x3
The same network is used again and again.
Output yi depends on x1, x2, …… xi

RNN How to train?
𝑦ො 1 target 𝑦ො 2 target 𝑦ො 3 target
L1 L2 L3
y1 y2 y3
Wo Wo Wo
Wh Wh
……
Wi Wi Wi
x1 x2 x3
Find the network parameters to minimize the total cost:

Backpropagation through time (BPTT)
Of course it can be deep …
yt yt+1 yt+2
…… ……
……
……
……
…… ……
…… ……
xt xt+1 xt+2
Bidirectional RNN
xt xt+1 xt+2
…… ……
yt yt+1 yt+2
…… ……
xt xt+1 xt+2
Many to Many (Output is shorter)
• Both input and output are both sequences, but the output
is shorter.
• E.g. Speech Recognition
Output: “好棒” (character sequence)

Trimming
Problem?
Why can’t it be 好好好棒棒棒棒棒
“好棒棒”
(vector
Input:
sequence)
Many to Many (Output is shorter)
• Both input and output are both sequences, but the output
is shorter.
• Connectionist Temporal Classification (CTC) [Alex Graves,
ICML’06][Alex Graves, ICML’14][Haşim Sak, Interspeech’15][Jie Li,
Interspeech’15][Andrew Senior, ASRU’15]
“好棒” Add an extra symbol “φ” “好棒棒”

representing “null”
好 φ φ 棒 φ φ φ φ 好 φ φ 棒 φ 棒 φ φ
Many to Many (No Limitation)
• Both input and output are both sequences with different
lengths. → Sequence to sequence learning
• E.g. Machine Translation (machine learning→機器學習)
machine
learning
Containing all
information about
input sequence
機器學習慣性 ……
……
machine
learning
Don’t know when to stop

推 tlkagk: =========斷==========
Ref:http://zh.pttpedia.wikia.com/wiki/%E6%8E%A5%E9%BE%8D%
E6%8E%A8%E6%96%87 (鄉民百科)
===
機器學習
machine
learning
Add a symbol “===“ (斷)

[Ilya Sutskever, NIPS’14][Dzmitry Bahdanau, arXiv’15]
感謝曾柏翔同學提供實驗結果
Unfortunately ……
• RNN-based network is not always easy to learn
Real experiments on Language modeling
sometimes
Lucky
The error surface is rough.
The error surface is either
very flat or very steep.
Clipping
Cost
w2
w1 [Razvan Pascanu, ICML’13]

Why?
𝑤=1 𝑦1000 = 1 Large Small
𝑤 = 1.01 𝑦1000 ≈ 20000 gradient Learning rate?
𝑤 = 0.99 𝑦1000 ≈ 0 small Large
𝑤 = 0.01 𝑦1000 ≈ 0 gradient Learning rate?
=w999
y1 y2 y3 y1000
Toy Example
1 1 1 1
……
w w w
1 1 1 1
1 0 0 0
Helpful Techniques
• Nesterov’s Accelerated Gradient (NAG):
• Advance momentum method
• RMS Prop
• Advanced approach to give each parameter
different learning rates
• Considering the change of Second derivatives
• Long Short-term Memory (LSTM)
• Can deal with gradient vanishing (not gradient
explode)
Long Short-term Memory (LSTM)
Other part of the network
Special Neuron:
Signal control
Output Gate
4 inputs,
the output gate 1 output
(Other part of
the network)
Memory Forget Signal control
Cell Gate the forget gate
(Other part of
the network)
Signal control
Input Gate LSTM
the input gate
(Other part of
the network)
Other part of the network
𝑎 = ℎ 𝑐 ′ 𝑓 𝑧𝑜
𝑧𝑜 multiply
Activation function f is
𝑓 𝑧𝑜 ℎ 𝑐′ usually a sigmoid function
Between 0 and 1
Mimic open and close gate
𝑐 𝑓 𝑧𝑓
𝑐c′ 𝑧𝑓
𝑐𝑓 𝑧𝑓
𝑔 𝑧 𝑓 𝑧𝑖
𝑐 ′ = 𝑔 𝑧 𝑓 𝑧𝑖 + 𝑐𝑓 𝑧𝑓
𝑓 𝑧𝑖
𝑧𝑖
multiply
𝑔 𝑧
𝑧
Original Network:
Simply replace the neurons with LSTM
……
……
𝑎1 𝑎2
𝑧1 𝑧2
x1 x2 Input
𝑎1 𝑎2
+ +
+ +
+ +
+ +
4 times of parameters x1 x2 Input

Extension: “peephole”
LSTM
yt yt+1
ct-1 ct ct+1
× ＋ × × ＋ ×
× ×
zf zi z zo zf zi z zo
ct-1 ht-1 xt ct ht xt+1

Other Simpler Alternatives
Gated Recurrent Unit (GRU) Structurally Constrained
Recurrent Network (SCRN)
[Tomas Mikolov,
[Cho, EMNLP’14] ICLR’15]
Vanilla RNN Initialized with Identity matrix + ReLU activation

function [Quoc V. Le, arXiv’15]
 Outperform or be comparable with LSTM in 4 different tasks
What is the next wave?
Internal memory or
• Attention-based Model information from output
…… ……
Reading Head Writing Head
Reading Head Writing Head

Controller Controller
Input x DNN/LSTM output y

Already applied on speech recognition, caption
generation, QA, visual QA
What is the next wave?
• Attention-based Model
• End-To-End Memory Networks. S. Sukhbaatar, A. Szlam, J. Weston, R. Fergus.
arXiv Pre-Print, 2015.
• Neural Turing Machines. Alex Graves, Greg Wayne, Ivo Danihelka. arXiv Pre-Print,
2014
• Ask Me Anything: Dynamic Memory Networks for Natural Language Processing.
Kumar et al. arXiv Pre-Print, 2015
• Neural Machine Translation by Jointly Learning to Align and Translate. D.
Bahdanau, K. Cho, Y. Bengio; International Conference on Representation
Learning 2015.
• Show, Attend and Tell: Neural Image Caption Generation with Visual
Attention. Kelvin Xu et. al.. arXiv Pre-Print, 2015.
• Attention-Based Models for Speech Recognition. Jan Chorowski, Dzmitry
Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, Yoshua Bengio. arXiv Pre-Print,
2015.
• Recurrent models of visual attention. V. Mnih, N. Hees, A. Graves and K.
Kavukcuoglu. In NIPS, 2014.
• A Neural Attention Model for Abstractive Sentence Summarization. A. M. Rush,
S. Chopra and J. Weston. EMNLP 2015.
Concluding Remarks
Concluding Remarks
• Introduction of deep learning
• Discussing some reasons using deep learning
• New techniques for deep learning
• ReLU, Maxout
• Giving all the parameters different learning rates
• Dropout
• Network with memory
• Recurrent neural network
• Long short-term memory (LSTM)
Reading Materials
• “Neural Networks and Deep Learning”
• written by Michael Nielsen
• http://neuralnetworksanddeeplearning.com/
• “Deep Learning” (not finished yet)
• Written by Yoshua Bengio, Ian J. Goodfellow and
Aaron Courville
• http://www.iro.umontreal.ca/~bengioy/dlbook/
Thank you
for your attention!
Acknowledgement
• 感謝 Ryan Sun 來信指出投影片上的錯字
Appendix
Matrix Operation
1 4 0.98
x1 y1
1 -2
1
-1 -2 0.12
x2 y2
1
-1
0
1 W−2 1 1 0.98
𝜎 x + b = a
−1 1 −1 0 0.12
Why Deep? – Logic Circuits
• A two levels of basic logic gates can represent any
Boolean function.
• However, no one uses two levels of logic gates to
build computers
• Using multiple layers of logic gates to build some
functions are much simpler (less gates needed).
Boosting Weak classifier
Input Weak classifier

Combine
……
𝑥
Weak classifier
Deep Learning
Weak Boosted weak Boosted Boosted
classifier classifier weak classifier
x1 ……
x2 ……
…
…
…
…
xN ……
𝑧 + 𝑧1
Input 𝑤 ReLU 𝑎 Input 𝑤 Max 𝑎
𝑏
x x 0 + 𝑧2 𝑚𝑎𝑥 𝑧1 , 𝑧2
𝑏
0
1 1
𝑎 𝑎
𝑧 = 𝑤𝑥 + 𝑏
𝑧1 = 𝑤𝑥 + 𝑏
𝑥 𝑥
𝑧2 =0
𝑧 + 𝑧1
Input 𝑤 ReLU 𝑎 Input 𝑤 Max 𝑎
𝑏
x x
𝑤 ′ + 𝑧2 𝑚𝑎𝑥 𝑧1 , 𝑧2
𝑏
1 𝑏′
1 Learnable Activation
Function
𝑎 𝑎
𝑧 = 𝑤𝑥 + 𝑏
𝑧1 = 𝑤𝑥 + 𝑏
𝑥 𝑥
𝑧2 = 𝑤 ′ 𝑥 + 𝑏 ′

Deep Learning Tutorial: Reference: Hung-Yi Lee

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Deep Learning Tutorial: Reference: Hung-Yi Lee

Caricato da

Copyright:

Formati disponibili

Deep Learning Tutorial

Reference: Hung-yi Lee

Deep learning obtains many exciting results.

2007 2009 2011 2013 2015

Part I: Introduction of Deep Learning

Part II: Why Deep?

Part III: Tips for Training Deep Neural Network

Part IV: Neural Network with Memory

What people already knew in 1980s

Deep means many hidden layers

Using parallel computing techniques

May not be easy to interpret

“5” “0” “4” “1”

“9” “2” “1” “3”

Using the training data to find

Find the network

The colors represent the value of C. Randomly pick a

𝐶 Reach different minima,

Stuck at local minima

How about put this phenomenon

The colors represent the total C on all training data.

Repeat the above process

Why “Deep” neural network not “Fat” neural network?

Which one is better?

Classifier Boys with

Classifier Boys with

Classifiers for the

• Deep → Modularization → Less training data?

Before 2006, deeper usually does not imply better.

CS 678 – Deep Learning 43

CS 678 – Deep Learning 44

CS 678 – Deep Learning 45

New Feature Space

Adobe – Deep Learning and Active Learning 46

CS 678 – Deep Learning 47

CS 678 – Deep Learning 48

CS 678 – Deep Learning 50

CS 678 – Deep Learning 53

CS 678 – Deep Learning 55

CS 678 – Deep Learning 56

Repeat this thousands, maybe millions of times – each time

• they work by making thousands and thousands of tiny

• but, by dumb luck, eventually this tends to be good enough to

If f(x) is non-linear, a network with 1 hidden layer

it will send strong signal for a horizontal

Strong signal for a dark area in the top left

But what about position invariance ???

Higher level detetors

Higher level detetors

What does this unit detect?

Your brain works that way

Train this layer first

Train this layer first

Train this layer first

Train this layer first

Train this layer first

EACH of the (non-output) layers is trained

By making this happen with (many) fewer units than the

Don’t forget! overfitting

Better optimization Strategy

Smaller gradients Larger gradients

Learn very slow Learn very fast

Almost random Already converge