Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Geoffrey
Hinton
with
Ni@sh
Srivastava
Kevin
Swersky
Reminder:
The
error
surface
for
a
linear
neuron
• The
error
surface
lies
in
a
space
with
a
horizontal
axis
for
each
weight
and
one
ver@cal
axis
for
the
error.
E
– For
a
linear
neuron
with
a
squared
error,
it
is
a
quadra@c
bowl.
– Ver@cal
cross-‐sec@ons
are
parabolas.
– Horizontal
cross-‐sec@ons
are
ellipses.
• For
mul@-‐layer,
non-‐linear
nets
the
error
surface
w1
is
much
more
complicated.
– But
locally,
a
piece
of
a
quadra@c
bowl
is
usually
a
very
good
approxima@on.
w2
Convergence
speed
of
full
batch
learning
when
the
error
surface
is
a
quadra@c
bowl
• Going
downhill
reduces
the
error,
but
the
direc@on
of
steepest
descent
does
not
point
at
the
minimum
unless
the
ellipse
is
a
circle.
– The
gradient
is
big
in
the
direc@on
i
n
which
we
only
want
to
travel
a
small
Even
for
non-‐linear
distance.
mul@-‐layer
nets,
the
error
surface
is
locally
– The
gradient
is
small
in
the
direc@on
in
quadra@c,
so
the
same
which
we
want
to
travel
a
large
distance.
speed
issues
apply.
How
the
learning
goes
wrong
• If
the
learning
rate
is
big,
the
weights
slosh
to
and
fro
across
the
ravine.
– If
the
learning
rate
is
too
big,
this
oscilla@on
diverges.
• What
we
would
like
to
achieve:
– Move
quickly
in
direc@ons
with
small
but
E
consistent
gradients.
– Move
slowly
in
direc@ons
with
big
but
w
inconsistent
gradients.
Stochas@c
gradient
descent
• If
the
dataset
is
highly
redundant,
the
• Mini-‐batches
are
usually
beYer
gradient
on
the
first
half
is
almost
than
online.
iden@cal
to
the
gradient
on
the
– Less
computa@on
is
used
second
half.
upda@ng
the
weights.
– So
instead
of
compu@ng
the
full
gradient,
update
the
weights
using
– Compu@ng
the
gradient
for
the
gradient
on
the
first
half
and
many
cases
simultaneously
then
get
a
gradient
for
the
new
uses
matrix-‐matrix
weights
on
the
second
half.
mul@plies
which
are
very
– The
extreme
version
of
this
efficient,
especially
on
GPUs
approach
updates
weights
aVer
• Mini-‐batches
need
to
be
each
case.
Its
called
“online”.
balanced
for
classes
Two
types
of
learning
algorithm
If
we
use
the
full
gradient
computed
from
all
For
large
neural
networks
with
the
training
cases,
there
are
many
clever
ways
very
large
and
highly
redundant
to
speed
up
learning
(e.g.
non-‐linear
conjugate
training
sets,
it
is
nearly
always
gradient).
best
to
use
mini-‐batch
learning.
– The
op@miza@on
community
has
– The
mini-‐batches
may
studied
the
general
problem
of
need
to
be
quite
big
op@mizing
smooth
non-‐linear
when
adap@ng
fancy
func@ons
for
many
years.
methods.
– Mul@layer
neural
nets
are
not
typical
– Big
mini-‐batches
are
of
the
problems
they
study
so
their
more
computa@onally
methods
may
need
a
lot
of
adapta@on.
efficient.
A
basic
mini-‐batch
gradient
descent
algorithm
• Guess
an
ini@al
learning
rate.
• Towards
the
end
of
mini-‐batch
– If
the
error
keeps
geang
worse
learning
it
nearly
always
helps
to
or
oscillates
wildly,
reduce
the
turn
down
the
learning
rate.
learning
rate.
– This
removes
fluctua@ons
in
the
– If
the
error
is
falling
fairly
final
weights
caused
by
the
consistently
but
slowly,
increase
varia@ons
between
mini-‐
the
learning
rate.
batches.
• Write
a
simple
program
to
automate
• Turn
down
the
learning
rate
when
this
way
of
adjus@ng
the
learning
the
error
stops
decreasing.
rate.
– Use
the
error
on
a
separate
valida@on
set
Neural
Networks
for
Machine
Learning
Lecture
6b
A
bag
of
tricks
for
mini-‐batch
gradient
descent
Geoffrey
Hinton
with
Ni@sh
Srivastava
Kevin
Swersky
Be
careful
about
turning
down
the
learning
rate
error
to
the
different
gradients
on
different
mini-‐batches.
– So
we
get
a
quick
win.
– But
then
we
get
slower
learning.
• Don’t
turn
down
the
epoch
learning
rate
too
soon!
Ini@alizing
the
weights
• If
two
hidden
units
have
exactly
• If
a
hidden
unit
has
a
big
fan-‐in,
the
same
bias
and
exactly
the
small
changes
on
many
of
its
same
incoming
and
outgoing
incoming
weights
can
cause
the
weights,
they
will
always
get
learning
to
overshoot.
exactly
the
same
gradient.
– We
generally
want
smaller
– So
they
can
never
learn
to
be
incoming
weights
when
the
different
features.
fan-‐in
is
big,
so
ini@alize
the
– We
break
symmetry
by
weights
to
be
propor@onal
to
ini@alizing
the
weights
to
sqrt(fan-‐in).
have
small
random
values.
• We
can
also
scale
the
learning
rate
the
same
way.
ShiVing
the
inputs
color
indicates
training
case
w1 w2
• When
using
steepest
descent,
shiVing
the
input
values
makes
a
big
difference.
– It
usually
helps
to
transform
each
component
of
the
input
vector
so
that
it
has
zero
mean
101,
101
à
2
gives
error
over
the
whole
training
set.
101,
99
à
0
surface
• The
hypberbolic
tangent
(which
is
2*logis@c
-‐1)
produces
hidden
ac@va@ons
that
are
roughly
zero
gives
error
mean.
1,
1
à
2
– In
this
respect
its
beYer
than
the
1,
-‐1
à
0
surface
logis@c.
Scaling
the
inputs
color
indicates
weight
axis
w1 w2
• When
using
steepest
descent,
scaling
the
input
values
makes
a
big
difference.
– It
usually
helps
to
0.1,
10
à
2
gives
error
transform
each
0.1,
-‐10
à
0
surface
component
of
the
input
vector
so
that
it
has
unit
variance
over
the
whole
training
set.
1,
1
à
2
gives
error
1,
-‐1
à
0
surface
A
more
thorough
method:
Decorrelate
the
input
components
• For
a
linear
neuron,
we
get
a
big
win
by
decorrela@ng
each
component
of
the
input
from
the
other
input
components.
• There
are
several
different
ways
to
decorrelate
inputs.
A
reasonable
method
is
to
use
Principal
Components
Analysis.
– Drop
the
principal
components
with
the
smallest
eigenvalues.
•
This
achieves
some
dimensionality
reduc@on.
– Divide
the
remaining
principal
components
by
the
square
roots
of
their
eigenvalues.
For
a
linear
neuron,
this
converts
an
axis
aligned
ellip@cal
error
surface
into
a
circular
one.
• For
a
circular
error
surface,
the
gradient
points
straight
towards
the
minimum.
Common
problems
that
occur
in
mul@layer
networks
• If
we
start
with
a
very
big
learning
• In
classifica@on
networks
that
use
rate,
the
weights
of
each
hidden
a
squared
error
or
a
cross-‐entropy
unit
will
all
become
very
big
and
error,
the
best
guessing
strategy
is
posi@ve
or
very
big
and
nega@ve.
to
make
each
output
unit
always
–
The
error
deriva@ves
for
the
produce
an
output
equal
to
the
hidden
units
will
all
become
propor@on
of
@me
it
should
be
a
1.
@ny
and
the
error
will
not
– The
network
finds
this
strategy
decrease.
quickly
and
may
take
a
long
– This
is
usually
a
plateau,
but
@me
to
improve
on
it
by
people
oVen
mistake
it
for
a
making
use
of
the
input.
local
minimum.
– This
is
another
plateau
that
looks
like
a
local
minimum.
Four
ways
to
speed
up
mini-‐batch
learning
• Use
“momentum”
• rmsprop:
Divide
the
learning
rate
for
a
– Instead
of
using
the
gradient
weight
by
a
running
average
of
the
to
change
the
posi@on
of
the
magnitudes
of
recent
gradients
for
that
weight
“par@cle”,
use
it
to
weight.
change
the
velocity.
– This
is
the
mini-‐batch
version
of
just
• Use
separate
adap@ve
learning
using
the
sign
of
the
gradient.
rates
for
each
parameter
• Take
a
fancy
method
from
the
– Slowly
adjust
the
rate
using
op@miza@on
literature
that
makes
use
of
the
consistency
of
the
curvature
informa@on
(not
this
lecture)
gradient
for
that
parameter.
– Adapt
it
to
work
for
neural
nets
– Adapt
it
to
work
for
mini-‐batches.
Neural
Networks
for
Machine
Learning
Lecture
6c
The
momentum
method
Geoffrey
Hinton
with
Ni@sh
Srivastava
Kevin
Swersky
The
intui@on
behind
the
momentum
method
Imagine
a
ball
on
the
error
surface.
The
• It
damps
oscilla@ons
in
direc@ons
of
loca@on
of
the
ball
in
the
horizontal
high
curvature
by
combining
plane
represents
the
weight
vector.
gradients
with
opposite
signs.
– The
ball
starts
off
by
following
the
• It
builds
up
speed
in
direc@ons
with
gradient,
but
once
it
has
velocity,
a
gentle
but
consistent
gradient.
it
no
longer
does
steepest
descent.
– Its
momentum
makes
it
keep
going
in
the
previous
direc@on.
The
equa@ons
of
the
momentum
method
The
effect
of
the
gradient
is
to
∂E increment
the
previous
velocity.
The
v(t) = α v(t −1) − ε (t)
∂w velocity
also
decays
by
α
which
is
slightly
less
then
1.
Δw(t) = v(t) The
weight
change
is
equal
to
the
current
velocity.
∂E
= α v(t −1) − ε (t)
∂w
The
weight
change
can
be
expressed
in
∂E terms
of
the
previous
weight
change
and
= α Δw(t −1) − ε (t)
∂w the
current
gradient.
The
behavior
of
the
momentum
method
• At
the
beginning
of
learning
there
may
• If
the
error
surface
is
a
@lted
plane,
be
very
large
gradients.
the
ball
reaches
a
terminal
velocity.
– So
it
pays
to
use
a
small
– If
the
momentum
is
close
to
1,
momentum
(e.g.
0.5).
this
is
much
faster
than
simple
gradient
descent.
– Once
the
large
gradients
have
disappeared
and
the
weights
are
stuck
in
a
ravine
the
momentum
can
be
smoothly
raised
to
its
final
value
(e.g.
0.9
or
even
0.99)
1 $ ∂E ' • This
allows
us
to
learn
at
a
rate
that
v(∞) = & −ε )
1− α % ∂w ( would
cause
divergent
oscilla@ons
without
the
momentum.
A
beYer
type
of
momentum
(Nesterov
1983)
• The
standard
momentum
method
• First
make
a
big
jump
in
the
first
computes
the
gradient
at
the
direc@on
of
the
previous
current
loca@on
and
then
takes
a
big
accumulated
gradient.
jump
in
the
direc@on
of
the
updated
• Then
measure
the
gradient
accumulated
gradient.
where
you
end
up
and
make
a
• Ilya
Sutskever
(2012
unpublished)
correc@on.
suggested
a
new
form
of
momentum
– Its
beYer
to
correct
a
that
oVen
works
beYer.
mistake
aVer
you
have
– Inspired
by
the
Nesterov
method
made
it!
for
op@mizing
convex
func@ons.
A
picture
of
the
Nesterov
method
• First
make
a
big
jump
in
the
direc@on
of
the
previous
accumulated
gradient.
• Then
measure
the
gradient
where
you
end
up
and
make
a
correc@on.
brown
vector
=
jump,
red
vector
=
correc@on,
green
vector
=
accumulated
gradient
blue
vectors
=
standard
momentum
Neural
Networks
for
Machine
Learning
Lecture
6d
A
separate,
adap@ve
learning
rate
for
each
connec@on
Geoffrey
Hinton
with
Ni@sh
Srivastava
Kevin
Swersky
The
intui@on
behind
separate
adap@ve
learning
rates
• In
a
mul@layer
net,
the
appropriate
learning
rates
can
vary
widely
between
weights:
– The
magnitudes
of
the
gradients
are
oVen
very
different
for
different
layers,
especially
if
the
ini@al
weights
are
small.
– The
fan-‐in
of
a
unit
determines
the
size
of
the
“overshoot”
effects
caused
by
simultaneously
Gradients
can
get
very
changing
many
of
the
incoming
weights
of
a
unit
to
small
in
the
early
layers
of
correct
the
same
error.
very
deep
nets.
• So
use
a
global
learning
rate
(set
by
hand)
mul@plied
by
an
appropriate
local
gain
that
is
The
fan-‐in
oVen
varies
determined
empirically
for
each
weight.
widely
between
layers.
One
way
to
determine
the
individual
learning
rates
• Start
with
a
local
gain
of
1
for
every
weight.
∂E
• Increase
the
local
gain
if
the
gradient
for
Δwij = −ε gij
∂wij
that
weight
does
not
change
sign.
• Use
small
addi@ve
increases
and
mul@plica@ve
decreases
(for
mini-‐batch)
# ∂E &
∂E
– This
ensures
that
big
gains
decay
rapidly
if %% (t) (t −1)(( > 0
when
oscilla@ons
start.
$ ∂wij ∂wij '
– If
the
gradient
is
totally
random
the
gain
then gij (t) = gij (t −1) +.05
will
hover
around
1
when
we
increase
by
plus
δ
half
the
@me
and
decrease
else gij (t) = gij (t −1)*.95
by
@mes
1−
δ
half
the
@me.
Tricks
for
making
adap@ve
learning
rates
work
beYer
• Limit
the
gains
to
lie
in
some
• Adap@ve
learning
rates
can
be
reasonable
range
combined
with
momentum.
–
e.g.
[0.1,
10]
or
[.01,
100]
– Use
the
agreement
in
sign
• Use
full
batch
learning
or
big
mini-‐ between
the
current
gradient
for
a
batches
weight
and
the
velocity
for
that
– This
ensures
that
changes
in
weight
(Jacobs,
1989).
the
sign
of
the
gradient
are
• Adap@ve
learning
rates
only
deal
with
not
mainly
due
to
the
axis-‐aligned
effects.
sampling
error
of
a
mini-‐
batch.
– Momentum
does
not
care
about
the
alignment
of
the
axes.
Neural
Networks
for
Machine
Learning
Lecture
6e
rmsprop:
Divide
the
gradient
by
a
running
average
of
its
recent
magnitude
Geoffrey
Hinton
with
Ni@sh
Srivastava
Kevin
Swersky
rprop:
Using
only
the
sign
of
the
gradient
• The
magnitude
of
the
gradient
can
be
• rprop:
This
combines
the
idea
of
only
very
different
for
different
weights
using
the
sign
of
the
gradient
with
the
and
can
change
during
learning.
idea
of
adap@ng
the
step
size
separately
– This
makes
it
hard
to
choose
a
for
each
weight.
single
global
learning
rate.
– Increase
the
step
size
for
a
weight
• For
full
batch
learning,
we
can
deal
mul@plica@vely
(e.g.
@mes
1.2)
if
the
with
this
varia@on
by
only
using
the
signs
of
its
last
two
gradients
agree.
sign
of
the
gradient.
– Otherwise
decrease
the
step
size
– The
weight
updates
are
all
of
the
mul@plica@vely
(e.g.
@mes
0.5).
same
magnitude.
– Limit
the
step
sizes
to
be
less
than
– This
escapes
from
plateaus
with
50
and
more
than
a
millionth
(Mike
@ny
gradients
quickly.
Shuster’s
advice).
Why
rprop
does
not
work
with
mini-‐batches
• The
idea
behind
stochas@c
gradient
• rprop
would
increment
the
weight
descent
is
that
when
the
learning
nine
@mes
and
decrement
it
once
by
rate
is
small,
it
averages
the
about
the
same
amount
(assuming
gradients
over
successive
mini-‐ any
adapta@on
of
the
step
sizes
is
batches.
small
on
this
@me-‐scale).
– Consider
a
weight
that
gets
a
– So
the
weight
would
grow
a
lot.
gradient
of
+0.1
on
nine
mini-‐ • Is
there
a
way
to
combine:
batches
and
a
gradient
of
-‐0.9
– The
robustness
of
rprop.
on
the
tenth
mini-‐batch.
– The
efficiency
of
mini-‐batches.
– We
want
this
weight
to
stay
roughly
where
it
is.
– The
effec@ve
averaging
of
gradients
over
mini-‐batches.
rmsprop:
A
mini-‐batch
version
of
rprop
• rprop
is
equivalent
to
using
the
gradient
but
also
dividing
by
the
size
of
the
gradient.
– The
problem
with
mini-‐batch
rprop
is
that
we
divide
by
a
different
number
for
each
mini-‐batch.
So
why
not
force
the
number
we
divide
by
to
be
very
similar
for
adjacent
mini-‐batches?
• rmsprop:
Keep
a
moving
average
of
the
squared
gradient
for
each
weight
2
MeanSquare(w, t) = 0.9 MeanSquare(w, t− 1) + 0.1 ∂E ( ∂w
(t) )
• Dividing
the
gradient
by
MeanSquare(w,
t)
makes
the
learning
work
much
beYer
(Tijmen
Tieleman,
unpublished).
Further
developments
of
rmsprop
• Combining
rmsprop
with
standard
momentum
– Momentum
does
not
help
as
much
as
it
normally
does.
Needs
more
inves@ga@on.
• Combining
rmsprop
with
Nesterov
momentum
(Sutskever
2012)
– It
works
best
if
the
RMS
of
the
recent
gradients
is
used
to
divide
the
correc@on
rather
than
the
jump
in
the
direc@on
of
accumulated
correc@ons.
• Combining
rmsprop
with
adap@ve
learning
rates
for
each
connec@on
– Needs
more
inves@ga@on.
• Other
methods
related
to
rmsprop
– Yann
LeCun’s
group
has
a
fancy
version
in
“No
more
pesky
learning
rates”
Summary
of
learning
methods
for
neural
networks
• For
small
datasets
(e.g.
10,000
cases)
• Why
there
is
no
simple
recipe:
or
bigger
datasets
without
much
Neural
nets
differ
a
lot:
redundancy,
use
a
full-‐batch
method.
– Very
deep
nets
(especially
ones
with
narrow
boYlenecks).
– Conjugate
gradient,
LBFGS
...
– Recurrent
nets.
– adap@ve
learning
rates,
rprop
...
– Wide
shallow
nets.
• For
big,
redundant
datasets
use
mini-‐
Tasks
differ
a
lot:
batches.
– Try
gradient
descent
with
– Some
require
very
accurate
momentum.
weights,
some
don’t.
– Try
rmsprop
(with
momentum
?)
– Some
have
many
very
rare
cases
(e.g.
words).
– Try
LeCun’s
latest
recipe.