Sei sulla pagina 1di 26

1

Back-Propagation Algorithm
! Perceptron
! Gradient Descent
! Multi-layerd neural network
! Back-Propagation
! More on Back-Propagation
! Examples
2
Inner-product
! A measure of the projection of one vector
onto another

!
net =<
r
w ,
r
x >=||
r
w || " ||
r
x || " cos(#)
!
net = w
i
" x
i
i=1
n
#
Activation function
!
o = f (net) = f ( w
i
" x
i
)
i=1
n
#
!
f (x) := sgn(x) =
1 if x " 0
#1 if x < 0
$
%
&
3
!
f (x) :="(x) =
1
1+ e
(#ax)
!
f (x) :="(x) =
1 if x # 0
0 if x < 0
$
%
&
sigmoid function
!
f (x) :="(x) =
1 if x # 0.5
x if 0.5 > x > 0.5
0 if x $ %0.5
&
'
(
)
(
Gradient Descent
! To understand, consider simpler linear
unit, where
! Let's learn w
i
that minimize the squared
error, D={(x
1
,t
1
),(x
2
,t
2
), . .,(x
d
,t
d
),..,(x
m
,t
m
)}
(t for target)
!
o = w
i
" x
i
i= 0
n
#
4
Error for different hypothesis,
for w
0
and w
1
(dim 2)
! We want to move the weight vector in the
direction that decrease E
w
i
=w
i
+!w
i
w=w+!w
5
Differentiating E
Update rule for gradient decent
!
"w
i
=# (t
d
d $D
%
&o
d
)x
id
6
Stochastic Approximation to
gradient descent
! The gradient decent training rule updates summing
over all the training examples D
! Stochastic gradient approximates gradient decent by
updating weights incrementally
! Calculate error for each example
! Known as delta-rule or LMS (last mean-square) weight
update
! Adaline rule, used for adaptive lters Widroff and Hoff (1960)
!
"w
i
=#(t $o)x
i
7
XOR problem and Perceptron
! By Minsky and Papert in mid 1960
8
Multi-layer Networks
! The limitations of simple perceptron do not
apply to feed-forward networks with
intermediate or hidden nonlinear units
! A network with just one hidden unit can
represent any Boolean function
! The great power of multi-layer networks was
realized long ago
! But it was only in the eighties it was shown how to
make them learn
! Multiple layers of cascade linear units still
produce only linear functions
! We search for networks capable of
representing nonlinear functions
! Units should use nonlinear activation functions
! Examples of nonlinear activation functions
9
XOR-example
10
! Back-propagation is a learning algorithm for
multi-layer neural networks
! It was invented independently several times
! Bryson an Ho [1969]
! Werbos [1974]
! Parker [1985]
! Rumelhart et al. [1986]
Parallel Distributed Processing - Vol. 1
Foundations
David E. Rumelhart, James L. McClelland and the PDP
Research Group
What makes people smarter than computers? These volumes
by a pioneering neurocomputing.....
11
Back-propagation
! The algorithm gives a prescription for
changing the weights w
ij
in any feed-
forward network to learn a training set of
input output pairs {x
d
,t
d
}
! We consider a simple two-layer network
x
k
x
1
x
2
x
3
x
4
x
5
12
! Given the pattern x
d
the hidden unit j
receives a net input
! and produces the output
!
net
j
d
= w
jk
k=1
5
"
x
k
d
!
V
j
d
= f (net
j
d
) = f ( w
jk
k=1
5
"
x
k
d
)
! Output unit i thus receives
! And produce the nal output
!
net
i
d
= W
ij
j=1
3
"
V
j
d
= (W
ij
#
j=1
3
"
f ( w
jk
k=1
5
"
x
k
d
))
!
o
i
d
= f (net
i
d
) = f ( W
ij
j=1
3
"
V
j
d
) = f ( (W
ij
#
j=1
3
"
f ( w
jk
k=1
5
"
x
k
d
)))
13
! Out usual error function
! For l outputs and m input output pairs
{x
d
,t
d
}

!
E[
r
w ] =
1
2
(t
i
d
i=1
l
"
d=1
m
"
#o
i
d
)
2
! In our example E becomes
! E[w] is differentiable given f is differentiable
! Gradient descent can be applied

!
E[
r
w ] =
1
2
(t
i
d
i=1
2
"
d=1
m
"
#o
i
d
)
2

!
E[
r
w ] =
1
2
(t
i
d
i=1
2
"
d=1
m
"
# f ( W
ij
j
3
"
$ f ( w
jk
x
k
d
k=1
5
"
)))
2
14
! For hidden-to-output connections the
gradient descent rule gives:
!
"W
ij
= #$
%E
%W
ij
= #$ (t
i
d
#o
i
d
) f
'
d=1
m
&
(net
i
d
) ' (#V
j
d
)
"W
ij
=$ (t
i
d
#o
i
d
) f
'
d=1
m
&
(net
i
d
) ' V
j
d
!
"
i
d
= f
'
(net
i
d
)(t
i
d
#o
i
d
)
$W
ij
=% "
i
d
d=1
m
&
V
j
d
! For the input-to hidden connection w
jk
we
must differentiate with respect to the w
jk
! Using the chain rule we obtain
!
"w
jk
= #$
%E
%w
jk
= #$
%E
%V
j
d
&
%V
j
d
%w
jk d=1
m
'
15
!
"
i
d
= f
'
(net
i
d
)(t
i
d
#o
i
d
)
!
"
j
d
= f
'
(net
j
d
) W
ij
i=1
2
#
"
i
d
!
"w
jk
=# (t
i
d
i=1
2
$
d=1
m
$
%o
i
d
) f
'
(net
i
d
)W
ij
f
'
(net
j
d
) & x
k
d
!
"w
jk
=# $
i
d
%
i=1
2
&
d=1
m
&
W
ij
f
'
(net
j
d
) % x
k
d
!
"w
jk
=# $
j
d
d=1
m
%
& x
k
d
!
"w
jk
=# $
j
d
d=1
m
%
& x
k
d
!
"W
ij
=# $
i
d
d=1
m
%
V
j
d
! we have same form with a different
denition of !
16
! In general, with an arbitrary number of layers,
the back-propagation update rule has always
the form
! Where output and input refers to the connection
concerned
! V stands for the appropriate input (hidden unit or
real input, x
d
)
! ! depends on the layer concerned
!
"w
ij
=# $
output
d=1
m
%
& V
input
! By the equation
! allows us to determine for a given
hidden unit V
j
in terms of the !s of the
unit o
i
! The coefcient are usual forward, but the
errors ! are propagated backward
! back-propagation
!
"
j
d
= f
'
(net
j
d
) W
ij
i=1
2
#
"
i
d
17
! We have to use a nonlinear differentiable
activation function
! Examples:
!
f (x) ="(x) =
1
1+ e
(#$% x)
!
f
'
(x) ="
'
(x) =# $ "(x) $ (1%"(x))
!
f (x) = tanh(" # x)
f
'
(x) =" # (1$ f (x)
2
)
18
! Consider a network with M layers
m=1,2,..,M
! V
m
i
from the output of the ith unit of the
mth layer
! V
0
i
is a synonym for x
i
of the ith input
! Subscript m layers ms layers, not
patterns
! W
m
ij
mean connection from V
j
m-1
to V
i
m
Stochastic Back-Propagation
Algorithm (mostly used)
1. Initialize the weights to small random values
2. Choose a pattern x
d
k
and apply is to the input layer V
0
k
= x
d
k
for
all k
3. Propagate the signal through the network
4. Compute the deltas for the output layer
5. Compute the deltas for the preceding layer for m=M,M-1,..2
6. Update all connections
7. Goto 2 and repeat for the next pattern
!
V
i
m
= f (net
i
m
) = f ( w
ij
m
j
"
V
j
m#1
)
!
"
i
M
= f
'
(net
i
M
)(t
i
d
#V
i
M
)
!
"
i
m#1
= f
'
(net
i
m#1
) w
ji
m
j
$
"
j
m
!
"w
ij
m
=#$
i
m
V
j
m%1
!
w
ij
new
= w
ij
old
+ "w
ij
19
More on Back-Propagation
! Gradient descent over entire network
weight vector
! Easily generalized to arbitrary directed
graphs
! Will nd a local, not necessarily global
error minimum
! In practice, often works well (can run
multiple times)
! Gradient descent can be very slow if " is to
small, and can oscillate widely if " is to large
! Often include weight momentum "
! Momentum parameter " is chosen between 0
and 1, 0.9 is a good value
!
"w
pq
(t +1) = #$
%E
%w
pq
+& ' "w
pq
(t)
20
! Minimizes error over training examples
! Will it generalize well
! Training can take thousands of iterations,
it is slow!
! Using network after training is very fast
21
Convergence of Back-
propagation
! Gradient descent to some local minimum
! Perhaps not global minimum...
! Add momentum
! Stochastic gradient descent
! Train multiple nets with different initial weights
! Nature of convergence
! Initialize weights near zero
! Therefore, initial networks near-linear
! Increasingly non-linear functions possible as training
progresses
22
Expressive Capabilities of
ANNs
! Boolean functions:
! Every boolean function can be represented by
network with single hidden layer
! but might require exponential (in number of inputs)
hidden units
! Continuous functions:
! Every bounded continuous function can be
approximated with arbitrarily small
error, by network with one hidden layer [Cybenko
1989; Hornik et al. 1989]
! Any function can be approximated to arbitrary
accuracy by a network with two
hidden layers [Cybenko 1988].
NETtalk Sejnowski et al 1987
23
Prediction
24
25
! Perceptron
! Gradient Descent
! Multi-layerd neural network
! Back-Propagation
! More on Back-Propagation
! Examples
26
! RBF Networks, Support Vector Machines

Potrebbero piacerti anche