Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
e
=
C j
2
j
] n [ e
2
1
] n [ E
(4)
(5)
where C is a set containing all neurons of the output layer.
If the total number of patterns contained in the training set is N,
then the average squared error of the network will be
=
=
N
1 n
av
] n [ E
N
1
E
(6)
This is the cost function of the network which is to be minimized.
15
Differentiating E[n] with respect to w
ji
[n] and making use of the
chain rule, we get
] n [ w
] n [ u
] n [ u
] n [ y
] n [ y
] n [ e
] n [ e
] n [ E
] n [ w
] n [ E
ji
j
j
j
j
j
j ji
c
c
c
c
c
c
c
c
=
c
c
(7)
The first term
] n [ e
] n [ E
j
c
c
On the right hand side (RHS) of the
above equation can be found by differentiating both sides of
Equation (5) with respect to e
j
[n]
] n [ e
] n [ e
] n [ E
j
j
=
c
c
(8)
The next term, i.e.
] n [ y
] n [ e
j
j
c
c
on the RHS of (7) can be obtained
by differentiating (3) with respect to y
j
[n]
16
1
] n [ y
] n [ e
j
j
=
c
c
(9)
To find the term cy
j
[n]/cu
j
[n], we have to differentiate (2) with
respect to u
j
[n]. That is,
( ) ] n [ u f
] n [ u
] n [ y
j
'
j
j
j
=
c
c
(10)
Finally the last term (i.e. cu
j
[n]/cw
ji
[n] on the RHS of (7) can
be computed by differentiating (1) with respect to w
ji
[n] and
is given by
] n [ y
] n [ w
] n [ u
i
ji
j
=
c
c
(11)
Now equation (7) becomes
17
] n [ y ] n [ u f ] n [ e
] n [ w
] n [ E
i j
'
j j
ji
=
c
c
(12)
The correction Aw
ji
[n] applied to w
ji
[n] can now be defined as
] n [ w
] n [ E
] n [ w
ji
ji
c
c
q = A
(13)
Where q is the learning rate which is a factor deciding how fast
the weights are allowed to change for each time step. The minus
sign indicates that the weights are to be changed in such a way
that the error decreases.
substituting (12) into (13) yields
] n [ y ] n [ ] n [ w
i j ji
qo = A (14)
where the local gradient o
j
[n] is defined by
18
] n [ u
] n [ y
] n [ y
] n [ e
] n [ e
] n [ E
] n [
j
j
j
j
j
j
c
c
c
c
c
c
= o
] n [ u f ] n [ e
j
'
j j
=
(15)
which shows that the local gradient o
j
[n] is the product of the
corresponding error signal e
j
[n] and the derivative f
jj
(u
j
[n]) of
the associated activation function.
The above derivation is based on the assumption that the neuron j
is located in the output layer of the network. Of course, this is the
simplest case. Since neuron j is in the output layer where desired
signal is always available so it is quite straightforward to compute
the error e
j
[n] and the local gradient o
j
[n] by using (3) and (15)
respectively.
19
Let us consider the case in which the
neuron j is not in the output layer of the
network but is located in the hidden layer
immediately left to the output layer, as
shown in the figure of the next slide.
Note that now the index j will refer to
hidden layer and index k will refer to the
output layer. Also note that the desired
response d
k
[n] is not directly available to
hidden layer neurons.
Case II: Neuron j is a hidden node
20
+
f(.)
+
y
0
= -1
y
i
[n]
-1
+
d
k
[n]
e
k
[n]
w
ji
[n]
u
j
[n]
y
j
[n]
w
j0
u
k
[n]
f(.)
y
k
[n]
Neuron j
Neuron k
21
In this new situation, the local gradient will take
the following form:
As neuron k is located in the output layer
which is simply (5) in which the index j has been
replaced by the index k. Differentiating this
equation with respect to y
j
[n] and using the
chain rule, we obtain
]] n [ u [ f
] n [ y
] n [ E
] n [ u
] n [ y
] n [ y
] n [ E
] n [
j
'
j
j j
j
j
j
c
c
=
c
c
c
c
= o
(16)
] n [ e
2
1
] n [ E
c k
2
k
e
=
(17)
] n [ y
] n [ u
] n [ u
] n [ e
] n [ e
] n [ y
] n [ E
j
k
k
k
k
k
j
c
c
c
c
=
c
c
(18)
22
Since
]] n [ u [ f ] n [ d ] n [ y ] n [ d ] n [ e
k k k k k k
= =
(19)
Therefore,
| | ] n [ u f
] n [ u
] n [ e
k
'
k
k
k
=
c
c
(20)
The net input of neuron k is given by
] n [ y ] n [ w ] n [ u
j
q
0 j
kj k
=
=
(21)
where q is the total number of inputs applied to neuron k.
Differentiating u
k
[n] with respect to y
j
[n] we have
] n [ w
] n [ y
] n [ u
kj
j
k
=
c
c
(22)
substituting (20) and (22) in (18) yields
23
| |
o = =
c
c
k
kj k kj
k
k
'
k k
j
] n [ w ] n [ ] n [ w ] n [ u f ] n [ e
] n [ y
] n [ E
(23)
The local gradient o
j
[n] for the hidden neuron j can now be
obtained by using (23) in (16):
] n [ w ] n [ ] n [ u f ] n [
k
kj k j
'
j j
o = o
(24)
24
Summary:
1. The correction Aw
ji
[n] applied to the synaptic weight
connecting Neuron I to neuron j is defined as:
] n [ y ] n [ ] n [ w
i j ji
o q = A
2. The local gradient o
j
[n] depends on whether neuron j is an
output node or a hidden node:
(a) If neuron j is an output node, o
j
[n] equals the product of the
derivative f
j
[u
j
[n]]and the error signal e
j
[n], both of which
are associated with neuron j.
(b) If neuron j is a hidden node, o
j
[n] equals the product of the
associated derivative f
j
[u
j
[n]] and the weighted sum of the
os computed for the neurons in the next hidden or output
layer that are connected to neuron j.
25
Improved Back-Propagation:
The back-propagation algorithm derived above has some
drawbacks. First of all, the learning parameter q
should be chosen to be small to provide minimization of
the total error signal.However, for a small q the
learning process becomes very slow. On the other hand,
large values of q corresponds to rapid learning, but
lead to parasitic oscillations which prevent the
algorithm from converging to the desired solution.
Moreover, if the error function contains many local
minima, the network might get trapped in some local
minimum, or get stuck on a very flat plateau. One
simple way to improve the standard back-propagation
algorithm is to use adaptive learning rate and
momentum as described below:
26
Momentum:
Here the idea is to give weights and biases some
momentum so that they will not get stuck in local
minima, but have enough energy to pass these.
Mathematically adding momentum is expressed
as:
where c is momentum constant and must have a
value between 0 and 1. If c is zero, the algorithm
is same as the basic back-propagation rule, that
means no momentum.
c equals to 1 means that the weights change exactly
as they did in the preceding time step. A typical
value of c is 0.9 to 0.95.
] n [ y ] n [ ] 1 n [ w ] n [ w
i j ji ji
qo + A c = A
(25)
27
Adaptive Learning Rate:
As mentioned earlier, it is difficult to choose an
appropriate value of the learning rate q for a
particular application. The optimal value can
change during training. Thus, this parameter
should be updated as the training phase
progresses. That is, the learning rate should be
adaptive.
One way of doing this is to change the learning rate
according to the way in which the error function
responded to the last change in weights. If a
weight update decreased the error function, the
weights probably were changed in the right
direction, and q is increased. On the other hand, if
the error function was increased, we reduce the
value of q.