Back Propagation

1
Mehran University of Engineering and

Technology, Jamshoro
Department of Electronic Engineering

Neural Networks

Feedforward Networks

By
Dr. Mukhtiar Ali Unar
2
Multilayer Feedforward Neural Networks
Most popular ANN architecture are easy to
implement.
It is a supervised learning network.
A layer consists of an arbitrary number of
(artificial neurons) or nodes.
In most cases, all the neurons in a particular
layer contain the same activation function.
The neurons in different layers may have
different activation functions.
When a signal passes through the network, it
always propagates in the forward direction
from one layer to the next layer, not to the
other neurons in the same or the previous
layer.
3
Input layer Hidden layer Output layer
A simple three layer (one hidden layer) feedforwrd ANN
w
11

w
pn

w
21

Fig1.
4
In a feedforward neural network, the first layer
is called the input layer, and is the layer where
the input patterns based on the data sample are
fed in. This layer does not perform any
processing and consists of fan out units only.
Then follows one or more hidden layers, and as
the name indicates these layers cannot be
accessed from outside the network. The hidden
layers enable the network to learn complex
tasks by extracting progressively more
meaningful features from the input patterns.
The final layer is the output layer, which may
contain one or more output neurons. This is the
layer where the network decision, for a given
input pattern, can be readout.
5
Types of feedforward Neural Networks
There are two main categories of feedforward
neural networks:
Multilayer Perceptron Networks
Radial Basis Function Networks

6
Multilayer Perceptron (MLP) Networks
It is the best known type of feedforward ANN.
The structure of MLP network is similar to
that shown in figure 1 (slide 3). It consists of an
input layer, one or more hidden layers, and an
output layer.
The number of hidden layers and the number
of neurons in each layer is not fixed. Each layer
may have a different number of neurons,
depending on the application. The developer
will have to determine how many layers and
how many neurons per layer should be selected
for a particular application.

7
All neurons in hidden layer can have a non-linear
sigmoid activation function (logistic function) .

or a hyperbolic tangent function

where u
i
is the net internal activity level of neuron i, y
i

is the output of the same neuron, and a,b are constants.
Generally, an MLP network learns faster with
hyperbolic function than the logistic function. The
important point to emphasize here is that the non-
linearity is smooth.

) u exp( 1
1
y
i
i
+
=
) u ( b tanh a y
i i
=
8
A number of researchers have proved
mathematically that a single hidden layer
feedforward network is capable to
approximate any continuous
multivariable function to any desired
degree of accuracy, provided that
sufficiently many hidden layer neurons
are available.
The issue of choosing an appropriate
number of neurons of an MLP network is
almost unresolved.
9
With few hidden neurons, the network may not
produce outputs reasonably close to the targets. This
effect is called under fitting.
On the other hand, an excessive number of hidden
layer neurons will increase the training time and may
cause problem called over fitting. In over fitting the
network will have so much information processing
capability that it will learn insignificant aspects of the
training set, aspect that are irrelevant to the general
population. If the performance of the network is
evaluated with the training set, it will be excellent.
However, when the network is called upon to work
with the general population, it will do poorly. This is
because it will consider trivial features unique to
training set members, as well as important general
features, and become confused. Thus is very important
to choose an appropriate number of hidden layer
neurons for satisfactory performance of the network.
10
How to choose a suitable number of hidden
layer neurons?
Lippmann Rule: Lippmann has proved that
the number of hidden layer neurons in one
hidden layer MLP network should be Q(P+1),
where Q is the number of output units and P is
the number of inputs.
Geometric Pyramid rule: In a single hidden
layer network with P inputs and Q outputs, the
number of neurons in the hidden layer should
be .
PQ
11
Important:
The above formulas are only rough approximations to the
ideal hidden layer size and may be far from optimal in
certain applications. For example, if the problem is
complex but there are only few inputs and outputs, we
may need many more hidden neurons than suggested
by the above formulae. On the other hand, if problem
is simple with many inputs and outputs, fewer neurons
will often suffice.
A common approach is to start with a small number of
hidden neurons (e.g. with just two hidden neurons).
Then slightly increase the number of hidden neurons,
again train and test the network. Continue this
procedure until satisfactory performance is achieved.
This procedure is time consuming, but usually results
in success.

12
Error Back-Propagation Algorithm:
+
+

f(.)
-1
-1 +
w
j0
[n]
w
j1
[n]
w
ji
[n]
w
j(p-1)
[n]
w
jp
[n]
y
0
= -1
y
1
[n]
y
i
[n]
y
p-1
[n]
y
p
[n]
u
j
[n]
d
j
[n]
y
j
[n]
+
e
j
[n]
Case I: Neuron j is an output node:
Figure 2 shows a neuron j located in the output layer of
an MLP network.

13
The input to the neuron at iteration n is
] n [ y ] n [ w ] n [ u
i
p
0 i
ji j
=
=
where p is the total number of inputs and p
0
is the threshold.
The output of the neuron j will be
] n [ u f ] n [ y
j j
=
where f(.) is the activation function of the neuron. Assume that
d
j
[n] be the desired output at iteration n. The error signal will
therefore be given by
] n [ y ] n [ d ] n [ e
j j j
=
(1)
(2)
(3)
The instantaneous value of the squared error corresponding to
neuron j will be
14
] n [ e
2
1
] n [ EE
2
j
=
and hence the instantaneous sum of squared errors will be
e
=
C j
2
j
] n [ e
2
1
] n [ E
(4)
(5)
where C is a set containing all neurons of the output layer.
If the total number of patterns contained in the training set is N,
then the average squared error of the network will be
=
=
N
1 n
av
] n [ E
N
1
E
(6)
This is the cost function of the network which is to be minimized.

15
Differentiating E[n] with respect to w
ji
[n] and making use of the
chain rule, we get
] n [ w
] n [ u
] n [ u
] n [ y
] n [ y
] n [ e
] n [ e
] n [ E
] n [ w
] n [ E
ji
j
j
j
j
j
j ji
c
c
c
c
c
c
c
c
=
c
c
(7)
The first term
] n [ e
] n [ E
j
c
c
On the right hand side (RHS) of the

above equation can be found by differentiating both sides of
Equation (5) with respect to e
j
[n]

] n [ e
] n [ e
] n [ E
j
j
=
c
c
(8)
The next term, i.e.
] n [ y
] n [ e
j
j
c
c
on the RHS of (7) can be obtained

by differentiating (3) with respect to y
j
[n]
16
1
] n [ y
] n [ e
j
j
=
c
c
(9)
To find the term cy
j
[n]/cu
j
[n], we have to differentiate (2) with
respect to u
j
[n]. That is,
( ) ] n [ u f
] n [ u
] n [ y
j
'
j
j
j
=
c
c
(10)
Finally the last term (i.e. cu
j
[n]/cw
ji
[n] on the RHS of (7) can
be computed by differentiating (1) with respect to w
ji
[n] and
is given by
] n [ y
] n [ w
] n [ u
i
ji
j
=
c
c
(11)
Now equation (7) becomes
17
] n [ y ] n [ u f ] n [ e
] n [ w
] n [ E
i j
'
j j
ji
=
c
c
(12)
The correction Aw
ji
[n] applied to w
ji
[n] can now be defined as
] n [ w
] n [ E
] n [ w
ji
ji
c
c
q = A
(13)
Where q is the learning rate which is a factor deciding how fast
the weights are allowed to change for each time step. The minus
sign indicates that the weights are to be changed in such a way
that the error decreases.
substituting (12) into (13) yields
] n [ y ] n [ ] n [ w
i j ji
qo = A (14)
where the local gradient o
j
[n] is defined by
18
] n [ u
] n [ y
] n [ y
] n [ e
] n [ e
] n [ E
] n [
j
j
j
j
j
j
c
c
c
c
c
c
= o
] n [ u f ] n [ e
j
'
j j
=
(15)
which shows that the local gradient o
j
[n] is the product of the
corresponding error signal e
j
[n] and the derivative f
jj
(u
j
[n]) of
the associated activation function.
The above derivation is based on the assumption that the neuron j
is located in the output layer of the network. Of course, this is the
simplest case. Since neuron j is in the output layer where desired
signal is always available so it is quite straightforward to compute
the error e
j
[n] and the local gradient o
j
[n] by using (3) and (15)
respectively.

19
Let us consider the case in which the
neuron j is not in the output layer of the
network but is located in the hidden layer
immediately left to the output layer, as
shown in the figure of the next slide.
Note that now the index j will refer to
hidden layer and index k will refer to the
output layer. Also note that the desired
response d
k
[n] is not directly available to
hidden layer neurons.
Case II: Neuron j is a hidden node
20
+
f(.)
+
y
0
= -1
y
i
[n]
-1
+
d
k
[n]
e
k
[n]
w
ji
[n]
u
j
[n]
y
j
[n]
w
j0

u
k
[n]
f(.)

y
k
[n]
Neuron j
Neuron k
21
In this new situation, the local gradient will take
the following form:

As neuron k is located in the output layer

which is simply (5) in which the index j has been
replaced by the index k. Differentiating this
equation with respect to y
j
[n] and using the
chain rule, we obtain

]] n [ u [ f
] n [ y
] n [ E
] n [ u
] n [ y
] n [ y
] n [ E
] n [
j
'
j
j j
j
j
j
c
c
=
c
c
c
c
= o
(16)
] n [ e
2
1
] n [ E
c k
2
k
e
=
(17)
] n [ y
] n [ u
] n [ u
] n [ e
] n [ e
] n [ y
] n [ E
j
k
k
k
k
k
j
c
c
c
c
=
c
c
(18)
22
Since
]] n [ u [ f ] n [ d ] n [ y ] n [ d ] n [ e
k k k k k k
= =
(19)
Therefore,
| | ] n [ u f
] n [ u
] n [ e
k
'
k
k
k
=
c
c
(20)
The net input of neuron k is given by
] n [ y ] n [ w ] n [ u
j
q
0 j
kj k
=
=
(21)
where q is the total number of inputs applied to neuron k.
Differentiating u
k
[n] with respect to y
j
[n] we have
] n [ w
] n [ y
] n [ u
kj
j
k
=
c
c
(22)
substituting (20) and (22) in (18) yields
23
| |

o = =
c
c
k
kj k kj
k
k
'
k k
j
] n [ w ] n [ ] n [ w ] n [ u f ] n [ e
] n [ y
] n [ E
(23)
The local gradient o
j
[n] for the hidden neuron j can now be
obtained by using (23) in (16):
] n [ w ] n [ ] n [ u f ] n [
k
kj k j
'
j j
o = o
(24)
24
Summary:
1. The correction Aw
ji
[n] applied to the synaptic weight
connecting Neuron I to neuron j is defined as:
] n [ y ] n [ ] n [ w
i j ji
o q = A
2. The local gradient o
j
[n] depends on whether neuron j is an
output node or a hidden node:

(a) If neuron j is an output node, o
j
[n] equals the product of the
derivative f
j
[u
j
[n]]and the error signal e
j
[n], both of which
are associated with neuron j.
(b) If neuron j is a hidden node, o
j
[n] equals the product of the
associated derivative f
j
[u
j
[n]] and the weighted sum of the
os computed for the neurons in the next hidden or output
layer that are connected to neuron j.

25
Improved Back-Propagation:
The back-propagation algorithm derived above has some
drawbacks. First of all, the learning parameter q
should be chosen to be small to provide minimization of
the total error signal.However, for a small q the
learning process becomes very slow. On the other hand,
large values of q corresponds to rapid learning, but
lead to parasitic oscillations which prevent the
algorithm from converging to the desired solution.
Moreover, if the error function contains many local
minima, the network might get trapped in some local
minimum, or get stuck on a very flat plateau. One
simple way to improve the standard back-propagation
algorithm is to use adaptive learning rate and
momentum as described below:
26
Momentum:
Here the idea is to give weights and biases some
momentum so that they will not get stuck in local
minima, but have enough energy to pass these.
Mathematically adding momentum is expressed
as:

where c is momentum constant and must have a
value between 0 and 1. If c is zero, the algorithm
is same as the basic back-propagation rule, that
means no momentum.
c equals to 1 means that the weights change exactly
as they did in the preceding time step. A typical
value of c is 0.9 to 0.95.
] n [ y ] n [ ] 1 n [ w ] n [ w
i j ji ji
qo + A c = A
(25)
27
Adaptive Learning Rate:
As mentioned earlier, it is difficult to choose an
appropriate value of the learning rate q for a
particular application. The optimal value can
change during training. Thus, this parameter
should be updated as the training phase
progresses. That is, the learning rate should be
adaptive.
One way of doing this is to change the learning rate
according to the way in which the error function
responded to the last change in weights. If a
weight update decreased the error function, the
weights probably were changed in the right
direction, and q is increased. On the other hand, if
the error function was increased, we reduce the
value of q.

Back Propagation

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Back Propagation

Caricato da

Copyright:

Formati disponibili

1

Mehran University of Engineering and

Potrebbero piacerti anche