08 Neuralnet

Ch.
8 Neural Networks
Hantao Zhang
http://www.cs.uiowa.edu/ hzhang/c145
The University of Iowa

Department of Computer Science
Artificial Intelligence p.1/??
Brains as Computational Devices

Algorithms developed over centuries do not fit the complexity
of real world problem. The human brain is the most sophisticated
computer suitable for solving extremely complex problems.
Motivation:
1011 neurons (neural cells) and only a small portion

of these cells are used
Reasonable Size:
Simple Building Blocks:

Massively parallel:
Fault-tolerant:
No cell contains too much information.
Each region of the brain controls specialized tasks.
Information is saved mainly in the connections among
neurons
Reliable
Graceful degradation
Comparing Brains with Computers

Computational units
Storage units
Cycle time
Bandwidth
Neuron updates/sec
Computer
Human Brain
1 CPU, 105 gates

109 bits RAM, 1010 bits disk
10 8 sec
109 bits/sec
105
1011 neurons
1011 neurons, 1014 synapses
10 3 sec
1014 bits/sec
1014
Even if a computer is one million times faster than a brain in

raw speed, a brain ends up being one billion times faster
than a computer at what it does.
Example: Recognizing a face
Brain:
< 1s (a few hundred computer cycles)
Computer: billions of cycles
Biological System
Axonal arborization
Axon from another cell
Synapse
Dendrite
Axon
Nucleus
Synapses
Cell body or Soma
A neuron does nothing until the collective influence of all its inputs
reaches a threshold level.
At that point, the neuron produces a full-strength output in the form of
a narrow pulse that proceeds from the cell body, down the axon, and
into the axons branches.
It fires!: Since it fires or does nothing it is considered an all or
nothing device.
Increases or decreases the strength of connection and causes

excitation or inhibition of a subsequent neuron
Analogy from Biology

aj
a i = g(ini )
Wj,i
Input
Links
ini
Input
Function
ai
Activation
Function
Output
Links
Output
Artificial neurons are viewed as a node connected to

other nodes via links that correspond to neural
connections.
Each link is associated with a weight.
The weight determines the nature (+/-) and strength of
the nodes influence on another.
If the influence of all the links is strong enough the node
is activate (similar to the firing of a neuron).
A Neural Network Unit
Artificial Neural Network

A neural network is a graph of nodes (or units), connected
by links.
Each link has an associated weight, a real number.
Typically, each node i has several incoming links and
several outgoing links. Each incomping link provides a
real number as input to the node and the node sends
one real number through every outgoing link.
The output of a node is a function of the weighted sum
of the nodes inputs.
The Input Function

Each incoming link of a unit i feeds an input value, or
activation value, aj coming from another unit.
The input function ini of a unit is simply the weighted sum of
the units input:
ini (a1 , . . . , ani ) =
ni
X
Wj,i aj
j=1
The unit applies the activation function gi to the result of ini to

produce an output:
outi = gi (ini ) = gi (
ni
X
Wj,i aj )
j=1
Typical Activation Functions

ai
ai
ai
+1
+1
+1
ini
ini
ini
(a) Step function
1, if x t
stept (x) =
0, if x < t
(b) Sign function
(c) Sigmoid function
+1, if x 0
sign(x) =
1, if x < 0
sig(x) =
1
1+ex
Typical Activation Functions 2

Hard limiter (binary step)
if x >
1,
f (x) =
0,
if x
1, if x <
Binary sigmoid (exponential sigmoid)

sig(x) =
1
1 + ex
where controls the saturation of the curve. When , hard

limiter is achieved.
Bipolar sigmoid (atan)
f (x) = tan1 (x)
Units as Logic Gates

W=1
W=1
W = 1
t = 1.5
W=1
AND
t = 0.5
t = 0.5
OR
NOT
W=1
Activation function: stept

Since units can implement the , , boolean operators,
neural nets are Turing-complete: they can implement any
computable function.
Structures of Neural Networks

Directed:
Acyclic:
Feed-Forward:
Multi Layers: Nodes are grouped into layers and
all links are from one layer to the next layer.
Single Layer: Each node sends its output out of
the network.
Tree: ...
Arbitrary Feed: ...
Cyclic: ...
Undirected: ...
Multilayer, Feed-forward Networks

A kind of neural network in which
links are directional and form no cycles (the net is a
directed acyclic graph);
the root nodes of the graph are input units, their
activation value is determined by the environment;
the leaf nodes are output units;
the remaining nodes are hidden units;
units can be divided into layers: a unit in a layer is
connected only to units in the next layer.
A Two-layer, Feed-forward Network

Output units
Oi
Wj,i
Hidden units a j
Wk,j
Input units
Ik
Notes:
The roots of the graph are at the bottom and the (only) leaf at the top.
The layer of input units is generally not counted (which is why this is
a two-layer net).
Example
I1
w13
H3
w14
w35
O5
w23
I2
a5
w24
H4
w45
g5 (W3,5 a3 + W4,5 a4 )
g5 (W3,5 g3 (W1,3 a1 + W2,3 a2 ) + W4,5 g4 (W1,4 a1 + W2,4 a2 ))
where ai is the output and gi is the activation function of node i.
Multilayer, Feed-forward Networks

A powerful computational device:
with just one hidden layer, they can approximate any
continuous function;
with just two hidden layers, they can approximate any
computable function.
However, the number of needed units per layer may grow
exponentially with the number of the input units.
Perceptrons
Single-layer, feed-forward networks whose units use a step
function as activation function.
Ij
Input
Units
Wj,i
Oi
Ij
Output
Units
Input
Units
Perceptron Network
Wj
O
Output
Unit
Single Perceptron
Perceptrons
Perceptrons caused a great stir when they were invented
because it was shown that
If a function is representable by a perceptron, then it
is learnable with 100% accuracy, given enough
training examples.
The problem is that perceptrons can only represent
linearly-separable functions.
Linearly Separable Functions

On a 2-dimensional space:
I1
I1
I1
?
0
0
0
(a)
I1 and I2
I2
0
0
(b)
I1 or I2
I2
0
(c)
I2
I1 xor I2
A black dot corresponds to an output value of 1. An empty

dot corresponds to an output value of 0.
How to Represent XOR function by NN

I1
w13
H3
w14
w35
O5
w23
I2
a5
w24
H4
w45
g5 (W3,5 a3 + W4,5 a4 )
g5 (W3,5 g3 (W1,3 a1 + W2,3 a2 ) + W4,5 g4 (W1,4 a1 + W2,4 a2 ))
where ai is the output, gi = step0.5 is the activation function of node i,

W13 = W24 = W35 = W45 = 1, W14 = W23 = 1.
A Linearly Separable Function

On a 3-dimensional space:
The minority function: Return 1 if the input vector contains
less ones than zeros. Return 0 otherwise.
I1
W = 1
I2
W = 1
t = 1.5
W = 1
I3
(a) Separating plane
(b) Weights and threshold
Computing with a 2-layer NN

#define NUM_INPUTS 4
#define NUM_HIDDEN_NEURONS 4
#define NUM_OUTPUT_NEURONS 3
typedef mlp_s {
/* Inputs to the MLP (+1 for bias) */
double inputs[NUM_INPUTS+1];
/* Weights from Hidden to Input Layer (+1 for bias) */
double w_h_i[NUM_HIDDEN_NEURONS+1][NUM_INPUTS+1];
/* Hidden layer */
double hidden[NUM_HIDDEN+1];
/* Weights from Output to Hidden Layer (+1 for bias) */
double w_o_h[NUM_OUTPUT_NEURONS][NUM_HIDDEN_NEURONS+1];
/* Outputs of the MLP */
double outputs[NUM_OUTPUT_NEURONS];
} mlp_t;
double step(double x) {
if (x > 0.0) return 1.0; else return 0.0;
}

void feed_forward( mlp_t *mlp ) {
int i, h, out;
/* Feed the inputs to the hidden layer through
the hidden to input weights. */
for ( h = 0 ; h < NUM_HIDDEN_NEURONS ; h++ ) {
mlp->hidden[h] = 0.0;
for ( i = 0 ; i < NUM_INPUT_NEURONS+1 ; i++ ) {
mlp->hidden[h] += ( mlp->inputs[i] * mlp->w_h_i[h][i] );
}
mlp->hidden[h] = step( mlp->hidden[h] );
}
/* Feed the hidden layer activations to the output layer

* through the output to hidden weights. */
for( out = 0 ; out < NUM_OUTPUT_NEURONS ; out++ ) {
mlp->output[out] = 0.0;
for ( h = 0 ; h < NUM_HIDDEN_NEURONS ; h++ ) {
mlp->outputs[out] += ( mlp->hidden[h] * mlp->w_o_h[out][h] );
}
mlp->outputs[out] = step( mlp->outputs[out] );
}}
Applications of Neural Networks

Signal and Image Processing
Signal prediction (e.g., weather prediction)
Adaptive noise cancellation
Satellite image analysis
Multimedia processing
Bioinformatics
Functional classification of protein and genes
Clustering of genesbased on DNA microarray data
Applications of Neural Networks

Astronomy
Classification of objects (stars and galaxies)
Compression of astronomical data
Finance and Marketing
Stock market prediction
Fraud detection
Loan approval
Product bundling
Strategic planning
Computing with NNs

Different functions are implemented by different network
topologies and unit weights.
The lure of NNs is that a network need not be explicitly
programmed to compute a certain function f .
Given enough nodes and links, a NN can learn the
function by itself.
It does so by looking at a training set of input/output
pairs for f and modifying its topology and weights so
that its own input/output behavior agrees with the
training pairs.
In other words, NNs learn by induction, too.
Learning = Training in NN
Neural networks are trained using data referred to as a
training set.
The process is one of computing outputs, compare
outputs with desired answers, adjust weights and
repeat.
The information of a Neural Network is in its structure,
activation functions, weights, and
Learning to use different structures and activation
functions is very difficult.
These weights are used to express the relative strength
of an input value or from a connecting unit (i.e., in
another layer). It is by adjusting these weights that a
neural network learns.
Process for Developing NN

1. Collect data Ensure that application is amenable to a NN
approach and pick data randomly.
2. Separate Data into Training Set and Test Set
3. Define a Network Structure Are perceptrons sufficient?
4. Select a Learning Algorithm Decided by available tools
5. Set Parameter Values They will affect the length of the
training period.
6. Training Determine and revise weights
7. Test If not acceptable, go back to steps 1, 2, ..., or 5.
8. Delivery of the product
The Perceptron Learning Method

Weight updating in perceptrons is very simple because
each output node is independent of the other output nodes.
Ij
Input
Units
Wj,i
Oi
Ij
Output
Units
Input
Units
Perceptron Network
Wj
O
Output
Unit
Single Perceptron
With no loss of generality then, we can consider a

perceptron with a single output node.
Normalizing Unit Thresholds

Notice that, if t is the threshold value of the output unit, then
stept (
n
X
Wj Ij ) = step0 (
j=1
n
X
W j Ij )
j=0
where W0 = t and I0 = 1.
Therefore, we can always assume that the units threshold is 0 if we
include the actual threshold as the weight of an extra link with a fixed
input value.
This allows thresholds to be learned like any other weight.
Then, we can even allow output values in [0, 1] by replacing step0 by
the sigmoid function.

If O is the value returned by the output unit for a given
example and T is the expected output, then the units
error is
Err = T O
If the error Err is positive we need to increase O ;

otherwise, we need to decrease O .

Since O = g(
each Wj .
Pn
j=0 Wj Ij ),
we can change O by changing
Assuming g is monotonic, to increase O we should

increase Wj if Ij is positive, decrease Wj if Ij is
negative.
Similarly, to decrease O we should decrease Wj if Ij is
positive, increase Wj if Ij is negative.
This is done by updating each Wj as follows
Wj Wj + Ij (Err)
where is a positive constant, the learning rate, and

Err = T O
Theoretic Background
Learn by adjusting weights to reduce error on training set
The squared error for an example with input x and true
output y is
1 2 1
E = Err = (y hW (x))2 ,
2
2
Perform optimization search by gradient descent:
n
X
Err

E
Wj x j )
y g(
= Err
= Err
Wj
Wj
Wj
j=0
= Err g (in) xj
Weight update rule

Wj
E
Wj
= Wj + Err g (in) xj
Wj
E.g., positive error = increasing network output =

increasing weights on positive inputs and decreasing on
negative inputs.
Simple weight update rule (assuming g (in) constant):
Wj Wj + Err xj
A 5-place Minority Function

At first, collect the data (see below), then choose a structure (a perceptron
with five inputs and one output) and the activation function (i.e., step3 ).
Finally, set up parameters (i.e., Wi = 0) and start to learn:
P5
Assumping = 1, we have Sum = i=1 Wi Ii , Out = step3 (Sum),
Err = T Out, and Wj Wj + Ij Err.
I1
I2
I3
I4
I5
W1
W2
W3
W4
W5
e1
e2
e3
e4
e5
e6
e7
e8
Sum
Out
Err
A 5-place Minority Function

The same as the last example, except that = 0.5 instead
of = 1, and initial Wi .
P5
Sum = i=1 Wi Ii , Out = step3 (Sum), Err = T Out, and
Wj Wj + Ij Err.
I1
I2
I3
I4
I5
W1
W2
W3
W4
W5
e1
e2
e3
e4
e5
e6
e7
e8
Sum
Out
Err
Multilayer Perceptrons (MLP)

Layers are usually fully connected;
numbers of hidden units typically chosen by hand
Output units
ai
Wj,i
Hidden units
aj
Wk,j
Input units
ak
All continuous functions with 2 layers, all functions with 3

layers
Sigmoid Function in MLP

1
g(x) =
1 + ecx
As c goes bigger, the curve becomes sharper:
Sigmoid Function in Perceptron

1
g(x) =
1 + ecx
Only linearly-separable functions:
Sigmoid Function in Two Layer MLP

1
g(x) =
1 + ecx
Two hidden nodes produce ridge-like functions:
hW(x1, x2)
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-4 -2
x1
-4
4
2
0 x2
-2
Sigmoid Function in Two Layer MLP

1
g(x) =
1 + ecx
Four hidden nodes produce bump-like functions:
hW(x1, x2)
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-4 -2
x1
-4
4
2
0 x2
-2
Errors with Sigmoid Functions

E
Wj
Wj
= Err g (in) xj
E
= Wj + Err g (in) xj
Wj
Wj
Assuming
g (x) =
1
g(x) =
(1 + ex )
ex
= g(x)(1 g(x))
x
2
(1 + e )
Eq (8.8) on the page 267 of the textbook, g (u) = u(1 u), is

a typo.
Back-propagation Learning
1. Phase 1: Propagation
(a) Forward propagation of a training examples input to get the
output O
(b) Backward propagation of the output error to generate the deltas
of all neural nodes (neurons).
2. Phase 2: Weight update
(a) Multiply its output delta and inputs activation to get the gradient of
the weight.
(b) Bring the wright in the opposite direction of the gradient by
substracting a ratio of it from the weight.
(Most neuroscientists deny that back-propagation occurs in the brain)
Output layer: similar to the single-layer perceptron, let
i = Erri g (ini ) (called EO in (Eq 8.6)).
Wji Wji + aj i
Hidden layer: back-propagate the error from the output layer,

X
j = (
Wji i ) g (inj ) .
i
(called Eh in (Eq 8.7) which has typos).

The update rule is identical:
Wkj Wkj + ak j .
Initialize the weights in the network (often randomly

while (stopping criterion not meet)
For each example e in the training set
O = neural-net-output(network, e); \\ forward ph
T = teacher output for e
Calculate err = (T - O) at the output units
\\ backward phase
Compute delta_wh for all weights from hidden lay
to output layer;
Compute delta_wi for all weights from input laye
to hidden layer;
Update the weights in the network
Return the network
Learning XOR Function

Assuming = 0.5, g = 1;
S1 = u u1 + v v1 , a1 = step0.5 (S1 ),
S2 = u u2 + v v2 , a2 = step0.5 (S2 ),
S3 = a1 w1 + a2 w2 , Out = step0.5 (S3 ),
Err = T Out, wj wj + aj Err for j = 1, 2,
ui ui + u w1 Err, and vi vi + v w2 Err, for i = 1, 2.
Learning XOR Function

Assuming = 0.5, g = 1;
S1 = u u1 + v v1 , a1 = step0.5 (S1 ),
S2 = u u2 + v v2 , a2 = step0.5 (S2 ),
S3 = a1 w1 + a2 w2 , Out = step0.5 (S3 ),
Err = T Out, wj wj + aj Err for j = 1, 2,
ui ui + u w1 Err, and vi vi + v w2 Err, for i = 1, 2.
u1
u2
v1
v2
w1
w2
e1
e2
e3
1
0
e4
e1
S1
a1
S2
a2
S3
Out
Err
Handwriting Digits
The Neural Network has 35 input nodes for such an image.
The NN is trained by these perfect examples many times.
Handwriting Digits
The Neural Network has 10 hidden nodes, fully connected to all
the input nodes (350 edges), and fully connected to all 10
output nodes (100 edges). Each output node represents one
type of digits.
The final result is decided by the maximum output node (winner
takes all).
Neural Network Summary

Advantages
Easy to adapt to unknown situations
Robustness: fault tolerance due to network redundancy
Autonomous learning and generalization
Disadvantages
Poor accuracy
Large complexity of the network structure
Over-fitting to training examples (cannot generalize well)
The solution is a black box (no insights into the problem)
Nearest Neighbor Classification

Compute the distances from the input to all the examples;
Choose k examples which are nearest neighbors of the input;
Decide the majority class of these k neighbors.
Question: Can this technique of classification be regarded as
neural network?
Probabilistic Neural Network (PNN)

Compute the distances from the input to all the examples;
Cumulate the normalized distances for each class of the
examples;
Decide the majority class of the normalized distances for each
class (winner takes all).
Probabilistic Neural Network (PNN)

Compute the distances from the input to all the examples:
Pn
hi = Ei F = j=1 eij aj (Eq 8.13), where Ei = (ei1 , ei2 , ..., ein ) is
an example and F = (a1 , a2 , ..., an )q
is the input. Some people
Pn
2
also use Euclidean distance: hi =
j=1 (eij aj ) .
Cumulate the normalized distances for each class of the

2
examples: Initially cj := 0, cj +=e(hi 1)/ /Nj for each hi , where
cj is the accumulated normalized distance for class j, and Nj is
the number of examples of class j. is the smoothing factor.
This normalized distance is called normalized Radial-Basis
Form (RBF) in Probability Theory (a kind of probability density).
Other normalized distances are also used in practice.
Decide the majority class of the normalized distances for each
class (winner takes all).
Summary
Learning needed for unknown environments, lazy designers
Learning method depends on type of performance element,
available feedback, type of component to be improved, and its
representation
For supervised learning, the aim is to find a simple hypothesis
approximately consistent with training examples
Learning performance = prediction accuracy measured on test
set
Many applications: speech, driving, handwriting, credit cards,
etc.

08 Neuralnet

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

08 Neuralnet

Caricato da

Copyright:

Formati disponibili

Ch.

The University of Iowa

Artificial Intelligence p.1/??

Brains as Computational Devices

1011 neurons (neural cells) and only a small portion

Simple Building Blocks:

No cell contains too much information.

Each region of the brain controls specialized tasks.

Information is saved mainly in the connections among

Artificial Intelligence p.2/??

Comparing Brains with Computers

1 CPU, 105 gates

Even if a computer is one million times faster than a brain in

Artificial Intelligence p.3/??

Cell body or Soma

Increases or decreases the strength of connection and causes

Analogy from Biology

Artificial neurons are viewed as a node connected to

Artificial Intelligence p.5/??

A Neural Network Unit

Artificial Intelligence p.6/??

Artificial Neural Network

Artificial Intelligence p.7/??

The Input Function

The unit applies the activation function gi to the result of ini to

Artificial Intelligence p.8/??

Typical Activation Functions

(a) Step function

(b) Sign function

(c) Sigmoid function

Artificial Intelligence p.9/??

Typical Activation Functions 2

Binary sigmoid (exponential sigmoid)

where controls the saturation of the curve. When , hard

Artificial Intelligence p.10/??

Units as Logic Gates

Activation function: stept

Artificial Intelligence p.11/??

Structures of Neural Networks

Artificial Intelligence p.12/??

Multilayer, Feed-forward Networks

Artificial Intelligence p.13/??

A Two-layer, Feed-forward Network

Artificial Intelligence p.14/??

g5 (W3,5 g3 (W1,3 a1 + W2,3 a2 ) + W4,5 g4 (W1,4 a1 + W2,4 a2 ))

where ai is the output and gi is the activation function of node i.

Artificial Intelligence p.15/??

Multilayer, Feed-forward Networks

Artificial Intelligence p.16/??

Artificial Intelligence p.17/??

Artificial Intelligence p.18/??

Linearly Separable Functions

A black dot corresponds to an output value of 1. An empty

Artificial Intelligence p.19/??

How to Represent XOR function by NN

g5 (W3,5 g3 (W1,3 a1 + W2,3 a2 ) + W4,5 g4 (W1,4 a1 + W2,4 a2 ))

where ai is the output, gi = step0.5 is the activation function of node i,

Artificial Intelligence p.20/??

A Linearly Separable Function

(b) Weights and threshold

Artificial Intelligence p.21/??

Computing with a 2-layer NN

Artificial Intelligence p.22/??