Sei sulla pagina 1di 58

Training Neural

Networks
Robert Turetsky
Columbia University
rjt72@columbia.edu
Systems, Man and Cybernetics Society
IEEE North Jersey Chapter
December 12, 2000
Objective
Introduce fundamental concepts in
Artificial Neural Networks
Discuss methods of training ANNs
Explore some uses of ANNs
Assess the accuracy of artificial neurons
as models for biological neurons
Discuss current views, ideas and
research
Organization
Why Neural Networks?
Single TLUs
Training Neural Nets: Back propagation
Working with Neural Networks
Modeling the neuron
The multi-agent architecture
Directions and destinations
Why Neural Networks?
The Von Neumann architecture
Memory for
programs and
data
CPU for math
and logic
Control unit to
steer program
flow
Von Neumann vs. ANNs
Follows Rules
Solution can/must
be formally specified
Cannot generalize
Not error tolerant
Learns from data
Rules on data are
not visible
Able to generalize
Copes well with
noise
Von Neumann
Neural Net
Circuits that LEARN
Three types of learning:
Supervised Learning
Unsupervised Learning
Reinforcement Learning
Hebbian networks: reward good paths,
punish bad paths
Train neural net by adjusting weights
PAC (Probably Approximately Correct)
theory: Kerns & Vazirani 1994, Haussler
1990
Supervised Learning Concepts
Training set : Input/output pairs
Supervised learning because we know the
correct action for every input in
We want our Neural Net to act correctly in as
many training vectors as possible
Choose training set to be a typical set of inputs
The Neural net will (hopefully) generalize to all
inputs based on training set
Validation Set: Check to see how well our
training can generalize
Neural Net Applications
Miros Corp.: Face recognition
Handwriting Recognition
BrainMaker: Medical Diagnosis
Bushnell: Neural net for combinational
automatic test pattern generation
ALVINN: Knight Rider in real life!
Getting rich: LBS Capital Management
predicts the S&P 500
History of Neural Networks
1943: McCullough and Pitts - Modeling the
Neuron for Parallel Distributed Processing
1958: Rosenblatt - Perceptron
1969: Minsky and Papert publish limits on the
ability of a perceptron to generalize
1970s and 1980s: ANN renaissance
1986: Rumelhart, Hinton + Williams present
backpropagation
1989: Tsividis: Neural Network on a chip
Threshold Logic Units
The building blocks of
Neural Networks
The TLU at a glance
TLU: Threshold Logic Unit
Loosely based on the firing of biological
neurons
Many inputs, one binary output
Threshold: Biasing function
Squashing function compresses infinite
input into range of 0 - 1
The TLU in Action
Training TLUs: Notation
= Threshold of TLU
X = Input Vector
W= Weight Vector
s = X W
ie: if s > , op = 1
if s < , op = 0
d = desired output of TLU
f = output of TLU with current X and W
Augmented Vectors
Motivation: Train threshold at the
same time as input weights
X W > is the same as X W- > 0
Set threshold of TLU = 0
Augment W: W= [w
1
, w
2
, w
n
, -]
Augment X: X = [x
1
, x
2
, .. x
n
, 1]
New TLU equation: X W > 0
(for augmented X and W)
Gradient Descent Methods
Error Function: How far off are we?
Example Error function:
s depends on weight values
Gradient Descent: Minimize error by
moving weights along the decreasing
slope of error
The Idea: iterate through the training set
and adjust the weights to minimize the
gradient of the error
)

.
=

i i
i
f d
2
Gradient Descent: The Math
We have s = (d - f)
2
Gradient of s:
Using the chain rule:
Since , we have
Also:
Which finally gives:

1 1
,..., ,...,
n i
w w w W
I I I I
W
s
s W

I I
' !

W
s
'

s W
I I
s
f
f d
s

) ( 2
I
'
x
x
!
x
x
s
f
f d ) ( 2
I
Gradient Descent: Back to reality
So we have
The problem: xf / xs is not differentiable
Three solutions:
Ignore It: The Error-Correction Procedure
Fudge It: Widrow-Hoff
Approximate it: The Generalized Delta
Procedure
'
x
x
!
x
x
s
f
f d ) ( 2
I
Training a TLU: Example
Train a neural network to match the
following linearly separable training set:
Behind the scenes: Planes
and Hyperplanes
What can a TLU learn?
Linearly Separable Functions
A single TLU can implement any
Linearly separable function
AB is Linearly separable
A B is not
NEURAL NETWORKS
An Architecture for Learning
Neural Network Fundamentals
Chain multiple TLUs together
Three layers:
Input Layer
Hidden Layers
Output Layer
Two classifications:
Feed-Forward
Recurrent
Neural Network Terminology
Training ANNs: Backpropagation
Main Idea: distribute the error function
across the hidden layers, corresponding
to their effect on the output
Works on feed-forward networks
Use sigmoid units to train, and then we
can replace with threshold functions.
Back-Propagation: Birds-eye view
Repeat:
Choose training pair and copy it to input layer
Cycle that pattern through the net
Calculate error derivative between output
activation and target output
Back propagate the summed product of the
weights and errors in the output layer to
calculate the error on the hidden units
Update weights according to the error on that
unit
Until error is low or the net settles
Back-Prop: Sharing the Blame
We want to assign
W
i
j
= weights of i-th sigmoid in j-th layer
X
j-1
= inputs to our TLU (outputs from
previous layer)
c
i
j
= learning rate constant of i-th sigmoid in
j-th layer
H
i
j
= sensitivity of the network output to
changes in the input of our TLU
Important equation:
1
n
j j
i
j
i
j
i
j
i
X c W W H

j
i
j
i
j
i
s s
f
f d
x
x
!
x
x
!
I
H
2
Back-Prop: Calculating H
i
j
For the output layer: H
i
j
= H
k
H
i
j
= H
k
= (d-f)xf/ xs
k
H
i
j
= (d-f)f(1-f) for sigmoid
Therefore W
k
<- W
k
+ c
k
(d - f) f (1 -f ) X
k-1
For the hidden layers:
See Nilsson 1998 for calculation
Recursive Formula: base case H
k
=(d-f)f(1-f)
1
1
1
1
) 1 (

!

!
j
il
m
l
j j
i
j
i
j
i
w f f
j
l
H H
Back-Prop: Example
Train a 2-layer Neural net with the
following input:
x
1
0
= 1, x
2
0
= 0, x
3
0
= 1, d = 0
x
1
0
= 0, x
2
0
= 0, x
3
0
= 1, d = 1
x
1
0
= 0, x
2
0
= 1, x
3
0
= 1, d =0
x
1
0
= 1, x
2
0
= 1, x
3
0
= 1, d = 1
Back-Prop: Problems
Learning rate is non-optimal
One solution: Learn the learning rate
Network Paralysis: Weights grow so
large that f
i
j
(1-f
i
j
) --> 0, and the net never
learns
Local Extrema: Gradient Descent is a
greedy method
These problems are acceptable in many
cases, even if workarounds cant be
found
Back-Prop: Momentum
We want to choose a learning rate that
is as large as possible
Speed up convergence
Avoid oscillations
Add momentum term dependent on
past weight change:
) ( ) (
, ,
t w X t w
j i
j j
i j i
E K ! (

Another Method: ALOPEX
Used for visual receptive field mapping
by Tzanakou and Harth,1973
Originally developed for receptive field
mapping in the visual pathway of frogs
The main ideas:
Use cross-correlation to determine a
direction of movement in gradient field
Add a random element to avoid local
extrema
WORKING WITH
NEURAL NETS
AI the easy way!
ANN Project Lifecycle
Task identification and design
Feasibility
Data Coding
Network Design
Data Collection
Data Checking
Training and Testing
Error Analysis
Network Analysis
System Implementation
ANN Design Tradeoffs
enera o s A ra e ri e
o o e i
eso ion
e e of a ra
ig
s a ad a e O Defini ion of e ro e s e e - osed
Di ensiona i red ion Da a oding an di ensions
o N er of ne or ni s ig
ess da a needed
arse da a
en dis ri ion
Nois da a o era ed
Da a o e ion ore Da a needed
Dense da a
ne en dis ri ion
No noise o era ed
enera i es e o nseen
da e
Tes ri eria ea es re ired e e of
a ra
ina ra ain ro e in raining o erfi ing
A good design i find a a an e
e een ese o e re es!
ANN Design Balance: Depth
Too few hidden layers will cause errors in accuracy
Too many errors will cause errors in generalization!
0
5
10
15
20
25
30
35
40
2 4 6 8
1
0
Number of hidden units
P
e
r
c
e
n
t

E
r
r
o
r
Validation Set
Error
Training Set
Error
CLICK!
Modeling the neuron
Wetware: Biological Neurons
The Process: Neuron Firing
Each electrical signal received at a synapse
causes neurotransmitter release
The neurotransmitter travels along the synaptic
cleft and received by the other neuron at a
receptor site
Post-Synaptic-Potential (PSP) either increases
(hyperpolarizes) or decreases (depolarizes) the
polarization of the post-synaptic membrane (the
receptors)
In hyperpolarization, the spike train is inhibited.
In depolarization, the spike train is excited.
The Process: Part 2
Each PSP travels along the dendrite of the
new neuron, and spreads itself over the cell
body
When the effects of the PSP reaches the
axon-hillock, it is summed with other PSPs.
If the sum is greater than a certain threshold,
the neuron fires a spike along the axon
Once the spike reaches the synapse of an
efferent neuron, the process starts in that
neuron
The neuron to the TLU
Cell Body (Soma) = accumulator plus its
threshold function
Dendrites = inputs to the TLU
Axon = output of the TLU
Information Encoding:
Neurons use frequency
TLUs use value
Modeling the Neuron: Capabilities
Humans and Neural Nets are both:
Good at pattern recognition
Bad at mathematical calculation
Good at compressing lots of information
into a yes/no decision
Taught via training period
TLUs win because neurons are slow
Wetware wins because we have a
cheap source of billions of neurons
Do ANNs model neuron structures?
No: Hundreds of types of specialized nerons,
only one TLU
No: Weights to neural threshold controlled by
many neurotransmitters, not just one
Yes: Most of the complexity in the neuron is
devoted to sustaining life, not information
processing
Maybe: There is no real method for
backpropagation in the brain. Instead, firing
of neurons increases connection strength
High Level: Agent Architecture
Our minds are composed of a series of
non-intelligent agents
The hierarchy, interconnections, and
interactions between the agents creates
our intelligence
There is no one agent in control
We learn by forming new connections
between agents
We improve by dealing with agents at a
higher level, ie creating mental scripts
Agent Hierarchy: Playing with Blocks
Begin A n
B il er
ee
in
ras o e
et
o e elease
P t
A
rom the o tsi e, B il er
know how to b il towers.
rom insi e, B il er j st t rns
on other agents.
How We Remember: K-Line Theory
New Knowledge: Connections
Sandcastles in the sky: Everything we know is
connected to everything else we know
Knowledge is acquired by making connections
new between things we already know
ird ish
nimal
ak ir
lant
live irus
oat
ood
lane
etal
ouse
Stone
Not live
Thing
ird lane
ir
og Car
and
ish oat
Sea
Thing
Learning Meaning
Uniframing: Combining several
descriptions into one
Accumulating: Collecting incompatible
descriptions
Reformulating: modifying a descriptions
character
Transforming: bridging between
structures and functions or actions
The Exception Principle
It rarely pays to tamper with a rule that
nearly always works. It is better to
complement it with an accumulation of
exceptions
Birds can Fly
Birds can fly unless they are penguins
and ostriches
The Exception Principle:
Overfitting
Birds can fly, unless they are penguins
and ostriches, or if they happen to be
dead, or have broken wings, or are
confined to cages, or have their feet
stuck in cement, or have undergone
experiences so dreadful as to render
them psychologically incapable of flight
In real thought, finding exceptions to
everything is usually unnecessary.
Minskys Princples
Most new knowledge is simply finding a
new way to relate things we already
know
There is nothing wrong with circular
logic or having imperfect rules
Any idea will seem self-evident... once
youve forgotten learning it.
Easy things are hard: Were least aware
of what our minds do best
TO THE FUTURE AND
BEYOND
Why you should be nice
to your computer
Im lonely and Im bored.
Come play with me!
Computers are Dumb
Deep Blue might be able to win at
chess, but it wont know to come in from
the rain.
Computers can only know what theyre
told, or what theyre told to learn
Computers lack a sense of mortality and
a physical self with which to preserve
All of this will change when computers
can reach consciousness
I, Silicon Consciousness
Kurzweil: By 2019, a $1000 computer
will be equivalent to the human brain.
By 2029, machines will claim to be
conscious. We will believe them.
By 2049, nanobot swarms will make
virtual reality obsolete in real reality.
By 2099, man and machine will have
completely merged.
You mean to tell me?????
We humans will gradually introduce
machines into our bodies, as implants
Our machines will grow more human as
they learn, and learn to design themselves
The Neo-Luddite scenarios:
AI succeeds in creating conscious beings. All
life is at the mercy of the machines.
Humans retain control: workers are obsolete.
The power to decide the fate of the masses is
now completely in the hands of the elite.
Neural Networks: Conclusions
Neural Networks are a powerful tool for:
Pattern recognition
Generalizing to a problem
Machine learning
Training Neural Networks
Can be done, but exercise great care
Still has room for improvement
Understanding and creating
consciousness?
Still working on it :)

Potrebbero piacerti anche