Neurocomputing Chap2 2 PDF

Neurocomputing
NN-Models
Neurocomputing
Prof. Dr.-Ing. Andreas König
Institute of Integrated Sensor Systems ISE
Dept. of Electrical Engineering and Information Technology

Technische Universität Kaiserslautern
Fall Semester 2006
© Andreas König Slide 2-1
Neurocomputing
NN-Models
Course Contents:
1. Introduction
2. Rehearsal of Artificial Neural Network models relevant for
implementation and analysis of the required computational steps
3. Analysis of typical ANN-applications with regard to computational
requirements
4. Aspects of simulation of ANNs and systems
5. Efficient VLSI-implementation by simplification of the original
algorithms
6. Derivation of a taxonomy of neural hardware
7. Digital neural network hardware
8. Analog and mixed-signal neural network hardware
9. Principles of optical neural network hardware implementation
10. Evolvable hardware overview
11. Summary and Outlook
1
Neurocomputing
Chapter Contents NN-Models
2. Rehearsal of Artificial Neural Network models relevant for

implementation and analysis of the required computational steps
2.1 Discussion and analysis of the ADALINE (Perceptron) – recall and
learning requirements
2.2 Relevant neural networks and related statistical methods for
classification purposes
2.2.1 Parametric Classifiers (Normal distribution, Mahalanobis,
Euclidean Distance)
2.2.2 Nonparametric classifiers (Parzen Window, Nearest neighbor)
2.2.3 Multi-Layer-Perceptron with Backpropagation learning
2.2.4 Learning-Vector-Quantization
2.2.5 Dynamic-Nearest-Neighbor Classifiers
2.2.6 Restricted-Coulomb-Energy-Network
Neurocomputing
Chapter Contents NN-Models
2.2.7 Radial-Basis-Function Networks

2.2.8 Probabilistic Neural Networks
2.2.9 Hopfield Network
2.2.10 Boltzmann Machine
2.2.11 Self-Organizing Feature Map
2.2.12 Cellular Neural Network
2.2.13 Summary
2
Neurocomputing
ADALINE/Perceptron NN-Models
¾ Early neuron models, such as the famous Perceptron or ADaptive Linear

Element (ADALINE) neuron were used for (simple) classification
¾ Example of the Perceptron application for letter recognition:
output neuron
variable weights
fixed connections
¾ Special learning algorithm

for perceptron
¾ Systematic approach for
ADALINE
Neurocomputing
¾ In the general case the neuron possess a nonlinear activation function f(net),
which is the identity function in the ADALINE:
 m 
x1 o = f  ∑ xi wi  (2.1)
x2
w1  i =1 
w2
wi f(net) o 1
xi wm f (net ) = (2.2)
1 + e − net
xm
f(net)
Stimuli Weights Cell body Output 0.5

(Activation)
net
¾ In classication (recall mode), a step function is employed
3
Neurocomputing
¾ The deviation of the actual neuron behavior or output from the desired or
prescribed output in such a supervised approach, can be assessed for all
pattern pairs
¾ The resulting error can be displayed as an error (hyper) surface with the
neuron weights as parameters, here w1 and w2:
error
¾ Good performance corresponds to valley location, that has to be reached !

Neurocomputing
¾ Adaptation of a neuron weight is commonly achieved by gradient descent

based on an error function:
2
1 N   m 
E = ∑  y k − f  ∑ xik wi   (2.3)
2 k =1   i =1 
¾ Every weight is adapted after (random) initialization according to:
∂E
∆wi = −η (2.4)
∂wi
¾ The gradient is computed as:
∂E N
  m   m 
= −∑  y k − f  ∑ xik wi   ⋅ f '  ∑ xik wi  ⋅ xik (2.5)
∂wi k =1   i =1    i =1 
¾ Inserting (2.5) in (2.4) returns the batch learning rule, reducing the batch
size to one returns on-line learning rule with immediate weight adaptation
4
Neurocomputing
¾ Additional implementation requirement for implementing the Delta-Rule

(ADALINE):
∆wij
xi
* * ∑ wijnew
δj
y kj η
* - wijold
y j
For batch learning, the need of accumulation
∑ and intermediate storage of individual batch

pattern contributions must be regarded here
¾ Learning requires availability of previous forward phase results ( y kj )

¾ An arbitrary nonlinearity, e.g., (2.2), requires the availability of its
derivative and an additional scaling (multiplication) step !
Neurocomputing
¾ Simple dot product neurons can serve as linear classifiers

 m 
o = f  ∑ xi wi + 1 ⋅ w0 
threshold
x1 +1
w1w0  i =1 
x2 w2  m 
f(net) o o = f  ∑ xi wi 
wi  i =0 
xi wm x2
xm
Class 2
(o=+1)
Class 1
Stimuli Weights Cell body Output (o=-1)
(Activation)
x1
¾ A single neuron can separate a linear separable problem with a separating
line (plane, hyperplane)
¾ Logical combinations of a layer allow tackling non-linear problems !
5
Neurocomputing
¾ Simple distance neurons can serve also classifiers with spheric regions:
x1 R f
 m 
w1 o = f  ∑ (xi − wi ) − R 
2
x2 w2
 i =1 
wi f(net) o
xi wm x2
w2
xm
Class 2
Class 1 (o=+1)
Stimuli Weights Cell body Output (o=-1)
R
(Activation)
x
¾ A single neuron can separate a region by a radius limited (hyper) sphere w 1
1
¾ The norm of the distance vector is computed and compared with the
sphere radius, inside gives +1, outside returns -1
Neurocomputing
Relevant ANN for Classification NN-Models
¾ System context of neural network application:

Class labels
Feature space
Preprocessing 67663
Knowledgebased
S.Ender
Feature
From Home 11
Daiytown, 0815
Receiver & Classification interpretation

extraction
At Home 12
Daisytown, 0815
segmentation
Kaiserslautern
Vision (CCD,CMOS), IR, UV, SAR, US,

THz, Sonic, Olfaction, Degustation,...
6
Neurocomputing
¾ In the following, artificial neural networks will regarded in their role as

classifiers, where they are most commonly applied
¾ They will be compared with established techniques of statistical pattern

recognition
¾ Both performance and implementation requirement/cost will be regarded
¾ The objective is the determination of basic building blocks and operations

common to most regarded algorithms
¾ Later, implementation options for these basic building blocks will be

considered
¾ ANN serve in different places of recognition systems and in very different

applications !
Neurocomputing
¾ Numerous options to define estimation functions

¾ Taxonomy of important classification methods:
classification methods
statistical approaches function approximation
parametric non parametric
BAC/MLC MAC EAC CBC Parzen Hyper kNN Decision Polynomial

window sphere Tree Classifier
PNN RCE LVQ RBF/BP/CasCor
7
Neurocomputing
Parametric Methods NN-Models
¾ The Bayes-Normal distribution Classifier (BAC) assumes Gaussian

distributions for the classes
¾ The decision function are determined as
r P(ωi ) −
1 r r T −1 r r
( x − µi ) K i ( x − µi )
dî (x ) = e 2
(2.6)
(2π )N det K i
Neurocomputing
¾ The intersections of two class region probability distribution functions

(pdf) define the class borders as lines of equiprobability
sketch of
class boundary
¾ Parabolic class boundaries result in the two-dimensional example

¾ In the case of more than two classes, class regions are defined by
intersections of resulting parabolic functions
8
Neurocomputing
¾ Assuming equal a priori values for all classes return the Maximum-
Likelihood-Classifier (MLC):
r 1 1 r r T r r
dî (x ) = − ln(det K i ) − (x − µi ) K i−1 ( x − µi ) (2.7)
2 2
¾ Assuming further equal covariance for all classes return the Mahalanobis-
Classifier (MAC):
r 1 r r T r r
dî (x ) = − ( x − µi ) K −1 ( x − µi ) (2.8)
2
x2
sketch of
class boundary
x1
Neurocomputing
¾ Assuming further the covariance matrix to be the identity matrix return the
Centroid or Euclidean Distance Classifier (EAC):
N −1 2
= ∑ (x )
r r r
dî (x ) = x − µi
2
j −µj (2.9)
j =0
x2
sketch of
class boundary
x1
r N −1
¾ Simplifying the metric returns the dî (x ) = ∑ x j − µ j (2.10)
City-Block-Classifier (CBC): j =0
9
Neurocomputing
Nonparametric Methods NN-Models
¾ Parametric methods condense all the sample set information into few
model parameters and perform a global pdf estimation
¾ In contract, nonparametric methods perform a local pdf estimation:
r
r k (x)
p(x ) = (2.11)
N ⋅ν
¾ Assumption of either fixed volume ν or fixed number of patterns k
¾ The Parzen-window classifier bases on the first alternative

r r
r 1 N
 x − xi 
pParzen ( x ) = ∑ 
κ  (2.12)
N ⋅ hNM i =1  hN 
¾ In the simplest case the kernels used in (2.12) could be Gaussian
functions: Summation of kernel
Class 1 contributions in window max
Class 3 hN hN
Class 2
Class 1 hN
x1
Neurocomputing
¾ For the given choice of Gaussian kernels, implementation could look like:
xi
wij
- * ∑ / exp oj
2σ 2
o1 Can be replaced by multiplication

oj
oL1
∑ with precomputed factor for fixed σ
o1
∑
Max-of-L
oj
oLi
Class index
pdf value(s)
o1
oj
oLL
∑
10
Neurocomputing
¾ The kernel width must be adapted to data density

¾ k-nearest-neighbor classifiers (kNN) under control of the parameter k
estimate the densitiy and determine the class affiliation of a new pattern
by evaluating the k-nearest-neigbors
r k −1
x2 pkNN (x ) = r (2.13)
N ⋅ν ( x )
Class 1
Volume for k=5
Class 2 Class 1: 0/5=0.0

Class 3 max for
Class 2: 3/5=0.6
Class 3: 2/5=0.4 Class 2
x1
¾ kNN training means storage of all patterns !
¾ Two basic variations: voting and volumetric kNN
Neurocomputing
¾ Sketch of class boundary for k=1 1NN-classification
x2
Class 1
Class 2 Class 3
x1
¾ Class-specific Voronoi tesselation defines the class borders in this case
11
Neurocomputing
¾ Implementation requirements of kNN (special case k=1):
xi
wij
- * ∑ Min.
Search
oj
¾ Parallel implementation of individual neurons advocate

parallelisation of minimum (maximum) search: Class index
Neighbor
xi indices
∑
Parallel Minimum Search

- * pdf value(s)
wi1
xi
wij
- * ∑ oj
xi
wiK
- * ∑
Neurocomputing
¾ Edited-Nearest-Neighbor-classification (ENN, Devijver and Kittler, 1980)

¾ For k=1 resubstitution is guaranteed, however generalization can be
affected by this over-specialization in following situations:
x2 ¾ Automatic determination of the

outliers in the data set:
kNNω j =ω j
GENN = (2.14)
Class 1 kNNω j ≠ω j
k=5
¾ k neighbors determined and
investigated for class affiliation
Class 2 Class 3
¾ Elimination if same to different
class affiliations are below
threshold:
x1
GENN < Θ (2.15)
¾ Edited sample set is reduced, resubstitution no longer assured,

generalization generally improved by outlier elimination (rf. qo)
12
Neurocomputing
¾ Condensed-Nearest-Neighbor-classification (CoNN, Hart, 1968)

¾ The complete storage in kNN requires large memory & long computation
¾ Reduction by limiting storage to vectors defining class boundary
x2
Pseudo-Code of CoNN algorithm:
1. Initially empty classifier
2. Get first (next) pattern xi
3. 1-NN-classify sample set
pattern
If correct_classification goto 4
Else Insert pattern xi; goto 2
4. If all_patterns_corr_class
break;
Else goto 2
x1
¾ Algorithm reduces effort but depends on sample set presentation order and
leaves redundancy in the CoNN; recall by 1-NN mandatory !
Neurocomputing

¾ Step-wise demonstration:
x2
3. 1-NN-classify sample set
first pattern pattern
Else Insert pattern xi; goto 2
break;
Else goto 2
x1
¾ All class 3 patterns in the following will be classified correctly
13
Neurocomputing

¾ Sketch of potential final solution:
x2
potentially 3. 1-NN-classify sample set
redundant pattern
reference If correct_classification goto 4
vector Else Insert pattern xi; goto 2
break;
Else goto 2
x1
¾ Removal of existing redundancy by following step
Neurocomputing
¾ Reduced-Nearest-Neighbor-classification (RNN, Gates, 1972)

¾ Addition of a removal step to clean reference vector set from redundant
instances, i.e., vectors not required for perfect resubstitution
x2 Pseudo-Code of RNN algorithm:

1. Run CoNN algorithm
2. Tentatively remove first (next)
reference vector ri
potentially
3. 1-NN-classify all sample set
redundant
reference patterns
vector If correct_classification
permanently remove ri; goto 4
Else restore ri; goto 4
4. If last_ref_vec break;
Else goto 2
x1
¾ CoNN redundancy eliminated, trong dependance on presentation order
¾ Alternatively, CoNN and RNN steps can be interleaved in a modified,
potentially faster algorithm, however, limit cycles can occur (rf. qs)
14
Neurocomputing
¾ QuickCog example for Iris data:
22
2825 49
2 16
67 70
73 2
36 486
33 39 5 11 3155 3
30 69
63321 18 5614
5775 32 4317 34
61
9 27 4160 38
8 44 4037
10 810
45 42
51
15 54 26 59
71
53 19464
52
46 9 5
35
72 66 5023
65
68 58 7
24 62 13
29 20 47 1 6
74 4
12
17
Mirrored due to projection
¾ With the given settings, 10 reference vectors are chosen (qs=0.906666)
Neurocomputing
¾ A traditional method in OCR used for nonparametric classification, based

on function approximation approach, is the polynomial classifier
r
dî (x ) = a0,i + a1,i ⋅ x1 + a2,i ⋅ x2 + K + a N ,i ⋅ x N + (2.16)
a N +1,i ⋅ x12 + a N + 2,i ⋅ x1 ⋅ x2 + K

¾ Expressing (7.14) by new coordinates results in
v = (v1 , v2 , K, v p ) = (1, x1 , x2 ,K, xN , x12 , x1 ⋅ x2 , K)

r
(2.17)
r
r r
dˆ = AT ⋅ v (x ) (2.18)
¾ LMS-optimization to determine polynomial coefficients of A
¾ Similarity to dot product neuron with nonlinear synapses !
¾ Excessive growth of variables in (7.15) for increasing order G of
polynomial:
 N + G  (N + G )!
p =   = (2.19)
 G  N !⋅ G!
¾ Data compression/term selection, G 1-3 (Schürmann, Kressel)
15
Neurocomputing
Neural Networks NN-Models
¾ A multilayer perceptron with backpropagation algorithm serves as a
classifier for non-linear separable problems:
Max-of-L Class affiliation
oi
Output Layer
wij
hj
Hidden Layer
wjk
xk Input Layer
¾ Proven to be a universal function approximator with one nonlinear HL
¾ A learning rule is required for this network, in particular the hidden layer(s)
¾ Choice of hidden layer size and learning parameters can be difficult
¾ Resubstitution not guaranteed, generalization can be surprizing
Neurocomputing
¾ Introducing the following abbreviations using the notation given with the
network structure:
   
oi = f  ∑ h j wij  ; h j = f  ∑ xk w jk  (2.20)
 j   k 
¾ With (5.19) the error can be expressed as:
2
1 N L   m 
E = ∑∑  yiµ − f  ∑ h µj wij   (2.21)
2 µ =1 i =1   i =1 
¾ Every weight is adapted after (random) initialization according to:
∂E
∆wij = −η (2.22)
∂wij
¾ The gradient for the output layer weights is computed as:
N     
∂E
= −∑  yiµ − f  ∑ h µj wij   ⋅ f '  ∑ h µj wij  ⋅ h µj (2.23)
∂wij  
µ =1   j   j 
16
Neurocomputing
¾ This can be expressed employing the abreviations of (2.20) as

∂E
( )
N
= −∑ yiµ − oiµ ⋅ o'iµ ⋅h µj (2.24)
∂wij µ =1
¾ Inserting in (2.22) gives the output layer batch adaptation rule
∂E
( )
N
∆wij = −η = η ∑ yiµ − oiµ ⋅ o'iµ ⋅h µj (2.25)
∂wij µ =1
¾ For the hidden layer adaptation rule, the error function must be expanded:
2
1 N L    
E = ∑∑  yiµ − f  ∑ wij f  ∑ xkµ w jk    (2.26)
2 µ =1 i =1   j  k   
netj
hj
neti
oi
ei
Neurocomputing
¾ Every hidden weight is adapted after (random) initialization according to:

∂E
∆w jk = −η (2.27)
∂w jk
¾ The gradient for the hidden layer weights is computed by application of the
chain rule:
∂E ∂E ∂ei ∂oi ∂neti ∂h j ∂net j
= ⋅ ⋅ ⋅ ⋅ ⋅ (2.28)
∂w jk ∂ei ∂oi ∂neti ∂h j ∂net j ∂w jk
N L       
∂E
= −∑ ∑  yiµ − f  ∑ h µj wij   ⋅ f '  ∑ h µj wij  ⋅ wij ⋅ f '  ∑ xkµ w jk  ⋅ xkµ
∂w jk  
µ =1 i =1   j   j   j 
(2.29)
¾ This can again be expressed employing the abreviations of (2.20) as
∂E
( )
N L
= −∑∑ yiµ − oiµ ⋅ o'iµ ⋅wij ⋅ h'µj ⋅xkµ (2.30)
∂w jk µ =1 i =1
17
Neurocomputing
¾ Introduction of error or δ-terms with
δ iµ = ( yiµ − oiµ )⋅ o'iµ (2.31)
L
δ jµ = h'µj ⋅∑ wijδ iµ (2.32)
i =1
¾ ... allows a compact representation of the adaptation rules

N
∆wij = η ∑ δ iµ ⋅ h µj (2.33)
µ =1
N
∆w jk = η ∑ δ jµ ⋅ xkµ (2.34)
µ =1
¾ This learning rule is denoted as error-backpropagation learning rule

¾ Numerous variants of this vanilla approach are in existence to improve
learning behavior, e.g., introduction of a momentum term or adaptive η
Neurocomputing
¾ Learning rule requirements for output layer:

∆wij
neti NL * * * ∑ wijnew
~0 !
δi
yiµ η h µj
- wijold
oiµ
0.4
Derivative For batch learning, the need of accumulation
0.3
0.2 and intermediate storage of individual batch
0.1
0
pattern contributions must be regarded here
1
0.5
Sigmoid
∑
0
18
Neurocomputing
¾ Data flow in forward and backward propagation:

(y µ
i − oiµ )⋅ o'iµ
oi δi
wij
∑w δ ij i
δj
w jk
∑w jk δj
δk
¾ Implies transposed access to weight memory
¾ Learning of lower layers recursively requires data from previous layer (∑ w δ ) ij i
¾ Each neuron has an additional weight connected to constant +/-1 (threshold)

Neurocomputing
¾ Commonly this „vanilla“ backpropagation learning algorithm is

implemented in NN-HW, its nice and regular !
¾ Improved faster variants tend too loose that advantage
¾ Momentum term extension:
∂E
∆wij (t ) = −η + α ⋅ ∆wij (t − 1) (2.35)
∂wij
¾ Potentially helps to avoid sluggish behavior or oscillations

¾ Globally adaptive learning rule:
 +κ for ∆E < 0

∆η = − ϕ ⋅η for ∆E > 0 (2.36)
 0 for else

19
Neurocomputing
¾ Locally adaptive learning rule:

 +κ for δ (t − 1)δ (t ) < 0

∆ηij = − ϕ ⋅ηij for δ (t − 1)δ (t ) > 0 (2.37)
 0 for else

∂E
with ∂ (t ) = and δ (t ) = (1 − Θ )δ (t ) + Θδ (t − 1)
∂wij
¾ Denoted as Delta-Bar-Delta-Rule
¾ More information on error surface by 2nd order derivative (Hessian matrix)
¾ Requires storage in the order of the square of no. of weights
¾ Restricted effort by using only diagonal elements of Hessian matrix:
∂E
∂wij
∆wij (t ) = −η (2.38)
∂2E
+µ
∂wij2 practical extension to avoid
division by zero
Neurocomputing
¾ Model-based acceleration technique, denoted as Quickprop, by Fahlman:

∂E
parabolic model
∂wij
estimated crosssection of
mimimum gradient surface
two known points location
wij (t − 1) wij (t ) wij (t + 1) wij

y = ax + bx + c y1 ' = 2ax1 + b y2 '− y1 ' = 2a(x2 − x1 )
2
y2 '
x3 = x2 + (x2 − x1 )
y ' = 2ax + b y2 ' = 2ax2 + b − y2 ' = 2a (x3 − x2 ) y1 '− y2 '
0 = 2ax3 + b
∂E
∂wij
wij (t + 1) = wij (t ) +
∂E
(w (t ) − w (t − 1)) (2.39)
(t − 1) − ∂E (t )
ij ij
∂wij ∂wij
¾ Weight increment limited by maximum step size given by µ !
¾ Epoch or batch learning rule !
20
Neurocomputing
Neural Networks - LVQ NN-Models
¾ A learning vector quantization network (LVQ) employs winner-take-all

mechanism that leaves but the strongest response active:
Class 1 Class L
Output (Classification) Layer
WTA-Layer
wij
Input Layer
xk
¾ Adjusting of a fixed number, commonly randomly or by SOM training

initialized vectors
¾ Recall via 1-NN classification
Neurocomputing
¾ Basic LVQ-1 learning method [Kohonen 1989]

¾ Iterative presentation of training data and WTA-computation finding wc(t)
r r r r
w2, x2 wc (t + 1) = wc (t ) + α (t )[x (t ) − wc (t )] if ωc = ω x
r r r r
wc (t + 1) = wc (t ) − α (t )[x (t ) − wc (t )] if ωc ≠ ω x
r r
wi (t + 1) = wi (t ) ∀ i≠c
(2.40)
x(t)-wc(t)
α (t )
wc(t) temporal decay of

learn rate
wc(t+1)
w1, x1 t
¾ Dead reference vectors wi can occur, sufficient no. per class not assured
¾ Different initialization, e.g., RNN, could be suitable
¾ Basic methods extended by improved versions LVQ2/2.1 & LVQ3
21
Neurocomputing
¾ LVQ-2 learning method [Kohonen 1989]

¾ Now two weight vectors are determined for a new pattern:
r r r r
wi (t + 1) = wi (t ) + α (t )[x (t ) − wi (t )] if ω i = ω x
r r r r (2.41)
wl (t + 1) = wl (t ) − α (t )[x (t ) − wl (t )] if ω l ≠ ω x
r r
wk (t + 1) = wk (t ) ∀ k ≠ i, l
r
x (t ) ¾ Window computation:
new class
border
r d d  1− w
wi (t + 1) min xi , xl  > ηˆ =
d d
 xl xi  1 +w
r r
wi (t ) wl (t ) (2.42)
r
wl (t + 1) ¾ Monotonic decrease of dli
old class ¾ Remedy: Accept vectors in both
border sides of window (LVQ 2.1)
Neurocomputing
¾ Computational requirements in forward phase as for 1-NN methods

¾ Requirements in learning (LVQ-1):
∆wij
xi
wci
- * +/- wijnew
t
α (t )
α (0) wijold
right/wrong
class
¾ Main computational effort in forward phase (all neurons compute)

¾ Adaptation requires only very few neurons to update their weights
¾ This leads to inefficient use of potentially parallel hardware
¾ Window computation and decaying learn rate computation impose
additional demands on computing resources
22
Neurocomputing
Neural Networks - RCE NN-Models
¾ A special class of ANN consists of a kernel layer and an output layer

Resolving of class affiliation, e.g., Max-of-L
Class 1 Class L
∑ ∑ ∑ ∑ Output
Rj (Classification)
Layer
r r Kernel-Layer
w j − xk
wij
Input Layer
xk
¾ The Restricted-Coulomb-Energy network (RCE) employs step functions
in the kernel layer and or-gate or summation of activated kernel neurons
¾ The network is generated from scratch by (patented) dynamic training
Neurocomputing
¾ RCE-training is part of the Nestor-Learning-System (NLS)

¾ Dynamic placement and scaling of hyperspheres:
Pseudo-Code of RCE algorithm:
x2 1. Initially empty classifier
3. RCE-classify pattern
Rmax Else If unknown Insert xi; goto 2
Else If ambiguous reduce
radii of ri with ωi≠ωj;goto 2
Else (misclassification)
reduce radii of ri with
Rmin
ωi≠ωj; Insert xi; goto 2
4. If all_patterns_corr_class break;
x1 Else goto 2
¾ Result strongly depends on presentation order (Pro-RCE extension)
¾ Insertion only does not remove redundant neurons
¾ Classification resolving by voting or kernel based pdf estimation (PRCE)
¾ Additional attributes unknown and uncertain/ambiguous
23
Neurocomputing
¾ RCE-training is part of the Nestor-Learning-System (NLS):
Class 1
Class 2
Mechatronic data
Iris data
Neurocomputing
¾ Computational requirements of RCE forward phase:
xi oj
wij
- * ∑ - >0 to global
Rj class
assignment
oj > Rj
oj ≤ Rj
¾ Evaluation of activated hyperspheres and result determination (class
assignment) according to global rules
¾ Not really regular structure, not easy amenable to parallel implementation
¾ Computational requirements of RCE learning phase:
• Pattern storage requires fast transfer to memory
• Initial setting of radii to Rmax
• Radius reduction requires distance computation between two weight
vectors and comparison to Rmin
• Rather irregular process, not amenable to parallelization
• Overlap of forward and training phase hard due to resource conflicts
24
Neurocomputing
¾ Rough sketch of a potential RCE architecture:
Reference
Time-
Vector
multiplex
Memory 1
RN+1 RN+2 Rn+3 RN+4 R2N-1 R2N

First stored pattern
2nd stored pattern
3rd stored pattern
Reference
Vector
x2 Memory 1
R1 R2 R3 R4 RN-1 RN
PE
x1
¾ One more memory: Storage of reference vectors class affiliations !

Neurocomputing
Neural Networks - RBF NN-Models
¾ The Radial-Basis-Function network (RBF) employs (commonly Gaussian)

kernel functions in the kernel layer and dot product output layer neurons
Class 1 Class L
σj (Classification)
Layer
r r Kernel-Layer
w j − xk
wij
d ker_ max
σj =
2M Input Layer
xk
¾ RBF-networks are also universal function approximators !
¾ Number of kernels fixed by choice and random initialization or SOM
25
Neurocomputing
¾ RBF-training on a fixed hidden layer might not be efficient
¾ Dynamic training algorithms for function approximation and classification
¾ The first one is Platt´s Resource-Allocation-Network RAN
Pseudo-Code of principle RAN algorithm:
x2 sketch of σj 1. Initially empty kernel layer
3. Compute output value
If Error > ε & d>δ
Insert xi as new kernel;
σj = min(κ• δ, κ• d)
Adapt output layer weights ... ;
4. If sum_err < ε 2 break;

Else δ=δ•e-1/τ ; goto 2
x1
¾ Fast (evolving) training, insertion only does not remove redundant neurons
¾ Classification by determination of maximum pdf; background applicable !
¾ Training for all RBF-parameters can be achieved by gradient descent !
¾ Smooth and well-generalizing behavior of RBF-networks
Neurocomputing
¾ Computational requirements of RBF forward phase:
• Output layer neurons correspond to MLP output neurons

• Hidden layer neurons correspond to distance neuron computation,
where the computed distance is subject to a Gaussian non-linearity
with potentially global or local σ
¾ Computational requirements of RBF learning phase:
• Requirements vary with chosen learning approach

• Output layer neuron adaptation corresponds to MLP
• Complete gradient descent weight and σ adaptation can take place
with similar data propagation and approach as in MLP
• Usually, a simple scheme with mild requirements chosen in context
of HW-implementation
¾ An extension of RBF is to include oriented kernels in hyperbasis function

networks, i.e., introducing the effort of a BAC in each hidden neuron
26
Neurocomputing
Neural Networks-PNN NN-Models
¾ Probabilistic-Neural-Networks of Specht resemble Parzen-Window:

Class 1 Class L
σj (Classification)
Layer
r r Kernel-Layer
w j − xk
wij
Input Layer
xk
¾ Each training data vector is stored as Gaussian kernel with fixed global σ
¾ According to class labels, kernel are wired to pdf summation nodes
¾ Explicit cost or a priori weighting can be employed before pdf max-of-L
Neurocomputing
¾ pdf-computation:
 (xv − µ j )T (x − µ j ) 
r r r
1 1 m
pi = ∑ exp −  (2.43)
(2π )n σ n m i =1  2σ 2 

¾ Explicit weighting and cost function inclusion as well as rejection

generation easily feasible
¾ Computational simplification assuming normalized vectors of unit length:
 (x − µ j )T (x − µ j ) 
r r r r r r rr r r
 − x T x + 2 x µ j − µ Tj µ j 

g j = exp −  = exp  
 2σ 2   2 σ 2 
   
rr rr
 − 1 + 2 xµ j − 1   2 xµ j − 2   net j − 1 
= exp  = exp  = exp 
 2σ 2
  2σ
2
  σ
2

¾ Reduction of Gaussian to exponential activation function

27
Neurocomputing
¾ Computational requirements of PNN forward phase:
xi oj
wij
- * ∑ NL
Rj
pdf accumulation
Class-Specific
¾ Metric computation can be simplified and parallelized
Max-of-N
¾ Pdf-computation rather complex
¾ Learning is a process of storing training patterns

¾ Additionally, σ, rejection thresholds or cost factor
specification required
¾ Relation with RCE variant, denoted as p-RCE

¾ In the uncertain case, for activated hyperspheres pdf
computation takes place as in general PNN to resolve
Neurocomputing
Neural Networks-AM NN-Models
¾ Additional application fields of neural networks
28
Neurocomputing
¾ Associative memories serve to establish mappings between

incomplete/distorted versions of the pattern itself (auto association) or
between entirely different patterns (hetero association)
¾ The conditioning of animals (dogs) to show flow of saliva to the
presentation of a ringing bell (Pawlovian reflex) is a prominent example
¾ Steinbuchs Lernmatrix is the first hardware implementation
Neurocomputing
¾ Technical systems employ associative memory on a large scale, e.g., in

memory management systems (cache, page-based memory)
¾ The search is for a certain bit pattern, tolerance in the search can be added
by masking bits of the pattern, i.e., excluding these from the search
¾ Here, metrics serve for pattern comparison and similarity measure
¾ First case: Linear Associative Memory
Y =W ⋅ X (2.44)
¾ The pattern association relates to standard linear algebra

¾ If Y=X then the case of auto association is met
¾ Determination of association matrix W:
• Pseudo Inverse computation ( )
W = YX T XX T
−1
(2.45)
∆W = η ∑ ( y − Wx )(x )
N
• Gradient descent r r r T
µ µ µ (2.46)
µ =1
• Correlation matrix 1 N
r r
W=
N
∑
µ
y µ xµ
=1
(2.47)
29
Neurocomputing
¾ Inherent problem: limited storage capacity that leads to crosstalk

between patterns during association process:
r r
r r 2 r r xν xµ 
y = xν  yν + ∑ yµ r 2 (2.48)
 xν 
 µ ≠ν
¾ Remedy: pairwise orthogonal patterns

r r 2 r r 0  r r
y = xν  yν + ∑ y µ r 2  = xν
2
yν (2.49)
 xν 
 µ ≠ν
¾ Remedy: pairwise orthonormal patterns

r r 2 r r
y = xν yν = 1 ⋅ yν (2.50)
¾ Storage limitation due to orthogonalization requirements

¾ Further activities: sparse coded or nonlinear associative memories
¾ Representatives: Canerva’s or Palm’s associative memory and the
Hopfield network
Neurocomputing
Neural Networks-Hopfield NN-Models
¾ The Hopfield network (Hopfield 82) is a recurrent neural network with

binary or real valued neurons and weights applied for pattern restauration
and optimization
¾ An energy function is associated with the Hopfield network:

1
E (t ) = − ∑∑ wij oi (t )o j (t ) − ∑j in j o j (t ) + ∑j Θ j oi (t )
2 i j
(2.51)
E Inital state E
Spurios
Attractor Final state
attractor
30
Neurocomputing
¾ Stored patterns correspond to attractors in energy landscape

¾ Excessive number of patterns leads to spurious attractors
¾ Storage capacity has been analyzed to be
p ≈ 0,146 N
¾ Binary Hopfield neuron computation in forward phase:
net j (t + 1) = ∑ wij oi (t ) + in j (2.52)

j
 1 if f (net j (t ) ) > Θ j
o j (t + 1) = f (net j (t ) ) =  0

if f (net j (t ) ) < Θ j (2.53)
o (t ) else
 j
¾ Asynchronous (one-at-a-time) or synchronous change of neuron states
¾ Learning can take place, e.g., by correlation learning:
(xν ⋅ xν )
P
1
wij =
N
∑
ν=1
i j
(2.54)
¾ Weights can be computed externally, e.g., for optimization tasks

Neurocomputing
¾ Application examples of Hopfield network for pattern restauration

Test patterns
Training patterns
31
Neurocomputing
¾ Optimization with Hopfield network applied to the Traveling-Sales-

Person-problem (TSP):
city/order of
traversal
¾ Energy function adapted to the problem with (soft) constraints:
(2.55)
¾ Penalty terms of soft constraints disappear if

A: only one “1” per row, i.e., every city visited only once
B: only one “1” per column, i.e., every city visited at a time
C: every city visited once (trivial solution)
D: Tour length (main constraint)
Neurocomputing
¾ Due to the network growth with the number of cities n (n2 neurons, n4
weights) problem size was commonly limited to several hundred cities
¾ This and a class of related optimization problems of significant

economical impact, e.g., air-crew-scheduling-problem etc., kindled the
interest in fast parallel hardware implementation
¾ For this kind of task real-valued neurons were used in the Hopfield
network, employing the nonlinearity of (2.2)
¾ Practical drawback: Hopfield network can only reach a local optimum

during “roll-off” in the energy landscape, thus can get stuck in a shallow
dent representing a potentially sub-optimal or bad solution
¾ Advocates other techniques employing random mechanisms, such as

Simulated Annealing or Boltzmann Machines
32
Neurocomputing
Neural Networks-SA NN-Models
¾ Principle of Simulated Annealing, escaping local mimimum:

E E
1 2
E E
3 4
¾ Decaying system temperature T defines the T

system’s ability to accept a temporary 1
decrease of the solution quality to leave a 2
local optimum
3 4 t
Neurocomputing
Neural Networks-BM NN-Models
¾ A Boltzmann Machine (BM) is a neural network exploiting a stochastic

transition mechanism inspired by SA (Koorst & Aarts, 1990)
¾ Basically, the network has input, hidden, and output neurons connected by
bi-directional weighted connections
¾ For optimization, only hidden units are used
¾ Neurons are binary, i.e, “on” or “off”, their states are denoted as BM
configuration
¾ In every configuration, the sum of all connection weights incident with
two “on” neurons is accumulated as consensus function
C (k ) = ∑ w k (u )k (v )
{u ,v}∈U
uv (2.56)
¾ Here, k(u) gives the state of neuron u in the current configuration

¾ A sequential BM generates a state transition of one unit at a time
¾ A parallel BM generates state transitions for several up to all units at a
time
1
¾ A state transition is generated with a probability of: G (u ) = (2.57)
U
33
Neurocomputing
¾ A state change will lead to a consensus difference between configuration k

and ku
∆Ck (u ) = C (ku ) − C (k ) (2.58)
¾ A state transition will be accepted with a probability Ak(u,c) controlled by

the implied consensus difference and the system temperature:
1
Ak (u , c) = − ∆Ck (u ) (2.59)
1 + exp c
Neurocomputing
¾ Computation for acceptance of a neuron proposed state transition:

o (t ) if random < Ak (u , c )
ok (t + 1) =  k (2.60)
ok (t ) else
¾ Temperature is reduced gradually during the process (Markov chain
length) reaching intermediate equilibria
¾ For optimization, the problem must be reformulated to a binary variable

representation
¾ The binary variables are assigned to the BM neurons and problem-specific
weights are determined to meet constraints (order preserving, feasible)
¾ Minimization and maximization problems must be mapped in an appro-
priate way to the consensus function that ensures valid solution finding
¾ Example: general Cut problem:
∑ w ((1 − x )x + (1 − x )x )
n n
f (X ) = ∑ ij i j j i
(2.61)
i =1 j =i +1
34
Neurocomputing
¾ The BM has the clear advantage to find better, perhaps global minima
¾ This property is also attractive with regard general mappings learning:
¾ Supervised learning takes place in two phases:

¾ Clamped Phase: Environmental units are clamped to prescribed values of
the aspired mapping and the remaining hidden units equilibrate. The
probability z’{u,v} of each weight having a pre and post synaptic active
neuron is calculated
¾ Free Phase: All units equilibrate (special case: input units clamped) and
probability z{u,v} is calculated
Neurocomputing
¾ Formulation of the BM learning algorithm:
denotes wij
¾ Learning can be (dreadfully) slow, however, can reach good optimum

¾ Convergence proof exists: reaches global optimum for infinitesimal slow
cooling process
¾ Applied in optimization and pattern recognition, rarely implemented
¾ Demands for random generators and cooling schedule implementation
35
Neurocomputing
Neural Networks-SOM NN-Models
¾ The Self-Organizing feature Map (SOM), introduced by Teuvo Kohonen

is the probably most well-known and applied neural network
¾ The SOM was derived from physiological evidence observed in the
somato sensory cortex, e.g. [Kohonen 89]
Winning neuron Nc
1
2
Neigborhood function Component
(Gaussian, pyramid, box) planes
α (t )
M
r
Weight vector w j M
WTA
M
r
Input vector vi r (t ) d c = min Nj=SOM
1 ∑ (v − w )
i =1
i ij
2
(2.62)
Neurocomputing
¾ The SOM features the properties of data quantization, probability

density approximation, topology preserving dimensionality reducing
mapping
¾ Typically, 1D- or 2D-SOM neuron grids are employed (3D in Robotics)
¾ SOM learning in its common technical implementation:
r
1. Random initialization of neuron weight vectors w j
r
2. Iterative presentation of stimuli vectors vi and computation of the
winner neuron Nc M
N
d c = min j =SOM
1
2
∑ (v − w )
i =1
i ij
3. Adaptation of the winning neuron and the neigbors

w (t ) + α (t ) N c (r (t ))(vik − wij (t )) for j ∈ N c (r (t ))
wij (t + 1) =  ij
 wij (t ) for j ∉ N c (r (t ))
(2.63)
4. Reduce α (t ) and r(t); Terminate learning by max. steps/error
36
Neurocomputing
¾ For the special case of two-dimensional SOM, two-dimensional weights

and stimuli equal probability distribution , the network unfolding during
training can be observed:
Initialization 20 steps 100 steps
300 steps 1000 steps 10000 steps
Neurocomputing
¾ During the training process, the SOM unfolds in the multivariate pattern
space and creates a topology preserving mapping to the 2D neuron grid
¾ Example of SOM visualization for Cube-data:
37
Neurocomputing
¾ SOM component planes for Cube-data:
Plane 1
Plane 2
Plane 3
¾ More discussed in sensor signal processing lecture ....

Neurocomputing
¾ Computational requirements of the SOM in the forward phase are the

common distance computation, e.g., Euclidean distance, followed by the
search for the closest stimulus, i.e., the minimum distance
¾ This compares to 1-NN computation and can be subject to parallelization
¾ In particular the minimum search is an attractive candidate as sequential
search can be a bottleneck in a parallel array
¾ Remedy: Comparator tree or efficient bitwise parallel comparison schemes
¾ In learning, however, only a subset (of decreasing size) of neurons take part
¾ Both spatial and temporal adaptations for the learn rate have to be computed
and communicated to adapting neurons:
xi
t - * + wijnew
α (t ) wij
α (0)
t
N C (r (t ) ) *
r (0) wijold
38
Neurocomputing
Neural Networks-CNN NN-Models
¾ Cellular-Neural-Networks (CNN) Chua, Yang IEEE TCAS (1988), p. 1257:

Network topology
Neurons
r=1
Inputs
DTCNN:
s j (t ) = ∑ wij oi (t ) + ∑ wkj xk + Θ j (2.64)
i k
 1 for s j (t ) > 0 (2.65)

o j (t + 1) = 
− 1 for s j (t ) ≤ 0
¾ Neural networks for low-level processing: Retina/vision chips, cochlea
chips, cellular-neural-networks with restricted local neighborhood conn.
¾ Implementation of linear and non-linear image processing operations based
on appropriate (heuristically determined) cloning templates
Neurocomputing
¾ Image processing capabilities of Discrete-Time-Cellular-Neural-Networks

(DTCNN) for various cloning templates:
¾ Sceletonization, edge extraction, connected-component-detection, hole
filling, concentric-contour-detection, dilation/erosion, noise elimination
39
Neurocomputing
¾ Application to feature computation for OCR:
¾ Preferred application in analog or mixed-signal implementation

¾ Time-continuous CNN with differential equation for modelling
¾ Grey value output/processing and hexagonal neighborhood as options
¾ No explicit learning rule available !
¾ Forward phase requirements: Twice the dot product computation,
accumulation to state and non-linearity (thresholding) for output
Neurocomputing
Conclusions NN-Models
¾ Identified Basic Algorithmic Building Blocks:
• Vector Subtraction, Addition, and Scaling

• Matrix-Vector Multiplication
• Distance Metric (Dot Product, Euclidean Distance, …)
• Non-Linearity (Sigmoid, Gaussian, …, derivatives of )
• Winner-Takes-All-Mechanism (WTA), corresponds to efficient
Max/Min-Search
• Convolution/Correlation Support
• Random Generators
• Dynamic Network Topology Support
¾ Operations supported by dedicated neural network hardware

¾ Often complemented for conventional signal processing needs
40
Neurocomputing
Conclusions NN-Models
¾ Unifying regarded algorithm’s requirements for a common forward phase:
K2
xi
wij
- * ∑ Min.
Search
oj
NL
-Rj
¾ A data path for a processing element will follow these conceptual guidelines,
obeying to additional constraints
¾ Learning implementation considerably more inhomogeneous
¾ Additional non-linearity required (derivative)
¾ Commonly, implementations including on-chip learning tend to be more
specialized and support only single or few algorithms
Neurocomputing
Summary NN-Models
¾ The chapter briefly introduced to (revisited) important and commonly

applied artificial neural network algorithms
¾ These are MLPs with backpropagation learning, RBFs, RCE, PNN,

LVQ/SOM, Hopfield/Boltzmann and Cellular neural networks
¾ The focus on the presentation was on the computational requirements of

the ANN algorithms and their potential for parallel implementation
¾ Common requirements for the forward phase were identified for

potential multi-model implementation (not so much for learning)
¾ Spiking neural network algorithms not yet included
In the next step, typical applications will be investigated with

regard to the justification of the underlying effort of an actual
dedicated massively parallel system implementation.
41

Neurocomputing Chap2 2 PDF

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Neurocomputing Chap2 2 PDF

Caricato da

Copyright:

Formati disponibili

Neurocomputing

Dept. of Electrical Engineering and Information Technology

Fall Semester 2006

© Andreas König Slide 2-1

© Andreas König Slide 2-2

2. Rehearsal of Artificial Neural Network models relevant for

© Andreas König Slide 2-3

2.2.7 Radial-Basis-Function Networks

© Andreas König Slide 2-4

¾ Early neuron models, such as the famous Perceptron or ADaptive Linear

¾ Special learning algorithm

Stimuli Weights Cell body Output 0.5

¾ Good performance corresponds to valley location, that has to be reached !

¾ Adaptation of a neuron weight is commonly achieved by gradient descent

© Andreas König Slide 2-8

¾ Additional implementation requirement for implementing the Delta-Rule

For batch learning, the need of accumulation

∑ and intermediate storage of individual batch

¾ Learning requires availability of previous forward phase results ( y kj )

¾ Simple dot product neurons can serve as linear classifiers

¾ System context of neural network application:

Receiver & Classification interpretation

Vision (CCD,CMOS), IR, UV, SAR, US,

© Andreas König Slide 2-12

¾ In the following, artificial neural networks will regarded in their role as

¾ They will be compared with established techniques of statistical pattern

¾ Both performance and implementation requirement/cost will be regarded

¾ The objective is the determination of basic building blocks and operations

¾ Later, implementation options for these basic building blocks will be

¾ ANN serve in different places of recognition systems and in very different

© Andreas König Slide 2-13

¾ Numerous options to define estimation functions

statistical approaches function approximation

parametric non parametric

BAC/MLC MAC EAC CBC Parzen Hyper kNN Decision Polynomial

PNN RCE LVQ RBF/BP/CasCor

© Andreas König Slide 2-14

¾ The Bayes-Normal distribution Classifier (BAC) assumes Gaussian

© Andreas König Slide 2-15

¾ The intersections of two class region probability distribution functions

¾ Parabolic class boundaries result in the two-dimensional example

© Andreas König Slide 2-16

¾ The Parzen-window classifier bases on the first alternative

o1 Can be replaced by multiplication

¾ The kernel width must be adapted to data density

Volume for k=5

Class 2 Class 1: 0/5=0.0

¾ Sketch of class boundary for k=1 1NN-classification

¾ Class-specific Voronoi tesselation defines the class borders in this case

© Andreas König Slide 2-22

¾ Implementation requirements of kNN (special case k=1):

¾ Parallel implementation of individual neurons advocate

Parallel Minimum Search

¾ Edited-Nearest-Neighbor-classification (ENN, Devijver and Kittler, 1980)

x2 ¾ Automatic determination of the

¾ Edited sample set is reduced, resubstitution no longer assured,

¾ Condensed-Nearest-Neighbor-classification (CoNN, Hart, 1968)

¾ Condensed-Nearest-Neighbor-classification (CoNN, Hart, 1968)

¾ All class 3 patterns in the following will be classified correctly

© Andreas König Slide 2-26

¾ Condensed-Nearest-Neighbor-classification (CoNN, Hart, 1968)

¾ Removal of existing redundancy by following step

© Andreas König Slide 2-27