Sei sulla pagina 1di 22

1430/10/28

Ch 10: Widrow-Hoff Learning

(LMS Algorithm)

In this chapter we apply the principles of performance learning to a single-layer linear neural network.

Widrow-Hoff learning is an approximate steepest descent algorithm, in which the performance index is mean square error.

1

• Bernard Widrow began working in NN in the late 1950s, at about the same time that Frank Rosenblatt developed the perceptron learning rule.

• In 1960 Widrow and Hoff introduced ADALINE (ADAptive LInear NEuron) network.

• Its learning rule is called LMS (Least Mean Square) algorithm.

• ADALINE is similar to the perceptron, except that its transfer function is linear, instead of hard limiting.

2

1430/10/28

Widrow, B., and Hoff, M. E., Jr., 1960, Adaptive switching circuits, in 1960 IRE WESCON Convention Record, Part 4, New York: IRE, pp. 96–104.

Widrow, B., and Lehr, M. A., 1990, 30 years of adaptive neural networks: Perceptron, madaline, and backpropagation, Proc. IEEE, 78:1415–1441.

Widrow, B., and Stearns, S. D., 1985, Adaptive Signal Processing, Englewood Cliffs, NJ: Prentice-Hall.

 

3

• Both have the same limitations; They can only solve linearly separable problems.

• The LMS algorithm minimizes mean square error, and therefore tries to move the decision boundaries as far from the training patterns as possible.

• The LMS algorithm found many more practical uses than the perceptron (like most long distance phone lines use ADALINE network for echo cancellation).

 

4

1430/10/28

ADALINE Network

 1430/10/28 ADALINE Network a = purelin  Wp + b  = Wp + b

a = purelinWp + b = Wp + b

a i

==

i

purelin n

purelin

i

w

T

p

+

b

i

=

i

w

T

p

+

b i

i w is made up of the elements of the ith row of W:

i

w

=

w i1

w

w

i2

iR

5

Two-Input ADALINE

w = w i  1 w w i  2 i  R 5 Two-Input
w = w i  1 w w i  2 i  R 5 Two-Input

a

==

purelinn

purelin

1

w

T

p + b

=

1

T

w

p + b

a

=

1

w

T

p + b

=

w

11

p

1

+

w

12

p

2

+ b

The ADALINE like perceptron has a decision boundary, which is determined by the input vectors for which the net input n is zero.

6

1430/10/28

Mean Square Error

The LMS algorithm is an example of supervised training.

Training Set:

{

p

1

t

,}

1

{

p

2

t 2

,}

{

p

Q

t

,}

Q

Input: p Target: q t q Notation: T w T = x = 1 a
Input:
p
Target:
q
t q
Notation:
T
w
T
=
x
=
1
a
p + b
z
= p
a
=
x
z
1 w
b
1
Mean Square Error:
2
Fx
= E
e
=
Et – a 2
=
E
t
x
T
z 2

The expectation is taken over all sets of input/target pairs.

7

Error Analysis

Fx

=

E

e

2

Fx

Fx

=

=

E

=

E

t

2

Eta

2

t

2

2x

2tx

T

z

T

Etz

=

E

t

x

T

z

2

+

+

x

x

T

T

zz

x

T

E zz

T

x

This can be written in the following convenient form:

where

c

F x

=

Et 2

=

c

h

2 x

T

h

+

=

E tz

x

T

R x

R

=

E

zz

T

8

1430/10/28

The vector h gives the cross-correlation between the input vector and its associated target. R is the input correlation matrix.

The diagonal elements of this matrix are equal to the mean square values of the elements of the input vectors.

Th

e mean square error or quadratic function:

f

th

ADALINE N t

e

k i

e wor

s a

Fx

=

c

+

d

T

x

+

1

---x

2

T

Ax

d

=

–2h

A

=

2 R

9

Stationary Point

Hessian Matrix:

A = 2R

The correlation matrix R must be at least positive semidefinite. Really it can be shown that all correlation matrices are either positive definite or positive semidefinite. If there are any zero eigenvalues, the performance index will either have a weak minimum or else no stationary point (depending on d= -2h), otherwise there will be a unique global minimum x * (see Ch8).

Fx c d

=

+

T

x

Stationary point:

+

1

T

---x

2

Ax

==d + Ax

– 2h + 2Rx

=

0

– 2h + 2Rx

10

1430/10/28

If R (the correlation matrix) is positive definite:

x

=

R

–1

h

If we could calculate the statistical quantities h and R, we could find the minimum point directly from above equation.

But it is not desirable or convenient to calculate h and R. So …

11

Approximate Steepest Descent

Approximate mean square error (one sample):

ˆ

F x

=

tkak

2

2

= k

e

Expectation of the squared error has been replaced by the squared error at iteration k.

Approximate (stochastic) gradient:

2

 e k  j =

ˆ 2

Fx

=  k

e

e

----------------

2 k

w 1

j

=

2 e k

------------- e k

w 1

j

j

=

2

 k R + 1 =

e

 k

----------------

b

e

2

=

-------------

2 e k ek

b

12 R

12

1430/10/28

Approximate Gradient Calculation

ek = tk – ak  ------------- ---------------------------------- =  tk
ek
= tk – ak
-------------
----------------------------------
=
tk
 w
w
 w 1 j
1 j
 1 j
R
e k 
------------- =
tk 
– w
1 i
 w 1 j
w 1 j
i = 1
T –  pk + b w 1  p i k  + b
T
–  pk + b
w
1
p i k  + b

Where p i (k) is the ith elements of the input vector at kth iteration.

ek

-------------

w 1j

=

p j

k

------------- ek b = –1

ˆ 2  Fx =  k = –2ekzk e
ˆ
2
 Fx
=  k = –2ekzk
e

13

Now we can see the beauty of approximating the mean square error by the single error at iteration k as in:

ˆ

2

2

F x tkak e

=

= k

•This approximation to F (x) can now be used

in the Steepest descent algorithm.

LMS Algorithm

x k + 1

=

x

k

Fxx

=

x k

14

1430/10/28

If we substitute

ˆ

(

F x

) for

(

F x

)

x

k + 1

=

x k

1

w

k+ 1w

=

1

+ 2 e k z k

k+2ekpk

bk + 1= bk+ 2ek

These last two equations make up the LMS algorithm. Also called Delta Rule or the Widrow-Hoff learning algorithm.

15

Multiple-Neuron Case

i

w

k + 1w k2e

=

i

+

i

kpk

b i k + 1

=

b k

i

+

2e

i

k

Matrix Form:

Wk+1

=

Wk

+

2ekp

T k

bk + 1= bk+ 2ek

16

1430/10/28

Analysis of Convergence

Note that x k is a function only of z(k-1), z(k-2), …, z(0). If we assume that successive input vectors are statistically independent, then x k is independent of z(k).

We will show that for stationary input processes meeting this condition, so the expected value of the weight vector will converge to:

*

x

R

1

h

This is the minimum mean square error {E[e k 2 ]} solution, as we saw before.

Recall the LMS Algorithm:

x k +1

=

x k

+2ekzk

Ex

k + 1

= Ex +2Eekzk

k

Substitute the error with

t(k)

x

T

k

z

(k)

E x

k +1

=

T

k

i

s nce x

E x

k

+

2Etk zk

z(

k

) z

T

k

(

)x

k



E

T

x k

zkzk

E

x k + 1

= E

x k

+

2E zk

t

k

E zkz

T kx k



17

18

1430/10/28

Since x k is independent of z(k)

E x

k + 1

=

E x 2h RE x

k

+

k



Ex

k + 1

= I – 2REx + 2h

k

+ 1  =  I – 2  R  E  x  +

For stability, the eigenvalues of this matrix must fall inside the unit circle.

Conditions for Stability

eig I – 2R 

=

1

2

i

1

(where i is an eigenvalue of R)

Since

i 0 ,

1

2

i

1

.

19

Therefore the stability condition simplifies to

1

2



1 

i

i

– 1

for all

i

0

1

max

Note: we have the same condition as the SD algorithm. In SD we use the Hessian Matrix A, here we use the input correlation matrix R (Recall that A=2R).

20

1430/10/28

Steady State Response

E x

k + 1

=

I – 2RE x

k

  + 2h

If the system is stable then a steady state condition will be reached

E x

ss

,

=

I – 2RE x

ss

  + 2h

.

The solution to this equation is

E

x

ss

=

R

–1

h

=

x

This is also the strong minimum of the performance index.

Thus the LMS solution, obtained by applying one input at a time, is

the same as the minimum mean square solution of

*

x

R

1

h

Banana

Example

p

1

=

–1

1

–1

t

1

=

–1

Apple

p

2

=

1

1

–1

t

2

=

1

21

If inputs are generated randomly with equal probability, the input correlation matrix is:

1

=

1.0

R

=

2

1

---

2

=

R = –1 1 –1
R
=
–1
1 –1

1

 

T

E pp

=

 
  1

1

+

1 –1

---

2

0.0

3 = 2.0

1 T 1 T ---p 2 p 1 + ---p 2 p 2 1 2
1 T
1
T
---p 2
p 1
+ ---p 2 p 2
1
2
1
100
=
1
1 1 –1
0
1 –1
–1
0 –1
1
1
1
------------ = -------= 0.5
2.0
max

We take α=0.2 (Note: Practically it is difficult to calculate R and α. We choose them by trial and error).

22

1430/10/28

Banana

Iteration One

a0

=

e0

W0p0

=

W0p

1

=

000

–1

1

–1

=

0

t0

= –

a0

=

t

1

a0= –011–

= –

W1

W1

=

=

000

W0

+

2e0p

T

0

+ 20.2–1

–1

1

–1

T

=

0.4 –0.4 0.4

W(0) is

selected

arbitrarily.

23

Iteration Two

Apple

a1

=

e1

W2

=

W1p1

=

W1p

2

=

0.4 –0.4 0.4

1

1

–1

= –0.4

t1

= –

a1

=

t 2

a1= 1 ––0.4= 1.4

0.4 –0.4 0.4

+ 20.21.4

1 1
1
1

–1

T

=

0.96 0.16 –0.16

24

1430/10/28

Iteration Three

–1 a2 = W2p2 = W2p = –0.64 1 = 0.96 0.16 –0.16 1 –1
–1
a2
=
W2p2
=
W2p
= –0.64
1 =
0.96 0.16 –0.16
1
–1
e2
=
t2
a2
=
t
1 – a2 = – 1 – –0.64= –0.36
T
W3
=
W2 + 2e2p 2
=
1.1040 0.0160 –0.0160

W

= 100

25

Some general comments on the learning process:

• Computationally, the learning process goes through all training examples (an epoch) number of times, until a stopping criterion is reached.

• The convergence process can be monitored with the plot of the mean- squared error function F(W(k)).

26

1430/10/28

The popular stopping criteria are:

the mean-squared error is sufficiently small: F(W(k)) < ε

The rate of change of the mean-squared error is sufficiently small:

of change of the mean-squared error is sufficiently small: 27 Adaptive Filtering ADALINE is one of

27

Adaptive Filtering

ADALINE is one of the most widely used NNs in practical applications. One of the major application areas has been Adaptive Filtering.

Tapped Delay Line

Adaptive Filter

of the majo r application areas has been Adaptive Filtering. Tapped Delay Line Adaptive Filte r

28

1430/10/28

ak

=

purelinWp + b

=

R

i = 1

w 1i yki + 1

+b

In Digital Signal Processing (DSP) language we recognize this network as a finite impulse response (FIR) filter.

29

Example: Noise Cancellation

recognize this network as a finite impulse response (FIR) filter. 29 Example: Noise Cancellation 30 15

30

1430/10/28

Noise Cancellation Adaptive Filter

Two-input filter can attenuate and phase-shift the noise ν in the desired way.

te and phase-shift the noise ν in the desired way. 31 Correlation Matrix To Analyze this

31

Correlation Matrix

To Analyze this system we need to find the input correlation matrix R and the input/target cross- h

corre a on vec or

l

ti

t

.

[

R E zz

T

]

h = Etz

zk

= vkvk – 1

tk= sk+ mk

R

h

=

2 = Ev k Evkvk– 1 Evk  –1vk E v  2 k –1
2
= Ev k
Evkvk– 1
Evk 
–1vk
E v
2 k –1
E sk
 + mk v k 
E sk
 + m kv  k – 1

32

1430/10/28

We must define the noise signal ν, the EEG signal s, and the filtered noise m, to be able to obtain specific values.

We assume: The EEG signal is a white (Uncorrelated from one time step to the next) random signal uniformly distributed between the values -0.2 and +0.2, the noise source (60 Hz sine wave sampled at 180 Hz) is given by

vk

=

1.2

2k

3

sin ---------

And the filtered noise that contaminates the EEG is the noise attenuated by a factor 1.0 and shifted in phase by

-3π/4:

33

E

m k

=

1.2

sin

2 k

---------

3

3

– ------

4

2

v

v

21

k 1.2---

3

=

3

k = 1

sin

2k

---------

3

2

==0.5

1.2

2

E v

2

2

k – 1 E v

=

  k  = 0.72

0.72

Evk  vk –1

1

= ---

3

3

k = 1

2k  

 

1.2 sin---------

3

2k – 1 

1.2 sin-----------------------

3

 

2

2

------

3

=

1.2

0.5 cos

   

=

– 0.36

 

0.72

–0.36

 

R

=

0.36

0.72

34

1430/10/28

Stationary Point

Esk+ mkvk = Eskvk +Emkvk

 k  v  k  + E  m  k  v 

0

Th

independent and zero mean.

1 st t

i

erm s zero

b

ecause s

e

(k)

an

d

(k)

v

are

Emk

 vk

=

1

---

3

3

k = 1

1.2 sin

2k

3

--------- – ------

3

4

 

  

1.2 sin --------- 2k  

3

= –0.51

Now we find the 2 nd element of h:

Esk+mkvk – 1 = Eskvk – 1 +Emkvk – 1

3

k – 1  + E  m  k  v  k – 1

0

35

2k 2k –1  3     Emk  vk –1 = ---
2k
2k –1
3  
Emk
 vk –1
= ---
1 
  1.2
sin --------- – ------
1.2 sin-----------------------
= 0.70
3
3
4
 
3
k = 1
h
=
Esk  + mkvk
Esk  + mkvk – 1
–0.51
h
=
0.70
–1
–1
0.72
– 0.36
– 0.51
–0.30
x 
==
R
h
=
– 0.36
0.72
0.70
0.82
Now, what kind of error will we have at the
minimum solution?

36

1430/10/28

Performance Index

Fx

=

c

2x

T

h

+

x

T

Rx

We have just found x * R and h. To find c we have

c

=

,

c

=

Et 2

k E sk  + mk

=

2

2

Es

 k 2Esk  mk E m

+

+

2

k

t independent and zero mean.

Th

iddl

e

i

erm s zero

b

ecause s

e m

(k)

E

s

2

0.2

1

k ==

-------

0.4

s

ds

2

–0.2

1

--------------- s

3 0.4

an

d

v

(k)

are

3

0.2

–0.2

=

0.0133

E m

2 k =

1

---

3

3

k = 1

1.2

sin

2

3 

------ – ------

3

4

2

= 0.72

F x

c

=

1

0.0 33

+ 0.7

2

=

0.7333

  = 0.7333– 20.72+ 0.72 = 0.0133

37

The minimum mean square error is the same as the mean square value of the EEG signal. This is what we expected, since the “error” of this adaptive noise canceller is in fact the reconstructed EEG Signal.

38

1,2

W

1430/10/28

LMS Response for α=0.1

1 , 2 W 1430/10/28 LMS Response for α =0.1 W 1,1 LMS trajectory looks like

W 1,1

LMS trajectory looks like noisy version of steepest descent.

39

Note that the contours in this figure reflect the fact that the eigenvalues and the eigenvectors of the Hessian matrix A=2R are

1

2.16,

z

1

 0.7071


2

0.7071

,

0.75,

z

2

  0.7071

  0.7071

If the learning rate is decreased, the LMS trajectory is smoother, but the learning proceed more slowly.

Note that α max is 2/2.16=0.926 for stability.

40

1430/10/28

1430/10/28 nnd10eeg Note that error does not go to zero, because the LMS algorithm is approximate

nnd10eeg

Note that error does not go to zero, because the LMS

algorithm is approximate steepest descent; it uses an estimate

of the gradient, not the true gradient.

41

Echo Cancellation

steepest descent; it uses an estimate of the gradient, not the true gradient. 41 Echo Cancellation

42

1430/10/28

HW

Ch 4: E 2, 4, 6, 7

Ch 5: 5, 7, 9

Ch 6: 4, 5, 8, 10

Ch 7: 1, 5, 6, 7

Ch 8: 2, 4, 5

Ch 9: 2, 5, 6

Ch 10: 3, 6, 7

43