Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
= Conditioned Stimulus
CR
= Conditioned Response
I. Pawlow
Learning the
association
CS U(C)R
Porsche Good Feeling
C la s s ic a l C o n d itio n in g
C o r r e la tio n o f S ig n a ls
R E IN F O R C E M E N T L EA R N IN G
U N -S U P E R VIS E D L E AR N IN G
e x a m p le b a s e d
D y n am ic P ro g .
(Be llm a n E q .)
-R u le
H e b b -R u le
su p e r v is e d L .
R e s c o rla /
Wagner
T D ( )
o fte n = 0
c o rre la tio n b a s e d
T D (1 )
LT P
(LT D = a n ti)
=
T D (0 )
D iffe re n tia l
H e b b -R u le
(s lo w )
M o n te C a rlo
C o n tro l
N e u r.T D -M o d e ls
A cto r/C ritic
S T D P -M o d e ls
IS O -L e a rn in g
(C ritic )
SARSA
IS O -C ontrol
STD P
B io p h y s . o f S y n . P la s tic ity
C o rr e la tio n
b a se d C o n tro l
( n o n - e v a lu a tiv e )
b io p h y s ic a l & n e tw o r k
IS O -M o d e l
of STDP
te c h n ic a l & B a s a l G a n g l.
Q -L e a rn in g
D iffe re n tia l
H e b b -R u le
(fa s t)
D o p a m in e
G lu ta m a te
N e u ro n a l R e w a rd S y s te m s
(B a s a l G a n g lia )
N O N -E VA L U AT IV E F E ED B A C K (C o rre la tio n s )
EVA L U AT IV E F E E D B A C K (R e w a rd s )
C la s s ic a l C o n d itio n in g
R E IN F O R C E M E N T L EA R N IN G
U N -S U P E R VIS E D L E AR N IN G
e x a m p le b a s e d
D y n am ic P ro g .
(Be llm a n E q .)
-R u le
H e b b -R u le
R e s c o rla /
Wagner
T D ( )
o fte n = 0
c o rre la tio n b a s e d
su p e r v is e d L .
T D (1 )
LT P
(LT D = a n ti)
=
D iffe
re n tiahere
l
And
later
also
!
T D (0 )
H e b b -R u le
(s lo w )
M o n te C a rlo
C o n tro l
N e u r.T D -M o d e ls
A cto r/C ritic
IS O -L e a rn in g
(C ritic )
STD P
B io p h y s . o f S y n . P la s tic ity
C o rr e la tio n
b a se d C o n tro l
IS O -C ontrol
b io p h y s ic a l & n e tw o r k
IS O -M o d e l
of STDP
SARSA
( n o n - e v a lu a tiv e )
D iffe re n tia l
H e b b -R u le
(fa s t)
S T D P -M o d e ls
te c h n ic a l & B a s a l G a n g l.
Q -L e a rn in g
C o r r e la tio n o f S ig n a ls
D o p a m in e
G lu ta m a te
N e u ro n a l R e w a rd S y s te m s
(B a s a l G a n g lia )
N O N -E VA L U AT IV E F E ED B A C K (C o rre la tio n s )
EVA L U AT IV E F E E D B A C K (R e w a rd s )
Notation
US = r,R = Reward
(similar to X0 in ISO/ICO)
Note: The notion of a state really only makes sense as soon as there
is more than one state.
1
Types of Rules
1) Rescorla-Wagner Rule: Allows for explaining
several types of conditioning experiments.
2) TD-rule (TD-algorithm) allows measuring the
value of states and allows accumulating
rewards. Thereby it generalizes the Resc.Wagner rule.
3) TD-algorithm can be extended to allow
measuring the value of actions and thereby
control behavior either by ways of
a) Q or SARSA learning or with
b) Actor-Critic Architectures
C la s s ic a l C o n d itio n in g
C o r r e la tio n o f S ig n a ls
R E IN F O R C E M E N T L EA R N IN G
U N -S U P E R VIS E D L E AR N IN G
e x a m p le b a s e d
D y n am ic P ro g .
(Be llm a n E q .)
-R u le
H e b b -R u le
su p e r v is e d L .
R e s c o rla /
Wagner
T D ( )
o fte n = 0
c o rre la tio n b a s e d
T D (1 )
LT P
(LT D = a n ti)
=
T D (0 )
D iffe re n tia l
H e b b -R u le
(s lo w )
M o n te C a rlo
C o n tro l
N e u r.T D -M o d e ls
A cto r/C ritic
S T D P -M o d e ls
IS O -L e a rn in g
(C ritic )
SARSA
IS O -C ontrol
STD P
B io p h y s . o f S y n . P la s tic ity
C o rr e la tio n
b a se d C o n tro l
( n o n - e v a lu a tiv e )
b io p h y s ic a l & n e tw o r k
IS O -M o d e l
of STDP
te c h n ic a l & B a s a l G a n g l.
Q -L e a rn in g
D iffe re n tia l
H e b b -R u le
(fa s t)
D o p a m in e
G lu ta m a te
N e u ro n a l R e w a rd S y s te m s
(B a s a l G a n g lia )
N O N -E VA L U AT IV E F E ED B A C K (C o rre la tio n s )
EVA L U AT IV E F E E D B A C K (R e w a rd s )
Rescorla-Wagner
Rule
Pre-Train
Pavlovian:
Extinction:
Partial:
We define:
ur
Train
Result
ur
uv=max
uv=0
ur u
uv<max
Pawlovian
Extinction
Partial
v = w.u,
and
w w + u
with = r v
Where we use stochastic gradient descent for minimiz
We define:
Do you see the similarity of this rule with the rule discussed earlier !?
Pre-Train
Blocking:
u1r
Train
Result
u1+u2r
u1v=max, u2v=0
For Blocking: The association formed during pretraining leads to =0. As 2 starts with zero the
expected reward v=1u1+2u2 remains at r. This keeps
=0 and the new association with u2 cannot be
learned.
Inhibitory:
Train
u1+u2, u1r
Result
u1v=max, u2v<0
Overshadow:
Train
u1+u2r
Result
u1v<max, u2v<max
Secondary: u1r
Train
u2u1
Result
u2v=max
State,
Action,
Reward,
Value,
Policy
13
16
14
10
11
12
rewards r 1 r 2
actions a 1 a 2
states s 1
a 14 a 15
2
p(N) = 0.5
Policy: p(S) = 0.125
p(W) = 0.25
p(E) = 0.125
value = 0.0
everywhere
reward R=1
0.9 R
0.8 0.9
etc
0.0
x
Policy Evaluation
possible start give values of states
locations
Policie
s always exploits and
Greedy Policy: The agent
selects the most rewarding action. This is suboptimal as the agent never finds better new paths.
Policie
s
-Greedy Policy: With a small probability the
exp( T )
Pn
Qb
exp( T )
b=1
currently to be evaluated
action a and T is a
temperature parameter. For
large T all actions have
C la s s ic a l C o n d itio n in g
C o r r e la tio n o f S ig n a ls
R E IN F O R C E M E N T L EA R N IN G
U N -S U P E R VIS E D L E AR N IN G
e x a m p le b a s e d
D y n am ic P ro g .
(Be llm a n E q .)
-R u le
H e b b -R u le
su p e r v is e d L .
R e s c o rla /
Wagner
T D ( )
o fte n = 0
c o rre la tio n b a s e d
T D (1 )
LT P
(LT D = a n ti)
=
T D (0 )
D iffe re n tia l
H e b b -R u le
(s lo w )
M o n te C a rlo
C o n tro l
N e u r.T D -M o d e ls
A cto r/C ritic
S T D P -M o d e ls
IS O -L e a rn in g
(C ritic )
SARSA
IS O -C ontrol
STD P
B io p h y s . o f S y n . P la s tic ity
C o rr e la tio n
b a se d C o n tro l
( n o n - e v a lu a tiv e )
b io p h y s ic a l & n e tw o r k
IS O -M o d e l
of STDP
te c h n ic a l & B a s a l G a n g l.
Q -L e a rn in g
D iffe re n tia l
H e b b -R u le
(fa s t)
D o p a m in e
G lu ta m a te
N e u ro n a l R e w a rd S y s te m s
(B a s a l G a n g lia )
N O N -E VA L U AT IV E F E ED B A C K (C o rre la tio n s )
EVA L U AT IV E F E E D B A C K (R e w a rd s )
Tree backup
methods:
r1
a1
s
15
16
13
14
r2
10
11
12
a14 a15
a2
2
tre e .cd r
Linear
backup
methods
r1
a1
s
13
15
16
14
r2
10
11
12
a14 a15
a2
2
tre e .cd r
Linear
backup
methods
r1
a1
s
15
16
13
14
r2
10
11
12
a14 a15
a2
2
tre e .cd r
Linear
backup
methods
r1
a1
s
13
16
14
B
9
15
r2
10
11
12
a14 a15
a2
2
tre e .cd r
{z
}
The elegant trick is to assume that, |
This is why it
if the process converges, the value
is called TD
of the next state V(st+1) should be
(temp. diff.)
an accurate estimate of the
Learning
expected return downstream to
this state (i.e., downstream to st+1).
Thus, we would hope that the
following holds:
Indeed, proofs exist that under certain boundary conditions this
procedure, known as TD(0), converges to the optimal value
function for all states.
Instead of just updating the most recently left state st we will now
loop through all states visited in the past of this trial which still
have an eligibility trace larger than zero and update them
according to:
C la s s ic a l C o n d itio n in g
R E IN F O R C E M E N T L EA R N IN G
-R u le
H e b b -R u le
R e s c o rla /
Wagner
T D ( )
o fte n = 0
c o rre la tio n b a s e d
su p e r v is e d L .
T D (1 )
=
T D (0 )
D iffe re n tia l
H e b b -R u le
(s lo w )
M o n te C a rlo
C o n tro l
N e u r.T D -M o d e ls
A cto r/C ritic
IS O -L e a rn in g
(C ritic )
b io p h y s ic a l & n e tw o r k
STD P
B io p h y s . o f S y n . P la s tic ity
C o rr e la tio n
b a se d C o n tro l
IS O -C ontrol
(LT D = a n ti)
IS O -M o d e l
of STDP
SARSA
( n o n - e v a lu a tiv e )
LT P
D iffe re n tia l
H e b b -R u le
(fa s t)
S T D P -M o d e ls
te c h n ic a l & B a s a l G a n g l.
Q -L e a rn in g
C o r r e la tio n o f S ig n a ls
U N -S U P E R VIS E D L E AR N IN G
e x a m p le b a s e d
D y n am ic P ro g .
(Be llm a n E q .)
D o p a m in e
G lu ta m a te
N e u ro n a l R e w a rd S y s te m s
(B a s a l G a n g lia )
N O N -E VA L U AT IV E F E ED B A C K (C o rre la tio n s )
EVA L U AT IV E F E E D B A C K (R e w a rd s )
(t)
wi ! wi + [r (t + 1) + v(t + 1) v(t)]u
r
X
Trace
xu11
v
v
re w a rd
Xn
X1
x
x
( n - i)
X0
v(t+1)-v(t)
v(t)
r
SerialCompound
representations
X1,Xn for
defining an
re w a rd
v (t)
w0 = 1
v (t- )
wi wi + xi
#1
S ta rt: w 0 = 0
w1 = 0
re w a rd , U S
S ta rt: w 0 = 1
w 1= 0
#2
X1
P re d ictive
S ig n a ls
X0
v
v
= v+ r
re w a rd
Xn
X1
X0
x
x
( n - i)
E n d : w0 = 1
w1= 1
E n d : w0 = 1
w1 = 0
v
v(t)
Forward
shift
because of
acausal
derivative
v
v
= v+ r
#3
Observations
= v+ r
-error moves
forward from
the US to the
CS.
The reward
expectation
signal
extends
forward to
the CS.
#1
#3
v
#2
#3
C la s s ic a l C o n d itio n in g
C o r r e la tio n o f S ig n a ls
R E IN F O R C E M E N T L EA R N IN G
U N -S U P E R VIS E D L E AR N IN G
e x a m p le b a s e d
D y n am ic P ro g .
(Be llm a n E q .)
-R u le
H e b b -R u le
su p e r v is e d L .
R e s c o rla /
Wagner
T D ( )
o fte n = 0
c o rre la tio n b a s e d
T D (1 )
LT P
(LT D = a n ti)
=
T D (0 )
D iffe re n tia l
H e b b -R u le
(s lo w )
M o n te C a rlo
C o n tro l
N e u r.T D -M o d e ls
A cto r/C ritic
S T D P -M o d e ls
IS O -L e a rn in g
(C ritic )
SARSA
IS O -C ontrol
STD P
B io p h y s . o f S y n . P la s tic ity
C o rr e la tio n
b a se d C o n tro l
( n o n - e v a lu a tiv e )
b io p h y s ic a l & n e tw o r k
IS O -M o d e l
of STDP
te c h n ic a l & B a s a l G a n g l.
Q -L e a rn in g
D iffe re n tia l
H e b b -R u le
(fa s t)
D o p a m in e
G lu ta m a te
N e u ro n a l R e w a rd S y s te m s
(B a s a l G a n g lia )
N O N -E VA L U AT IV E F E ED B A C K (C o rre la tio n s )
EVA L U AT IV E F E E D B A C K (R e w a rd s )
no CS
A fte r le a rn in g :
p r e d ic te d re w a r d o c c u r s
DA-responses in
the basal ganglia
pars compacta of
the
substantia nigra
and the medially
adjoining ventral
tegmental area
CS
This neuron
is supposed
to represent
the -error
of TDlearning,
which has
moved
forward as
expected.
A fte r le a rn in g :
p re d ic te d re w a rd d o e s n o t
occur
CS
1 .0 s
Omission of
reward
leads to
inhibition as
also
predicted
Tr
Tr
1 .5 s
1 .0 s
0 .5 s
Incompatible to a serial
compound representation of the
stimulus as the -error should
move step by step forward, which
is not found. Rather it shrinks at r
and grows at the CS.
=cause-effect
C la s s ic a l C o n d itio n in g
R E IN F O R C E M E N T L EA R N IN G
-R u le
H e b b -R u le
R e s c o rla /
Wagner
T D ( )
o fte n = 0
c o rre la tio n b a s e d
su p e r v is e d L .
T D (1 )
=
T D (0 )
D iffe re n tia l
H e b b -R u le
(s lo w )
M o n te C a rlo
C o n tro l
N e u r.T D -M o d e ls
A cto r/C ritic
IS O -L e a rn in g
(C ritic )
b io p h y s ic a l & n e tw o r k
STD P
B io p h y s . o f S y n . P la s tic ity
C o rr e la tio n
b a se d C o n tro l
IS O -C ontrol
(LT D = a n ti)
IS O -M o d e l
of STDP
SARSA
( n o n - e v a lu a tiv e )
LT P
D iffe re n tia l
H e b b -R u le
(fa s t)
S T D P -M o d e ls
te c h n ic a l & B a s a l G a n g l.
Q -L e a rn in g
C o r r e la tio n o f S ig n a ls
U N -S U P E R VIS E D L E AR N IN G
e x a m p le b a s e d
D y n am ic P ro g .
(Be llm a n E q .)
D o p a m in e
G lu ta m a te
N e u ro n a l R e w a rd S y s te m s
(B a s a l G a n g lia )
N O N -E VA L U AT IV E F E ED B A C K (C o rre la tio n s )
EVA L U AT IV E F E E D B A C K (R e w a rd s )
p
m
Bu
op
o
l
ed
s
o
ng
l
i
c
n
r
a
a
s
e
i
l
s
e
i
Th Control
efor Structure
b
The Basic
em
t
s
y
s
Schematic
diagram of
n
ti o
ac
tr x
Re fle
re
Control Loops
An old slide
from some
lectures earlier!
Any
recollections?
?
Control Loops
D istu rb a n ce s
S e t-P o in t
C o n tro lle r
X0
C o n tro l
S ig n a ls
C o ntro lle d
S yste m
F e e d ba ck
Control Loops
C o n te xt
C ritic
R e in fo rce m e n t
S ig n a l
A cto r
X0
(C o n tro lle r)
A ctio n s
(C o n tro l S ig n a ls )
D istu rb a n ce s
E n viro n m e n t
(C o n tro lle d S y s te m )
F e e d b a ck
An Actor-Critic Architecture: The Critic produces evaluative,
reinforcement feedback for the Actor by observing the consequences
of its actions. The Critic takes the form of a TD-error which gives an
indication if things have gone better or worse than expected with the
preceding action. Thus, this TD-error can be used to evaluate the
preceding action: If the error is positive the tendency to select this
action should be strengthened or else, lessened.
Example of an Actor-Critic
Procedure
(s; a) =
p(s;a)
Pe
p(s;b)
e
b
where p(s,a) are the values of the modifiable (by the Critcic!)
policy parameters of the actor, indicting the tendency to select
action a when being in state s.
We can now modify p for a given state action pair at time
t with:
p(st; at)
p(st; at) + t
C la s s ic a l C o n d itio n in g
C o r r e la tio n o f S ig n a ls
R E IN F O R C E M E N T L EA R N IN G
U N -S U P E R VIS E D L E AR N IN G
e x a m p le b a s e d
D y n am ic P ro g .
(Be llm a n E q .)
-R u le
H e b b -R u le
su p e r v is e d L .
R e s c o rla /
Wagner
T D ( )
o fte n = 0
c o rre la tio n b a s e d
T D (1 )
LT P
(LT D = a n ti)
=
T D (0 )
D iffe re n tia l
H e b b -R u le
(s lo w )
M o n te C a rlo
C o n tro l
N e u r.T D -M o d e ls
A cto r/C ritic
S T D P -M o d e ls
IS O -L e a rn in g
(C ritic )
SARSA
IS O -C ontrol
STD P
B io p h y s . o f S y n . P la s tic ity
C o rr e la tio n
b a se d C o n tro l
( n o n - e v a lu a tiv e )
b io p h y s ic a l & n e tw o r k
IS O -M o d e l
of STDP
te c h n ic a l & B a s a l G a n g l.
Q -L e a rn in g
D iffe re n tia l
H e b b -R u le
(fa s t)
D o p a m in e
G lu ta m a te
N e u ro n a l R e w a rd S y s te m s
(B a s a l G a n g lia )
N O N -E VA L U AT IV E F E ED B A C K (C o rre la tio n s )
EVA L U AT IV E F E E D B A C K (R e w a rd s )
F ro n ta l
C o rte x
T h a la m u s
VP SNr GPi
S tria tu m (S )
D A -S y s te m
( S N c ,V T A ,R R A )
S TN
GPe
VP=ventral pallidum,
SNr=substantia nigra pars
reticulata,
SNc=substantia nigra pars
compacta,
GPi=globus pallidus pars interna,
GPe=globus pallidus pars
externa,
C
S
STN
+
DA
Cortex=C, striatum=S,
STN=subthalamic Nucleus,
DA=dopamine system, r=reward.
C o r tic o s tr ia ta l
(p re )
G lu
N ig r o s tr ia ta l
(D A )
DA
M e d iu m -s iz e d S p in y P ro je c tio n
N e u ro n in th e S tria tu m (p o s t)
C la s s ic a l C o n d itio n in g
R E IN F O R C E M E N T L EA R N IN G
-R u le
H e b b -R u le
R
e s c oare
rla /
You
Wagner
T D ( )
o fte n = 0
c o rre la tio n b a s e d
su p e r v is e d L .
T D (1 )
here !
(LT D = a n ti)
T D (0 )
N e u r.T D -M o d e ls
A cto r/C ritic
D iffe re n tia l
H e b b -R u le
(s lo w )
IS O -L e a rn in g
(C ritic )
STD P
B io p h y s . o f S y n . P la s tic ity
C o rr e la tio n
b a se d C o n tro l
IS O -C ontrol
b io p h y s ic a l & n e tw o r k
IS O -M o d e l
of STDP
SARSA
( n o n - e v a lu a tiv e )
D iffe re n tia l
H e b b -R u le
(fa s t)
S T D P -M o d e ls
te c h n ic a l & B a s a l G a n g l.
Q -L e a rn in g
LT P
M o n te C a rlo
C o n tro l
C o r r e la tio n o f S ig n a ls
U N -S U P E R VIS E D L E AR N IN G
e x a m p le b a s e d
D y n am ic P ro g .
(Be llm a n E q .)
D o p a m in e
G lu ta m a te
N e u ro n a l R e w a rd S y s te m s
(B a s a l G a n g lia )
N O N -E VA L U AT IV E F E ED B A C K (C o rre la tio n s )
EVA L U AT IV E F E E D B A C K (R e w a rd s )
SARSA-Learning
It is also possible to directly evaluate actions by
assigning Value (Q-values and not V-values!) to
state-action pairs and not just to states.
Interestingly one can use exactly the same
mathematical formalism and write:
Q(st; at)
st
at
s t+ 1
r t+ 1
Q t+ 1
SARSA = state-action-reward-stateaction
On-policy update!
Q-Learning
Q(st; at)
A g e n t co uld g o h e re n e x t!
a t+ 1
Qt
st
at
s t+ 1
r t+ 1
a t+ 1
m a x Q t+ 1
Even if the
agent will
not go to
the blue
state but to
the black
one, it will
nonetheless use the
blue Qvalue for
update of
the red
state-action
Notes:
1) For SARSA and Q-learning rigorous proofs exist
that they will always converge to the optimal
policy.
2) Q-learning is the most widely used method for
policy optimization.
3) For regular state-action spaces in a fully
Markovian system Q-learning converges faster
Regular
state-action
than
SARSA. spaces: States tile the state space in a
non-overlapping way. System is fully deterministic (Hence
rewards and values are associated to state-action pairs in a
deterministic way.). Actions cover the space fully.
Problems of RL
Curse of Dimensionality
In real world problems ist difficult/impossible to define discrete stateaction spaces.
(Temporal) Credit Assignment Problem
RL cannot handle large state action spaces as the reward gets too much
dilited along the way.
Partial Observability Problem
In a real-world scenario an RL-agent will often not know exactly in what
state it will end up after performing an action. Furthermore states must
be history independent.
State-Action Space Tiling
Deciding about the actual state- and action-space tiling is difficult as it is
often critical for the convergence of RL-methods. Alternatively one could
employ a continuous version of RL, but these methods are equally
difficult to handle.
Non-Stationary Environments
As for other learning methods, RL will only work quasi stationary
Problems of RL
Credit Structuring Problem
One also needs to decide about the reward-structure, which will affect
the learning. Several possible strategies exist:
external evaluative feedback: The designer of the RL-system places
rewards and punishments by hand. This strategy generally works only in
very limited scenarios because it essentially requires detailed knowledge
about the RL-agent's world.
internal evaluative feedback: Here the RL-agent will be equipped
with sensors that can measure physical aspects of the world (as opposed
to 'measuring' numerical rewards). The designer then only decides,
which of these physical influences are rewarding and which not.
Exploration-Exploitation Dilemma
RL-agents need to explore their environment in order to assess its
reward structure. After some exploration the agent might have found a
set of apparently rewarding actions. However, how can the agent be sure
that the found actions where actually the best? Hence, when should an
agent continue to explore or else, when should it just exploit its existing
knowledge? Mostly heuristic strategies are employed for example
annealing-like procedures, where the naive agent starts with exploration
and its exploration-drive gradually diminishes over time, turning it more
towards exploitation.
Goal
NW
Start
10000 units 1.5m
W
SW
NE
M otor
Lay er
NW
W
SW
SE
Learned
N
NE
Sens or
Lay er
Place field 1
... W
NE
Learned &
R andom
E
S
M otor activity
E
S
SE
R andom
Q valu es
NW
NE
R andom w alk
generation
algorithm
...
Place field n
Real (left)
and
generated
(right) path
examples.
NW
where i(st) are the features over the state space, and i,at are
the adaptable weights binding features to actions.
We assume that a place cell i produces spikes with a scaled
Gaussian-shaped probability distribution:
where i is the distance from the i-th place field centre to the
sample point (x,y) on the rats trajectory, defines the width of
the place field, and A is a scaling factor.
We then use the actual place field spiking to determine the values
for features i, i = 1, .., n, which take the value of 1, if place cell i
spikes at the given moment on the given point of the trajectory of
the model animal, otherwise it is zero:
Where i,at is the weight from the i-th place cell to action(-cell) a,
and state st is defined by (xt,yt), which are the actual coordinates
of the model animal in the field.
Results
With
Without
Function
Approximation
300
Steps
200
100
0
Divergent run
RL versus CL
Reinforcement learning and correlation based (hebbian)
learning in comparison:
RL:
CL:
1) Evaluative Feedback
(rewards)
1) Non-evaluative
Feedback (correlations
only)
Neural-SARSA (nd!SARSA)
0
i (t)
dt
i+1
r