Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Outline
Deep Learning
Reinforcement Learning
Deep Value Functions
Deep Policies
Deep Models
Reinforcement Learning: AI = RL
Outline
Deep Learning
Reinforcement Learning
Deep Value Functions
Deep Policies
Deep Models
Deep Representations
w1
/ h1
w2
/ ...
wn
/ hn
wn+1
/y
h2
h1
... o
y
hn
...
hn
wn
h1
w1
y
wn+1
Linear transformations
hk+1 = Whk
Weight Sharing
Recurrent neural network shares weights between time-steps
yO1
h0
/ h1
O
yO2
/ h2
O
x1
...
x2
/ ...
yOn
...
/ hn
O
xn
w1
w2
w1
w2
h1
h2
Loss Function
I
w1
/ h1
/ ...
w2
wn
/ hn
wn+1
/y
/ l(y )
h2
h1
... o
y
hn
...
hn
wn
h1
w1
l(y )
y
y
wn+1
!"#$%"&'(%")*+,#'-'!"#$%%" (%")*+,#'.+/0+,#
I
I
'*$%-/0'*,('-*$.'("$("#$1)*%('-*$
l(y )
,22&-3'/,(-&
%,*$0#$)+#4$(-$
(1)
w.
l(y )
L(w )
%&#,(#$,*$#&&-&$1)*%('-*$$$$$$$$$$$$$
.
= Ex
= Ex
.
w
l(y )
w (k)
y !"#$2,&(',5$4'11#&#*(',5$-1$("'+$#&&-&$
Adjust w in direction of -ve gradient
1)*%('-*$$$$$$$$$$$$$$$$6$("#$7&,4'#*($
l(y )
%,*$*-.$0#$)+#4$(-$)24,(#$("#$
w =
2 w
'*(#&*,5$8,&',05#+$'*$("#$1)*%('-*$
where ,22&-3'/,(-&
is a step-size9,*4$%&'('%:;$$$$$$
parameter
<&,4'#*($4#+%#*($=>?
I
I
I
I
Outline
Deep Learning
Reinforcement Learning
Deep Value Functions
Deep Policies
Deep Models
Policy-based RL
I
Value-based RL
I
Policy-based RL
I
Value-based RL
I
Model-based RL
I
Outline
Deep Learning
Reinforcement Learning
Deep Value Functions
Deep Policies
Deep Models
Bellman Equation
I
Bellman Equation
I
0 0
Q (s, a) = Es 0 r + max
Q (s , a ) | s, a
0
a
Bellman Equation
I
0 0
Q (s, a) = Es 0 r + max
Q (s , a ) | s, a
0
a
L(w ) = E r + Q(s 0 , a0 , w ) Q(s, a, w )
|
{z
}
target
L(w ) = E r + Q(s 0 , a0 , w ) Q(s, a, w )
|
{z
}
target
L(w )
w
Q(s 0 , a0 , w ) Q(s, a, w )
L(w ) = E
r + max
0
a
|
{z
}
target
L(w )
w
Example: TD Gammon
Bbar 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7
V(s, w)
6 5 4
3 2
w
1 0 Wbar
Deep Q-Networks
Avoid oscillations
Break correlations between Q-network and target
Robust gradients
L(w ) = Es,a,r ,s 0 D
r + max
Q(s , a , w ) Q(s, a, w )
0
a
action
st
at
reward
rt
DQN in Atari
I
DQN Demo
DQN
Breakout
Enduro
River Raid
Seaquest
Space Invaders
Q-learning
Q-learning
3
29
1453
276
302
+ Target Q
10
142
2868
1003
373
Q-learning
+ Replay
241
831
4103
823
826
Q-learning
+ Replay
+ Target Q
317
1006
7447
2894
1089
Outline
Deep Learning
Reinforcement Learning
Deep Value Functions
Deep Policies
Deep Models
Deterministic Actor-Critic
Use two networks
I
w1
/ ...
wn
/Q
/ ...
/a
un
u1
u1
/ ...
/a
un
w1
/ ...
wn
/Q
... o
Q
a
... o
(s)
[Lillicrap et al.]
DDPG Demo
Outline
Deep Learning
Reinforcement Learning
Deep Value Functions
Deep Policies
Deep Models
Model-Based RL
Learn a transition model of the environment
p(r , s 0 | s, a)
Plan using the transition model
I
left
left
right
right
left
right
Deep Models
(Gregor et al.)
DARN Demo
Challenges of Model-Based RL
Compounding errors
I
Challenges of Model-Based RL
Compounding errors
I
Deep Learning in Go
Monte-Carlo search
I Monte-Carlo search (MCTS) simulates future trajectories
I Builds large lookahead search tree with millions of positions
I State-of-the-art 19 19 Go programs use MCTS
I e.g. First strong Go program MoGo
(Gelly et al.)
Deep Learning in Go
Monte-Carlo search
I Monte-Carlo search (MCTS) simulates future trajectories
I Builds large lookahead search tree with millions of positions
I State-of-the-art 19 19 Go programs use MCTS
I e.g. First strong Go program MoGo
(Gelly et al.)
Convolutional Networks
I 12-layer convnet trained to predict expert moves
I Raw convnet (looking at 1 position, no search at all)
I Equals performance of MoGo with 105 position search tree
(Maddison et al.)
Program
Human 6-dan
12-Layer ConvNet
8-Layer ConvNet*
Prior state-of-the-art
*Clarke & Storkey
Accuracy
52%
55%
44%
31-39%
Program
GnuGo
MoGo (100k)
Pachi (10k)
Pachi (100k)
Winning rate
97%
46%
47%
11%
Conclusion
Questions?
The only stupid question is the one you never asked -Rich Sutton