1602 02672 PDF

Learning to Communicate to Solve Riddles
with Deep Distributed Recurrent Q-Networks
Jakob N. Foerster1, JAKOB . FOERSTER @ CS . OX . AC . UK

Yannis M. Assael1, YANNIS . ASSAEL @ CS . OX . AC . UK
Nando de Freitas1,2,3 NANDODEFREITAS @ GOOGLE . COM
Shimon Whiteson1 SHIMON . WHITESON @ CS . OX . AC . UK
1
University of Oxford, United Kingdom
2
arXiv:1602.02672v1 [cs.AI] 8 Feb 2016
Canadian Institute for Advanced Research, CIFAR NCAP Program

3
Google DeepMind
Abstract ing for Go (Maddison et al., 2015; Silver et al., 2016) has
recently shown success. In cooperative settings, Tampuu
We propose deep distributed recurrent Q-
et al. (2015) have adapted deep Q-networks (Mnih et al.,
networks (DDRQN), which enable teams of
2015) to allow two agents to tackle a multi-agent exten-
agents to learn to solve communication-based co-
sion to ALE. Their approach is based on independent Q-
ordination tasks. In these tasks, the agents are
learning (Shoham et al., 2007; Shoham & Leyton-Brown,
not given any pre-designed communication pro-
2009; Zawadzki et al., 2014), in which all agents learn their
tocol. Therefore, in order to successfully com-
own Q-functions independently in parallel.
municate, they must first automatically develop
and agree upon their own communication proto- However, these approaches all assume that each agent can
col. We present empirical results on two multi- fully observe the state of the environment. While DQN
agent learning problems based on well-known has also been extended to address partial observability
riddles, demonstrating that DDRQN can success- (Hausknecht & Stone, 2015), only single-agent settings
fully solve such tasks and discover elegant com- have been considered. To our knowledge, no work on deep
munication protocols to do so. To our knowl- reinforcement learning has yet considered settings that are
edge, this is the first time deep reinforcement both partially observable and multi-agent.
learning has succeeded in learning communica-
Such problems are both challenging and important. In the
tion protocols. In addition, we present ablation
cooperative case, multiple agents must coordinate their be-
experiments that confirm that each of the main
haviour so as to maximise their common payoff while fac-
components of the DDRQN architecture are crit-
ing uncertainty, not only about the hidden state of the envi-
ical to its success.
ronment but about what their teammates have observed and
thus how they will act. Such problems arise naturally in a
variety of settings, such as multi-robot systems and sensor
1. Introduction
networks (Matari, 1997; Fox et al., 2000; Gerkey & Matari,
In recent years, advances in deep learning have been in- 2004; Olfati-Saber et al., 2007; Cao et al., 2013).
strumental in solving a number of challenging reinforce- In this paper, we propose deep distributed recurrent Q-
ment learning (RL) problems, including high-dimensional networks (DDRQN) to enable teams of agents to learn ef-
robot control (Levine et al., 2015; Assael et al., 2015; Wat- fectively coordinated policies on such challenging prob-
ter et al., 2015), visual attention (Ba et al., 2015), and the lems. We show that a naive approach to simply train-
Atari learning environment (ALE) (Guo et al., 2014; Mnih ing independent DQN agents with long short-term memory
et al., 2015; Stadie et al., 2015; Wang et al., 2015; Schaul (LSTM) networks (Hochreiter & Schmidhuber, 1997) is in-
et al., 2016; van Hasselt et al., 2016; Oh et al., 2015; Belle- adequate for multi-agent partially observable problems.
mare et al., 2016; Nair et al., 2015).
Therefore, we introduce three modifications that are key to
The above-mentioned problems all involve only a single DDRQNs success: a) last-action inputs: supplying each
learning agent. However, recent work has begun to address agent with its previous action as input on the next time step
multi-agent deep RL. In competitive settings, deep learn- so that agents can approximate their action-observation his-

These authors contributed equally to this work. tories; b) inter-agent weight sharing: a single networks
Learning to Communicate to Solve Riddles with Deep Distributed Recurrent Q-Networks
weights are used by all agents but that network conditions The optimal action-value function Q (s, a) =
on the agents unique ID, to enable fast learning while also max Q (s, a) obeys the Bellman optimality equation:
allowing for diverse behaviour; and c) disabling experience h i
replay, which is poorly suited to the non-stationarity aris- Q (s, a) = Es0 r + max0
Q 0 0
(s , a ) | s, a . (3)
a
ing from multiple agents learning simultaneously.
To evaluate DDRQN, we propose two multi-agent rein- Deep Q-networks (Mnih et al., 2015) (DQNs) use neural
forcement learning problems that are based on well-known networks parameterised by to represent Q(s, a; ). DQNs
riddles: the hats riddle, where n prisoners in a line must are optimised by minimising the following loss function at
determine their own hat colours; and the switch riddle, in each iteration i:
which n prisoners must determine when they have all vis-
2
ited a room containing a single switch. Both riddles have Li (i ) = Es,a,r,s0 yiDQN Q(s, a; i ) , (4)
been used as interview questions at companies like Google
and Goldman Sachs. with target
While these environments do not require convolutional net- yiDQN = r + max Q(s0 , a0 ; i ). (5)
0
works for perception, the presence of partial observabil- a
ity means that they do require recurrent networks to deal
Here, i are the weights of a target network that is frozen
with complex sequences, as in some single-agent works
for a number of iterations while updating the online net-
(Hausknecht & Stone, 2015; Ba et al., 2015) and language-
work Q(s, a; i ) by gradient descent. DQN uses experience
based (Narasimhan et al., 2015) tasks. In addition, because
replay (Lin, 1993; Mnih et al., 2015): during learning, the
partial observability is coupled with multiple agents, op-
agent builds a dataset Dt = {e1 , e2 , . . . , et } of experiences
timal policies critically rely on communication between
et = (st , at , rt , st+1 ) across episodes. The Q-network is
agents. Since no communication protocol is given a pri-
then trained by sampling mini-batches of experiences from
ori, reinforcement learning must automatically develop a
D uniformly at random. Experience replay helps prevent
coordinated communication protocol.
divergence by breaking correlations among the samples. It
Our results demonstrate that DDRQN can successfully also enables reuse of past experiences for learning, thereby
solve these tasks, outperforming baseline methods, and dis- reducing sample costs.
covering elegant communication protocols along the way.
To our knowledge, this is the first time deep reinforcement 2.2. Independent DQN
learning has succeeded in learning communication proto-
cols. In addition, we present ablation experiments that con- DQN has been extended to cooperative multi-agent set-
firm that each of the main components of the DDRQN ar- tings, in which each agent m observes the global st , se-
chitecture are critical to its success. lects an individual action am
t , and receives a team reward,
rt , shared among all agents. Tampuu et al. (2015) ad-
dress this setting with a framework that combines DQN
2. Background with independent Q-learning, applied to two-player pong,
In this section, we briefly introduce DQN and its multi- in which all agents independently and simultaneously learn
agent and recurrent extensions. their own Q-functions Qm (s, am ; im ). While indepen-
dent Q-learning can in principle lead to convergence prob-
2.1. Deep Q-Networks lems (since one agents learning makes the environment ap-
pear non-stationary to other agents), it has a strong empir-
In a single-agent, fully-observable, reinforcement learning ical track record (Shoham et al., 2007; Shoham & Leyton-
setting (Sutton & Barto, 1998), an agent observes its cur- Brown, 2009; Zawadzki et al., 2014).
rent state st S at each discrete time step t, chooses an
action at A according to a potentially stochastic policy 2.3. Deep Recurrent Q-Networks
, observes a reward signal rt , and transitions to a new state
st+1 . Its objective is to maximize an expectation over the Both DQN and independent DQN assume full observabil-
discounted return, Rt ity, i.e., the agent receives st as input. By contrast, in par-
tially observable environments, st is hidden and instead the
Rt = rt + rt+1 + 2 rt+2 + , (1) agent receives only an observation ot that is correlated with
st but in general does not disambiguate it.
where [0, 1) is a discount factor. The Q-function of a
policy is: Hausknecht & Stone (2015) propose the deep recurrent Q-
network (DRQN) architecture to address single-agent, par-
Q (s, a) = E [Rt |st = s, at = a] . (2) tially observable settings. Instead of approximating Q(s, a)
with a feed-forward network, they approximate Q(o, a) Algorithm 1 DDRQN

with a recurrent neural network that can maintain an in-
Initialise 1 and 1
ternal state and aggregate observations over time. This can
for each episode e do
be modelled by adding an extra input ht1 that represents
hm1 = 0 for each agent m
the hidden state of the network, yielding Q(ot , ht1 , a; i ).
s1 = initial state, t = 1
Thus, DRQN outputs both Qt , and ht , at each time step.
while st 6= terminal and t < T do
DRQN was tested on a partially observable version of ALE
for each agent m do
in which a portion of the input screens were blanked out.
With probability pick random am
t
else am m m m
t = arg maxa Q(ot , ht1 , m, at1 , a; i )
2.4. Partially Observable Multi-Agent RL
Get reward rt and next state st+1 , t = t + 1
In this work, we consider settings where there are both mul- = 0 . reset gradient
tiple agents and partial observability: each agent receives for j = t 1 to 1, 1 do
its own private om
t at each time step and maintains an inter- for each agent m do
nal state hm
t . However, we assume that learning can occur
n r , if s terminal, else
in a centralised fashion, i.e., agents can share parameters, yjm = j j

rj + maxa Q(om m m
j+1 , hj , m, aj , a; i )
etc., during learning so long as the policies they learn con- Accumulate gradients for:
dition only on their private histories. In other words, we (yjm Q(om m m m 2
j , hj1 , m, aj1 , aj ; i ))
consider centralised learning of decentralised policies.
i+1 = i + . update parameters

We are interested in such settings because it is only when i+1 = i + (i+1 i ) . update target network
multiple agents and partial observability coexist that agents
have the incentive to communicate. Because no communi-
cation protocol is given a priori, the agents must first au- ing it easier for agents to specialise. Weight sharing dra-
tomatically develop and agree upon such a protocol. To matically reduces the number of parameters that must be
our knowledge, no work on deep RL has considered such learned, greatly speeding learning.
settings and no work has demonstrated that deep RL can
successfully learn communication protocols. The third, disabling experience replay, simply involves
turning off this feature of DQN. Although experience re-
play is helpful in single-agent settings, when multiple
3. DDRQN agents learn independently the environment appears non-
The most straightforward approach to deep RL in par- stationary to each agent, rendering its own experience ob-
tially observable multi-agent settings is to simply combine solete and possibly misleading.
DRQN with independent Q-learning, in which case each Given these modifications, DDRQN learns a Q-function of
agents Q-network represents Qm (om m m m
t , ht1 , a ; i ), the form Q(om m m m
t , ht1 , m, at1 , at ; i ). Note that i does
which conditions on that agents individual hidden state as not condition on m, due to weight sharing, and that am t1 is
well as observation. This approach, which we call the naive a portion of the history while am t is the action whose value
method, performs poorly, as we show in Section 5. the Q-network estimates.
Instead, we propose deep distributed recurrent Q-networks Algorithm 1 describes DDRQN. First, we initialise the tar-
(DDRQN), which makes three key modifications to the get and Q-networks. For each episode, we also initialise
naive method. The first, last-action input, involves pro- the state, s1 , the internal state of the agents, hm m
1 , and a0 .
viding each agent with its previous action as input to the Next, for each time step we pick an action for each agent
next time step. Since the agents employ stochastic policies -greedily w.r.t. the Q-function. We feed in the previous
for the sake of exploration, they should in general condi- action, amt1 , the agent index, m, along with the observa-
tion their actions on their action-observation histories, not tion om m
t and the previous internal state, ht1 . After all
just their observation histories. Feeding the last action as agents have taken their action, we query the environment
input allows the RNN to approximate action-observation for a state update and reward information.
histories.
When we reach the final time step or a terminal state, we
The second, inter-agent weight sharing, involves tying the proceed to the Bellman updates. Here, for each agent, m,
weights of all agents networks. In effect, only one network and time step, j, we calculate a target Q-value, yjm , using
is learned and used by all agents. However, the agents can the observed reward, rj , and the discounted target network.
still behave differently because they receive different ob- We also accumulate the gradients, , by regressing the
servations and thus evolve different hidden states. In ad- Q-value estimate, Q(om m m
j , hj1 , m, aj1 , a; i ), against the
dition, each agent receives its own index m as input, mak- target Q-value, yj , for the action chosen, am
m
j .
Lastly, we conduct two weight updates, first i in the direc- Although this riddle is a single action and observation prob-
tion of the accumulated gradients, , and then the target lem it is still partially observable, given that none of the
network, i , in the direction of i . agents can observe the colour of their own hat.
4. Multi-Agent Riddles 4.2. Switch Riddle
In this section, we describe the riddles on which we evalu- The switch riddle can be described as follows: One hun-
ate DDRQN. dred prisoners have been newly ushered into prison. The
warden tells them that starting tomorrow, each of them
will be placed in an isolated cell, unable to communicate
4.1. Hats Riddle
amongst each other. Each day, the warden will choose one
The hats riddle can be described as follows: An execu- of the prisoners uniformly at random with replacement, and
tioner lines up 100 prisoners single file and puts a red or a place him in a central interrogation room containing only
blue hat on each prisoners head. Every prisoner can see a light bulb with a toggle switch. The prisoner will be able
the hats of the people in front of him in the line - but not to observe the current state of the light bulb. If he wishes,
his own hat, nor those of anyone behind him. The execu- he can toggle the light bulb. He also has the option of an-
tioner starts at the end (back) and asks the last prisoner nouncing that he believes all prisoners have visited the in-
the colour of his hat. He must answer red or blue. If terrogation room at some point in time. If this announce-
he answers correctly, he is allowed to live. If he gives the ment is true, then all prisoners are set free, but if it is false,
wrong answer, he is killed instantly and silently. (While ev- all prisoners are executed. The warden leaves and the pris-
eryone hears the answer, no one knows whether an answer oners huddle together to discuss their fate. Can they agree
was right.) On the night before the line-up, the prisoners on a protocol that will guarantee their freedom? (Wu,
confer on a strategy to help them. What should they do? 2002).
(Poundstone, 2012). Figure 1 illustrates this setup.
Action: On None None Tell
Answers: Red Red Observed hats
Prisoner :
in IR
3 2 3 1
Hats:
? Switch:
On
Off
On
Off
On
Off
On
Off
1 2 3 4 5 6
Day 1 Day 2 Day 3 Day 4

Prisoners:
Figure 2. Switch: Every day one prisoner gets sent to the inter-
Figure 1. Hats: Each prisoner can hear the answers from all pre- rogation room where he can see the switch and choose between
ceding prisoners (to the left) and see the colour of the hats in front actions On, Off, Tell and None.
of him (to the right) but must guess his own hat colour.
A number of strategies (Song, 2012; Wu, 2002) have been
An optimal strategy is for all prisoners to agree on a com- analysed for the infinite time-horizon version of this prob-
munication protocol in which the first prisoner says blue lem in which the goal is to guarantee survival. One well-
if the number of blue hats is even and red otherwise (or known strategy is for one prisoner to be designated the
vice-versa). All remaining prisoners can then deduce their counter. Only he is allowed to turn the switch off while
hat colour given the hats they see in front of them and the each other prisoner turns it on only once. Thus, when the
responses they have heard behind them. Thus, everyone counter has turned the switch off n1 times, he can Tell.
except the first prisoner will definitely answer correctly.
To formalise the switch riddle, we define a state space
To formalise the hats riddle as a multi-agent RL task, we s = (SWt , IRt , s1 , . . . , sn ), where SWt {on, off} is
define a state space s = (s1 , . . . , sn , a1 , . . . , an ), where n the position of the switch, IR {1 . . . n} is the current
is the total number of agents, sm {blue, red} is the m- visitor in the interrogation room and s1 , . . . , sn {0, 1}
th agents hat colour and am {blue, red} is the action it tracks which agents have already been to the interrogation
took on the m-th step. At all other time steps, agent m can room. At time step t, agent m observes om t = (irt , swt ),
only take a null action. On the m-th time step, agent ms where irt = I(IRt = m), and swt = SWt if the agent
observation is om = (a1 , . . . , am1 , sm+1 , . . . , sn ). Re- is in the interrogation room and null otherwise. If agent
ward is zero except at the end of the episode, when it is m is in the interrogation room then its actions are am t
the total number of agents with the correct action: rn = {On, Off, Tell, None}; otherwise the only action
m
= sm ). We label only the relevant observation
P
m I(a is None. The episode ends when an agent chooses Tell
o and action am of agent m, omitting the time index.
m
or when the maximum time step is reached. The reward rt
LSTM unrolled for n-m steps 64](m, n), and their outputs are added element-wise.
Subsequently, zak is passed through an LSTM network
s m+1 m, n sk m, n sn m, n yak , hka = LSTMa [64](zak , hk1
a ). We follow a similar pro-
cedure for the n m hats observed defining ysk , hks =
+ + +
LSTMs [64](zsk , hk1
s ). Finally, the last values of the two
z m+1 z ks z ns
LSTM networks yam1 and ysn are used to approximate
s
h sm+1 h ks the Q-Values for each action Qm = MLP[128 64, 64
y ns
64, 64 1](yam1 ||ysn ). The network is trained with mini-
Qm batches of 20 episodes.
h 1a h ka
y am-1
Furthermore, we use an adaptive variant of curriculum
z 1a z ka z m-1
a learning (Bengio et al., 2009) to pave the way for scalable
+ + +
strategies and better training performance. We sample ex-
amples from a multinomial distribution of curricula, each
a1 m, n ak m, n a m-1 m, n corresponding to a different n, where the current bound is
raised every time performance becomes near optimal. The
LSTM unrolled for m-1 steps probability of sampling a given n is inversely proportional
to the performance gap compared to the normalised maxi-
Figure 3. Hats: Each agent m observes the answers, ak , k < m, mum reward. The performance is depicted in Figure 5.
from all preceding agents and hat colour, sk in front of him,
k > m. Both variable length sequences are processed through We first evaluate DDRQN for n = 10 and compare it with
RNNs. First, the answers heard are passed through two single- tabular Q-learning. Tabular Q-learning is feasible only with
layer MLPs, zak = MLP(ak ) MLP(m, n), and their outputs few agents, since the state space grows exponentially with
are added element-wise. zak is passed through an LSTM net- n. In addition, separate tables for each agent precludes gen-
work yak , hka = LSTMa (zak , hk1
a ). Similarly for the observed eralising across agents.
hats we define ysk1 , hk1
s = LSTM k k1
s (zs , hs ). The last val-
ues of the two LSTMs ya m1 n
and ys are used to approximate Figure 4 shows the results, in which DDRQN substan-
Qm = MLP(yam1 ||ysn ) from which the action am is chosen. tially outperforms tabular Q-learning. In addition, DDRQN
also comes near in performance to the optimal strategy de-
scribed in Section 4.1. This figure also shows the results of
is 0 except unless an agent chooses Tell, in which case an ablation experiment in which inter-agent weight sharing
it is 1 if all agents have been to the interrogation room and has been removed from DDRQN. The results confirm that
1 otherwise. inter-agent weight sharing is key to performance.
Since each agent takes only one action in the hats riddle, it
5. Experiments is essentially a single step problem. Therefore, last-action
In this section, we evaluate DDRQN on both multi-agent

riddles. In our experiments, prisoners select actions using DDRQN w Tied Weights Q-Table
1
an -greedy policy with = 10.5 n for the hats riddle and DDRQN w/o Tied Weights Optimal
= 0.05 for the switch riddle. For the latter, the discount 1.0
factor was set to = 0.95, and the target networks, as
described in Section 3, update with = 0.01, while in 0.9
Norm. R (Optimal)
both cases weights were optimised using Adam (Kingma &

Ba, 2014) with a learning rate of 1 103 . The proposed 0.8
architectures make use of rectified linear units, and LSTM
cells. Further details of the network implementations are 0.7
described in the Supplementary Material and source code
0.6
will be published online.
0.5
5.1. Hats Riddle 20k 40k 60k 80k 100k
# Epochs
Figure 3 shows the architecture we use to apply DDRQN
to the hats riddle. To select am , the network is fed as Figure 4. Results on the hats riddle with n = 10 agents, com-
input om = (a1 , . . . , am1 , sm+1 , . . . , sn ), as well as paring DDRQN with and without inter-agent weight sharing to
m and n. The answers heard are passed through two a tabular Q-table and a hand-coded optimal strategy. The lines
single-layer MLPs, zak = MLP[1 64](ak ) MLP[2 depict the average of 10 runs and 95% confidence intervals.
m m m m m
Q 1 Q 2 Q 3 Q t Q Dmax
Table 1. Percent agreement on the hats riddle between DDRQN
and the optimal parity-encoding strategy.
ym
1 ym
2 ym
3 ym
t
m
yD max
N % AGREEMENT
hm
1 hm
2 hm
t
m
hD
max -1
3 100.0% zm
1 zm
2 zm
3 zm
t
m
zD max
5 100.0%

8 79.6%
12 52.6% m m m
m m
16 50.8% (sw1, ir 1 , a 0m, m, n) (sw3, ir 3 , a 2 , m, n) ( swD , ir D , a D
max max max -1, m, n)
20 52.5%
Figure 6. Switch: Agent m receives as input: the switch
inputs and disabling experience replay do not play a role state swt , the room he is in, irtm , his last action, am t1 ,
his ID, m, and the # of agents, n. At each step,
and do not need to be ablated. We consider these compo-
the inputs are processed through a 2-layer MLP ztm =
nents in the switch riddle in Section 5.2. MLP(swt , irtm , OneHot(am t1 ), OneHot(m), n). Their embed-
We compare the strategies DDRQN learns to the optimal ding ztm is then passed to an LSTM network, ytm , hm t =
strategy by computing the percentage of trials in which the LSTM(ztm , hm t1 ), which is used to approximate the agents
first agent correctly encodes the parity of the observed hats action-observation history. Finally, the output ytm of the LSTM
is used at each step to compute Qm m
t = MLP(yt ).
in its answer. Table 1 shows that the encoding is almost
perfect for n {3, 5, 8}. For n {12, 16, 20}, the agents
do not encode parity but learn a different distributed solu- 128](om m
t , OneHot(at1 ), OneHot(m), n). Their embed-
tion that is nonetheless close to optimal. We believe that ding zt is then passed an LSTM network, ytm , hm
m
=
t
qualitatively this solution corresponds to more the agents LSTM[128](ztm , hm ), which is used to approximate the
t1
communicating information about other hats through their agents action-observation history. Finally, the output ytm
answers, instead of only the first agent. of the LSTM is used at each step to approximate the
Q-values of each action using a 2-layer MLP Qm t =
5.2. Switch Riddle MLP[128 128, 128 128, 128 4](ytm ). As in the hats
Figure 6 illustrates the model architecture used in the riddle, curriculum learning was used for training.
switch riddle. Each agent m is modelled as a recurrent Figure 7, which shows results for n = 3, shows that
neural network with LSTM cells that is unrolled for Dmax DDRQN learns an optimal policy, beating the naive method
time-steps, where d denotes the number of days of the and the hand coded strategy, tell on last day. This veri-
episode. In our experiments, we limit d to Dmax = 4n 6 fies that the three modifications of DDRQN substantially
in order to keep the experiments computationally tractable. improve performance on this task. In following paragraphs
The inputs, om m we analyse the importance of the individual modifications.
t , at1 , m and n, are processed through
a 2-layer MLP ztm = MLP[(7 + n) 128, 128 We analysed the strategy that DDRQN discovered for n =
n=3 C.L. n=8 C.L. n=16 C.L. n=20 3 by looking at 1000 sampled episodes. Figure 8 shows a
n=5 C.L. n=12 C.L. n=20 C.L. Optimal decision tree, constructed from those samples, that corre-
1.0 sponds to an optimal strategy allowing the agents to col-
lectively track the number of visitors to the interrogation
0.9 room. When a prisoner visits the interrogation room after
Norm. R (Optimal)
day two, there are only two options: either one or two pris-
0.8 oners may have visited the room before. If three prisoners
had been, the third prisoner would have already finished
0.7
the game. The two remaining options can be encoded via
0.6 the on and off position respectively. In order to carry
out this strategy each prisoner has to learn to keep track
0.5 of whether he has visited the cell before and what day it
20k 40k 60k 80k 100k 120k 140k 160k currently is.
# Epochs
Figure 9 compares the performance of DDRQN to a variant
Figure 5. Hats: Using Curriculum Learning DDRQN achieves in which the switch has been disabled. After around 3,500
good performance for n = 3...20 agents, compared to the op- episodes, the two diverge in performance. Hence, there is a
timal strategy. clearly identifiable point during learning when the prison-
DDRQN Hand-coded Oracle DDRQN Hand-coded Split

Naive Method w Disabled Switch Oracle
1.0 1.0
0.5 0.8
Norm. R (Oracle)
Norm. R (Oracle)
0.6
0.0
0.4
0.5 0.2
1.0 0.0
10k 20k 30k 40k 50k 10k 20k 30k 40k 50k
# Epochs # Epochs
Figure 7. Switch: For n = 3 DDRQN outperforms the Naive Figure 9. Switch: At 3.5k episodes the DDRQN line clearly sep-
Method and a simple hand coded strategy, tell on last day, arates from the performance line for the no switch test and start
achieving Oracle level performance. The lines depict the av- exceeding tell on last day. At this point the agents start to dis-
erage of 10 runs and the 95% confidence interval. cover strategies that evolve communication using the switch. The
lines depict the mean of 10 runs and the 95% confidence interval.
ers start learning to communicate via the switch. Note that

only when the switch is enabled can DDRQN outperform modifications contribute substantially to DDRQNs perfor-
the hand-coded tell on last day strategy. Thus, communi- mance. Inter-agent weight sharing is by far the most im-
cation via the switch is required for good performance. portant, without which the agents are essentially unable to
learn the task, even for n = 3. Last-action inputs also
Figure 10 shows performance for n = 4. On most runs, play a significant role, without which performance does
DDRQN clearly beats the hand-coded tell on last day not substantially exceed that of the tell on last day strat-
strategy and final performance approaches 90% of the or- egy. Disabling experience replay also makes a difference,
acle. However, on some of the remaining runs DDRQN as performance with replay never reaches optimal, even af-
fails to significantly outperform the hand-coded strategy. ter 50,000 episodes. This result is not surprising given the
Analysing the learned strategies suggests that prisoners non-stationarity induced by multiple agents learning in par-
typically encode whether 2 or 3 prisoners have been to the allel. Such non-stationarity arises even though agents can
room via the on and off positions of the switch, respec- track their action-observation histories via RNNs within
tively. This strategy generates no false negatives, i.e., when a given episode. Since their memories are reset between
the 4th prisoner enters the room, he always Tells, but episodes, learning performed by other agents appears as
generates false positives around 5% of the time. Example non-stationary from their perspective.
strategies are included in the Supplementary Material.
Furthermore, Figure 11 shows the results of ablation ex-
periments in which each of the modifications in DDRQN n=4 C.L Hand-coded Oracle
is removed one by one. The results show that all three 1.0
0.9
Norm. R (Oracle)
1 On
2
Yes Off 0.8
Day Has Been?
No On
0.7
3+ Yes None
0.6
Has Been? On Tell
No Switch?
Off On
100k 200k 300k 400k 500k

Figure 8. Switch: For n = 3, DDRQN manages to discover a per- # Epochs
fect strategy, which we visualise as a decision tree in this Figure.
After day 2, the on position of the switch encodes that 2 prison- Figure 10. Switch: 10 runs using curriculum learning for n =
ers have visited the interrogation room, while off encodes that 3, 4. In most cases was able DDRQN to find strategies that out-
one prisoner has. perform tell on last day for n = 4.
DDRQN w Experience Replay deal with high dimensional complex problems.

w/o Last Action Hand-coded
w/o Tied Weights Oracle In ALE, partial observability has been artificially in-
troduced by blanking out a fraction of the input
1.0
screen (Hausknecht & Stone, 2015). Deep recurrent
0.8 reinforcement learning has also been applied to text-
Norm. R (Oracle)
0.6 based games, which are naturally partially observable

0.4 (Narasimhan et al., 2015). Recurrent DQN was also suc-
cessful in the email campaign challenge (Li et al., 2015).
0.2
0.0 However, all these examples apply recurrent DQN in
single-agent domains. Without the combination of multiple
0.2 agents and partial observability, there is no need to learn a
0.4 communication protocol, an essential feature of our work.
10k 20k 30k 40k 50k
# Epochs
7. Conclusions & Future Work
Figure 11. Switch: Tied weights and last action input are key for
the performance of DDRQN. Experience replay prevents agents This paper proposed deep distributed recurrent Q-networks
from reaching Oracle level. The experiment was executed for (DDRQN), which enable teams of agents to learn to solve
n = 3 and the lines depict the average of 10 runs and the 95% communication-based coordination tasks. In order to suc-
confidence interval. cessfully communicate, agents in these tasks must first au-
tomatically develop and agree upon their own communi-
cation protocol. We presented empirical results on two
However, it is particularly important in communication- multi-agent learning problems based on well-known rid-
based tasks like these riddles, since the value function for dles, demonstrating that DDRQN can successfully solve
communication actions depends heavily on the interpreta- such tasks and discover elegant communication protocols
tion of these messages by the other agents, which is in turn to do so. In addition, we presented ablation experiments
set by their Q-functions. that confirm that each of the main components of the
DDRQN architecture are critical to its success.
6. Related Work Future work is needed to fully understand and improve

the scalability of the DDRQN architecture for large num-
There has been a plethora of work on multi-agent rein- bers of agents, e.g., for n > 4 in the switch riddle. We
forcement learning with communication, e.g., (Tan, 1993; also hope to further explore the local minima structure
Melo et al., 2011; Panait & Luke, 2005; Zhang & Lesser, of the coordination and strategy space that underlies these
2013; Maravall et al., 2013). However, most of this work riddles. Another avenue for improvement is to extend
assumes a pre-defined communication protocol. One ex- DDRQN to make use of various multi-agent adaptations of
ception is the work of Kasai et al. (2008), in which the Q-learning (Tan, 1993; Littman, 1994; Lauer & Riedmiller,
tabular Q-learning agents have to learn the content of a 2000; Panait & Luke, 2005).
message to solve a predator-prey task. Their approach is
similar to the Q-table benchmark used in Section 5.1. By A benefit of using deep models is that they can efficiently
contrast, DDRQN uses recurrent neural networks that al- cope with high dimensional perceptual signals as inputs. In
low for memory-based communication and generalisation the future this can be tested by replacing the binary repre-
across agents. sentation of the colour with real images of hats or applying
DDRQN to other scenarios that involve real world data as
Another example of open-ended communication learning input.
in a multi-agent task is given in (Giles & Jim, 2002). How-
ever, here evolutionary methods are used for learning com- While we have advanced a new proposal for using riddles
munication protocols, rather than RL. By using deep RL as a test field for multi-agent partially observable reinforce-
with shared weights we enable our agents to develop dis- ment learning with communication, we also hope that this
tributed communication strategies and to allow for faster research will spur the development of further interesting
learning via gradient based optimisation. and challenging domains in the area.
Furthermore, planning-based RL methods have been em-

ployed to include messages as an integral part of the
multi-agent reinforcement learning challenge (Spaan et al.,
2006). However, so far this work has not been extended to
8. Acknowledgements Lauer, M. and Riedmiller, M. An algorithm for distributed

reinforcement learning in cooperative multi-agent sys-
This work was supported by the Oxford-Google DeepMind tems. In ICML, 2000.
Graduate Scholarship and the EPSRC.
Levine, S., Finn, C., Darrell, T., and Abbeel, P. End-to-
References end training of deep visuomotor policies. arXiv preprint
arXiv:1504.00702, 2015.
Assael, J.-A. M, Wahlstrom, N., Schon, T. B., and Deisen-
roth, M. p. Data-efficient learning of feedback policies Li, X., Li, L., Gao, J., He, X., Chen, J., Deng, L., and He,
from image pixels using deep dynamical models. arXiv J. Recurrent reinforcement learning: A hybrid approach.
preprint arXiv:1510.02173, 2015. arXiv preprint 1509.03044, 2015.
Ba, J., Mnih, V., and Kavukcuoglu, K. Multiple object Lin, L.J. Reinforcement learning for robots using neu-
recognition with visual attention. In ICLR, 2015. ral networks. PhD thesis, School of Computer Science,
Carnegie Mellon University, 1993.
Bellemare, M. G., Ostrovski, G., Guez, A., Thomas, P. S.,
and Munos, R. Increasing the action gap: New operators Littman, M. L. Markov games as a framework for multi-
for reinforcement learning. In AAAI, 2016. agent reinforcement learning. In International Confer-
ence on Machine Learning (ICML), pp. 157163, 1994.
Bengio, Y., Louradour, J., Collobert, R., and Weston, J.
Curriculum learning. In ICML, pp. 4148, 2009. Maddison, C. J., Huang, A., Sutskever, I., and Silver, D.
Move Evaluation in Go Using Deep Convolutional Neu-
Cao, Y., Yu, W., Ren, W., and Chen, G. An overview of re- ral Networks. In ICLR, 2015.
cent progress in the study of distributed multi-agent co-
ordination. IEEE Transactions on Industrial Informatics, Maravall, D., De Lope, J., and Domnguez, R. Coordina-
9(1):427438, 2013. tion of communication in robot teams by reinforcement
learning. Robotics and Autonomous Systems, 61(7):661
Fox, D., Burgard, W., Kruppa, H., and Thrun, S. Prob- 666, 2013.
abilistic approach to collaborative multi-robot localiza-
tion. Autonomous Robots, 8(3):325344, 2000. Matari, M.J. Reinforcement learning in the multi-robot do-
main. Autonomous Robots, 4(1):7383, 1997.
Gerkey, B.P. and Matari, M.J. A formal analysis and tax-
onomy of task allocation in multi-robot systems. Inter- Melo, F. S., Spaan, M., and Witwicki, S. J. QueryPOMDP:
national Journal of Robotics Research, 23(9):939954, POMDP-based communication in multiagent systems.
2004. In Multi-Agent Systems, pp. 189204. 2011.
Giles, C. L. and Jim, K. C. Learning communication for Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Ve-
multi-agent systems. In Innovative Concepts for Agent- ness, J., Bellemare, M. G., Graves, A., Riedmiller, M.,
Based Systems, pp. 377390. Springer, 2002. Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C.,
Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wier-
Guo, X., Singh, S., Lee, H., Lewis, R. L., and Wang, X.
stra, D., Legg, S., and Hassabis, D. Human-level con-
Deep learning for real-time Atari game play using offline
trol through deep reinforcement learning. Nature, 518
Monte-Carlo tree search planning. In NIPS, pp. 3338
(7540):529533, 2015.
3346. 2014.
Nair, A., Srinivasan, P., Blackwell, S., Alcicek, C.,
Hausknecht, M. and Stone, P. Deep recurrent Q-
Fearon, R., Maria, A. De, Panneershelvam, V., Suley-
learning for partially observable MDPs. arXiv preprint
man, M., Beattie, C., Petersen, S., Legg, S., Mnih,
arXiv:1507.06527, 2015.
V., Kavukcuoglu, K., and Silver, D. Massively paral-
Hochreiter, S. and Schmidhuber, J. Long short-term mem- lel methods for deep reinforcement learning. In Deep
ory. Neural computation, 9(8):17351780, 1997. Learning Workshop, ICML, 2015.
Kasai, T., Tenmoto, H., and Kamiya, A. Learning of com- Narasimhan, K., Kulkarni, T., and Barzilay, R. Lan-
munication codes in multi-agent reinforcement learning guage understanding for text-based games using deep re-
problem. In IEEE Conference on Soft Computing in In- inforcement learning. In EMNLP, 2015.
dustrial Applications, pp. 16, 2008.
Oh, J., Guo, X., Lee, H., Lewis, R. L., and Singh,
Kingma, D. and Ba, J. Adam: A method for stochastic S. Action-conditional video prediction using deep net-
optimization. arXiv preprint arXiv:1412.6980, 2014. works in Atari games. In NIPS, pp. 28452853, 2015.
Olfati-Saber, R., Fax, J.A., and Murray, R.M. Consensus Wang, Z., de Freitas, N., and Lanctot, M. Dueling network
and cooperation in networked multi-agent systems. Pro- architectures for deep reinforcement learning. arXiv
ceedings of the IEEE, 95(1):215233, 2007. preprint 1511.06581, 2015.
Panait, L. and Luke, S. Cooperative multi-agent learning: Watter, M., Springenberg, J. T., Boedecker, J., and Ried-
The state of the art. Autonomous Agents and Multi-Agent miller, M. A. Embed to control: A locally linear latent
Systems, 11(3):387434, 2005. dynamics model for control from raw images. In NIPS,
2015.
Poundstone, W. Are You Smart Enough to Work at
Google?: Fiendish Puzzles and Impossible Interview Wu, W. 100 prisoners and a lightbulb. Technical report,
Questions from the Worlds Top Companies. Oneworld OCF, UC Berkeley, 2002.
Publications, 2012.
Zawadzki, E., Lipson, A., and Leyton-Brown, K. Empir-
Schaul, T., Quan, J., Antonoglou, I., and Silver, D. Priori- ically evaluating multiagent learning algorithms. arXiv
tized experience replay. In ICLR, 2016. preprint 1401.8074, 2014.
Shoham, Y. and Leyton-Brown, K. Multiagent Systems: Zhang, C. and Lesser, V. Coordinating multi-agent rein-
Algorithmic, Game-Theoretic, and Logical Foundations. forcement learning with limited communication. vol-
Cambridge University Press, New York, 2009. ume 2, pp. 11011108, 2013.
Shoham, Y., Powers, R., and Grenager, T. If multi-agent

learning is the answer, what is the question? Artificial
Intelligence, 171(7):365377, 2007.
Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L.,
van den Driessche, G., Schrittwieser, J., Antonoglou, I.,
Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe,
D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap,
T., Leach, M., Kavukcuoglu, K., Graepel, T., and Has-
sabis, D. Mastering the game of Go with deep neural
networks and tree search. Nature, 529(7587):484489,
2016.
Song, Y. 100 prisoners and a light bulb. Technical report,

University of Washington, 2012.
Spaan, M., Gordon, G. J., and Vlassis, N. Decentralized

planning under uncertainty for teams of communicating
agents. In International joint conference on Autonomous
agents and multiagent systems, pp. 249256, 2006.
Stadie, B. C., Levine, S., and Abbeel, P. Incentivizing ex-

ploration in reinforcement learning with deep predictive
models. arXiv preprint arXiv:1507.00814, 2015.
Sutton, R. S. and Barto, A. G. Introduction to reinforce-

ment learning. MIT Press, 1998.
Tampuu, A., Matiisen, T., Kodelja, D., Kuzovkin, I., Kor-

jus, K., Aru, J., Aru, J., and Vicente, R. Multiagent coop-
eration and competition with deep reinforcement learn-
ing. arXiv preprint arXiv:1511.08779, 2015.
Tan, M. Multi-agent reinforcement learning: Independent

vs. cooperative agents. In ICML, 1993.
van Hasselt, H., Guez, A., and Silver, D. Deep reinforce-

ment learning with double Q-learning. In AAAI, 2016.

1602 02672 PDF

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

1602 02672 PDF

Caricato da

Copyright:

Formati disponibili

Learning to Communicate to Solve Riddles

with Deep Distributed Recurrent Q-Networks

Jakob N. Foerster1, JAKOB . FOERSTER @ CS . OX . AC . UK

Canadian Institute for Advanced Research, CIFAR NCAP Program

with a feed-forward network, they approximate Q(o, a) Algorithm 1 DDRQN

4. Multi-Agent Riddles 4.2. Switch Riddle

Day 1 Day 2 Day 3 Day 4

In this section, we evaluate DDRQN on both multi-agent

both cases weights were optimised using Adam (Kingma &

DDRQN Hand-coded Oracle DDRQN Hand-coded Split

ers start learning to communicate via the switch. Note that

100k 200k 300k 400k 500k

DDRQN w Experience Replay deal with high dimensional complex problems.

0.6 based games, which are naturally partially observable

6. Related Work Future work is needed to fully understand and improve

Furthermore, planning-based RL methods have been em-

8. Acknowledgements Lauer, M. and Riedmiller, M. An algorithm for distributed

Shoham, Y., Powers, R., and Grenager, T. If multi-agent

Song, Y. 100 prisoners and a light bulb. Technical report,

Spaan, M., Gordon, G. J., and Vlassis, N. Decentralized

Stadie, B. C., Levine, S., and Abbeel, P. Incentivizing ex-

Sutton, R. S. and Barto, A. G. Introduction to reinforce-

Tampuu, A., Matiisen, T., Kodelja, D., Kuzovkin, I., Kor-

Tan, M. Multi-agent reinforcement learning: Independent

van Hasselt, H., Guez, A., and Silver, D. Deep reinforce-

Potrebbero piacerti anche