Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Abstract ing for Go (Maddison et al., 2015; Silver et al., 2016) has
recently shown success. In cooperative settings, Tampuu
We propose deep distributed recurrent Q-
et al. (2015) have adapted deep Q-networks (Mnih et al.,
networks (DDRQN), which enable teams of
2015) to allow two agents to tackle a multi-agent exten-
agents to learn to solve communication-based co-
sion to ALE. Their approach is based on independent Q-
ordination tasks. In these tasks, the agents are
learning (Shoham et al., 2007; Shoham & Leyton-Brown,
not given any pre-designed communication pro-
2009; Zawadzki et al., 2014), in which all agents learn their
tocol. Therefore, in order to successfully com-
own Q-functions independently in parallel.
municate, they must first automatically develop
and agree upon their own communication proto- However, these approaches all assume that each agent can
col. We present empirical results on two multi- fully observe the state of the environment. While DQN
agent learning problems based on well-known has also been extended to address partial observability
riddles, demonstrating that DDRQN can success- (Hausknecht & Stone, 2015), only single-agent settings
fully solve such tasks and discover elegant com- have been considered. To our knowledge, no work on deep
munication protocols to do so. To our knowl- reinforcement learning has yet considered settings that are
edge, this is the first time deep reinforcement both partially observable and multi-agent.
learning has succeeded in learning communica-
Such problems are both challenging and important. In the
tion protocols. In addition, we present ablation
cooperative case, multiple agents must coordinate their be-
experiments that confirm that each of the main
haviour so as to maximise their common payoff while fac-
components of the DDRQN architecture are crit-
ing uncertainty, not only about the hidden state of the envi-
ical to its success.
ronment but about what their teammates have observed and
thus how they will act. Such problems arise naturally in a
variety of settings, such as multi-robot systems and sensor
1. Introduction
networks (Matari, 1997; Fox et al., 2000; Gerkey & Matari,
In recent years, advances in deep learning have been in- 2004; Olfati-Saber et al., 2007; Cao et al., 2013).
strumental in solving a number of challenging reinforce- In this paper, we propose deep distributed recurrent Q-
ment learning (RL) problems, including high-dimensional networks (DDRQN) to enable teams of agents to learn ef-
robot control (Levine et al., 2015; Assael et al., 2015; Wat- fectively coordinated policies on such challenging prob-
ter et al., 2015), visual attention (Ba et al., 2015), and the lems. We show that a naive approach to simply train-
Atari learning environment (ALE) (Guo et al., 2014; Mnih ing independent DQN agents with long short-term memory
et al., 2015; Stadie et al., 2015; Wang et al., 2015; Schaul (LSTM) networks (Hochreiter & Schmidhuber, 1997) is in-
et al., 2016; van Hasselt et al., 2016; Oh et al., 2015; Belle- adequate for multi-agent partially observable problems.
mare et al., 2016; Nair et al., 2015).
Therefore, we introduce three modifications that are key to
The above-mentioned problems all involve only a single DDRQNs success: a) last-action inputs: supplying each
learning agent. However, recent work has begun to address agent with its previous action as input on the next time step
multi-agent deep RL. In competitive settings, deep learn- so that agents can approximate their action-observation his-
These authors contributed equally to this work. tories; b) inter-agent weight sharing: a single networks
Learning to Communicate to Solve Riddles with Deep Distributed Recurrent Q-Networks
weights are used by all agents but that network conditions The optimal action-value function Q (s, a) =
on the agents unique ID, to enable fast learning while also max Q (s, a) obeys the Bellman optimality equation:
allowing for diverse behaviour; and c) disabling experience h i
replay, which is poorly suited to the non-stationarity aris- Q (s, a) = Es0 r + max0
Q 0 0
(s , a ) | s, a . (3)
a
ing from multiple agents learning simultaneously.
To evaluate DDRQN, we propose two multi-agent rein- Deep Q-networks (Mnih et al., 2015) (DQNs) use neural
forcement learning problems that are based on well-known networks parameterised by to represent Q(s, a; ). DQNs
riddles: the hats riddle, where n prisoners in a line must are optimised by minimising the following loss function at
determine their own hat colours; and the switch riddle, in each iteration i:
which n prisoners must determine when they have all vis-
2
ited a room containing a single switch. Both riddles have Li (i ) = Es,a,r,s0 yiDQN Q(s, a; i ) , (4)
been used as interview questions at companies like Google
and Goldman Sachs. with target
While these environments do not require convolutional net- yiDQN = r + max Q(s0 , a0 ; i ). (5)
0
works for perception, the presence of partial observabil- a
ity means that they do require recurrent networks to deal
Here, i are the weights of a target network that is frozen
with complex sequences, as in some single-agent works
for a number of iterations while updating the online net-
(Hausknecht & Stone, 2015; Ba et al., 2015) and language-
work Q(s, a; i ) by gradient descent. DQN uses experience
based (Narasimhan et al., 2015) tasks. In addition, because
replay (Lin, 1993; Mnih et al., 2015): during learning, the
partial observability is coupled with multiple agents, op-
agent builds a dataset Dt = {e1 , e2 , . . . , et } of experiences
timal policies critically rely on communication between
et = (st , at , rt , st+1 ) across episodes. The Q-network is
agents. Since no communication protocol is given a pri-
then trained by sampling mini-batches of experiences from
ori, reinforcement learning must automatically develop a
D uniformly at random. Experience replay helps prevent
coordinated communication protocol.
divergence by breaking correlations among the samples. It
Our results demonstrate that DDRQN can successfully also enables reuse of past experiences for learning, thereby
solve these tasks, outperforming baseline methods, and dis- reducing sample costs.
covering elegant communication protocols along the way.
To our knowledge, this is the first time deep reinforcement 2.2. Independent DQN
learning has succeeded in learning communication proto-
cols. In addition, we present ablation experiments that con- DQN has been extended to cooperative multi-agent set-
firm that each of the main components of the DDRQN ar- tings, in which each agent m observes the global st , se-
chitecture are critical to its success. lects an individual action am
t , and receives a team reward,
rt , shared among all agents. Tampuu et al. (2015) ad-
dress this setting with a framework that combines DQN
2. Background with independent Q-learning, applied to two-player pong,
In this section, we briefly introduce DQN and its multi- in which all agents independently and simultaneously learn
agent and recurrent extensions. their own Q-functions Qm (s, am ; im ). While indepen-
dent Q-learning can in principle lead to convergence prob-
2.1. Deep Q-Networks lems (since one agents learning makes the environment ap-
pear non-stationary to other agents), it has a strong empir-
In a single-agent, fully-observable, reinforcement learning ical track record (Shoham et al., 2007; Shoham & Leyton-
setting (Sutton & Barto, 1998), an agent observes its cur- Brown, 2009; Zawadzki et al., 2014).
rent state st S at each discrete time step t, chooses an
action at A according to a potentially stochastic policy 2.3. Deep Recurrent Q-Networks
, observes a reward signal rt , and transitions to a new state
st+1 . Its objective is to maximize an expectation over the Both DQN and independent DQN assume full observabil-
discounted return, Rt ity, i.e., the agent receives st as input. By contrast, in par-
tially observable environments, st is hidden and instead the
Rt = rt + rt+1 + 2 rt+2 + , (1) agent receives only an observation ot that is correlated with
st but in general does not disambiguate it.
where [0, 1) is a discount factor. The Q-function of a
policy is: Hausknecht & Stone (2015) propose the deep recurrent Q-
network (DRQN) architecture to address single-agent, par-
Q (s, a) = E [Rt |st = s, at = a] . (2) tially observable settings. Instead of approximating Q(s, a)
Learning to Communicate to Solve Riddles with Deep Distributed Recurrent Q-Networks
Lastly, we conduct two weight updates, first i in the direc- Although this riddle is a single action and observation prob-
tion of the accumulated gradients, , and then the target lem it is still partially observable, given that none of the
network, i , in the direction of i . agents can observe the colour of their own hat.
In this section, we describe the riddles on which we evalu- The switch riddle can be described as follows: One hun-
ate DDRQN. dred prisoners have been newly ushered into prison. The
warden tells them that starting tomorrow, each of them
will be placed in an isolated cell, unable to communicate
4.1. Hats Riddle
amongst each other. Each day, the warden will choose one
The hats riddle can be described as follows: An execu- of the prisoners uniformly at random with replacement, and
tioner lines up 100 prisoners single file and puts a red or a place him in a central interrogation room containing only
blue hat on each prisoners head. Every prisoner can see a light bulb with a toggle switch. The prisoner will be able
the hats of the people in front of him in the line - but not to observe the current state of the light bulb. If he wishes,
his own hat, nor those of anyone behind him. The execu- he can toggle the light bulb. He also has the option of an-
tioner starts at the end (back) and asks the last prisoner nouncing that he believes all prisoners have visited the in-
the colour of his hat. He must answer red or blue. If terrogation room at some point in time. If this announce-
he answers correctly, he is allowed to live. If he gives the ment is true, then all prisoners are set free, but if it is false,
wrong answer, he is killed instantly and silently. (While ev- all prisoners are executed. The warden leaves and the pris-
eryone hears the answer, no one knows whether an answer oners huddle together to discuss their fate. Can they agree
was right.) On the night before the line-up, the prisoners on a protocol that will guarantee their freedom? (Wu,
confer on a strategy to help them. What should they do? 2002).
(Poundstone, 2012). Figure 1 illustrates this setup.
Action: On None None Tell
Answers: Red Red Observed hats
Prisoner :
in IR
3 2 3 1
Hats:
? Switch:
On
Off
On
Off
On
Off
On
Off
1 2 3 4 5 6
Figure 2. Switch: Every day one prisoner gets sent to the inter-
Figure 1. Hats: Each prisoner can hear the answers from all pre- rogation room where he can see the switch and choose between
ceding prisoners (to the left) and see the colour of the hats in front actions On, Off, Tell and None.
of him (to the right) but must guess his own hat colour.
A number of strategies (Song, 2012; Wu, 2002) have been
An optimal strategy is for all prisoners to agree on a com- analysed for the infinite time-horizon version of this prob-
munication protocol in which the first prisoner says blue lem in which the goal is to guarantee survival. One well-
if the number of blue hats is even and red otherwise (or known strategy is for one prisoner to be designated the
vice-versa). All remaining prisoners can then deduce their counter. Only he is allowed to turn the switch off while
hat colour given the hats they see in front of them and the each other prisoner turns it on only once. Thus, when the
responses they have heard behind them. Thus, everyone counter has turned the switch off n1 times, he can Tell.
except the first prisoner will definitely answer correctly.
To formalise the switch riddle, we define a state space
To formalise the hats riddle as a multi-agent RL task, we s = (SWt , IRt , s1 , . . . , sn ), where SWt {on, off} is
define a state space s = (s1 , . . . , sn , a1 , . . . , an ), where n the position of the switch, IR {1 . . . n} is the current
is the total number of agents, sm {blue, red} is the m- visitor in the interrogation room and s1 , . . . , sn {0, 1}
th agents hat colour and am {blue, red} is the action it tracks which agents have already been to the interrogation
took on the m-th step. At all other time steps, agent m can room. At time step t, agent m observes om t = (irt , swt ),
only take a null action. On the m-th time step, agent ms where irt = I(IRt = m), and swt = SWt if the agent
observation is om = (a1 , . . . , am1 , sm+1 , . . . , sn ). Re- is in the interrogation room and null otherwise. If agent
ward is zero except at the end of the episode, when it is m is in the interrogation room then its actions are am t
the total number of agents with the correct action: rn = {On, Off, Tell, None}; otherwise the only action
m
= sm ). We label only the relevant observation
P
m I(a is None. The episode ends when an agent chooses Tell
o and action am of agent m, omitting the time index.
m
or when the maximum time step is reached. The reward rt
Learning to Communicate to Solve Riddles with Deep Distributed Recurrent Q-Networks
LSTM unrolled for n-m steps 64](m, n), and their outputs are added element-wise.
Subsequently, zak is passed through an LSTM network
s m+1 m, n sk m, n sn m, n yak , hka = LSTMa [64](zak , hk1
a ). We follow a similar pro-
cedure for the n m hats observed defining ysk , hks =
+ + +
LSTMs [64](zsk , hk1
s ). Finally, the last values of the two
z m+1 z ks z ns
LSTM networks yam1 and ysn are used to approximate
s
h sm+1 h ks the Q-Values for each action Qm = MLP[128 64, 64
y ns
64, 64 1](yam1 ||ysn ). The network is trained with mini-
Qm batches of 20 episodes.
h 1a h ka
y am-1
Furthermore, we use an adaptive variant of curriculum
z 1a z ka z m-1
a learning (Bengio et al., 2009) to pave the way for scalable
+ + +
strategies and better training performance. We sample ex-
amples from a multinomial distribution of curricula, each
a1 m, n ak m, n a m-1 m, n corresponding to a different n, where the current bound is
raised every time performance becomes near optimal. The
LSTM unrolled for m-1 steps probability of sampling a given n is inversely proportional
to the performance gap compared to the normalised maxi-
Figure 3. Hats: Each agent m observes the answers, ak , k < m, mum reward. The performance is depicted in Figure 5.
from all preceding agents and hat colour, sk in front of him,
k > m. Both variable length sequences are processed through We first evaluate DDRQN for n = 10 and compare it with
RNNs. First, the answers heard are passed through two single- tabular Q-learning. Tabular Q-learning is feasible only with
layer MLPs, zak = MLP(ak ) MLP(m, n), and their outputs few agents, since the state space grows exponentially with
are added element-wise. zak is passed through an LSTM net- n. In addition, separate tables for each agent precludes gen-
work yak , hka = LSTMa (zak , hk1
a ). Similarly for the observed eralising across agents.
hats we define ysk1 , hk1
s = LSTM k k1
s (zs , hs ). The last val-
ues of the two LSTMs ya m1 n
and ys are used to approximate Figure 4 shows the results, in which DDRQN substan-
Qm = MLP(yam1 ||ysn ) from which the action am is chosen. tially outperforms tabular Q-learning. In addition, DDRQN
also comes near in performance to the optimal strategy de-
scribed in Section 4.1. This figure also shows the results of
is 0 except unless an agent chooses Tell, in which case an ablation experiment in which inter-agent weight sharing
it is 1 if all agents have been to the interrogation room and has been removed from DDRQN. The results confirm that
1 otherwise. inter-agent weight sharing is key to performance.
Since each agent takes only one action in the hats riddle, it
5. Experiments is essentially a single step problem. Therefore, last-action
m m m m m
Q 1 Q 2 Q 3 Q t Q Dmax
Table 1. Percent agreement on the hats riddle between DDRQN
and the optimal parity-encoding strategy.
ym
1 ym
2 ym
3 ym
t
m
yD max
N % AGREEMENT
hm
1 hm
2 hm
t
m
hD
max -1
3 100.0% zm
1 zm
2 zm
3 zm
t
m
zD max
5 100.0%
8 79.6%
12 52.6% m m m
m m
16 50.8% (sw1, ir 1 , a 0m, m, n) (sw3, ir 3 , a 2 , m, n) ( swD , ir D , a D
max max max -1, m, n)
20 52.5%
Figure 6. Switch: Agent m receives as input: the switch
inputs and disabling experience replay do not play a role state swt , the room he is in, irtm , his last action, am t1 ,
his ID, m, and the # of agents, n. At each step,
and do not need to be ablated. We consider these compo-
the inputs are processed through a 2-layer MLP ztm =
nents in the switch riddle in Section 5.2. MLP(swt , irtm , OneHot(am t1 ), OneHot(m), n). Their embed-
We compare the strategies DDRQN learns to the optimal ding ztm is then passed to an LSTM network, ytm , hm t =
strategy by computing the percentage of trials in which the LSTM(ztm , hm t1 ), which is used to approximate the agents
first agent correctly encodes the parity of the observed hats action-observation history. Finally, the output ytm of the LSTM
is used at each step to compute Qm m
t = MLP(yt ).
in its answer. Table 1 shows that the encoding is almost
perfect for n {3, 5, 8}. For n {12, 16, 20}, the agents
do not encode parity but learn a different distributed solu- 128](om m
t , OneHot(at1 ), OneHot(m), n). Their embed-
tion that is nonetheless close to optimal. We believe that ding zt is then passed an LSTM network, ytm , hm
m
=
t
qualitatively this solution corresponds to more the agents LSTM[128](ztm , hm ), which is used to approximate the
t1
communicating information about other hats through their agents action-observation history. Finally, the output ytm
answers, instead of only the first agent. of the LSTM is used at each step to approximate the
Q-values of each action using a 2-layer MLP Qm t =
5.2. Switch Riddle MLP[128 128, 128 128, 128 4](ytm ). As in the hats
Figure 6 illustrates the model architecture used in the riddle, curriculum learning was used for training.
switch riddle. Each agent m is modelled as a recurrent Figure 7, which shows results for n = 3, shows that
neural network with LSTM cells that is unrolled for Dmax DDRQN learns an optimal policy, beating the naive method
time-steps, where d denotes the number of days of the and the hand coded strategy, tell on last day. This veri-
episode. In our experiments, we limit d to Dmax = 4n 6 fies that the three modifications of DDRQN substantially
in order to keep the experiments computationally tractable. improve performance on this task. In following paragraphs
The inputs, om m we analyse the importance of the individual modifications.
t , at1 , m and n, are processed through
a 2-layer MLP ztm = MLP[(7 + n) 128, 128 We analysed the strategy that DDRQN discovered for n =
n=3 C.L. n=8 C.L. n=16 C.L. n=20 3 by looking at 1000 sampled episodes. Figure 8 shows a
n=5 C.L. n=12 C.L. n=20 C.L. Optimal decision tree, constructed from those samples, that corre-
1.0 sponds to an optimal strategy allowing the agents to col-
lectively track the number of visitors to the interrogation
0.9 room. When a prisoner visits the interrogation room after
Norm. R (Optimal)
day two, there are only two options: either one or two pris-
0.8 oners may have visited the room before. If three prisoners
had been, the third prisoner would have already finished
0.7
the game. The two remaining options can be encoded via
0.6 the on and off position respectively. In order to carry
out this strategy each prisoner has to learn to keep track
0.5 of whether he has visited the cell before and what day it
20k 40k 60k 80k 100k 120k 140k 160k currently is.
# Epochs
Figure 9 compares the performance of DDRQN to a variant
Figure 5. Hats: Using Curriculum Learning DDRQN achieves in which the switch has been disabled. After around 3,500
good performance for n = 3...20 agents, compared to the op- episodes, the two diverge in performance. Hence, there is a
timal strategy. clearly identifiable point during learning when the prison-
Learning to Communicate to Solve Riddles with Deep Distributed Recurrent Q-Networks
0.5 0.8
Norm. R (Oracle)
Norm. R (Oracle)
0.6
0.0
0.4
0.5 0.2
1.0 0.0
10k 20k 30k 40k 50k 10k 20k 30k 40k 50k
# Epochs # Epochs
Figure 7. Switch: For n = 3 DDRQN outperforms the Naive Figure 9. Switch: At 3.5k episodes the DDRQN line clearly sep-
Method and a simple hand coded strategy, tell on last day, arates from the performance line for the no switch test and start
achieving Oracle level performance. The lines depict the av- exceeding tell on last day. At this point the agents start to dis-
erage of 10 runs and the 95% confidence interval. cover strategies that evolve communication using the switch. The
lines depict the mean of 10 runs and the 95% confidence interval.
0.9
Norm. R (Oracle)
1 On
2
Yes Off 0.8
Day Has Been?
No On
0.7
3+ Yes None
0.6
Has Been? On Tell
No Switch?
Off On
Ba, J., Mnih, V., and Kavukcuoglu, K. Multiple object Lin, L.J. Reinforcement learning for robots using neu-
recognition with visual attention. In ICLR, 2015. ral networks. PhD thesis, School of Computer Science,
Carnegie Mellon University, 1993.
Bellemare, M. G., Ostrovski, G., Guez, A., Thomas, P. S.,
and Munos, R. Increasing the action gap: New operators Littman, M. L. Markov games as a framework for multi-
for reinforcement learning. In AAAI, 2016. agent reinforcement learning. In International Confer-
ence on Machine Learning (ICML), pp. 157163, 1994.
Bengio, Y., Louradour, J., Collobert, R., and Weston, J.
Curriculum learning. In ICML, pp. 4148, 2009. Maddison, C. J., Huang, A., Sutskever, I., and Silver, D.
Move Evaluation in Go Using Deep Convolutional Neu-
Cao, Y., Yu, W., Ren, W., and Chen, G. An overview of re- ral Networks. In ICLR, 2015.
cent progress in the study of distributed multi-agent co-
ordination. IEEE Transactions on Industrial Informatics, Maravall, D., De Lope, J., and Domnguez, R. Coordina-
9(1):427438, 2013. tion of communication in robot teams by reinforcement
learning. Robotics and Autonomous Systems, 61(7):661
Fox, D., Burgard, W., Kruppa, H., and Thrun, S. Prob- 666, 2013.
abilistic approach to collaborative multi-robot localiza-
tion. Autonomous Robots, 8(3):325344, 2000. Matari, M.J. Reinforcement learning in the multi-robot do-
main. Autonomous Robots, 4(1):7383, 1997.
Gerkey, B.P. and Matari, M.J. A formal analysis and tax-
onomy of task allocation in multi-robot systems. Inter- Melo, F. S., Spaan, M., and Witwicki, S. J. QueryPOMDP:
national Journal of Robotics Research, 23(9):939954, POMDP-based communication in multiagent systems.
2004. In Multi-Agent Systems, pp. 189204. 2011.
Giles, C. L. and Jim, K. C. Learning communication for Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Ve-
multi-agent systems. In Innovative Concepts for Agent- ness, J., Bellemare, M. G., Graves, A., Riedmiller, M.,
Based Systems, pp. 377390. Springer, 2002. Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C.,
Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wier-
Guo, X., Singh, S., Lee, H., Lewis, R. L., and Wang, X.
stra, D., Legg, S., and Hassabis, D. Human-level con-
Deep learning for real-time Atari game play using offline
trol through deep reinforcement learning. Nature, 518
Monte-Carlo tree search planning. In NIPS, pp. 3338
(7540):529533, 2015.
3346. 2014.
Nair, A., Srinivasan, P., Blackwell, S., Alcicek, C.,
Hausknecht, M. and Stone, P. Deep recurrent Q-
Fearon, R., Maria, A. De, Panneershelvam, V., Suley-
learning for partially observable MDPs. arXiv preprint
man, M., Beattie, C., Petersen, S., Legg, S., Mnih,
arXiv:1507.06527, 2015.
V., Kavukcuoglu, K., and Silver, D. Massively paral-
Hochreiter, S. and Schmidhuber, J. Long short-term mem- lel methods for deep reinforcement learning. In Deep
ory. Neural computation, 9(8):17351780, 1997. Learning Workshop, ICML, 2015.
Kasai, T., Tenmoto, H., and Kamiya, A. Learning of com- Narasimhan, K., Kulkarni, T., and Barzilay, R. Lan-
munication codes in multi-agent reinforcement learning guage understanding for text-based games using deep re-
problem. In IEEE Conference on Soft Computing in In- inforcement learning. In EMNLP, 2015.
dustrial Applications, pp. 16, 2008.
Oh, J., Guo, X., Lee, H., Lewis, R. L., and Singh,
Kingma, D. and Ba, J. Adam: A method for stochastic S. Action-conditional video prediction using deep net-
optimization. arXiv preprint arXiv:1412.6980, 2014. works in Atari games. In NIPS, pp. 28452853, 2015.
Learning to Communicate to Solve Riddles with Deep Distributed Recurrent Q-Networks
Olfati-Saber, R., Fax, J.A., and Murray, R.M. Consensus Wang, Z., de Freitas, N., and Lanctot, M. Dueling network
and cooperation in networked multi-agent systems. Pro- architectures for deep reinforcement learning. arXiv
ceedings of the IEEE, 95(1):215233, 2007. preprint 1511.06581, 2015.
Panait, L. and Luke, S. Cooperative multi-agent learning: Watter, M., Springenberg, J. T., Boedecker, J., and Ried-
The state of the art. Autonomous Agents and Multi-Agent miller, M. A. Embed to control: A locally linear latent
Systems, 11(3):387434, 2005. dynamics model for control from raw images. In NIPS,
2015.
Poundstone, W. Are You Smart Enough to Work at
Google?: Fiendish Puzzles and Impossible Interview Wu, W. 100 prisoners and a lightbulb. Technical report,
Questions from the Worlds Top Companies. Oneworld OCF, UC Berkeley, 2002.
Publications, 2012.
Zawadzki, E., Lipson, A., and Leyton-Brown, K. Empir-
Schaul, T., Quan, J., Antonoglou, I., and Silver, D. Priori- ically evaluating multiagent learning algorithms. arXiv
tized experience replay. In ICLR, 2016. preprint 1401.8074, 2014.
Shoham, Y. and Leyton-Brown, K. Multiagent Systems: Zhang, C. and Lesser, V. Coordinating multi-agent rein-
Algorithmic, Game-Theoretic, and Logical Foundations. forcement learning with limited communication. vol-
Cambridge University Press, New York, 2009. ume 2, pp. 11011108, 2013.
Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L.,
van den Driessche, G., Schrittwieser, J., Antonoglou, I.,
Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe,
D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap,
T., Leach, M., Kavukcuoglu, K., Graepel, T., and Has-
sabis, D. Mastering the game of Go with deep neural
networks and tree search. Nature, 529(7587):484489,
2016.