Sei sulla pagina 1di 36

Introduction to Game Theory

7. Repeated Games
Dana Nau
University of Maryland

Nau: Game Theory 1


Repeated Games
  Used by game theorists, economists, social and behavioral scientists
as highly simplified models of various real-world situations

Iterated Battle
of the Sexes
Iterated Prisoner’s Dilemma Roshambo
Repeated
Ultimatum Game

Repeated
Iterated Chicken Game Repeated Stag Hunt Matching
Nau:Pennies
Game Theory 2
Finitely Repeated Games
  In repeated games, some game G is played Prisoner’s Dilemma:
multiple times by the same set of agents 2 C D
  G is called the stage game 1
•  Usually (but not always), C 3, 3 0, 5
G is a normal-form game D 5, 0 1, 1
  Each occurrence of G is called
an iteration or a round
  Usually each agent knows what all Iterated Prisoner’s Dilemma,
the agents did in the previous iterations, with 2 iterations:
but not what they’re doing in the
current iteration Agent 1: Agent 2:
  Thus, an imperfect-information Round 1: C C
game with perfect recall Round 2: D C
  Usually each agent’s
payoff function is additive Total payoff: 3+5 = 5 3+0 = 3

Nau: Game Theory 3


Strategies
  The repeated game has a much bigger strategy space than the stage game
  One kind of strategy is a stationary strategy:
  Use the same strategy
at every iteration
  More generally, an Iterated Prisoner’s Dilemma with 2 iterations:
agent’s play at each stage
may depend on the history
  What happened in
previous iterations

Nau: Game Theory 4


Backward Induction
  If the number of iterations is finite and known, we can use backward
induction to get a subgame-perfect equilibrium
  Example: finitely many repetitions of the Prisoner’s Dilemma
  In the last round,
the dominant strategy is D
  That’s common knowledge
  So in the 2nd-to-last round,
D also is the dominant strategy
Agent 1: Agent 2:
  …
Round 1: D D
  The SPE is (D,D) on every round
Round 2: D D

  As with the Centipede game, Round 3: D D


this argument is vulnerable to Round 4: D D
both empirical and theoretical criticisms

Nau: Game Theory 5


Backward Induction when G is 0-sum
  As before, backward induction works much better in zero-sum games
  In the last round, equilibrium is the minimax profile
•  Each agent uses his/her minimax strategy
  That’s common knowledge
  So in the 2nd-to-last round, it
again is the minimax strategies

  The SPE is (D,D) on every round

Nau: Game Theory 6


Infinitely Repeated Games
  An infinitely repeated game in extensive form would be an infinite tree
  Payoffs can’t be attached to any terminal nodes
  Payoffs can’t be the sums of the payoffs in the stage games (generally infinite)
  Two common ways around this problem
  Let r (1)i , r (2)i , … be an infinite sequence of payoffs for agent i
  Agent i’s average reward is k
lim ∑ j=1 ri( j ) / k
k→∞

  Agent i’s future discounted reward is the discounted sum of the payoffs, i.e.,


j ( j)
∑ j=1
β ri where€β (with 0 ≤ β ≤ 1) is a constant called the discount factor

  Two ways to interpret the discount factor:


1.  The agent cares more about the preset than the future

2.  The agent cares about the future, but the game ends at any round with
probability 1 − β

Nau: Game Theory 7


Example
  Some well-known strategies for the Iterated Prisoner’s Dilemma:
»  AllC: always cooperate AllC, AllC, TFT
»  AllD (the Hawk strategy): Grim, Grim, or
always defect or TFT or TFT Grim AllD TFT Tester
»  Grim: cooperate until the C C C D C D
other agent defects,
then defect forever C C D D D C
»  Tit-for-Tat (TFT): C C D D C C
cooperate on the first move. C C D D C C
On the nth move, repeat
the other agent (n–1)th move C C D D C C
»  Tester: defect on move 1. If the C C D D C C
other agent retaliates, play TFT. C C D D C C
Otherwise, randomly intersperse


...

...


cooperation and defection
  If the discount factor is large enough, each of the following is a Nash equilibrium
  (TFT, TFT), (TFT,GRIM), and (GRIM,GRIM)

Nau: Game Theory 8


Equilibrium Payoffs for Repeated Games
  There’s a “folk theorem” that tells what the possible equilibrium payoffs
are in repeated games
  It says roughly the following:
  In an infinitely repeated game whose stage game is G, there is a
Nash equilibrium whose average payoffs are (p1, p2, …, pn)
if and only if
  G has a mixed-strategy profile (s1, s2, …, sn) with the following
property:
•  For each i, si’s payoff would be ≥ pi if the other agents used
minimax strategies against i

Nau: Game Theory 9


Proof and Examples
  The proof proceeds in 2 parts:
Example 1: Example 2:
  Use the definitions of minimax and best- IPD with IPD with
response to show that in every equilibrium, (p1, p2) = (3,3) (p1, p2) = (2.5,2.5)
an agent’s average payoff ≥ the agent’s
minimax value Other
Grim agent Agent 1 Agent 2
  Show how to construct an equilibrium that
gives each agent i the average payoff pi, C C D C
given certain constraints on (p1, p2, …, pn) C C C D
•  In this equilibrium, the agents cycle in C C D C
lock-step through a sequence of game
outcomes that achieve (p1, p2, …, pn) C C C D

•  If any agent i deviates, then the others C D D D


punish i forever, by playing their D C D D
minimax strategies against i
D C D C
D C D D
  There’s a large family of such theorems,


for various conditions on the game

Nau: Game Theory 10


Zero-Sum Repeated Games
  For two-player zero-sum repeated games, the folk theorem is still true, but it
becomes vacuous

  Suppose we iterate a two-player zero-sum game G


  Let V be the value of G (from the Minimax Theorem)
  If agent 2 uses a minimax strategy against 1, then 1’s maximum payoff is V
•  Thus max value for p1 is V, so min value for p2 is –V
  If agent 1 uses a minimax strategy against 2, then 2’s maximum payoff is –V
•  Thus max value for p2 is –V, so min value for p1 is V

  Thus in the iterated game, the only Nash-equilibrium payoff profile is (V,–V)
  The only way to get this is if each agent always plays his/her minimax strategy
•  If agent 1 plays a non-minimax strategy s1 and agent 2 plays his/her best
response, 2’s expected payoff will be higher than –V

Nau: Game Theory 11


Roshambo (Rock, Paper, Scissors)
A1 Rock Paper Scissors
A2
Rock 0, 0 –1, 1 1, –1
Paper 1, –1 0, 0 –1, 1
Scissors –1, 1 1, –1 0, 0

  Nash equilibrium for the stage game:


  choose randomly, P=1/3 for each move
  Nash equilibrium for the repeated game:
  always choose randomly, P=1/3 for each move
  Expected payoff = 0

  Let’s see how that works out in practice …

Nau: Game Theory 12


Roshambo (Rock, Paper, Scissors)
A1 Rock Paper Scissors
A2
Rock 0, 0 –1, 1 1, –1
Paper 1, –1 0, 0 –1, 1
Scissors –1, 1 1, –1 0, 0
  1999 international roshambo programming competition
www.cs.ualberta.ca/~darse/rsbpc1.html
  Round-robin tournament:
•  55 programs, 1000 iterations
for each pair of programs
•  Lowest possible score = –55000,
highest possible score = 55000
  Average over 25 tournaments:
•  Highest score (Iocaine Powder): 13038
•  Lowest score (Cheesebot): –36006
  Very different from the game-theoretic prediction
Nau: Game Theory 13
  A Nash equilibrium strategy is best for you
if the other agents also use their Nash equilibrium strategies

  In many cases, the other agents won’t use Nash equilibrium strategies
  If you can forecast their actions accurately, you may be able to do
much better than the Nash equilibrium strategy

  Why won’t the other agents use their Nash equilibrium strategies?
  Because they may be trying to forecast your actions too

  Something analogous can happen in non-zero-sum games

Nau: Game Theory 14


Iterated Prisoner’s Dilemma
Prisoner’s Dilemma
  Multiple iterations of the Prisoner’s Dilemma P2
Cooperate Defect
P1
Cooperate 3, 3 0, 5
  Widely used to study the emergence of Defect 5, 0 1, 1
cooperative behavior among agents
Nash equilibrium
  e.g., Axelrod (1984), The Evolution of Cooperation
  Axelrod ran a famous set of tournaments
  People contributed strategies
encoded as computer programs If I defect now, he might punish
  Axelrod played them against each other me by defecting next time

Nau: Game Theory 15


TFT with Other Agents
  In Axelrod’s tournaments, TFT usually did best
»  It could establish and maintain cooperations with many other agents
»  It could prevent malicious agents from taking advantage of it

TFT AllC TFT AllD TFT Grim TFT TFT TFT Tester
C C C D C C C C C D
C C D D C C C C D C
C C D D C C C C C C
C C D D C C C C C C
C C D D C C C C C C
C C D D C C C C C C
C C D D C C C C C C



...

...

Nau: Game Theory 16


Example:
  A real-world example of the IPD, described in Axelrod’s book:
  World War I trench warfare

  Incentive to cooperate:
  If I attack the other side, then they’ll retaliate and I’ll get hurt
  If I don’t attack, maybe they won’t either
  Result: evolution of cooperation
  Although the two infantries were supposed to be enemies, they
avoided attacking each other Nau: Game Theory 17
IPD with Noise

  In noisy environments, Did he really


  There’s a nonzero probability (e.g., 10%) intend to do that?
that a “noise gremlin” will change some
of the actions
•  Cooperate (C) becomes Defect (D),
and vice versa
  Can use this to model accidents
  Compute the score using the changed
action C C
  Can also model misinterpretations C C Noise
  Compute the score using the original C
action
D C


Nau: Game Theory 18
Example of Noise

  Story from a British army officer in World War I:


  I was having tea with A Company when we heard a lot of shouting and went
out to investigate. We found our men and the Germans standing on their
respective parapets. Suddenly a salvo arrived but did no damage.
Naturally both sides got down and our men started swearing at the Germans,
when all at once a brave German got onto his parapet and shouted out:
“We are very sorry about that; we hope no one was hurt. It is not our
fault. It is that damned Prussian artillery.”
  The salvo wasn’t the German infantry’s intention
  They didn’t expect it nor desire it

Nau: Game Theory 19


Noise Makes it Difficult
to Maintain Cooperation

  Consider two agents


who both use TFT C C
  One accident or
misinterpretation C C
can cause a long C C Noise"
string of retaliations
C D C
Retaliation D C
C D Retaliation
Retaliation D C
C D Retaliation
...

... Nau: Game Theory 20


Some Strategies for the Noisy IPD
  Principle: be more forgiving in the face of defections

  Tit-For-Two-Tats (TFTT)
»  Retaliate only if the other agent defects twice in a row
•  Can tolerate isolated instances of defections, but susceptible to exploitation
of its generosity
•  Beaten by the TESTER strategy I described earlier
  Generous Tit-For-Tat (GTFT)
»  Forgive randomly: small probability of cooperation if the other agent defects
»  Better than TFTT at avoiding exploitation, but worse at maintaining cooperation
  Pavlov
»  Win-Stay, Lose-Shift
•  Repeat previous move if I earn 3 or 5 points in the previous iteration
•  Reverse previous move if I earn 0 or 1 points in the previous iteration
»  Thus if the other agent defects continuously, Pavlov will alternatively cooperate
and defect

Nau: Game Theory 21


Discussion
  The British army officer’s story:
  a German shouted, ``We are very sorry about that; we hope no one was
hurt. It is not our fault. It is that damned Prussian artillery.”
  The apology avoided a conflict
  It was convincing because it was consistent with the German infantry’s
past behavior
  The British had ample evidence that the German infantry wanted to
keep the peace

  If you can tell which actions are affected by noise, you can avoid reacting
to the noise

  IPD agents often behave deterministically


  For others to cooperate with you it helps if you’re predictable
  This makes it feasible to build a model from observed behavior
Nau: Game Theory 22
The DBS Agent
  Work by my recent PhD graduate, Tsz-Chiu Au
  Now a postdoc at University of Texas

  From the other agent’s recent behavior, build a model π of the other
agent’s strategy
  Use the model to filter noise
  Use the model to help plan our next move Au & Nau. Accident or intention:
That is the question (in the iterated
prisoner’s dilemma). AAMAS, 2006.

Au & Nau. Is it accidental or


intentional? A symbolic approach to the
noisy iterated prisoner’s dilemma. In G.
Kendall (ed.), The Iterated Prisoners
Dilemma: 20 Years On. World
Scientific, 2007.

Nau: Game Theory 23


Modeling the other agent
  A set of rules of the following form
if our last move was m and their last move was m'
then P[their next move will be C]
  Four rules: one for each of (C,C), (C,D), (D,C), and (D,D)
  For example, TFT can be described as
  (C,C) 1, (C, D) 1, (D, C ) 0, (D, D) 0

  How to get the probabilities?


  One way: look at the agent’s behavior in the recent past
  During the last k iterations,
  What fraction of the time did the other agent cooperate at iteration j
when the agents’ moves were (x,y) at iteration j–1?

Nau: Game Theory 24


Modeling the other agent
  π can only model a very small set of strategies
  It doesn’t even model the Grim strategy correctly:
  If Grim defects, it may be defecting because of something that
happened many moves ago

  But we’re not trying to model an agent’s entire strategy, just its recent
behavior
  If an agent’s behavior changes, then the probabilities in π will change
  e.g., after Grim defects a few times, the rules will give a very low
probability of it cooperating again

Nau: Game Theory 25


Noise Filtering
  Suppose the applicable rule is
deterministic
  P[their next move will be C] = 0 or 1
  If the other agent’s next move C C
The other
isn’t what the rule predicts, then C C
agent
  Assume the cooperates C C
observed action when I do
is noise C D C
So I won’t
  Behave as if the
retaliate here. I
C C
action were what
the rule predicted
think these C D C
defections are
actually noise C C
C C
: :

Nau: Game Theory 26


I am Grim. If
you ever betray
Change of Behavior me, I will never
forgive you.
  Anomalies in observed behavior can be due
either to noise or to a genuine change of
behavior
  Changes of behavior occur because
  The other agent can change its strategy
anytime C C
  E.g., if noise affects one of Agent 1’s C C
actions, this may trigger a change in
Agent 2’s behavior CD C

•  Agent 1 does not know this C D


  How to distinguish noise from a real change C D
of behavior? C D These
moves are
: D not noise
: D
: : Nau: Game Theory 27
Detection of a Change of Behavior
Temporary tolerance:
  When we observe unexpected
behavior from the other agent
  Don’t immediately decide The other C C
whether it’s noise or a real agent
change of behavior cooperates C C
  Instead, defer judgment when I do C C
for a few iterations
The defections C D
  If the anomaly persists, then might be accidents,
recompute π based on the so I shouldn’t lose C D
other agent’s recent behavior my temper too soon C D
I think the other D D
agent’s has really D D
changed, so I’ll
change mine too : :
Nau: Game Theory 28
Move generation
  Modified version of game-tree search
  Use the policy π to predict probabilities of the other agent’s moves
  Compute expected utility) for move x as

u1(x) = ∑ y∈{C,D} u1(x,y) × P(y | π, previous moves)

where x = my move, y = other agent’s move


  Choose the move with the highest expected utility
Current
Iteration

(C,C) (C,D) (D,C) (D,D)


Next
Iteration

Iteration
after next
: : : : : : : : : : : : : : : :
Nau: Game Theory 29
Suppose we have the rules
1. (C,C) → 0.7
2. (C,D) → 0.4 Example
3. (D,C) → 0.1
4. (D,D) → 0.1

(C,C) (C,D) (D,C) (D,D)

  Suppose we search to depth 1


C C u1(C) = 0.7 u1(C,C) + 0.3 u1(C,D) = 2.1 + 0 = 2.1
C D u1(D) = 0.7 u1(D,C) + 0.3 u1(D,D) = 3.5 + 0.3 = 3.8
D C
»  So D looks better
C C
  Is D really what we should choose?
??
Rule 1 predicts
P(C) = 0.7, P(D) = 0.3

Nau: Game Theory 30


Suppose we have the rules
1. (C,C) → 0.7
2. (C,D) → 0.4 Example
3. (D,C) → 0.1
4. (D,D) → 0.1

(C,C) (C,D) (D,C) (D,D)

  It’s not wise to choose D


C C »  On the move after that, the opponent will
C D retaliate with P=0.9
»  The depth-1 search didn’t see this
D C   But if we search to depth d>1, we’ll see it
C C   C will look better and we’ll choose it instead

??   In general, it’s best look far ahead


Rule 1 predicts »  e.g., 60 moves
P(C) = 0.7, P(D) = 0.3

Nau: Game Theory 31


How to Search Deeper
  Game trees grow exponentially with search depth
»  How to search to the tree deeply?
  Key assumption: π accurately models the other agent’s future behavior
  Then we can use dynamic programming
»  Makes the search polynomial in the search depth
»  Can easily search to depth 60
»  Equivalent to solving an acyclic MDP of depth 60
  This generates fairly good moves
Current
iteration

(C,C) (C,D) (D,C) (D,D)


Next
iteration

iteration
after next
: : : :
Nau: Game Theory 32
20th Anniversary IPD Competition

http://www.prisoners-dilemma.com

  Category 2: IPD with noise


  165 programs participated

  DBS dominated the


top 10 places

  Two agents scored


higher than DBS
  They both used
master-and-slaves strategies

Nau: Game Theory 33


Master & Slaves Strategy
  Each participant could submit up to 20 programs
  Some submitted programs that could recognize each other
  (by communicating pre-arranged sequences of Cs and Ds)
  The 20 programs worked as a team
My goons give
•  1 master, 19 slaves me
  When a slave plays with its master all their
money …
•  Slave cooperates, master defects
=> maximizes the master’s payoff
  When a slave plays with … and they
an agent not in its team beat up
•  It defects everyone else
=> minimizes the other agent’s payoff

Nau: Game Theory 34


Comparison
  Analysis
  Each master-slaves team’s average score was much lower than DBS’s
  If BWIN and IMM01 had each been restricted to ≤ 10 slaves,
DBS would have placed 1st
  Without any slaves, BWIN and IMM01 would have done badly

  In contrast, DBS had no slaves


  DBS established cooperation
with many other agents
  DBS did this despite the noise,
because it filtered out the noise

Nau: Game Theory 35


Summary
  Finitely repeated games – backward induction
  Infinitely repeated games
  average reward, future discounted reward
  equilibrium payoffs
  Non-equilibrium strategies
  opponent modeling in roshambo
  iterated prisoner’s dilemma with noise
•  opponent models based on observed behavior
•  detection and removal of noise
•  game-tree search against the opponent model
  20th anniversary IPD competition

Nau: Game Theory 36

Potrebbero piacerti anche