Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
7. Repeated Games
Dana Nau
University of Maryland
Iterated Battle
of the Sexes
Iterated Prisoner’s Dilemma Roshambo
Repeated
Ultimatum Game
Repeated
Iterated Chicken Game Repeated Stag Hunt Matching
Nau:Pennies
Game Theory 2
Finitely Repeated Games
In repeated games, some game G is played Prisoner’s Dilemma:
multiple times by the same set of agents 2 C D
G is called the stage game 1
• Usually (but not always), C 3, 3 0, 5
G is a normal-form game D 5, 0 1, 1
Each occurrence of G is called
an iteration or a round
Usually each agent knows what all Iterated Prisoner’s Dilemma,
the agents did in the previous iterations, with 2 iterations:
but not what they’re doing in the
current iteration Agent 1: Agent 2:
Thus, an imperfect-information Round 1: C C
game with perfect recall Round 2: D C
Usually each agent’s
payoff function is additive Total payoff: 3+5 = 5 3+0 = 3
Agent i’s future discounted reward is the discounted sum of the payoffs, i.e.,
∞
j ( j)
∑ j=1
β ri where€β (with 0 ≤ β ≤ 1) is a constant called the discount factor
…
...
...
…
cooperation and defection
If the discount factor is large enough, each of the following is a Nash equilibrium
(TFT, TFT), (TFT,GRIM), and (GRIM,GRIM)
…
for various conditions on the game
Thus in the iterated game, the only Nash-equilibrium payoff profile is (V,–V)
The only way to get this is if each agent always plays his/her minimax strategy
• If agent 1 plays a non-minimax strategy s1 and agent 2 plays his/her best
response, 2’s expected payoff will be higher than –V
In many cases, the other agents won’t use Nash equilibrium strategies
If you can forecast their actions accurately, you may be able to do
much better than the Nash equilibrium strategy
Why won’t the other agents use their Nash equilibrium strategies?
Because they may be trying to forecast your actions too
TFT AllC TFT AllD TFT Grim TFT TFT TFT Tester
C C C D C C C C C D
C C D D C C C C D C
C C D D C C C C C C
C C D D C C C C C C
C C D D C C C C C C
C C D D C C C C C C
C C D D C C C C C C
…
…
…
…
...
...
Incentive to cooperate:
If I attack the other side, then they’ll retaliate and I’ll get hurt
If I don’t attack, maybe they won’t either
Result: evolution of cooperation
Although the two infantries were supposed to be enemies, they
avoided attacking each other Nau: Game Theory 17
IPD with Noise
…
Nau: Game Theory 18
Example of Noise
Tit-For-Two-Tats (TFTT)
» Retaliate only if the other agent defects twice in a row
• Can tolerate isolated instances of defections, but susceptible to exploitation
of its generosity
• Beaten by the TESTER strategy I described earlier
Generous Tit-For-Tat (GTFT)
» Forgive randomly: small probability of cooperation if the other agent defects
» Better than TFTT at avoiding exploitation, but worse at maintaining cooperation
Pavlov
» Win-Stay, Lose-Shift
• Repeat previous move if I earn 3 or 5 points in the previous iteration
• Reverse previous move if I earn 0 or 1 points in the previous iteration
» Thus if the other agent defects continuously, Pavlov will alternatively cooperate
and defect
If you can tell which actions are affected by noise, you can avoid reacting
to the noise
From the other agent’s recent behavior, build a model π of the other
agent’s strategy
Use the model to filter noise
Use the model to help plan our next move Au & Nau. Accident or intention:
That is the question (in the iterated
prisoner’s dilemma). AAMAS, 2006.
But we’re not trying to model an agent’s entire strategy, just its recent
behavior
If an agent’s behavior changes, then the probabilities in π will change
e.g., after Grim defects a few times, the rules will give a very low
probability of it cooperating again
Iteration
after next
: : : : : : : : : : : : : : : :
Nau: Game Theory 29
Suppose we have the rules
1. (C,C) → 0.7
2. (C,D) → 0.4 Example
3. (D,C) → 0.1
4. (D,D) → 0.1
iteration
after next
: : : :
Nau: Game Theory 32
20th Anniversary IPD Competition
http://www.prisoners-dilemma.com