Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Resource Distribution
Anna Kosiorek (4746651), Anant Semwal (4630734), Rolf Starre (4334620)
January 10, 2018
1
the total expected resource consumption of the their temperature. In the TCL domain each
agents doesn’t exceed the resource availability. agent gets a reward that is inversely propor-
First, we define more precisely an finite-horizon tional to the difference between its current
CMDP. A finite constrained
Markov decision temperature and its goal temperature. The
process is a 7-tuple S, A, T, R, h, s1 . Each resource constraints are defined per timestep.
agent is defined by a separate set of states Si ,
a set of actions Ai , a transition function Ti , In the Advertising domain we have an ad-
a reward function Ri and finally by its initial vertiser with a fixed budget and a set of target
state si,1 . We define transition probabilities users [1]. The advertiser can spend money on
as follows: Ti : Si × Ai × Si → [0, 1]. Let actions, each action is targeted at a specific
Ti (s, a, s0 ) = P (s0 |s, a) be the probability of ad- user. Actions could be for instance an in-app
vancing to state s0 ∈ Si from state s ∈ Si by or web page ad. The goal of the advertiser is
choosing an action a. Given the state s and ac- to move the users to a buy state, where the
tion a, we define the reward Ri (s, a) ∈ R. The user buys a product, producing a reward for
finite time horizon is assumed to be: t = 1, ..., h, the advertiser. The goal of the advertiser is
which corresponds to the total number of deci- to spend the budget in a way that maximizes
sions. si,1 represents the initial state for the the total reward. Where in the TCL problem
agent i. Similarly to unconstrained MDPs,Qthe there is a resource constraint per timestep, in
solution of multi-agent Markov policies: = the advertising domain there is only one total
{∀i πi : {1, ..., h} × Si → Ai } should maximize resource constraint.
the total expected sum of rewards of the agents:
n
X h
X
max Eπi Ri (si,t , πi (t, si,t )) (1)
i=1 t=1 3 Optimistic Linear Sup-
, given that si,1 = si,1 . Stating that xit,s,a de- port
notes the probability of agent i reaching state
s at time t and executing action a, it can be OLS [6] is a parameter dependent optimization
reformulated as: approach to compute the Convex Coverage Set
(CCS), which is the set of optimal policies. In
Xn X h X X this method a weight vector is selected and the
max xit,s,a Ri (s, a) (2) dot product with multi-objective reward func-
i=1 t=1 s∈Si a∈Ai tion is computed to obtain the scalarized reward
However, the constrained problem formulates a value which is explained later in detail. Roi-
constraint that the total usage of resources by jers in [6] discusses a toy problem of a one state
agents must not exceed the resource limit. This problem and four possible actions with rewards
constraint can be represented by: as tabulated in table 1.
n X X
X
xit,s,a Ci,j (s, a) ≤ Lj ∀j (3)
i=1 s∈Si a∈Ai Table 1: A simple MODP select one element
from each list and receive the associated reward
, where Ci,j is the usage of resource type j by vector [6]
agent i and Lj is a limit of resource type j.
Action Va Vb
The instances of those problems targeted in A 5.7 6.9 7.3 7.6
the report will be Thermostatically Controlled B 7.1 5.7 5.9 8.2
Loads (TCL) [4] and Advertising Domain C 7.5 5.4 8.8 6.4
[1]. The TCL problem is about balancing
D 6.6 6.7 6.6 7.7
the heating system among customers’ heating
devices. This is represented as a multi-agent
Markov decision process (MMDP), in which OLS will start by generating boundary weight
each TCL device is an individual agent. They vectors, the weight vectors that are 1 for one of
are controlled centrally, by switching themselves the objectives and 0 for all the others. In this ex-
from on to off, and vice versa, depending on ample it generates the boundary weight vectors
2
Algorithm 1 OLS(m, SolveSODP, ) w = [0, 1] will be 11.8. Similarly, joint policy
1: procedure OLS . A MODP m, a AB (V π = 15.1) will perform best under
single-objective subroutine SolveSODP, and w = [0, 1] and earn a reward of 11.6 for
max allowed error w = [1, 0]. Next, OLS will find corner weights
2: S ← ∅; . A partial CCS (Section 3.2) for these two policies to look for a
3: W ← ∅; . a set of visited weights better policy around the corner weights. This
4: Q← . an empty priority queue process of identification of new corner weights
5: for all extremum of the weight simplex and generating the optimum policy is repeated
we do until no further corner weights are identified.
6: Q.add(we , ∞); . Add extrema to Q The Pareto Coverage Set will be the set of
with infinite priority policies generated by OLS along with their
7: end for respective weight vectors.
8: while !Q.isEmpty&&!T imeOut do
9: w ← Q.pop(); The OLS algorithm as proposed by [6] is
10: V π ← SolveSODP (m, w); shown in Algorithm 1.
11: W ← W ∪ {w}
12: if V π ∈ / S then 3.1 Single Objective Solver
13: Wdel ← remove the corner
weights made obselete by V from Q, and OLS requires a single objective decision problem
store them (SODP) solver which is invoked as a subrou-
14: Wdel ← {w} ∪ Wdel ; tine. It computes the best policy for the given
15: WV π ← weight vector w, from which the payoff vector
newCornerWeights(V π , Wdel , S) can be derived. Specifically, it calculates the
16: S ← S ∪ {V π }; policy that maximizes the scalarized function,
17: for w ∈ WV π do that we defined as follows:
18: ∆r (w) ← Calculate improve- Vw = w0 E[V ] − w1 E[c1 ] − ... − wr E[cr ] (4)
ment using maxValueLP(w, S,W);
19: if ∆r (w) ≥ then Here E[V ] is the expected reward for an agent
20: Q.add(w, ∆r (w)); and E[ci ] is the expected cost of using the
21: end if resource i given the policy V . Depending on
22: end for the weight vector the agent’s reward is being
23: end if penalized for resource usage by subtracting the
24: end while weighted resource costs for each resource type.
25: return S and the highest ∆r (w) left in Thus, we obtain minuses in equation 4.
Q
26: end procedure
3
CCS is generated. The intuition is that we can mulation:
interpolate between the points of the found
optimal value vectors to obtain an optimistic max z
n X
estimate at any weight w. X
s.t. z ≤ Vi,πi xi,πi
i=1 πi ∈Zi
n X
X
Ci,j,πi xi,πi ≤ Lj ∀j
4 Fairness Constraint (6)
i=1 πi ∈Zi
X
xi,πi = 1 ∀i
While seeking the result of constrained MDPs,
πi ∈Zi
a Linear Program (LP) was defined with help
of Column Generation (CG). It evaluates opti- xi,πi ≥ 0 ∀i,πi
mal policies subject to resource limitations. The −∞ < z < ∞
formulation of the problem was created with re-
spect to equations 2 and 3 as follows:
5 Analysis
n X
X Although we intended to test our implementa-
max Vi,πi xi,πi tion on the TCL problem by modeling it as a
i=1 πi ∈Zi
MA-MMDP, due to challenges related to com-
n X
X putational complexity of finding unique poli-
s.t. Ci,j,πi xi,πi ≤ Lj ∀j
(5) cies for n > 2 objectives and identifying corner
i=1 πi ∈Zi
X weights in multidimensional space, we instead
xi,πi = 1 ∀i tested the implementation on the Advertising
πi ∈Zi domain.
xi,πi ≥ 0 ∀i,πi
4
SODP solver returns more diverse polices. 6 Experiments
The same problem can be viewed from a dif- The experiments were executed on the Advertis-
ferent perspective. For a boundary weight a ing domain for different values for the horizon
weighted policy only minds to restrict agent and the number of agents. First the set of poli-
from using resource of one type, to be exact for a cies was extracted using the OLS algorithm and
weight wi , a resource ri is minimized. There are results were verified with that of CG. Later we
in total r resources, so for the weight w0 there included fairness constraint and verified the ex-
are no restrictions about the resource usage. To istence of a fair policy with improved rewards of
tackle this problem another solution can be pro- worst-off agent. All three computation results
posed: change the scalarized function from what were compared according to: minimal, maximal
it was initially: and average values of expected rewards. Tables
2, 3 and 4 show data collected for 5000 agents,
Vw = w0 E[V ] − w1 E[c1 ] − ... − wr E[cr ] 1 bully and a budget of 16000.
to:
5
Figure 1: Pareto Front of policies for advertising domain
6.1 Retrieving Set of Policies This was expected, as since all the agents were
the same there was almost no variation in the
Using the OLS algorithm, sets of policies per
assigned policies. Therefore, we modified the
agent type were retrieved. Tests resulted in the
initial problem and introduced a ‘Bully’ Agent.
CCS having approximately 50 policies per each
In this kind of modeling e one agent gets a
agent type. One such set is illustrated in figure
higher maximum reward for completion of job
1. Each line represents a policy. The red line
than the other agents, and thus a solver which
corresponds to a policy, that was extracted by
ignores fairness will allocate more resources to
vector weight [1, 0]. It is an optimal policy for
this agent. However, when a fairness constraint
an agent, given no resource limitations. The
is applied, resources are ”fairly” distributed
blue line represents a policy found by weight
so that other agents also get their share of
vector [0, 1]. Given that this objective focuses
resources. That leads to selecting policies of
on penalizing agent for resource usage, no
similar reward value for the Bully Agent and
resources were used.
for the rest of the agents.
6
of worst-off agent. design, etc. In our experiments we use the
notion of fairness to maximize the worst per-
formance of agents with a consideration on the
Table 5: Minimum Reward Values with varying overall performance [7]. Other quantifiers of the
number of bullies fairness are the Jain’s index, the proportional
Minimum Agent Reward fairness and envy-freeness [2]. To implement
the envy-freeness based fairness function we
Bullies CG OLS OLS + Fair
would have to model an envy function amongst
6 12.903 12.903 13.398
each agent to represent dissatisfaction of agent
7 12.903 12.903 13.399
i because it prefers the share of agent j. This
8 12.903 12.903 13.400
would have complicated the model and thus we
9 12.903 12.903 13.401
chose to use fairness model based on maximin
10 12.903 12.903 13.402 criterion described in [5].
7
[2] M. Cheung and C. Swamy. “Approxima-
tion Algorithms for Single-minded Envy-
free Profit-maximization Problems with
Limited Supply”. In: 2008 49th Annual
IEEE Symposium on Foundations of Com-
puter Science. Oct. 2008, pp. 35–44. doi:
10.1109/FOCS.2008.15.
[3] L. R. Ford and D. R. Fulkerson. “Con-
structing Maximal Dynamic Flows from
Static Flows”. In: Oper. Res. 6.3 (June
1958), pp. 419–433. issn: 0030-364X. doi:
10.1287/opre.6.3.419. url: http://dx.
doi.org/10.1287/opre.6.3.419.
[4] Frits de Nijs, Matthijs TJ Spaan, and
Mathijs de Weerdt. “Best-Response Plan-
ning of Thermostatically Controlled Loads
under Power Constraints.” In: AAAI. 2015,
pp. 615–621.
[5] Albert Banchs Robert Denda and Wolf-
gang Effelsberg. “The Fairness Challenge in
Computer Networks”. In: (). url: http :
/ / www . ia . pw . edu . pl / ~wogrycza /
dydaktyka/lncs1922.pdf.
[6] Diederik M Roijers. “Multi-objective
decision-theoretic planning”. In: AI
Matters 2.4 (2016), pp. 11–12.
[7] Chongjie Zhang and Julie A Shah. “Fair-
ness in Multi-Agent Sequential Decision-
Making”. In: Advances in Neural Infor-
mation Processing Systems 27. Ed. by
Z. Ghahramani et al. Curran Associates,
Inc., 2014, pp. 2636–2644. url: http : / /
papers.nips.cc/paper/5588-fairness-
in-multi-agent-sequential-decision-
making.pdf.