Sei sulla pagina 1di 8

Group I Multi-agent Multi-objective Planning for Fair

Resource Distribution
Anna Kosiorek (4746651), Anant Semwal (4630734), Rolf Starre (4334620)
January 10, 2018

1 Introduction services. However, a fair policy will try to


balance rewards and resource allocation.
Multi-agent problems with resources are com-
mon in real-world. These problems include Unfortunately, existing planners for multi-
sharing of Internet bandwidth among several agent problems typically do not consider
users, or domestic/commercial heating where fairness. Often these problems are modeled
resources are used to ensure that temper- as constrained Markov Decision Processes
ature is maintained effectively in multiple (CMDPs) [1]. We propose to model these
locations, or allocation of budget to various problems as a multi-agent multi-objective
advertisement campaigns for promotion of Markov Decision Process (MA-MMDP). This
product sales (advertising domain). With can be done by treating the resource constraints
limitations on resources unfair allocation will as individual objectives, this way the value of
lead to contention between agents, and users not using a resource can be compared with
may try to game the system to get an advantage. the utility of consumption. We created an
implementation to solve MA-MMDP problems
The notion of fairness is essential in the real- fairly using Optimistic Linear Support (OLS)
world problems which involve multiple agents [6] and Column Generation [3]. We apply this
and limited resources. Fairness can be modeled approach on advertising domain and success-
as an additional constraint and added to the fully produced results that show fairness can be
solver to ensure that the selected policies are achieved.
fair. Fairness ensures that every agent is able to
access the resource. Let us consider an example The report is structured as follows, Section 1
of two Internet users whose tariff for Internet introduces the motivation about the topic. In
usage per MB are X and Y respectively, such Section 2 we present an overview of CMDPs
that X > Y . Let us now formulate an objective and Section 3 discusses OLS in depth. Details
function where revenue of the telecommunica- about Column Generation and the Fairness
tion operating firm is Constraint are given in Section 4. Before
Rt = at .X + bt .Y describing experiments in Section 6 we present
an analysis of the experiments in Section 5. The
where at and bt are binary variables to decide report concludes with highlights on related work
which user gets Internet access. Given both in Section 7 followed by conclusions in Section 8.
users are required to use the resource in each
time step, we formulate the objective function
as,
XT 2 Background
max( Rt )
t=0 The project focuses mainly on constrained based
subject to constraint, at + bt = 1, indicating problems. Constraint Markov Decision Pro-
only a single resource is available to service cesses (CMDPs) can be represented as a lin-
the request. In case of an unfair planner, the ear program, that is used to compute opti-
policy generated will always allocate resource mal policies that maximize the sum of the ex-
to user A leaving user B unhappy with the pected agent rewards with a condition that

1
the total expected resource consumption of the their temperature. In the TCL domain each
agents doesn’t exceed the resource availability. agent gets a reward that is inversely propor-
First, we define more precisely an finite-horizon tional to the difference between its current
CMDP. A finite constrained
Markov decision temperature and its goal temperature. The
process is a 7-tuple S, A, T, R, h, s1 . Each resource constraints are defined per timestep.
agent is defined by a separate set of states Si ,
a set of actions Ai , a transition function Ti , In the Advertising domain we have an ad-
a reward function Ri and finally by its initial vertiser with a fixed budget and a set of target
state si,1 . We define transition probabilities users [1]. The advertiser can spend money on
as follows: Ti : Si × Ai × Si → [0, 1]. Let actions, each action is targeted at a specific
Ti (s, a, s0 ) = P (s0 |s, a) be the probability of ad- user. Actions could be for instance an in-app
vancing to state s0 ∈ Si from state s ∈ Si by or web page ad. The goal of the advertiser is
choosing an action a. Given the state s and ac- to move the users to a buy state, where the
tion a, we define the reward Ri (s, a) ∈ R. The user buys a product, producing a reward for
finite time horizon is assumed to be: t = 1, ..., h, the advertiser. The goal of the advertiser is
which corresponds to the total number of deci- to spend the budget in a way that maximizes
sions. si,1 represents the initial state for the the total reward. Where in the TCL problem
agent i. Similarly to unconstrained MDPs,Qthe there is a resource constraint per timestep, in
solution of multi-agent Markov policies: = the advertising domain there is only one total
{∀i πi : {1, ..., h} × Si → Ai } should maximize resource constraint.
the total expected sum of rewards of the agents:
n 
X h
X 
max Eπi Ri (si,t , πi (t, si,t )) (1)
i=1 t=1 3 Optimistic Linear Sup-
, given that si,1 = si,1 . Stating that xit,s,a de- port
notes the probability of agent i reaching state
s at time t and executing action a, it can be OLS [6] is a parameter dependent optimization
reformulated as: approach to compute the Convex Coverage Set
(CCS), which is the set of optimal policies. In
Xn X h X X this method a weight vector is selected and the
max xit,s,a Ri (s, a) (2) dot product with multi-objective reward func-
i=1 t=1 s∈Si a∈Ai tion is computed to obtain the scalarized reward
However, the constrained problem formulates a value which is explained later in detail. Roi-
constraint that the total usage of resources by jers in [6] discusses a toy problem of a one state
agents must not exceed the resource limit. This problem and four possible actions with rewards
constraint can be represented by: as tabulated in table 1.

n X X
X
xit,s,a Ci,j (s, a) ≤ Lj ∀j (3)
i=1 s∈Si a∈Ai Table 1: A simple MODP select one element
from each list and receive the associated reward
, where Ci,j is the usage of resource type j by vector [6]
agent i and Lj is a limit of resource type j.
Action Va Vb
The instances of those problems targeted in A 5.7 6.9 7.3 7.6
the report will be Thermostatically Controlled B 7.1 5.7 5.9 8.2
Loads (TCL) [4] and Advertising Domain C 7.5 5.4 8.8 6.4
[1]. The TCL problem is about balancing
D 6.6 6.7 6.6 7.7
the heating system among customers’ heating
devices. This is represented as a multi-agent
Markov decision process (MMDP), in which OLS will start by generating boundary weight
each TCL device is an individual agent. They vectors, the weight vectors that are 1 for one of
are controlled centrally, by switching themselves the objectives and 0 for all the others. In this ex-
from on to off, and vice versa, depending on ample it generates the boundary weight vectors

2
Algorithm 1 OLS(m, SolveSODP, ) w = [0, 1] will be 11.8. Similarly, joint policy
1: procedure OLS . A MODP m, a AB (V π = 15.1) will perform best under
single-objective subroutine SolveSODP, and w = [0, 1] and earn a reward of 11.6 for
max allowed error  w = [1, 0]. Next, OLS will find corner weights
2: S ← ∅; . A partial CCS (Section 3.2) for these two policies to look for a
3: W ← ∅; . a set of visited weights better policy around the corner weights. This
4: Q← . an empty priority queue process of identification of new corner weights
5: for all extremum of the weight simplex and generating the optimum policy is repeated
we do until no further corner weights are identified.
6: Q.add(we , ∞); . Add extrema to Q The Pareto Coverage Set will be the set of
with infinite priority policies generated by OLS along with their
7: end for respective weight vectors.
8: while !Q.isEmpty&&!T imeOut do
9: w ← Q.pop(); The OLS algorithm as proposed by [6] is
10: V π ← SolveSODP (m, w); shown in Algorithm 1.
11: W ← W ∪ {w}
12: if V π ∈ / S then 3.1 Single Objective Solver
13: Wdel ← remove the corner
weights made obselete by V from Q, and OLS requires a single objective decision problem
store them (SODP) solver which is invoked as a subrou-
14: Wdel ← {w} ∪ Wdel ; tine. It computes the best policy for the given
15: WV π ← weight vector w, from which the payoff vector
newCornerWeights(V π , Wdel , S) can be derived. Specifically, it calculates the
16: S ← S ∪ {V π }; policy that maximizes the scalarized function,
17: for w ∈ WV π do that we defined as follows:
18: ∆r (w) ← Calculate improve- Vw = w0 E[V ] − w1 E[c1 ] − ... − wr E[cr ] (4)
ment using maxValueLP(w, S,W);
19: if ∆r (w) ≥  then Here E[V ] is the expected reward for an agent
20: Q.add(w, ∆r (w)); and E[ci ] is the expected cost of using the
21: end if resource i given the policy V . Depending on
22: end for the weight vector the agent’s reward is being
23: end if penalized for resource usage by subtracting the
24: end while weighted resource costs for each resource type.
25: return S and the highest ∆r (w) left in Thus, we obtain minuses in equation 4.
Q
26: end procedure

3.2 New Corner Weights


[1, 0] and [0, 1] and accordingly scalarizes the re- Finding corner weights is a fundamental aspect
ward values by computing dot product between of the OLS algorithm. These give us an indica-
weight vector and reward for tion as to how the scalarized reward value will
vary for a specific policy when the same policy is
V A (a) = w.Va (a) executed with different weight vectors. The new
B corner weight algorithm helps in finding the set
V (a) = w.Vb (a))
of new weight vectors for which there is a pos-
Here w is a weight vector of two-dimension, as sibility for finding a new policy that belongs to
we have two objectives. a is the action selected. CCS. This is done by finding the intersection of
V A and V B are rewards earned by selecting d (d - number of objectives) hyperplanes that
a pair of actions from the set of available actions. are created by the policies in the partial CCS.

This will generate two optimal policies with


3.3 Estimate Improvement
joint actions CC and AB. Joint policy CC
will perform best under w = [1, 0] and earn To estimate the improvement that is possible
a reward of 16.3. However, their reward with at the corner weights, a optimistic hypothetical

3
CCS is generated. The intuition is that we can mulation:
interpolate between the points of the found
optimal value vectors to obtain an optimistic max z
n X
estimate at any weight w. X
s.t. z ≤ Vi,πi xi,πi
i=1 πi ∈Zi
n X
X
Ci,j,πi xi,πi ≤ Lj ∀j
4 Fairness Constraint (6)
i=1 πi ∈Zi
X
xi,πi = 1 ∀i
While seeking the result of constrained MDPs,
πi ∈Zi
a Linear Program (LP) was defined with help
of Column Generation (CG). It evaluates opti- xi,πi ≥ 0 ∀i,πi
mal policies subject to resource limitations. The −∞ < z < ∞
formulation of the problem was created with re-
spect to equations 2 and 3 as follows:
5 Analysis
n X
X Although we intended to test our implementa-
max Vi,πi xi,πi tion on the TCL problem by modeling it as a
i=1 πi ∈Zi
MA-MMDP, due to challenges related to com-
n X
X putational complexity of finding unique poli-
s.t. Ci,j,πi xi,πi ≤ Lj ∀j
(5) cies for n > 2 objectives and identifying corner
i=1 πi ∈Zi
X weights in multidimensional space, we instead
xi,πi = 1 ∀i tested the implementation on the Advertising
πi ∈Zi domain.
xi,πi ≥ 0 ∀i,πi

, where Vi,πi is value and Ci,j,πi consumption


5.1 TCL Problem
of policy πi . The probability that agent i uses While trying to compute policies using the OLS
policy πi is represented by xi,πi . So in general algorithm we encountered the following limita-
the expected value E[Vi ] of i’th agents policy is tions:
given by:
• The OLS solver cannot find corner weights
due to insufficient number of policies that
X
E[Vi ] = Vi,πi xi,πi
πi ∈Zi
intersect.

• To calculate the intersection of hyperplanes


using matrices, the inverse was not always
In order to retrieve policies with a fair distri- defined which led to undefined points of in-
bution of resources, we aimed to ‘maximize the tersection.
minimum expected value over all the agents’.
Thus, the LP equation objective was changed to The first problem arose, because not enough
maximize a new variable z. This variable takes unique policies were extracted from the SODP
value of the lowest expected reward value over solver using the boundary vector weights. Its
all agents: reason originated in the structure of policy
evaluation. When the SODP solver is comput-
z = min E[Vi ] ing the policy that maximizes the scalarized
i
function for a given weight, the first action that
In order to encode this kind of minimization in maximizes the weighed value in current time
LP, a mathematical ‘trick’ was applied, that is step and state is chosen. However, mathemat-
by bounding z by the expected values of each ically speaking, other actions with the same
agent: weighted value are equally optimal. That is
z ≤ E[Vi ]∀i why a solution was proposed that the selection
of an action is randomized between actions that
All this reasoning leads to the following LP for- give the same weighted value. This way the

4
SODP solver returns more diverse polices. 6 Experiments
The same problem can be viewed from a dif- The experiments were executed on the Advertis-
ferent perspective. For a boundary weight a ing domain for different values for the horizon
weighted policy only minds to restrict agent and the number of agents. First the set of poli-
from using resource of one type, to be exact for a cies was extracted using the OLS algorithm and
weight wi , a resource ri is minimized. There are results were verified with that of CG. Later we
in total r resources, so for the weight w0 there included fairness constraint and verified the ex-
are no restrictions about the resource usage. To istence of a fair policy with improved rewards of
tackle this problem another solution can be pro- worst-off agent. All three computation results
posed: change the scalarized function from what were compared according to: minimal, maximal
it was initially: and average values of expected rewards. Tables
2, 3 and 4 show data collected for 5000 agents,
Vw = w0 E[V ] − w1 E[c1 ] − ... − wr E[cr ] 1 bully and a budget of 16000.

to:

Vw = 0.01 ∗ E[V ]+ Table 2: Minimum Reward Values


0.99 ∗ (w0 E[V ] − w1 E[c1 ] − ... − wr E[cr ])
Minimum Agent Reward
This way the expected value always has an Horizon CG OLS OLS + Fair
impact on the scalarized value function. Using 10 12.903 12.903 13.394
the above solutions the problem was resolved. 20 10.490 10.490 16.067
30 10.703 10.703 16.106
However, another problem arose. For 40 10.709 10.709 16.106
calculating the corner weights we used an 50 10.709 10.709 16.106
implementation that found the intersection
between a set of hyperplanes. Each hyperplane
was represented as a vector of coefficients plus
the constant term. To create the hyperplane
the implementation required a matrix of the Table 3: Maximum Rewards
points that were known on the hyperplane.
The problem was that to then compute the Maximum Agent Reward
hyperplane the determinant of the matrix had Horizon CG OLS OLS + Fair
to be nonzero, which was not always the case. 10 83.398 83.398 13.394
Resulting in the hyperplane not being found. 20 136.076 136.076 18.248
30 140.892 149.068 18.591
40 141.081 149.386 18.602
50 141.088 149.394 18.602
5.2 Advertising domain
While in TCL problem domain it we had to
apply some adjustments due to high amount
of objectives, in Advertising Domain there is Table 4: Average Rewards
not need for it. There is only one resource
constraint over the all time-steps, so only one Average Agent Reward
objective gives a penalty for resource usage. Horizon CG OLS OLS + Fair
Moreover, the expected consumption doesnt 10 13.4062 13.4062 13.3942
vary between [0, 1] as in TCL, but on contrary 20 16.0843 16.0843 16.0676
its value grows even faster than the expected 30 16.1247 16.1247 16.1072
reward value. Thus, the resource constraint 40 16.1243 16.1243 16.1068
objective has enough impact on scalarized value 50 16.1242 16.1242 16.1067
function.

5
Figure 1: Pareto Front of policies for advertising domain

6.1 Retrieving Set of Policies This was expected, as since all the agents were
the same there was almost no variation in the
Using the OLS algorithm, sets of policies per
assigned policies. Therefore, we modified the
agent type were retrieved. Tests resulted in the
initial problem and introduced a ‘Bully’ Agent.
CCS having approximately 50 policies per each
In this kind of modeling e one agent gets a
agent type. One such set is illustrated in figure
higher maximum reward for completion of job
1. Each line represents a policy. The red line
than the other agents, and thus a solver which
corresponds to a policy, that was extracted by
ignores fairness will allocate more resources to
vector weight [1, 0]. It is an optimal policy for
this agent. However, when a fairness constraint
an agent, given no resource limitations. The
is applied, resources are ”fairly” distributed
blue line represents a policy found by weight
so that other agents also get their share of
vector [0, 1]. Given that this objective focuses
resources. That leads to selecting policies of
on penalizing agent for resource usage, no
similar reward value for the Bully Agent and
resources were used.
for the rest of the agents.

The results can be seen in Tables 2, 3 and


6.2 Verification of Results 4. In Table 2 we see that using the fairness
Testing of our implementation was the first constraint the minimum agent reward is higher
milestone. In this experiment we ignored the than without it, whereas the maximum agent
fairness constraint and solved the problem reward is a lot lower, see Table 3. This happens,
using our implementation. The reward values because without the fairness constraint a lot
for each agent was compared with the reward of the resources are given to the Bully Agent,
values from the existing solver and were found while with the fairness constraint they are
to be equal. distributed more evenly. In Table 4 we see that
using a fair solution we obtain a slightly lower
average reward per agent.
6.3 Bully Agent
Next we analyzed the impact on reward val-
The initial Advertising domain problem as- ues of agents when we varied the number of bul-
sumed that all agents are the same. After lies. The results from experiment are tabulated
running experiments under the assumption that in Table 5, 6 and 7. The solver proved to be
all agents have the same reward function and generating policies effectively fair, when more
transition probabilities, the calculated optimal bullies were added to the model. The results for
policy with fairness constraint gives the same CG and OLS implementation were similar and
results as not using the fairness constraint. OLS with fairness improved on the performance

6
of worst-off agent. design, etc. In our experiments we use the
notion of fairness to maximize the worst per-
formance of agents with a consideration on the
Table 5: Minimum Reward Values with varying overall performance [7]. Other quantifiers of the
number of bullies fairness are the Jain’s index, the proportional
Minimum Agent Reward fairness and envy-freeness [2]. To implement
the envy-freeness based fairness function we
Bullies CG OLS OLS + Fair
would have to model an envy function amongst
6 12.903 12.903 13.398
each agent to represent dissatisfaction of agent
7 12.903 12.903 13.399
i because it prefers the share of agent j. This
8 12.903 12.903 13.400
would have complicated the model and thus we
9 12.903 12.903 13.401
chose to use fairness model based on maximin
10 12.903 12.903 13.402 criterion described in [5].

The solution given in [6] computes the op-


Table 6: Maximum Rewards with varying num- timal policy of MMDPs for the TCL prob-
ber of bullies lem, where as agents aim tfor generating cor-
ner weights and optimistic improvement estima-
Maximum Agent Reward tionso maximize the collective system reward,
Bullies CG OLS OLS + Fair it may lead to unfair distribution of resources
6 83.398 83.398 13.398 between agents, leaving some of them far from
7 83.398 83.398 13.399 their goal. Thus we included a fairness con-
8 83.398 83.398 13.400 straint in the model and showed that fair poli-
9 83.398 83.398 13.401 cies are just as good.
10 83.398 83.398 13.402
8 Conclusions
Thus from reward value tables in Section 6 we
Table 7: Average Rewards with varying number demonstrated that a fair policy exists such that
of bullies it allocates resources fairly among all agents
(comparing difference in max and min rewards
Average Agent Reward
and average reward) and thus improving on the
Bullies CG OLS OLS + Fair
performance of worst agent(min reward). As we
6 13.470 13.470 13.3987
were not able to find solutions for the challenges
7 13.483 13.483 13.3996
encountered while solving the TCL problem, it
8 13.496 13.496 13.4005
can be looked upon as future work to create
9 13.509 13.509 13.4014 an alternative implementation for generating
10 13.522 13.522 13.4023 corner weights and optimistic improvement
estimations.

6.4 TCL Problem


We tried to experiment with TCL domain as References
well. However the results were not obtained. It
was due to limitations described in section 5. [1] Craig Boutilier and Tyler Lu. “Budget
Allocation using Weakly Coupled, Con-
strained Markov Decision Processes”. In:
Proceedings of the 32nd Conference on Un-
7 Related work certainty in Artificial Intelligence (UAI-
16). New York, NY, 2016, pp. 52–61.
The notion of fairness is widely used in the field
of computer networks [5]. The applications
range from server-load balancing, network
bandwidth sharing, congestion control, network

7
[2] M. Cheung and C. Swamy. “Approxima-
tion Algorithms for Single-minded Envy-
free Profit-maximization Problems with
Limited Supply”. In: 2008 49th Annual
IEEE Symposium on Foundations of Com-
puter Science. Oct. 2008, pp. 35–44. doi:
10.1109/FOCS.2008.15.
[3] L. R. Ford and D. R. Fulkerson. “Con-
structing Maximal Dynamic Flows from
Static Flows”. In: Oper. Res. 6.3 (June
1958), pp. 419–433. issn: 0030-364X. doi:
10.1287/opre.6.3.419. url: http://dx.
doi.org/10.1287/opre.6.3.419.
[4] Frits de Nijs, Matthijs TJ Spaan, and
Mathijs de Weerdt. “Best-Response Plan-
ning of Thermostatically Controlled Loads
under Power Constraints.” In: AAAI. 2015,
pp. 615–621.
[5] Albert Banchs Robert Denda and Wolf-
gang Effelsberg. “The Fairness Challenge in
Computer Networks”. In: (). url: http :
/ / www . ia . pw . edu . pl / ~wogrycza /
dydaktyka/lncs1922.pdf.
[6] Diederik M Roijers. “Multi-objective
decision-theoretic planning”. In: AI
Matters 2.4 (2016), pp. 11–12.
[7] Chongjie Zhang and Julie A Shah. “Fair-
ness in Multi-Agent Sequential Decision-
Making”. In: Advances in Neural Infor-
mation Processing Systems 27. Ed. by
Z. Ghahramani et al. Curran Associates,
Inc., 2014, pp. 2636–2644. url: http : / /
papers.nips.cc/paper/5588-fairness-
in-multi-agent-sequential-decision-
making.pdf.

Potrebbero piacerti anche