06868256

IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 3, NO.
4, OCTOBER-DECEMBER 2015 411
LsPS: A Job Size-Based Scheduler for Efficient

Task Assignments in Hadoop
Yi Yao, Jianzhe Tai, Bo Sheng, and Ningfang Mi, Member, IEEE
AbstractThe MapReduce paradigm and its open source implementation Hadoop are emerging as an important standard for
large-scale data-intensive processing in both industry and academia. A MapReduce cluster is typically shared among multiple users
with different types of workloads. When a flock of jobs are concurrently submitted to a MapReduce cluster, they compete for the
shared resources and the overall system performance in terms of job response times, might be seriously degraded. Therefore, one
challenging issue is the ability of efficient scheduling in such a shared MapReduce environment. However, we find that conventional
scheduling algorithms supported by Hadoop cannot always guarantee good average response times under different workloads. To
address this issue, we propose a new Hadoop scheduler, which leverages the knowledge of workload patterns to reduce average
job response times by dynamically tuning the resource shares among users and the scheduling algorithms for each user. Both
simulation and real experimental results from Amazon EC2 cluster show that our scheduler reduces the average MapReduce job
response time under a variety of system workloads compared to the existing FIFO and Fair schedulers.
Index TermsMapReduce, hadoop, schdeuling, heavy-tailed workloads, bursty workloads
1 INTRODUCTION
M APREDUCE [1] has become an important paradigm for

parallel data-intensive cluster programming due to
its simplicity and flexibility. Essentially, it is a software
MapReduce framework. A single master node is adopted to
manage distributed slave nodes. The master node commu-
nicates with slave nodes with heartbeat messages which con-
framework that allows a cluster of computers to process a sist of status information of slaves. Job scheduling is
large set of structured or unstructured data in parallel. performed by a centralized jobtracker routine in the master
MapReduces users are quickly growing. Apache Hadoop node. The scheduler assigns tasks to slave nodes which
[2] is an open source implementation of MapReduce that have free resources and response to the heartbeats as well.
has been widely adopted in the industry area [3]. With the The resources in each slave node are represented as map/
rise of cloud computing, it becomes more convenient for IT reduce slots. Each slave node has a fixed number of slots,
business to set a cluster of servers in the cloud and launch a and each map (resp. reduce) slot only processes one map
batch of MapReduce jobs. Therefore, there are a large vari- (resp. reduce) task at one moment.
ety of data-intensive applications using the MapReduce Typically, multiple users compete for the available slots
framework. in a MapReduce cluster when these users concurrently sub-
In a classic Hadoop system, each MapReduce job is parti- mit jobs to the cluster. As a result, the average job response
tioned into small tasks which are distributed and executed times,1 an important performance concern in MapReduce
across multiple machines. There are two kinds of tasks, i.e., systems, might be seriously degraded. Recent studies [4],
map tasks and reduce tasks. Each map task applies the [5], [6] found that MapReduce workloads often exhibit the
same map function to process a block of the input data and heavy-tailed (or long-tailed) characteristics, where a Map-
produces intermediate results in the format of key-value Reduce workflow consists of few but extremely large jobs
pairs. The intermediate data will be partitioned by hash and many small jobs. In such a MapReduce system, an effi-
functions and fetched by the corresponding reduce tasks as cient scheduling policy is a critical factor for improving sys-
their inputs. Once all the intermediate data has been tem performance in terms of average job response times.
fetched, a reduce task starts to execute and produce the final However, we found that the existing policies supported by
results. The Hadoop implementation closely resembles the Hadoop do not perform well under heavy-tailed and
diverse workloads. By default, Hadoop uses a first-in-first-
out (FIFO) scheduler which is originally designed to opti-
Y. Yao, J. Tai, and N. Mi are with the Department of Electrical and Com- mize the completion length (i.e., makespan) of a batch of
puter Engineering, Northeastern University, 409 Dana Research Center,
360 Huntington Avenue, Boston, MA 02115.
jobs. However, it is not efficient in clusters which serve
E-mail: {yyao, jtai, ningfang}@ece.neu.edu. diverse workloads since small jobs will experience
B. Sheng is with the Department of Computer Science, University of Mas- extremely long waiting time when submitted after a large
sachusetts Boston, Boston, MA. job. Fair scheduler [7], [8] was proposed to improve the
E-mail: shengbo@cs.umb.edu.
average job response times in shared Hadoop clusters by
Manuscript received 18 Nov. 2013; revised 23 June 2014; accepted 25 June
2014. Date of publication 29 July 2014; date of current version 9 Dec. 2015.
Recommended for acceptance by R. Campbell.
For information on obtaining reprints of this article, please send e-mail to: 1. The response time of a MapReduce job is measured from the time
reprints@ieee.org, and reference the Digital Object Identifier below. when that job is submitted to the time when it is completed, i.e., the
Digital Object Identifier no. 10.1109/TCC.2014.2338291 summation of that jobs waiting time and execution time.
2168-7161 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
412 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 3, NO. 4, OCTOBER-DECEMBER 2015
assigning to all jobs, on average, an equal share of resources lease 11 EC2 nodes, where one node serves as the master
over time. However, we notice that the Fair scheduler makes and the remaining 10 nodes run as the slaves. In this
its scheduling decision without considering workload pat- Hadoop cluster, each slave node contains two map slots
terns of different users. and two reduce slots. The WordCount application is run
Compared to the early stage, the workloads in Map- to compute the occurrence frequency of each word in
Reduce system have been changing towards the follow- input files with different sizes. The randomtextwriter
ing three directions. First, a MapReduce cluster, once program is used to generate random files as inputs to the
established, is no longer dedicated to a particular job, WordCount applications.
but to multiple jobs from different applications or users.
For example, Facebook [9] is one of Hadoops biggest 2.1 How to Share Slots
champions, which keeps more than 100 petabytes of Specifically, there are two tiers of scheduling in a Hadoop
Hadoop data online, and allows multiple applications system which is shared by multiple users: (1) Tier 1 is
and users to submit their ad hoc queries to the shared responsible for assigning free slots to active users; and (2)
Hive-Hadoop clusters. Second, as a data processing ser- Tier 2 schedules jobs for each individual user. In this sec-
vice, MapReduce is becoming prevalent and open to tion, we first investigate different Hadoop scheduling poli-
numerous clients from the Internet, like todays search cies at Tier 1. When no minimum share of each user is
engines service. For example, a smartphone user may specified, Fair scheduler fairly allocates available slots
send a job to a MapReduce cluster through an App ask- among users such that all users get an equal share of slots
ing for the most popular words in the tweets logged in over time. However, we argue that Fair unfortunately
the past three days. Third, the characteristics of MapRe- becomes inefficient when job sizes of active users are not
duce jobs vary a lot. It is essentially caused by the diver- uniform.
sity of user demands. Recent analysis on MapReduce For example, we perform an experiment with two
workloads of current enterprise clients [10], e.g., users such that user 1 submits 30 WordCount jobs to scan
Facebook and Yahoo!, has revealed the diversity of Map- a 180 MB input file, while user 2 submits six WordCount
Reduce job sizes which range from seconds to hours. jobs to scan a 1.6 GB input file. All the jobs will be sub-
Overall, workload diversity is common in practice when mitted at roughly the same time. We set the block size of
jobs are submitted by different users. For example, some HDFS to be equal to 30 MB. Thus, the map task number
users run small interactive jobs while other users submit of jobs from user 2 is equal to 54 (1.6 GB/30 MB), while
large periodical jobs; some users run jobs for processing each job from user 1 only has six (180 MB/30 MB) map
files with similar sizes while the sizes of jobs from other tasks. The reduce task number of each job is set to be
users are quite vary. We thus argue that a good Hadoop equal to its map task number. As the average task execu-
scheduler should take into consideration the diversity in tion times of these two users are similar, we say that the
workload characteristics with the goal of reducing Map- average job size (i.e., the average task number times the
Reduce job execution times. average task execution time) of user 2 is about nine times
In this paper, we propose a novel Hadoop scheduler, larger than that of user 1.
called LsPS, which aims to improve the average job In the context of single-user job queues, it is well known
response time of Hadoop systems by leveraging the job that giving preferential treatment to shorter jobs can reduce
size patterns to tune its scheduling schemes among users the overall expected response time [11]. However, directly
and for each user as well. Specifically, we first develop a using the shortest job first (SJF) policy has several draw-
lightweight information collector that tracks statistic backs. First, large jobs could be starved under SJF, and SJF
information of recently finished jobs from each user. A lacks flexibility when certain level of fairness or priority
self-tuning scheduling policy is then designed to sched- between users is required, which is common in real use
uler Hadoop jobs at two levels: the resource shares across cases. Moreover, precise job size prediction before execution
multiple users are tuned based on the estimated job size of is also required for using SJF, which is not easy to achieve in
each user; and the job scheduling for each individual user is real world systems. In contrast, the sharing based schedul-
further adjusted to accommodate to that users job size ing could easily solve the starvation problem and provide
distribution. Experimental results in both the simulation the flexibility to integrate fairness between users by, for
model and the Amazon EC2 Hadoop cluster environment example, setting up minimal shares for users. Allowing all
confirm the effectiveness and the robustness of our solu- users to run their application concurrently also helps to
tion. We show that our scheduler improves the average improve the job size prediction accuracy in Hadoop system
job response times under a variety of system workloads by getting information from finished tasks. Motivated by
under the comparisons with FIFO, Fair and Capacity this observation and the analysis of discriminatory proces-
schedulers which have been widely adopted as the stan- sor sharing between multiple users in [12], we evaluate the
dard scheduling disciplines in MapReduce frameworks, discriminatory share policies in Hadoop systems. It is
such as Apache Hadoop. extremely hard and complex to find out an optimal share
policy under a dynamic environment where user workload
patterns may change frequently across time. Therefore, we
2 MOTIVATIONS opted to heuristically assign slots that are reversely propor-
In order to investigate the pros and cons of the existing tional to the current average job sizes of users, and dynami-
Hadoop schedulers (i.e., FIFO and Fair), we conduct sev- cally tune the share over time according to the workload
eral experiments in a Hadoop cluster at Amazon EC2. We pattern changes. This method introduces few overheads,
YAO ET AL.: LSPS: A JOB SIZE-BASED SCHEDULER FOR EFFICIENT TASK ASSIGNMENTS IN HADOOP 413
TABLE 1 TABLE 2
Average Job Response Times (in Seconds) for Average Job Response Times under FIFO and Fair
Two Users with Different Job Sizes under Fair when Job Sizes Have Three Different Distributions
and Two Variants
CV 0 CV 1 CV 1:8
ShareRatio User 1 User 2 All FIFO 239.10 sec 208.78 sec 234.95sec
Fair 1:1 548.06 1189.33 656.61 Fair 346.45 sec 220.11 sec 128.35sec
Fair_V1 1:9 1132.33 983.16 1107.47
Fair_V2 9:1 375.56 1280.66 516.41
distributed, both FIFO and Fair obtain similar average job

which is important to real implementation. We compare the response times; while Fair significantly reduce the aver-
Fair policy and two variants, i.e., share slots proportional to age job response times under the case of high variance
the average job sizes of users (Fair_V1), and reversely pro- but loses its superior when all files have similar sizes.
portional sharing policy (Fair_V2), under the two user The response times of each job in the three experiments
scenario. with different job size distributions are also plotted in Fig. 1.
Table 1 shows the average response times of jobs from We observe that when the job sizes are similar, most of jobs
user 1, user 2, and both users under Fair and the two var- experience shorter response times under FIFO than under
iants, i.e., Fair_V1 and Fair_V2. We observe that Fair_V2 Fair, see Fig. 1a. However, as the variation of job sizes
achieves a non-negligible improvement by assigning more increases, i.e., CV > 1, the percentage of jobs which are fin-
slots to user 1 who has small jobs. We therefore conclude ished more quickly under Fair increases as well, which thus
that when the job sizes of various users are not uniform, a good allows Fair to achieve better average job response time.
Hadoop scheduler should adjust the slot shares among multiple These results further confirm that the relative performance
users based on their average job sizes, aiming to improve the over- between two scheduling policies depends on the job size
all performance in terms of job response times. distribution. Clearly, the response time of each individual
job is mainly related to that particular jobs size under Fair
2.2 How to Schedule scheduling policy. On the other hand, under the FIFO pol-
icy, each jobs response time may be affected by other jobs
Now, we look closely at the two Hadoop scheduling poli-
which were submitted earlier. FIFO allows most of the jobs
cies at Tier 2, i.e., allocating slots to the jobs from the same
to experience faster response times when the job sizes are
user. As shown in [10], job execution times may vary from
similar; while most jobs are finished faster under Fair when
seconds to hours in enterprise workloads. The average job
jobs have variable sizes. The CV value of job sizes could
response times under FIFO scheduling policy thus becomes
then be a great threshold to determine which policy can
unacceptable because small jobs are often stuck behind
achieve shorter job response times under a certain work-
large ones and thus experiencing long waiting times. On the
load. We thus argue that a good Hadoop scheduler should
other hand, Fair scheduler solves this problem by equally
dynamically adjust the scheduling algorithms at Tier 2 according
assigning slots to all jobs no matter what sizes those jobs
to the distribution of the job sizes.
have and thus avoiding the long wait behind large jobs.
However, the average job response time of Fair scheduler
depends on the job size distribution, similar as Process Shar- 3 ALGORITHM DESCRIPTION
ing policy [13]: when job sizes have high variances, i.e., coef- Considering the dependency between map and reduce
ficient of variation of job sizes CV > 1, Fair achieves better tasks, the Hadoop scheduling can be formulated as a two-
performance (i.e., shorter average job response time) than stage multi-processor flow-shop problem. However, finding
FIFO; but this performance benefit disappears when the job the optimal solution with the minimum response times
sizes become close to each other, with CV 1. (flow times) is NP-hard [14]. Therefore, in this section we
To verify this observation, we conduct experiments in propose LsPS, an adaptive scheduling algorithm which lev-
our Hadoop cluster by running WordCount applications erages the knowledge of workload characteristics to dynam-
under three different job size distributions: (1) input files ically adjust the scheduling schemes, aiming to improve
have the same size with CV 0; (2) input file sizes are efficiency in terms of job response times, especially under
exponentially distributed with CV 1; and (3) input file heavy-tailed workloads [6].
sizes are highly variable with CV 1:8. As shown in The details of our designed LsPS scheduler is pre-
Table 2, when input file sizes are exponentially sented in Algorithms 1-3. The architecture of LsPS is
Fig. 1. Response times of each WordCount job under FIFO and Fair when the input file sizes have different CV .
TABLE 3
Notations Used in the Algorithm
U / ui number of users / 0ith user, i 2 1; U.

Ji / jobi;j set of all user is jobs / jth job of user i.
tm r
i;j / ti;j average map/reduce task exe. time of jobi;j .
m r
ti / ti average map/reduce task exe. time of jobs from ui .
nm r
i;j / ni;j number of map/reduce tasks in jobi;j .
si;j size of jobi;j , i.e., total exe. time of tasks.

Si / Si average size of completed/current jobs from ui .
CVi / CVi CV of completed/current job sizes of ui .
SUi / SJi;j the slot share of ui / the slot share of jobi;j .
ASi the slot share that ui actually received.
TW time window for collecting historic informations.
W window size for collecting historic informations.
Fig. 2. The architecture overview of LsPS.
shown in Fig. 2. Briefly, LsPS consists of the following 3.1 Workload Information Collection
three components: When a Hadoop system is shared by multiple users, job sizes
and patterns of each user must be considered for designing
Workload information collection: monitor the execu-
an efficient scheduling algorithm. Therefore, a light-
tion of each job and each task, and gather the work-
weighted history information collector is introduced in LsPS
load information.
for collecting the important historic information of jobs and
Scheduling among multiple users: allocate (both
users upon each jobs completion time. Here we collect and
map and reduce) slots for users according to their
update the information of each jobs map and reduce tasks
workload characteristics, i.e., scheduling at Tier 1.
separately, through the same functions. To avoid redundant
Scheduling for each individual user: tune the sched-
description, we use the general term task to represent both
uling schemes for jobs from each individual user
types of tasks and the term size to represent size of either
based on that users job size distribution, i.e., sched-
map phase or reduce phase of each job as follows.
uling at Tier 2.
LsPS appropriately allocates slots for Hadoop users and
guides each user to select the right scheduling algorithm for Algorithm 2. Tier 1: Allocate Slots for Each User
their own job queue, even under highly variable and heavy- Input: historic information of each active user;
tailed workloads. In the remainder of this section, we Output: slot share SUi of each active user;
describe the detailed implementation of the above three for each user ui do
components. Table 3 lists some notations used in the rest of Update that users slot share SUi using Eq. (6);
this paper. for jth job of user i, i.e., jobi;j do
if currently scheduling based on submission times
Algorithm 1. Overview of the LsPS
then
1. When a new job from user i is submitted if jobi;j has the earliest submission time in Ji

a. Estimate job size and avg. job size S i of then
user i using Eq. (5); SJi;j SUi ;
b. Adjust slot shares among all active users, see else
Algorithm 2; SJi;j 0;
c. Tune the job scheduling scheme for user i, see else
Algorithm 3; SJi;j SUi =jJi j.
2. When a task of job j from user i is finished
a. Update the estimated average task execution

time ti;j ; Algorithm 3. Tier 2: Tune Job Scheduling for Each User
3. When the jth job from user i is finished Input: historic information of each active user;
a. Measure avg. map/reduce task execution time Output: UseFIFO vector;
tm r m
i;j /ti;j and map/reduce task number ni;j /ni;j ;
r
for each user ui do
b. Update history info. of user i, i.e., ti ; S i ; CVi , using if user ui is active, i.e., jJi j > 1 then
Eq. (1-4); calculate the CVi of current jobs;
4. When a free slot is available if CVi < 1 and CVi < 1 then
a. Sort users in decreasing order of deficits ASi SUi ; schedule current jobs based on their submission
b. Assign the slot to the first user in the sorted list; times;
c. Increase num. of actual received slots ASi by 1; if CVi > 1 k CVi > 1 then
d. Choose a job from user ui to get service based on equally allocate slots among current jobs;
the current scheduling scheme. clear history information and restart collection.
In LsPS, the important history workload information that monitoring the past scheduling history. In each monitoring
needs to be collected for each user ui includes its average window, the system completes exactly W jobs; we set
m r
task execution time ti (and ti ), average size S i , and the coef- W 100 in all the experiments presented in the paper. We
ficient of variation of sizes CVi . We here adopt the Welfords also assume that the scheduler is able to correctly measure
one-pass algorithm [15] to on-line update these statistics as the information of each completed job, such as its map/
follows: reduce execution times as well as the number of map/
reduce tasks. This assumption is reasonable for most
si;j tm m r r
i;j ni;j ti;j ni;j ; (1) Hadoop systems. Upon each jobs completion, LsPS updates
the workload statistics for job owner using the above equa-
tions, i.e., Eqs. (1)-(4), see Algorithm 1 step 3. The statistic
S i S i si;j S i =j; (2) information collected in the present monitoring window
will then be utilized by LsPS to tune the schemes for sched-
uling the following jobs.
vi vi si;j S i 2 j 1=j; (3)
p 3.2 Scheduling Among Multiple Users

CVi vi =j=S i ; (4) In this section, we present our algorithm (i.e., Algorithm 2)
for scheduling among multiple users. Our goal is to decide
where si;j denotes the size of the jth completed job of user ui
the deserved amount of slots and allocate appropriate num-
(i.e., jobi;j ), tm r
i;j (resp. ti;j ) represents the measured average ber of slots for each active user to run their jobs. In a MapRe-
map (resp. reduce) task execution time of jobi;j , nm i;j (resp. duce system, there are two types of slots, i.e., map slots and
nri;j ) means the measured map (resp. reduce) task number of reduce slots. Therefore, we have designed two algorithms,
the jobi;j . A jobs size si;j is defined as the summation of the one for allocating map slots and the other for allocating
execution times of all tasks of the job, which is independent reduce slots. However, they share the same design princi-
on the level of task concurrency during the execution, i.e., ples. For simplicity, we present a general form of the algo-
concurrently running multiple map (or reduce) tasks may rithm in the rest of this section.
reduce the execution time of the job, but not the job size. Basically, our solution is motivated by the observation
The estimation of a jobs size uses the historic task execution (see Section 2.1) that equally assigning slot shares cannot
times of the job based on a well accepted assumption that achieve better performance than the policies that allocate
the same type of tasks (either map or reduce tasks) of the slots based on average job sizes when the job sizes of vari-
same job have similar execution times. Additionally, vi =j ous users are not uniform. Therefore, we propose to adap-
denotes the variance of ui s job sizes. S i and vi are both ini- tively adjust the slot shares among all active users such
tialized as 0 and updated each time when a new job is fin- that the share ratio is inversely proportional to the ratio of
ished and its information is collected. The average map their job average sizes. For example, in a simple case of
m r
(resp. reduce) task execution time ti (resp. ti ) can be two users, if their average job size ratio is equal to 1:2,
updated as well with Equations (2)-(4) by replacing si;j with then the number of slots assigned to user 1 will be twice
tm r
i;j (resp. ti;j ).
that to user 2. Consequently, LsPS implicitly gives higher
Our solution uses the Welfords one-pass algorithm to priority to users with smaller jobs, resulting in shorter job
online update the statistics without keeping all data records. response times.
Therefore, the data structure for each users information One critical issue that needs to be addressed is how to
only includes user ID, number of jobs that have been correctly measure the execution times of map or reduce
recorded, average map/reduce task execution times, aver- phase of jobs that are currently running or waiting for the
age and variance of job sizes, and the last update time for service. In Hadoop systems, it is not possible to get the
detecting inactive users. The memory space needed for each exact execution times of jobs tasks before it is finished.
user is 48 bytes and our solution only requires a total mem- However, the job sizes are predictable in Hadoop system
ory space of 288 bytes when considering six users in our as discussed before in this section. In this work, we esti-
experiments. This is a trivial overhead for regular MapRe- mate the job sizes as task number times average task
duce clusters. To further reduce the space overhead, our execution time, through the following steps: (1) the num-
solution periodically evicts the out-of-date user records if a ber of tasks of jth job from user i (jobi;j ), i.e., ni;j , could be
user has not submitted any jobs in 30 minutes. In addition, obtained immediately when the job is submitted; (2) simi-
the average task execution time of each active job (i.e., cur- lar to [16], we assume that the execution times of tasks
rent running or waiting) is recorded in another data struc- from the same job are close to each other, and thus the

ture, e.g., JobInfo used by Fair. JobInfo is used for average task execution time, ti;j , of the finished tasks of
storing the information such as number of running tasks for current running job jobi;j could be used to represent the
each active job, which is created when a job is submitted overall average task execution time ti;j of that job; and (3)
and deleted when that job is finished. Our solution extends for the pending jobs and the running jobs that have no fin-
this data structure to keep the estimated average task execu- ished tasks, pre-configured job profiles could be used to
tion time for each active job by adding additional 24 bytes to estimate task execution times [17]. If there are no available
JobInfo. job profiles, then our solution considers the historic infor-
We use a moving window to collect and update the mation of the past jobs from the same user, i.e., using the
workload information of each user. Let TW be a window for average task execution times of recently finished jobs from
see Eq. (7), and all available slots in the system are fully dis-
tributed to active users, see Eq. (8).
The resulting deserved slot shares (i.e., SUi ) are not nec-
essarily equal to the actual assignments among users (i.e.,
ASi ). They will be used to determine which user can receive
the slot that just became available for redistribution, see
Algorithm 1 step 4. LsPS sorts all active users in a non-
increasing order of their deficits, i.e., the gap between the
expected assigned slots (SUi ) and the actual received slots
(ASi ), and then dispatchs that particular slot to the user
with the largest deficit. Additionally, it might happen in the
Fig. 3. Illustrating actual task execution times and on-line estimated Hadoop system that some users have high deficits but their
average task execution times of a Sort job with 300 map tasks and a
Grep job with 300 map tasks. actual demands on map/reduce slots are less than the
expected shares. In such a case, LsPS re-dispatches the extra
ui (ti ) to approximate the average task execution time ti;j of slots to those users who have lower deficits but need more
jobi;j . Therefore, the average map phase size of jobs from slots for serving their jobs.
user ui is calculated as follows:
3.3 Scheduling for a User
1 X
jJ j
i
m The second design principle used in LsPS is to dynamically
Si nm t ; (5)
jJi j j1 i;j i;j tune the scheduling scheme for jobs within an individual
user by leveraging the knowledge of job size distribution.
where Ji represents the set of jobs from user ui that are cur- As observed in Section 2.2, the scheme of equally distribut-
rently running or waiting for service. The average reduce ing shared resources outperforms by avoiding small jobs to
phase size of jobs from user ui could be calculated in the waiting behind large ones. However, when the jobs have
same way. Fig. 3 illustrates an example, where the estimated similar sizes, scheduling jobs based on their submission
average map task execution times for both Sort and Grep times becomes superior to the former one.
jobs quickly converge to a stable value which is consistent Therefore, our algorithm (as described in Algorithm 3)
with the actual one, i.e., the average task execution time of considers the CV of job sizes, i.e., map size plus reduce size,
all 300 map tasks. of each user to determine which scheme should be used to
As shown in Algorithm 2 step 1, once a new job arrives, distribute the free slots to jobs from that user. To improve the
LsPS updates the average size of that jobs owner and then accuracy, we combine the history job size information and
adaptively adjusts the deserved map slot shares (SUi ) the estimated size distribution of running and waiting jobs
among all active users using Eq. (6) in system. CVi of currently finished jobs sizes of user i is pro-
vided directly by the history information collector, and CVi
1

of waiting and running jobs sizes is calculated based on the
S
SUi SUi a U PU i 1
1a ; (6) estimated job sizes. When the two values of a user are both
i1 S
i smaller than 1, the LsPS scheme schedules the current jobs in
that users sub queue in the order of their submission times,
8i; SUi > 0; (7)
otherwise the user level scheduler will fairly assign slots
among jobs. When the two values are conflicting, i.e.,
X
U X
U
SUi SUi ; (8) CVi > 1 and CVi < 1 or vise versa, which means the users
i1 i1 workload pattern may change, the fair scheme will be
adopted at this time, the history information will be cleared
where SUi represents the estimated slot shares that should and a new collection window will start at this time, see the
be assigned to user ui , SUi represents the deserved slot pseudo-code in Algorithm 3.
shares for user ui under the Fair scheme, i.e., equally dis-
patching the slots among all users, U indicates the number
of users that are currently active in the system, and a is a 4 MODEL DESCRIPTION
tuning parameter within the range from 0 to 1. Parameter a In this section, we introduce a queuing model that is devel-
in Eq. (6) can be used to control how aggressively LsPS oped to emulate a Hadoop system. The main purpose of
biases towards the users with smaller jobs: when a is close this model is to compare various Hadoop scheduling
to 0, our scheduler increases the degree of fairness among schemes, and give the first proof of our new approach. This
all users, performing similar as Fair; and when a is model does not include all the details of the complex
increased to 1, LsPS gives the strong bias towards the users Hadoop system, but provide a general guideline to users
with small jobs in order to improve the efficiency in terms which is useful for performance evaluation without running
of job response times. In the remainder of the paper, we set experiment tests. In Section 5, besides experiments on a real
a as 1 if there is no explicit specification. When all users Hadoop clusters, we also conduct trace-driven simulations
have the same average job sizes, one can get SUi equal to based on this model to evaluate the performance improve-
SUi , i.e., fairly allocating slots among users. Also, when ment of LsPS.
using Eq. (6) to calculate the SUi for each user, it is guaran- The model, as shown in Fig. 4 consists of two queues for
teed that no active users gets starved for map/reduce slots, map tasks (Qm ) and reduce tasks (Qr ), respectively. Once a
the efficiency of a Hadoop system, especially with highly

variable and/or bursty workloads from different users.
5.1 Simulation Evaluation

We first evaluate LsPS with our simulation model which is
developed to emulate a classic Hadoop system. On the top
of this model, we use trace-driven simulations to evaluate
the performance improvement of LsPS in terms of average
job response times. Later, we will verify the performance of
LsPS by implementing the proposed policy as a plug-in
scheduler in an EC2 Hadoop cluster.
Fig. 4. Design of the Hadoop MapReduce cluster simulator. In our simulations, we configure the number of both map
and reduce slots in cluster to be equal to 300 in the simula-
job is submitted, its tasks will be inserted into Qm (resp. Qr ) tion model. We also have U users fu1 ; . . . ; ui ; . . . ; uU g to
through the map (resp. reduce) task dispatcher. Furthermore, share the Hadoop cluster by submitting Ji jobs to the sys-
the model includes s servers to represent s available slots in tem. The specification of each user includes its job inter-
the system, such that sm servers are used to serve map tasks arrival times and its job sizes, which are created based on
while the remaining servers, i.e., sr s sm , connect to the the specified distributions and methods. Recall that each
reduce queue for executing reduce tasks. Note that the values Hadoop job size is determined by the number of map (resp.
of fsm ; sr g are based on the actual Hadoop configuration. reduce) tasks from that job as well as the execution time of
An important feature of MapReduce jobs need to be con- each map (resp. reduce) task. In our model, we consider to
sidered in the model is the dependency between map and change the distributions of map/reduce task numbers for
reduce tasks. Typically, in a Hadoop cluster, there is a investigating various job size patterns, while fixing the uni-
parameter which decides when a job can start its reduce form distribution to draw the execution times of map/
tasks. By default, this parameter is set as 5 percent, which reduce tasks.
indicates that the first reduce task can be launched when In general, we consider the following four different dis-
5 percent of the map tasks are committed. Under this set- tributions to generate job inter-arrival times and job map/
ting, a jobs first wave of reduce tasks will overlap with its reduce task numbers.
map phase and could prefetch the output of map tasks in
the overlapping period. However, previous work [18] found Uniform distribution (u), which indicates similar job
that this setting would lead to performance degradation sizes.
under the Fair scheduling policy and proposed to launch Exponential distribution (e), which implies medium
reduce tasks gradually according to the progress of map diversity of job sizes or inter-arrival times.
phase. We further found that delaying the launch time of Hyperexponential distribution (h), which means
reduce tasks, i.e., setting the parameter to a large value such high variance of traces.
as 100 percent, can improve the performance of the Fair and Bursty pattern (b), which indicates high variance and
the other slots sharing based schedulers. Therefore, in our high auto-correlation of traces.
experiments, we set the parameter to 100 percent, i.e., run- Here, we use a hyperexponential distribution to approxi-
ning reduce tasks when all map tasks are completed, in all mate a heavy-tailed distribution [19], [20], which has been
the three policies (i.e., FIFO, Fair and our LsPS). However, often observed in MapReduce clusters [6].
this is not a necessary assumption in the model. Our sched- We first consider two simple cases where the cluster is
uler works in the same way under the other two cases, i.e., shared by two users. We evaluate the impacts of different
launching the first reduce task when 5 percent of the map job size patterns in case 1 and job arrival patterns in case 2.
tasks are committed or launching the reduce tasks gradually We then validate the robustness of LsPS with a complex
according to the progress of map phase. case where the cluster is shared by multiple users with dif-
Finally, compared to the complex architecture of a Map- ferent job size and arrival patterns.
Reduce system, the simulator built based on this model is a
much-simplified tool. Our goal is to use a queuing model to 5.1.1 Simple Case 1-Two Users with Diverse Job Size
compare different scheduling policies quickly and effec-
Patterns
tively. The key point of the simulation model is to capture
the impact of different scheduling policies on the job Consider a simple case of two users, i.e., u1 and u2 , that
response time under different situations. Therefore, the concurrently submit Hadoop jobs to the system. We first
model we designed mainly simulates this key feature, i.e., focus on evaluating different Hadoop schedulers under
how to share slots, without capturing the low-level details various job size patterns, i.e., we conduct experiments with
of a MapReduce system, such as communication costs, different job size distributions for u2 , but always keeping
locality of data, and fault-tolerant mechanism. the uniform distribution to generate job sizes for u1 . Specif-
ically, we consider u2 with (1) similar job sizes; (2) high
variability in job sizes; and (3) high variability and strong
5 EXPERIMENTAL RESULTS temporal dependence in job sizes. We also set the job size
In this section, we turn to present the performance evalua- ratio between u1 and u2 as 1:1, i.e., two users having the
tion of the proposed LsPS scheduler, which aims to improve same average job sizes. Furthermore, both users have the
Fig. 5. Average job response times of (a) two users, (b) user 1, and (c) user 2 under three different scheduling policies and different job size distribu-
tion settings. The relative improvement with respect to Fair is also plotted on each bar of LsPS.
exponentially distributed job interarrival times with the In order to analyze the impacts of relative job sizes on
same mean of 300 seconds. LsPS performance, we conduct another two sets of experi-
Fig. 5 shows the mean job response times of both ments with various job size ratios between two users, i.e.,
users under different policies and the relative improve- we keep the same parameters as the previous experiments
ment with respect to Fair. Job response time is measured but tune the job sizes of u2 such that we have the average
from the moment when that particular job is submitted job size of u1 is 10 times less (resp. more) than that of u2 , see
to the moment that all the associated map and reduce the results shown in Fig. 7(I) (resp. Fig. 7(II)). In addition,
tasks are finished. We first observe that high variability we tune the job arrival rates of u2 to keep the same loads in
in job sizes dramatically degrades the performance the system. Recall that our LsPS scheduler always gives
under FIFO as a large number of small jobs are stuck higher priority, i.e., assigning more slots, to the user with
behind the extremely large ones, see plot (a) in Fig. 5. In smaller average job size, see Section 3.2. As a result, LsPS
contrast, both Fair and LsPS effectively mitigate such achieves non-negligible improvements of overall job
negative performance effects by equally distributing response times no matter which user has smaller job sizes,
available slots between two users and within a single see plots (a) in Fig. 7(I) and (II). Further confirmation of this
user. Our policy further improves the overall perfor- benefit comes from the plots in Figs. 8(I) and 8(II), which
mance by shifting the scheduler to FIFO for the jobs show that most jobs experience the shortest response times
from u1 and thus significantly reducing its mean job when the scheduling is LsPS. Indeed, the part of the work-
response time by 60 and 62 percent with respect to Fair load whose job sizes are large receives increased response
when the job sizes of user 2 are highly variable (i.e., u: times, but the number of penalized jobs is less than 5 per-
h) and temporally dependent (i.e., u:b), respectively, cent of the total.
see plot (b) in Fig. 5. On the other hand, Fair loses its Now, look closely at each users response times. We
superiority when both users have similar job sizes, while observe that under the cases of two different job size ratios,
our new scheduler bases on the features of both users LsPS always achieves significant improvement in job
job sizes to tune the scheduling at two tiers and thus response times for the user which submits small jobs in
achieves the performance close to the best one. average by assigning more slots to that user, see plot (b) in
To further investigate the tail of job response times, Fig. 7(I) and plot (c) in Fig. 7(II). Meanwhile, although LsPS
we plot in Fig. 6 the complementary cumulative distribu- discriminately treats another user (i.e., having larger jobs)
tion functions (CCDFs) of job response times, i.e., the with less resource, this policy does not always sacrifice that
probability that the response times experienced by indi- users performance. For example, as shown in plot (c) of
vidual jobs are greater than the value on the horizontal Fig. 7(I), when job sizes of u2 are highly variable and/or
axis, for both users under the three scheduling policies. strongly dependent, shorter response times are achieved
Consistently, almost all jobs from the two users experi- under LsPS or Fair than under FIFO because small jobs now
ence shorter response times under Fair and LsPS than have the reserved slots without waiting behind the large
under FIFO when job sizes of u2 are highly variable. In ones. Another example can be found in plot (b) of Fig. 7(II),
addition, compared to Fair, LsPS reduces the response where we observe that LsPS is superior to Fair on the perfor-
times for more that 60 percent of jobs, having shorter mance of user 1 by switching the tier 2 scheduling algorithm
tails in job response times. to FIFO.
Fig. 6. CCDFs of response times of all jobs under different schedulers, where user 1 has uniform job size distribution while user 2 have (a) similar job
sizes; (b) high variability in job sizes; and (c) high variability and strong temporal dependence in job sizes.
Fig. 7. Average job response times of (a) two users, (b) user 1, and (c) user 2 under three different scheduling policies. The relative job size ratio
between two users is (I) 1:10, and (II) 10:1.
5.1.2 Simple Case 2-Two Users with Diverse Job Consistent to the previous experiments, our LsPS sched-
Arrival Patterns uler outperforms in terms of the overall job response times,
We now turn to consider the changes in job arrival patterns. see plot (a) in Fig. 9. We observe that this benefit indeed
We conduct experiments with varying arrival processes of comes from the response time improvement of u1 , i.e., LsPS
the second user, i.e., u2 , but always fixing the uniform job assigns u1 with more slot shares due to its smaller average
size distributions for both users as well as the relative job job size and further schedules its jobs based on the FIFO dis-
size ratio between them as 1:10. Therefore, the job interar- cipline because its job sizes have low variability. However,
rival times of u2 are drawn from three different arrival pat- comparing to FIFO, this outcome unfortunately penalizes
terns, i.e., exponential, hyper-exponential and bursty, while user 2, especially when this users arrival process is hyper-
user 1s job interarrival times are exponentially distributed exponential or bursty, see plot (c) in Fig. 9. Meanwhile, due
in all the experiments. We then depict the average job to the uniform job size distribution, LsPS schedules the jobs
response times of two users in Fig. 9 and the CCDFs of job from u2 in the order of their submission times, which indeed
response times in Fig. 10. compensates for the less resources and thus reduces the
average response time when compared to Fair. The CCDFs
Fig. 8. CCDFs of response times of all jobs under three different scheduling policies. The relative job size ratio between two users is (I) 1:10, and (II)
10:1.
Fig. 9. Average job response times of (a) two users, (b) user 1, and (c) user 2 under different scheduling policies and different job interarrival time dis-
tributions. The relative job size ratio of two users is 1:10.
shown in Fig. 10 further confirm that a large portion of jobs The main reason is because LsPS switches the scheduling
experiences shorter response times under LsPS than under for each user between FIFO and Fair based on their job size
the other two policies. distributions and thus improves that particular users
response times. Additionally, LsPS completes the jobs
5.1.3 Complex Case- Multiple Users with Diverse Job from the first four users within a short time period such
Arrival/Size Patterns that the occupied slots will be released soon and then reas-
To further verify the robustness of LsPS, we conduct experi- signed to the last two users, which further decreases the
ments under a more complex case of 6 users which have the job response times of these two users. We also observe that
mixed workloads of varying job arrival and job size pat- different sharing configurations have significant impacts
terns. Table 4 presents the detailed experimental settings. on Capacitys performance. However, even under Capaci-
Here, users with larger IDs have relatively larger job sizes ty_v3, the average job response time is still two times
in average. We also adjust the average arrival rate of each slower than LsPS when a is set to be 1.0.
user such that all the users submit the same load to the sys- Recall that parameter a in Eq. (6) is a tuning parameter to
tem. Table 5 and Fig. 11 present the average job response control how aggressively LsPS discriminates large jobs from
times as well as the distributions of job response times of all small ones. The larger the a is, the stronger bias is given
users under different scheduling policies. We compare our towards the users with small jobs. Table 5 shows that LsPS
proposed LsPS scheduler with the commonly adopted ones, with a 1:0 introduces the strongest bias on user slot
i.e., FIFO, Fair, and Capacity schedulers. For the Capacity shares and achieves the best response time improvement.
scheduler, we consider three different capacity sharing con- Therefore, we set a 1:0 in the remainder of this paper.
figurations, i.e., equal capacity shares in Capacity_v1, capac-
ity shares proportional to each users average job sizes in 5.2 Case Studies in Amazon EC2
Capacity_v2, and capacity shares reversely proportional to
To further verify the effectiveness and robustness of our
each users average job sizes in Capcaity_v3. Table 5 shows
new scheduler, we implement and evaluate LsPS in Ama-
the average job response times of each individual user and
zon EC2, a cloud platform that provides pools of computing
of all six users as well. Furthermore, in order to analyze the
resources to developers for flexibly configuring and scaling
impact of parameter a in Eq. (6), Table 5 also shows the sim-
their computational capacity on demand.
ulation results under LsPS with a equal to 0:3, 0.6 and 1.0.
We first observe that LsPS with different a significantly
improves the overall job response times compared to 5.2.1 Experimental Settings
FIFO, Fair and Capacity. Meanwhile, the average response In particular, we lease a m1.large instance as master node,
times of the first four users are improved as well under which provides two virtual cores with two EC2 Compute
LsPS because those users have relatively smaller job sizes Units each, 7.5 GB memory, and 850 GB storage to per-
and thus receive more map/reduce slots for executing form heartbeat and jobtracker routines for job scheduling.
their jobs. On the other hand, although the last two users We also use the same 11 m1.large instances to launch
u5 and u6 are assigned with the least number of slots, their slave nodes, each of which is configured with two map
average job response times have not been dramatically slots and two reduce slots. Such a configuration ensures
increased. In contrast, their jobs even experience faster that the system bottleneck is not our scheduler on master
response times compared to Fair for u5 , and FIFO for u6 . node, while the overall job response times depend on the
Fig. 10. CCDFs of job response times under different schedulers, where job interarrival times of user 1 are exponentially distributed and user 2s
arrival process is (a) exponential; (b) hyper-exponential; and (c) bursty. The relative job size ratio of two users is 1:10.
TABLE 4
Experimental Settings for Each User
User Job Size Pattern Arrival Pattern Avg. Size Ratio

1 Bursty Hyper-exp. 1
2 Exponential Exponential 5
3 Uniform Exponential 10
4 Hyper-exp. Exponential 20
5 Uniform Bursty 50
6 Hyper-exp. Hyper-exp. 100
Fig. 11. CCDFs of job response times of the complex case where six
users having the mixed workloads.
scheduling algorithms as well as the processing capability
of each slave node. of both CPU-bound applications, such as PiEstimator, and
As the Hadoop project [2] provides an API to support IO-bound applications, e.g., WordCount and Grep Applica-
pluggable schedulers, we implement our proposed LsPS tions. In this experiment, there are four users which submit
scheduling policy in Amazon Hadoop by extending the a set of jobs for one of the above four MapReduce applica-
TaskScheduler interface. In particular, we add a module in tions, according to the specified job size and arrival pat-
our scheduler to periodically predict the job sizes of users terns, see Table 6.
based on the execution times of finished tasks (which are The experimental results of overall and each users aver-
recorded for logging purpose in original Hadoop imple- age job response times are shown in Table 7. We first
mentation). We also integrate another module to calculate observe that LsPS reduces the overall job response times by
the slot share between users upon the submission of new a factor of 3.5 and 1.8 over FIFO and Fair, respectively. We
jobs and assign tasks of different users according to the defi- interpret it as an outcome of setting suitable scheduling
ciency between their running tasks and deserved slot algorithms for each user based on their corresponding
assignments. workload features. More importantly, LsPS significantly
The benchmarks we consider for performance evaluation reduces the average response times for the first three users
includes the following four classic MapReduce applications. and even slightly improves the average response time of
user 4 which indeed is assigned with less resources due to
WordCounttaking text files as inputs and comput- its large job size. The CCDFs of job response times are
ing the occurrence frequency of each existing word; depicted in Fig. 12. A large fraction of jobs experience faster
PiEstimatorestimating the value of p using response times under LsPS than under FIFO and Fair, while
quasi-Monte Carlo method; the number of penalized jobs which receive increased
Grepextracting and counting the strings in input response times is less than 3 percent of the total. Summariz-
files that match the given regular expression; ing, the real experimental results are consistent with the
Sorttaking sequential files as inputs and frag- results shown in our simulations (see Section 5.1), which
menting and sorting the input data. further confirm the effectiveness and robustness of our new
In addition, the randomtextwriter program is used to LsPS scheduler.
generate a random file as the input to WordCount and Grep
applications. We also run the RandomWriter application to 5.2.3 Non-Stationary Workloads
write 10 G of random data, that is used as the input to Sort In the previous sections, we have confirmed that LsPS per-
applications. For PiEstimator applications, we set the sample forms effectively under stationary workloads, where all
space at each map task as 100 million random points. We users have the stable job size/arrival patterns. Now, we
have 20 map tasks for each PiEstimator job, thus the total turn to evaluate LsPS under non-stationary workloads, fur-
number of random points for one p estimation is 2 billion. ther verifying its effectiveness when the workloads of some
users change over time.
5.2.2 Workloads with Mixed Applications In particular, we conduct experiments with two users
We conduct experiments with the mixed MapReduce appli- which submit a set of WordCount jobs to our Hadoop
cations, aiming to evaluate LsPS in a diverse environment cluster consisting of 18 map and 18 reduce slots on m1.
TABLE 5
Average Response Times (in Seconds) of All Users and Each User under Different Scheduling Policies
User FIFO Fair Capacity_v1 Capacity_v2 Capacity_v3 LsPS LsPS LsPS

a 0:3 a 0:6 a 1:0
1 7357.60 211.43 606.52 8688.95 193.64 163.00 150.05 142.51
2 11520.03 283.43 426.94 9648.50 242.47 234.64 222.00 220.20
3 10822.45 475.00 376.19 4992.06 357.96 276.18 258.88 254.41
4 10626.55 1182.14 2083.75 2532.49 2372.63 742.41 734.50 647.84
5 12017.48 40677.55 30441.12 22275.19 29416.86 22637.01 17557.73 13317.92
6 11346.46 3318.84 14598.88 3415.35 14997.93 3194.37 4760.66 5587.35
All 8488.00 939.24 1211.70 8513.58 886.27 583.05 505.94 441.66
TABLE 6
Case Study: Experimental Settings for Four Users in Amazon EC2
User Job Average Input Input Size Job Arrival Average Inter-arrival Submission
Type Size Pattern Pattern Time Number
1 WordCount 100 MB Exponential Bursty 20 sec 150
2 PiEstimator - - Uniform 30 sec 100
3 Grep 2000 MB Bursty Exponential 100 sec 30
4 Sort 10 GB Uniform Exponential 600 sec 5
large instances (i.e., slave nodes) in Amazon EC2. We resources to user 2, decreasing its job response times during
further generate a non-stationary workload by changing the last period. On the other hand, user 1 still experiences
the job size/arrival patterns of user 1 while keeping shorter job response times than under Fair even though this
fixed user 2s input file sizes and job interarrival times user now receives less resources. We interpret this by
both exponentially distributed with mean of 500 MB and observing the FIFO scheduling for this user in the third
25 seconds, respectively. Table 8 illustrates the three period. Moreover, the long delay existing in the shifting
changes in user 1s workloads. from period 2 to 3 indeed only affects few number of jobs
Table 9 shows the mean response times of two users as because the job interval time actually becomes quite long
well as of each user under the Fair and LsPS policies. The during period 3.
average job response times measured during each period Fig. 13c shows the number of jobs that are running or
are also shown in the table. We observe that LsPS success- waiting for service under Fair and LsPS policies, giving
fully captures the changes in user 1s workloads and an evidence that LsPS can consistently improve the aver-
dynamically tunes the two-level scheduling (i.e., between age job response times through dynamically adapting the
two users and within each user) based on the measured job scheduling to the workload changes. Therefore, we con-
size/arrival patterns. As a result, LsPS always achieves clude that these results strongly demonstrate the effec-
noticeable response time improvement by 26, 24, and 40 tiveness and robustness under both stationary and non-
percent during the three periods and by 34 percent in over- stationary workloads.
all with respect to Fair.
To better understand LsPSs processing on non-station-
ary workloads, Fig. 13 illustrates how LsPS dynamically 6 RELATED WORKS
adjusts its two-level scheduling algorithms in an on-line Scheduling in Hadoop system has already received lots of
fashion. Specifically, the transient distributions of 18 map attention. An early work of Matei Zaharia et al. [7] proposed
slots between two users are depicted in Fig. 13a, where red the Fair scheduler which assigns resources equally to each
areas indicate the slots assigned to user 1 while greed areas user and provides performance isolation between users.
represent those assigned to user 2. We also plot the changes However, its objective is not to optimize the system perfor-
of the scheduling within user 1 as a function of time in mance. In [21], a delay scheduler was proposed to improve
Fig. 13b. the performance of Fair scheduler by increasing data local-
As we observed, during the first period, LsPS assigns ity. It simply delays the task assignment for a while if the
more shares to user 1 than to user 2 because LsPS detects tasks data is not local. This improvement is at task level,
that user 1 has smaller average job size, see Fig. 13a. Mean- and can be combined with our proposed job scheduling pol-
while, the jobs from user 1 are scheduled according to the icy. The quincy scheduler [22] considered similar direction
FIFO discipline, see Fig. 13b, which further reduces the and found a fair assignment while considering locality by
response times of user 1 and thus results in better overall formulating and solving a minimum flow network problem.
response times during this period. Once LsPS captures the However, it is limited due to its high computation complex-
changes in user 1s job size distribution, i.e., from uniform ity. In [23], Sandholm and Lai considered the profit of the
to hyperexponential, LsPS quickly switches the scheduling service provider and proposed a scheduler that splits slots
within user 1 from FIFO to Fair and thus consistently to users according to the bids they pay instead of fair share.
achieves shorter response times in the second period, see The efficiency of scheduler is not considered in their work.
Fig. 13b. Later, when user 1 starts to submit large jobs with
the uniform distribution, LsPS turns to dispatch more
TABLE 7
Average Response Times (in Seconds) of All Users
and Each User in the Amazon Hadoop Cluster
User FIFO Fair LsPS

1 251.36 121.18 67.50
2 280.06 149.95 74.79
3 235.33 118.36 75.18
4 330.20 248.00 209.00
All 259.61 132.90 74.16 Fig. 12. CCDFs of job response times when four different MapReduce
applications are running on Amazon EC2 cluster.
TABLE 8
Experimental Settings for User 1s Non-Stationary Workload
Periods Avg. Input Size Pattern Interarrival Submission

Size Time Number
1 100 MB Uniform 5 sec 50
2 100 MB Hyper-Exp. 5 sec 50
3 2500 MB Uniform 125 sec 10
Wolf et al. proposed a slot allocation scheduler called FLEX

[24] that could optimize towards a given scheduling metric,
e.g., average response time, makespan, etc., by sorting jobs
with generic shemes before allocating slots. Their scheme
are used under the batch job situation, i.e., all jobs are sub-
mitted at the same time for scheduling, while our schedul-
ing policy works well under the on-line situation. Verma et
al. [25] and Tian et al. [5] focus on improving the makespan
of a batch of jobs that submitted at the same time instead of
average job response times in multiple user situation.
Another major direction of improving the Hadoop sched-
uling policy is considering the deadline or SLA of jobs. A
Fig. 13. Illustrating (a) the slot shares between users, where the grey
deadline based scheduler was proposed in [17], which uti- (resp. dark) areas indicate the slots assigned to user 1 (resp. user 2); (b)
lizes earliest deadline first policy to sort jobs and the the scheduling scheme of user 1 across time, where 1 indicates FIFO
lagrange optimization method to find out the jobs minimum and 0 indicates Fair; and (c) the transient number of jobs that are run-
map and reduce slots number requirements to meet their ning or waiting for service.
deadline. This solution required a detailed profile for each
unnecessary speculative executions. Dhok and Varma [28]
job to provide the execution time of map task and reduce
applied the Bayes pattern classification method to classify
task of that job. Polo et al. [16] estimated the task execution
and sort jobs and dynamically decided the slots numbers a
time of a job from the average execution time of already fin-
node can provide according to its load limitation instead of
ished tasks of that job, and calculate the slots number a job
assigning fixed number of slots to each node. Ghodsi
needs from its deadline and estimated task execution time.
et al. [29] considered the difference of resources require-
We partly adopt the method to help estimate the job size of
ments of Hadoop jobs and proposed a dominant resource
each user in our proposed LsPS scheduler.
fairness policy to assign resources fairly to jobs instead of
Some other works focus on theoretical understanding of
assign slots to jobs. Since homogeneous systems are more
scheduling problem in Hadoop system and try to give
commonly used in Hadoop implementation, we do not con-
approximate scheduling algorithms [14], [26]. Chang et al.
sider heterogeneous system in our work.
[26] proposed a 2-approximate scheduling algorithm. How-
ever, their assumption that there are no precedence relation-
ships between tasks belonging to a given job is not true in 7 CONCLUSION
Hadoop system. Moseley et al. [14] formalized job schedul- In this paper, we have proposed LsPS, an adaptive schedul-
ing problem in MapReduce system as a generalization of ing technique for improving the efficiency of Hadoop sys-
the two-stage flexible flow-shop problem and gave the off- tems that process heterogeneous MapReduce jobs.
line and on-line version of approximate algorithm. How- MapReduce workloads of contemporary enterprise clients
ever, their work assumes that the execution time of tasks have revealed the diversity of job sizes, ranging from sec-
are known. onds to hours and having varying distributions as well. Our
Furthermore, some prior work utilized the heteroge- new online policy captures the present job size patterns of
neousness of system and jobs resource requirements to each user and leverages this knowledge to dynamically
improve the performance of scheduler. Zaharia et al. [27] adjust the slot shares among all active users and to further
indicated that the Hadoop speculative execution will lead on-the-fly tune the scheme for scheduling jobs within a sin-
to poor performance in heterogeneous Hadoop environ- gle user. Using real experiments in Amazon EC2, we have
ments. They thus proposed a LATE scheduler which stops shown that LsPS consistently improves the performance in
TABLE 9
Average Job Response Times (in Seconds) of Fair and LsPS under Non-Stationary Workloads
User Period 1 Period 2 Period 3 All

Fair LsPS Fair LsPS Fair LsPS Fair LsPS
1 108.96 50.30 100.72 60.78 803.00 771.10 168.30 120.59
2 277.90 354.50 326.70 329.60 357.52 158.40 341.74 210.87
All 136.32 101.00 138.38 105.58 431.33 260.51 235.75 155.70
terms of job response times under a variety of system work- [22] M. Isard, Vijayan Prabhakaran, J. Currey, U. Wieder, K. Talwar,
and A. Goldberg, Quincy: Fair scheduling for distributed comput-
loads. We have also shown that the effectiveness and ing clusters, in Proc. 5th Eur. Conf. Comput. Syst., 2009, pp. 261276.
robustness of LsPS under the nonstationary workloads. [23] Thomas Sandholm and K. Lai, Mapreduce optimization using
regulated dynamic prioritization, in Proc. SIGMETRICS: 11th Int.
ACKNOWLEDGMENTS Joint Conf. Meas. Model. Comput. Syst., 2009, pp. 299310.
[24] J. Wolf, D. Rajan, K. Hildrum, R. Khandekar, V. Kumar, S. Parekh,
This work was partially supported by the National Science K.-L. Wu, and A. Balmin, Flex: A slot allocation scheduling opti-
Foundation grant CNS-1251129, and the AWS in Education mizer for mapreduce workloads, in Proc. ACM/IFIP/USENIX
11th Int. Conf. Middleware, 2010, pp. 120.
Research Grant. [25] A. Verma, L. Cherkasova, and R. H. Campbell, Two sides of a
coin: Optimizing the schedule of mapreduce jobs to minimize
REFERENCES their makespan and improve cluster performance, in Proc. IEEE
20th Int. Symp. Model., Anal. Simul. Comput. Telecommun. Syst.,
[1] J. Dean, S. Ghemawat, and G. Inc., Mapreduce: Simplified data 2012, pp. 1118.
processing on large clusters, in Proc. 6th Conf. Symp. Operating [26] H. Chang, M. Kodialam, R. R. Kompella, T. V. Lakshman, M. Lee,
Syst. Des. Implementation, 2004, p.10. and S. Mukherjee, Scheduling in mapreduce-like systems for fast
[2] Apache Hadoop. (2014). [Online]. Available: http://hadoop. completion time, in Proc. IEEE INFOCOM, 2011, pp. 30743082.
apache.org/ [27] M. Zaharia, A. Konwinski, A. D. Joseph, R. H. Katz, and
[3] Apache Hadoop Users. (2014). [Online]. Available: http://wiki. I. Stoica, Improving mapreduce performance in heterogeneous
apache.org/hadoop/PoweredBy environments, in Proc. 8th USENIX Conf. Operating Syst. Des.
[4] Y. Chen, A. Ganapathi, R. Griffith, and R. Katz, The case for eval- Implementation, 2008, pp. 2942.
uating mapreduce performance using workload suites, in Proc. [28] J. Dhok and V. Varma, Using pattern classification for task
IEEE 19th Int. Symp. Model., Anal. Simul. Comput. Telecommun. assignment in mapreduce, in Proc. ISEC, 2010.
Syst., 2011, pp. 390399. [29] A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker,
[5] C. Tian, H. Zhou, Y. He, and L. Zha, A dynamic mapreduce and I. Stoica, Dominant resource fairness: Fair allocation of mul-
scheduler for heterogeneous workloads, in Proc. 8th Int. Conf. tiple resource types, in Proc. 8th USENIX Conf. Netw. Syst. Des.
Grid Cooperative Comput., 2009, pp. 218224. Implementation, Mar. 2011, p. 24.
[6] S. Kavulya, J. Tan, R. Gandhi, and P. Narasimhan, An analysis of
traces from a production mapreduce cluster, in Proc. 10th IEEE/ Yi Yao received the BS and MS degrees in com-
ACM Int. Conf. Cluster, Cloud Grid Comput., 2010, pp. 94103. puter science from the Southeast University,
[7] M. Zaharia, D. Borthakur, J. S. Sarma, K. Elmeleegy, S. Shenker, China, in 2007 and 2010. He is working toward
and I. Stoica, Job scheduling for multi-user mapreduce clusters, the PhD degree at the Department of Electrical
University of California, Berkeley, CA, USA, Tech. Rep. UCB/ and Computer Engineering, Northeastern Univer-
EECS-2009-55 , Apr. 2009. sity, Boston, MA. His current research interests
[8] Fair scheduler [Online]. Available: http://hadoop.apache.org/ include resource management, scheduling, and
common/docs/r1.0.4/fair_scheduler.html, 2014. cloud computing.
[9] A. Thusoo, Z. Shao, S. Anthony, D. Borthakur, N. Jain, J. Sen
Sarma, R. Murthy, and H. Liu, Data warehousing and analytics
infrastructure at facebook, in Proc. ACM SIGMOD Int. Conf. Man-
age. Data, 2010, pp. 10131020. Jianzhe Tai received the BS and MS degrees in
[10] Y. Chen, A. S. Ganapathi, R. Griffith, and R. H. Katz, A method- electrical and information engineering from the
ology for understanding mapreduce performance under diverse Dalian University of Technology, China, in 2009
workloads, Univ. California, Berkeley, CA, USA, Tech. Rep. and 2007. He is working toward the PhD degree
UCB/EECS-2010-135 , 2010. at the Department of Electrical and Computer
[11] L. E. Schrage and L. W. Miller, The queue m/g/1 with the short- Engineering, Northeastern University, Boston,
est remaining processing time discipline, Operations Research, MA. His current research interests include soft-
vol. 14, no. 4, pp. pp. 670684, 1966. ware-defined storage systems and virtualization
[12] K. Avrachenkov, U. Ayesta, P. Brown, and R. N unez-Queija, in hybrid storage.
Discriminatory processor sharing revisited, in Proc. IEEE
INFOCOM, Mar. 2005, pp. 784795.
[13] I. Mitrani, Probabilistic Modelling. Cambridge, U.K.: Cambridge
Univ. Press, 1998. Bo Sheng received the PhD degree in computer
[14] B. Moseley, A. Dasgupta, R. Kumar, and T. Sarl os, On scheduling science from the College of William and Mary in
in map-reduce and flow-shops, in Proc. 23rd Annu. ACM Symp. 2010. He is an assistant professor at the Depart-
Parallelism Algorithms Archit., 2011, pp. 289298. ment of Computer Science, University of Massa-
[15] B. P. Welford, Note on a method for calculating corrected sums chusetts Boston. His research interests include
of squares and products, Technometrics, vol. 4, pp. 419420, 1962. mobile computing, wireless networks, security,
[16] J. Polo, D. Carrera, Y. Becerra, J. Torres, E. Ayguade, M. Steinder, and cloud computing.
and I. Whalley, Performance-driven task co-scheduling for map-
reduce environments, in Proc. IEEE Netw. Operations Manage.
Symp., 2010, pp. 373380.
[17] A. Verma, L. Cherkasova, and R. H. Campbell, Aria: Automatic
resource inference and allocation for mapreduce environments, Ningfang Mi received the BS degree in computer
in Proc. 8th ACM Int. Conf. Autonomic Comput., 2011, pp. 235244. science from Nanjing University, China, in 2000
[18] J. Tan, X. Meng, and L. Zhang, Performance analysis of coupling and the MS degree in computer science from the
scheduler for mapreduce/hadoop, In INFOCOM, 2012 Proceed- University of Texas at Dallas, TX in 2004. She
ings IEEE, pp. 25862590, Mar. 2012. received the PhD degree in computer science
[19] A. Feldmann and W. Whitt, Fitting mixtures of exponentials to from the College of William and Mary, VA in
long-tail distributions to analyze network performance models, 2009. She is an assistant professor at the Depart-
in Proc. IEEE INFOCOM, 1997, vol. 3, pp. 10961104. ment of Electrical and Computer Engineering,
[20] A. Riska, V. Diev, and E. Smirni, Efficient fitting of long-tailed Northeastern University, Boston. Her current
data sets into hyperexponential distributions, in Proc. IEEE research interests include performance evalua-
GLOBECOM, 2002, pp. 25132517. tion, capacity planning, resource management,
[21] M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker, simulation, data center, and cloud computing. She is a member of the
and I. Stoica, Delay scheduling: A simple technique for achieving ACM and the IEEE.
locality and fairness in cluster scheduling, in Proc. 5th Eur. Conf.
Comput. Syst., 2010, pp. 265278.

06868256

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

06868256

Caricato da

Copyright:

Formati disponibili

IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 3, NO.

4, OCTOBER-DECEMBER 2015 411

LsPS: A Job Size-Based Scheduler for Efficient

Index TermsMapReduce, hadoop, schdeuling, heavy-tailed workloads, bursty workloads

M APREDUCE [1] has become an important paradigm for

distributed, both FIFO and Fair obtain similar average job

U / ui number of users / 0ith user, i 2 1; U.

p 3.2 Scheduling Among Multiple Users

the efficiency of a Hadoop system, especially with highly

5.1 Simulation Evaluation

User Job Size Pattern Arrival Pattern Avg. Size Ratio

User FIFO Fair Capacity_v1 Capacity_v2 Capacity_v3 LsPS LsPS LsPS

User FIFO Fair LsPS

Periods Avg. Input Size Pattern Interarrival Submission

Wolf et al. proposed a slot allocation scheduler called FLEX

User Period 1 Period 2 Period 3 All

Potrebbero piacerti anche

06868256

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

06868256

Caricato da

Copyright:

Formati disponibili

IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 3, NO.

4, OCTOBER-DECEMBER 2015 411

LsPS: A Job Size-Based Scheduler for Efficient

Index TermsMapReduce, hadoop, schdeuling, heavy-tailed workloads, bursty workloads

M APREDUCE [1] has become an important paradigm for

distributed, both FIFO and Fair obtain similar average job

U / ui number of users / 0ith user, i 2 1; U.

p 3.2 Scheduling Among Multiple Users

the efficiency of a Hadoop system, especially with highly

5.1 Simulation Evaluation

User Job Size Pattern Arrival Pattern Avg. Size Ratio

User FIFO Fair Capacity_v1 Capacity_v2 Capacity_v3 LsPS LsPS LsPS

User FIFO Fair LsPS

Periods Avg. Input Size Pattern Interarrival Submission

Wolf et al. proposed a slot allocation scheduler called FLEX

User Period 1 Period 2 Period 3 All

Potrebbero piacerti anche

U / ui number of users / 0ith user, i 2 1; U.