Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
AbstractThe MapReduce paradigm and its open source implementation Hadoop are emerging as an important standard for
large-scale data-intensive processing in both industry and academia. A MapReduce cluster is typically shared among multiple users
with different types of workloads. When a flock of jobs are concurrently submitted to a MapReduce cluster, they compete for the
shared resources and the overall system performance in terms of job response times, might be seriously degraded. Therefore, one
challenging issue is the ability of efficient scheduling in such a shared MapReduce environment. However, we find that conventional
scheduling algorithms supported by Hadoop cannot always guarantee good average response times under different workloads. To
address this issue, we propose a new Hadoop scheduler, which leverages the knowledge of workload patterns to reduce average
job response times by dynamically tuning the resource shares among users and the scheduling algorithms for each user. Both
simulation and real experimental results from Amazon EC2 cluster show that our scheduler reduces the average MapReduce job
response time under a variety of system workloads compared to the existing FIFO and Fair schedulers.
1 INTRODUCTION
assigning to all jobs, on average, an equal share of resources lease 11 EC2 nodes, where one node serves as the master
over time. However, we notice that the Fair scheduler makes and the remaining 10 nodes run as the slaves. In this
its scheduling decision without considering workload pat- Hadoop cluster, each slave node contains two map slots
terns of different users. and two reduce slots. The WordCount application is run
Compared to the early stage, the workloads in Map- to compute the occurrence frequency of each word in
Reduce system have been changing towards the follow- input files with different sizes. The randomtextwriter
ing three directions. First, a MapReduce cluster, once program is used to generate random files as inputs to the
established, is no longer dedicated to a particular job, WordCount applications.
but to multiple jobs from different applications or users.
For example, Facebook [9] is one of Hadoops biggest 2.1 How to Share Slots
champions, which keeps more than 100 petabytes of Specifically, there are two tiers of scheduling in a Hadoop
Hadoop data online, and allows multiple applications system which is shared by multiple users: (1) Tier 1 is
and users to submit their ad hoc queries to the shared responsible for assigning free slots to active users; and (2)
Hive-Hadoop clusters. Second, as a data processing ser- Tier 2 schedules jobs for each individual user. In this sec-
vice, MapReduce is becoming prevalent and open to tion, we first investigate different Hadoop scheduling poli-
numerous clients from the Internet, like todays search cies at Tier 1. When no minimum share of each user is
engines service. For example, a smartphone user may specified, Fair scheduler fairly allocates available slots
send a job to a MapReduce cluster through an App ask- among users such that all users get an equal share of slots
ing for the most popular words in the tweets logged in over time. However, we argue that Fair unfortunately
the past three days. Third, the characteristics of MapRe- becomes inefficient when job sizes of active users are not
duce jobs vary a lot. It is essentially caused by the diver- uniform.
sity of user demands. Recent analysis on MapReduce For example, we perform an experiment with two
workloads of current enterprise clients [10], e.g., users such that user 1 submits 30 WordCount jobs to scan
Facebook and Yahoo!, has revealed the diversity of Map- a 180 MB input file, while user 2 submits six WordCount
Reduce job sizes which range from seconds to hours. jobs to scan a 1.6 GB input file. All the jobs will be sub-
Overall, workload diversity is common in practice when mitted at roughly the same time. We set the block size of
jobs are submitted by different users. For example, some HDFS to be equal to 30 MB. Thus, the map task number
users run small interactive jobs while other users submit of jobs from user 2 is equal to 54 (1.6 GB/30 MB), while
large periodical jobs; some users run jobs for processing each job from user 1 only has six (180 MB/30 MB) map
files with similar sizes while the sizes of jobs from other tasks. The reduce task number of each job is set to be
users are quite vary. We thus argue that a good Hadoop equal to its map task number. As the average task execu-
scheduler should take into consideration the diversity in tion times of these two users are similar, we say that the
workload characteristics with the goal of reducing Map- average job size (i.e., the average task number times the
Reduce job execution times. average task execution time) of user 2 is about nine times
In this paper, we propose a novel Hadoop scheduler, larger than that of user 1.
called LsPS, which aims to improve the average job In the context of single-user job queues, it is well known
response time of Hadoop systems by leveraging the job that giving preferential treatment to shorter jobs can reduce
size patterns to tune its scheduling schemes among users the overall expected response time [11]. However, directly
and for each user as well. Specifically, we first develop a using the shortest job first (SJF) policy has several draw-
lightweight information collector that tracks statistic backs. First, large jobs could be starved under SJF, and SJF
information of recently finished jobs from each user. A lacks flexibility when certain level of fairness or priority
self-tuning scheduling policy is then designed to sched- between users is required, which is common in real use
uler Hadoop jobs at two levels: the resource shares across cases. Moreover, precise job size prediction before execution
multiple users are tuned based on the estimated job size of is also required for using SJF, which is not easy to achieve in
each user; and the job scheduling for each individual user is real world systems. In contrast, the sharing based schedul-
further adjusted to accommodate to that users job size ing could easily solve the starvation problem and provide
distribution. Experimental results in both the simulation the flexibility to integrate fairness between users by, for
model and the Amazon EC2 Hadoop cluster environment example, setting up minimal shares for users. Allowing all
confirm the effectiveness and the robustness of our solu- users to run their application concurrently also helps to
tion. We show that our scheduler improves the average improve the job size prediction accuracy in Hadoop system
job response times under a variety of system workloads by getting information from finished tasks. Motivated by
under the comparisons with FIFO, Fair and Capacity this observation and the analysis of discriminatory proces-
schedulers which have been widely adopted as the stan- sor sharing between multiple users in [12], we evaluate the
dard scheduling disciplines in MapReduce frameworks, discriminatory share policies in Hadoop systems. It is
such as Apache Hadoop. extremely hard and complex to find out an optimal share
policy under a dynamic environment where user workload
patterns may change frequently across time. Therefore, we
2 MOTIVATIONS opted to heuristically assign slots that are reversely propor-
In order to investigate the pros and cons of the existing tional to the current average job sizes of users, and dynami-
Hadoop schedulers (i.e., FIFO and Fair), we conduct sev- cally tune the share over time according to the workload
eral experiments in a Hadoop cluster at Amazon EC2. We pattern changes. This method introduces few overheads,
YAO ET AL.: LSPS: A JOB SIZE-BASED SCHEDULER FOR EFFICIENT TASK ASSIGNMENTS IN HADOOP 413
TABLE 1 TABLE 2
Average Job Response Times (in Seconds) for Average Job Response Times under FIFO and Fair
Two Users with Different Job Sizes under Fair when Job Sizes Have Three Different Distributions
and Two Variants
CV 0 CV 1 CV 1:8
ShareRatio User 1 User 2 All FIFO 239.10 sec 208.78 sec 234.95sec
Fair 1:1 548.06 1189.33 656.61 Fair 346.45 sec 220.11 sec 128.35sec
Fair_V1 1:9 1132.33 983.16 1107.47
Fair_V2 9:1 375.56 1280.66 516.41
Fig. 1. Response times of each WordCount job under FIFO and Fair when the input file sizes have different CV .
414 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 3, NO. 4, OCTOBER-DECEMBER 2015
TABLE 3
Notations Used in the Algorithm
shown in Fig. 2. Briefly, LsPS consists of the following 3.1 Workload Information Collection
three components: When a Hadoop system is shared by multiple users, job sizes
and patterns of each user must be considered for designing
Workload information collection: monitor the execu-
an efficient scheduling algorithm. Therefore, a light-
tion of each job and each task, and gather the work-
weighted history information collector is introduced in LsPS
load information.
for collecting the important historic information of jobs and
Scheduling among multiple users: allocate (both
users upon each jobs completion time. Here we collect and
map and reduce) slots for users according to their
update the information of each jobs map and reduce tasks
workload characteristics, i.e., scheduling at Tier 1.
separately, through the same functions. To avoid redundant
Scheduling for each individual user: tune the sched-
description, we use the general term task to represent both
uling schemes for jobs from each individual user
types of tasks and the term size to represent size of either
based on that users job size distribution, i.e., sched-
map phase or reduce phase of each job as follows.
uling at Tier 2.
LsPS appropriately allocates slots for Hadoop users and
guides each user to select the right scheduling algorithm for Algorithm 2. Tier 1: Allocate Slots for Each User
their own job queue, even under highly variable and heavy- Input: historic information of each active user;
tailed workloads. In the remainder of this section, we Output: slot share SUi of each active user;
describe the detailed implementation of the above three for each user ui do
components. Table 3 lists some notations used in the rest of Update that users slot share SUi using Eq. (6);
this paper. for jth job of user i, i.e., jobi;j do
if currently scheduling based on submission times
Algorithm 1. Overview of the LsPS
then
1. When a new job from user i is submitted if jobi;j has the earliest submission time in Ji
a. Estimate job size and avg. job size S i of then
user i using Eq. (5); SJi;j SUi ;
b. Adjust slot shares among all active users, see else
Algorithm 2; SJi;j 0;
c. Tune the job scheduling scheme for user i, see else
Algorithm 3; SJi;j SUi =jJi j.
2. When a task of job j from user i is finished
a. Update the estimated average task execution
time ti;j ; Algorithm 3. Tier 2: Tune Job Scheduling for Each User
3. When the jth job from user i is finished Input: historic information of each active user;
a. Measure avg. map/reduce task execution time Output: UseFIFO vector;
tm r m
i;j /ti;j and map/reduce task number ni;j /ni;j ;
r
for each user ui do
b. Update history info. of user i, i.e., ti ; S i ; CVi , using if user ui is active, i.e., jJi j > 1 then
Eq. (1-4); calculate the CVi of current jobs;
4. When a free slot is available if CVi < 1 and CVi < 1 then
a. Sort users in decreasing order of deficits ASi SUi ; schedule current jobs based on their submission
b. Assign the slot to the first user in the sorted list; times;
c. Increase num. of actual received slots ASi by 1; if CVi > 1 k CVi > 1 then
d. Choose a job from user ui to get service based on equally allocate slots among current jobs;
the current scheduling scheme. clear history information and restart collection.
YAO ET AL.: LSPS: A JOB SIZE-BASED SCHEDULER FOR EFFICIENT TASK ASSIGNMENTS IN HADOOP 415
In LsPS, the important history workload information that monitoring the past scheduling history. In each monitoring
needs to be collected for each user ui includes its average window, the system completes exactly W jobs; we set
m r
task execution time ti (and ti ), average size S i , and the coef- W 100 in all the experiments presented in the paper. We
ficient of variation of sizes CVi . We here adopt the Welfords also assume that the scheduler is able to correctly measure
one-pass algorithm [15] to on-line update these statistics as the information of each completed job, such as its map/
follows: reduce execution times as well as the number of map/
reduce tasks. This assumption is reasonable for most
si;j tm m r r
i;j ni;j ti;j ni;j ; (1) Hadoop systems. Upon each jobs completion, LsPS updates
the workload statistics for job owner using the above equa-
tions, i.e., Eqs. (1)-(4), see Algorithm 1 step 3. The statistic
S i S i si;j S i =j; (2) information collected in the present monitoring window
will then be utilized by LsPS to tune the schemes for sched-
uling the following jobs.
vi vi si;j S i 2 j 1=j; (3)
see Eq. (7), and all available slots in the system are fully dis-
tributed to active users, see Eq. (8).
The resulting deserved slot shares (i.e., SUi ) are not nec-
essarily equal to the actual assignments among users (i.e.,
ASi ). They will be used to determine which user can receive
the slot that just became available for redistribution, see
Algorithm 1 step 4. LsPS sorts all active users in a non-
increasing order of their deficits, i.e., the gap between the
expected assigned slots (SUi ) and the actual received slots
(ASi ), and then dispatchs that particular slot to the user
with the largest deficit. Additionally, it might happen in the
Fig. 3. Illustrating actual task execution times and on-line estimated Hadoop system that some users have high deficits but their
average task execution times of a Sort job with 300 map tasks and a
Grep job with 300 map tasks. actual demands on map/reduce slots are less than the
expected shares. In such a case, LsPS re-dispatches the extra
ui (ti ) to approximate the average task execution time ti;j of slots to those users who have lower deficits but need more
jobi;j . Therefore, the average map phase size of jobs from slots for serving their jobs.
user ui is calculated as follows:
3.3 Scheduling for a User
1 X
jJ j
i
m The second design principle used in LsPS is to dynamically
Si nm t ; (5)
jJi j j1 i;j i;j tune the scheduling scheme for jobs within an individual
user by leveraging the knowledge of job size distribution.
where Ji represents the set of jobs from user ui that are cur- As observed in Section 2.2, the scheme of equally distribut-
rently running or waiting for service. The average reduce ing shared resources outperforms by avoiding small jobs to
phase size of jobs from user ui could be calculated in the waiting behind large ones. However, when the jobs have
same way. Fig. 3 illustrates an example, where the estimated similar sizes, scheduling jobs based on their submission
average map task execution times for both Sort and Grep times becomes superior to the former one.
jobs quickly converge to a stable value which is consistent Therefore, our algorithm (as described in Algorithm 3)
with the actual one, i.e., the average task execution time of considers the CV of job sizes, i.e., map size plus reduce size,
all 300 map tasks. of each user to determine which scheme should be used to
As shown in Algorithm 2 step 1, once a new job arrives, distribute the free slots to jobs from that user. To improve the
LsPS updates the average size of that jobs owner and then accuracy, we combine the history job size information and
adaptively adjusts the deserved map slot shares (SUi ) the estimated size distribution of running and waiting jobs
among all active users using Eq. (6) in system. CVi of currently finished jobs sizes of user i is pro-
vided directly by the history information collector, and CVi
1
of waiting and running jobs sizes is calculated based on the
S
SUi SUi a U PU i 1
1a ; (6) estimated job sizes. When the two values of a user are both
i1 S
i smaller than 1, the LsPS scheme schedules the current jobs in
that users sub queue in the order of their submission times,
8i; SUi > 0; (7)
otherwise the user level scheduler will fairly assign slots
among jobs. When the two values are conflicting, i.e.,
X
U X
U
SUi SUi ; (8) CVi > 1 and CVi < 1 or vise versa, which means the users
i1 i1 workload pattern may change, the fair scheme will be
adopted at this time, the history information will be cleared
where SUi represents the estimated slot shares that should and a new collection window will start at this time, see the
be assigned to user ui , SUi represents the deserved slot pseudo-code in Algorithm 3.
shares for user ui under the Fair scheme, i.e., equally dis-
patching the slots among all users, U indicates the number
of users that are currently active in the system, and a is a 4 MODEL DESCRIPTION
tuning parameter within the range from 0 to 1. Parameter a In this section, we introduce a queuing model that is devel-
in Eq. (6) can be used to control how aggressively LsPS oped to emulate a Hadoop system. The main purpose of
biases towards the users with smaller jobs: when a is close this model is to compare various Hadoop scheduling
to 0, our scheduler increases the degree of fairness among schemes, and give the first proof of our new approach. This
all users, performing similar as Fair; and when a is model does not include all the details of the complex
increased to 1, LsPS gives the strong bias towards the users Hadoop system, but provide a general guideline to users
with small jobs in order to improve the efficiency in terms which is useful for performance evaluation without running
of job response times. In the remainder of the paper, we set experiment tests. In Section 5, besides experiments on a real
a as 1 if there is no explicit specification. When all users Hadoop clusters, we also conduct trace-driven simulations
have the same average job sizes, one can get SUi equal to based on this model to evaluate the performance improve-
SUi , i.e., fairly allocating slots among users. Also, when ment of LsPS.
using Eq. (6) to calculate the SUi for each user, it is guaran- The model, as shown in Fig. 4 consists of two queues for
teed that no active users gets starved for map/reduce slots, map tasks (Qm ) and reduce tasks (Qr ), respectively. Once a
YAO ET AL.: LSPS: A JOB SIZE-BASED SCHEDULER FOR EFFICIENT TASK ASSIGNMENTS IN HADOOP 417
Fig. 5. Average job response times of (a) two users, (b) user 1, and (c) user 2 under three different scheduling policies and different job size distribu-
tion settings. The relative improvement with respect to Fair is also plotted on each bar of LsPS.
exponentially distributed job interarrival times with the In order to analyze the impacts of relative job sizes on
same mean of 300 seconds. LsPS performance, we conduct another two sets of experi-
Fig. 5 shows the mean job response times of both ments with various job size ratios between two users, i.e.,
users under different policies and the relative improve- we keep the same parameters as the previous experiments
ment with respect to Fair. Job response time is measured but tune the job sizes of u2 such that we have the average
from the moment when that particular job is submitted job size of u1 is 10 times less (resp. more) than that of u2 , see
to the moment that all the associated map and reduce the results shown in Fig. 7(I) (resp. Fig. 7(II)). In addition,
tasks are finished. We first observe that high variability we tune the job arrival rates of u2 to keep the same loads in
in job sizes dramatically degrades the performance the system. Recall that our LsPS scheduler always gives
under FIFO as a large number of small jobs are stuck higher priority, i.e., assigning more slots, to the user with
behind the extremely large ones, see plot (a) in Fig. 5. In smaller average job size, see Section 3.2. As a result, LsPS
contrast, both Fair and LsPS effectively mitigate such achieves non-negligible improvements of overall job
negative performance effects by equally distributing response times no matter which user has smaller job sizes,
available slots between two users and within a single see plots (a) in Fig. 7(I) and (II). Further confirmation of this
user. Our policy further improves the overall perfor- benefit comes from the plots in Figs. 8(I) and 8(II), which
mance by shifting the scheduler to FIFO for the jobs show that most jobs experience the shortest response times
from u1 and thus significantly reducing its mean job when the scheduling is LsPS. Indeed, the part of the work-
response time by 60 and 62 percent with respect to Fair load whose job sizes are large receives increased response
when the job sizes of user 2 are highly variable (i.e., u: times, but the number of penalized jobs is less than 5 per-
h) and temporally dependent (i.e., u:b), respectively, cent of the total.
see plot (b) in Fig. 5. On the other hand, Fair loses its Now, look closely at each users response times. We
superiority when both users have similar job sizes, while observe that under the cases of two different job size ratios,
our new scheduler bases on the features of both users LsPS always achieves significant improvement in job
job sizes to tune the scheduling at two tiers and thus response times for the user which submits small jobs in
achieves the performance close to the best one. average by assigning more slots to that user, see plot (b) in
To further investigate the tail of job response times, Fig. 7(I) and plot (c) in Fig. 7(II). Meanwhile, although LsPS
we plot in Fig. 6 the complementary cumulative distribu- discriminately treats another user (i.e., having larger jobs)
tion functions (CCDFs) of job response times, i.e., the with less resource, this policy does not always sacrifice that
probability that the response times experienced by indi- users performance. For example, as shown in plot (c) of
vidual jobs are greater than the value on the horizontal Fig. 7(I), when job sizes of u2 are highly variable and/or
axis, for both users under the three scheduling policies. strongly dependent, shorter response times are achieved
Consistently, almost all jobs from the two users experi- under LsPS or Fair than under FIFO because small jobs now
ence shorter response times under Fair and LsPS than have the reserved slots without waiting behind the large
under FIFO when job sizes of u2 are highly variable. In ones. Another example can be found in plot (b) of Fig. 7(II),
addition, compared to Fair, LsPS reduces the response where we observe that LsPS is superior to Fair on the perfor-
times for more that 60 percent of jobs, having shorter mance of user 1 by switching the tier 2 scheduling algorithm
tails in job response times. to FIFO.
Fig. 6. CCDFs of response times of all jobs under different schedulers, where user 1 has uniform job size distribution while user 2 have (a) similar job
sizes; (b) high variability in job sizes; and (c) high variability and strong temporal dependence in job sizes.
YAO ET AL.: LSPS: A JOB SIZE-BASED SCHEDULER FOR EFFICIENT TASK ASSIGNMENTS IN HADOOP 419
Fig. 7. Average job response times of (a) two users, (b) user 1, and (c) user 2 under three different scheduling policies. The relative job size ratio
between two users is (I) 1:10, and (II) 10:1.
5.1.2 Simple Case 2-Two Users with Diverse Job Consistent to the previous experiments, our LsPS sched-
Arrival Patterns uler outperforms in terms of the overall job response times,
We now turn to consider the changes in job arrival patterns. see plot (a) in Fig. 9. We observe that this benefit indeed
We conduct experiments with varying arrival processes of comes from the response time improvement of u1 , i.e., LsPS
the second user, i.e., u2 , but always fixing the uniform job assigns u1 with more slot shares due to its smaller average
size distributions for both users as well as the relative job job size and further schedules its jobs based on the FIFO dis-
size ratio between them as 1:10. Therefore, the job interar- cipline because its job sizes have low variability. However,
rival times of u2 are drawn from three different arrival pat- comparing to FIFO, this outcome unfortunately penalizes
terns, i.e., exponential, hyper-exponential and bursty, while user 2, especially when this users arrival process is hyper-
user 1s job interarrival times are exponentially distributed exponential or bursty, see plot (c) in Fig. 9. Meanwhile, due
in all the experiments. We then depict the average job to the uniform job size distribution, LsPS schedules the jobs
response times of two users in Fig. 9 and the CCDFs of job from u2 in the order of their submission times, which indeed
response times in Fig. 10. compensates for the less resources and thus reduces the
average response time when compared to Fair. The CCDFs
Fig. 8. CCDFs of response times of all jobs under three different scheduling policies. The relative job size ratio between two users is (I) 1:10, and (II)
10:1.
420 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 3, NO. 4, OCTOBER-DECEMBER 2015
Fig. 9. Average job response times of (a) two users, (b) user 1, and (c) user 2 under different scheduling policies and different job interarrival time dis-
tributions. The relative job size ratio of two users is 1:10.
shown in Fig. 10 further confirm that a large portion of jobs The main reason is because LsPS switches the scheduling
experiences shorter response times under LsPS than under for each user between FIFO and Fair based on their job size
the other two policies. distributions and thus improves that particular users
response times. Additionally, LsPS completes the jobs
5.1.3 Complex Case- Multiple Users with Diverse Job from the first four users within a short time period such
Arrival/Size Patterns that the occupied slots will be released soon and then reas-
To further verify the robustness of LsPS, we conduct experi- signed to the last two users, which further decreases the
ments under a more complex case of 6 users which have the job response times of these two users. We also observe that
mixed workloads of varying job arrival and job size pat- different sharing configurations have significant impacts
terns. Table 4 presents the detailed experimental settings. on Capacitys performance. However, even under Capaci-
Here, users with larger IDs have relatively larger job sizes ty_v3, the average job response time is still two times
in average. We also adjust the average arrival rate of each slower than LsPS when a is set to be 1.0.
user such that all the users submit the same load to the sys- Recall that parameter a in Eq. (6) is a tuning parameter to
tem. Table 5 and Fig. 11 present the average job response control how aggressively LsPS discriminates large jobs from
times as well as the distributions of job response times of all small ones. The larger the a is, the stronger bias is given
users under different scheduling policies. We compare our towards the users with small jobs. Table 5 shows that LsPS
proposed LsPS scheduler with the commonly adopted ones, with a 1:0 introduces the strongest bias on user slot
i.e., FIFO, Fair, and Capacity schedulers. For the Capacity shares and achieves the best response time improvement.
scheduler, we consider three different capacity sharing con- Therefore, we set a 1:0 in the remainder of this paper.
figurations, i.e., equal capacity shares in Capacity_v1, capac-
ity shares proportional to each users average job sizes in 5.2 Case Studies in Amazon EC2
Capacity_v2, and capacity shares reversely proportional to
To further verify the effectiveness and robustness of our
each users average job sizes in Capcaity_v3. Table 5 shows
new scheduler, we implement and evaluate LsPS in Ama-
the average job response times of each individual user and
zon EC2, a cloud platform that provides pools of computing
of all six users as well. Furthermore, in order to analyze the
resources to developers for flexibly configuring and scaling
impact of parameter a in Eq. (6), Table 5 also shows the sim-
their computational capacity on demand.
ulation results under LsPS with a equal to 0:3, 0.6 and 1.0.
We first observe that LsPS with different a significantly
improves the overall job response times compared to 5.2.1 Experimental Settings
FIFO, Fair and Capacity. Meanwhile, the average response In particular, we lease a m1.large instance as master node,
times of the first four users are improved as well under which provides two virtual cores with two EC2 Compute
LsPS because those users have relatively smaller job sizes Units each, 7.5 GB memory, and 850 GB storage to per-
and thus receive more map/reduce slots for executing form heartbeat and jobtracker routines for job scheduling.
their jobs. On the other hand, although the last two users We also use the same 11 m1.large instances to launch
u5 and u6 are assigned with the least number of slots, their slave nodes, each of which is configured with two map
average job response times have not been dramatically slots and two reduce slots. Such a configuration ensures
increased. In contrast, their jobs even experience faster that the system bottleneck is not our scheduler on master
response times compared to Fair for u5 , and FIFO for u6 . node, while the overall job response times depend on the
Fig. 10. CCDFs of job response times under different schedulers, where job interarrival times of user 1 are exponentially distributed and user 2s
arrival process is (a) exponential; (b) hyper-exponential; and (c) bursty. The relative job size ratio of two users is 1:10.
YAO ET AL.: LSPS: A JOB SIZE-BASED SCHEDULER FOR EFFICIENT TASK ASSIGNMENTS IN HADOOP 421
TABLE 4
Experimental Settings for Each User
TABLE 5
Average Response Times (in Seconds) of All Users and Each User under Different Scheduling Policies
TABLE 6
Case Study: Experimental Settings for Four Users in Amazon EC2
User Job Average Input Input Size Job Arrival Average Inter-arrival Submission
Type Size Pattern Pattern Time Number
1 WordCount 100 MB Exponential Bursty 20 sec 150
2 PiEstimator - - Uniform 30 sec 100
3 Grep 2000 MB Bursty Exponential 100 sec 30
4 Sort 10 GB Uniform Exponential 600 sec 5
large instances (i.e., slave nodes) in Amazon EC2. We resources to user 2, decreasing its job response times during
further generate a non-stationary workload by changing the last period. On the other hand, user 1 still experiences
the job size/arrival patterns of user 1 while keeping shorter job response times than under Fair even though this
fixed user 2s input file sizes and job interarrival times user now receives less resources. We interpret this by
both exponentially distributed with mean of 500 MB and observing the FIFO scheduling for this user in the third
25 seconds, respectively. Table 8 illustrates the three period. Moreover, the long delay existing in the shifting
changes in user 1s workloads. from period 2 to 3 indeed only affects few number of jobs
Table 9 shows the mean response times of two users as because the job interval time actually becomes quite long
well as of each user under the Fair and LsPS policies. The during period 3.
average job response times measured during each period Fig. 13c shows the number of jobs that are running or
are also shown in the table. We observe that LsPS success- waiting for service under Fair and LsPS policies, giving
fully captures the changes in user 1s workloads and an evidence that LsPS can consistently improve the aver-
dynamically tunes the two-level scheduling (i.e., between age job response times through dynamically adapting the
two users and within each user) based on the measured job scheduling to the workload changes. Therefore, we con-
size/arrival patterns. As a result, LsPS always achieves clude that these results strongly demonstrate the effec-
noticeable response time improvement by 26, 24, and 40 tiveness and robustness under both stationary and non-
percent during the three periods and by 34 percent in over- stationary workloads.
all with respect to Fair.
To better understand LsPSs processing on non-station-
ary workloads, Fig. 13 illustrates how LsPS dynamically 6 RELATED WORKS
adjusts its two-level scheduling algorithms in an on-line Scheduling in Hadoop system has already received lots of
fashion. Specifically, the transient distributions of 18 map attention. An early work of Matei Zaharia et al. [7] proposed
slots between two users are depicted in Fig. 13a, where red the Fair scheduler which assigns resources equally to each
areas indicate the slots assigned to user 1 while greed areas user and provides performance isolation between users.
represent those assigned to user 2. We also plot the changes However, its objective is not to optimize the system perfor-
of the scheduling within user 1 as a function of time in mance. In [21], a delay scheduler was proposed to improve
Fig. 13b. the performance of Fair scheduler by increasing data local-
As we observed, during the first period, LsPS assigns ity. It simply delays the task assignment for a while if the
more shares to user 1 than to user 2 because LsPS detects tasks data is not local. This improvement is at task level,
that user 1 has smaller average job size, see Fig. 13a. Mean- and can be combined with our proposed job scheduling pol-
while, the jobs from user 1 are scheduled according to the icy. The quincy scheduler [22] considered similar direction
FIFO discipline, see Fig. 13b, which further reduces the and found a fair assignment while considering locality by
response times of user 1 and thus results in better overall formulating and solving a minimum flow network problem.
response times during this period. Once LsPS captures the However, it is limited due to its high computation complex-
changes in user 1s job size distribution, i.e., from uniform ity. In [23], Sandholm and Lai considered the profit of the
to hyperexponential, LsPS quickly switches the scheduling service provider and proposed a scheduler that splits slots
within user 1 from FIFO to Fair and thus consistently to users according to the bids they pay instead of fair share.
achieves shorter response times in the second period, see The efficiency of scheduler is not considered in their work.
Fig. 13b. Later, when user 1 starts to submit large jobs with
the uniform distribution, LsPS turns to dispatch more
TABLE 7
Average Response Times (in Seconds) of All Users
and Each User in the Amazon Hadoop Cluster
TABLE 8
Experimental Settings for User 1s Non-Stationary Workload
TABLE 9
Average Job Response Times (in Seconds) of Fair and LsPS under Non-Stationary Workloads
terms of job response times under a variety of system work- [22] M. Isard, Vijayan Prabhakaran, J. Currey, U. Wieder, K. Talwar,
and A. Goldberg, Quincy: Fair scheduling for distributed comput-
loads. We have also shown that the effectiveness and ing clusters, in Proc. 5th Eur. Conf. Comput. Syst., 2009, pp. 261276.
robustness of LsPS under the nonstationary workloads. [23] Thomas Sandholm and K. Lai, Mapreduce optimization using
regulated dynamic prioritization, in Proc. SIGMETRICS: 11th Int.
ACKNOWLEDGMENTS Joint Conf. Meas. Model. Comput. Syst., 2009, pp. 299310.
[24] J. Wolf, D. Rajan, K. Hildrum, R. Khandekar, V. Kumar, S. Parekh,
This work was partially supported by the National Science K.-L. Wu, and A. Balmin, Flex: A slot allocation scheduling opti-
Foundation grant CNS-1251129, and the AWS in Education mizer for mapreduce workloads, in Proc. ACM/IFIP/USENIX
11th Int. Conf. Middleware, 2010, pp. 120.
Research Grant. [25] A. Verma, L. Cherkasova, and R. H. Campbell, Two sides of a
coin: Optimizing the schedule of mapreduce jobs to minimize
REFERENCES their makespan and improve cluster performance, in Proc. IEEE
20th Int. Symp. Model., Anal. Simul. Comput. Telecommun. Syst.,
[1] J. Dean, S. Ghemawat, and G. Inc., Mapreduce: Simplified data 2012, pp. 1118.
processing on large clusters, in Proc. 6th Conf. Symp. Operating [26] H. Chang, M. Kodialam, R. R. Kompella, T. V. Lakshman, M. Lee,
Syst. Des. Implementation, 2004, p.10. and S. Mukherjee, Scheduling in mapreduce-like systems for fast
[2] Apache Hadoop. (2014). [Online]. Available: http://hadoop. completion time, in Proc. IEEE INFOCOM, 2011, pp. 30743082.
apache.org/ [27] M. Zaharia, A. Konwinski, A. D. Joseph, R. H. Katz, and
[3] Apache Hadoop Users. (2014). [Online]. Available: http://wiki. I. Stoica, Improving mapreduce performance in heterogeneous
apache.org/hadoop/PoweredBy environments, in Proc. 8th USENIX Conf. Operating Syst. Des.
[4] Y. Chen, A. Ganapathi, R. Griffith, and R. Katz, The case for eval- Implementation, 2008, pp. 2942.
uating mapreduce performance using workload suites, in Proc. [28] J. Dhok and V. Varma, Using pattern classification for task
IEEE 19th Int. Symp. Model., Anal. Simul. Comput. Telecommun. assignment in mapreduce, in Proc. ISEC, 2010.
Syst., 2011, pp. 390399. [29] A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker,
[5] C. Tian, H. Zhou, Y. He, and L. Zha, A dynamic mapreduce and I. Stoica, Dominant resource fairness: Fair allocation of mul-
scheduler for heterogeneous workloads, in Proc. 8th Int. Conf. tiple resource types, in Proc. 8th USENIX Conf. Netw. Syst. Des.
Grid Cooperative Comput., 2009, pp. 218224. Implementation, Mar. 2011, p. 24.
[6] S. Kavulya, J. Tan, R. Gandhi, and P. Narasimhan, An analysis of
traces from a production mapreduce cluster, in Proc. 10th IEEE/ Yi Yao received the BS and MS degrees in com-
ACM Int. Conf. Cluster, Cloud Grid Comput., 2010, pp. 94103. puter science from the Southeast University,
[7] M. Zaharia, D. Borthakur, J. S. Sarma, K. Elmeleegy, S. Shenker, China, in 2007 and 2010. He is working toward
and I. Stoica, Job scheduling for multi-user mapreduce clusters, the PhD degree at the Department of Electrical
University of California, Berkeley, CA, USA, Tech. Rep. UCB/ and Computer Engineering, Northeastern Univer-
EECS-2009-55 , Apr. 2009. sity, Boston, MA. His current research interests
[8] Fair scheduler [Online]. Available: http://hadoop.apache.org/ include resource management, scheduling, and
common/docs/r1.0.4/fair_scheduler.html, 2014. cloud computing.
[9] A. Thusoo, Z. Shao, S. Anthony, D. Borthakur, N. Jain, J. Sen
Sarma, R. Murthy, and H. Liu, Data warehousing and analytics
infrastructure at facebook, in Proc. ACM SIGMOD Int. Conf. Man-
age. Data, 2010, pp. 10131020. Jianzhe Tai received the BS and MS degrees in
[10] Y. Chen, A. S. Ganapathi, R. Griffith, and R. H. Katz, A method- electrical and information engineering from the
ology for understanding mapreduce performance under diverse Dalian University of Technology, China, in 2009
workloads, Univ. California, Berkeley, CA, USA, Tech. Rep. and 2007. He is working toward the PhD degree
UCB/EECS-2010-135 , 2010. at the Department of Electrical and Computer
[11] L. E. Schrage and L. W. Miller, The queue m/g/1 with the short- Engineering, Northeastern University, Boston,
est remaining processing time discipline, Operations Research, MA. His current research interests include soft-
vol. 14, no. 4, pp. pp. 670684, 1966. ware-defined storage systems and virtualization
[12] K. Avrachenkov, U. Ayesta, P. Brown, and R. N unez-Queija, in hybrid storage.
Discriminatory processor sharing revisited, in Proc. IEEE
INFOCOM, Mar. 2005, pp. 784795.
[13] I. Mitrani, Probabilistic Modelling. Cambridge, U.K.: Cambridge
Univ. Press, 1998. Bo Sheng received the PhD degree in computer
[14] B. Moseley, A. Dasgupta, R. Kumar, and T. Sarl os, On scheduling science from the College of William and Mary in
in map-reduce and flow-shops, in Proc. 23rd Annu. ACM Symp. 2010. He is an assistant professor at the Depart-
Parallelism Algorithms Archit., 2011, pp. 289298. ment of Computer Science, University of Massa-
[15] B. P. Welford, Note on a method for calculating corrected sums chusetts Boston. His research interests include
of squares and products, Technometrics, vol. 4, pp. 419420, 1962. mobile computing, wireless networks, security,
[16] J. Polo, D. Carrera, Y. Becerra, J. Torres, E. Ayguade, M. Steinder, and cloud computing.
and I. Whalley, Performance-driven task co-scheduling for map-
reduce environments, in Proc. IEEE Netw. Operations Manage.
Symp., 2010, pp. 373380.
[17] A. Verma, L. Cherkasova, and R. H. Campbell, Aria: Automatic
resource inference and allocation for mapreduce environments, Ningfang Mi received the BS degree in computer
in Proc. 8th ACM Int. Conf. Autonomic Comput., 2011, pp. 235244. science from Nanjing University, China, in 2000
[18] J. Tan, X. Meng, and L. Zhang, Performance analysis of coupling and the MS degree in computer science from the
scheduler for mapreduce/hadoop, In INFOCOM, 2012 Proceed- University of Texas at Dallas, TX in 2004. She
ings IEEE, pp. 25862590, Mar. 2012. received the PhD degree in computer science
[19] A. Feldmann and W. Whitt, Fitting mixtures of exponentials to from the College of William and Mary, VA in
long-tail distributions to analyze network performance models, 2009. She is an assistant professor at the Depart-
in Proc. IEEE INFOCOM, 1997, vol. 3, pp. 10961104. ment of Electrical and Computer Engineering,
[20] A. Riska, V. Diev, and E. Smirni, Efficient fitting of long-tailed Northeastern University, Boston. Her current
data sets into hyperexponential distributions, in Proc. IEEE research interests include performance evalua-
GLOBECOM, 2002, pp. 25132517. tion, capacity planning, resource management,
[21] M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker, simulation, data center, and cloud computing. She is a member of the
and I. Stoica, Delay scheduling: A simple technique for achieving ACM and the IEEE.
locality and fairness in cluster scheduling, in Proc. 5th Eur. Conf.
Comput. Syst., 2010, pp. 265278.