Contents lists available at ScienceDirect
Future Generation Computer Systems
journal homepage: www.elsevier.com/locate/fgcs
A GSA based hybrid algorithm for biobjective workflow scheduling in cloud computing
Anubhav Choudhary, Indrajeet Gupta, Vishakha Singh, Prasanta K. Jana * ^{,}^{1}
Department of Computer Science and Engineering, Indian Institute of Technology (ISM), Dhanbad, India
h i g h l i g h t s
• Proposed an efficient hybrid scheme of GSA and HEFT, called HGSA for workflow scheduling.
• Systematic derivation of fitness function based on makespan and cost.
• Novelty in introducing a proficient elimination strategy of inferior agents.
• Demonstration of better performance through simulation results and statistical test ANOVA.
a r t i c l e
i n f o
Article history:
Received 28 February 2017 Received in revised form 12 October 2017 Accepted 3 January 2018 Available online 8 January 2018
Keywords:
Gravitational Search Algorithm Workflow scheduling Cost Makespan Cost time equivalence
a b s t r a c t
Workflow Scheduling in cloud computing has drawn enormous attention due to its wide application in both scientific and business areas. This is particularly an NPcomplete problem. Therefore, many researchers have proposed a number of heuristics as well as metaheuristic techniques by considering several issues, such as energy conservation, cost and makespan. However, it is still an open area of research as most of the heuristics or metaheuristics may not fulfill certain optimum criterion and produce near optimal solution. In this paper, we propose a metaheuristic based algorithm for workflow scheduling that considers minimization of makespan and cost. The proposed algorithm is a hybridization of the popular metaheuristic, Gravitational Search Algorithm (GSA) and equally popular heuristic, Heterogeneous Earliest Finish Time (HEFT) to schedule workflow applications. We introduce a new factor called cost time equivalence to make the biobjective optimization more realistic. We consider monetary cost ratio (MCR) and schedule length ratio (SLR) as the performance metrics to compare the performance of the proposed algorithm with existing algorithms. With rigorous experiments over different scientific workflows, we show the effectiveness of the proposed algorithm over standard GSA, Hybrid Genetic Algorithm (HGA) and the HEFT. We validate the results by wellknown statistical test, Analysis of Variance (ANOVA). In all the cases, simulation results show that the proposed approach outperforms these algorithms. © 2018 Elsevier B.V. All rights reserved.
1. Introduction
Workflow has wide applications in business as well as in the scientific areas such as astronomy, weather forecasting, medical and bioinformatics. Generally, these workflows are vast in size as they consist of a large number of independent and/or depen dent tasks and thus they demand huge infrastructure for their computation, communication, and storage. Clouds [1] provide such an infrastructure in order to execute the workflow on virtualized
_{*} Corresponding author. Email addresses: anubhav.choudhary@live.com (A. Choudhary), indrajeet7830@gmail.com (I. Gupta), vs.make.a.vish@gmail.com (V. Singh), prasantajana@yahoo.com (P.K. Jana). ^{1} IEEE Senior Member.
0167739X/© 2018 Elsevier B.V. All rights reserved.
resources which are provisioned dynamically. However, the allo cation of the resources and the order in which the execution of tasks of a given workflow will be performed, are matters of great importance. This is commonly referred as workflow scheduling problem. In fact, workflow scheduling is an NPcomplete problem which has been extensively studied for other paradigms, such as grid and cluster computing. It is noteworthy that if there are n tasks in a workflow and m available virtual machines (VMs), then there exist m ^{n} different ways in which the tasks can be mapped to the VM pool. For a large value of n and m, finding an optimal solution by brute force approach is computationally very expensive. Therefore, a metaheuristic approach can be very effective for solving this problem. However, every metaheuristic algorithm has its own merits and demerits. Hybridization of such metaheuristic approaches has evidence to produce better results
A. Choudhary et al. / Future Generation Computer Systems 83 (2018) 14–26
15
[2,3] and therefore this has become the recent trend of research in cloud computing. Heterogeneous Earliest Finish Time (HEFT) [4] is an efficient heuristic proposed for task scheduling in heterogeneous multipro
cessor which is also used for cloud computing [5–7]. This algorithm maps each task arranged in a priority order to a VM for which the earliest finish time is minimum. It should be noted that it is essentially a single objective algorithm which can only optimize the makespan. The Gravitational Search Algorithm (GSA) [8] is
a popular metaheuristic approach which utilizes the concept of
the law of gravitation to find the nearoptimal solution. The algo rithm starts with a set of random particles, where each particle represents a solution and the mass of each particle is calculated using a fitness function based on the application. Particles with higher fitness value have the higher mass and, hence, it can exert more force to attract other particles towards it. Eventually, all the particles converge towards an optimal point. It is capable of obtaining a global optimum faster than other metaheuristic algorithms and, hence, has a higher convergence rate. Moreover, it provides better results than the Central Force Optimization (CFO) and Particle Swarm Optimization (PSO) as demonstrated in [8]. In this paper, we propose a metaheuristic based algorithm for workflow scheduling problem which is a hybrid of the HEFT and the GSA. Specifically, we address the following workflow scheduling problem. Given a workflow consisting of a set of tasks
, t _{n} } with their computational load and precedence
, v _{m} }, our
objective is to map all the tasks to the available VMs so that the entire workflow can be executed in minimum time and minimum computational cost. The proposed algorithm is presented with an efficient agent representation and systematic derivation of fitness function. The algorithm is extensively simulated using the scien tific workflows of different sizes and is shown to produce better results as compared to other related algorithms such as Hybrid Genetic Algorithm (HGA), GSA and HEFT. We use ANOVA [9], a sta
tistical test to validate the simulation results. This test determines
if a given result set has significant difference statistically with other
set of results.
Many algorithms have been proposed for workflow scheduling
in cloud computing which are based on metaheuristic approaches.
For instance, Rodriguez et al. [10] have proposed a PSO based algorithm with the objective of minimizing the execution cost while meeting the deadline constraints. Similarly, HGA has been presented in [3] that also has a single objective, i.e., minimizing the makespan. Many other metaheuristic approaches have been developed for workflow scheduling, the survey of which can be found in [11]. However, our approach is different from all such approaches and has the following novelty. We consider parameters such as communication bandwidth, output data size of each task, VM boot time, VM shutdown time and performance variability of VMs in order to create a more realistic environment for scheduling. Most of these features are absent in the existing works. Moreover, our approach deals with two objectives in contrast to single objec tive in many existing algorithms. Hybridization of GSA with HEFT is also novel in the sense that there is full exploitation of the benefits of both these algorithms. Our contribution can be summarized as follows.
T = {t _{1} , t _{2} ,
constraints, and also given a set of VMs, V = { v _{1} , v _{2} ,
• A hybrid algorithm based on GSA and HEFT to minimize makespan and total computational cost.
• An efficient agent representation and derivation of system atic fitness function.
• Introduction of cost time equivalence and a procedure for eradicating the inferior agents.
• Demonstration of better performance of the algorithm through extensive simulation and comparison with other heuristic/metaheuristic based approaches.
• Validation of the performance through the statistical test ANOVA.
The rest of the paper is organized as follows. Related works are stated in Section 2. Section 3 explains the application and cloud model. Section 4 describes the terminologies used in the paper and the problem statement. Section 5 presents the proposed work with an illustration. Performance metrics, experimental results, and comparison are discussed in Section 6 followed by Section 7 which concludes the paper.
2. Related works
Many heuristic and metaheuristic based algorithms have been proposed for workflow scheduling in cloud computing. In this section, we present a short review of some of the works that are relevant to our proposed scheme. HEFT [4] is a popular heuristic which was initially developed for task scheduling in heterogeneous multiprocessor systems. It is wellknown that HEFT performs better than many other heuris tics such as [12,13] for task scheduling. However, it considers the minimization of makespan only. An extension of HEFT called Pareto Optimal Scheduling Heuristic (POSH) was proposed by Su et al. [5] for workflow scheduling in cloud, to minimize makespan and cost of execution. The POSH produces acceptable solution. Nevertheless, the solution is derived from a constricted search space and thus, it may miss on the better solutions. An energy efficient scheduling with deadline constraint for heterogeneous cloud environment was proposed in [14]. In this work, a new VM scheduler is developed which is shown to reduce energy consump tion in the execution of workflows. They claimed to achieve up to 20% reduction in energy requirement and improvement in the processing capacity by 8%. Fard et al. [15] proposed another heuris tic called multiobjective list scheduling (MOLS), which provides a general framework for multiobjective static workflow scheduling. It supports four objectives namely, makespan, cost, reliability, and energy. Based on selected objectives it provides the execution plan. Abrishami et al. [16] adopted the Partial Critical Path (PCP) for workflow scheduling and designed two algorithms, a onephase algorithm which is called IaaS Cloud Partial Critical Paths (ICPCP) and a twophase algorithm called IaaS Cloud Partial Critical Paths with Deadline Distribution (ICPCPD2). Here, homogeneous cloud environment is assumed. Recently, Casas et al. [17] have proposed balanced and file reusereplication scheduling (BaRRS) algorithm to schedule workflows that are based on two optimization constraints, i.e., makespan and cost. They have also focused on finding an optimal number of VMs required for a given workflow. However, it has large computational overhead. Panda et al. [18] have developed a normalization based task scheduling for a heterogeneous multi cloud environment. This technique provides a way to schedule tasks over multiple cloud providers. In another work [19], they have proposed a modification of min–min algorithm with uncer tainty parameter for scheduling tasks in a heterogeneous multi cloud environment. Gupta et al. [20] have also reported a work flow scheduling algorithm for the multicloud environment. How ever, this work focuses more on compute intensive workflows for scheduling. Metaheuristics are wellknown techniques to obtain near op timal solution. For instance, Pandey et al. [21] proposed a PSO based workflow scheduling algorithm for cost optimization. It is designed to consider computational cost and data transmission cost to provide an execution plan such that overall cost is min imized. However, this approach has been tested only on limited workflow applications. Jena et al. [22] proposed a multiobjective nested Particle Swarm Optimization (TSPSO) algorithm for work flow scheduling to optimize energy as well as processing time.
16
A. Choudhary et al. / Future Generation Computer Systems 83 (2018) 14–26
Fig. 1. An example of workflow: Nodes represent the task and the edges represent precedence relation. The numeric value 7 inside the node t _{2} is the computational load of the task t _{2} and the label 11 is the size of the data generated by t _{2} .
However, a crucial scheduling factor, i.e, bandwidth between the VMs is not considered in this work. Moreover, no experiment was reported on largescale workflows. Many GA based solutions have also been proposed. Garg et al. [23] proposed a hybrid GA driven by linear programming (LP) for cost optimization in the grid computing. This work combines the capabilities of both LP and GA to find a schedule such that it minimizes the combined cost of all users of the grid. Wang et al. [24] proposed a lookahead genetic algorithm (LAGA) to optimize both reliability and makespan. But this work focused only on the compute intensive workflows and did not consider the communication time, which is a very vital factor for scheduling workflows.
3. Models
This section contains a detailed description about workflow application model and the cloud server model assumed for the development of the proposed algorithm. The important terminolo gies used throughout the paper are also described in this section.
3.2. Cloud server model
We assume a cloud server which contains set of m VMs, rep
resented by V = {v _{1} , v _{2} , v _{3} ,
putational power measured in terms of million instructions per second (MIPS). All VMs are fully connected to each other and may reside in one or more physical cloud server. The time required to transfer output data from task t _{i} to t _{j} is described as communica tion overhead time. Note that combination overhead time is the ratio of output data size of a task t _{i} to the bandwidth between the VMs. If both t _{i} and t _{j} execute on the same VM, then communication overhead is assumed to be zero.
, v _{m} }. Each VM has its own com
4. Terminologies and problem statement
4.1. Constraints and assumptions
The notations used in the proposed work are given in Table 1. The proposed algorithm considers the following constraints and assumptions similar to the work as presented in [10].
1. We consider performance variance for the VMs to calcu late effective cpu cycles for the execution of the tasks. The reasons behind this variability is due to heterogeneity and shared nature of infrastructure of the underlying hardware. Based on a survey [27], it is found that an overall perfor mance variability of Amazon’s EC2 cloud is 24%. For the VM v _{j} , performance variance is represented as deg _{v} _{j} . Thus, using deg _{v} _{j} , execution time of task t _{i} on VM v _{j} can be written as
ET
t
i
v
j
=
Load(t _{i} )
Capacity(v _{j} ) × (1 − deg _{v} _{j} )
^{(}^{1}^{)}
is execution time of task t _{i} on VM v _{j} , Load(t _{i} ) is
computational load of task t _{i} and Capacity( v _{j} ) is computa tional capacity of VM v _{j} .
2. The unit chargeable time τ is considered to calculate the cost of execution. If we utilize the leased VM for time less than τ , then also it is charged for a full time period.
3. An initial boot time is always required when a VM is leased. So, we consider VM boot time for calculation of the makespan. We also consider VM shutdown time as it is required to release the provisioned VM.
4. Each VM is assumed to be roughly connected with the same bandwidth.
v j
where ET
t
i
3.1. Workflow application model
The workflow application is represented by a Direct Acyclic
Graph (DAG) [25,26], G = (T , E) as shown in Fig. 1, where T =
, t _{n} } is the set of tasks and E is the set of edges. An edge
t _{i} → t _{j} indicates the precedence relation between the predecessor t _{i} and the successor t _{j} . Thus, task t _{j} cannot start unless task t _{i} is complete. Each task t _{i} is labeled with its computational load in million instructions (MI). Also, the label on each edge t _{i} → t _{j} indicates the size of the output data generated by t _{i} . This data is required to start the execution of the task t _{j} . The task without any predecessor is termed as entry task (t _{e}_{n}_{t}_{r}_{y} ) and task without any successor is termed as exit task (t _{e}_{x}_{i}_{t} ). If there are more than one entry tasks in the workflow, a new pseudo entry task is created with zero computational load as well as no output data. Then all the entry tasks are connected with the pseudo task so formed. Similarly, pseudo exit task can also be created, if required.
{t _{1} , t _{2} ,
4.2. Problem formulation
For the sake of biobjective problem formulation, we first de scribe the two important parameters, i.e., makespan and cost as follows.
Makespan Calculation:
Makespan is also referred as the total execution time for the entire workflow. It includes the boot time of the leased VM, execu tion time, data transfer time between two VMs and the shutdown time of VM. While computing makespan, it is assumed that a VM cannot execute a task while data is being transferred from it to another VM. The makespan is equal to the summation of VM boot time, maximum of VMtime for all VM and the VM shutdown time. VMtime[ i ] denotes the last timestamp (starting from zero for each workflow) up to which the VM v _{i} executes its assigned task. We need to add VM boot time and VM shutdown time only once because only the boot time of first VM and the shutdown time of last VM will contribute to makespan and the rest will be overlapped
A. Choudhary et al. / Future Generation Computer Systems 83 (2018) 14–26
Table 1 Notations and definitions.
Notation 
Definition 
N 
Population size. 
n 
Number of the tasks in a given workflow. ith task. 
t _{i} 

m 

v _{j} 
X _{i}
X _{b}_{e}_{s}_{t}
Number of available VMs. jth VM. It represents the ith agent. It represents the best agent known so far.
17
α It determines the weight of makespan and cost for calculation fitness.
β It is the cost makespan equivalence factor. It is a part of SLA and its value depends on the priority and urgency of the application. Fitness value of the ith agent based on its makespan and cost. Mass of the ith agent.
M _{i}
fit _{i}
σ
V _{c}_{b}_{a}_{s}_{e}
dig _{v} _{j}
γ A small constant, which regulates the declination of gravitational constant.
δ Threshold mass for replacing the inferior agents.
A random variable used in the pricing model. Base price based on slowest VM Performance degradation of VM v _{j} .
with other events. Therefore, makespan can be mathematically formulated as follows.
Makespan = VMboottime + max (VMtime[ i] )
m
i =1
+ VMshutdow ntime
(2)
Cost Calculation:
In our assumed model, a cloud server consists of VMs with varying computational capacity for different types of workload.
is the execution time of the task t _{i} on VM v _{j} as defined in Eq.
(1). Let τ be the unit chargeable time for which charge of execution of any task will be accounted. Let V _{c}_{b}_{a}_{s}_{e} be the base price charged for the slowest VM, then as per the exponential pricing model [5], the cost of execution of the task t _{i} on VM v _{j} is denoted as, cost(t _{i} , v _{j} ) and is formulated as follows.
ET
v j
t
i
cost(t _{i} , v _{j} ) = σ × ⌈ ^{E}^{T}
v
j
t
i
τ
⌉ × V _{c}_{b}_{a}_{s}_{e} × exp ^{(} CPU cycles of v _{j}
slow est CPU cycle ) ^{(}^{3}^{)}
where, σ is a random variable used to generate different combina tion of VM pricing and capacity. Let B _{i} _{,} _{j} be a boolean variable, such that
B i, j = {
1,
0,
if task t _{i} is assigned to v _{j} otherwise
}
(4)
Therefore, the total cost of execution for a workflow is defined as
Total Cost =
n
m
∑ ∑
i =1 j =1
B _{i}_{,} _{j} × cost(t _{i} , v _{j} )
(5)
However, the values of makespan and cost may have different scales. So, the value of one attribute may overwhelm the value of the other attribute, in the case of agent evaluation. Thus, it is not valid to perform a linear formulation directly using the actual values. One of the institutive approaches is to normalize both makespan and cost in order to scale these values in the same range. Normalization can be done using any of the wellknown methods such as min–max normalization, Zscore normalization, etc. However, this solution has a major drawback, i.e, if there is a change in global minimum and maximum values of makespan and cost, it may lead to the variation in the relative rank of agents in two consecutive iterations. To resolve this issue, we use makespan equivalent of total cost (ME _{c}_{o}_{s}_{t} ) calculated using Eq. (6) instead of total cost.
ME _{c}_{o}_{s}_{t} = β × Total Cost
Problem Statement:
(6)
The objective of the proposed work is to minimize makespan
in Eqs. (2) and
and makespan equivalent of total cost as given
(6) respectively. Therefore, it is wise to minimize their linear com bination. The workflow scheduling problem can be formulated as
follows.
Minimize
subject to
z = α × Makespan + (1 − α ) × ME _{c}_{o}_{s}_{t}
(i)
m
∑
j =1
B _{i} _{,} _{j} = 1, i = 1, 2, 3,
(ii) 0 ≤ α ≤ 1
, n
(7)
The constraint (i) indicates that any task of the workflow can be assigned to one and only one VM and the constraint (ii) limits the range of α which balances makespan and total cost.
5. Proposed work
As our algorithm is a hybrid of GSA and HEFT, we first provide
a brief description about both of these algorithms as follows.
5.1. Overview of heterogeneous earliest finish time (HEFT)
The main idea behind HEFT [4] is to schedule tasks in such a way that earliest finish time (EFT ) is minimized for all the tasks. HEFT
is executed in two phases which is described as follows.
Phase 1:
Phase 2:
Calculating priority of tasks In this phase, the priority of each task is calculated using average execution time and average communi cation time. The priorities are calculated in a bottom  up approach. The sequence of tasks generated based on higher to lower priority value, satisfying the precedence constraints of the given workflow. The priority of task t _{i} is given by
pri(t _{i} ) = _{w} _{i} _{+}
_{j} ∈succ(t _{i} ) ( c ^{i} ^{,} ^{j} + pri(t ^{j} ) )
max
t
(8)
where, w _{i} is the average execution time of task t _{i} on
available VMs and c _{i}_{,} _{j} is the average communication time between task t _{i} and task t _{j} . Mapping tasks to VMs This is the main phase of the algorithm, where actual mapping of task to VM is performed according to the priority of the tasks. The task with the highest priority is scheduled first, by calculating earliest start time (EST ) and EFT on all available VMs.
18
A. Choudhary et al. / Future Generation Computer Systems 83 (2018) 14–26
Fig. 2. Working of GSA.
Table 2 Example of a Mapping / Agent.
Task
t _{1}
t _{2}
t _{3}
t _{4}
t _{5}
t _{6}
t _{7}
t _{8}
VM
1
3
2
1
5
1
3
1
5.3. Agent representation
In the proposed algorithm, a mapping of the tasks of a given workflow to the VMs, is considered as an agent. For workflow having n tasks and cloud server having m number of VMs, the ith agent can be represented as a vector of dimension 1 × n, i.e.,
X _{i} = ^{[} x
1
i
, x
2
i
, x
3
i
,
.
.
. , x
n
i
^{]}
(9)
is
an integer which lies in interval [ 1, m] . Table 2 shows an example of an agent for 8 tasks on 5 VMs. Here, task t _{1} is mapped with VM v _{1} , i.e., task t _{1} will be executed on VM v _{1} , while preserving the precedence constraints.
where, x
d
i
represents the VM assigned to the task t _{d} . Note that x
d
i
5.4. Fitness evaluation
For a given agent, we can always calculate its makespan and makespan equivalent of total cost using Eqs. (2) and (6) respec tively. Let Makespan ^{i} and ME _{c}_{o}_{s}_{t} be the makespan and makespan equivalent of total cost for the ith agent in the population. Now, the fitness value of ith agent can be computed as
i
Fitness(fit _{i} ) =
1
1 + α × Makespan ^{i} + (1 − α ) × ME
i
cost
(10)
The fitness value calculated using Eq. (10) is an absolute fit ness value. As its calculation does not require any information about the population, the relative difference in fitness value of two agents will remain same irrespective of the iteration in which they are present. Agents with lower makespan and lower cost, will have higher fitness value and they can be considered as potential candidates for a final solution instead of the agents with higher makespan or higher cost.
5.2. Overview of gravitational search algorithm (GSA)
The GSA is based on the law of gravity [28] introduced by Rashedi et al. [8]. It is a populationbased search algorithm where each agent is considered as a particle (so, we use agent and particle interchangeably throughout the paper) and its fitness value is considered as its mass. Each particle represents a solution of a given problem. The main idea is that the heavier particles, i.e., superior solutions do not move much as compared to lighter particles, i.e., inferior solutions. All particles apply force on every other particle. As each particle has some mass, its acceleration and velocity can be calculated using the net force. Using calculated velocity, the new position of the particle can be found. When algorithm terminates, the particle with highest mass provides the near optimal solution. The working of the GSA is depicted in Fig. 2. To use this algorithm in scheduling, we first identify the search space and particle representation. Then we initialize the popu lation randomly. Now in each iteration, we calculate the fitness value of all particles using the fitness function defined as per the optimization constraints. Based on the best and worst particles identified, we calculate the mass to update the position of each particle. The gravity constant that is used to calculate the velocity and the position, is also updated in each iteration. We repeat all the steps till the algorithm attains a certain termination criterion.
Remark. The problem of workflow scheduling is to minimize the linear combination of the makespan and the makespan equivalent of total cost as described in Section 4.2. As our fitness value is the reciprocal of the same, so the higher fitness value would be desirable.
In the proposed algorithm, we apply min–max normalization in order scale the fitness values of all the agents in population in the range [ 0, 1] to get the mass of each agent. So, mass of the ith agent is given by
Mass(M _{i} ) =
fit _{i} − min _{N} ^{(} fit _{j} ^{)}
j =1 to
max N ^{(} fit _{j} ^{)} − min _{N} ^{(} fit _{j} ^{)}
j =1 to
j =1 to
(11)
We use Eq. (11) as the fitness for our proposed work which is used in the simulations.
5.5. Proposed scheduling technique
The proposed HGSA algorithm works in two phases. The first phase focuses only on optimization of makespan. In the second phase, it attempts to optimize the cost while trying to minimize the fitness value which is calculated from both makespan and cost. The result from the first phase guides the particle movement in GSA, which is considered in the second phase of the proposed work. This improves the result as compared to GSA with random initial particles. We use GSA by incorporating HEFT with following steps.
A. Choudhary et al. / Future Generation Computer Systems 83 (2018) 14–26
19
Step 1: The initial population is seeded with the output of the HEFT algorithm. The HEFT heuristic provides guidance to GSA algorithm that improves the overall performance of the proposed algorithm. It helps to generate better solutions in fewer number of iterations.
The best particle identified from current generation based
on the fitness function is preserved. This is done to ensure that the best agent does not get degraded in the future generation. Step 3: The agents having mass less than threshold mass ( δ ) is removed from the current population as they have very little or no contribution in updating the population. In place of all the removed agents, new agents are added to the population generated with the help of the best agent identified so far. This improves the overall fitness of the population.
Step 2:
5.6. Position update of particle
Let us consider a system of N agents. We define the position of the ith agent as follows.
X _{i} = ^{[} x
1
i
, x
2
i
, x
3
i
,
.
.
. , x
n
i
^{]}
for i = 1, 2, 3 ,
, N
(12)
d i shows the position of ith agent in the dth dimension.
Let M _{i} (k) and G(k) be the mass of ith agent and the gravitational constant respectively in kth iteration. We can define the force acting on the ith agent by jth agent for the kth iteration as follows.
where, x
_{,}_{j} (k) = G(k) × ^{M} ^{i} ^{(}^{k}^{)} ^{×} ^{M} ^{j} ^{(}^{k}^{)}
d
F
i
R _{i} _{,} _{j} (k) + ϵ
d
j
× ^{(} x
(k) − x
d
i
(k) ^{)}
(13)
where, ϵ is a very small constant and R _{i}_{,} _{j} (k) is the Euclidean dis tance between ith and jth agent in the kth iteration. R _{i}_{,} _{j} (k) is defined as
R _{i} _{,} _{j} (k) = ^{} ^{} X _{i} (k) · X _{j} (k)) ^{} ^{} _{2}
We suppose that the total force that acts on the ith agent in the dth dimension be a randomly weighted sum of the forces exerted in the dth dimension from the other agents. Then,
(14)
d
F
i
(k)
=
N
∑
j= 1, j ̸= i
d
rand _{j} × F _{j} (k)
i
,
(15)
where rand _{j} is a random number that lies in the interval [ 0, 1] . By the law of motion [28], the acceleration of the ith agent in the dth dimension in kth iteration is given by,
a
d
i
(k)
=
^{d} (k)
F
i
M i
(k)
(16)
Furthermore, the next velocity of the agent is considered as a fraction of its current velocity added to its acceleration. Therefore, its position and velocity can be calculated as follows.
v el (k + 1) = rand _{i} × v el (k) + a
i
i
d
d
d
i
(k)
(17)
x (k + 1) = x (k) + v el (k + 1)
i
i
i
defined as
(18)
The gravitational constant G _{0} , is initialized in the beginning and reduces as the algorithm proceeds, in order to improve the search
accuracy. G(k) is a function of initial value G _{0} and iteration number
k,
d
d
d
G(k) = G _{0} × ( ^{k} ^{0}
k
)
γ
, γ < 1
(19)
where, γ is a small constant that regulates the reduction in the gravitational constant.
5.7. Algorithm
We start by generating initial population by random mapping of the tasks on VMs in step 1 of the proposed algorithm, followed by seeding of the results of HEFT into the population in step 2. Once the initial population is ready, a set of iterative steps is applied to each agent of the population to get the final result as per step 3 through step 16. The first step of iteration is to calculate the gravitational constant in step 4. Then in step 5, we compute the fitness value of each agent using Eq. (10). Note that Eq. (10) requires the values of both cost and makespan. Algorithms 2 and 3 can be utilized for calculating these values. Based on fitness value, we identify the best and the worst agents in step 6 for calculating the mass of all the agents in step 7. In step 8, we update the position of each agent by calculating the net force, net acceleration and velocity. In remaining steps, we replace the inferior agents by the new agents generated with the help of best agent known so far. The new agent is generated by mapping one of the tasks to a randomly selected VM and rest of the mapping remains same as the best agent.
Algorithm 1 : Proposed Workflow Scheduling Algorithm
Input: Workflow Application (W ) and Cloud Server Specification (CSS)
Output: Task mapping with VMs (M)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Initialize population X with N randomly generated agents.
Replace one of the agent by mapping generated by HEFT
for k = 1 to MAX_ITERATION do
Compute gravitational constant G(k) using Eq. (19)
N using Eq. (10)
Identify best and worst agent based on calculated fitness value.
Compute fitness value fit _{i} for i = 1, 2, 3,
,
Compute mass M _{i} for i = 1 , 2 , 3,
Update velocity and position of each agent using Eq. (13) to
,
N using Eq. (11)
(18)
for i = 1 to N do
if M _{i} < δ then
Pos = a random integer from interval [ 1, n ]
x
x
i
= x ^{d}
^{P}^{o}^{s}
i
d
_{b}_{e}_{s}_{t} for d = 1, 2, 3 ,
, n.
= a random integer from interval [ 1, m]
end if
end for
end for
Find M corresponding to the best agent based on the fit _{i} for i =
1, 2, 3,
return M
,
N
Algorithm 2 : CostCalculation
Input: Workflow Application (W ), Mapping (M), Cloud Server Specification (CSS)
Output: Cost value (Cost)
1 Set Cost = 0
2 for each task t _{i} ∈ W do
3 Execution_time ET
v M [i ]
t
^{v} M [i ]
i
⌉
Unit_used = ⌈ ^{E}^{T}
^{t}
i
τ
4
^{=}
Load(t _{i} )
Capacity( v _{M} _{[}_{i} _{]} )×(1 −deg _{v} _{M} _{[}_{i} _{]} )
5 Rate_per_unit = σ × V _{c}_{b}_{a}_{s}_{e} × exp ( ^{C}^{P}^{U} ^{c}^{y}^{c}^{l}^{e}^{s} ^{o}^{f} ^{v} ^{M} ^{[}^{i}^{]}
slowest CPU cycle )
6 Cost = Cost + Unit_used × Rate_per_unit
7 end for
8 return Cost
20
A. Choudhary et al. / Future Generation Computer Systems 83 (2018) 14–26
Algorithm 3 : MakespanCalculation
Input: Workflow Application (W ), Mapping (M), Cloud Server Specification (CSS)
Output: Makespan value (Makespan)
1 for each v _{i} ∈ V do
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
VM_time[ i] = 0
end for
for each task t _{i} ∈ W in topological order do
if t _{i} .ParentCount ! = 0 then
Parent_finishtime =
end if
if t _{i} . ChildCount ! = 0 then
max _{p}_{r}_{e}_{d}_{(}_{t} i _{)} (Tast_actual_finsh_time[ k])
t _{k} ∈
Transfer_time = 0
for each task t _{j} where t _{j} ∈ succ(t _{i} ) and M [i ] ̸= M [ j] do
if output data of t _{i} task is not transferred to v _{M} _{[} _{j} _{]} then
Transfertime = Transfertime + ^{t} ^{i} ^{.} ^{O}^{u}^{t}^{p}^{u}^{t}^{d}^{a}^{t}^{a}^{s}^{i}^{z}^{e}
Bandw idth
end if
end for
end if
Execution_time ET
v M [i]
t
i
^{=}
Load(t _{i} )
Capacity( v _{M} _{[}_{i}_{]} ) ×(1−deg _{v} _{M} _{[}_{i}_{]} )
Actual_start_time = max(Parent_finishtime , VM_time[M[i]])
Task_actual_finish_time[ i ] = Actual_start_time + Execution_time + Transfer_time
VM_time[ M [ i ]] = Task_actual_finish_time[ i]
20 end for
21 Makespan = VM_boot_time +
22
return Makespan
Cloud ( VM_time [i ]) + VM_shutdown_time
max
v _{i} ∈
(a) Montage workflow of 16 tasks.
(b) Cloud environment.
Fig. 3. Workflow and cloud environment.
5.8. 
An illustration 

Consider a Montage workflow [29,30] consisting of 16 tasks, 

T 
= {t _{1} , t _{2} , 
, t _{1}_{6} } and a set of 4 VMs, V = {v _{1} , v _{2} , v _{3} , v _{4} } 
as shown in Fig. 3. We have to schedule the workflow on the given VMs which are fully connected to each other. The output of this illustration is a mapping of given tasks on the VMs, which
Table 3 Parameters used in the illustration.
Parameter 
Value 
Number of VMs Computational power of all VMs Network bandwidth Boot time and shutdown time of VM Performance variance of VM MAX _ITERATION Population size (N) Gravitational constant (G _{0} ) Weight of makespan and cost (α ) Cost time equivalence (β ) Small constant used in gravity (γ ) Mass threshold for inferior agents (δ ) Small constant used in force (ϵ ) 
4 
2.0, 3.5, 4.5 and 5.5 MIPS 1 MBps 0.5 sec 

24% 

10 

100 

5 

0.5 

1 

0.3 

0.1 

10 
is optimized in terms of makespan and cost. Table 3 shows the parameters used in this illustration and Table 4 shows the initial population generated, as described in step 1 of our proposed algo rithm. We now compute a schedule using HEFT for the given workflow. The resultant schedule is then included into the generated popula tion. To compute HEFT schedule, we need to find the priority using Eq. (8). Now, starting from the highest priority, each task is mapped to a VM in such a way that it minimizes the EFT . Table 5 shows the priority as well as the mapping of the tasks to VMs. The makespan and cost of schedule generated by HEFT are 36.57 sec and $50.31 respectively. Fig. 4 demonstrates the process of including the mapping gen erated by HEFT into the initial population by replacing the shaded agent. The selection of the agent to be replaced is purely random. This completes the step 2 of the algorithm. Now, the current population contains HEFT generated agent as well as the agents generated randomly. These agents are processed as described in step 3 through step 16, for a certain number of iterations. Table 6 shows the details of the best agent identified
A. Choudhary et al. / Future Generation Computer Systems 83 (2018) 14–26
Table 4 Initial population of agents.
Agent 1 
4 
1 
1 
2 
3 
4 
4 
2 
3 
1 
2 
3 
2 
1 
1 
4 
Agent 2 
2 
3 
1 
1 
1 
2 
3 
4 
4 
1 
1 
2 
1 
3 
3 
3 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
Agent N 
1 
2 
1 
1 
1 
3 
3 
4 
4 
1 
2 
2 
2 
2 
1 
1 
Fig. 4. Seeding of HEFT solution into population.
21
Table 5 Task priority and task mapping by HEFT.
Task 
Priority 
Virtual machine 

t 
_{1} 
107.71 
3 
t 
_{2} 
103.37 
2 
t 
_{3} 
107.74 
4 
t _{4} 
103.24 
4 

t 
_{5} 
94.38 
2 
t 
_{6} 
94.44 
3 
t 
_{7} 
90.26 
2 
t 
_{8} 
90.20 
4 
t 
_{9} 
90.05 
3 
t 
_{1}_{0} 
90.08 
2 
t 
_{1}_{1} 
90.06 
4 
t 
_{1}_{2} 
90.08 
4 
t 
_{1}_{3} 
77.89 
4 
t 
_{1}_{4} 
77.57 
4 
t 
_{1}_{5} 
2.63 
4 
t 
_{1}_{6} 
0.28 
4 
in each iteration based on calculated fitness value. The resultant schedule is shown in Table 7 having makespan of 32.64 sec and cost as $47.71.
6. Experimental results and comparison
This section presents the simulation results of the proposed algorithm and its comparison with three workflow scheduling algorithms including the standard GSA based approach, HEFT and HGA as follows. Note that, for the sake of comparison, we convert the single objective of the HGA (minimization of makespan) into biobjective, keeping all the constraints same as the proposed algorithm.
6.1. Experimental setup
The simulations were carried out using C++ coding environment on an Intel(R) Core(TM) i52540M CPU with 2.60 GHz and 4GB RAM running on Linux platform. The specifications of the cloud environment, as well as the parameters used for evaluation of our proposed algorithm, are given in Table 8.
Table 6 Iteration wise specification of the best agent.
Iteration 
Makespan 
Cost 
Fitness 
1 
36.575226 
50.313644 
2 . 250 × 10 ^{−}^{2} 2 . 391 × 10 ^{−}^{2} 2 . 422 × 10 ^{−}^{2} 2 . 428 × 10 ^{−}^{2} 2 . 428 × 10 ^{−}^{2} 2 . 428 × 10 ^{−}^{2} 2 . 428 × 10 ^{−}^{2} 2 . 428 × 10 ^{−}^{2} 2 . 428 × 10 ^{−}^{2} 2 . 428 × 10 ^{−}^{2} 
2 
32.338966 
49.307339 

3 
32.536972 
48.014904 

4 
32.642235 
47.711269 

5 
32.642235 
47.711269 

6 
32.642235 
47.711269 

7 
32.642235 
47.711269 

8 
32.642235 
47.711269 

9 
32.642235 
47.711269 

10 
32.642235 
47.711269 
6.2. Performance metrics
We normalize the makespan and monetary cost similar to that in [5] and call them the schedule length ratio (SLR) and monetary cost ratio (MCR) of tasks, as follows.
SLR = Makespan
(20)
∑ t _{i} ∈CP
m
min { ET
j
=1
v
j
t
i
_{}}
MCR =
Total Cost
∑ t _{i} ∈ CP
m
min
j =1
{ cost(t _{i} , v _{j} ) ^{}}
(21)
The denominator is the summation of the minimum execution time and monetary cost of the tasks on the critical path (CP) without communication cost. For a given task graph, an algorithm that produces a scheduling plan with lower SLR and lower MCR value is more effective. We also calculate the normalized fitness value for easy compar ison and visualization of the overall quality of the results. We use maxnormalization to normalize the absolute fitness value as cal culated using Eq. (10). After applying normalization, the maximum value is mapped to one and the rest of the values lie in the interval (0,1]. Mathematically, maxnormalization is defined as
ˆx _{i}
=
^{x} ^{i}
max N (x _{j} )
j =1 to
(22)
where, ˆx _{i} is the normalized value for x _{i} .
22
A. Choudhary et al. / Future Generation Computer Systems 83 (2018) 14–26
Table 7 Resultant schedule of montage with 16 tasks.
Task
t 1
t 2
t 3
t 4
t 5
t 6
t 7
t 8
t 9
t 10
t 11
t 12
t 13
t 14
t 15
t 16
VM
3
1
4
4
2
3
1
1
3
2
4
4
4
4
4
4
Fig. 5. (a) CyberShake (b) Epigenomics (c) Inspiral (d) Montage (e) SIPHT. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
6.3. Dataset
The proposed algorithm is evaluated on various scientific work flows as considered by Bharathi et al. [29] and Juve, et al. [30]. These workflows are synthesized using the generator program provided by Pegasus project [31]. It uses information gathered from actual execution of scientific workflows to generate a syn thetic workflow. This in turn, is a near approximation of a real workflow. We used CyberShake (IO and networkintensive), Epige nomics (both computeintensive and networkintensive), Inspiral (computeintensive), Montage (networkintensive) and SIPHT (IO intensive) in the simulation. For all the types of workflow, we di vide them into three categories based on the number of constituent tasks as shown in Table 9. Each workflow has some characteristic features, which play pivotal roles in process of scheduling. The detailed characterization of each workflow can be found in [29]. Topology of tasks in a given workflow is also an major criterion for scheduling. These workflows have variety of topological features such as pipeline (yellow task nodes), data aggregation (red task nodes) and data partitioning (green task nodes) as shown in Fig. 5.
6.4. Result analysis and performance evaluation
In this subsection, we evaluate the performance of our proposed algorithm against the HEFT, standard GSA and the HGA with re spect to makespan and monetary cost as follows. The MCR, SLR and the normalized fitness are used as the performance metrics for the comparative analysis which are defined in Section 6.2. Note that the lower values of the SLR and MCR are desirable as they indicate lower makespan and cost, respectively. However, the higher value of the normalized fitness is preferred. We present the results obtained by using the same machine configuration, same constraints, and the same set of workflow applications (of various sizes and types). Figs. 6–9 show the bar charts for the MCR, SLR and the normalized fitness value, so as to compare between the HGSA, HGA, standard GSA, and the HEFT. From the figures, we can observe that the MCR of the proposed HGSA is better as compared to the HEFT, HGA and the GSA for all the workflow categories, such as small, medium and large. Thus, it is visible that the performance of the proposed HGSA is better than others in terms of MCR for any aforementioned workflows. We also observe that the value of the SLR obtained by the proposed algorithm is much better than the GSA and the HGA. However, it is somewhat lesser to that of given by the HEFT. This is due to the fact
Table 8 Parameters used during experiment.
Parameter 
Value 
Network bandwidth Boot time and shutdown time of VM Performance variance of VM MAX _ITERATION Population size (N) Gravitational constant (G _{0} ) Weight of makespan (α ) Cost time equivalence (β ) Small constant used in gravity (γ ) Mass threshold for inferior agents (δ ) Small constant used in force (ϵ ) 
1 MBps 
0.5 sec 

24% 

200 

500 

5 

0.5 

50 

0.3 

0.1 

10 
that it is a single objective scheduling algorithm which focuses on makespan only. To calculate the normalized fitness value, we used two input parameters, such as α and β as shown in Table 1. The normalized fitness value shows the overall quality as per the user requirement. From Fig. 9(a)–(c), we observe that the proposed HGSA algorithm performs better than the HEFT, HGA and the GSA. We get the better results using the HGSA even for the case where SLR is poor with respect to the HEFT as the difference in cost is enough to compensate for the difference in makespan.
6.5. Analysis of variance (ANOVA)
We also conducted hypothesis testing using ANOVA [9]. It is a statistical method which compares the mean of two or more groups to determine whether there is a significant difference among the groups or not. This test has a null hypothesis (H _{0} ) and an alternate hypothesis (H _{1} ), defined as
H _{0} : µ _{1} = µ _{2} = µ _{3} = 
= µ _{n} 
(23) 

H _{1} : 
Means are not equal 
(24) 
During the test, if F statistical < F critical then we fail to reject the null hypothesis and all groups have the same mean. But if F statistical > F critical then we reject the null hypothesis and accept the alternate hypothesis. If the alternative hypothesis is accepted, we can easily conclude that one of the group is having significant differences with the others.
A. Choudhary et al. / Future Generation Computer Systems 83 (2018) 14–26
23
Fig. 6. Results for small sized workflows.
(a) Monetary cost ratio.
(b) Schedule length ratio.
Fig. 7. Results for medium sized workflows.
(a) Monetary cost ratio.
(b) Schedule length ratio.
Fig. 8. Results for large sized workflows.
24
A. Choudhary et al. / Future Generation Computer Systems 83 (2018) 14–26
(a) Normalized fitness for small sized workflows.
(b) Normalized fitness for medium sized workflows.
(c) Normalized fitness value for large sized workflows.
Fig. 9. Comparison of normalized fitness.
Table 9 Category of the workflow based on the number of tasks.
Number of tasks 
Category 

24 to 60 
Small 

100 
to 400 
Medium 
800 
to 2000 
Large 
The test was performed to compare the standard GSA, hybrid
GA and the hybrid GSA. In order to do this experiment, all three
algorithms were executed 10 times for each of the five scientific
Table 10 ANOVA using CyberShake workflow of 2000 tasks.
workflows of various sizes. Tables 10–14 shows the results for each workflows of 2000 tasks. As we can see that, for all workflows we have F statistical > F critical. Thus, we can reject the null hypothesis. Therefore, means of all the groups are significantly different. This implies that the performance of HGSA is better and consistent than HGA and GSA.
7. Conclusion
In this paper, we have presented a hybrid gravitational search algorithm for scheduling workflows, with the basic objective of re ducing the makespan as well as the cost of execution. The efficiency
(a) Summary of input
Group 
Count 
Sum 
Average 
Variance 

HGSA 
10 
3.51E 
−05 
3.51E 
− 
06 
3.44E 
− 
16 

GSA 
10 
2.96E 
−05 
2.96E 
− 
06 
5.10E 
− 
17 

HGA 
10 
3.17E − 05 
3.17E − 06 
2.01E − 16 

(b) 
ANOVA test result 

Source of variation 
SS 
df 
MS 
F stat 
Pvalue 
F crit 

Between groups 
1.56E− 
12 
2 
7.81E− 
13 
3925.06 
5.278E−34 
3.35 

Within groups 
5.37E− 
15 
27 
1.99E− 
16 

Total 
1.57E− 
12 
29 
A. Choudhary et al. / Future Generation Computer Systems 83 (2018) 14–26
Table 11 ANOVA using Epigenomics workflow of 2000 tasks.
(a) Summary of input
Group 
Count 
Sum 
Average 
Variance 

HGSA 
10 
3.60E 
− 
07 
3.60E 
− 
08 
5.90E 
− 
20 

GSA 
10 
3.00E 
− 
07 
3.00E 
− 
08 
1.30E 
− 
20 

HGA 
10 
3.10E − 07 
3.10E − 08 
5.90E − 20 

(b) 
ANOVA test result 

Source of variation 
SS 
df 
MS 
F stat 
Pvalue 
F crit 

Between groups 
1.60E − 
16 
2 
7.80E− 
17 
1788.54 
2.00E−29 
3.35 

Within groups 
1.20E − 
18 
27 
4.40E− 
20 

Total 
1.60E − 16 
29 

Table 12 ANOVA using Inspiral workflow of 2000 tasks. 

(a) 
Summary of input 

Group 
Count 
Sum 
Average 
Variance 

HGSA 
10 
3.90E 
− 
06 
3.90E 
− 
07 
3.70E 
− 
18 

GSA 
10 
3.50E 
− 
06 
3.50E 
− 
07 
4.50E 
− 
19 

HGA 
10 
3.60E − 06 
3.60E − 07 
2.30E − 18 

(b) 
ANOVA test result 

Source of variation 
SS 
df 
MS 
F stat 
Pvalue 
F crit 

Between groups 
9.60E − 
15 
2 
4.80E− 
15 
2238.84 
1.00E−30 
3.35 

Within groups 
5.80E − 
17 
27 
2.10E− 
18 

Total 
9.70E − 15 
29 

Table 13 ANOVA using Montage workflow of 2000 tasks. 

(a) 
Summary of input 

Group 
Count 
Sum 
Average 
Variance 

HGSA 
10 
8.30E 
− 
05 

Molto più che documenti.
Scopri tutto ciò che Scribd ha da offrire, inclusi libri e audiolibri dei maggiori editori.
Annulla in qualsiasi momento.