Cloud Computing: Application On Data Farming

Proceedings of the URECA@NTU 2008-09
Cloud Computing: Application On Data Farming

Yong Yong Cheng School of Computer Engineering Asst Prof Low Yoke Hean, Malcolm School of Computer Engineering
Mr Choo Chwee Seng DSO National Laboratories
AbstractData Farming depends on simulation models that can be run multiple times to bring larger portions of the problem landscape to light, and is used to aid intuition to gain insight into a problem domain. Thus Data Farming is an iterative process, which is both computation-intensive and data-intensive. Cloud Computing is rapidly gaining widespread acceptance, with many big companies, such as Yahoo and Google, using it in their daily business operations. Thus Cloud Computing is incorporated into Data Farming so that it can be performed with a lower cost, in a larger scale and a faster time. Hadoop, which is a Cloud Computing platform with a proven ability to process vast amount of data, seems appropriate for use with Data Farming. Indeed, Cloud Computing provides Data Farming with the benefits of Grid Computing and none of the drawbacks of Grid Computing. Finally, Web Service is created for Data Farming to be able to run simulations in the cloud with a lower cost, in a larger scale and a faster time. Keywordsdata; farming; cloud; computing; hadoop; mapreduce; map; reduce
Data Farming is an iterative process. The steps, as shown in Figure 1, are inherent in the process and may be repeated until sufficient insights to a problem are gained [1].
Figure 1: Data Farming Iterative Process Adapted From [1] During Data Farming, each simulation model has to be run multiple times either for model testing or for parameter space exploration. Each run is performed using either a single computer, a cluster of computers (aka cluster computing) or a grid of computers (aka grid computing). The pros and cons of each computing power are illustrated in Table 1. Table 1: Computing Powers for Data Farming
Computing Powers Pros Cons - Time-consuming. A Single Computer - May not be able to handle large and complex computing tasks. - All computers must have the same hardware and software configurations.
1 INTRODUCTION
This paper will look at the use of Cloud Computing within the Data Farming of Agent-Based Simulation experiments, and the incorporation of Cloud Computing into Data Farming so that Data Farming can be performed with a lower cost, in a larger scale and a faster time. Data Farming relies on a set of enabling technologies and processes that have been the focus of ongoing research and development efforts: distributed and highperformance computing; agent-based simulations and rapid model development; knowledge discovery methods; high-dimensional data visualization techniques; design-of-experiments methods; humancomputer interfaces; teamwork and collaborative environments; and heuristic search techniques [1]. Data Farming can be thought of as nothing more than putting the advances mentioned above to work in order to automate the scientific method. It depends on simulation models that can be run multiple times to bring larger portions (as opposed to simply points) of the problem landscape to light [2]. It is not intended to predict an outcome, as it is used to aid intuition and to gain insight [1].
A Cluster of Computers (Cluster Computing)
- Reduce the time to obtain the results. - Able to handle large and complex computing tasks. - Reduce the time to obtain the results.
- Able to handle large A Grid of Computers and complex (Grid Computing) computing tasks. - Computers can be different in their configurations.
- Have to deal with all issues occurring within the grid. - High cost when the grid expands in size.
The results of Data Farming may be incorporated into other modeling and operation analysis activities.
Insight may be used to provide input to deterministic models or equations, or build more realistic simulations and models [1]. Cloud Computing is a style of computing in which ITrelated capabilities are provided as a service, allowing users to access technology-enabled services from the Internet without knowledge of, expertise with, or control over the technology infrastructure that supports them [3]. The adoption of Cloud Computing has its own benefits and drawbacks. The benefits are: access to completely different levels of scale and economics in terms of the ability to scale very rapidly and to operate IT systems more cheaply; easier change management of infrastructure including maintenance and upgrades (cloud vendors extensively virtualize and commoditize the underlying components to make them nondisruptive to replace and improve); offering improved agility to deploy solutions and choice between vendors, particularly when cloud interoperability becomes more of a reality than it is today; and offering an onramp to new computing advances such as non-relational databases, new languages, and frameworks that are designed to encourage scalability and take advantage of new innovations and other advances [4].
2.1 HADOOP
Hadoop is a Java-based Cloud Computing software platform that allows one to easily write and run applications that process vast amount of data. It implements MapReduce, using the Hadoop Distributed File System (HDFS) [6]. As shown in Figure 2, MapReduce divides applications into many small blocks of work and HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. MapReduce then processes the data where it is located [6].
Figure 2: MapReduce and HDFS Extracted From [6] The experiment uses a Hadoop cluster, which comprises four compute nodes. All nodes have the same hardware and software specification: Pentium 4 3.60GHz with 2GB memory and running Windows XP Professional. Within the Hadoop cluster, all nodes are serving as the slave nodes, with one of them performing additional tasks of a master node. The master node is responsible for: scheduling the jobs component tasks on the slave nodes, monitoring them and re-executing the failed tasks [7]; managing the file system namespace; determining the mapping of blocks to the slave nodes; and regulating access to files by the clients [8].
The drawbacks in adopting Cloud Computing include security of enterprise data that are stored in the cloud, risk of lock-in to cloud platform vendors, loss of control over cloud resources run and managed by someone else, and reliability [4]. Data Farming requires a huge amount of computing power to run a simulation thousands or millions of times across a large parameter and value space [5]. By adopting Cloud Computing, computing power can be tapped from multiple clouds over the Internet so that Data Farming can be performed with a lower cost and in a larger scale and a faster time. Simultaneously, the benefits of using Grid Computing can be preserved while the drawbacks are being eliminated. This paper is organized as follows: Section 2 demonstrates the ability of the Hadoop platform to perform Evolutionary Multi-Objective Optimization using an Island Model Genetic Algorithm. Lastly, Section 3 summarizes the main contributions of this paper and discusses directions for future research.
The slave node is responsible for: executing the tasks as directed by the master node [7]; serving read and write requests from the file systems clients; and performing block creation, deletion, and replication upon instruction from the master node [8].
2 HADOOP AND EVOLUTIONARY MULTI-OBJECTIVE OPTIMIZATION

An experiment is conducted to determine whether Evolutionary Multi-Objective Optimization performed using a Hadoop cluster will produce better outcome in a faster time.
2.2 EVOLUTIONARY MULTIOBJECTIVE OPTIMIZATION

Evolutionary Multi-Objective Optimization can be used in the data farming problem to reduce the number of evaluation needed. In many real-world problems,
there are several criteria which have to be considered in order to evaluate the quality of an individual. Only on the basis of the comparison of these several criteria or objectives (thus multi-objective) can a decision be made as to the superiority of one individual over another [9]. Multi-Objective Optimization is concerned with the simultaneous minimization of N criteria fr with r having the value starting from 1 to N. The values fr are determined by the objective function, which in turn is dependent on the variables of the individuals (decision variables) [9]. Evolutionary Multi-Objective Optimization uses evolutionary algorithm in its optimization. Figure 3 shows the structure of an extended multi-population evolutionary algorithm.
jMetal, and many problems, usually included in performance studies, are also available in the framework [12]. NSGA-II [11] has been chosen as the evolutionary algorithm and has the following configurations in the experiment. Selection Binary Tournament (two individuals are randomly chosen; the fitter of the two is selected as a parent). Crossover Simulated Binary Crossover (SBX) with a distribution index of 20.0 and a probability of 0.9. Mutation Polynomial Mutation with a distribution index of 20.0 and a probability of 1.0 divided by the number of decision variables in the problem.
The benchmark problems Kursawe, ZDT-1, ZDT-2, ZDT-3, ZDT-4 and ZDT-6 have been chosen for NSGA-II [11] to perform optimization and the following metrics have been chosen to evaluate the solutions produced by NSGA-II [11] on each of the problems. Figure 3: Structure of an Extended MultiPopulation Evolutionary Algorithm Adapted From [9] First, an initial population is created at random (or according to a predefined scheme), which is the starting point of the evolutionary process. Then a loop, which consists of the steps: evaluation (fitness assignment), selection, recombination and/or mutation, is executed for a certain number of times. Each loop is called a generation, and often a predefined maximum number of generations serve as the termination criterion of the loop. But other conditions, such as stagnation in the population or existence of an individual with sufficient quality, may be used to stop the simulation. At the end, the best individuals in the final population represent the outcome of the evolutionary algorithm [10]. The following sub-sections illustrate the evolutionary algorithm used in Evolutionary Multi-Objective Optimization and the modifications made to the evolutionary algorithm so that Evolutionary MultiObjective Optimization can be performed using the Hadoop cluster. Pareto Front Size for measuring the number of individuals in the non-dominated set. Hyper Volume for calculating the volume (in the objective space) covered by the individuals in the non-dominated set [13]. Generational Distance (GD) for measuring how far the individuals are in the set of non-dominated individuals found from those in the Paretooptimal set [13]. Inverted Generational Distance (IGD) for measuring how far the individuals are in the Pareto-optimal set from those in the set of nondominated individuals [13]. Spread for measuring the extent of spread achieved among the obtained individuals [13].
2.2.2
Parallel Island Model
2.2.1
NSGA-II
Metaheuristic Algorithms in Java (jMetal) is an objectoriented Java-based framework aimed at facilitating the development, experimentation, and study of metaheuristic algorithms for solving multi-optimization problems. A number of metaheuristic algorithms, such as the Nondominated Sorting Genetic Algorithm II (NSGA-II) [11], have already been implemented in
Island Model is a popular and efficient way to implement a genetic algorithm on both serial and parallel machines. In a parallel implementation of an Island Model each machine executes a genetic algorithm and maintains its own sub-population for search. The machines work in consort by periodically exchanging a portion of their populations in a process called migration. The Island Model introduces two parameters: migration interval, the number of generations (or evaluations) between a migration, and migration size, the number of individuals in the population to migrate [14]. Parallel Island Model has often been reported to display better search performance than serial single population models, both in terms of the quality of the solution found and effort measured in the total number
of evaluations of points sampled in the search space [14]. One reason for the improved search quality in the Parallel Island Model is that the various islands maintain some degree of independence and thus explore different regions of the search space while at the same time sharing information by means of migration. This can also be seen as a means of sustaining genetic diversity [14]. The Parallel Island Model has been used to implement NSGA-II [11]. The total population is composed of four sub-populations, with each sub-population having 50 individuals. Copies of the individuals that make up the fittest 10% of the sub-population are allowed to migrate and the sub-population, which receives these individuals, deletes the least fit 10% of its own population. The migration scheme assumes that the subpopulations are arranged in a ring. On the first migration, copies of the individuals move from their current sub-population to their immediate neighbor to the left. Migrations occur between all sub-populations simultaneously. On the second migration, the subpopulations send copies of the individuals to the subpopulation which is two moves to the left in the ring. In general, the migration destination address is incremented by 1 and moves around the ring. The process is repeated until each sub-population has sent one set of copies of individuals to every other subpopulation (not including itself) [14]. The experiment is also conducted to determine whether the following changes in Parallel Island Model NSGAII will aid in producing better outcome in a faster time. Migration Interval For each problem, a comparison is performed between Parallel Island Model NSGA-II with a migration interval of 1, 5, 10 and 20. Degree of Independence For each problem, a comparison is performed between Parallel Island Model NSGA-II with zero degree of independence among sub-populations (explores the whole search space) and some degree of independence among sub-populations (explores different regions of the search space).
After successful completion, the output of the MapReduce execution is available in the R output files (one per reduce task, with filenames as specified by the user) [15]. The Parallel Island Model NSGA-II has been adapted to the MapReduce framework in order to run on the Hadoop cluster. The Map function performs the selection, crossover and mutation operations, and ranks the individuals within the sub-population. The Reduce function takes care of migrations between the subpopulations. Figure 4 illustrates the adaptation of Parallel Island Model NSGA-II within MapReduce.
Figure 4: Adaptation of Parallel Island Model NSGA-II within MapReduce
2.3 EXPERIMENT
The experiment is conducted in the following two phases. All experiment results reported are the averages of five replications.
2.3.1
Degree of Independence
This phase observes how the degree of independence affects the execution of Parallel Island Model NSGA-II on the Hadoop cluster and the outcome produced by Parallel Island Model NSGA-II. For each problem, Parallel Island Model NSGA-II executes on two types of sub-populations: subpopulations with zero degree of independence and subpopulations with some degree of independence. For each type of sub-populations, the Parallel Island Model NSGA-II is executed with a migration interval of 1, 5, 10 and 20. Each Parallel Island Model NSGA-II consists of 25000 evaluations. In the Parallel Island Model NSGA-II with some degree of independence among sub-populations, the whole search space is divided into 16 squares. Four squares, which form a reverse diagonal line across the search space, are being explored.
2.2.3
MapReduce
MapReduce is a programming model where the user expresses the computation as two functions: Map and Reduce. The computation takes a set of input key/value pairs, and produces a set of output key/value pairs [15]. Map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits. The input splits can be processed in parallel by different machines. Reduce invocations are distributed by partitioning the intermediate key space into R pieces using a partitioning function, such as hash(key) mod R. The number of partitions (R) and the partitioning function can be specified by the user.
2.3.2
Sequential Versus Parallel
This phase determines whether the effort in making Parallel Island Model NSGA-II to run on the Hadoop cluster is worthwhile. To obtain a benchmark for each problem, the sequential NSGA-II, which consists of 100000 evaluations, is executed on a single computer.
For each problem, the Parallel Island Model NSGA-II is executed with a migration interval of 1, 5, 10 and 20. Each Parallel Island Model NSGA-II consists of 100000 evaluations. As the time taken to evaluate an individual function in jMetal is almost negligible, the evaluation of an individual function has been made expensive by putting the thread to sleep for 30 seconds.
Table 3: Results of Solutions for ZDT-4 (i)

Solutions Description NSGA-II Parallel Island Model NSGA-II Migration Interval 1 5 10 20 Pareto Front Size 200 163 180 185 185 Results Hyper Volume 6.63E-01 6.61E-01 6.62E-01 6.62E-01 6.62E-01 Spread 3.99E-01 6.82E-01 6.81E-01 6.69E-01 6.71E-01
2.4 EXPERIMENTAL RESULTS 2.4.1 Degree of Independence

The Parallel Island Model NSGA-II with some degree of independence among sub-populations obtains fewer Pareto-optimal individuals than Parallel Island Model NSGA-II with zero degree of independence among sub-populations. This is evident in the solutions for all problems (the evidence is much clearer in the solutions for ZDT-3) and is due to fewer sub-populations exploring fewer regions of the search space [14]. Table 2: Pareto Front Size of Solutions for ZDT-3
Solutions Problems Description Migration Interval 1 Parallel Island Model NSGA-II with zero degree of independence among subpopulations ZDT-3 1 Parallel Island Model NSGA-II with some degree of independence among subpopulations 5 10 20 126 109 116 112 5 10 20 Pareto Front Size 149 144 146 141
Table 4: Results of Solutions for ZDT-4 (ii)

Solutions Migration Interval 1 Parallel Island Model NSGA-II 5 10 20 Generational Distance 1.10E-04 9.66E-05 7.28E-05 7.28E-05 6.40E-05 Results Inverted Generational Distance 9.34E-05 1.61E-04 1.44E-04 1.37E-04 1.43E-04
Description
NSGA-II
On other metrics, similar results are obtained for Parallel Island Model NSGA-II with both types of subpopulations. Thus, Parallel Island Model NSGA-II with zero degree of independence among sub-populations is used in the next phase of the experiment. Figure 5: Times Taken (msec) in Optimizing ZDT-4
3 CONCLUSION
Cloud Computing is almost similar to Grid Computing. The difference in relation to a grid is that a single user at a given point only gets a small portion of the cloud. Thus we believe that whatever has been done to Data Farming with Grid Computing will be applicable to the incorporation of Cloud Computing into Data Farming. The experiment in Section 2 demonstrates that just like Grid Computing, Cloud Computing can perform Data Farming in a faster time when compared to running Data Farming on a single computer, if the evaluation of an individual is expensive. In another experiment not shown in this report, a famous strategic game, known as the Iterated Prisoners Dilemma, is created as a simulation model in the Automated Co-Evolution (ACE) Framework. The ACE Framework is able to run simulations
2.4.2
Sequential Versus Parallel
Parallel Island Model NSGA-II obtains fewer Paretooptimal individuals than the sequential NSGA-II. Due to migrations between sub-populations, more duplicated individuals are found in the solutions produced by Parallel Island Model NSGA-II. Although the results of the solutions on the HV are similar, the results on the GD, IGD and Spread are worse than the sequential NSGA-II. Figure 5 shows that Parallel Island Model NSGA-II takes around one-fifth of the time taken by the sequential NSGA-II. The effort in implementing Parallel Island Model NSGA-II to run on a Hadoop cluster is worthwhile when the evaluation of an individual function is expensive.
successfully on the Hadoop cluster via the SOAP Web Service, which is a useful component in running simulations in the cloud. Given that an abundance of computing power can be found in the cloud, Data Farming can be performed with a lower cost, in a larger scale and a faster time. Figure 6 shows a configuration that can be used to run large-scale Data Farming in the cloud.
[3]
[4]
[5]
[6]
[7]
Figure 6: Large-Scale Data Farming in the Cloud In the initial phase, there are many individuals per island. Each island undergoes a sequence of MapReduce steps. In each MapReduce step, the map invocation executes another (inner) sequence of MapReduce steps and the reduce invocation performs migrations between the islands. In the inner sequence of MapReduce steps, each map invocation evaluates an individual and each reduce invocation performs the selection, crossover and mutation operations for the individuals in an island. In the termination phase, all the individuals represent the outcome of Data Farming. The Hadoop platform can be enhanced with HBase and memcached to improve the performance of Data Farming. HBase provides the Hadoop platform with a scalable and distributed database to manage the data produced by Data Farming and memcached aids in speeding up Data Farming. [8]
[9]
[10]
[11]
ACKNOWLEDGMENT
We would like to express our thanks to Mr Spencer Low Kin Ming and Mr Koh Tech Chuan from DSO National Laboratories for their precious comments and discussion. We also wish to acknowledge the funding support for this project from Nanyang Technological University under the Undergraduate Research Experience on Campus (URECA) programme.
[12]
[13]
[14]
REFERENCES
[1] Gary E. Home, and Ted E. Meyer, Data Farming: Discovering Surprise, presented at Winter Simulation Conference (WSC) 04. Washington, DC, USA, 2004, unpublished. [2] Alfred G. Brandstein, and Gary E. Home, Data Farming: A Meta-Technique for Research in the 21st Century, presented at Genetic and [15]
Evolutionary Computation Conference (GECCO) 07. London, England, UK, 2007, unpublished. Wikipedia The Free Encyclopedia, Cloud Computing, (December 13, 2008), [December 13, 2008], Available at: http://en.wikipedia.org/wiki/Cloud_computing Dion Hinchcliffe, Eight Ways that Cloud Computing Will Change Business, (June 5, 2009), [June 17, 2009], Available at: http://blogs.zdnet.com/Hinchcliffe/?p=488 Wikipedia The Free Encyclopedia, Data Farming, (December 30, 2007), [June 17, 2009], Available at: http://en.wikipedia.org/wiki/Data_farming The Apache Software Foundation, Hadoop Core, (April 12, 2008), [December 13, 2008], Available at: http://hadoop.apache.org/core/ The Apache Software Foundation, Hadoop Map/Reduce Tutorial, (November 25, 2008), [December 13, 2008], Available at: http://hadoop.apache.org/core/docs/r0.19.1/mapre d_tutorial.html The Apache Software Foundation, HDFS Architecture, (November 25, 2008), [December 13, 2008], Available at: http://hadoop.apache.org/core/docs/r0.19.1/hdfs_ design.html Hartmut Pohlheim, Evolutionary Algorithms, (1994-2006), [December 10, 2008], Available at: http://www.geatbx.com/docu/algindex.html Eckart Zitzler, Marco Laumanns, and Stefan Bleuler, A Tutorial on Evolutionary MultiObjective Optimization, in Proceedings of the Workshop on Multiple Objective Metaheuristics, 2004, pp. 3 38, in press. Kalyanmoy Deb, Amrit Pratap, Sameer Agarwal, and T. Meyarivan, A Fast and Elitist Multiobjective Genetic Algorithm: NSGA-II, IEEE Transactions on Evolutionary Computation, vol. 6, no. 2, pp. 182 197, April 2002, in press. NEO: Networking and Emerging Optimization, JMetal, (April 12, 2008), [December 13, 2008], Available at: http://mallba10.lcc.uma.es/wiki/index.php/JMetal NEO: Networking and Emerging Optimization, Tools, (October 9, 2007), [June 17, 2009], Available at: http://mallba10.lcc.uma.es/wiki/index.php/Tools Darrell Whitley, Soraya Rana, Robert B. Heckendorn, The Island Model Genetic Algorithm: On Separability, Population Size and Convergence, Journal of Computing and Information Technology, vol. 7, pp. 33 47, 1998, in press. Jeffrey Dean, and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, presented at Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, USA, December 2004, unpublished.

Cloud Computing: Application On Data Farming

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Cloud Computing: Application On Data Farming

Caricato da

Copyright:

Formati disponibili

Proceedings of the URECA@NTU 2008-09