Sei sulla pagina 1di 87
NANYANG TECHNOLOGICAL UNIVERSITY CLOUD COMPUTING: APPLICATION ON DATA FARMING Yong Yong Cheng School of Computer

NANYANG TECHNOLOGICAL UNIVERSITY

CLOUD COMPUTING:

APPLICATION ON DATA FARMING

Yong Yong Cheng

School of Computer Engineering

2010

NANYANG TECHNOLOGICAL UNIVERSITY

SCE09-0445

CLOUD COMPUTING:

APPLICATION ON DATA FARMING

Submitted in Partial Fulfillment of the Requirements for the Degree of Bachelor of Engineering (Computer Science) of the Nanyang Technological University

By

Yong Yong Cheng

School of Computer Engineering

2010

2

Abstract

Objective-Based Data Farming requires massive amount of computing power to run thousands/millions of simulations. To acquire this massive amount of computing power, one has to own the infrastructure: a cluster/grid of many cheap computers or an expensive supercomputer. Both actions amount to an exorbitant sum of money over time to satisfy this increasing need.

With the introduction of Amazon Elastic Compute Cloud (Amazon EC2) and MapReduce programming model, the story of one having to own the infrastructure to gain access to this massive amount of computing power is in the past. The term “Cloud Computing” has been introduced and is more popular with each passing day. Cloud Computing allows massive amount of computing power to be available as a utility and at a cheap cost. It also offers other benefits such as scalability in real-time and with great ease, high availability and fault tolerant.

This project implements a robust private Cloud to address the security concerns when using military applications. It also implements a public Cloud to exhibit the feasibility of using a public infrastructure.

In this project, a Web Service and 6 MapReduce applications, which are used in distributing workloads within a Cloud, are designed and implemented. They allow conventional Objective- Based Data Farming frameworks to take full advantage of Cloud Computing.

3

Acknowledgements

I would like to express my thanks to Asst. Prof. Malcolm Low Yoke Hean, Dr. James Decraene and Mr. Zeng Fanchao. Their guidance had provided me with an insight to various ideas and concepts that are useful to this project.

4

Table of Contents

Abstract

3

Acknowledgements

4

Table of Contents

5

List of Tables

8

List of Figures

9

Chapter 1: Introduction

11

1.1 Objectives

12

1.2 Background

12

1.2.1 Data Farming

12

1.2.2 Computing Environments

13

1.2.3 Objective-Based Data Farming

16

1.3 Scope

17

1.4 Report Organization

17

Chapter 2: Concepts & Frameworks

2.1 Apache Hadoop

19

19

2.1.1 Hadoop Distributed File System

20

2.1.2 MapReduce

21

2.1.3 Condor

22

2.2 Web Service

24

2.3 Complex Adaptive System Evolver

24

2.3.1 Map Aware Non-Uniform Automata

26

2.3.2 Evolutionary Algorithm

26

5

2.3.3

Island Model

27

Chapter 3: Implementing Cloud Infrastructure & Applications

3.1 Hadoop Cluster @NTU

29

29

3.1.1 Development Cluster

31

3.1.2 Auto-Updating & Reporting Tool

32

3.2 Hadoop Cluster @EC2

33

3.3 Hadoop Service

34

3.3.1 Web Service Clients

Chapter 4: Preliminary Work

38

42

4.1 Automated Red Teaming Framework

42

4.2 MapReduce MANA

45

4.2.1 Problems & Solutions

46

4.2.2 Demonstrating Apache Hadoop-Compliant ART

46

4.2.2.1

Results & Analysis

47

Chapter 5: Implementing Apache Hadoop-Compliant CASE

48

5.1 Apache Hadoop-Compliant CASE

48

5.2 Replication Model

50

5.2.1

Testing Cluster Robustness

51

5.2.1.1

Results & Analysis

51

5.3 Standard Model

52

5.3.1

Demonstrating Scalability

53

5.3.1.1

Results & Analysis

54

5.3.2

Evaluating Hadoop Cluster @EC2

58

5.3.2.1

Results & Analysis

59

5.4 Island MapReduce

6

60

5.4.1

Island MapReduce 1

60

5.4.2 Island MapReduce 2

61

5.4.3 Evaluating Island MapReduce

62

5.4.3.1 Results & Analysis

63

Chapter 6: Conclusion

66

6.1 Summary

66

6.2 Limitations

67

6.3 Future Enhancements

68

References

69

Appendix

72

Appendix A: List of Nodes in Hadoop Cluster @NTU

72

Appendix B: Past Problems in MRMANA & Solutions

74

Appendix C: Evolving Agent-Based Simulations in the Clouds

81

7

List of Tables

Table 1: Pros & Cons of Computing Environments for Data Farming Table 2: Comparison between Condor & Apache Hadoop Table 3: 32-Bit Instance Types on Amazon Elastic Compute Cloud (Amazon EC2) Table 4: Web API (Application Programming Interface) of Hadoop Service Table 5: Solutions to Problems in MRMANA Table 6: Times Taken By MRMANA & MOMANA in Executing Simple & Complex Scenarios Table 7: 5 Entry Points in Apache Hadoop-Compliant CASE Table 8: Execution Times Demonstrating Robustness Table 9: Execution Times Using 5 & K Excursions in a Map Task Table 10: Execution Times Demonstrating Scalability Table 11: Benchmarks for Virtual Machine, Small & Medium Instances Table 12: Execution Times on Hadoop Cluster @EC2 & Hadoop Cluster @NTU Table 13: Differences between Configuration 1, 2, 3 & 4 for Island MapReduce Table 14: Execution Times for Island MapReduce with Configuration 1, 2, 3 & 4

8

List of Figures

Figure 1: Search Popularity on Google Since 2007 Figure 2: Data Farming Iterative Process Figure 3: Cloud Computing Stack Figure 4: Component Stacks of Apache Hadoop & Google MapReduce Framework Figure 5: MapReduce Execution Figure 6: Complex Adaptive System Evolver (CASE) Figure 7: General Structure of an Evolutionary Algorithm (EA) Figure 8: Differences between Single Population Model & Island Model Figure 9: 37-Nodes Production Cluster Figure 10: Screen Capture of the Installer Figure 11: 4-Nodes Development Cluster Figure 12: Steps in Getting an Update to Be Applied on Each Slave Node Figure 13: Hadoop Cluster @EC2 Figure 14: Steps for Executing a MapReduce Job on Hadoop Cluster @NTU Figure 15: Strategy Formulated To Solve a Problem in Log Recording Figure 16: Solution to Solve Performance Degradation Due To Increasing Number of Queries Figure 17: Screen Capture of Hadoop Service Figure 18: Package Diagram for Hadoop Service & Java-Based Web Service Client Figure 19: Screen Capture of Java-Based Web Service Client Figure 20: Screen Capture of Web-Based Web Service Client Figure 21: Screen Capture of CASE GUI Figure 22: Cloud Computing Architecture Figure 23: Architecture of ART Framework Figure 24: Screen Capture of ART Framework Figure 25: Steps for ART Framework to Submit & Run a MapReduce Job Figure 26: Application Workflows of MRMANA & MOMANA Figure 27: Differences between Apache Hadoop-Compliant ART Framework & CASE Figure 28: 5 Entry Points in Apache Hadoop-Compliant CASE

9

Figure 29: Application Workflow of Replication Model Figure 30: Application Workflow of Standard Model Figure 31: Execution Times Using 5 & K Excursions in a Map Task Figure 32: Execution Times Demonstrating Scalability Using Simple Scenario Figure 33: Execution Times Demonstrating Scalability Using Complex Scenario Figure 34: Comparison of Solutions between Standard Model & CASE Figure 35: Application Workflow of Island MapReduce 1 Figure 36: Application Workflow of Island MapReduce 2 Figure 37: Comparison of Solutions between Island MapReduce 1 & Island MapReduce 2 Figure 38: Comparison of Solutions between Configuration 2, 3 & 4 for Island MapReduce 1

10

Chapter 1: Introduction

Merrill Lynch, one of the world’s leading financial management and advisory companies, issued a research note titled “The Cloud Wars: $100+ Billion at Stake” on 7 th May 2008 [1]. The analysts estimated that by 2011 the volume of Cloud Computing market opportunity would amount to $160 billion.

Today, Cloud Computing is rapidly gaining widespread acceptance both in the public and industry. Figure 1 shows the search popularity on Google, a popular search engine, since 2007 for three terms: Cloud Computing, Grid Computing and Cluster Computing. From the figure, it can be observed that the popularity of Cloud Computing is on the rise.

25 Search Volume Index 20 15 10 5 0 Jan 7 Dec 16 Nov 23
25
Search Volume Index
20
15
10
5
0
Jan 7
Dec 16
Nov 23
Nov 1
Oct 10
2007
2007
2008
2009
2010
Cloud Computing
Grid Computing
Cluster Computing
Figure 1: Search Popularity on Google Since 2007

Many large corporations, such as Activision, British Telecom (BT) and Yahoo have jumped on the bandwagon of Cloud Computing. They have obtained the benefits that Cloud Computing has promised. These benefits include reduced cost, scalability, high availability and fault

11

tolerant. Gartner, a famous information technology research and advisory company, has even

identified Cloud Computing as one of the top 10 strategic technologies for 2011 [2].

1.1 Objectives

This project aims to:

Explore the paradigm of Cloud Computing through Apache Hadoop, a popular Cloud Computing software framework; and

Incorporate Cloud Computing into Data Farming so that the latter can be carried out with a lower cost, in a larger scale and a faster time.

1.2 Background

1.2.1 Data Farming

Data Farming [3] [4] [5] is a technique that uses simulation models that are executed thousands

or millions of times to reveal the complexities in a problem landscape. It combines a set of

enabling technologies and processes into a single integrated task of automating the above

scientific method. These set of technologies and processes include distributed and high-

performance computing, agent-based simulations and rapid model development, knowledge

discovery methods, high dimensional data visualization techniques, design-of-experiments

methods, human computer interfaces, teamwork and collaborative environments, and

heuristic search techniques. However, it is not intended to be used for predicting an outcome,

but employed instead to aid intuition and to gain insight on a problem scenario.

Data Farming is a collaborative and iterative process. The steps, as shown in Figure 2, are

essential in the process and may be repeated until sufficient insights to a problem are gained.

12

Figure 2: Data Farming Iterative Process The obtained results may be incorporated into other modeling

Figure 2: Data Farming Iterative Process

The obtained results may be incorporated into other modeling and operational analysis activities, while the insight gained may be used to provide input to deterministic models or equations, or build more realistic simulations and models.

1.2.2 Computing Environments

During Data Farming, each simulation model has to be executed thousands or millions of times either for model testing or for parameter space exploration. These executions are performed using a single computer, a cluster of computers (also known as Cluster Computing [6]) or a grid of computers (also known as Grid Computing [7]). The pros and cons of each type of computing environment are illustrated in Table 1.

13

Table 1: Pros & Cons of Computing Environments for Data Farming

Computing

   

Environments

Pros

 

Cons

A

Single

- Simple to deploy simulation models.

- Time-consuming to obtain large amount of results.

Computer

- May

not be able to handle

large and

complex computing tasks.

A Cluster of Computers (Cluster Computing)

- Reduces the time required to obtain large amount of results. - Able to handle large and complex computing tasks. - Low network latency, as homogeneous computers are geographically located near to each other and linked together within a dedicated network.

- High cost involved in managing, maintaining and upgrading the computers. - Limited size, as computers have to be located within the same organization. - Complex to deploy simulation models, since executions have to be split across multiple computers.

 

- Reduces the time required to obtain large amount of results. - Able to handle large and complex computing tasks. - Unlimited size, as computers can be

- High cost involved in managing, maintaining and upgrading the computers. - Expensive, as computers may be left unused most of the times if computing tasks are not large and complex enough to utilize them. - Has to deal with more management issues when computers are managed by unrelated organizations. - More complex to deploy simulation models, since executions also have to be performed on heterogeneous computers.

A

Grid of

geographically dispersed and located across multiple organizations.

Computers

 

(Grid

 

Computing)

Cloud Computing [8] [9] [10] represents a technology advancement in which Grid Computing is made more user-friendly and attractive. One of the most outstanding characteristics of Cloud Computing is the ability to provision resources on-demand, thus eliminating over-provisioning and removing the need to over-provision in order to meet the demands of large and complex computing tasks. These resources include computing power, storage space and network bandwidth.

Cloud Computing is a computing concept in which resources in distributed computing systems are provided as a service, allowing users to consume them via the Internet and on a utility computing basis. The users do not require any knowledge of, expertise with, or control over the technology infrastructure that supplies them with the resources.

14

A typical Cloud Computing architecture is composed of six layers, as shown in Figure 3.

is composed of six layers, as shown in Figure 3. Figure 3: Cloud Computing Stack [11]

Figure 3: Cloud Computing Stack [11]

Clients – Computer hardware and/or software that are solely designed to deliver Cloud services and are essentially useless without it.

Application (also known as Software-as-a-Service, SaaS) – Eliminates the need to install and run applications on the users’ computers. It mitigates the burden of maintenance, upgrades and support. An example is FaceBook.

Platform (also known as Platform-as-a-Service, PaaS) – Provides a platform as a service, consuming Cloud infrastructure and supporting Cloud applications. It aids in the development and deployment of Cloud applications, and eliminates the cost and complexity in buying and managing the underlying infrastructure. An example is Apache Hadoop.

Infrastructure (also known as Infrastructure-as-a-Service, IaaS) – Offers a computing infrastructure as a service. It allows the users to purchase resources, such as computing power and storage space, on a utility computing basis. Two such examples are High Performance Computing Centre (HPCC) in Nanyang Technological University (NTU) and Amazon Elastic Compute Cloud (Amazon EC2).

Servers – Computer hardware and/or software that are used to support the delivery of Cloud services.

The benefits of Cloud Computing are:

15

Cost – Users benefit from lower up-front capital cost, as they do not need to purchase their own infrastructures and instead, they are paying only for those resources that they use;

Scalability – Users do not need to over-provision their resources to meet the peak demands, as resources are provisioned dynamically on a fine-grained and self-service basis near real-time; and

Mobility – Users are able to access the resources via the Internet regardless of their locations and devices.

The drawbacks of Cloud Computing include privacy and security concerns about handing over

confidential data to the third-party providers and the inability of the users to do anything when

the third-party providers suffer outages. However, it is in the best interest of the third-party

providers to employ the most sophisticated high-availability and security strategies available

for the users’ data, and these strategies are likely to be far more stringent than any company’s

in-house policies.

1.2.3 Objective-Based Data Farming

Data Farming explores the entire parameter space. The size of this parameter space is

proportional to the complexity of the problem landscape. As the problem landscape becomes

more complex, the parameter space gets even larger. Thus the number of times to execute

each simulation model, as well as the time involved in doing so, increases tremendously.

Objective-Based Data Farming uses Evolutionary Algorithms (EAs) and objectives to direct a

search within the parameter space. This search reduces the number of times to execute each

simulation model, as points, which prove worthwhile based on objectives, in the parameter

space are used in discovering those points that may be worth evaluating. Each point in the

parameter space is equivalent to one execution of each simulation model.

However, there is a limitation to the reduction that can be performed by the search. A large

reduction may hide the complexities in the problem landscape and thus hinder in gaining

insight on a problem scenario.

Objective-Based Data Farming requires a massive amount of resources to execute a simulation

model thousands or millions of times across a large parameter and value space. By adopting

16

Cloud Computing, huge amount of resources can be purchased inexpensively and only when

needed so that Data Farming can be performed with a lower cost and in a larger scale and a

faster time. Simultaneously, the benefits of using Cluster Computing and Grid Computing can

be preserved while their drawbacks are being eliminated.

1.3 Scope

Many Cloud Computing offerings and platforms are available in the market. Due to time

constraints, it is impossible to evaluate all of them in this project. Hence, this project explores

the paradigm of Cloud Computing through Apache Hadoop, which is a popular Cloud

Computing software platform inspired by the Google MapReduce framework. The

development of Apache Hadoop is initiated and led by Yahoo, and has even spawned many

startups. One of the startups is Cloudera, which has a funding of $36 million [12].

Besides exploring with a private Cloud that is implemented using Apache Hadoop, this project

also takes into consideration of having the exploration performed on a public Cloud. Amazon

Elastic Compute Cloud (Amazon EC2) is chosen as the public Cloud, as it supports Windows

operating system. It has been in operation since 25 th August 2006.

Due to time constraints, two Objective-Based Data Farming frameworks are chosen to

incorporate with Cloud Computing. They are Automated Red Teaming (ART) Framework and

Complex Adaptive System Evolver (CASE).

1.4 Report Organization

This report is organized as follows:

Chapter 2 illustrates the concepts and the software frameworks that are employed in this project;

Chapter 3 describes the implementation of the Cloud infrastructure and applications;

Chapter 4 describes the preliminary work of incorporating Cloud Computing into an Objective-Based Data Farming framework;

17

Chapter 5 describes Apache Hadoop-compliant CASE, and presents the experiments that had been performed and their obtained results; and

Chapter 6 summarizes this project, and describes its limitations and possible enhancements in the near future.

18

Chapter 2: Concepts & Frameworks

This chapter describes the concepts and the software frameworks that are employed in this project. The software frameworks are Apache Hadoop and Complex Adaptive System Evolver (CASE). MapReduce programming model and the concepts of Web Service, Map Aware Non- Uniform Automata (MANA), Evolutionary Algorithm, and Island Model are also reviewed in this chapter.

2.1 Apache Hadoop

Apache Hadoop [13] is an open-source Java software framework for developing and deploying applications that can be run on large clusters built of commodity computers. Being a popular software platform that has been used to realize Cloud Computing, it is largely inspired by the Google MapReduce framework, which is implemented in C++ and processes more than 20000 terabytes of data across Google’s massive computing clusters per day.

The framework is able to transparently furnish applications with scalable and reliable distributed computing capabilities with the implementation of two core components: Hadoop Distributed File System (HDFS) and MapReduce. Figure 4 shows the component stacks of the Apache Hadoop and Google MapReduce framework. Each component in Apache Hadoop has its equivalent component in Google MapReduce framework.

19

Figure 4: Component Stacks of Apache Hadoop & Google MapReduce Framework The advantages of using

Figure 4: Component Stacks of Apache Hadoop & Google MapReduce Framework

The advantages of using Apache Hadoop are:

Common tasks, such as scheduling, input partitioning, failover, replication and sorting of intermediate results, in distributed computing systems are automatically taken care of by the framework;

Massive computing clusters have become increasingly easier to utilize because of the simplified MapReduce programming model; and

The simplified MapReduce programming model also allows the users to concentrate on designing the workflows of their applications.

The design of Apache Hadoop assumes that it is much more efficient to move and execute the

computation closer to where the required data is located. This is especially true when the size

of the data is huge. This assumption aims to minimize the network congestion and increase the

overall throughput of the computing system.

2.1.1 Hadoop Distributed File System

Hadoop Distributed File System (HDFS) [14] is a distributed file system designed to be deployed

on low-cost commodity computers. It is highly fault-tolerant, provides high throughput access

to application data and is especially suitable for applications that handle large data-sets.

20

Although HDFS allows the users to view and store their data as files, each file is actually split into one or more blocks internally. These blocks are replicated and then stored among multiple computers within the computing system.

A typical HDFS cluster consists of a single NameNode and one or more DataNodes. The

NameNode manages the namespace of the file system and executes only namespace operations, such as renaming files and directories. It also regulates accesses to files by the

clients and determines the mapping of blocks to the DataNodes. The DataNode serves read/write requests from the clients and performs block creation, deletion, and replication upon receiving instructions from the NameNode.

The NameNode is a single point of failure for a typical HDFS cluster. If it fails, manual intervention is required to start the namespace recovery. Existing works are in progress to start this namespace recovery for the failed NameNode automatically [15].

2.1.2 MapReduce

MapReduce [16] was introduced to the world by Google in 2004. It is a programming model and an associated implementation to generate and process large data-sets in a scalable, reliable and fault-tolerant manner.

In MapReduce programming model, a computation has to be expressed as two functions: Map

and Reduce. The computation consumes a set of input key-value pairs and produces a set of output key-value pairs, conceivably of different types. The Map function processes a key-value

pair to generate a set of intermediate key-value pairs, while the Reduce function merges and processes all intermediate values associated with the same intermediate key to produce one or more output key-value pairs.

MapReduce implementation distributes both Map and Reduce invocations across multiple computers within the computing system. It also automatically partitions the input data into a set of input splits and these input splits are then processed in-parallel by the Map invocations on different computers. Using a partitioning function, the intermediate key space is partitioned into R pieces and each Reduce invocation processes one or more pieces. After successful

21

completion, the output of the MapReduce execution, as shown in Figure 5, is available in R output files.

A typical MapReduce cluster consists of a single JobTracker and one or more TaskTrackers. The JobTracker schedules Map/Reduce invocations to be executed across multiple nodes, monitors them and re-schedules those failed invocations for execution. The TaskTracker executes the invocations as instructed by the JobTracker.

In this project, a MapReduce execution and a Map/Reduce invocation are addressed as a MapReduce job and a Map/Reduce task respectively.

as a MapReduce job and a Map/Reduce task respectively. Figure 5: MapReduce Execution 2.1.3 Condor Condor

Figure 5: MapReduce Execution

2.1.3 Condor

Condor [17] is an open-source high throughput computing software framework to distribute and execute computationally intensive tasks in-parallel across multiple computers in large clusters. It runs on multiple operating systems: Linux, UNIX, Mac OS X, FreeBSD, and Windows.

22

Condor is able to seamlessly integrate both dedicated and non-dedicated resources into one computing environment. One outstanding feature in Condor is the ability to identify idle computers and distribute tasks to these computers for execution.

Although Condor has been successfully deployed as a Cloud platform to replace Apache Hadoop [18], it is still not suitable for large number of tasks that are short-running, data- intensive or both. Existing works are in progress to use Condor in managing the clusters to support Apache Hadoop [19].

Table 2 shows the comparison between Condor and Apache Hadoop.

Table 2: Comparison between Condor & Apache Hadoop

Characteristics

 

Condor

Apache Hadoop

Task Type

Computation-Intensive.

 

Data-Intensive.

Application

Sequential, MW (Master Worker), MPI (Message Passing Interface) and PVM (Parallel Virtual Machine).

MapReduce.

Structure

Checkpoint

Yes.

Only

supported

on

UNIX-like

No.

Mechanism

operating systems.

 

Scheduling

- Idle computers are being identified and tasks are distributed to these computers for execution.

- Memory usages of computers are being monitored and a task is executed on a computer if memory requirements specified by the task are met on that particular computer. Only supported on Linux operating system. - Tasks are executed on computers, which store the required data or are closed to the location of the required data.

Awareness

Data Transfer

- No shared file system is required. - During the execution of a task on a remote computer, input/output data is transferred automatically from/to the user’s computer.

- A shared file system is required. Prefers to utilize a distributed file system. - Large clusters are partitioned into multiple racks. Each rack consists of computers in close proximity and has a replica of the required data. During the execution of a task in each rack, a large portion of the data transfer occurs within that particular rack.

23

2.2

Web Service

A Web Service [20] is a software system that supports interoperable computer-to-computer

interaction over a network and has an interface described in WSDL (Web Service Definition

Language). Other software systems interact with the Web Service using SOAP (Simple Object

Access Protocol) messages. These messages are typically transmitted using HTTP (Hypertext

Transfer Protocol) with an XML (Extensible Markup Language) serialization.

A Web Service can be engaged and used in several ways. In general, the following broad steps

are required.

1. Both the requester and provider become known to each other. Or at least one of them becomes known to the other.

2. The requester and provider also agree on the service description, which governs the mechanism of interacting with the service, and semantics, which dictates the meaning and purpose of the interaction. Both the service description and semantics will govern the interaction between the requester’s and provider’s software systems.

3. The service description and semantics are realized by the requester’s and provider’s software systems.

4. The requester’s and provider’s software systems exchange messages, thus performing some tasks on behalf of the requester and provider. The exchange of messages with the provider’s software system represents the concrete manifestation of interacting with the provider’s Web Service.

Web Services have many advantages. They provide interoperability between various software

systems running on heterogeneous platforms, and allow software applications and services

from different companies and locations to be combined easily to provide an integrated service.

By utilizing HTTP, they are able to work through many firewall security measures without

requiring any changes to the firewall filtering rules.

2.3 Complex Adaptive System Evolver

Complex Adaptive System Evolver (CASE) [21] is a framework developed by the EVOSIM project

in Parallel & Distributed Computing Center (PDCC) of Nanyang Technological University (NTU).

24

It is developed under a project funded by the Singapore Defence Science & Technology Agency

(DSTA). CASE is designed to simulate and evolve complex war game scenarios using

Evolutionary Algorithms (EAs). A scenario represents a military operation that is modeled using

Map Aware Non-Uniform Automata (MANA), while an excursion represents a change to one or

more parameters in the scenario.

The CASE framework, as illustrated in Figure 6, is constructed in a modular fashion using the

Ruby programming language. It is composed of three main components.

Excursion Generator – Takes in a scenario XML file and a set of excursion specification text files as inputs. Using these inputs, a set of excursion XML files are generated and sent to the Simulation Engine.

Simulation Engine – Receives the set of excursion XML files and executes MANA using them as inputs. A set of result text files detailing the outcomes of the simulations are generated and used by the Evolutionary Algorithm to direct the search.

Evolutionary Algorithm – Receives the set of result text files. Paired with the associated set of excursion specification text files, the result text files are processed to generate a new set of excursion specification text files. This new set of excursion specification text files may then be sent to the Excursion Generator to begin another round of evolution.

sent to the Excursion Generator to begin another round of evolution. Figure 6: Complex Adaptive System

Figure 6: Complex Adaptive System Evolver (CASE)

25

2.3.1 Map Aware Non-Uniform Automata

Map Aware Non-Uniform Automata (MANA) [22] is a proprietary agent-based simulation model designed by the Defence Technology Agency (DTA) in New Zealand. It is developed using Delphi as the programming language and runs only on the Microsoft Windows operating system.

MANA is being used largely to model military operations, such as civil violence management, maritime surveillance and coastal patrols, due to its easy representation of the more chaotic and intangible aspects of military conflicts. By leaving out detailed physical attributes of the military subjects concerned, scenarios can be run relatively fast and over many excursions with on MANA so that unique situations or tactics, where friendly forces can achieve dominance over an enemy, can be discovered.

In this project, large amount of time in conducting the experiments is being spent on using MANA to simulate scenarios over thousands or millions of excursions.

2.3.2 Evolutionary Algorithm

Evolutionary algorithms (EAs) [23] are stochastic search methods that imitate the natural biological evolution. By applying the principle of the survival of the fittest, EAs operate on a population of potential solutions to generate better and better approximations to a solution. At each generation, a new set of approximations is created by selecting the individuals based on their fitness level in the problem domain and breeding them together using operators borrowed from the natural adaptation. Eventually, the above process leads to an evolution of individuals that are better suited to their environment than the individuals that they are created from.

Figure 7 shows the general structure of an EA. A population is initially created at random, and then a loop, which consists of evaluation, selection, crossover and/or mutation, is executed for a certain number of times. Each loop is called a generation, and the termination criteria of the loop can be either a predefined maximum number of generations or other conditions, such as

26

stagnation in the population or existence of an individual with sufficient quality. Finally, the individuals in the last population represent the best outcomes of the EA.

the last population represent the best outcomes of the EA. Figure 7: General Structure of an

Figure 7: General Structure of an Evolutionary Algorithm (EA)

2.3.3 Island Model

The Island Model [24] [25] is an efficient parallelization technique to implement an EA. It consists of several islands, with each island executing an EA and maintaining its own sub- population for searching. They work together by periodically exchanging a portion of their sub- populations in a process called migration. Figure 8 shows the differences between the single population model and the Island Model.

27

Figure 8: Differences between Single Population Model & Island Model The Island Model has often

Figure 8: Differences between Single Population Model & Island Model

The Island Model has often been reported to display better search performance than the single

population model, in terms of the amount of computation time required, the quality of

solutions found and the effort measured in the total number of evaluations of individuals

sampled in the search space [26] [27]. One reason for this improvement in the search

performance is that various islands maintain some degree of independence and thus explore

different regions of the search space, while at the same time sharing information by means of

migration. This can be seen as a mean of sustaining genetic diversity.

However, the Island Model introduces more parameters into the process. Four parameters,

which are always needed to be fine-tuned when using the Island Model, are described below.

Migration Interval – The number of generations or evaluations of individuals before a migration occurs;

Migration Size – The number of individuals on an island to migrate;

Migration Policy – The type of individuals on the source island to migrate and those on the destination island to substitute with; and

Migration Topology – The destination island that the individuals on a source island are to be migrated to.

28

Chapter 3: Implementing Cloud Infrastructure & Applications

This chapter describes the implementation of a private Cloud and a public Cloud. They are called Hadoop Cluster @NTU and Hadoop Cluster @EC2 respectively. Since this project involves the usage of military applications, such as MANA, the implementation of the private Cloud is necessary to ensure the confidentiality of both the applications and the data. A Web Service, which makes an Apache Hadoop cluster available via the Internet, and all its Web Service clients are also presented in this chapter. This Web Service is called Hadoop Service.

3.1 Hadoop Cluster @NTU

Hadoop Cluster @NTU is located in Parallel & Distributed Computing Centre (PDCC) of Nanyang Technological University (NTU). Figure 9 illustrates the 37-nodes production cluster. The master node in this production cluster is a dedicated physical computer, and the other 36 slave nodes are made up of 30 non-dedicated physical computers and 6 dedicated virtual machines.

No user-intervention is required in bringing up the 37-nodes production cluster. The NameNode and the JobTracker will run automatically after the boot-up of the master node, while the DataNode and the TaskTracker on each slave node will run automatically after the boot-up of the node.

29

Figure 9: 37-Nodes Production Cluster Each slave node is performing two roles: a DataNode and

Figure 9: 37-Nodes Production Cluster

Each slave node is performing two roles: a DataNode and a TaskTracker. With this configuration

for each slave node, the MapReduce implementation is able to effectively schedule tasks on

those slave nodes where data is located. High bandwidth is also achieved throughout the

production cluster.

The following tasks have been performed to complement the production cluster. The last two

tasks have also been implemented for the development cluster.

The master node has been moved to the server room, which is off-limits to the public. It can only be accessed remotely.

6 dedicated virtual machines have been added to the production cluster. This ensures that no MapReduce job will be terminated prematurely due to the inability of a task to execute, even if all 30 non-dedicated physical machines are offline.

Rack-awareness has been enabled by a tool (named as “topology.7z”) implemented in Python. It consolidates the movement of network traffic within the same rack/place, which is much more desirable than network traffic moving across the racks/places. High bandwidth is achieved throughout the production cluster and fault-tolerance is improved, as the NameNode places block replicas on multiple racks/places.

30

A shell script (named as “hadoopservice-maintenance.sh”) and a batch file (named as “hadoopservice-maintenance.bat”) have been written to remove any temporary files created by the Web service and delete any application data that has been stored with the Web service for more than three months. The shell script and the batch file are scheduled to be executed automatically every night.

A batch file (named as “hadoop-maintenance.bat”) has been written to maintain the NameNode and the JobTracker in a good and working state. This batch file has to be executed manually and can also be used to restart the NameNode when the NameNode fails.

A physical computer can be added as a new node to the production cluster by running an

installer (named as “HadoopUpdater.7z”). This installer is created by using NSIS (Nullsoft

Scriptable Install System) [28]. Figure 10 shows the screen capture of the installer.

[28]. Figure 10 shows the screen capture of the installer. Figure 10: Screen Capture of the

Figure 10: Screen Capture of the Installer

The production cluster is equivalent to having Cloud Computing on a private network and can

be used to address the privacy, security and reliability concerns. However, it does not have the

two benefits of Cloud Computing: lower up-front capital cost and less hands-on management.

For a list of nodes in the production cluster, please refer to Appendix A.

3.1.1 Development Cluster

The development cluster consists of 4 nodes. They are all dedicated virtual machines. Figure 11

illustrates the 4-nodes development cluster. It is mainly used for development and testing

purposes.

31

Figure 11: 4-Nodes Development Cluster For a list of nodes in the development cluster, please

Figure 11: 4-Nodes Development Cluster

For a list of nodes in the development cluster, please refer to Appendix A.

3.1.2 Auto-Updating & Reporting Tool

An auto-updating and reporting tool (named as “HadoopNodeService.7z”) is being installed on every slave nodes in both the production and development clusters. It is developed with C# as the programming language and makes use of an open-source updater, which is known as GUP (Generic Updater for Win32) [29]. GUP has been modified to download and install the updates (named as “HadoopUpdater.7z”) automatically in the silent mode whenever the updates are available. Figure 12 shows the steps in getting an update to be applied on each slave node. An update, which can be created by using NSIS (Nullsoft Scriptable Install System), must be applied on all slave nodes in both clusters before another update can be applied again.

This tool also supplies the job scheduler in Apache Hadoop with information about each slave node in the production and development clusters. This information includes memory usage, processor utilization rate and idleness of the particular slave node.

32

Figure 12: Steps in Getting an Update to Be Applied on Each Slave Node 3.2

Figure 12: Steps in Getting an Update to Be Applied on Each Slave Node

3.2 Hadoop Cluster @EC2

Amazon Elastic Compute Cloud (Amazon EC2) [30] is a Web Service offered by Amazon to supply its users with resizable resources in the Cloud. It allows its users to rent virtual computers, which represent resources such as computing power, storage space and network bandwidth. Each virtual computer is called an instance.

Amazon EC2 offers great ease in deploying multiple instances. An Amazon Machine Image (AMI) can be created from an instance, which has already been installed with the essential software, and many more instances can then be easily spawned using this AMI.

Amazon EC2 also offers many types of instances. A 32-bit AMI can be used to create instances of any 32-bit instance type. Table 3 shows the 32-bit instance types available on Amazon EC2. 1 EC2 Compute Unit (ECU) represents the CPU capacity of a 1.0-1.2 GHz (Gigahertz) 2007 Opteron or 2007 Xeon processor.

33

Table 3: 32-Bit Instance Types on Amazon Elastic Compute Cloud (Amazon EC2)

 

Instance

   

Instance Types

Categories

Memory Sizes

EC2 Compute Units (ECUs)

Micro

Micro

613 MB

Up To 2 ECUs (For Short Periodic Bursts)

Small

Standard

1.7

GB

1 ECU (1 Virtual Core With 1 ECU)

Medium

High-CPU

1.7

GB

5 ECUs (2 Virtual Cores With 2.5 ECUs Each)

Housed on Amazon EC2, Hadoop Cluster @EC2 is made up of 4 instances. Each instance can be treated as a node. Figure 13 illustrates the 4-nodes cluster. The master node has a permanent IP (Internet Protocol) address.

master node has a permanent IP (Internet Protocol) address. Figure 13: Hadoop Cluster @EC2 3.3 Hadoop

Figure 13: Hadoop Cluster @EC2

3.3 Hadoop Service

The Hadoop Service (named as “HadoopService.7z”) is a Web Service that allows MapReduce jobs to be submitted and run on an Apache Hadoop cluster via the Internet. This Web Service is an improved version of the Web Service that had been implemented during the author’s IA (Industrial Attachment) and URECA (Undergraduate Research Experience on Campus) projects. It is implemented in Java and runs on an Oracle GlassFish Server 2.1.1, which uses the Common class loader to load Apache Hadoop JAR (Java Archive) files.

34

The following features have been implemented to enhance the Web Service.

Kills a running MapReduce job on an Apache Hadoop cluster.

Views all logs that are generated by a MapReduce job when it is running on an Apache Hadoop cluster.

Uploads/Downloads a file using MTOM/XOP (Message Transmission Optimization Mechanism/XML-Binary Optimized Packaging).

For a MapReduce job to run on an Apache Hadoop cluster via the Web Service, two sets of files

have to be submitted to the Web Service first. These two sets of files are (i) a MapReduce

model and (ii) an input data-set. A MapReduce model is an application written according to the

MapReduce programming model to process the input data-set on an Apache Hadoop cluster.

Figure 14 presents the steps for executing a MapReduce job on the Hadoop Cluster @NTU.

for executing a MapReduce job on the Hadoop Cluster @NTU. Figure 14: Steps for Executing a

Figure 14: Steps for Executing a MapReduce Job on Hadoop Cluster @NTU

Table 4 describes the Web API (Application Programming Interface) of Hadoop Service.

35

Table 4: Web API (Application Programming Interface) of Hadoop Service

Methods

Descriptions

getVersion()

Gets the version number of Apache Hadoop JAR files that Hadoop Service is using.

isClusterAvailable()

Checks whether the Apache Hadoop cluster is available.

listModels()

Lists all MapReduce models stored in Hadoop Service.

getModel()

Retrieves the status information of a MapReduce model.

addModel(), removeModel()

Adds/Removes a MapReduce model to/from Hadoop Service.

listInputs()

Lists all input data-sets stored in Hadoop Service.

getInput()

Retrieves the status information of an input data-set.

addInput(), removeInput()

Adds/Removes an input data-set to/from Hadoop Service.

listOutputs()

Lists all output data-sets stored in Hadoop Service.

removeOutput()

Removes an output data-set from Hadoop Service.

prepareJob(), runPreparedJob()

Prepares & runs a MapReduce job.

runJob(), killJob()

Runs/Kills a MapReduce job.

getOutput()

Retrieves the status information of an output data-set.

getCompressedOutput()

Retrieves the file information of a compressed output data-set.

getCompressedWCOutput()

Retrieves the file information of a compressed output data-set. This data-set may be a portion of the original data-set.

uploadFile(), downloadFile()

Uploads/Downloads a file in multiple segments.

putFile(), getFile()

Uploads/Downloads a file using MTOM/XOP.

Hadoop Service allows multiple MapReduce jobs to be submitted and run concurrently.

Running these MapReduce jobs in parallel had caused problems in log recording and service

performance. These problems and their solutions are presented below.

1. The logging information produced by a running MapReduce job garbled with those generated by other running MapReduce jobs. Figure 15 presents the strategy formulated to solve this problem.

15 presents the strategy formulated to solve this problem. Figure 15: Strategy Formulated To Solve a

Figure 15: Strategy Formulated To Solve a Problem in Log Recording

36

2. The status of each running MapReduce job is often being queried. The increasing number of queries, due to multiple running MapReduce jobs, degraded the performance of the Web Service. Figure 16 shows the solution to this problem.

Web Service. Figure 16 shows the solution to this problem. Figure 16: Solution to Solve Performance

Figure 16: Solution to Solve Performance Degradation Due To Increasing Number of Queries

3. When the status of a running MapReduce job was queried, the logging information produced by the particular MapReduce job was also sent simultaneously. This also degraded the performance of the Web Service. It has been observed that logging information is viewed only when a MapReduce job fails. Thus logging information has been modified to be sent only when a MapReduce job completes.

Please take note that a MapReduce job from the perspective of Hadoop Service corresponds to

one or more MapReduce jobs, which are spawned by a single instance of a MapReduce model,

on an Apache Hadoop cluster.

Hadoop Service also allows the status information of the cluster to be viewed visually via the

Internet. Figure 17 shows the screen capture of the webpage presenting the status information

of the cluster.

37

Figure 17: Screen Capture of Hadoop Service 3.3.1 Web Service Clients Two Web Service clients

Figure 17: Screen Capture of Hadoop Service

3.3.1 Web Service Clients

Two Web Service clients have been implemented to allow MapReduce jobs to be submitted and run on an Apache Hadoop cluster via the above Web Service. The first Web Service client (named as “HadoopWSClient.7z”) is implemented in Java. It serves as a GUI (Graphical User Interface)/CLI (Command-Line Interface) application for the users or an add-on application for any PHP-enabled Web Server. Figure 18 shows the package diagram for Hadoop Service and this first Web Service client.

38

Figure 18: Package Diagram for Hadoop Service & Java-Based Web Service Client Figure 19 shows

Figure 18: Package Diagram for Hadoop Service & Java-Based Web Service Client

Figure 19 shows the screen capture of the first Web Service client.

19 shows the screen capture of the first Web Service client. Figure 19: Screen Capture of

Figure 19: Screen Capture of Java-Based Web Service Client

The second Web Service client (named as “HadoopWebClient.7z”) is a Web-based application implemented in PHP and JavaScript. It provides an interactive and personalized experience for the users by employing AJAX (Asynchronous JavaScript & XML) and Client-Side Local Storage. It runs on any Web Server that supports PHP 5.2.4 and relies on the first Web Service client, which must be running in the background on the Web Server. Figure 20 shows the screen capture of this second Web Service client. It provides an option to choose between the Apache Hadoop clusters for submitting and running a MapReduce job.

39

Figure 20: Screen Capture of Web-Based Web Service Client Figure 21 shows the screen capture

Figure 20: Screen Capture of Web-Based Web Service Client

Figure 21 shows the screen capture of another Web Service client that uses Hadoop Service. It is implemented by another student specifically to run CASE (Complex Adaptive System Evolver) on an Apache Hadoop cluster. It is being called CASE GUI.

40

Figure 21: Screen Capture of CASE GUI Figure 22 illustrates the Cloud Computing architecture that

Figure 21: Screen Capture of CASE GUI

Figure 22 illustrates the Cloud Computing architecture that has been implemented.

22 illustrates the Cloud Computing architecture that has been implemented. Figure 22: Cloud Computing Architecture 41

Figure 22: Cloud Computing Architecture

41

Chapter 4: Preliminary Work

This chapter describes the preliminary work of incorporating Cloud Computing into Automated

Red Teaming (ART) Framework. Two MapReduce models, which are implemented to enable

ART Framework to execute using an Apache Hadoop cluster, are also presented in this chapter.

They are MapReduce MANA (MRMANA) and Map-Only MANA (MOMANA).

4.1 Automated Red Teaming Framework

Red Teaming is a technique often utilized to uncover vulnerabilities and breaches in

operational concepts with the ultimate goal of improving them. However, it demands close

collaboration from a group of subject matter experts, whose knowledge and experiences

greatly influence the success of this technique. This is especially so for complicated and multi-

faceted nature of military operational concepts.

Automated Red Teaming (ART) is a concept that enhances the Manual Red Teaming (MRT)

effort with the automated discovery of vulnerabilities and breaches in the targeted system. The

technique works by accessing the targeted system using a series of rigorous strategies and

keeping track of those strategies that have performed exceedingly well against the operational

concepts of the Blue team. These well-performed strategies provide the subject matter experts

with alternative views regarding the various vulnerabilities and breaches in the operational

concepts of the Blue team.

The ART Framework realizes the ART concept by leveraging on advanced technologies such as

high-performance computing, EAs and agent-based simulations. It is developed by the DSO

National Laboratories (DSO) using Visual C++ programming language.

The architecture of the ART framework, as shown in Figure 23, is composed by the following

components:

ART Parameters Setting Interface allows the selection of those parameters that are required to be varied;

42

Simulation Model Dependent Modules add a layer of data flow between the ART Framework and simulation models. Data flowing into the simulation models are the parameters to be executed and those that are flowing out will be the results from the simulation runs. These data are translated with wrappers that follow the simulation format to the ART Framework data structures;

EA Module stores the EA library in which the user can choose from. It also prepares the parameters for the individual simulation, analyses the results and distills the desired Red Teaming objectives;

Condor Controller submits the run of each individual simulation to the Condor cluster. It also monitors the completion of each individual runs and signals the ART controller for further processing;

ART Output Module provides feedback on the whole process, updates the user on the selected parameters and with the run results; and

ART Controller coordinates the whole process.

and • ART Controller coordinates the whole process. Figure 23: Architecture of ART Framework [31] Figure

Figure 23: Architecture of ART Framework [31]

Figure 24 shows a screen capture of the ART Framework.

43

Figure 24: Screen Capture of ART Framework The Java-based Web Service client has been incorporated

Figure 24: Screen Capture of ART Framework

The Java-based Web Service client has been incorporated into the ART Framework. Figure 25 presents the steps that the ART Framework performed for the Java-based Web Service client to submit and run a MapReduce job on an Apache Hadoop cluster.

submit and run a MapReduce job on an Apache Hadoop cluster. Figure 25: Steps for ART

Figure 25: Steps for ART Framework to Submit & Run a MapReduce Job

44

4.2 MapReduce MANA

MapReduce MANA (MRMANA) is a MapReduce application that had initially been implemented by the author during his Industrial Attachment in the DSO National Laboratories and was incrementally improved upon during this project. In MRMANA, each Map task executes MANA for one replication of an excursion, while each Reduce task gathers the obtained results of all replications that belong to an excursion. After gathering the results, the Reduce task calculates the means and standard deviations, and generates an output file for the excursion.

If the execution time for one replication of an excursion is negligible, it will be very expensive to execute MANA for only one replication in the Map task due to the overhead in creating the Map task. Thus Map-Only MANA (MOMANA) is preferred over MRMANA. In MOMANA, each Map task executes MANA for all replications of an excursion, calculates the means and standard deviations, and generates an output file for the excursion. There is no Reduce task in MOMANA. Figure 26 illustrates the application workflows of MRMANA and MOMANA.

The number of Reduce tasks spawned in MRMANA and the number of Map tasks spawned in MOMANA are equivalent to the number of excursions in the particular MapReduce job.

to the number of excursions in the particular MapReduce job. Figure 26: Application Workflows of MRMANA

Figure 26: Application Workflows of MRMANA & MOMANA

45

4.2.1

Problems & Solutions

Table 5 lists the problems that had arose in the execution of MRMANA and their respective

solutions. These problems are commonly found in MapReduce applications and their solutions

are also being used in the subsequent MapReduce models.

Table 5: Solutions to Problems in MRMANA

Problems

Solutions

Empty Result File (Due to a race condition)

Each task attempts to produce a result file that is named uniquely from other attempts. When a task attempt completes successfully, it renames the result file to the intended filename.

Missing Result File (Due to offline DataNodes)

Increases the replication factor for the result files and/or enables rack-awareness for the clusters to improve fault-tolerance.

For a more detailed description of the above problems and their solutions, please refer to Appendix B.

4.2.2 Demonstrating Apache Hadoop-Compliant ART

This experiment was performed to benchmark the time taken by MRMANA and MOMANA in

executing two types of scenarios respectively. The ART Framework was utilized to submit and

run MapReduce jobs on Hadoop Cluster @NTU.

The experiment was performed using the production cluster and thus might be affected by the

performance of those nodes that were being utilized by other users. However, the production

cluster exhibits the hazards that are unavoidable in a distributed heterogeneous computing

environment and at the same time, these hazards also pose a trial to the robustness of the

cluster.

Both types of scenarios were executed for 30 replications and on the slowest node, which

represented most of the nodes in the cluster. The simple scenario was executed in 0.05 second,

while the complex scenario took 1 minute to execute. Other common settings are described

below.

Evolutionary Algorithm: NSGA-II (Non-Dominated Sorting Genetic Algorithm II) [32]

46

Generations: 100

Population Size: 100

Replications for each Excursion: 30

4.2.2.1 Results & Analysis

All the results shown here are the averages of 2 replications.

Table 6 shows the times taken by MRMANA and MOMANA in executing both types of

scenarios.

Table 6: Times Taken By MRMANA & MOMANA in Executing Simple & Complex Scenarios

 

Execution Times (minutes)

Simple Scenario

Complex Scenario

MRMANA

23.11

321.06

MOMANA

5.76

395.90

ART Framework (Using A Single Computer)

15.83

10366.67

The result shows that MRMANA executed in a faster time than MOMANA when the complex

scenario was used. For the simple scenario, MOMANA executed faster than MRMANA.

Compared to running ART Framework on a single computer, MRMANA took a longer time for

the simple scenario. This was mainly due to the overhead in executing only one replication of

the excursion in each Map task. The execution time for one replication of the excursion was

negligible.

47

Chapter 5: Implementing Apache Hadoop-Compliant CASE

This chapter describes Apache Hadoop-compliant CASE and presents the four MapReduce models that are implemented to enable CASE in executing using an Apache Hadoop cluster. These four MapReduce models are Replication Model, Standard Model, Island MapReduce 1 and Island MapReduce 2. Experiments that use these MapReduce models and their obtained results are also presented in this chapter.

5.1 Apache Hadoop-Compliant CASE

Apache Hadoop-compliant CASE has been modified in such a way that each component in the original CASE can be run within a Map/Reduce task. These modifications differ greatly from those modifications that have been done to ART Framework to make it Apache Hadoop- compliant. Figure 27 illustrates the differences between the Apache Hadoop-compliant ART Framework and CASE. In Apache Hadoop-Compliant CASE, each MapReduce job executes EA and passes generated result-sets to the next MapReduce job.

and passes generated result-sets to the next MapReduce job. Figure 27: Differences between Apache Hadoop-Compliant ART

Figure 27: Differences between Apache Hadoop-Compliant ART Framework & CASE

48

Apache Hadoop-compliant CASE has 5 entry points. These entry points are used by MapReduce models to execute the components in CASE. Table 7 describes each of the 5 entry points.

Table 7: 5 Entry Points in Apache Hadoop-Compliant CASE

Entry Points

Descriptions

execute

Executes the simulation model.

evolve

Executes EA.

replicate

Executes CASE.

evolve_migrate

Executes EA and generates migrating individuals.

exec_migrate

Executes CASE and generates migrating individuals.

Each entry point often uses methods that are almost the same as the original methods in CASE. These methods are modified to run effectively and efficiently within a Map/Reduce task. They are written in the same file (named as “hadoop.rb”) to form a module. Figure 28 shows the entry points and the methods that they use. Each entry point writes its obtained results to the Standard Output Stream.

writes its obtained results to the Standard Output Stream. Figure 28: 5 Entry Points in Apache

Figure 28: 5 Entry Points in Apache Hadoop-Compliant CASE

49

5.2 Replication Model

In Replication Model, each Map task executes an instance of CASE and produces a result-set, while a single Reduce task gathers all the obtained result-sets and combines them into a single result file. Each instance of CASE is run independently. Figure 29 illustrates the application workflow of the Replication Model. The number of Map tasks spawned is specified by the user.

The Replication Model executes multiple instances of CASE on multiple computers simultaneously. All instances normally have the same setting and execute the same EA within the same parameter space. It can also be configured for each instance to have different settings, execute different EAs and/or execute within different regions of the parameter space.

As each Map task may execute for a long period of time, the Replication Model is thus not the recommended way to make use of Apache Hadoop.

is thus not the recommended way to make use of Apache Hadoop. Figure 29: Application Workflow

Figure 29: Application Workflow of Replication Model

50

5.2.1 Testing Cluster Robustness

In this experiment, Replication Model was utilized to test the robustness of the production

cluster and its fault tolerance.

The experiment was performed using the production cluster and thus might be affected by the

performance of those nodes that were being utilized by other users. However, the production

cluster exhibits the hazards that are unavoidable in a distributed heterogeneous computing

environment and at the same time, these hazards also pose a trial to the robustness of the

cluster.

A simple scenario was used in this experiment. This simple scenario took 5 seconds to execute

for 30 replications on the slowest node, which represented most of the nodes. Other common

settings are described below.

Map Tasks: 10

Evolutionary Algorithm: NSGA-II (Non-Dominated Sorting Genetic Algorithm II)

Generations: 10

Population Size: 100

Replications for each Excursion: 30

5.2.1.1 Results & Analysis

All the results shown here are the averages of 10 replications.

Table 8 shows the execution times for the Replication Model when a given set of nodes was

switched off during the start, the midst or the end of its execution. All nodes within this set

were being utilized by the Replication Model at that point in time.

Table 8: Execution Times Demonstrating Robustness

Number Of Offline Nodes During Execution

Execution Times (minutes)

 

0

90.53

 

Start of Execution

110.83

5

Midst of Execution

135.05

End of Execution

210.33

 

Start of Execution

100.67

10

Midst of Execution

130.08

End of Execution

190.17

51

During this experiment, all MapReduce jobs completed successfully. This experiment has shown that the production cluster is robust and fault tolerant.

5.3 Standard Model

To adapt to the MapReduce programming model, the Standard Model splits CASE into two portions. Each portion is able to be executed independently. The first portion, which executes in a Map task, is responsible for its first two components: Excursion Generator and Simulation Engine, while the second portion, which runs in a Reduce task, takes care of its last component:

Evolutionary Algorithm.

The Standard Model consists of G MapReduce jobs, which are executed sequentially. In each MapReduce job, each spawned Map task executes MANA for R times per excursion, while a single Reduce task gathers all the obtained results and executes EA with them as inputs. The output of the MapReduce job is then supplied as the input to the next MapReduce job. Figure 30 illustrates the application workflow of the Standard Model. The number of generations (G), the number of excursions to be executed by each Map task (E) and the number of replications per excursion (R) can be specified by the user.

of replications per excursion ( R ) can be specified by the user. Figure 30: Application

Figure 30: Application Workflow of Standard Model

52

Each Map task must be able to finish its execution within two hours. If they are unable to, the

MapReduce job will fail. The failure of the MapReduce job will cause a Standard Model

execution to be terminated. Thus the Standard Model may require the user to split the

workload among more Map tasks by having a smaller E.

5.3.1 Demonstrating Scalability

This set of experiment was divided into two parts and utilized the Standard Model. The first

part observed the effect caused by the varying number of excursions in a Map task on the

execution time, while the second part investigated the degree of scalability in the Standard

Model.

The experiment was performed using the production cluster and thus might be affected by the

performance of those nodes that were being utilized by other users. However, the production

cluster exhibits the hazards that are unavoidable in a distributed heterogeneous computing

environment and at the same time, these hazards also pose a trial to the robustness of the

cluster.

The first portion of the experiment utilized a simple scenario. This simple scenario was

executed in 2 and 5 seconds for 30 replications on the fastest and slowest nodes in the cluster

respectively. Other common settings are described below.

Evolutionary Algorithm: NSGA-II (Non-Dominated Sorting Genetic Algorithm II)

Generations: 50

Population Size: 100

Replications for each Excursion: 30

The second portion of the experiment used two scenarios. Both scenarios were executed for 30

replications and on the slowest node, which represented most of the nodes in the cluster. The

simple scenario was executed in 5 seconds, while the complex scenario took 60 seconds to

execute. Other common settings are described below.

Evolutionary Algorithm: NSGA-II (Non-Dominated Sorting Genetic Algorithm II)

Generations: 100

Population Size: 100

53

Replications for each Excursion: 30

The experiment also compared the quality of the solutions in terms of hyper-volume obtained

by the Standard Model to those that were produced by CASE and MapReduce application [33].

The MapReduce application was proposed by the Illinois Genetic Algorithms Laboratory in 2009

to scale genetic algorithm.

The MapReduce application was modified from Standard Model. It takes the parameter values

in each excursion as the key for the particular excursion and has more than one Reduce task. A

customized partitioning function splits all excursions equally among the Reduce tasks.

5.3.1.1 Results & Analysis

All the results shown here are the averages of 10 replications.

Table 9 shows the execution times for the Standard Model utilizing a given number of nodes in

the production cluster. For each given set of nodes, the Standard Model was designed to

execute 5 and K excursions in each of its Map task. K can be calculated using the following

formula.

Table 9: Execution Times Using 5 & K Excursions in a Map Task

Number of Nodes Involved

Number of Excursions per Map Task

Execution Times (minutes)

 

5

522.95

1

100

424.72

2

5

273.80

50

235.82

3

5

201.47

34

166.20

 

5

162.00

4

25

130.15

 

5

122.97

5

20

104.55

10

5

80.92

10

70.02

 

5

69.77

15

7

62.75

54

Number of Nodes Involved

Number of Excursions per Map Task

Execution Times (minutes)

20

5

48.18

-

-

25

5

48.05

4

51.07

Figure 31 presents the above table visually. K certainly aided in getting a faster execution time in using the Standard Model with 15 or fewer nodes in the cluster. It was unable to achieve a faster execution time with 20 nodes and beyond, as the execution time for K excursions in a Map task was negligible when compared to the overheads involved in using the Standard Model.

600 500 400 300 200 100 0 1 2 3 4 5 10 15 20
600
500
400
300
200
100
0
1
2
3
4
5
10
15
20
25
Number of Nodes Involved
5 Excursions/Map Task
K Excursions/Map Task
Execution Times (minutes)

Figure 31: Execution Times Using 5 & K Excursions in a Map Task

Table 10 shows the execution times for Standard Model when used on the simple and complex scenarios. To investigate the degree of scalability in the Standard Model, the number of nodes was increased incrementally. The number of excursions to execute in each Map task was adjusted accordingly to the above formula used in obtaining K.

55

Table 10: Execution Times Demonstrating Scalability

Number of Nodes Involved

Number of Excursions per Map Task

Execution Times (minutes)

Simple Scenario

Complex Scenario

1

100

854.13

10517.86

2

50

476.08

5875.90

4

25

266.52

3163.48

5

20

219.80

2353.27

10

10

136.73

1309.33

20

5

97.42

632.13

25

4

92.75

483.73

Figure 32 visually shows the execution times for Standard Model using the simple scenario. It is observed that as the number of node involved increases, the execution time for using Standard Model decreases accordingly. The above observation is applicable to this simple scenario when the number of nodes involved is less than 20.

900 800 700 600 500 400 300 200 100 0 1 2 4 5 10
900
800
700
600
500
400
300
200
100
0
1
2
4
5
10
20
25
Number of Nodes Involved
Simple Scenario
Execution Times (minutes)

Figure 32: Execution Times Demonstrating Scalability Using Simple Scenario

Figure 33 shows the execution times for the Standard Model when using the complex scenario. The above observation is applicable to this complex scenario. Therefore it can be concluded that an optimal number of nodes definitely exists and depends on the complexity of the scenario being used.

The following issues were identified in this experiment.

56

A MapReduce job may take a longer time to complete. The delay in the MapReduce job is caused by nodes that are either equipped with a slower processor or experiencing a higher computational load due to external factors, such as utilized by other users.

Network traffic may cause a delay in executing the Map tasks and thus each Map task occurs with a different starting time. This may in turn aggravate the previous issue.

12000 10000 8000 6000 4000 2000 0 1 2 4 5 10 20 25 Number
12000
10000
8000
6000
4000
2000
0
1
2
4
5
10
20
25
Number of Nodes Involved
Complex Scenario
Execution Times (minutes)

Figure 33: Execution Times Demonstrating Scalability Using Complex Scenario

Figure 34 shows the quality of the solutions in terms of hyper-volume that were produced by

the Standard Model, CASE and MapReduce application. The solutions produced by the

Standard Model had a hyper-volume that was similar to that obtained by CASE. The solutions

produced by the MapReduce application were less optimal due to each Reduce task executing

EA with a set of similar excursions.

This set of experiment and its findings are documented in a published paper. Please refer to

Appendix C for this paper.

57

-5 -10 -15 -20 -25 -30 0 20 40 60 80 100 Generations Standard Model
-5
-10
-15
-20
-25
-30
0
20
40
60
80
100
Generations
Standard Model
CASE
MapReduce Application
Hyper-Volume

Figure 34: Comparison of Solutions between Standard Model & CASE

5.3.2 Evaluating Hadoop Cluster @EC2

An experiment was performed to demonstrate the feasibility of using Amazon EC2 within this

project. In this experiment, Standard Model was executed on Hadoop Cluster @EC2 and the

development cluster in Hadoop Cluster @NTU. The common settings are described below.

Evolutionary Algorithm: NSGA-II (Non-Dominated Sorting Genetic Algorithm II)

Generations: 50

Population Size: 40

Excursions per Map Task: 10

Replications for each Excursion: 30

Benchmarks were being performed on small and medium instances in Hadoop Cluster @EC2

and a virtual machine in the development cluster respectively. Table 11 shows the benchmarks

that were obtained. A medium instance has 5 EC2 Compute Units (ECUs), while a small instance

has only 1 ECU.

58

Table 11: Benchmarks for Virtual Machine, Small & Medium Instances

Benchmarks

Small Instance

Medium Instance

Virtual Machine

Execution Times for 30 Replications of the Simple Scenario that is used in this Experiment (seconds)

5.0

2.1

5.0

Execution Times for 30 Replicatons of the Complex Scenario (seconds)

138.0

58.4

69.5

Whetstone iSSE3

     

(GFLOPS)

3.65

14.87

21.10

CPU Mark

363.4

1490.6

1272.7

5.3.2.1 Results & Analysis

The execution times shown in Table 12 are the averages of 10 replications. The calculated costs for the respective execution times take only into account of the hourly charge for each instance involved.

The cost for data transfer is indicated by a plus symbol (+) and is also unaccountable in this experiment due to the small scale that it was being performed. However, the cost may escalate if a more complex scenario is used and/or an experiment is performed in a larger scale.

Table 12: Execution Times on Hadoop Cluster @EC2 & Hadoop Cluster @NTU

     

Execution Times (minutes)

Clusters

 

Descriptions

Costs

 

Standard

 

Mean

Deviation

   

4

Small Instances

     

Public IP Address For Master Node

$0.96+

89.32

1.80

4

Medium Instances

     

Public IP Address For Master Node

$1.16+

48.70

1.07

Hadoop Cluster @EC2

 

4

Small Instances

     

Private IP Address For

$0.96

88.53

1.23

 

Master Node

4

Medium Instances

     

Private IP Address For Master Node

$1.16

48.40

0.95

Hadoop Cluster @NTU

 

4 Virtual Machines

-

90.83

2.43

59

Based on the obtained results, it is feasible to make use of Amazon EC2 within this project. The cost can be further reduced by using medium instances to form the cluster and a private IP address for the master node.

5.4 Island MapReduce

The Island MapReduce is designed to run CASE in a scalable fashion by employing two techniques: MapReduce programming model and Island Model. It exists in two versions: Island MapReduce 1 and Island MapReduce 2.

Island MapReduce 1 and Island MapReduce 2 use the same migration topology and migration policy. The islands are arranged in a ring topology and each island is tagged with a different index number. Random excursions in a particular island are sent to the next island in the ring and substitutes random excursions in that island.

5.4.1 Island MapReduce 1

Island MapReduce 1 consists of one or more MapReduce jobs. In a MapReduce job, each pair of Map and Reduce tasks corresponds to an island. Each Map task executes an instance of CASE for T generations and produces two result-sets, which contain excursions that are remaining in the particular island and excursions that are migrating from the particular island to another island respectively. The associated Reduce task gathers all the excursions for the particular island and groups them. The output of the MapReduce job is then supplied as the input to the subsequent MapReduce job. Figure 35 illustrates the application workflow of Island MapReduce 1. The number of islands (N) and the migration interval (T) can be specified by the user.

60

Figure 35: Application Workflow of Island MapReduce 1 5.4.2 Island MapReduce 2 Like the Standard

Figure 35: Application Workflow of Island MapReduce 1

5.4.2 Island MapReduce 2

Like the Standard Model, the Island MapReduce 2 also splits CASE into two portions in order to adapt to the MapReduce programming model. Each portion is able to be executed independently. The first portion, which executes in a Map task, is responsible for its first two components: Excursion Generator and Simulation Engine, while the second portion, which runs in a Reduce task, takes care of its last component: Evolutionary Algorithm.

The Island MapReduce 2 consists of G MapReduce jobs, which are executed sequentially. In each MapReduce job, each Map task executes MANA for R times per excursion, while each Reduce task corresponds to an island. A Reduce task gathers all the obtained results for the excursions in the particular island, executes EA with them as inputs and generates at least one result-set, which contains excursions remaining in the particular island. Whenever T is reached, it produces an additional result-set, which includes excursions that are migrating from the particular island. The output of the MapReduce job is then supplied as the input to the subsequent MapReduce job. Figure 36 illustrates the application workflow of the Island MapReduce 2. The number of islands (N), the number of generations (G), the number of

61

excursions to be executed by each Map task (E), the number of replications per excursion (R) and the migration interval (T) can be specified by the user.

The Island MapReduce 2 allows more scalability and parallelism than Island MapReduce 1. By having a smaller E in Island MapReduce 2, workload can be split among more Map tasks. However, it is more efficient and effective to employ Island MapReduce 1 whenever simple scenarios are used instead.

MapReduce 1 whenever simple scenarios are used instead. Figure 36: Application Workflow of Island MapReduce 2

Figure 36: Application Workflow of Island MapReduce 2

5.4.3 Evaluating Island MapReduce

In this experiment, the timings and the quality of the solutions in terms of hyper-volume were compared between Island MapReduce 1 and Island MapReduce 2. It also investigated the possible effect of two parameters on the timings and the quality of the solutions. These two parameters, which are commonly tuned in the Island Model, are the population size and the migration interval.

62

This experiment was performed using the production cluster and thus might be affected by the

performance of those nodes that were being utilized by other users. However, the production

cluster exhibits the hazards that are unavoidable in a distributed heterogeneous computing

environment and at the same time, these hazards also pose a trial to the robustness of the

cluster.

A simple scenario was used in this experiment. This simple scenario took 5 seconds to execute

for 30 replications on the slowest node, which represented most of the nodes. Other common

settings are described below.

Islands: 20

Generations: 100

Replications for each Excursion: 30

Table 13 shows the differences between the 4 configurations that were used in this

experiment. The EA used in this experiment was either DE (Differential Evolution) [34] or NSGA-

II (Non-Dominated Sorting Genetic Algorithm II).

Table 13: Differences between Configuration 1, 2, 3 & 4 for Island MapReduce

Characteristics

Configuration 1

Configuration 2

Configuration 3

Configuration 4

Evolutionary

       

Algorithm

DE

NSGA-II

NSGA-II

NSGA-II

Excursions per Generation on an Island

50

50

20

20

Migration Interval

5

10

10

5

Migration Size

2

2

2

2

5.4.3.1 Results & Analysis

All the results shown here are the averages of 10 replications. Table 14 shows the execution

times for Island MapReduce with Configuration 1, 2, 3 and 4. Island MapReduce 1 and Island

MapReduce 2 were executed using Configuration 1. Just as expected, Island MapReduce 2

achieved a much faster execution time as compared to that obtained by Island MapReduce 1.

63

Table 14: Execution Times for Island MapReduce with Configuration 1, 2, 3 & 4

 

Island

Number of Excursions per Map Task

 

Configurations

MapReduce

Execution Times (minutes)

Configuration 1

1

250

433.38

2

10

390.23

Configuration 2

1

250

435.02

Configuration 3

1

200

204.25

Configuration 4

1

100

198.02

Figure 37 presents the quality of the solutions in terms of hyper-volume that were obtained by both Island MapReduce 1 and Island MapReduce 2 when executed with Configuration 1. Both Island MapReduce 1 and Island MapReduce 2 were able to obtain solutions that generate similar hyper-volumes.

-5 -10 -15 -20 -25 -30 0 20 40 60 80 100 Generations Island MapReduce
-5
-10
-15
-20
-25
-30
0
20
40
60
80
100
Generations
Island MapReduce 1
Island MapReduce 2
Hyper-Volume

Figure 37: Comparison of Solutions between Island MapReduce 1 & Island MapReduce 2

Figure 38 shows the quality of the solutions in terms of hyper-volume that were obtained by Island MapReduce 1 when applied with Configuration 2, 3 and 4. Using a smaller number of excursions in Configuration 3 produced solutions that were less optimal. However, this can be improved by having a smaller migration interval, as in Configuration 4.

64

-5 -10 -15 -20 -25 -30 0 20 40 60 80 100 Generations Configuration 2
-5
-10
-15
-20
-25
-30
0
20
40
60
80
100
Generations
Configuration 2
Configuration 3
Configuration 4
Hyper-Volume

Figure 38: Comparison of Solutions between Configuration 2, 3 & 4 for Island MapReduce 1

65

Chapter 6: Conclusion

6.1 Summary

The popularity of Cloud Computing is on the rise. It has also delivered benefits to its adopters. These benefits include high availability, scalability, reduced cost and fault tolerant.

This project explores the paradigm of Cloud Computing through Apache Hadoop. Due to the nature of this project which involves the use of military applications, such as MANA, a private Cloud was implemented. This private Cloud is called Hadoop Cluster @NTU. Despite being in a non-dedicated and heterogeneous computing environment, the Cloud cluster implemented has demonstrated its robustness and high fault tolerance. Besides the private Cloud, a public Cloud, which is called Hadoop Cluster @EC2, was also deployed and evaluated. This public Cloud is used mainly for comparison purposes. It is also used to exhibit the feasibility of this project using a public Cloud as an infrastructure.

Cloud Computing is not considered complete without Web Service. In this project, Hadoop Service was implemented. It allows the submission and running of MapReduce jobs on an Apache Hadoop cluster via the Internet. This system, which consists of Hadoop Cluster @NTU and Hadoop Service, had been used in the recent IDFW (International Data Farming Workshop) 21 held in Lisbon, Portugal on 19 th -24 th September 2010. Three Web Service clients were being implemented. The CASE GUI, a Web-based Web Service client was especially implemented to run Apache Hadoop-compliant CASE on an Apache Hadoop cluster.

To investigate the integration of Cloud Computing and Data Farming, the Apache Hadoop- compliant ART and CASE were implemented. Six different MapReduce models were implemented and evaluated. They enable Apache Hadoop-compliant ART Framework/CASE to execute using an Apache Hadoop cluster. Although two MapReduce models, such as MRMANA/MOMANA and Island MapReduce 1/Island MapReduce 2, may serve the same purpose, they differ greatly in their implementation complexities and scalabilities. Increased scalability for a MapReduce model escalates its implementation complexity.

66

6.2 Limitations

Due to the nature of MANA being a proprietary simulation agent-based model that runs only on Windows operating system, all nodes within Hadoop Cluster @NTU must be running Windows operating system. This limitation has posed a few issues. The first issue is that Apache Hadoop is unable to monitor the memory usage of each node in the clusters. Thus it is unable to schedule tasks based on their memory requirements.

Although Apache Hadoop is implemented in Java, there exists a difference in terminating all the child processes spawned by a Map/Reduce task on a node between Windows and UNIX-like operating systems. When a Map/Reduce task fails or is being killed, all the child processes are automatically terminated on UNIX-like operating systems. But for Windows operating system, these child processes are left running in the background instead. Careful programming and a mechanism, which is built into each MapReduce model, can only minimize the chance of the child processes being left running in the background. The mechanism may fail due to the fact that Windows operating system reuses process identity number when creating a new process.

As Apache Hadoop is not supported as a production platform on Windows operating system, any patches available in the Hadoop community require tremendous effort in testing before they can be deployed on nodes within Hadoop Cluster @NTU.

The last issue is that Windows operating system is not widely supported by Cloud providers. Currently, Amazon is the only major Cloud provider that allows its customers to run their Cloud applications on Windows operating system.

Within Hadoop Cluster @NTU, there are 30 non-dedicated physical computers. A Map/Reduce task fails when the computer, which is executing the particular Map/Reduce task, is shut down by another user. If this situation happens too frequently, the MapReduce job will take a longer time to complete.

The presence of non-dedicated physical computers also poses a serious problem to tools that are required in certain MapReduce jobs and have to be installed on each node. These tools can

67

be moved, overwritten or even deleted by another user. Thus these affected MapReduce jobs

will fail.

6.3 Future Enhancements

The possible future enhancements are:

The current method used to identify bottlenecks within Hadoop Cluster @NTU requires tremendous effort. This can be facilitated greatly by installing Chukwa [35] to monitor the clusters. Chukwa automatically collects all logs generated by Apache Hadoop and uses these logs for debugging, performance measurement and operational monitoring. It will also aid considerably in identifying those problematic nodes within the clusters.

Cascading [36] is an application programming interface (API) that greatly simplifies the complexities in the creation of MapReduce applications for Apache Hadoop. However, it may not be effective to be applied on the applications in this project, as they are normally composed by multiple iterations of similar operations. Twister [37], an API supporting iterative MapReduce computations, may be a more worthy alternative to explore.

Due to the nature of this project which involves the usage of military applications, such as MANA, security is always a topmost concern for this project when it is being discussed. Yahoo has recently launched a version of Hadoop that integrates with Kerberos, a mature open source authentication standard [38]. This version of Hadoop is worth exploring since it supports security.

68

References

[1] Markus Klem, “Merill Lynch Estimates Cloud Computing To Be $100 Billion Market,” SYS- CON Media [Online], (21 st August 2008). Available:

http://www.sys-con.com/node/604936

[2] Christy Pettey, “Gartner Identifies the Top 10 Strategic Technologies for 2011,” Gartner Newsroom [Online], (19 th October 2010). Available:

http://www.gartner.com/it/page.jsp?id=1454221

[3] Alfred G. Brandstein, and Gary E. Horne, “Data Farming: A Meta-Technique for Research in the 21 st Century,” in Maneuver Warfare Science 1998, Quantico, VA, USA, 1998, Page 93 – 99. [4] Gary E. Horne, and Ted E. Meyer, “Data Farming: Discovering Surprise,” in Proceedings of the 36 th Conference on Winter Simulation (WSC 2004), Washington, DC, USA, 2004, Page 807 – 813. [5] Philip S. Barry, Jianping Zhang, and Mary McDonald, “Architecting a Knowledge Discovery Engine for Military Commanders Utilizing Massive Runs of Simulations,” in Proceedings of the 9 th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2003), Washington, DC, USA, 2003, Page 699 – 704. [6] Mark Baker, and Rajkumar Buyya, “Cluster Computing At a Glance,” in High Performance Cluster Computing: Architectures and Systems, Volume 1, 1 st Edition. Upper Saddle River, NJ, USA: Prentice Hall PTR, 1999, Chapter 1, Page 3 – 46. [7] Bart Jacob, Michael Brown, Kentaro Fukui, and Nihar Trivedi, “What Grid Computing is,” in Introduction to Grid Computing, 1 st Edition. Austin, TX, USA: IBM’s International Technical Support Organization, 2005, Chapter 1, Page 3 – 6. [8] L. Youseff, M. Butrico, and D. Da Silva, “Towards a Unified Ontology of Cloud Computing,” in Proceedings of the 2008 Grid Computing Environments Workshop (GCE 2008), Austin, TX, USA, 2008, Page 1 – 10. [9] Luis M. Vaquero, Luis Rodero-Merino, Juan Caceres, and Maik Lindner, “A Break in the Clouds: Towards a Cloud Definition,” in ACM SIGCOMM Computer Communication Review, Volume 39, Issue 1. New York, NY, USA: ACM, 2008, Page 50 – 55. [10] Mladen A. Vouk, “Cloud Computing – Issues, Research and Implementation,” in Proceedings of the 30 th International Conference on Information Technology Interfaces (ITI 2008), Cavtat/Dubrovnik, Croatia, 2008, Page 31 – 40. [11] Shlomo Swidler, “The OGF Open Cloud Computing Interface,” presented at IGT 2009 World Summit of Cloud Computing, Shefayim, Israel, 2009. [12] Cari Tuna, “Cloudera Raises Hefty Funding Round,” The Wall Street Journal [Online], (26 th October 2010). Available: http://blogs.wsj.com/digits/2010/10/26/cloudera-raises-hefty- funding-round/?mod=google_news_blog

69

[13] Tom White, “Meet Hadoop,” in Hadoop: The Definitive Guide, 1 st Edition. Sebastopol, CA, USA: O’Reilly Media, 2009, Chapter 1, Page 1 – 13. [14] Tom White, “The Hadoop Distributed Filesystem,” in Hadoop: The Definitive Guide, 1 st Edition. Sebastopol, CA, USA: O’Reilly Media, 2009, Chapter 3, Page 41 – 74. [15] Konstantin Shvachko, “Automatic Namespace Recovery from the Secondary Image,” The Apache Software Foundation [Online], (8 th July 2009). Available:

https://issues.apache.org/jira/browse/HADOOP-2585

[16] Jeffrey Dean, and Sanjay Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” in Communications of the ACM, Volume 51, Issue 1. New York, NY, USA: ACM, 2008, Page 107 – 113. [17] Douglas Thain, Todd Tannenbaum, and Miron Livny, “Condor and the Grid,” in Grid Computing: Making the Global Infrastructure a Reality, 1 st Edition. Chichester, UK: John Wiley & Sons, 2003, Chapter 11, Page 299 – 335. [18] Douglas Thain, and Christopher Moretti, “Abstractions for Cloud Computing with Condor,” in Cloud Computing and Software Services: Theory and Techniques, 1 st Edition. Boca Raton, FL, USA: CRC Press, 2009, Chapter 7, Page 153 – 171. [19] Miron Livny, “Condor and the Cloud – The Challenges and the Roadmap of Condor,” presented at Condor & the Cloud with Professor Miron Livny & a FaceBook IT Case Study, Hertzelia, Israel, 2009. [20] Web Services Architecture, W3C Working Group Note, David Booth, Hugo Haas, and Francis McCabe, 11 th February 2004. [21] James Decraene, Yong Yong Cheng, Malcolm Low Yoke Hean, Suiping Zhou, Wentong Cai, Choo Chwee Seng, “Evolving Agent-Based Simulations in the Clouds,” in Proceedings of the 3 rd International Workshop on Advanced Computational Intelligence (IWACI 2010), Suzhou, China, 2010, Page 244 – 249. [22] Gregory C. Mclntosh, David P. Galligan, Mark A. Anderson, and Michael K. Lauren, “Recent Developments in the MANA Agent-Based Model,” in Scythe Issue 1, Scheveningen, Netherlands, 2006, Page 38 – 39. [23] Eckart Zitzler, Marco Laumanns, and Stefan Bleuler, “A Tutorial on Evolutionary Multiobjective Optimization,” in Proceedings of the Workshop on Multiple Objective Metaheuristics (MOMH 2004), Heidelberg, Germany, 2004, page 3 – 38. [24] Darrell Whitley, Soraya Rana, and Robert B. Heckendorn, “The Island Model Genetic Algorithm: On Separability, Population Size and Convergence,” in Journal of Computing and Information Technology, Volume 7, Issue 1. Zagreb, Croatia: University Computing Centre, 1999, Page 33 – 47. [25] Zbigniew Skolicki, and Kenneth De Jong, “The Influence of Migration Sizes and Intervals on Island Models,” in Proceedings of the 2005 Conference on Genetic and Evolutionary Computation (GECCO 2005), Washington, DC, USA, 2005, Page 1295 – 1302.

70

[26] Heinz Muhlenbein, “Evolution in Time and Space – The Parallel Genetic Algorithm,” in Proceedings of the 1 st Workshop on Foundations of Genetic Algorithms (FOGA 1990), Indiana, IN, USA, 1990, Page 316 – 337. [27] Darrell Whitley, and Timothy Starkweather, “GENITOR II: A Distributed Genetic Algorithm,” in Journal of Experimental & Theoretical Artificial Intelligence, Volume 2, Issue 3. 1990, Page 189 – 214. [28] Nullsoft Scriptable Install System [Online], (2009). Available:

http://nsis.sourceforge.net/Main_Page [29] GUP for Win32 [Online], (2010). Available: http://gup-win32.tuxfamily.org/ [30] Shufen Zhang, Shuai Zhang, Xuebin Chen, and Shangzhuo Wu, “Analysis and Research of Cloud Computing System Interface,” in Proceedings of the 2 nd International Conference on Future Networks (ICFN 2010), Sanya, China, 2010, Page 88 – 92. [31] C.L. Chua, W.C. Sim, C.S. Choo, and Victor Tay, “Automated Red Teaming: An Objective- Based Data Farming Approach For Red Teaming,” in Proceedings of the 40 th Conference on Winter Simulation (WSC 2008), Austin, TX, USA, 2008, Page 1456 – 1462. [32] Kalyanmoy Deb, Amrit Pratap, Sameer Agarwal, and T. Meyarivan, “A Fast Elitist Multi- Objective Genetic Algorithm: NSGA-II,” in IEEE Transactions on Evolutionary Computation, Volume 6, Issue 2. 2002, Page 182 – 197. [33] Abhishek Verma, Xavier Llora, Roy H. Campbell, and David E. Goldberg, “Scaling Genetic Algorithm Using MapReduce,” in Proceedings of the 9 th International Conference on Intelligent Systems Design and Applications (ISDA 2009), Pisa, Italy, 2009, Page 13 – 18. [34] R. Storn, and K. Price, “Differential Evolution – A Simple and Efficient Heuristic for Global Optimization over Continuous Space,” in Journal of Global Optimization, Volume 11, Issue 4. 1997, Page 341 – 359. [35] Jerome Boulon, Andy Konwinski, Runping Qi, Ariel Rabkin, Eric Yang, and Mac Yang, “Chukwa: A Large-Scale Monitoring System,” in Proceedings of the Cloud Computing & its Applications 2008 (CCA 2008), Chicago, IL, USA, 2008. [36] Cascading [Online], (2010). Available: http://www.cascading.org/ [37] Jaliya Ekanayake, Hui Li, Bingjing Zhang, Thilina Gunarathne, Seung-Hee Bae, Judy Qiu, and Geoffrey Fox, “Twister: A Runtime for Iterative MapReduce,” in Proceedings of the 19 th ACM International Symposium on High Performance Distributed Computing (HPDC 2010), Chicago, IL, USA, 2010, Page 810 – 818. [38] Yahoo Distribution of Hadoop [Online], (2010). Available:

http://yahoo.github.com/hadoop-common/

71

Appendix

Appendix A: List of Nodes in Hadoop Cluster @NTU

Clusters

Host Names

IP Addresses

 

Web Server

Both Clusters

pdcc.ntu.edu.sg

155.69.148.19

Clusters

Host Names

IP Addresses

Dedicated Physical Computer: Core 2 Duo CPU 2.66 GHz + 2 GHz RAM

Production Cluster

pdccrs-01

155.69.145.248

Non-Dedicated Physical Computers: Pentium D CPU 3.20 GHz + 2 GHz RAM

 

pdc1

155.69.151.85

pdc2

155.69.151.86

pdc3

155.69.151.87

pdc4

155.69.151.88

pdc5

155.69.151.89

pdc6

155.69.151.90

pdc7

155.69.151.91

pdc8

155.69.151.92

pdc9

155.69.151.93

pdc10

155.69.151.94

pdc11

155.69.151.95

Production Cluster

pdc12

155.69.151.96

pdc13

155.69.151.97

 

pdc14

155.69.151.98

pdc15

155.69.151.99

pdc16

155.69.151.100

pdc17

155.69.151.101

pdc18

155.69.151.102

pdc19

155.69.151.103

pdc20

155.69.151.104

pdc21

155.69.151.105

pdc22

155.69.151.106

pdc23

155.69.151.107

pdc24

155.69.151.108

72

Clusters

Host Names

IP Addresses

Non-Dedicated Physical Computers: Core 2 Quad CPU 2.66 GHz + 3 GHz RAM

 

pdc25

155.69.145.211

pdc26

155.69.145.212

Production Cluster

pdc27

155.69.145.213

pdc28

155.69.145.214

 

pdc29

155.69.145.215

pdc33

155.69.145.228

Dedicated Virtual Machines: Xeon X5560 CPU 2.80 GHz + 3 GHz RAM

 

fypyyc1

155.69.102.245

fypyyc2

155.69.102.246

Production Cluster

fypyyc3

155.69.102.247

fypyyc4

155.69.102.248

 

fypyyc5

155.69.102.249

fypyyc6

155.69.102.250

 

fypyyc7

155.69.102.251

Development Cluster

fypyyc8

155.69.102.252

fypyyc9

155.69.102.253

 

fypyyc10

155.69.102.254

73

Appendix B: Past Problems in MRMANA & Solutions

Appendix B: Past Problems in MRMANA & Solutions 74
Appendix B: Past Problems in MRMANA & Solutions 74

74

75
75

75

76
76

76

77
77

77

78
78

78

79
79

79

80

80

Appendix C: Evolving Agent-Based Simulations in the Clouds

This paper was published in the Proceedings of the 3 rd International Workshop on Advanced Computational Intelligence (IWACI 2010).

81

Evolving Agent-based Simulations in the Clouds

James Decraene, Yong Yong Cheng, Malcolm Yoke Hean Low Suiping Zhou, Wentong Cai and Chwee Seng Choo

Abstract— Evolving agent-based simulations enables one to automate the difficult iterative process of modeling complex adaptive systems to exhibit pre-specified/desired behaviors. Nevertheless this emerging technology, combining research advances in agent-based modeling/simulation and evolutionary computation, requires significant computing resources (i.e., high performance computing facilities) to evaluate simulation models across a large search space. Moreover, such experiments are typically conducted in an infrequent fashion and may occur when the computing facilities are not fully available. The user may thus be confronted with a computing budget limiting the use of these “evolvable simulation” techniques. We propose the use of the cloud computing paradigm to address these budget and flexibility issues. To assist this research, we utilize a modular evolutionary framework coined CASE (for complex adaptive system evolver) which is capable of evolving agent-based models using nature-inspired search algorithms. In this paper, we present an adaptation of this framework which supports the cloud computing paradigm. An example evolutionary experiment, which examines a simplified military scenario modeled with the agent-based simulation platform MANA, is presented. This experiment refers to Automated Red Teaming: a vulnerability assessment tool employed by defense analysts to study combat operations (which are regarded here as complex adaptive systems). The experimental results suggest promising research potential in exploiting the cloud computing paradigm to support computing intensive evolvable simulation experiments. Finally, we discuss an additional extension to our cloud computing compliant CASE in which we propose to incorporate a distributed evolutionary approach, e.g., the island-based model to further optimize the evolutionary search.

I. INTRODUCTION

E XAMINING complex adaptive systems (CAS) remains problematic as the traditional analytical and statistical

modeling methods appear to limit the study of CAS [1]. To overcome these issues, Holland proposed the use of evolu- tionary agent-based simulations to examine the emergent and complicated phenomena characterizing CAS. In evolutionary agent-based simulations, multiple and in- teracting evolvable agents (e.g., neurones, traders, soldiers, etc.) determine, as a whole, the behavior of the system (e.g., brain, financial market, warfare, etc.). The evolution of agents is conducted through the use of evolutionary computation techniques (e.g., learning classifier systems, genetic program- ming, evolution strategies, etc.). The evolution of CAS can be

James Decraene, Yong Yong Cheng, Malcolm Yoke Hean Low, Suiping Zhou and Wentong Cai are with the Parallel and Distributed Computing Center at the School of Computer Engineering, Nanyang Technological University, Singapore (email: jdecraene@ntu.edu.sg). Chwee Seng Choo is with DSO National Laboratories, 20 Science Park Drive, Singapore. This R&D work was supported by the Defence Research and Technology Office, Ministry of Defence, Singapore under the EVOSIM Project (Evo- lutionary Computing Based Methodologies for Modeling, Simulation and Analysis).

driven to exhibit pre-specified and desired system behaviors (e.g., to identify critical conditions leading to the emergence of specific system-level phenomena such as a financial crisis or battlefield outcomes). Although this method appears to be satisfactory for study- ing CAS, it is limited by the requirement of significant com- putational resources. Indeed in “evolvable simulation” exper- iments, many simulation models are iteratively generated and evaluated. Due to the stochastic nature of both evolutionary algorithms and agent-based simulations, experiment replica- tions are also required to account for statistical fluctuations. As a result, the experimental process is computationally highly demanding. Moreover, such experiments are typically conducted occasionally when the computing facilities may not be fully available. To address these computing budget issues, involving both scalability and flexibility constraints, we examine the cloud computing paradigm [2]. This distributed computing paradigm has recently been introduced to specifically address such computing budget issues where large dataset and con- siderable computational requirements are dealt with. To assist this research, we propose to modify a modular evolutionary framework, coined CASE for “complex adaptive system evolver” to support cloud computing facilities. In the remainder of this paper, we first provide introduc- tions to both evolutionary agent-based simulations and cloud computing. Following this, we present the CASE framework. The latter is then extended to support the cloud computing paradigm. A series of experiments is described to evalu- ate our cloud computing compliant framework in terms of scalability. The experiments involve a simplified military simulation which is modeled with the agent-based simula- tion platform MANA [3]. Finally we discuss an additional extension to CASE which would incorporate a distributed evolutionary approach [4] to further optimize the search process.

II. EVOLUTIONARY AGENT-BASED SIMULATIONS

Agent-based systems (ABSs) are computational methods which can model the intricate and non-linear dynamics of complex adaptive systems. ABSs are commonly im- plemented with object-oriented programming environments in which agents are instantiations of object classes. ABSs typically involve a large number of autonomous agents which are executed in a concurrent or pseudo-concurrent manner (i.e., using a time-slicing algorithm). Each agent possesses its own distinct state variables, can be dynamically deleted and is capable of interacting with the other agents. The agents’

computational methods may include stochastic processes resulting in a stochastic behavior at the system level. To study ABS, the data-farming method was proposed as a means to identify the “landscape of possibilities” [5], i.e., the spectrum of possible simulation outcomes. In data farming experiments, specific simulation model parameters are selected and varied (according to pre-specified boundary values). This exploratory analysis of parameters enables one to examine the effects of the parameters over the simulation outcomes. Several techniques [6] have been introduced to reduce the search space where each solution/design point is a distinct simulation model. The search space can be reduced even further when one is interested in a single (or target) system behavior. Evolutionary computation (EC) techniques can here be used to drive the generation/evaluation of simu- lation models. In this paper, we examine such an “objective- based data farming” approach using evolutionary agent-based simulations [7]. In evolutionary ABS, EC techniques are utilized to evolve simulation models to exhibit a desirable output/behavior. This method differs from simulation optimization techniques [8] as it relies on the simulation of autonomous and concurrent agents whose (inter)actions may include stochastic elements. Therefore the evaluation of the simulation models is also stochastic by nature.

III. CLOUD COMPUTING

Cloud computing [2] is a novel high performance com- puting (HPC) paradigm which has recently attracted consid- erable attention. The computing capabilities (i.e., compute and storage clouds) are typically provided as a service via the Internet. This web approach enables users to access HPC services without requiring expertise in the technology that supports them. In other words, the user does not need expertise in mainframe administration and maintenance, dis- tributed systems, networking, etc. The key benefits of cloud computing are identified as follows:

Reduced Cost: Cloud computing infrastructures are pro- vided by a third-party and do not need to be purchased for potentially infrequent computing tasks. Users pay for the resources on a “utility” computing basis. This enables users with limited financial and computing re- sources to exploit high performance computing facilities (e.g., the Amazon Elastic Compute Cloud, the Sun Grid) without having to invest into personal and expensive computing facilities. Scalability: Multiple computing “clouds” (which can be distant from each other) can be aggregated to form

a single virtual entity enabling users to conduct very large scale experiments. The computing resources are dynamically provided and self-managed by the cloud

computing server. Cloud computing is a HPC paradigm,

in others words, it aims at enabling users to exploit large

amounts of computing power in a short period of time (in minutes or hours). Thus, cloud computing differs from “High Throughput Computing” approaches, such

as

Condor [9] 1 , which aim at provisioning large amounts

of

computing power over longer periods of time (in days

or

weeks).

One of the core technology underlying cloud computing, enabling the above benefits, is the MapReduce programming model [11]. This model is composed of two distinct phases:

Map: The input data is partitioned into subsets and distributed across multiple compute nodes. The data subsets are processed in parallel by the different nodes.

A set of intermediate files results from the Map phase

and is processed during the Reduce phase. Reduce: Multiple compute nodes process the inter- mediate files which are then collated to produce the output files. Similarly to the Map processes, the Reduce operations are distributed (and executed in parallel) over multiple compute nodes. The relative simplicity of the MapReduce programming model facilitates the efficient parallel distribution of compu- tationally expensive jobs. This parallelism also enables the recovery from failure during the operations (this is partic- ularly relevant when considering a distributed environment where some nodes may fail during a run). Map/Reduce operations may be replicated (if a distinct operation fails, its replication is retrieved). Also, failed operations may automatically be rescheduled. These fault- tolerant features are inherent properties of cloud computing frameworks such as the Apache Hadoop. Thus the user is not required to handle such issues. We suggest that evolutionary agent-based simulations can be expressed as MapReduce computations, and consequently, may exploit the benefits provided by the cloud computing paradigm. In the next section we briefly present some related studies which examined the combination of the MapReduce programming model with evolutionary algorithms.

IV. RELATED STUDIES

Recent studies have combined evolutionary computation and the MapReduce programming model. In [12], Jin et al. claimed that, as devised, the MapReduce model can- not directly support the implementation of parallel genetic algorithms (i.e., a specific island-based model). As a re- sult, MapReduce was extended and included an additional Reduce process. The iterative cycle is as follows. During the Map phase, multiple instances of the genetic algorithms are executed in parallel. The local optimal solutions of each population are collected during the first Reduce phase. An additional collection and sorting of the local optimal solutions is conducted during the second Reduce phase. The resulting set of global optimal solutions is then utilized to initiate the next generation. Llora et al. [13] presented a different approach where several evolutionary algorithms were adapted to support the MapReduce model (in contrast with Jin et al. who adapted the MapReduce model and not the evolutionary algorithm

1 Note that Condor is being adapted to support cloud computing [10].

itself). The parallelization of the evolutionary algorithms was here conducted using a decentralized and distributed selec- tion approach [14]. This method avoided the requirement of

a

second Reduce process (i.e., a single selection operation

is

conducted over the aggregation of the different pools of

solutions). The above studies provide guidance for translating evolu- tionary algorithms for MapReduce operations. The approach proposed by Llora et al. is further examined in Section VI. Note that in contrast with Jin et al. and Llora et al.’s approaches, the “objective function” is here the simulation of stochastic agent-based models. The resolution (i.e., level of

abstraction) of the simulations is the key factor (i.e., the bulk of the work) determining the computational requirements of the evolutionary experiments. In the next section, a description of the CASE framework

is provided.

V. T HE

CASE FRAMEWORK

CASE is a recently developed framework which enables one to evolve simulation models using nature-inspired search algorithms. This system was constructed in a modular man- ner (using the Ruby programming language to accommodate the user’s specific requirements (e.g., use of different simu- lation engines or evolutionary algorithms, etc.). This frame- work can be regarded as a simplification of the Automated Red Teaming framework [15] which was developed by the DSO National Laboratories of Singapore. CASE is composed of three main components which are distinguished as follows:

1) The model generator: This component takes as inputs

a base simulation model specified in the eXtended

Markup Language and a set of model specification text

files. According to these inputs, novel XML simulation models are generated and sent to the simulation engine

for evaluation. Thus, as currently devised, CASE only

supports simulation models specified in XML. More- over, the model generator may consider constraints

over the evolvable parameters (this feature is optional). These constraints are specified in a text file by the user. These constraints (due for instance to interac- tions between evolvable simulation parameters) aim

at increasing the plausibility of generated simulation

models (e.g., through introducing cost trade-off for specific parameter values). 2) The simulation engine: The set of XML simulation models is received and executed by the stochastic simulation engine. Each simulation model is replicated

a number of times to account for statistical fluctuations.

A set of result files detailing the outcomes of the

simulations (in the form of numerical values for in- stance) are generated. These measurements are used to evaluate the generated models, i.e., these figures are the fitness (or “cost”) values utilized by the evolutionary algorithm (EA) to direct the search. 3) The evolutionary algorithm: The set of simulation results and associated model specification files are

received by the evolutionary algorithm, which in turns, processes the results and produce a new “genera- tion” of model specification files. The generation of these new model specifications is driven by the user- specified (multi)objectives (e.g., maximize/minimize some quantitative values capturing the target system behavior). The algorithm iteratively generates models which would incrementally, through the evolutionary search, best exhibit the desired outcome behavior. The model specification files are sent back to the model generator; this completes the search iteration. This component is the key module responsible for the automated analysis and modeling of simulations. Communications between the three components are con- ducted via text files for simplicity and flexibility. Note that the flexible nature of CASE allows one to develop and integrate different simulation platforms (using models specified in XML), and search algorithms. In the next section, we propose a cloud computing compliant version of CASE.

VI. MAPREDUCE CASE

We present our adaptation of the CASE framework to support the MapReduce programming model. This adaptation is conducted using the Apache Hadoop framework which relies on the Map and Reduce functions devised in functional programming languages such as Lisp. During initialization, the CASE modules (simple Ruby scripts and the simulation engine executable) are sent to the compute nodes. Then, at each search iteration, only the model specification files are transmitted to the compute nodes, where, locally the generation and evaluation of simulation models are conducted. The motivation of this approach is to decrease the network traffic and distribute the computational effort (moving computation is cheaper than moving data). Also, note that only a single Reduce process is conducted to retrieve the intermediate result files. Future work will consider exploiting the Reduce phase through analyzing intermediate result files (to assist the evolutionary algorithm) using multiple compute nodes. This relatively straightforward implementation illustrates the simplicity of the MapReduce programming model.

VII. EXPERIMENT

We present an example experiment in which the CASE framework is utilized for Automated Red Teaming (ART), a simulation-based military methodology utilized to uncover weaknesses of operation plans. Here, “combat” is conceptually regarded as a complex adaptive system which outcomes result from complex non-linear dynamics [16]. The agent-based simulation platform MANA [3], developed by the New Zealand Defense and Technology Agency, is employed to model and perform the simulations.

A. Automated Red Teaming

Automated Red Teaming (ART) was originally proposed by the defense research community as a vulnerability as- sessment tool to automatically uncover critical weaknesses

of operational plans [7]. Using this computer/simulation- based approach, defense analysts may subsequently resolve the identified tactical plan loopholes.

A stochastic agent-based simulation is typically used to

model and simulate the behavioral and dynamical features of the environment/agents. The agents are specified with a set of properties which defines their intrinsic capabilities and personality such as sensor range, fire range, movement range, communications range, aggressiveness, response to injured teammates and cohesion. A review of ABS systems applied to various military applications is provided by Cioppa et al.

[17].

In

ART experiments, a defensive Blue team (a set of

agents) is subjected to repeated attacks, where multiple scenarios may be examined, from a belligerent Red team. Thus, ART aims at anticipating the adversary behaviour through the simulation of various potential scenarios.

B. Setting

A maritime anchorage protection scenario is examined.

In this scenario, a Blue Team (composed of 7 vessels) conducts patrols to protect an anchorage (in which 10 Green commercial vessels are anchored) against threats. Red forces (5 vessels) attempt to break Blues defense strategy and inflict damages to anchored vessels. The aim of the study is to discover Reds strategies that are able to breach through Blues defensive tactic. We detail the model, evolutionary algorithm and cloud computing facilities utilized in the experiments:

The model: Figure 1 depicts the scenario which was modeled using the ABS platform MANA.

scenario which was modeled using the ABS platform MANA . Fig. 1. MANA model of the

Fig. 1. MANA model of the maritime anchorage protection scenario adapted from [18]. The map covers an area of 100 by 50 nautical miles (1 nm = 1.852km). The dashed lines depict the patrolling paths of the different Blue vessels.

The Blue patrolling strategy is composed of two layers:

an outer (with respect to the anchorage area, 30 by 10 nm) and inner patrol. The outer patrol consists of four smaller but faster boats. They provide the first layer of defence whereas the larger and heavily armored ships inside the anchorage are the second defensive layer. In CASE, each candidate solution is represented by a vector of real values defining the different evolvable

Red behavioral parameters (Table I). As the number of decision variables increases, the search space be- comes significantly larger. According to the number of evolvable properties and associated ranges given for this experiment, the search space contains 1.007 distinct candidate solutions (i.e., variants of the original simulation model).

TABLE I

EVOLVABLE RED PARAMETERS

 

Min

Max

Red property Team 1 initial position (x,y) Team 2 initial position (x,y) Intermediate waypoints (x,y) Team 1 final position (x,y) Team 2 final position (x,y) Aggressiveness Cohesiveness Determination

(0,0)

(399,39)

(0,160) (399,199)

(0,40)

(399,159)

(0,160)

(399,199)

(0,0)

(399,39)

-100

100

-100

100

20

100

The home and final positions together with the interme- diate waypoint define the trajectory of each distinct Red vessel. Three of the Red crafts (Team 1) were set up to initiate their attack from the north while the remaining two attack (Team 2) from the south. This allows Red to perform multi-directional attack at the anchorage. In addition, the final positions of the Red crafts are constrained to the opposite region (with respect to initial area) to simulate escapes from the anchorage following successful attacks. Psychological elements are included in the decision variables to address the potential effects on the Red force. The aggressiveness determines the reaction of individual Red crafts upon detecting a Blue patrol. Cohesiveness influences the propensity of Red to maneuver as a group or not, whereas determination stands for the Red’s willingness to follow the defined trajectories. The Red crafts’ aggressiveness against the Blue force are varied from unaggressive (-100) to very aggressive (100). Likewise, the cohesiveness of the Red crafts are varied from independent (-100) to very cohesive (100). Finally, a minimum value of 20 is set for determination to prevent inaction from occurring.

The evolutionary algorithm: The Non-dominated Sort- ing Algorithm II (NSGA-II) [19] is employed to conduct the evolutionary search using the parameter values listed in Table II:

TABLE II

EVOLUTIONARY ALGORITHM SETTING

 

Value

Parameter Population size Number of search iteration Mutation probability Mutation index Crossover rate Crossover index

100

50

0.1

20

0.9

20

The NSGA-II population size and number of search iteration indicate that 5000 distinct MANA simulation

models are generated and evaluated for each experi- mental run. Each individual simulation model is ex- ecuted/replicated 30 times to account for statistical fluctuations. The efficiency of the algorithm is measured by the number of Green casualties with respect to the number of Red casualties. In other words, the objectives are:

To minimize the number of Green (commercial) vessels “alive”.

To minimize the number of Red casualties.

The cloud computing facilities: The cloud computing cluster is composed of 30 laboratory workstations lo- cated at the Parallel and Distributed Computing Center, Nanyang Technological University. Note that the hard- ware of the workstations may vary from each others, thus a heterogeneous environment is considered. More- over, as these workstations may also occasionally be utilized by students, the performance of workstations may also be affected during experiments. This exempli- fies the hazards (e.g., a student may reboot a compute node) that may occur in a distributed environment. We purposely utilize such a computing environment to test the fault tolerant features of Hadoop.

C. Results

Figure 2 presents the running times of two experiments where we incrementally increased the number of available

compute nodes. In the first experimental run, a relatively fast version of the simulation model is employed (requires

5 seconds to execute 30 replications on a compute node). In

the second case, the model execution time is increased from

5 to 90 seconds to reflect real life military simulation models which typically require such an amount of time.

It can be observed that as the number of available compute

nodes increases, the time required to perform the experiment decreases accordingly. Nevertheless, we note that this rela- tionship (i.e., number of nodes/time) is not exactly scalable (most remarkable when the number of compute nodes is higher than 10) in the first model. Whereas in the second experimental run, the running time scales with the number of utilized compute nodes. The results suggest that, according to the execution time of the simulation model, an “optimal” (from a computing cost point of view) number of compute node exists.

A number of issues causing overheads were identified:

1) The iterative nature of the evolutionary algorithm re-

quires the synchronization of the search iterations. As a result, compute nodes equipped with a relatively slower CPU (or having a higher computational load due to external factors such as students using the computer) may cause a delay. Delays may also occur due to network traffic. The latter may lead the model evaluations to occur with differing start times (this issue may thus aggravate the previous one).

2)

14 12 10 8 6 4 2 0 1 4 2 5 10 20 25
14
12
10
8
6
4
2
0
1 4
2
5
10
20
25
Number of distributed compute nodes
220
200
180
160
140
120
100
80
60
40
20
0
1 4
2
5
10
20
25
Time (hours)
Time (hours)

Number of distributed compute nodes

Fig. 2.

number of computer nodes using a fast (top) and slow (bottom) variants of the base simulation model.

Running times of MapReduce CASE experiments with increasing

Future work will consider the utilization of an asynchronous model considering a heterogeneous computing environment to resolve the above issues. Also, note that some experi- ments were conducted while laboratory demonstrations were occurring. Nevertheless no significant deteriorations upon the experiments were observed (apart from the occasional slow down of some model evaluations). All experiments were thus successfully achieved using this heterogeneous and relatively hazardous computing environment. This support the robustness qualities of the cloud computing paradigm. In the next section we discuss the integration of dis- tributed evolutionary computation techniques within our CASE MapReduce model.

VIII. FUTURE WORK

Our simplistic adaption of CASE did not exploit some features (e.g., shuffling process, multiple Reduce processes) of the MapReduce model. We discuss future directions, examining distributed evolutionary computation, which may potentially address this deficit:

Island-based model: The island-based model [4] is a popular and efficient way to implement evolutionary algorithms on distributed systems. In this model, each compute node executes an independent evolutionary algorithm over its own sub-population. The nodes work in consort by periodically exchanging solutions in a

process called “migration”. It has been reported that such models often exhibit better search performance in terms of both accuracy and speed. This approach may thus further optimize the evolutionary search given a limited computing budget. We may, for instance, devise Reduce processes that would carry out the computations required during the migrations (e.g., selection of most promising solutions to be transferred). Self-adaptive mechanisms: Similarly to the parameter setting of evolutionary algorithms, the performance of distributed evolutionary approaches may vary according

to the specific migration scheme employed. Numerous

parameters (as mentioned above) are to be pre-specified by the user and ultimately determine the efficiency

of the distributed evolutionary search. This parameter tuning process is thus a critical step which typically requires series of preliminary experiments to identify

a satisfactory set of parameter values. Consequently,

running such preliminary experiments conflicts with our intention to resolve computing budget issues. Recent studies [20], [21] have addressed this issue where self- adaptive methods are used to automate this parameter tuning process. We suggest that these computations may be expressed as Reduce processes. The above directions are currently being investigated using our seminal work on combining CASE and the MapReduce model.

IX. CONCLUSION

We first briefly presented the fields of evolutionary agent- based simulations and cloud computing. To date, the work reported here is among the very first attempts to combine evolutionary agent-based simulations with the MapReduce programming model. To assist this research, we utilized the modular evolutionary framework CASE. The latter was adapted to support the MapReduce model. To test our novel framework, we presented an evolutionary experiment which involved Automated Red Teaming, a method originating from the defense research community where warfare is concep- tually regarded as a complex adaptive system. The experi- mental results demonstrated the benefits of the MapReduce approach in terms of both scalability and robustness. Finally we discussed a future research direction in which self- adaptive distributed evolutionary algorithms are considered to further optimize the evolutionary search.

ACKNOWLEDGMENTS

We would like to thank the following organizations that helped make this R&D work possible:

Defence Research and Technology Office, Ministry of Defence, Singapore, for sponsoring the Evolutionary Computing Based Methodologies for Modeling, Simula- tion and Analysis project which is part of the Defence Innovative Research Programme FY08. Defence Technology Agency, New Zealand Defence Force, for sharing the Agent Based Model, MANA.

Parallel and Distributed Computing Center, School of Computer Engineering, Nanyang Technological Univer- sity, Singapore. DSO National Laboratories, Singapore.

REFERENCES

[1] J. Holland, “Studying complex adaptive systems,” Journal of Systems

Science and Complexity, vol. 19, no. 1, pp. 1–8, 2006. [2] A. Weiss, “Computing in the Clouds,” netWorker, vol. 11, no. 4, pp. 16–25, 2007. [3] M. Lauren and R. Stephen, “Map-aware Non-uniform Automata (MANA)-A New Zealand Approach to Scenario Modelling,” Journal of Battlefield Technology, vol. 5, pp. 27–31, 2002. [4] E. Cantu-Paz, Efficient and Accurate Parallel Genetic Algorithms. Kluwer Academic Pub, 2000. [5] P. Barry and M. Koehler, “Simulation in Context; Using Data Farming for Decision Support,” in Proceedings of the 36th Winter Simulation Conference, 2004, pp. 814–819.

[6]

T. Cioppa and T. Lucas, “Efficient Nearly Orthogonal and Space-filling Latin Hypercubes,” Technometrics, vol. 49, no. 1, pp. 45–55, 2007.

[7] C. Chua, C. Sim, C. Choo, and V. Tay, “Automated Red Teaming:

an Objective-based Data Farming Approach for Red Teaming,” in Proceedings of the 40th Winter Simulation Conference, 2008, pp.

1456–1462.

[8] S. Olafsson and J. Kim, “Simulation Optimization,” in Proceedings of the 34th Winter Simulation Conference, vol. 1, 2002, pp. 79–84. [9] M. Litzkow, M. Livny, and M. Mutka, “Condor-a Hunter of Idle Workstations,” in Proceedings of the 8th International Conference of Distributed Computing Systems, vol. 43, 1988, pp. 104–111. [10] Thain, D. and Moretti, C., “Abstractions for Cloud Computing with Condor,” in Cloud Computing and Software Services, S. Ahson and M. Ilyas, Eds. CRC Press, 2010, To appear. [11] J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” Commun. ACM, vol. 51, no. 1, pp. 107–113, 2008. [12] C. Jin, C. Vecchiola, and R. Buyya, “MRPGA: An Extension of MapReduce for Parallelizing Genetic Algorithms,” in ESCIENCE ’08:

Proceedings of the 2008 Fourth IEEE International Conference on eScience. Washington, DC, USA: IEEE Computer Society, 2008, pp.

214–221.

[13] X. Llora, A. Verma, R. Campbell, and D. Goldberg, “When Huge Is Routine: Scaling Genetic Algorithms and Estimation of Distribution Algorithms via Data-Intensive Computing,” Parallel and Distributed Computational Intelligence, pp. 11–41, 2010. [14] K. De Jong and J. Sarma, “On Decentralizing Selection Algorithms,” in Proceedings of the Sixth International Conference on Genetic Algorithms, 1995, pp. 17–23.

C. S. Choo, C. L. Chua, and S.-H. V. Tay, “Automated Red Teaming: a

[15]

Proposed Framework for Military Application,” in Proceedings of the 9th Annual Conference on Genetic and Evolutionary Computation. New York, NY, USA: ACM, 2007, pp. 1936–1942. [16] A. Ilachinski, Artificial war: Multiagent-based Simulation of Combat. World Scientific Pub Co Inc, 2004.

[17] T. Cioppa, T. Lucas, and S. Sanchez, “Military Applications of Agent- based Simulations,” in Proceedings of the 36th Winter Simulation Conference, 2004, pp. 171–180. [18] M. Low, M. Chandramohan, and C. Choo, “Multi-Objective Bee Colony Optimization Algorithm to Automated Red Teaming,” in Proceedings of the 41th Winter Simulation Conference, 2009, pp.

1798–1808.

[19] K. Deb, S. Agrawal, A. Pratap, and T. Meyarivan, “A Fast Elitist Non-dominated Sorting Genetic Algorithm for Multi-objective Opti- mization: NSGA-II,” Lecture Notes in Computer Science, pp. 849–858,

2000.

K. Srinivasa, K. Venugopal, and L. Patnaik, “A Self-adaptive Migration Model Genetic Algorithm for Data Mining Applications,” Information Sciences, vol. 177, no. 20, pp. 4295–4313, 2007.

[21] C. Leon, G. Miranda, and C. Segura, “A Memetic Algorithm and a Parallel Hyperheuristic Island-based Model for a 2D Packing Prob- lem,” in GECCO ’09: Proceedings of the 11th Annual Conference on Genetic and Evolutionary Computation. New York, NY, USA: ACM, 2009, pp. 1371–1378.

[20]