Sei sulla pagina 1di 54

Large-scale Elastic Data Processing in Micro-cloud Environments based on StreamMine3G

Pawel Skorupinski June 16, 2013


Abstract Large-scale processing of data in own data centers as well as using cloud computing are two well-established approaches, however they have their drawbacks when a concept of data as a service is to be introduced. Therefore, a novel data as a service model, based on Micro-clouds was presented [lea]. It provides a possibility of querying public data resources to companies that cannot aord to extract, store it or process all the data in house. Characteristic of a Micro-cloud environment is that data is distributed over small, geographically distributed, inhomogeneous data centers. That imposes a need for a new approach to the operators placement, highly aware of execution costs that might be inuenced by transfer costs over WAN, low bandwidths over links and a limited safety of Micro-clouds locations. In the thesis dierent algorithms searching for an optimal operator placement are presented. It is shown that a relaxation of a problem to a linear programming (LP) problem with an awareness of data processing topologies leads to good placements. Furthermore, the ideas of dening the problem more precisely with an extension to mixed integer linear programming (MILP) problem as well as possibilities of a usage of metaheuristics are tentatively analyzed and described.

Contents
1 Introduction 2 Background 2.1 Concepts about a Data Processing Framework . . . 2.1.1 Web Crawling . . . . . . . . . . . . . . . . . . 2.1.2 Storing Data . . . . . . . . . . . . . . . . . . 2.1.3 Data Processing . . . . . . . . . . . . . . . . 2.2 Background of Micro-cloud Environment Design . . 2.2.1 Micro-clouds and Data Centers Concept . . . 2.2.2 Micro-clouds and Cloud Computing Concept 3 3 3 4 4 4 5 5 6 6 7 7 9 9 10 10 11 11 12

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

3 Algorithms Description 3.1 General Factors to Be Considered . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 General Formulation of a Price-aware Operator Placement Problem (OPP) for the Micro-cloud Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Additional Constraint Regarding Operator Placement Problem . . . . . . . 3.3 Possible Topologies for Data Processing . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Greedy Approach for an Operator Placement Problem . . . . . . . . . . . . . . . . 3.5 All in One Micro-cloud Approach for Solving an Operator Placement Problem . . 3.6 Approach Based on the Simplex Algorithm for Operator Placement Problem Solving 3.6.1 Transportation Problem [Chu] . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.2 Connections between Operator Placer Problem, Transportation Problem and Linear Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

CONTENTS

3.7

3.6.3 Reducing a Operators Placement Problem to a Transportation Problem . . 3.6.4 Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Approach for Choosing Hosts for Processing in Destination Micro-clouds . . . . . .

12 14 15 15 16 16 16 16 16 16 17 17 18 18 19 19 20 20 22 22 27 29 29 29 33 35 35 35 38 40 40 41 42 42 42 42 44 45 48 49 49 49 50 50 51 52 52

4 System Design 4.1 Persistent Model of the System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Model of the Physical System . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.2 Pricing Proles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.3 Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.4 Proles of Worker Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.5 Queries waiting for an Execution and Information on the System State . . . 4.2 Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 mongoDB - a Technology for Historic Data Sources . . . . . . . . . . . . . . 4.2.2 Live Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 StreamMine3G Platform - a Technology for Event Processing [sm3] . . . . . . . . . 4.3.1 Accessop implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Mapper, Workerop and Partitioner implementation . . . . . . . . . . . . . . 4.3.3 Implementation of the manager . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Design and Implementation of the Tasks Scheduler . . . . . . . . . . . . . . . . . . 4.4.1 General Architectural Approach . . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 Component Model of Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.3 Data Flow within the System . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Implementation of the Placement Algorithms . . . . . . . . . . . . . . . . . . . . . 4.5.1 Implementation of All in One Micro-cloud Approach . . . . . . . . . . . . . 4.5.2 Implementation of an Approach Based on Simplex Algorithm . . . . . . . . 4.5.3 Implementation of Algorithms for a Solution Normalization and Choosing Hosts for Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Evaluation 5.1 A Simulation Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 Design of a Simulation Environment . . . . . . . . . . . . . . . . . . . . . . 5.1.2 Simulation of the Designed System . . . . . . . . . . . . . . . . . . . . . . . 5.2 On Approximating the Price and the Time of the Solution Execution . . . . . . . . 5.2.1 Analysis of Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Analysis of Connections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.3 Analysis of Destinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.4 Calculations of the solutions time . . . . . . . . . . . . . . . . . . . . . . . 5.2.5 Calculations of the solutions price . . . . . . . . . . . . . . . . . . . . . . . 5.3 Evaluations of Positioning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Analysis of Simplex Algorithm Constraints in a Simple Mathematical Model 5.3.2 Tests of the System Implementation . . . . . . . . . . . . . . . . . . . . . . 5.3.3 Summary of the Measurements . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.4 Evaluations of Algorithms based on Measurements . . . . . . . . . . . . . . 6 Future Work and Conclusion 6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 Using Metaheuristics for Complex Constraints . . . . . . . . . . . . . . . 6.1.2 Extended Mathematical Model with a Mixed Integer Linear Programming 6.1.3 Modeling solutions for queries with more levels of worker operators . . . . 6.1.4 Even Transfer Distribution over Connections . . . . . . . . . . . . . . . . 6.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . .

Introduction

A framework that gives a constant access to public resources of the World Wide Web is a very powerful tool. It lets execute an analysis on an information shared on a daily basis between more that one third of the global population [int13]. Nowadays, this wealth is available only to computer science giants, like Google or Yahoo. Since the main goal of those companies are to provide World Wide Web search engines and technologies around them, that power is the essence of their existence. There are though the companies that would like to get a possibility to analyze picobytes of data accessible in the World Wide Web without putting an eort and money on building and maintaining the extremely expensive processing environment. They would like to have a simple access to all that data based on a data as a service (DaaS) paradigm. Data as a service is a concept of providing data on demand to the user regardless of geographic or organizational separation of provider and consumer [daa]. The biggest advantage of the paradigm is that costs of maintaining and processing data are distributed between all of the customers. Data demanded by customer can be some historic data kept in the persistent storage or data retrieved live from external sources. Such a paradigm would be a great opportunity for companies that are big sellers around the world. They have to put a big eort to choose the right marketing strategies so that they could derive benet. There is no better resource of data on how do those strategies work than the World Wide Web. In an example of sport companies like Adidas, the sponsored athletes need to be followed for how they are perceived by potential clients. Internet resources could give an instant feedback from all around the world in such a domain. Data as a service gives a possibility to avoid the big maintenance costs by single companies. To further reduce the costs of data processing in general, a novel approach that reduces the total cost of the system maintenance is considered. Data is to be stored and processed in small, geographically distributed data centers, called Micro-clouds. The new paradigm comes together with new challenges such as dealing with inhomogeneity, high distribution of the system and low bandwidths. Therefore fault tolerant, time- and price aware components need to be used in the system to provide an access to data and to execute computations on it. The focus of the thesis is to present the algorithms that would nd the optimal solutions for operators placements over the system nodes that are aware of the specics of the Micro-cloud environment. Those specics were analyzed during the prototypic implementations of a data processing framework that could become the core of the software in Micro-cloud environments. The thesis is structured as follows. In Chapter 2, the background of the topic is given and the most important concepts and technologies are explained. In Chapter 3, algorithms to solve the problem of operators placement inside of the Micro-cloud environment are described. In Chapter 4, the essential elements of the system design and implementation are explained. In Chapter 5, the simulation environment setup and algorithms evaluating solutions are explained. Then, the measurements on algorithms are described. Chapter 6 contains ideas for further development of placement algorithms as well as concludes the thesis.

Background

This Chapter gives a background of the topic of the thesis.

2.1

Concepts about a Data Processing Framework

There are many concepts and technologies that need to be considered when building a large-scale elastic live and historic data processing framework. The questions that need to be answered are: How to retrieve World Wide Web data constantly into the system? How to store data inside of the system? How to process live as well as historic data? The possible answers to those questions are analyzed below.

BACKGROUND

2.1.1

Web Crawling

Web crawling is a concept of a systematic and automatic World Wide Web data browsing [web]. Web crawlers would provide the streams of live data that could be accessed from inside of the system. In principle, Web crawlers work as follows. To start the work, they require a list of URLs to visit. They recognize hyperlinks in every of them, thereby nding the paths to further sites of the Web. Web crawlers need to be aware of the facts that a lot of WWW data gets updated and removed very fast, as well as that the same content is often represented by many URLs. Therefore, the policies dene the way of their behavior. That includes the rules on which pages to download, which pages to revisit, or how to coordinate the work of distributed Web crawlers. There are many projects providing Web crawling functionality. One of them, available under Apache license is Apache Nutch. It is highly scalable (up to 100 machines) and feature rich [nut]. 2.1.2 Storing Data

In a created data processing framework there is a need for a convenient retrieval of a historic data. It is possible through a distributed storage technology. There are several approaches for storing vast amounts of data in a distributed manner. They all should provide strategies obeying fault tolerance policies, like replication or equal distribution. The rst big group are so called distributed document-oriented databases. They belong to the family of NoSQL solutions. The systems are designed around a notion of a schema-free selfcontained document. Every document in the system is identied by a unique key and fully describes itself. Although no schema is dened on their content, documents can be looked-up based on that. That dierentiates that model from from a key-value store, where only the key gives an access to its values (the dierence erases when key-values stores enable secondary indexing). Documents are often encoded in formats like JSON or XML that can represent the structured data [dod] [dod10] [int]. There is also a group of distributed systems that provide an access to data spread over machines like it was stored on a local le system. They are called distributed le systems. They encapsulate functionalities characteristic for le systems, like a hierarchy of directories and access permissions for users [dfs] [hdf11]. 2.1.3 Data Processing

To process a large-scale data in a distributed manner, a special programming model is needed. MapReduce is a paradigm that allows for massive scalability across a huge number of machines. The magic trick that makes this model so convenient is that basically processing is spread into two jobs. First job is called map. Its role is to take an original set of data and convert it into a set of tuples. The next job, called reduce, takes a set of tuples from the rst job and reduces it into a smaller number of tuples. That division of a work can allow a high parallelism inside of the system, as both map and reduce jobs can be distributed over the multiple nodes [wir] [ibm]. An example of how word counting algorithm works with the MapReduce is presented below.
1. Original set of data: We are who we are 2. Set of tuples after a map job: (We, 1) (are, 1) (who, 1) (we, 1) (are, 1) 3. Outcome of a reduce job: (we, 2) (are, 2) (who, 1)

2.1.3.1

Approaches for Processing

2.2

Background of Micro-cloud Environment Design

There are two basic approaches for data processing. One of them is batch processing. This is the process of analysis of a huge amount of data at once, without any manual intervention. It is meant to work on historic data [bat]. An example of a distributed batch processing engine is Apache Hadoop, based on MapReduce processing model. Data of the big les that need to be processed can be divided between map jobs and then sent further to reduce jobs [had]. A typical Hadoop job takes hours and is run on dozens of machines. There can be one job run per input directory [Kle10]. Another approach is dealing with potentially innite streams of data coming live to the system during its execution. In the standard scenario, the goal of the stream processing engine is to identify the meaningful events within those streams and to employ techniques on them such as detection of complex patterns of many events, event correlation and abstraction, event hierarchies, and relationships between events [esp]. The scenario needed for the system is specic. Here both, streams of data as well as data of big les, need to be processed in the same manner. In other words, historic data processed by MapReduce jobs may be also enriched by live streams of data. Of course, there should not be a limit of jobs dened on data. Event stream processing engines with a MapReduce interface exist, examples of which are StreamMine3G or Hadoop Online Prototype [hop].

2.2

Background of Micro-cloud Environment Design

Micro-cloud environment means a group of Micro-clouds connected and cooperating with each other. Every of those Micro-clouds contains a group of nodes that can store data and host operators during queries execution. They are physically grouped into racks. Every Micro-cloud can consist of a few racks. The design of Micro-cloud environment is based on paradigms taken from data centers and cloud computing concepts. The similarities as well as dierences in concepts are presented in this Chapter. 2.2.1 Micro-clouds and Data Centers Concept

Micro-clouds are dierent from the standard data center approach concerning better data distribution as well as green computing opportunities. They were designed with an awareness of data centers disadvantages: High percentage of energy waste Poor distribution Data centers use a lot of energy. In 2010 it was already 1.3% of energy consumption in the world (and about 2% in the USA) [Koo11]. The biggest problem is about how much of this energy is actually wasted. According to some researches, only 6-12% of the electricity powering servers in data centers perform computations [Gla12]. Lots of the rest is used on cooling the devices and surroundings, so that they do not get overheated. Because of the sizes of data centers, they are typically poorly distributed. Consequently, it is often the case that they are geographically far from original data sources and potential clients. Micro-cloud is a new proposal created with an awareness of the drawbacks of the data center paradigm. The proposal assumes that Micro-clouds would be small data centers, containing much smaller amount of nodes and racks. They could be potentially placed in households. Placing small data centers inside of houses could be pro-environmental in a few ways: Heat produced by processing nodes, instead of being cooled, could be used for heating the households. That is the double advantage considering electricity usage, as there would be less of it spent on cooling of machines as well as on a process of the house heating System nodes would be placed in already existing houses. There would not be a need of building new huge data center buildings

ALGORITHMS DESCRIPTION

The Micro-cloud approach comes also with big advantages concerning Data as a Service paradigm. Quality of Service could be potentially increased, as system would be better geographicallyspread (data would be closer to clients). It would also assure a better bandwidth distribution, as data coming from external sources and reaching external destinations (clients) would be normally divided between a lot of Micro-clouds. The Micro-cloud approach entails new challenges because of some of its characteristics. Since a number of Micro-clouds in a potential environment would be big, it is going to be much more inhomogeneous than the standard approach. Micro-clouds will have nodes with a various hardware, that means various values of parameters such as computation power. Some of Micro-clouds are going to be connected with low-bandwidth linkages, therefore awareness of data transfers would need to be injected into algorithms. Some of them would be also placed in locations that are out of full control. Physical safety would therefore need to be substituted with security on the software level. 2.2.2 Micro-clouds and Cloud Computing Concept

Basically, a concept of cloud computing means proceeding distributed computing over a network on many connected machines at the same time [ccw]. It gives a solution for a fundamental problem of IT companies which is how to increase capacity and extend capabilities without investing in new infrastructure. Moreover, cloud computing enables such extensions on the y [EK]. There are a few models on how cloud computing services are oered to clients by providers [ccw]. According to infrastructure as a service (IaaS) model, clients get an access to virtual machines and other resources, like storage or network. Platform as a service (PaaS) model is already on an operating system abstraction level - clients get an access to a platform consisting of execution environments, web server, and database. Software as a service (SaaS) provides an access to applications and databases. There is a property of all of the classical cloud computing models considering data processing. Namely, the processing needs to be preceded with data transfer to the cloud. Data as a service model does not stick with any of those models, as it provides shared data accessible for processing by cloud customers. There are characteristics that Micro-cloud environment providing data as a service would need to share in common with other cloud computing provides. One of them are pricing proles. Amazon EC2 is an infrastructure as a service cloud provider that has very complex pricing proles. Costs are dened for various virtual machine instances as well as transfers inside of the cloud and with the external sources. There are also special pricing proles provided for so called spot instances. These are basically the computing capacities that can be exchanged between clients [ec2]. Their proles change in time and have characteristic picks meaning that computing price suddenly grows for a short time. That happens when demand starts to exceed supply. Such proles would also characterize data processing queries inside of a Micro-cloud environment. Their structure could be inuenced by at least two factors: the current heating demands of the household where the Micro-cloud is located; and current exploitation of the Micro-cloud. The prices could be dened for processing on nodes as well as for transfers over WAN.

Algorithms Description

A Micro-cloud environment needs a novel approach for nding solutions for operators placement. There are a few reasons for that as well as a few factors that should be considered while looking for a solution. They are going to be explained in this Chapter, followed by the general formulation of the price-aware operators placement problem for Micro-clouds environments. The algorithms that are explained in the thesis aim in nding optimal solutions for a specic topology of how operators send data streams between each other. Therefore, before the algorithms are mathematically described, dierent topologies and their possible impact on a placement strategy is going to be explained.

3.1

General Factors to Be Considered

3.1

General Factors to Be Considered

In normal cases, every portion of data inside the system is replicated a few times. Consequently, for any portion of data there is always a choice where to take it from. Generally, depending on Micro-cloud environment characteristics, there are a few factors that could inuence the way of choosing the replicas. Lowering the price of the execution Lowering the time of the execution Reducing transfers through public networks Choosing sources that are closer to a client

Three rst factors are dependent on each other, therefore it is hard to consider them as independent facets of algorithms. Instead, algorithms could be aware of one of them and with a high possibility that makes them indirectly aware of other factors. Because of the characteristics of Micro-cloud environment and StreamMine3G architecture, feedback may occur. For example, it may sound favorably from a price-awareness point of view to bring operators processing data from an expensive Micro-Cloud to another, even if sources have to be placed in this expensive one. However, this could rise the time of the execution (because of the bandwidth limits) and rise transfers through wide area networks. But, since it will take longer time to transfer data to processing operators, also access operators will work slower, so their execution will cost more money. Fourth - geographic factor is omitted in algorithms implementation. It is assumed that data is going to be located close to the client by default, therefore, in most of the situations, this factor would not be important.

3.2

General Formulation of a Price-aware Operator Placement Problem (OPP) for the Micro-cloud Environment

As the price is said to be the most important coecient during making decision on a placement, the operator placement problem considered during the thesis focuses on a minimization of it. The formulation of the OPP presented in this chapter is on an abstract level. Additional technical constraints that should be taken into account are presented after the abstract description. The objective of an operator placement problem is to determine which hosts should be chosen to execute operators accessing the data of every key, as well as those to execute operators working on that data, and which connection paths should be chosen between them to minimize the total cost of processing with a given algorithm. It is assumed that access operators can be placed only on hosts from which they can retrieve data locally. Let a set K be a set of all of the keys k , data of which needs to be accessed and retrieved to solve the query, and H be a set of all of the hosts h that can be sources to at least one of the keys during the execution time of the query. Let a source s be a part of a set S and be dened as a pair of a key (that represents the data of this source) and a host from which it can be retrieved. Every key and every host are sets of sources. Hence: s S (k K (s k ) h H (s h)). s would generally mean the amount of data to be retrieved by the source. Every algorithm for processing the data has a specic topology dened that is needed to properly process the data. If assumed that there are m host key pairs s, n hosts able to run a processing operator wa on destination level , and so on, generally a solution topology could be presented as on Figure 1. Let it be assumed that destination d of any level represents a system node that is able to process data with a demanded for its level type of algorithm during the execution time of the query. d would generally mean the amount of data to be processed by the destination.

ALGORITHMS DESCRIPTION

s1
Source level
1

s2
2

... ... ...

si
i

... ... ...

sm
m

Access operator

xij cij Destination level

d1
1

d 2
2

d j
j

d n
n

Worker operator type wa

xjk cjk Destination level

d 1
1

d2
2

dj
k

dp
p

Worker operator type wb

Figure 1: General formulation for a topology of sources and a few levels of destinations to process data in a demanded way

For every source and a destination of every level, except the last one, there exists a communication path between it and every node of the next level. Every of those paths has its cost c. For the paths to nodes of a level , c equals a sum of costs of retrieving a unit of data on a source, transferring it and processing on a destination. For every next level it equals a sum of costs of transferring a unit of data and processing it on a destination of that level. For every communication path, an amount of data to be transferred through it x needs to be dened. By dening all of the x values, a placement for all source and all the processing operators would be dened and consequently OPP would be solved. Assigning (ki ) to the size of the key ki there are equations: k K (
n

Those equations guarantee that all of the data of the keys is going to be transported to destinations. On a level and every other level, for a node dj there is an equation:
m p

where wa is a data size reduction coecient for an algorithm wa used on the destination level and determines how many times data is typically reduced by that algorithm. The price minimization function for the topology of would look as follows:
m n n p

Clearly no negative commodities are to be transported on the paths xij 0, xjk 0, ..., 1 i m, 1 j n, 1 k p, ... .

...
(s k ) = (k )), xij = si , 1 i m.
j =0

...

xij = dj = wa
i=0 k=0

xjk , 1 j n,

min z =
i=1 j =1

cij xij +
j =1 k=1

cjk xjk + ... .

3.3

Possible Topologies for Data Processing

Accessop

Mapper

Workerop

Figure 2: StreamMine operators in the system 3.2.1 Additional Constraint Regarding Operator Placement Problem

There are three main constraints for solutions of the operator placement problem that would determine feasibility of the found solutions. The reasons why they should be taken into account are listed below: 1. Time (processing speed limits) - directing data streams from many sources to one destination can cause that the destination will process the data slower as it comes and therefore increase the total execution price 2. Rules of partitioning data between processing operators partitions - denitions of operators and data partitioners in the processing engine determine how the data is split between partitions of operators as well as whether there is some xed number of partitions for an operator dened. Only an awareness of this data can lead to nding optimal placements 3. Non-linear splitting of transfer between sources of the key - there might be data sources that would let only discreet splitting of source data between nodes (e.g. mongoDB and real-time data sources belong to them). Algorithms unaware of that fact normally will be able to nd solutions that are only approximate to optimal solution

3.3

Possible Topologies for Data Processing

Topology describes how the operators are connected to each other in order to solve a query within a data processing system. There might be a need of a usage of dierent topologies of operators depending on the operation that was chosen to process that data. Indeed, there could be also a possibility of executing the same operation using various topologies - it would depend on what programming model is used. A prototypic framework was prepared to solve a word count algorithm based on a MapReduce paradigm. Hence, there are three operators needed in a topology. First is an accessop - a universal operator accessing the data of historic and live sources. Next one in the topology is a mapper and nally a workerop, playing a role of reducer. For every source, one accessop needs to be deployed. The architecture of connections between operators is shown on a Figure 2. mapper and workerop operators should be partitioned into slices to increase the eciency of an execution. Moreover, a special logic needs to be employed into the way of how to partition data between them. First important thing is that there is always one mapper slice per one accessop instance. Data is partitioned in the way that only the one slice that is co-located receives all of the data from the access operator. Partitioner of data that is heading workerop slices guarantees that data will be split possibly equally between them. That lets achieve the situation that the data ow between slices looks as presented on a Figure 3.

10

ALGORITHMS DESCRIPTION

Accessop

Mapper

Workerop

Node A Node X Node B

Node Y Node C

Figure 3: Distribution of StreamMine3G operator slices and the logic of data ow between them

A topology-awareness in this context would mean that an algorithm is aware of the fact that every host chosen as a location of an accessop slice and a mapper slice is going to send data portions to every host chosen as a location of an workerop slice.

3.4

Greedy Approach for an Operator Placement Problem

First algorithm implemented is a greedy approach. Its characteristics are: For every key, always will exactly one source be chosen workerop partitions will always be placed on the same nodes where accessops are chosen to be placed The strategy of an algorithm is to try to put operators in a low number of Micro-clouds but on various hosts (to avoid too long reading from the disk). That makes the algorithm somehow time-aware and hence indirectly price-aware. Since this is the greedy approach, the decision on choosing the hosts in made only once for every key - there is no feedback. There are two ways of how to choose hosts for keys that do not have replicas in already chosen Micro-clouds. One is price-aware - the currently cheapest host is chosen. In the other one - the price-oblivious - the random one is chosen.

3.5

All in One Micro-cloud Approach for Solving an Operator Placement Problem

This approach is based on assumptions that: When data is smartly distributed over Micro-clouds, it will often be the case that all the data queried can be retrieved from sources placed in one Micro-cloud It will often be viable (at least from a time-consumption point of view) to retrieve all the data from sources placed in one Micro-cloud Since these assumptions sound reliable, an eort was made to implement this solution. Though theoretically it is price-unaware, time savings caused by the fact that there is no WAN transfer can lead to cost savings as well. The algorithm will search through the hosts and see whether there is a combination of them where all are placed in the same Micro-cloud and contain together all of the keys.

3.6

Approach Based on the Simplex Algorithm for Operator Placement Problem Solving

11

m M (k K (s k (s h h m))), where m - a micro-cloud / a set of hosts, M - a set of all micro-clouds, k - a key / set of sources that contain its data, K - a set of all keys that are to be processed, s - a source, h - a host / set of sources placed on it. Every of such combinations found will be checked.

3.6

Approach Based on the Simplex Algorithm for Operator Placement Problem Solving

This approach is based on the fact that it is the price minimization is the main goal of algorithms. Therefore, an eort is made to reduce the problem to the transportation problem. It is proven that simplex algorithm can nd the optimal solution of the transportation problem [Chu]. To understand the similarities of OPP and TP, the transportation problem will be at rst presented. Then, the ways of simplifying OPP will be described. Finally, the process of reduction will be shown. 3.6.1 Transportation Problem [Chu]

Transportation model is about to determine a minimum-cost plan for transporting a commodity from a number of sources to a number of destinations. It is provable that optimal feasible solution can be always found with simplex algorithm. Let there be m sources that produce commodity for n destinations. At the i-th source (i = 1; 2; ...; m) there are si units of commodity available. The demand at the j -th destination (j = 1; 2; ...; n) is denoted by dj . The cost of transporting one unit of the commodity from the i-th source to the j -th destination is cij . Let xij (1 i m; 1 j n) be the numbers of the commodity that are being transported from the i-th source to the j th destination. The problem is to determine values of xij in the way that will minimize the transportation cost of all commodities from sources to destinations.

The commodities transported from i-th source have to be equal to amount of commodity available in i-th source
n

xij = si , 1 i m,
j =1

as well as those commodities transported to j-th source have to be equal to j -th destination demand
m

xij = dj , 1 j n.
i=1

Naturally, total demand must be equal to the total supply

12

ALGORITHMS DESCRIPTION

si =
i=1 j =1

dj .

The minimization function of overall transportation cost looks as follows


m n

min z =
i=1 j =1

cij xij .

Clearly no negative commodities are to be transported on the paths xij 0, 1 i m, 1 j n.

3.6.2

Connections between Operator Placer Problem, Transportation Problem and Linear Programming

The operator placement problem is considered to have a lot of things in common with the transportation problem. It is even more similar to it than to Warehouse Location problem [SPB77], because the choice of communication paths inuences the price on sources, so they cannot be treated separately. The main dierence is that there are specic goods available on chosen sources (data identied by keys) and every of those goods has to be fully shipped (transportation of it ought to be split between sources). This dierence can be easily dened as a constraint in the simplex algorithm. Not the original version of OPP is presented as a transportation problem but its simplied version. There is always only one level of destinations considered. Destinations are not the hosts that can process the data but groups of hosts of similar characteristics - namely the common Microcloud. This transformation is made is order to reduce the complexity of the linear programming problem data that is going to be solved by simplex. In the operator placement problem, as in the transportation problem, there is no special unit of transported data. Even if data of sources is stored in bigger chunks, they are loosely splittable between connections, when sending to destinations. Therefore the basis of the problem is considered to be a linear programming, not an integer programming. Additionally, OPP has to deal with three additional constraints. There is an eort made to present the time constraint and the rules of partitioning data between processing operators partitions constraint as additional constraints for the simplex algorithm. Non-linear splitting of transfer between sources of the key constraint is supposed to be taken care of by later conversion algorithms.

3.6.3

Reducing a Operators Placement Problem to a Transportation Problem

During this process, concepts of an operator placement problem are represented as concepts of a transportation problem. Values known from database and from previous calculations as well as approximations are used to represent variables of transportation problem. Let the minimization function z be the total price of the execution. Sources si in the transportation problem will be all potential sources of data in OPP, so HostKey pairs. As in step one we are looking for a solution on a Micro-cloud level, destinations dj will be all Micro-clouds within the system.

3.6

Approach Based on the Simplex Algorithm for Operator Placement Problem Solving

13

A transportation cost cij will be the sum of prices of data retrieval in si , data processing in dj and data transfer from si to dj . Route commodity xij will be a number of megabytes to be sent on the path between si and dj . 3.6.3.1 Counting the cost

Let us say that Pij is a sum of prices of execution on: si - Psi and on dj - Pdj , and of transfer between them, Ptij . Pij = Psi + Pdj + Ptij . Cost cij of the transfer on every path is proportional to Pij , but since there is a need for correctness that this cij is a price per megabyte of data, the equation for cost on a path between si and dj will look as follows cij = Pij , q (si )

where q (si ) is a method giving the size of data that is identied by a key belonging to key-host pair si . The equation for the price Psi goes as follows:
t0 +Texec

Psi = Tsi

t0

pcpMs (x)dx
i

Texec

where t0 is a point in time when the query processing begins, Texec is the approximated execution time of the processing, Tsi is a duration of the key data retrieval for this source, Msi is a Micro-cloud where si is placed, pcpM is processing cost prole for that Micro-cloud. The time of the execution on si can be presented as shown below: Tsi = max q (si )) q (si )) q (si ) fvm(si ) , , ..., Vst(si ) Vti1 Vtim ,

where st(si ) is a source type of this source, Vst() is the speed with which the data is retrieved for this source type for a standard virtual machine, Vt are bandwidths on the output connections, vm(si ) is a VM type for this source and fvm() is a factor of the acceleration between this and a standard VM type. The equation for the price Ptij : Msi = Mdj Ptij = 0,
t0 +Texec t0 +Texec

Msi = Mdj Ptij = q (si )

t0

ocpMs (x)dx +
i

t0

icpMd (x)dx
j

Texec

14

ALGORITHMS DESCRIPTION

When a source and a destination are in the same Micro-cloud, this cost will be equal to zero. When they are in dierent ones, we need to take a look at ocpM (), an output cost prole of a source Micro-cloud, and icpM (), an input cost prole of a destination Micro-cloud and multiply the average price during expected system execution time with an amount of megabytes that would be transferred on this path. The equation for the price Pdj is as follows:
t0 +Texec

Pdj = Tdj

t0

pcpMd (x)dx
j

Texec

The time of the execution on dj for the connection with source si : Tdj = max Tsi , q (si) fvm(dj ) . V wt(dj )

When counting the time of the execution, the execution time of the source should be considered as well, because it was previously correlated with bandwidths of a transportation path, and if they are lower than speed of processing on the destination, the time of the execution on dj will be dependent on them. 3.6.4 Constraints

To let an algorithm nd the optimal solution for the problem that is possibly most similar to the specic operator placement problem, constraints on commodities have to be as well dened. 3.6.4.1 Constraint K: Sources of a key have to produce data of the size of data identied by that key This constraint is dened in a general formulation of the OPP. If K is a set of host-key pairs with the same key, sum of the data they retrieve from the source has to be equal to the size of the data of that key.
n

K = {s0 , s1 , ..., sn }
i=0

si = q (si ).

3.6.4.2

Constraint D: Time (processing speed) limits inside of Micro-cloud

This is one of the three basic technical constraints in the OPP for Micro-clouds. It denes limits about how much data can come into the destination Micro-cloud. Let Q0 be a total size of all the keys from clients request, vmcntM - a number of VMs within the cloud that are going to be free at a time, and W - an approximate number of workerops to be deployed. . W Let us say that vmcntMdj = 2 W . That would mean that only a half of demanded workerops can be deployed in this Micro-cloud. In consequence, we let only half of the whole data ow into this Micro-cloud. 3.6.4.3 Constraint T: Rules of partitioning data between operator partitions dj Q0 vmcntMdj

Constraint T is the one that injects topology-awareness in the problem. It is needed due to the logic behind the tested processing algorithm word count. In practice, the data ow heading from source to worker nodes is spread equally between all of the destinations. If K is a set of all

3.7

Approach for Choosing Hosts for Processing in Destination Micro-clouds

15

keys to be retrieved, D is a set of all destinations, (k ) is a size of a key, xkd is a transfer size of a key k data to a destination d: k1 K k2 K d D : 3.6.4.4 xk2 d xk 1 d = . (k1 ) (k2 )

Constraint L: Limit of the retrieval size for every host

If H is a set of host-key pairs for the same host, sum of the sizes of data identied by all the keys of those host-key pairs cannot be greater than a parameter of maximum size of data to be retrieved from one host Qmax .
n

H = {s0, s1, ...; sn }


i=0

q (si ) Qmax .

3.7

Approach for Choosing Hosts for Processing in Destination Microclouds

All in one Micro-cloud approach as well as an approach based on a simplex algorithm end up choosing sources and destinations with a grain of a Micro-cloud. Also, an information given as an output of an algorithm might not be precise enough to be translated correctly into an input of the processing engine. Therefore an algorithm for normalizing a solution and choosing hosts for processing needed to be specied. The subsequent actions of this algorithm are following: 1. Normalization of unsplittable key sources - previous algorithms might choose that some data retrieval would be split between a few hosts. However, some of the data sources might have their keys unsplittable. Therefore, an additional algorithm needs to be run to normalize the solution - by choosing one host for data retrieval in those cases. 2. Counting the number of workers needed in the system - source hosts placement should be analyzed to nd how many hosts would be needed in the system to process data without delays 3. Counting the number of workers needed per Micro-cloud - based on a previous information and outcomes of placement algorithms, a number of hosts to run processing in every Microcloud can be found 4. Normalization of the solution by conforming sources, keys, transfers to how will they look during execution - a whole solution is normalized according to topology properties dened in the processing engine for the type of processing specied in query by a client 5. Choosing processing hosts in every Micro-cloud - at this point of the algorithm destination hosts that are going to run the processing can be chosen

System Design

To analyze the placement problem deeply as well as to test the placement algorithms in the environment possibly similar to the true Micro-cloud environment, a whole processing framework extended by scheduling components containing placement algorithms needed to be designed and implemented from scratch. The fundamental technologies needed to be chosen. MongoDB was chosen to be the technology for data storage and StreamMine3G was chosen to work as processing engine in the system (reasons for those choices are explained later in this Chapter). Also, a lot of eort was put in creating a persistent model of the system, so that it would represent the real system reliably - that was essential for operator placements algorithms evaluation. The system was implemented in the way, so that the logic was divided into loosely coupled components, so that they can be easily reused and extended in the future work on Micro-cloud environments. Key elements of the system design are presented in this Chapter together with explanations on choices.

16

SYSTEM DESIGN

4.1

Persistent Model of the System

During implementation, a persistent model of the environment was designed. It is compatible with the principle assumptions of the Micro-cloud environment. A short overview of solutions used in the model are presented in this Chapter. 4.1.1 Model of the Physical System

In the model, the physical entities need to be represented. Under the Micro-cloud environment, there would be three levels of abstraction regarding the physical structure of the system. The highest level of abstraction would be a view on all of the Micro-clouds within the system. They would have the attributes like a name, a host address and a geographical location. For every Micro-cloud, there would be a network prole dened as well. It would specify the transfer speeds between the nodes inside of the Micro-cloud, as well as an output and an input bandwidth. Every Micro-cloud would consist of one or more racks. A view on all of the racks in the system would be the next abstraction level. The main purpose of racks would be to group the physical machines that would work as hosts in the system, therefore able to retrieve and process data. A view on all of the hosts would be the lowest abstraction level. Every host would have its URL (the datum indispensable for establishing connections with services running on it) and attributes dening its physical characteristics. They would determine the disk read speed and a computation factor (meaning a computation speed on this host compared to some standard instance). Worth mentioning, host entity represents VM as well as its host - there is no dierentiation made between those two in the model. 4.1.2 Pricing Proles

Every Micro-cloud in the environment would have its pricing prole dened. It would be dened in the way that for every period of time (it could be an hour or even one day) there would be the price (per GB) of the input and the output trac as well as the price of usage (execution) on the hosts within the Micro-cloud determined. 4.1.3 Data Sources

Information on every external data source that the system is supposed to use to retrieve data has to have its representation in the persistent model. The main attribute of the data source is its name - this would be an identier known to dene queries on it. Every data source would have its main instance listening on some host on some port. Moreover, for every data source there would be a xed port dened on which it would listen on all of the hosts it would be deployed. Obviously, a bunch of technical characteristics would need to be dened for every data source. It would be a name of a collection (database etc.) that stores the data, as well as an information about a type (historic / real-time), a specic technology, a version and an expected data transfer. Data sources that do not have an internal system of data to hosts mappings, need to have the hosts running their services explicitly dened. In the example of the created system, this would concern real-time data sources. 4.1.4 Proles of Worker Algorithms

In the prepared model, it is assumed that it would be possible to determine some reliable processing speed for every algorithm that would be possibly used in data processing, for every number of slices of worker operator - excluding bandwidth limits on connections. 4.1.5 Queries waiting for an Execution and Information on the System State

After the optimal placement is found, every query to be executed would be stored together with the time the execution should start. The persistent model would also store an information about

4.2

Data Sources

17

the system state with data about hosts that are currently retrieving / processing data as well as information about the hosts that are going to be used for processing in the future.

4.2
4.2.1

Data Sources
mongoDB - a Technology for Historic Data Sources

mongoDB was chosen as a technology for storing historic data within the system. The decision on choosing it was preceded by an analysis of two scalable, high-performance storage systems - mongoDB and Hadoop Distributed File System (HDFS). In an abstract, they are based on pretty dierent paradigms. mongoDB is a NoSQL document oriented database, while HDFS is a distributed le system. However, mongoDB comes out with a lightweight extension GridFS that provides abstraction layer of a distributed le system over the database. Therefore, both technologies can be used in a very similar manner, however a number of dierences between them makes mongoDB t the specic architecture of Micro-cloud environments better. Below, characteristics of mongoDB (specically GridFS) will be presented, together with the comparison with the characteristics of HDFS. Moreover, their benets and drawbacks, concerning the specic architecture, will be stated. 4.2.1.1 File Storage

To provide an abstraction layer of le storage in mongoDB, GridFS extension is used. This is a specication for storing and retrieving les of a size over 16MB. The logic behind it (and what actually makes that abstraction layer very simple) is that a le is simply stored as a group of chunks of a xed size. GridFS stores les in two collections. One is a collection of chunks. Second one keeps matadata about the les (and is replicated separately from chunks collection) [mdba]. It looks dierent in HDFS. Although the les are divided internally into blocks, externally they are seen as one big stream of data. The chosen number of bytes can be read on the chosen oset [hdfa]. While reading mongoDB, the whole document/chunk of data needs to be read into the memory. 4.2.1.2 Replication

In mongoDB, replica sets are dened manually. That means, every of the replica sets has a given denition containing nodes that will store its data. Consequently, if one le is placed on the same node with another le, they are placed together on all of the nodes of the replica set [mdbc]. Every replica set has one primary and many secondary members. The primary replica is used for writing. Choice of the source of reading depends on the read preference - it can be either based on member type (only primary, secondaries preferred etc.) or on geographical location (nearest). Since a role of the algorithm implemented during thesis is to choose the replica hosted in the best place and to place there an operator - a strategy that allows reading from any replica is chosen [mdbb]. The approach used in mongoDB is clearly dierent from how HDFS works. There are no xed replica sets specied. Instead, client sets replication factor and assigns data storing nodes to racks. Replication algorithm in Hadoop File System is rack-aware. There is a rule for storing three rst replicas: rst is stored on some node, second is stored on a node in another rack, third one is stored in the rack of the rst replica. All the other ones are stored randomly [hdfb]. 4.2.1.3 Sharding

Since data replication in mongoDB always takes place between strictly dened nodes, there is a data partitioning needed to allow many replica sets and still keep it as one system. Therefore shards are introduced [mdbd]. Sharding partitions a collection to store portions of data in dierent

18

SYSTEM DESIGN

replica sets. It takes care of an even distribution of data over machines (shard balancing). Shard keys need to be dened by picking the elds of documents stored in the database. Because of the way how horizontal scalability is injected into HDFS, there is no place for a thing like sharding. System automatically scales out when new nodes are added. 4.2.1.4 MongoDB, HDFS and specics of Micro-cloud environment

During a run of the placement algorithm, nodes which should host operator slices are going to be found. It is the systems assumption that source operators are always co-located with sources. Therefore it was important to look for a solution that lets le systems client, run by the operator, to access the local data directly, without connecting to the main process (through the network). It was proved - such a solution works for mongoDB, as every mongoDB daemon process (mongod ) gives the same access to the data as mongoDB shards routing service process (mongos ). A small dierence appears when client is about to read GridFS le (le names are not replicated together with data), therefore operators are given les id instead of a les name as an input. It was not checked whether the same solution is possible with HDFS. The main problem with HDFS appears when considering architecture of Micro-cloud environment. Since there are many data centers over which data should be replicated, there is more than one replication level needed. One could consider bringing rack-awareness on a level of Micro-clouds and skip awareness on the rack level. But Micro-clouds need special policies that are connected with geographical distribution in regard to the clients. MongoDB enabling manual replica sets creation together with sharding using sharding tags (specic ranges of a shard key can be associated with a specic subset of shards [mdbe]) is much more convenient and hence much easier to t into the specics of Micro-cloud environment. 4.2.2 Live Data Sources

Live (real-time) data sources is the second group of sources that can be used to retrieve data. By denition, data retrieved by them is sent directly to be processed and is not dened by any keys. Examples of such data sources could be Web crawlers or instances receiving data real-time from Smart Grids. Within the system an exemplar live data source was implemented. It is the process with two children-threads. One of them is connections server. It listens on a port and accepts connections. New connections are added to the list of sockets that is shared between threads. The second thread is responder. Every iteration, it takes all of the sockets that are on the list of connections and sends a new chunk of data to each of them. An implementation of that source is not prepared for high loads - it was just made for test purposes. Nevertheless, current architecture of the system assumes that there would always be in fact only one client connecting to the socket, moreover on the same machine.

4.3

StreamMine3G Platform - a Technology for Event Processing [sm3]

StreamMine3G was chosen as a processing engine for event streams within the system. It has an open logic for running either continuous or batch processing. It can also work according to the MapReduce model. Therefore, it is a good choice for the system, where processing engine has to deal with high loads of data coming from both historic and live data sources. StreamMine3G is an event processing engine, designed for a high scalability, elasticity and fault tolerance. It can be run as a cluster of nodes, each of which has to run StreamMine3G as well as ZooKeeper. ZooKeeper is a centralized service that takes care of maintaining the current conguration on the cluster, moreover providing naming and distributed synchronization [zk]. There are two types of nodes within StreamMine3G - master or worker. There should be one master node hosting manager. Its role is to conduct the jobs over worker nodes by deploying operators on them and taking care of removals when the job is done.

4.3

StreamMine3G Platform - a Technology for Event Processing [sm3]

19

Operators within the system can either receive data from external data sources or from other operators. Based on that they are going to be called either access operators (is short: accessop ) or worker operators (workerop ). All of the data that goes through the StreamMine3G cluster should be considered as streams - unbounded sequences of tuples/events. Following the potential use cases of the system, these could be the streams of World Wide Web pages. StreamMine3G allows to dene any topologies. Every operator can have a few upstream operators and a few downstream operators. Upstream operators would send events to the operator, downstream operators would receive events from it. Topologies are dened by the manager. Every operator can consist of a bunch of slices. Slices deployment can be seen as a physical mapping between operators and cluster nodes. However, slice is not only a deployment of operator on node but also a partition of operator. That means, it can be dened which data heading the operator can reach exactly that partition of it. Components taking care of forwarding data portions to proper slices are called partitioners. 4.3.1 Accessop implementation

The role of accessop is to provide a universal access to all of the external event sources that system is supposed to use. 4.3.1.1 MongoDB data source adapter

MongoDB data source adapter ensures an access to the processes of mongoDB system in a special way provided by GridFS abstraction level, modied to let the process communicate directly with the local node, skipping connection with mongos process. accessop reading data from mongoDB data source has always a xed list of les (or parts of les) to be read dened at an input. Changes in GridFS implementation There is GridFS implementation that comes together with mongoDB library. Since its implementation is aiming the scenario when replica connects mongos process, it did not fully t the needs of applications running on every StreamMine3G node. Original GridFS requires le name to be given at an input. But the system is built up in a way that the collection with les metadata is not accessible on every node (actually only on nodes of one shard). Therefore, the code needed to be changed so that the les are not found by le name but only by le id (that is stored as metadata with every chunks data). The other thing that was changed was about to enable listing and reading of les from any replica. To achieve that, queries in methods realizing those functionalities were extended by setting queries option so that the system would let them be run on secondary replicas (slaves). 4.3.1.2 Real-time data source adapter

Real-time data source adapter basically connects to the socket on a given host and port address. What is dierent from mongoDB data source adapter is that, instead of the denition which data is to be read, the work period is limited. 4.3.2 Mapper, Workerop and Partitioner implementation

Implementation of operators running word count algorithm inside of the system is based on the code available on StreamMine3G site and works according to the MapReduce model. It is only extended by a few mechanisms. mapper operator, the one with slices always co-located with accessop instances, takes the data of the incoming event and splits it into single words. Every of the words is being emitted as a separate event. workerop plays a role of the reducer in MapReduce paradigm. This is a stateful operator that keeps a map of words that already came with events, together with counters for each of them.

20

SYSTEM DESIGN

During an event processing, the counter of an appropriate word is incremented. At the work nish, every slice stores the nal version of the state. Since every slice was receiving another range of words, in the nal state of every of the slices there are the nal outcomes for words that were in that range. To guarantee an even distribution of events between all of the workerop slices, a custom partitioner was implemented. It analyzes both incoming partition key and incoming event. mapper can dene that it wants an event to be broadcasted (it is used to notify about the end of stream). When partition key has another value, hash of a word stored in the event buer is counted. Then a slice number that should receive an event can be counted with a simple formula.
sliceNumber = floor( (double)hash / double(0xffffffff) * double(slicesCount) )

4.3.3

Implementation of the manager

The StreamMine3G manager is a component of the system that receives an information about newly scheduled tasks from the Scheduler component and is responsible for a correct deployment of all the operators needed to proceed the queries, as well as for cleaning up right after the processing nishes. Basically, it needs to call methods of so called cloud controller that implicitly takes care of spreading deployment data between the nodes with a help of ZooKeeper. In the current implementation the database is used as a communication channel between the Scheduler and the StreamMine3G manager. Every second, the manager queries the database for the tasks, execution of which needs to be started during that second. To guarantee a deterministic way of working of the manager, next step of the deployment process is called only after asynchronous notications about a successful deployment of the previous steps came. An example of controlling the deployment of a task with two operators, every of each has one slice was presented on Figure 4. It presents that during the deployment every action is executed on every operator / slice (depending on the deployment state) before any other action executed on any of the operators / slices. It is possible thanks to the (atomic) counter that counts up to the moment when an appropriate number of responses from the cloud controller was reached. An information about properties of deployed objects is accessible through maps (string operator, string slice ). Furthermore, relationships between objects (slices, their operators, their tasks) are stored by the manager. During a removal of slices and operators, a dierent logic is used. Every of the slices sends a notication to the manager every time any of its sources reaches the end of its stream. Those notications are counted on the managers side. When their number reaches a number of sources of some slice, its removal is initiated. Generally, slices can be removed at various points of time (depending on when they nish their operations). There is no need to wait with this operation for other slices. A procedure of a removal of an operator is initiated, when a number of slices of that operator which are still working reaches zero. Every time the slice of an operator is removed, an information about a busy state of its host is updated, so that the placement algorithms running simultaneously can have an up-to-date view on a system state.

4.4

Design and Implementation of the Tasks Scheduler

Tasks Scheduler is a component of the system that is responsible for nding placements for queries dened by clients and scheduling them for an execution by the processing engine. Precisely, it has the following roles: Provide an interface to client that allows to dene queries on both historical and live data sources within the system Run the functionality of the operator placement solutions search Give an information to client about costs and time of processing his query Translate found solutions to the standard topology denition being an input for the StreamMine3G manager

4.4

Design and Implementation of the Tasks Scheduler

21

Counter

Manager
setLimit(operatorsNumber) createOp(op1) createOp(op2) incrementCheck() incrementCheck() setLimit(operatorsNumber) deployOp(op1) deployOp(op2) created(op2) created(op1)

CloudControl

Figure 4: Simple presentation of message exchange between three components during a deployment of a task with two operators, every of which has one slice

...
incrementCheck() incrementCheck() setLimit(slicesNumber)

...
deployed(slice1) deployed(slice2) launchSlice(slice1) launchSlice(slice2) launched(slice2) launched(slice1)

...

22

SYSTEM DESIGN

SchedulerInput interface SchedulerInput +useHistoricalDataSource() +useRealTimeDataSource() +runPlacement() +confirmExecution()

Scheduler

Figure 5: An interface of the Scheduler component

The interface of Scheduler in presented on Figure 5. It consists of four methods. First two methods are basically about input data denition. Methods useLiveDataSource() and useHistoricalDataSource() let dene what data should be processed during the query execution. Every of the sources in the system has its unique name stored in the database. Choosing a live data source as one to be used by the query means that all the data produced by that source during the query execution will be a part of an input for the system. When a historical data source is chosen, a list of keys to be retrieved from this source needs to be provided. Data type for keys of that source would be dependent on data source technical type. For mongoDB these are unique strings (le ids) - not le names. It is assumed that mapping from le names to unique strings would be done by external component responsible for translating business query to technical query. Two other methods let client dene details of the query specics, as well as choose the found operators placement solution that sticks most to the requirements. ClientQuery object is expected as an input of runPlacement() method. It contains an information about the way data should be processed (worker algorithm type) and the time when the query processing is supposed to be started (it can be dened as null which means as soon as possible). The other two elds let dene what is the preferred execution time and price for that query, but nding such a solution is taken out of the scope of scheduler component. Instead, runPlacement() method returns a list of possible solutions for placement found during the algorithms execution and let the client choose the most adequate one. To let client conrm one of the solutions, the conrmSolution() method needs to be called. A number identifying the chosen solution needs to be passed as a parameter of this method.

4.4.1

General Architectural Approach

An eort was made to let the system be easily congurable from the outside. Therefore properties les were introduced. As a result, those points of implementation that encapsulate algorithms that can be classied as strategies could have been left open to a simple exchange based properties le content. Right now, it applies mainly to the simplex-based placement algorithm.

4.4.2

Component Model of Scheduler

Figure 6 presents components inside of the system. Beside the processing engine - StreamMine3G, there would be the Scheduler component as well as SystemState component providing some common methods for system state access for both other components. Scheduler consists of a few subcomponents. Every of them is described below.

4.4

Design and Implementation of the Tasks Scheduler

23

Scheduler
SchedulerInput Mapper

StreamMineTaskPrepare

Placer

PriceTimeApproximator

StreamMine3G
Manager

SystemState
SystemState

Figure 6: Components inside of the system together with their dependencies

24

SYSTEM DESIGN

4.4.2.1

SchedulerInput Component

SchedulerInput consists of a class that implements the system interface and is responsible for controlling communication between other internal components and external components (Scheduler clients). It keeps basic data that is exchanged with client and that needs to be shared between calls of methods, such as mapping of keys to hosts for every data source dened, clients query and a list of solution graphs when they are created. 4.4.2.2 Mapper Component

Mapper component implements functionality of building host-to-key maps for every of the data sources. It is called by SchedulerInput every time when useLiveDataSource() or useHistoricalDataSource() are called with correct parameters. In the system, mapping retrieval functionality is possible from two types of data sources: live and historical (based on MongoDB technology). Classes implementing the mapping share a common method they inherit from the abstract parent class. Its task is to construct the map from every of the given keys to every of the system nodes that enables that keys retrieval. For live data sources the operation of map creation is trivial. There are no specic keys dening data of those sources, so basically every of the nodes running this data source is to be placed in the map. Information about which of the nodes run every of live data sources is stored in the Micro-cloud persistent model. For mongoDB data sources operation of the process building a map requires connecting to the mongos process. It consists of two parts (see: Figure 7). First part is about building a map of pairs: mongo key set of hosts. Possibly, such a data structure could be kept inside of the system and refreshed only from time to time - right now it is built up every time mongoDB source keys mapper is called. In the rst step of that part, a map of shard names and hosts that hold replicas containing those shards is created. In fact, every of the shard denitions is stored in mongos congdb as a string containing the shards name and a list of hosts that belong to the replication set of it. Consequently, this operation is basically about string parsing. Then, in the next step, all of the keys in the system are iterated. Every key identies a group of chunks of one le (sometimes the le is split to keep equal distribution between shards). For every key, the le id, rst chunk and the name of the shard that stores it is retrieved. That enables to translate the previous map to the map expected as an outcome of this part. Second part is about to nd the keys given as the system input in the map of mongo keys. The denition of both of those keys is slightly dierent, as keys given as an input represent whole les, while mongo keys represent groups of chunks of one le that are stored in the same shard. Consequently, it is possible that hosts-to-keys map that is output of the method will have more keys than given in the input. 4.4.2.3 Placer component

The role of the component is to nd solutions for placement of all of the keys given as an input to the system, distribution of which is determined within Mapper component. Functionality of the component in initiated by SchedulerInput during runPlacement() method. It is supposed to be called once after all of the needed data sources are dened. As an input, it receives: A set of structures representing every of the data sources dened by client, each of which consists of: a data source denition a structure with bidirectional mapping between keys and nodes hosting their data A clients query

4.4

Design and Implementation of the Tasks Scheduler

25

mongoDB mongoDB data data

clients clients query query


key_1

Part 1 Step 1

shard_1 [host_1, host_2, host_3] shard_2 [host_4, host_5, host_6]

key_1(1:10) shard_1 key_1(11:14) shard_2 key_2(1:6) shard_2

Part 1 Step 2

key_1(1:10) [host_1, host_2, host_3] key_1(11:14) [host_4, host_5, host_6] key_2(1:6) [host_4, host_5, host_6]

Part 2

key_1(1:10) [host_1, host_2, host_3] key_1(11:14) [host_4, host_5, host_6]

Figure 7: An example of combining sharding data from mongoDB with keys requested from client proceeded in two parts

Placer is built up in the way that preplacement, placement and postplacement algorithms can be exchanged independently of each other. During every call, no matter what algorithms are going to be used, this order of calls is kept:
runAlgorithm( ) call ::prePlacement( ) call ::doRunAlgorithm( ) call ::postPlacement( )

The roles of those subsequent procedures are described below. Preplacement There are a few steps taken before any of the placement algorithms. Their role is basically to analyze and correct data received from other components before placement algorithms start working on it. In the rst step, the time of the execution is approximated. This value is needed by the further steps of preplacement, but might be as well used by the placement algorithms (simplex-based algorithm uses it). There was not a big eort done to make the approximation algorithm very adequate but it is assumed that a very general approximation of possible execution period length is good enough at this point. The algorithm respects a condition that only data about Micro-cloud proles for the next 48 hours is reliable. Therefore, clients queries that have the start time set at the point of time that is more than 48 hours ahead, are rejected at this point. There are two scenarios that are considered separately - either there is some historical source or there is none. If there is, only the time of historical data retrieval will be taken into account. Then, the time approximation is based on two very simple (too simple) assumptions: the total processing speed of data equals its retrievals speed; on an average host, data of the size of two average-size keys is going to be retrieved. Those assumptions can be changed after an enough number of runs of algorithms to make approximation more reliable. When there are no historical keys, the time dened for real-time data retrieval is taken into account. However, the approximate time period lengths are compared with the maximum time for reliable analysis (ending when pricing proles become unreliable - so after 48 hours from now). The lower of the values becomes the execution period length that is going to be used in further algorithms. Since only the time period counted in the previous step is taken into account right now, realtime keys need to be changed as if they were about to retrieve the data only within this period of

26

SYSTEM DESIGN

time. The time of their work that is after that period is now taken out of account, assumed this can be treated as another scheduling problem. During the execution period the length of which was just approximated, some of nodes that are said to be hosting the needed keys can be busy (executing other tasks). Therefore they should be out of the interest of placement algorithms. Consequently, the procedure is run that checks every of those hosts for being on the busy list during the approximated execution period. If that is the case, it is removed from the mapping. Placement The role of this procedure is to run placement algorithms. As shown in Sections 3.4, 3.5 and 3.6, there are three main approaches for nding placement solutions. During placement, a few of them can be called subsequently to build up a list of possible solutions that could be checked for feasibility and evaluated by a client. Postplacement taken: After a run of any of the placement algorithms, two key steps have to be

1. Run of computations to approximate price and time of the execution 2. Run a nal analysis about feasibility of a given solution An algorithm on solutions price and time approximation is precisely described in Section 5.2. After it is executed, solution can be checked against feasibility rules. For a solution to be feasible, all of the hosts that are about to be used for the retrieval or the processing need to be free during a period of time dened for each of them during the solution analysis (price and time approximation procedure). If some host does not meet that expectation, a whole solution needs to be marked as infeasible and will no longer be treated as a solution of the placement problem. Lists of hosts that caused the fail of feasibility check are stored for the potential future reruns of placement algorithms. 4.4.2.4 PriceTimeApproximator Component

PriceTimeApproximator is called during the postplacement routine inside of Placer component. Its role is to analyze solution graphs to nd partial times and prices, as well as comprehensive times and prices, using algorithm described in Chapter 5.2. As an input, it receives a graph, expected start time of the execution and worker algorithm type. It works directly on a graph and sets all of the counted values inside of its structures. 4.4.2.5 StreamMineTaskPrepare Component

StreamMineTaskPrepare is responsible for a translation of a graph into a StreamMine3G manager input. It is called by SchedulerInput during storeExecution() method run, naturally only if some of the solutions were already prepared by Placer. As parameters it receives the structures indispensable for preparing managers input - the solution graph chosen by client and the clients query. There is an algorithm implemented, the role of which is to analyze the structures that encapsulate a found solution and translate them into StreamMine3Gs manager input. The mapping has to be done in a specic way for every of the worker operator algorithms. For the word count algorithm that created system implements, three types of operators have to be placed within the mapping. Below the way of describing properties for every of the operators is going to be presented. Name For every task, there is a unique id generated and operators of this task have their names built up on it (to guarantee uniqueness). The source operators name is built up based on a schema: taskuid operatortype uniquenumber (as there are many textitaccessops, every of

4.4

Design and Implementation of the Tasks Scheduler

27

them gets its unique number). Other operators (mapper and workerop ) have their names like: taskuid operatortype workeralgorithmname. Wire with In a denition of every operator there is a place for the names of downstream operators of its upstream. This list is dened of every accessop and contains a name of mapper for this task. A list dened for mapper contains workerops name. Library path There is a properties le used during translation that contains library paths to every type of an operator. Those names are used here. Partitioner path As with library paths. Parameters up as follows: Currently, parameters are set only for the accessops. The parameters are lled

Host - name of a source host (in practice - URL) Port - port on which service listens on hosts (specied in the data source denition) Source name - collection name (specied in the data source denition) Partition key - id of a mappers slice placed on the same node Data source implementation type, read preference type - set based on the data source type taken from its denition Time limit - set for real-time sources, as dened in a clients query General keys - set for mongoDB data sources. A list of mongoDB le keys, dened by a key string, the rst chunk number and the last chunk number to be retrieved on this node Hosts This is a list of nodes that are going to host any of the slices of the operator. For workerop, this is the list of names of destination hosts in the graph, for mapper - the list of names of source hosts in the graph, for each accessop this is one of the source hosts name. Key range size This parameter is set because of the way how partitioning between accessops and mapper works. mapper is set to have the key range size equal to a number of its slices (so equal to a number of sources). End-of-stream signals to shut down slice This value equals a number of original sources that deliver data to the operator slice. For mapper and workerop slices this is the number of accessops. For accessops this equals 1. 4.4.2.6 SystemState Component

This component is created basically to keep the methods that are common for StreamMine3G manager and Scheduler. Basically, they are about providing an access of Scheduler to the current system state and a possibility of adding expectations of future changes to it, as well as providing a possibility of adding actual changes in a system state to manager. Right now, it is about keeping the state and expectations of a state changes of host busy times up to date. 4.4.3 Data Flow within the System

The exemplar ow of data about one key that is included in the clients query is presented on Figure 8. In this example, this is mongoDB key. At the input of the system, it is dened as a string identifying a le. It is sent in this form to Mapper component. From there it returns back as a few groups of chunks that are parts of the le - every of which has got a set of hosts holding this data replicas assigned. Placer component takes this data as an input and converts it into pair: group of chunks one host. It might be the case that one previous group of chunks will be split to few by P lacer. Those pairs are sent to StreamMineTaskPrepare.

28

SYSTEM DESIGN

Mapper

Key: keyname Key: keyname; 1:a -> Hosts: list(host) Key: keyname; a+1:n -> Hosts: list(host)

Key: keyname
SchedulerInput

Key: keyname; 1:a -> Hosts: list(host) Key: keyname; a+1:n -> Hosts: list(host)
Placer

Key: keyname; 1:a -> Host: hostA Key: keyname; a+1:b -> Host: hostB Key: keyname; b+1:n -> Host: hostC

Key: keyname; 1:a -> Host: hostA Key: keyname; a+1:b -> Host: hostB Key: keyname; b+1:n -> Host: hostC

StreaMineTaskPrepare

Figure 8: Presentation of how data description about an exemplar mongoDB key changes during execution of Scheduler

4.5

Implementation of the Placement Algorithms

29

4.5
4.5.1

Implementation of the Placement Algorithms


Implementation of All in One Micro-cloud Approach

Micro-clouds that contain all the needed data are easily found in two steps: Keys-to-hosts map that is an input of Placer component is translated to Keys-to-Microclouds map Micro-clouds that appear to have all of the keys that are dened in the map are the Micro-clouds of algorithms interest The thing that is left after that is to create a solution only by using nodes inside the Microcloud, for every of the Micro-clouds found. It is done in the steps as follows: A set of nodes inside of the Micro-cloud that host any of the keys data is created For every host that is an element of that set, all of the keys that it stores, are taken Pairs host key from previous steps are used as denitions of sources within the system Destination is dened generally as the Micro-cloud. Choosing concrete nodes as hosts of worker operators is a common part with other algorithms and is described in another section

Steps on choosing sources for keys that are described above are rather a simplistic solution and might be extended to the one more adequate. Since it is assumed that data of one key is replicated no more than once within one Micro-cloud, the solution of picking always the rst host of every key can be considered good enough. 4.5.2 Implementation of an Approach Based on Simplex Algorithm

General task realized by the implementation of this approach is to prepare structures, so that they can be used to form an input for the Simplex-solving external component and store the data that comes as an output of it. Therefore, for every potential connection for the solutions graph a special object is created that implements a whole complexity of considered problem as it was the transportation problem. Therefore it consists of getter/setter methods:
double getC(); double getTransfer(); void setTransfer(double transfer);

They encapsulate the abstraction of counting the cost for the problem as well as let the transfer be set according to the simplex outcome and got later for further processing. There is also a group of methods that provide an access to all of the objects dening this connection, so the source s, its key k and host h, as well as destination d. The way in which the values are counted lets the solving algorithm be aware of: Execution prices during retrieval and processing of data Transfer prices Indirectly, transfer bandwidths on connections (as the slower bandwidth the greater price of retrieval) 4.5.2.1 Price Approximation Procedure

The role of this procedure is to approximate the price on every connection in the system according to the algorithm description. Therefore, approximate price on the source, on the connection and on the destination need to be counted and summed up. With the rst call of a getter of the cost on a connection, a price counting procedure is called. It is divided into three parts: Count an approximate execution price on the source for this connection Count an approximate transfer price on this connection

30

SYSTEM DESIGN

Count an approximate execution price on the destination for this connection In the rst step, a (maximal possible) bandwidth on connection is counted. It is done by comparing bandwidths on the communication path and taking the minimum. Compared values are: Retrieval speed on the source (so basically the bandwidth of a real-time source or the disk bandwidth if that is a historical source) Bandwidth between the nodes within the Micro-cloud (in current model, there is no dierentiation between inside-of-rack speed and bandwidth between racks) If the source and the destination are in dierent Micro-clouds: Output bandwidth of the source Micro-cloud1 Input bandwidth of the destination Micro-cloud1 Bandwidth inside of the destination Micro-cloud It is important to notice that the processing speed on the worker operator is skipped when looking for the speed of connection. The reason of it is that generally there will not be one-to-one connections in the system (that means, source will send to many destinations and every destination will get data from many sources). Therefore, further algorithms will count appropriate number of worker operator slices in the system, so that they do not inuence the bandwidths on connections. Then, in the next step, the time can be counted, as a bandwidth and a data size (which is obviously the size of data represented by the key) are known. Finally, the price can be counted. It is done by multiplying the approximated time by the average price during the assumed execution period2 . The price of the transfer of every possible connection is counted by summing up the price of the output transfer of Micro-cloud where source is placed with the price of an input transfer to the destination Micro-cloud. Each of them is counted by getting an average price of transferring one GB of data during the assumed execution period2 and multiplying it by the size of the key data on the source. Of course, when the source is inside of the destination Micro-cloud, this price equals 0. Finally, the price on the potential execution host is counted, as if there was only one worker operator in the system deployed. Therefore, the approximate speed of a computation on one node with the chosen worker algorithm is taken from the database. It will be dened as the processing speed unless the bandwidth on considered communication path is lower than that value. The execution time and the price are counted in the same manner as for the source. Since all of the prices are represented in the same way in the model, the generic method to count an average price during a given period of time was implemented. Pricing prole of every Micro-cloud is stored as a group of in pairs date time new price value. A generic algorithm for counting the average price (either execution or transfer) was presented below.
initialize averagePricesMap as empty Map getAveragePrice() initialize averagePrice as Double initialize knownAveragePrice to value of timePeriod in averagePrices if knownAveragePrice different from null set averagePrice to knownAveragePrice else initialize microCloudProfileList to List of profileNodes store in microCloudProfileList the last profileNode before the timePeriod start
1 For simplicity, an incorrect assumption is made for bandwidths between Micro-clouds - that the connection will be able to use a whole in/out bandwidth of them 2 Assumed execution period is dened as the period starting at a point of time dened in the clients query and lasting number of minutes dened during preplacement while approximating execution time length

4.5

Implementation of the Placement Algorithms

31

store in microCloudProfileList all the profileNodes during the timePeriod initialize currentPointOfTime to timePeriod start initialize area to 0.0 for every profileNode in microCloudProfileList initialize priceInThisPeriod priceInThisPeriod to price from profileNode initialize currentPeriodEnd if this is the last profileNode, set currentPeriodEnd to timePeriod end else, set currentPeriodEnd to next period start in microCloudProfileList initialize currentPeriodLength to the difference of currentPeriodEnd and currentPointOfTime set area to area + currentPeriodLength * priceInThisPeriod set averagePrice to area / timePeriod length put pair timePeriod<->averagePrice to averagePricesMap return averagePrice

The maps of prices are dened separately for each of the Micro-clouds. The implementation prevents from recounting the same prices (this method is called every time when a connection destination is that Micro-cloud). The returned price is an average price of a unit - so GB for transfer and an hour for a computation. 4.5.2.2 Initializing Simplex Parameters

There are ve parameters used as an input for the simplex solver external component to solve optimal operator placement problem as linear optimization problem. 1. 2. 3. 4. 5. Linear Objective Function Linear Constraint Set Non Negative Constraint Goal Type Max Iter

Parameters 1., 3. and 4. are common with the transportation problem (see: Chapter 3.6.1). Linear objective function is a function of pairs ci xi . Array of ci values, given as a parameter, is lled with cost values for every connection, counted as described in Chapter 4.5.2.1. Non negative constraint denes that xi values should be all greater than zero. Goal type says that the goal is to minimize the value of the Linear objective function. Parameter 5. denes whether there should be a limit of iterations during searching for the optimal solution. Parameter 2. is a set of constraints that are taken into consideration when solving the linear objective function with the simplex algorithm. They are dened in theory in Chapter 3.6.4. In practice the denition of a linear constraint is about dening three things - coecients, relationship and value. There is one coecient dened for every connection transfer xi . In eect, an equation of a type presented below is created:
iconnections number

coecienti xi
i=1

[< | | = | | >]

value.

There is one constraint K dened for every key. Connections with sources that represent some specic key get a coecient of value 1 (all the others are lled with zeros). Value becomes a size of that key and a relationship is equals.

32

SYSTEM DESIGN

There is one constraint D dened for every destination (Micro-cloud). Connections with a specic destination get an coecient of value 1 (all the others are lled with zeros). Value becomes a chosen processing limit inside that Micro-cloud. There is a specic strategy dened for specifying that value, described later. Naturally, a relationship between left side of equation and a limit value is less or equals. A constraint L is similar to the constraint D - it is also about setting limits. There is one constraint L dened for every host. Connection with sources that represent some specic host get a coecient of a value 1 (all the others are lled with zeros). Value becomes a chosen retrieval limit inside that Micro-cloud. There is a specic strategy dened for specifying that value, described later. Naturally, a relationship between left side of equation and a limit value is less or equals. There is one constraint T dened for each pair key - destination. In each constraint dened for some destination, all of the connections with sources that represent some chosen key k0 get a coecient of a value 1. Each connection between sources of other keys and that destination gets a coecient of a value equal to the proportion between the size of k0 and the size of the connection source key. Setting a value to 0 and a relationship to equals guarantees that data of every key is going to be split in the same manner between destination Micro-clouds, what is compatible with the topology specics.

4.5.2.3

Strategies while Initializing Simplex Parameters

The rst strategy is about to approximate how much workers will be needed in the system. The strategy that is chosen can be considered as very demanding. It is assumed that all of the data will be transferred within the system on a speed of its retrieval. Then, depending on the type of a worker algorithm, an appropriate number of slices to process the data real-time is chosen. In eect, totally maximal number of workers potentially needed to process all of the data is known. In eect of this strategy, the simplex algorithm would nd solutions that would assure a place in every of the chosen Micro-clouds for real-time processing of all of the incoming data, no matter of other specics of the system. Consequently, because of those demands, a lot of better feasible solutions may be rejected. Another strategy is about setting the limit of data to be retrieved from one source host during the execution of query. Many things could inuence that limitation (e.g. the demanded time of execution, when it is proven that some of sources with a big amount of data to retrieve are bottlenecks). The strategy chosen is simple but takes one important consideration into account. It is based on a current model of the system where real-time sources are always on the other nodes than historical sources. It makes the problem easier, as the limitations on hosts with real-time sources make no sense (because the only time constraint is dened by user and amount of data is explicitly derived from it). The limitation of the data to be retrieved from historic sources can be dened in the cong le. But it cannot be smaller than one third of the sum of data sizes for all of the keys to be potentially retrieved from that node. Otherwise, in some cases, it would be impossible to nd any feasible solution.

4.5.2.4

After Simplex Algorithm Run

All of the previous description of simplex based approach described how the input for a simplex solver is dened. Based on that denitions, simplex solver nds an optimum and returns an array of transfer values xi for every of the connections, in an order that the costs ci were initiated. Therefore, transfers can be easily written to the list of connections by calling setTransfer() on every of the connections during a single iteration. The state ready for further processing is achieved. Now, the concrete hosts inside of every chosen destination Micro-cloud have to be chosen. This is the common part of an algorithm with All in one Micro-cloud approach.

4.5

Implementation of the Placement Algorithms

33

4.5.3

Implementation of Algorithms for a Solution Normalization and Choosing Hosts for Processing

The subsequent algorithms run to achieve the fully prepared solution from outcomes of All in one Micro-cloud algorithm as well as simplex based algorithm are described below. 4.5.3.1 Normalization of Indivisible Key Sources

This step is about giving a guarantee that none of the keys that are unsplittable3 is divided into a few parts in the denition of the solution provided by previous processing. The algorithm iterates through all of the keys. If the key is approved to be unsplittable, at rst the source that was dened by previous algorithms to retrieve the greatest amount of data for that key is found. Then, transfers on every of output connections of that key are increased proportionally, so that all of data will now be transfered from that source. Consequently, transfers from all of the other sources of that key can be set to zero. 4.5.3.2 Counting the Number of Workers Needed in the System

This procedure tries to assume what will be the needed number of workers in a system made up in previous steps. Generally, the strategy is to nd a number that will process the incoming data on time (without delays). In a model used here, it is assumed that outgoing bandwidths are independent of each other (slow bandwidth on one of communication lines will not delay and slow down transfer speed on other lines). In other words, it is assumed that an average bandwidth going out from every of the sources equals an average of all of the transfer speeds of outgoing connections. The sum of all of these average bandwidths will be an approximate bandwidth within the system. The algorithm consists of three subparts. At rst, so called fractions map is created. Every Micro-cloud that is going to be used is paired in a map with a fraction of a total transfer going out from source operators that comes into this Micro-cloud. Afterwards, the total transfer per second from source operators is being approximated. It is done by taking bandwidths between every of the active sources and every of the active destinations and multiplying them by the fraction of a transfer that will be transferred from this source to that destination. Bandwidth between a source and a destination is assumed to be a lower value from a pair: retrieval speed on a source multiplied by fraction of transfer that will be going to this destination, or bandwidth between a source and a destination. Then, an appropriate number of hosts to be used in processing data without delays can be determined, based on a worker algorithm type specied in the clients query. 4.5.3.3 Counting the Number of Workers Needed per Micro-cloud

When this procedure starts, the number of worker operator slices to be deployed in the system is already dened. Also an amount of data supposed to come to every Micro-cloud is dened. It was noticed that reasoning a number of workers in every Micro-cloud from amounts of incoming data is just a conversion from double to integer values. Only some rules needed to be set to make that conversion tting best the system specics. The rules are explained in an example below.
The number of needed worker operator slices TotalWorkersNo is set to 6. TotalSystemTransfer is the sum of all the transfers in the system and equals 35.0 A list of destinations D represented by a tuple <OrderNumber,InputTransfer,WorkersNo> exists:
3 Within the current system implementation, keys of real time sources are unsplittable and keys of historical sources are splittable. The key of a real time source represents a whole stream of data, therefore, by denition, it is indivisible. On the other hand, the key of historical source based on MongoDBs GridFS might be divided into keys representing smaller amounts of data, as it represents a group of chunks.

34

SYSTEM DESIGN

D<1,10.0,?>, D<2,15.0,?>, D<3,0.0,?>, D<4,10.0,?> The list of destinations D is sorted by field InputTranfer in the descending manner: D<2,15.0,?>, D<1,10.0,?>, D<4,10.0,?>, D<3,0.0,?> WorkersNo field is filled with outcomes of an equation WorkersNo = floor (TotalWorkersNo * InputTransfer / TotalSystemTransfer): D<2,15.0,2>, D<1,10.0,1>, D<4,10.0,1>, D<3,0.0,0> After this operation, the number of worker operator slices that are not yet assigned equals N and is lower than a number of destinations. Every of N Micro-clouds with the biggest incoming transfers will get one more worker slice: D<2,15.0,3>, D<1,10.0,2>, D<4,10.0,1>, D<3,0.0,0>

As presented in the example, the algorithm tries to split workers between Micro-clouds fairly, depending on incoming transfer. When it is impossible, Micro-clouds with bigger incoming transfers are supported. The map created above is forwarded to be used in further processing. 4.5.3.4 Normalization of the Solution

In this step, a graph of connections between sources and destinations (still on the abstraction level of Micro-clouds) is given characteristics that t an actual deployment inside of StreamMine3G environment. To achieve it, a few actions need to be taken. Only those sources that have an output transfer greater than zero are put into a new graph. Only those destinations that have a number of worker slices to be deployed inside greater than zero are put into a new graph. Afterwards, the values on connections are being set to possibly similar to those that will occur between the nodes during the execution. As explained in 3.3, a topology used in a system consists of accessops co-located with mapper slices and workerop slices and every mapper slice will send the data to every workerop slice. The size of this data will be equal on every outgoing communication line. Therefore, for every source, a connection to every of the destinations is dened. Since destinations in the current graph are still dened as Micro-clouds, the data transfer on every connection is set to the data size of source multiplied by the number of worker slices to be deployed in the destination divided by the total number of worker slices to be deployed. A redenition of keys is another essential part of the normalization algorithm. In principle, it is about redening splittable keys by dividing each of them into a group of keys, each of which represents a smaller amount of data. This operation needs to be executed only for the keys that, according to the current denition of the graph, need to be retrieved on a few hosts. When the key needs to be redened, a few steps need to be taken. 1. A list of sources that should be the source of a part of that key is created 2. A number of indivisible chunks being a part of that key is taken 3. A number of chunks for every of the sources is counted with the same double to integer conversion as shown in 4.5.3.3 4. A group of new keys is created - every one of which contains a subset of chunks of the original key 5. Every of the new keys is connected with an appropriate source 4.5.3.5 Choosing Hosts in every Micro-cloud

This procedures goal is to convert a graph with Micro-clouds as destinations to a graph having hosts specied as destinations. The number of worker slices that should be deployed in every of the Micro-clouds is already known. Now the task is to choose the hosts in every Micro-cloud according to the chosen strategy. Current model denes only one rack per Micro-cloud, therefore the decision was made to skip implementation of how to distribute slices between racks for now. However, logic of such a distribution would be very similar to the one in the procedure of picking worker slices numbers in Micro-clouds.

35

The way the hosts are chosen for workerop slices deployment inside of every Micro-cloud goes as follows. At rst, input connections are divided into two groups - those having sources within that Micro-cloud and does having in other Micro-clouds. It is then checked whether there are not enough free hosts in the Micro-cloud to place a demanded number of worker operator slices. In case there is not enough of them, the way the implemented solution deals with it is very straightforward and simple, and therefore rather not ultimate. If there are not enough free hosts to deploy all of the slices, the number of slices to be hosted is lowered. Other, smarter strategies, could be about to move a slice to the other rack (if there are many racks) or another (best tting) Micro-cloud, or even rerunning previous parts of the algorithm, giving as an input the information that this Micro-cloud cannot execute so many worker operators. When the number of hosts is nally determined, an algorithm that chooses choosing hosts is run. It starts from those that produce the biggest amount of data, so that the biggest amount can be processed locally. If there are less source hosts than the needed number of worker hosts, the next are chosen randomly from the remaining ones. 4.5.3.6 Denition of Connections between Sources and Destinations

Finally, all of the sources are connected with all of the destinations in the graph. For every source, the same transfer size is set on every connection outgoing from it. After that, the solutions graph is ready.

5
5.1

Evaluation
A Simulation Environment

During the thesis there was an eort made to simulate the real environment. Database data, network interfaces, StreamMine3G, mongoDB and Web crawler simulators were prepared according to the one congruent design, so that the whole system could have been run as it was spread over the Micro-clouds. In this Chapter, at rst the design including some assumptions will be presented. Then, it is going to be shown how the parts of the system were run in practice. 5.1.1 5.1.1.1 Design of a Simulation Environment Micro-cloud Environment

The simulated system looks as follows. There are ve Micro-clouds, each of which has one rack inside. Three Micro-clouds have four machines inside, two of them have three. Every of the machines has a URL to access it dened, e.g. host no.1 of a rack in Micro-cloud no.1 has address h1.r1.mc1.microcloud.com. Every machine in the system hosts StreamMine3G service as well as one type of the data source. Fifteen machines run mongod process (daemon process for the mongoDB system) and three of them run real-time data sources (called Web crawlers). Details are presented on Figure 9. 5.1.1.2 Assumptions about Retrieval, Processing and Transfer Speeds

There are data transfer limits dened as attributes for every of the Micro-clouds (see: Figure 10). All of them have connections inside, enabling data transfer between nodes with a speed of 2 Gb/s. Three of them have a bandwidth at an input equal to a bandwidth at output and amounting 1 Gb/s. Another one has also a symmetric link with a bandwidth of 50 Mb/s. The last one has an asymmetric link with input transfer limit of 50 Mb/s and output of 5 Mb/s. It is assumed that the read speed from hard drives by mongod process on every of the nodes would equal 100 MB/s and data transfer rate from real-time data sources will equal 50 MB/s. For simplicity, it is assumed that neither accessop nor mapper of StreamMine3G are going to slow

36

EVALUATION

MicroCloud mc1
Rack r1.mc1
Host h1.r1.mc1 Host h2.r1.mc1

MicroCloud mc2
Rack r1.mc2
Host h1.r1.mc2 Host h2.r1.mc2 Web crawler

Host h3.r1.mc1 Web crawler

Host h4.r1.mc1

Host h3.r1.mc2

Host h4.r1.mc2

MicroCloud mc3
Rack r1.mc3
Host h1.r1.mc3 Host h2.r1.mc3

MicroCloud mc4
Rack r1.mc4
Host h1.r1.mc4 Web crawler Host h2.r1.mc4

Host h3.r1.mc3

Host h3.r1.mc4

Host h4.r1.mc4

MicroCloud mc5
Rack r1.mc5
Host h1.r1.mc5 Host h2.r1.mc5

Host h3.r1.mc5

Figure 9: Distribution of data sources and StreamMine3G nodes over hosts in Microclouds according to proposed design

5.1

A Simulation Environment

37

MicroCloud mc1
2 Gb/s 1 Gb/s 1 Gb/s

MicroCloud mc2
2 Gb/s 1 Gb/s 1 Gb/s

MicroCloud mc3
2 Gb/s 1 Gb/s 1 Gb/s

MicroCloud mc4
2 Gb/s 50 Mb/s 50 Mb/s

MicroCloud mc5
2 Gb/s 5 Mb/s 50 Mb/s

Figure 10: Input, output and inside bandwidths of every of the Micro-clouds in environment

Hard drive
Accessop Mapper

100 MB/s 50 MB/s

Web crawler
Figure 11: Assumptions about data retrieval speeds on source nodes

down those transfer rates (see: Figure 11). The next assumption that is made is that workerop will be able to process data with word count algorithm with the speed of 50 MB/s (see: Figure 12). 5.1.1.3 Pricing Proles

Every Micro-cloud has its pricing prole dened. Input and output transfer prices are held on a constant level. Prices of execution change with time. Their proles generally consist of two types of day-long parts. One of them is a regular pricing prole where processing price rises during the morning and falls down in the afternoon. Second one is the case when Micro-cloud is going to be extremely busy, therefore all day long it is going to be much more expensive. A pricing prole graph for one week for one of the Micro-clouds is presented on Figure 13. 5.1.1.4 Division of Data inside of Data Sources

Figure 14 presents shards inside of the mongoDB system. There are ve of them, each of which consist of three nodes being one replica set. The replicas of every replica set fulll the policy that

Workerop
Word count algorithm

50 MB/s

Figure 12: An assumption about processing speed by Workerop executing word count algorithm

38

EVALUATION

Transfer price (in/out) [per GB] Processing price [per hour]


600

500

400

300

200

100

0 2013-05-20

2013-05-21

2013-05-22

2013-05-23

2013-05-24

2013-05-25

2013-05-26

2013-05-27

Figure 13: A graph of a pricing prole for one of the Micro-clouds

mongos

config

Host fs.microcloud.com

mongod

mongod

mongod

mongod

mongod

mongod

mongod

mongod

mongod

mongod

mongod

mongod

mongod

mongod

mongod

Host h1.r1. mc2

Host h1.r1. mc3

Host h3.r1. mc4

Host h4.r1. mc1

Host h3.r1. mc2

Host h2.r1. mc4

Host h1.r1. mc1

Host h4.r1. mc2

Host h2.r1. mc5

Host h2.r1. mc1

Host h2.r1. mc3

Host h1.r1. mc5

Host h3.r1. mc3

Host h4.r1. mc4

Host h3.r1. mc5

Shard 1

Shard 2

Shard 3

Shard 4

Shard 5

Figure 14: Model of mongoDB system and distribution of replicas over hosts

states that each of them should be placed in a dierent Micro-cloud. As mongoDB replication rules state, there is always one primary replica - all the other ones are called secondary replicas. There are two more mongoDB processes run. One is mongos process - it is a routing service for shard congurations that processes queries from the application layer, and determines the location of this data in the sharded cluster [mon]. The other one is a cong datatase storing the clusters matadata. Figure 15 presents what are the external sources of data for real-time data sources. In the created model, two of Web crawlers read the same external data - therefore they are considered as one data source. The third one is the only one that reads the other external source. 5.1.2 Simulation of the Designed System

To test the system in practice as well to make measurements, the whole system was simulated on one machine. The data describing the designed system was stored according to the model explained in Section 4.1 in MySQL database. 5.1.2.1 Creation of local interfaces to run services

5.1

A Simulation Environment

39

Host h3.r1.mc1

External source 1

Web crawler

Host h2.r1.mc2 Web crawler

External source 2

Host h1.r1.mc4 Web crawler

Figure 15: Presentation of external sources for real-time data sources

There was a need to let all of the mongod instances listen on the same port, as in the model there is a xed port number dened, on which data source daemons should be listening on all of the hosts. Therefore nineteen new loopback interfaces were created using tunctl and ifcong applications (one for management processes, eighteen for hosts of the system). E.g. for h1.r1.mc1.microcloud.com, an interface was created using:
tunctl ifconfig tap1 10.1.1.1 netmask 255.0.0.0 up

Then, for every interface, an entry in in /etc/hosts was added (so that services can be accessed by domain name), following the pattern:
10.1.1.1 h1.r1.mc1.microcloud.com

5.1.2.2

Conguring mongoDB

To run the mongoDB system as it was assumed, sixteen mongod and one mongos processes needed to be run. For every of the shards three mongod processes were run, congured as to be a part of one replica set. Each of them was run of the same port of dierent interface, according to sharding design described before. E.g. replicas of shard 1 were run as follows:
#shard_1 mongod --shardsvr --dbpath /data/microcloud/mc2/r1/h1 --port 10001 --bind_ip 10.2.1.1 -replSet shard_1 --oplogSize 100 --rest --journal & mongod --shardsvr --dbpath /data/microcloud/mc3/r1/h1 --port 10001 --bind_ip 10.3.1.1 -replSet shard_1 --oplogSize 100 --rest --journal & mongod --shardsvr --dbpath /data/microcloud/mc4/r1/h3 --port 10001 --bind_ip 10.4.1.3 -replSet shard_1 --oplogSize 100 --rest --journal &

When all the replica sets were run, the connection to a chosen replica of every of the sets was made to initiate the replica set with a shell command. Then, the cong database, that is supposed to store the clusters metadata, as well as mongos process were run using commands presented below.
#config mongod --configsvr --dbpath /data/microcloud/config --port 20001 --bind_ip 10.0.0.1 & #mongos mongos --configdb fs.microcloud.com:20001 --port 27017 --bind_ip 10.0.0.1 --chunkSize 1 2> /data/microcloud/mongoserr.log &> /data/microcloud/mongosout.log &

40

EVALUATION

Filling shards with data To provide a good input for word count algorithm, les containing contents of books were created as GridFS les inside of mongoDB. Their id was set to md5 hash of the content. When the les were uploaded to the database called lesystem, sharding was enabled on the GridFS chunks collection of this database. As a result, after some time, the number of chunks in every of the shards was equal. 5.1.2.3 Conguration of real-time data sources

Java application simulating Web crawler functionality takes two parameters as an input. One of them is the name of a le containing a content of the book, so that its content can be sent to clients. Second one is the name of the interface, on which a service ought to be listening. For each of the data sources of this type, the same port to listen on is set in the database, therefore this parameter can be xed inside of the code. 5.1.2.4 Conguration of StreamMine3G run

StreamMine3G was run using a special Google Protocol Buer bridge, so that manager could have been run in Java. Managers execution eciency is not that essential, therefore the fact that Scheduler is implemented in Java and Manager is logically close to it (some of structures are common) was the key one. Every node of StreamMine3G was congured in a separate le, under the name of the interface it should be run on. It dened managers library of the one GPB-based, host to be run on as appropriate interface name. Ports were dened for 9001 and 19001 and number of threads was set for 8 for each of the nodes. When conguration les were dened, StreamMine3G processes could have been run taking those les as an input. Then, the system was ready to run the StreamMine3G manager process.

5.2

On Approximating the Price and the Time of the Solution Execution

An algorithm implemented to approximate the price and the time of possible executions found by placement algorithms was the essential path to evaluate the work. It is about to analyze in detail how would a given solution be executed to set the reliable values for all of the outcome solutions, so that they can be compared and then executed similar to the way that it was anticipated. The algorithm consists of ve main parts: 1. 2. 3. 4. 5. Analysis of sources Analysis of connections Analysis of destinations Calculations of the solutions time Calculations of the solutions price

Some of the mechanisms used in the algorithm are the same as those used when approximating prices on connections before Simplex algorithm run. 5.2.1 Analysis of Sources

The goal of this step is not only to count the approximate time and price of the execution of accessop on a given source, but also to determine how would source execution and output links inuence each other (particularly, how they would limit each other). The rst step is about counting the time of retrieval of every of the keys on a given source. It is done in the same manner as explained in Chapter 4.5.2.1. To count the retrieval period length, the actual speed of retrieval has to be counted. It is assumed that this speed is going to be pessimistically inuenced by the slowest output connection. An eort to give better approximation than the most pessimistic one

5.2

On Approximating the Price and the Time of the Solution Execution

41

OS: 25 OS: 25 RS: 100 OS: 25 OS: 25 OS: 12 OS: 12 RS: 48 OS: 12 OS: 12

B: 120 B: 120 B: 12 B: 250 B: 120 B: 120 B: 12 B: 250 RS retrieval speed OS output speed B bandwidth

Figure 16: Algorithm assumes that the connection with the lowest bandwidth will inuence transfer speed for all the other connections

was skipped. It is assumed that when one connection is only a bit slower than others, taking its speed as the maximum possible retrieval speed would not be such a big error; on the other hand, when one connection is clearly slower, it can really determine the speed of the whole retrieval. Then, as the retrieval speed and the retrieval time are set, an average data transfer speed per every of the output connections can be determined (see: Figure 16). In the next stage, the execution times for every host that is going to take part in data retrieval during the query run are set. Moreover, the order in which keys are going to be retrieved is determined. Right now, the order is set without any strategy, although arrangement of that data on the disk could have been taken into account to avoid unnecessary moves of a head stack of a hard drive. When the order and the retrieval time of every of the keys is known, delay periods between the query execution start time and the beginning of every of the keys retrieval can be determined. A sum of all of the retrieval times on the host is now known as the comprehensive execution time on the host. When the total time of the execution on the host is known, the price can be counted. It is done with a use of the same algorithm as presented in Chapter 4.5.2.1. That gives a precise value of the execution price, if execution will last as it was predicted with the time-approximating algorithm.

5.2.2

Analysis of Connections

A period during which the transfer will take place on a given connection is known as it coincides with the period of data retrieval on the source. Therefore, the price of a transfer can be easily counted by multiplying a size of data to be retrieved on the source with average prices of output transfer of the source Micro-cloud and input transfer of the destination Micro-cloud (if the source and the destination are placed in dierent ones). Again, for counting of an average price, the algorithm described in the Chapter 4.5.2.1 is used.

42

EVALUATION

5.2.3

Analysis of Destinations

The analysis of a time period and the price of an execution on the destination host is a bit more complex, as at rst the assumptions about how the input trac will be apportioned in time have to be analyzed in detail. Figures 17, 18 and 19 present an example of what logic was implemented when trying to simulate how workerop on the analyzed destination is going to deal with incoming transfers. Figure 17 shows a situation that there are three sources sending data to the destination. Their actual transfer speed is already counted, as well as the moments in time when every of them starts sending. The processing speed with the given algorithm is known as well. Figure 18 is an outcome of a method that analyzes incoming bandwidths and shows what would be the distribution of the sums of the incoming transfers in time. As it can be seen clearly on the picture, not all of the data can be processed straight away as it comes. Consequently, there was an algorithm implemented to count how much time will it take for the processing operator to deal with all of the incoming data. The graph presented on Figure 19 is an outcome of a conversion of the distribution of total incoming transfer speeds in time into processing speeds in time. This algorithm is executed during counting the execution time of the destination. In the example, data will be processed yet two seconds after the last data has arrived. When the execution period is known, the standard procedure of counting the price (Chapter 4.5.2.1) is called. 5.2.4 Calculations of the solutions time

The solutions execution period end is bounded by the moment when the last workerop nishes its work (it is going to be at least a bit later than nish of any accessop ). Consequently, a set of destinations in the solution graph is iterated in search of the worker operator that is going to nish its work at the latest moment of time. The dierence between that moment of time and the beginning of execution time dened by the client is known from now on as the execution time of the given solution. 5.2.5 Calculations of the solutions price

To calculate the solutions price one obviously needs to sum up execution price on all of the hosts that are going to be used in the task with prices of all of the data transfers. Therefore, all of the prices counted before are summed up, with exclusion of those situations where there are both workerop and accessop to be placed on the same host. Then the price of the one that is executed longer is taken (normally, that should be the workers execution price).

5.3

Evaluations of Positioning Algorithms

The evaluation was proceeded on all of the algorithms described in the paper variated by parameters dening strategies to be used. The names that will be used for those algorithms in further parts are described below: greedy - greedy approach described in Chapter 3.4 without price aware choices greedyp - greedy approach described in Chapter 3.4 with price aware choices simplexk - simplex algorithm-based approach with K constraint simplexkdmax - simplex algorithm-based approach with K and D constraints with workers number approximation strategy as it was explained in Chapter 4.5.2.3 simplexkdhalf - simplex algorithm-based approach with K and D constraints with loosened workers number approximation strategy simplexkt - simplex algorithm-based approach with K and T constraints simplexkdmax - simplex algorithm-based approach with K, T and D constraints with workers number approximation strategy as it was explained in Chapter 4.5.2.3 simplexkdhalf - simplex algorithm-based approach with K, T and D constraints with loosened workers number approximation strategy

5.3

Evaluations of Positioning Algorithms

43

100

90

80

70

60

50

Input traffic 3 Input traffic 2 Input traffic 1 Processing speed

40

30

20

10

0 0,1 1,75 3,4 5,05 6,7 8,35 10 11,65 12,2 13,3 14,95 16,6 18,25 19,9 21,55 23,2 24,85 26,5 28,15 29,8 31,45 33,1 34,75 36,4 38,05 39,7 41,35 43 44,65 46,3 47,95 49,6 51,25 52,9 54,55

Figure 17: A graph presenting the situation when destination node receives data from three sources on a given speed during a given period of time
100 90

80

70

60

50

Size of input Processing limit

40

30

20

10

0 0,1 1,75 3,4 5,05 6,7 8,35 10 11,65 12,2 13,3 14,95 16,6 18,25 19,9 21,55 23,2 24,85 26,5 28,15 29,8 31,45 33,1 34,75 36,4 38,05 39,7 41,35 43 44,65 46,3 47,95 49,6 51,25 52,9 54,55

Figure 18: A graph presenting the sums of incoming data per second during a given period of time
100 90

80

70

60

50

Processing of input Processing limit

40

30

20

10

0 0,1 1,75 3,4 5,05 6,7 8,35 10 11,65 12,2 13,3 14,95 16,6 18,25 19,9 21,55 23,2 24,85 26,5 28,15 29,8 31,45 33,1 34,75 36,4 38,05 39,7 41,35 43 44,65 46,3 47,95 49,6 51,25 52,9 54,55

Figure 19: A graph presenting what would be the processing speeds of an algorithm if data came with expected speeds

44

EVALUATION

5.3.1

Analysis of Simplex Algorithm Constraints in a Simple Mathematical Model

Constraints in the implementation were created as it was shown in a pseudo code in Chapter 4.5.2.3. To understand better the measurements that were going to be the outcome of system tests, rst an analysis was proceeded on simple mathematical models in a simplex solving application [sim]. One of them is going to be explained below. Symbols follow the rules presented in Chapter 3.6. There is a minimization problem z , looking as follows: min z = x11 + 2x12 + x21 + 3x22 + 4x31 + x32 + 4x41 + x42 . That means, there are four sources and two destinations. Two rst sources are placed in Microcloud 1, so transfers on connections x11 and x21 are cheaper than the ones heading Micro-cloud 2. Other sources are placed in Micro-cloud 2. The problem might be a subject to constraints of type K: x11 + x12 = 10, x21 + x22 + x31 + x32 = 8, x41 + x42 = 4. They would mean that some key of a size 10 can be retrieved from source 1, key of size 8 can be retrieved from sources 2 and 3, the other one of size 4 can be retrieved from source 4. Constraints of type T could look as follows: x11 1.25x21 1.25x31 = 0, x12 1.25x22 1.25x32 = 0, x11 2.5x41 = 0, x12 2.5x42 = 0. They would set the rule of StreamMine3G data partitioner that data of every key would be split in the same manner between all of the destinations. Problem might we as well be a subject to constraints of type D: x11 + x21 + x31 + x41 20, x12 + x22 + x32 + x42 20. They would mean that a transfer limit to every of the destinations is equal to 20. Constraints of type L were decided to be skipped from testing. The main outcome of this basic analysis of mathematical models were that: Minimization problem that is a subject only to constraints K nds connection paths totally independently of each other. That means, it chooses paths as if all of the source trac would be going to one destination. That could mean in practice, that it will successfully choose the connection paths within currently cheapest Micro-clouds but trac costs and bandwidth limits that would appear between them in practice according to the topology of word count could kill the positive sides of an outcome Minimization problem that is a subject to constraints K and D has the same problems as described above but it correctly obeys the rules of omitting too big trac into destinations Minimization problem that is a subject to constraints K and T succeeds in being aware of dividing the trac of every source of the ones that are picked between the destinations in the same manner. But with a limitation to one situation: for every key always one source is chosen, and in a whole solution always exactly one destination is chosen.

5.3

Evaluations of Positioning Algorithms

45

Minimization problem that is a subject to constraints K, T and D succeeds in being aware of dividing the trac of sources of every key between destinations but fails in dividing tracs of every source separately (when constraints do not let all the trac into the best destination match). Therefore, extending transfers-division-aware algorithm by destinationslimits-aware might produce outcomes that would give worse outcomes after a conversion to operator placement problem solution. Probably, the same problems would be produced by adding L constraints to K and T. 5.3.2 Tests of the System Implementation

To see how the simplex algorithm-based algorithms work in practice bound together with conversions to operator placement problem and to compare their outcomes with outcomes of greedy approaches, tests were run. Placements for queries using data from various number of keys of mongoDB storage were run. Placements using real-time data sources were skipped as those using only them are not enough interesting and the mixed ones, because of some facilitations in implementation, could give some not easily analyzable outcomes. greedy algorithm using randomization was run three times to have a bigger picture om solutions that it is able to nd. 5.3.2.1 Analysis of the Price- and Time- Awareness of All the Algorithms

All of the algorithms were employed to nd the best solutions for operator placements in dierent starting times for two dierent key combinations. In the second combination, there was a Micro-cloud that contained all of the chunks of every key. The comparison of prices and times for all of the algorithms was presented on Figures 20 and 21. The values given on the Y axis are multiplications of the price/time that was counted with simplexkt algorithm. As it can be seen, in average simplexkt is clearly the algorithm working the best for those specic problem of nding placements for an implemented version of word count algorithm. It nds solutions better than the best ones found by random greedy algorithm. When it is about the rst group of keys, simplexktdhalf is able to nd solutions with prices at the levels of best found by greedy. What is interesting is that greedyp does not always nd either the cheapest of the fastest solutions of those that can be found using the random greedy approach. Simplex algorithms not using constraints of type T achieve very bad outcomes, even worse than average outcomes of greedy algorithm. simplexktdmax was able to nd faster but more expensive solutions than simplexktdhalf. Both of them had limits set on a level that did not block the solutions from being dierent than the ones found by simplexkt. 5.3.2.2 Comparison of Outcomes for Dierent Pricing Proles

To see what inuence on the outcome has the structure of proles, the second variant of proles was used. The thing that was changed was the factor between the average execution price and the average transfer price - it was increased 100 times. The test for every prole was proceeded for two combinations of keys - one with much less data than the other one. The outcomes do not lead to clear results - they were presented on Figures 22 and 23. It is clear that simplexkt is the best algorithm for both types of proles. There is no algorithm that could be said to be getting closer with price- and time-awareness to simplexkt when the pricing proles structure changes. There can be only a few small comments made after the deeper analysis. simplexkt was able to nd a cheaper solution (but taking a bit more time) than allinone algorithm for the changed prole in the day when processing in Micro-cloud containing all of the data was expensive. That means it can really t into various situations. Also simplexktdhalf algorithm, even if its solutions were always much more expensive, after changing proles, sometimes it was able to nd faster solutions than those found by simplexkt. 5.3.2.3 Analysis of an Inuence of Worker Number Changes on Prices and Times

46

EVALUATION

90

Keys combination 1
80

Keys combination 2

70

Normalized price

60

50

40

30

20

10

Figure 20: Comparison of geometric means of execution prices counted by algorithms in comparison to prices being an outcome of simplexkt algorithm

14

Keys combination 1
12

Keys combination 2

10

Normalized time

Figure 21: Comparison of geometric means of execution times counted by algorithms in comparison to times being an outcome of simplexkt algorithm

5.3

Evaluations of Positioning Algorithms

47

80

Standard profile (less keys) Standard profile (more keys) Changed profile (less keys)

70

60

Changed profile (more keys)

Normalized price

50

40

30

20

10

Figure 22: Comparison of geometric means of execution prices counted by algorithms in comparison to prices being an outcome of simplexkt algorithm for various number of keys and various pricing proles

12

Standard profile (less keys) Standard profile (more keys) Changed profile (less keys) Changed profile (more keys)

10

Normalized time

Figure 23: Comparison of geometric means of execution times counted by algorithms in comparison to times being an outcome of simplexkt algorithm for various number of keys and various pricing proles

48

EVALUATION

10

Proportion between simplexktdhalf and simplexkt execution

9 8 7 6 5 4 3 2 1 0

Normal number of hosts Doubled number of hosts

Time

Price

Figure 24: Change of proportions of the execution price and time of solutions created using simplexkt and simplexktdhalf when the number of hosts was doubled

During the analysis of the outcome of measurements, it appeared that the algorithm picking hosts of every Micro-cloud for processing would often like to pick more hosts that were present in all of the chosen Micro-clouds taken together. It was then checked what would be the outcome of one of the keys combinations (the one representing greater amount of data) if the number of hosts for processing was doubled. Figure 24 presents how the proportion of the price and the time of outcomes of simplexkt and simplexktdhalf was changed when the number of hosts was doubled. It appeared that the proportion for both price and time was lowered twice. In fact, for most of the situations, simplexktdhalf was nding faster solutions - only for a few the execution time was much longer. What appeared was that the simplexktdhalf algorithm was causing the longest keys to be cut in half, what in consequence lowered the retrieval time. And its algorithm that let choose a few Micro-clouds as destinations, was able to deal with that increased amount of data transfer well. 5.3.3 Summary of the Measurements

The tests show that there is only one algorithm found that constantly works on a satisfying level for the specic type of operator placement problem - simplexkt. The others, though sometimes giving not bad outcomes (especially considering the time of the execution), are not that predictable. The algorithm of simplexkt has though a few disadvantages: It is built up specically for the problem of processing word count with data of sources in a topology where data is partitioned equally to all of the destinations. It would not have such a positive outcome in other situations It does not let set the constraints like retrieval limits on sources or transfer limits to destinations that might be needed in some cases Anyway, a model of transforming simplied form of operators placement problem into linear programming problem, extended by transformations following it to restore the previous complexity was proved to be good at least in some cases. Fortunately, there are extensions possible that could lead to more complex and more universal denitions of the operator placement problem.

49

5.3.4

Evaluations of Algorithms based on Measurements

The main results that were the payo of searching the optimal operator placement problem solutions by presenting its simplied form as a linear programming problem were: As the outcomes of the measurements showed, there was a very good algorithm simplexkt found for the case when all of the sources are about to send retrieved data to all of the worker operator slices. This would be a solution for the queries that can be resolved with one worker operator that cuts the data size so signicantly that a transfer to the sink can be already skipped from considerations. There was simplexk algorithm presented that, because of its structure, would be nding the best solutions in case when all retrieved data from a source could be sent only to slice of the worker operator placed in one Micro-cloud. This could be the case when worker operator is e.g. a lter and - as well - cuts data size so signicantly that a transfer to the sink can be already skipped from considerations. Problems dened with constraints on destinations processing were sometimes able to nd the solutions that were faster and simultaneously comparable in price with those without the constraints. However, it was proved in a mathematical model that they do not stick together with the equal transfer division constraints (those of simplexkt ) and the outcomes were not clear enough to be well converted to solutions with created components. Conversions from the simplex algorithm results to ready solutions often appeared not to stick with the further outcomes. The number of worker operator slices needed in the system were often approximated really badly. Those observations lead to three main conclusions: The algorithms based on the linear programming solving model are able to deal with specic problems but there is a congruent possibly universal mathematical model lacking to be able to nd solution for a possibly wide variety of the problems Also tinier components of the system like those approximating the execution time or approximating a needed number of workers (either before or after the run of simplex algorithm) should have their mathematical models dened to determine their exact implementation The system is created in a nicely coupled way, so that it is possible to determine the parts that could be redened in a better way. However, probably there should be more feedbacks passed between components It is also important to remember that the current model is built on many simplistic and even false assumptions. Some of them are: There is a possibility of nding some reliable processing speed for every worker algorithm independently of the content of incoming data Every node-to-node connection might possibly use the full input and output bandwidth of Micro-clouds There is one VM on every machine - it is able to use a full bandwidth when reading the disk, a full processing speed and a full connection bandwidth Co-locating an access operator slice with a worker operator slice would not slow down retrieval and processing speed of any of them These and many other assumptions would need to be checked and corrected to obtain the correct model.

6
6.1

Future Work and Conclusion


Future Work

In this chapter, the approaches to make placer algorithms more aware of all of the queries definitions that may appear are going to be dened. It can be done either through the usage of

50

FUTURE WORK AND CONCLUSION

heuristics or through transformation of the current linear programming (LP) problem into mixed integer linear programming (MILP) problem. 6.1.1 Using Metaheuristics for Complex Constraints

A way for building up an algorithm that is more aware of various constraints is to use metaheuristics. They would enable searching for an optimal solution for a precisely dened problem in a wider set including infeasible solutions as well. An example of nding an LP-feasible relaxation of a problem and using heuristics to search through a set of solutions to nd the best ones that are feasible for the basic problem is presented in [Sun98]. There a term of a total infeasibility is introduced to dene how far is the optimal LP solution found on a given graph from the feasible solution of a given problem. Also, the terms characteristic for a used tabu search procedure have to be dened - like neighborhood and moves. They determine how would an algorithm walk through the solutions by changing a graph. The possible moves could be: Removing chosen sources, chosen connections or chosen destinations by setting transfer value (or values of a group of transfers) to zero (or removing them from the graph) Setting limits on the chosen connections Heuristics would be good in a way that they would enable a feedback between the components, so that every parameter (like approximated execution time) would be set with maximal possible precision on every step. The example of how heuristics could be used is nding solutions obeying the processing limits of destination Micro-clouds. It is not a simple task to set the limit of data that can be transported to one Micro-cloud in the way it is done right now (depending on workers number approximation). Moreover, that might be a case sometimes that solution that exceeds those limits a bit and has some delays, would be cheaper and even faster than other solutions. Instead of setting that constraint, after every run of the algorithm an outcome could be checked for its feasibility (if there are no Micro-clouds in which more worker slices than it is possible would need to be deployed). One assumption would make it easier to rate the feasibility concerning processing limits. Right now it is a complex process that takes total transfers into account. If it was assumed that always, if one key is located together on one host with another key, then they are together on every other host (as it is with sharding in mongoDB), those keys could have been grouped into one key as an input to the problem solving algorithm. Then it could be assumed that every host chosen in a solution would retrieve every of (those grouped) keys just from the query processing start. The processing limits could have consequently been seen only by analyzing one second of the processing by summing the amount of data that can be processed during every second in that Micro-cloud. That seems to be much better solution than the one applied right now. 6.1.2 Extended Mathematical Model with a Mixed Integer Linear Programming

During the nal considerations over the problem it came out that actually some subproblems of the operator placement problem are decision problems. They could be represented only as integer variables. Therefore a linear programming model, used by now, would need to be extended to the mixed integer linear programming (MILP). 6.1.2.1 Splitting Data to be Retrieved with MILP

The fact that the problem is a mix of integer and linear can be shown on sending of the retrieved data. In Chapter 3.6.2 it was explained that problem is linear, because data can be linearly divided between output connections. That is of course true. But, in the other hand, for mongoDB (not for HDFS) there is a decision problem from which source to retrieve every of the chunks - that cannot be split. Earlier it was assumed that rounding of outcomes would be enough. But, in some cases, the outcomes were noticed that could not be rounded in the way that would not put the solution far from the optimum.

6.1

Future Work

51

An extended version would look as follows. Lets assume there is a key of size k1 identifying the data divided in a number of chunks n1 . It is stored on sources s1 and s2 . Every of the sources has two output connections, consecutively x11 , x12 , x21 and x22 . There are two additional variables y1 and y2 . Variables x are oating points, variables y are integers. The problem of retrieving whole chunks and splitting them then equally over connections could be then dened as: x11 + x12 + x21 + x22 = k1 , x11 + x12 = y1 , x21 + x22 = y2 , y1 + y2 = n1 . Such a presentation would also t a decision for picking hosts for real-time data sources perfectly, as there all data has to be retrieved from one place (consequently n1 would equal 1). 6.1.2.2 Determining the Number of Micro-clouds for Processing

Usage of the M ILP would bring the possibility of setting the maximum number of Microclouds to be chosen as destinations of data streams. This would force the algorithm to nd the solutions where the source sends cheaply to a few of the cheapest Micro-clouds. Every of those chosen Micro-clouds could then be seen as the maximum number of slice groups that would receive data from sources. To show an example, there are two sources, one of which stores a key k1 , second stores a key k2 . There are also two destination Micro-clouds d1 and d2 . Transfer size from source i to destination j is denoted as xij . There are two extra boolean values y1 and y2 . To guarantee that all of the data of both keys would reach the same destination Micro-cloud, one could dene ve constraints: x11 + x12 = k1 , x21 + x22 = k2 . x11 + x21 (k1 + k2 ) y1 , x12 + x22 (k1 + k2 ) y2 , y1 + y2 = 1. 6.1.3 Modeling solutions for queries with more levels of worker operators

There might be the cases that some algorithms to be performed on data would be executed cheaper and faster when a few levels of worker operators would be considered. Let us consider the simple example of word count that has data coming to worker operator (reducer) partitioned in the way that not one slice but a group of slices can receive a word with the hash from some domain. Then, two level partitioner would need to be employed. On the rst level, there would be the locationbased partitioning (every chosen Micro-cloud would be one partition). On the second there would be hash-based partitioning. When the slice nishes its work (or once every period of time), it would send its state to the sink operator that does additional work of summing up the outcomes of all of the worker operator slices. To nd the optimal solution for a placement of slices of every operator with an awareness about the course of the whole algorithm, the minimization function would need to be extended by costs of sending data between worker slices and a sink operator slice. The output data from every of the Micro-clouds could be counted only if there is some data size reduction factor dened for the worker algorithm that would mean how many times data size on output of the algorithm is smaller than on input of it (as presented in a general formulation of the problem in 3.2). Costs of potential

52

FUTURE WORK AND CONCLUSION

transfers from worker slices to the sink would be counted as the transfer price of such an amount of data between Micro-clouds added to the execution price of the sink operator in Micro-cloud. Algorithm could be forced to put the sink operator only in one Micro-cloud with constraints dened in Chapter 6.1.2.2. Also it could be determined to put worker operator slices in some maximal amount of Micro-clouds (e.g. 2 or 3), thereby asserting the maximum number of rst level partitions. 6.1.4 Even Transfer Distribution over Connections

Denition of constraint T that allows to nd solutions tting the way the operators are being placed to solve word count problem in current conguration of StreamMine3G has its disadvantage. It splits the transfers between destinations in the same manner for every key, not for every source, as it is supposed to be (therefore constraint D was able to erase advantages of constraint T sometimes). By now, the presentation of how to dene even transfers distribution over connections was not found using mixed integer linear programming. Possibly, it is already a mixed integer non-linear programming problem.

6.2

Conclusion

The main goals stated for the thesis were accomplished. The advantages and possible drawbacks of Micro-cloud environments were analyzed based on the available knowledge. According to that, the usability of two distributed storage technologies was compared. It appeared that a freedom of dening replicas and shards arbitrarily makes it possible to t mongoDB well into the specics of Micro-cloud environment. In the next phase, the prototype of the processing framework based on StremMine3G event processing engine to be deployed in nodes of Micro-clouds was built. It consists of an access operator, worker operators and a manager. The access operator is implemented so that the information on which source to use and how is given as parameters. That makes it very usable and easily extensible. Worker operators are implemented to solve one exemplary algorithm (word count ) using a map reduce paradigm. For the manager the standard input was created, so that it can build topologies based on that. The whole manager is implemented in the asynchronous manner, able to react on changes in the system. The manager was successfully tested for deploying, launching and cleaning up topologies as dened in the input. The access operator was able to read in the same manner from both historic and real-time data sources. Worker operators were able to count words using the map reduce paradigm. The part that required the most of eort was about building the component that would let run algorithms solving operator placement problem. The way of dening and running queries from the outside was determined. The component to specify the actual mappings of data needed to execute the queries to system hosts was implemented. Operator placement search was divided into phases. Solutions being outcomes of operator placement algorithms were standardized and the algorithm on the price and the time approximation of solutions was proposed. Component translating solutions to standard StreamMine3G manager input completed the implementation. Thereby, the stable, extensible testing environment for the next operator placement algorithms was created. Finally, on the top of the built up framework, algorithms proposed for solving operator placement problem could have been run, evaluated and compared. They were checked for how well they deal with one specic placement problem. It appeared that it is possible to dene the simplied version of the operator placement problem as a linear programming problem and nd very good solutions using simplex algorithm. The algorithm designed specically for the tested placement problem was giving clearly the best solutions, a few times cheaper and faster than the best ones found by simple greedy algorithm that was implemented. Other simplex algorithm based solutions, although not designed strictly for the tested topology, were sometimes nding interesting solutions, otherwise they were comparable with average outcomes of greedy algorithms.

REFERENCES

53

The work was completed with proposals for future work on developing operator placement algorithms. By dening the problem more strictly using integer variables together with linear ones for constraints denitions and by using heuristics to deal with other constraints, they are assumed to give a way to nd good solutions for a wide range of specic placement problems.

References
[bat] [ccw] [Chu] [daa] [dfs] Batch processing. http://en.wikipedia.org/wiki/Batch_processing. Accessed: 2013-06-10. 2.1.3.1 Cloud computing. http://en.wikipedia.org/wiki/Cloud_computing. cessed: 2013-06-10. 2.2.2 Ac-

S. Chu. Transportation problems. Linear Programming lecture notes. (document), 3.6, 3.6.1 Data as a service. http://en.wikipedia.org/wiki/Data_as_a_service. Accessed: 2013-06-07. 1 Clustered le system. distributed le systems. http://en.wikipedia.org/wiki/ Clustered_file_system#Distributed_file_systems. Accessed: 2013-06-10. 2.1.2 Document-oriented database. http://en.wikipedia.org/wiki/ Document-oriented_database. Accessed: 2013-06-10. 2.1.2

[dod]

[dod10] Comparing document databases to key-value stores. http://nosql.mypopescu. com/post/659390374/comparing-document-databases-to-key-value-stores, June 2010. Accessed: 2013-06-10. 2.1.2 [ec2] [EK] Amazon ec2 spot instances. http://aws.amazon.com/ec2/spot-instances/. Accessed: 2013-06-10. 2.2.2 Galen Gruman Eric Knorr. What cloud computing really means. http://www.infoworld.com/d/cloud-computing/ what-cloud-computing-really-means-031. Accessed: 2013-06-10. 2.2.2 Event stream processing. http://en.wikipedia.org/wiki/Event_stream_ processing. Accessed: 2013-06-10. 2.1.3.1

[esp]

[Gla12] James Glanz. Power, pollution and the internet. The New York Times, September 2012. 2.2.1 [had] [hdfa] Apache hadoop. http://en.wikipedia.org/wiki/Apache_Hadoop. Accessed: 2013-06-10. 2.1.3.1 Hdfs c++ library header le. https://svn.apache.org/repos/asf/hadoop/ common/branches/branch-0.20/src/c++/libhdfs/hdfs.h. Accessed: 201305-30. 4.2.1.1 Understanding hadoop clusters and the network. http://bradhedlund.com/ 2011/09/10/understanding-hadoop-clusters-and-the-network/. Accessed: 2013-05-30. 4.2.1.2 An introduction to the hadoop distributed le system. http://www.ibm.com/ developerworks/library/wa-introhdfs/, February 2011. Accessed: 2013-0610. 2.1.2

[hdfb]

[hdf11]

54

REFERENCES

[hop] [ibm] [int] [int13]

hop. hadoop online prototype. 2013-06-10. 2.1.3.1

https://code.google.com/p/hop/.

Accessed:

What is mapreduce? http://www-01.ibm.com/software/data/infosphere/ hadoop/mapreduce/. Accessed: 2013-06-10. 2.1.3 An introduction to document databases. http://weblogs.asp.net/britchie/ archive/2010/08/12/document-databases.aspx. Accessed: 2013-06-10. 2.1.2 Internet 2012 in numbers. http://royal.pingdom.com/2013/01/16/ internet-2012-in-numbers/, January 2013. Accessed: 2013-06-10. 1

[Kle10] Wilhelm Kleiminger. Stream processing in the cloud. June 2010. 2.1.3.1 [Koo11] Jonathan Koomey. Growth in data center electricity use 2005 to 2010. Analytics Press, August 2011. 2.2.1 [lea] [mdba] Large-scale elastic architecture for data as a service. http://www.leads-project. eu/. Accessed: 2013-06-10. (document) mongoDB gridfs. http://docs.mongodb.org/manual/core/gridfs/. Accessed: 2013-05-30. 4.2.1.1

[mdbb] mongoDB read preference. http://docs.mongodb.org/manual/core/ read-preference/. Accessed: 2013-05-30. 4.2.1.2 [mdbc] mongoDB replica set architectures and deployment patterns. http://docs.mongodb. org/manual/core/replica-set-architectures/. Accessed: 2013-05-30. 4.2.1.2

[mdbd] mongoDB sharded cluster overview. http://docs.mongodb.org/manual/core/ sharded-clusters/. Accessed: 2013-05-30. 4.2.1.3 [mdbe] [mon] [nut] [sim] [sm3] mongoDB tag aware sharding. http://docs.mongodb.org/manual/core/ tag-aware-sharding/. Accessed: 2013-05-30. 4.2.1.4 mongoDB mongos. http://docs.mongodb.org/manual/reference/program/ mongos/. Accessed: 2013-06-07. 5.1.1.4 Nutch. http://en.wikipedia.org/wiki/Nutch. Accessed: 2013-06-10. 2.1.1 Finite mathematics utility: simplex method tool. http://www.zweigmedia.com/ RealWorld/simplex.html. Accessed: 2013-06-10. 5.3.1 StreamMine3G overview & concepts. https://streammine3g.inf.tu-dresden. de/trac/wiki/OverviewAndConcepts. Accessed: 2013-06-07. (document), 4.3, 4.3.2

[SPB77] Thomas L. Magnanti Stephen P. Bradley, Arnoldo C. Hax. Applied mathematical programming. February 1977. 3.6.2 [Sun98] Minghe Sun. A tabu search heuristic procedure for solving the transportation problem with exclusionary side constraints. Journal of Heuristics, 3(4):305326, March 1998. 6.1.1 [web] [wir] [zk] Web crawler. http://en.wikipedia.org/wiki/Web_crawler. Accessed: 201306-10. 2.1.1 Mapreduce. http://en.wikipedia.org/wiki/MapReduce. Accessed: 2013-06-10. 2.1.3 Apache zookeeper. http://zookeeper.apache.org/. Accessed: 2013-06-07. 4.3

Potrebbero piacerti anche