Map Reduce

Parallel Computing
Parallel processing is the ability to carry out multiple operations or tasks simultaneously
|1|
Parallel computing is a
Iorm oI computation in which many calculations are carried out simultaneously, operating on the principle that large
problems can oIten be divided into smaller ones, which are then solved concurrently ("in parallel").
Parallel computers can be roughly classiIied according to the level at which the hardware supports parallelism,
with multi-core and multi-processor computers having multiple processing elements within a single machine,
while clusters, MPPs, and grids use multiple computers to work on the same task.
Background
Traditionally, computer soItware has been written Ior serial computation. To solve a problem, an algorithm is
constructed and implemented as a serial stream oI instructions. These instructions are executed on a central
processing unit on one computer. Only one instruction may execute at a timeaIter that instruction is Iinished, the
next is executed.
Parallel computing, on the other hand, uses multiple processing elements simultaneously to solve a problem. This is
accomplished by breaking the problem into independent parts so that each processing element can execute its part oI
the algorithm simultaneously with the others. The processing elements can be diverse and include resources such as
a single computer with multiple processors, several networked computers, specialized hardware, or any combination
oI the above.
|2|

Parallel computing is being used to sort out tasks oI complex nature and also Ior perIorming routine parallelizable
jobs. With the help oI parallel computing, a number oI computations can be perIormed at once, bringing down the
time required to complete a project. Parallel processing is particularly useIul in projects that require complex
computations, such as weather modeling and digital special eIIects.
With the help oI parallel computing, highly complicated scientiIic problems that are otherwise extremely diIIicult to
solve can be solved eIIectively. Parallel computing can be eIIectively used Ior tasks that involve a large number oI
calculations, have time constraints and can be divided into a number oI smaller tasks.

The Type oI Parallel Computing we would be Iocusing on is Distributed Computing.
istributed computing
istributed computing is a Iield oI computer science that studies distributed systems. A distributed
system consists oI multiple autonomous computers that communicate through a computer network. The computers
interact with each other in order to achieve a common goal. In distributed computing, a problem is divided into
many tasks, each oI which is solved by one or more computers. A distributed system may have a common goal, such
as solving a large computational problem. Alternatively, each computer may have its own user with individual
needs, and the purpose oI the distributed system is to coordinate the use oI shared resources or provide
communication services to the users.
|4|

An Important paradigm in distributed parallel computing is Google`s MapReduce.
apReduce
apReduce is a soItware Iramework introduced by Google in 2004 to support distributed computing on large data
sets on clusters oI computers
|5|.

MapReduce was developed within Google as a mechanism Ior processing large amounts oI raw data, Ior example,
crawled documents or web request logs. This data is so large, it must be distributed across thousands oI machines in
order to be processed in a reasonable time. This distribution implies parallel computing since the same computations
are perIormed on each CPU, but with a diIIerent dataset. MapReduce is an abstraction that allows Google engineers
to perIorm simple computations while hiding the details oI parallelization, data distribution, load balancing and Iault
tolerance.

;er;iew
MapReduce is a Iramework Ior processing highly distributable problems across huge datasets using a large number
oI computers (nodes), collectively reIerred to as a cluster (iI all nodes use the same hardware) or a grid (iI the nodes
use diIIerent hardware). Computational processing can occur on data stored either in a Iilesystem (unstructured) or
in a database(structured).
"ap" step: The master node takes the input, partitions it up into smaller sub-problems, and distributes them to
worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node
processes the smaller problem, and passes the answer back to its master node.
"Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some way
to Iorm the output the answer to the problem it was originally trying to solve.
MapReduce allows Ior distributed processing oI the map and reduction operations. Provided each mapping operation
is independent oI the others, all maps can be perIormed in parallel though in practice it is limited by the number oI
independent data sources and/or the number oI CPUs near each source. Similarly, a set oI 'reducers' can perIorm the
reduction phase - provided all outputs oI the map operation that share the same key are presented to the same
reducer at the same time. While this process can oIten appear ineIIicient compared to algorithms that are more
sequential, MapReduce can be applied to signiIicantly larger datasets than "commodity" servers can handle a
large server Iarm can use MapReduce to sort a petabyte oI data in only a Iew hours. The parallelism also oIIers some
possibility oI recovering Irom partial Iailure oI servers or storage during the operation iI one mapper or reducer
Iails, the work can be rescheduled assuming the input data is still available.

apReduce Execution ;er;iew
The Map invocations are distributed across multiple machines by automatically partitioning the input data into a set
oI M splits or shards. The input shards can be processed in parallel on diIIerent machines.
Reduce invocations are distributed by partitioning the intermediate key space into R pieces using a partitioning
Iunction (e.g., hash(key) mod R). The number oI partitions (R) and the partitioning Iunction are speciIed by the user.
The illustration below shows the overall Ilow oI a MapReduce operation. When the user program calls the
MapReduce Iunction, the Iollowing sequence oI actions occurs (the numbered labels in the illustration correspond to
the numbers in the list below).
1. The MapReduce library in the user program Iirst shards the input Iiles into M pieces oI typically 16
megabytes to 64 megabytes (MB) per piece. It then starts up many copies oI the program on a cluster oI
machines.
2. One oI the copies oI the program is special the master. The rest are workers that are assigned work by the
master. There are M map tasks and R reduce tasks to assign. The master picks idle workers and assigns each
one a map task or a reduce task.
3. A worker who is assigned a map task reads the contents oI the corresponding input shard. It parses key/value
pairs out oI the input data and passes each pair to the user-deIined Map Iunction. The intermediate key/value
pairs produced by the Map Iunction are buIIered in memory.
4. Periodically, the buIIered pairs are written to local disk, partitioned into R regions by the partitioning
Iunction. The locations oI these buIIered pairs on the local disk are passed back to the master, who is
responsible Ior Iorwarding these locations to the reduce workers.
5. When a reduce worker is notiIied by the master about these locations, it uses remote procedure calls to read
the buIIered data Irom the local disks oI the map workers. When a reduce worker has read all intermediate
data, it sorts it by the intermediate keys so that all occurrences oI the same key are grouped together. II the
amount oI intermediate data is too large to Iit in memory, an external sort is used.
6. The reduce worker iterates over the sorted intermediate data and Ior each unique intermediate key
encountered, it passes the key and the corresponding set oI intermediate values to the user's Reduce Iunction.
The output oI the Reduce Iunction is appended to a Iinal output Iile Ior this reduce partition.
. When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this
point, the MapReduce call in the user program returns back to the user code.
AIter successIul completion, the output oI the MapReduce execution is available in the R output Iiles.
To detect Iailure, the master pings every worker periodically. II no response is received Irom a worker in a certain
amount oI time, the master marks the worker as Iailed. Any map tasks completed by the worker are reset back to
their initial idle state, and thereIore become eligible Ior scheduling on other workers. Similarly, any map task or
reduce task in progress on a Iailed worker is also reset to idle and becomes eligible Ior rescheduling.
Completed map tasks are re-executed when Iailure occurs because their output is stored on the local disk(s) oI the
Iailed machine and is thereIore inaccessible. Completed reduce tasks do not need to be re-executed since their output
is stored in a global Iile system.
|8|

Uses
MapReduce is useIul in a wide range oI applications including distributed grep, distributed sort, web link-graph
reversal, term-vector per host, web access log stats, inverted index construction, document clustering, machine
learning, and statistical machine translation. Moreover, the MapReduce model has been adapted to several
computing environments like multi-core and many-core systems, desktop grids, volunteer computing
environments,dynamic cloud environments,and mobile environments.
At Google, MapReduce was used to completely regenerate Google's index oI the World Wide Web. It replaced the
old ad hoc programs that updated the index and ran the various analyses
|5|.

Hadoop(MapReduce Iramework by Apache)
Apache Hadoop is a Iramework Ior running applications on large cluster built oI commodity hardware.
|9|
Apache
Hadoop is an open source Java Iramework Ior processing and querying vast amounts oI data on large clusters oI
commodity hardware.
|10|

The Hadoop Iramework transparently provides applications both reliability and data motion. Hadoop implements a
computational paradigm named Map/Reduce, where the application is divided into many small Iragments oI work,
each oI which may be executed or re-executed on any node in the cluster. In addition, it provides a distributed Iile
system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster.
Both MapReduce and the Hadoop Distributed File System are designed so that node Iailures are automatically
handled by the Iramework.
|9|

rchitecture
The Hadoop Map/Reduce Iramework has a master/slave architecture. It has a single master server or obtracker and
several slave servers or tasktrackers, one per node in the cluster. The obtracker is the point oI interaction between
users and the Iramework. Users submit map/reduce jobs to the obtracker, which puts them in a queue oI pending
jobs and executes them on a Iirst-come/Iirst-served basis. The obtracker manages the assignment oI map and reduce
tasks to the tasktrackers. The tasktrackers execute tasks upon instruction Irom the jobtracker and also handle data
motion between the map and reduce phases.
adoop $
Hadoop's Distributed File System is designed to reliably store very large Iiles across machines in a large cluster. It is
inspired by the Google File System. Hadoop DFS stores each Iile as a sequence oI blocks, all blocks in a Iile except
the last block are the same size. Blocks belonging to a Iile are replicated Ior Iault tolerance. The block size and
replication Iactor are conIigurable per Iile. Files in HDFS are "write once" and have strictly one writer at any time.
rchitecture
Like Hadoop Map/Reduce, HDFS Iollows a master/slave architecture. An HDFS installation consists oI a
single amenode, a master server that manages the Iilesystem namespace and regulates access to Iiles by clients. In
addition, there are a number oI Datanodes, one per node in the cluster, which manage storage attached to the nodes
that they run on. The amenode makes Iilesystem namespace operations like opening, closing, renaming etc. oI Iiles
and directories available via an RPC interIace. It also determines the mapping oI blocks to Datanodes.
The Datanodes are responsible Ior serving read and write requests Irom Iilesystem clients, they also perIorm block
creation, deletion, and replication upon instruction Irom the amenode.
$calability
The intent is to scale Hadoop up to handling thousand oI computers.
|11|

Some Companies Using Hadoop
|12|

About.com
Amazon.com
AOL
Apple
Harmony
LinkedIn
Meebo
Metaweb
etFlix
The ew York Times
Rackspace
RazorIish
StumbleUpon
Twitter

/`ifference from Us
We have decided to use the mapreduce paradigm as our choice Ior parallel computing. We would be using the
Hadoop Iramework to run mapreduce jobs on our architecture. The diIIerence here is that Hadoop is designed to run
on clusters while our architecture would be a volunteer computing environment which is highly volatile and
unstable. So we would have to modiIy the Hadoop Iramework to adapt to a volunteer computing environment.
/
Cloud compuLlng
CIoud computing is the delivery of computing as a service rather than a product, whereby shared
resources, software, and information are provided to computers and other devices as a utility (like
the electricity grid) over anetwork (typically the nternet).
PubIic cIoud
Public cloud describes cloud computing in the traditional mainstream sense, whereby resources are
dynamically provisioned to the general public on a fine-grained, self-service basis over the nternet,
via web applications/web services, from an off-site third-party provider who bills on a fine-grained utility
computing basis
Amazon Elastic Compute Cloud
From Wikipedia, the free encyclopedia
Amazon EIastic Compute CIoud (EC2) is a central part of Amazon.com's cloud computing platform, Amazon
Web Services (AWS). EC2 allows users to rent virtual computers on which to run their own computer
applications. EC2 allows scalable deployment of applications by providing a Web service through which a user
can boot an Amazon Machine mage to create a virtual machine, which Amazon calls an "instance", containing
any software desired. A user can create, launch, and terminate server instances as needed, paying by the hour
for active servers, hence the term "elastic". EC2 provides users with control over the geographical location of
instances that allows for latency optimization and high levels of redundancy.
[1]

n November 2010, Amazon made the switch of its own retail website to EC2 and AWS.

n
1. LLp//enwlklpedlaorg/wlkl/arallel_processlng
2. LLp//enwlklpedlaorg/wlkl/arallel_compuLlng#ulsLrlbuLed_compuLlng
3. LLp//wwwwlsegeekcom/waLlsparallelprocesslngLm
4. http//en.wikipedia.org/wiki/Distributedcomputing
5. LLp//enwlklpedlaorg/wlkl/Map8educe
6. LLp//codegooglecom/edu/parallel/mapreduceLuLorlalLml
. LLp//enwlklpedlaorg/wlkl/Map8educe
8. LLp//codegooglecom/edu/parallel/mapreduceLuLorlalLml
9. LLp//wlklapaceorg/adoop/
10. LLp//developeryaoocom/adoop/
11. LLp//wlklapaceorg/adoop/ro[ecLuescrlpLlon
12. LLp//wlklapaceorg/adoop/owered8y
13. Dean, JeII and Ghemawat, Sanjay. apReduce: $implified ata Processing on Large
Clusters http//labs.google.com/papers/mapreduce-osdi04.pdI
14. Hadoop The DeIinitive guide by Tom White.
15. LLp//enwlklpedlaorg/wlkl/Amazon_LlasLlc_CompuLe_Cloud
16. LLp//enwlklpedlaorg/wlkl/Amazon_LlasLlc_CompuLe_Cloud
1. LLp//wwwlnfoworldcom/d/cloudcompuLlng/waLcloudcompuLlngreallymeans0
18. LLp//compuLerowsLuffworkscom/cloudcompuLlng/cloudcompuLlngLm

Map Reduce

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Map Reduce

Caricato da

Copyright:

Formati disponibili

Parallel Computing

Potrebbero piacerti anche