Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Page 1 of 16
Table of Content:
Introduction.......
..3
MapReduce Basic Approach........
.3
What is MapReduce? ...................
...........................................................................4
Map Phase.
....5
Combining & Shuffling
phase.....................................................................5
Reduce Phase...........
5
The Model...........
.6
Map Reduce
Implementation............................ ...............................................
...7
Example of problems expressed as MapReduce
computations..........7
Page 2 of 16
Word Counting......
....7
Minimum Spanning Tree.......
...10
Page Ranking......
....10
Using Cloud Computing Services for Map Reduce........
12
References......
.13
Introduction
Algorithm analysis is about optimizing computational operations to their best
and lowest cost possible. However, we may easily reach the maximum optimization
for a certain set of problems due to their complexity or to their nature. Companies
that dominate the web have to process large set of data every day, and in only few
milliseconds. For instance, Facebook has to process millions of news feeds every
day because the users interactions represents the backbone of its social concept.
Google also has to process Billion of search queries. Indeed, Google is the pioneer of
storing tremendous amount of data. It has to store and cache all the web pages of
Page 3 of 16
Page 4 of 16
What is MapReduce?
MapReduce paradigm is a programming model that uses parallel processing on
multiple computers to process large data sets. Its power resides into two main
operations: map(), and reduce() procedures. Both of these procedures are defined
by the user. We will see later on that we can also add another not mandatory
shuffling and comparing stage to optimize our Map Reduce job. The map stage
performs independent record transformation that takes key value pair (K1, V1), to
generate 0 or more intermediate key value pairs list (K2, V2). Not all data are
structured as key/value pairs; however, the MapReduce can accommodate it by
generating dummy attributes for the key/value pair to produce viable input data set.
Then, The reduce function aggregates the results from the map phase, and output 0
or more key value pairs of type list(K3,V3). The type of the input keys and values
are different than the output keys and values. Moreover, the intermediate keys
values generated by the map function are the same as the output keys and values.
When launching a MapReduce task, the output of the map function is distributed
around a cluster of many computers, then collect back the output data with the
reduce function that represents the final result. The end result is a scale-free
programming model. A MapReduce code written for 1MB of data can also handle
Terabytes and beyond of data. (See below flowchart of a Map Reduce procedure for
counting words.)
Page 5 of 16
Map Phase
The map phase gives the user an opportunity to operate on every record in the data
set individually. This phase is commonly used to delete unwanted fields, modify or
transform fields, or apply filters. Specific joins and grouping can also be done in the
map (e.g., joins where the data is already sorted or hash-based aggregation). There
is no requirement that for every input record there should be one output record.
Maps can choose to remove records or group multiple records into a single one.
Then, the output of the Map Phase is sent to the Reduce Phase directly, or to the
comparing & shuffling phase.
Reduce Phase
The input to the reduce phase is each key from the Map function plus all of the
values associated with that key. Since all records with the same key/value pair are
now collected together, it is possible to join and aggregate. The Map Reduce user
explicitly controls parallelism in the reduce function. Map Reduce jobs that do not
require a reduce phase can set the reduce count to zero. These are referred to as
map-only jobs, and automatically given as output. The aggregation is what combine
all the results into a single one, which represents our final one.
The Model
A Map Reduce task is performed in the following steps:
1. Input data, such as a long text file, is split into key-value pairs. These
key-value pairs are then fed to the mapper. This job is done by the
master program.
2. The Map function processes each key-value pair individually and
outputs one or more intermediate key-value pairs.
3. All intermediate key-value pairs are collected, sorted, and grouped by
key (done by the shuffling & comparing phase).
4. For each unique key, the reduce function receives the key with a list of
all the values associated with it. The reducer aggregates these values
in some way (adding them up, taking averages, finding the maximum,
etc.) and outputs one or more output key-value pairs.
Page 7 of 16
MapReduce Implementations
The MapReduce paradigm is extremely powerful because it is implemented by many
frameworks such as Apache Hadoop, Riak, and Infinispan that take care of details.
The user has only to care about writing the Map and reduce functions. Issuing tasks,
File I/O, networking between nodes, synchronization and failure recovery are all
managed by these frameworks written in different programming languages.
Page 8 of 16
Page 10 of 16
Page 11 of 16
Page 12 of 16
2. Calculating: In this second Map Reduce Job, we will compute the PageRank for
each website. In the mapping phase, we map each outgoing link to the webpage
with its rank and total outgoing links. In the reduce phase, calculate the page
rank for each webpages using the formula describe earlier.
Page 13 of 16
Page 14 of 16
References:
http://aws.amazon.com/elasticmapreduce/
https://courses.cs.washington.edu/courses/cse490h/08au/lectures/algorithms.pdf
http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
http://www.slideshare.net/andreaiacono/mapreduce-34478449
http://hadooptutorial.wikispaces.com/Sorting+feature+of+MapReduce
http://chimera.labs.oreilly.com/books/1234000001811/apb.html#overview_mr_distri
buted_cache
Page 15 of 16
http://blog.xebia.com/2011/09/27/wiki-pagerank-with-hadoop/
Page 16 of 16