Research Paper - Map Reduce - CSC3323

Processing Large Data Set with a Map Reduce Approach
Processing Large Data Set with a Map

Reduce Approach
Research Paper part of the Honors Component of CSC3323
Algorithm Analysis
Supervisor: Dr. Hamid HARROUD

Honor Student: Ali ELABRIDI
Page 1 of 16
Table of Content:
Introduction.......
..3
MapReduce Basic Approach........
.3
What is MapReduce? ...................
...........................................................................4
Map Phase.
....5
Combining & Shuffling
phase.....................................................................5
Reduce Phase...........
5
The Model...........
.6
Map Reduce
Implementation............................ ...............................................
...7
Example of problems expressed as MapReduce
computations..........7
Page 2 of 16
Word Counting......
....7
Minimum Spanning Tree.......
...10
Page Ranking......
....10
Using Cloud Computing Services for Map Reduce........
12
References......
.13
Introduction
Algorithm analysis is about optimizing computational operations to their best
and lowest cost possible. However, we may easily reach the maximum optimization
for a certain set of problems due to their complexity or to their nature. Companies
that dominate the web have to process large set of data every day, and in only few
milliseconds. For instance, Facebook has to process millions of news feeds every
day because the users interactions represents the backbone of its social concept.
Google also has to process Billion of search queries. Indeed, Google is the pioneer of
storing tremendous amount of data. It has to store and cache all the web pages of
Page 3 of 16

the World Wide Web, and this since their creation. Every website indexed by Google
has to be parsed and analyzed to provide users with the most accurate and relevant
search results. To do so, the power of the search engine focus on counting the
number of specific keywords on a web page to infer that it is the most relevant for a
particular search query. It also focus on ranking website reputations (PageRank); in
other words, the more a website is being referenced by other websites, the more it
is considered trustworthy, and therefore, placed at top positions in search results.
Both these two strategies involve heavy computations on large set of data since it
has to consider the whole World Wide Web. Unfortunately no matter how efficient
your algorithm is, a single computer, regardless of its performances, cannot for
example handle the whole process of computing the PageRank of a website simply
because it cannot retrieve and compute all the number of incoming links to a
website considering the whole World Wide Web without running out of memory. To
solve this problematic, the Map Reduce approach is specifically well suited to solve
large scale problems such as this one.
MapReduce Basic Approach

To clarify the idea behind Map Reduce in a basic manner, we may consider the
following concrete example: assuming that we need to count the number of citizen
of a country. Instead of relying on a single person to count the habitants one by one,
we send in each city a representative that will count the number of people in each
specific city (map phase). At the end, we reassemble all the representatives from
the cities to sum the counts they have made to get the total number of people in
the country (reduce phase).
Page 4 of 16
What is MapReduce?
MapReduce paradigm is a programming model that uses parallel processing on
multiple computers to process large data sets. Its power resides into two main
operations: map(), and reduce() procedures. Both of these procedures are defined
by the user. We will see later on that we can also add another not mandatory
shuffling and comparing stage to optimize our Map Reduce job. The map stage
performs independent record transformation that takes key value pair (K1, V1), to
generate 0 or more intermediate key value pairs list (K2, V2). Not all data are
structured as key/value pairs; however, the MapReduce can accommodate it by
generating dummy attributes for the key/value pair to produce viable input data set.
Then, The reduce function aggregates the results from the map phase, and output 0
or more key value pairs of type list(K3,V3). The type of the input keys and values
are different than the output keys and values. Moreover, the intermediate keys
values generated by the map function are the same as the output keys and values.
When launching a MapReduce task, the output of the map function is distributed
around a cluster of many computers, then collect back the output data with the
reduce function that represents the final result. The end result is a scale-free
programming model. A MapReduce code written for 1MB of data can also handle
Terabytes and beyond of data. (See below flowchart of a Map Reduce procedure for
counting words.)
Page 5 of 16
Map Phase
The map phase gives the user an opportunity to operate on every record in the data
set individually. This phase is commonly used to delete unwanted fields, modify or
transform fields, or apply filters. Specific joins and grouping can also be done in the
map (e.g., joins where the data is already sorted or hash-based aggregation). There
is no requirement that for every input record there should be one output record.
Maps can choose to remove records or group multiple records into a single one.
Then, the output of the Map Phase is sent to the Reduce Phase directly, or to the
comparing & shuffling phase.
Combining & Shuffling Phase

The combine & shuffling phase takes advantage of the fact that when the
Map phase is running, its data is stored in the memory instead of the disk. Then, it
runs a reduce-type function that combines pairs with the same value in a single list,
then flush the memory to leave space for new produced data from the map phase
Page 6 of 16

when it runs out of it. the combiner class output the data as if they were from the
map function, the only difference is that it speeds up the processing since some
key/value pair have already been proceeded, and only need to be aggregated by
the reducer function with the other lists produced by the combiner. For example, a
word count Map Reduce application whose map operation outputs (word, 1) pairs as
words are encountered in the input can use a combiner to speed up processing.
Once a certain number of pairs is output, the combine operation will be called once
per unique word with the list available as an iterator. The combiner then emits
(word, count-in-this-part-of-the-input) pairs.d
Reduce Phase
The input to the reduce phase is each key from the Map function plus all of the
values associated with that key. Since all records with the same key/value pair are
now collected together, it is possible to join and aggregate. The Map Reduce user
explicitly controls parallelism in the reduce function. Map Reduce jobs that do not
require a reduce phase can set the reduce count to zero. These are referred to as
map-only jobs, and automatically given as output. The aggregation is what combine
all the results into a single one, which represents our final one.
The Model
A Map Reduce task is performed in the following steps:
1. Input data, such as a long text file, is split into key-value pairs. These
key-value pairs are then fed to the mapper. This job is done by the
master program.
2. The Map function processes each key-value pair individually and
outputs one or more intermediate key-value pairs.
3. All intermediate key-value pairs are collected, sorted, and grouped by
key (done by the shuffling & comparing phase).
4. For each unique key, the reduce function receives the key with a list of
all the values associated with it. The reducer aggregates these values
in some way (adding them up, taking averages, finding the maximum,
etc.) and outputs one or more output key-value pairs.
Page 7 of 16

5. Output pairs are collected and stored in an output file (by the master
program).
Figure1: Execution overview of a MapReduce Task
MapReduce Implementations
The MapReduce paradigm is extremely powerful because it is implemented by many
frameworks such as Apache Hadoop, Riak, and Infinispan that take care of details.
The user has only to care about writing the Map and reduce functions. Issuing tasks,
File I/O, networking between nodes, synchronization and failure recovery are all
managed by these frameworks written in different programming languages.
Page 8 of 16
Example of problems expressed as MapReduce

computations
In this section, we will go through some concrete applications of Map Reduce
using Hadoop.
Word Counting:
We want to compute the occurrences or frequency of certain words present

in a large file document. One way of solving the problem is by having a map
function that takes as an input the name of the document and its content as
key/value pair, and emit a set of words in the document file with a key occurrence of
1. Then, the Reduce function take for each word in our set the key words and an
iterator to the whole set of values then emit the occurrences of each specific
keywords in the whole set values.(find bellow the java code using Hadoop for word
counting with all its dependencies). By splitting the word counting task into small
Page 9 of 16

tasks done by multiple nodes in our clusters, the output was computed in a parallel
way, and therefore it has increased the time needed to do the job.
Page 10 of 16
Figure: Java Hadoop Code for word counting
Page 11 of 16
Minimum Spanning Tree: Another problematic that can be solved using

MapReduce and one of the topics discussed in CSC3323 Algorithm Analysis is
finding minimum spanning tree that reaches all nodes at minimum cost for very
large weighted graphs. Indeed, one intuitive approach that can be easily be mapped
to the MapReduce approach, and discussed in class is the Prims Algorithm. Prims
approach to find the minimum spanning tree in a graph is to find each time the
edge that cost the minimum, and links a set S of nodes with the remaining set of
the graph V-S, such that V is all nodes of the graph. For the MapReduce approach,
we can partition our graph into multiple set of nodes to send to each computer in
our cluster. Each computer will find the minimum spanning tree on its given set,
then the reduce function will join the set, of which the MST has been already found
from the parallel tasks, by find the bridge or the edge that cost the least again, and
by then we have found a minimum spanning tree of the whole large graph, with
parallel computations.
Page Ranking: This algorithm was first developed by Larry Page, one of the cofounders of Google. His strategy is to rely on the other websites to determine
whether a specific website is worthy or not. The idea is simple. It needs to count the
number of incoming links to a specific website, and use the following formula to get
the page ranking of a website.
PageRank of A = 0.15 + 0.85 * ( PageRank(B)/outgoing links(B) +
PageRank(...)/outgoing link(...) )
Page 12 of 16

In the graph above, we may easily infer that website A has the highest PageRank
compared to its peers, and therefore, put as first in the searching results if the
content matches the keywords entered by the user (using previous word counting
strategy). However, resides on the fact this graph is a large and growing one.
Therefore, we need to use Map Reduce to traverse it, and extract the incoming links
to website A. To do so, we will run 3 different Hadoop jobs self-explained in the
following flowcharts:
1. Parsing: Traverse the whole web graph. In the mapping phase, get for website
and its outgoing links. In the reduce phase, get for each website the links to the
others page.
2. Calculating: In this second Map Reduce Job, we will compute the PageRank for
each website. In the mapping phase, we map each outgoing link to the webpage
with its rank and total outgoing links. In the reduce phase, calculate the page
rank for each webpages using the formula describe earlier.
3. Sorting: rank the website by their given page ranks
Page 13 of 16
Using Cloud Computing Services for Map Reduce

This section of this paper is more likely to be presented as an extra. We have
seen that Map Reduce Implementations will allow us to not deal with tedious and
low level implementations to run and process large scale program. This would
represent the software part. However, the implementation of Map Reduce would
require us to have multiple computers in a large cluster, with high connectivity with
each other. This hardware part is also not easy to build. Therefore, processing large
data, can also be done using the Cloud. One of the most used product is the
Amazon Elastic Map Reduce. It is a Platform as a Service (PaaS) that simplifies big
data processing not only by providing and managing the Hadoop framework, but
also by managing the hardware for you. The Amazon EMR is reliable, secure, and
low-cost. You pay only for what you use, and for how much time you use it. Indeed,
they provide you with flexible resources. You may extend the number of CPUs or
cores you need, and also the RAM needed. Furthermore, you may also join your
virtual cluster with other Amazon products such as the Amazon S3 (Amazon Simple
Storage Service) to store for example your input and output from your Map Reduce
tasks. Therefore, you not only avoid having to deal with low level implementation of
scale free programs, but also avoid setting up large and complicated cluster to run
your programs by using the Cloud.
Page 14 of 16
References:
http://aws.amazon.com/elasticmapreduce/
https://courses.cs.washington.edu/courses/cse490h/08au/lectures/algorithms.pdf
http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
http://www.slideshare.net/andreaiacono/mapreduce-34478449
http://hadooptutorial.wikispaces.com/Sorting+feature+of+MapReduce
http://chimera.labs.oreilly.com/books/1234000001811/apb.html#overview_mr_distri
buted_cache
Page 15 of 16

http://webmapreduce.sourceforge.net/docs/User_Guide/sect-User_GuideIntroduction-What_is_Map_Reduce.html
http://blog.xebia.com/2011/09/27/wiki-pagerank-with-hadoop/
Page 16 of 16

Research Paper - Map Reduce - CSC3323

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Research Paper - Map Reduce - CSC3323

Caricato da

Copyright:

Formati disponibili

Processing Large Data Set with a Map Reduce Approach

Processing Large Data Set with a Map

Supervisor: Dr. Hamid HARROUD

Processing Large Data Set with a Map Reduce Approach

Processing Large Data Set with a Map Reduce Approach

Processing Large Data Set with a Map Reduce Approach

MapReduce Basic Approach

Processing Large Data Set with a Map Reduce Approach

Processing Large Data Set with a Map Reduce Approach

Combining & Shuffling Phase

Processing Large Data Set with a Map Reduce Approach

Processing Large Data Set with a Map Reduce Approach

Figure1: Execution overview of a MapReduce Task

Processing Large Data Set with a Map Reduce Approach

Example of problems expressed as MapReduce

We want to compute the occurrences or frequency of certain words present

Processing Large Data Set with a Map Reduce Approach

Processing Large Data Set with a Map Reduce Approach

Figure: Java Hadoop Code for word counting

Processing Large Data Set with a Map Reduce Approach

Minimum Spanning Tree: Another problematic that can be solved using

Processing Large Data Set with a Map Reduce Approach

3. Sorting: rank the website by their given page ranks

Processing Large Data Set with a Map Reduce Approach

Using Cloud Computing Services for Map Reduce

Processing Large Data Set with a Map Reduce Approach

Processing Large Data Set with a Map Reduce Approach

Potrebbero piacerti anche