Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
minutes
Some revolutions are marked by a single, spectacular event: the storming of the Bastille during the
French Revolution, or the destruction of the towers of the World Trade Center on September 11,
2001, which so changed the US’s relationship with the rest of the world. But often the most important
revolutions aren’t announced with the blare of trumpets. They occur softly, too slow for a single news
cycle, but fast enough that if you aren’t alert, the revolution is over before you’re aware it’s happening.
Such a revolution is happening right now in computing. Microprocessor clock speeds have stagnated
since about 2000. Major chipmakers such as Intel and AMD continue to wring out improvements in
speed by improving on-chip caching and other clever techniques, but they are gradually hitting the
point of diminishing returns. Instead, as transistors continue to shrink in size, the chipmakers are
packing multiple processing units onto a single chip. Most computers shipped today use multi-core
microprocessors, i.e., chips with 2 (or 4, or 8, or more) separate processing units on the main
microprocessor.
The result is a revolution in software development. We’re gradually moving from the old world in
which multiple processor computing was a special case, used only for boutique applications, to a
world in which it is widespread. As this movement happens, software development, so long tailored to
single-processor models, is seeing a major shift in some its basic paradigms, to make the use of
This movement to multiple processors began decades ago. Projects such as the Connection Machine
demonstrated the potential of massively parallel computing in the 1980s. In the 1990s, scientists
became large-scale users of parallel computing, using parallel computing to simulate things like
nuclear explosions and the dynamics of the Universe. Those scientific applications were a bit like the
early scientific computing of the late 1940s and 1950s: specialized, boutique applications, built with
heroic effort using relatively primitive tools. As computing with multiple processors becomes
widespread, though, we’re seeing a flowering of general-purpose software development tools tailored
One of the organizations driving this shift is Google. Google is one of the largest users of multiple
processor computing in the world, with its entire computing cluster containing hundreds of thousands
of commodity machines, located in data centers around the world, and linked using commodity
computing; the characteristic feature of distributed computing is that the processors in the cluster
don’t necessarily share any memory or disk space, and so information sharing must be mediated by
In this post, I’ll describe a framework for distributed computing called MapReduce. MapReduce was
introduced in a paper written in 2004 by Jeffrey Dean and Sanjay Ghemawat from Google. What’s
beautiful about MapReduce is that it makes parallelization almost entirely invisible to the programmer
who is using MapReduce to develop applications. If, for example, you allocate a large number of
machines in the cluster to a given MapReduce job, the job runs in a highly parallelized way. If, on the
other hand, you allocate only a small number of machines, it will run in a much more serial way, and
What exactly is MapReduce? From the programmer’s point of view, it’s just a library that’s imported at
the start of your program, like any other library. It provides a single library call that you can make,
passing in a description of some input data and two ordinary serial functions (the “mapper” and
“reducer”) that you, the programmer, specify in your favorite programming language. MapReduce
then takes over, ensuring that the input data is distributed through the cluster, and computing those
two functions across the entire cluster of machines, in a way we’ll make precise shortly. All the details
– parallelization, distribution of data, tolerance of machine failures – are hidden away from the
What we’re going to do in this post is learn how to use the MapReduce library. To do this, we don’t
need a big sophisticated version of the MapReduce library. Instead, we can get away with a toy
implementation (just a few lines of Python!) that runs on a single machine. By using this single-
machine toy library we can learn how to develop for MapReduce. The programs we develop will run
essentially unchanged when, in later posts, we improve the MapReduce library so that it can run on a
cluster of machines.
Okay, so how do we use MapReduce? I’ll describe it with a simple example, which is a program to
count the number of occurrences of different words in a set of files. The example is simple, but you’ll
find it rather strange if you’re not already familiar with MapReduce: the program we’ll describe is
certainly not the way most programmers would solve the word-counting problem! What it is, however,
is an excellent illustration of the basic ideas of MapReduce. Furthermore, what we’ll eventually see is
that by using this approach we can easily scale our wordcount program up to run on millions or even
billions of documents, spread out over a large cluster of computers, and that’s not something a
The input to a MapReduce job is just a set of (input_key,input_value) pairs, which we’ll
implement as a Python dictionary. In the wordcount example, the input keys will be the filenames of
the files we’re interested in counting words in, and the corresponding input values will be the contents
of those files:
filenames = ["text\\a.txt","text\\b.txt","text\\c.txt"]
i = {}
f = open(filename)
i[filename] = f.read()
f.close()
After this code is run the Python dictionary i will contain the input to our MapReduce job, namely, i
has three keys containing the filenames, and three corresponding values containing the contents of
those files. Note that I’ve used Windows’ filenaming conventions above; if you’re running a Mac or
Linux you may need to tinker with the filenames. Also, to run the code you will of course need text
files text\1.txt, text\2.txt, text\3.txt. You can create some simple examples texts by
text\a.txt:
The quick brown fox jumped over the lazy grey dogs.
text\b.txt:
That's one small step for a man, one giant leap for mankind.
text\c.txt:
Mary had a little lamb,
The MapReduce job will process this input dictionary in two phases: the map phase, which produces
output which (after a little intermediate processing) is then processed by the reduce phase. In the map
phase what happens is that for each (input_key,input_value) pair in the input dictionary i, a
keys and values. This function mapper is supplied by the programmer – we’ll show how it works for
wordcount below. The output of the map phase is just the list formed by concatenating the list of
intermediate keys and values for all of the different input keys and values.
I said above that the function mapper is supplied by the programmer. In the wordcount example, what
mapper does is takes the input key and input value – a filename, and a string containing the contents
of the file – and then moves through the words in the file. For each word it encounters, it returns the
intermediate key and value (word,1), indicating that it found one occurrence of word. So, for
returns:
[('the', 1), ('quick', 1), ('brown', 1), ('fox', 1), ('jumped', 1),
('over', 1), ('the', 1), ('lazy', 1), ('grey', 1), ('dogs', 1)]
Notice that everything has been lowercased, so we don’t count words with different cases as distinct.
Furthermore, the same key gets repeated multiple times, because words like the appear more than
once in the text. This, incidentally, is the reason we use a Python list for the output, and not a Python
dictionary, for in a dictionary the same key can only be used once.
Here’s the Python code for the mapper function, together with a helper function used to remove
punctuation:
def mapper(input_key,input_value):
remove_punctuation(input_value.lower()).split()]
def remove_punctuation(s):
mapper works by lowercasing the input file, removing the punctuation, splitting the resulting string
around whitespace, and finally emitting the pair (word,1) for each resulting word. Note, incidentally,
that I’m ignoring apostrophes, to keep the code simple, but you can easily extend the code to deal
With this specification of mapper, the output of the map phase for wordcount is simply the result of
mapper("text\\c.txt"):
('jumped', 1), ('over', 1), ('the', 1), ('lazy', 1), ('grey', 1),
('dogs', 1), ('mary', 1), ('had', 1), ('a', 1), ('little', 1),
('lamb', 1), ('its', 1), ('fleece', 1), ('was', 1), ('white', 1),
('as', 1), ('snow', 1), ('and', 1), ('everywhere', 1),
('that', 1), ('mary', 1), ('went', 1), ('the', 1), ('lamb', 1),
('was', 1), ('sure', 1), ('to', 1), ('go', 1), ('thats', 1),
('one', 1), ('small', 1), ('step', 1), ('for', 1), ('a', 1),
('man', 1), ('one', 1), ('giant', 1), ('leap', 1), ('for', 1),
('mankind', 1)]
The map phase of MapReduce is logically trivial, but when the input dictionary has, say 10 billion
keys, and those keys point to files held on thousands of different machines, implementing the map
phase is actually quite non-trivial. What the MapReduce library handles is details like knowing which
files are stored on what machines, making sure that machine failures don’t affect the computation,
making efficient use of the network, and storing the output in a useable form. We won’t worry about
these issues for now, but we will come back to them in future posts.
What the MapReduce library now does in preparation for the reduce phase is to group together all the
intermediate values which have the same key. In our example the result of doing this is the following
intermediate dictionary:
{'and': [1], 'fox': [1], 'over': [1], 'one': [1, 1], 'as': [1],
We see, for example, that the word ‘and’, which appears only once in the three files, has as its
associated value a list containing just a single 1, [1]. By contrast, the word ‘one’, which appears
intermediate dictionary. For wordcount, reducer simply sums up the list of intermediate values, and
return both the intermediate_key and the sum as the output. This is done by the following code:
def reducer(intermediate_key,intermediate_value_list):
return (intermediate_key,sum(intermediate_value_list))
The output from the reduce phase, and from the total MapReduce computation, is thus:
[('and', 1), ('fox', 1), ('over', 1), ('one', 2), ('as', 1),
('go', 1), ('its', 1), ('lamb', 2), ('giant', 1), ('for', 2),
('jumped', 1), ('had', 1), ('snow', 1), ('to', 1), ('leap', 1),
('went', 1), ('man', 1), ('a', 2), ('fleece', 1), ('grey', 1),
('dogs', 1), ('quick', 1), ('the', 3), ('thats', 1)]
You can easily check that this is just a list of the words in the three files we started with, and the
We’ve looked at code defining the input dictionary i, the mapper and reducer functions. Collecting
it all up, and adding a call to the MapReduce library, here’s the complete wordcount.py program:
#word_count.py
import string
import map_reduce
def mapper(input_key,input_value):
remove_punctuation(input_value.lower()).split()]
def remove_punctuation(s):
def reducer(intermediate_key,intermediate_value_list):
return (intermediate_key,sum(intermediate_value_list))
filenames = ["text\\a.txt","text\\b.txt","text\\c.txt"]
i = {}
f = open(filename)
i[filename] = f.read()
f.close()
print map_reduce.map_reduce(i,mapper,reducer)
The map_reduce module imported by this program implements MapReduce in pretty much the
simplest possible way, using some useful functions from the itertools library:
# map_reduce.py
import itertools
def map_reduce(i,mapper,reducer):
intermediate = []
intermediate.extend(mapper(key,value))
groups = {}
lambda x: x[0]):
return [reducer(intermediate_key,groups[intermediate_key])
(Credit to a nice blog post from Dave Spencer for the use of itertools.groupby to simplify the
reduce phase.)
Obviously, on a single machine an implementation of the MapReduce library is pretty trivial! In later
posts we’ll extend this library so that it can distribute the execution of the mapper and reducer
functions across multiple machines on a network. The payoff is that with enough improvement to the
library we can with essentially no change use our wordcount.py program to count the words not
just in 3 files, but rather the words in billions of files, spread over thousands of computers in a cluster.
What the MapReduce library does, then, is provide an approach to developing in a distributed
environment where many simple tasks (like wordcount) remain simple for the programmer. Important
(but boring) tasks like parallelization, getting the right data into the right places, dealing with the failure
of computers and networking components, and even coping with racks of computers being taken
offline for maintenance, are all taken care of under the hood of the library.
In the posts that follow, we’re thus going to do two things. First, we’re going to learn how to develop
MapReduce applications. That means taking familiar tasks – things like computing PageRank – and
figuring out how they can be done within the MapReduce framework. We’ll do that in the next post in
this series. In later posts, we’ll also take a look at Hadoop, an open source platform that can be used
Second, we’ll go under the hood of MapReduce, and look at how it works. We’ll scale up our toy
implementation so that it can be used over small clusters of computers. This is not only fun in its own
right, it will also make us better MapReduce programmers, in the same way as understanding the
innards of an operating system (for example) can make you a better application programmer.
To finish off this post, though, we’ll do just two things. First, we’ll sum up what MapReduce does,
stripping out the wordcount-specific material. It’s not any kind of a formal specification, just a brief
informal summary, together with a few remarks. We’ll refine this summary a little in some future posts,
but this is the basic MapReduce model. Second, we’ll give a overview of how MapReduce takes
MapReduce in general
Summing up our earlier description of MapReduce, and with the details about wordcount removed,
the input to a MapReduce job is a set of (input_key,input_value) pairs. Each pair is used as
[(intermediate_key,intermediate_value),
(intermediate_key',intermediate_value'),
...]
The output from all the different input pairs is then sorted, so that values associated with the same
intermediate key and list of intermediate values, to produce the output from the MapReduce job.
admit I’m not sure of the answer to this question – if it’s discussed in the original paper, then I missed
it. In most of the examples I’m familiar with, the order doesn’t matter, because the reducer works by
applying a commutative, associate operation across all intermediate values in the list. As we’ll see in
a minute, because the mapper computations are potentially done in parallel, on machines which may
be of varying speed, it’d be hard to guarantee the ordering, and this suggests that the ordering
doesn’t matter. It’d be nice to know for sure – if anyone reading this does know the answer, I’d
One of the most striking things about MapReduce is how restrictive it is. A priori it’s by no means clear
that MapReduce should be all that useful in practical applications. It turns out, though, that many
MapReduce computations. We’ve seen wordcount implemented in this post, and we’ll see how to
compute PageRank using MapReduce in the next post. Many other important computations can also
be implemented using MapReduce, including doing things like finding shortest paths in a graph,
grepping a large document collection, or many data mining algorithms. For many such problems,
though, the standard approach doesn’t obviously translate into MapReduce. Instead, you need to
think through the problem again from scratch, and find a way of doing it using MapReduce