Sei sulla pagina 1di 1

20000 20000

20000
Input (MB/s) Done Done Done

Input (MB/s)
15000 15000

Input (MB/s)
15000
10000 10000 10000
5000 5000 5000
0 0 0
500 1000 500 1000 500 1000

20000 20000 20000


Shuffle (MB/s)

Shuffle (MB/s)
Shuffle (MB/s)
15000 15000 15000
10000 10000 10000
5000 5000 5000
0 0 0
500 1000 500 1000 500 1000

20000 20000 20000


Output (MB/s)
Output (MB/s)

Output (MB/s)
15000 15000 15000
10000 10000 10000
5000 5000 5000
0
0 0
500 1000
500 1000 500 1000
Seconds
Seconds Seconds

(a) Normal execution (b) No backup tasks (c) 200 tasks killed
Figure 3: Data transfer rates over time for different executions of the sort program

original text line as the intermediate key/value pair. We the first batch of approximately 1700 reduce tasks (the
used a built-in Identity function as the Reduce operator. entire MapReduce was assigned about 1700 machines,
This functions passes the intermediate key/value pair un- and each machine executes at most one reduce task at a
changed as the output key/value pair. The final sorted time). Roughly 300 seconds into the computation, some
output is written to a set of 2-way replicated GFS files of these first batch of reduce tasks finish and we start
(i.e., 2 terabytes are written as the output of the program). shuffling data for the remaining reduce tasks. All of the
As before, the input data is split into 64MB pieces shuffling is done about 600 seconds into the computation.
(M = 15000). We partition the sorted output into 4000 The bottom-left graph shows the rate at which sorted
files (R = 4000). The partitioning function uses the ini- data is written to the final output files by the reduce tasks.
tial bytes of the key to segregate it into one of R pieces. There is a delay between the end of the first shuffling pe-
riod and the start of the writing period because the ma-
Our partitioning function for this benchmark has built-
chines are busy sorting the intermediate data. The writes
in knowledge of the distribution of keys. In a general
continue at a rate of about 2-4 GB/s for a while. All of
sorting program, we would add a pre-pass MapReduce
the writes finish about 850 seconds into the computation.
operation that would collect a sample of the keys and
Including startup overhead, the entire computation takes
use the distribution of the sampled keys to compute split-
891 seconds. This is similar to the current best reported
points for the final sorting pass.
result of 1057 seconds for the TeraSort benchmark [18].
Figure 3 (a) shows the progress of a normal execution A few things to note: the input rate is higher than the
of the sort program. The top-left graph shows the rate shuffle rate and the output rate because of our locality
at which input is read. The rate peaks at about 13 GB/s optimization most data is read from a local disk and
and dies off fairly quickly since all map tasks finish be- bypasses our relatively bandwidth constrained network.
fore 200 seconds have elapsed. Note that the input rate The shuffle rate is higher than the output rate because
is less than for grep. This is because the sort map tasks the output phase writes two copies of the sorted data (we
spend about half their time and I/O bandwidth writing in- make two replicas of the output for reliability and avail-
termediate output to their local disks. The corresponding ability reasons). We write two replicas because that is
intermediate output for grep had negligible size. the mechanism for reliability and availability provided
The middle-left graph shows the rate at which data by our underlying file system. Network bandwidth re-
is sent over the network from the map tasks to the re- quirements for writing data would be reduced if the un-
duce tasks. This shuffling starts as soon as the first derlying file system used erasure coding [14] rather than
map task completes. The first hump in the graph is for replication.

To appear in OSDI 2004 9

Potrebbero piacerti anche