Hadoop HP Day2

1
Map reduce Programming
The Configuration API

Components in Hadoop are configured using Hadoops own configuration API.
org.apache.hadoop.conf.Configuration : - represents a collection of configuration properties
and their values.
Each property is named by a String, and the type of a value may be one of several types
o including Java primitives such as boolean, int, long, and float
o other useful types such as String, Class, and java.io.File, and collections of Strings
hPot-Tech
The Configuration API..
hPot-Tech
Tool implementation :
hPot-Tech
Packaging a Job
a jobs classes must be packaged into a job JAR file to send to the cluster
Any dependent JAR files can be packaged in a lib subdirectory in the job JAR file.
The client classpath
The users client-side classpath set by hadoop jar <jar> is made up of:
The job JAR file
Any JAR files in the lib directory of the job JAR file, and the classes directory.
The classpath defined by HADOOP_CLASSPATH, if set
hPot-Tech
Launching a Job
To launch the job, we need to run the driver, specifying the cluster that we want to run
the job on with the -conf option
hPot-Tech
The Job output..
hPot-Tech
The Job output..
hPot-Tech
The MapReduce Web UI.

A web UI for viewing information about your jobs.
useful for
o following a jobs progress while it is running
o finding job statistics and logs after the job has completed.
http://jobtracker-host:50030/.
hPot-Tech
The jobtracker page
hPot-Tech
10
The jobtracker page
hPot-Tech
11
The job page
hPot-Tech
12
The job page
hPot-Tech
13
Map Reduce Programming
hPot-Tech
14
The MapReduce Approach

Shared memory approach (OpenMP, MPI, ...)
I Developer needs to take care of (almost) everything
I Synchronization, Concurrency
I Resource allocation
MapReduce: a shared nothing approach
I Most of the above issues are taken care of
I Problem decomposition and sharing partial results need particular
attention
I Optimizations (memory and network consumption) are tricky
hPot-Tech
15
Functional Programming Roots

Key feature: higher order functions
I Functions that accept other functions as arguments
I Map and Fold
Figure: Illustration of map and fold.
hPot-Tech
16

map phase:
Given a list, map takes as an argument a function f (that takes a
single argument) and applies it to all element in a list
fold phase:
Given a list, fold takes as arguments a function g (that takes two
arguments) and an initial value
g is first applied to the initial value and the first item in the list
The result is stored in an intermediate variable, which is used as an
input together with the next item to a second application of g
The process is repeated until all items in the list have been
Consumed
hPot-Tech
17

We can view map as a transformation over a dataset
This transformation is specified by the function f
Each functional application happens in isolation
The application of f to each element of a dataset can be parallelized in a
straightforward manner
We can view fold as an aggregation operation
The aggregation is defined by the function g
Data locality: elements in the list must be brought together
If we can group element of the list, also the fold phase can proceed in parallel
Associative and commutative operations
Allow performance gains through local aggregation and reordering
hPot-Tech
18
Functional Programming and MapReduce

Equivalence of MapReduce and Functional Programming:
The map of MapReduce corresponds to the map operation
The reduce of MapReduce corresponds to the fold operation
The framework coordinates the map and reduce phases:
How intermediate results are grouped for the reduce to happen in parallel
In practice:
User-specified computation is applied (in parallel) to all input records of a dataset
Intermediate results are aggregated by another user-specified Computation
hPot-Tech
19
Mappers and Reducers
hPot-Tech
20
Data Structures
Key-value pairs are the basic data structure in MapReduce
Keys and values can be: integers, float, strings, raw bytes
They can also be arbitrary data structures
The design of MapReduce algorithms involes:
Imposing the key-value structure on arbitrary datasets
o E.g.: for a collection of Web pages, input keys may be URLs and
values may be the HTML content
In some algorithms, input keys are not used, in others they uniquely
identify a record
Keys can be combined in complex ways to design various algorithms
hPot-Tech
21
A MapReduce job
The programmer defines a mapper and a reducer as follows2:
o map: (k1; v1) ! [(k2; v2)]
o reduce: (k2; [v2]) ! [(k3; v3)]
A MapReduce job consists in:
o A dataset stored on the underlying distributed filesystem, which is
split in a number of files across machines
o The mapper is applied to every input key-value pair to generate
intermediate key-value pairs
o The reducer is applied to all values associated with the same
intermediate key to generate output key-value pairs
hPot-Tech
22
Where the magic happens

Implicit between the map and reduce phases is a distributed group by operation
on intermediate keys
Intermediate data arrive at each reducer in order, sorted by the key
No ordering is guaranteed across reducers
Output keys from reducers are written back to the distributed filesystem
The output may consist of r distinct files, where r is the number of reducers
Such output may be the input to a subsequent MapReduce phase
Intermediate keys are transient:
They are not stored on the distributed filesystem
They are spilled to the local disk of each machine in the cluster
hPot-Tech
23
A Simplified view of MapReduce
Figure: Mappers are applied to all input key-value pairs, to generate an arbitrary number of intermediate pairs. Reducers are applied to all
intermediate values associated with the same intermediate key. Between the map and reduce phase lies a barrier that involves a large distributed
sort and group by
hPot-Tech
24
hPot-Tech
25
Hello World in MapReduce

Input:
Key-value pairs: (docid, doc) stored on the distributed filesystem
docid: unique identifier of a document
doc: is the text of the document itself
Mapper:
Takes an input key-value pair, tokenize the document
Emits intermediate key-value pairs: the word is the key and the integer is the value
The framework:
Guarantees all values associated with the same key (the word) are brought to the
same reducer
The reducer:
Receives all values associated to some keys
Sums the values and writes output key-value pairs: the key is the word and the value
is the number of occurrences
hPot-Tech
26
Implementation and Execution Details

The partitioner is in charge of assigning intermediate keys (words) to reducers
Note that the partitioner can be customized
How many map and reduce tasks?
The framework essentially takes care of map tasks
The designer/developer takes care of reduce tasks
hPot-Tech
27
Restrictions
Using external resources
E.g.: Other data stores than the distributed file system
Concurrent access by many map/reduce tasks
Side effects
Not allowed in functional programming
E.g.: preserving state across multiple inputs
State is kept internal
I/O and execution
External side effects using distributed data stores (e.g. BigTable)
No input (e.g. computing _), no reducers, never no mappers
hPot-Tech
28
The Execution Framework
hPot-Tech
29
The Execution Framework

MapReduce program, a.k.a. a job:

Code of mappers and reducers

Code for combiners and partitioners (optional)
Configuration parameters
All packaged together
A MapReduce job is submitted to the cluster

The framework takes care of eveything else
hPot-Tech
30
Tutorial: Map Reduce
hPot-Tech
31
hPot-Tech
32
Debugging a Job
The web UI (debug statement to log to standard error)
custom counter
hPot-Tech
33
Add debugging to the mapper:
hPot-Tech
34
The tasks page
hPot-Tech
35
The task details page
hPot-Tech
36
Hadoop Logs
hPot-Tech
37
Anything written to standard output or standard error is directed to the relevant log file.
hPot-Tech
38
Remote Debugging
debugger is hard to arrange when running the job on a cluster
options :
o Reproduce the failure locally
o Use JVM debugging options
o Use task profiling
o Use IsolationRunner
set keep.failed.task.files to true to keep a failed tasks files.
hPot-Tech
39
Tuning a Job
hPot-Tech
40
Tuning a Job
hPot-Tech
41
Job Submission
JobClient class
The runJob() method creates a new instance of a JobClient
Then it calls the submitJob() on this class
Simple verifications on the Job
Is there an output directory?
Are there any input splits?
Can I copy the JAR of the job to HDFS?
NOTE: the JAR of the job is replicated 10 times
hPot-Tech
42
MapReduce Workflows
o When the processing gets more complex :
o As a rule of thumb, think about adding more jobs, rather than adding complexity to jobs.
o For more complex problems,
o Consider a higher-level language than Map-Reduce, such as Pig, Hive, Cascading,
Cascalog, or Crunch.
o One immediate benefit is that it frees you from the translation into MapReduce jobs,
allowing you to concentrate on the analysis you are performing.
hPot-Tech
43
JobControl:
When there is more than one job in a MapReduce workflow :

For a linear chain, the simplest approach is to run each job one after another :
For anything more complex than a linear chain,

o org.apache.hadoop.mapreduce.jobcontrol.JobControl :
o represents a graph of jobs to be run.
o add the job configurations,
o tell the JobControl instance the dependencies between jobs.
o run the JobControl in a thread, and it runs the jobs in dependency order.
o can poll for progress, and when the jobs have finished, you can query for all the jobs
statuses and the associated errors for any failures.
o If a job fails, JobControl wont run its dependencies.
hPot-Tech
44
Advance MapReduce How Map reduce works?
hPot-Tech
45
Classic MapReduce
hPot-Tech
46
Failures
Major benefits of using Hadoop is its ability to handle failures and allow job to complete.
Task failure:
When user code in the map or reduce task throws a runtime exception.
The error ultimately makes it into the user logs.
Hanging tasks are dealt with differently : mapred.task.timeout
When the jobtracker is notified of a task attempt that has failed (by the tasktrackers
heartbeat call), it will reschedule execution of the task.
The jobtracker will try to avoid rescheduling the task on a tasktracker where it has previously
failed
hPot-Tech
47
Failures
Tasktracker failure :
The jobtracker will notice a tasktracker that has stopped sending heartbeats if it hasnt received
one for 10 minutes (configured via the mapred.task tracker.expiry.interval property, in
milliseconds)
And remove it from its pool of tasktrackers to schedule tasks on.
Jobtracker failure
Failure of the jobtracker is the most serious failure mode.
Hadoop has no mechanism for dealing with jobtracker failureit is a single point of failure
so in this case all running jobs fail.
After restarting a jobtracker, any jobs that were running at the time it was stopped will need to
be resubmitted
hPot-Tech
48
Partitioners and Combiners
hPot-Tech
49
Partitioners
Partitioners are responsible for:
Dividing up the intermediate key space
Assigning intermediate key-value pairs to reducers
Specify the task to which an intermediate key-value pair must be copied
Hash-based partitioner
Computes the hash of the key modulo the number of reducers r
This ensures a roughly even partitioning of the key space
o However, it ignores values: this can cause imbalance in the data processed by
each reducer
When dealing with complex keys, even the base partitioner may need customization
hPot-Tech
50
Combiners
Combiners are an (optional) optimization:
Allow local aggregation before the shuffle and sort phase
Each combiner operates in isolation
Essentially, combiners are used to save bandwidth
E.g.: word count program
Combiners can be implemented using local data-structures
E.g., an associative array keeps intermediate computations and aggregation thereof
The map function only emits once all input records (even all input splits) are
processed
hPot-Tech
51
Partitioners and Combiners, an Illustration
Note: in Hadoop, partitioners are executed before combiners
hPot-Tech
52
hPot-Tech
53
Lab : Combiner & Partitioners
hPot-Tech
54
MRUnit Map Reduce Unit Testing.

The map and reduce functions in MapReduce are easy to test in isolation
MRUnit :
a testing library that makes easy to pass known inputs to a mapper or a reducer and
check that the outputs are as expected.
used in conjunction with a standard test execution framework, such as JUnit.
hPot-Tech
55
Mapper
hPot-Tech
56
Reducer
hPot-Tech
57
Tutorial : MRUnit.
hPot-Tech
58
hPot-Tech
59
Hadoop MapReduce Types and Formats
hPot-Tech
60
MapReduce Types
Input / output to mappers and reducers
a. map: (k1; v1) ! [(k2; v2)]
b. reduce: (k2; [v2]) ! [(k3; v3)]
In Hadoop, a mapper is created as follows:
a. void map(K1 key, V1 value, OutputCollector<K2,V2> output, Reporter reporter)
b.
Types:
a. K types implement WritableComparable
b. V types implement Writable
hPot-Tech
61
What is a Writable
Hadoop defines its own classes for strings (Text), integers
(intWritable), etc...
All keys are instances of WritableComparable
o Why comparable?
All values are instances of Writable
hPot-Tech
62
hPot-Tech
63
Reading Data
Datasets are specified by InputFormats
I InputFormats define input data (e.g. a file, a directory)
I InputFormats is a factory for RecordReader objects to extract
key-value records from the input source
InputFormats identify partitions of the data that form an InputSplit
InputSplit is a (reference to a) chunk of the input processed by
a single map
o Largest split is processed first
Each split is divided into records, and the map processes each
record (a key-value pair) in turn
Splits and records are logical, they are not physically bound to a file
hPot-Tech
64
The relationship between InputSplit and HDFS blocks
hPot-Tech
65
FileInputFormat and Friends

TextInputFormat
Traeats each newline-terminated line of a file as a value
KeyValueTextInputFormat
Maps newline-terminated text lines of key SEPARATOR value
SequenceFileInputFormat
Binary file of key-value pairs with some additional metadata
SequenceFileAsTextInputFormat
Same as before but, maps (k.toString(), v.toString())
hPot-Tech
66
Filtering File Inputs

FileInputFormat reads all files out of a specified directory and send them to the
mapper
Delegates filtering this file list to a method subclasses may override
Example: create your own xyzFileInputFormat to read *.xyz from a directory list
hPot-Tech
67
Record Readers
Each InputFormat provides its own RecordReader implementation
LineRecordReader
Reads a line from a text file
KeyValueRecordReader
Used by KeyValueTextInputFormat
hPot-Tech
68
Input Split Size
hPot-Tech
69
Sending Data to Reducers

Map function receives OutputCollector object
OutputCollector.collect() receives key-value elements
Any (WritableComparable, Writable) can be used By defalut, mapper output type
assumed to be the same as the reducer output type
hPot-Tech
70
WritableComparator
Compares WritableComparable data
Will call the WritableComparable.compare() method
Can provide fast path for serialized data
Configured through:
JobConf.setOutputValueGroupingComparator()
hPot-Tech
71
Partitioner
int getPartition(key, value, numPartitions)
Outputs the partition number for a given key
One partition == all values sent to a single reduce task
HasPartitioner used by default
Uses key.hashCode() to return partion number
JobConf used to set Partitioner implementation
hPot-Tech
72
The Reducer
void reduce(k2 key, Iterator<v2> values,OutputCollector<k3, v3> output, Reporter
reporter )
Keys and values sent to one partition all go to the same reduce task
Calls are sorted by key
Early keys are reduced and output before late keys
hPot-Tech
73
Writing the Output
hPot-Tech
74
Writing the Output

Analogous to InputFormat
TextOutputFormat writes key value <newline> strings to output file
SequenceFileOutputFormat uses a binary format to pack key-value pairs
NullOutputFormat discards output
hPot-Tech
75
Lab :- Input and Output
hPot-Tech
76
Map Side and Reduce Side Joins
hPot-Tech
77
Joins
MapReduce can perform joins between large datasets
hPot-Tech
78
Join:
performed by the mapper, it is called a map-side join
performed by the reducer it is called a reduce-side join.
hPot-Tech
79
Map-Side Joins
A map-side join between large inputs works by
performing the join before the data reaches the map
function.
The inputs to each map must be partitioned and
sorted in a particular way.
Each input dataset must be divided into the
same number of partitions, and it must be sorted by
the same key (the join key) in each source.
All the records for a particular key must reside in
the same partition.
hPot-Tech
80
Reduce-Side Joins
A reduce-side join is more general than a mapside join
the input datasets dont have to be structured in
any particular way
the mapper tags each record with its source and
uses the join key as the map output key, so that the
records with the same key are brought together in
the reducer.
hPot-Tech
81
Lab : Map Side Join.
hPot-Tech
Managing a Hadoop Cluster
Hadoop Cluster Component

NameNode: Manages the namespace, file
system metadata, and access control. There is
exactly one NameNode in each cluster.
SecondaryNameNode: Downloads
periodic checkpoints from the NameNode for
fault-tolerance. There is exactly one
SecondaryNameNode in each cluster.

JobTracker: Hands out tasks to the slave
nodes. There is exactly one JobTracker in each
cluster.
DataNode: Holds file system data; each data
node manages its own locally-attached storage
(i.e., the node's hard disk) and stores a copy of
some or all blocks in the file system. There are
one or more DataNodes in each cluster. If your
cluster has only one DataNode then file system
data cannot be replicated.

TaskTracker: Slaves that carry
out map and reduce tasks. There are one or
more TaskTrackers in each cluster.
HDFS Architecture
Metadata ops
Metadata(Name, replicas..)
(/home/foo/data,6. ..
Namenode
Client
Block ops
Read
Datanodes
Datanodes
replication
B
Blocks
Rack1
Write
Rack2
Client
3/3/2013
Platform requirements for Hadoop

Java Requirements
Hadoop is a Java-based system. Recent versions of
Hadoop require Sun Java 1.6.
Operating System
As Hadoop is written in Java, it is mostly portable
between different operating systems
Downloading and Installing Hadoop
Topology of a typical Hadoop cluster .
Installation Steps
Installed java
ssh and sshd
gunzip hadoop-0.18.0.tar.gz
Or tar vxf hadoop-0.18.0.tar
Set JAVA_HOME in conf/hadoop-env.sh

Modified hadoop-site.xml
Hadoop Installation Flavors

Standalone
Pseudo-distributed
Hadoop clusters of multiple nodes
Additional Configuration
conf/masters
contains the hostname of the SecondaryNameNode
It should be fully-qualified domain name.
conf/slaves
the hostname of every machine in the cluster which
should start TaskTracker and DataNode daemons
Ex:
slave01
slave02
slave03
Advance Configuration
enable passwordless ssh
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
The ~/.ssh/id_dsa.pub and authorized_keys

files should be replicated on all machines in
the cluster.
Advance Configuration
Various directories should be created on each
node
The NameNode requires the NameNode metadata
directory
$ mkdir -p /home/hadoop/dfs/name
Every node needs the Hadoop tmp directory and

DataNode directory created
Advance Configuration..
bin/slaves.sh allows a command to be
executed on all nodes in the slaves file.
$ mkdir -p /tmp/hadoop
$ export HADOOP_CONF_DIR=${HADOOP_HOME}/conf
$ export HADOOP_SLAVES=${HADOOP_CONF_DIR}/slaves
$ ${HADOOP_HOME}/bin/slaves.sh "mkdir -p /tmp/hadoop"
$ ${HADOOP_HOME}/bin/slaves.sh "mkdir -p /home/hadoop/dfs/data
Format HDFS
$ bin/hadoop namenode -format
start the cluster:

$ bin/start-all.sh
Important Directories
Directory
Description
Default location
Suggested location
HADOOP_LOG_DIR
Output location for log files

from daemons
${HADOOP_HOME}/logs
/var/log/hadoop
hadoop.tmp.dir
A base for other temporary

directories
/tmp/hadoop-${user.name}
/tmp/hadoop
dfs.name.dir
Where the NameNode

metadata should be stored
${hadoop.tmp.dir}/dfs/name /home/hadoop/dfs/name
dfs.data.dir
Where DataNodes store their

${hadoop.tmp.dir}/dfs/data
blocks
mapred.system.dir
The in-HDFS path to shared

MapReduce system files
/home/hadoop/dfs/data
${hadoop.tmp.dir}/mapred/sy
/hadoop/mapred/system
stem
Recommended configuration
dfs.name.dir and dfs.data.dir be moved out
from hadoop.tmp.dir.
Adjust mapred.system.dir
Selecting Machines
Hadoop is designed to take advantage of
whatever hardware is available
Hadoop jobs written in Java can consume
between 1 and 2 GB of RAM per core
If you use HadoopStreaming to write your jobs
in a scripting language such as Python, more
memory may be advisable.
Cluster Configurations
Small Clusters: 2-10 Nodes
Medium Clusters: 10-40 Nodes
Large Clusters: Multiple Racks
Small Clusters: 2-10 Nodes

In two nodes,
one node: NameNode/JobTracker and a
DataNode/TaskTracker;
the other node: DataNode/TaskTracker.
Clusters of three or more machines typically

use a dedicated NameNode/JobTracker, and
all other nodes are workers.
configuration in conf/hadoop-site.xml
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>head.server.node.com:9001</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://head.server.node.com:9000</val
ue>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/hadoop/dfs/data</value>
<final>true</final>
</property>
<property>
<name>dfs.name.dir</name>
<value>/home/hadoop/dfs/name</value>
<final>true</final>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/tmp/hadoop</value>
<final>true</final>
</property>
<property>
<name>mapred.system.dir</name>
<value>/hadoop/mapred/system</value>
<final>true</final>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
</configuration>
Medium Clusters: 10-40 Nodes

The single point of failure in a Hadoop cluster
is the NameNode
Hence, back up the NameNode metadata.
One machine in the cluster should be designated
as the NameNode's backup
It does not run the normal Hadoop daemons
it exposes a directory via NFS which is only
mounted on the NameNode
NameNodes backup
The cluster's hadoop-site.xml file should then
instruct the NameNode to write to this
directory as well:
<property>
<name>dfs.name.dir</name>
<value>/home/hadoop/dfs/name,/mnt/namenode-backup</value>
<final>true</final>
</property>
Backup NameNode
the backup machine can be used for is to
serve as the SecondaryNameNode
this is not a failover NameNode process
It takes periodic snapshots of its metadata
conf/hadoop-site.xml
Nodes must be decommissioned on a schedule that permits
replication of blocks being decommissioned.
conf/hadoop-site.xml
<property>
<name>dfs.hosts.exclude</name>
<value>/home/hadoop/excludes</value>
<final>true</final>
</property>
<property>
<name>mapred.hosts.exclude</name>
<value>/home/hadoop/excludes</value>
<final>true</final>
</property>
create an empty file with this name:
$ touch /home/hadoop/excludes
Replication Setting
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
Disk & heap

<property>
<name>dfs.datanode.du.reserved</name>
<value>1073741824</value>
<final>true</final>
</property>
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx512m</value>
</property>
Using multiple drives per machine

DataNodes can be configured to write blocks
out to multiple disks via the dfs.data.dir
property.
<property>
<name>dfs.data.dir</name>
<value>/d1/dfs/data,/d2/dfs/data,/d3/dfs/data,/d4/dfs/data</value>
<final>true</final>
</property>
Using multiple drives per machine..

<property>
<name>mapred.local.dir</name>
<value>
/d1/mapred/local,/d2/mapred/local,
/d3/mapred/local,/d4/mapred/local
</value>
<final>true</final>
</property>
Tutorial
Configure Hadoop Cluster in two nodes.
Tutorial-Installed Hadoop in Cluster.docx

possibility of rack failure now exists
operational racks should be able to continue
even if entire other racks are disabled
the amount of metadata under the care of the
NameNode increases

The NameNode is responsible for managing
metadata associated with each block in the
HDFS
the amount of information in the rack scales
into the 10's or 100's of TB
<property>
<name>dfs.block.size</name>
<value>134217728</value>
</property>

The NFS-mounted write-through backup
should be placed in a different rack from the
NameNode.
The SecondaryNameNode should be
instantiated on a separate rack

<property>
<name>dfs.namenode.handler.count</name>
<value>40</value>
</property>
<property>
<name>mapred.job.tracker.handler.count</name>
<value>40</value>
</property>

Property
Range
io.file.buffer.size
32768-131072
io.sort.factor
50-200
io.sort.mb
50-200
mapred.reduce.parallel.copies
20-50
Description
Read/write buffer size used in

SequenceFiles (should be in multiples of
the hardware page size)
Number of streams to merge concurrently
when sorting files during shuffling
Amount of memory to use while sorting
data
Number of concurrent connections a
reducer should use when fetching its input
from mappers
tasktracker.http.threads
40-50
Number of threads each TaskTracker uses

to provide intermediate map output to
reducers
mapred.tasktracker.map.tasks.maximum
1/2 * (cores/node) to 2 * (cores/node)
Number of map tasks to deploy on each

machine.
mapred.tasktracker.reduce.tasks.maximum 1/2 * (cores/node) to 2 * (cores/node)
Number of reduce tasks to deploy on each

machine.

Hadoop HP Day2

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Hadoop HP Day2

Caricato da

Copyright:

Formati disponibili

1

Map reduce Programming

The Configuration API

Map reduce Programming

The Configuration API..

Map reduce Programming

Map reduce Programming

Map reduce Programming

Map reduce Programming

The Job output..

Map reduce Programming

The Job output..

Map reduce Programming

The MapReduce Web UI.

Map reduce Programming

The jobtracker page

Map reduce Programming

The jobtracker page

Map reduce Programming

The job page

Map reduce Programming

The job page

Map reduce Programming

Map Reduce Programming

Map reduce Programming

The MapReduce Approach

Map reduce Programming

Functional Programming Roots

Figure: Illustration of map and fold.

Map reduce Programming

Functional Programming Roots

Map reduce Programming

Functional Programming Roots

Map reduce Programming

Functional Programming and MapReduce

Map reduce Programming

Mappers and Reducers

Map reduce Programming

Map reduce Programming

Map reduce Programming

Where the magic happens

Map reduce Programming

A Simplified view of MapReduce

Map reduce Programming

Map reduce Programming

Hello World in MapReduce

Map reduce Programming

Implementation and Execution Details

Map reduce Programming

Map reduce Programming

The Execution Framework

Map reduce Programming

The Execution Framework

Code of mappers and reducers

A MapReduce job is submitted to the cluster

Map reduce Programming

Tutorial: Map Reduce

Map reduce Programming

Map reduce Programming

Map reduce Programming

Add debugging to the mapper:

Map reduce Programming

The tasks page

Map reduce Programming