Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
4-1
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
4-2
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
4-3
Chapter
Topics
Delving
Deeper
into
the
Hadoop
API
Using
the
ToolRunner
Class
SeJng
Up
and
Tearing
Down
Mappers
and
Reducers
Decreasing
the
Amount
of
Intermediate
Data
with
Combiners
Accessing
HDFS
programmaPcally
Using
the
Distributed
Cache
Using
the
Hadoop
APIs
Library
of
Mappers,
Reducers
and
ParPPoners
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
4-4
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
4-5
org.apache.hadoop.fs.Path;
org.apache.hadoop.io.IntWritable;
org.apache.hadoop.io.Text;
org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.conf.Configured;
org.apache.hadoop.conf.Configuration;
org.apache.hadoop.util.Tool;
org.apache.hadoop.util.ToolRunner;
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
4-6
import
import
import
import
org.apache.hadoop.fs.Path;
org.apache.hadoop.io.IntWritable;
org.apache.hadoop.io.Text;
org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
org.apache.hadoop.mapreduce.Job;
org.apache.hadoop.conf.Configured;
org.apache.hadoop.conf.Configuration;
org.apache.hadoop.util.Tool;
org.apache.hadoop.util.ToolRunner;
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
4-7
if (args.length != 2) {
System.out.printf(
"Usage: %s [generic options] <input dir> <output dir>\n", getClass().getSimpleName());
return -1;
}
Job job = new Job(getConf());
job.setJarByClass(WordCount.class); job.setJobName("Word Count");
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
4-8
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
4-9
if (args.length != 2) {
System.out.printf(
"Usage: %s [generic options] <input dir> <output dir>\n", getClass().getSimpleName());
return -1;
}
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
4-10
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
4-11
Chapter
Topics
Delving
Deeper
into
the
Hadoop
API
Using
the
ToolRunner
Class
Se?ng
Up
and
Tearing
Down
Mappers
and
Reducers
Decreasing
the
amount
of
intermediate
data
with
Combiners
Accessing
HDFS
programmaPcally
Using
the
Distributed
Cache
Using
the
Hadoop
APIs
Library
of
Mappers,
Reducers
and
ParPPoners
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
4-12
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
4-13
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
4-14
Passing
Parameters
public class MyDriverClass {
public int main(String[] args) throws Exception {
Configuration conf = new Configuration();
conf.setInt ("paramname",value);
Job job = new Job(conf);
...
boolean success = job.waitForCompletion(true);
return success ? 0 : 1;
}
}
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
4-15
Chapter
Topics
Delving
Deeper
into
the
Hadoop
API
Using
the
ToolRunner
Class
SeJng
Up
and
Tearing
Down
Mappers
and
Reducers
Decreasing
the
Amount
of
Intermediate
Data
with
Combiners
Accessing
HDFS
programmaPcally
Using
the
Distributed
Cache
Using
the
Hadoop
APIs
Library
of
Mappers,
Reducers
and
ParPPoners
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
4-16
The
Combiner
O[en,
Mappers
produce
large
amounts
of
intermediate
data
That
data
must
be
passed
to
the
Reducers
This
can
result
in
a
lot
of
network
trac
It
is
o[en
possible
to
specify
a
Combiner
Like
a
mini-Reducer
Runs
locally
on
a
single
Mappers
output
Output
from
the
Combiner
is
sent
to
the
Reducers
Combiner
and
Reducer
code
are
o[en
idenDcal
Technically,
this
is
possible
if
the
operaPon
performed
is
commutaPve
and
associaPve
Input
and
output
data
types
for
the
Combiner/Reducer
must
be
idenPcal
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
4-17
The
Combiner
Combiners
run
as
part
of
the
Map
phase
Output
from
the
Combiners
is
passed
to
the
Reducers
Block 1
Block 2
Block 3
Input Format
Input Format
Input Format
Mapper
Mapper
Mapper
Combiner
Combiner
Combiner
ParPPoner
ParPPoner
ParPPoner
Reducer
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
4-18
WordCount
Revisited
Node
1
the cat
sat on
the mat
Mapper
Node
2
the
aardvark
sat on
the sofa
Mapper
the
cat
sat
aardvark
on
cat
the
mat
mat
on
1
1
sat
1
1
the
aardvark
sat
on
the
sofa
sofa
the
1
1
1
1
Reducer
Reducer
Reducer
aardvark
cat
mat
on
sat
sofa
the
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
4-19
Mapper
Node
2
the
aardvark
sat on
the sofa
Mapper
the
cat
sat
on
the
mat
the
aardvark
sat
on
the
sofa
Combiner
Combiner
cat
mat
on
sat
the
aardvark
on
sat
sofa
the
aardvark
cat
aardvark
mat
cat
on
1
1
mat
on
sat
1
1
sat
sofa
sofa
the
2
2
the
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
4-20
WriPng
a
Combiner
The
Combiner
uses
the
same
signature
as
the
Reducer
Takes
in
a
key
and
a
list
of
values
Outputs
zero
or
more
(key,
value)
pairs
The
actual
method
called
is
the
reduce
method
in
the
class
reduce(inter_key, [v1, v2, ])
(result_key, result_value)
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
4-21
foo
2
3
5
6
Sum(2,3)
Sum(5,6)
foo
5
11
Sum(5,11)
foo
16
=
foo
Sum(2,3,5,6)
16
bar
5
5
5
8
10
Avg(5,5,5)
Avg(8,10)
bar
5
9
Avg(5,5,5,8,10)
Avg(5,9)
bar
bar
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
6.3
4-22
Specifying
a
Combiner
Specify
the
Combiner
class
to
be
used
in
your
MapReduce
code
in
the
driver
Use
the
setCombinerClass
method,
e.g.:
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setCombinerClass(SumReducer.class);
Input
and
output
data
types
for
the
Combiner
and
the
Reducer
for
a
job
must
be
idenDcal
VERY
IMPORTANT:
The
Combiner
may
run
once,
or
more
than
once,
on
the
output
from
any
given
Mapper
Do
not
put
code
in
the
Combiner
which
could
inuence
your
results
if
it
runs
more
than
once
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
4-23
Chapter
Topics
Delving
Deeper
into
the
Hadoop
API
Using
the
ToolRunner
Class
SeJng
Up
and
Tearing
Down
Mappers
and
Reducers
Decreasing
the
Amount
of
Intermediate
Data
with
Combiners
Accessing
HDFS
ProgrammaDcally
Using
the
Distributed
Cache
Using
the
Hadoop
APIs
Library
of
Mappers,
Reducers
and
ParPPoners
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
4-24
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
4-25
The
conf
object
has
read
in
the
Hadoop
conguraDon
les,
and
therefore
knows
the
address
of
the
NameNode
A
le
in
HDFS
is
represented
by
a
Path
object
Path p = new Path("/path/to/my/file");
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
4-26
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
4-27
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
4-28
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
4-29
Chapter
Topics
Delving
Deeper
into
the
Hadoop
API
Using
the
ToolRunner
Class
SeJng
Up
and
Tearing
Down
Mappers
and
Reducers
Decreasing
the
Amount
of
Intermediate
Data
with
Combiners
Accessing
HDFS
ProgrammaPcally
Using
the
Distributed
Cache
Using
the
Hadoop
APIs
Library
of
Mappers,
Reducers
and
ParPPoners
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
4-30
4-31
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
4-32
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
4-33
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
4-34
Chapter
Topics
Delving
Deeper
into
the
Hadoop
API
Using
the
ToolRunner
Class
SeJng
Up
and
Tearing
Down
Mappers
and
Reducers
Decreasing
the
Amount
of
Intermediate
Data
with
Combiners
Accessing
HDFS
ProgrammaPcally
Using
the
Distributed
Cache
Using
the
Hadoop
APIs
Library
of
Mappers,
Reducers
and
ParDDoners
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
4-35
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
4-36
Key
Points
Use
the
ToolRunner
class
to
build
drivers
Parses
job
opPons
and
conguraPon
variables
automaPcally
Override
Mapper
and
Reducer
setup
and
cleanup
methods
Set
up
and
tear
down,
e.g.
reading
conguraPon
parameters
Combiners
are
mini-reducers
Run
locally
on
Mapper
output
to
reduce
data
sent
to
Reducers
The
FileSystem
API
lets
you
read
and
write
HDFS
les
programmaDcally
The
Distributed
Cache
lets
you
copy
local
les
to
worker
nodes
Mappers
and
Reducers
can
access
directly
as
regular
les
Hadoop
includes
a
library
of
predened
Mappers,
Reducers,
and
ParDDoners
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
4-37
Bibliography
The
following
oer
more
informaDon
on
topics
discussed
in
this
chapter
Combiners
are
discussed
in
TDG
3e
on
pages
33-36.
A
table
describing
available
lesystems
in
Hadoop
is
on
pages
52-53
of
TDG
3e.
The
HDFS
API
is
described
in
TDG
3e
on
pages
55-67.
Distributed
cache:
See
pages
289-295
of
TDG
3e
for
more
details.
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
4-38