Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Chapter 1.1
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-2
Chapter
Topics
Introduc.on
Course
Objec.ves
About
Apache
Hadoop
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-3
Course
ObjecFves
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-4
Chapter
Topics
Introduc.on
Course
ObjecFves
About
Apache
Hadoop
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-5
What
is
Apache
Hadoop?
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-6
Facts
About
Apache
Hadoop
Open
source
Around
60
commiXers
from
~10
companies
Cloudera,
Yahoo!,
Facebook,
Apple,
and
more
Hundreds
of
contributors
wri.ng
features,
xing
bugs
Many
related
projects,
applica.ons,
tools,
etc.
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-7
A Large (and Growing) Ecosystem
Pig
Zookeeper
Impala
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-8
Who uses Hadoop?
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-9
CDH
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-10
Bibliography
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-11
The
MoFvaFon
for
Hadoop
Chapter
1.2
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-12
The
MoFvaFon
For
Hadoop
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-13
Chapter
Topics
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-14
TradiFonal
Large-Scale
ComputaFon
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-15
Distributed
Systems
Copyright
2Database
010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
Hadoop
prior
wri>en
consent.
Cluster 1-16
Distributed
Systems:
Problems
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-17
Distributed
Systems:
Challenges
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-18
Distributed
Systems:
The
Data
Bo>leneck
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-19
Distributed
Systems:
The
Data
Bo>leneck
(contd)
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-20
Distributed
Systems:
The
Data
Bo>leneck
(contd)
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-21
Chapter
Topics
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-22
ParFal
Failure
Support
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-23
Data
Recoverability
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-24
Component
Recovery
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-25
Consistency
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-26
Scalability
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-27
Chapter
Topics
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-28
Hadoops
History
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-29
Core
Hadoop
Concepts
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-30
Hadoop:
Very
High-Level
Overview
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-31
Fault
Tolerance
If
a
node
fails,
the
master
will
detect
that
failure
and
re-assign
the
work
to
a
dierent
node
on
the
system
Restar.ng
a
task
does
not
require
communica.on
with
nodes
working
on
other
por.ons
of
the
data
If
a
failed
node
restarts,
it
is
automa.cally
added
back
to
the
system
and
assigned
new
tasks
If
a
node
appears
to
be
running
slowly,
the
master
can
redundantly
execute
another
instance
of
the
same
task
Results
from
the
rst
to
nish
will
be
used
Known
as
speculaFve
execuFon
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-32
Hadoop
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-33
Hadoop:
Very
High-Level
Overview
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-34
Core
Hadoop
Concepts
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-35
Scalability
Number of Nodes
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-36
Fault
Tolerance
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-37
Chapter
Topics
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-38
What
Can
You
Do
With
Hadoop?
OperaFonal Eciency
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-39
Where
Does
Data
Come
From?
Science
Medical
imaging,
sensor
data,
genome
sequencing,
weather
data,
satellite
feeds,
etc.
Industry
Financial,
pharmaceuFcal,
manufacturing,
insurance,
online,
energy,
retail
data
Legacy
Sales
data,
customer
behavior,
product
databases,
accounFng
data,
etc.
System
Data
Log
les,
health
and
status
feeds,
acFvity
streams,
network
messages,
web
analyFcs,
intrusion
detecFon,
spam
lters
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-40
Common
Types
of
Analysis
with
Hadoop
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-41
What
is
Common
Across
Hadoop-able
Problems?
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-42
Use
Case:
Opower
Opower
SaaS
for
uFlity
companies
Provides
insights
to
customers
about
their
energy
usage
The
data
Huge
amounts
of
data
Many
dierent
sources
The
analysis,
e.g.
Similar
homes
comparison
HeaFng
and
cooling
usage
Bill
forecasFng
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-43
Benets
of
Analyzing
with
Hadoop
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-44
Bibliography
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-45
Bibliography
(contd)
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-46
Hadoop
Basic
Concepts
Chapter
1.3
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-47
Hadoop:
Basic
Concepts
What
is
Hadoop?
What
features
does
the
Hadoop
Distributed
File
System
(HDFS)
provide?
What
are
the
concepts
behind
MapReduce?
How
does
a
Hadoop
cluster
operate?
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-48
Chapter
Topics
What
is
Hadoop?
The
Hadoop
Distributed
File
System
(HDFS)
How
MapReduce
Works
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-49
The
Hadoop
Project
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-50
Hadoop
Components
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-51
Hadoop
Components
(contd)
Hadoop
Ecosystem
HBase
Flume
Oozie
CDH
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-52
Core
Components:
HDFS
and
MapReduce
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-53
Hadoop
Components:
HDFS
HDFS,
the
Hadoop
Distributed
File
System,
is
responsible
for
storing
data
on
the
cluster
Data
is
split
into
blocks
and
distributed
across
mul.ple
nodes
in
the
cluster
Each
block
is
typically
64MB
or
128MB
in
size
Each
block
is
replicated
mul.ple
.mes
Default
is
to
replicate
each
block
three
Fmes
Replicas
are
stored
on
dierent
nodes
This
ensures
both
reliability
and
availability
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-54
Hadoop
Components:
MapReduce
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-55
Chapter
Topics
What
is
Hadoop?
The
Hadoop
Distributed
File
System
(HDFS)
How
MapReduce
Works
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-56
HDFS
Basic
Concepts
HDFS
NaFve OS lesystem
Disk Storage
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-57
How
Files
Are
Stored
in
HDFS
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-58
How
Files
are
Stored
(1)
Data les are split into blocks and distributed to data nodes
Block 1
Very
Large
Block
2
Data
File
Block 3
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-59
How
Files
are
Stored
(2)
Data les are split into blocks and distributed to data nodes
Block 1
Block
1
Block
1
Very
Large
Block
2
Data
File
Block 3
Block 1
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-60
How
Files
are
Stored
(3)
Data
les
are
split
into
blocks
and
distributed
to
data
nodes
Each
block
is
replicated
on
mul.ple
nodes
(default
3x)
Block 1
Block 3
Block
1
Block
1
Very Block 2
Block
2
Block
3
Block 1
Block 3
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-61
How
Files
are
Stored
(4)
Data
les
are
split
into
blocks
and
distributed
to
data
nodes
Each
block
is
replicated
on
mul.ple
nodes
(default
3x)
NameNode
stores
metadata
Block
1
Name
Block
3
Node
Block
1
Block
1
Metadata:
Very
Block
2
informaFon
Large
Block
2
Block
2
about
les
Data
File
Block
3
and
blocks
Block
2
Block
3
Block 1
Block 3
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-62
HDFS:
Points
To
Note
Although
les
are
split
into
64MB
or
128MB
blocks,
if
a
le
is
smaller
than
this
the
full
64MB/128MB
will
not
be
used
Blocks
are
stored
as
standard
les
on
the
DataNodes,
in
a
set
of
directories
specied
in
Hadoops
congura.on
les
Without
the
metadata
on
the
NameNode,
there
is
no
way
to
access
the
les
in
the
HDFS
cluster
When
a
client
applica.on
wants
to
read
a
le:
It
communicates
with
the
NameNode
to
determine
which
blocks
make
up
the
le,
and
which
DataNodes
those
blocks
reside
on
It
then
communicates
directly
with
the
DataNodes
to
read
the
data
The
NameNode
will
not
be
a
bo>leneck
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-63
OpFons
for
Accessing
HDFS
put
FsShell
Command
line:
hadoop fs! HDFS
Java
API
Client
Cluster
get
Ecosystem
Projects
Flume
Collects
data
from
network
sources
(e.g.,
system
logs)
Sqoop
Transfers
data
between
HDFS
and
RDBMS
Hue
Web-based
interacFve
UI.
Can
browse,
upload,
download,
and
view
les
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-64
Example:
Storing
and
Retrieving
Files
(1)
Local
Node
A
Node
D
/logs/
031512.log
Node B Node E
/logs/
041213.log Node
C
HDFS
Cluster
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-65
Example:
Storing
and
Retrieving
Files
(2)
1 Node
A
Node
D
/logs/
031512.log
2 1 3 1 5
3 4 2
Node
B
Node
E
1 2 2 5
3 4 4
4
/logs/
041213.log 5 Node
C
3 5
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-66
Example:
Storing
and
Retrieving
Files
(3)
1 Node
A
Node
D
/logs/
/logs/041213.log?
031512.log
2 1 3 1 5
3 4 2
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-67
Example:
Storing
and
Retrieving
Files
(4)
1 Node
A
Node
D
/logs/
/logs/041213.log?
031512.log
2 1 3 1 5
3 4 2
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-68
HDFS
NameNode
Availability
Classic
mode
One
NameNode
Secondary
One
helper
node
called
Name
Name
SecondaryNameNode
Node
Node
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-69
hadoop fs
Examples
(1)
$ hadoop fs -ls
$ hadoop fs ls /
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-70
hadoop fs
Examples
(2)
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-71
hadoop fs
Examples
(3)
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-72
Important
Notes
About
HDFS
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-73
Chapter
Topics
What
is
Hadoop?
The
Hadoop
Distributed
File
System
(HDFS)
How
MapReduce
Works
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-74
What
Is
MapReduce?
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-75
MapReduce:
Terminology
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-76
Hadoop
Components:
MapReduce
The
Mapper
Each
Map
task
(typically)
operates
on
a
single
HDFS
block
Map
Map
tasks
(usually)
run
on
the
node
where
the
block
is
stored
Shue
and
Sort
Sorts
and
consolidates
intermediate
data
from
all
Shue
mappers
and
Sort
Happens
as
Map
tasks
complete
and
before
Reduce
tasks
start
The
Reducer
Operates
on
shued/sorted
intermediate
data
Reduce
(Map
task
output)
Produces
nal
output
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-77
Example:
Word
Count
Result
aardvark
1
Input
Data
cat
1
the cat sat on the mat mat
1
the aardvark sat on the sofa Map
Reduce
on
2
sat
2
sofa
1
the
4
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-78
Example:
The
WordCount
Mapper
WordCountMapper
output
the
1
cat
1
Map
sat
1
Input
Data
on
1
the
1
the cat sat on the mat mat
1
the aardvark sat on the sofa
the
1
aardvark
1
sat
1
Map
on
1
the
1
sofa
1
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-79
Example:
Shue
&
Sort
Mapper
Output
the
1
cat
1
Intermediate
Data
sat
1
on
1
aardvark
1
the
1
cat
1
mat
1
mat
1
the
1
on
1,1
aardvark
1
sat
1,1
sat
1
sofa
1
on
1
the
1,1,1,1
the
1
sofa
1
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-80
Example:
SumReducer
Reducer
Output
reduce
aardvark
1
reduce the 4
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-81
Mappers
Run
in
Parallel
Hadoop
runs
Map
tasks
on
the
slave
node
where
the
block
is
stored
(when
possible)
Many
Mappers
can
run
in
parallel
a
1,1,1,
Minimizes
network
trac
ac
1,1,1,1,1.
Block
1
accumsan
1,1
Input
Data
HDFS
Blocks
Map
ad
1,1,1,1.
adipiscing
1,1,1,1
Block
3
Lorem ipsum dolor sit amet, Lorem ipsum dolor Block
1
aliquet
1,1,1,1.
consectetur adipiscing elit. sit amet, consec
Integer nec odio. Praesent tetur adipisc ing
libero. Sed cursus ante elit. Integer
dapibus diam. Sed nisi. Nulla
quis sem at nibh elementum a
1,1,1,1.
imperdiet. Duis sagittis
ipsum. Praesent mauris. Fusce ac
1,1,1,1,1
Block
2
Sed pretium blandit ad
1,1,1,1,1
Map
nec tellus sed augue semper
orci. Ut eu diam at
porta. Mauris massa.
pede susci pit Block
3
aliquam
1,1,1
Vestibulum lacinia arcu eget
nulla. Class aptent taciti
sociosqu ad litora torquent
sodales
Block
3
aliquet
1
auctor
1,1,1,1
per conubia nostra, per
inceptos himenaeos. Curabitur
sodales ligula in libero. Sed
dignissim lacinia nunc. Aenean quam. In
Curabitur tortor. scelerisque sem at
Pellentesque nibh. Aenean dolor. Maecenas a
1,1,1,
quam. In scelerisque sem at mattis. Sed con Block
1
ad
1,1,1,1.
dolor. Maecenas mattis. Sed
convallis tristique sem Block
2
Map
aliquam
1.1,1
Block
2
amet
1,1,1,1
blandit
1,1,1
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-82
MapReduce:
The
Mapper
Hadoop
aXempts
to
ensure
that
Mappers
run
on
nodes
which
hold
their
por.on
of
the
data
locally,
to
avoid
network
trac
MulFple
Mappers
run
in
parallel,
each
processing
a
porFon
of
the
input
data
The
Mapper
reads
data
in
the
form
of
key/value
pairs
The
Mapper
may
use
or
completely
ignore
the
input
key
For
example,
a
standard
pa>ern
is
to
read
one
line
of
a
le
at
a
Fme
The
key
is
the
byte
oset
into
the
le
at
which
the
line
starts
The
value
is
the
contents
of
the
line
itself
Typically
the
key
is
considered
irrelevant
If
the
Mapper
writes
anything
out,
the
output
must
be
in
the
form
of
key/value
pairs
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-83
MapReduce:
The
Reducer
AOer
the
Map
phase
is
over,
all
intermediate
values
for
a
given
intermediate
key
are
combined
together
into
a
list
This
list
is
given
to
a
Reducer
There
may
be
a
single
Reducer,
or
mulFple
Reducers
All
values
associated
with
a
parFcular
intermediate
key
are
guaranteed
to
go
to
the
same
Reducer
The
intermediate
keys,
and
their
value
lists,
are
passed
to
the
Reducer
in
sorted
key
order
This
step
is
known
as
the
shue
and
sort
The
Reducer
outputs
zero
or
more
nal
key/value
pairs
These
are
wri>en
to
HDFS
In
pracFce,
the
Reducer
usually
emits
a
single
key/value
pair
for
each
input
key
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-84
Why
Do
We
Care
About
CounFng
Words?
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-85
Another
Example:
Analyzing
Log
Data
Input
Data
...
2013-03-15 12:39 - 74.125.226.230 /common/logo.gif 1231ms - 2326
2013-03-15 12:39 157.166.255.18 /catalog/cat1.html 891ms - 1211
2013-03-15 12:40 65.50.196.141 /common/logo.gif 1992ms - 1198
2013-03-15 12:41 64.69.4.150 /common/promoex.jpg 3992ms - 2326
...
AverageReducer
FileTypeMapper
Intermediate
Data
output
output
ater
Shue
and
Sort
html
888.6
html
891,788,344,2990
gif
1231
gif
1231,1992,3997,872
gif
1886.4
html
891
jpg
3992,7881,2999
jpg
888.6
Map
gif
1992
png
919,890,3441,444
Reduce
jpg
3992
txt
344,325,444,421
png
1201.0
html
788
txt
399.1
gif
3997
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-86
Features
of
MapReduce
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-87
Hadoop
Environments
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-88
Bibliography
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-89
Hadoop
SoluFons
Chapter
1.4
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-90
Hadoop
SoluFons
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-91
Chapter
Topics
Hadoop Solu.ons
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-92
Some
Common
Hadoop
ApplicaFons
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-93
Data
Processing/ETL
Ooad:
TradiFonal
Approach
Discard
Too
costly
Data
quality
ETL: Extract/Transform/Load
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-94
Data
Processing/ETL
Ooad
(contd)
Data
Stream
Enterprise
Data
Warehouse
ETL
Networked
Storage
AnalyFcs
Real-Fme
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-95
Data
Processing/ETL
Ooad
Use
Case:
BlackBerry
(1)
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-96
Data
Processing/ETL
Ooad
Use
Case:
BlackBerry
(2)
Filter
and
Streaming
ETL
Complex
CorrelaFon
Data
Split
Streaming
ETL
Data
Warehouse
Archive Storage
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-97
Data
Processing/ETL
Ooad
Use
Case:
BlackBerry
(3)
Filter
and
Hadoop
Data
CorrelaFon
Split
Archive
Storage
Data
Warehouse
(2nd)
ETL
Data
Warehouse
(1st)
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-98
Data
Processing/ETL
Ooad
Use
Case:
BlackBerry
(4)
The
impact
90%
code
base
reducFon
for
ETL
Remaining
code
is
business
focused
Performance
increases
e.g.,
previous
ad
hoc
query
that
took
four
days
now
takes
53
minutes
Can
perform
queries
and
correlaFons
that
were
previously
impossible
Signicant
capital
cost
reducFon
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-99
Data
Warehouse
Ooad
Hadoop
Lower
Value
Data
50
TB
100
TB
High
Value
Data
50
TB
Historical
Data
Data
Processing
OperaFonal
AnalyFcs
Ad
Hoc
Exploratory
ReporFng
TransformaFon/Batch
Business
AnalyFcs
Data
Hub
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-100
Data
Warehouse
Ooad
Use
Case:
CBS
InteracFve
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-101
Telemetry
Telemetry
Data
collected
at
many
points
monitored
in
real
Fme
Huge
amounts
of
data
previously
hard
to
store
and
analyze
Examples:
Opower:
SmartMeter
readings
Chevron:
AcousFc
sampling
from
oil
wells
15TB
per
well
A
major
videogame
company:
player
acFons
in
mulF-player
online
games
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-102
Telemetry
Use
Case:
NetApp
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-104
360
Customer
View
Use
Case:
Experian
(1)
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-105
360
Customer
View
Use
Case:
Experian
(2)
The
analysis
Cross-Channel
IdenFty
ResoluFon
(CCIR)
engine
built
on
Hadoop
Creates
360
view
of
customers
The
impact
50x
performance
increase
over
prior
data
warehouse
system
Cross-Channel
IdenFty
ResoluFon
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-106
The
Need
for
the
Enterprise
Data
Hub
Thousands
of
Employees
&
Lots
of
Inaccessible
InformaFon
Heterogeneous
Legacy
IT
Infrastructure
Silos
of
MulF-
Structured
Data
Dicult
to
Integrate
ERP, CRM, RDBMS, Machines Files, Images, Video, Logs, Clickstreams External Data Sources
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-107
The
Enterprise
Data
Hub:
One
Unied
System
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-108
Data
Hub
Use
Case:
Nokia
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-109
Other
Common
ApplicaFons
Predic.ve
modeling
Recommenda.ons
and
ad
targe.ng
Time
series
analysis
Point
of
Sale
(PoS)
analysis
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-110
Chapter
Topics
Hadoop Solu.ons
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-111
Some
Other
InteresFng
Use
Case
Examples
Orbitz
A
Financial
Regulatory
Body
eBay
A
Leading
North
American
Retailer
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-112
Orbitz
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-113
Financial
Regulatory
Body
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-114
eBay
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-115
Leading
North
American
Retailer
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-116
Key
Points
(1)
Hadoop
has
become
a
valuable
business
intelligence
tool,
and
will
become
an
increasingly
important
part
of
a
BI
infrastructure
Hadoop
wont
replace
the
EDW
But
any
organizaFon
with
a
large
EDW
should
explore
Hadoop
as
a
complement
to
its
BI
infrastructure
Use
Hadoop
to
ooad
the
.me-
and
resource-intensive
processing
of
large
data
sets
so
you
can
free
up
your
data
warehouse
to
serve
user
needs
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-117
Key
Points
(2)
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-118
Bibliography
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-119
Bibliography
(contd)
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-120
Bibliography
(contd)
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-121
Bibliography
(contd)
Copyright
2010-2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1-122
Bibliography
(contd)
Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 1-123