Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Contact:
Dr. Bina Ramamurthy
CSE Department
University at Buffalo (SUNY)
bina@buffalo.edu
http://www.cse.buffalo.edu/faculty/bina
Partially Supported by
NSF DUE Grant: 0737243
CCSCNE 2009 Palttsburg, April 24 2009
Man on the moon with 32KB (1969); my laptop had 2GB RAM (2009)
Google collects 270PB data in a month (2007), 20000PB a day (2008)
2010 census data is expected to be a huge gold mine of information
Data mining huge amounts of data collected in a wide range of domains
The Outline
4
Introduction to MapReduce
From CS Foundation to MapReduce
MapReduce programming model
Hadoop Distributed File System
Relevance to Undergraduate Curriculum
Demo (Internet access needed)
Our experience with the framework
Summary
References
CCSCNE 2009 Palttsburg, April 24 2009
MapReduce
5
What is MapReduce?
6
Main
WordCounter
web
weed
green
sun
moon
land
part
parse( )
count( )
DataCollection
ResultTable
Data
collection
Main
Thread
1..*
WordCounter
web
weed
green
sun
moon
land
part
parse( )
count( )
DataCollection
ResultTable
Observe:
Multi-thread
Lock on shared data
B.Ramamurthy & K.Madurai
Main
Data
collection
Thread
1..*
Parser
1..*
WordList
DataCollection
KEY
web
weed
Counter
green
sun
VALUE
CCSCNE 2009 Palttsburg, April 24 2009
moon
ResultTable
land
part
web
weed
green
sun
moon
land
part
Separate counters
green
Peta-scale Data
11
Main
Data
collection
Thread
1..*
Parser
1..*
WordList
DataCollection
KEY
web
weed
green
Counter
sun
VALUE
CCSCNE 2009 Palttsburg, April 24 2009
moon
web
weed
green
sun
moon
land
part
green
ResultTable
land
part
web
Single machine cannot serve all the data: you need a distributed
monitoring
Exploit parallelism afforded by splitting parsing and counting
Provision and locate computing at data locations
CCSCNE 2009 Palttsburg, April 24 2009
Peta-scale Data
13
Main
Data
collection
Thread
1..*
Parser
1..*
WordList
DataCollection
KEY
web
weed
green
Counter
sun
VALUE
CCSCNE 2009 Palttsburg, April 24 2009
moon
web
weed
green
sun
moon
land
part
green
ResultTable
land
part
web
Data
collection
Data
collection
Data
collection
Thread
1..*
Data
collection
Data
collection
KEY
web
Parser
1..*
Counter
WordList
DataCollection
web
weed
green
sun
moon
land
part
ResultTable
green
sun
VALUE
CCSCNE 2009 Palttsburg, April 24 2009
moon
land
part
web
green
Data
collection
Data
collection
Data
collection
Thread
1..*
Data
collection
Data
collection
KEY
web
Parser
1..*
WordList
DataCollection
weed
green
Counter
sun
VALUE
CCSCNE 2009 Palttsburg, April 24 2009
moon
web
weed
green
sun
moon
land
part
green
ResultTable
land
part
web
Data
collection
Data
collection
Data
collection
Thread
1..*
Data
collection
Data
collection
Parser
1..*
DataCollection
Counter
WordList
ResultTable
Data
collection
Thread
1..*
One node
Parser
1..*
Counter
WordList
D ataC ollection
ResultTable
Main
Data
collection
Thread
1..*
Parser
1..*
Counter
WordList
D ataC ollection
ResultTable
Main
Data
collection
Thread
1..*
Parser
1..*
C ounter
W ordList
D ataCollection
ResultTable
Main
Data
collection
Thread
1..*
Parser
D ataC ollection
1..*
Counter
WordList
ResultTable
Map Operation
19
weed
green
sun
moon
land
part
web
green
KEY
VALUE
Map
Data
Collection: split 2
Data
Collection: split1
Map
web
Data
Collection: split n
CCSCNE 2009 Palttsburg, April 24 2009
Reduce Operation
20
Map
Data
Collection: split n
CCSCNE 2009 Palttsburg, April 24 2009
Reduce
Data
Collection: split 2
Reduce
Data
Collection: split1
Map
Map
Reduce
B.Ramamurthy & K.Madurai
Parse-hash
Count
P-0000
, count1
Parse-hash
Count
P-0001
, count2
Parse-hash
Count
Parse-hash
CCSCNE 2009 Palttsburg, April 24 2009
21
P-0002
,count3
Cat
map
combine
reduce
split
map
combine
reduce
split
map
combine
reduce
split
map
split
part0
part1
Bat
Dog
Other
Words
(size:
TByte)
part2
MapReduce Programming
Model
23
MapReduce Characteristics
25
mutexes
Map and Reduce are the main operations: simple code
There are other supporting operations such as combine and partition
(out of the scope of this talk).
All the map should be completed before reduce operation starts.
Map and reduce operations are typically performed by the same
physical processor.
Number of map tasks and reduce tasks are configurable.
Operations are provisioned near the data.
Commodity hardware and storage.
Runtime takes care of splitting and moving data for operations.
Special distributed file system. Example: Hadoop Distributed File
System and Hadoop Runtime.
CCSCNE 2009 Palttsburg, April 24 2009
Scope of MapReduce
Data size: small
27
Hadoop
28
What is Hadoop?
29
Highly fault-tolerant
High throughput
Suitable for applications with large data sets
Streaming access to file system data
Can be built out of commodity hardware
HDFS Server
Master node
HDFS Client
Application
Local file
system
Block size: 2K
Name Nodes
More details: We discuss this in great detail in my Operating
Systems course
CCSCNE 2009 Palttsburg, April 24 2009
HDFS Server
Master node
blockmap
HDFS Client
heartbeat
Application
Local file
system
Block size: 2K
Name Nodes
More details: We discuss this in great detail in my Operating
Systems course
CCSCNE 2009 Palttsburg, April 24 2009
Demo
34
Summary
35
References
36