Sei sulla pagina 1di 32

SOFTWARE SYSTEMS

DEVELOPMENT
MAP-REDUCE , Hadoop, HBase
The problem
Batch (offline) processing of huge data set
using commodity hardware

Linear scalability

Need infrastructure to handle all the
mechanics, allow for developer to focus on the
processing logic/algorithms
Data Sets
The New York Stock Exchange: 1 Terabyte of data
per day

Facebook: 100 billion of photos, 1 Petabyte(1000
Terabytes)

Internet Archive: 2 Petabyte of data, growing by
20 Terabytes per month

Cant put data on a single node, need distributed
file system to hold it
Batch processing
Single write/append multiple reads
Analyze Log files for most frequent URL

Each data entry is self-contained
At each step , each data entry can be treated
individually
After the aggregation, each aggregated data set
can be treated individually

Grid Computing
Grid computing
Cluster of processing nodes attached to shared
storage through fiber (typically Storage Area
Network)

Work well for computation intensive tasks,
problem with huge data sets as network
become a bottleneck

Programming paradigm: Low level Message
Passing Interface (MPI)
Hadoop
Open-source implementation of 2 key ideas
HDFS: Hadoop distributed file system
Map-Reduce: Programming Model

Build based on Google infrastructure (GFS,
Map-Reduce papers published 2003/2004)

Java/Python/C interfaces, several projects
built on top of it
Approach
Limited but simple model fit to broad range of
applications

Handle communications, redundancies ,
scheduling in the infrastructure

Move computation to data instead of moving
data to computation
Who is using Hadoop?
Distributed File System (HDFS)
Files are split into large blocks (128M, 64M)
Compare with typical FS block of 512Bytes
Replicated among Data Nodes(DN)
3 copies by default
Name Node (NN) keeps track of files and
pieces
Single Master node
Stream-based I/O
Sequential access
HDFS: File Read
HDFS: File Write
HDFS: Data Node Distance
Map Reduce
A Programming Model

Decompose a processing job into Map and
Reduce stages

Developer need to provide code for Map and
Reduce functions, configure the job and let
Hadoop handle the rest
Map-Reduce Model
MAP function
Map each data entry into a pair
<key, value>

Examples
Map each log file entry into <URL,1>
Map day stock trading record into <STOCK,
Price>

Hadoop: Shuffle/Merge phase
Hadoop merges(shuffles) output of the MAP
stage into
<key, valulue1, value2, value3>

Examples
<URL, 1 ,1 ,1 ,1 ,1 1>
<STOCK, Price On day 1, Price On day 2..>
Reduce function
Reduce entries produces by Hadoop merging
processing into <key, value> pair

Examples
Map <URL, 1,1,1> into <URL, 3>
Map <Stock, 3,2,10> into <Stock, 10>

Map-Reduce Flow
Hadoop Infrastructure
Replicate/Distribute data among the nodes
Input
Output
Map/Shuffle output
Schedule Processing
Partition Data
Assign processing nodes (PN)
Move code to PN(e.g. send Map/Reduce code)
Manage failures (block CRC, rerun MAP/Reduce
if necessary)

Example: Trading Data
Processing
Input:
Historical Stock Data
Records are CSV (comma separated values) text
file
Each line : stock_symbol, low_price, high_price
1987-2009 data for all stocks one record per
stock per day

Output:
Maximum interday delta for each stock
Map Function: Part I
Map Function: Part II
Reduce Function
Running the Job : Part I
Running the Job: Part II
Inside Hadoop
Datastore: HBASE
Distributed Column-Oriented database on top of
HDFS

Modeled after Googles BigTable data store

Random Reads/Writes on to of sequential stream-
oriented HDFS

Billions of Rows * Millions of Columns *
Thousands of Versions
HBASE: Logical View
Row Key Time
Stamp
Column
Contents
Column Family Anchor
(Referred by/to)
Column
mime
com.cnn.www T9 cnnsi.com cnn.com/1
T8 my.look.ca cnn.com/2
T6 <html>.. Text/html
T5 <html>..
t3 <html>..
Physical View
Row Key Time Stamp Column: Contents
Com.cnn.www T6 <html>..
T5 <html>..
T3 <html>..
Row Key Time Stamp Column Family: Anchor
Com.cnn.www T9 cnnsi.com cnn.com/1
T5 my.look.ca cnn.com/2
Row Key Time Stamp Column: mime
Com.cnn.www T6 text/html
HBASE: Region Servers
Tables are split into horizontal regions
Each region comprises a subset of rows
HDFS
Namenode, dataNode
MapReduce
JobTracker, TaskTracker
HBASE
Master Server, Region Server

HBASE Architecture
HBASE vs RDMS
HBase tables are similar to RDBS tables with
a difference
Rows are sorted with a Row Key
Only cells are versioned
Columns can be added on the fly by client as long
as the column family they belong to preexists

Potrebbero piacerti anche