Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
p
Ag e n d a
Architecture of HDFS and
MapReduce Hadoop Streams
Hadoop Pipes
Basics of Hbase and Zookeeper
Had oop Distributed
Filesystem
When a dataset outgrows the storage capacity of a single
physical machine, it becomes necessary to partition it
across a number of separate machines.
Filesystems that manage the storage across a network of
machines are called distributed fi lesystems.
Since they are network-based, all the complications of
network programming kick in, thus making distributed
fi lesystems more complex than regular disk
fi lesystems.
For example, one of the biggest challenges is making the
fi lesystem tolerate node failure without suff ering data loss.
Hadoop comes with a distributed fi lesystem called HDFS,
which stands for H a d o o p d i st r i b u t ed fi l e s y s t e m .
D e s i g n o f HDFS
Very large files
Very large in this context me an s fi les that are hundreds of
megabytes, gigabytes, or terabytes in size. There are Hadoop
clusters running today that store petabytes of data.
Commodity hardware
Hadoop doesnt require expensive, highly reliable hardware to run on.
Its designed to run on clusters of commodity hardware for which the
chance of node failure across the cluster is high, at least for large
clusters. HDFS is designed to carry on working without a noticeable
interruption to the user in the face of such failure.
Where HDFS i s not a g o o d
fi t
Low-latency data access
Applications that require low-latency access to data, in the tens of
milliseconds range, will not work well with HDFS. Remember, HDFS is
optimized for delivering a high throughput of data, and this m a y be at
the expense of latency. Hbase is currently a better choice for low-latency
access.
By making a block large enough, the time to transfer the data from the
disk can be made to be signifi cantly larger than the time to seek to the
start of the block.
A quick calculation shows that if the seek time is around 10 ms, and the
transfer rate is 1 0 0 MB/s, then to m a ke the seek time 1% of the transfer
time, we need to ma ke the block size around 1 0 0 MB.
q
bookkeeping inform ation to keep track of the tasks
status and progress(step 5)
To create the list of tasks to run, the job scheduler first
retrieves the input splits computed by the JobClient from
the shared fi lesystem (step 6)
It then creates one map task for each split.
The number of reduce tasks to create is determined
by the mapred.reduce.tasks property in the JobConf
The scheduler simply creates this number of reduce tasks to
be run. Tasks are given IDs at this point.
Ta s k
As s i g n m en t
periodically sends & hmethod
eartbeat
Tasktrackers run a simple loop that
heartbeat
calls to the jobtracker.
Heartbeats tell the jobtracker that a
tasktracker is alive
As a part of the heartbeat, a tasktracker will indicate
whether it is ready to run a new task, and if it is, the
jobtracker will allocate it a task, which it communicates to
the tasktracker using the heartbeat return value(step 7)
Tasktrackers have a fi xed number of
slots for ma p tasks and for reduce
tasks:
The default scheduler fills empty map task slots before
reduce task slots
If the tasktracker has at least one empty ma p task slot,
the jobtracker will select a map task; otherwise, it will
select a reduce task.
Task Execution
Once the tasktracker has been assigned a task
It localizes the job JAR by copying it from the shared
fi lesystem to the tasktrackers
fi lesystem. It also copies any fi les needed from the
distributed cache by the application to the local
disk(step 8)
It creates a local working directory for the task, and
un-jars the contents of the JAR into this directory.
It creates an instance of TaskRunner to run the task.
Task R u n n e r
TaskRunner launches a new Java Virtual
Machine(step 9) run each task in(step 10)
Why n e w JVM for e a c h
task?
Any bugs in the user-defi ned ma p and reduce functions
dont aff ect the tasktracker(by causing it to crash or hang)
The child process communicates with its parent through the
umbilical interface.
It informs the parent of the tasks progress every few
seconds until the task is complete.
User Defined Counters
Ha d oop S t r e a m i n g
Hadoop provides an API to MapReduce that allows you to
write your ma p and reduce functions in languages other
than Java.
Hadoop Streaming uses Unix standard streams as the
interface between Hadoop and your program, so you can
use any language that can read standard input and write
to standard output to write your MapReduce program.
Ruby, Python
h a d o o p jar
$HADOOP_INSTALL/contrib/streaming/hadoop-*-
st re a mi n g. j a r \
-i np ut i n p u t/ sa mp l e .t x t \
-output o u t pu t \
- ma p pe r src/main/ruby/mapper.rb \
-reducer src/main/ruby/reducer.rb
Hadoop P i p e s
Hadoop Pipes is the name of the C + + interface to
Hadoop MapReduce.
Unlike Streaming, which uses standard input and
output to communicate with the m ap and reduce
code
Pipes uses sockets as the channel over which the
tasktracker communicates with the process running the
C + + ma p or reduce function. JNI is not used.
h a d o o p f s -p ut max_temperature
bin/max_temperature h a d o o p f s - pu t
in p u t / sa mp l e .t x t s a m p l e . t x t
hadoop pipes \
-D h a d o o p . p i p e s . j a v a . r e c o r d r e a d e r = t r u e \
-D h a d o o p . p i p e s . j a v a . r e c o r d w r i t e r = t r u e \
-in pu t sa m p le . t xt \
-o utput o ut p u t \
-program bin/max_temperature
Job s t a t u s u p d a t e
Data fl o w for t w o j o b s
HBa s e
HBase is a distributed column-oriented
database built on top of HDFS
HBase is the Hadoop application to use when you require
real-time read/write random-access to very large datasets.
Zookeeper
http://blog.adku.com/2011/02/hbase-vs-cassandra
.html
http://www.google.co.in/search?q=hbase+tutori
al&ie=utf-8&oe=utf-8&aq=t&rls=o