Sei sulla pagina 1di 59

Home

Big Data
Hadoop Tutorials
Cassandra
Hector API
Request Tutorial
About
LABELS: HADOOP-TUTORIAL, HDFS
3 OCTOBER 2013
Hadoop Tutorial: Part 1 - What is
Hadoop ? (an Overview)

Hadoop is an open source software framework that supports data
intensive distributed applications which is licensed under Apache
v2 license.
At-least this is what you are going to fnd as the frst line of
defnition on Hadoop in Wikipedia. So what is data intensive
distributed applications?
Well data intensive is nothing but BigData (data that has
outgrown in size) anddistributed appliations are the
applications that works on network by counicating and
coordinating with each other by passing essages. (say using a
!"# interprocess counication or through $essage-%ueue)
Hence Hadoop works on a distributed en&ironent and is build to
store' handle and process large aount of data set (in petabytes'
e(abyte and ore). )ow here since i a saying that hadoop
stores petabytes of data' this doesn*t ean that Hadoop is a
database. Again reeber its a fraework that handles large
aount of data for processing. +ou will get to know the di,erence
between Hadoop and -atabases (or )oS%. -atabases' well that*s
what we call /ig-ata*s databases) as you go down the line in the
coing tutorials.
Hadoop was deri&ed fro the research paper published by
0oogle on Google File System(GFS and Google!s "ap#educe. So
there are two integral parts of Hadoop1 Hadoop Distributed
!ile "#ste$(HD!") and Hadoop %ap&edue.
Hadoop Distributed !ile "#ste$ (HD!")
H-2S is a flesyste designed for storing ver# large
'les with strea$ing data aesspatterns' running on clusters
of o$$odit# hardware.
Well .ets get into the details of the stateent entioned abo&e1
(er# )arge 'les: )ow when we say &ery large fles we ean
here that the size of the fle will be in a range of gigabyte'
terabyte' petabyte or ay be ore.
"trea$ing data aess: H-2S is built around the idea that the
ost e3cient data processing pattern is a write-once' read-any-
ties pattern. A dataset is typically generated or copied fro
source' and then &arious analyses are perfored on that dataset
o&er tie. 4ach analysis will in&ol&e a large proportion' if not all'
of the dataset' so the tie to read the whole dataset is ore
iportant than the latency in reading the frst record.
*o$$odit# Hardware: Hadoop doesn*t re5uire e(pensi&e'
highly reliable hardware. 6t7s designed to run
on clusters of coodity hardware (coonly a&ailable
hardware that can be obtained fro ultiple &endors) for which
the chance of node failure across the cluster is high' at least for
large clusters. H-2S is designed to carry on working without a
noticeable interruption to the user in the face of such failure.
)ow here we are talking about a 2ileSyste' Hadoop -istributed
2ileSyste. And we all know about a few of the other 2ile Systes
like .inu( 2ileSyste and Windows 2ileSyste. So the ne(t
5uestion coes is...
What is the di+erene between nor$al
!ile"#ste$ and Hadoop Distributed
!ile "#ste$?
8he a9or two di,erences that is notable between H-2S and
other 2ilesystes are1
Blo, "i-e: 4&ery disk is ade up of a block size. And this
is the iniu aount of data that is written and read
fro a -isk. )ow a 2ilesyste also consists of blocks which
is ade out of these blocks on the disk. )orally disk
blocks are of :;< bytes and those of flesyste are of a few
kilobytes. 6n case of HD!" we also ha&e the blocks
concept. /ut here one block size is of => $/ by default and
which can be increased in an integral ultiple of => i.e.
;<?$/' <:=$/' :;<$/ or e&en ore in 0/*s. 6t all depend
on the re5uireent and use-cases.
So Why are these blocks size so large for H-2S@ keep on
reading and you will get it in a ne(t few tutorials 1)
%etadata "torage: 6n noral fle syste there is
a hierarchical storage of etadata i.e. lets say there is a
folder A$%. inside that folder there is again one another
folder &'F. and inside that there is hello.t(t fle. )ow the
inforation about hello.t(t (i.e. etadata info of
hello.t(t) fle will be with &'F and again the etadata
of &'F will be with A$%. Hence this fors a hierarchy and
this hierarchy is aintained until the root of the flesyste.
/ut in HD!" we don*t ha&e a hierarchy of etadata. All the
etadata inforation resides with a single achine known
as Namenode (or $aster )ode) on the cluster. And this
node contains all the inforation about other fles and
folder and lots of other inforation too' which we will learn
in the ne(t few tutorials. 1)
Well this was 9ust an o&er&iew of Hadoop and Hadoop -istributed
2ile Syste. )ow in the ne(t part i will go into the depth of H-2S
and there after $ap!educe and will continue fro here...
Let me know if you have any doubts
in understanding anything into the comment section and i
will be really glad to answer the same :)
If you like what you just read and want to
continue your learning on BI!"#" you
can subscribe to our $mail and Like
our facebook page
These $ight also help #ou :'
;. Hadoop 8utorial1 "art > - Write Aperations in H-2S
<. Hadoop 8utorial1 "art B - !eplica "laceent or !eplication and
!ead Aperations in H-2S
B. Hadoop 8utorial1 "art < - Hadoop -istributed 2ile Syste (H-2S)
>. Hadoop 8utorial1 "art ; - What is Hadoop @ (an A&er&iew)
:. /est of /ooks and !esources to 0et Started with Hadoop
=. Hadoop 8utorial1 "art : - All Hadoop Shell #oands you will
)eed.
C. Hadoop 6nstallation on .ocal $achine (Single node #luster)
!ind *o$$ents below or /dd one
!oain !igau( said...
)ice suaryD
Actober EB' <E;B
pragya khare said...
6 know i* a beginner and this 5uestion yt be a silly ;....but can
you please e(plain to e that how "A!A..4.6S$ is achie&ed &ia
ap-reduce at the processor le&el @@@ if 6*&e a dual core processor'
is it that only < 9obs will run at a tie in parallel@
Actober E:' <E;B
Anonyous said...
Hi 6 a fro $ainfrae background and with little knowledge of
core 9a&a...-o you think Fa&a is needed for learning Hadoop in
addition to Hi&eG"60 @ 4&en want to learn Fa&a for ap reduce but
couldn*t fnd what all will be used in realtie..and defniti&e guide
books sees tough for learning apreduce with Fa&a..any option
where 6 can learn it step by step@
Sorry for long coent..but it would be helpful if you can guide
e..
Actober E:' <E;B
-eepak Huar said...
I"ragya Hhare...
2irst thing always reeber... the one "opular saying.... )A %uestions
are 2oolish 1) And btw it is a &ery good 5uestion.
Actually there are two things1
Ane is what will be the best practice@ and other is what happens in
there by default @...
Well by default the nuber of apper and reducer is set to < for any
task tracker' hence one sees a a(iu of < aps and < reduces at
a gi&en instance on a 8ask8racker (which is confgurable)..Well this
-oesn*t only depend on the "rocessor but on lots of other factor as
well like ra' cpu' power' disk and others....
http1GGhortonworks.coGblogGbest-practices-for-selecting-apache-
hadoop-hardwareG
And for the other factor i.e for /est "ractices it depends on your use
case. +ou can go through the Brd point of the below link to understand
it ore conceptually
http1GGblog.cloudera.coGblogG<EEJG;<GC-tips-for-ipro&ing-
apreduce-perforanceG
Well i will e(plain all these when i will reach the ad&ance $ap!educe
tutorials.. 8ill then keep reading DD 1)
Actober E:' <E;B
-eepak Huar said...
IAnonyous
As Hadoop is written in Fa&a' so ost of its A"6*s are written in core
Fa&a... Well to know about the Hadoop architecture you don*t need
Fa&a... /ut to go to its A"6 .e&el and start prograing in $ap!educe
you need to know #ore Fa&a.
And as for the re5uireent in 9a&a you ha&e asked for... you 9ust need
siple core 9a&a concepts and prograing for Hadoop and
$ap!educe..And Hi&eG"60 are the S%. kind of data Kow languages
that is really easy to learn...And since you are fro a prograing
background it won*t be &ery di3cult to learn 9a&a 1) you can also go
through the link below for further details 1)
http1GGwww.bigdataplanet.infoG<E;BGEJGWhat-are-the-"re-re5usites-for-
getting-started-with-/ig--ata-8echnologies.htl
Actober E:' <E;B
"os t a #oent
Newer %ost& ' (lder %ost
/BO0T TH1 /0THO&
D11P/2 20%/&
/ig -ata G Hadoop -e&eloper' Software 4ngineer' 8hinker' .earner' 0eek' /logger'
#oder
3 love to pla# around Data4 Big Data 5
"ubsribe updates via 1$ail
)oin $ig&ata *lanet to continue your learning on $ig&ata +echnologies

Subscribe
6et 0pdates on !aeboo,
Big Data )ibraries
1. BIGDATA NEWS

2. CASSANDRA

3. HADOOP-TUTORIAL

4. HDFS

5. HECTOR-API

6. INSTALLATION

7. SQOOP
Which )oS%. -atabases according to you is $ost "opular @
6et *onneted on 6oogle7
%ost Popular Blog /rtile
Hadoop Installation on Local ac!in" #Sin$l" nod" Cl%st"&'
Hadoop T%to&ial( Pa&t ) - All Hadoop S!"ll Co**ands +o% ,ill N""d-
W!at a&" t!" P&"-&".%isit"s /o& $"ttin$ sta&t"d ,it! Bi$ Data T"c!nolo$i"s
Hadoop T%to&ial( Pa&t 0 - R"plica Plac"*"nt o& R"plication and R"ad Op"&ations in
HDFS
Hadoop T%to&ial( Pa&t 1 - W!at is Hadoop 2 #an O3"&3i",'
Hadoop T%to&ial( Pa&t 4 - Hadoop Dist&i5%t"d Fil" S+st"* #HDFS'
Hadoop T%to&ial( Pa&t 6 - W&it" Op"&ations in HDFS
B"st o/ Boo7s and R"so%&c"s to G"t Sta&t"d ,it! Hadoop
Ho, to %s" Cassand&a CQL in +o%& 8a3a Application

/ack to 8op L
M)ote1 Nse Screen !esolution of ;<?E p( and ore to &iew the website I its best. Also use the
latest &ersion of the browser as the website uses H8$.: and #SSB 1)
8witter 2acebook !SS 0oogle
ABOUT E

CONTACT

PRI9AC: POLIC:
O <E;B All !ights !eser&ed /ig-ata "lanet.
All articles on this website by -eepak Huar is licensed under a #reati&e #oons Attribution-
)on#oercial-ShareAlike B.E Nnported .icense


Home
Big Data
Hadoop Tutorials
Cassandra
Hector API
Request Tutorial
About
LABELS: HADOOP-TUTORIAL, HDFS
6 OCTOBER 2013
Hadoop Tutorial: Part 8 - Hadoop
Distributed !ile "#ste$ (HD!")

6n the last tutorial on What is Hadoop? i ha&e gi&en you a brief
idea about Hadoop. So the two integral parts of Hadoop is
Hadoop HD!" and Hadoop %ap&edue.
.ets go further deep inside H-2S.
Hadoop Distributed !ile
"#ste$ (HD!") *onepts:
2irst take a look at the following two terinologies that will be
used while describing H-2S.
*luster1 A hadoop cluster is ade by ha&ing any achines in a
network' each achine is tered as a node' and these nodes
talks to each other o&er the network.
Blo, "i-e: 8his is the iniu aount of size of one block in a
flesyste' in which data can be kept contiguously. +he default
si,e of a single block in H&FS is -. "b.
6n H-2S' -ata is kept by splitting it into sall chunks or parts.
.ets say you ha&e a te(t fle of <EE $/ and you want to keep this
fle in a Hadoop #luster. 8hen what happens is that' the )le
breaks or splits into a large number of chunks* where
each chunk is e+ual to the block si,e that is set for the
-!./ cluster 0which is 12 3B by default)4 Hence a <EE $b of
fle gets split into > parts' B parts of => b and ; part of ? b'
and each part will be kept on a di,erent achine. An which
achine which split will be kept is decided by )aenode' about
which we will be discussing in details below.
)ow in a Hadoop -istributed 2ile Syste or H-2S #luster' there
are two kinds of nodes' A $aster )ode and any Worker )odes.
8hese are known as1
)aenode (aster node) and -atanode (worker node).
!"e#$%e:
8he naenode anages the flesyste naespace. 6t aintains
the flesyste tree and the etadata for all the fles and
directories in the tree. So it contains the inforation of all the
fles' directories and their hierarchy in the cluster in the for of
a 9a$espae 3$age and edit logs. Along with the flesyste
inforation it also knows about the -atanode on which all the
blocks of a fle is kept.
A client accesses the flesyste on behalf of the user by
counicating with the naenode and datanodes. 8he client
presents a flesyste interface siilar to a "ortable Aperating
Syste 6nterface ("AS6P)' so the user code does not need to
know about the naenode and datanode to function.
Datanode:
8hese are the workers that does the real work. And here by real
work we ean that the storage of actual data is done by the data
node. 8hey store and retrie&e blocks when they are told to (by
clients or the naenode)' and they report back to the naenode
periodically with lists of blocks that they are storing.
Here one iportant thing that is there to note1 /n one cluster
there will be only one 0amenode and there can be 0 number of
datanodes.
Since the )aenode contains the etadata of all the fles and
directories and also knows about the datanode on which each
split of fles are stored. So lets say9a$enode goes down then
what do #ou thin, will happen?.
5es* if the Namenode is !own we cannot access any of the
)les and directories in the cluster4
'ven we will not be able to connect with any of the datanodes to
get any of the 1les. 0ow think of it2 since we have kept our 1les
by splitting it in di3erent chunks and also we have kept them in
di3erent datanodes. And it is the 0amenode that keeps track of
all the 1les metadata. So only 0amenode knows how to
reconstruct a 1le back into one from all the splits. and this is the
reason that if 0amenode is down in a hadoop cluster so every
thing is down.
#his is also the reason that6s why -adoop is known as a
/ingle %oint of failure4
)ow since )aenode is so iportant' we ha&e to ake the
naenode resilient to failure. And for that hadoop pro&ides us
with two echanis.
8he frst way is to back up the fles that ake up the persistent
state of the flesyste etadata. Hadoop can be confgured so
that the naenode writes its persistent state to ultiple
flesystes. 8hese writes are synchronous and atoic. 8he usual
confguration choice is to write to local disk as well as a reote
)2S ount.
8he second way is running a "eondar# 9a$enode4 Well as the
nae suggests' it does not act like a )aenode. So if it doesn*t
act like a naenode how does it pre&ents fro the failure.
Well the "eondar# na$enode also contains a na$espae
i$age and edit logs likena$enode. )ow after e&ery certain
inter&al of tie(which is one hour by default) it copies
the na$espae i$age fro na$enode and erge
this na$espae i$age with the edit log and copy it back to
the na$enode so that na$enode will ha&e the fresh copy
of na$espae i$age. )ow lets suppose at any instance of tie
the na$enodegoes down and becoes corrupt then we can
restart soe other achine with the naespace iage and the
edit log that*s what we ha&e with the seondar# na$enodeand
hence can be pre&ented fro a total failure.
Secondary )ae node takes alost the sae aount of eory
and #"N for its working as the )aenode. So it is also kept in a
separate achine like that of a naenode. Hence we see here
that in a single cluster we have one Namenode* one
/econdary namenode and many !atanodes' and H-2S
consists of these three eleents.
8his was again an o&er&iew of Hadoop -istributed 2ile Syste
H-2S' 6n the ne(t part of the tutorial we will know about the
working of )aenode and -atanode in a ore detailed
anner.We will know how read and write happens in H-2S.
Let me know if you have any doubts in understanding anything into the
comment section and i will be really glad to answer your +uestions :)
If you like what you just read and want to
continue your learning on BI!"#" you
can subscribe to our $mail and Like
our facebook page
These $ight also help #ou :'
;. Hadoop 6nstallation on .ocal $achine (Single node #luster)
<. Hadoop 8utorial1 "art > - Write Aperations in H-2S
B. Hadoop 8utorial1 "art B - !eplica "laceent or !eplication and
!ead Aperations in H-2S
>. Hadoop 8utorial1 "art < - Hadoop -istributed 2ile Syste (H-2S)
:. Hadoop 8utorial1 "art ; - What is Hadoop @ (an A&er&iew)
=. /est of /ooks and !esources to 0et Started with Hadoop
C. Hadoop 8utorial1 "art : - All Hadoop Shell #oands you will
)eed.
!ind *o$$ents below or /dd one
&ishwash said...
&ery inforati&e...
Actober EC' <E;B
8ushar Harande said...
8hanks for such a inforatic tutorials 1)
please keep posting .. waiting for ore... 1)
Actober E?' <E;B
Anonyous said...
)ice inforation......../ut 6 ha&e one doubt like' what is the
ad&antage of keeping the fle in part of chunks on di,erent-<
datanodes@ What kind of beneft we are getting here@
Actober E?' <E;B
-eepak Huar said...
IAnonyous1 Well there are lots of reasons... i will e(plain that with
great details in the ne(t few articles...
/ut for now let us understand this... since we ha&e split the fle into
two' now we can take the power of two processors(parallel
processing) on two di,erent nodes to do our analysis(like search'
calculation' prediction and lots ore).. Again lets say y fle size is in
soe petabytes... +our won*t fnd one Hard disk that big.. and lets say
if it is there... how do you think that we are going to read and write on
that hard disk(the latency will be really high to read and write)... it will
take lots of tie...Again there are ore reasons for the sae... 6 will
ake you understand this in ore technical ways in the coing
tutorials... 8ill then keep reading 1)
Actober E?' <E;B
"os t a #oent
Newer %ost& ' (lder %ost
/BO0T TH1 /0THO&
D11P/2 20%/&
/ig -ata G Hadoop -e&eloper' Software 4ngineer' 8hinker' .earner' 0eek' /logger'
#oder
3 love to pla# around Data4 Big Data 5
"ubsribe updates via 1$ail
)oin $ig&ata *lanet to continue your learning on $ig&ata +echnologies

Subscribe
6et 0pdates on !aeboo,
Big Data )ibraries
1. BIGDATA NEWS

2. CASSANDRA

3. HADOOP-TUTORIAL

4. HDFS

5. HECTOR-API

6. INSTALLATION

7. SQOOP
Which )oS%. -atabases according to you is $ost "opular @
6et *onneted on 6oogle7
%ost Popular Blog /rtile
Hadoop Installation on Local ac!in" #Sin$l" nod" Cl%st"&'
Hadoop T%to&ial( Pa&t ) - All Hadoop S!"ll Co**ands +o% ,ill N""d-
W!at a&" t!" P&"-&".%isit"s /o& $"ttin$ sta&t"d ,it! Bi$ Data T"c!nolo$i"s
Hadoop T%to&ial( Pa&t 0 - R"plica Plac"*"nt o& R"plication and R"ad Op"&ations in
HDFS
Hadoop T%to&ial( Pa&t 1 - W!at is Hadoop 2 #an O3"&3i",'
Hadoop T%to&ial( Pa&t 4 - Hadoop Dist&i5%t"d Fil" S+st"* #HDFS'
Hadoop T%to&ial( Pa&t 6 - W&it" Op"&ations in HDFS
B"st o/ Boo7s and R"so%&c"s to G"t Sta&t"d ,it! Hadoop
Ho, to %s" Cassand&a CQL in +o%& 8a3a Application

/ack to 8op L
M)ote1 Nse Screen !esolution of ;<?E p( and ore to &iew the website I its best. Also use the
latest &ersion of the browser as the website uses H8$.: and #SSB 1)
8witter 2acebook !SS 0oogle
ABOUT E

CONTACT

PRI9AC: POLIC:
O <E;B All !ights !eser&ed /ig-ata "lanet.
All articles on this website by -eepak Huar is licensed under a #reati&e #oons Attribution-
)on#oercial-ShareAlike B.E Nnported .icense


Home
Big Data
Hadoop Tutorials
Cassandra
Hector API
Request Tutorial
About
LABELS: HADOOP-TUTORIAL, HDFS
3 OCTOBER 2013
Hadoop Tutorial: Part 1 - What is
Hadoop ? (an Overview)

Hadoop is an open source software framework that supports data
intensive distributed applications which is licensed under Apache
v2 license.
At-least this is what you are going to fnd as the frst line of
defnition on Hadoop in Wikipedia. So what is data intensive
distributed applications?
Well data intensive is nothing but BigData (data that has
outgrown in size) anddistributed appliations are the
applications that works on network by counicating and
coordinating with each other by passing essages. (say using a
!"# interprocess counication or through $essage-%ueue)
Hence Hadoop works on a distributed en&ironent and is build to
store' handle and process large aount of data set (in petabytes'
e(abyte and ore). )ow here since i a saying that hadoop
stores petabytes of data' this doesn*t ean that Hadoop is a
database. Again reeber its a fraework that handles large
aount of data for processing. +ou will get to know the di,erence
between Hadoop and -atabases (or )oS%. -atabases' well that*s
what we call /ig-ata*s databases) as you go down the line in the
coing tutorials.
Hadoop was deri&ed fro the research paper published by
0oogle on Google File System(GFS and Google!s "ap#educe. So
there are two integral parts of Hadoop1 Hadoop Distributed
!ile "#ste$(HD!") and Hadoop %ap&edue.
Hadoop Distributed !ile "#ste$ (HD!")
H-2S is a flesyste designed for storing ver# large
'les with strea$ing data aesspatterns' running on clusters
of o$$odit# hardware.
Well .ets get into the details of the stateent entioned abo&e1
(er# )arge 'les: )ow when we say &ery large fles we ean
here that the size of the fle will be in a range of gigabyte'
terabyte' petabyte or ay be ore.
"trea$ing data aess: H-2S is built around the idea that the
ost e3cient data processing pattern is a write-once' read-any-
ties pattern. A dataset is typically generated or copied fro
source' and then &arious analyses are perfored on that dataset
o&er tie. 4ach analysis will in&ol&e a large proportion' if not all'
of the dataset' so the tie to read the whole dataset is ore
iportant than the latency in reading the frst record.
*o$$odit# Hardware: Hadoop doesn*t re5uire e(pensi&e'
highly reliable hardware. 6t7s designed to run
on clusters of coodity hardware (coonly a&ailable
hardware that can be obtained fro ultiple &endors) for which
the chance of node failure across the cluster is high' at least for
large clusters. H-2S is designed to carry on working without a
noticeable interruption to the user in the face of such failure.
)ow here we are talking about a 2ileSyste' Hadoop -istributed
2ileSyste. And we all know about a few of the other 2ile Systes
like .inu( 2ileSyste and Windows 2ileSyste. So the ne(t
5uestion coes is...
What is the di+erene between nor$al
!ile"#ste$ and Hadoop Distributed
!ile "#ste$?
8he a9or two di,erences that is notable between H-2S and
other 2ilesystes are1
Blo, "i-e: 4&ery disk is ade up of a block size. And this
is the iniu aount of data that is written and read
fro a -isk. )ow a 2ilesyste also consists of blocks which
is ade out of these blocks on the disk. )orally disk
blocks are of :;< bytes and those of flesyste are of a few
kilobytes. 6n case of HD!" we also ha&e the blocks
concept. /ut here one block size is of => $/ by default and
which can be increased in an integral ultiple of => i.e.
;<?$/' <:=$/' :;<$/ or e&en ore in 0/*s. 6t all depend
on the re5uireent and use-cases.
So Why are these blocks size so large for H-2S@ keep on
reading and you will get it in a ne(t few tutorials 1)
%etadata "torage: 6n noral fle syste there is
a hierarchical storage of etadata i.e. lets say there is a
folder A$%. inside that folder there is again one another
folder &'F. and inside that there is hello.t(t fle. )ow the
inforation about hello.t(t (i.e. etadata info of
hello.t(t) fle will be with &'F and again the etadata
of &'F will be with A$%. Hence this fors a hierarchy and
this hierarchy is aintained until the root of the flesyste.
/ut in HD!" we don*t ha&e a hierarchy of etadata. All the
etadata inforation resides with a single achine known
as Namenode (or $aster )ode) on the cluster. And this
node contains all the inforation about other fles and
folder and lots of other inforation too' which we will learn
in the ne(t few tutorials. 1)
Well this was 9ust an o&er&iew of Hadoop and Hadoop -istributed
2ile Syste. )ow in the ne(t part i will go into the depth of H-2S
and there after $ap!educe and will continue fro here...
Let me know if you have any doubts
in understanding anything into the comment section and i
will be really glad to answer the same :)
If you like what you just read and want to
continue your learning on BI!"#" you
can subscribe to our $mail and Like
our facebook page
These $ight also help #ou :'
;. Hadoop 8utorial1 "art > - Write Aperations in H-2S
<. Hadoop 8utorial1 "art B - !eplica "laceent or !eplication and
!ead Aperations in H-2S
B. Hadoop 8utorial1 "art < - Hadoop -istributed 2ile Syste (H-2S)
>. Hadoop 8utorial1 "art ; - What is Hadoop @ (an A&er&iew)
:. /est of /ooks and !esources to 0et Started with Hadoop
=. Hadoop 8utorial1 "art : - All Hadoop Shell #oands you will
)eed.
C. Hadoop 6nstallation on .ocal $achine (Single node #luster)
!ind *o$$ents below or /dd one
!oain !igau( said...
)ice suaryD
Actober EB' <E;B
pragya khare said...
6 know i* a beginner and this 5uestion yt be a silly ;....but can
you please e(plain to e that how "A!A..4.6S$ is achie&ed &ia
ap-reduce at the processor le&el @@@ if 6*&e a dual core processor'
is it that only < 9obs will run at a tie in parallel@
Actober E:' <E;B
Anonyous said...
Hi 6 a fro $ainfrae background and with little knowledge of
core 9a&a...-o you think Fa&a is needed for learning Hadoop in
addition to Hi&eG"60 @ 4&en want to learn Fa&a for ap reduce but
couldn*t fnd what all will be used in realtie..and defniti&e guide
books sees tough for learning apreduce with Fa&a..any option
where 6 can learn it step by step@
Sorry for long coent..but it would be helpful if you can guide
e..
Actober E:' <E;B
-eepak Huar said...
I"ragya Hhare...
2irst thing always reeber... the one "opular saying.... )A %uestions
are 2oolish 1) And btw it is a &ery good 5uestion.
Actually there are two things1
Ane is what will be the best practice@ and other is what happens in
there by default @...
Well by default the nuber of apper and reducer is set to < for any
task tracker' hence one sees a a(iu of < aps and < reduces at
a gi&en instance on a 8ask8racker (which is confgurable)..Well this
-oesn*t only depend on the "rocessor but on lots of other factor as
well like ra' cpu' power' disk and others....
http1GGhortonworks.coGblogGbest-practices-for-selecting-apache-
hadoop-hardwareG
And for the other factor i.e for /est "ractices it depends on your use
case. +ou can go through the Brd point of the below link to understand
it ore conceptually
http1GGblog.cloudera.coGblogG<EEJG;<GC-tips-for-ipro&ing-
apreduce-perforanceG
Well i will e(plain all these when i will reach the ad&ance $ap!educe
tutorials.. 8ill then keep reading DD 1)
Actober E:' <E;B
-eepak Huar said...
IAnonyous
As Hadoop is written in Fa&a' so ost of its A"6*s are written in core
Fa&a... Well to know about the Hadoop architecture you don*t need
Fa&a... /ut to go to its A"6 .e&el and start prograing in $ap!educe
you need to know #ore Fa&a.
And as for the re5uireent in 9a&a you ha&e asked for... you 9ust need
siple core 9a&a concepts and prograing for Hadoop and
$ap!educe..And Hi&eG"60 are the S%. kind of data Kow languages
that is really easy to learn...And since you are fro a prograing
background it won*t be &ery di3cult to learn 9a&a 1) you can also go
through the link below for further details 1)
http1GGwww.bigdataplanet.infoG<E;BGEJGWhat-are-the-"re-re5usites-for-
getting-started-with-/ig--ata-8echnologies.htl
Actober E:' <E;B
"os t a #oent
Newer %ost& ' (lder %ost
/BO0T TH1 /0THO&
D11P/2 20%/&
/ig -ata G Hadoop -e&eloper' Software 4ngineer' 8hinker' .earner' 0eek' /logger'
#oder
3 love to pla# around Data4 Big Data 5
"ubsribe updates via 1$ail
)oin $ig&ata *lanet to continue your learning on $ig&ata +echnologies

Subscribe
6et 0pdates on !aeboo,
Big Data )ibraries
1. BIGDATA NEWS

2. CASSANDRA

3. HADOOP-TUTORIAL

4. HDFS

5. HECTOR-API

6. INSTALLATION

7. SQOOP
Which )oS%. -atabases according to you is $ost "opular @
6et *onneted on 6oogle7
%ost Popular Blog /rtile
Hadoop Installation on Local ac!in" #Sin$l" nod" Cl%st"&'
Hadoop T%to&ial( Pa&t ) - All Hadoop S!"ll Co**ands +o% ,ill N""d-
W!at a&" t!" P&"-&".%isit"s /o& $"ttin$ sta&t"d ,it! Bi$ Data T"c!nolo$i"s
Hadoop T%to&ial( Pa&t 0 - R"plica Plac"*"nt o& R"plication and R"ad Op"&ations in
HDFS
Hadoop T%to&ial( Pa&t 1 - W!at is Hadoop 2 #an O3"&3i",'
Hadoop T%to&ial( Pa&t 4 - Hadoop Dist&i5%t"d Fil" S+st"* #HDFS'
Hadoop T%to&ial( Pa&t 6 - W&it" Op"&ations in HDFS
B"st o/ Boo7s and R"so%&c"s to G"t Sta&t"d ,it! Hadoop
Ho, to %s" Cassand&a CQL in +o%& 8a3a Application

http://www.bigdataplanet.info/2013/10/Hadoop-Tutorial-Part-2-Hadoop-Distributed-File-
!ste".ht"l
http://www.de#$.%o"/opensour%e/e$ploring-the-hadoop-distributed-file-s!ste"-hdfs.ht"l
L$&i# ' Re&is(er

TODA)*S HEADLIES ' ARTICLE ARCHI+E ' FORU,S ' TIP BA-
Securely view and collaborate on documents using any device Pri." C$#(e#(
C$##ec( %e/i0ers 1u// 226-bi( SSL securi(3 !#% DR, ($ 4r$(ec( 3$ur se#si(i0e
%$cu"e#(s5 E!s3 cus($"i.!(i$#6 D$7#/$!% Tri!/
S4$#s$re% b3 Accus$1(


Specialized Dev Zones
eBook Library
.NET
Java
C++
Web Dev
Arci!ec!"re
Da!abase
Sec"ri!y
#pen So"rce
En!erprise
$obile
Special %epor!s
&'($in"!e Sol"!ions
Dev)!ra Blo*s

Exploring the Hadoop Distributed File
System (HDFS)
Kaushik Pal explores the basics of the Hadoop
Distributed File System (HDFS), the underlying file
system of the pache Hadoop frame!ork"
b3 -!us8i9 P!/
0 omments (clic! to add your comment)
omment and ontribute

)$ur #!"e:#ic9#!"e
)$ur e"!i/
;ebSi(e
Sub<ec(

=,!>i"u" c8!r!c(ers: 1200?5 )$u 8!0e 1200 c8!r!c(ers /e1(5
Privacy &
Terms
Sub"i( )$ur C$""e#(


Si(e"!4
Pr$4er(3 $1 @ui#s(ree( E#(er4rise5
Ter"s $1 Ser0ice ' Lice#si#& A Re4ri#(s ' Ab$u( Us ' Pri0!c3 P$/ic3 ' A%0er(ise
C$43ri&8( 201B @ui#S(ree( I#c5 A// Ri&8(s Reser0e%5
http://beginnersboo&.%o"/2013/0'/hdfs/
Cre!(e 3$ur $7# REST
API Usi#& OAu(8
Au(8e#(ic!(i$#
,icr$s$1( De/i0ers e7
6B-Bi( CIT C$"4i/er 1$r
5ET
Bi(c$i#Ds True Pur4$se
A"!.$# Re/e!ses I(s O7#
C8!$s E$ri//!
O4e#TSDB P!c9!&e
I#s(!//!(i$# !#% E>(r!c(i#&
Ti"e Series D!(! P$i#(s
Sign up "or e#mail
newsletters "rom Dev$
E#(er e"!i/ !%%re
HOE
CONTACT US
CORE 8 A9A
8 S P
8 S TL
S QL
8 A9A COLLECTI ONS
S EO
WORDPRES S
C
I NTER9I EW Q;A
OOPS CONCEPTS
S ER9LET
Hadoop Distributed File System(HDFS
by CHAITAN:A SINGH
in HADOOP
B"/o&" %nd"&standin$ ,!at is HDFS /i&st I ,o%ld li7" to "<plain ,!at is dist&i5%t"d
/il" s+st"*-
!"at is Distributed File System#
As +o% 7no, t!at "ac! p!+sical s+st"* !as its o,n sto&a$" li*it- And ,!"n it co*"s
to sto&" lots o/ data t!"n ," *a+ n""d *o&" t!an on" s+st"*= Basicall+ a n"t,o&7 o/
s+st"*s- So t!at t!" data can 5" s"$&"$at"d a*on$ 3a&io%s *ac!in"s ,!ic! a&"
conn"ct"d to "ac! ot!"& t!&o%$! a n"t,o&7- S%c! t+p" o/ *ana$"*"nt in o&d"& to
sto&" 5%l7 o/ data is 7no,n as distributed $ile system-
!"at is HDFS % Hadoop Distributed File System#
Hadoop !as its o,n dist&i5%t"d /il" s+st"* ,!ic! is 7no,n as HDFS # &"na*"d /&o*
NDFS'-
HDFS Design
1- Hadoop do"sn>t &".%i&"s "<p"nsi3" !a&d,a&" to sto&" data= &at!"& it is
d"si$n"d to s%ppo&t co**on and "asil+ a3aila5l" !a&d,a&"-
4- It is d"si$n"d to sto&" 3"&+ 3"&+ la&$" /il"# As +o% all 7no, t!at in o&d"& to
ind"< ,!ol" ,"5 it *a+ &".%i&" to sto&" /il"s ,!ic! a&" in t"&a5+t"s and p"ta5+t"s
o& "3"n *o&" t!an t!at'- Hadoop clusters a&" %s"d to p"&/o&* t!is tas7-
0- It is d"si$n"d /o& st&"a*in$ data acc"ss-
Hadoop $ile system
1' &ocal' T!is /il" s+st"* is /o& locall+ conn"ct"d dis7s-
4' HDFS' Hadoop dist&i5%t"d /il" s+st"*( E<plain"d a5o3"
0' HFTP' T!" p%&pos" o/ it to p&o3id" &"ad-onl+ acc"ss /o& Hadoop distributed
$ile system o3"& HTTP-
6' HSFTP' It is al*ost si*ila& to HFTP= %nli7" HFTP it p&o3id"s &"ad-onl+
onHTTPS(
)' HAR % Hadoop)s Arc"i*es' Us"d /o& a&c!i3in$ /il"s-
?' !ebHDFS' G&ant ,&it" acc"ss on HTTP-
@' +FS' Its a clo%d sto&" s+st"* si*ila& to GFS and HDFS-
A' Distributed RAID' Li7" HAR it is also %s"d /o& a&c!i3al-
B' S,' A /il" s+st"* p&o3id"d 5+ A*aCon S0
HDFS Cluster -odes
HDFS cl%st"& !as t,o nod"s(
1- na*"nod"
4- datanod"
. -amenodes
It 5asicall+ sto&"s t!" na*" and add&"ss"s o/ datanodes- It sto&"s t!" data in /o&*
o/ a t&""- Wit!o%t Na*"nod"s t!is ,!ol" s+st"* o/ st&oin$ and &"t&i"3in$ data ,o%ld
not ,o&7 as it is &"sponsi5l" to 7no, ,!ic! data is sto&"d ,!"&"-
/ Datanodes
Datanodes a&" %s"d to sto&" t!" data in /o&* o/ bloc0s( T!"+ sto&"
and &"t&i"3" data in /o&* o/ data 5loc7s a/t"& co**%nication ,it! Na*"nod"s-
Important lin0s'
1- HDFS G%id"
4- HDFS 8a3a APi
0- HDFS So%&c" cod"
1ou mig"t li0e'
1- Hadoop t%to&ial
4- Fil" IDO in C p&o$&a**in$ ,it! "<a*pl"s
0- Ho, to "dit -!tacc"ss /il" in Wo&dP&"ss
6- Ho, to p&"3"nt acc"ss to -!tacc"ss E a7" it *o&" s"c%&"
)- Ho, to c&"at" a Fil" in 8a3a
?- Ho, to ,&it" to a /il" in Fa3a %sin$ Fil"O%tp%tSt&"a*
@- App"nd to a /il" in Fa3a %sin$ B%//"&"dW&it"&= P&intW&it"&= Fil"W&it"&
G H co**"ntsI add on" no, J
L"a3" a Co**"nt
Na*" K
E-*ail K
W"5sit"
Noti/+ *" o/ /ollo,%p co**"nts 3ia "-*ail
Sub"i(
Con/i&* +o% a&" NOT a spa**"&
P2P3&AR T3T2RIA&S
o *ore :ava Tutorial
o :"P Tutorial
o :"T) Tutorial
o :ava *olletions Tutorial
o "ervlet Tutorial
o * Tutorial

"1/&*H TH3" "3T1

T$ se!rc8, (34e !#% 8


!O))OW %1 O9 6OO6)17

*O991*T W3TH 0" O9 !/*1BOO2

:O39 0" O9 6OO6)1 P)0"

O%& F&i"ndLs ,"5sit" lin7 ( F&o*D"3-co*


Copyright 2012 2014 BeginnersBook - All Rights Reserved || Sitemap
Apache > Hadoop > Core > docs > r1.0.4
Se!rc8 (8e si(e 7i(8 &$$&/e

Se!rc8
Project
Wiki
+adoop &.'., Doc"-en!a!ion
Las! ."blised/ '01&210'&2 &3/0'/45

6e!!in* S!ar!ed
6"ides
$ap%ed"ce
+D7S
HDF !sers
HDF Architect"re
Per#issio$s
%"otas
&$thetic 'oad (e$erator
We)HDF *+, AP-
C AP- .i)hd/s
Co--on
$iscellaneo"s
PDF
HDFS %rchitecture &uide
-$trod"ctio$
Ass"#ptio$s a$d (oa.s
o Hard0are Fai."re
o trea#i$1 Data Access
o 'ar1e Data ets
o i#p.e Cohere$c& 2ode.
o 3 2o4i$1 Co#p"tatio$ is Cheaper tha$ 2o4i$1 Data5
o Porta)i.it& Across Hetero1e$eo"s Hard0are a$d o/t0are P.at/or#s
6a#e6ode a$d Data6odes
,he Fi.e &ste# 6a#espace
Data *ep.icatio$
o *ep.ica P.ace#e$t7 ,he First 8a)& teps
o *ep.ica e.ectio$
o a/e#ode
,he Persiste$ce o/ Fi.e &ste# 2etadata
,he Co##"$icatio$ Protoco.s
*o)"st$ess
o Data Disk Fai."re9 Heart)eats a$d *e:*ep.icatio$
o C."ster *e)a.a$ci$1
o Data -$te1rit&
o 2etadata Disk Fai."re
o $apshots
Data ;r1a$i<atio$
o Data 8.ocks
o ta1i$1
o *ep.icatio$ Pipe.i$i$1
Accessi)i.it&
o F he..
o DFAd#i$
o 8ro0ser -$ter/ace
pace *ec.a#atio$
o Fi.e De.etes a$d !$de.etes
o Decrease *ep.icatio$ Factor
*e/ere$ces
'ntroduction
,he Hadoop Distri)"ted Fi.e &ste# =HDF> is a distri)"ted /i.e s&ste# desi1$ed to
r"$ o$ co##odit& hard0are. -t has #a$& si#i.arities 0ith e?isti$1 distri)"ted /i.e
s&ste#s. Ho0e4er9 the di//ere$ces /ro# other distri)"ted /i.e s&ste#s are si1$i/ica$t.
HDF is hi1h.& /a".t:to.era$t a$d is desi1$ed to )e dep.o&ed o$ .o0:cost hard0are.
HDF pro4ides hi1h thro"1hp"t access to app.icatio$ data a$d is s"ita).e /or
app.icatio$s that ha4e .ar1e data sets. HDF re.a?es a /e0 P;-@ reA"ire#e$ts to
e$a).e strea#i$1 access to /i.e s&ste# data. HDF 0as ori1i$a..& )"i.t as
i$/rastr"ct"re /or the Apache 6"tch 0e) search e$1i$e project. HDF is $o0 a$
Apache Hadoop s")project. ,he project !*' ishttp7BBhadoop.apache.or1Bhd/sB.
%ssumptions and &oals
Hardware Failure
Hard0are /ai."re is the $or# rather tha$ the e?ceptio$. A$ HDF i$sta$ce #a&
co$sist o/ h"$dreds or tho"sa$ds o/ ser4er #achi$es9 each stori$1 part o/ the /i.e
s&ste#Cs data. ,he /act that there are a h"1e $"#)er o/ co#po$e$ts a$d that each
co#po$e$t has a $o$:tri4ia. pro)a)i.it& o/ /ai."re #ea$s that so#e co#po$e$t o/
HDF is a.0a&s $o$:/"$ctio$a.. ,here/ore9 detectio$ o/ /a".ts a$d A"ick9 a"to#atic
reco4er& /ro# the# is a core architect"ra. 1oa. o/ HDF.
Streaming Data %ccess
App.icatio$s that r"$ o$ HDF $eed strea#i$1 access to their data sets. ,he& are $ot
1e$era. p"rpose app.icatio$s that t&pica..& r"$ o$ 1e$era. p"rpose /i.e s&ste#s. HDF
is desi1$ed #ore /or )atch processi$1 rather tha$ i$teracti4e "se )& "sers. ,he
e#phasis is o$ hi1h thro"1hp"t o/ data access rather tha$ .o0 .ate$c& o/ data
access. P;-@ i#poses #a$& hard reA"ire#e$ts that are $ot $eeded /or app.icatio$s
that are tar1eted /or HDF. P;-@ se#a$tics i$ a /e0 ke& areas has )ee$ traded to
i$crease data thro"1hp"t rates.
(arge Data Sets
App.icatio$s that r"$ o$ HDF ha4e .ar1e data sets. A t&pica. /i.e i$ HDF is
1i1a)&tes to tera)&tes i$ si<e. ,h"s9 HDF is t"$ed to s"pport .ar1e /i.es. -t sho".d
pro4ide hi1h a11re1ate data )a$d0idth a$d sca.e to h"$dreds o/ $odes i$ a si$1.e
c."ster. -t sho".d s"pport te$s o/ #i..io$s o/ /i.es i$ a si$1.e i$sta$ce.
Simple oherency )odel
HDF app.icatio$s $eed a 0rite:o$ce:read:#a$& access #ode. /or /i.es. A /i.e o$ce
created9 0ritte$9 a$d c.osed $eed $ot )e cha$1ed. ,his ass"#ptio$ si#p.i/ies data
cohere$c& iss"es a$d e$a).es hi1h thro"1hp"t data access. A 2ap*ed"ce app.icatio$
or a 0e) cra0.er app.icatio$ /its per/ect.& 0ith this #ode.. ,here is a p.a$ to s"pport
appe$di$1:0rites to /i.es i$ the /"t"re.
*)oving omputation is heaper than )oving Data+
A co#p"tatio$ reA"ested )& a$ app.icatio$ is #"ch #ore e//icie$t i/ it is e?ec"ted
$ear the data it operates o$. ,his is especia..& tr"e 0he$ the si<e o/ the data set is
h"1e. ,his #i$i#i<es $et0ork co$1estio$ a$d i$creases the o4era.. thro"1hp"t o/ the
s&ste#. ,he ass"#ptio$ is that it is o/te$ )etter to #i1rate the co#p"tatio$ c.oser to
0here the data is .ocated rather tha$ #o4i$1 the data to 0here the app.icatio$ is
r"$$i$1. HDF pro4ides i$ter/aces /or app.icatio$s to #o4e the#se.4es c.oser to
0here the data is .ocated.
,ortability %cross Heterogeneous Hardware and So"tware ,lat"orms
HDF has )ee$ desi1$ed to )e easi.& porta).e /ro# o$e p.at/or# to a$other. ,his
/aci.itates 0idespread adoptio$ o/ HDF as a p.at/or# o/ choice /or a .ar1e set o/
app.icatio$s.
-ame-ode and Data-odes
HDF has a #asterBs.a4e architect"re. A$ HDF c."ster co$sists o/ a si$1.e
6a#e6ode9 a #aster ser4er that #a$a1es the /i.e s&ste# $a#espace a$d re1".ates
access to /i.es )& c.ie$ts. -$ additio$9 there are a $"#)er o/ Data6odes9 "s"a..& o$e
per $ode i$ the c."ster9 0hich #a$a1e stora1e attached to the $odes that the& r"$
o$. HDF e?poses a /i.e s&ste# $a#espace a$d a..o0s "ser data to )e stored i$ /i.es.
-$ter$a..&9 a /i.e is sp.it i$to o$e or #ore ).ocks a$d these ).ocks are stored i$ a set
o/ Data6odes. ,he 6a#e6ode e?ec"tes /i.e s&ste# $a#espace operatio$s .ike
ope$i$19 c.osi$19 a$d re$a#i$1 /i.es a$d directories. -t a.so deter#i$es the #appi$1
o/ ).ocks to Data6odes. ,he Data6odes are respo$si).e /or ser4i$1 read a$d 0rite
reA"ests /ro# the /i.e s&ste#Cs c.ie$ts. ,he Data6odes a.so per/or# ).ock creatio$9
de.etio$9 a$d rep.icatio$ "po$ i$str"ctio$ /ro# the 6a#e6ode.
,he 6a#e6ode a$d Data6ode are pieces o/ so/t0are desi1$ed to r"$ o$ co##odit&
#achi$es. ,hese #achi$es t&pica..& r"$ a (6!B'i$"? operati$1 s&ste# =;>. HDF is
)"i.t "si$1 the Da4a .a$1"a1eE a$& #achi$e that s"pports Da4a ca$ r"$ the
6a#e6ode or the Data6ode so/t0are. !sa1e o/ the hi1h.& porta).e Da4a .a$1"a1e
#ea$s that HDF ca$ )e dep.o&ed o$ a 0ide ra$1e o/ #achi$es. A t&pica.
dep.o&#e$t has a dedicated #achi$e that r"$s o$.& the 6a#e6ode so/t0are. +ach o/
the other #achi$es i$ the c."ster r"$s o$e i$sta$ce o/ the Data6ode so/t0are. ,he
architect"re does $ot prec."de r"$$i$1 #".tip.e Data6odes o$ the sa#e #achi$e )"t
i$ a rea. dep.o&#e$t that is rare.& the case.
,he e?iste$ce o/ a si$1.e 6a#e6ode i$ a c."ster 1reat.& si#p.i/ies the architect"re o/
the s&ste#. ,he 6a#e6ode is the ar)itrator a$d repositor& /or a.. HDF #etadata.
,he s&ste# is desi1$ed i$ s"ch a 0a& that "ser data $e4er /.o0s thro"1h the
6a#e6ode.
.he File System -amespace
HDF s"pports a traditio$a. hierarchica. /i.e or1a$i<atio$. A "ser or a$ app.icatio$ ca$
create directories a$d store /i.es i$side these directories. ,he /i.e s&ste# $a#espace
hierarch& is si#i.ar to #ost other e?isti$1 /i.e s&ste#sE o$e ca$ create a$d re#o4e
/i.es9 #o4e a /i.e /ro# o$e director& to a$other9 or re$a#e a /i.e. HDF does $ot &et
i#p.e#e$t "ser A"otas. HDF does $ot s"pport hard .i$ks or so/t .i$ks. Ho0e4er9 the
HDF architect"re does $ot prec."de i#p.e#e$ti$1 these /eat"res.
,he 6a#e6ode #ai$tai$s the /i.e s&ste# $a#espace. A$& cha$1e to the /i.e s&ste#
$a#espace or its properties is recorded )& the 6a#e6ode. A$ app.icatio$ ca$ speci/&
the $"#)er o/ rep.icas o/ a /i.e that sho".d )e #ai$tai$ed )& HDF. ,he $"#)er o/
copies o/ a /i.e is ca..ed the rep.icatio$ /actor o/ that /i.e. ,his i$/or#atio$ is stored )&
the 6a#e6ode.
Data /eplication
HDF is desi1$ed to re.ia).& store 4er& .ar1e /i.es across #achi$es i$ a .ar1e c."ster.
-t stores each /i.e as a seA"e$ce o/ ).ocksE a.. ).ocks i$ a /i.e e?cept the .ast ).ock
are the sa#e si<e. ,he ).ocks o/ a /i.e are rep.icated /or /a".t to.era$ce. ,he ).ock
si<e a$d rep.icatio$ /actor are co$/i1"ra).e per /i.e. A$ app.icatio$ ca$ speci/& the
$"#)er o/ rep.icas o/ a /i.e. ,he rep.icatio$ /actor ca$ )e speci/ied at /i.e creatio$
ti#e a$d ca$ )e cha$1ed .ater. Fi.es i$ HDF are 0rite:o$ce a$d ha4e strict.& o$e
0riter at a$& ti#e.
,he 6a#e6ode #akes a.. decisio$s re1ardi$1 rep.icatio$ o/ ).ocks. -t periodica..&
recei4es a Heart)eat a$d a 8.ockreport /ro# each o/ the Data6odes i$ the c."ster.
*eceipt o/ a Heart)eat i#p.ies that the Data6ode is /"$ctio$i$1 proper.&. A
8.ockreport co$tai$s a .ist o/ a.. ).ocks o$ a Data6ode.
/eplica ,lacement: .he First 0aby Steps
,he p.ace#e$t o/ rep.icas is critica. to HDF re.ia)i.it& a$d per/or#a$ce. ;pti#i<i$1
rep.ica p.ace#e$t disti$1"ishes HDF /ro# #ost other distri)"ted /i.e s&ste#s. ,his
is a /eat"re that $eeds .ots o/ t"$i$1 a$d e?perie$ce. ,he p"rpose o/ a rack:a0are
rep.ica p.ace#e$t po.ic& is to i#pro4e data re.ia)i.it&9 a4ai.a)i.it&9 a$d $et0ork
)a$d0idth "ti.i<atio$. ,he c"rre$t i#p.e#e$tatio$ /or the rep.ica p.ace#e$t po.ic& is
a /irst e//ort i$ this directio$. ,he short:ter# 1oa.s o/ i#p.e#e$ti$1 this po.ic& are to
4a.idate it o$ prod"ctio$ s&ste#s9 .ear$ #ore a)o"t its )eha4ior9 a$d )"i.d a
/o"$datio$ to test a$d research #ore sophisticated po.icies.
'ar1e HDF i$sta$ces r"$ o$ a c."ster o/ co#p"ters that co##o$.& spread across
#a$& racks. Co##"$icatio$ )et0ee$ t0o $odes i$ di//ere$t racks has to 1o thro"1h
s0itches. -$ #ost cases9 $et0ork )a$d0idth )et0ee$ #achi$es i$ the sa#e rack is
1reater tha$ $et0ork )a$d0idth )et0ee$ #achi$es i$ di//ere$t racks.
,he 6a#e6ode deter#i$es the rack id each Data6ode )e.o$1s to 4ia the process
o"t.i$ed i$ Hadoop *ack A0are$ess. A si#p.e )"t $o$:opti#a. po.ic& is to p.ace
rep.icas o$ "$iA"e racks. ,his pre4e$ts .osi$1 data 0he$ a$ e$tire rack /ai.s a$d
a..o0s "se o/ )a$d0idth /ro# #".tip.e racks 0he$ readi$1 data. ,his po.ic& e4e$.&
distri)"tes rep.icas i$ the c."ster 0hich #akes it eas& to )a.a$ce .oad o$ co#po$e$t
/ai."re. Ho0e4er9 this po.ic& i$creases the cost o/ 0rites )eca"se a 0rite $eeds to
tra$s/er ).ocks to #".tip.e racks.
For the co##o$ case9 0he$ the rep.icatio$ /actor is three9 HDFCs p.ace#e$t po.ic&
is to p"t o$e rep.ica o$ o$e $ode i$ the .oca. rack9 a$other o$ a $ode i$ a di//ere$t
=re#ote> rack9 a$d the .ast o$ a di//ere$t $ode i$ the sa#e re#ote rack. ,his po.ic&
c"ts the i$ter:rack 0rite tra//ic 0hich 1e$era..& i#pro4es 0rite per/or#a$ce. ,he
cha$ce o/ rack /ai."re is /ar .ess tha$ that o/ $ode /ai."reE this po.ic& does $ot i#pact
data re.ia)i.it& a$d a4ai.a)i.it& 1"ara$tees. Ho0e4er9 it does red"ce the a11re1ate
$et0ork )a$d0idth "sed 0he$ readi$1 data si$ce a ).ock is p.aced i$ o$.& t0o
"$iA"e racks rather tha$ three. With this po.ic&9 the rep.icas o/ a /i.e do $ot e4e$.&
distri)"te across the racks. ;$e third o/ rep.icas are o$ o$e $ode9 t0o thirds o/
rep.icas are o$ o$e rack9 a$d the other third are e4e$.& distri)"ted across the
re#ai$i$1 racks. ,his po.ic& i#pro4es 0rite per/or#a$ce 0itho"t co#pro#isi$1 data
re.ia)i.it& or read per/or#a$ce.
,he c"rre$t9 de/a".t rep.ica p.ace#e$t po.ic& descri)ed here is a 0ork i$ pro1ress.
/eplica Selection
,o #i$i#i<e 1.o)a. )a$d0idth co$s"#ptio$ a$d read .ate$c&9 HDF tries to satis/& a
read reA"est /ro# a rep.ica that is c.osest to the reader. -/ there e?ists a rep.ica o$
the sa#e rack as the reader $ode9 the$ that rep.ica is pre/erred to satis/& the read
reA"est. -/ a$11B HDF c."ster spa$s #".tip.e data ce$ters9 the$ a rep.ica that is
reside$t i$ the .oca. data ce$ter is pre/erred o4er a$& re#ote rep.ica.
Sa"emode
;$ start"p9 the 6a#e6ode e$ters a specia. state ca..ed a/e#ode. *ep.icatio$ o/
data ).ocks does $ot occ"r 0he$ the 6a#e6ode is i$ the a/e#ode state. ,he
6a#e6ode recei4es Heart)eat a$d 8.ockreport #essa1es /ro# the Data6odes. A
8.ockreport co$tai$s the .ist o/ data ).ocks that a Data6ode is hosti$1. +ach ).ock
has a speci/ied #i$i#"# $"#)er o/ rep.icas. A ).ock is co$sidered sa/e.& rep.icated
0he$ the #i$i#"# $"#)er o/ rep.icas o/ that data ).ock has checked i$ 0ith the
6a#e6ode. A/ter a co$/i1"ra).e perce$ta1e o/ sa/e.& rep.icated data ).ocks checks i$
0ith the 6a#e6ode =p."s a$ additio$a. 30 seco$ds>9 the 6a#e6ode e?its the
a/e#ode state. -t the$ deter#i$es the .ist o/ data ).ocks =i/ a$&> that sti.. ha4e
/e0er tha$ the speci/ied $"#)er o/ rep.icas. ,he 6a#e6ode the$ rep.icates these
).ocks to other Data6odes.
.he ,ersistence o" File System )etadata
,he HDF $a#espace is stored )& the 6a#e6ode. ,he 6a#e6ode "ses a tra$sactio$
.o1 ca..ed the +dit'o1 to persiste$t.& record e4er& cha$1e that occ"rs to /i.e s&ste#
#etadata. For e?a#p.e9 creati$1 a $e0 /i.e i$ HDF ca"ses the 6a#e6ode to i$sert a
record i$to the +dit'o1 i$dicati$1 this. i#i.ar.&9 cha$1i$1 the rep.icatio$ /actor o/ a
/i.e ca"ses a $e0 record to )e i$serted i$to the +dit'o1. ,he 6a#e6ode "ses a /i.e i$
its .oca. host ; /i.e s&ste# to store the +dit'o1. ,he e$tire /i.e s&ste# $a#espace9
i$c."di$1 the #appi$1 o/ ).ocks to /i.es a$d /i.e s&ste# properties9 is stored i$ a /i.e
ca..ed the Fs-#a1e. ,he Fs-#a1e is stored as a /i.e i$ the 6a#e6odeCs .oca. /i.e
s&ste# too.
,he 6a#e6ode keeps a$ i#a1e o/ the e$tire /i.e s&ste# $a#espace a$d /i.e
8.ock#ap i$ #e#or&. ,his ke& #etadata ite# is desi1$ed to )e co#pact9 s"ch that a
6a#e6ode 0ith 4 (8 o/ *A2 is p.e$t& to s"pport a h"1e $"#)er o/ /i.es a$d
directories. Whe$ the 6a#e6ode starts "p9 it reads the Fs-#a1e a$d +dit'o1 /ro#
disk9 app.ies a.. the tra$sactio$s /ro# the +dit'o1 to the i$:#e#or& represe$tatio$ o/
the Fs-#a1e9 a$d /."shes o"t this $e0 4ersio$ i$to a $e0 Fs-#a1e o$ disk. -t ca$
the$ tr"$cate the o.d +dit'o1 )eca"se its tra$sactio$s ha4e )ee$ app.ied to the
persiste$t Fs-#a1e. ,his process is ca..ed a checkpoi$t. -$ the c"rre$t
i#p.e#e$tatio$9 a checkpoi$t o$.& occ"rs 0he$ the 6a#e6ode starts "p. Work is i$
pro1ress to s"pport periodic checkpoi$ti$1 i$ the $ear /"t"re.
,he Data6ode stores HDF data i$ /i.es i$ its .oca. /i.e s&ste#. ,he Data6ode has $o
k$o0.ed1e a)o"t HDF /i.es. -t stores each ).ock o/ HDF data i$ a separate /i.e i$
its .oca. /i.e s&ste#. ,he Data6ode does $ot create a.. /i.es i$ the sa#e director&.
-$stead9 it "ses a he"ristic to deter#i$e the opti#a. $"#)er o/ /i.es per director& a$d
creates s")directories appropriate.&. -t is $ot opti#a. to create a.. .oca. /i.es i$ the
sa#e director& )eca"se the .oca. /i.e s&ste# #i1ht $ot )e a).e to e//icie$t.& s"pport
a h"1e $"#)er o/ /i.es i$ a si$1.e director&. Whe$ a Data6ode starts "p9 it sca$s
thro"1h its .oca. /i.e s&ste#9 1e$erates a .ist o/ a.. HDF data ).ocks that correspo$d
to each o/ these .oca. /i.es a$d se$ds this report to the 6a#e6ode7 this is the
8.ockreport.
.he ommunication ,rotocols
A.. HDF co##"$icatio$ protoco.s are .a&ered o$ top o/ the ,CPB-P protoco.. A c.ie$t
esta).ishes a co$$ectio$ to a co$/i1"ra).e ,CP port o$ the 6a#e6ode #achi$e. -t
ta.ks the C.ie$tProtoco. 0ith the 6a#e6ode. ,he Data6odes ta.k to the 6a#e6ode
"si$1 the Data6ode Protoco.. A *e#ote Proced"re Ca.. =*PC> a)stractio$ 0raps )oth
the C.ie$t Protoco. a$d the Data6ode Protoco.. 8& desi1$9 the 6a#e6ode $e4er
i$itiates a$& *PCs. -$stead9 it o$.& respo$ds to *PC reA"ests iss"ed )& Data6odes or
c.ie$ts.
/obustness
,he pri#ar& o)jecti4e o/ HDF is to store data re.ia).& e4e$ i$ the prese$ce o/
/ai."res. ,he three co##o$ t&pes o/ /ai."res are 6a#e6ode /ai."res9 Data6ode
/ai."res a$d $et0ork partitio$s.
Data Dis! Failure1 Heartbeats and /e#/eplication
+ach Data6ode se$ds a Heart)eat #essa1e to the 6a#e6ode periodica..&. A $et0ork
partitio$ ca$ ca"se a s")set o/ Data6odes to .ose co$$ecti4it& 0ith the 6a#e6ode.
,he 6a#e6ode detects this co$ditio$ )& the a)se$ce o/ a Heart)eat #essa1e. ,he
6a#e6ode #arks Data6odes 0itho"t rece$t Heart)eats as dead a$d does $ot
/or0ard a$& $e0-; reA"ests to the#. A$& data that 0as re1istered to a dead
Data6ode is $ot a4ai.a).e to HDF a$& #ore. Data6ode death #a& ca"se the
rep.icatio$ /actor o/ so#e ).ocks to /a.. )e.o0 their speci/ied 4a."e. ,he 6a#e6ode
co$sta$t.& tracks 0hich ).ocks $eed to )e rep.icated a$d i$itiates rep.icatio$
0he$e4er $ecessar&. ,he $ecessit& /or re:rep.icatio$ #a& arise d"e to #a$& reaso$s7
a Data6ode #a& )eco#e "$a4ai.a).e9 a rep.ica #a& )eco#e corr"pted9 a hard disk
o$ a Data6ode #a& /ai.9 or the rep.icatio$ /actor o/ a /i.e #a& )e i$creased.
luster /ebalancing
,he HDF architect"re is co#pati).e 0ith data re)a.a$ci$1 sche#es. A sche#e #i1ht
a"to#atica..& #o4e data /ro# o$e Data6ode to a$other i/ the /ree space o$ a
Data6ode /a..s )e.o0 a certai$ thresho.d. -$ the e4e$t o/ a s"dde$ hi1h de#a$d /or a
partic".ar /i.e9 a sche#e #i1ht d&$a#ica..& create additio$a. rep.icas a$d re)a.a$ce
other data i$ the c."ster. ,hese t&pes o/ data re)a.a$ci$1 sche#es are $ot &et
i#p.e#e$ted.
Data 'ntegrity
-t is possi).e that a ).ock o/ data /etched /ro# a Data6ode arri4es corr"pted. ,his
corr"ptio$ ca$ occ"r )eca"se o/ /a".ts i$ a stora1e de4ice9 $et0ork /a".ts9 or )"11&
so/t0are. ,he HDF c.ie$t so/t0are i#p.e#e$ts checks"# checki$1 o$ the co$te$ts
o/ HDF /i.es. Whe$ a c.ie$t creates a$ HDF /i.e9 it co#p"tes a checks"# o/ each
).ock o/ the /i.e a$d stores these checks"#s i$ a separate hidde$ /i.e i$ the sa#e
HDF $a#espace. Whe$ a c.ie$t retrie4es /i.e co$te$ts it 4eri/ies that the data it
recei4ed /ro# each Data6ode #atches the checks"# stored i$ the associated
checks"# /i.e. -/ $ot9 the$ the c.ie$t ca$ opt to retrie4e that ).ock /ro# a$other
Data6ode that has a rep.ica o/ that ).ock.
)etadata Dis! Failure
,he Fs-#a1e a$d the +dit'o1 are ce$tra. data str"ct"res o/ HDF. A corr"ptio$ o/
these /i.es ca$ ca"se the HDF i$sta$ce to )e $o$:/"$ctio$a.. For this reaso$9 the
6a#e6ode ca$ )e co$/i1"red to s"pport #ai$tai$i$1 #".tip.e copies o/ the Fs-#a1e
a$d +dit'o1. A$& "pdate to either the Fs-#a1e or +dit'o1 ca"ses each o/ the
Fs-#a1es a$d +dit'o1s to 1et "pdated s&$chro$o"s.&. ,his s&$chro$o"s "pdati$1 o/
#".tip.e copies o/ the Fs-#a1e a$d +dit'o1 #a& de1rade the rate o/ $a#espace
tra$sactio$s per seco$d that a 6a#e6ode ca$ s"pport. Ho0e4er9 this de1radatio$ is
accepta).e )eca"se e4e$ tho"1h HDF app.icatio$s are 4er& data i$te$si4e i$ $at"re9
the& are $ot #etadata i$te$si4e. Whe$ a 6a#e6ode restarts9 it se.ects the .atest
co$siste$t Fs-#a1e a$d +dit'o1 to "se.
,he 6a#e6ode #achi$e is a si$1.e poi$t o/ /ai."re /or a$ HDF c."ster. -/ the
6a#e6ode #achi$e /ai.s9 #a$"a. i$ter4e$tio$ is $ecessar&. C"rre$t.&9 a"to#atic
restart a$d /ai.o4er o/ the 6a#e6ode so/t0are to a$other #achi$e is $ot s"pported.
Snapshots
$apshots s"pport stori$1 a cop& o/ data at a partic".ar i$sta$t o/ ti#e. ;$e "sa1e
o/ the s$apshot /eat"re #a& )e to ro.. )ack a corr"pted HDF i$sta$ce to a
pre4io"s.& k$o0$ 1ood poi$t i$ ti#e. HDF does $ot c"rre$t.& s"pport s$apshots )"t
0i.. i$ a /"t"re re.ease.
Data 2rgani3ation
Data 0loc!s
HDF is desi1$ed to s"pport 4er& .ar1e /i.es. App.icatio$s that are co#pati).e 0ith
HDF are those that dea. 0ith .ar1e data sets. ,hese app.icatio$s 0rite their data
o$.& o$ce )"t the& read it o$e or #ore ti#es a$d reA"ire these reads to )e satis/ied
at strea#i$1 speeds. HDF s"pports 0rite:o$ce:read:#a$& se#a$tics o$ /i.es. A
t&pica. ).ock si<e "sed )& HDF is 64 28. ,h"s9 a$ HDF /i.e is chopped "p i$to 64
28 ch"$ks9 a$d i/ possi).e9 each ch"$k 0i.. reside o$ a di//ere$t Data6ode.
Staging
A c.ie$t reA"est to create a /i.e does $ot reach the 6a#e6ode i##ediate.&. -$ /act9
i$itia..& the HDF c.ie$t caches the /i.e data i$to a te#porar& .oca. /i.e. App.icatio$
0rites are tra$spare$t.& redirected to this te#porar& .oca. /i.e. Whe$ the .oca. /i.e
acc"#".ates data 0orth o4er o$e HDF ).ock si<e9 the c.ie$t co$tacts the
6a#e6ode. ,he 6a#e6ode i$serts the /i.e $a#e i$to the /i.e s&ste# hierarch& a$d
a..ocates a data ).ock /or it. ,he 6a#e6ode respo$ds to the c.ie$t reA"est 0ith the
ide$tit& o/ the Data6ode a$d the desti$atio$ data ).ock. ,he$ the c.ie$t /."shes the
).ock o/ data /ro# the .oca. te#porar& /i.e to the speci/ied Data6ode. Whe$ a /i.e is
c.osed9 the re#ai$i$1 "$:/."shed data i$ the te#porar& .oca. /i.e is tra$s/erred to the
Data6ode. ,he c.ie$t the$ te..s the 6a#e6ode that the /i.e is c.osed. At this poi$t9
the 6a#e6ode co##its the /i.e creatio$ operatio$ i$to a persiste$t store. -/ the
6a#e6ode dies )e/ore the /i.e is c.osed9 the /i.e is .ost.
,he a)o4e approach has )ee$ adopted a/ter care/". co$sideratio$ o/ tar1et
app.icatio$s that r"$ o$ HDF. ,hese app.icatio$s $eed strea#i$1 0rites to /i.es. -/ a
c.ie$t 0rites to a re#ote /i.e direct.& 0itho"t a$& c.ie$t side )"//eri$19 the $et0ork
speed a$d the co$1estio$ i$ the $et0ork i#pacts thro"1hp"t co$sidera).&. ,his
approach is $ot 0itho"t precede$t. +ar.ier distri)"ted /i.e s&ste#s9 e.1. AF9 ha4e
"sed c.ie$t side cachi$1 to i#pro4e per/or#a$ce. A P;-@ reA"ire#e$t has )ee$
re.a?ed to achie4e hi1her per/or#a$ce o/ data "p.oads.
/eplication ,ipelining
Whe$ a c.ie$t is 0riti$1 data to a$ HDF /i.e9 its data is /irst 0ritte$ to a .oca. /i.e as
e?p.ai$ed i$ the pre4io"s sectio$. "ppose the HDF /i.e has a rep.icatio$ /actor o/
three. Whe$ the .oca. /i.e acc"#".ates a /".. ).ock o/ "ser data9 the c.ie$t retrie4es a
.ist o/ Data6odes /ro# the 6a#e6ode. ,his .ist co$tai$s the Data6odes that 0i.. host
a rep.ica o/ that ).ock. ,he c.ie$t the$ /."shes the data ).ock to the /irst Data6ode.
,he /irst Data6ode starts recei4i$1 the data i$ s#a.. portio$s =4 F8>9 0rites each
portio$ to its .oca. repositor& a$d tra$s/ers that portio$ to the seco$d Data6ode i$
the .ist. ,he seco$d Data6ode9 i$ t"r$ starts recei4i$1 each portio$ o/ the data ).ock9
0rites that portio$ to its repositor& a$d the$ /."shes that portio$ to the third
Data6ode. Fi$a..&9 the third Data6ode 0rites the data to its .oca. repositor&. ,h"s9 a
Data6ode ca$ )e recei4i$1 data /ro# the pre4io"s o$e i$ the pipe.i$e a$d at the
sa#e ti#e /or0ardi$1 data to the $e?t o$e i$ the pipe.i$e. ,h"s9 the data is pipe.i$ed
/ro# o$e Data6ode to the $e?t.
%ccessibility
HDF ca$ )e accessed /ro# app.icatio$s i$ #a$& di//ere$t 0a&s. 6ati4e.&9 HDF
pro4ides a Da4a AP- /or app.icatio$s to "se. A C .a$1"a1e 0rapper /or this Da4a AP- is
a.so a4ai.a).e. -$ additio$9 a$ H,,P )ro0ser ca$ a.so )e "sed to )ro0se the /i.es o/
a$ HDF i$sta$ce. Work is i$ pro1ress to e?pose HDF thro"1h
the We)DAG protoco..
FS Shell
HDF a..o0s "ser data to )e or1a$i<ed i$ the /or# o/ /i.es a$d directories. -t pro4ides
a co##a$d.i$e i$ter/ace ca..ed F she.. that .ets a "ser i$teract 0ith the data i$
HDF. ,he s&$ta? o/ this co##a$d set is si#i.ar to other she..s =e.1. )ash9 csh> that
"sers are a.read& /a#i.iar 0ith. Here are so#e sa#p.e actio$Bco##a$d pairs7
Action Command
(reate a dire%tor! na"ed /foodir bin/hadoop dfs -mkdir /foodir
)e"o#e a dire%tor! na"ed /foodir bin/hadoop dfs -rmr /foodir
*iew the %ontents of a file na"ed /foodir/myfile.txt bin/hadoop dfs -cat /foodir/myfile.txt
F she.. is tar1eted /or app.icatio$s that $eed a scripti$1 .a$1"a1e to i$teract 0ith
the stored data.
DFS%dmin
,he DFAd#i$ co##a$d set is "sed /or ad#i$isteri$1 a$ HDF c."ster. ,hese are
co##a$ds that are "sed o$.& )& a$ HDF ad#i$istrator. Here are so#e sa#p.e
actio$Bco##a$d pairs7
Action Command
Put the %luster in afe"ode bin/hadoop dfsadmin -safemode enter
+enerate a list of Data,odes bin/hadoop dfsadmin -report
)e%o""ission or de%o""ission Data,ode-s. bin/hadoop dfsadmin -refreshNodes
0rowser 'nter"ace
A t&pica. HDF i$sta.. co$/i1"res a 0e) ser4er to e?pose the HDF $a#espace
thro"1h a co$/i1"ra).e ,CP port. ,his a..o0s a "ser to $a4i1ate the HDF $a#espace
a$d 4ie0 the co$te$ts o/ its /i.es "si$1 a 0e) )ro0ser.
Space /eclamation
File Deletes and 4ndeletes
Whe$ a /i.e is de.eted )& a "ser or a$ app.icatio$9 it is $ot i##ediate.& re#o4ed /ro#
HDF. -$stead9 HDF /irst re$a#es it to a /i.e i$ the /trash director&. ,he /i.e ca$ )e
restored A"ick.& as .o$1 as it re#ai$s i$ /trash. A /i.e re#ai$s i$/trash /or a
co$/i1"ra).e a#o"$t o/ ti#e. A/ter the e?pir& o/ its .i/e i$ /trash9 the 6a#e6ode
de.etes the /i.e /ro# the HDF $a#espace. ,he de.etio$ o/ a /i.e ca"ses the ).ocks
associated 0ith the /i.e to )e /reed. 6ote that there co".d )e a$ apprecia).e ti#e
de.a& )et0ee$ the ti#e a /i.e is de.eted )& a "ser a$d the ti#e o/ the correspo$di$1
i$crease i$ /ree space i$ HDF.
A "ser ca$ !$de.ete a /i.e a/ter de.eti$1 it as .o$1 as it re#ai$s i$
the /trash director&. -/ a "ser 0a$ts to "$de.ete a /i.e that heBshe has de.eted9
heBshe ca$ $a4i1ate the /trash director& a$d retrie4e the /i.e. ,he /trash director&
co$tai$s o$.& the .atest cop& o/ the /i.e that 0as de.eted. ,he /trash director& is j"st
.ike a$& other director& 0ith o$e specia. /eat"re7 HDF app.ies speci/ied po.icies to
a"to#atica..& de.ete /i.es /ro# this director&. ,he c"rre$t de/a".t po.ic& is to de.ete
/i.es /ro# /trash that are #ore tha$ 6 ho"rs o.d. -$ the /"t"re9 this po.ic& 0i.. )e
co$/i1"ra).e thro"1h a 0e.. de/i$ed i$ter/ace.
Decrease /eplication Factor
Whe$ the rep.icatio$ /actor o/ a /i.e is red"ced9 the 6a#e6ode se.ects e?cess rep.icas
that ca$ )e de.eted. ,he $e?t Heart)eat tra$s/ers this i$/or#atio$ to the Data6ode.
,he Data6ode the$ re#o4es the correspo$di$1 ).ocks a$d the correspo$di$1 /ree
space appears i$ the c."ster. ;$ce a1ai$9 there #i1ht )e a ti#e de.a& )et0ee$ the
co#p.etio$ o/ thesetReplication AP- ca.. a$d the appeara$ce o/ /ree space i$ the
c."ster.
/e"erences
HDF Da4a AP-7 http7BBhadoop.apache.or1BcoreBdocsBc"rre$tBapiB
HDF so"rce code7 http7BBhadoop.apache.or1Bhd/sB4ersio$Hco$tro..ht#.
)& Dhr")a 8orthak"r

'ast P").ished7 02B13B2013 1I72075J
Cop&ri1ht K 200J ,he Apache o/t0are Fo"$datio$.
&ip to %ontent
/earn,ow0nline Ho"e
Official Blog
T/6 /&*H3(1": H/DOOP D3"T&3B0T1D !3)1 ";"T1%
The Power o< Hadoop
.ea&e a reply
4&en within the conte(t of other hi-tech technologies' Hadoop went fro
obscurity to fae in a iraculously short about of tie. 6t had toQ the pressures
dri&ing the de&elopent of this technology were too great. 6f you are not failiar
with Hadoop' let7s start by looking at the &oid it is trying to fll.
#opanies' up until recentlyRsay the last f&e to ten years or soRdid not ha&e
the assi&e aounts of data to anage as they do today. $ost copanies only
had to anage the data relating to running their business and anaging their
custoers. 4&en those with illions of custoers didn7t ha&e trouble storing data
using your e&eryday relational database like $icrosoft S%. Ser&er or Aracle.
/ut today' copanies are realizing that with the growth of the 6nternet and with
self-ser&icing (or SaaS) Web sites' there are now hundreds of illions of potential
custoers that are all &oluntarily pro&iding assi&e aounts of &aluable
business intelligence. 8hink of storing soething as siple as a Web log that
pro&ides e&ery click of e&ery user on your site. How does a copany store and
anipulate this data when it is generating potentially trillions of rows of data
e&ery year@
0enerally speaking' the essence of the proble Hadoop is attepting to sol&e is
that data is coing in faster than hard dri&e capacities are growing. 8oday we
ha&e > 8/ dri&es a&ailable which can then be assebled on SA) or )AS de&ices
to easily get >E 8/ &olues or aybe e&en >EE 8/ &olues. /ut what if you
needed a >'EEE 8/ or > "etabytes ("/) &olue@ 8he costs 5uickly get incredibly
high for ost copanies to absorbQuntil now. 4nter Hadoop.
Hadoop /rhiteture
Ane of the keys to Hadoop7s success is that it operates on e&eryday coon
hardware. A typical copany has a backroo with hardware that has since past
its prie. Nsing old and outdated coputers' one can pack the full of relati&ely
ine(pensi&e hard dri&es (doesn7t need to be the sae total capacity within each
coputer) and use the within a Hadoop cluster. )eed to e(pand capacity@ Add
ore coputers or hard dri&es. Hadoop can le&erage all the hard dri&es into one
giant &olue a&ailable for storing all types of data' fro web logs to large &ideo
fles. 6t is not uncoon for Hadoop to be used to store rows of data that are
o&er ;0/ per rowD
8he fle syste that Hadoop uses is called the Hadoop -istributed 2ile Syste or
H-2S. 6t is a highly fault tolerant fle syste that focuses on high a&ailability and
fast readabilities. 6t is best used for data that is written once and read often. 6t
le&erages all the hard dri&es in the systes when writing data because Hadoop
knows that bottlenecks ste fro writing and reading to a single hard dri&e. 8he
ore hard dri&es are used siultaneously during the writing and reading of data'
the faster the syste operates as a whole.
8he H-2S fle syste operates in sall fle blocks which are spread across all
hard dri&es a&ailable within a cluster. 8he block size is confgurable and
optiized to the data being stored. 6t also replicates the blocks o&er ultiple
dri&es across ultiple coputers and e&en across ultiple network subnets. 8his
allows for hard dri&es or coputers to fail (and they will) and not disrupt the
syste. 6t also allows Hadoop to be strategic in which blocks it accesses during a
read. Hadoop will choose to read certain replicated blocks when it feels it can
retrie&e the data faster using one coputer o&er another. Hadoop analyses which
coputers and hard dri&es are currently being utilized' along with network
bandwidth' to strategically pick the ne(t hard dri&e to read a block. 8his produces
a syste that is &ery 5uick to respond to re5uests.
%ap&edue
-espite the relati&ely odd nae' $ap!educe is the cornerstone of Hadoop7s data
retrie&al syste. 6t is an abstracted prograing layer on top of H-2S and is
responsible for siplifying how data is read back to the user. 6t has a purpose
siilar to S%. in that it allows prograers to focus on building intelligent
5ueries and not get in&ol&ed in the underlying plubing responsible for
ipleenting or optiizing the 5ueries. 8he S$apT part of the nae refers to the
task of building a ap on the best way to sort and flter the inforation
re5uested and then to return it as a pseudo result set. 8he S!educeT task
suarizes the data like the counting and suing of certain coluns.
8hese two tasks are both analyzed by the Hadoop engine and then broken into
any pieces or nodes (a di&ide and con5uer odel) which are all processed in
parallel by indi&idual workers. 8his result is the ability to process "etabytes of
data in a atter of hours.
$ap!educe is an open source pro9ect originally de&eloped by 0oogle and has
been now ported o&er to any prograing languages. +ou can fnd out ore
on $ap!educe by &isitinghttp1GGapreduce.org.
6n y ne(t post' 67ll take a look at soe of the other popular coponents around
Hadoop' including ad&anced analytical tools like Hi&e and "ig. 6n the eantie' if
you7d like to learn ore about Hadoop' check out our new course.
Apache Hadoop2 Hadoop2 Apache2 the Apache feather logo2 and the Apache
Hadoop pro4ect logo are either registered trademarks or trademarks of the
Apache Software Foundation in the 5nited States and other countries.
/bout the /uthor
%artin "hae<erle is the Uice "resident of 8echnology
for .earn)owAnline. $artin 9oined the copany in ;JJ> and started teaching 68
professionals nationwide to de&elop applications using Uisual Studio and
$icrosoft S%. Ser&er. He has been a featured speaker at &arious conferences
including $icrosoft 8ech-4d' -e&#onnections and the $icrosoft )#- #hannel
Suit. 8oday' he is responsible for all product and software de&elopent as
well as anaging the copany7s 68 infrastructure. $artin en9oys staying on the
cutting edge of technology and guiding the copany to produce the best
learning content with the best user e(perience in the industry. 6n his spare tie'
$artin en9oys golf' fshing' and being with his wife and three teenage children.
8his entry was posted in Nncategorized and tagged big data' Hadoop' Hadoop
architecture' Hadoop -istributed 2ile Syste' H-2S' $ap!educe' $artin
Schaeferle on Fuly B' <E;> by $arty S..
"1/&*H
Search for1
Search
!O))OW 0"
)/T1"T PO"T"
"ower "i&ot -ashboards
HadoopQ"igs' Hi&es' and Vookeepers' Ah $yD
/ootstrap 2undaentals with Ada /arney
Watch S0etting Started with AngularFST
AgileGScru 4ssentials for "ractitioners
/&*H3(1"
Fuly <E;>
Fune <E;>
$ay <E;>
April <E;>
$arch <E;>
2ebruary <E;>
Fanuary <E;>
-eceber <E;B
)o&eber <E;B
Actober <E;B
Septeber <E;B
August <E;B
Fuly <E;B
Fune <E;B
$ay <E;B
April <E;B
$arch <E;B
2ebruary <E;B
Fanuary <E;B
-eceber <E;<
)o&eber <E;<
Actober <E;<
Septeber <E;<
August <E;<
Fuly <E;<
Proudl! powered b! 1ordPress
!I+IS
43I5
S6ARCH
Se!rc8
T"e Full !i0i
S"a&c!(
E$
Ad4ertise#e$ts
o&" in/o on Hadoop
Wikis
+$c&c.opedia
Arci!ec!"re
Hadoop Distri)"ted Fi.e &ste#
Do) ,racker a$d ,ask ,racker7 the 2ap*ed"ce e$1i$e
;ther app.icatio$s
Pro#i$e$t "sers
Hadoop at LahooM
;ther "sers
Hadoop o$ A#a<o$ +C2B3 ser4ices
Hadoop 0ith "$ (rid +$1i$e
ee a.so
*e/ere$ces
8i).io1raph&
+?ter$a. .i$ks
*e.ated .i$ks
*e.ated topics
%"i<
%"i<
R"lat"d topics
6"tch
Apache (ero$i#o
Cassa$dra =data)ase>
Apache Der)&
Apache A$t
Apache o.r
Apache NooFeeper
Apache ,o#cat
Co"chD8
'"ce$e
Hadoop( Wi7is
Ad4ertise#e$ts

6ote7 2a$& o/ o"r artic.es ha4e direc! 8"o!es 9ro- so"rces yo" can ci!e:
;i!in !e Wikipedia ar!icleM ,his artic.e does$Ot &et9 )"t 0eOre 0orki$1 o$ itM
ee #ore i$/o or o"r .ist o/ cita).e artic.es.
/elated top topics
N"!c
Apace 6eroni-o
Apace An!
Apace Solr
Apace To-ca!
Co"cDB
Ca!e*ories/ 7ree so9!;are pro*ra--ed in Java <7ree sys!e- so9!;are < Dis!rib"!ed
9ile sys!e-s < Clo"d co-p"!in*< Clo"d in9ras!r"c!"re
Enc+clop"dia
From 5i!ipedia1 the "ree encyclopedia
Apache Hadoop
Developer(s)
Apache o/t0are Fo"$datio$
Stable release
0.20.0 B 200I:04:22E J #o$ths a1o
Written in
Da4a
Operating system
Cross:p.at/or#
Development status
Acti4e
ype
Distri)"ted Fi.e &ste#
!icense
Apache 'ice$se 2.0
Website
http7BBhadoop.apache.or1B
Apace +adoop is a Da4a so/t0are /ra#e0ork that s"pports data:i$te$si4e distri)"ted
app.icatio$s "$der a /ree .ice$se.
P1Q
-t e$a).es app.icatio$s to 0ork 0ith tho"sa$ds o/ $odes
a$d peta)&tes o/ data. Hadoop 0as i$spired )& (oo1.eOs 2ap*ed"ce a$d (oo1.e Fi.e
&ste#=(F> papers.
Hadoop is a top:.e4e. Apache project9 )ei$1 )"i.t a$d "sed )& a co##"$it& o/ co$tri)"tors
/ro# a.. o4er the 0or.d.
P2Q
LahooM has )ee$ the .ar1est co$tri)"tor
P3Q
to the project a$d "ses
Hadoop e?te$si4e.& i$ its 0e) search a$d ad4ertisi$1 )"si$esses.
P4Q
-82 a$d (oo1.e ha4e
a$$o"$ced a #ajor i$itiati4e to "se Hadoop to s"pport "$i4ersit& co"rses i$ distri)"ted
co#p"ter pro1ra##i$1.
P5Q
Hadoop 0as created )& Do"1 C"tti$1 =$o0 a C.o"dera e#p.o&ee>
P6Q
9 0ho $a#ed it a/ter his
chi.dOs st"//ed e.epha$t. -t 0as ori1i$a..& de4e.oped to s"pport distri)"tio$ /or
the 6"tch search e$1i$e project.
P7Q
Contents
1 Architect"re
o 1.1 Hadoop Distri)"ted Fi.e &ste#
o 1.2 Do) ,racker a$d ,ask ,racker7 the 2ap*ed"ce e$1i$e
o 1.3 ;ther app.icatio$s
2 Pro#i$e$t "sers
o 2.1 Hadoop at LahooM
o 2.2 ;ther "sers
3 Hadoop o$ A#a<o$ +C2B3 ser4ices
4 Hadoop 0ith "$ (rid +$1i$e
5 ee a.so
6 *e/ere$ces
7 8i).io1raph&
J +?ter$a. .i$ks
Arc"itecture
Hadoop co$sists o/ the Hadoop Core9 0hich pro4ides access to the /i.es&ste#s that Hadoop
s"pports. R*ack a0are$essR is a$ opti#i<atio$ 0hich takes i$to acco"$t the 1eo1raphic
c."steri$1 o/ ser4ersE $et0ork tra//ic )et0ee$ ser4ers i$ di//ere$t 1eo1raphic c."sters is
#i$i#i<ed.
PJQ
As o/ D"$e 200J9 the .ist o/ s"pported /i.es&ste#s i$c."des7
HDF7 HadoopOs o0$ /i.es&ste#. ,his is desi1$ed to sca.e to peta)&tes o/ stora1e a$d
r"$s o$ top o/ the /i.es&ste#s o/ the "$der.&i$1 operati$1 s&ste#s.
A#a<o$ 3 /i.es&ste#. ,his is tar1eted at c."sters hosted o$ the A#a<o$ +.astic
Co#p"te C.o"d ser4er:o$:de#a$d i$/rastr"ct"re. ,here is $o rack:a0are$ess i$ this
/i.es&ste#9 as it is a.. re#ote.
C.o"dtore =pre4io"s.& Fos#os Distri)"ted Fi.e &ste#> : .ike HDF9 this is rack:
a0are.
F,P Fi.es&ste#7 this stores a.. its data o$ re#ote.& accessi).e F,P ser4ers.
*ead:o$.& H,,P a$d H,,P /i.e s&ste#s.
Ad4ertise#e$ts
Hadoop Distributed File System
,he HDF /i.es&ste# stores .ar1e /i.es =a$ idea. /i.e si<e is a #".tip.e o/ 64 28
PIQ
>9 across
#".tip.e #achi$es. -t achie4es re.ia)i.it& )& rep.icati$1 the data across #".tip.e hosts9 a$d
he$ce does $ot reA"ire *A-D stora1e o$ hosts. With the de/a".t rep.icatio$ 4a."e9 39 data is
stored o$ three $odes7 t0o o$ the sa#e rack9 a$d o$e o$ a di//ere$t rack.
,he /i.es&ste# is )"i.t /ro# a c."ster o/ data nodes9 each o/ 0hich ser4es "p ).ocks o/ data
o4er the $et0ork "si$1 a ).ock protoco. speci/ic to HDF. ,he& a.so ser4e the data o4er H,,P9
a..o0i$1 access to a.. co$te$t /ro# a 0e) )ro0ser or other c.ie$t. Data $odes ca$ ta.k to each
other to re)a.a$ce data9 to #o4e copies aro"$d9 a$d to keep the rep.icatio$ o/ data hi1h.
A /i.es&ste# reA"ires o$e "$iA"e ser4er9 the name node. ,his is a si$1.e poi$t o/ /ai."re /or a$
HDF i$sta..atio$. -/ the $a#e $ode 1oes do0$9 the /i.es&ste# is o//.i$e. Whe$ it co#es )ack
"p9 the $a#e $ode #"st rep.a& a.. o"tsta$di$1 operatio$s. ,his rep.a& process ca$ take o4er
ha./ a$ ho"r /or a )i1 c."ster.
P10Q
,he /i.es&ste# i$c."des 0hat is ca..ed a Secondary Namenode9
0hich #is.eads so#e peop.e i$to thi$ki$1 that 0he$ the pri#ar& 6a#e$ode 1oes o//.i$e9 the
eco$dar& 6a#e$ode takes o4er. -$ /act9 the eco$dar& 6a#e$ode re1".ar.& co$$ects 0ith
the $a#e$ode a$d do0$.oads a s$apshot o/ the pri#ar& 6a#e$odeOs director& i$/or#atio$9
0hich is the$ sa4ed to a director&. ,his eco$dar& 6a#e$ode is "sed to1ether 0ith the edit
.o1 o/ the Pri#ar& 6a#e$ode to create a$ "p:to:date director& str"ct"re.
A$other .i#itatio$ o/ HDF is that it ca$$ot )e direct.& #o"$ted )& a$ e?isti$1 operati$1
s&ste#. (etti$1 data i$to a$d o"t o/ the HDF /i.e s&ste#9 a$ actio$ that o/te$ $eeds to )e
per/or#ed )e/ore a$d a/ter e?ec"ti$1 a jo)9 ca$ )e i$co$4e$ie$t. A Fi.es&ste# i$
!serspace has )ee$ de4e.oped to address this pro).e#9 at .east /or 'i$"? a$d so#e other !$i?
s&ste#s.
*ep.icati$1 data three ti#es is cost.&. ,o a..e4iate this cost9 rece$t 4ersio$s o/ HDF ha4e
eras"re codi$1 s"pport 0here)& #".tip.e ).ocks o/ the sa#e /i.e are co#)i$ed to1ether to
1e$erate a parit& ).ock. HDF creates parit& ).ocks as&$chro$o"s.& a$d the$ decreases the
rep.icatio$ /actor o/ the /i.e /ro# 3 to 2. t"dies ha4e sho0$ that this tech$iA"e decreases the
ph&sica. stora1e reA"ire#e$ts /ro# a /actor o/ 3 to a /actor o/ aro"$d 2.2.
6ob .rac!er and .as! .rac!er: the )ap/educe engine
A)o4e the /i.e s&ste#s co#es the 2ap*ed"ce e$1i$e9 0hich co$sists o/ o$e Job Tracker9 to
0hich c.ie$t app.icatio$s s")#it 2ap*ed"ce jo)s. ,he Do) ,racker p"shes 0ork o"t to
a4ai.a).e Task Tracker $odes i$ the c."ster9 stri4i$1 to keep the 0ork as c.ose to the data as
possi).e. With a rack:a0are /i.es&ste#9 the Do) ,racker k$o0s 0hich $ode co$tai$s the data9
a$d 0hich other #achi$es are $ear)&. -/ the 0ork ca$$ot )e hosted o$ the act"a. $ode 0here
the data resides9 priorit& is 1i4e$ to $odes i$ the sa#e rack. ,his red"ces $et0ork tra//ic o$
the #ai$ )ack)o$e $et0ork. -/ a ,ask ,racker /ai.s or ti#es o"t9 that part o/ the jo) is
resched".ed. -/ the Do) ,racker /ai.s9 a.. o$1oi$1 0ork is .ost.
Hadoop 4ersio$ 0.21 adds so#e checkpoi$ti$1 to this processE the Do) ,racker records 0hat it
is "p to i$ the /i.es&ste#. Whe$ a Do) ,racker starts "p9 it .ooks /or a$& s"ch data9 so that it
ca$ restart 0ork /ro# 0here it .e/t o//. -$ ear.ier 4ersio$s o/ Hadoop9 a.. acti4e 0ork 0as .ost
0he$ a Do) ,racker restarted.
F$o0$ .i#itatio$s o/ this approach are7
,he a..ocatio$ o/ 0ork to task trackers is 4er& si#p.e. +4er& task tracker has a $"#)er
o/ a4ai.a).e slots =s"ch as R4 s.otsR>. +4er& acti4e #ap or red"ce task takes "p o$e s.ot. ,he
Do) ,racker a..ocates 0ork to the tracker $earest to the data 0ith a$ a4ai.a).e s.ot. ,here is
$o co$sideratio$ o/ the c"rre$t acti4e .oad o/ the a..ocated #achi$e9 a$d he$ce its act"a.
a4ai.a)i.it&.
-/ o$e task tracker is 4er& s.o09 it ca$ de.a& the e$tire 2ap*ed"ce operatio$
:especia..& to0ards the e$d o/ a jo)9 0here e4er&thi$1 ca$ e$d "p 0aiti$1 /or a si$1.e s.o0
task. With spec".ati4e:e?ec"tio$ e$a).ed9 ho0e4er9 a si$1.e task ca$ )e e?ec"ted o$ #".tip.e
s.a4e $odes.
2ther applications
,he HDF /i.es&ste# is $ot restricted to 2ap*ed"ce jo)s. -t ca$ )e "sed /or other app.icatio$s9
#a$& o/ 0hich are "$der 0a& at Apache. ,he .ist i$c."des the H8ase data)ase9 the Apache
2aho"t #achi$e .ear$i$1 s&ste#9 a$d #atri? operatio$s. Hadoop ca$ i$ theor& )e "sed /or
a$& sort o/ 0ork that is )atch:orie$ted rather tha$ rea.:ti#e9 4er& data:i$te$si4e9 a$d a).e to
0ork o$ pieces o/ the data i$ para..e..
Prominent users
Hadoop at 7ahoo8
;$ Fe)r"ar& 1I9 200J9 LahooM .a"$ched 0hat it c.ai#ed 0as the 0or.dOs .ar1est Hadoop
prod"ctio$ app.icatio$. ,he LahooM earch We)#ap is a Hadoop app.icatio$ that r"$s o$ a
#ore tha$ 109000 core 'i$"? c."ster a$d prod"ces data that is $o0 "sed i$ e4er& LahooM We)
search A"er&.
P11Q
,here are #".tip.e Hadoop c."sters at LahooM9 each occ"p&i$1 a si$1.e datace$ter =or /ractio$
thereo/>. 6o HDF /i.es&ste#s or 2ap*ed"ce jo)s are sp.it across #".tip.e datace$tersE
i$stead each datace$ter has a separate /i.es&ste# a$d 0ork.oad. ,he c."ster ser4ers r"$
'i$"?9 a$d are co$/i1"red o$ )oot "si$1 Fickstart. +4er& #achi$e )ootstraps the 'i$"? i#a1e9
i$c."di$1 the Hadoop distri)"tio$. C."ster co$/i1"ratio$ is a.so aided thro"1h a pro1ra# ca..ed
NooFeeper. Work that the c."sters per/or# is k$o0$ to i$c."de the i$de? ca.c".atio$s /or the
LahooM search e$1i$e.
;$ D"$e 109 200I9 LahooM re.eased its o0$ distri)"tio$ o/ Hadoop.
P12Q
2ther users
8esides LahooM9 #a$& other or1a$i<atio$s are "si$1 Hadoop to r"$ .ar1e distri)"ted
co#p"tatio$s. o#e o/ the# i$c."de7
P2Q
AI.co#
A;'
8oo< A..e$ Ha#i.to$
+Har#o$&
Face)ook
Free)ase
Fo? -$teracti4e 2edia
-82
-#a1ehack
--
Doost
'ast./#
'i$ked-$
2eta0e)
2ee)o
6i$1
Po0erset =$o0 part o/ 2icroso/t>
Prote"s ,ech$o.o1ies
,he 6e0 Lork ,i#es
*ackspace
Geoh
Hadoop on Ama7on 6C/8S, ser*ices
-t is possi).e to r"$ Hadoop o$ A#a<o$ +.astic Co#p"te C.o"d =+C2> a$d A#a<o$ i#p.e
tora1e er4ice=3>
P13Q
. As a$ e?a#p.e ,he 6e0 Lork ,i#es "sed 100 A#a<o$ +C2 i$sta$ces
a$d a Hadoop app.icatio$ to process 4,8 o/ ra0 i#a1e ,-FF data =stored i$ 3> i$to 11 #i..io$
/i$ished PDFs i$ the space o/ 24 ho"rs at a co#p"tatio$ cost o/ a)o"t S240 =$ot i$c."di$1
)a$d0idth>.
P14Q
,here is s"pport /or the 3 /i.es&ste# i$ Hadoop distri)"tio$s9 a$d the Hadoop tea# 1e$erates
+C2 #achi$e i#a1es a/ter e4er& re.ease. Fro# a p"re per/or#a$ce perspecti4e9 Hadoop o$
3B+C2 is i$e//icie$t9 as the 3 /i.es&ste# is re#ote a$d de.a&s ret"r$i$1 /ro# e4er& 0rite
operatio$ "$ti. the data are 1"ara$teed to $ot )e .ost. ,his re#o4es the .oca.it& ad4a$ta1es o/
Hadoop9 0hich sched".es 0ork $ear data to sa4e o$ $et0ork .oad.
;$ Apri. 29 200I A#a<o$ a$$o"$ced the )eta re.ease o/ a $e0 ser4ice ca..ed A#a<o$ +.astic
2ap*ed"ce 0hich the& descri)e as Ra 0e) ser4ice that e$a).es )"si$esses9 researchers9 data
a$a.&sts9 a$d de4e.opers to easi.& a$d cost:e//ecti4e.& process 4ast a#o"$ts o/ data. -t "ti.i<es
a hosted Hadoop /ra#e0ork r"$$i$1 o$ the 0e):sca.e i$/rastr"ct"re o/ A#a<o$ +.astic
Co#p"te C.o"d =A#a<o$ +C2> a$d A#a<o$ i#p.e tora1e er4ice =A#a<o$ 3>.R
P15Q
Hadoop 9it" Sun :rid 6ngine
Hadoop ca$ a.so )e "sed i$ co#p"te /ar#s a$d hi1h:per/or#a$ce co#p"ti$1 e$4iro$#e$ts.
-$te1ratio$ 0ith"$ (rid +$1i$e 0as re.eased9 a$d r"$$i$1 Hadoop o$ "$ (rid ="$Os o$:
de#a$d "ti.it& co#p"ti$1 ser4ice> is possi).e.
P16Q
-$ the i$itia. i#p.e#e$tatio$ o/ the
i$te1ratio$9 the CP!:ti#e sched".er has $o k$o0.ed1e o/ the .oca.it& o/ the data. A ke& /eat"re
o/ the Hadoop *"$ti#e9 Rdo the 0ork i$ the sa#e ser4er or rack as the dataR is there/ore .ost.
D"ri$1 the "$ HPC o/t0are Workshop O0I9 a$ i#pro4ed i$te1ratio$ 0ith data:.oca.it&
a0are$ess 0as a$$o"$ced.
P17Q
"$ a.so has the Hadoop Live CD ;pe$o.aris project9 0hich a..o0s r"$$i$1 a /"..& /"$ctio$a.
Hadoop c."ster "si$1 a .i4e CD.
P1JQ
See also
Free software portal
6"tch : a$ e//ort to )"i.d a$ ope$ so"rce search e$1i$e )ased o$ '"ce$e a$d Hadoop.
A.so created )& Do"1 C"tti$1.
H8ase : 8i1,a).e:#ode. data)ase. "):project o/ Hadoop.
H&perta).e : H8ase a.ter$ati4e
2ap*ed"ce : HadoopOs /"$da#e$ta. data /i.teri$1 a.1orith#
Apache Pi1 : *eporti$1 A"er& .a$1"a1e /or Hadoop
Apache 2aho"t : 2achi$e 'ear$i$1 a.1orith#s i#p.e#e$t o$ Hadoop
C.o"d co#p"ti$1
Re$erences
1. = RHadoop is a /ra#e0ork /or r"$$i$1 app.icatio$s o$ .ar1e c."sters o/ co##odit&
hard0are. ,he Hadoop /ra#e0ork tra$spare$t.& pro4ides app.icatio$s )oth re.ia)i.it& a$d data
#otio$. Hadoop i#p.e#e$ts a co#p"tatio$a. paradi1# $a#ed #apBred"ce9 0here the app.icatio$ is
di4ided i$to #a$& s#a.. /ra1#e$ts o/ 0ork9 each o/ 0hich #a& )e e?ec"ted or ree?ec"ted o$ a$&
$ode i$ the c."ster. -$ additio$9 it pro4ides a distri)"ted /i.e s&ste# that stores data o$ the co#p"te
$odes9 pro4idi$1 4er& hi1h a11re1ate )a$d0idth across the c."ster. 8oth #apBred"ce a$d the
distri)"ted /i.e s&ste# are desi1$ed so that $ode /ai."res are a"to#atica..& ha$d.ed )& the
/ra#e0ork.R Hadoop ;4er4ie0
2. T
a

b
App.icatio$s a$d or1a$i<atio$s "si$1 Hadoop
3. = Hadoop Credits Pa1e
4. = LahooM 'a"$ches Wor.dOs 'ar1est Hadoop Prod"ctio$ App.icatio$
5. = (oo1.e Press Ce$ter7 (oo1.e a$d -82 A$$o"$ce !$i4ersit& -$itiati4e to Address -$ter$et:
ca.e Co#p"ti$1 Cha..e$1es
6. = Hadoop creator 1oes to C.o"dera
7. = RHadoop co$tai$s the distri)"ted co#p"ti$1 p.at/or# that 0as /or#er.& a part o/ 6"tch.
,his i$c."des the Hadoop Distri)"ted Fi.es&ste# =HDF> a$d a$ i#p.e#e$tatio$ o/
#apBred"ce.R A)o"t Hadoop
J. = http7BBhadoop.apache.or1BcoreBdocsBr0.17.2Bhd/sH"serH1"ide.ht#.U*ackVA0are$ess
I. = ,he Hadoop Distri)"ted Fi.e &ste#7 Architect"re a$d Desi1$
10. = -#pro4e 6a#e$ode start"p per/or#a$ce. RDe/a".t sce$ario /or 20 #i..io$ /i.es 0ith the
#a? Da4a heap si<e set to 14(8 7 40 #i$"tes. ,"$i$1 4ario"s Da4a optio$s s"ch as &o"$1 si<e9
para..e. 1ar)a1e co..ectio$9 i$itia. Da4a heap si<e 7 14 #i$"tesR
11. = LahooM 'a"$ches Wor.dOs 'ar1est Hadoop Prod"ctio$ App.icatio$ =Hadoop a$d Distri)"ted
Co#p"ti$1 at LahooM>
12. = Hadoop a$d Distri)"ted Co#p"ti$1 at LahooM
13. = http7BBa0s.t&pepad.co#Ba0sB200JB02Btaki$1:#assi4e.ht#. *"$$i$1 Hadoop o$ A#a<o$
+C2B3
14. = e./:ser4ice9 Prorated "per Co#p"ti$1 F"$M : ;pe$ : Code : 6e0 Lork ,i#es 8.o1
15. = A#a<o$ +.astic 2ap*ed"ce 8eta
16. = RCreati$1 Hadoop pe "$der (+R. "$ 2icros&ste#s. 200J:01:16.
17. = RHDF:A0are ched".i$1 With (rid +$1i$eR. "$ 2icros&ste#s. 200I:0I:10.
1J. = R;pe$o.aris Project7 Hadoop 'i4e CDR. "$ 2icros&ste#s. 200J:0J:2I.
Bibliograp"y
Ch"ck9 'a# =Da$"ar& 2J9 2010>9 Hadoop in Action =1st ed.>9 2a$$i$19
pp. 3259 -86 1I351J21I6
Ge$$er9 Daso$ =D"$e 229 200I>9 Pro Hadoop =1st ed.>9 Apress9 pp. 4409 -86 143021I424
White9 ,o# =D"$e 169 200I>9 Hadoop: The Definitive !ide =1st ed.>9 ;O*ei..& 2edia9
pp. 5249 -86 05I6521I7I
6;ternal lin0s
;//icia. 0e) site
Top level
pro>ec!s
Acti4e2% ? A$t ? Apache H,,P
er4er ? AP* ? 8eehi4e ? 8"i.dr ? Ca#e. ? Ca&e$$e ?Cocoo$ ? Co"chD8 ? C@F ? Der)& ? Director&
estr& ? ,o#cat ? ,"sca$& ? Ge.ocit& ? Wicket ? @2'8ea$s
Jakar!a .ro>ec!s 8C+' ? 8F ? Cact"s ? +C ? DC ? D2eter ? ;*; ? *e1e?p
Co--ons
.ro>ec!s
a$se.a$
L"cene .ro>ec!s '"ce$e Da4a ? Droids ? '"ce$e.6et ? '"c& ? 2aho"t ? 6"tch ? ;pe$ *e.e4a$ce Project ?P&'"ce$e
#!er pro>ec!s Chai$sa0 ? H8ase ? @erces ? 8atik ? F;P ? 'o14j ? @AP ? *i4er ? er4ice2i? ? 'o146et ?A)dera
@nc"ba!or
.ro>ec!s
DPWiki ? C.ick ? Cassa$dra
%e!ired pro>ec!s Hi4e2i$d ? .ide ? ha.e
Cate1ories7 Free so/t0are pro1ra##ed i$ Da4a W Free s&ste# so/t0are W Distri)"ted /i.e
s&ste#s W C.o"d co#p"ti$1 W C.o"d i$/rastr"ct"re
R"lat"d lin7s
!p to date as o/ 6o4e#)er 169 200I
2e$tio$ o/ 6"tch a$d Hadoop : Ho0 (oo1.e Works : -$/rastr"ct"re

-82 2ap*ed"ce ,oo.s : a.phaWorks 7 -82 2ap*ed"ce ,oo.s /or +c.ipse 7 ;4er4ie0

Heritri? Hadoop DF Writer Processor : ,he 'a) : N4e$ts
Hadoop 0e)site : We.co#e to Apache Hadoop CoreM
A 6L, ).o1 : e./:ser4ice9 Prorated "per Co#p"ti$1 F"$M : ;pe$ 8.o1 : 6L,i#es.co#
Hadoop 0iki
LahooOs )et o$ Hadoop : LahooMOs )et o$ Hadoop : ;O*ei..& *adar
(oo1.e Press Ce$ter7 (oo1.e a$d -82 A$$o"$ce !$i4ersit& -$itiati4e to Address -$ter$et:
ca.e Co#p"ti$1 Cha..e$1es : (oo1.e Press Ce$ter7 Press *e.ease
LahooOs Do"1 C"tti$1 o$ 2ap*ed"ce a$d the F"t"re o/ Hadoop : -$/o%7 LahooOs Do"1
C"tti$1 o$ 2ap*ed"ce a$d the F"t"re o/ Hadoop
Hadoop ;4er4ie0 : ProjectDescriptio$ : Hadoop Wiki
Hadoop 0iki
Hadoop 0e)site : We.co#e to Apache HadoopM
LahooM 'a"$ches Wor.dOs 'ar1est Hadoop Prod"ctio$ App.icatio$ : LahooM 'a"$ches
Wor.dOs 'ar1est Hadoop Prod"ctio$ App.icatio$ =Hadoop a$d Distri)"ted Co#p"ti$1 at
LahooM>
*"$$i$1 Hadoop o$ A#a<o$ +C2B3 : A#a<o$ We) er4ices 8.o17 ,aki$1 2assi4e
Distri)"ted Co#p"ti$1 to the Co##o$ 2a$ : Hadoop o$ A#a<o$ +C2B3
R"lat"d topics
!p to date as o/ A"1"st 1I9 2010
6"tch
Apache (ero$i#o
Cassa$dra =data)ase>
Apache Der)&
Apache A$t
Apache o.r
Apache NooFeeper
Apache ,o#cat
Co"chD8
'"ce$e
Ad4ertise#e$ts
6o! so-e!in* !o sayA $ake a co--en!.
Lo"r $a#e
Lo"r e#ai.
address
2essa1e
Privacy &
Terms
Sub"i( C$""e#( F

,he te?t o/ the a)o4e Wikipedia artic.e is a4ai.a).e "$der the Creati4e Co##o$s Attri)"tio$:
hareA.ike 'ice$se. ,his co$te$t a$d its associated e.e#e$ts are #ade a4ai.a).e "$der
the sa#e .ice$se 0here attri)"tio$ #"st i$c."de ack$o0.ed1e#e$t o/ ,he F".. Wiki as the
so"rce o$ the pa1e sa#e pa1e 0ith a .i$k )ack to this pa1e 0ith $o $o/o..o0 ta1.
8.o1

A)o"t ,he F".. Wiki

Co$tact "s

Pri4ac& Po.ic&
Gersio$ 060I9 0

Potrebbero piacerti anche