Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Big Data
Hadoop Tutorials
Cassandra
Hector API
Request Tutorial
About
LABELS: HADOOP-TUTORIAL, HDFS
3 OCTOBER 2013
Hadoop Tutorial: Part 1 - What is
Hadoop ? (an Overview)
Hadoop is an open source software framework that supports data
intensive distributed applications which is licensed under Apache
v2 license.
At-least this is what you are going to fnd as the frst line of
defnition on Hadoop in Wikipedia. So what is data intensive
distributed applications?
Well data intensive is nothing but BigData (data that has
outgrown in size) anddistributed appliations are the
applications that works on network by counicating and
coordinating with each other by passing essages. (say using a
!"# interprocess counication or through $essage-%ueue)
Hence Hadoop works on a distributed en&ironent and is build to
store' handle and process large aount of data set (in petabytes'
e(abyte and ore). )ow here since i a saying that hadoop
stores petabytes of data' this doesn*t ean that Hadoop is a
database. Again reeber its a fraework that handles large
aount of data for processing. +ou will get to know the di,erence
between Hadoop and -atabases (or )oS%. -atabases' well that*s
what we call /ig-ata*s databases) as you go down the line in the
coing tutorials.
Hadoop was deri&ed fro the research paper published by
0oogle on Google File System(GFS and Google!s "ap#educe. So
there are two integral parts of Hadoop1 Hadoop Distributed
!ile "#ste$(HD!") and Hadoop %ap&edue.
Hadoop Distributed !ile "#ste$ (HD!")
H-2S is a flesyste designed for storing ver# large
'les with strea$ing data aesspatterns' running on clusters
of o$$odit# hardware.
Well .ets get into the details of the stateent entioned abo&e1
(er# )arge 'les: )ow when we say &ery large fles we ean
here that the size of the fle will be in a range of gigabyte'
terabyte' petabyte or ay be ore.
"trea$ing data aess: H-2S is built around the idea that the
ost e3cient data processing pattern is a write-once' read-any-
ties pattern. A dataset is typically generated or copied fro
source' and then &arious analyses are perfored on that dataset
o&er tie. 4ach analysis will in&ol&e a large proportion' if not all'
of the dataset' so the tie to read the whole dataset is ore
iportant than the latency in reading the frst record.
*o$$odit# Hardware: Hadoop doesn*t re5uire e(pensi&e'
highly reliable hardware. 6t7s designed to run
on clusters of coodity hardware (coonly a&ailable
hardware that can be obtained fro ultiple &endors) for which
the chance of node failure across the cluster is high' at least for
large clusters. H-2S is designed to carry on working without a
noticeable interruption to the user in the face of such failure.
)ow here we are talking about a 2ileSyste' Hadoop -istributed
2ileSyste. And we all know about a few of the other 2ile Systes
like .inu( 2ileSyste and Windows 2ileSyste. So the ne(t
5uestion coes is...
What is the di+erene between nor$al
!ile"#ste$ and Hadoop Distributed
!ile "#ste$?
8he a9or two di,erences that is notable between H-2S and
other 2ilesystes are1
Blo, "i-e: 4&ery disk is ade up of a block size. And this
is the iniu aount of data that is written and read
fro a -isk. )ow a 2ilesyste also consists of blocks which
is ade out of these blocks on the disk. )orally disk
blocks are of :;< bytes and those of flesyste are of a few
kilobytes. 6n case of HD!" we also ha&e the blocks
concept. /ut here one block size is of => $/ by default and
which can be increased in an integral ultiple of => i.e.
;<?$/' <:=$/' :;<$/ or e&en ore in 0/*s. 6t all depend
on the re5uireent and use-cases.
So Why are these blocks size so large for H-2S@ keep on
reading and you will get it in a ne(t few tutorials 1)
%etadata "torage: 6n noral fle syste there is
a hierarchical storage of etadata i.e. lets say there is a
folder A$%. inside that folder there is again one another
folder &'F. and inside that there is hello.t(t fle. )ow the
inforation about hello.t(t (i.e. etadata info of
hello.t(t) fle will be with &'F and again the etadata
of &'F will be with A$%. Hence this fors a hierarchy and
this hierarchy is aintained until the root of the flesyste.
/ut in HD!" we don*t ha&e a hierarchy of etadata. All the
etadata inforation resides with a single achine known
as Namenode (or $aster )ode) on the cluster. And this
node contains all the inforation about other fles and
folder and lots of other inforation too' which we will learn
in the ne(t few tutorials. 1)
Well this was 9ust an o&er&iew of Hadoop and Hadoop -istributed
2ile Syste. )ow in the ne(t part i will go into the depth of H-2S
and there after $ap!educe and will continue fro here...
Let me know if you have any doubts
in understanding anything into the comment section and i
will be really glad to answer the same :)
If you like what you just read and want to
continue your learning on BI!"#" you
can subscribe to our $mail and Like
our facebook page
These $ight also help #ou :'
;. Hadoop 8utorial1 "art > - Write Aperations in H-2S
<. Hadoop 8utorial1 "art B - !eplica "laceent or !eplication and
!ead Aperations in H-2S
B. Hadoop 8utorial1 "art < - Hadoop -istributed 2ile Syste (H-2S)
>. Hadoop 8utorial1 "art ; - What is Hadoop @ (an A&er&iew)
:. /est of /ooks and !esources to 0et Started with Hadoop
=. Hadoop 8utorial1 "art : - All Hadoop Shell #oands you will
)eed.
C. Hadoop 6nstallation on .ocal $achine (Single node #luster)
!ind *o$$ents below or /dd one
!oain !igau( said...
)ice suaryD
Actober EB' <E;B
pragya khare said...
6 know i* a beginner and this 5uestion yt be a silly ;....but can
you please e(plain to e that how "A!A..4.6S$ is achie&ed &ia
ap-reduce at the processor le&el @@@ if 6*&e a dual core processor'
is it that only < 9obs will run at a tie in parallel@
Actober E:' <E;B
Anonyous said...
Hi 6 a fro $ainfrae background and with little knowledge of
core 9a&a...-o you think Fa&a is needed for learning Hadoop in
addition to Hi&eG"60 @ 4&en want to learn Fa&a for ap reduce but
couldn*t fnd what all will be used in realtie..and defniti&e guide
books sees tough for learning apreduce with Fa&a..any option
where 6 can learn it step by step@
Sorry for long coent..but it would be helpful if you can guide
e..
Actober E:' <E;B
-eepak Huar said...
I"ragya Hhare...
2irst thing always reeber... the one "opular saying.... )A %uestions
are 2oolish 1) And btw it is a &ery good 5uestion.
Actually there are two things1
Ane is what will be the best practice@ and other is what happens in
there by default @...
Well by default the nuber of apper and reducer is set to < for any
task tracker' hence one sees a a(iu of < aps and < reduces at
a gi&en instance on a 8ask8racker (which is confgurable)..Well this
-oesn*t only depend on the "rocessor but on lots of other factor as
well like ra' cpu' power' disk and others....
http1GGhortonworks.coGblogGbest-practices-for-selecting-apache-
hadoop-hardwareG
And for the other factor i.e for /est "ractices it depends on your use
case. +ou can go through the Brd point of the below link to understand
it ore conceptually
http1GGblog.cloudera.coGblogG<EEJG;<GC-tips-for-ipro&ing-
apreduce-perforanceG
Well i will e(plain all these when i will reach the ad&ance $ap!educe
tutorials.. 8ill then keep reading DD 1)
Actober E:' <E;B
-eepak Huar said...
IAnonyous
As Hadoop is written in Fa&a' so ost of its A"6*s are written in core
Fa&a... Well to know about the Hadoop architecture you don*t need
Fa&a... /ut to go to its A"6 .e&el and start prograing in $ap!educe
you need to know #ore Fa&a.
And as for the re5uireent in 9a&a you ha&e asked for... you 9ust need
siple core 9a&a concepts and prograing for Hadoop and
$ap!educe..And Hi&eG"60 are the S%. kind of data Kow languages
that is really easy to learn...And since you are fro a prograing
background it won*t be &ery di3cult to learn 9a&a 1) you can also go
through the link below for further details 1)
http1GGwww.bigdataplanet.infoG<E;BGEJGWhat-are-the-"re-re5usites-for-
getting-started-with-/ig--ata-8echnologies.htl
Actober E:' <E;B
"os t a #oent
Newer %ost& ' (lder %ost
/BO0T TH1 /0THO&
D11P/2 20%/&
/ig -ata G Hadoop -e&eloper' Software 4ngineer' 8hinker' .earner' 0eek' /logger'
#oder
3 love to pla# around Data4 Big Data 5
"ubsribe updates via 1$ail
)oin $ig&ata *lanet to continue your learning on $ig&ata +echnologies
Subscribe
6et 0pdates on !aeboo,
Big Data )ibraries
1. BIGDATA NEWS
2. CASSANDRA
3. HADOOP-TUTORIAL
4. HDFS
5. HECTOR-API
6. INSTALLATION
7. SQOOP
Which )oS%. -atabases according to you is $ost "opular @
6et *onneted on 6oogle7
%ost Popular Blog /rtile
Hadoop Installation on Local ac!in" #Sin$l" nod" Cl%st"&'
Hadoop T%to&ial( Pa&t ) - All Hadoop S!"ll Co**ands +o% ,ill N""d-
W!at a&" t!" P&"-&".%isit"s /o& $"ttin$ sta&t"d ,it! Bi$ Data T"c!nolo$i"s
Hadoop T%to&ial( Pa&t 0 - R"plica Plac"*"nt o& R"plication and R"ad Op"&ations in
HDFS
Hadoop T%to&ial( Pa&t 1 - W!at is Hadoop 2 #an O3"&3i",'
Hadoop T%to&ial( Pa&t 4 - Hadoop Dist&i5%t"d Fil" S+st"* #HDFS'
Hadoop T%to&ial( Pa&t 6 - W&it" Op"&ations in HDFS
B"st o/ Boo7s and R"so%&c"s to G"t Sta&t"d ,it! Hadoop
Ho, to %s" Cassand&a CQL in +o%& 8a3a Application
/ack to 8op L
M)ote1 Nse Screen !esolution of ;<?E p( and ore to &iew the website I its best. Also use the
latest &ersion of the browser as the website uses H8$.: and #SSB 1)
8witter 2acebook !SS 0oogle
ABOUT E
CONTACT
PRI9AC: POLIC:
O <E;B All !ights !eser&ed /ig-ata "lanet.
All articles on this website by -eepak Huar is licensed under a #reati&e #oons Attribution-
)on#oercial-ShareAlike B.E Nnported .icense
Home
Big Data
Hadoop Tutorials
Cassandra
Hector API
Request Tutorial
About
LABELS: HADOOP-TUTORIAL, HDFS
6 OCTOBER 2013
Hadoop Tutorial: Part 8 - Hadoop
Distributed !ile "#ste$ (HD!")
6n the last tutorial on What is Hadoop? i ha&e gi&en you a brief
idea about Hadoop. So the two integral parts of Hadoop is
Hadoop HD!" and Hadoop %ap&edue.
.ets go further deep inside H-2S.
Hadoop Distributed !ile
"#ste$ (HD!") *onepts:
2irst take a look at the following two terinologies that will be
used while describing H-2S.
*luster1 A hadoop cluster is ade by ha&ing any achines in a
network' each achine is tered as a node' and these nodes
talks to each other o&er the network.
Blo, "i-e: 8his is the iniu aount of size of one block in a
flesyste' in which data can be kept contiguously. +he default
si,e of a single block in H&FS is -. "b.
6n H-2S' -ata is kept by splitting it into sall chunks or parts.
.ets say you ha&e a te(t fle of <EE $/ and you want to keep this
fle in a Hadoop #luster. 8hen what happens is that' the )le
breaks or splits into a large number of chunks* where
each chunk is e+ual to the block si,e that is set for the
-!./ cluster 0which is 12 3B by default)4 Hence a <EE $b of
fle gets split into > parts' B parts of => b and ; part of ? b'
and each part will be kept on a di,erent achine. An which
achine which split will be kept is decided by )aenode' about
which we will be discussing in details below.
)ow in a Hadoop -istributed 2ile Syste or H-2S #luster' there
are two kinds of nodes' A $aster )ode and any Worker )odes.
8hese are known as1
)aenode (aster node) and -atanode (worker node).
!"e#$%e:
8he naenode anages the flesyste naespace. 6t aintains
the flesyste tree and the etadata for all the fles and
directories in the tree. So it contains the inforation of all the
fles' directories and their hierarchy in the cluster in the for of
a 9a$espae 3$age and edit logs. Along with the flesyste
inforation it also knows about the -atanode on which all the
blocks of a fle is kept.
A client accesses the flesyste on behalf of the user by
counicating with the naenode and datanodes. 8he client
presents a flesyste interface siilar to a "ortable Aperating
Syste 6nterface ("AS6P)' so the user code does not need to
know about the naenode and datanode to function.
Datanode:
8hese are the workers that does the real work. And here by real
work we ean that the storage of actual data is done by the data
node. 8hey store and retrie&e blocks when they are told to (by
clients or the naenode)' and they report back to the naenode
periodically with lists of blocks that they are storing.
Here one iportant thing that is there to note1 /n one cluster
there will be only one 0amenode and there can be 0 number of
datanodes.
Since the )aenode contains the etadata of all the fles and
directories and also knows about the datanode on which each
split of fles are stored. So lets say9a$enode goes down then
what do #ou thin, will happen?.
5es* if the Namenode is !own we cannot access any of the
)les and directories in the cluster4
'ven we will not be able to connect with any of the datanodes to
get any of the 1les. 0ow think of it2 since we have kept our 1les
by splitting it in di3erent chunks and also we have kept them in
di3erent datanodes. And it is the 0amenode that keeps track of
all the 1les metadata. So only 0amenode knows how to
reconstruct a 1le back into one from all the splits. and this is the
reason that if 0amenode is down in a hadoop cluster so every
thing is down.
#his is also the reason that6s why -adoop is known as a
/ingle %oint of failure4
)ow since )aenode is so iportant' we ha&e to ake the
naenode resilient to failure. And for that hadoop pro&ides us
with two echanis.
8he frst way is to back up the fles that ake up the persistent
state of the flesyste etadata. Hadoop can be confgured so
that the naenode writes its persistent state to ultiple
flesystes. 8hese writes are synchronous and atoic. 8he usual
confguration choice is to write to local disk as well as a reote
)2S ount.
8he second way is running a "eondar# 9a$enode4 Well as the
nae suggests' it does not act like a )aenode. So if it doesn*t
act like a naenode how does it pre&ents fro the failure.
Well the "eondar# na$enode also contains a na$espae
i$age and edit logs likena$enode. )ow after e&ery certain
inter&al of tie(which is one hour by default) it copies
the na$espae i$age fro na$enode and erge
this na$espae i$age with the edit log and copy it back to
the na$enode so that na$enode will ha&e the fresh copy
of na$espae i$age. )ow lets suppose at any instance of tie
the na$enodegoes down and becoes corrupt then we can
restart soe other achine with the naespace iage and the
edit log that*s what we ha&e with the seondar# na$enodeand
hence can be pre&ented fro a total failure.
Secondary )ae node takes alost the sae aount of eory
and #"N for its working as the )aenode. So it is also kept in a
separate achine like that of a naenode. Hence we see here
that in a single cluster we have one Namenode* one
/econdary namenode and many !atanodes' and H-2S
consists of these three eleents.
8his was again an o&er&iew of Hadoop -istributed 2ile Syste
H-2S' 6n the ne(t part of the tutorial we will know about the
working of )aenode and -atanode in a ore detailed
anner.We will know how read and write happens in H-2S.
Let me know if you have any doubts in understanding anything into the
comment section and i will be really glad to answer your +uestions :)
If you like what you just read and want to
continue your learning on BI!"#" you
can subscribe to our $mail and Like
our facebook page
These $ight also help #ou :'
;. Hadoop 6nstallation on .ocal $achine (Single node #luster)
<. Hadoop 8utorial1 "art > - Write Aperations in H-2S
B. Hadoop 8utorial1 "art B - !eplica "laceent or !eplication and
!ead Aperations in H-2S
>. Hadoop 8utorial1 "art < - Hadoop -istributed 2ile Syste (H-2S)
:. Hadoop 8utorial1 "art ; - What is Hadoop @ (an A&er&iew)
=. /est of /ooks and !esources to 0et Started with Hadoop
C. Hadoop 8utorial1 "art : - All Hadoop Shell #oands you will
)eed.
!ind *o$$ents below or /dd one
&ishwash said...
&ery inforati&e...
Actober EC' <E;B
8ushar Harande said...
8hanks for such a inforatic tutorials 1)
please keep posting .. waiting for ore... 1)
Actober E?' <E;B
Anonyous said...
)ice inforation......../ut 6 ha&e one doubt like' what is the
ad&antage of keeping the fle in part of chunks on di,erent-<
datanodes@ What kind of beneft we are getting here@
Actober E?' <E;B
-eepak Huar said...
IAnonyous1 Well there are lots of reasons... i will e(plain that with
great details in the ne(t few articles...
/ut for now let us understand this... since we ha&e split the fle into
two' now we can take the power of two processors(parallel
processing) on two di,erent nodes to do our analysis(like search'
calculation' prediction and lots ore).. Again lets say y fle size is in
soe petabytes... +our won*t fnd one Hard disk that big.. and lets say
if it is there... how do you think that we are going to read and write on
that hard disk(the latency will be really high to read and write)... it will
take lots of tie...Again there are ore reasons for the sae... 6 will
ake you understand this in ore technical ways in the coing
tutorials... 8ill then keep reading 1)
Actober E?' <E;B
"os t a #oent
Newer %ost& ' (lder %ost
/BO0T TH1 /0THO&
D11P/2 20%/&
/ig -ata G Hadoop -e&eloper' Software 4ngineer' 8hinker' .earner' 0eek' /logger'
#oder
3 love to pla# around Data4 Big Data 5
"ubsribe updates via 1$ail
)oin $ig&ata *lanet to continue your learning on $ig&ata +echnologies
Subscribe
6et 0pdates on !aeboo,
Big Data )ibraries
1. BIGDATA NEWS
2. CASSANDRA
3. HADOOP-TUTORIAL
4. HDFS
5. HECTOR-API
6. INSTALLATION
7. SQOOP
Which )oS%. -atabases according to you is $ost "opular @
6et *onneted on 6oogle7
%ost Popular Blog /rtile
Hadoop Installation on Local ac!in" #Sin$l" nod" Cl%st"&'
Hadoop T%to&ial( Pa&t ) - All Hadoop S!"ll Co**ands +o% ,ill N""d-
W!at a&" t!" P&"-&".%isit"s /o& $"ttin$ sta&t"d ,it! Bi$ Data T"c!nolo$i"s
Hadoop T%to&ial( Pa&t 0 - R"plica Plac"*"nt o& R"plication and R"ad Op"&ations in
HDFS
Hadoop T%to&ial( Pa&t 1 - W!at is Hadoop 2 #an O3"&3i",'
Hadoop T%to&ial( Pa&t 4 - Hadoop Dist&i5%t"d Fil" S+st"* #HDFS'
Hadoop T%to&ial( Pa&t 6 - W&it" Op"&ations in HDFS
B"st o/ Boo7s and R"so%&c"s to G"t Sta&t"d ,it! Hadoop
Ho, to %s" Cassand&a CQL in +o%& 8a3a Application
/ack to 8op L
M)ote1 Nse Screen !esolution of ;<?E p( and ore to &iew the website I its best. Also use the
latest &ersion of the browser as the website uses H8$.: and #SSB 1)
8witter 2acebook !SS 0oogle
ABOUT E
CONTACT
PRI9AC: POLIC:
O <E;B All !ights !eser&ed /ig-ata "lanet.
All articles on this website by -eepak Huar is licensed under a #reati&e #oons Attribution-
)on#oercial-ShareAlike B.E Nnported .icense
Home
Big Data
Hadoop Tutorials
Cassandra
Hector API
Request Tutorial
About
LABELS: HADOOP-TUTORIAL, HDFS
3 OCTOBER 2013
Hadoop Tutorial: Part 1 - What is
Hadoop ? (an Overview)
Hadoop is an open source software framework that supports data
intensive distributed applications which is licensed under Apache
v2 license.
At-least this is what you are going to fnd as the frst line of
defnition on Hadoop in Wikipedia. So what is data intensive
distributed applications?
Well data intensive is nothing but BigData (data that has
outgrown in size) anddistributed appliations are the
applications that works on network by counicating and
coordinating with each other by passing essages. (say using a
!"# interprocess counication or through $essage-%ueue)
Hence Hadoop works on a distributed en&ironent and is build to
store' handle and process large aount of data set (in petabytes'
e(abyte and ore). )ow here since i a saying that hadoop
stores petabytes of data' this doesn*t ean that Hadoop is a
database. Again reeber its a fraework that handles large
aount of data for processing. +ou will get to know the di,erence
between Hadoop and -atabases (or )oS%. -atabases' well that*s
what we call /ig-ata*s databases) as you go down the line in the
coing tutorials.
Hadoop was deri&ed fro the research paper published by
0oogle on Google File System(GFS and Google!s "ap#educe. So
there are two integral parts of Hadoop1 Hadoop Distributed
!ile "#ste$(HD!") and Hadoop %ap&edue.
Hadoop Distributed !ile "#ste$ (HD!")
H-2S is a flesyste designed for storing ver# large
'les with strea$ing data aesspatterns' running on clusters
of o$$odit# hardware.
Well .ets get into the details of the stateent entioned abo&e1
(er# )arge 'les: )ow when we say &ery large fles we ean
here that the size of the fle will be in a range of gigabyte'
terabyte' petabyte or ay be ore.
"trea$ing data aess: H-2S is built around the idea that the
ost e3cient data processing pattern is a write-once' read-any-
ties pattern. A dataset is typically generated or copied fro
source' and then &arious analyses are perfored on that dataset
o&er tie. 4ach analysis will in&ol&e a large proportion' if not all'
of the dataset' so the tie to read the whole dataset is ore
iportant than the latency in reading the frst record.
*o$$odit# Hardware: Hadoop doesn*t re5uire e(pensi&e'
highly reliable hardware. 6t7s designed to run
on clusters of coodity hardware (coonly a&ailable
hardware that can be obtained fro ultiple &endors) for which
the chance of node failure across the cluster is high' at least for
large clusters. H-2S is designed to carry on working without a
noticeable interruption to the user in the face of such failure.
)ow here we are talking about a 2ileSyste' Hadoop -istributed
2ileSyste. And we all know about a few of the other 2ile Systes
like .inu( 2ileSyste and Windows 2ileSyste. So the ne(t
5uestion coes is...
What is the di+erene between nor$al
!ile"#ste$ and Hadoop Distributed
!ile "#ste$?
8he a9or two di,erences that is notable between H-2S and
other 2ilesystes are1
Blo, "i-e: 4&ery disk is ade up of a block size. And this
is the iniu aount of data that is written and read
fro a -isk. )ow a 2ilesyste also consists of blocks which
is ade out of these blocks on the disk. )orally disk
blocks are of :;< bytes and those of flesyste are of a few
kilobytes. 6n case of HD!" we also ha&e the blocks
concept. /ut here one block size is of => $/ by default and
which can be increased in an integral ultiple of => i.e.
;<?$/' <:=$/' :;<$/ or e&en ore in 0/*s. 6t all depend
on the re5uireent and use-cases.
So Why are these blocks size so large for H-2S@ keep on
reading and you will get it in a ne(t few tutorials 1)
%etadata "torage: 6n noral fle syste there is
a hierarchical storage of etadata i.e. lets say there is a
folder A$%. inside that folder there is again one another
folder &'F. and inside that there is hello.t(t fle. )ow the
inforation about hello.t(t (i.e. etadata info of
hello.t(t) fle will be with &'F and again the etadata
of &'F will be with A$%. Hence this fors a hierarchy and
this hierarchy is aintained until the root of the flesyste.
/ut in HD!" we don*t ha&e a hierarchy of etadata. All the
etadata inforation resides with a single achine known
as Namenode (or $aster )ode) on the cluster. And this
node contains all the inforation about other fles and
folder and lots of other inforation too' which we will learn
in the ne(t few tutorials. 1)
Well this was 9ust an o&er&iew of Hadoop and Hadoop -istributed
2ile Syste. )ow in the ne(t part i will go into the depth of H-2S
and there after $ap!educe and will continue fro here...
Let me know if you have any doubts
in understanding anything into the comment section and i
will be really glad to answer the same :)
If you like what you just read and want to
continue your learning on BI!"#" you
can subscribe to our $mail and Like
our facebook page
These $ight also help #ou :'
;. Hadoop 8utorial1 "art > - Write Aperations in H-2S
<. Hadoop 8utorial1 "art B - !eplica "laceent or !eplication and
!ead Aperations in H-2S
B. Hadoop 8utorial1 "art < - Hadoop -istributed 2ile Syste (H-2S)
>. Hadoop 8utorial1 "art ; - What is Hadoop @ (an A&er&iew)
:. /est of /ooks and !esources to 0et Started with Hadoop
=. Hadoop 8utorial1 "art : - All Hadoop Shell #oands you will
)eed.
C. Hadoop 6nstallation on .ocal $achine (Single node #luster)
!ind *o$$ents below or /dd one
!oain !igau( said...
)ice suaryD
Actober EB' <E;B
pragya khare said...
6 know i* a beginner and this 5uestion yt be a silly ;....but can
you please e(plain to e that how "A!A..4.6S$ is achie&ed &ia
ap-reduce at the processor le&el @@@ if 6*&e a dual core processor'
is it that only < 9obs will run at a tie in parallel@
Actober E:' <E;B
Anonyous said...
Hi 6 a fro $ainfrae background and with little knowledge of
core 9a&a...-o you think Fa&a is needed for learning Hadoop in
addition to Hi&eG"60 @ 4&en want to learn Fa&a for ap reduce but
couldn*t fnd what all will be used in realtie..and defniti&e guide
books sees tough for learning apreduce with Fa&a..any option
where 6 can learn it step by step@
Sorry for long coent..but it would be helpful if you can guide
e..
Actober E:' <E;B
-eepak Huar said...
I"ragya Hhare...
2irst thing always reeber... the one "opular saying.... )A %uestions
are 2oolish 1) And btw it is a &ery good 5uestion.
Actually there are two things1
Ane is what will be the best practice@ and other is what happens in
there by default @...
Well by default the nuber of apper and reducer is set to < for any
task tracker' hence one sees a a(iu of < aps and < reduces at
a gi&en instance on a 8ask8racker (which is confgurable)..Well this
-oesn*t only depend on the "rocessor but on lots of other factor as
well like ra' cpu' power' disk and others....
http1GGhortonworks.coGblogGbest-practices-for-selecting-apache-
hadoop-hardwareG
And for the other factor i.e for /est "ractices it depends on your use
case. +ou can go through the Brd point of the below link to understand
it ore conceptually
http1GGblog.cloudera.coGblogG<EEJG;<GC-tips-for-ipro&ing-
apreduce-perforanceG
Well i will e(plain all these when i will reach the ad&ance $ap!educe
tutorials.. 8ill then keep reading DD 1)
Actober E:' <E;B
-eepak Huar said...
IAnonyous
As Hadoop is written in Fa&a' so ost of its A"6*s are written in core
Fa&a... Well to know about the Hadoop architecture you don*t need
Fa&a... /ut to go to its A"6 .e&el and start prograing in $ap!educe
you need to know #ore Fa&a.
And as for the re5uireent in 9a&a you ha&e asked for... you 9ust need
siple core 9a&a concepts and prograing for Hadoop and
$ap!educe..And Hi&eG"60 are the S%. kind of data Kow languages
that is really easy to learn...And since you are fro a prograing
background it won*t be &ery di3cult to learn 9a&a 1) you can also go
through the link below for further details 1)
http1GGwww.bigdataplanet.infoG<E;BGEJGWhat-are-the-"re-re5usites-for-
getting-started-with-/ig--ata-8echnologies.htl
Actober E:' <E;B
"os t a #oent
Newer %ost& ' (lder %ost
/BO0T TH1 /0THO&
D11P/2 20%/&
/ig -ata G Hadoop -e&eloper' Software 4ngineer' 8hinker' .earner' 0eek' /logger'
#oder
3 love to pla# around Data4 Big Data 5
"ubsribe updates via 1$ail
)oin $ig&ata *lanet to continue your learning on $ig&ata +echnologies
Subscribe
6et 0pdates on !aeboo,
Big Data )ibraries
1. BIGDATA NEWS
2. CASSANDRA
3. HADOOP-TUTORIAL
4. HDFS
5. HECTOR-API
6. INSTALLATION
7. SQOOP
Which )oS%. -atabases according to you is $ost "opular @
6et *onneted on 6oogle7
%ost Popular Blog /rtile
Hadoop Installation on Local ac!in" #Sin$l" nod" Cl%st"&'
Hadoop T%to&ial( Pa&t ) - All Hadoop S!"ll Co**ands +o% ,ill N""d-
W!at a&" t!" P&"-&".%isit"s /o& $"ttin$ sta&t"d ,it! Bi$ Data T"c!nolo$i"s
Hadoop T%to&ial( Pa&t 0 - R"plica Plac"*"nt o& R"plication and R"ad Op"&ations in
HDFS
Hadoop T%to&ial( Pa&t 1 - W!at is Hadoop 2 #an O3"&3i",'
Hadoop T%to&ial( Pa&t 4 - Hadoop Dist&i5%t"d Fil" S+st"* #HDFS'
Hadoop T%to&ial( Pa&t 6 - W&it" Op"&ations in HDFS
B"st o/ Boo7s and R"so%&c"s to G"t Sta&t"d ,it! Hadoop
Ho, to %s" Cassand&a CQL in +o%& 8a3a Application
http://www.bigdataplanet.info/2013/10/Hadoop-Tutorial-Part-2-Hadoop-Distributed-File-
!ste".ht"l
http://www.de#$.%o"/opensour%e/e$ploring-the-hadoop-distributed-file-s!ste"-hdfs.ht"l
L$&i# ' Re&is(er
TODA)*S HEADLIES ' ARTICLE ARCHI+E ' FORU,S ' TIP BA-
Securely view and collaborate on documents using any device Pri." C$#(e#(
C$##ec( %e/i0ers 1u// 226-bi( SSL securi(3 !#% DR, ($ 4r$(ec( 3$ur se#si(i0e
%$cu"e#(s5 E!s3 cus($"i.!(i$#6 D$7#/$!% Tri!/
S4$#s$re% b3 Accus$1(
Specialized Dev Zones
eBook Library
.NET
Java
C++
Web Dev
Arci!ec!"re
Da!abase
Sec"ri!y
#pen So"rce
En!erprise
$obile
Special %epor!s
&'($in"!e Sol"!ions
Dev)!ra Blo*s
Exploring the Hadoop Distributed File
System (HDFS)
Kaushik Pal explores the basics of the Hadoop
Distributed File System (HDFS), the underlying file
system of the pache Hadoop frame!ork"
b3 -!us8i9 P!/
0 omments (clic! to add your comment)
omment and ontribute
)$ur #!"e:#ic9#!"e
)$ur e"!i/
;ebSi(e
Sub<ec(
=,!>i"u" c8!r!c(ers: 1200?5 )$u 8!0e 1200 c8!r!c(ers /e1(5
Privacy &
Terms
Sub"i( )$ur C$""e#(
Si(e"!4
Pr$4er(3 $1 @ui#s(ree( E#(er4rise5
Ter"s $1 Ser0ice ' Lice#si#& A Re4ri#(s ' Ab$u( Us ' Pri0!c3 P$/ic3 ' A%0er(ise
C$43ri&8( 201B @ui#S(ree( I#c5 A// Ri&8(s Reser0e%5
http://beginnersboo&.%o"/2013/0'/hdfs/
Cre!(e 3$ur $7# REST
API Usi#& OAu(8
Au(8e#(ic!(i$#
,icr$s$1( De/i0ers e7
6B-Bi( CIT C$"4i/er 1$r
5ET
Bi(c$i#Ds True Pur4$se
A"!.$# Re/e!ses I(s O7#
C8!$s E$ri//!
O4e#TSDB P!c9!&e
I#s(!//!(i$# !#% E>(r!c(i#&
Ti"e Series D!(! P$i#(s
Sign up "or e#mail
newsletters "rom Dev$
E#(er e"!i/ !%%re
HOE
CONTACT US
CORE 8 A9A
8 S P
8 S TL
S QL
8 A9A COLLECTI ONS
S EO
WORDPRES S
C
I NTER9I EW Q;A
OOPS CONCEPTS
S ER9LET
Hadoop Distributed File System(HDFS
by CHAITAN:A SINGH
in HADOOP
B"/o&" %nd"&standin$ ,!at is HDFS /i&st I ,o%ld li7" to "<plain ,!at is dist&i5%t"d
/il" s+st"*-
!"at is Distributed File System#
As +o% 7no, t!at "ac! p!+sical s+st"* !as its o,n sto&a$" li*it- And ,!"n it co*"s
to sto&" lots o/ data t!"n ," *a+ n""d *o&" t!an on" s+st"*= Basicall+ a n"t,o&7 o/
s+st"*s- So t!at t!" data can 5" s"$&"$at"d a*on$ 3a&io%s *ac!in"s ,!ic! a&"
conn"ct"d to "ac! ot!"& t!&o%$! a n"t,o&7- S%c! t+p" o/ *ana$"*"nt in o&d"& to
sto&" 5%l7 o/ data is 7no,n as distributed $ile system-
!"at is HDFS % Hadoop Distributed File System#
Hadoop !as its o,n dist&i5%t"d /il" s+st"* ,!ic! is 7no,n as HDFS # &"na*"d /&o*
NDFS'-
HDFS Design
1- Hadoop do"sn>t &".%i&"s "<p"nsi3" !a&d,a&" to sto&" data= &at!"& it is
d"si$n"d to s%ppo&t co**on and "asil+ a3aila5l" !a&d,a&"-
4- It is d"si$n"d to sto&" 3"&+ 3"&+ la&$" /il"# As +o% all 7no, t!at in o&d"& to
ind"< ,!ol" ,"5 it *a+ &".%i&" to sto&" /il"s ,!ic! a&" in t"&a5+t"s and p"ta5+t"s
o& "3"n *o&" t!an t!at'- Hadoop clusters a&" %s"d to p"&/o&* t!is tas7-
0- It is d"si$n"d /o& st&"a*in$ data acc"ss-
Hadoop $ile system
1' &ocal' T!is /il" s+st"* is /o& locall+ conn"ct"d dis7s-
4' HDFS' Hadoop dist&i5%t"d /il" s+st"*( E<plain"d a5o3"
0' HFTP' T!" p%&pos" o/ it to p&o3id" &"ad-onl+ acc"ss /o& Hadoop distributed
$ile system o3"& HTTP-
6' HSFTP' It is al*ost si*ila& to HFTP= %nli7" HFTP it p&o3id"s &"ad-onl+
onHTTPS(
)' HAR % Hadoop)s Arc"i*es' Us"d /o& a&c!i3in$ /il"s-
?' !ebHDFS' G&ant ,&it" acc"ss on HTTP-
@' +FS' Its a clo%d sto&" s+st"* si*ila& to GFS and HDFS-
A' Distributed RAID' Li7" HAR it is also %s"d /o& a&c!i3al-
B' S,' A /il" s+st"* p&o3id"d 5+ A*aCon S0
HDFS Cluster -odes
HDFS cl%st"& !as t,o nod"s(
1- na*"nod"
4- datanod"
. -amenodes
It 5asicall+ sto&"s t!" na*" and add&"ss"s o/ datanodes- It sto&"s t!" data in /o&*
o/ a t&""- Wit!o%t Na*"nod"s t!is ,!ol" s+st"* o/ st&oin$ and &"t&i"3in$ data ,o%ld
not ,o&7 as it is &"sponsi5l" to 7no, ,!ic! data is sto&"d ,!"&"-
/ Datanodes
Datanodes a&" %s"d to sto&" t!" data in /o&* o/ bloc0s( T!"+ sto&"
and &"t&i"3" data in /o&* o/ data 5loc7s a/t"& co**%nication ,it! Na*"nod"s-
Important lin0s'
1- HDFS G%id"
4- HDFS 8a3a APi
0- HDFS So%&c" cod"
1ou mig"t li0e'
1- Hadoop t%to&ial
4- Fil" IDO in C p&o$&a**in$ ,it! "<a*pl"s
0- Ho, to "dit -!tacc"ss /il" in Wo&dP&"ss
6- Ho, to p&"3"nt acc"ss to -!tacc"ss E a7" it *o&" s"c%&"
)- Ho, to c&"at" a Fil" in 8a3a
?- Ho, to ,&it" to a /il" in Fa3a %sin$ Fil"O%tp%tSt&"a*
@- App"nd to a /il" in Fa3a %sin$ B%//"&"dW&it"&= P&intW&it"&= Fil"W&it"&
G H co**"ntsI add on" no, J
L"a3" a Co**"nt
Na*" K
E-*ail K
W"5sit"
Noti/+ *" o/ /ollo,%p co**"nts 3ia "-*ail
Sub"i(
Con/i&* +o% a&" NOT a spa**"&
P2P3&AR T3T2RIA&S
o *ore :ava Tutorial
o :"P Tutorial
o :"T) Tutorial
o :ava *olletions Tutorial
o "ervlet Tutorial
o * Tutorial
,he te?t o/ the a)o4e Wikipedia artic.e is a4ai.a).e "$der the Creati4e Co##o$s Attri)"tio$:
hareA.ike 'ice$se. ,his co$te$t a$d its associated e.e#e$ts are #ade a4ai.a).e "$der
the sa#e .ice$se 0here attri)"tio$ #"st i$c."de ack$o0.ed1e#e$t o/ ,he F".. Wiki as the
so"rce o$ the pa1e sa#e pa1e 0ith a .i$k )ack to this pa1e 0ith $o $o/o..o0 ta1.
8.o1
A)o"t ,he F".. Wiki
Co$tact "s
Pri4ac& Po.ic&
Gersio$ 060I9 0