Sei sulla pagina 1di 35

Hadoop

An Elephant can't jump. But can carry heavy load.

A 20 page introduction to hadoop and friends.

Prashant Sharma

Table of Contents
1. INTRODUCTION............................................................................................................................................5 1.1 What is distributed computing?....................................................................................................................5 1.2 What is hadoop? Name o! a to" e#ephant actua##"$.....................................................................................5 1.% &o' does &adoop e#iminate comp#e(ities?...................................................................................................5 1.) What is map*reduce?......................................................................................................................................+ 1.5 What is &D,-?...............................................................................................................................................+ 1.+ What is Namenode?........................................................................................................................................+ 1.. What is a datanode?.......................................................................................................................................+ 1./ What is a 0obtrac1er and tas1trac1er?.......................................................................................................... 2. &OW 234*R5DUC5 WOR6?....................................................................................................................... ................................................................................................................................................................................

................................................................................................................................................................................ . 2.1 Introduction..................................................................................................................................................../ 2.2 2ap*reduce is the ans'er............................................................................................................................../ 2.% 3n e(amp#e program 'hich puts in7erted inde( in action using &adoop 8.28.28% 34I.........................../ 2.) &o' &adoop runs 2ap*reduce?..................................................................................................................11 2.4.1 Submit Job...................................................................................................................................................11 2.4.2 Job Initialization..........................................................................................................................................11 2.4.3 Task Assignment.........................................................................................................................................12 2.4.4 Task Execution............................................................................................................................................12 %. &3DOO4 -TR532IN9...............................................................................................................................12 %.1 3 simp#e e(amp#e run...................................................................................................................................1% %.2 &o' it 'or1s?...............................................................................................................................................1% %.% ,eatures.........................................................................................................................................................1%

). &3DOO4 DI-TRI:UT5D ,I;5 -<-T52 ................................................................................................1% ).1 Introduction..................................................................................................................................................1% ).2 What &D,- can not do?...............................................................................................................................1) ).% 3natom" o! &D,- =......................................................................................................................................1) 4.3.1 Filesystem Meta ata....................................................................................................................................14 4.3.2 Anatomy o! "#ite.........................................................................................................................................1$ 4.3.3 Anatomy o! a #ea .......................................................................................................................................1$ ).) 3ccessibi#it"...................................................................................................................................................15 4.4.1 %FS s&ell.....................................................................................................................................................1$ 4.4.2 %FS A min..................................................................................................................................................1$ 4.4.3 '#o"se# Inte#!ace.........................................................................................................................................1( 4.4.4 Mountable )%FS.........................................................................................................................................1( 5. -5RI3;I>3TION..........................................................................................................................................1+ 5.1 Introduction..................................................................................................................................................1+ 5.2 Write "our o'n composite 'ritab#e............................................................................................................1. 5.% 3n e(amp#e e(p#ained on seria#i?ation and ha7ing custom Writab#es !rom hadoop repos.....................1. 5.) Wh" 0a7a Ob@ect -eria#i?ation is not so e!!icient compared to other -eria#i?ation !rame'or1s?..........21 $.4.1 Ja*a Se#ialization oes not meet t&e c#ite#ia o! Se#ialization !o#mat............................................................21 $.4.2 Ja*a Se#ialization is not com+act.................................................................................................................21 $.4.3 Ja*a Se#ialization is not !ast.........................................................................................................................21 $.4.4 Ja*a Se#ialization is not extensible..............................................................................................................21 $.4.$ Ja*a Se#ialization is not inte#o+e#able..........................................................................................................21 $.4.( Se#ialization I%,.........................................................................................................................................21

+. DI-TRI:UT5D C3C&5. ..............................................................................................................................22 +.1 Introdution....................................................................................................................................................22 +.2 3n 5(amp#e usageA.......................................................................................................................................22 .. -5CURIN9 T&5 5;54&3NT.....................................................................................................................22

..1 1erberos tic1ets.............................................................................................................................................22 ..2 e(amp#eA o! using 1erberos. .........................................................................................................................22 ..% De#egation to1ens..........................................................................................................................................22 ..) ,urther securing the e#ephant......................................................................................................................2% /. &3DOO4 0O: -C&5DU;IN9....................................................................................................................2% /.1Three schedu#ersA...........................................................................................................................................2% %e!ault sc&e ule#-.................................................................................................................................................23 .a+acity Sc&e ule#-..............................................................................................................................................23 Fai# Sc&e ule#.......................................................................................................................................................24 3445NDIB 13 3CRO -5RI3;I>3TION......................................................................................................2) A+ac&e A*#o ........................................................................................................................................................24 A*#o is a ata se#ialization system........................................................................................................................24 A*#o #elies on JS/0 sc&emas...............................................................................................................................24 A*#o Se#ialzation .................................................................................................................................................24 A*#o Se#ialization is !ast......................................................................................................................................24 A*#o Se#ialization is Inte#o+e#able.......................................................................................................................24 3445NDIB 1:....................................................................................................................................................25 1.D2C pie estimator. .......................................................................................................................................25 9rep e(amp#eA.....................................................................................................................................................%1 WordCount.........................................................................................................................................................%%

1. Introduction
1.1 What is distributed computing? Multiple autonomous s stems appear as one! interacting "ia a message passing interface! no single point of failure.

Challenges of Distributed computing.


1. #esource sharing. Access an data and utili$e %P& resources across the s stem. 2. 'penness. ()tensions! interoperabilit ! portabilit . *. %oncurrenc . Allo+s concurrent access! update of shared resources. ,. Scalabilit . Handle e)tra load. li-e increase in users! etc.. .. /ault tolerance. b ha"ing pro"isions for redundanc and reco"er . 0. Heterogeneit . 1ifferent 'perating s stems! different hard+are! Middle+are s stem allo+s this. 2. 3ransparenc . Should appear as a +hole instead of collection of computers. 4. Biggest challenge is to hide the details and comple)it of accomplishing abo"e challenges from the user and to ha"e a common unified interface to interact +ith it. Which is +here hadoop comes in. 1.2 What is hadoop? 56ame of a to elephant actuall 7 Hadoop is a frame+or- +hich pro"ides open source libraries for distributed computing using simple single map8reduce interface and its o+n distributed files stem called H1/S. It facilitates scalabilit and ta-es cares of detecting and handling failures.

1.* Ho+ does Hadoop eliminate comple)ities?


Hadoop has components +hich ta-e care of all comple)ities for us and b using a simple map reduce frame+or- +e are able to harness the po+er of distributed computing +ithout ha"ing to +orr about comple)ities li-e fault tolerance! data loss. It has replication mechanism for data reco"er and 9ob scheduling and blac-listing of fault nodes b a configurable blac-listing polic . /ollo+ing are ma9or components. 1. Map8reduce 5:ob 3rac-er and tas- trac-er7 2. 6amenode and Secondar namenode 5A H1/S 6ame6ode stores (dit logs and /ile s stem Image7. *. 1atanode 5#uns on sla"es7 ,. :ob3rac-er 5#uns on ser"er7 .. 3as-3rac-er 5#uns on sla"es7

1., What is map-reduce?


3he map reduce frame+or- is introduced b google. A simple and po+erful interface that enables automatic paralleli$ation and distribution of large8scale computations! combined +ith an implementation of this interface that achie"es high performance on large clusters of commodit P%s. 5definition b ;oogle paper on mapred7 3his broadl consists of t+o mandator functions to implement< =Map and reduce>. A map is a function +hich is e)ecuted on each -e 8"alue pair from an input split! does some processing and emits again a -e and "alue pair. After map and before reduce can begin there is a phase called shuffle +hich copies and sorts the output on -e and aggregates "alues. 3hese -e and =aggregated "alue> pairs are captured b reduce and outputs a reduced -e "alue pair . 3his process is also called aggregation as ou get "alues aggregated for a particular -e as input to reduce method. Again in reduce ou ma pla around as ou +ant +ith -e 8"alues and +hat ou emit -no+ is also -e "alue pairs +hich are dumped directl to a file. 6o+ simpl b e)pressing a problem in terms of map8reduce +e can e)ecute a tas- in parallel and distribute it across a broad cluster and be relie"ed of ta-ing care of all comple)ities of distributed computing. Indeed =?ife made eas >! had ou tried doing the same thing +ith MPI libraries ou can understand the comple)it there scaling to thousands or e"en hundreds of nodes. 3here is a lot more going in map8reduce than 9ust map8reduce. @ut the beaut of the hadoop is that it ta-es care of most of those things and a user ma not dig into details for simpl running a 9ob! though it is good if one has -no+ledge of those features and can help in tuning parameters and thus impro"e efficienc and fault tolerance.

1.. What is H1/S?


Hadoop has its o+n implementation of distributed file s stem called hadoop distributed files stem! +hich is coherent and pro"ides all facilities of a file s stem. It implements A%?s and pro"ides a subset of usual &6IA commands for accessing or Buer ing the files stem and if one mounts it as a fuse dfs then it is possible to access it as an other linu) files stem +ith standard uni) commands.

1.0 What is 6amenode?


A single point of failure for an H1/S installation. It contains information regarding a bloc-Cs location as +ell as the information of entire director structure and files. @ sa ing It is a single point of failure 8 I mean! if namenode goes do+n 8 +hole files stem is offline. Hadoop also has a secondar namenode +hich contains edit log! +hich in case of a failure of namenode can be used to repla all the actions of the files stem and thus restore the state of the files stem. A secondar namenode regularl contacts namenode and ta-es chec-pointed snapshot images. At an time of failure these chec-pointed images can be used to restore the namenode. %urrent efforts are going on to ha"e high a"ailabilit for 6amenode.

1.2 What is a datanode?


1atanode stores actual bloc-s of data and stores and retrie"es bloc-s +hen as-ed. 3he periodicall report bac- to 6amenode +ith list of bloc-s the are storing.

1.4 What is a :obtrac-er and tas-trac-er?


3here is one :ob3rac-er5is also a single point of failure7 running on a master node and se"eral tas-trac-er running on sla"e nodes. (ach tas-trac-er has multiple tas-8instances running and e"er tas- trac-er reports to 9obtrac-er in the form of heart beat at regular inter"als +hich also carries message of the progress of the current 9ob it is e)ecuting and idle if it has finished e)ecuting. :obtrac-er schedules 9obs and ta-es care of failed ones b re8e)ecuting them on some other nodes. In Mr"2 efforts are made to ha"e high a"ailabilit for :obtrac-er! +hich +ould definitel change the +a it has been.

2. Ho+ map8reduce +or-?

MapReduce: Simplified Data Processing on Large Clusters b google.

2.1 ntroduction.
3hese are basic steps of a t pical map reduce program as described b ;oogle Map8reduce paper. We +ill understand this ta-ing in"erted inde) as e)ample. An in"erted inde) is same as the one that appear at the bac- of the boo- ! +here each +ord is listed and then location +here it occurs. Its main usage is to build inde)es for search engines. Suppose ou +ere to build a search engine +ith in"erted inde) as the inde). 6o+ the con"entional +a is to build the in"erted inde) in a large map51ata structure7 and update the map b reading the documents and updating the inde). ?imitation of this approach< If the no of documents is large then dis- IDo +ill become a bottle nec-. And +hat if data is in P@s? +ill it scale?

2.2 Map8reduce is the ans+er.


3he map function parses each document! and emits a seBuence of E+ord! document I1F pairs. Where +ord is the -e ! as the name sa s its uniBue and different document I1 +ith same +ord are merged +hen passed to reduce function as input! this is done in sort and shuffle phase. 3he reduce function accepts all pairs for a gi"en +ord! sorts the corresponding document I1s and emits a E+ord! list5document I17F pair. 3he set of all output pairs forms a simple in"erted inde). It is eas to augment this computation to -eep trac- of +ord positions. !." #n e$ample program %hich puts in&erted inde$ in action using 'adoop (.!(.!(" #P .

package testhadoop; import import import import import import import import import import import import import import java.io.IOException; java.util.Enumeration; java.util.Hashtable; java.util.StringTokenizer; org.apache.hadoop.con .!on iguration; org.apache.hadoop. s."ath; org.apache.hadoop.io.Text; org.apache.hadoop.mapreduce.lib.input.#; org.apache.hadoop.mapreduce.$ob; org.apache.hadoop.mapreduce.%apper; org.apache.hadoop.mapreduce.&educer; org.apache.hadoop.mapreduce.lib.input.'ileInput'ormat; org.apache.hadoop.mapreduce.lib.output.'ileOutput'ormat; org.apache.hadoop.util.(enericOptions"arser;

/** * Here we find the inverted index of a corpus. you can use the wikipedia * corpus. Read the hadoop quickstart guide for installation intruction. Map * :emits word as key and filename as value. Reduce :emtis word and occurances * in filenames. * * author !rashant * */ public class InvertedIndex )

/** * Mapper is the "#stract class which need to #e extended to write a mapper. * $e specify the input key and value format and output key and value * formats in %&nkey'&n(al')utkey')ut(al*. +o in the mapper we chose * %)#,ect'-ext'-ext'-ext* Remem#er we can only use the writa#le implemented * classes for key and value pairs.+eriali/ation issues discussed later0 * emits.$ord'1ilename in which it occurs0. * * author !rashant */ public static class *'%apper extends %apper+Object, Text, Text, Text- ) public void map.Object ke/, Text value, !ontext context0 throws IOException, InterruptedException ) Text 1ord 2 new Text.0; 33 Tokenize each line on basis o 45,. 4t4n /ou should add more 33 s/mbols i /ou are using HT%6 or 7%6 corpus. StringTokenizer itr 2 new StringTokenizer.value.toString.0, 545,. 4t4n50; 33 Here 1e used the context object to retrieve the ile name o the 33 map is 1orking on. String ile 2 new String...'ileSplit0 .context.getInputSplit.000 .get"ath.0.toString.00; Text '8 2 new Text. ile0; while .itr.has%oreTokens.00 ) 1ord.set.itr.nextToken.00; 33 Emits intermediate ke/ and value pairs. context.1rite.1ord, '80; 9 9 9 /** * "lmost same concept for reducer as well as mapper.Read $1Mapper * documentation0 2mits .$ord'3%1ilename*'%1ilename*....30 * Here we store the file names in a hashta#le and increment the count to augment the index. */ public static class *'&educer extends &educer+Text, Text, Text, Text- ) public void reduce.Text ke/, Iterable+Text- values, !ontext context0 throws IOException, InterruptedException ) Hashtable+String, 6ong- table 2 new Hashtable+String, 6ong-.0; for .Text val : values0 ) if .table.contains;e/.val.toString.000 ) 6ong temp 2 table.get.val.toString.00; temp 2 temp.long<alue.0 = >; table.put.val.toString.0, temp0; 9 else table.put.val.toString.0, new 6ong.>00; 9 String result 2 55; Enumeration+String- e 2 table.ke/s.0; while .e.has%oreElements.00 ) String tempke/ 2 e.nextElement.0.toString.0; 6ong tempvalue 2 .6ong0 table.get.tempke/0; result 2 result = 5+ 5 = tempke/ = 5, 5 = tempvalue.toString.0 = 5 - 5; 9 context.1rite.ke/, new Text.result00; 9 9

public static void main.String?@ args0 throws Exception ) /** * 4oad the configurations into the configuration o#,ect.1rom 5M4 that * you +etup while you have setup hadoop0 */ !on iguration con 2 new !on iguration.0; 33 pass the arguements to Hadoop utilit/ or options parsing String?@ otherArgs 2 new (enericOptions"arser.con , args0 .get&emainingArgs.0; if .otherArgs.length B2 C0 ) S/stem.err.println.5Dsage: invertedIndex +in- +out-50; S/stem.exit.C0; 9 33 !reate $ob object rom con iguration. $ob job 2 new $ob.con , 5Inverted index50; job.set$arE/!lass.InvertedIndex.class0; job.set%apper!lass.*'%apper.class0; /** * $hy do we use a com#iner when its optional6 $ell a com#iner helps in * reducing the output at the mapper end itself and thus #andwidth load * is reduced over the network and also increases the efficiency of the * reducer. */ job.set!ombiner!lass.*'&educer.class0; 33 *e used the same class or 33 combiner as reducer. Although 33 its possible to 1rite a 33 seperate. job.set&educer!lass.*'&educer.class0; job.setOutput;e/!lass.Text.class0; job.setOutput<alue!lass.Text.class0; 'ileInput'ormat.addInput"ath.job, new "ath.otherArgs?F@00; 'ileOutput'ormat.setOutput"ath.job, new "ath.otherArgs?>@00; S/stem.exit.job.1ait'or!ompletion.true0 G F : >0; 9 9

2., Ho+ Hadoop runs Map8reduce?.


If ou are curious about +hat goes on behind the scenes +hen ou deplo ed a map8red program and sa+ a dump of spam printed on stderr5+hich is Info of our 9obCs status and not error8If e"er thing +ent fine7.

3he diagram abo"e is self e)planator and tells us about ho+ map reduce +or-s.

2.,.1 Submit :ob G As-s the :ob 3rac-er for a ne+ I1 G %hec-s output spec of the :ob. %hec-s oDp 1ir. If e)ists! thro+s error. :ob is not submitted. G Pass :ob%onf to :ob%lient.run:ob57 or submit:ob57 G run:ob57 bloc-s! submit:ob57 does not. 5As nchronous and s nchronous +a s of submitting a 9ob.7 G :ob%lient< 1etermines proper di"ision of input into InputSplits G Submits the 9ob to :ob3rac-er. G %omputes input split for the 9ob. Splits cannot be computed5inputs doesCt e)ist7! error is thro+n. :ob is not submitted G %opies the resources needed to run the 9ob to H1/S in a director named specified b =mapred.s stem.dir>. G :ob 9ar file. %opied +ith a high replication factor! factor of %an be set b =mapred.submit.replication> propert . G 3ells the :ob3rac-er. :ob is read

2.,.2 :ob Initiali$ation G Puts the 9ob in internal Hueue G :ob Scheduler +ill pic-up and initiali$e it

G G G G G

I %reate a :ob ob9ect and 9ob being run I (ncapsulate its tas-s I @oo- -eeping info to trac- tas-s status and progress %reate list of tas-s to run #etrie"es number of input splits computed b the :ob%lient from the shared files stem %reates one map tas- for each split. Scheduler creates the #educe tas-s and assigns them to tas-3rac-er. I 6o. of reduce tas-s is determined b the map.reduce.tas-s. 3as-s I1Cs are gi"en for each tas-

2.,.* 3as- Assignment. G 3as- Assignment G 3as- trac-ers send heartbeats to :ob3rac-er "ia #P%. G 3as- trac-er indicates readines for a ne+ tasG :ob 3rac-er +ill allocate a 3asG :ob 3rac-er communicates the tas- in a response to a heartbeat return G %hoosing a 3as- 3rac-er I :ob 3rac-er must choose a 3as- for a 3as-3rac-er I &ses scheduler to choose a tas- from I :ob Scheduling algorithms J Fdefault one based on priorit and /I/'.

2.,., 3as- ()ecution G 3as- trac-er has been assigned the tasG 6e)t step is to run the tasG ?ocali$es the 9ob b cop ing the 9ar file from the Kmapred.s stem.dirK to 9ob specific directories and copies an other files reBuired. G %reates a local +or-ing dir for the tas-! un89ars the contents of the 9ar onto this dir G %reates an instance of 3as-#unner to run the tasG 3as- runner launches a ne+ :LM to run each tasI 3o a"oid 3as- trac-er to fail! if an bugs in Map#educe tas-s I 'nl the child :LM e)its in case of a problem G 3as-3rac-er.%hild.main57< I Sets up the child 3as-InProgress attempt I #eads AM? configuration I %onnects bac- to necessar Map#educe components "ia #P% I &ses 3as-#unner to launch user process

*. Hadoop Streaming.
Hadoop streaming is a utilit that comes +ith the Hadoop distribution. 3he utilit allo+s ou to create and run mapDreduce 9obs +ith an e)ecutable or script as the mapper andDor the reducer.

*.1 A simple e)ample run.


MHA1''PNH'M(DbinDhadoop 9ar MHA1''PNH'M(Dhadoop8streaming.9ar O 8input m Input1irs O 8output m 'utput1ir O 8mapper DbinDcat O 8reducer DbinD+c

*.2 Ho+ it +or-s?


Mapper Side:When an e)ecutable is specified for mappers! each mapper tas- +ill launch the e)ecutable as a separate process +hen the mapper is initiali$ed. As the mapper tas- runs! it con"erts its inputs into lines and feed the lines to the stdin of the process. In the meantime! the mapper collects the line oriented outputs from the stdout of the process and con"erts each line into a -e D"alue pair! +hich is collected as the output of the mapper. Similarl for reducer tas- each reducer tas- gets as input the con"erted form of -e "alue pair of the map tas- into stdin readable input and output of the e)ecutable is con"erted to -e "alue pairs.

*.* /eatures.
Pou can specif internal class as a mapper instead of an e)ecutable li-e this. 8mapper org.apache.hadoop.mapred.lib.Identit Mapper O Input ouput format classes can be specified li-e this. 8inputformat :a"a%lass6ame 8outputformat :a"a%lass6ame 8partitioner :a"a%lass6ame 8combiner :a"a%lass6ame Pou can specif :ob%onf parameters. MHA1''PNH'M(DbinDhadoop 9ar MHA1''PNH'M(Dhadoop8streaming.9ar O 8input m Input1irs O 8output m 'utput1ir O 8mapper org.apache.hadoop.mapred.lib.Identit MapperO 8reducer DbinD+c O 89obconf mapred.reduce.tas-sQ2

,. Hadoop 1istributed /ile S stem


,.1 ntroduction.
H1/S is a files stem designed for storing "er large files +ith streaming data access patterns! running on clusters of commodit hard+are. G It has large bloc- si$e 5default 0,mb7 for storage to compensate for see- time to net+orband+idth. So "er large files for storage are ideal.

G G

Streaming data access. Write once and read man times architecture. Since files are large time to read is significant parameter than see- to first record. %ommodit hard+are. It is designed to run on commodit hard+are +hich ma fail. H1/S is capable of handling it.

,.2 What H1/S can not do?


G G G ?o+ latenc data access. It is not optimi$ed for lo+ latenc data access it trades latenc to increase the throughput of the data. ?ots of small files. Since bloc- si$e is 0, M@ and lots of small files5+ill +aste bloc-s7 +ill increase the memor reBuirements of namenode. Multiple +riters and arbitrar modification. 3here is no support for multiple +riters in H1/S and files are +ritten to b a single +riter after end of each file.

,.* Anatom of H1/S R


,.*.1 /iles stem Metadata. G 3he H1/S namespace is stored b 6amenode. G 6amenode uses a transaction log called the (dit?og to record e"er change that occurs to the files stem meta data. G /or e)ample! creating a ne+ file. %hange replication factor of a file G (dit?og is stored in the 6amenodeCs local files stem G (ntire files stem namespace including mapping of bloc-s to files and file s stem properties is stored in a file /sImage. Stored in 6amenodeCs local files stem.

,.*.2 Anatom of +rite. G 1/S'utputStream splits data into pac-ets. G G G G Writes into an internal Bueue. 1ataStreamer as-s namenode to get list of data nodes and uses the internal data Bueue. 6ame node gi"es a list of data nodes for the pipeline. Maintains internal Bueue of pac-ets +aiting to be ac-no+ledged.

,.*.* Anatom of a read. G 6ame node returns the locations of bloc-s for first fe+ bloc-s of the file G G G G G G 1ata nodes list is sorted according to their pro)imit to the client /S1ataInputStream +raps 1/SInputStream! +hich manages datanode and namenode ID' #ead is called repeatedl on the datanode till end of the bloc- is reached /inds the ne)t 1ata6ode for the ne)t datablocAll happens transparentl to the client %alls close after finishing reading the data

,., #ccessibilit)
H1/S can be accessed from applications in man different +a s. 6ati"el ! H1/S pro"ides a :a"a API for applications to use. A % language +rapper for this :a"a API is also a"ailable. In addition! an H33P bro+ser can also be used to bro+se the files of an H1/S instance. 'r can be mounted as uni) files stem. ,.,.1 1/S shell H1/S allo+s user data to be organi$ed in the form of files and directories. It pro"ides a commandline interface called DFSShell that lets a user interact +ith the data in H1/S. 3he s nta) of this command set is similar to other shells 5e.g. bash! csh7 that users are alread familiar +ith. Here are some sample actionDcommand pairs<

Action %reate a director named Dfoodir Lie+ the contents of a file named DfoodirDm file.t)t

%ommand binDhadoop dfs 8m-dir Dfoodir binDhadoop dfs 8cat DfoodirDm file.t)t

1/SShell is targeted for applications that need a scripting language to interact +ith the stored data. ,.,.2 1/S Admin 3he DFSAdmin command set is used for administering an H1/S cluster. 3hese are commands that are used onl b an H1/S administrator. Here are some sample actionDcommand pairs< Action %ommand

Put a cluster in SafeMode ;enerate a list of 1atanodes 1ecommission 1atanode datanodename

binDhadoop dfsadmin 8safemode enter binDhadoop dfsadmin 8report binDhadoop dfsadmin 8decommission datanodename

,.,.* @ro+ser Interface A t pical H1/S install configures a +eb ser"er to e)pose the H1/S namespace through a configurable 3%P port. 3his allo+s a user to na"igate the H1/S namespace and "ie+ the contents of its files using a +eb bro+ser. ,.,., Mountable H1/S< Please "isit the +i-i for more details MountableH1/S<

.. Seriali$ation.
..1 Introduction.

Seriali$ation is the process of turning structured ob9ects into a b te stream for transmission o"er a net+or- or for +riting to persistent storage. ()pectation from a seriali$ation interface. I I I I %ompact . 3o utili$e band+idth efficientl . /ast. #educed processing o"erhead of seriali$ing and deseriali$ing. ()tensible. (asil enhanceable protocols. Interoperable. Support for different languages.

Hadoop has +ritable interface +hich has all of those features e)cept interoperabilit +hich is implemented in A"ro. 3here are follo+ing predefined implementations a"ailable for Writable%omparable. 1. IntWritable 2. ?ongWritable *. 1oubleWritable ,. L?ongWritable. Lariable si$e! stores as much as needed. 18S b tes storage .. LIntWritable. ?ess used R as it is prett much represented b Llong. 0. @ooleanWritable 2. /loatWritable 4. @ tesWritable. S. 6ullWritable. Well this does not store an thing and ma be used +hen +e do not +ant to gi"e an thing as -e or "alue. It has also got one important usage< /or e)ample +e +ant to +rite a seB.

file and do not +ant it be stored in -e and "alue pairs! then +e can gi"e -e as 6ullWritable ob9ect and since it stores nothing! all "alues +ill be merged b reduce method into one single instance. 10. M1.Hash 11. 'b9ectWritable 12. ;enericWritable Apart from the abo"e there are four Writable %ollection t pes 1. Arra Writable 2. 3+o1Arra Writable *. MapWritable ,. SortedMapWritable

..2 Write our o+n composite +ritable.


@esides predefined =+ritables> +e can implement Writable%omparable to seriali$e a class and use as -e and "alue. As it is important for an class to be used! as a -e andDor "alue in map8reduce 8 to be seriali$ed.

..* An e)ample e)plained on seriali$ation and ha"ing custom Writables from hadoop repos. 5See comments7
/** * 4icensed to the "pache +oftware 1oundation ."+10 under one * or more contri#utor license agreements. +ee the 7)-&82 file * distri#uted with this work for additional information * regarding copyright ownership. -he "+1 licenses this file * to you under the "pache 4icense' (ersion 9.: .the * 34icense30; you may not use this file except in compliance * with the 4icense. <ou may o#tain a copy of the 4icense at * * http://www.apache.org/licenses/4&827+2=9.: * * >nless required #y applica#le law or agreed to in writing' software * distri#uted under the 4icense is distri#uted on an 3"+ &+3 ?"+&+' * $&-H)>- $"RR"7-&2+ )R 8)7@&-&)7+ )1 "7< A&7@' either express or implied. * +ee the 4icense for the specific language governing permissions and * limitations under the 4icense. */ package testhadoop; import import import import import import import import import import java.io.HataInput; java.io.HataOutput; java.io.IOException; java.util.StringTokenizer; org.apache.hadoop.con .!on iguration; org.apache.hadoop. s."ath; org.apache.hadoop.io.Int*ritable; org.apache.hadoop.io.6ong*ritable; org.apache.hadoop.io.&a1!omparator; org.apache.hadoop.io.Text;

import import import import import import import import import

org.apache.hadoop.io.*ritable!omparable; org.apache.hadoop.io.*ritable!omparator; org.apache.hadoop.mapreduce.lib.input.'ileInput'ormat; org.apache.hadoop.mapreduce.lib.output.'ileOutput'ormat; org.apache.hadoop.mapreduce.$ob; org.apache.hadoop.mapreduce.%apper; org.apache.hadoop.mapreduce."artitioner; org.apache.hadoop.mapreduce.&educer; org.apache.hadoop.util.(enericOptions"arser;

/** * -his is an example Hadoop Map/Reduce application. * &t reads the text input files that must contain two integers per a line. * -he output is sorted #y the first and second num#er and grouped on the * first num#er. * * -o run: #in/hadoop ,ar #uild/hadoop=examples.,ar secondarysort * %i*in=dir%/i* %i*out=dir%/i* */ 3# # In this example the use o composite ke/ is demonstrated 1here since there is no de ault implementation o a composite ke/. *e had to override methods rom *ritable!omparable. # # %apclass: %apclass simpl/ reads the line rom input and emits the pair as Intpair.le t,right0 value as right reducer class that just emits the sum o #the input values. #3 public class Secondar/SortC ) /** * @efine a pair of integers that are writa#le. * -hey are seriali/ed in a #yte compara#le format. */ public static class Int"air implements *ritable!omparable+Int"air- ) private int irst 2 F; private int second 2 F; /** * +et the left and right values. */ public void set.int le t, int right0 ) irst 2 le t; second 2 right; 9 public int get'irst.0 ) return irst; 9 public int getSecond.0 ) return second; 9 /** * Read the two integers. * 2ncoded as: M&7B("4>2 =* :' : =* =M&7B("4>2' M"5B("4>2=* =C */ IOverride public void read'ields.HataInput in0 throws IOException ) irst 2 in.readInt.0 = Integer.%I8J<A6DE; second 2 in.readInt.0 = Integer.%I8J<A6DE; 9 IOverride public void 1rite.HataOutput out0 throws IOException ) out.1riteInt. irst K Integer.%I8J<A6DE0; out.1riteInt.second K Integer.%I8J<A6DE0; 9 IOverride

and

public int hash!ode.0 ) return irst # >LM = second; 9 IOverride public boolean eNuals.Object right0 ) if .right instanceof Int"air0 ) Int"air r 2 .Int"air0 right; return r. irst 22 irst OO r.second 22 second; 9 else ) return false; 9 9 /** " 8omparator that compares seriali/ed &nt!air. */ public static class !omparator extends *ritable!omparator ) public !omparator.0 ) super.Int"air.class0; 9 public int compare.byte?@ b>, int s>, int l>, byte?@ bC, int sC, int lC0 ) return compareE/tes.b>, s>, l>, bC, sC, lC0; 9

static ) 33 register this comparator *ritable!omparator.de ine.Int"air.class, new !omparator.00; 9 /** 8ompare on the #asis of first firstD then second. */ IOverride public int compareTo.Int"air o0 ) if . irst B2 o. irst0 ) return irst + o. irst G K> : >; 9 else if .second B2 o.second0 ) return second + o.second G K> : >; 9 else ) return F; 9 9 9 /** * !artition #ased on the first part of the pair. $e will need to override the partitioner * as we cannot go for default Hashpartitioner. +ince we have our own implementation of key and hash function. */ 3# "artion unction . irst#>CM %OH .noO "artition00.#3 public static class 'irst"artitioner extends "artitioner+Int"air,Int*ritable-) IOverride public int get"artition.Int"air ke/, Int*ritable value, int num"artitions0 ) return %ath.abs.ke/.get'irst.0 # >CM0 P num"artitions; 9 9

/** * Read two integers from each line and generate a key' value pair * as ..left' right0' right0. */ 3# %apclass simpl/ reads the line rom input and emits the pair as Intpair.le t,right0 as ke/ and value as right#3 public static class %ap!lass extends %apper+6ong*ritable, Text, Int"air, Int*ritable- ) private final Int"air ke/ 2 new Int"air.0; private final Int*ritable value 2 new Int*ritable.0; IOverride public void map.6ong*ritable in;e/, Text in<alue, !ontext context0 throws IOException, InterruptedException ) StringTokenizer itr 2 new StringTokenizer.in<alue.toString.00;

9 9

int le t 2 F; int right 2 F; if .itr.has%oreTokens.00 ) le t 2 Integer.parseInt.itr.nextToken.00; if .itr.has%oreTokens.00 ) right 2 Integer.parseInt.itr.nextToken.00; 9 ke/.set.le t, right0; value.set.right0; context.1rite.ke/, value0; 9

/** * " reducer class that ,ust emits the sum of the input values. */ public static class &educe extends &educer+Int"air, Int*ritable, Text, Int*ritable- ) private final Text irst 2 new Text.0;

IOverride public void reduce.Int"air ke/, Iterable+Int*ritable- values, !ontext context 0 throws IOException, InterruptedException ) irst.set.Integer.toString.ke/.get'irst.000; for.Int*ritable value: values0 ) context.1rite. irst, value0; 9 9

public static void main.String?@ args0 throws Exception ) !on iguration con 2 new !on iguration.0; String?@ otherArgs 2 new (enericOptions"arser.con , args0.get&emainingArgs.0; if .otherArgs.length B2 C0 ) S/stem.err.println.5Dsage: secondar/sortC +in- +out-50; S/stem.exit.C0; 9 $ob job 2 new $ob.con , 5secondar/ sort50; job.set$arE/!lass.Secondar/SortC.class0; job.set%apper!lass.%ap!lass.class0; job.set&educer!lass.&educe.class0; 33 group and partition b/ the irst int in the pair job.set"artitioner!lass.'irst"artitioner.class0; 33 the map output is Int"air, Int*ritable job.set%apOutput;e/!lass.Int"air.class0; job.set%apOutput<alue!lass.Int*ritable.class0; 33 the reduce output is Text, Int*ritable job.setOutput;e/!lass.Text.class0; job.setOutput<alue!lass.Int*ritable.class0; 'ileInput'ormat.addInput"ath.job, new "ath.otherArgs?F@00; 'ileOutput'ormat.setOutput"ath.job, new "ath.otherArgs?>@00; S/stem.exit.job.1ait'or!ompletion.true0 G F : >0;

9 9

.., Wh :a"a 'b9ect Seriali$ation is not so efficient compared to other Seriali$ation frame+or-s? *.+., :a"a Seriali$ation does not meet the criteria of Seriali$ation format.
17compact 27fast *7e)tensible ,7interoperable

*.+.! :a"a Seriali$ation is not compact.


:a"a +rites the classname of each ob9ect being +ritten to the stream! this is true of classes that implement 9a"a.io.Seriali$able or 9a"a.io.()ternali$able. SubseBuent instances of the same class +rite a reference handle to the first occurrence! +hich occupies onl . b tes. Ho+e"er! reference handles donCt +or- +ell +ith random access! since the referent class ma occur at an point in the preceding stream that is! there is state stored in the stream. ("en +orse! reference handles pla ha"oc +ith sorting records in a seriali$ed stream! since the first record of a particular class is distinguished and must be treated as a special case. All of these problems are a"oided b not +riting the classname to the stream at all! +hich is the approach that Writable ta-es. 3his ma-es the assumption that the client -no+s the e)pected t pe. 3he result is that the format is considerabl more compact than :a"a Seriali$ation! and random access and sorting +or- as e)pected since each record is independent of the others 5so there is no stream state7.

*.+." :a"a Seriali$ation is not fast.


:a"a Seriali$ation is a general8purpose mechanism for seriali$ing graphs of ob9ects! so it necessaril has some o"erhead for seriali$ation and deseriali$ation operations. While Map8#educe 9ob at its core seriali$es and deseriali$es billions of records of different t pes! thus benefiting in terms of memor and band+idth b not allocating ne+ ob9ects.

*.+.+ :a"a Seriali$ation is not e)tensible.


In terms of e)tensibilit ! :a"a Seriali$ation has some support for e"ol"ing a t pe! but it is hard to use effecti"el 5Writables ha"e no support< the programmer has to manage them himself7.

*.+.* :a"a Seriali$ation is not interoperable.


In principle! other languages could interpret the :a"a Seriali$ation stream protocol 5defined b the :a"a 'b9ect Seriali$ation Specification7! but in practice there are no +idel used implementations in other languages! so it is a :a"a8onl solution. 3he situation is the same for Writables.

*.+.- Seriali$ation I1?


3here are man seriali$ation frame+or-s that approach the seriali$ation in a different +a T rather than defining t pes through code. It allo+s them to define in a language8neutral! declarati"e fashion using Interface description language5I1?7. 3he s stem generates t pes for different languages! +hich encourages interoperabilit . A"ro is one of the seriali$ation frame+or- +hich uses I1? mechanism. Please see appendi) for details.

0. 1istributed %ache.
0.1 Introdution.
A facilit pro"ided b map8reduce frame+or- to distribute the files e)plicitl specified using 88files option across the cluster and -ept in cache for processing. ;enerall all e)tra files needed b map reduce tas-s should be distributed this +a to sa"e net+or- band+idth.

0.2 An ()ample usage<


Mhadoop 9ar 9ob.9ar APU&sing1istributed%ache 8files inputDsomethingtobecached inputDdata output

2. Securing the elephant.


In order to safe guard hadoop cluster against unauthori$ed access ! +hich ma lead to data loss or lea- of important data! We +ould need some mechanism to ensure the access is onl to an authori$ed entit +ith the rights as to +hat e)tent of securit +e +ould +ant to enforce for that user. Hadoop uses -erberos for ensuring cluster securit .

2.1 -erberos tic-ets.


Verberos is computer net+or- authentication protocol +hich assigns tic-ets to nodes communicating o"er insecure net+or- to establish each otherCs identit in a secure manner. 3here are three steps to access a ser"ice. a7Authentication. #ecei"es 3;3 53ic-et granting tic-et 7 from authentication ser"er. b7Authori$ation. %ontacts 3ic-et granting ser"er +ith a 3;3 for a ser"ice tic-et. c7Ser"ice #eBuest. 6o+ use Ser"ice tic-et can be used to access the resource.

2.2 e)ample< of using -erberos.


M-init pro"ide pass+ord for HA1''PWuser<XXXXXX Mhadoop fs 8put an thing Y... A -erberos tic-et is "alid for 10 hours once recei"ed and can be rene+ed.

2.* 1elegation to-ens.

1elegation to-en are used in hadoop in bac-ground so that user does not ha"e to authenticate at e"er command b contacting V1%.

2., /urther securing the elephant.


G @esides this in order to isolate on user rather than operating s stem. set mapred.tas-.trac-er.tas-8controller to org.apache.hadoop.mapred.?inu)3as-%ontroller. G 3o enforce the A%?s. i.e. each user is able to access onl his 9obs. set mapred.acls.enabled to true. and setting for each user mapred.9ob.acl8"ie+89ob and ...modif 89ob Hadoop does not use encr ption for #P% and transferring H1/S bloc-s to and from datanodes.

4. Hadoop :ob Scheduling.


6o+ since +e ha"e 9obs running +e need somebod to ta-e care of them. Hadoop has plug8gable schedulers and a default scheduler as +ell. 3hese are necessar to pro"ide users access to cluster resources and also control access and limit usage.

4.13hree schedulers<

Default scheduler: G Single priorit based Bueue of 9obs G Scheduling tries to balance map and reduce load on all tas-trac-ers in the cluster

%apacit Scheduler< G PahooRCs scheduler 3he Capacit) Scheduler supports the follo+ing features< G Support for multiple Bueues! +here a 9ob is submitted to a Bueue. G Hueues are allocated a fraction of the capacit of the grid in the sense that a certain capacit of resources +ill be at their disposal. All 9obs submitted to a Bueue +ill ha"e access to the capacit allocated to the Bueue. G /ree resources can be allocated to an Bueue be ond itZs capacit . When there is demand for these resources from Bueues running belo+ capacit at a future point in time! as tas-s scheduled on these resources complete! the +ill be assigned to 9obs on Bueues running belo+ the capacit . G Hueues optionall support 9ob priorities 5disabled b default7. G Within a Bueue! 9obs +ith higher priorit +ill ha"e access to the BueueZs resources before 9obs +ith lo+er priorit . Ho+e"er! once a 9ob is running! it +ill not be preempted for a higher priorit 9ob! though ne+ tas-s from the higher priorit 9ob +ill be preferentiall scheduled.

In order to pre"ent one or more users from monopoli$ing its resources! each Bueue enforces a limit on the percentage of resources allocated to a user at an gi"en time! if there is competition for them. Support for memor 8intensi"e 9obs! +herein a 9ob can optionall specif higher memor 8 reBuirements than the default! and the tas-s of the 9ob +ill onl be run on 3as-3rac-ers that ha"e enough memor to spare.

/air Scheduler< 5ensure fairness amongst users7


G G G G @uilt b /acebooMultiple Bueues 5pools7 of 9obs J sorted in /I/' or b fairness limits (ach pool is guaranteed a minimum capacit and e)cess is shared b all 9obs using a fairness algorithm Scheduler tries to ensure that o"er time! all 9obs recei"e the same number of resources

#ppendi$ ,# #&ro Seriali.ation.


Apache A"ro is one of the seriali$ation frame+or- used in Hadoop. A"ro offers lot of ad"antages compared to :a"a Seriali$ation. 17compact 27fast *7interoperable ,7e)tensible A"ro is a data seriali$ation s stem. It pro"ides rich data structures that are compact! and are transported in a binar data format. It pro"ides container file to store persistent data. It pro"ides #emote Procedure %all5#P%7. It pro"ides simple integration +ith d namic languages. %ode generation is not reBuired to read or +rite data files nor to use or implement #P% protocols. %ode generation as an optional optimi$ation! onl +orth implementing for staticall t ped languages. A"ro relies on :S'6 schemas. A"ro data is al+a s seriali$ed +ith its schema. A"ro relies on a schema8based s stem that defines a data contract to be e)changed.When A"ro data is read! the schema used +hen +riting it is al+a s present.3he strateg emplo ed b A"ro! means that a minimal amount of data is generated! enabling faster transport. A"ro is compatible +ith %! :a"a and P thon and some more languages. A"ro Serial$ation is compact. Since the schema is present +hen data is read! considerabl less t pe information need be encoded +ith data! resulting in smaller seriali$ation si$e. A"ro Seriali$ation is fast. Since the schema is present +hen data is read! considerabl less t pe information need be encoded +ith data! resulting in smaller seriali$ation si$e.3he smaller seriali$ation si$e ma-ing it faster to transport to remote machines. A"ro Seriali$ation is Interoperable A"ro schemas are defined +ith :S'6 . 3his facilitates implementation in languages that alread ha"e :S'6 libraries.

#ppendi$ ,B
1ocumented e)amples from latest Apache hadoop distribution in the ne+ hadoop 0.21 trun"ersion. %an be ported to hadoop 0.20.20* api and used.
1.HM% pie estimator.

>. /** C. * 4icensed to the "pache +oftware 1oundation ."+10 under one Q. * or more contri#utor license agreements. +ee the 7)-&82 file R. * distri#uted with this work for additional information L. * regarding copyright ownership. -he "+1 licenses this file S. * to you under the "pache 4icense' (ersion 9.: .the M. * 34icense30; you may not use this file except in compliance T. * with the 4icense. <ou may o#tain a copy of the 4icense at U. * >F. * http://www.apache.org/licenses/4&827+2=9.: >>. * >C. * >nless required #y applica#le law or agreed to in writing' software >Q. * distri#uted under the 4icense is distri#uted on an 3"+ &+3 ?"+&+' >R. * $&-H)>- $"RR"7-&2+ )R 8)7@&-&)7+ )1 "7< A&7@' either express or implied. >L. * +ee the 4icense for the specific language governing permissions and >S. * limitations under the 4icense. >M. */ >T. >U.package org.apache.hadoop.examples; CF. C>.import java.io.IOException; CC.import java.math.EigHecimal; CQ.import java.math.&ounding%ode; CR. CL.import org.apache.hadoop.con .!on iguration; CS.import org.apache.hadoop.con .!on igured; CM.import org.apache.hadoop. s.'ileS/stem; CT.import org.apache.hadoop. s."ath; CU.import org.apache.hadoop.io.Eoolean*ritable; QF.import org.apache.hadoop.io.6ong*ritable; Q>.import org.apache.hadoop.io.SeNuence'ile; QC.import org.apache.hadoop.io.*ritable; QQ.import org.apache.hadoop.io.*ritable!omparable; QR.import org.apache.hadoop.io.SeNuence'ile.!ompressionT/pe; QL.import org.apache.hadoop.mapreduce.#; QS.import org.apache.hadoop.mapreduce.lib.input.'ileInput'ormat; QM.import org.apache.hadoop.mapreduce.lib.input.SeNuence'ileInput'ormat; QT.import org.apache.hadoop.mapreduce.lib.output.'ileOutput'ormat; QU.import org.apache.hadoop.mapreduce.lib.output.SeNuence'ileOutput'ormat; RF.import org.apache.hadoop.util.Tool; R>.import org.apache.hadoop.util.Tool&unner; RC./** 1irstly this is the only Mapreduce program that !rints output to the screen RQ. * Halton+equence:

instead of file.

RR.

* !opulates the arrays with Halton +equence. i.e. in first iteraion when iE: q is' C ' C/9 ' C/F .... GH nos. RL. * varia#le k is moded for randomi/ation' for second iteration when iEC ' q is' C'C/H'C/I ' RS. * x is sum of all elements of series d * q. RM. *next!oint.0: RT. * -his simply undoes what the constructor has done to get the next point RU. *JmcMapper creates si/e samples and checks if they are inside our outside #y first su#tracting :.K .which are center LF. * coordinates .supposingly0 from x and y and then putting it in equation of the circle and checking if it satisfies L>. * .xL9 M yL9 * rL90 then it emits 9 values one is num&nside and numoutside seperated #y a true or false value as keys. LC. *JmcReducer: LQ. * " reducer does not emit any thing it simply iterates over the keys of true and false and sums LR. * the no of points and updates the varia#le num&nside and numoutside ' &t has a seperate overriden LL. * close method wherein it has written the output to the file reduce=out and as it has only one LS. * reduce tasks .-hus it is possi#le else concurrency issues0. LM. * LT. *overridden cleanup.0:seperate close to register the output to file.

LU. SF.

* estimate: calls specified no of mapper and C reducer and reads the output from the file written #y S>. * the reducer. SC. * SQ. */

SR. SL. SS./** SM. * " map/reduce program that estimates the value of !i ST. * using a quasi=Monte 8arlo .qM80 method. SU. * "r#itrary integrals can #e approximated numerically #y qM8 methods. MF. * &n this example' M>. * we use a qM8 method to approximate the integral N& E OintB+ f.x0 dxN' MC. * where N+EP:'C0L9N is a unit square' MQ. * NxE.xBC'xB90N is a 9=dimensional point' MR. * and NfN is a function descri#ing the inscri#ed circle of the square N+N' ML. * Nf.x0ECN if N.9xBC=C0L9M.9xB9=C0L9 %E CN and Nf.x0E:N' otherwise. MS. * &t is easy to see that !i is equal to NF&N. MM. * +o an approximation of !i is o#tained once N&N is evaluated numerically. MT. * MU. * -here are #etter methods for computing !i. TF. * $e emphasi/e numerical approximation of ar#itrary integrals in this example. T>. * 1or computing many digits of !i' consider using ##p. TC. * TQ. * -he implementation is discussed #elow. TR. * TL. * Mapper: TS. * Qenerate points in a unit square TM. * and then count points inside/outside of the inscri#ed circle of the square. TT. * TU. * Reducer: UF. * "ccumulate points inside/outside results from the mappers. U>. * UC. * 4et num-otal E num&nside M num)utside. UQ. * -he fraction num&nside/num-otal is a rational approximation of UR. * the value ."rea of the circle0/."rea of the square0 E N&N' UL. * where the area of the inscri#ed circle is !i/F

US. * and the area of unit square is C. UM. * 1inally' the estimated value of !i is F.num&nside/num-otal0. UT. */ UU.public class Vuasi%onte!arlo extends !on igured implements Tool ) >FF. static final String HES!&I"TIO8 >F>. 2 5A map3reduce program that estimates "i using a NuasiK%onte !arlo >FC. /** tmp directory for input/output */ >FQ. static private final "ath T%"JHI& 2 new "ath. >FR. Vuasi%onte!arlo.class.getSimple8ame.0 = 5JT%"JQJ>R>LUCSLR50; >FL. >FS. /** 9=dimensional Halton sequence RH.i0S' >FM. * where H.i0 is a 9=dimensional point and i *E C is the index. >FT. * Halton sequence is used to generate sample points for !i estimation. >FU. */ >>F. private static class HaltonSeNuence ) >>>. /** ?ases */ >>C. static final int?@ " 2 )C, Q9; >>Q. /** Maximum num#er of digits allowed */ >>R. static final int?@ ; 2 )SQ, RF9; >>L. >>S. private long index; >>M. private double?@ x; >>T. private double?@?@ N; >>U. private int?@?@ d; >CF. >C>. /** &nitiali/e to H.startindex0' >CC. * so the sequence #egins with H.startindexMC0. >CQ. */ >CR. HaltonSeNuence.long startindex0 ) >CL. index 2 startindex; >CS. x 2 new double?;.length@; >CM. N 2 new double?;.length@?@; >CT. d 2 new int?;.length@?@; >CU. for.int i 2 F; i + ;.length; i==0 ) >QF. N?i@ 2 new double?;?i@@; >Q>. d?i@ 2 new int?;?i@@; >QC. 9 >QQ. >QR. for.int i 2 F; i + ;.length; i==0 ) >QL. long k 2 index; >QS. x?i@ 2 F; >QM. >QT. for.int j 2 F; j + ;?i@; j==0 ) >QU. N?i@?j@ 2 .j 22 FG >.F: N?i@?jK>@03"?i@; >RF. d?i@?j@ 2 .int0.k P "?i@0; >R>. k 2 .k K d?i@?j@03"?i@; >RC. x?i@ =2 d?i@?j@ # N?i@?j@; >RQ. 9 >RR. 9 >RL. 9 >RS. >RM. /** 8ompute next point. >RT. * "ssume the current point is H.index0. >RU. * 8ompute H.indexMC0. >LF. * >L>. * return a 9=dimensional point with coordinates in P:'C0L9 >LC. */ >LQ. double?@ next"oint.0 ) >LR. index==;

method.5;

>LL. >LS. >LM. >LT. >LU. >SF. >S>. >SC. >SQ. >SR. >SL. >SS. >SM. >ST. >SU. >MF. >M>. >MC. >MQ. >MR. >ML. >MS. >MM. >MT. >MU. >TF. >T>. >TC. >TQ. >TR. >TL. >TS. >TM. >TT. >TU. >UF. >U>. >UC. >UQ. >UR. >UL. >US. >UM. >UT. >UU. CFF. CF>. CFC. CFQ. CFR. CFL. CFS. CFM. CFT. CFU. C>F. C>>. C>C. C>Q.

for.int i 2 F; i + ;.length; i==0 ) for.int j 2 F; j + ;?i@; j==0 ) d?i@?j@==; x?i@ =2 N?i@?j@; if .d?i@?j@ + "?i@0 ) break; 9 d?i@?j@ 2 F; x?i@ K2 .j 22 FG >.F: N?i@?jK>@0; 9 9 return x; 9 9 /** * Mapper class for !i estimation. * Qenerate points in a unit square * and then count points inside/outside of the inscri#ed circle of the square. */ public static class Vmc%apper extends %apper+6ong*ritable, 6ong*ritable, Eoolean*ritable, 6ong*ritable- ) /** Map method. * param offset samples starting from the .offsetMC0th sample. * param si/e the num#er of samples for this map * param context output Rture=*num&nside' false=*num)utsideS */ public void map.6ong*ritable o set, 6ong*ritable size, !ontext context0 throws IOException, InterruptedException ) final HaltonSeNuence haltonseNuence 2 new HaltonSeNuence.o long numInside 2 F6; long numOutside 2 F6; for.long i 2 F; i + size.get.0; 0 ) 33generate points in a unit sNuare final double?@ point 2 haltonseNuence.next"oint.0; 33count points inside3outside o the inscribed circle o final double x 2 point?F@ K F.L; final double / 2 point?>@ K F.L; if .x#x = /#/ - F.CL0 ) numOutside==; 9 else ) numInside==; 9 33report status i==; if .i P >FFF 22 F0 ) context.setStatus.5(enerated 5 = i = 5 samples.50; 9 9 33output map results context.1rite.new Eoolean*ritable.true0, new 6ong*ritable.numInside00; the sNuare set.get.00;

C>R. C>L. C>S. C>M. C>T. C>U. CCF. CC>. CCC. CCQ. CCR. CCL. CCS. CCM. CCT. CCU. CQF. CQ>. CQC. CQQ. CQR. CQL. CQS. CQM. CQT. CQU. CRF. CR>. CRC. CRQ. CRR. CRL. CRS. CRM. CRT. CRU. CLF. CL>. CLC. CLQ. CLR. CLL. CLS. CLM. CLT. CLU. CSF. CS>. CSC. CSQ. CSR. CSL. CSS. CSM. CST. CSU. CMF. CM>. CMC.

context.1rite.new Eoolean*ritable.false0, new 6ong*ritable.numOutside00; 9 9 /** * Reducer class for !i estimation. * "ccumulate points inside/outside results from the mappers. */ public static class Vmc&educer extends &educer+Eoolean*ritable, 6ong*ritable, *ritable!omparable+G-, *ritable- ) private long numInside 2 F; private long numOutside 2 F; /** * "ccumulate num#er of points inside/outside results from the mappers. * param is&nside &s the points inside6 * param values "n iterator to a list of point counts * param context dummy' not used here. */ public void reduce.Eoolean*ritable isInside, Iterable+6ong*ritable- values, !ontext context0 throws IOException, InterruptedException ) if .isInside.get.00 ) for .6ong*ritable val : values0 ) numInside =2 val.get.0; 9 9 else ) for .6ong*ritable val : values0 ) numOutside =2 val.get.0; 9 9 9 /** * Reduce task done' write output to a file. */ IOverride public void cleanup.!ontext context0 throws IOException ) 331rite output to a ile "ath outHir 2 new "ath.T%"JHI&, 5out50; "ath out'ile 2 new "ath.outHir, 5reduceKout50; !on iguration con 2 context.get!on iguration.0; 'ileS/stem ileS/s 2 'ileS/stem.get.con 0; SeNuence'ile.*riter 1riter 2 SeNuence'ile.create*riter. ileS/s, con , out'ile, 6ong*ritable.class, 6ong*ritable.class, !ompressionT/pe.8O8E0; 1riter.append.new 6ong*ritable.numInside0, new 6ong*ritable.numOutside00; 1riter.close.0; 9 9 /** * Run a map/reduce ,o# for estimating !i. * * return the estimated value of !i */ public static EigHecimal estimate"i.int num%aps, long num"oints, !on iguration con

CMQ. CMR. CML. CMS. CMM. CMT. CMU. CTF. CT>. CTC. CTQ. CTR. CTL. CTS. CTM. CTT. CTU. CUF. CU>. CUC. CUQ. CUR. CUL. CUS. CUM. CUT. CUU. QFF. QF>. QFC. QFQ. QFR. QFL. QFS. QFM. QFT. QFU. Q>F. Q>>. Q>C. Q>Q. Q>R. Q>L. Q>S. Q>M. Q>T. Q>U. QCF. QC>. QCC. QCQ. QCR. QCL. QCS. QCM. QCT. QCU. QQF. QQ>.

0 throws IOException, !lass8ot'oundException, InterruptedException ) $ob job 2 new $ob.con 0; 33setup job con job.set$ob8ame.Vuasi%onte!arlo.class.getSimple8ame.00; job.set$arE/!lass.Vuasi%onte!arlo.class0; job.setInput'ormat!lass.SeNuence'ileInput'ormat.class0; job.setOutput;e/!lass.Eoolean*ritable.class0; job.setOutput<alue!lass.6ong*ritable.class0; job.setOutput'ormat!lass.SeNuence'ileOutput'ormat.class0; job.set%apper!lass.Vmc%apper.class0; job.set&educer!lass.Vmc&educer.class0; job.set8um&educeTasks.>0; 33 turn o speculative execution, because H'S doesnWt handle 33 multiple 1riters to the same ile. job.setSpeculativeExecution.false0; 33setup input3output directories final "ath inHir 2 new "ath.T%"JHI&, 5in50; final "ath outHir 2 new "ath.T%"JHI&, 5out50; 'ileInput'ormat.setInput"aths.job, inHir0; 'ileOutput'ormat.setOutput"ath.job, outHir0; final 'ileS/stem s 2 'ileS/stem.get.con 0; if . s.exists.T%"JHI&00 ) throw new IOException.5Tmp director/ 5 = s.makeVuali ied.T%"JHI&0 = 5 alread/ exists. "lease remove it irst.50; 9 if .B s.mkdirs.inHir00 ) throw new IOException.5!annot create input director/ 5 = inHir0; 9 try ) 33generate an input ile or each map task for.int i2F; i + num%aps; ==i0 ) final "ath ile 2 new "ath.inHir, 5part5=i0; final 6ong*ritable o set 2 new 6ong*ritable.i # num"oints0; final 6ong*ritable size 2 new 6ong*ritable.num"oints0; final SeNuence'ile.*riter 1riter 2 SeNuence'ile.create*riter. s, con , ile, 6ong*ritable.class, 6ong*ritable.class, !ompressionT/pe.8O8E0; try ) 1riter.append.o set, size0; 9 finally ) 1riter.close.0; 9 S/stem.out.println.5*rote input or %ap X5=i0; 9 33start a map3reduce job S/stem.out.println.5Starting $ob50; final long startTime 2 S/stem.currentTime%illis.0; job.1ait'or!ompletion.true0; final double duration 2 .S/stem.currentTime%illis.0 K startTime03>FFF.F; S/stem.out.println.5$ob 'inished in 5 = duration = 5 seconds50;

QQC. QQQ. QQR. QQL. QQS. QQM. QQT. QQU. QRF. QR>. QRC. QRQ. QRR. QRL. QRS. QRM. QRT. QRU. QLF. QL>. QLC. QLQ. QLR. QLL. QLS. QLM. QLT. QLU. QSF. QS>. QSC. QSQ. QSR. QSL. QSS. QSM. QST. QSU. QMF. QM>. QMC. QMQ. QMR. QML. QMS. QMM. QMT. QMU. QTF. QT>. QTC. QTQ. QTR. QTL.9 QTS.

33read outputs "ath in'ile 2 new "ath.outHir, 5reduceKout50; 6ong*ritable numInside 2 new 6ong*ritable.0; 6ong*ritable numOutside 2 new 6ong*ritable.0; SeNuence'ile.&eader reader 2 new SeNuence'ile.&eader. s, in'ile, con 0; try ) reader.next.numInside, numOutside0; 9 finally ) reader.close.0; 9 33compute estimated value final EigHecimal numTotal 2 EigHecimal.valueO .num%aps0.multipl/.EigHecimal.valueO .num"oints00; return EigHecimal.valueO .R0.setScale.CF0 .multipl/.EigHecimal.valueO .numInside.get.000 .divide.numTotal, &ounding%ode.HA6'JD"0; 9 finally ) s.delete.T%"JHI&, true0; 9 9 /** * !arse arguments and then runs a map/reduce ,o#. * !rint output in standard out. * * return a non=/ero if there is an error. )therwise' return :. */ public int run.String?@ args0 throws Exception ) if .args.length B2 C0 ) S/stem.err.println.5Dsage: 5=get!lass.0.get8ame.0=5 +n%aps- +nSamples-50; Tool&unner.print(eneric!ommandDsage.S/stem.err0; return C; 9 final int n%aps 2 Integer.parseInt.args?F@0; final long nSamples 2 6ong.parse6ong.args?>@0; S/stem.out.println.58umber o %aps 2 5 = n%aps0; S/stem.out.println.5Samples per %ap 2 5 = nSamples0; S/stem.out.println.5Estimated value o "i is 5 = estimate"i.n%aps, nSamples, get!on .000; return F; 9 /** * main method for running it as a stand alone command. */ public static void main.String?@ argv0 throws Exception ) S/stem.exit.Tool&unner.run.null, new Vuasi%onte!arlo.0, argv00; 9

;rep e)ample<

>. /** C. * 4icensed to the "pache +oftware 1oundation ."+10 under one Q. * or more contri#utor license agreements. +ee the 7)-&82 file R. * distri#uted with this work for additional information L. * regarding copyright ownership. -he "+1 licenses this file S. * to you under the "pache 4icense' (ersion 9.: .the M. * 34icense30; you may not use this file except in compliance T. * with the 4icense. <ou may o#tain a copy of the 4icense at U. * >F. * http://www.apache.org/licenses/4&827+2=9.: >>. * >nless required #y applica#le law or agreed to in writing' software >C. * distri#uted under the 4icense is distri#uted on an 3"+ &+3 ?"+&+' >Q. * $&-H)>- $"RR"7-&2+ )R 8)7@&-&)7+ )1 "7< A&7@' either express or implied. >R. * +ee the 4icense for the specific language governing permissions and >L. * limitations under the 4icense. >S. */ >M.package org.apache.hadoop.examples; >T. >U.import java.util.&andom; CF. C>.import org.apache.hadoop.con .!on iguration; CC.import org.apache.hadoop.con .!on igured; CQ.import org.apache.hadoop. s.'ileS/stem; CR.import org.apache.hadoop. s."ath; CL.import org.apache.hadoop.io.6ong*ritable; CS.import org.apache.hadoop.io.Text; CM.import org.apache.hadoop.mapreduce.#; CT.import org.apache.hadoop.mapreduce.lib.input.'ileInput'ormat; CU.import org.apache.hadoop.mapreduce.lib.input.SeNuence'ileInput'ormat; QF.import org.apache.hadoop.mapreduce.lib.map.Inverse%apper; Q>.import org.apache.hadoop.mapreduce.lib.map.&egex%apper; QC.import org.apache.hadoop.mapreduce.lib.output.'ileOutput'ormat; QQ.import org.apache.hadoop.mapreduce.lib.output.SeNuence'ileOutput'ormat; QR.import org.apache.hadoop.mapreduce.lib.reduce.6ongSum&educer; QL.import org.apache.hadoop.util.Tool; QS.import org.apache.hadoop.util.Tool&unner; QM./** Qrep search uses RegexMapper to read from the input that satisfies the regex QT. * and then emits the count as value and word as key. QU. * RF. * 4ongsumreducer simply sums all the long values. which is the count emitted R>. RC.
mapper

#y all the

* for that particular key. * &t first searches #y a#ove procedure and since the output o#tained a#ove is sorted on words it again RQ. * uses inversemapper class to sort the output on frequencies. RR. */

RL. RS.3# Extracts matching regexs rom input iles and counts them. #3 RM.public class (rep extends !on igured implements Tool ) RT. private (rep.0 )9 33 singleton RU. LF. public int run.String?@ args0 throws Exception ) L>. if .args.length + Q0 ) LC. S/stem.out.println.5(rep +inHir- +outHir- +regex- ?+group-@50; LQ. Tool&unner.print(eneric!ommandDsage.S/stem.out0; LR. return C; LL. 9 LS. LM. "ath tempHir 2 LT. new "ath.5grepKtempK5=

LU. SF. S>. SC. SQ. SR. SL. SS. SM. ST. SU. MF. M>. MC. MQ. MR. ML. MS. MM. MT. MU. TF. T>. TC. TQ. TR. TL. TS. TM. TT. TU. UF. U>. UC. UQ. UR. UL. US. UM. UT. UU. >FF. >F>. >FC. >FQ. >FR. >FL. >FS. >FM. >FT. >FU. >>F. >>>. >>C.9 >>Q.

Integer.toString.new &andom.0.nextInt.Integer.%A7J<A6DE000; !on iguration con 2 get!on .0; con .set.&egex%apper."ATTE&8, args?C@0; if .args.length 22 R0 con .set.&egex%apper.(&OD", args?Q@0; $ob grep$ob 2 new $ob.con 0; try ) grep$ob.set$ob8ame.5grepKsearch50; 'ileInput'ormat.setInput"aths.grep$ob, args?F@0; grep$ob.set%apper!lass.&egex%apper.class0; grep$ob.set!ombiner!lass.6ongSum&educer.class0; grep$ob.set&educer!lass.6ongSum&educer.class0; 'ileOutput'ormat.setOutput"ath.grep$ob, tempHir0; grep$ob.setOutput'ormat!lass.SeNuence'ileOutput'ormat.class0; grep$ob.setOutput;e/!lass.Text.class0; grep$ob.setOutput<alue!lass.6ong*ritable.class0; grep$ob.1ait'or!ompletion.true0; $ob sort$ob 2 new $ob.con 0; sort$ob.set$ob8ame.5grepKsort50; 'ileInput'ormat.setInput"aths.sort$ob, tempHir0; sort$ob.setInput'ormat!lass.SeNuence'ileInput'ormat.class0; sort$ob.set%apper!lass.Inverse%apper.class0; sort$ob.set8um&educeTasks.>0; 33 1rite a single ile 'ileOutput'ormat.setOutput"ath.sort$ob, new "ath.args?>@00; sort$ob.setSort!omparator!lass. 33 sort b/ decreasing reN 6ong*ritable.Hecreasing!omparator.class0; sort$ob.1ait'or!ompletion.true0; 9 finally ) 'ileS/stem.get.con 0.delete.tempHir, true0; 9 return F; 9 public static void main.String?@ args0 throws Exception ) int res 2 Tool&unner.run.new !on iguration.0, new (rep.0, args0; S/stem.exit.res0; 9

Word%ount

>. /** C. * 4icensed to the "pache +oftware 1oundation ."+10 under one Q. * or more contri#utor license agreements. +ee the 7)-&82 file R. * distri#uted with this work for additional information L. * regarding copyright ownership. -he "+1 licenses this file S. * to you under the "pache 4icense' (ersion 9.: .the M. * 34icense30; you may not use this file except in compliance T. * with the 4icense. <ou may o#tain a copy of the 4icense at U. * >F. * http://www.apache.org/licenses/4&827+2=9.: >>. * >C. * >nless required #y applica#le law or agreed to in writing' software >Q. * distri#uted under the 4icense is distri#uted on an 3"+ &+3 ?"+&+' >R. * $&-H)>- $"RR"7-&2+ )R 8)7@&-&)7+ )1 "7< A&7@' either express or implied. >L. * +ee the 4icense for the specific language governing permissions and >S. * limitations under the 4icense. >M. */ >T.package org.apache.hadoop.examples; >U. CF.import java.io.IOException; C>.import java.util.StringTokenizer; CC. CQ.import org.apache.hadoop.con .!on iguration; CR.import org.apache.hadoop. s."ath; CL.import org.apache.hadoop.io.Int*ritable; CS.import org.apache.hadoop.io.Text; CM.import org.apache.hadoop.mapreduce.$ob; CT.import org.apache.hadoop.mapreduce.%apper; CU.import org.apache.hadoop.mapreduce.&educer; QF.import org.apache.hadoop.mapreduce.lib.input.'ileInput'ormat; Q>.import org.apache.hadoop.mapreduce.lib.output.'ileOutput'ormat; QC.import org.apache.hadoop.util.(enericOptions"arser; QQ./** QR. * $ord count @ocumentation. QL. * -okeni/erMapper: &t takes as input each line from the input set and tokeni/e
emits each

each word and

QS. QM.

* word as key and integer one as value. * &nt+umReducer: &t accepts each word as key and aggregates all values i.e. the ones mapper has emitted. QT. * +o iterating over all the values and adding gives the count of that word. QU. * RF. * R>. */

RC. RQ.public class *ord!ount ) RR. RL. public static class Tokenizer%apper RS. extends %apper+Object, Text, Text, Int*ritable-) RM. RT. private final static Int*ritable one 2 new Int*ritable.>0; RU. private Text 1ord 2 new Text.0; LF. L>. public void map.Object ke/, Text value, !ontext context LC. 0 throws IOException, InterruptedException ) LQ. StringTokenizer itr 2 new StringTokenizer.value.toString.00; LR. while .itr.has%oreTokens.00 ) LL. 1ord.set.itr.nextToken.00; LS. context.1rite.1ord, one0; LM. 9 LT. 9

LU. SF. S>. SC. SQ. SR. SL. SS. SM. ST. SU. MF. M>. MC. MQ. MR. ML. MS. MM. MT. MU. TF. T>. TC. TQ. TR. TL. TS. TM. TT. TU. UF. U>. UC. UQ. UR. UL.9 US.

9 public static class IntSum&educer extends &educer+Text,Int*ritable,Text,Int*ritable- ) private Int*ritable result 2 new Int*ritable.0; public void reduce.Text ke/, Iterable+Int*ritable- values, !ontext context 0 throws IOException, InterruptedException ) int sum 2 F; for .Int*ritable val : values0 ) sum =2 val.get.0; 9 result.set.sum0; context.1rite.ke/, result0; 9 9 public static void main.String?@ args0 throws Exception ) !on iguration con 2 new !on iguration.0; String?@ otherArgs 2 new (enericOptions"arser.con , args0.get&emainingArgs.0; if .otherArgs.length B2 C0 ) S/stem.err.println.5Dsage: 1ordcount +in- +out-50; S/stem.exit.C0; 9 $ob job 2 new $ob.con , 51ord count50; job.set$arE/!lass.*ord!ount.class0; job.set%apper!lass.Tokenizer%apper.class0; job.set!ombiner!lass.IntSum&educer.class0; job.set&educer!lass.IntSum&educer.class0; job.setOutput;e/!lass.Text.class0; job.setOutput<alue!lass.Int*ritable.class0; 'ileInput'ormat.addInput"ath.job, new "ath.otherArgs?F@00; 'ileOutput'ormat.setOutput"ath.job, new "ath.otherArgs?>@00; S/stem.exit.job.1ait'or!ompletion.true0 G F : >0; 9

Potrebbero piacerti anche