Sei sulla pagina 1di 6

!!

!
!"#$%&'()*+,+-.'/0%1*(2%%3%%&'()*.(",%45.'6%%3%%7+*6.8'*9.",%%3%%:;<=;>>%%3%%???2@"#52(+@%%3%%ABCD%>!

Map8's ulrecL Access nlS vs. Padoop luSL
DE'(F9.G'%HF@@"5I%
1he Map8 ulsLrlbuLlon for Apache Padoop greaLly enhances Padoop Lhrough ulrecL Access nlS. 1hls
capablllLy ls enabled by Map8's replacemenL of Lhe Padoop ulsLrlbuLed llle Servlce (PulS) wlLh a hlghly
scalable, enLerprlse-class sLorage servlce layer LhaL supporLs random reads and wrlLes and allows users
Lo mounL Lhe clusLer over nlS so LhaL sLandard flle-based appllcaLlons (command llne uLlllLles, flle
browsers, daLabases, eLc.) can work dlrecLly wlLh Lhe daLa ln Lhe clusLer, and appllcaLlon servers can
wrlLe Lhelr daLa dlrecLly lnLo Lhe clusLer. CLher Padoop dlsLrlbuLlons, unllke Map8, are based on PulS,
whlch ls a read-only (or wrlLe-once) flle sysLem, wlLh no supporL for nlS. Map8 provldes ulrecL Access
nlS along wlLh 100 compaLlblllLy aL Lhe Al layer for Map8educe and PulS so Lhere ls no need Lo
change or recomplle exlsLlng appllcaLlons.
CLher dlsLrlbuLlons clalm LhaL a Padoop luSL" componenL provldes slmllar funcLlonallLy Lo Map8's
nlS, buL Lhls ls noL Lrue. llrsL, Lhelr back-end sLorage servlce (PulS) only supporLs sequenLlal l/C, so a
cllenL appllcaLlon can fall aL any polnL ln Llme. Second, Padoop luSL does noL provlde Lhe mlnlmal
conslsLency guaranLees needed by LradlLlonal appllcaLlons (a cllenL cannoL always read whaL lL [usL
wroLe). 1hlrd, a cllenL componenL musL be lnsLalled and malnLalned on every cllenL, and only Llnux
cllenLs are supporLed. llnally, whlle Map8's nlS provldes full wlre-speed performance, Padoop luSL ls
exLremely slow.
1he followlng Lable provldes a hlgh-level comparlson beLween Map8's ulrecL Access nlS feaLure and
Padoop luSL.
JKH%L!"#$M% N"8++#%KOHD% N"8++#%KOHD%'E#+59'8%
+G'5%JKH%
SupporLed cllenL
operaLlng sysLems
Llnux, Mac, Wlndows,
Solarls, eLc.
Llnux
no cllenL sofLware
lnsLallaLlon needed
?es no
8andom read/wrlLe ?es no (arblLrary
appllcaLlons wlll fall)
erformance Wlre-speed Slow
Self conslsLency" (read
whaL you wrlLe)
?es no
"#$%&'! ?es no (see explanaLlon
below)
8ecommended by
Map8/Cloudera for
dlrecL wrlLlng from
appllcaLlon server?
?es no
!"#$"%#&'()#*'((#
(+,-.$.%/"$#0(-"12#
!!
!
!"#$%&'()*+,+-.'/0%1*(2%%3%%&'()*.(",%45.'6%%3%%7+*6.8'*9.",%%3%%:;<=;>>%%3%%???2@"#52(+@%%3%%ABCD%<!

!"#$%
K.,'HI/9'@%BA1%"*8%JKH%
Llke PulS, Map8's sLorage servlces lmplemenL Lhe Padoop llleSysLem Al. Powever, ln addlLlon Lo Lhe
llleSysLem Al, Map8 also provldes an nlS lnLerface so LhaL cllenLs can mounL Lhe clusLer and access Lhe
daLa dlrecLly.
1he followlng dlagram shows how Lhls works:

Lach node ln Lhe clusLer has a llleServer servlce, whose role ls slmllar ln many ways Lo Lhe uaLanode ln
PulS. ln addlLlon, Lhere can be one or more nlS CaLeway servlces runnlng ln Lhe clusLer. ln many
deploymenLs Lhe nlS CaLeway servlce runs on every node ln Lhe clusLer, alongslde Lhe llleServer
servlce.
A Map8 clusLer can be accessed elLher Lhrough Lhe Padoop llleSysLem Al or Lhrough nlS:
N"8++#%K.,'HI/9'@%BA1. 1o access a Map8 clusLer vla Lhe Padoop llleSysLem Al, Lhe
Map8/Padoop cllenL musL be lnsLalled on Lhe cllenL. Map8 provldes easy-Lo-lnsLall cllenLs for
Llnux, Mac and Wlndows. 1he Padoop llleSysLem Al ls ln !ava, so ln mosL cases cllenL
appllcaLlons are developed ln !ava and llnked Lo Lhe hadoop-core-*.[ar llbrary.
JKH. 1o access a Map8 clusLer over nlS, Lhe cllenL mounLs any of Lhe nlS CaLeway servlces.
1here ls no need Lo lnsLall any soLware on Lhe cllenL, because every common operaLlng sysLem
lncludes an nlS cllenL. ln Wlndows, Lhe Map8 clusLer becomes a leLLer drlve (?:, Z:, eLc.),
whereas ln Llnux and Mac Lhe clusLer ls accesslble as a dlrecLory ln Lhe local flle sysLem (e.g.,
/mapr). (noLe LhaL some low-end Wlndows verslons do noL lnclude Lhe nlS cllenL.)
!!
!
!"#$%&'()*+,+-.'/0%1*(2%%3%%&'()*.(",%45.'6%%3%%7+*6.8'*9.",%%3%%:;<=;>>%%3%%???2@"#52(+@%%3%%ABCD%=!


1he Padoop llleSysLem Al ls deslgned for Map8educe (wlLh funcLlons such as geLllle8lockLocaLlons),
so Map8educe [obs normally read and wrlLe daLa Lhrough LhaL Al. Powever, Lhe nlS lnLerface ls ofLen
more sulLable for appllcaLlons LhaL are noL speclflc Lo Padoop. lor example, an appllcaLlon server can
use Lhe nlS lnLerface Lo wrlLe lLs log flles dlrecLly lnLo Lhe clusLer. 1exL edlLors, command-llne uLlllLles
and flle browsers can also use Lhe nlS lnLerface Lo access Lhe clusLer.
noLe LhaL Map8 provldes hlgh avallablllLy for nlS ln Lhe M3 edlLlon. 1he admlnlsLraLor allocaLes a pool
of vlrLual l addresses (vls), whlch Lhe clusLer Lhen auLomaLlcally asslgns Lo Lhe nlS CaLeways. A vl
auLomaLlcally mlgraLes from one nlS CaLeway servlce Lo anoLher ln Lhe evenL of a fallure, so LhaL all
cllenLs who mounLed Lhe clusLer Lhrough LhaL vl can conLlnue readlng and wrlLlng daLa wlLh no lmpacL.
ln a Lyplcal deploymenL, a slmple load-balanclng scheme, such as unS round-robln, ls used Lo unlformly
dlsLrlbuLe cllenLs among Lhe dlfferenL nlS CaLeways (l.e., vls).
$"*8+@%5'"8;?5.9'%
1he Map8 ulsLrlbuLlon for Apache Padoop lncludes an underlylng sLorage sysLem LhaL supporLs random
reads and wrlLes (wlLh supporL for mulLlple readers and wrlLers slmulLaneously). 1hls provldes a
slgnlflcanL advanLage over oLher dlsLrlbuLlons, ln whlch PulS provldes a wrlLe-once sLorage sysLem
(slmllar Lo l1, or a Cu-8CM).
Pavlng supporL for random reads and wrlLes ls necessary ln order Lo provlde nlS access (and, more
generally, any klnd of access for non-Padoop appllcaLlons). nlS ls a slmple proLocol ln whlch Lhe cllenL
sends Lhe server requesLs Lo wrlLe (or read) $ byLes aL offseL 3 ln a glven flle. ln a Map8 clusLer, Lhe nlS
CaLeway servlce recelves Lhese requesLs from Lhe cllenL and LranslaLes Lhem lnLo Lhe correspondlng
8Cs Lo Lhe llleServer servlces. 1he server-slde ln Lhe nlS proLocol ls mosLly sLaLeless - Lhere ls no
concepL of openlng or closlng flles.
!+F*9"P,'%NQKH%LN"8++#%KOHDM%
CLher Padoop dlsLrlbuLlons (e.g., CuP) cannoL provlde nlS access because Lhe underlylng sLorage
sysLem (PulS) does noL supporL random read/wrlLe semanLlcs. A pro[ecL called Padoop luSL was
developed so LhaL Padoop clusLers can be mounLed on Llnux sysLems. 1hls approach has many
problems:
lL requlres a cllenL lnsLallaLlon. ln oLher words, sofLware musL be lnsLalled on every machlne
LhaL wlll read from or wrlLe Lo Lhe Padoop clusLer. ln addlLlon, lf Lhe Padoop Al changes, Lhe
sofLware musL be upgraded on each cllenL. lnsLalllng and malnLalnlng Padoop sofLware on
every cllenL (e.g., every appllcaLlon server ln Lhe daLacenLer) can be palnful.
luSL only works on Llnux. Wlndows and Mac cllenLs cannoL mounL Lhe clusLer.

!!
!
!"#$%&'()*+,+-.'/0%1*(2%%3%%&'()*.(",%45.'6%%3%%7+*6.8'*9.",%%3%%:;<=;>>%%3%%???2@"#52(+@%%3%%ABCD%R!


lf Lhe appllcaLlon doesn'L wrlLe daLa sequenLlally, lL recelves an l/C error (CperaLlon noL
supporLed"). noL only do many appllcaLlons noL wrlLe sequenLlally, lL ls lmposslble Lo know how
an appllcaLlon behaves wlLhouL acLually looklng aL lLs source code.
lL ls very slow. unllke nlS cllenLs whlch are Lyplcally parL of Lhe CS kernel, a luSL cllenL runs ln
user space. ln Padoop luSL, Lhe luSL lmplemenLaLlon calls llbhdfs, a C/C++ wrapper for
hadoop-core-*.[ar. 1he llbhdfs llbrary calls Lhe correspondlng !ava llleSysLem Al meLhod,
whlch ln Lurn communlcaLes wlLh Lhe clusLer. All Lhese memory coples and LranslLlons from
kernel Lo user space and Lhen Lhe !vM lmpacL performance.
no self conslsLency". lf you wrlLe Lo a flle and Lhen lmmedlaLely read Lhe flle, Lhe read reLurns
empLy. ?ou musL walL several seconds afLer wrlLlng Lo Lhe flle before readlng lL. 1hls behavlor ls
obvlously noL someLhlng LhaL appllcaLlons expecL, so Lhe resulLs can be caLasLrophlc dependlng
on Lhe appllcaLlon.
oor sLablllLy and daLa loss. We Lrled Padoop luSL on mulLlple plaLforms: CenLCS 3.4 and
ubunLu 10.10. ln ubunLu 10.10 (Lhe laLLer uslng Lhe Cloudera uemo vM). ln ubunLu 10.10, any
Llme we Lrled Lo edlL an exlsLlng flle (e.g., ln vl), noL only was Lhe new conLenL noL saved, buL
Lhe exlsLlng conLenL of Lhe flle was losL. lL appears LhaL when Lhe cllenL Lrled Lo rewrlLe Lhe flle,
Lhe namenode dldn'L Lhlnk LhaL Lhe cllenL had opened Lhe flle and Lhus Lhrew a
LeaseLxplredLxcepLlon.
1o furLher lllusLraLe why Map8's ulrecL Access nlS provldes much hlgher performance Lhan Padoop
luSL, Lhe followlng dlagrams ouLllne how daLa flows from Lhe cllenL appllcaLlon Lo Lhe clusLer:


!!
!
!"#$%&'()*+,+-.'/0%1*(2%%3%%&'()*.(",%45.'6%%3%%7+*6.8'*9.",%%3%%:;<=;>>%%3%%???2@"#52(+@%%3%%ABCD%S!

!
lL's worLh noLlng LhaL Padoop luSL does work wlLh Map8, because lL ls uses Lhe Padoop llleSysLem Al
lnLernally, buL we sLrongly advlse agalnsL uslng lL due Lo all Lhese problems.
DE#+59%KOHD%+G'5%JKHT%
ln Lheory, lL ls posslble Lo mounL PulS on a Llnux server uslng luSL and Lhen exporL LhaL Lhrough nlS
uslng Lhe sLandard Llnux nlS server. Powever, Lhls should be avolded due Lo Lhe followlng reasons:
luSL relles on Lhe kernel's lnode cache slnce luSL ls paLh-based and noL lnode-based llke nlS.
ln oLher words, an nlS cllenL speclfles an lnode number ln a read/wrlLe requesL, and Lhe server
musL be able Lo LranslaLe LhaL lnode lnLo a paLh so lL can be passed lnLo Lhe Padoop luSL
lmplemenLaLlon. Powever, Lhe nlS server ls generally sLaLeless, so lL has no way of knowlng
wheLher or noL a cllenL ls keeplng a reference Lo a flle. As a resulL, lL may flush an lnode from lLs
cache even Lhough a cllenL ls sLlll referenclng Lhe correspondlng flle, and Lhe nlS server wlll no
longer be able Lo LranslaLe Lhe lnode number lnLo a paLh, so addlLlonal requesLs from Lhe cllenL
wlll fall. (noLe LhaL luSL provldes a noforgeL" opLlon Lo mlLlgaLe Lhls problem by never flushlng
lnodes from Lhe cache, buL ln Lhe case of PulS Lhls ls noL pracLlcal because Lhe nlS server wlll
evenLually run ouL of memory.)
nlS reorders wrlLes. So even lf an appllcaLlon ls wrlLlng sequenLlally, Lhe wrlLe requesLs wlll be
processed ln random order, and Lhe Padoop luSL cllenL wlll fall because PulS can only wrlLe
sequenLlally.
LxporLlng luSL-mounLed flle sysLems over nlS ls only posslble ln recenL Llnux kernels. lL doesn'L
work wlLh 8PLL/CenLCS 3.x.
See hLLp://wlkl.apache.org/hadoop/MounLablePulS and
hLLps://glLhub.com/fuse4x/fuse/blob/masLer/8LAuML.nlS for more deLalls.
!!
!
!"#$%&'()*+,+-.'/0%1*(2%%3%%&'()*.(",%45.'6%%3%%7+*6.8'*9.",%%3%%:;<=;>>%%3%%???2@"#52(+@%%3%%ABCD%U!
%
HF@@"5I%
1he Map8 ulsLrlbuLlon for Apache Padoop lncludes a robusL, enLerprlse-class sLorage servlce LhaL
supporLs random reads and wrlLes and exposes Lhe sLandard nlS lnLerface so LhaL cllenLs can mounL Lhe
clusLer and read and wrlLe daLa dlrecLly. 1hls capablllLy makes Padoop much easler Lo use, and enables
new classes of appllcaLlons.
CLher Padoop dlsLrlbuLlons clalm Lo supporL a mounLable PulS" vla luSL. Powever, as demonsLraLed
ln Lhls arLlcle, Padoop luSL slmply does noL work for mosL deslred use cases, due Lo PulS and luSL
llmlLaLlons.

Potrebbero piacerti anche