The Hadoop Ecosystem Table

9/4/2016
TheHadoopEcosystemTable
ForkMeonGitHub
ThispageisasummarytokeepthetrackofHadooprelatedprojects,
focusedonFLOSSenvironment.
ApacheHDFS
RedHatGlusterFS
QuantcastFileSystemQFS
CephFilesystem
https://hadoopecosystemtable.github.io/
DistributedFilesystem
TheHadoopDistributedFileSystem(HDFS)offersa
waytostorelargefilesacrossmultiplemachines.
1.hadoop.apache.org
HadoopandHDFSwasderivedfromGoogleFile
2.GoogleFileSystem
System(GFS)paper.PriortoHadoop2.0.0,the
GFSPaper
NameNodewasasinglepointoffailure(SPOF)inan
3.ClouderaWhy
HDFScluster.WithZookeepertheHDFSHigh
HDFS
Availabilityfeatureaddressesthisproblemby
4.HortonworksWhy
providingtheoptionofrunningtworedundant
HDFS
NameNodesinthesameclusterinanActive/Passive
configurationwithahotstandby.
GlusterFSisascaleoutnetworkattachedstoragefile
system.GlusterFSwasdevelopedoriginallyby
Gluster,Inc.,thenbyRedHat,Inc.,aftertheirpurchase
1.www.gluster.org
ofGlusterin2011.InJune2012,RedHatStorage
2.RedHatHadoop
Serverwasannouncedasacommerciallysupported
Plugin
integrationofGlusterFSwithRedHatEnterprise
Linux.GlusterFileSystem,knownnowasRedHat
StorageServer.
QFSisanopensourcedistributedfilesystemsoftware
packageforlargescaleMapReduceorotherbatch
processingworkloads.Itwasdesignedasanalternative
toApacheHadoopsHDFS,intendedtodeliverbetter
performanceandcostefficiencyforlargescale
processingclusters.ItiswritteninC++andhasfixed
footprintmemorymanagement.QFSusesReed
1.QFSsite
Solomonerrorcorrectionasmethodforassuring
2.GitHubQFS
reliableaccesstodata.
3.HADOOP8885
ReedSolomoncodingisverywidelyusedinmass
storagesystemstocorrectthebursterrorsassociated
withmediadefects.Ratherthanstoringthreefull
versionsofeachfilelikeHDFS,resultingintheneed
forthreetimesmorestorage,QFSonlyneeds1.5xthe
rawcapacitybecauseitstripesdataacrossnine
differentdiskdrives.
Cephisafreesoftwarestorageplatformdesignedto
presentobject,block,andfilestoragefromasingle
1.CephFilesystem
distributedcomputercluster.Ceph'smaingoalsareto
site
becompletelydistributedwithoutasinglepointof
2.CephandHadoop
failure,scalabletotheexabytelevel,andfreely
3.HADOOP6253
available.Thedataisreplicated,makingitfault
tolerant.
TheLustrefilesystemisahighperformance
1/29
9/4/2016
Lustrefilesystem
Alluxio
distributedfilesystemintendedforlargernetworkand
highavailabilityenvironments.Traditionally,Lustreis
configuredtomanageremotedatastoragediskdevices
withinaStorageAreaNetwork(SAN),whichistwoor
moreremotelyattacheddiskdevicescommunicating
viaaSmallComputerSystemInterface(SCSI)
protocol.ThisincludesFibreChannel,FibreChannel
overEthernet(FCoE),SerialAttachedSCSI(SAS)and
eveniSCSI.
WithHadoopHDFSthesoftwareneedsadedicated
clusterofcomputersonwhichtorun.Butfolkswho
runhighperformancecomputingclustersforother
purposesoftendon'trunHDFS,whichleavesthem
withabunchofcomputingpower,tasksthatcould
almostcertainlybenefitfromabitofmapreduceand
nowaytoputthatpowertoworkrunningHadoop.
Intel'snoticedthisand,inversion2.5ofitsHadoop
distributionthatitquietlyreleasedlastweek,hasadded
supportforLustre:theIntelHPCDistributionfor
ApacheHadoop*Software,anewproductthat
combinesIntelDistributionforApacheHadoop
softwarewithIntelEnterpriseEditionforLustre
software.ThisistheonlydistributionofApache
HadoopthatisintegratedwithLustre,theparallelfile
systemusedbymanyoftheworld'sfastest
supercomputers
Alluxio,theworldsfirstmemorycentricvirtual
distributedstoragesystem,unifiesdataaccessand
bridgescomputationframeworksandunderlying
storagesystems.Applicationsonlyneedtoconnect
withAlluxiotoaccessdatastoredinanyunderlying
storagesystems.Additionally,Alluxiosmemory
centricarchitectureenablesdataaccessordersof
magnitudefasterthanexistingsolutions.
Inbigdataecosystem,Alluxioliesbetween
computationframeworksorjobs,suchasApache
Spark,ApacheMapReduce,orApacheFlink,and
variouskindsofstoragesystems,suchasAmazonS3,
OpenStackSwift,GlusterFS,HDFS,Ceph,orOSS.
Alluxiobringssignificantperformanceimprovementto
thestackforexample,BaiduusesAlluxiotoimprove
theirdataanalyticsperformanceby30times.Beyond
performance,Alluxiobridgesnewworkloadswithdata
storedintraditionalstoragesystems.Userscanrun
Alluxiousingitsstandaloneclustermode,forexample
onAmazonEC2,orlaunchAlluxiowithApache
MesosorApacheYarn.
AlluxioisHadoopcompatible.Thismeansthat
existingSparkandMapReduceprogramscanrunon
topofAlluxiowithoutanycodechanges.Theproject
isopensource(ApacheLicense2.0)andisdeployedat
multiplecompanies.Itisoneofthefastestgrowing
opensourceprojects.Withlessthanthreeyearsopen
sourcehistory,Alluxiohasattractedmorethan160
1.wiki.lustre.org/
2.Hadoopwith
Lustre
3.IntelHPCHadoop
1.Alluxiosite
2/29
9/4/2016
contributorsfromover50institutions,including
Alibaba,Alluxio,Baidu,CMU,IBM,Intel,NJU,Red
Hat,UCBerkeley,andYahoo.Theprojectisthe
storagelayeroftheBerkeleyDataAnalyticsStack
(BDAS)andalsopartoftheFedoradistribution.
GridGainisopensourceprojectlicensedunderApache
2.0.OneofthemainpiecesofthisplatformistheIn
MemoryApacheHadoopAcceleratorwhichaimsto
accelerateHDFSandMap/Reducebybringingboth,
dataandcomputationsintomemory.Thisworkisdone
withtheGGFSHadoopcompliantinmemoryfile
system.ForI/OintensivejobsGridGainGGFSoffers
performancecloseto100xfasterthanstandardHDFS.
ParaphrasingDmitriySetrakyanfromGridGain
SystemstalkingaboutGGFSregardingTachyon:
GridGain
XtreemFS
GGFSallowsreadthroughandwritethrough
to/fromunderlyingHDFSoranyotherHadoop
compliantfilesystemwithzerocodechange.
EssentiallyGGFSentirelyremovesETLstep
1.GridGainsite
fromintegration.
GGFShasabilitytopickandchoosewhat
foldersstayinmemory,whatfoldersstayon
disc,andwhatfoldersgetsynchronizedwith
underlying(HD)FSeithersynchronouslyor
asynchronously.
GridGainisworkingonaddingnative
MapReducecomponentwhichwillprovide
nativecompleteHadoopintegrationwithout
changesinAPI,likeSparkcurrentlyforcesyou
todo.EssentiallyGridGainMR+GGFSwill
allowtobringHadoopcompletelyorpartiallyin
memoryinPlugnPlayfashionwithoutanyAPI
changes.
XtreemFSisageneralpurposestoragesystemand
coversmoststorageneedsinasingledeployment.Itis
opensource,requiresnospecialhardwareorkernel
modules,andcanbemountedonLinux,Windowsand
OSX.XtreemFSrunsdistributedandoffersresilience
throughreplication.XtreemFSVolumescanbe
accessedthroughaFUSEcomponent,thatoffers
normalfileinteractionwithPOSIXlikesemantics.
FurthermoreanimplementationofHadoops
1.XtreemFSsite2.
FileSysteminterfaceisincludedwhichmakes
FlinkonXtreemFS.
XtreemFSavailableforusewithHadoop,Flinkand
SparkXtreemFS
Sparkoutofthebox.XtreemFSislicensedunderthe
NewBSDlicense.TheXtreemFSprojectisdeveloped
byZuseInstituteBerlin.Thedevelopmentofthe
projectisfundedbytheEuropeanCommissionsince
2006underGrantAgreementsNo.FP6033576,FP7
ICT257438,andFP7318521,aswellastheGerman
projectsMoSGrid,"FirstWeTakeBerlin",FFMK,
GeoMultiSens,andBBDC.
3/29
9/4/2016
ApacheIgnite
ApacheMapReduce
ApachePig
DistributedProgramming
ApacheIgniteInMemoryDataFabricisadistributed
inmemoryplatformforcomputingandtransactingon
largescaledatasetsinrealtime.Itincludesa
distributedkeyvalueinmemorystore,SQL
1.ApacheIgnite
capabilities,mapreduceandothercomputations,
2.ApacheIgnite
distributeddatastructures,continuousqueries,
documentation
messagingandeventssubsystems,HadoopandSpark
integration.IgniteisbuiltinJavaandprovides.NET
andC++APIs.
MapReduceisaprogrammingmodelforprocessing
largedatasetswithaparallel,distributedalgorithmon
acluster.ApacheMapReducewasderivedfrom
GoogleMapReduce:SimplifiedDataProcessingon
LargeClusterspaper.ThecurrentApacheMapReduce
versionisbuiltoverApacheYARNFramework.
1.Apache
YARNstandsforYetAnotherResourceNegotiator. MapReduce
Itisanewframeworkthatfacilitateswritingarbitrary 2.Google
distributedprocessingframeworksandapplications.
MapReducepaper
YARNsexecutionmodelismoregenericthanthe
3.WritingYARN
earlierMapReduceimplementation.YARNcanrun
applications
applicationsthatdonotfollowtheMapReducemodel,
unliketheoriginalApacheHadoopMapReduce(also
calledMR1).HadoopYARNisanattempttotake
ApacheHadoopbeyondMapReducefordata
processing.
Pigprovidesanengineforexecutingdataflowsin
parallelonHadoop.Itincludesalanguage,PigLatin,
forexpressingthesedataflows.PigLatinincludes
operatorsformanyofthetraditionaldataoperations
(join,sort,filter,etc.),aswellastheabilityforusersto
developtheirownfunctionsforreading,processing,
andwritingdata.PigrunsonHadoop.Itmakesuseof
boththeHadoopDistributedFileSystem,HDFS,and
Hadoopsprocessingsystem,MapReduce.
1.pig.apache.org/
PigusesMapReducetoexecuteallofitsdata
2.Pigexamplesby
processing.ItcompilesthePigLatinscriptsthatusers AlanGates
writeintoaseriesofoneormoreMapReducejobsthat
itthenexecutes.PigLatinlooksdifferentfrommany
oftheprogramminglanguagesyouhaveseen.There
arenoifstatementsorforloopsinPigLatin.Thisis
becausetraditionalproceduralandobjectoriented
programminglanguagesdescribecontrolflow,anddata
flowisasideeffectoftheprogram.PigLatininstead
focusesondataflow.
JAQLisafunctional,declarativeprogramming
languagedesignedespeciallyforworkingwithlarge
volumesofstructured,semistructuredand
unstructureddata.Asitsnameimplies,aprimaryuse
ofJAQListohandledatastoredasJSONdocuments,
butJAQLcanworkonvarioustypesofdata.For
example,itcansupportXML,commaseparatedvalues
(CSV)dataandflatfiles.A"SQLwithinJAQL"
4/29
9/4/2016
JAQL
ApacheSpark
capabilityletsprogrammersworkwithstructuredSQL
datawhileemployingaJSONdatamodelthat'sless
restrictivethanitsStructuredQueryLanguage
counterparts.
Specifically,Jaqlallowsyoutoselect,join,group,and
filterdatathatisstoredinHDFS,muchlikeablendof
PigandHive.Jaqlsquerylanguagewasinspiredby
manyprogrammingandquerylanguages,including
Lisp,SQL,XQuery,andPig.
JAQLwascreatedbyworkersatIBMResearchLabs
in2008andreleasedtoopensource.Whileitcontinues
tobehostedasaprojectonGoogleCode,wherea
downloadableversionisavailableunderanApache2.0
license,themajordevelopmentactivityaroundJAQL
hasremainedcenteredatIBM.Thecompanyoffersthe
querylanguageaspartofthetoolssuiteassociated
withInfoSphereBigInsights,itsHadoopplatform.
Workingtogetherwithaworkfloworchestrator,JAQL
isusedinBigInsightstoexchangedatabetween
storage,processingandanalyticsjobs.Italsoprovides
linkstoexternaldataandservices,includingrelational
databasesandmachinelearningdata.
Dataanalyticsclustercomputingframeworkoriginally
developedintheAMPLabatUCBerkeley.Sparkfits
intotheHadoopopensourcecommunity,buildingon
topoftheHadoopDistributedFileSystem(HDFS).
However,Sparkprovidesaneasiertousealternativeto
HadoopMapReduceandoffersperformanceupto10
timesfasterthanpreviousgenerationsystemslike
HadoopMapReduceforcertainapplications.
Sparkisaframeworkforwritingfast,distributed
programs.SparksolvessimilarproblemsasHadoop
MapReducedoesbutwithafastinmemoryapproach
andacleanfunctionalstyleAPI.Withitsabilityto
integratewithHadoopandinbuilttoolsforinteractive
queryanalysis(Shark),largescalegraphprocessing
andanalysis(Bagel),andrealtimeanalysis(Spark
Streaming),itcanbeinteractivelyusedtoquickly
processandquerybigdatasets.
Tomakeprogrammingfaster,Sparkprovidesclean,
conciseAPIsinScala,JavaandPython.Youcanalso
useSparkinteractivelyfromtheScalaandPython
shellstorapidlyquerybigdatasets.Sparkisalsothe
enginebehindShark,afullyApacheHivecompatible
datawarehousingsystemthatcanrun100xfasterthan
Hive.
Stormisacomplexeventprocessor(CEP)and
distributedcomputationframeworkwritten
predominantlyintheClojureprogramminglanguage.
Isadistributedrealtimecomputationsystemfor
processingfast,largestreamsofdata.Stormisan
architecturebasedonmasterworkersparadigma.Soa
Stormclustermainlyconsistsofamasterandworker
1.JAQLinGoogle
Code
2.WhatisJaql?by
IBM
1.ApacheSpark
2.MirrorofSparkon
Github
3.RDDsPaper
4.Spark:Cluster
Computing...Paper
SparkResearch
5/29
9/4/2016
ApacheStorm
ApacheFlink
nodes,withcoordinationdonebyZookeeper.
Stormmakesuseofzeromq(0mq,zeromq),an
advanced,embeddablenetworkinglibrary.Itprovides
amessagequeue,butunlikemessageoriented
middleware(MOM),a0MQsystemcanrunwithouta
dedicatedmessagebroker.Thelibraryisdesignedto
haveafamiliarsocketstyleAPI.
OriginallycreatedbyNathanMarzandteamat
1.StormProject/
BackType,theprojectwasopensourcedafterbeing
acquiredbyTwitter.Stormwasinitiallydevelopedand 2.StormonYARN
deployedatBackTypein2011.After7monthsof
developmentBackTypewasacquiredbyTwitterin
July2011.StormwasopensourcedinSeptember2011.
HortonworksisdevelopingaStormonYARNversion
andplansfinishthebaselevelintegrationin2013Q4.
ThisistheplanfromHortonworks.
Yahoo/HortonworksalsoplanstomoveStormon
YARNcodefromgithub.com/yahoo/stormyarntobea
subprojectofApacheStormprojectinthenearfuture.
TwitterhasrecentlyreleasedaHadoopStormHybrid
calledSummingbird.Summingbirdfusesthetwo
frameworksintoone,allowingfordeveloperstouse
StormforshorttermprocessingandHadoopfordeep
datadives,.asystemthataimstomitigatethetradeoffs
betweenbatchprocessingandstreamprocessingby
combiningthemintoahybridsystem.
ApacheFlink(formerlycalledStratosphere)features
powerfulprogrammingabstractionsinJavaandScala,
ahighperformanceruntime,andautomaticprogram
optimization.Ithasnativesupportforiterations,
incrementaliterations,andprogramsconsistingof
largeDAGsofoperations.
Flinkisadataprocessingsystemandanalternativeto
Hadoop'sMapReducecomponent.Itcomeswithits
1.ApacheFlink
ownruntime,ratherthanbuildingontopof
incubatorpage
MapReduce.Assuch,itcanworkcompletely
2.Stratospheresite
independentlyoftheHadoopecosystem.However,
FlinkcanalsoaccessHadoop'sdistributedfilesystem
(HDFS)toreadandwritedata,andHadoop'snext
generationresourcemanager(YARN)toprovision
clusterresources.SincemostFlinkusersareusing
HadoopHDFStostoretheirdata,itshipsalreadythe
requiredlibrariestoaccessHDFS.
ApacheApexisanenterprisegradeApacheYARN
basedbigdatainmotionplatformthatunifiesstream
processingaswellasbatchprocessing.Itprocessesbig
datainmotioninahighlyscalable,highlyperformant,
faulttolerant,stateful,secure,distributed,andaneasily
operableway.ItprovidesasimpleAPIthatenables
userstowriteorreusegenericJavacode,thereby
loweringtheexpertiseneededtowritebigdata
applications.
6/29
9/4/2016
ApacheApex
TheApacheApexplatformissupplementedbyApache
ApexMalhar,whichisalibraryofoperatorsthat
1.ApacheApexfrom
implementcommonbusinesslogicfunctionsneededby DataTorrent
customerswhowanttoquicklydevelopapplications. 2.ApacheApexmain
TheseoperatorsprovideaccesstoHDFS,S3,NFS,
page
FTP,andotherfilesystemsKafka,ActiveMQ,
3.ApacheApex
RabbitMQ,JMS,andothermessagesystemsMySql, Proposal
Cassandra,MongoDB,Redis,HBase,CouchDBand
otherdatabasesalongwithJDBCconnectors.The
libraryalsoincludesahostofothercommonbusiness
logicpatternsthathelpuserstosignificantlyreducethe
timeittakestogointoproduction.Easeofintegration
withallotherbigdatatechnologiesisoneofthe
primarymissionsofApacheApexMalhar.
Apex,availableonGitHub,isthecoretechnology
uponwhichDataTorrent'scommercialoffering,
DataTorrentRTS3,alongwithothertechnologysuch
asadataingestiontoolcalleddtIngest,arebased.
NetflixPigPen
AMPLabSIMR
FacebookCorona
PigPenismapreduceforClojurewhichcompilesto
ApachePig.ClojureisdialectoftheLispprogramming
languagecreatedbyRichHickey,soisafunctional
generalpurposelanguage,andrunsontheJavaVirtual
Machine,CommonLanguageRuntime,andJavaScript
engines.InPigPentherearenospecialuserdefined
1.PigPenonGitHub
functions(UDFs).DefineClojurefunctions,
anonymouslyornamed,andusethemlikeyouwould
inanyClojureprogram.Thistoolisopensourcedby
Netflix,Inc.theAmericanproviderofondemand
Internetstreamingmedia.
ApacheSparkwasdevelopedthinkinginApache
YARN.However,uptonow,ithasbeenrelativelyhard
torunApacheSparkonHadoopMapReducev1
clusters,i.e.clustersthatdonothaveYARNinstalled.
Typically,userswouldhavetogetpermissiontoinstall
Spark/Scalaonsomesubsetofthemachines,aprocess
1.SIMRonGitHub
thatcouldbetimeconsuming.SIMRallowsanyone
withaccesstoaHadoopMapReducev1clustertorun
Sparkoutofthebox.AusercanrunSparkdirectlyon
topofHadoopMapReducev1withoutany
administrativerights,andwithouthavingSparkor
Scalainstalledonanyofthenodes.
ThenextversionofMapReduce"fromFacebook,
basedinownforkofHadoop.ThecurrentHadoop
implementationoftheMapReducetechniqueusesa
singlejobtracker,whichcausesscalingissuesforvery
largedatasets.TheApacheHadoopdevelopershave
beencreatingtheirownnextgenerationMapReduce, 1.CoronaonGithub
calledYARN,whichFacebookengineerslookedatbut
discountedbecauseofthehighlycustomisednatureof
thecompany'sdeploymentofHadoopandHDFS.
7/29
9/4/2016
Corona,likeYARN,spawnsmultiplejobtrackers(one
foreachjob,inCorona'scase).
ApacheREEF(RetainableEvaluatorExecution
Framework)isalibraryfordevelopingportable
applicationsforclusterresourcemanagerssuchas
ApacheHadoopYARNorApacheMesos.Apache
REEFdrasticallysimplifiesdevelopmentofthose
resourcemanagersthroughthefollowingfeatures:
ApacheREEF
ApacheTwill
CentralizedControlFlow:ApacheREEFturns
thechaosofadistributedapplicationintoevents
inasinglemachine,theJobDriver.Events
includecontainerallocation,Tasklaunch,
completionandfailure.Forfailures,Apache
REEFmakeseveryeffortofmakingtheactual
`Exception`thrownbytheTaskavailabletothe
Driver.
Taskruntime:ApacheREEFprovidesaTask
runtimecalledEvaluator.Evaluatorsare
instantiatedineverycontainerofaREEF
application.Evaluatorscankeepdatainmemory
1.ApacheREEF
inbetweenTasks,whichenablesefficient
Website
pipelinesonREEF.
Supportformultipleresourcemanagers:Apache
REEFapplicationsareportabletoanysupported
resourcemanagerwithminimaleffort.Further,
newresourcemanagersareeasytosupportin
REEF.
.NETandJavaAPI:ApacheREEFistheonly
APItowriteYARNorMesosapplicationsin
.NET.Further,asingleREEFapplicationisfree
tomixandmatchTaskswrittenfor.NETor
Java.
Plugins:ApacheREEFallowsforplugins(called
"Services")toaugmentitsfeaturesetwithout
addingbloattothecore.REEFincludesmany
Services,suchasanamebasedcommunications
betweenTasksMPIinspiredgroup
communications(Broadcast,Reduce,Gather,...)
anddataingress.
TwillisanabstractionoverApacheHadoopYARN
thatreducesthecomplexityofdevelopingdistributed
applications,allowingdeveloperstofocusmoreon
theirbusinesslogic.Twillusesasimplethreadbased
modelthatJavaprogrammerswillfindfamiliar.YARN
canbeviewedasacomputefabricofacluster,which
meansYARNapplicationslikeTwillwillrunonany
Hadoop2cluster.
YARNisanopensourceapplicationthatallowsthe
Hadoopclustertoturnintoacollectionofvirtual
machines.Weave,developedbyContinuuityand
initiallyhousedonGithub,isacomplementaryopen
1.ApacheTwill
sourceapplicationthatusesaprogrammingmodel
similartoJavathreads,makingiteasytowrite
8/29
9/4/2016
similartoJavathreads,makingiteasytowrite
Incubator
distributedapplications.Inordertoremoveaconflict
withasimilarlynamedprojectonApache,called
"Weaver,"Weave'snamechangedtoTwillwhenit
movedtoApacheincubation.
Twillfunctionsasascaledoutproxy.Twillisa
middlewarelayerinbetweenYARNandany
applicationonYARN.WhenyoudevelopaTwillapp,
TwillhandlesAPIsinYARNthatresembleamulti
threadedapplicationfamiliartoJava.Itisveryeasyto
buildmultiprocesseddistributedapplicationsinTwill.
DamballaParkour
ApacheHama
DatasaltPangool
ApacheTez
ApacheDataFu
LibraryfordevelopMapReduceprogramsusingthe
LISPlikelanguageClojure.Parkouraimstoprovide
deepClojureintegrationforHadoop.Programsusing
ParkourarenormalClojureprograms,usingstandard
Clojurefunctionsinsteadofnewframework
abstractions.ProgramsusingParkourarealsofull
Hadoopprograms,withcompleteaccesstoabsolutely
everythingpossibleinrawJavaHadoopMapReduce.
ApacheTopLevelopensourceproject,allowingyou
todoadvancedanalyticsbeyondMapReduce.Many
dataanalysistechniquessuchasmachinelearningand
graphalgorithmsrequireiterativecomputations,thisis
whereBulkSynchronousParallelmodelcanbemore
effectivethan"plain"MapReduce.
AnewMapReduceparadigm.AnewAPIforMRjobs,
inhigherlevelthanJava.
Tezisaproposaltodevelopagenericapplication
whichcanbeusedtoprocesscomplexdataprocessing
taskDAGsandrunsnativelyonApacheHadoop
YARN.TezgeneralizestheMapReduceparadigmtoa
morepowerfulframeworkbasedonexpressing
computationsasadataflowgraph.Tezisnotmeant
directlyforendusersinfactitenablesdevelopersto
buildenduserapplicationswithmuchbetter
performanceandflexibility.Hadoophastraditionally
beenabatchprocessingplatformforlargeamountsof
data.However,therearealotofusecasesfornear
realtimeperformanceofqueryprocessing.Thereare
alsoseveralworkloads,suchasMachineLearning,
whichdonotfitwillintotheMapReduceparadigm.
TezhelpsHadoopaddresstheseusecases.Tez
frameworkconstitutespartofStingerinitiative(alow
latencybasedSQLtypequeryinterfaceforHadoop
basedonHive).
DataFuprovidesacollectionofHadoopMapReduce
jobsandfunctionsinhigherlevellanguagesbasedonit
toperformdataanalysis.Itprovidesfunctionsfor
commonstatisticstasks(e.g.quantiles,sampling),
PageRank,streamsessionization,andsetandbag
operations.DataFualsoprovidesHadoopjobsfor
1.ParkourGitHub
Project
1.Hamasite
1.Pangool
2.GitHubPangool
1.ApacheTez
Incubator
2.Hortonworks
ApacheTezpage
1.DataFuApache
Incubator
9/29
9/4/2016
Pydoop
Kangaroo
TinkerPop
PachydermMapReduce
incrementaldataprocessinginMapReduce.DataFuis
acollectionofPigUDFs(includingPageRank,
sessionization,setoperations,sampling,andmuch
more)thatwereoriginallydevelopedatLinkedIn.
PydoopisaPythonMapReduceandHDFSAPIfor
Hadoop,builtupontheC++PipesandtheClibhdfs
APIs,thatallowstowritefullfledgedMapReduce
applicationswithHDFSaccess.Pydoophasseveral
1.SFPydoopsite
advantagesoverHadoopsbuiltinsolutionsforPython 2.PydoopGitHub
programming,i.e.,HadoopStreamingandJython:
Project
beingaCPythonpackage,itallowsyoutoaccessall
standardlibraryandthirdpartymodules,someof
whichmaynotbeavailable.
OpensourceprojectfromConductorforwriting
MapReducejobsconsumingdatafromKafka.The
introductorypostexplainsConductorsusecase
1.Kangaroo
loadingdatafromKafkatoHBasebywayofa
Introduction
MapReducejobusingtheHFileOutputFormat.Unlike
2.KangarooGitHub
othersolutionswhicharelimitedtoasingleInputSplit
Project
perKafkapartition,Kangaroocanlaunchmultiple
consumersatdifferentoffsetsinthestreamofasingle
partitionforincreasedthroughputandparallelism.
GraphcomputingframeworkwritteninJava.Provides
acoreAPIthatgraphsystemvendorscanimplement.
Therearevarioustypesofgraphsystemsincludingin
memorygraphlibraries,OLTPgraphdatabases,and
OLAPgraphprocessors.Oncethecoreinterfacesare 1.ApacheTinkerpop
Proposal
implemented,theunderlyinggraphsystemcanbe
queriedusingthegraphtraversallanguageGremlinand 2.TinkerPopsite
processedwithTinkerPopenabledalgorithms.For
many,TinkerPopisseenastheJDBCofthegraph
computingcommunity.
PachydermisacompletelynewMapReduceengine
builtontopDockerandCoreOS.InPachyderm
MapReduce(PMR)ajobisanHTTPserverinsidea
Dockercontainer(amicroservice).Yougive
PachydermaDockerimageanditwillautomatically
distributeitthroughouttheclusternexttoyourdata.
DataisPOSTedtothecontaineroverHTTPandthe
resultsarestoredbackinthefilesystem.Youcan
implementthewebserverinanylanguageyouwant
andpullinanylibrary.PachydermalsocreatesaDAG
1.Pachydermsite
forallthejobsinthesystemandtheirdependencies
2.Pachyderm
anditautomaticallyschedulesthepipelinesuchthat
introductionarticle
eachjobisntrununtilitsdependencieshave
completed.EverythinginPachydermspeaksindiffs
soitknowsexactlywhichdatahaschangedandwhich
subsetsofthepipelineneedtobererun.CoreOSisan
opensourcelightweightoperatingsystembasedon
ChromeOS,actuallyCoreOSisaforkofChromeOS.
CoreOSprovidesonlytheminimalfunctionality
requiredfordeployingapplicationsinsidesoftware
10/29
9/4/2016
containers,togetherwithbuiltinmechanismsfor
servicediscoveryandconfigurationsharing
ApacheBeamisanopensource,unifiedmodelfor
definingandexecutingdataparallelprocessing
pipelines,aswellasasetoflanguagespecificSDKs
forconstructingpipelinesandruntimespecific
Runnersforexecutingthem.
ApacheBeam
ThemodelbehindBeamevolvedfromanumberof
internalGoogledataprocessingprojects,including
1.ApacheBeam
MapReduce,FlumeJava,andMillwheel.Thismodel
Proposal
wasoriginallyknownastheDataflowModeland
firstimplementedasGoogleCloudDataflow,including 2.DataFlowBeam
aJavaSDKonGitHubforwritingpipelinesandfully andSpark
managedserviceforexecutingthemonGoogleCloud Comparasion
Platform.
InJanuary2016,Googleandanumberofpartners
submittedtheDataflowProgrammingModeland
SDKsportionasanApacheIncubatorProposal,under
thenameApacheBeam(unifiedBatch+strEAM
processing).
ApacheHBase
ApacheCassandra
Hypertable
NoSQLDatabases
ColumnDataModel
GoogleBigTableInspired.Nonrelationaldistributed
database.Ramdom,realtimer/woperationsin
columnorientedverylargetables(BDDB:BigData
DataBase).ItsthebackingsystemforMRjobs
outputs.ItstheHadoopdatabase.Itsforbacking
HadoopMapReducejobswithApacheHBasetables
DistributedNonSQLDBMS,itsaBDDB.MRcan
retrievedatafromCassandra.ThisBDDBcanrun
withoutHDFS,orontopofHDFS(DataStaxforkof
Cassandra).HBaseanditsrequiredsupportingsystems
arederivedfromwhatisknownoftheoriginalGoogle
BigTableandGoogleFileSystemdesigns(asknown
fromtheGoogleFileSystempaperGooglepublished
in2003,andtheBigTablepaperpublishedin2006).
Cassandraontheotherhandisarecentopensource
forkofastandalonedatabasesysteminitiallycodedby
Facebook,whichwhileimplementingtheBigTable
datamodel,usesasysteminspiredbyAmazons
Dynamoforstoringdata(infactmuchoftheinitial
developmentworkonCassandrawasperformedby
twoDynamoengineersrecruitedtoFacebookfrom
Amazon).
Databasesysteminspiredbypublicationsonthedesign
ofGoogle'sBigTable.Theprojectisbasedon
experienceofengineerswhoweresolvinglargescale
dataintensivetasksformanyyears.Hypertableruns
ontopofadistributedfilesystemsuchastheApache
HadoopDFS,GlusterFS,ortheKosmosFileSystem
1.ApacheHBase
Home
2.MirrorofHBase
onGithub
1.ApacheHBase
Home
2.Cassandraon
GitHub
3.TrainingResources
4.CassandraPaper
TODO
11/29
9/4/2016
ApacheAccumulo
ApacheKudu
MongoDB
RethinkDB
ArangoDB
EventStore
(KFS).ItiswrittenalmostentirelyinC++.Sposored
byBaidutheChinesesearchengine.
Distributedkey/valuestoreisarobust,scalable,high
performancedatastorageandretrievalsystem.Apache
AccumuloisbasedonGoogle'sBigTabledesignandis 1.ApacheAccumulo
builtontopofApacheHadoop,Zookeeper,andThrift. Home
AccumuloissoftwarecreatedbytheNSAwith
securityfeatures.
Distributed,columnar,relationaldatastoreoptimized
foranalyticalusecasesrequiringveryfastreadswith
competitivewritespeeds.
Relationaldatamodel(tables)withstrongly
typedcolumnsandafast,onlinealtertable
operation.
Scaleoutandshardedwithsupportfor
1.ApacheKudu
partitioningbasedonkeyrangesand/orhashing. Home
Faulttolerantandconsistentduetoits
2.KuduonGithub
implementationofRaftconsensus.
3.Kudutechnical
SupportedbyApacheImpalaandApacheDrill, whitepaper(pdf)
enablingfastSQLreadsandwritesthroughthose
systems.
IntegrateswithMapReduceandSpark.
Additionallyprovides"NoSQL"APIsinJava,
Python,andC++.
DocumentDataModel
Documentorienteddatabasesystem.Itispartofthe
NoSQLfamilyofdatabasesystems.Insteadofstoring
dataintablesasisdoneina"classical"relational
1.Mongodbsite
database,MongoDBstoresstructureddataasJSON
likedocuments
RethinkDBisbuilttostoreJSONdocuments,andscale
tomultiplemachineswithverylittleeffort.Ithasa
pleasantquerylanguagethatsupportsreallyuseful
1.RethinkDBsite
queriesliketablejoinsandgroupby,andiseasyto
setupandlearn.
Anopensourcedatabasewithaflexibledatamodelfor
documents,graphs,andkeyvalues.Buildhigh
1.ArangoDBsite
performanceapplicationsusingaconvenientsqllike
querylanguageorJavaScriptextensions.
StreamDataModel
Anopensource,functionaldatabasewithsupportfor
ComplexEventProcessing.Itprovidesapersistence
engineforapplicationsusingeventsourcing,orfor
storingtimeseriesdata.EventStoreiswritteninC#,
C++fortheserverwhichrunsonMonoorthe.NET
1.EventStoresite
CLR,onLinuxorWindows.ApplicationsusingEvent
StorecanbewritteninJavaScript.Eventsourcing(ES)
isawayofpersistingyourapplication'sstatebystoring
thehistorythatdeterminesthecurrentstateofyour
application.
12/29
9/4/2016
RedisDataBase
LinkedinVoldemort
RocksDB
OpenTSDB
ArangoDB
Neo4j
TitanDB
TokuDB
HandlerSocket
KeyvalueDataModel
Redisisanopensource,networked,inmemory,data
structuresstorewithoptionaldurability.Itiswrittenin
ANSIC.Initsouterlayer,theRedisdatamodelisa
dictionarywhichmapskeystovalues.Oneofthemain 1.Redissite
differencesbetweenRedisandotherstructuredstorage 2.RedisLabssite
systemsisthatRedissupportsnotonlystrings,butalso
abstractdatatypes.SponsoredbyRedisLabs.ItsBSD
licensed.
Distributeddatastorethatisdesignedasakeyvalue
1.Voldemortsite
storeusedbyLinkedInforhighscalabilitystorage.
RocksDBisanembeddablepersistentkeyvaluestore
forfaststorage.RocksDBcanalsobethefoundation
1.RocksDBsite
foraclientserverdatabasebutourcurrentfocusison
embeddedworkloads.
OpenTSDBisadistributed,scalableTimeSeries
Database(TSDB)writtenontopofHBase.OpenTSDB
waswrittentoaddressacommonneed:store,index
andservemetricscollectedfromcomputersystems
1.OpenTSDBsite
(networkgear,operatingsystems,applications)ata
largescale,andmakethisdataeasilyaccessibleand
graphable.
GraphDataModel
Anopensourcedatabasewithaflexibledatamodelfor
documents,graphs,andkeyvalues.Buildhigh
1.ArangoDBsite
performanceapplicationsusingaconvenientsqllike
querylanguageorJavaScriptextensions.
Anopensourcegraphdatabasewrittingentirelyin
Java.Itisanembedded,diskbased,fullytransactional
1.Neo4jsite
Javapersistenceenginethatstoresdatastructuredin
graphsratherthanintables.
TitanDBisahighlyscalablegraphdatabaseoptimized
forstoringandqueryinglargegraphswithbillionsof
verticesandedgesdistributedacrossamultimachine 1.Titansite
cluster.Titanisatransactionaldatabasethatcan
supportthousandsofconcurrentusers.
NewSQLDatabases
TokuDBisastorageengineforMySQLandMariaDB
thatisspecificallydesignedforhighperformanceon
writeintensiveworkloads.ItachievesthisviaFractal
TODO
Treeindexing.TokuDBisascalable,ACIDand
MVCCcompliantstorageengine.TokuDBisoneof
thetechnologiesthatenableBigDatainMySQL.
HandlerSocketisaNoSQLpluginfor
MySQL/MariaDB(thestorageengineofMySQL).It
worksasadaemoninsidethemysqldprocess,
acceptingTCPconnections,andexecutingrequests
fromclients.HandlerSocketdoesnotsupportSQL
TODO
queries.Instead,itsupportssimpleCRUDoperations
ontables.HandlerSocketcanbemuchfasterthan
13/29
9/4/2016
AkibanServer
Drizzle
Haeinsa
SenseiDB
Sky
BayesDB
InfluxDB
mysqld/libmysqlinsomecasesbecauseithaslower
CPU,disk,andnetworkoverhead.
AkibanServerisanopensourcedatabasethatbrings
documentstoresandrelationaldatabasestogether.
TODO
Developersgetpowerfuldocumentaccessalongside
surprisinglypowerfulSQL.
DrizzleisaredesignedversionoftheMySQLv6.0
codebaseandisdesignedaroundacentralconceptof
havingamicrokernelarchitecture.Featuressuchasthe
querycacheandauthenticationsystemarenowplugins
tothedatabase,whichfollowthegeneralthemeof
"pluggablestorageengines"thatwereintroducedin
MySQL5.1.ItsupportsPAM,LDAP,andHTTP
TODO
AUTHforauthenticationviapluginsitships.Viaits
pluginsystemitcurrentlysupportsloggingtofiles,
syslog,andremoteservicessuchasRabbitMQand
Gearman.DrizzleisanACIDcompliantrelational
databasethatsupportstransactionsviaanMVCC
design
Haeinsaislinearlyscalablemultirow,multitable
transactionlibraryforHBase.UseHaeinsaifyouneed
TODO
strongACIDsemanticsonyourHBasecluster.Isbased
onGooglePerlocatorconcept.
Opensource,distributed,realtime,semistructured
database.SomeFeatures:Fulltextsearch,Fast
realtimeupdates,Structuredandfacetedsearch,BQL:
1.SenseiDBsite
SQLlikequerylanguage,Fastkeyvaluelookup,High
performanceunderconcurrentheavyupdateandquery
volumes,Hadoopintegration
Skyisanopensourcedatabaseusedforflexible,high
performanceanalysisofbehavioraldata.Forcertain
kindsofdatasuchasclickstreamdataandlogdata,it
1.SkyDBsite
canbeseveralordersofmagnitudefasterthan
traditionalapproachessuchasSQLdatabasesor
Hadoop.
BayesDB,aBayesiandatabasetable,letsusersquery
theprobableimplicationsoftheirtabulardataaseasily
asanSQLdatabaseletsthemquerythedataitself.
UsingthebuiltinBayesianQueryLanguage(BQL),
userswithnostatisticstrainingcansolvebasicdata
1.BayesDBsite
scienceproblems,suchasdetectingpredictive
relationshipsbetweenvariables,inferringmissing
values,simulatingprobableobservations,and
identifyingstatisticallysimilardatabaseentries.
InfluxDBisanopensourcedistributedtimeseries
databasewithnoexternaldependencies.It'susefulfor
recordingmetrics,events,andperforminganalytics.It
hasabuiltinHTTPAPIsoyoudon'thavetowriteany
serversidecodetogetupandrunning.InfluxDBis
1.InfluxDBsite
designedtobescalable,simpletoinstallandmanage,
andfasttogetdatainandout.Itaimstoanswer
queriesinrealtime.Thatmeanseverydatapointis
14/29
9/4/2016
ApacheHive
ApacheHCatalog
ApacheTrafodion
ApacheHAWQ
ApacheDrill
ClouderaImpala
FacebookPresto
DatasaltSploutSQL
indexedasitcomesinandisimmediatelyavailablein
queriesthatshouldreturnunder100ms.
SQLonHadoop
DataWarehouseinfrastructuredevelopedbyFacebook. 1.ApacheHIVEsite
Datasummarization,query,andanalysis.Itsprovides 2.ApacheHIVE
SQLlikelanguage(notSQL92compliant):HiveQL. GitHubProject
HCatalogstableabstractionpresentsuserswitha
relationalviewofdataintheHadoopDistributedFile
System(HDFS)andensuresthatusersneednotworry
TODO
aboutwhereorinwhatformattheirdataisstored.
RightnowHCatalogispartofHive.Onlyoldversions
areseparatedfordownload.
ApacheTrafodionisawebscaleSQLonHadoop
solutionenablingenterpriseclasstransactionaland
1.ApacheTrafodion
operationalworkloadsonHBase.Trafodionisanative
website
MPPANSISQLdatabaseenginethatbuildsonthe
2.ApacheTrafodion
scalability,elasticityandflexibilityofHDFSand
wiki
HBase,extendingthesetoprovideguaranteed
3.ApacheTrafodion
transactionalintegrityforallworkloadsincluding
GitHubProject
multicolumn,multirow,multitable,andmultiserver
updates.
ApacheHAWQisaHadoopnativeSQLqueryengine 1.ApacheHAWQ
thatcombineskeytechnologicaladvantagesofMPP
site
databaseevolvedfromGreenplumDatabase,withthe 2.HAWQGitHub
scalabilityandconvenienceofHadoop.
Project
DrillistheopensourceversionofGoogle'sDremel
systemwhichisavailableasaninfrastructureservice
calledGoogleBigQuery.Inrecentyearsopensource
systemshaveemergedtoaddresstheneedforscalable
batchprocessing(ApacheHadoop)andstream
processing(Storm,ApacheS4).ApacheHadoop,
originallyinspiredbyGoogle'sinternalMapReduce
1.ApacheIncubator
system,isusedbythousandsoforganizations
Drill
processinglargescaledatasets.ApacheHadoopis
designedtoachieveveryhighthroughput,butisnot
designedtoachievethesubsecondlatencyneededfor
interactivedataanalysisandexploration.Drill,inspired
byGoogle'sinternalDremelsystem,isintendedto
addressthisneed
TheApachelicensedImpalaprojectbringsscalable
paralleldatabasetechnologytoHadoop,enablingusers 1.ClouderaImpala
toissuelowlatencySQLqueriestodatastoredin
site
HDFSandApacheHBasewithoutrequiringdata
2.ImpalaGitHub
movementortransformation.It'saGoogleDremel
Project
clone(BigQuerygoogle).
FacebookhasopensourcedPresto,aSQLengineit
saysisonaverage10timesfasterthanHivefor
1.Prestosite
runningqueriesacrosslargedatasetsstoredinHadoop
andelsewhere.
Sploutallowsservinganarbitrarilybigdatasetwith
highQPSratesandatthesametimeprovidesfullSQL TODO
querysyntax.
15/29
9/4/2016
ApacheTajo
ApachePhoenix
ApacheMRQL
ApacheTajoisarobustbigdatarelationaland
distributeddatawarehousesystemforApacheHadoop.
Tajoisdesignedforlowlatencyandscalableadhoc
queries,onlineaggregation,andETL(extract
transformloadprocess)onlargedatasetsstoredon
HDFS(HadoopDistributedFileSystem)andother
1.ApacheTajosite
datasources.BysupportingSQLstandardsand
leveragingadvanceddatabasetechniques,Tajoallows
directcontrolofdistributedexecutionanddataflow
acrossavarietyofqueryevaluationstrategiesand
optimizationopportunities.Forreference,theApache
SoftwareFoundationannouncedTajoasaTopLevel
ProjectinApril2014.
ApachePhoenixisaSQLskinoverHBasedeliveredas
aclientembeddedJDBCdrivertargetinglowlatency
queriesoverHBasedata.ApachePhoenixtakesyour
SQLquery,compilesitintoaseriesofHBasescans,
andorchestratestherunningofthosescanstoproduce
regularJDBCresultsets.Thetablemetadataisstored 1.ApachePhoenix
inanHBasetableandversioned,suchthatsnapshot
site
queriesoverpriorversionswillautomaticallyusethe
correctschema.DirectuseoftheHBaseAPI,along
withcoprocessorsandcustomfilters,resultsin
performanceontheorderofmillisecondsforsmall
queries,orsecondsfortensofmillionsofrows.
MRQLisaqueryprocessingandoptimizationsystem
forlargescale,distributeddataanalysis,builtontopof
ApacheHadoop,Hama,andSpark.
MRQL(pronouncedmiracle)isaqueryprocessingand
optimizationsystemforlargescale,distributeddata
analysis.MRQL(theMapReduceQueryLanguage)is
anSQLlikequerylanguageforlargescaledata
analysisonaclusterofcomputers.TheMRQLquery 1.ApacheIncubator
processingsystemcanevaluateMRQLqueriesinthree MRQLsite
modes:
inMapReducemodeusingApacheHadoop,
inBSPmode(BulkSynchronousParallelmode)
usingApacheHama,and
inSparkmodeusingApacheSpark.
inFlinkmodeusingApacheFlink.
Kylin
ApacheFlume
KylinisanopensourceDistributedAnalyticsEngine
fromeBayInc.thatprovidesSQLinterfaceandmulti
1.Kylinprojectsite
dimensionalanalysis(OLAP)onHadoopsupporting
extremelylargedatasets
DataIngestion
Flumeisadistributed,reliable,andavailableservice
forefficientlycollecting,aggregating,andmoving
largeamountsoflogdata.Ithasasimpleandflexible
architecturebasedonstreamingdataflows.Itisrobust 1.ApacheFlume
andfaulttolerantwithtunablereliabilitymechanisms projectsite
andmanyfailoverandrecoverymechanisms.Itusesa
16/29
9/4/2016
ApacheSqoop
FacebookScribe
ApacheChukwa
ApacheKafka
NetflixSuro
ApacheSamza
ClouderaMorphline
HIHO
ApacheNiFi
simpleextensibledatamodelthatallowsforonline
analyticapplication.
SystemforbulkdatatransferbetweenHDFSand
structureddatastoresasRDBMS.LikeFlumebutfrom
HDFStoRDBMS.
Logagregatorinrealtime.ItsaApacheThrift
Service.
Largescalelogaggregator,andanalytics.
Distributedpublishsubscribesystemforprocessing
largeamountsofstreamingdata.KafkaisaMessage
QueuedevelopedbyLinkedInthatpersistsmessagesto
diskinaveryperformantmanner.Becausemessages
arepersisted,ithastheinterestingabilityforclientsto
rewindastreamandconsumethemessagesagain.
Anotherupsideofthediskpersistenceisthatbulk
importingthedataintoHDFSforofflineanalysiscan
bedoneveryquicklyandefficiently.Storm,developed
byBackType(whichwasacquiredbyTwitterayear
ago),ismoreabouttransformingastreamofmessages
intonewstreams.
SurohasitsrootsinApacheChukwa,whichwas
initiallyadoptedbyNetflix.Isalogagregattorlike
Storm,Samza.
ApacheSamzaisadistributedstreamprocessing
framework.ItusesApacheKafkaformessaging,and
ApacheHadoopYARNtoprovidefaulttolerance,
processorisolation,security,andresource
management.Developedby
http://www.linkedin.com/in/jaykrepsLinkedin.
ClouderaMorphlinesisanewopensourceframework
thatreducesthetimeandskillsnecessarytointegrate,
build,andchangeHadoopprocessingapplicationsthat
extract,transform,andloaddataintoApacheSolr,
ApacheHBase,HDFS,enterprisedatawarehouses,or
analyticonlinedashboards.
Thisprojectisaframeworkforconnectingdisparate
datasourceswiththeApacheHadoopsystem,making
theminteroperable.HIHOconnectsHadoopwith
multipleRDBMSandfilesystems,sothatdatacanbe
loadedtoHadoopandunloadedfromHadoop
ApacheNiFiisadataflowsystemthatiscurrently
underincubationattheApacheSoftwareFoundation.
NiFiisbasedontheconceptsofflowbased
programmingandishighlyconfigurable.NiFiusesa
componentbasedextensionmodeltorapidlyadd
capabilitiestocomplexdataflows.OutoftheboxNiFi
hasseveralextensionsfordealingwithfilebased
dataflowssuchasFTP,SFTP,andHTTPintegrationas
wellasintegrationwithHDFS.OneofNiFisunique
featuresisarich,webbasedinterfacefordesigning,
controlling,andmonitoringadataflow.
ApacheManifoldCFprovidesaframeworkfor
1.ApacheSqoop
projectsite
TODO
TODO
1.ApacheKafka
2.GitHubsource
code
TODO
TODO
TODO
TODO
1.ApacheNiFi
17/29
9/4/2016
ApacheManifoldCF
ApacheThrift
ApacheZookeeper
ApacheAvro
ApacheCurator
Apachekaraf
connectingsourcecontentrepositorieslikefile
systems,DB,CMIS,SharePoint,FileNet...totarget
1.Apache
repositoriesorindexes,suchasApacheSolror
ElasticSearch.It'sakindofcrawlerformulticontent ManifoldCF
repositories,supportingalotofsourcesandmulti
formatconversionforindexingbymeansofApache
TikaContentExtractortransformationfilter.
ServiceProgramming
AcrosslanguageRPCframeworkforservice
creations.ItstheservicebaseforFacebook
technologies(theoriginalThriftcontributor).Thrift
providesaframeworkfordevelopingandaccessing
remoteservices.Itallowsdeveloperstocreateservices
thatcanbeconsumedbyanyapplicationthatiswritten
inalanguagethatthereareThriftbindingsfor.Thrift 1.ApacheThrift
managesserializationofdatatoandfromaservice,as
wellastheprotocolthatdescribesamethodinvocation,
response,etc.InsteadofwritingalltheRPCcode
youcanjustgetstraighttoyourservicelogic.Thrift
usesTCPandsoagivenserviceisboundtoa
particularport.
Itsacoordinationservicethatgivesyouthetoolsyou
needtowritecorrectdistributedapplications.
ZooKeeperwasdevelopedatYahoo!Research.Several
HadoopprojectsarealreadyusingZooKeeperto
coordinatetheclusterandprovidehighlyavailable
distributedservices.Perhapsmostfamousofthoseare
ApacheHBase,Storm,Kafka.ZooKeeperisan
applicationlibrarywithtwoprincipalimplementations
oftheAPIsJavaandCandaservicecomponent
1.ApacheZookeeper
implementedinJavathatrunsonanensembleof
2.GoogleChubby
dedicatedservers.Zookeeperisforbuildingdistributed paper
systems,simplifiesthedevelopmentprocess,makingit
moreagileandenablingmorerobustimplementations.
Backin2006,Googlepublishedapaperon"Chubby",
adistributedlockservicewhichgainedwideadoption
withintheirdatacenters.Zookeeper,notsurprisingly,
isaclosecloneofChubbydesignedtofulfillmanyof
thesamerolesforHDFSandotherHadoop
infrastructure.
ApacheAvroisaframeworkformodeling,serializing
andmakingRemoteProcedureCalls(RPC).Avrodata
isdescribedbyaschema,andoneinterestingfeatureis
thattheschemaisstoredinthesamefileasthedatait
1.ApacheAvro
describes,sofilesareselfdescribing.Avrodoesnot
requirecodegeneration.Thisframeworkcancompete
withothersimilartoolslike:ApacheThrift,Google
ProtocolBuffers,ZeroCICE,andsoon.
CuratorisasetofJavalibrariesthatmakeusing
TODO
ApacheZooKeepermucheasier.
ApacheKarafisanOSGiruntimethatrunsontopof
anyOSGiframeworkandprovidesyouasetof
TODO
18/29
9/4/2016
TwitterElephantBird
LinkedinNorbert
ApacheOozie
LinkedinAzkaban
ApacheFalcon
Schedoscope
services,apowerfulprovisioningconcept,an
extensibleshellandmore.
ElephantBirdisaprojectthatprovidesutilities
(libraries)forworkingwithLZOPcompresseddata.It
alsoprovidesacontainerformatthatsupportsworking
withProtocolBuffers,ThriftinMapReduce,Writables,
PigLoadFuncs,HiveSerDe,HBasemiscellanea.This
opensourcelibraryismassivelyusedinTwitter.
Norbertisalibrarythatprovideseasycluster
managementandworkloaddistribution.WithNorbert,
youcanquicklydistributeasimpleclient/server
architecturetocreateahighlyscalablearchitecture
capableofhandlingheavytraffic.Implementedin
Scala,NorbertwrapsZooKeeper,Nettyanduses
ProtocolBuffersfortransporttomakeiteasytobuilda
clusterawareapplication.AJavaAPIisprovidedand
pluggableloadbalancingstrategiesaresupportedwith
roundrobinandconsistenthashstrategiesprovidedout
ofthebox.
Scheduling
WorkflowschedulersystemforMRjobsusingDAGs
(DirectAcyclicalGraphs).OozieCoordinatorcan
triggerjobsbytime(frequency)anddataavailability
Hadoopworkflowmanagement.Abatchjobscheduler
canbeseenasacombinationofthecronandmake
UnixutilitiescombinedwithafriendlyUI.
ApacheFalconisadatamanagementframeworkfor
simplifyingdatalifecyclemanagementandprocessing
pipelinesonApacheHadoop.Itenablesusersto
configure,manageandorchestratedatamotion,
pipelineprocessing,disasterrecovery,anddata
retentionworkflows.Insteadofhardcodingcomplex
datalifecyclecapabilities,Hadoopapplicationscan
nowrelyonthewelltestedApacheFalconframework
forthesefunctions.Falconssimplificationofdata
managementisquiteusefultoanyonebuildingappson
Hadoop.DataManagementonHadoopencompasses
datamotion,processorchestration,lifecycle
management,datadiscovery,etc.amongother
concernsthatarebeyondETL.Falconisanewdata
processingandmanagementplatformforHadoopthat
solvesthisproblemandcreatesadditional
opportunitiesbybuildingonexistingcomponents
withintheHadoopecosystem(ex.ApacheOozie,
ApacheHadoopDistCpetc.)withoutreinventingthe
wheel.
Schedoscopeisanewopensourceprojectprovidinga
schedulingframeworkforpainfreeagiledevelopment,
testing,(re)loading,andmonitoringofyourdatahub,
lake,orwhateveryouchoosetocallyourHadoopdata
warehousethesedays.Datasets(including
dependencies)aredefinedusingascalaDSL,which
1.ElephantBird
GitHub
1.LinedinProject
2.GitHubsource
code
1.ApacheOozie
2.GitHubsource
code
TODO
TODO
GitHubsourcecode
19/29
9/4/2016
ApacheMahout
WEKA
ClouderaOryx
Deeplearning4j
MADlib
canembedMapReducejobs,Pigscripts,Hivequeries
orOozieworkflowstobuildthedataset.Thetool
includesatestframeworktoverifylogicanda
commandlineutilitytoloadandreloaddata.
MachineLearning
Machinelearninglibraryandmathlibrary,ontopof
TODO
MapReduce.
Weka(WaikatoEnvironmentforKnowledgeAnalysis)
isapopularsuiteofmachinelearningsoftwarewritten
inJava,developedattheUniversityofWaikato,New TODO
Zealand.Wekaisfreesoftwareavailableunderthe
GNUGeneralPublicLicense.
TheOryxopensourceprojectprovidessimple,real
timelargescalemachinelearning/predictiveanalytics
1.OryxatGitHub
infrastructure.Itimplementsafewclassesofalgorithm
2.Clouderaforumfor
commonlyusedinbusinessapplications:collaborative
MachineLearning
filtering/recommendation,classification/regression,
andclustering.
TheDeeplearning4jopensourceprojectisthemost
widelyuseddeeplearningframeworkfortheJVM.
DL4Jincludesdeepneuralnetssuchasrecurrent
neuralnetworks,LongShortTermMemoryNetworks
(LSTMs),convolutionalneuralnetworks,various
autoencodersandfeedforwardneuralnetworkssuchas
restrictedBoltzmannmachinesanddeepbelief
1.Deeplearning4j
networks.Italsohasnaturallanguageprocessing
Website
algorithmssuchasword2vec,doc2vec,GloVeandTF 2.GitterCommunity
IDF.AllDeeplearning4jnetworksrundistributedon
forDeeplearning4j
multipleCPUsandGPUs.TheyworkasHadoopjobs,
andintegratewithSparkontheslacelevelforhost
threadorchestration.Deeplearning4j'sneuralnetworks
areappliedtousecasessuchasfraudandanomaly
detection,recommendersystems,andpredictive
maintenance.
TheMADlibprojectleveragesthedataprocessing
capabilitiesofanRDBMStoanalyzedata.Theaimof
thisprojectistheintegrationofstatisticaldataanalysis
intodatabases.TheMADlibprojectisselfdescribedas
1.MADlib
theBigDataMachineLearninginSQLforData
Community
Scientists.TheMADlibsoftwareprojectbeganthe
followingyearasacollaborationbetweenresearchers
atUCBerkeleyandengineersanddatascientistsat
EMC/Greenplum(nowPivotal)
H2Oisastatistical,machinelearningandmath
runtimetoolforbigdataanalysis.Developedbythe
predictiveanalyticscompanyH2O.ai,H2Ohas
establishedaleadershipintheMLscenetogetherwith
RandDatabricksSpark.Accordingtotheteam,H2O
istheworldsfastestinmemoryplatformformachine
learningandpredictiveanalyticsonbigdata.Itis
1.H2OatGitHub
20/29
9/4/2016
H2O
designedtohelpusersscalemachinelearning,math,
andstatisticsoverlargedatasets.
2.H2OBlog
InadditiontoH2OspointandclickWebUI,itsREST
APIallowseasyintegrationintovariousclients.This
meansexplorativeanalysisofdatacanbedonein
atypicalfashioninR,Python,andScala
andentireworkflowscanbewrittenupasautomated
scripts.
SparklingWater
SparklingWatercombinestwoopensource
technologies:ApacheSparkandH2Oamachine
learningengine.ItmakesH2OslibraryofAdvanced
AlgorithmsincludingDeepLearning,GLM,GBM,
KMeans,PCA,andRandomForestaccessiblefrom
Sparkworkflows.Sparkusersareprovidedwiththe
optionstoselectthebestfeaturesfromeitherplatforms
tomeettheirMachineLearningneeds.Userscan
combineSparksRDDAPIandSparkMLLibwith
H2Osmachinelearningalgorithms,oruseH2O
independentofSparkinthemodelbuildingprocess
andpostprocesstheresultsinSpark.
1.SparklingWaterat
GitHub
2.SparklingWater
Examples
SparklingWaterprovidesatransparentintegrationof
H2OsframeworkanddatastructuresintoSparks
RDDbasedenvironmentbysharingthesame
executionspaceaswellasprovidingaRDDlikeAPI
forH2Odatastructures.
ApacheSystemML
ApacheSystemMLwasopensourcedbyIBMandit's
prettyrelatedwithApacheSpark.Ifyouthinkingin
ApacheSparkastheanalyticsoperatingsystemforany
applicationthattapsintohugevolumesofstreaming
data.MLLib,themachinelearninglibraryforSpark,
providesdeveloperswitharichsetofmachinelearning
algorithms.AndSystemMLenablesdevelopersto
translatethosealgorithmssotheycaneasilydigest
differentkindsofdataandtorunondifferentkindsof
computers.
1.ApacheSystemML
2.ApacheProposal
SystemMLallowsadevelopertowriteasingle
machinelearningalgorithmandautomaticallyscaleit
upusingSparkorHadoop.
SystemMLscalesforbigdataanalyticswithhigh
performanceoptimizertechnology,andempowers
userstowritecustomizedmachinelearningalgorithms
usingsimple,domainspecificlanguage(DSL)without
learningcomplicateddistributedprogramming.Itisan
extensiblecomplementframeworkofSparkMLlib.
BenchmarkingandQATools
21/29
9/4/2016
TherearetwomainJARfilesinApacheHadoopfor
benchmarking.ThisJARaremicrobenchmarksfor
testingparticularpartsoftheinfrastructure,for
instanceTestDFSIOanalyzesthedisksystem,TeraSort
evaluatesMapReducetasks,WordCountmeasures
clusterperformance,etc.MicroBenchmarksare
packagedinthetestsandexmaplesJARfiles,andyou
ApacheHadoopBenchmarking cangetalistofthem,withdescriptions,byinvoking
theJARfilewithnoarguments.WithregardsApache
Hadoop2.2.0stableversionwehaveavailablethe
followingJARfilesfortest,examplesand
benchmarking.TheHadoopmicrobenchmarks,are
bundledinthisJARfiles:hadoopmapreduce
examples2.2.0.jar,hadoopmapreduceclient
jobclient2.2.0tests.jar.
HadoopclusterbenchmarkingfromYahooengineer
YahooGridmix3
team.
Benchmarksuitewhichrepresentsabroadrangeof
MapReduceapplicationsexhibitingapplication
characteristicswithhigh/lowcomputationand
high/lowshufflevolumes.Thereareatotalof13
benchmarks,outofwhichTeraSort,WordCount,and
GreparefromHadoopdistribution.Therestofthe
PUMABenchmarking
benchmarksweredevelopedinhouseandarecurrently
notpartoftheHadoopdistribution.Thethree
benchmarksfromHadoopdistributionarealsoslightly
modifiedtotakenumberofreducetasksasinputfrom
theuserandgeneratefinaltimecompletionstatisticsof
jobs.
TheSWIMbenchmark(StatisticalWorkloadInjector
forMapReduce),isabenchmarkrepresentingareal
worldbigdataworkloaddevelopedbyUniversityof
BerkeleySWIMBenchmark
CaliforniaatBerkleyinclosecooperationwith
Facebook.Thistestprovidesrigorousmeasurementsof
theperformanceofMapReducesystemscomprisedof
realindustryworkloads..
IntelHiBench
HiBenchisaHadoopbenchmarksuite.
Tohelpmaintainconsistencyoveralargeand
disconnectedsetofcommitters,automatedpatch
testingwasaddedtoHadoopsdevelopmentprocess.
Thisautomatedpatchtesting(nowincludedaspartof
ApacheYetus)worksasfollows:whenapatchis
uploadedtothebugtrackingsystemanautomated
processdownloadsthepatch,performssomestatic
analysis,andrunstheunittests.Theseresultsare
ApacheYetus
postedbacktothebugtrackerandalertsnotify
interestedpartiesaboutthestateofthepatch.
1.MAPREDUCE
3561umbrellaticket
totrackalltheissues
relatedto
performance
TODO
1.MAPREDUCE
5116
2.FarazAhmad
researcher
3.PUMADocs
1.GitHubSWIN
TODO
1.AltiscaleBlog
Entry
2.ApacheYetus
Proposal
3.ApacheYetus
Projectsite
HoweverTheApacheYetusprojectaddressesmuch
morethanthetraditionalpatchtesting,it'sabetter
approachincludingamassiverewriteofthepatch
testingfacilityusedinHadoop.
22/29
9/4/2016
ApacheSentry
ApacheKnoxGateway
ApacheRanger
Metascope
ApacheAmbari
Security
Sentryisthenextstepinenterprisegradebigdata
securityanddeliversfinegrainedauthorizationtodata
storedinApacheHadoop.Anindependentsecurity
modulethatintegrateswithopensourceSQLquery
enginesApacheHiveandClouderaImpala,Sentry
TODO
deliversadvancedauthorizationcontrolstoenable
multiuserapplicationsandcrossfunctionalprocesses
forenterprisedatasets.SentrywasaCloudera
development.
Systemthatprovidesasinglepointofsecureaccessfor
ApacheHadoopclusters.Thegoalistosimplify
1.ApacheKnox
Hadoopsecurityforbothusers(i.e.whoaccessthe
2.ApacheKnox
clusterdataandexecutejobs)andoperators(i.e.who
Gateway
controlaccessandmanagethecluster).TheGateway
Hortonworksweb
runsasaserver(orclusterofservers)thatserveoneor
moreHadoopclusters.
ApacheArgusRanger(formerlycalledApacheArgus
orHDPAdvancedSecurity)deliverscomprehensive
approachtocentralsecuritypolicyadministration
acrossthecoreenterprisesecurityrequirementsof
authentication,authorization,accountinganddata
1.ApacheRanger
protection.Itextendsbaselinefeaturesforcoordinated
2.ApacheRanger
enforcementacrossHadoopworkloadsfrombatch,
Hortonworksweb
interactiveSQLandrealtimeandleveragesthe
extensiblearchitecturetoapplypoliciesconsistently
againstadditionalHadoopecosystemcomponents
(beyondHDFS,Hive,andHBase)includingStorm,
Solr,Spark,andmore.
MetadataManagement
Metascopeisametadatamanagementanddata
discoverytoolwhichservesasanaddonto
Schedoscope.Metascopeisabletocollecttechnical,
GitHubsourcecode
operationalandbusinessmetadatafromyourHadoop
Datahubandprovidesthemeasytosearchandnavigate
viaaportal.
SystemDeployment
Intuitive,easytouseHadoopmanagementwebUI
backedbyitsRESTfulAPIs.ApacheAmbariwas
donatedbyHortonworksteamtotheASF.It'sa
powerfulandniceinterfaceforHadoopandother
typicalapplicationsfromtheHadoopecosystem.
ApacheAmbariisunderaheavydevelopment,andit
willincorporatenewfeaturesinanearfuture.For
1.ApacheAmbari
exampleAmbariisabletodeployacompleteHadoop
systemfromscratch,howeverisnotpossibleusethis
GUIinaHadoopsystemthatisalreadyrunning.The
abilitytoprovisioningtheoperatingsystemcouldbea
goodaddition,howeverprobablyisnotinthe
roadmap..
WebapplicationforinteractingwithApacheHadoop.
It'snotadeplomenttool,isanopensourceWeb
23/29
9/4/2016
ClouderaHUE
ApacheMesos
Myriad
Marathon
Brooklyn
HortonworksHOYA
ApacheHelix
interfacethatsupportsApacheHadoopandits
ecosystem,licensedundertheApachev2license.HUE 1.HUEhomepage
isusedforHadoopanditsecosystemuseroperations.
ForexampleHUEofferseditorsforHive,Impala,
Oozie,Pig,notebooksforSpark,SolrSearch
dashboards,HDFS,YARN,HBasebrowsers..
Mesosisaclustermanagerthatprovidesresource
sharingandisolationacrossclusterapplications.Like
TODO
HTCondor,SGEorTroquecandoit.HoweverMesos
ishadoopcentreddesign
Myriadisamesosframeworkdesignedforscaling
YARNclustersonMesos.Myriadcanexpandorshrink
1.MyriadGithub
oneormoreYARNclustersinresponsetoeventsas
perconfiguredrulesandpolicies.
MarathonisaMesosframeworkforlongrunning
services.GiventhatyouhaveMesosrunningasthe
TODO
kernelforyourdatacenter,Marathonistheinitor
upstartdaemon.
Brooklynisalibrarythatsimplifiesapplication
deploymentandmanagement.Fordeployment,itis
designedtotieinwithothertools,givingsingleclick
deployandaddingtheconceptsofmanageableclusters
andfabrics:Manycommonsoftwareentitiesavailable
outofthebox.IntegrateswithApacheWhirrand
TODO
therebyChefandPuppettodeploywellknown
servicessuchasHadoopandelasticsearch(oruse
POBS,plainoldbashscripts)UsePaaS'ssuchas
OpenShift,alongsideselfbuiltclusters,formaximum
flexibility
HOYAisdefinedasrunningHBaseOnYARN.The
HoyatoolisaJavatool,andiscurrentlyCLIdriven.It
takesinaclusterspecificationintermsofthenumber
ofregionservers,thelocationofHBASE_HOME,the
ZooKeeperquorumhosts,theconfigurationthatthe
newHBaseclusterinstanceshoulduseandsoon.
SoHOYAisforHBasedeploymentusingatool
developedontopofYARN.Oncetheclusterhasbeen 1.HortonworksBlog
started,theclustercanbemadetogroworshrinkusing
theHoyacommands.Theclustercanalsobestopped
andlaterresumed.Hoyaimplementsthefunctionality
throughYARNAPIsandHBasesshellscripts.The
goaloftheprototypewastohaveminimalcode
changesandasofthiswriting,ithasrequiredzerocode
changesinHBase.
ApacheHelixisagenericclustermanagement
frameworkusedfortheautomaticmanagementof
partitioned,replicatedanddistributedresourceshosted
onaclusterofnodes.Originallydevelopedby
1.ApacheHelix
Linkedin,nowisinanincubatorprojectatApache.
HelixisdevelopedontopofZookeeperfor
coordinationtasks.
Bigtopwasoriginallydevelopedandreleasedasan
24/29
9/4/2016
ApacheBigtop
Buildoop
Deploop
SequenceIQCloudbreak
opensourcepackaginginfrastructurebyCloudera.
BigTopisusedforsomevendorstobuildtheirown
distributionsbasedonApacheHadoop(CDH,Pivotal
HD,Intel'sdistribution),howeverApacheBigtopdoes
manymoretasks,likecontinuousintegrationtesting
(withJenkins,maven,...)andisusefulforpackaging
(RPMandDEB),deploymentwithPuppet,andsoon. 1.ApacheBigtop.
BigTopalsofeaturesvagrantrecipesforspinningup
"nnode"hadoopclusters,andthebigpetstoreblueprint
applicationwhichdemonstratesconstructionofafull
stackhadoopappwithETL,machinelearning,and
datasetgeneration.ApacheBigtopcouldbeconsidered
asacommunityeffortwithamainfocus:putallbitsof
theHadoopecosystemasawhole,ratherthan
individualprojects.
Buildoopisanopensourceprojectlicensedunder
ApacheLicense2.0,basedonApacheBigTopidea.
Buildoopisacollaborationprojectthatprovides
templatesandtoolstohelpyoucreatecustomLinux
basedsystemsbasedonHadoopecosystem.The
projectisbuiltfromscrachusingGroovylanguage,
1.HadoopEcosystem
andisnotbasedonamixtureoftoolslikeBigTopdoes Builder.
(Makefile,Gradle,Groovy,Maven),probablyiseasier
toprogrammingthanBigTop,andthedesingisfocused
inthebasicideasbehindthebuildrootYoctoProject.
Theprojectisinearlystagesofdevelopmentright
now.
Deploopisatoolforprovisioning,managingand
monitoringApacheHadoopclustersfocusedinthe
LambdaArchitecture.LAisagenericdesignbasedon
theconceptsofTwitterengineerNathanMarz.This
1.TheHadoop
genericarchitecturewasdesignedaddressingcommon
DeploySystem.
requirementsforbigdata.TheDeploopsystemisin
ongoingdevelopment,inalphaphasesofmaturity.The
systemissetupontopofhighlyscalabletechologies
likePuppetandMCollective.
Cloudbreakisaneffectivewaytostartandrun
multipleinstancesandversionsofHadoopclustersin
thecloud,Dockercontainersorbaremetal.Itisacloud
andinfrastructureagnosticandcosteffictiveHadoop
AsaServiceplatformAPI.Providesautomatic
scaling,securemultitenancyandfullcloudlifecycle 1.GitHubproject.
2.Cloudbreak
management.
introduction.
Cloudbreakleveragesthecloudinfrastructure
3.Cloudbreakin
platformstocreatehostinstances,usesDocker
Hortonworks.
technologytodeploytherequisitecontainerscloud
agnostically,andusesApacheAmbari(viaAmbari
Blueprints)toinstallandmanageaHortonworks
cluster.ThisisatoolwithintheHDPecosystem.
Applications
Highlyextensibleandscalableopensourceweb
25/29
9/4/2016
ApacheNutch
SphnixSearchServer
ApacheOODT
HIPILibrary
PivotalR
Jumbune
crawlersoftwareproject.Asearchenginebasedon
Lucene:AWebcrawlerisanInternetbotthat
systematicallybrowsestheWorldWideWeb,typically TODO
forthepurposeofWebindexing.Webcrawlerscan
copyallthepagestheyvisitforlaterprocessingbya
searchenginethatindexesthedownloadedpagesso
thatuserscansearchthemmuchmorequickly.
Sphinxletsyoueitherbatchindexandsearchdata
storedinanSQLdatabase,NoSQLstorage,orjustfiles
quicklyandeasilyorindexandsearchdataonthe TODO
fly,workingwithSphinxprettymuchaswitha
databaseserver.
OODTwasoriginallydevelopedatNASAJet
PropulsionLaboratorytosupportcapturing,processing TODO
andsharingofdataforNASA'sscientificarchives
HIPIisalibraryforHadoop'sMapReduceframework
thatprovidesanAPIforperformingimageprocessing TODO
tasksinadistributedcomputingenvironment.
PivotalRisapackagethatenablesusersofR,themost
popularopensourcestatisticalprogramminglanguage
andenvironmenttointeractwiththePivotal
(Greenplum)DatabaseaswellasPivotalHD/HAWQ
andtheopensourcedatabasePostgreSQLforBigData
analytics.Risaprogramminglanguageanddata
analysissoftware:youdodataanalysisinRbywriting 1.PivotalRon
scriptsandfunctionsintheRprogramminglanguage. GitHub
Risacomplete,interactive,objectorientedlanguage:
designedbystatisticians,forstatisticians.Thelanguage
providesobjects,operatorsandfunctionsthatmakethe
processofexploring,modeling,andvisualizingdataa
naturalone.
DevelopmentFrameworks
Jumbuneisanopensourceproductthatsitsontopof
anyHadoopdistributionandassistsindevelopment
andadministrationofMapReducesolutions.The
objectiveoftheproductistoassistanalyticalsolution
providerstoportfaultfreeapplicationsonproduction
Hadoopenvironments.
JumbunesupportsallactivemajorbranchesofApache 1.Jumbune
Hadoopnamely1.x,2.x,0.23.xandcommercial
2.JumbuneGitHub
MapR,HDP2.xandCDH5.xdistributionsofHadoop. Project
IthastheabilitytoworkwellwithbothYarnandnon 3.JumbuneJIRA
YarnversionsofHadoop.
page
IthasfourmajormodulesMapReduceDebugger,
HDFSDataValidator,Ondemandclustermonitorand
MapReducejobprofiler.Jumbunecanbedeployedon
anyremoteusermachineandusesalightweightagent
ontheNameNodeoftheclustertorelayrelevant
informationtoandfro.
SpringXD(XtremeData)isaevolutionofSpringJava
applicationdevelopmentframeworktohelpBigData
ApplicationsbyPivotal.SpringSourcewasthe
companycreatedbythefoundersoftheSpring
26/29
9/4/2016
companycreatedbythefoundersoftheSpring
Framework.SpringSourcewaspurchasedbyVMware
whereitwasmaintainedforsometimeasaseparate
divisionwithinVMware.LaterVMware,anditsparent
companyEMCCorporation,formallycreatedajoint
venturecalledPivotal.SpringXDismorethan
developmentframeworklibrary,isadistributed,and
extensiblesystemfordataingestion,realtime
analytics,batchprocessing,anddataexport.Itcouldbe 1.SpringXDon
SpringXD
consideredasalternativetoApache
GitHub
Flume/Sqoop/Oozieinsomescenarios.SpringXDis
partofPivotalSpringforApacheHadoop(SHDP).
SHDP,integratedwithSpring,SpringBatchandSpring
DataarepartoftheSpringIOPlatformasfoundational
libraries.Buildingontopof,andextendingthis
foundation,theSpringIOplatformprovidesSpring
XDasbigdataruntime.SpringforApacheHadoop
(SHDP)aimstohelpsimplifythedevelopmentof
Hadoopbasedapplicationsbyprovidingaconsistent
configurationandAPIacrossawiderangeofHadoop
ecosystemprojectssuchasPig,Hive,andCascadingin
additiontoprovidingextensionstoSpringBatchfor
orchestratingHadoopbasedworkflows.
CaskDataApplicationPlatformisanopensource
applicationdevelopmentplatformfortheHadoop
ecosystemthatprovidesdeveloperswithdataand
applicationvirtualizationtoaccelerateapplication
development,addressarangeofrealtimeandbatch
usecases,anddeployapplicationsintoproduction.The
deploymentismadebyCaskCoopr,anopensource
CaskDataApplicationPlatform
1.CaskSite
templatebasedclustermanagementsolutionthat
provisions,manages,andscalesclustersformulti
tieredapplicationstacksonpublicandprivateclouds.
AnothercomponentisTigon,adistributedframework
builtonApacheHadoopandApacheHBaseforreal
time,highthroughput,lowlatencydataprocessingand
analyticsapplications.
CategorizePending...
Asystemthataimstomitigatethetradeoffsbetween
batchprocessingandstreamprocessingbycombining
themintoahybridsystem.InthecaseofTwitter,
TwitterSummingbird
TODO
Hadoophandlesbatchprocessing,Stormhandles
streamprocessing,andthehybridsystemiscalled
Summingbird.
BuildRealtimeBigDataApplicationsonApache
ApacheKiji
TODO
HBase.
S4isageneralpurpose,distributed,scalable,fault
tolerant,pluggableplatformthatallowsprogrammers
S4Yahoo
TODO
toeasilydevelopapplicationsforprocessing
continuousunboundedstreamsofdata.
MetamarkersDruid
Realtimeanalyticaldatastore.
TODO
ApplicationframeworkforJavadeveloperstosimply
27/29
9/4/2016
ConcurrentCascading
ConcurrentLingual
ConcurrentPattern
ApacheGiraph
Talend
AkkaToolkit
EclipseBIRT
SpangoBI
JedoxPalo
TwitterFinagle
IntelGraphBuilder
ApacheTika
developrobustDataAnalyticsandDataManagement TODO
applicationsonApacheHadoop.
OpensourceprojectenablingfastandsimpleBigData
applicationdevelopmentonApacheHadoop.project
thatdeliversANSIstandardSQLtechnologytoeasily TODO
buildnewandintegrateexistingapplicationsonto
Hadoop
MachineLearningforCascadingonApacheHadoop
TODO
throughanAPI,andstandardsbasedPMML
ApacheGiraphisaniterativegraphprocessingsystem
builtforhighscalability.Forexample,itiscurrently
usedatFacebooktoanalyzethesocialgraphformedby
TODO
usersandtheirconnections.Giraphoriginatedasthe
opensourcecounterparttoPregel,thegraph
processingarchitecturedevelopedatGoogle
Talendisanopensourcesoftwarevendorthatprovides
dataintegration,datamanagement,enterprise
TODO
applicationintegrationandbigdatasoftwareand
solutions.
Akkaisanopensourcetoolkitandruntimesimplifying
theconstructionofconcurrentapplicationsontheJava TODO
platform.
BIRTisanopensourceEclipsebasedreportingsystem
thatintegrateswithyourJava/JavaEEapplicationto TODO
producecompellingreports.
SpagoBIisanOpenSourceBusinessIntelligencesuite,
belongingtothefree/opensourceSpagoWorld
initiative,foundedandsupportedbyEngineering
Group.Itoffersalargerangeofanalyticalfunctions,a
TODO
highlyfunctionalsemanticlayeroftenabsentinother
opensourceplatformsandprojects,andarespectable
setofadvanceddatavisualizationfeaturesincluding
geospatialanalytics
PaloSuitecombinesallcoreapplicationsOLAP
Server,PaloWeb,PaloETLServerandPaloforExcel
intoonecomprehensiveandcustomisableBusiness
Intelligenceplatform.Theplatformiscompletely
TODO
basedonOpenSourceproductsrepresentingahigh
endBusinessIntelligencesolutionwhichisavailable
entirelyfreeofanylicensefees.
FinagleisanasynchronousnetworkstackfortheJVM
thatyoucanusetobuildasynchronousRemote
TODO
ProcedureCall(RPC)clientsandserversinJava,
Scala,oranyJVMhostedlanguage.
Librarywhichprovidestoolstoconstructlargescale
TODO
graphsontopofApacheHadoop
Toolkitdetectsandextractsmetadataandstructured
textcontentfromvariousdocumentsusingexisting
TODO
parserlibraries.
Zeppelinisamodernwebbasedtoolforthedata
scientiststocollaborateoverlargescaledata
28/29
9/4/2016
ApacheZeppelin
explorationandvisualizationprojects.Itisanotebook
styleinterpreterthatenablecollaborativeanalysis
1.ApacheZeppelin
sessionssharingbetweenusers.Zeppelinis
independentoftheexecutionframeworkitself.Current site
versionrunsontopofApacheSparkbutithas
pluggableinterpreterAPIstosupportotherdata
processingsystems.Moreexecutionframeworkscould
beaddedatalaterdatei.eApacheFlink,Crunchas
wellasSQLlikebackendssuchasHive,Tajo,MRQL.
PublishedwithGitHubPagesbyJaviRoman,andcontributors
29/29

The Hadoop Ecosystem Table

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

The Hadoop Ecosystem Table

Caricato da

Copyright:

Formati disponibili

9/4/2016

Potrebbero piacerti anche