Sei sulla pagina 1di 3

Sort-Based Shuffle in Spark

Motivation

AsortbasedshufflecanbemorescalablethanSparkscurrenthashbasedonebecauseit
doesntrequirewritingaseparatefileforeachreducetaskfromeachmapper.Instead,wewrite
asinglesortedfileandserverangesofittodifferentreducers.Injobswithalotofreducetasks
(say10,000+),thissavessignificantmemoryforcompressionandserializationbuffersand
resultsinmoresequentialdiskI/O.

Implementation

Toperformasortbasedshuffle,eachmaptaskwillproduceoneormoreoutputfilessortedbya
keyspartitionID,thenmergesortthemtoyieldasingleoutputfile.Becauseitsonlynecessary
togroupthekeystogetherintopartitions,wewontbothertoalsosortthemwithineachpartition.
OperatorslikesortByKeycandothatinthereducetask,astheydointhehashbasedshuffle.
Oncethemaptasksproducethesefiles,reducerswillbeabletorequestrangesofthesefilesto
gettheirparticulardata.Forthispurpose,wellhaveanindexfileforeachoutputfilesaying
whereeachpartitionislocated,andwellupdatetheBlockManagertosupportusingthisindex.

NOTE:IfthekeyshaveanOrdering,itmightbeusefultosortthekeyswithineachrangetoallow
combiningacrossmergedfiles.Wemayleavethisoutintheinitialimplementationandadditin
later.Inthecurrentcodebase,theShuffleMapTaskalreadycombinesdatausingan
ExternalAppendOnlyMap,sotheelementsgoingintotheshufflephasewillhaveuniquekeys.
However,wecanmovethemerging/combiningtohappenaspartofthesortlater.

Shuffle Map Tasks

MaptaskswillwritedatathroughaSortedFileWriterthatcreatesoneormoresortedfiles,
mergesthem,andthencreatesanindexfileforthemergedfile.Becauseweneedtobeableto
serveanypartitionofthisfile,thisSortedFileWritermustresetcompressionandserialization
streamswhenwritingeachrange.ThismakesitdifferentfromtheExternalAppendOnlyMap,
thoughsomeofthecodewillbesimilar.Inaddition,toavoidcallingourPartitioneroneachkey
multipletimesasitgetssorted,wellmakeeachintermediatefiletrackthestartandendlocation
ofeachpartitionitcontains.

TheSortedFileWriterwillworkasfollows:
Givenastreamofincomingkeyvaluepairs,firstwritethemintobucketsinmemory
basedontheirpartitionID.ThesebucketscanjustbeArrayBuffersforeachpartitionIDif
weassumeinputkeysareunique(seeabove),orlateronwecanhaveahashtableto
supportcombining.
Whenthetotalsizeofthebucketsgetstoolarge,writethecurrentinmemoryoutputtoa
newfile.Thisintermediatefilewillcontainaheadersayingatwhichpositioneachpartition
IDbegins.Unlikeourfinalservingusecase,thesepositionscanbegivenby#ofobjects,
andwedontneedtoresetcompressionandserializationstreams.
Afteralltheintermediatefilesarewritten,mergesortthemintoafinalfile.Wecanhavea
maximummergefactortoavoidopeningtoomanyfilesatonce,thoughinpracticeyou
probablywanttomergeatleast10100atatime.
Whenwritingthefinalfile,resettheserializationandcompressionstreamsafterwriting
eachpartitionandtrackthebytepositionsofeachpartitiontocreateanindexfile.

Afteritrunsthroughallthedata,theSortedFileWriterwilldeleteanyintermediatefilesandjust
leavethefinalfileanditsindexintheblockmanager.

NOTE:Itmaybeusefultostoreintermediatefilesinthesameformatasfinalones,withareset
beforeeachpartition,sothatwecanmergethemwithoutdeserializingdata.However,thiscould
backfirewhentheamountofdataperpartitionissmall(sincewestartanewcompressionand
serializationstream).Wecaninvestigatethislater.

File Format

Inboththeintermediateandfinalmergedfiles,itmakessensetouseasparseindexthat
contains(partitionID,startLocation)pairs.Thisistoavoidhavingtheindextakemuchmore
spacethanthedataifyouhavealargenumberofreducetasksbutfewkeys.(Thisissimilarto
howweavoidrequestingzerolengthblocksinthecurrentshuffle).Fortheintermediatefiles,the
indexcanbeatthestartofeachfile,andthelocationswillbegivenin#ofobjectswithinthefile.
Wecandothisbecauseweknowthe#ofobjectsbeforestartingtowritethefile,bothinthe
caseofmergingtwofilesandthecaseofcreatingonefrominmemorydata.Forthefinalfile,its
easiesttoputtheindexinaseparatefileafterwefinishbecausewewontknowthebyte
positionsforeachrangeuntilwefinishwritingthem.

File Serving

TheBlockManagerwillneedtobemodifiedtoserverangesofafilebasedonanindex.Right
nowithassomesimilarcodefordealingwithconsolidatedshufflefiles:whensomeonerequests
ashuffleblock,itasksaclasscalledShuffleBlockManagertogetaFileSegmentforthisblock,
whichmaybeasegmentinsidearealfilemanagedbytheDiskStore.Wecandosomething
similarherethoughitmightalsobenicetomaketheBlockManagerexplicitlyawareofindices
andofsubblocks.(Onebenefitofthatisaneasiermovetoinmemoryshuffle.)

Reduce Tasks

Reducetaskswillfetchandhashtogetherdatathesamewaytheydonow.Becausetheyuse
ExternalAppendOnlyMap,theyalreadyhaveasortbasedwayofspillingtodiskiftheyreceivetoo
manyvalues,sowedontneedtodoanythingspecialforthem.Note,however,thatthe
ExternalAppendOnlyMapmergesallspilledfilesatonce,soitmaynotperformwelliftherearea
hugeamountoffiles.Wecanupdateittouseatieredmergingpolicyinaseparatepatch.