Sei sulla pagina 1di 8

Lesson11Introduction

TheMultithreadedDAGModel

DAG=DirectedAcyclicGraph:acollectionofverticesanddirectededges(lineswitharrows).
Eachedgeconnectstwovertices.Thefinalresultoftheseconnectionsisthatthereisnowayto
startatsomevertexA,followasequenceofverticesalongdirectedpaths,andendupbackat
A.

DAGscanbeusedforavarietyoftasks,includingmodelingprocessesinwhichdataflowsina
consistentdirectionthroughanetworkofprocessors.

Eachvertexisanoperationlikeafunctioncall,addition,branch,etc.
Directededgesshowhowoperationsdependononeanother.

Thesinkdependsontheoutputofthesource
Assumethereisalwaysonestartingandoneexitvertex.

Beginanalysisbylookingforastartingvertexthisisavertexwhereallinputsaresatisfied.
Thisvertexcanbeassignedtoanyopenprocessor.

Schedulingtakingunitsofworkandassigningittoprocessors.

HowlongwillittaketoruntheDAG?Acostmodelisneeded.
CostModelAssumptions: allprocessorsrunatthesamespeed
1operation=1unitoftime
Edgesdonothaveanycostassociatedwiththem

ExampleSequentialReduction

Reductionreduceanarraytoasumofitselements.
Tofindthecostofthisreduction.wewillonlycareaboutthecostofarrayaccessandthe
costofaddition.

HowlongwillittaketoexecutethisDAGwithprocessors?
Tp(n)=>ceilingofn/p(timeisdependentuponthesizeofthearray)and
Tp(n)=>n(thentimeforeachaddition)

Theadditionsmustbedonesequentially.
Bothtimeconditionsmustbetrue.

Tp(n)=>ceilingofn/ppwillalwaysbeatleastone.Thismeansareductionwilltakenunits
oftimeonaPRAM.

Tp(n)=>n(thentimeforeachaddition)

QUIZ:AReductionTree

Assumeassociativity(a+b)+c=a+(b+c)
Assumenprocessors.
Assumeadditionisdoneinpairs.

WhatistheminimumtimeonaPRAMwithP=nprocessors?
TheDAGisexecutedlevelbylevelandeachleveltakesconstanttimesoallthatisneededto
calculatethetimeistoknowthelevels.logn.

WorkandSpan

Work=numberofverticesintheDAG=W(n)

Span=longestpaththroughtheDAG=D(n)=numberofverticesonthelongestspan

Spanisalsoknownasthecriticalpath.

T1(n)=W(n)
Tinfinity(n)=D(n)

QUIZ:WorkandSpanforReduction

ForthesequentialDAGspan=O(n)
ForthetreeDAGspan=O(log(n))

BasicWorkSpanLaws

W(n)/D(n)=theamountofworkpercriticalvertex=theaverageavailableparallelisminthe
DAG.

Howmanyprocessorsfortheproblem?W(n)/D(n)

SpanLawTp(n)=>D(n)
WorkLawTp(n)=>ceilingofW(n)/P

Tp(n)=>maximumof{SpanLaw,WorkLaw}={D(n),ceilingofW(n)/P}

BrentsTheoremPart1(setup)

IsthereanupperboundtoexecutetheDAG?Yes,accordingtoBrentsTheorem

GivenaPRAMwithPprocessors.
Breaktheexecutionintophases:
1.
Eachphasehas1criticalpathvertex
2.
Noncriticalpathverticesineachphaseareindependent.Thismeanstheverticesinthe
phasecanhaveedgesthatenterorexitthephase,buttheycannotdependonone
another.
3.
Everyvertexhastobeinsomephase,andonlyonephase.

Howlongwillittaketoexecutephasek?

QUIZBrentsTheoremAside
Usethefollowingequivalencies

BrentsTheoremPart2

TheupperboundofthetimetoexecutetheDAGis:

whichbecomes.

ThisisBrentsTheorem.Itsays

Theupperlimitoftimetoexecutethepath,usingPprocessorsis<=Thetimetoexecutethe
criticalpath+thetimetoexecuteeverythingoffthecriticalpathusingpprocessors.

**Thissetsthegoalforanyscheduler.**

Thesetwolimitsarewithinafactorof2witheachother.

ThisimpliesthatyoumaybeabletoexecutetheDAGinafastertimethanBrentpredicts,but
neverfasterthanthelowerbound.

DesiderataSpeedup,WorkOptimality,andWeakScaling

HowcanwetellisaDAGisgoodorbad.

Speedup=bestsequentialtime/paralleltime=Sp(n)=T*(n)/Tp(n)

T*(n)dependsontheworkdonebythebestsequentialalgorithm
Tp(n)dependsonthework,thespan,n,andp

IdealSpeedup:LinearinP(youwantthespeeduptobelinearwiththenumberofprocessors).


Sp(n)=Theta(p)=BestSequentialWork/ParallelTime=W*(n)/Tp(n)

UseBrentsTheoremtogetanUpperboundontime.
Intheequationshownbelowthereisstilladependenceonn,itisjustnotshownontheright
side.

P=numberofprocessors
Thepenalty(thedenominator)to
getlinearscaling,thedenominator
needstobeaconstant.

Togetaconstantinthedenominator:
W=W*WorkOptimality

WeakScalability
P=O(W*/D)W*/P=Omega(D)workperprocessorhastogrowproportionaltothe
span.Spandependsonproblemsizen.

Recap:

Speeduplinearscalingisthegoal.
Toachievelinearscalingtheworkoftheparallelalgorithmshouldmatchthebestsequential
algorithmandtheworkperprocessorshouldgrowasafunctionofn.

BasicConcurrencyPrimitives

TheDivideandConquerScheme

Thisisthesequentialversionofthedivide
andconquerscheme.

Notethatthetworecursivecallsare
independent,andwillnowbecalledSPAWN

Spawnisasignaltoeitherthecompileror
theruntimesystemthatthetargetisan
independentunitofwork.Thetargetmaybe
executedasynchronouslyfromthecaller.


SYNCthedependencebetweenaandbandthereturnstatement.Thesehavetobecombined.
Syncisusedtocombinethedependentstatements.

TowhichSpawndoesagivenSyncapply?Thesyncmatchesanyspawninthesameframe.

NestedParallelism=Thereisalwaysanimplicitsyncbeforereturningtothecaller.

Thespawncreatestwoindependentpathsonepathcarriesthenewwork,andonepath
continuescarryingonafterthespawn.

QUIZ:ASubtlePointAboutSpawns
Theabovecursivereductionusestwospawnsaretheybothnecessary?YoucaneliminateB
butnotA.

IfyoueliminatetheApathyoueliminate
concurrencythisisbad.

IfyoueliminatetheBpaththetwosub
graphscanbeexecutedconcurrently.

BasicAnalysisofWorkandSpan
Manyoftheanalysistoolsusedonsequentialalgorithmscanbeusedonparallelalgorithms.

Wanttoanalyzeworkandspan.

Assumeeachspawnandsyncisaconstanttime
operation.Andcanbeignoredforanalysis.

Analyzingworkiscountingtotaloperations,endupwith
linearworkO(n).

AnalyzingSpanaspawncreatestwopaths,thecritical
pathisthelongerofthetwopaths.

DesiderataForWorkandSpan
Thegoalsofaparallelalgorithmdesigner:
1.
WorkoptimalityAchieveadegreeofworkthatmatchesthebestsequentialalgorithm.
2.
Findalgorithmswithpolylogarithmicspan.D(n)=O(logkn)thisislowspan
Thisinsurestheaverageavailableparallelismgrowswithn.

ConcurrencyPrimitiveParallelFor
Alliterationsareindependentofoneanother.
Aparforcreatesnindependentsubpaths.
Theendofaparforloopwillincludeanimplicitsyncpoint.
TheWorkofaparforisWparfor(n)=O(n)
TheSpanofaparforisDparfor(n)=O(1)intheory,butinpracticeitwillgrowwithn,especiallyifn
isreallylarge.

QUIZImplementingParFor
TheDAGexecutesthespawnssequentially,oneafteranother.Thisleadstoabottleneck.The
Spangrowswithn.Thisisbad.

ImplementingParForPart2
Implementparforasaprocedurecall(ParForT).Thisisabetterwaytoimplementaparallelfor
loop.Thespanwillnowgrowlogarithmicallywithn.

Fortherestofthiscourse,assumetheParForTimplementation.

D(n)=O(logn)

QUIZMatrixVectorMultiply
Ifaloopcarriesadependence,thenitcannotbeparallelizedwithaparfor.

DataRacesandRaceConditions
Ifwelookatthenestedloops,we
seethattheinnermostloopthere
areiterationsofjthatwritetothe
samei.

DataRace=atleastonereadandonewritecanhappenatthesamedatalocationatthesame
time.

RaceCondition=adataracethatcausesanerror.

**Adataracedoesnotalwaysleadtoaracecondition.**

VectorNotation
t[1:n]A[i,1:n]*x[1:n]Thisisamorecompactformoftheparforloop.
t[:]A[i,:]*x[:]

Thiscanbefurtherreducedto:y[i]y[i]+reduce(t)

Potrebbero piacerti anche