Lesson3 1BasicModelofLocality PDF

Lesson31BasicModelofLocality
AFirstBasicModel
Tofindalocalityawarealgorithmweneedamachinemodelwillbeusingavariationonthe
vonNeumannmodel.
vonNeumannModel:
Hasasequentialprocessorthatdoesbasiccomputeoperations
Processorconnectstoamainmemorynearlyinfinitebutreallyslow
Fastmemorysmallbutveryfast,size=Zmeasuredinnumberofwords
Rules:
1.Theprocessorcanonlyworkwithdatathatisinthefastmemory,knownas
thelocaldatarule.
2.Whenthereisatransferofdatabetweenthefastandslowmemory,thedatais
transferredinblocksofsizeL,knownastheblocktransferrule.
Forexample:ifyouwanttomove[x]wordsfromslowtofastmemory,youneed
topaytomoveLxadditionalnearbywords.
Inthismodelyoumayneedtoconsiderdataalignment
Costs:
Themodelhastwocostsassociatedwithit:
1. Work,W(n)==the#ofcomputationoperations.Howmanyoperationswillthe
processorhavetoperform?
2. Datatransfers,Q(nZL)==#ofLsizedslowfasttransfers(loadsandstores).
Thenumberoftransfersisdependentuponthesizeofthecacheandblocksize.
ThiswillreferredtoasQandbecalledtheI/OComplexity.
Example:
Givenanarrayofsizen,sumitselements.
Theprocessorneedstodoatleastn1additions,W(n) n1additions= (n)
Formemorytransfersyouneedtomakeatleastonepassthroughthedata.Thiscan
beconsideredthelowerboundontransfers:
Q(n,Z,L) ceiling(n/L)transfers= (n/L)
(Theceilingtakesintoaccountanypartialtransferifn/Lisnotaninteger)
NotetheequationdoesNOTdependonZ,thesizeofthecache
becauseyou
aretouchingeachdataonlyonce,sothesizeofthefastmemorydoesnot
matter.
ReductiondoesnotreusedatathisisBAD!

ExamplesofTwoLevelMemories:
harddisk&mainmemory
L1cache&CPUregisters
TapeStorage&Harddisk
RemoteServerRAM&localServerRAM
TheInternet&yourbrain
Howmanytransfersarenecessaryintheworstcase,assumingnothingabout
alignment?
Answer:theceilingof(n/L)+1
Heresanexample:
Letn=4andL=2
Case1:thearrayisalignedonanLwordboundary.Thentransfers=ceiling(n/L)=2transfers
Case2:thearrayisnotalignedonanLwordboundary,thenanextratransferisneeded
Whenn>>thanL,the+1canbeignored.
MinimumTransferstoSort
Givenanarrayofsizen,sortit.
Assumeaslow/fastmemorymodel.
Recallcomparisonsortsneedtoperformnlog(n)operations,W(n)= (nlog(n))
Whatisthenumberofslow/fastmemorytransfers?
ceiling(n/L)orjustn/L
Q(nZL)= (ceiling(n/L) or (n/L)
nbecauseeachelementistouchedatleastonce,Lbecauseyoureadtheelementsfromslow
memoryoneblockatatime.
Thisanswerwouldbeimpressive:Q(nZL)= ((n/Llog(n/L)/log(Z/L))
Amatrixmatrixmultiplyonamachinewithatwolevelmemory.
Thematricesareallnxnobjects.
3
ForanonStrassenalgorithm,workisW(n)=O(n
)
Question:Whatistheminimumnumberoftransfers?
2
Answer:Q(nZL)= (n
/L)
Then*ncountsthenumberofelements,dividingbyLconvertsittothenumberoftransfers.
3
Answerifyouarealreadyfamiliarwiththequestion:Q(nZ,L)= (n
/(L Z )
I/OExampleReduction
W(n)= (n) (work)

Q(nZ,L)= (n/L) (numberoftransfers)

Letslookatanalgorithmtoseeifwecanachievethelowerbound:
Forasequentialprocessorwithoutfastmemory:
Whenyouhaveatwolevelmemory,youneedtothinkaboutwhentomovedatafromslowto
fastmemory.
Assumesbeginslocally,alreadyinthefastmemory.
Assumen>>Z(thearrayismuchbiggerthanthecache).
AssumeXisalignedonanLwordboundary.
Nowmakeslowandfastmemorytransfersexplicit:
Note:fortheouterloop,itstepsthroughthearrayoneblock(L)atatime.
L^istheblockofsizeLorsmaller?Canoftenignorethisdetail.
ythisisaloadfromslowtofastmemory,itrequestsatmostLwords(1blocktransfer).
Sincesandyarelocaltofastmemory,theprocessorcanexecutetheinnermostloop.
Work=W(n)= (n)
Transfers=Q(nZ,L)= (ceilingof(n/L))
Comparetothelowerbounds:
Lowerbounds:Work=W(n)= (n),Q(nZ,L)= (n/L)
Observation:
Cachesareveryfast,buttheyarenotsufficienttoguaranteehighperformance.
MatrixVectorMultiply
Multiplyadensenxnmatrix,A,byavector,y.
2
Work=W(n)= (n
)
Thearrayisstoredinmemoryincolumnmajororder.Thematrixisstoredcolumnwise,one
columnfollowsthepreviouscolumninmemory.
Theelementinmemorycanbefoundusing
thefollowingrule:
a
<>A[i+(jn)]
ij
Considertwoalgorithmstocomputetheproduct:
Inthisalgorithmtheouterloopiteratesoverrows,inner
loopovercolumns.
Inthisalgorithmtheouterloopiteratesovercolumns,theinneroverrows.
InthebasicRAMmodel,thesealgorithmsareidentical.
Question
:Whichalgorithminthetwolevelmodeldoesfewertransfers?
Assumptions:
Thefastmemorycanholdtwovectors:Z=2n+O(L)
L/nLdividesn
allarraysandmatricesarealignedonLwordboundaries.
canignorefloorsandceilings
canassumethealgorithmpreloadsxandy,andstoresyattheend
Theseassumptionsimplythenumberoftransfersisatleast:
Q(nZ,L)=3n/L+???
Soreallyhowmanyadditionaltransfersdoesloadingthematrixrequire.
Answer:
Algorithm2requiresfewertransfers.
Consideralgorithm1,ititeratesoverrows.Soloadinganelementwillloadablocksworthof
columnelements.(Thearrayisstoredbycolumns).Thenthenextelementintherowwillneed
tobeloaded.Thiswillcauseanewcolumnofelementstobeloaded.
2
ThiswillleadtoQ(nZ,L)=3n/L+n
Inalgorithm2,theblocktransfermatchesthestorageformat.
2
ThiswillleadtoQ(nZ,L)=3n/L+n
/L
Inthesequentialmodelthesetwoalgorithmsareidentical,butinthetwolevelmodeltheyare
different.
Ifyouhaveafullyassociativecache,willithelpalgorithm1tobeasfastasalgorithm2?
AlgorithmicDesignGoals
Whatarethegoals?Whatmakesanalgorithmgood?
Goal1:
Workoptimality
Thetwolevelalgorithmshoulddothesameworkasthebestasymptoticalgorithm.
w(n)= (W
(n))W
istheworkofthebestasymptoticalgorithm
*
*
Goal2:
HasHighcomputationalintensity
Thisistheratioofworktowordstransferred.
Intensityisoperations/word,itmeasuresthedatareuseofthealgorithm.Itisgoodtohavehigh
intensity,aslongastheworkisoptimized.
Shouldremindyouofworkandspan.
WhichisBetter?
Giventwoalgorithms,whichisbetter?
Answer:Neither,thereisinsufficientinformation.
Recallthegoals:lowworkandhighintensity.
Algorithm1doeslesswork,buttheintensityisaconstant.
Algorithm2theintensitygrows.
Intensity,Balance,andTime
Therelationshipbetweenwork,transfers,andexecutiontime.
=[time]/[operations]
Timetocompute=T
= W
comp
=amortizedtimetomovedatabetweenslowandfastmemory=[time]/[word]
ThetimetoexecuteQtransfers=T
= LQ
mem
Theminimumtimetoexecutetheprogram=T max(T
,T
)assumesperfectoverlap
comp
mem
Theexecutiontimerelativetotheidealrunningtime:
Itisidealbecauseitassumesdata
movementisfree.
Mustpaypenaltyformovingthedata.
Thisis:machinebalance/Intensity
B=machinebalanceis:[ops]/[word](thisis
machinedependent)
[ops]/[word]howmanyoperationscanbe
executedinthetimeittakestomoveaword
ofdata
ThetimeasafunctionofBalanceandIntensity
Themaximumtimeis:
NormalizePerformance:
RooflinePlots
TovisualizetherelationshipsbetweenR,I,Blookatarooflineplot.
AssumeW
andWarefixed,butIcanvary.
*
Plotofthisisalinearlyincreasingslope,an
inflectionpoint,andaplateau.
Thevalueoftheplateauandthelocationofthe
inflection.
Whatarethevaluesofx
,y
?
0
0
x
=BthecriticalpointisreachedassoonasI==B.Sowhendesigninganalgorithm,tryforan
0
intensity1=>B.
y
=W
/W(itisthemaximumpossiblevalue),ifyoudesignanalgorithmthatisnotworkoptimal
0
*
youwillpayapenalty.
IntensityofConventionalMatrixMultiply
ConsideraMatrixMatrixMultiply(nonvon
Strassen)
Executethisalgorithmonatwolevelmemorymachine.
Assume:
Transfersize==1word(L=1word)
Z=2n+O(1)
Question:
Whatistheintensityofthealgorithm?
I(nZ)= (1)
Note:
3
W(n)= (n
)
2
2
3
Q(nZ)=n
(forelementsinA)+2n
(forelementsinC)+n
(forelementsinB)
ThereadsofBdominatetheoveralltransfercost.
3
Q(nZ)=n
I(nZ)=ratioofWandQ=1
Canyoudobetter?Yes
3
2
Therearen
transfers,andn
data.Theremightbeannreuseofdataavailable.
IntensityofConventionalMatrixMultiplyPart2
Dividethematricesintobxbblocks.
Thereadsandwritesofblocksareslow/fast
memorytransfers.
Counttheblocktransfersanddeterminethe
intensityofthealgorithm.
Assume:
Answer:I(nZ)= (b)or Z
3
W(n)=n
3
Q(nZ)= (n
/b)
Blockingisbetterthanindividualelementreading.

Lesson3 1BasicModelofLocality PDF

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Lesson3 1BasicModelofLocality PDF

Caricato da

Copyright:

Formati disponibili

Lesson31BasicModelofLocality

W(n)= (n) (work)

Potrebbero piacerti anche