Sei sulla pagina 1di 10

Lesson31BasicModelofLocality

AFirstBasicModel

Tofindalocalityawarealgorithmweneedamachinemodelwillbeusingavariationonthe
vonNeumannmodel.

vonNeumannModel:
Hasasequentialprocessorthatdoesbasiccomputeoperations
Processorconnectstoamainmemorynearlyinfinitebutreallyslow
Fastmemorysmallbutveryfast,size=Zmeasuredinnumberofwords

Rules:
1.Theprocessorcanonlyworkwithdatathatisinthefastmemory,knownas
thelocaldatarule.

2.Whenthereisatransferofdatabetweenthefastandslowmemory,thedatais
transferredinblocksofsizeL,knownastheblocktransferrule.
Forexample:ifyouwanttomove[x]wordsfromslowtofastmemory,youneed
topaytomoveLxadditionalnearbywords.

Inthismodelyoumayneedtoconsiderdataalignment

Costs:
Themodelhastwocostsassociatedwithit:
1. Work,W(n)==the#ofcomputationoperations.Howmanyoperationswillthe
processorhavetoperform?
2. Datatransfers,Q(nZL)==#ofLsizedslowfasttransfers(loadsandstores).
Thenumberoftransfersisdependentuponthesizeofthecacheandblocksize.
ThiswillreferredtoasQandbecalledtheI/OComplexity.

Example:
Givenanarrayofsizen,sumitselements.
Theprocessorneedstodoatleastn1additions,W(n) n1additions= (n)
Formemorytransfersyouneedtomakeatleastonepassthroughthedata.Thiscan
beconsideredthelowerboundontransfers:
Q(n,Z,L) ceiling(n/L)transfers= (n/L)
(Theceilingtakesintoaccountanypartialtransferifn/Lisnotaninteger)
NotetheequationdoesNOTdependonZ,thesizeofthecache
becauseyou
aretouchingeachdataonlyonce,sothesizeofthefastmemorydoesnot
matter.
ReductiondoesnotreusedatathisisBAD!


ExamplesofTwoLevelMemories:
harddisk&mainmemory
L1cache&CPUregisters
TapeStorage&Harddisk
RemoteServerRAM&localServerRAM
TheInternet&yourbrain

Howmanytransfersarenecessaryintheworstcase,assumingnothingabout
alignment?
Answer:theceilingof(n/L)+1
Heresanexample:
Letn=4andL=2
Case1:thearrayisalignedonanLwordboundary.Thentransfers=ceiling(n/L)=2transfers
Case2:thearrayisnotalignedonanLwordboundary,thenanextratransferisneeded

Whenn>>thanL,the+1canbeignored.

MinimumTransferstoSort
Givenanarrayofsizen,sortit.
Assumeaslow/fastmemorymodel.
Recallcomparisonsortsneedtoperformnlog(n)operations,W(n)= (nlog(n))

Whatisthenumberofslow/fastmemorytransfers?
ceiling(n/L)orjustn/L
Q(nZL)= (ceiling(n/L) or (n/L)
nbecauseeachelementistouchedatleastonce,Lbecauseyoureadtheelementsfromslow
memoryoneblockatatime.
Thisanswerwouldbeimpressive:Q(nZL)= ((n/Llog(n/L)/log(Z/L))

Amatrixmatrixmultiplyonamachinewithatwolevelmemory.
Thematricesareallnxnobjects.
3
ForanonStrassenalgorithm,workisW(n)=O(n
)
Question:Whatistheminimumnumberoftransfers?
2
Answer:Q(nZL)= (n
/L)
Then*ncountsthenumberofelements,dividingbyLconvertsittothenumberoftransfers.

3
Answerifyouarealreadyfamiliarwiththequestion:Q(nZ,L)= (n
/(L Z )

I/OExampleReduction

W(n)= (n) (work)


Q(nZ,L)= (n/L) (numberoftransfers)


Letslookatanalgorithmtoseeifwecanachievethelowerbound:

Forasequentialprocessorwithoutfastmemory:

Whenyouhaveatwolevelmemory,youneedtothinkaboutwhentomovedatafromslowto
fastmemory.

Assumesbeginslocally,alreadyinthefastmemory.
Assumen>>Z(thearrayismuchbiggerthanthecache).
AssumeXisalignedonanLwordboundary.

Nowmakeslowandfastmemorytransfersexplicit:

Note:fortheouterloop,itstepsthroughthearrayoneblock(L)atatime.
L^istheblockofsizeLorsmaller?Canoftenignorethisdetail.
ythisisaloadfromslowtofastmemory,itrequestsatmostLwords(1blocktransfer).
Sincesandyarelocaltofastmemory,theprocessorcanexecutetheinnermostloop.

Work=W(n)= (n)

Transfers=Q(nZ,L)= (ceilingof(n/L))

Comparetothelowerbounds:
Lowerbounds:Work=W(n)= (n),Q(nZ,L)= (n/L)

Observation:
Cachesareveryfast,buttheyarenotsufficienttoguaranteehighperformance.

MatrixVectorMultiply

Multiplyadensenxnmatrix,A,byavector,y.

2
Work=W(n)= (n
)
Thearrayisstoredinmemoryincolumnmajororder.Thematrixisstoredcolumnwise,one
columnfollowsthepreviouscolumninmemory.

Theelementinmemorycanbefoundusing
thefollowingrule:

a
<>A[i+(jn)]
ij

Considertwoalgorithmstocomputetheproduct:

Inthisalgorithmtheouterloopiteratesoverrows,inner
loopovercolumns.

Inthisalgorithmtheouterloopiteratesovercolumns,theinneroverrows.

InthebasicRAMmodel,thesealgorithmsareidentical.

Question
:Whichalgorithminthetwolevelmodeldoesfewertransfers?
Assumptions:
Thefastmemorycanholdtwovectors:Z=2n+O(L)
L/nLdividesn
allarraysandmatricesarealignedonLwordboundaries.
canignorefloorsandceilings
canassumethealgorithmpreloadsxandy,andstoresyattheend
Theseassumptionsimplythenumberoftransfersisatleast:
Q(nZ,L)=3n/L+???

Soreallyhowmanyadditionaltransfersdoesloadingthematrixrequire.

Answer:
Algorithm2requiresfewertransfers.
Consideralgorithm1,ititeratesoverrows.Soloadinganelementwillloadablocksworthof
columnelements.(Thearrayisstoredbycolumns).Thenthenextelementintherowwillneed
tobeloaded.Thiswillcauseanewcolumnofelementstobeloaded.
2
ThiswillleadtoQ(nZ,L)=3n/L+n

Inalgorithm2,theblocktransfermatchesthestorageformat.
2
ThiswillleadtoQ(nZ,L)=3n/L+n
/L

Inthesequentialmodelthesetwoalgorithmsareidentical,butinthetwolevelmodeltheyare
different.

Ifyouhaveafullyassociativecache,willithelpalgorithm1tobeasfastasalgorithm2?

AlgorithmicDesignGoals
Whatarethegoals?Whatmakesanalgorithmgood?

Goal1:
Workoptimality
Thetwolevelalgorithmshoulddothesameworkasthebestasymptoticalgorithm.

w(n)= (W
(n))W
istheworkofthebestasymptoticalgorithm
*
*

Goal2:
HasHighcomputationalintensity
Thisistheratioofworktowordstransferred.

Intensityisoperations/word,itmeasuresthedatareuseofthealgorithm.Itisgoodtohavehigh
intensity,aslongastheworkisoptimized.
Shouldremindyouofworkandspan.

WhichisBetter?
Giventwoalgorithms,whichisbetter?

Answer:Neither,thereisinsufficientinformation.
Recallthegoals:lowworkandhighintensity.
Algorithm1doeslesswork,buttheintensityisaconstant.
Algorithm2theintensitygrows.

Intensity,Balance,andTime
Therelationshipbetweenwork,transfers,andexecutiontime.
=[time]/[operations]

Timetocompute=T
= W
comp

=amortizedtimetomovedatabetweenslowandfastmemory=[time]/[word]

ThetimetoexecuteQtransfers=T
= LQ
mem

Theminimumtimetoexecutetheprogram=T max(T
,T
)assumesperfectoverlap
comp
mem

Theexecutiontimerelativetotheidealrunningtime:

Itisidealbecauseitassumesdata
movementisfree.

Mustpaypenaltyformovingthedata.
Thisis:machinebalance/Intensity

B=machinebalanceis:[ops]/[word](thisis
machinedependent)

[ops]/[word]howmanyoperationscanbe
executedinthetimeittakestomoveaword
ofdata

ThetimeasafunctionofBalanceandIntensity

Themaximumtimeis:

NormalizePerformance:

RooflinePlots
TovisualizetherelationshipsbetweenR,I,Blookatarooflineplot.

AssumeW
andWarefixed,butIcanvary.
*

Plotofthisisalinearlyincreasingslope,an
inflectionpoint,andaplateau.

Thevalueoftheplateauandthelocationofthe
inflection.

Whatarethevaluesofx
,y
?
0
0

x
=BthecriticalpointisreachedassoonasI==B.Sowhendesigninganalgorithm,tryforan
0
intensity1=>B.

y
=W
/W(itisthemaximumpossiblevalue),ifyoudesignanalgorithmthatisnotworkoptimal
0
*
youwillpayapenalty.

IntensityofConventionalMatrixMultiply

ConsideraMatrixMatrixMultiply(nonvon
Strassen)

Executethisalgorithmonatwolevelmemorymachine.

Assume:
Transfersize==1word(L=1word)
Z=2n+O(1)

Question:
Whatistheintensityofthealgorithm?

I(nZ)= (1)

Note:
3
W(n)= (n
)
2
2
3
Q(nZ)=n
(forelementsinA)+2n
(forelementsinC)+n
(forelementsinB)

ThereadsofBdominatetheoveralltransfercost.
3
Q(nZ)=n

I(nZ)=ratioofWandQ=1

Canyoudobetter?Yes

3
2
Therearen
transfers,andn
data.Theremightbeannreuseofdataavailable.

IntensityofConventionalMatrixMultiplyPart2

Dividethematricesintobxbblocks.

Thereadsandwritesofblocksareslow/fast
memorytransfers.

Counttheblocktransfersanddeterminethe
intensityofthealgorithm.

Assume:

Answer:I(nZ)= (b)or Z

3
W(n)=n

3
Q(nZ)= (n
/b)

Blockingisbetterthanindividualelementreading.

Potrebbero piacerti anche