Sei sulla pagina 1di 79

CCL 433 ulsLrlbuLed CompuLlng

lnLroducLlon Lo arallel CompuLlng


JaL ls arallel CompuLlng?
JaL ls arallel CompuLlng? (2)
JaL ls arallel CompuLlng? (3)
JaL ls arallel CompuLlng? (4)
@e 8eal Jorld ls Masslvely arallel
Dses of arallel CompuLlng
Dses of arallel CompuLlng (2)
8easons for Dslng arallel CompuLlng
W Save t|me and]or money
W So|ve |arger prob|ems
W rov|de concurrency
W Use of non|oca| resources
W L|m|ts to ser|a| comput|ng
@op300org SLaLlsLlcs
@op300org SLaLlsLlcs (2)
@op300org SLaLlsLlcs (3)
@op300org SLaLlsLlcs (4)
von Neumann Arch|tecture
|ynns C|ass|ca| 1axonomy
S I S D
S|ng|e Instruct|on
S|ng|e Data
S I M D
S|ng|e Instruct|on
Mu|t|p|e Data
M I S D
Mu|t|p|e Instruct|on
S|ng|e Data
M l M u
MulLlple lnsLrucLlon
MulLlple uaLa
SlSu
SlSuLxamples
SlMu
SlMu (2)
SlMuLxamples
lLLlAC lv
Cray xM
Cray ?M
MlSu
MlMu
MlMuLxamples
l8M CJL83
P/Compaq Alpaserver
lnLel lA32
Some Genera| ara||e| 1erm|no|ogy
W Supercomput|ng ] n|gh erformance
Comput|ng (nC)
W Node
W CU ] Socket ] rocessor ] Core
W 1ask
W |pe||n|ng
W Shared Memory
W Symmetr|c Mu|t|rocessor (SM)
W D|str|buted Memory
W Commun|cat|ons
W Synchron|zat|on
W Granu|ar|ty
W Cbserved Speedup
W ara||e| Cverhead
W Mass|ve|y ara||e|
W Lmbarrass|ng|y ara||e|
W Sca|ab|||ty
Some Genera| ara||e| 1erm|no|ogy (2)
ara||e| Computer Memory
Arch|tectures
W Shared Memory
W D|str|buted Memory
W nybr|d D|str|butedShared Memory
Sared Memory (DMA)
Sared Memory (nDMA)
ulsLrlbuLed Memory
Pybrld ulsLrlbuLedSared Memory
ara||e| rogramm|ng Mode|s
W Sared Memory (wlLouL Lreads)
W @reads
W ulsLrlbuLed Memory / Message asslng
W uaLa arallel
W Pybrld
W Slngle rogram MulLlple uaLa (SMu)
W MulLlple rogram MulLlple uaLa (MMu)
Sared Memory Model
(wlLouL Lreads)
W Lasks sare a common address space (read and wrlLLen
Lo asyncronously)
W locks / semapores may be used Lo conLrol access
W noLlon of daLa ownerslp ls lacklng
W
W more dlfflculL Lo undersLand and manage daLa locallLy
W
W Imp|ementat|ons
W naLlve compllers and/or ardware LranslaLe user
program varlables lnLo acLual memory addresses
@reads Model
W Lreads lmplemenLaLlons commonly comprlse
A llbrary of subrouLlnes LaL are called from wlLln
parallel source code
A seL of compller dlrecLlves lmbedded ln elLer serlal
or parallel source code
W programmer ls responslble for deLermlnlng all
parallellsm
W DnrelaLed sLandardlzaLlon efforLs ave resulLed ln
Lwo very dlfferenL lmplemenLaLlons of Lreads
PO5lx 1hreods and OpenMP
@reads ModellmplemenLaLlons
D|str|buted Memory ] Message
ass|ng Mode|
W message passlng lmplemenLaLlons usually
comprlse a llbrary of subrouLlnes
W programmer ls responslble for deLermlnlng all
parallellsm
W Ml lorum formed Lo esLablls a sLandard
lnLerface for message passlng
lmplemenLaLlons
W Ml ls now Le de facLo lndusLry sLandard
for message passlng
D|str|buted Memory ] Message
ass|ng Mode|Imp|ementat|ons
Data ara||e| Mode|
Data ara||e| Mode|
Imp|ementat|ons
W accompllsed by wrlLlng a program wlL daLa parallel
consLrucLs
W @e consLrucLs can be calls Lo a daLa parallel subrouLlne
llbrary or compller dlrecLlves recognlzed by a daLa parallel
compller
W Comp||er D|rect|ves
W Allow Le programmer Lo speclfy Le dlsLrlbuLlon and
allgnmenL of daLa
W Lls model usually requlre Le compller Lo produce ob[ecL
code wlL calls Lo a message passlng llbrary (Ml) for daLa
dlsLrlbuLlon
W All message passlng ls done lnvlslbly Lo Le programmer
Pybrld Model
S|ng|e rogram Mu|t|p|e Data (SMD)
Mu|t|p|e rogram Mu|t|p|e Data
(MMD)
Des|gn|ng ara||e| rograms
W Automat|c vs Manua| ara||e||zat|on
W Understand the rob|em and the rogram
W art|t|on|ng
W Commun|cat|ons
W Synchron|zat|on
W Data Dependenc|es
W Load 8a|anc|ng
W Granu|ar|ty
W I]C
W L|m|ts and Costs of ara||e| rogramm|ng
W erformance Ana|ys|s and 1un|ng
Automat|c vs Manua| ara||e||zat|on
W ueslgnlng and developlng parallel programs as
caracLerlsLlcally been a very manual process
programmer ldenLlfles and acLually lmplemenLs parallellsm
Llme consumlng
Complex
errorprone
eroe
W varlous Lools ave been avallable Lo asslsL Le programmer
wlL converLlng serlal programs lnLo parallel programs
parallellzlng compller or preprocessor
works ln Lwo dlfferenL ways
W lully AuLomaLlc
W rogrammer ulrecLed
Understand the rob|em and the
rogram
W Le flrsL sLep ln developlng parallel sofLware
W deLermlne weLer or noL Le problem ls one
LaL can acLually be parallellzed
W ldenLlfy Le programs hospos
W ldenLlfy oenecks ln Le program
W ldenLlfy lnlblLors Lo parallellsm
W lnvesLlgaLe oLer algorlLms lf posslble
art|t|on|ng
W breaklng Le problem lnLo dlscreLe cunks
of work LaL can be dlsLrlbuLed Lo mulLlple
Lasks
W also known as decomposlLlon
W @ere are Lwo baslc ways Lo parLlLlon
compuLaLlonal work among parallel Lasks
domon decomposon and
ncono decomposon
Doma|n Decompos|t|on
unct|ona| Decompos|t|on
unct|ona| Decompos|t|on
Lcosystem
unct|ona| Decompos|t|on
S|gna| rocess|ng
unct|ona| Decompos|t|onC||mate
Commun|cat|ons
@e need for communlcaLlons beLween Lasks
depends upon your problem
W ou DCN1 need commun|cat|ons
Lasks sare no daLa
@ese Lypes of problems are ofLen called
emorrossnq poroe because very llLLle lnLer
Lask communlcaLlon ls requlred
W ou DC need commun|cat|ons
Lasks Lo sare daLa wlL eac oLer
Commun|cat|ons (2)
W actors to Cons|der to cons|der when
des|gn|ng |ntertask commun|cat|ons
Cost of commun|cat|ons
Latency vs 8andw|dth
V|s|b|||ty of commun|cat|ons
Synchronous vs asynchronous commun|cat|ons
Scope of commun|cat|ons
Lff|c|ency of commun|cat|ons
Cverhead and Comp|ex|ty
Scope of Commun|cat|on
W Ponopon
W one Lask acLlng as Le
sender/producer of daLa
W anoLer acLlng as Le
recelver/consumer
W oece
W daLa sarlng beLween more
Lan Lwo Lasks
W Lasks are ofLen speclfled as
belng members ln a
common group or
collecLlve
Some common varlaLlons
Cverhead and Comp|ex|tyA Slmple Lxample
Synchron|zat|on
W 1ypes of Synchron|zat|on
W 8arr|er
W Lock ] semaphore
W Synchronous commun|cat|on operat|ons
Data Dependenc|es
W A dependence exlsLs beLween program
sLaLemenLs wen Le order of sLaLemenL
execuLlon affecLs Le resulLs of Le program
W A doo dependence resulLs from mulLlple use
of Le same locaLlon(s) ln sLorage by dlfferenL
Lasks
W uependencles are lmporLanL Lo parallel
programmlng because Ley are one of Le
prlmary lnlblLors Lo parallellsm
W Loop carr|ed data
dependence
uO 500 I M51Ak1 MNu
A(I) A(I1) * 20
500 cON1lNu
W Loop |ndependent data
dependence
Lask 1 Lask 2
O 2 O 4


O**2 O**J
Data Dependenc|esLxamp|es
Pandllng uaLa uependencles
W now to nand|e Data Dependenc|es
ulsLrlbuLed memory arclLecLures
W communlcaLe requlred daLa aL syncronlzaLlon polnLs
Sared memory arclLecLures
W syncronlze read/wrlLe operaLlons beLween Lasks
Load 8a|anc|ng
Aclevlng Load 8alance
W Lqua||y part|t|on the work each task rece|ves
evenly dlsLrlbuLe Le daLa seL among Le Lasks
evenly dlsLrlbuLe Le lLeraLlons across Le Lasks
Dse performance analysls Lool Lo deLecL any load lmbalances
Ad[usL work accordlngly
W Use dynam|c work ass|gnment
CerLaln classes of problems resulL ln load lmbalances even lf daLa ls
evenly dlsLrlbuLed among Lasks
Jen Le amounL of work eac Lask wlll perform ls lnLenLlonally
varlable or ls unable Lo be predlcLed lL may be elpful Lo use a
scheder osk poo approac
lL may become necessary Lo deslgn an algorlLm wlc deLecLs and
andles load lmbalances as Ley occur dynamlcally wlLln Le code
Granu|ar|ty
W a quallLaLlve measure of Le raLlo of
compuLaLlon Lo communlcaLlon
W |negra|n ara||e||sm
W 8elaLlvely small amounLs of compuLaLlonal
work are done beLween communlcaLlon
evenLs
W Coarsegra|n ara||e||sm
W 8elaLlvely large amounLs of compuLaLlonal
work are done beLween
communlcaLlon/syncronlzaLlon evenLs
|negra|n ara||e||sm
|negra|n ara||e||sm
W laclllLaLes load balanclng
W lmplles lg communlcaLlon overead and
less opporLunlLy for performance
enancemenL
W lf granularlLy ls Loo flne lL ls posslble LaL Le
overead requlred for communlcaLlons and
syncronlzaLlon beLween Lasks Lakes longer
Lan Le compuLaLlon
Coarsegra|n ara||e||sm
Coarsegra|n ara||e||sm
W 8elaLlvely large amounLs of compuLaLlonal
work are done beLween
communlcaLlon/syncronlzaLlon evenLs
W Plg compuLaLlon Lo communlcaLlon raLlo
W lmplles more opporLunlLy for performance
lncrease
W Parder Lo load balance efflclenLly
W
|negra|n vs Coarsegra|n
W @e mosL efflclenL granularlLy ls dependenL on
Le algorlLm and Le ardware envlronmenL
ln wlc lL runs
W ln mosL cases Le overead assoclaLed wlL
communlcaLlons and syncronlzaLlon ls lg
relaLlve Lo execuLlon speed so lL ls
advanLageous Lo ave coarse granularlLy
W llnegraln parallellsm can elp reduce
overeads due Lo load lmbalance
I]C1he 8ad News
W l/C operaLlons are generally regarded as lnlblLors Lo
parallellsm
W arallel l/C sysLems may be lmmaLure or noL avallable for all
plaLforms
W ln an envlronmenL were all Lasks see Le same flle space
wrlLe operaLlons can resulL ln flle overwrlLlng
W 8ead operaLlons can be affecLed by Le flle servers ablllLy Lo
andle mulLlple read requesLs aL Le same Llme
W l/C LaL musL be conducLed over Le neLwork (nlS nonlocal)
can cause severe boLLlenecks and even cras flle servers
I]C1he Good News
W arallel flle sysLems are avallable
ClS Ceneral arallel llle SysLem for Alx (l8M)
LusLre for Llnux clusLers (Cracle)
vlS/vlS2 arallel vlrLual llle SysLem for Llnux clusLers
(Clemson/Argonne/Clo SLaLe/oLers)
anlS anasas AcLlveScale llle SysLem for Llnux clusLers (anasas
lnc)
P SlS P SLorageJorks Scalable llle Sare LusLre based parallel flle
sysLem (Clobal llle SysLem for Llnux) producL from P
W @e parallel l/C programmlng lnLerface speclflcaLlon for Ml
as been avallable slnce 1996 as parL of Ml2
vendor and free lmplemenLaLlons are now commonly avallable
A few polnLers on l/C
W 8ule #1 8educe overall l/C as muc as posslble
W lf you ave access Lo a parallel flle sysLem lnvesLlgaLe uslng lL
W JrlLlng large cunks of daLa raLer Lan small packeLs ls
usually slgnlflcanLly more efflclenL
W Conflne l/C Lo speclflc serlal porLlons of Le [ob and Len use
parallel communlcaLlons Lo dlsLrlbuLe daLa Lo parallel Lasks
W lor example @ask 1 reads an lnpuL flle and communlcaLes Le
daLa Lo oLer Lasks llkewlse
W @ask 1 performs wrlLe operaLlon afLer recelvlng requlred daLa
from all oLer Lasks
W Dse local onnode flle space for l/C lf posslble
W lor example eac node may ave /Lmp fllespace more
efflclenL Lan performlng l/C over Le neLwork
L|m|ts and Costs of ara||e|
rogramm|ng
W Amdah|s Law
W Comp|ex|ty
W ortab|||ty
W kesource kequ|rements
W Sca|ab|||ty
Amdah|s Law
W sLaLes LaL poLenLlal program speedup ls deflned by Le
fracLlon of code () LaL can be parallellzed
W lf 0 speedup 1 (no speedup)
W lf all of Le code ls parallellzed 1 and Le speedup ls
lnflnlLe (ln Leory)
W lf 03 maxlmum speedup 2 meanlng Le code wlll run
Lwlce as fasL
Amdah|s Law
W lnLroduclng Le number of processors performlng Le parallel
fracLlon of work Le relaLlonslp can be modeled by
were parallel fracLlon n number of processors and S
serlal fracLlon
W Cbvlously Lere are llmlLs Lo scalablllLy of parallellsm
lor example
Speedup
N 0S0 090 099
10 182 326 917
100 198 917 3023
1000 199 991 9099
10000 199 991 9902
100000 199 999 9990
Amdah|s Law
Amdah|s Law
W CerLaln problems demonsLraLe lncreased performance by
lncreaslng Le problem slze
lor example
2D Gr|d Ca|cu|at|ons 8S seconds 8S
Ser|a| fract|on 1S seconds 1S
W can lncrease Le problem slze by doubllng Le grld dlmenslons
and alvlng Le Llme sLep resulLlng ln four Llmes Le number
of grld polnLs and Lwlce Le number of Llme sLeps
@e Llmlngs Len look llke
2D Gr|d Ca|cu|at|ons 680 seconds 9784
Ser|a| fract|on 1S seconds 216
W roblems LaL lncrease Le percenLage of parallel Llme wlL
Lelr slze are more scooe Lan problems wlL a flxed
percenLage of parallel Llme
Comp|ex|ty
W ln general parallel appllcaLlons are muc more complex Lan
correspondlng serlal appllcaLlons
mulLlple lnsLrucLlon sLreams execuLlng aL Le same Llme
daLa flowlng among lnsLrucLlon sLreams
W @e cosLs of complexlLy are measured ln programmer Llme ln
vlrLually every aspecL of Le sofLware developmenL cycle
ueslgn
Codlng
uebugglng
@unlng
MalnLenance
W LssenLlal Adere Lo good sofLware developmenL pracLlces
wen worklng wlL parallel appllcaLlons
ortab|||ty
W orLablllLy lssues wlL parallel programs ave become less
serlous due Lo sLandardlzaLlon ln several Als
W Powever
W @e usual porLablllLy lssues assoclaLed wlL serlal programs
apply Lo parallel programs
W SLandards lmplemenLaLlons dlffer ln dlfferenL Als
SomeLlmes code modlflcaLlons ls requlred ln order Lo effecL porLablllLy
W CperaLlng sysLems can play a key role ln code porLablllLy
lssues
W Pardware arclLecLures are caracLerlsLlcally lgly varlable
and can affecL porLablllLy
kesource kequ|rements
W @e prlmary lnLenL of parallel programmlng ls Lo decrease
execuLlon Llme
W owever ln order Lo accomplls Lls more CD Llme ls
requlred
eg a parallel code LaL runs ln 1 our on 8 processors acLually uses 8
ours of CD Llme
W @e amounL of memory requlred can be greaLer for parallel
codes Lan serlal codes
Le need Lo repllcaLe daLa and for overeads assoclaLed wlL parallel
supporL llbrarles and subsysLems
W lor sorL runnlng parallel programs Lere can acLually be a
decrease ln performance compared Lo a slmllar serlal
lmplemenLaLlon due Lo overead cosLs assoclaLed wlL seLLlng
up Le parallel envlronmenL Lask creaLlon communlcaLlons
and Lask LermlnaLlon
Sca|ab|||ty
W @e ablllLy of a parallel programs performance Lo scale ls a
resulL of a number of lnLerrelaLed facLors
Slmply addlng more maclnes ls rarely Le answer
W @e algorlLm may ave lnerenL llmlLs Lo scalablllLy
AL some polnL addlng more resources causes performance Lo
decrease
W Pardware facLors play a slgnlflcanL role ln scalablllLy
Lxamples
W Memorycpu bus bandwldL on an SM maclne
W CommunlcaLlons neLwork bandwldL
W AmounL of memory avallable on any glven maclne or seL of maclnes
W rocessor clock speed
W arallel supporL llbrarles and subsysLems sofLware can llmlL
scalablllLy lndependenL of your appllcaLlon
erformance Ana|ys|s and 1un|ng
W As wlL debugglng monlLorlng and analyzlng
parallel program execuLlon ls slgnlflcanLly
more of a callenge Lan for serlal programs
W A number of parallel Lools for execuLlon
monlLorlng and program analysls are avallable
W Some are qulLe useful some are cross
plaLform also
W Jork remalns Lo be done parLlcularly ln Le
area of scalablllLy

Potrebbero piacerti anche