JaL ls arallel CompuLlng? JaL ls arallel CompuLlng? (2) JaL ls arallel CompuLlng? (3) JaL ls arallel CompuLlng? (4) @e 8eal Jorld ls Masslvely arallel Dses of arallel CompuLlng Dses of arallel CompuLlng (2) 8easons for Dslng arallel CompuLlng W Save t|me and]or money W So|ve |arger prob|ems W rov|de concurrency W Use of non|oca| resources W L|m|ts to ser|a| comput|ng @op300org SLaLlsLlcs @op300org SLaLlsLlcs (2) @op300org SLaLlsLlcs (3) @op300org SLaLlsLlcs (4) von Neumann Arch|tecture |ynns C|ass|ca| 1axonomy S I S D S|ng|e Instruct|on S|ng|e Data S I M D S|ng|e Instruct|on Mu|t|p|e Data M I S D Mu|t|p|e Instruct|on S|ng|e Data M l M u MulLlple lnsLrucLlon MulLlple uaLa SlSu SlSuLxamples SlMu SlMu (2) SlMuLxamples lLLlAC lv Cray xM Cray ?M MlSu MlMu MlMuLxamples l8M CJL83 P/Compaq Alpaserver lnLel lA32 Some Genera| ara||e| 1erm|no|ogy W Supercomput|ng ] n|gh erformance Comput|ng (nC) W Node W CU ] Socket ] rocessor ] Core W 1ask W |pe||n|ng W Shared Memory W Symmetr|c Mu|t|rocessor (SM) W D|str|buted Memory W Commun|cat|ons W Synchron|zat|on W Granu|ar|ty W Cbserved Speedup W ara||e| Cverhead W Mass|ve|y ara||e| W Lmbarrass|ng|y ara||e| W Sca|ab|||ty Some Genera| ara||e| 1erm|no|ogy (2) ara||e| Computer Memory Arch|tectures W Shared Memory W D|str|buted Memory W nybr|d D|str|butedShared Memory Sared Memory (DMA) Sared Memory (nDMA) ulsLrlbuLed Memory Pybrld ulsLrlbuLedSared Memory ara||e| rogramm|ng Mode|s W Sared Memory (wlLouL Lreads) W @reads W ulsLrlbuLed Memory / Message asslng W uaLa arallel W Pybrld W Slngle rogram MulLlple uaLa (SMu) W MulLlple rogram MulLlple uaLa (MMu) Sared Memory Model (wlLouL Lreads) W Lasks sare a common address space (read and wrlLLen Lo asyncronously) W locks / semapores may be used Lo conLrol access W noLlon of daLa ownerslp ls lacklng W W more dlfflculL Lo undersLand and manage daLa locallLy W W Imp|ementat|ons W naLlve compllers and/or ardware LranslaLe user program varlables lnLo acLual memory addresses @reads Model W Lreads lmplemenLaLlons commonly comprlse A llbrary of subrouLlnes LaL are called from wlLln parallel source code A seL of compller dlrecLlves lmbedded ln elLer serlal or parallel source code W programmer ls responslble for deLermlnlng all parallellsm W DnrelaLed sLandardlzaLlon efforLs ave resulLed ln Lwo very dlfferenL lmplemenLaLlons of Lreads PO5lx 1hreods and OpenMP @reads ModellmplemenLaLlons D|str|buted Memory ] Message ass|ng Mode| W message passlng lmplemenLaLlons usually comprlse a llbrary of subrouLlnes W programmer ls responslble for deLermlnlng all parallellsm W Ml lorum formed Lo esLablls a sLandard lnLerface for message passlng lmplemenLaLlons W Ml ls now Le de facLo lndusLry sLandard for message passlng D|str|buted Memory ] Message ass|ng Mode|Imp|ementat|ons Data ara||e| Mode| Data ara||e| Mode| Imp|ementat|ons W accompllsed by wrlLlng a program wlL daLa parallel consLrucLs W @e consLrucLs can be calls Lo a daLa parallel subrouLlne llbrary or compller dlrecLlves recognlzed by a daLa parallel compller W Comp||er D|rect|ves W Allow Le programmer Lo speclfy Le dlsLrlbuLlon and allgnmenL of daLa W Lls model usually requlre Le compller Lo produce ob[ecL code wlL calls Lo a message passlng llbrary (Ml) for daLa dlsLrlbuLlon W All message passlng ls done lnvlslbly Lo Le programmer Pybrld Model S|ng|e rogram Mu|t|p|e Data (SMD) Mu|t|p|e rogram Mu|t|p|e Data (MMD) Des|gn|ng ara||e| rograms W Automat|c vs Manua| ara||e||zat|on W Understand the rob|em and the rogram W art|t|on|ng W Commun|cat|ons W Synchron|zat|on W Data Dependenc|es W Load 8a|anc|ng W Granu|ar|ty W I]C W L|m|ts and Costs of ara||e| rogramm|ng W erformance Ana|ys|s and 1un|ng Automat|c vs Manua| ara||e||zat|on W ueslgnlng and developlng parallel programs as caracLerlsLlcally been a very manual process programmer ldenLlfles and acLually lmplemenLs parallellsm Llme consumlng Complex errorprone eroe W varlous Lools ave been avallable Lo asslsL Le programmer wlL converLlng serlal programs lnLo parallel programs parallellzlng compller or preprocessor works ln Lwo dlfferenL ways W lully AuLomaLlc W rogrammer ulrecLed Understand the rob|em and the rogram W Le flrsL sLep ln developlng parallel sofLware W deLermlne weLer or noL Le problem ls one LaL can acLually be parallellzed W ldenLlfy Le programs hospos W ldenLlfy oenecks ln Le program W ldenLlfy lnlblLors Lo parallellsm W lnvesLlgaLe oLer algorlLms lf posslble art|t|on|ng W breaklng Le problem lnLo dlscreLe cunks of work LaL can be dlsLrlbuLed Lo mulLlple Lasks W also known as decomposlLlon W @ere are Lwo baslc ways Lo parLlLlon compuLaLlonal work among parallel Lasks domon decomposon and ncono decomposon Doma|n Decompos|t|on unct|ona| Decompos|t|on unct|ona| Decompos|t|on Lcosystem unct|ona| Decompos|t|on S|gna| rocess|ng unct|ona| Decompos|t|onC||mate Commun|cat|ons @e need for communlcaLlons beLween Lasks depends upon your problem W ou DCN1 need commun|cat|ons Lasks sare no daLa @ese Lypes of problems are ofLen called emorrossnq poroe because very llLLle lnLer Lask communlcaLlon ls requlred W ou DC need commun|cat|ons Lasks Lo sare daLa wlL eac oLer Commun|cat|ons (2) W actors to Cons|der to cons|der when des|gn|ng |ntertask commun|cat|ons Cost of commun|cat|ons Latency vs 8andw|dth V|s|b|||ty of commun|cat|ons Synchronous vs asynchronous commun|cat|ons Scope of commun|cat|ons Lff|c|ency of commun|cat|ons Cverhead and Comp|ex|ty Scope of Commun|cat|on W Ponopon W one Lask acLlng as Le sender/producer of daLa W anoLer acLlng as Le recelver/consumer W oece W daLa sarlng beLween more Lan Lwo Lasks W Lasks are ofLen speclfled as belng members ln a common group or collecLlve Some common varlaLlons Cverhead and Comp|ex|tyA Slmple Lxample Synchron|zat|on W 1ypes of Synchron|zat|on W 8arr|er W Lock ] semaphore W Synchronous commun|cat|on operat|ons Data Dependenc|es W A dependence exlsLs beLween program sLaLemenLs wen Le order of sLaLemenL execuLlon affecLs Le resulLs of Le program W A doo dependence resulLs from mulLlple use of Le same locaLlon(s) ln sLorage by dlfferenL Lasks W uependencles are lmporLanL Lo parallel programmlng because Ley are one of Le prlmary lnlblLors Lo parallellsm W Loop carr|ed data dependence uO 500 I M51Ak1 MNu A(I) A(I1) * 20 500 cON1lNu W Loop |ndependent data dependence Lask 1 Lask 2 O 2 O 4
O**2 O**J Data Dependenc|esLxamp|es Pandllng uaLa uependencles W now to nand|e Data Dependenc|es ulsLrlbuLed memory arclLecLures W communlcaLe requlred daLa aL syncronlzaLlon polnLs Sared memory arclLecLures W syncronlze read/wrlLe operaLlons beLween Lasks Load 8a|anc|ng Aclevlng Load 8alance W Lqua||y part|t|on the work each task rece|ves evenly dlsLrlbuLe Le daLa seL among Le Lasks evenly dlsLrlbuLe Le lLeraLlons across Le Lasks Dse performance analysls Lool Lo deLecL any load lmbalances Ad[usL work accordlngly W Use dynam|c work ass|gnment CerLaln classes of problems resulL ln load lmbalances even lf daLa ls evenly dlsLrlbuLed among Lasks Jen Le amounL of work eac Lask wlll perform ls lnLenLlonally varlable or ls unable Lo be predlcLed lL may be elpful Lo use a scheder osk poo approac lL may become necessary Lo deslgn an algorlLm wlc deLecLs and andles load lmbalances as Ley occur dynamlcally wlLln Le code Granu|ar|ty W a quallLaLlve measure of Le raLlo of compuLaLlon Lo communlcaLlon W |negra|n ara||e||sm W 8elaLlvely small amounLs of compuLaLlonal work are done beLween communlcaLlon evenLs W Coarsegra|n ara||e||sm W 8elaLlvely large amounLs of compuLaLlonal work are done beLween communlcaLlon/syncronlzaLlon evenLs |negra|n ara||e||sm |negra|n ara||e||sm W laclllLaLes load balanclng W lmplles lg communlcaLlon overead and less opporLunlLy for performance enancemenL W lf granularlLy ls Loo flne lL ls posslble LaL Le overead requlred for communlcaLlons and syncronlzaLlon beLween Lasks Lakes longer Lan Le compuLaLlon Coarsegra|n ara||e||sm Coarsegra|n ara||e||sm W 8elaLlvely large amounLs of compuLaLlonal work are done beLween communlcaLlon/syncronlzaLlon evenLs W Plg compuLaLlon Lo communlcaLlon raLlo W lmplles more opporLunlLy for performance lncrease W Parder Lo load balance efflclenLly W |negra|n vs Coarsegra|n W @e mosL efflclenL granularlLy ls dependenL on Le algorlLm and Le ardware envlronmenL ln wlc lL runs W ln mosL cases Le overead assoclaLed wlL communlcaLlons and syncronlzaLlon ls lg relaLlve Lo execuLlon speed so lL ls advanLageous Lo ave coarse granularlLy W llnegraln parallellsm can elp reduce overeads due Lo load lmbalance I]C1he 8ad News W l/C operaLlons are generally regarded as lnlblLors Lo parallellsm W arallel l/C sysLems may be lmmaLure or noL avallable for all plaLforms W ln an envlronmenL were all Lasks see Le same flle space wrlLe operaLlons can resulL ln flle overwrlLlng W 8ead operaLlons can be affecLed by Le flle servers ablllLy Lo andle mulLlple read requesLs aL Le same Llme W l/C LaL musL be conducLed over Le neLwork (nlS nonlocal) can cause severe boLLlenecks and even cras flle servers I]C1he Good News W arallel flle sysLems are avallable ClS Ceneral arallel llle SysLem for Alx (l8M) LusLre for Llnux clusLers (Cracle) vlS/vlS2 arallel vlrLual llle SysLem for Llnux clusLers (Clemson/Argonne/Clo SLaLe/oLers) anlS anasas AcLlveScale llle SysLem for Llnux clusLers (anasas lnc) P SlS P SLorageJorks Scalable llle Sare LusLre based parallel flle sysLem (Clobal llle SysLem for Llnux) producL from P W @e parallel l/C programmlng lnLerface speclflcaLlon for Ml as been avallable slnce 1996 as parL of Ml2 vendor and free lmplemenLaLlons are now commonly avallable A few polnLers on l/C W 8ule #1 8educe overall l/C as muc as posslble W lf you ave access Lo a parallel flle sysLem lnvesLlgaLe uslng lL W JrlLlng large cunks of daLa raLer Lan small packeLs ls usually slgnlflcanLly more efflclenL W Conflne l/C Lo speclflc serlal porLlons of Le [ob and Len use parallel communlcaLlons Lo dlsLrlbuLe daLa Lo parallel Lasks W lor example @ask 1 reads an lnpuL flle and communlcaLes Le daLa Lo oLer Lasks llkewlse W @ask 1 performs wrlLe operaLlon afLer recelvlng requlred daLa from all oLer Lasks W Dse local onnode flle space for l/C lf posslble W lor example eac node may ave /Lmp fllespace more efflclenL Lan performlng l/C over Le neLwork L|m|ts and Costs of ara||e| rogramm|ng W Amdah|s Law W Comp|ex|ty W ortab|||ty W kesource kequ|rements W Sca|ab|||ty Amdah|s Law W sLaLes LaL poLenLlal program speedup ls deflned by Le fracLlon of code () LaL can be parallellzed W lf 0 speedup 1 (no speedup) W lf all of Le code ls parallellzed 1 and Le speedup ls lnflnlLe (ln Leory) W lf 03 maxlmum speedup 2 meanlng Le code wlll run Lwlce as fasL Amdah|s Law W lnLroduclng Le number of processors performlng Le parallel fracLlon of work Le relaLlonslp can be modeled by were parallel fracLlon n number of processors and S serlal fracLlon W Cbvlously Lere are llmlLs Lo scalablllLy of parallellsm lor example Speedup N 0S0 090 099 10 182 326 917 100 198 917 3023 1000 199 991 9099 10000 199 991 9902 100000 199 999 9990 Amdah|s Law Amdah|s Law W CerLaln problems demonsLraLe lncreased performance by lncreaslng Le problem slze lor example 2D Gr|d Ca|cu|at|ons 8S seconds 8S Ser|a| fract|on 1S seconds 1S W can lncrease Le problem slze by doubllng Le grld dlmenslons and alvlng Le Llme sLep resulLlng ln four Llmes Le number of grld polnLs and Lwlce Le number of Llme sLeps @e Llmlngs Len look llke 2D Gr|d Ca|cu|at|ons 680 seconds 9784 Ser|a| fract|on 1S seconds 216 W roblems LaL lncrease Le percenLage of parallel Llme wlL Lelr slze are more scooe Lan problems wlL a flxed percenLage of parallel Llme Comp|ex|ty W ln general parallel appllcaLlons are muc more complex Lan correspondlng serlal appllcaLlons mulLlple lnsLrucLlon sLreams execuLlng aL Le same Llme daLa flowlng among lnsLrucLlon sLreams W @e cosLs of complexlLy are measured ln programmer Llme ln vlrLually every aspecL of Le sofLware developmenL cycle ueslgn Codlng uebugglng @unlng MalnLenance W LssenLlal Adere Lo good sofLware developmenL pracLlces wen worklng wlL parallel appllcaLlons ortab|||ty W orLablllLy lssues wlL parallel programs ave become less serlous due Lo sLandardlzaLlon ln several Als W Powever W @e usual porLablllLy lssues assoclaLed wlL serlal programs apply Lo parallel programs W SLandards lmplemenLaLlons dlffer ln dlfferenL Als SomeLlmes code modlflcaLlons ls requlred ln order Lo effecL porLablllLy W CperaLlng sysLems can play a key role ln code porLablllLy lssues W Pardware arclLecLures are caracLerlsLlcally lgly varlable and can affecL porLablllLy kesource kequ|rements W @e prlmary lnLenL of parallel programmlng ls Lo decrease execuLlon Llme W owever ln order Lo accomplls Lls more CD Llme ls requlred eg a parallel code LaL runs ln 1 our on 8 processors acLually uses 8 ours of CD Llme W @e amounL of memory requlred can be greaLer for parallel codes Lan serlal codes Le need Lo repllcaLe daLa and for overeads assoclaLed wlL parallel supporL llbrarles and subsysLems W lor sorL runnlng parallel programs Lere can acLually be a decrease ln performance compared Lo a slmllar serlal lmplemenLaLlon due Lo overead cosLs assoclaLed wlL seLLlng up Le parallel envlronmenL Lask creaLlon communlcaLlons and Lask LermlnaLlon Sca|ab|||ty W @e ablllLy of a parallel programs performance Lo scale ls a resulL of a number of lnLerrelaLed facLors Slmply addlng more maclnes ls rarely Le answer W @e algorlLm may ave lnerenL llmlLs Lo scalablllLy AL some polnL addlng more resources causes performance Lo decrease W Pardware facLors play a slgnlflcanL role ln scalablllLy Lxamples W Memorycpu bus bandwldL on an SM maclne W CommunlcaLlons neLwork bandwldL W AmounL of memory avallable on any glven maclne or seL of maclnes W rocessor clock speed W arallel supporL llbrarles and subsysLems sofLware can llmlL scalablllLy lndependenL of your appllcaLlon erformance Ana|ys|s and 1un|ng W As wlL debugglng monlLorlng and analyzlng parallel program execuLlon ls slgnlflcanLly more of a callenge Lan for serlal programs W A number of parallel Lools for execuLlon monlLorlng and program analysls are avallable W Some are qulLe useful some are cross plaLform also W Jork remalns Lo be done parLlcularly ln Le area of scalablllLy