EC527 Spring 2014

EC527:HighPerformanceProgrammingwithMulticoreandGPUsSpring,2014
Instructor: MartinHerbordt,PHO333
OfficeHours:T,Th35andbyappointment
Phone:x39850Email:herbordt@bu.eduWebpage:http://learn.bu.edu
TFs:Asthisisanadvancedundergrad/gradclasswithalargeenrollment,theroleoftheTFswillbelimitedto
gradingandhelpingtofindandsolvesystemproblemsfortheprogrammingassignments.
MissionStatement:Programmingforperformanceusingthecapabilitiesofmodernprocessors
CourseDescription(catalog):Considerstheoryandpracticeofhardwareawareprogramming.Keythemeis
obtaining a significant fraction of potential performance through knowledge of the underlying computing
platformandhowtheplatforminteractswithprograms.Studiesarchitectureof,andprogrammingmethods
for, contemporary highperformance processors. These include complex processor cores, multicore
processors, and graphics processors (GPUs). Labs include use and evaluation of programming methods on
theseprocessorsthroughapplicationssuchasmatrixoperationsandtheFastFourierTransform.
Prerequisites:ComputerOrganization(EC413ofequivalent),programminginC,academicmaturitysufficient,
e.g.,tolearnnewprogrammingtoolsfromprofessionaldocumentation.
CourseMotivation:
For several decades programmers found the von Neumann (vN) model to be an adequate world view for
obtaining most of the potential performance from target systems. This model is familiar: instructions are
executedseriallyinasinglestreamanddataarestoredinasingleimageofmemory.Instructionexecutions
andmemoryaccessesareassumedtobeuniformwithlittlepenalty.ExceptforspecializedprocessorsDSPs,
Supercomputers, MPPsgood vN programming plus a good compiler meant taking advantage of most of a
computerscapability.
Recent directions in processor architecturecomplex memory hierarchies, superscalar and deeply
pipelined CPUs, multicore, and accelerators such as SSE, GPUs and FPGAshave made the vN approach to
obtaining performance obsolete for all but the very simplest embedded processors. For many applications
performance is secondary; in those cases current methods remain excellent. But for applications requiring
performance, a deeper level of machine understanding is required: the programmer must be aware of the
underlying hardware at all stages of software development from algorithm selection, through coding, to
interactionwithsystemtoolssuchascompilersandlibraries,andfinallydebug,tuning,andmaintenance.
TextsandOrganization(supplementedwithadditionalarticles,lecturenotes,andtutorials):
Forcomplete(tentative)readingsseeReadingsdocument.
Part1Singlecore
ThispartofthecourseisbasedonsectionsofcoursestaughtatCMUandETH.
HowtoWriteFastCode,MarkusPueschel,LectureNotesfromCMUandETH
HowtoWriteFastNumericalCode:ASmallIntroduction,S.Chellappa,etal.;CMUTechReport
ComputerSystems:AProgrammersPerspective,Bryant&OHallaron,Chapters5andpartsofChapter6
VariousIntelHW&SWReferenceManuals
H&P4thEditionAppendixG:(Mostly)SWmethodsforcomplexprocessors
H&P4thEditionAppendixF:Vectorprocessors
Part2Multicore
ThispartisacondensedandappliedversionofmaterialfromEC713ParallelComputerArchitecture.
ComputerArchitecture(Chapter4):Hennessy&Patterson4e
ParallelComputerArchitecture(Chapter5):D.Culler,etal.
ParallelProgramming(Chapters2and3):D.Culler,etal.
ThreadsPrimer:AGuidetoMultithreadedProgramming(Chapters25):Lewis&Berg
OpenMPLectureNotes:SCVatBU
Part3GPUs
ThispartisbasedonacoursetaughtatUIUCbyWenmeiHwu.
CUDAReferenceManual(s):NVIDIA
ProgrammingMassivelyParallelProcessors:Kirk&Hwu
CourseMechanics
Style:Oneofthemissionsofthiscourseistobeapracticumassociatedwiththecomputerorganization
and architecture curriculum. As such we explore contemporary highend processors in some depth and
thenpracticeusingthatknowledgetoobtainhighresourceutilizationwithrealprograms.Theemphasis
isthereforeonprogrammingwithlecturesinsupportofthelabs.Lectureswillalsointroduceappropriate
theorywhennecessary,especiallywithrespecttoperformanceevaluation.
Grading: Exam:30%
Programming/HomeworkAssignments:50%
FinalProject:20%
Pleasenotethatthesepercentagesaretentative.Also,thattheimpactofanassignment/examgradeon
thefinalgradedependsontheexpectedvarianceinadditiontothetotal.
WeeklyAssignments:Untilthebeginningoftheprojecttherewillbeweeklyassignments,10inall(09).
All involve programming, mostly exploring small amounts of code in great depth. Some assignments
involvepencilandpaperproblemsinadditiontoprogramming.
LatePolicy:Assignmentsmustbesubmittedontime,usuallyonMondays(by23:59:59localtime).There
isthena20%perdaypenalty.Youwillgetatotalof5freelatedaystohandlespecial(butcommon)
occurrencessuchasillness,interviews,etc.
AcademicHonestyversusCollaboration:Youareencouragedtoworktogethertolearnthematerialand
todiscussapproachestosolvingproblems.However,youmustcomeupwithandwriteuptheprograms
andothersolutionsonyourown.
Exams: There will be at least one midterm exam. There may also be a final exam. Exams are open
readingsandopennotes.
FinalProject:Thepurposeistoadddepthandtopracticetheconceptslearnedinanextendedcasestudy.
Resultswillbewrittenupconferencepaperstyleandpresentedtotheclass.Youmayworkinteamsofup
totwostudents.Iwillconsiderlargergroupswithsufficientjustification.
Late Classes: Unfortunately, this class has been rescheduled so that it now conflicts with faculty
candidateanddistinguishedlecturespeakers.Therefore,onatleast2days,classwillbefrom5:156:45.
Theremaybeafewmoreshiftsthesewillbeannouncedwellinadvance.
CourseObjectives
Review
Computerarchitectureincludingmemoryhierarchyandbasicpipelining.
Learnabout
Variouscontemporaryhighendprocessors,inparticularthei7core,multicorecache,andGPUs
Methodsofperformanceevaluation
Methodsofhardwareawarecodedevelopment
Howtoprogramcomplexhardwaretoobtainhighutilization
Gainexperiencewithdevelopingefficientprograms,including
usingextendedinstructionsets
synchronization
methodsofparallelprogramming
cacheawareoptimizations
CUDAforGPUprogramming.
FromtheCourseRequisitionForm
BasicGoals
1. Studentsshouldlearnenoughaboutprocessorarchitectureandprogrammingtowritefastcode(code
withhighutilizationofavailableresources)oncontemporaryprocessors.
2. Theknowledgeandexperienceshouldenablestudentstoextendthiscapabilitytonewprocessorsand
tolargeandvariedapplications.
DetailedGoals
Studentsshouldhaveagoodunderstandingoftheoryandpracticeof
1. Measuringperformance
2. Developingfastcode(decomposition,mapping,loadbalancing)
3. Usingadvancedcapabilitiessuchasblocking,SIMDvectorextensions,andbasiccodeoptimizations
4. Parallelprocessingwithsmallscale(multicore)sharedmemoryprocessors
5. ProgrammingGPUs,includingproblemsingettinggoodperformance
Students should also develop a deeper understanding of one of the three technologies with an extended
project.Thiswillconsistofexaminingamorecomplexnumericalproblem.
CourseOutcomes
1. Sufficientknowledgeofvariousprocessorarchitecturestobeabletowritehighperformanceprograms
2. Basic knowledge of principles and practice of performance evaluation and writing highperformance
programswiththegoalofapplyingthisknowledgetootherprocessors.
3. AbilitytouseSSEinstructionssetextensions
4. AbilitytowriteparallelprogramsusingPThreadsandOpenMP
5. AbilitytowriteGPUprogramsinCUDA
6. Abilitytoformulateanddesignprogramsatahighlevelaccountingforatargetarchitecture
Please Note: All dates are tentative!

Exam and assignment due dates are not official until
announced in class.
Wk Cl Date
Lecture Topic
Intro, administration, goals & expectations
1 1 15-Jan
Motivation
20-Jan
2
2
22-Jan
27-Jan
29-Jan
3-Feb
5-Feb
7 10-Feb
5
8 12-Feb
No Class -- MLK Birthday

Review memory hierarchy
Memory-aware optimizations
CPU-aware optimizations
Ass. 0 due
Ass. 1 out
CPU-aware optimizations
NOTE -LATE CLASS in new room (5:15-6:30 in TBD)
Intro to SIMD and Vector processing
Ass. 1 due
Ass. 2 out
Programming using SSE
Ass. 2 due
Ass. 3 out
Advanced single-thread methods

Intro to parallel programming, part 1
17-Feb
6
9 19-Feb
10 24-Feb
7
11 26-Feb
12 3-Mar
8
13 5-Mar
HW
Ass. 0 out
Ass. 3 due
Ass. 4 out
No Class -- President's Day

Thread-based programming and Pthreads
Ass. 4 due
Ass. 5 out
Intro to parallel programming, part 2

Review Concurrency
Ass. 5 due
Ass. 6 out
OpenMP
Multicore Cache -- state machines, protocols, optimizations
Ass. 6 due
Fri 3/8
SPRING BREAK -- March 9-17

14 17-Mar
9
15 19-Mar
16 24-Mar
10
17 26-Mar
Multicore Cache
Multicore synchronization
Multicore synchronization implementations
Ass. 7 out
GPU -- First pass
Ass. 7 due
GPU -- Thread Organization
18 31-Mar
11
19
2-Apr
20
7-Apr
21
9-Apr
12
22 14-Apr
13
23 16-Apr
24 23-Apr
14
25 24-Apr
26 28-Apr
15 27 30-Apr
16
Mid-Term Exam
Ass. 8 out
GPU -- Memory Organization

GPU -- In Depth, processor part 1
Ass. 8 due
Ass. 9 out
GPU -- In Depth, processor part 2

Open -- group meetings
GPU -- In Depth, memory
Open -- work on projects
Project presentations
2-May
Project Writeups Due -- 17:00
8-May
Final Exam - Thursday, 9:00 - 11:00
Ass. 9 due

READINGS&SCHEDULE2014
L00:Class1(1)Intro,motivation,overview
Programmability:DesignCostsandPayoffsUsingAMDGPUStreamingLanguagesandTraditionalMulti
CoreLibraries,Weber,SAAHPC,ExtendedAbstractandLectureSlides
HowtoWriteFastCode,MarkusPueschel,LectureSlides,Class1
Debunkingthe100xGPUversusCPUMyth:AnevaluationofthroughputcomputingonCPUandGPU,
V.W.Lee,etal.ISCA10.
L01a,L01b:Class2(1)Reviewmemoryhierarchy,Memoryawareoptimizations
AnystandardComputerArchitecturetextbook
ComputerSystems:AProgrammersPerspective,Bryant&OHallaron,Chapter6(parts).
L02a,L02b:Classes34(2)CPUawareoptimizations,compilerinteractions
ComputerSystems:AProgrammersPerspective,Bryant&OHallaron,Chapter5.
L03a,L03b:Class56(2)VectorProcessing,SIMD,DataParallelProgramming,Vectorextensions
SlidesfromDavidPatterson
H&P5eAppendixG,VectorProcessors
HowtoWriteFastCode,MarkusPueschel,LectureSlides,Classes1314
IntroductiontoSSEProgramming,AlexFr,TheCodeProject,
http://www.codeproject.com/KB/recipes/sseintro.aspx
L04:Class7(1)Advancedsinglecoreoptimizations
SlidesfromMichelleHugueandDavidPatterson
ExploitingInstructionLevelParallelismwithSoftwareApproaches,H&P3eChapter4,Sections14
HardwareandSoftwareforVLIWandEPIC,H&P4eAppendixG,pp.G1G15
StaticMultipleIssue,P&H4eChapter4.10,pp.393398
LoopUnrollingTutorial,MichelleHugue,UniversityofMaryland
PartIIClasses916
L05:Classes8,10(2)ParallelProgramming
ParallelComputerArchitectureDraft,Culler,etal.:Chapter2,ParallelProgramming
ParallelComputerArchitectureDraft,Culler,etal.,Chapter3,Sections3.1,3.2.2,3.4.1,Programming
forPerformance
TypeArchitectures,L.Snyder
L06,L07:Classes9,11(2)ProgrammingwithThreads,ReviewConcurrencyandSynchronization
SlidesfromH.Casanova,UDelaware,andUHawaii
POSIXThreadsProgramming,B.Barney,LLNL
OperatingSystemsConceptsbyPetersonandSilberschatz,Chapter9
L08:Class12OpenMP
SlidesfromBUSCV
L09:Classes1314(1.5)CacheforSharedMemory
ParallelComputerArchitecture,byCuller,etal.Chapter5,Sections5.15.4
L10:Classes1415(1.5)Synchronizationimplementationforsharedmemoryprocessors
ParallelComputerArchitecture,byCuller,etal.Chapter,Section5.5
AlgorithmsforScalableSynchronizationonSharedMemoryMultiprocessors,byJ.M.MellorCrummy&
M.L.Scott.
LectureNotesM.Herlihy
PartIIIClasses1617,1920,2324
L11L16:Classes1617,1921,23(6)NVIDIAGPUs,CUDA,Performanceoptimization,casestudies
SlidesfromKirkandHwu
ProgrammingMassivelyParallelProcessorsbyKirkandHwu
Variouscasestudies,esp.fromMolecularDynamics
Class18MidTerm
Class22Cancelledinlieuofgroupmeetingswithinstructor
Class24Workonprojects
Classes2527Projectpresentations

EC527 Spring 2014

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

EC527 Spring 2014

Caricato da

Copyright:

Formati disponibili

EC527:HighPerformanceProgrammingwithMulticoreandGPUsSpring,2014

Please Note: All dates are tentative!

No Class -- MLK Birthday

Programming using SSE

Advanced single-thread methods

No Class -- President's Day

Intro to parallel programming, part 2

SPRING BREAK -- March 9-17

GPU -- First pass

GPU -- Thread Organization

GPU -- Memory Organization

GPU -- In Depth, processor part 2

Project Writeups Due -- 17:00

Final Exam - Thursday, 9:00 - 11:00

Potrebbero piacerti anche