Sei sulla pagina 1di 6

EC527:HighPerformanceProgrammingwithMulticoreandGPUsSpring,2014

Instructor: MartinHerbordt,PHO333

OfficeHours:T,Th35andbyappointment

Phone:x39850Email:herbordt@bu.eduWebpage:http://learn.bu.edu
TFs:Asthisisanadvancedundergrad/gradclasswithalargeenrollment,theroleoftheTFswillbelimitedto
gradingandhelpingtofindandsolvesystemproblemsfortheprogrammingassignments.

MissionStatement:Programmingforperformanceusingthecapabilitiesofmodernprocessors

CourseDescription(catalog):Considerstheoryandpracticeofhardwareawareprogramming.Keythemeis
obtaining a significant fraction of potential performance through knowledge of the underlying computing
platformandhowtheplatforminteractswithprograms.Studiesarchitectureof,andprogrammingmethods
for, contemporary highperformance processors. These include complex processor cores, multicore
processors, and graphics processors (GPUs). Labs include use and evaluation of programming methods on
theseprocessorsthroughapplicationssuchasmatrixoperationsandtheFastFourierTransform.

Prerequisites:ComputerOrganization(EC413ofequivalent),programminginC,academicmaturitysufficient,
e.g.,tolearnnewprogrammingtoolsfromprofessionaldocumentation.

CourseMotivation:
For several decades programmers found the von Neumann (vN) model to be an adequate world view for
obtaining most of the potential performance from target systems. This model is familiar: instructions are
executedseriallyinasinglestreamanddataarestoredinasingleimageofmemory.Instructionexecutions
andmemoryaccessesareassumedtobeuniformwithlittlepenalty.ExceptforspecializedprocessorsDSPs,
Supercomputers, MPPsgood vN programming plus a good compiler meant taking advantage of most of a
computerscapability.
Recent directions in processor architecturecomplex memory hierarchies, superscalar and deeply
pipelined CPUs, multicore, and accelerators such as SSE, GPUs and FPGAshave made the vN approach to
obtaining performance obsolete for all but the very simplest embedded processors. For many applications
performance is secondary; in those cases current methods remain excellent. But for applications requiring
performance, a deeper level of machine understanding is required: the programmer must be aware of the
underlying hardware at all stages of software development from algorithm selection, through coding, to
interactionwithsystemtoolssuchascompilersandlibraries,andfinallydebug,tuning,andmaintenance.

TextsandOrganization(supplementedwithadditionalarticles,lecturenotes,andtutorials):
Forcomplete(tentative)readingsseeReadingsdocument.
Part1Singlecore
ThispartofthecourseisbasedonsectionsofcoursestaughtatCMUandETH.
HowtoWriteFastCode,MarkusPueschel,LectureNotesfromCMUandETH
HowtoWriteFastNumericalCode:ASmallIntroduction,S.Chellappa,etal.;CMUTechReport
ComputerSystems:AProgrammersPerspective,Bryant&OHallaron,Chapters5andpartsofChapter6
VariousIntelHW&SWReferenceManuals
H&P4thEditionAppendixG:(Mostly)SWmethodsforcomplexprocessors
H&P4thEditionAppendixF:Vectorprocessors

Part2Multicore
ThispartisacondensedandappliedversionofmaterialfromEC713ParallelComputerArchitecture.
ComputerArchitecture(Chapter4):Hennessy&Patterson4e
ParallelComputerArchitecture(Chapter5):D.Culler,etal.
ParallelProgramming(Chapters2and3):D.Culler,etal.
ThreadsPrimer:AGuidetoMultithreadedProgramming(Chapters25):Lewis&Berg
OpenMPLectureNotes:SCVatBU

Part3GPUs
ThispartisbasedonacoursetaughtatUIUCbyWenmeiHwu.
CUDAReferenceManual(s):NVIDIA
ProgrammingMassivelyParallelProcessors:Kirk&Hwu

CourseMechanics
Style:Oneofthemissionsofthiscourseistobeapracticumassociatedwiththecomputerorganization
and architecture curriculum. As such we explore contemporary highend processors in some depth and
thenpracticeusingthatknowledgetoobtainhighresourceutilizationwithrealprograms.Theemphasis
isthereforeonprogrammingwithlecturesinsupportofthelabs.Lectureswillalsointroduceappropriate
theorywhennecessary,especiallywithrespecttoperformanceevaluation.
Grading: Exam:30%
Programming/HomeworkAssignments:50%
FinalProject:20%
Pleasenotethatthesepercentagesaretentative.Also,thattheimpactofanassignment/examgradeon
thefinalgradedependsontheexpectedvarianceinadditiontothetotal.
WeeklyAssignments:Untilthebeginningoftheprojecttherewillbeweeklyassignments,10inall(09).
All involve programming, mostly exploring small amounts of code in great depth. Some assignments
involvepencilandpaperproblemsinadditiontoprogramming.
LatePolicy:Assignmentsmustbesubmittedontime,usuallyonMondays(by23:59:59localtime).There
isthena20%perdaypenalty.Youwillgetatotalof5freelatedaystohandlespecial(butcommon)
occurrencessuchasillness,interviews,etc.
AcademicHonestyversusCollaboration:Youareencouragedtoworktogethertolearnthematerialand
todiscussapproachestosolvingproblems.However,youmustcomeupwithandwriteuptheprograms
andothersolutionsonyourown.
Exams: There will be at least one midterm exam. There may also be a final exam. Exams are open
readingsandopennotes.
FinalProject:Thepurposeistoadddepthandtopracticetheconceptslearnedinanextendedcasestudy.
Resultswillbewrittenupconferencepaperstyleandpresentedtotheclass.Youmayworkinteamsofup
totwostudents.Iwillconsiderlargergroupswithsufficientjustification.
Late Classes: Unfortunately, this class has been rescheduled so that it now conflicts with faculty
candidateanddistinguishedlecturespeakers.Therefore,onatleast2days,classwillbefrom5:156:45.
Theremaybeafewmoreshiftsthesewillbeannouncedwellinadvance.

CourseObjectives
Review
Computerarchitectureincludingmemoryhierarchyandbasicpipelining.
Learnabout
Variouscontemporaryhighendprocessors,inparticularthei7core,multicorecache,andGPUs

Methodsofperformanceevaluation
Methodsofhardwareawarecodedevelopment
Howtoprogramcomplexhardwaretoobtainhighutilization
Gainexperiencewithdevelopingefficientprograms,including
usingextendedinstructionsets
synchronization
methodsofparallelprogramming
cacheawareoptimizations
CUDAforGPUprogramming.

FromtheCourseRequisitionForm
BasicGoals
1. Studentsshouldlearnenoughaboutprocessorarchitectureandprogrammingtowritefastcode(code
withhighutilizationofavailableresources)oncontemporaryprocessors.
2. Theknowledgeandexperienceshouldenablestudentstoextendthiscapabilitytonewprocessorsand
tolargeandvariedapplications.

DetailedGoals
Studentsshouldhaveagoodunderstandingoftheoryandpracticeof
1. Measuringperformance
2. Developingfastcode(decomposition,mapping,loadbalancing)
3. Usingadvancedcapabilitiessuchasblocking,SIMDvectorextensions,andbasiccodeoptimizations
4. Parallelprocessingwithsmallscale(multicore)sharedmemoryprocessors
5. ProgrammingGPUs,includingproblemsingettinggoodperformance
Students should also develop a deeper understanding of one of the three technologies with an extended
project.Thiswillconsistofexaminingamorecomplexnumericalproblem.

CourseOutcomes
1. Sufficientknowledgeofvariousprocessorarchitecturestobeabletowritehighperformanceprograms
2. Basic knowledge of principles and practice of performance evaluation and writing highperformance
programswiththegoalofapplyingthisknowledgetootherprocessors.
3. AbilitytouseSSEinstructionssetextensions
4. AbilitytowriteparallelprogramsusingPThreadsandOpenMP
5. AbilitytowriteGPUprogramsinCUDA
6. Abilitytoformulateanddesignprogramsatahighlevelaccountingforatargetarchitecture

Please Note: All dates are tentative!


Exam and assignment due dates are not official until
announced in class.
Wk Cl Date
Lecture Topic
Intro, administration, goals & expectations
1 1 15-Jan
Motivation
20-Jan
2
2

22-Jan

27-Jan

29-Jan

3-Feb

5-Feb

7 10-Feb
5
8 12-Feb

No Class -- MLK Birthday


Review memory hierarchy
Memory-aware optimizations
CPU-aware optimizations

Ass. 0 due
Ass. 1 out

CPU-aware optimizations
NOTE -LATE CLASS in new room (5:15-6:30 in TBD)
Intro to SIMD and Vector processing

Ass. 1 due
Ass. 2 out

Programming using SSE

Ass. 2 due
Ass. 3 out

Advanced single-thread methods


Intro to parallel programming, part 1

17-Feb
6
9 19-Feb
10 24-Feb
7
11 26-Feb
12 3-Mar
8
13 5-Mar

HW
Ass. 0 out

Ass. 3 due
Ass. 4 out

No Class -- President's Day


Thread-based programming and Pthreads

Ass. 4 due
Ass. 5 out

Intro to parallel programming, part 2


Review Concurrency

Ass. 5 due
Ass. 6 out

OpenMP
Multicore Cache -- state machines, protocols, optimizations

Ass. 6 due
Fri 3/8

SPRING BREAK -- March 9-17


14 17-Mar
9
15 19-Mar
16 24-Mar
10
17 26-Mar

Multicore Cache
Multicore synchronization
Multicore synchronization implementations

Ass. 7 out

GPU -- First pass

Ass. 7 due

GPU -- Thread Organization

18 31-Mar
11
19

2-Apr

20

7-Apr

21

9-Apr

12

22 14-Apr
13
23 16-Apr
24 23-Apr
14
25 24-Apr
26 28-Apr
15 27 30-Apr

16

Mid-Term Exam

Ass. 8 out

GPU -- Memory Organization


GPU -- In Depth, processor part 1

Ass. 8 due
Ass. 9 out

GPU -- In Depth, processor part 2


Open -- group meetings
GPU -- In Depth, memory
Open -- work on projects
Project presentations
Project presentations
Project presentations

2-May

Project Writeups Due -- 17:00

8-May

Final Exam - Thursday, 9:00 - 11:00

Ass. 9 due


READINGS&SCHEDULE2014

L00:Class1(1)Intro,motivation,overview
HowtoWriteFastNumericalCode:ASmallIntroduction,S.Chellappa,etal.;CMUTechReport
Programmability:DesignCostsandPayoffsUsingAMDGPUStreamingLanguagesandTraditionalMulti
CoreLibraries,Weber,SAAHPC,ExtendedAbstractandLectureSlides
HowtoWriteFastCode,MarkusPueschel,LectureSlides,Class1
Debunkingthe100xGPUversusCPUMyth:AnevaluationofthroughputcomputingonCPUandGPU,
V.W.Lee,etal.ISCA10.

L01a,L01b:Class2(1)Reviewmemoryhierarchy,Memoryawareoptimizations
HowtoWriteFastNumericalCode:ASmallIntroduction,S.Chellappa,etal.;CMUTechReport
AnystandardComputerArchitecturetextbook
ComputerSystems:AProgrammersPerspective,Bryant&OHallaron,Chapter6(parts).

L02a,L02b:Classes34(2)CPUawareoptimizations,compilerinteractions
ComputerSystems:AProgrammersPerspective,Bryant&OHallaron,Chapter5.
HowtoWriteFastNumericalCode:ASmallIntroduction,S.Chellappa,etal.;CMUTechReport

L03a,L03b:Class56(2)VectorProcessing,SIMD,DataParallelProgramming,Vectorextensions
SlidesfromDavidPatterson
HowtoWriteFastNumericalCode:ASmallIntroduction,S.Chellappa,etal.;CMUTechReport
H&P5eAppendixG,VectorProcessors
HowtoWriteFastCode,MarkusPueschel,LectureSlides,Classes1314
IntroductiontoSSEProgramming,AlexFr,TheCodeProject,
http://www.codeproject.com/KB/recipes/sseintro.aspx

L04:Class7(1)Advancedsinglecoreoptimizations
SlidesfromMichelleHugueandDavidPatterson
HowtoWriteFastNumericalCode:ASmallIntroduction,S.Chellappa,etal.;CMUTechReport
ExploitingInstructionLevelParallelismwithSoftwareApproaches,H&P3eChapter4,Sections14
HardwareandSoftwareforVLIWandEPIC,H&P4eAppendixG,pp.G1G15
StaticMultipleIssue,P&H4eChapter4.10,pp.393398
LoopUnrollingTutorial,MichelleHugue,UniversityofMaryland

PartIIClasses916

L05:Classes8,10(2)ParallelProgramming
ParallelComputerArchitectureDraft,Culler,etal.:Chapter2,ParallelProgramming
ParallelComputerArchitectureDraft,Culler,etal.,Chapter3,Sections3.1,3.2.2,3.4.1,Programming
forPerformance
TypeArchitectures,L.Snyder

L06,L07:Classes9,11(2)ProgrammingwithThreads,ReviewConcurrencyandSynchronization

SlidesfromH.Casanova,UDelaware,andUHawaii
POSIXThreadsProgramming,B.Barney,LLNL
OperatingSystemsConceptsbyPetersonandSilberschatz,Chapter9

L08:Class12OpenMP
SlidesfromBUSCV

L09:Classes1314(1.5)CacheforSharedMemory
ParallelComputerArchitecture,byCuller,etal.Chapter5,Sections5.15.4

L10:Classes1415(1.5)Synchronizationimplementationforsharedmemoryprocessors
ParallelComputerArchitecture,byCuller,etal.Chapter,Section5.5
AlgorithmsforScalableSynchronizationonSharedMemoryMultiprocessors,byJ.M.MellorCrummy&
M.L.Scott.
LectureNotesM.Herlihy

PartIIIClasses1617,1920,2324

L11L16:Classes1617,1921,23(6)NVIDIAGPUs,CUDA,Performanceoptimization,casestudies
SlidesfromKirkandHwu
ProgrammingMassivelyParallelProcessorsbyKirkandHwu
Variouscasestudies,esp.fromMolecularDynamics

Class18MidTerm
Class22Cancelledinlieuofgroupmeetingswithinstructor
Class24Workonprojects
Classes2527Projectpresentations

Potrebbero piacerti anche