Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Instructor: MartinHerbordt,PHO333
OfficeHours:T,Th35andbyappointment
Phone:x39850Email:herbordt@bu.eduWebpage:http://learn.bu.edu
TFs:Asthisisanadvancedundergrad/gradclasswithalargeenrollment,theroleoftheTFswillbelimitedto
gradingandhelpingtofindandsolvesystemproblemsfortheprogrammingassignments.
MissionStatement:Programmingforperformanceusingthecapabilitiesofmodernprocessors
CourseDescription(catalog):Considerstheoryandpracticeofhardwareawareprogramming.Keythemeis
obtaining a significant fraction of potential performance through knowledge of the underlying computing
platformandhowtheplatforminteractswithprograms.Studiesarchitectureof,andprogrammingmethods
for, contemporary highperformance processors. These include complex processor cores, multicore
processors, and graphics processors (GPUs). Labs include use and evaluation of programming methods on
theseprocessorsthroughapplicationssuchasmatrixoperationsandtheFastFourierTransform.
Prerequisites:ComputerOrganization(EC413ofequivalent),programminginC,academicmaturitysufficient,
e.g.,tolearnnewprogrammingtoolsfromprofessionaldocumentation.
CourseMotivation:
For several decades programmers found the von Neumann (vN) model to be an adequate world view for
obtaining most of the potential performance from target systems. This model is familiar: instructions are
executedseriallyinasinglestreamanddataarestoredinasingleimageofmemory.Instructionexecutions
andmemoryaccessesareassumedtobeuniformwithlittlepenalty.ExceptforspecializedprocessorsDSPs,
Supercomputers, MPPsgood vN programming plus a good compiler meant taking advantage of most of a
computerscapability.
Recent directions in processor architecturecomplex memory hierarchies, superscalar and deeply
pipelined CPUs, multicore, and accelerators such as SSE, GPUs and FPGAshave made the vN approach to
obtaining performance obsolete for all but the very simplest embedded processors. For many applications
performance is secondary; in those cases current methods remain excellent. But for applications requiring
performance, a deeper level of machine understanding is required: the programmer must be aware of the
underlying hardware at all stages of software development from algorithm selection, through coding, to
interactionwithsystemtoolssuchascompilersandlibraries,andfinallydebug,tuning,andmaintenance.
TextsandOrganization(supplementedwithadditionalarticles,lecturenotes,andtutorials):
Forcomplete(tentative)readingsseeReadingsdocument.
Part1Singlecore
ThispartofthecourseisbasedonsectionsofcoursestaughtatCMUandETH.
HowtoWriteFastCode,MarkusPueschel,LectureNotesfromCMUandETH
HowtoWriteFastNumericalCode:ASmallIntroduction,S.Chellappa,etal.;CMUTechReport
ComputerSystems:AProgrammersPerspective,Bryant&OHallaron,Chapters5andpartsofChapter6
VariousIntelHW&SWReferenceManuals
H&P4thEditionAppendixG:(Mostly)SWmethodsforcomplexprocessors
H&P4thEditionAppendixF:Vectorprocessors
Part2Multicore
ThispartisacondensedandappliedversionofmaterialfromEC713ParallelComputerArchitecture.
ComputerArchitecture(Chapter4):Hennessy&Patterson4e
ParallelComputerArchitecture(Chapter5):D.Culler,etal.
ParallelProgramming(Chapters2and3):D.Culler,etal.
ThreadsPrimer:AGuidetoMultithreadedProgramming(Chapters25):Lewis&Berg
OpenMPLectureNotes:SCVatBU
Part3GPUs
ThispartisbasedonacoursetaughtatUIUCbyWenmeiHwu.
CUDAReferenceManual(s):NVIDIA
ProgrammingMassivelyParallelProcessors:Kirk&Hwu
CourseMechanics
Style:Oneofthemissionsofthiscourseistobeapracticumassociatedwiththecomputerorganization
and architecture curriculum. As such we explore contemporary highend processors in some depth and
thenpracticeusingthatknowledgetoobtainhighresourceutilizationwithrealprograms.Theemphasis
isthereforeonprogrammingwithlecturesinsupportofthelabs.Lectureswillalsointroduceappropriate
theorywhennecessary,especiallywithrespecttoperformanceevaluation.
Grading: Exam:30%
Programming/HomeworkAssignments:50%
FinalProject:20%
Pleasenotethatthesepercentagesaretentative.Also,thattheimpactofanassignment/examgradeon
thefinalgradedependsontheexpectedvarianceinadditiontothetotal.
WeeklyAssignments:Untilthebeginningoftheprojecttherewillbeweeklyassignments,10inall(09).
All involve programming, mostly exploring small amounts of code in great depth. Some assignments
involvepencilandpaperproblemsinadditiontoprogramming.
LatePolicy:Assignmentsmustbesubmittedontime,usuallyonMondays(by23:59:59localtime).There
isthena20%perdaypenalty.Youwillgetatotalof5freelatedaystohandlespecial(butcommon)
occurrencessuchasillness,interviews,etc.
AcademicHonestyversusCollaboration:Youareencouragedtoworktogethertolearnthematerialand
todiscussapproachestosolvingproblems.However,youmustcomeupwithandwriteuptheprograms
andothersolutionsonyourown.
Exams: There will be at least one midterm exam. There may also be a final exam. Exams are open
readingsandopennotes.
FinalProject:Thepurposeistoadddepthandtopracticetheconceptslearnedinanextendedcasestudy.
Resultswillbewrittenupconferencepaperstyleandpresentedtotheclass.Youmayworkinteamsofup
totwostudents.Iwillconsiderlargergroupswithsufficientjustification.
Late Classes: Unfortunately, this class has been rescheduled so that it now conflicts with faculty
candidateanddistinguishedlecturespeakers.Therefore,onatleast2days,classwillbefrom5:156:45.
Theremaybeafewmoreshiftsthesewillbeannouncedwellinadvance.
CourseObjectives
Review
Computerarchitectureincludingmemoryhierarchyandbasicpipelining.
Learnabout
Variouscontemporaryhighendprocessors,inparticularthei7core,multicorecache,andGPUs
Methodsofperformanceevaluation
Methodsofhardwareawarecodedevelopment
Howtoprogramcomplexhardwaretoobtainhighutilization
Gainexperiencewithdevelopingefficientprograms,including
usingextendedinstructionsets
synchronization
methodsofparallelprogramming
cacheawareoptimizations
CUDAforGPUprogramming.
FromtheCourseRequisitionForm
BasicGoals
1. Studentsshouldlearnenoughaboutprocessorarchitectureandprogrammingtowritefastcode(code
withhighutilizationofavailableresources)oncontemporaryprocessors.
2. Theknowledgeandexperienceshouldenablestudentstoextendthiscapabilitytonewprocessorsand
tolargeandvariedapplications.
DetailedGoals
Studentsshouldhaveagoodunderstandingoftheoryandpracticeof
1. Measuringperformance
2. Developingfastcode(decomposition,mapping,loadbalancing)
3. Usingadvancedcapabilitiessuchasblocking,SIMDvectorextensions,andbasiccodeoptimizations
4. Parallelprocessingwithsmallscale(multicore)sharedmemoryprocessors
5. ProgrammingGPUs,includingproblemsingettinggoodperformance
Students should also develop a deeper understanding of one of the three technologies with an extended
project.Thiswillconsistofexaminingamorecomplexnumericalproblem.
CourseOutcomes
1. Sufficientknowledgeofvariousprocessorarchitecturestobeabletowritehighperformanceprograms
2. Basic knowledge of principles and practice of performance evaluation and writing highperformance
programswiththegoalofapplyingthisknowledgetootherprocessors.
3. AbilitytouseSSEinstructionssetextensions
4. AbilitytowriteparallelprogramsusingPThreadsandOpenMP
5. AbilitytowriteGPUprogramsinCUDA
6. Abilitytoformulateanddesignprogramsatahighlevelaccountingforatargetarchitecture
22-Jan
27-Jan
29-Jan
3-Feb
5-Feb
7 10-Feb
5
8 12-Feb
Ass. 0 due
Ass. 1 out
CPU-aware optimizations
NOTE -LATE CLASS in new room (5:15-6:30 in TBD)
Intro to SIMD and Vector processing
Ass. 1 due
Ass. 2 out
Ass. 2 due
Ass. 3 out
17-Feb
6
9 19-Feb
10 24-Feb
7
11 26-Feb
12 3-Mar
8
13 5-Mar
HW
Ass. 0 out
Ass. 3 due
Ass. 4 out
Ass. 4 due
Ass. 5 out
Ass. 5 due
Ass. 6 out
OpenMP
Multicore Cache -- state machines, protocols, optimizations
Ass. 6 due
Fri 3/8
Multicore Cache
Multicore synchronization
Multicore synchronization implementations
Ass. 7 out
Ass. 7 due
18 31-Mar
11
19
2-Apr
20
7-Apr
21
9-Apr
12
22 14-Apr
13
23 16-Apr
24 23-Apr
14
25 24-Apr
26 28-Apr
15 27 30-Apr
16
Mid-Term Exam
Ass. 8 out
Ass. 8 due
Ass. 9 out
2-May
8-May
Ass. 9 due
READINGS&SCHEDULE2014
L00:Class1(1)Intro,motivation,overview
HowtoWriteFastNumericalCode:ASmallIntroduction,S.Chellappa,etal.;CMUTechReport
Programmability:DesignCostsandPayoffsUsingAMDGPUStreamingLanguagesandTraditionalMulti
CoreLibraries,Weber,SAAHPC,ExtendedAbstractandLectureSlides
HowtoWriteFastCode,MarkusPueschel,LectureSlides,Class1
Debunkingthe100xGPUversusCPUMyth:AnevaluationofthroughputcomputingonCPUandGPU,
V.W.Lee,etal.ISCA10.
L01a,L01b:Class2(1)Reviewmemoryhierarchy,Memoryawareoptimizations
HowtoWriteFastNumericalCode:ASmallIntroduction,S.Chellappa,etal.;CMUTechReport
AnystandardComputerArchitecturetextbook
ComputerSystems:AProgrammersPerspective,Bryant&OHallaron,Chapter6(parts).
L02a,L02b:Classes34(2)CPUawareoptimizations,compilerinteractions
ComputerSystems:AProgrammersPerspective,Bryant&OHallaron,Chapter5.
HowtoWriteFastNumericalCode:ASmallIntroduction,S.Chellappa,etal.;CMUTechReport
L03a,L03b:Class56(2)VectorProcessing,SIMD,DataParallelProgramming,Vectorextensions
SlidesfromDavidPatterson
HowtoWriteFastNumericalCode:ASmallIntroduction,S.Chellappa,etal.;CMUTechReport
H&P5eAppendixG,VectorProcessors
HowtoWriteFastCode,MarkusPueschel,LectureSlides,Classes1314
IntroductiontoSSEProgramming,AlexFr,TheCodeProject,
http://www.codeproject.com/KB/recipes/sseintro.aspx
L04:Class7(1)Advancedsinglecoreoptimizations
SlidesfromMichelleHugueandDavidPatterson
HowtoWriteFastNumericalCode:ASmallIntroduction,S.Chellappa,etal.;CMUTechReport
ExploitingInstructionLevelParallelismwithSoftwareApproaches,H&P3eChapter4,Sections14
HardwareandSoftwareforVLIWandEPIC,H&P4eAppendixG,pp.G1G15
StaticMultipleIssue,P&H4eChapter4.10,pp.393398
LoopUnrollingTutorial,MichelleHugue,UniversityofMaryland
PartIIClasses916
L05:Classes8,10(2)ParallelProgramming
ParallelComputerArchitectureDraft,Culler,etal.:Chapter2,ParallelProgramming
ParallelComputerArchitectureDraft,Culler,etal.,Chapter3,Sections3.1,3.2.2,3.4.1,Programming
forPerformance
TypeArchitectures,L.Snyder
L06,L07:Classes9,11(2)ProgrammingwithThreads,ReviewConcurrencyandSynchronization
SlidesfromH.Casanova,UDelaware,andUHawaii
POSIXThreadsProgramming,B.Barney,LLNL
OperatingSystemsConceptsbyPetersonandSilberschatz,Chapter9
L08:Class12OpenMP
SlidesfromBUSCV
L09:Classes1314(1.5)CacheforSharedMemory
ParallelComputerArchitecture,byCuller,etal.Chapter5,Sections5.15.4
L10:Classes1415(1.5)Synchronizationimplementationforsharedmemoryprocessors
ParallelComputerArchitecture,byCuller,etal.Chapter,Section5.5
AlgorithmsforScalableSynchronizationonSharedMemoryMultiprocessors,byJ.M.MellorCrummy&
M.L.Scott.
LectureNotesM.Herlihy
PartIIIClasses1617,1920,2324
L11L16:Classes1617,1921,23(6)NVIDIAGPUs,CUDA,Performanceoptimization,casestudies
SlidesfromKirkandHwu
ProgrammingMassivelyParallelProcessorsbyKirkandHwu
Variouscasestudies,esp.fromMolecularDynamics
Class18MidTerm
Class22Cancelledinlieuofgroupmeetingswithinstructor
Class24Workonprojects
Classes2527Projectpresentations