Sei sulla pagina 1di 2

Performance Optimization and Modeling on Modern Computer Architectures

U Greifswald, 8.-12.10.2012 Tentative agenda 1. Prerequisites: The structured performance engineering approach 2. Introduction: Modern computer architecture a. Modern (Intel and AMD) x86 architectures b. GPGPUs c. Networks and clusters 3. Crash course in OpenMP and MPI 4. Typical performance properties and issues with OpenMP and MPI a. OpenMP: Microbenchmarking the socket and the node b. OpenMP: Cost of synchronization c. Case study: OpenMP-parallel histogram computation d. OpenMP: Enforcing affinity with likwid-pin e. MPI: Intra-node vs. inter-node communication f. MPI: Mapping problems g. MPI: Enforcing affinity h. Hybrid MPI+OpenMP programming: Pitfalls, benefits, opportunities 5. High-level performance models a. Scalability laws b. The significance and typical forms of overhead c. Performance vs. scalability d. Slow computing 6. Practical performance analysis (part 1) a. Finding hot spots: gprof b. Analyzing communication behavior: Intel Trace Collector/Analyzer 7. Code execution on modern architectures a. In-core execution i. Pipelining revisited ii. L1, decoders, ports, throughput iii. SIMD (Single Instruction Multiple Data) iv. Typical bottlenecks: MULT/ADD, LD/ST, dependencies, long-latency instructions, scalar execution b. Understanding the memory hierarchy i. Data transfer between memory levels ii. Write allocate vs. NT stores iii. Cache hierarchies iv. Contention and saturation in cache and memory c. NUMA effects anisotropy and asymmetry 8. The Roofline Model a. Prerequisites and assumptions b. Case study: Optimizing a 3D Jacobi smoother i. Core-level optimizations 1. Blocking

9.

10.

11.

12.

13. 14.

15.

16.

2. NT stores 3. SIMD vectorization ii. Multithreading contention at different memory hierarchies iii. Temporal Blocking Multicore saturation and the ECM model a. Prerequisites and assumptions b. Explaining saturation behavior Energy & parallel scalability a. Energy consumption of modern processors b. The energy-to-solution metric and a multicore power model c. Performance engineering == power engineering d. Case studies Case study: The Lattice-Boltzmann Method (LBM) a. The LBM algorithm b. Roofline model analysis c. ECM model analysis d. The role of SIMD vectorization e. Power and energy analysis Practical performance analysis (part 2) a. Hardware metrics b. likwid-perfctr c. Typical performance patterns and metric signatures Structured performance engineering revisited Sparse Matrix-Vector Multiplication a. Data layouts b. Roofline model analysis and saturation behavior c. Overlapping communication and computation by functional parallelization on multicore chips d. A performance model for GPU-based spMVM A backprojection algorithm for CT volume reconstruction a. The algorithm b. Nave analysis c. Detailed analysis and ECM performance model d. Optimizations and performance comparisons Introduction to GPGPU programming with nVIDIA CUDA a. GPGPU architecture b. Performance expectations c. First steps with CUDA

Potrebbero piacerti anche