Sei sulla pagina 1di 81

Parallel programming:

Introduction to GPU architecture

Sylvain Collange
Inria Rennes – Bretagne Atlantique
sylvain.collange@inria.fr
GPU internals

What makes a GPU tick?

NVIDIA GeForce GTX 980 Maxwell GPU. Artist rendering!

2
Outline
Computer architecture crash course
The simplest processor
Exploiting instruction-level parallelism
GPU, many-core: why, what for?
Technological trends and constraints
From graphics to general purpose
Forms of parallelism, how to exploit them
Why we need (so much) parallelism: latency and throughput
Sources of parallelism: ILP, TLP, DLP
Uses of parallelism: horizontal, vertical
Let's design a GPU!
Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD
Putting it all together
Architecture of current GPUs: cores, memory
3
The free lunch era... was yesterday
1980's to 2002: Moore's law, Dennard scaling, micro-architecture
improvements
Exponential performance increase
Software compatibility preserved

Do not rewrite
Hennessy, software,
Patterson. buy a new
Computer Architecture, machine!
a quantitative approach. 4 th
Ed. 2006

4
Computer architecture crash course

How does a processor work?


Or was working in the 1980s to 1990s:
modern processors are much more complicated!
An attempt to sum up 30 years of research in 15 minutes

5
Machine language: instruction set

Registers
State for computations Example

Keeps variables and temporaries


R0, R1, R2, R3... R31
Instructions
Perform computations on registers,
move data between register and memory, branch…
Instruction word
Binary representation of an instruction
01100111
Assembly language
Readable form of machine language
ADD R1, R3
Examples
Which instruction set does your laptop/desktop run?
Your cell phone?

6
The Von Neumann processor

PC

+1 Fetch unit
Instruction
word
Decoder
Operands Operation
Memory
Register file

Branch Arithmetic and Load/Store


Unit Logic Unit Unit

Result bus State


machine
Let's look at it step by step
7
Step by step: Fetch

The processor maintains a Program Counter (PC)


Fetch: read the instruction word pointed by PC in memory

PC

Fetch unit 01100111

Instruction
word
01100111
Memory

8
Decode

Split the instruction word to understand what it represents


Which operation? → ADD
Which operands? → R1, R3

PC

Fetch unit

01100111 Instruction
word
Decoder
Operands Operation
Memory
R1, R3 ADD

9
Read operands

Get the value of registers R1, R3 from the register file


PC

Fetch unit
Instruction
word
Decoder
Operands
R1, R3 Memory
Register file

42, 17

10
Execute operation

Compute the result: 42 + 17


PC

Fetch unit
Instruction
word
Decoder
Operands Operation
Memory
Register file
42, 17 ADD

Arithmetic and
Logic Unit

59

11
Write back

Write the result back to the register file


PC

Fetch unit
Instruction
word
Decoder

R1 Operands Operation
Memory
Register file
59

Arithmetic and
Logic Unit

Result bus

12
Increment PC

PC

+1 Fetch unit
Instruction
word
Decoder
Operands Operation
Memory
Register file

Arithmetic and
Logic Unit

Result bus

13
Load or store instruction

Can read and write memory from a computed address


PC

+1 Fetch unit
Instruction
word
Decoder
Operands Operation
Memory
Register file

Arithmetic and Load/Store


Logic Unit Unit

Result bus

14
Branch instruction

Instead of incrementing PC, set it to a computed value


PC

+1 Fetch unit
Instruction
word
Decoder
Operands Operation
Memory
Register file

Branch Arithmetic and Load/Store


Unit Logic Unit Unit

Result bus

15
What about the state machine?

The state machine controls


everybody

Sequences the successive


steps
Send signals to units
depending on current state
At every clock tick,
switch to next state Fetch- Read
decode operands
Clock is a periodic signal
(as fast as possible) Increment Read
PC operands

Writeback Execute
16
Recap

We can build a real processor


As it was in the early 1980's

You
are
here

How did processors become faster?


17
Reason 1: faster clock

Progress in semiconductor technology


allows higher frequencies

Frequency
scaling

But this is not enough!


18
Outline
Computer architecture crash course
The simplest processor
Exploiting instruction-level parallelism
GPU, many-core: why, what for?
Technological trends and constraints
From graphics to general purpose
Forms of parallelism, how to exploit them
Why we need (so much) parallelism: latency and throughput
Sources of parallelism: ILP, TLP, DLP
Uses of parallelism: horizontal, vertical
Let's design a GPU!
Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD
Putting it all together
Architecture of current GPUs: cores, memory
19
Going faster using ILP: pipeline

Idea: we do not have to wait until instruction n has finished


to start instruction n+1
Like a factory assembly line
Or the bandeijão

20
Pipelined processor
Program

1: add, r1, r3 Fetch Decode Execute Writeback


2: mul r2, r3
3: load r3, [r1]
1: add

Independent instructions can follow each other


Exploits ILP to hide instruction latency
21
Pipelined processor
Program

1: add, r1, r3 Fetch Decode Execute Writeback


2: mul r2, r3
3: load r3, [r1]
2: mul 1: add

Independent instructions can follow each other


Exploits ILP to hide instruction latency
22
Pipelined processor
Program

1: add, r1, r3 Fetch Decode Execute Writeback


2: mul r2, r3
3: load r3, [r1]
3: load 2: mul 1: add

Independent instructions can follow each other


Exploits ILP to hide instruction latency
23
Superscalar execution

Multiple execution units in parallel


Independent instructions can execute at the same time

Decode Execute Writeback

Fetch Decode Execute Writeback

Decode Execute Writeback

Exploits ILP to increase throughput

24
Locality

Time to access main memory: ~200 clock cycles


One memory access every few instructions
Are we doomed?

Fortunately: principle of locality


~90% of memory accesses on ~10% of data
Accessed locations are often the same
Temporal locality
Access the same location at different times
Spacial locality
Access locations close to each other

26
Caches

Large memories are slower than small memories


The computer theorists lied to you:
in the real world, access in an array of size n costs O(log n), not O(1)!
Think about looking up a book in a small or huge library
Idea: put frequently-accessed data in small, fast memory
Can be applied recursively: hierarchy with multiple levels of cache

L1 Capacity: 64 KB
cache Access time: 2 ns

L2 1 MB
cache 10 ns

L3 8 MB
cache 30 ns

Memory 8 GB
60 ns 27
Branch prediction

What if we have a branch?


We do not know the next PC to fetch from until the branch executes
Solution 1: wait until the branch is resolved
Problem: programs have 1 branch every 5 instructions on average
We would spend most of our time waiting
Solution 2: predict (guess) the most likely direction
If correct, we have bought some time
If wrong, just go back and start over
Modern CPUs can correctly predict over 95% of branches
World record holder: 1.691 mispredictions / 1000 instructions
General concept: speculation

P. Michaud and A. Seznec. "Pushing the branch predictability limits with the multi-poTAGE+ SC
predictor." JWAC-4: Championship Branch Prediction (2014). 28
Example CPU: Intel Core i7 Haswell
Up to 192 instructions
in flight
May be 48 predicted branches
ahead
Up to 8 instructions/cycle
executed out of order
About 25 pipeline stages at
~4 GHz
Quizz: how far does light travel
during the 0.25 ns of a clock
cycle?
Too complex to explain in 1
slide, or even 1 lecture

David Kanter, Intel's Haswell CPU architecture, RealWorldTech, 2012 29


http://www.realworldtech.com/haswell-cpu/
Recap

Many techniques to run sequential programs


as fast as possible
Discovers and exploits parallelism between instructions
Speculates to remove dependencies
Works on existing binary programs,
without rewriting or re-compiling
Upgrading hardware is cheaper than improving software
Extremely complex machine

30
Outline
Computer architecture crash course
The simplest processor
Exploiting instruction-level parallelism
GPU, many-core: why, what for?
Technological trends and constraints
From graphics to general purpose
Forms of parallelism, how to exploit them
Why we need (so much) parallelism: latency and throughput
Sources of parallelism: ILP, TLP, DLP
Uses of parallelism: horizontal, vertical
Let's design a GPU!
Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD
Putting it all together
Architecture of current GPUs: cores, memory
31
Technology evolution
Compute
Memory wall Performance
Memory speed does not increase as Gap
fast as computing speed Memory

More and more difficult to hide memory Time


latency
Power wall Transistor
Power consumption of transistors does density
not decrease as fast as density
increases
Transistor
Performance is now limited by power power
Total power
consumption
Time
ILP wall
Law of diminishing returns on Cost
Instruction-Level Parallelism
Pollack rule: cost ≃ performance²
Serial performance 32
Usage changes

New applications demand


parallel processing
Computer games : 3D graphics
Search engines, social networks…
“big data” processing
New computing devices are
power-constrained
Laptops, cell phones, tablets…
Small, light, battery-powered
Datacenters
High power supply
and cooling costs

33
Latency vs. throughput

Latency: time to solution


CPUs
Minimize time, at the expense of
power

Throughput: quantity of tasks


processed per unit of time
GPUs
Assumes unlimited parallelism
Minimize energy per operation

34
Amdahl's law

Bounds speedup attainable on a parallel machine

1 S Speedup
S=
Time to run P Time to run P Ratio of parallel
1−P portions
sequential portions N parallel portions
N Number of
processors

S (speedup)

N (available processors)
G. Amdahl. Validity of the Single Processor Approach to Achieving Large-Scale 35
Computing Capabilities. AFIPS 1967.
Why heterogeneous architectures?
1
Time to run S= Time to run
P
sequential portions 1−P parallel portions
N

Latency-optimized multi-core (CPU)


Low efficiency on parallel portions:
spends too much resources

Throughput-optimized multi-core (GPU)


Low performance on sequential portions

Heterogeneous multi-core (CPU+GPU)


Use the right tool for the right job
Allows aggressive optimization
for latency or for throughput

M. Hill, M. Marty. Amdahl's law in the multicore era. IEEE Computer, 2008. 36
Example: System on Chip for smartphone
Small cores
for background activity

GPU

Big cores
for applications
Lots of interfaces Special-purpose 37
accelerators
Outline
Computer architecture crash course
The simplest processor
Exploiting instruction-level parallelism
GPU, many-core: why, what for?
Technological trends and constraints
From graphics to general purpose
Forms of parallelism, how to exploit them
Why we need (so much) parallelism: latency and throughput
Sources of parallelism: ILP, TLP, DLP
Uses of parallelism: horizontal, vertical
Let's design a GPU!
Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD
Putting it all together
Architecture of current GPUs: cores, memory
38
The (simplest) graphics rendering pipeline
Primitives
(triangles…)

Vertices
Fragment shader Textures

Vertex shader
Z-Compare
Blending

Clipping, Rasterization
Attribute interpolation

Pixels
Framebuffer
Fragments
Z-Buffer

Programmable Parametrizable
stage stage 39
How much performance do we need

… to run 3DMark 11 at 50 frames/second?

Element Per frame Per second

Vertices 12.0M 600M


Primitives 12.6M 630M
Fragments 180M 9.0G
Instructions 14.4G 720G

Intel Core i7 2700K: 56 Ginsn/s peak


We need to go 13x faster
Make a special-purpose accelerator

40
Source: Damien Triolet, Hardware.fr
Beginnings of GPGPU

Microsoft DirectX
7.x 8.0 8.1 9.0 a 9.0b 9.0c 10.0 10.1 11
Unified shaders

NVIDIA
NV10 NV20 NV30 NV40 G70 G80-G90 GT200 GF100
FP 16 Programmable FP 32 Dynamic SIMT CUDA
shaders control flow

ATI/AMD FP 24 CTM FP 64 CAL


R100 R200 R300 R400 R500 R600 R700 Evergreen

GPGPU traction

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

42
Today: what do we need GPUs for?

1. 3D graphics rendering for games


Complex texture mapping, lighting
computations…

2. Computer Aided Design


workstations
Complex geometry

3. GPGPU
Complex synchronization, data
movements
One chip to rule them all
Find the common denominator

43
Outline
Computer architecture crash course
The simplest processor
Exploiting instruction-level parallelism
GPU, many-core: why, what for?
Technological trends and constraints
From graphics to general purpose
Forms of parallelism, how to exploit them
Why we need (so much) parallelism: latency and throughput
Sources of parallelism: ILP, TLP, DLP
Uses of parallelism: horizontal, vertical
Let's design a GPU!
Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD
Putting it all together
Architecture of current GPUs: cores, memory
44
Little's law: data=throughput×latency
Throughput (GB/s)
Intel Core i7 920
180
50

1,25 3 10 50 Latency (ns)

1500
NVIDIA GeForce GTX 580

L1

320
190
L2
DRAM

30 210 350 ns45


J. Little. A proof for the queuing formula L= λ W. JSTOR 1961.
Hiding memory latency with pipelining
Throughput:
Memory throughput: 190 GB/s 190B/cycle

Memory latency: 350 ns 1 cycle

Data in flight = 66 500 Bytes

350 cycles
Latency:

Requests
At 1 GHz:

Time
190 Bytes/cycle,
350 cycles to wait
...

In flight: 65 KB

46
Consequence: more parallelism

GPU vs. CPU Space ×8


8× more parallelism to feed more

Time
units (throughput)
8× more parallelism to hide longer

Requests
latency
64× more total parallelism
How to find this parallelism?

×8
...

47
Sources of parallelism

ILP: Instruction-Level Parallelism add r3 ← r1, r2


Parallel
Between independent instructions mul r0 ← r0, r1
in sequential program sub r1 ← r3, r0

Thread 1 Thread 2
TLP: Thread-Level Parallelism
Between independent execution add mul Parallel
contexts: threads

DLP: Data-Level Parallelism vadd r←a,b a1 a2 a3


Between elements of a vector: + + +
b1 b2 b3
same operation on several elements
r1 r2 r3

48
Example: X ← a×X

In-place scalar-vector product: X ← a×X

Sequential (ILP) For i = 0 to n-1 do:


X[i] ← a * X[i]

Threads (TLP) Launch n threads:


X[tid] ← a * X[tid]

Vector (DLP) X ← a * X

Or any combination of the above

49
Uses of parallelism

“Horizontal” parallelism
for throughput A B C D
More units working in parallel
throughput

“Vertical” parallelism
for latency hiding A B C D
Pipelining: keep units busy

latency
A B C
when waiting for dependencies,
memory A B

cycle 1 cycle 2 cycle 3 cycle 4

50
How to extract parallelism?

Horizontal Vertical

ILP Superscalar Pipelined

Multi-core Interleaved / switch-on-event


TLP SMT multithreading

DLP SIMD / SIMT Vector / temporal SIMT

We have seen the first row: ILP


We will now review techniques for the next rows: TLP, DLP 51
Outline
Computer architecture crash course
The simplest processor
Exploiting instruction-level parallelism
GPU, many-core: why, what for?
Technological trends and constraints
From graphics to general purpose
Forms of parallelism, how to exploit them
Why we need (so much) parallelism: latency and throughput
Sources of parallelism: ILP, TLP, DLP
Uses of parallelism: horizontal, vertical
Let's design a GPU!
Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD
Putting it all together
Architecture of current GPUs: cores, memory
52
Sequential processor

for i = 0 to n-1
X[i] ← a * X[i]
Source code
add i ← 18 Fetch
move i ← 0
loop: store X[17] Decode
load t ← X[i] Memory
mul Execute
mul t ← a×t
store X[i] ← t Memory
add i ← i+1
branch i<n? loop Sequential CPU
Machine code

Focuses on instruction-level parallelism


Exploits ILP: vertically (pipelining) and horizontally (superscalar)

53
The incremental approach: multi-core

Several processors
on a single chip
sharing one memory space

Intel Sandy Bridge

Area: benefits from Moore's law


Power: extra cores consume little when not in use
e.g. Intel Turbo Boost

Source: Intel
54
Homogeneous multi-core
Horizontal use of thread-level parallelism

add i ← 18 F add i ← 50 F
IF IF

Memory
store X[17] D
ID store X[49] D
ID
mul EX
X mul EX
X
LSU
Mem LSU
Mem

Threads: T0 T1

Improves peak throughput

55
Example: Tilera Tile-GX

Grid of (up to) 72 tiles


Each tile: 3-way VLIW processor,
5 pipeline stages, 1.2 GHz


Tile (1,1) Tile (1,2) Tile (1,8)



Tile (9,1) Tile (9,8)

56
Interleaved multi-threading
Vertical use of thread-level parallelism

mul Fetch
mul Decode

add i ←73 Execute


add i ← 50 Memory
load X[89] Memory
store X[72] load-store
load X[17] unit
store X[49]
Threads: T0 T1 T2 T3

Hides latency thanks to explicit parallelism


improves achieved throughput
57
Example: Oracle Sparc T5
16 cores / chip
Core: out-of-order superscalar, 8 threads
15 pipeline stages, 3.6 GHz

Thread 1
Thread 2

Thread 8

Core 1 Core 2 Core 16


58
Clustered multi-core
For each
individual unit, T0 T1 T2 T3
select between → Cluster 1 → Cluster 2
Horizontal replication
Vertical time-multiplexing
br Fetch
Examples
mul store Decode
Sun UltraSparc T2, T3
AMD Bulldozer mul EX
IBM Power 7 add i ←73 add i ← 50 Memory

load X[89] L/S Unit


store X[72]
load X[17]
store X[49]

Area-efficient tradeoff
Blurs boundaries between cores 59
Implicit SIMD
Factorization of fetch/decode, load-store units
Fetch 1 instruction on behalf of several threads
Read 1 memory location and broadcast to several registers

T0 (0-3) load F

Memory
T1 (0-3) store D
T2
(0) mul (1) mul (2) mul (3) mul X
T3
(0) (1) (2) (3) Mem

In NVIDIA-speak
SIMT: Single Instruction, Multiple Threads
Convoy of synchronized threads: warp
Extracts DLP from multi-thread applications

60
Explicit SIMD
Single Instruction Multiple Data
Horizontal use of data level parallelism

loop: add i ← 20 F
vload T ← X[i]
vmul T ← a×T vstore X[16..19 D

Memory
vstore X[i] ← T
add i ← i+4 vmul X
branch i<n? loop
Machine code Mem

SIMD CPU

Examples
Intel MIC (16-wide)
AMD GCN GPU (16-wide×4-deep)
Most general purpose CPUs (4-wide to 8-wide)
61
Quizz: link the words

Parallelism Architectures
ILP Superscalar processor
TLP Homogeneous multi-core
DLP Multi-threaded core
Use Clustered multi-core
Horizontal: Implicit SIMD
more throughput Explicit SIMD
Vertical:
hide latency

62
Quizz: link the words

Parallelism Architectures
ILP Superscalar processor
TLP Homogeneous multi-core
DLP Multi-threaded core
Use Clustered multi-core
Horizontal: Implicit SIMD
more throughput Explicit SIMD
Vertical:
hide latency

63
Quizz: link the words

Parallelism Architectures
ILP Superscalar processor
TLP Homogeneous multi-core
DLP Multi-threaded core
Use Clustered multi-core
Horizontal: Implicit SIMD
more throughput Explicit SIMD
Vertical:
hide latency

64
Outline
Computer architecture crash course
The simplest processor
Exploiting instruction-level parallelism
GPU, many-core: why, what for?
Technological trends and constraints
From graphics to general purpose
Forms of parallelism, how to exploit them
Why we need (so much) parallelism: latency and throughput
Sources of parallelism: ILP, TLP, DLP
Uses of parallelism: horizontal, vertical
Let's design a GPU!
Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD
Putting it all together
Architecture of current GPUs: cores, memory
65
Hierarchical combination

Both CPUs and GPUs combine these techniques


Multiple cores
Multiple threads/core
SIMD units

66
Example CPU: Intel Core i7

Is a wide superscalar, but has also


Multicore
Multi-thread / core
SIMD units
Up to 117 operations/cycle from 8 threads

256-bit
SIMD
units: AVX
4 CPU cores

Wide superscalar

Simultaneous Multi-Threading:
2 threads

67
Example GPU: NVIDIA GeForce GTX 580
SIMT: warps of 32 threads
16 SMs / chip
2×16 cores / SM, 48 warps / SM

Warp 1 Warp 2

Warp 3 Warp 4

Core 16
Core 1
Core 2

Core 17
Core 18

Core 32
… … …
Warp 47 Warp 48

Time SM1 SM16

Up to 512 operations per cycle from 24576 threads in flight

68
Taxonomy of parallel architectures

Horizontal Vertical

ILP Superscalar / VLIW Pipelined

Multi-core Interleaved / switch-on-


TLP
SMT event multithreading

DLP SIMD / SIMT Vector / temporal SIMT

69
Classification: multi-core
Intel Haswell Fujitsu SPARC64 X

Horizontal Vertical
ILP 8 8
TLP 4 2 16 2
DLP 8 2
General-purpose
SIMD Cores multi-cores:
(AVX) Hyperthreading
balance ILP, TLP and DLP

IBM Power 8 Oracle Sparc T5 Sparc T:


focus on TLP
10 2

12 8 16 8

Cores Threads
70
Classification: GPU and many small-core
Intel MIC Nvidia Kepler AMD GCN

Horizontal Vertical
ILP 2 2
TLP 60 4 16×4 32 20×4 40
DLP 16 32 16 4

SIMD Cores Cores SIMT Multi-


×units threading

GPU: focus on DLP, TLP


horizontal and vertical
Tilera Tile-GX Kalray MPPA-256
Many small-core:
3 5 focus on horizontal TLP
72 17×16

71
Takeaway

All processors use hardware mechanisms to turn parallelism into


performance
GPUs focus on Thread-level and Data-level parallelism

72
Outline
Computer architecture crash course
The simplest processor
Exploiting instruction-level parallelism
GPU, many-core: why, what for?
Technological trends and constraints
From graphics to general purpose
Forms of parallelism, how to exploit them
Why we need (so much) parallelism: latency and throughput
Sources of parallelism: ILP, TLP, DLP
Uses of parallelism: horizontal, vertical
Let's design a GPU!
Ingredients: Sequential core, Multi-core, Multi-threaded core, SIMD
Putting it all together
Architecture of current GPUs: cores, memory
73
Computation cost vs. memory cost
Power measurements on NVIDIA GT200

Energy/op Total power


(nJ) (W)
Instruction control 1.8 18
Multiply-add on a 32-wide 3.6 36
warp
Load 128B from DRAM 80 90

With the same amount of energy


Load 1 word from external memory (DRAM)
Compute 44 flops
Must optimize memory accesses first!

74
External memory: discrete GPU

Classical CPU-GPU model


Split memory spaces PCI
Highest bandwidth from Express
GPU memory CPU GPU
16GB/s
Transfers to main
memory are slower 26GB/s 290GB/s

Main memory Graphics memory


8 GB 3 GB

Ex: Intel Core i7 4770, Nvidia GeForce GTX 780

75
External memory: embedded GPU

Most GPUs today


Same memory CPU GPU
May support memory
coherence Cache

GPU can read directly from


CPU caches
26GB/s
More contention on external
memory

Main memory
8 GB

76
GPU: on-chip memory
Conventional wisdom
Cache area in CPU vs. GPU
according to the NVIDIA
CUDA Programming Guide:

But... if we include registers:


GPU Register files
+ caches
NVIDIA 8.3 MB
GM204 GPU
AMD Hawaii 15.8 MB
GPU
Intel Core i7 9.3 MB
CPU

GPU/accelerator internal memory exceeds desktop CPUs


77
Registers: CPU vs. GPU

Registers keep the contents of local variables


Typical values
CPU GPU

Registers/thread 32 32

Registers/core 256 65536

Read / Write ports 10R/5W 2R/1W

GPU: many more registers, but made of simpler memory


78
Internal memory: GPU

Cache hierarchy
Keep frequently-accessed data Core Core Core
~2 TB/s
Reduce throughput demand on
main memory L1 L1 L1 1 MB
Managed by hardware (L1, L2) or
software (shared memory)
Crossbar

L2 L2 L2 6 MB

290 GB/s

External memory

79
Caches: CPU vs. GPU

CPU GPU

Caches,
Latency Multi-threading
prefetching

Throughput Caches

On CPU, caches are designed to avoid memory latency


Throughput reduction is a side effect
On GPU, multi-threading deals with memory latency
Caches are used to improve throughput (and energy)

80
GPU: thousands of cores?
Computational resources

NVIDIA GPUs G80/G92 GT200 GF100 GK104 GK110 GM204


(2006) (2008) (2010) (2012) (2012) (2014)
Exec. units 128 240 512 1536 2688 2048
SM 16 30 16 8 14 16

AMD GPUs R600 R700 Evergreen NI SI VI


(2007) (2008) (2009) (2010) (2012) (2013)
Exec. Units 320 800 1600 1536 2048 2560
SIMD-CU 4 10 20 24 32 40

Number of clients in interconnection network (cores)


stays limited
81
Takeaway

Result of many tradeoffs


Between locality and parallelism
Between core complexity and interconnect complexity
GPU optimized for throughput
Exploits primarily DLP, TLP
Energy-efficient on parallel applications with regular behavior
CPU optimized for latency
Exploits primarily ILP
Can use TLP and DLP when available

82
Next time

Next Tuesday, 1:00pm, room 2014


CUDA
Execution model
Programming model
API
Thursday 1:00pm, room 2011
Lab work: what is my GPU and when should I use it?
There may be available seats even if you are not enrolled

83