Parallelism Leverage High Performance Computer Architecture

Great
Idea #4: Parallelism
CS 61C: Great Ideas in

Computer Architecture
So9ware Hardware
Parallel Requests
Warehouse
Smart
Phone
Scale
Computer
Assigned to computer
e.g. search Garcia
Leverage
Parallel Threads Parallelism &

Assigned to core
e.g. lookup, ads
Pipelining Hazards
Achieve High
Performance
Computer
Parallel InstrucNons
Core
> 1 instrucNon @ one Nme

e.g. 5 pipelined instrucNons
Input/Output
Core
Parallel Data
> 1 data item @ one Nme
e.g. add of 4 pairs of words
Instructor: Senior Lecturer SOE Dan Garcia
Core
Memory
InstrucNon Unit(s)
FuncNonal
Unit(s)
A0+B0 A1+B1 A2+B2 A3+B3
Hardware descripNons
Logic Gates
Cache Memory
All gates funcNoning in

parallel at same Nme
1
Graphical Pipeline Diagrams
Pipelined ExecuNon RepresentaNon

MEM WB
IF
ID
EX
MEM WB
IF
ID
EX
MEM WB
IF
ID
EX
MEM WB
IF
ID
EX
MEM WB
IF
ID
EX
MUX
+4
1. InstrucNon
Fetch
MEM WB
rd
rs
rt
Register
File
Data
memory
EX
PC
ID
instrucNon
memory
Time
IF
ALU
imm
2. Decode/ 3. Execute 4. Memory

Register Read
5. Write
Back
Use datapath gure below to represent pipeline:

IF
I$
e.g. MEM stage for any arithmeNc instrucNon
ID
Reg
EX
ALU
Every instrucNon must take same number of

steps, so some stages will idle
Mem WB
D$
Reg
Graphical Pipeline RepresentaNon

RegFile: led half is write, right half is read
Time (clock cycles)
I
n
I$
D$
Reg
Reg
s Load
t
I$
D$
Reg
Reg
r Add

Equality only achieved if stages are balanced
(i.e. take the same amount of Nme)
ALU
Reg
Reg
D$
Reg
I$
Reg
ALU
I$
D$
ALU
Reg
ALU
I$
Pipelining Performance (1/3)

Use Tc (Nme between compleNon of
instrucNons) to measure speedup
ALU
Store
O
r Sub
d
e Or
r
D$
If not balanced, speedup is reduced

Speedup due to increased throughput
Latency for each instrucNon does not decrease
Reg
8

Assume Nme for stages is
100ps for register read or write
200ps for other stages
Single-cycle
Tc = 800 ps
Instr
Instr
fetch
Register
read
ALU op
Memory
access
Register
write
Total
time
lw
200ps
100 ps
200ps
200ps
100 ps
800ps
sw
200ps
100 ps
200ps
200ps
R-format
200ps
100 ps
200ps
beq
200ps
100 ps
200ps
700ps
100 ps
600ps
500ps
Pipelined
Tc = 200 ps
What is pipelined clock rate?

Compare pipelined datapath with single-cycle datapath
10
Pipelining Hazards
11
1. Structural Hazards
A hazard is a situaNon that prevents starNng the

next instrucNon in the next clock cycle
1)Structural hazard
A required resource is busy
(e.g. needed in mulNple stages)
2)Data hazard
Conict for use of a resource

MIPS pipeline with a single memory?
Load/Store requires memory access for data
InstrucNon fetch would have to stall for that cycle
Causes a pipeline bubble
Data dependency between instrucNons

Need to wait for previous instrucNon to
complete its data read/write
Hence, pipelined datapaths require separate

instrucNon/data memories
Separate L1 I$ and L1 D$ take care of this
3)Control hazard
Flow of execuNon depends on previous instrucNon

12
Structural Hazard #1: Single Memory
13
Structural Hazard #2: Registers (1/2)
Time (clock cycles)
Reg
D$
Reg
I$
Reg
D$
Reg
I$
Reg
ALU
D$
Reg
14
O Instr 2
r
d Instr 3
e
r Instr 4
Reg
Reg
D$
Reg
I$
Reg
D$
Reg
I$
Reg
D$
Reg
I$
Reg
ALU
I$
I$
D$
ALU
Reg
Reg
ALU
D$
I$
ALU
Reg
I
n
s Load
t
r Instr 1
ALU
Reg
ALU
I$
Trying to read
same memory
twice in same
clock cycle
D$
ALU
Reg
ALU
O Instr 2
r
d Instr 3
e
r Instr 4
I$
Time (clock cycles)
ALU
I
n
s Load
t
r Instr 1
D$
Can we read and

write to registers
simultaneously?
Reg
15
2. Data Hazards (1/2)
Structural Hazard #2: Registers (2/2)

Two dierent soluNons have been used:
Consider the following sequence of

instrucNons:
1) Split RegFile access in two: Write during 1st half

and Read during 2nd half of each clock cycle
add
sub
and
or
xor
Possible because RegFile access is VERY fast

(takes less than half the Nme of ALU stage)
2) Build RegFile with independent read and write

ports
Conclusion: Read and Write to registers

during same clock cycle is okay
$t0,
$t4,
$t5,
$t7,
$t9,
$t1,
$t0,
$t0,
$t0,
$t0,
$t2
$t3
$t6
$t8
$t10
16
18
2. Data Hazards (2/2)
Data Hazard SoluNon: Forwarding

Forward result as soon as it is available
Data-ow backwards in Nme are hazards
OK that its not stored in RegFile yet
Time (clock cycles)
Reg
D$
Reg
I$
Reg
D$
Reg
I$
Reg
D$
and $t5,$t0,$t6
or $t7,$t0,$t8
Reg
I$
WB
D$
Reg
Reg
D$
Reg
I$
Reg
D$
Reg
I$
Reg
D$
Reg
I$
Reg
ALU
I$
sub $t4,$t0,$t3
EX MEM
ALU
Reg
Reg
ALU
D$
ID/RF
I$
ALU
Reg
IF
ALU
D$
ALU
I$
add $t0,$t1,$t2
Reg
ALU
Reg
WB
ALU
I$
EX MEM
ALU
and $t5,$t0,$t6
O
r or $t7,$t0,$t8
d
e xor $t9,$t0,$t10
r
ID/RF
ALU
I
n add $t0,$t1,$t2
s
t sub $t4,$t0,$t3
r
IF
D$
xor $t9,$t0,$t10
19
Datapath for Forwarding (1/2)
Reg
20
Datapath for Forwarding (2/2)
What changes need to be made here?
Handled by forwarding unit
21
22
Data Hazard: Loads (2/4)
I$
sub $t3,$t0,$t2
lw $t0, 0($t1)
WB
D$
Reg
Reg
D$
bub
ble
How to stall
just part of
pipeline?
bub
ble
Reg
bub
ble
I$
Reg
D$
Reg
Reg
ALU
I$
D$
D$
24
bub
ble
bub
ble
bub
ble
I$
Reg
ALU
D$
Reg
I$
Reg
ALU
D$
Reg
I$
Reg
ALU
ALU
If that instrucNon uses the result of the load, then

the hardware interlock will stall it for one cycle
Lesng the hardware stall the instrucNon in the
delay slot is equivalent to pusng a nop in the slot
(except the later uses more code space)
bub bub
ble ble
D$
or $t7,$t0,$t6
Reg
Slot ader a load is called a load delay slot

Reg
and $t5,$t0,$t4
Reg
D$
sub $t3,$t0,$t2
D$
23
Stall is equivalent to nop
nop
I$
EX MEM WB
ALU
or $t7,$t0,$t6

Reg
Reg
and $t5,$t0,$t4
Must stall instrucNon dependent on load, then

forward (more hardware)
I$
ID/RF
I$
sub $t3,$t0,$t2
Reg
Cant solve all cases with forwarding
lw $t0, 0($t1)
IF
ALU
Reg
Called hardware interlock

ALU
I$
EX MEM
ALU
ID/RF
ALU
lw $t0,0($t1)
IF
SchemaNcally, this is what we

want, but in reality stalls done
horizontally
Hardware stalls pipeline
Recall: Dataow backwards in Nme are

hazards
Idea: Let the compiler put an unrelated

instrucNon in that slot no stall!
25
26
Code Scheduling to Avoid Stalls
Summary
Reorder code to avoid use of load result in the

next instrucNon!
MIPS code for D=A+B; E=A+C;
Stall!
Stall!
# Method
lw $t1,
lw $t2,
add $t3,
sw $t3,
lw $t4,
add $t5,
sw $t5,
1:
0($t0)
4($t0)
$t1, $t2
12($t0)
8($t0)
$t1, $t4
16($t0)
13 cycles
# Method
lw $t1,
lw $t2,
lw $t4,
add $t3,
sw $t3,
add $t5,
sw $t5,
2:
0($t0)
4($t0)
8($t0)
$t1, $t2
12($t0)
$t1, $t4
16($t0)
Hazards reduce eecNveness of pipelining

Cause stalls/bubbles
Structural Hazards
Conict in use of datapath component
Data Hazards
Need to wait for result of a previous instrucNon
Control Hazards
11 cycles
27
Address of next instrucNon uncertain/unknown

More to come next lecture!
32

Parallelism Leverage High Performance Computer Architecture

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Parallelism Leverage High Performance Computer Architecture

Caricato da

Copyright:

Formati disponibili

Great

Idea #4: Parallelism

CS 61C: Great Ideas in

Parallel Threads Parallelism &

> 1 instrucNon @ one Nme

Instructor: Senior Lecturer SOE Dan Garcia

All gates funcNoning in

Graphical Pipeline Diagrams

Pipelined ExecuNon RepresentaNon

2. Decode/ 3. Execute 4. Memory

Use datapath gure below to represent pipeline:

e.g. MEM stage for any arithmeNc instrucNon

Every instrucNon must take same number of

Graphical Pipeline RepresentaNon

Pipelining Performance (1/3)

If not balanced, speedup is reduced

Pipelining Performance (3/3)

Pipelining Performance (2/3)

What is pipelined clock rate?

A hazard is a situaNon that prevents starNng the

Conict for use of a resource

Data dependency between instrucNons

Hence, pipelined datapaths require separate

Flow of execuNon depends on previous instrucNon

Structural Hazard #1: Single Memory

Structural Hazard #2: Registers (1/2)

Time (clock cycles)

Time (clock cycles)

Can we read and

2. Data Hazards (1/2)

Structural Hazard #2: Registers (2/2)

Consider the following sequence of

1) Split RegFile access in two: Write during 1st half

Possible because RegFile access is VERY fast

2) Build RegFile with independent read and write

Conclusion: Read and Write to registers

2. Data Hazards (2/2)

Data Hazard SoluNon: Forwarding

Data-ow backwards in Nme are hazards

OK that its not stored in RegFile yet

Time (clock cycles)

Datapath for Forwarding (1/2)

Datapath for Forwarding (2/2)

What changes need to be made here?

Handled by forwarding unit

Data Hazard: Loads (2/4)

Data Hazard: Loads (1/4)

If that instrucNon uses the result of the load, then

Slot ader a load is called a load delay slot

Data Hazard: Loads (4/4)

Stall is equivalent to nop

Data Hazard: Loads (3/4)

Must stall instrucNon dependent on load, then

Cant solve all cases with forwarding

Called hardware interlock

SchemaNcally, this is what we

Hardware stalls pipeline

Recall: Dataow backwards in Nme are

Idea: Let the compiler put an unrelated

Code Scheduling to Avoid Stalls

Reorder code to avoid use of load result in the

Hazards reduce eecNveness of pipelining

Address of next instrucNon uncertain/unknown