Sei sulla pagina 1di 4

Great

Idea #4: Parallelism

CS 61C: Great Ideas in


Computer Architecture

So9ware Hardware
Parallel Requests
Warehouse

Smart
Phone

Scale
Computer

Assigned to computer
e.g. search Garcia

Leverage

Parallel Threads Parallelism &


Assigned to core
e.g. lookup, ads

Pipelining Hazards

Achieve High
Performance

Computer

Parallel InstrucNons

Core

> 1 instrucNon @ one Nme


e.g. 5 pipelined instrucNons

Input/Output

Core

Parallel Data
> 1 data item @ one Nme
e.g. add of 4 pairs of words

Instructor: Senior Lecturer SOE Dan Garcia

Core

Memory

InstrucNon Unit(s)

FuncNonal
Unit(s)
A0+B0 A1+B1 A2+B2 A3+B3

Hardware descripNons

Logic Gates

Cache Memory

All gates funcNoning in


parallel at same Nme
1

Graphical Pipeline Diagrams

Pipelined ExecuNon RepresentaNon


MEM WB

IF

ID

EX

MEM WB

IF

ID

EX

MEM WB

IF

ID

EX

MEM WB

IF

ID

EX

MEM WB

IF

ID

EX

MUX

+4

1. InstrucNon
Fetch

MEM WB

rd
rs
rt

Register
File

Data
memory

EX

PC

ID

instrucNon
memory

Time
IF

ALU

imm

2. Decode/ 3. Execute 4. Memory


Register Read

5. Write
Back

Use datapath gure below to represent pipeline:


IF
I$

e.g. MEM stage for any arithmeNc instrucNon

ID
Reg

EX
ALU

Every instrucNon must take same number of


steps, so some stages will idle

Mem WB
D$

Reg

Graphical Pipeline RepresentaNon


RegFile: led half is write, right half is read
Time (clock cycles)
I
n
I$
D$
Reg
Reg
s Load
t
I$
D$
Reg
Reg
r Add


Equality only achieved if stages are balanced
(i.e. take the same amount of Nme)

ALU

Reg

Reg

D$

Reg

I$

Reg

ALU

I$

D$

ALU

Reg

ALU

I$

Pipelining Performance (1/3)


Use Tc (Nme between compleNon of
instrucNons) to measure speedup

ALU

Store
O
r Sub
d
e Or
r

D$

If not balanced, speedup is reduced


Speedup due to increased throughput
Latency for each instrucNon does not decrease

Reg
8

Pipelining Performance (3/3)

Pipelining Performance (2/3)


Assume Nme for stages is
100ps for register read or write
200ps for other stages

Single-cycle
Tc = 800 ps

Instr

Instr
fetch

Register
read

ALU op

Memory
access

Register
write

Total
time

lw

200ps

100 ps

200ps

200ps

100 ps

800ps

sw

200ps

100 ps

200ps

200ps

R-format

200ps

100 ps

200ps

beq

200ps

100 ps

200ps

700ps
100 ps

600ps
500ps

Pipelined
Tc = 200 ps

What is pipelined clock rate?


Compare pipelined datapath with single-cycle datapath

10

Pipelining Hazards

11

1. Structural Hazards

A hazard is a situaNon that prevents starNng the


next instrucNon in the next clock cycle
1)Structural hazard
A required resource is busy
(e.g. needed in mulNple stages)

2)Data hazard

Conict for use of a resource


MIPS pipeline with a single memory?
Load/Store requires memory access for data
InstrucNon fetch would have to stall for that cycle
Causes a pipeline bubble

Data dependency between instrucNons


Need to wait for previous instrucNon to
complete its data read/write

Hence, pipelined datapaths require separate


instrucNon/data memories
Separate L1 I$ and L1 D$ take care of this

3)Control hazard

Flow of execuNon depends on previous instrucNon


12

Structural Hazard #1: Single Memory

13

Structural Hazard #2: Registers (1/2)

Time (clock cycles)

Reg

D$

Reg

I$

Reg

D$

Reg

I$

Reg

ALU

D$

Reg

14

O Instr 2
r
d Instr 3
e
r Instr 4

Reg

Reg

D$

Reg

I$

Reg

D$

Reg

I$

Reg

D$

Reg

I$

Reg

ALU

I$

I$

D$

ALU

Reg

Reg

ALU

D$

I$

ALU

Reg

I
n
s Load
t
r Instr 1

ALU

Reg

ALU

I$

Trying to read
same memory
twice in same
clock cycle

D$

ALU

Reg

ALU

O Instr 2
r
d Instr 3
e
r Instr 4

I$

Time (clock cycles)

ALU

I
n
s Load
t
r Instr 1

D$

Can we read and


write to registers
simultaneously?

Reg

15

2. Data Hazards (1/2)

Structural Hazard #2: Registers (2/2)


Two dierent soluNons have been used:

Consider the following sequence of


instrucNons:

1) Split RegFile access in two: Write during 1st half


and Read during 2nd half of each clock cycle

add
sub
and
or
xor

Possible because RegFile access is VERY fast


(takes less than half the Nme of ALU stage)

2) Build RegFile with independent read and write


ports

Conclusion: Read and Write to registers


during same clock cycle is okay

$t0,
$t4,
$t5,
$t7,
$t9,

$t1,
$t0,
$t0,
$t0,
$t0,

$t2
$t3
$t6
$t8
$t10

16

18

2. Data Hazards (2/2)

Data Hazard SoluNon: Forwarding


Forward result as soon as it is available

Data-ow backwards in Nme are hazards

OK that its not stored in RegFile yet

Time (clock cycles)

Reg

D$

Reg

I$

Reg

D$

Reg

I$

Reg

D$

and $t5,$t0,$t6
or $t7,$t0,$t8
Reg

I$

WB

D$

Reg

Reg

D$

Reg

I$

Reg

D$

Reg

I$

Reg

D$

Reg

I$

Reg

ALU

I$

sub $t4,$t0,$t3

EX MEM

ALU

Reg

Reg

ALU

D$

ID/RF

I$

ALU

Reg

IF

ALU

D$

ALU

I$

add $t0,$t1,$t2

Reg

ALU

Reg

WB

ALU

I$

EX MEM

ALU

and $t5,$t0,$t6

O
r or $t7,$t0,$t8
d
e xor $t9,$t0,$t10
r

ID/RF

ALU

I
n add $t0,$t1,$t2
s
t sub $t4,$t0,$t3
r

IF

D$

xor $t9,$t0,$t10

19

Datapath for Forwarding (1/2)

Reg

20

Datapath for Forwarding (2/2)

What changes need to be made here?

Handled by forwarding unit

21

22

Data Hazard: Loads (2/4)

Data Hazard: Loads (1/4)

I$

sub $t3,$t0,$t2

lw $t0, 0($t1)

WB

D$

Reg

Reg

D$

bub
ble

How to stall
just part of
pipeline?

bub
ble

Reg

bub
ble

I$

Reg

D$

Reg

Reg

ALU

I$

D$

D$

24

bub
ble

bub
ble

bub
ble

I$

Reg

ALU

D$

Reg

I$

Reg

ALU

D$

Reg

I$

Reg

ALU

ALU

If that instrucNon uses the result of the load, then


the hardware interlock will stall it for one cycle
Lesng the hardware stall the instrucNon in the
delay slot is equivalent to pusng a nop in the slot
(except the later uses more code space)

bub bub
ble ble

D$

or $t7,$t0,$t6

Reg

Slot ader a load is called a load delay slot


Reg

and $t5,$t0,$t4

Reg

Data Hazard: Loads (4/4)

D$

sub $t3,$t0,$t2

D$

23

Stall is equivalent to nop

nop

I$

EX MEM WB

ALU

or $t7,$t0,$t6

Data Hazard: Loads (3/4)


Reg

Reg

and $t5,$t0,$t4

Must stall instrucNon dependent on load, then


forward (more hardware)

I$

ID/RF

I$

sub $t3,$t0,$t2

Reg

Cant solve all cases with forwarding

lw $t0, 0($t1)

IF

ALU

Reg

Called hardware interlock


ALU

I$

EX MEM

ALU

ID/RF

ALU

lw $t0,0($t1)

IF

SchemaNcally, this is what we


want, but in reality stalls done
horizontally

Hardware stalls pipeline

Recall: Dataow backwards in Nme are


hazards

Idea: Let the compiler put an unrelated


instrucNon in that slot no stall!
25

26

Code Scheduling to Avoid Stalls

Summary

Reorder code to avoid use of load result in the


next instrucNon!
MIPS code for D=A+B; E=A+C;
Stall!

Stall!

# Method
lw $t1,
lw $t2,
add $t3,
sw $t3,
lw $t4,
add $t5,
sw $t5,

1:
0($t0)
4($t0)
$t1, $t2
12($t0)
8($t0)
$t1, $t4
16($t0)

13 cycles

# Method
lw $t1,
lw $t2,
lw $t4,
add $t3,
sw $t3,
add $t5,
sw $t5,

2:
0($t0)
4($t0)
8($t0)
$t1, $t2
12($t0)
$t1, $t4
16($t0)

Hazards reduce eecNveness of pipelining


Cause stalls/bubbles

Structural Hazards
Conict in use of datapath component

Data Hazards
Need to wait for result of a previous instrucNon

Control Hazards

11 cycles
27

Address of next instrucNon uncertain/unknown


More to come next lecture!

32

Potrebbero piacerti anche