Sei sulla pagina 1di 83

CS2100 Computer Organization

Basic Pipelining

Review: Single Cycle vs. Multiple Cycle Timing


Single Cycle Implementation:
Cycle 1

Cycle 2

Clk

lw

sw

multicycle clock
slower than 1/5th of
single cycle clock
due to stage register
overhead

Multiple Cycle Implementation:


Clk

Waste

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9Cycle 10

lw
IFetch

Dec

Exec

Mem

WB

sw
IFetch

Dec

Exec

Mem

R-type
IFetch

How Can We Make It Even Faster?

Split the multiple instruction cycle into smaller and


smaller steps

There is a point of diminishing returns where as much time is


spent loading the state registers as doing the work

Start fetching and executing the next instruction before


the current one has completed

Pipelining (all?) modern processors are pipelined for


performance

Remember the performance equation:


CPU time = CPI * CC * IC

Fetch (and execute) more than one instruction at a time

Superscalar processing stay tuned

Inspiration Automobile Assembly Line

Lessons learnt from automobile assembly lines

Faster line movement yields more cars per hour off the
line

Faster line movement requires more stages, each doing


simpler tasks

To maximize efficiency, all stages should take the same


amount of time

Filling, flushing or stalling the assembly line are all


bad news

Key idea

Key analogy: the instruction is our car

Instructions go through the same stages in the


processing

Aim: to process the maximum number of instructions in


the shortest time

Increase the line movement

Balance the pipeline

Minimize delays

A Pipelined MIPS Processor

Start the next instruction before the current one has


completed

improves throughput - total amount of work done in a given time

instruction latency (execution time, delay time, response time time from the start of an instruction to its completion) is not
reduced
Cycle 1 Cycle 2

lw
sw
R-type

IFetch

Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8

Dec

Exec

Mem

WB

IFetch

Dec

Exec

Mem

WB

IFetch

Dec

Exec

Mem

WB

- clock cycle (pipeline stage time) is limited by the slowest stage


- for some instructions, some stages are wasted cycles

Single Cycle, Multiple Cycle, vs. Pipeline


Single Cycle Implementation:
Cycle 1

Cycle 2

Clk
lw

sw

Waste

Multiple Cycle Implementation:


Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9Cycle 10
Clk
lw
IFetch

Dec

Exec

Mem

WB

sw
IFetch

Dec

Pipeline Implementation:
lw

IFetch
sw

Dec

Exec

Mem

WB

IFetch

Dec

Exec

Mem

WB

Dec

Exec

Mem

R-type IFetch

WB

Exec

Mem

R-type
IFetch

MIPS Pipeline Datapath Modifications

What do we need to add/modify in our MIPS datapath?


State registers between each pipeline stage to isolate them

IF:IFetch

ID:Dec

EX:Execute

MEM:
MemAccess

WB:
WriteBack

Add
Shift
left 2
Read Addr 2Data 1

File

Write Addr
Write Data

16

System Clock

Sign
Extend

Read
Data 2

32

ALU

Exec/Mem

Register Read

Dec/Exec

Read
Address

Read Addr 1
IFetch/Dec

PC

Instruction
Memory

Add

Data
Memory
Address
Write Data

Read
Data

Mem/WB

Tracing a single instruction


IF:IFetch

ID:Dec

EX:Execute

MEM:
MemAccess

WB:
WriteBack

Add
Shift
left 2
Read Addr 2Data 1

File

Write Addr
Write Data

16

System Clock

Sign
Extend

Read
Data 2

32

ALU

Exec/Mem

Register Read
Dec/Exec

Read
Address

Read Addr 1
IFetch/Dec

PC

Instruction
Memory

Add

Data
Memory
Address
Write Data

Read
Data

Mem/WB

Tracing a single instruction


IF:IFetch

ID:Dec

EX:Execute

MEM:
MemAccess

WB:
WriteBack

add $1, $2, $3

Add
Shift
left 2
Read Addr 2Data 1

File

Write Addr
Write Data

16

System Clock

Sign
Extend

Read
Data 2

32

ALU

Exec/Mem

Register Read
Dec/Exec

Read
Address

Read Addr 1
IFetch/Dec

PC

Instruction
Memory

Add

Data
Memory
Address
Write Data

Read
Data

Mem/WB

Tracing a single instruction


IF:IFetch

ID:Dec

EX:Execute

MEM:
MemAccess

WB:
WriteBack

add $1, $2, $3

Add
Shift
left 2
Read Addr 2Data 1

File

Write Addr
Write Data

16

System Clock

Sign
Extend

Read
Data 2

32

ALU

Exec/Mem

Register Read
Dec/Exec

Read
Address

Read Addr 1
IFetch/Dec

PC

Instruction
Memory

Add

Data
Memory
Address
Write Data

Read
Data

Mem/WB

Tracing a single instruction


IF:IFetch

ID:Dec

EX:Execute

MEM:
MemAccess

WB:
WriteBack

add $1, $2, $3

Add
Shift
left 2
Read Addr 2Data 1

File

Write Addr
Write Data

16

System Clock

Sign
Extend

Read
Data 2

32

ALU

Exec/Mem

Register Read
Dec/Exec

Read
Address

Read Addr 1
IFetch/Dec

PC

Instruction
Memory

Add

Data
Memory
Address
Write Data

Read
Data

Mem/WB

Tracing a single instruction


IF:IFetch

ID:Dec

EX:Execute

MEM:
MemAccess

WB:
WriteBack

add $1, $2, $3

Add
Shift
left 2
Read Addr 2Data 1

File

Write Addr
Write Data

16

System Clock

Sign
Extend

Read
Data 2

32

ALU

Exec/Mem

Register Read
Dec/Exec

Read
Address

Read Addr 1
IFetch/Dec

PC

Instruction
Memory

Add

Data
Memory
Address
Write Data

Read
Data

Mem/WB

Tracing a single instruction


IF:IFetch

ID:Dec

EX:Execute

MEM:
MemAccess

WB:
WriteBack

add $1, $2, $3

Add
Shift
left 2
Read Addr 2Data 1

File

Write Addr
Write Data

16

System Clock

Sign
Extend

Read
Data 2

32

ALU

Exec/Mem

Register Read
Dec/Exec

Read
Address

Read Addr 1
IFetch/Dec

PC

Instruction
Memory

Add

Data
Memory
Address
Write Data

Read
Data

Mem/WB

Tracing a single instruction


IF:IFetch

ID:Dec

EX:Execute

MEM:
MemAccess

WB:
WriteBack

add $1, $2, $3

Add
Shift
left 2
Read Addr 2Data 1

File

Write Addr
Write Data

16

System Clock

Sign
Extend

Read
Data 2

32

ALU

Exec/Mem

Register Read
Dec/Exec

Read
Address

Read Addr 1
IFetch/Dec

PC

Instruction
Memory

Add

Data
Memory
Address
Write Data

Read
Data

Mem/WB

Tracing a single instruction


IF:IFetch

ID:Dec

EX:Execute

MEM:
MemAccess

WB:
WriteBack

add $1, $2, $3

Add
Shift
left 2
Read Addr 2Data 1

File

Write Addr
Write Data

16

System Clock

Sign
Extend

Read
Data 2

32

ALU

Exec/Mem

Register Read
Dec/Exec

Read
Address

Read Addr 1
IFetch/Dec

PC

Instruction
Memory

Add

Data
Memory
Address
Write Data

Read
Data

Mem/WB

Tracing a single instruction


IF:IFetch

ID:Dec

EX:Execute

MEM:
MemAccess

WB:
WriteBack

add $1, $2, $3

Add
Shift
left 2
Read Addr 2Data 1

File

Write Addr
Write Data

16

System Clock

Sign
Extend

Read
Data 2

32

ALU

Exec/Mem

Register Read
Dec/Exec

Read
Address

Read Addr 1
IFetch/Dec

PC

Instruction
Memory

Add

Data
Memory
Address
Write Data

Read
Data

Mem/WB

Tracing a single instruction


IF:IFetch

ID:Dec

EX:Execute

MEM:
MemAccess

WB:
WriteBack

add $1, $2, $3

Add
Shift
left 2
Read Addr 2Data 1

File

Write Addr
Write Data

16

System Clock

Sign
Extend

Read
Data 2

32

ALU

Exec/Mem

Register Read
Dec/Exec

Read
Address

Read Addr 1
IFetch/Dec

PC

Instruction
Memory

Add

Data
Memory
Address
Write Data

Read
Data

Mem/WB

Tracing a single instruction


IF:IFetch

ID:Dec

EX:Execute

MEM:
MemAccess

WB:
WriteBack

Add
Shift
left 2
Read Addr 2Data 1

File

Write Addr
Write Data

16

System Clock

Sign
Extend

Read
Data 2

32

ALU

Exec/Mem

Register Read
Dec/Exec

Read
Address

Read Addr 1
IFetch/Dec

PC

Instruction
Memory

Add

Data
Memory
Address
Write Data

Read
Data

Mem/WB

Pipeline registers

State must propagate from one stage to another by


means of pipeline registers

Different from multicycle registers

Pipeline registers driven by the clock

They are arrays of D flip-flops driven by the clock

Combinatorial logic sandwiched by clocked state


registers

Clock determined by the slowest stage

Pipelining the MIPS ISA

What makes it easy

all instructions are the same length (32 bits)


- can fetch in the 1st stage and decode in the 2nd stage

few instruction formats (three) with symmetry across formats


- can begin reading register file in 2nd stage

memory operations can occur only in loads and stores


- can use the execute stage to calculate memory addresses

each MIPS instruction writes at most one result (i.e.,


changes the machine state) and does so near the end of the
pipeline (MEM and WB)

What makes it hard

structural hazards: what if we had only one memory?

control hazards: what about branches?

data hazards: what if an instructions input operands


depend on the output of a previous instruction?

Graphically Representing MIPS Pipeline

Reg

ALU

IM

DM

Reg

Can help with answering questions like:

How many cycles does it take to execute this code?

What is the ALU doing during cycle 4?

Is there a hazard, why does it occur, and how can it be fixed?

Why Pipeline? For Performance!


Time (clock cycles)

IM

Reg

DM

IM

Reg

DM

IM

Reg

DM

IM

Reg

ALU

Inst 3

DM

ALU

Inst 2

Reg

ALU

Inst 1

IM

ALU

O
r
d
e
r

Inst 0

ALU

I
n
s
t
r.

Once the
pipeline is full,
one instruction
is completed
every cycle, so
CPI = 1

Inst 4
Time to fill the pipeline

Reg

Reg

Reg

Reg

DM

Reg

Can Pipelining Get Us Into Trouble?

Yes: Pipeline Hazards

structural hazards: attempt to use the same resource by two


different instructions at the same time

data hazards: attempt to use data before it is ready


- An instructions source operand(s) are produced by a prior
instruction still in the pipeline

control hazards: attempt to make a decision about program


control flow before the condition has been evaluated and the
new PC target address calculated
- branch instructions

Can always resolve hazards by waiting

pipeline control must detect the hazard

and take action to resolve hazards

A Single Memory Would Be a Structural Hazard


Time (clock cycles)

Inst 4

Mem

Reg

Reg

Mem

Reg

Reg

Mem

Reg

Reg

ALU

Inst 3

Reg

ALU

Inst 2

Mem

Reg

ALU

Inst 1

Reg

Mem

ALU

O
r
d
e
r

lw

Reading data from


memory

ALU

I
n
s
t
r.

Mem

Mem

Mem

Mem

Reading instruction
from memory

Mem

Reg

Fix with separate instr and data memories (I$ and D$)

How About Register File Access?


Time (clock cycles)

Inst 2

DM

IM

Reg

DM

IM

Reg

DM

IM

Reg

ALU

Inst 1

Reg

ALU

IM

ALU

O
r
d
e
r

add $1,

ALU

I
n
s
t
r.

add $2,$1,

clock edge that controls


register writing

Reg

Reg

Fix register file


access hazard by
doing reads in the
second half of the
cycle and writes in
the first half
Reg

DM

Reg

clock edge that controls


loading of pipeline state
registers

Register Usage Can Cause Data Hazards

Dependencies backward in time cause hazards

Reg

DM

IM

Reg

DM

IM

Reg

DM

IM

Reg

ALU

$8,$1,$9

IM

ALU

or

DM

ALU

and $6,$1,$7

Reg

ALU

sub $4,$1,$5

IM

ALU

add $1,

xor $4,$1,$5

Read before write data hazard

Reg

Reg

Reg

Reg

DM

Reg

Loads Can Cause Data Hazards

Reg

DM

IM

Reg

DM

IM

Reg

DM

IM

Reg

ALU

DM

IM

Reg

ALU

sub $4,$1,$5

IM

ALU

$1,4($2)

ALU

O
r
d
e
r

lw

ALU

I
n
s
t
r.

Dependencies backward in time cause hazards

and $6,$1,$7
or

$8,$1,$9

xor $4,$1,$5

Load-use data hazard

Reg

Reg

Reg

Reg

DM

Reg

One Way to Fix a Data Hazard

Reg

DM

Reg

IM

Reg

DM

IM

Reg

ALU

IM

ALU

O
r
d
e
r

add $1,

ALU

I
n
s
t
r.

Can fix data


hazard by
waiting stall
but impacts CPI

stall
stall
sub $4,$1,$5
and $6,$1,$7

Reg

DM

Reg

Another Way to Fix a Data Hazard

or

$8,$1,$9

xor $4,$1,$5

IM

Reg

DM

IM

Reg

DM

IM

Reg

DM

IM

Reg

ALU

and $6,$1,$7

DM

ALU

sub $4,$1,$5

Reg

ALU

IM

ALU

O
r
d
e
r

add $1,

ALU

I
n
s
t
r.

Fix data hazards


by forwarding
results as soon as
they are available
to where they are
needed

Reg

Reg

Reg

Reg

DM

Reg

Forwarding with Load-use Data Hazards

or

$8,$1,$9

xor $4,$1,$5

IM

Reg

DM

IM

Reg

DM

IM

Reg

DM

IM

Reg

ALU

and $6,$1,$7

DM

ALU

sub $4,$1,$5

Reg

ALU

$1,4($2) IM

ALU

O
r
d
e
r

lw

ALU

I
n
s
t
r.

Reg

Reg

Reg

Will still need one stall cycle even with forwarding

Reg

DM

Reg

Branch Instructions Cause Control Hazards

Inst 4

IM

Reg

DM

IM

Reg

DM

IM

Reg

ALU

Inst 3

Reg

ALU

lw

IM

ALU

O
r
d
e
r

beq

ALU

I
n
s
t
r.

Dependencies backward in time cause hazards


DM

Reg

Reg

Reg

DM

Reg

One Way to Fix a Control Hazard


beq

O
r
d
e
r

stall

IM

Reg

ALU

I
n
s
t
r.

DM

Fix branch
hazard by
waiting
stall but
affects CPI

Reg

stall

stall
Reg

DM

IM

Reg

ALU

Inst 3

IM

ALU

lw

Reg

DM

Corrected Datapath to Save RegWrite Addr

Need to preserve the destination register address in the


pipeline state registers

IF/ID

ID/EX

EX/MEM

Add
4

Shift
left 2

PC

Instruction
Memory
Read
Address

Add

Read Addr 1

Data
Memory

Register Read

Read Addr 2Data 1

File

Write Addr
Write Data

16

Sign
Extend

Read
Data 2

32

MEM/WB

ALU

Address
Write Data

Read
Data

MIPS Pipeline Control Path Modifications

All control signals can be determined during Decode

and held in the state registers between pipeline stages


ID/EX
EX/MEM
IF/ID

Control

Add
4

Shift
left 2

PC

Instruction
Memory
Read
Address

Read Addr 1

Data
Memory

Register Read

Read Addr 2Data 1

File

Write Addr
Write Data

16

Sign
Extend

Read
Data 2

32

MEM/WB

Add

ALU

Address
Write Data

Read
Data

Other Pipeline Structures Are Possible

What about the (slow) multiply operation?

Make the clock twice as slow or

let it take two cycles (since it doesnt use the DM stage)


MUL

Reg

ALU

IM

DM

Reg

What if the data memory access is twice as slow as


the instruction memory?
make the clock twice as slow or

let data memory access take two cycles (and keep the same
clock rate)
IM

Reg

ALU

DM1

DM2

Reg

Sample Pipeline Alternatives

ARM7

IM

Reg

PC update
IM access

XScale

IM

IM1
PC update
BTB access
start IM access

Reg

IM2

ALU op
DM access
shift/rotate
commit result
(write back)
DM

Reg

Reg

SHFT

ALU

StrongARM-1

decode
reg
access

ALU

EX

decode
reg 1 access
IM access

DM1

DM write
reg write
start DM access
exception

ALU op

shift/rotate
reg 2 access

Reg
DM2

Summary

All modern day processors use pipelining

Pipelining doesnt help latency of single task, it helps


throughput of entire workload

Potential speedup: a CPI of 1 and fast a CC

Pipeline rate limited by slowest pipeline stage

Unbalanced pipe stages makes for inefficiencies

The time to fill pipeline and time to drain it can impact


speedup for deep pipelines and short code runs

Must detect and resolve hazards

Stalling negatively affects CPI (makes CPI less than the ideal
of 1)

Dealing with Pipeline Hazards

Review: MIPS Pipeline Data and Control Paths


PCSrc
ID/EX
EX/MEM
Control
IF/ID
Add
RegWrite

PC

Instruction
Memory
Read
Address

Shift
left 2

Add

Read Addr 1

Data
Memory

Register Read

Read Addr 2Data 1

File

Write Addr
Write Data

16

Sign
Extend

MEM/WB

Branch

ALUSrc
ALU

Read
Data 2

Address

Read
Data

Write Data
ALU
cntrl

32

ALUOp

RegDst

MemWrite MemRead

MemtoReg

Control Settings
EX Stage
Reg
Dst

MEM Stage

WB Stage

ALU ALU ALU Brch Mem Mem Reg Mem


Op1 Op0 Src
Read Write Write toReg

R
lw

sw

beq

Review: One Way to Fix a Data Hazard


Reg

DM

Reg

IM

Reg

DM

IM

Reg

ALU

IM

ALU

O
r
d
e
r

add $1,

ALU

I
n
s
t
r.

Fix data hazard


by waiting
stall but
impacts CPI

stall
stall
sub $4,$1,$5
and $6,$7,$1

Reg

DM

Reg

Review: Another Way to Fix a Data Hazard

or

$8,$1,$1

sw

$4,4($1)

IM

Reg

DM

IM

Reg

DM

IM

Reg

DM

IM

Reg

ALU

and $6,$7,$1

DM

ALU

sub $4,$1,$5

Reg

ALU

IM

ALU

O
r
d
e
r

add $1,

ALU

I
n
s
t
r.

Fix data hazards


by forwarding
results as soon as
they are available
to where they are
Reg
needed

Reg

Reg

Reg

DM

Reg

Data Forwarding (aka Bypassing)

Take the result from the earliest point that it exists in any of
the pipeline state registers and forward it to the functional
units (e.g., the ALU) that need it that cycle

For ALU functional unit: the inputs can come from any
pipeline register rather than just from ID/EX by

adding multiplexors to the inputs of the ALU

connecting the Rd write data in EX/MEM or MEM/WB to either (or


both) of the EXs stage Rs and Rt ALU mux inputs

adding the proper control hardware to control the new muxes

Other functional units may need similar forwarding logic


(e.g., the DM)

With forwarding can achieve a CPI of 1 even in the


presence of data dependencies

Data Forwarding Control Conditions


1.

EX/MEM hazard:

if (EX/MEM.RegWrite
and (EX/MEM.RegisterRd
and (EX/MEM.RegisterRd
ForwardA = 10
if (EX/MEM.RegWrite
and (EX/MEM.RegisterRd
and (EX/MEM.RegisterRd
ForwardB = 10
2.

!= 0)
= ID/EX.RegisterRs))
!= 0)
= ID/EX.RegisterRt))

Forwards the
result from the
previous instr.
to either input
of the ALU

MEM/WB hazard:

if (MEM/WB.RegWrite
and (MEM/WB.RegisterRd
and (MEM/WB.RegisterRd
ForwardA = 01
if (MEM/WB.RegWrite
and (MEM/WB.RegisterRd
and (MEM/WB.RegisterRd
ForwardB = 01

!= 0)
= ID/EX.RegisterRs))
!= 0)
= ID/EX.RegisterRt))

Forwards the
result from the
second
previous instr.
to either input
of the ALU

Forwarding Illustration

sub $4,$1,$5
and $6,$7,$1

Reg

DM

IM

Reg

DM

IM

Reg

ALU

IM

ALU

O
r
d
e
r

add $1,

ALU

I
n
s
t
r.

EX/MEM hazard
forwarding

Reg

Reg

DM

Reg

MEM/WB hazard
forwarding

Yet Another Complication!

add $1,$1,$4

Reg

DM

IM

Reg

DM

IM

Reg

ALU

add $1,$1,$3

IM

ALU

O
r
d
e
r

add $1,$1,$2

ALU

I
n
s
t
r.

Another potential data hazard can occur when there is a


conflict between the result of the WB stage instruction and
the MEM stage instruction which should be forwarded?

Reg

Reg

DM

Reg

Corrected Data Forwarding Control Conditions


2.

MEM/WB hazard:

if (MEM/WB.RegWrite
and (MEM/WB.RegisterRd != 0)
and (EX/MEM.RegisterRd != ID/EX.RegisterRs)
and (MEM/WB.RegisterRd = ID/EX.RegisterRs))
ForwardA = 01
if (MEM/WB.RegWrite
and (MEM/WB.RegisterRd != 0)
and (EX/MEM.RegisterRd != ID/EX.RegisterRt)
and (MEM/WB.RegisterRd = ID/EX.RegisterRt))
ForwardB = 01

Datapath with Forwarding Hardware


PCSrc

ID/EX
EX/MEM
Control
IF/ID
Add
4

Shift
left 2

PC

Instruction
Memory
Read
Address

Add

Read Addr 1

Data
Memory

Register Read

Read Addr 2Data 1

File

Write Addr
Write Data
16

Sign
Extend

MEM/WB

Branch

ALU

Read
Data 2

Address

Read
Data

Write Data
ALU
cntrl

32

EX/MEM.RegisterRd
ID/EX.RegisterRt
ID/EX.RegisterRs

Forward
Unit

MEM/WB.RegisterRd

Memory-to-Memory Copies

For loads immediately followed by stores (memory-tomemory copies) can avoid a stall by adding forwarding
hardware from the MEM/WB register to the data memory
input.

sw $1,4($3)

IM

Reg

DM

IM

Reg

ALU

O
r
d
e
r

lw $1,4($2)

ALU

I
n
s
t
r.

Would need to add a Forward Unit and a mux to the memory


access stage

Reg

DM

Reg

Forwarding with Load-use Data Hazards

or
xor $8,$1,$9
$4,$1,$5
xor $4,$1,$5

Reg

DM

IM

Reg

DM

IM

Reg

DM

IM

Reg

DM

IM

Reg

ALU

or $6,$1,$7
$8,$1,$9
and

IM

ALU

and $4,$1,$5
$6,$1,$7
sub

DM

ALU

stall $4,$1,$5
sub

Reg

ALU

IM

$1,4($2)

ALU

O
r
d
e
r

lw

ALU

I
n
s
t
r.

Reg

Reg

Reg

Reg

Reg

DM

Load-use Hazard Detection Unit

Need a Hazard detection Unit in the ID stage that inserts


a stall between the load and its use
ID Hazard Detection
if (ID/EX.MemRead
and ((ID/EX.RegisterRt = IF/ID.RegisterRs)
or (ID/EX.RegisterRt = IF/ID.RegisterRt)))
stall the pipeline
2.

The first line tests to see if the instruction now in the EX


stage is a lw; the next two lines check to see if the
destination register of the lw matches either source
register of the instruction in the ID stage (the load-use
instruction)

After this one cycle stall, the forwarding logic can handle
the remaining data hazards

Stall Hardware

Along with the Hazard Unit, we have to implement the stall

Prevent the instructions in the IF and ID stages from


progressing down the pipeline done by preventing the
PC register and the IF/ID pipeline register from changing

Insert a bubble between the lw instruction (in the EX


stage) and the load-use instruction (in the ID stage) (i.e.,
insert a noop in the execution stream)

Hazard detection Unit controls the writing of the PC (PC.write)


and IF/ID (IF/ID.write) registers

Set the control bits in the EX, MEM, and WB control fields of the
ID/EX pipeline register to 0 (noop). The Hazard Unit controls the
mux that chooses between the real control values and the 0s.

Let the lw instruction and the instructions after it in the


pipeline (before it in the code) proceed normally down the
pipeline

Adding the Hazard Hardware


PCSrc

Hazard
Unit

EX/MEM

Control 0

Shift
left 2

Instruction
Memory
PC

ID/EX.MemRead

IF/ID
Add

ID/EX

Read
Address

Add

Read Addr 1

Data
Memory

Register Read

Read Addr 2Data 1

File

Write Addr
Write Data
16

Sign
Extend

Read
Data 2
32

ID/EX.RegisterRt

MEM/WB

Branch

ALU

Address

Read
Data

Write Data
ALU
cntrl

Forward
Unit

On control hazards

Review: Datapath with Data Hazard Control


PCSrc

te
Wri
.
D
I
IF/

rite
PC.W

Hazard
Unit

EX/MEM

Control 0

Shift
left 2

Instruction
Memory
PC

ID/EX.MemRead

IF/ID
Add

ID/EX

Read
Address

Add

Read Addr 1

Data
Memory

Register Read

Read Addr 2Data 1

File

Write Addr
Write Data
16

Sign
Extend

Read
Data 2
32

ID/EX.RegisterRt

MEM/WB

Branch

ALU

Address

Read
Data

Write Data
ALU
cntrl

Forward
Unit

Control Hazards

When the flow of instruction addresses is not sequential


(i.e., PC = PC + 4); incurred by change of flow instructions

Conditional branches (beq, bne)

Unconditional branches (j, jal, jr)

Exceptions

Possible approaches

Stall (impacts CPI)

Move decision point as early in the pipeline as possible, thereby


reducing the number of stall cycles

Delay decision (requires compiler support)

Predict and hope for the best !

Control hazards occur less frequently than data hazards,


but there is nothing as effective against control hazards as
forwarding is for data hazards

Datapath Branch and Jump Hardware


Jump

PCSrc
ID/EX

Shift
left 2
IF/ID

EX/MEM

Control

Add
PC+4[31-28]

PC

Instruction
Memory
Read
Address

Shift
left 2

Add

Read Addr 1

Data
Memory

Register Read

Read Addr 2Data 1

File

Write Addr
Write Data
16

Sign
Extend

Read
Data 2
32

MEM/WB

Branch

ALU

Address

Read
Data

Write Data
ALU
cntrl

Forward
Unit

Jumps Incur One Stall

Jumps not decoded until ID, so one flush is needed

j target

IM

Reg

DM

IM

Reg

ALU

O
r
d
e
r

flush

DM

ALU

Reg

ALU

I
n
s
t
r.

IM

Fix jump
hazard by
waiting
stall but
affects CPI

Reg

Reg

DM

Reg

Fortunately, jumps are very infrequent only 3% of the


SPECint instruction mix

Supporting ID Stage Jumps


Jump

PCSrc
ID/EX

Shift
left 2
IF/ID

EX/MEM

Control

Add
PC+4[31-28]

PC

Instruction
Memory
Read
Address

Shift
left 2

Add

Read Addr 1

Data
Memory

Register Read

Read Addr 2Data 1

File

Write Addr
Write Data
16

Sign
Extend

Read
Data 2
32

MEM/WB

Branch

ALU

Address

Read
Data

Write Data
ALU
cntrl

Forward
Unit

Two Types of Stalls

Noop instruction (or bubble) inserted between two


instructions in the pipeline (as done for load-use
situations)

Keep the instructions earlier in the pipeline (later in the code)


from progressing down the pipeline for a cycle (bounce them in
place with write control signals)

Insert noop by zeroing control bits in the pipeline register at the


appropriate stage

Let the instructions later in the pipeline (earlier in the code)


progress normally down the pipeline

Flushes (or instruction squashing) were an instruction in


the pipeline is replaced with a noop instruction (as done
for instructions located sequentially after j instructions)

Zero the control bits for the instruction to be flushed

Review: Branches Incur Three Stalls


beq

O
r
d
e
r

flush

IM

Reg

ALU

I
n
s
t
r.

DM

Reg

Fix branch
hazard by
waiting
stall but
affects CPI

flush

flush
IM

Reg

ALU

beq target

DM

Reg

Moving Branch Decisions Earlier in Pipe

Move the branch decision hardware back to the EX stage

Reduces the number of stall (flush) cycles to two

Adds an and gate and a 2x1 mux to the EX timing path

Add hardware to compute the branch target address and


evaluate the branch decision to the ID stage

Reduces the number of stall (flush) cycles to one


(like with jumps)
- But now need to add forwarding hardware in ID stage

Computing branch target address can be done in parallel with


RegFile read (done for all instructions only used when needed)

Comparing the registers cant be done until after RegFile read, so


comparing and updating the PC adds a mux, a comparator, and an
and gate to the ID timing path

For deeper pipelines, branch decision points can be even


later in the pipeline, incurring more stalls

ID Branch Forwarding Issues

MEM/WB forwarding
is taken care of by the
normal RegFile write
before read operation

WB
MEM
EX
ID
IF

add3
$1,
add2
$3,
add1
$4,
beq
$1,$2,Loop
next_seq_instr

Need to forward from the


EX/MEM pipeline stage to
the ID comparison
hardware for cases like

WB
MEM
EX
ID
IF

add3
$3,
add2
$1,
add1
$4,
beq
$1,$2,Loop
next_seq_instr

if (IDcontrol.Branch
and (EX/MEM.RegisterRd
and (EX/MEM.RegisterRd
ForwardC = 1
if (IDcontrol.Branch
and (EX/MEM.RegisterRd
and (EX/MEM.RegisterRd
ForwardD = 1

Forwards the
result from the
second
previous instr.
!= 0)
= IF/ID.RegisterRt)) to either input
of the compare
!= 0)
= IF/ID.RegisterRs))

ID Branch Forwarding Issues, cont

If the instruction immediately


WB
add3
$3,
MEM add2
$4,
before the branch produces
EX
add1
$1,
one of the branch source
beq
$1,$2,Loop
operands, then a stall needs ID
IF
next_seq_instr
to be inserted (between the
beq and add1) since the EX stage ALU operation
is occurring at the same time as the ID stage branch
compare operation

Bounce the beq (in ID) and next_seq_instr (in IF) in place
(ID Hazard Unit deasserts PC.Write and IF/ID.Write)

Insert a stall between the add in the EX stage and the beq in
the ID stage by zeroing the control bits going into the ID/EX
pipeline register (done by the ID Hazard Unit)

If the branch is found to be taken, then flush the


instruction currently in IF (IF.Flush)

Supporting
ID Stage Branches
Branch

PCSrc

IF/ID
Add

PC

Instruction
Memory
Read
Address

IF.Flush

ID/EX

Control

Shift
left 2

Add

Read Addr 1

RegFile

Read Addr 2
Read Data 1
Write Addr
ReadData 2
Write Data
16

EX/MEM

Sign
Extend 32

MEM/WB

Compare

Hazard
Unit

Data
Memory
ALU

Read Data
Address
Write Data

ALU
cntrl

Forward
Unit
Forward
Unit

Delayed Decision

If the branch hardware has been moved to the ID stage,


then we can eliminate all branch stalls with delayed
branches which are defined as always executing the next
sequential instruction after the branch instruction the
branch takes effect after that next instruction

MIPS compiler moves an instruction to immediately after the


branch that is not affected by the branch (a safe instruction)
thereby hiding the branch delay

With deeper pipelines, the branch delay grows requiring


more than one delay slot

Delayed branches have lost popularity compared to more


expensive but more flexible (dynamic) hardware branch prediction

Growth in available transistors has made hardware branch


prediction relatively cheaper

Scheduling Branch Delay Slots


A. From before branch
add $1,$2,$3
if $2=0 then
delay slot

becomes

B. From branch target


sub $4,$5,$6
add $1,$2,$3
if $1=0 then
delay slot
becomes

if $2=0 then
add $1,$2,$3

C. From fall through


add $1,$2,$3
if $1=0 then
delay slot
sub $4,$5,$6
becomes
add $1,$2,$3
if $1=0 then

add $1,$2,$3
if $1=0 then
sub $4,$5,$6

sub $4,$5,$6

A is the best choice, fills delay slot and reduces IC

In B and C, the sub instruction may need to be copied, increasing IC

In B and C, must be okay to execute sub when branch fails

Static Branch Prediction

Resolve branch hazards by assuming a given outcome


and proceeding without waiting to see the actual branch
outcome

1.

Predict not taken always predict branches will not be


taken, continue to fetch from the sequential instruction
stream, only when branch is taken does the pipeline stall

If taken, flush instructions after the branch (earlier in the pipeline)


-

in IF, ID, and EX stages if branch logic in MEM three stalls

In IF and ID stages if branch logic in EX two stalls

in IF stage if branch logic in ID one stall

ensure that those flushed instructions havent changed the


machine state automatic in the MIPS pipeline since machine
state changing operations are at the tail end of the pipeline
(MemWrite (in MEM) or RegWrite (in WB))

restart the pipeline at the branch destination

Flushing with Misprediction (Not Taken)

Reg

DM

IM

Reg

IM

Reg

Reg

Reg

DM

ALU

20 or r8,$1,$9

IM

ALU

16 and $6,$1,$7

O
r
d
e
r

DM

ALU

8 flush
sub $4,$1,$5

Reg

ALU

4 beq $1,$2,2

I
n
s
t
r.

IM

Reg

DM

Reg

To flush the IF stage instruction, assert IF.Flush to


zero the instruction field of the IF/ID pipeline register
(transforming it into a noop)

Branching Structures

Predict not taken works well for top of the loop


branching structures
Loop: beq $1,$2,Out

But such loops have jumps at the


bottom of the loop to return to the
top of the loop and incur the
jump stall overhead
Out:

1nd loop instr


.
.
.
last loop instr
j Loop
fall out instr

Predict not taken doesnt work well for bottom of the


loop branching structures
Loop: 1st loop instr
2nd loop instr
.
.
.
last loop instr
bne $1,$2,Loop
fall out instr

Static Branch Prediction, cont

Resolve branch hazards by assuming a given outcome


and proceeding

2.

Predict taken predict branches will always be taken

Predict taken always incurs one stall cycle (if branch


destination hardware has been moved to the ID stage)

Is there a way to cache the address of the branch target


instruction ??

As the branch penalty increases (for deeper pipelines),


a simple static prediction scheme will hurt performance.
With more hardware, it is possible to try to predict
branch behavior dynamically during program execution

3.

Dynamic branch prediction predict branches at runtime using run-time information

Dynamic Branch Prediction

A branch prediction buffer (aka branch history table


(BHT)) in the IF stage addressed by the lower bits of the
PC, contains a bit passed to the ID stage through the
IF/ID pipeline register that tells whether the branch was
taken the last time it was execute

Prediction bit may predict incorrectly (may be a wrong prediction


for this branch this iteration or may be from a different branch
with the same low order PC bits) but the doesnt affect
correctness, just performance
- Branch decision occurs in the ID stage after determining that the
fetched instruction is a branch and checking the prediction bit

If the prediction is wrong, flush the incorrect instruction(s) in


pipeline, restart the pipeline with the right instruction, and invert
the prediction bit
- A 4096 bit BHT varies from 1% misprediction (nasa7, tomcatv) to
18% (eqntott)

Branch Target Buffer

The BHT predicts when a branch is taken, but does not


tell where its taken to!

A branch target buffer (BTB) in the IF stage can cache the


branch target address, but we also need to fetch the next
sequential instruction. The prediction bit in IF/ID selects which
next instruction will be loaded into IF/ID at the next clock edge
- Would need a two read port
instruction memory
BTB

Or the BTB can cache the


branch taken instruction while the
instruction memory is fetching the
next sequential instruction

PC

Instruction
Memory
0
Read
Address

If the prediction is correct, stalls can be avoided no matter


which direction they go

1-bit Prediction Accuracy

A 1-bit predictor will be incorrect twice when not taken

1.

Assume predict_bit = 0 to start (indicating


branch not taken) and loop control is at
the bottom of the loop code
Loop: 1st loop instr
2nd loop instr
First time through the loop, the predictor
.
mispredicts the branch since the branch is
.
taken back to the top of the loop; invert
.
prediction bit (predict_bit = 1)

2.

As long as branch is taken (looping),


prediction is correct

3.

Exiting the loop, the predictor again


mispredicts the branch since this time the
branch is not taken falling out of the loop;
invert prediction bit (predict_bit = 0)

last loop instr


bne $1,$2,Loop
fall out instr

For 10 times through the loop we have a 80% prediction


accuracy for a branch that is taken 90% of the time

2-bit Predictors

A 2-bit scheme can give 90% accuracy since a prediction


must be wrong twice before the prediction bit is changed

right 9 times
wrong on loop
Taken
fall out
1 Predict 11
Taken

Taken

Not taken
Taken

Predict 1
10 Taken

right on 1st
iteration

0 Predict 01
Not Taken

Not taken

Not taken

00Predict 0

Not Taken

Taken

Loop: 1st loop instr


2nd loop instr
.
.
.
last loop instr
bne $1,$2,Loop
fall out instr

Not taken

BHT also
stores the
initial FSM
state

Dealing with Exceptions

Exceptions (aka interrupts) are just another form of


control hazard. Exceptions arise from

R-type arithmetic overflow


Trying to execute an undefined instruction
An I/O device request
An OS service request (e.g., a page fault, TLB exception)
A hardware malfunction

The pipeline has to stop executing the offending


instruction in midstream, let all prior instructions
complete, flush all following instructions, set a register to
show the cause of the exception, save the address of the
offending instruction, and then jump to a prearranged
address (the address of the exception handler code)
The software (OS) looks at the cause of the exception
and deals with it

Two Types of Exceptions

Interrupts asynchronous to program execution

caused by external events

may be handled between instructions, so can let the


instructions currently active in the pipeline complete before
passing control to the OS interrupt handler

simply suspend and resume user program

Traps (Exception) synchronous to program execution

caused by internal events

condition must be remedied by the trap handler for that


instruction, so much stop the offending instruction midstream
in the pipeline and pass control to the OS trap handler

the offending instruction may be retried (or simulated by the


OS) and the program may continue or it may be aborted

Where in the Pipeline Exceptions Occur


Reg

ALU

IM

DM

Reg

Stage(s)?

Synchronous?

Arithmetic overflow

EX

yes

Undefined instruction

ID

yes

TLB or page fault

IF, MEM

yes

I/O service request

any

no

Hardware malfunction

any

no

Beware that multiple exceptions can occur


simultaneously in a single clock cycle

Multiple Simultaneous Exceptions


IM

Reg

ALU

Inst 0

I
n
s
t
r.

DM

Reg

D$ page fault
IM

Reg

ALU

Inst 1

DM

Reg

arithmetic overflow
IM

Reg

ALU

Inst 2

O
r
d
e
r

DM

Reg

Inst 4

IM

I$ page fault

Reg

DM

IM

Reg

ALU

Inst 3

ALU

undefined instruction
Reg

DM

Hardware sorts the exceptions so that the earliest


instruction is the one interrupted first

Reg

Additions to MIPS to Handle Exceptions (Fig 6.42)

Cause register (records exceptions) hardware to record


in Cause the exceptions and a signal to control writes to it
(CauseWrite)

EPC register (records the addresses of the offending


instructions) hardware to record in EPC the address of
the offending instruction and a signal to control writes to it
(EPCWrite)

A way to load the PC with the address of the exception


handler

Exception software must match exception to instruction

Expand the PC input mux where the new input is hardwired to


the exception handler address - (e.g., 8000 0180hex for arithmetic
overflow)

A way to flush offending instruction and the ones that


follow it

Summar
All modern
y day processors use pipelining for

performance (a CPI of 1 and fast a CC)


Pipeline clock rate limited by slowest pipeline stage
so designing a balanced pipeline is important
Must detect and resolve hazards

Structural hazards resolved by designing the pipeline


correctly
Data hazards
- Stall (impacts CPI)
- Forward (requires hardware support)

Control hazards put the branch decision hardware in as


early a stage in the pipeline as possible
- Stall (impacts CPI)
- Delay decision (requires compiler support)
- Static and dynamic prediction (requires hardware support)

Potrebbero piacerti anche