Pipeline

5 Steps of DLX Datapath
Instruction Fetch
Next PC
Instr. Decode Reg. Fetch
Execute Addr. Calc
Memory Access
MUX
Write Back
Adder
Next SEQ PC
Zero?
RS1
4
Address
PC <= PC + 4
MUX MUX
IR <= mem[PC];
Reg[IRrd] <= Reg[IRrs] opIRop Reg[IRrt]
Memory
Reg File
RS2
Inst
ALU
Data Memory
RD
L M D
MUX
Imm
Sign Extend
WB Data

Figure 3.4, Page 134 Instruction Fetch
Next PC

Next SEQ PC
Execute Addr. Calc

Next SEQ PC
Memory Access
MUX
Write Back
Adder
RS1
4
Address
PC <= PC + 4
Zero?
MUX MUX
MEM/WB
Imm
Sign Extend
A <= Reg[IRrs]; B <= Reg[IRrt] rslt <= A opIRop B WB <= rslt Reg[IRrd] <= WB
RD
RD
RD
WB Data
IR <= mem[PC];
Memory
EX/MEM
Reg File
RS2
ID/EX
IF/ID
ALU
Data Memory
MUX
Inst. Set Processor Controller

IR <= mem[PC]; PC <= PC + 4 Ifetch
JSR br
JR jmp
A <= Reg[IRrs]; B <= Reg[IRrt]
opFetch-DCD
ST RI
r <= A opIRop IRim
RR
PC <= IRjaddr r <= A opIRop B
LD
r <= A + IRim
if bop(A,b) PC <= PC+IRim
WB <= r
WB <= r
WB <= Mem[r]
Reg[IRrd] <= WB
Reg[IRrd] <= WB
Reg[IRrd] <= WB

Figure 3.4, Page 134 Instruction Fetch
Next PC

Next SEQ PC
Execute Addr. Calc

Next SEQ PC
Memory Access
MUX
Write Back
Adder
RS1
4
Address
Zero?
MUX MUX
MEM/WB
Imm
Sign Extend
RD
RD
RD
Data stationary control

local decode for each instruction phase / pipeline stage
4
WB Data
Memory
EX/MEM
Reg File
RS2
ID/EX
IF/ID
ALU
Data Memory
MUX
Visualizing Pipelining
Figure 3.3, Page 133 Time (clock cycles)
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
ALU
I n s t r. O r d e r
ALU
Ifetch
Reg
DMem
Reg
Ifetch
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
Speed Up Equation for Pipelining
CPIpipelined ! Ideal CPI Average Stall cycles per Inst Cycle Timeunpipelined Ideal CPI v Pipeline depth Speedup ! v Ideal CPI Pipeline stall CPI Cycle Timepipelined
For simple RISC pipeline, CPI = 1:

Cycle Timeunpipelined Pipeline depth Speedup ! v 1 Pipeline stall CPI Cycle Timepipelined
6
Pipelining is not quite that easy!
Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle
Structural hazards: HW cannot support this combination of instructions (single person to fold and put clothes away) Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock) Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps).
One Memory Port/Structural Hazards

Figure 3.6, Page 142
Time (clock cycles)

ALU
I Load Ifetch n s Instr 1 t r. O r d e r
ALU
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
Instr 2 Instr 3 Instr 4
Ifetch
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
One Memory Port/Structural Hazards

Time (clock cycles)

ALU
I Load Ifetch n s Instr 1 t r. O r d e r
ALU
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
Instr 2 Stall Instr 3
Ifetch
Reg
DMem
Reg
Bubble
Bubble Bubble
Bubble
ALU
Bubble
Ifetch
Reg
DMem
Reg
How do you bubble the pipe?
Example: Dual-port vs. Single-port

Machine A: Dual ported memory (Harvard Architecture) Machine B: Single ported memory, but its pipelined implementation has a 1.05 times faster clock rate Ideal CPI = 1 for both Loads are 40% of instructions executed
SpeedUpA = Pipeline Depth/(1 + 0) x (clockunpipe/clockpipe) = Pipeline Depth SpeedUpB = Pipeline Depth/(1 + 0.4 x 1) x (clockunpipe/(clockunpipe / 1.05) = (Pipeline Depth/1.4) x 1.05 = 0.75 x Pipeline Depth SpeedUpA / SpeedUpB = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33
Machine A is 1.33 times faster

10
Data Hazard on R1
Figure 3.9, page 147
Time (clock cycles)

IF ID/RF EX MEM
DMem
WB
Reg
I n s t r. O r d e r
add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or
Ifetch
Reg
ALU
ALU
Ifetch
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
ALU
r8,r1,r9
Ifetch
Reg
DMem
Reg
xor r10,r1,r11
ALU
Ifetch
Reg
DMem
Reg
11
Three Generic Data Hazards

Read After Write (RAW) InstrJ tries to read operand before InstrI writes it I: add r1,r2,r3 J: sub r4,r1,r3 Caused by a Dependence (in compiler nomenclature). This hazard results from an actual need for communication.
12

Write After Read (WAR) InstrJ writes operand before InstrI reads it I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7 Called an anti-dependence by compiler writers. This results from reuse of the name r1. Cant happen in DLX 5 stage pipeline because: All instructions take 5 stages, and Reads are always in stage 2, and Writes are always in stage 5
13

Write After Write (WAW) InstrJ writes operand before InstrI writes it. I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7 Called an output dependence by compiler writers This also results from the reuse of name r1. Cant happen in DLX 5 stage pipeline because: All instructions take 5 stages, and Writes are always in stage 5 Will see WAR and WAW in more complicated pipes
14
Forwarding to Avoid Data Hazard

Figure 3.10, Page 149 Time (clock cycles) I n s t r. O r d e r
add r1,r2,r3 Ifetch sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9
ALU
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
ALU
xor r10,r1,r11
Ifetch
Reg
DMem
Reg
15
HW Change for Forwarding

NextPC
mux
Immediate
Registers
MEM/WR
EX/MEM
ALU
What circuit detects and resolves this hazard?

16
ID/EX
Data Memory
mux
mux
Data Hazard Even with Forwarding

Time (clock cycles) I n s t r. O r d e r
lw r1, 0(r2) Ifetch sub r4,r1,r6 and r6,r1,r7 or r8,r1,r9
ALU
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
17
Data Hazard Even with Forwarding

Time (clock cycles) I n s t r. O r d e r
lw r1, 0(r2) sub r4,r1,r6 and r6,r1,r7 or r8,r1,r9
ALU
Ifetch
Reg
DMem
Reg
Ifetch
Reg
Bubble
ALU
DMem
Reg
Ifetch
Bubble
ALU
Reg
DMem
Reg
Bubble
Ifetch
Reg
ALU
DMem
How is this detected?
18
Software Scheduling to Avoid Load Hazards

Try producing fast code for a = b + c; d = e f; assuming a, b, c, d ,e, and f in memory.
Slow code: LW LW ADD SW LW LW SUB SW Rb,b Rc,c Ra,Rb,Rc a,Ra Re,e Rf,f Rd,Re,Rf d,Rd Fast code: LW LW LW ADD LW SW SUB SW Rb,b Rc,c Re,e Ra,Rb,Rc Rf,f a,Ra Rd,Re,Rf d,Rd
Compiler optimizes for performance. Hardware checks for safety.

19
Control Hazards
Branch instructions can cause great performance loss Branch instructions need two things:
The result of branch: Taken or Not Taken Branch target:

PC + 4 Branch NOT taken PC + 4 + immediate*4 Branch Taken Branch instruction is not detected until the ID stage
At which point a new instruction has already been fetched

For our original pipeline:
Branch Delay CC1

Consider the pipelined execution of: beq $1, $3, 100 During the first cycle, beq is fetched in the IF stage
beq $1, $3, 100
1004 +4
PCSrc Imm1 6 Rs Address Instruction
ID
EX
MEM
IF/ID
ID/EX
EX/MEM
Extend
A d d
Zero
m u x
1
PC = 1000
Registers
Reg_dst Data_in Rt
m u x m u x m u x
Instruction Memory
Op
A L U
ALU result
Writeback data
W M
Main Control
W M E
Branch Delay CC2

During the second cycle, beq is decoded in the ID stage The next_1 instruction is fetched in the IF stage
next_1 beq $1, $3, 100 EX MEM
1008 +4
PCSrc IF/ID ID/EX EX/MEM
1004
PC = 1004
m u x
1
Imm1 6 Rs Rt
100
Extend
A d d
Zero
Address Instruction
$3
Registers
Reg_dst Data_in
m u x m u x m u x
$1
Instruction Memory
beq
A L U
ALU result
Writeback data
W M
Main Control
W M E
Branch Delay CC3

During the third cycle, beq is executed in the EX stage The next_2 instruction is fetched in the IF stage
next_2 next_1 beq $1, $3, 100 MEM
1012 +4
PCSrc IF/ID ID/EX EX/MEM
1008
1004
Extend
100
PC = 1008
Address Instruction Rt
1234 1234
m u x
1
Imm1 6 Rs
A d d
Zero
Registers
Reg_dst Data_in
m u x m u x m u x
Instruction Memory
A L U
ALU result
Writeback data W M E
W M
Main Control
Beq = 1
Branch Delay CC4

During the fourth cycle, beq reaches MEM stage The next_3 instruction is fetched in the IF stage
next_3
1016 +4
PCSrc IF/ID
next_2
ID/EX
next_1
beq $1, $3, 100

EX/MEM
1012
1008
PC = 1012
m u x
1
Rs Address Instruction Rt
Registers
Reg_dst Data_in
m u x m u x m u x
Instruction Memory
A L U
1 0
W M
Imm1 6
Extend
A d d
1404
Zero = 1 ALU result Beq = 1
Writeback data
Main Control
W M E
Branch Delay CC5

During the fifth cycle, branch_target instruction is fetched Next_1 thru next_3 should be converted into NOPs
branch_target
1408 +4
PCSrc
next_3
next_2
next_1
IF/ID
ID/EX
EX/MEM
1016
1012
PC = 1404
m u x
1
Imm1 6 Rs Address Instruction Rt
Extend
A d d
Zero
Registers
Reg_dst Data_in
m u x m u x m u x
Instruction Memory
A L U
ALU result
Writeback data
W M
Main Control
W M E
3-Cycle Branch Delay

Next_1 thru Next_3 will be fetched anyway Pipeline should flush Next_1 thru Next_3 if branch is taken Otherwise, they can be executed normally
cc1 cc2 cc3 cc4 cc5 cc6 cc7
beq $1,$3,100
IM
Re g IM
ALU Re g IM
D M
Bubbl e
Re g
Bubbl e Bubbl e Bubbl e Bubbl e Bubbl e Bubbl e Bubbl e
Next_1 // bubble Next_2 // bubble Next_3 // bubble Branch_Target
Re g IM
Re g IM
Re g
ALU
Reducing the Delay of Branches

Branch delay can be reduced from 3 cycles to just 1 cycle Branch decision is moved from 4th into 2nd pipeline stage
Branches can be determined earlier in the ID stage Branch address calculation adder is moved to ID stage A comparator in the ID stage to compare the two fetched registers
To determine branch decision, whether the branch is taken or not Only one instruction that follows the branch will
Reducing the Delay of Branches

IF.Flush Hazard detection unit M u x M u x ID/EX WB Control 0 IF/ID M EX EX/MEM WB M MEM/WB WB
Shift left 2 Registers M u x
PC
Instruction memory
=
M u x Sign extend
ALU
Data memory
M u x
M u x Forwarding unit
Branch Hazard Alternatives

Always stall the pipeline until branch direction is known
Next instruction is always flushed (turned into a NOP)

Predict Branch Not Taken
Fetch successor instruction: PC+4 already calculated Almost half of MIPS branches are not taken on average Flush instructions in pipeline only if branch is actually taken
Predict Branch Taken
Can predict backward branches in
Delayed Branch
Define branch to take place after the next instruction For a 1-cycle branch delay, we have one delay slot
branch instruction branch delay slot next instruction ... branch target if branch taken
branch instruction (taken) branch delay slot (next instruction) branch target
IF
ID IF
EX MEM WB ID IF EX MEM WB ID EX MEM WB
Compiler/assembler fills the branch delay slot
By selecting a useful instruction
Scheduling the Branch Delay Slot

From an independent instruction before the branch From a target instruction when branch is predicted taken From fall through when branch is predicted not add $t2,$t3,$t4 sub $t4,$t5,$t6 beq $s1, $s0 taken $s1, $s0 beq Delay Slot
beq $s1, $s0 Delay Slot
beq $s1, $s0 add $t2,$t3,$t4
From Fall Through
Delay Slot
sub $t4,$t5,$t6
From Before
From Target
beq $s1, $s0 sub $t4,$t5,$t6
beq $s1, $s0 sub $t4,$t5,$t6
More on Delayed Branch

Scheduling delay slot with
Independent instruction is the best choice

However, not always possible to find an independent instruction
Target instruction is useful when branch is predicted taken

Such as in a loop branch May need to duplicate instruction if it can be reached by another path Cancel branch delay instruction if branch is not taken
Fall through is useful when branch is
Zero-Delayed Branch
How can we achieve zero-delay for a taken branch
If the branch target address is computed in the ID stage ?

Solution
Check the PC to see if the instruction being fetched is a branch Store the branch target address in a table in the IF stage Such a table is called the branch target buffer If branch is predicted taken then
Branch Target and Prediction Buffer

The branch target buffer is implemented as a small cache
That stores the branch target address of taken branches

We also have a branch prediction buffer
PC
Prediction Buffer
To store the prediction bits for branch Branch Target Buffer instructions PC of Branch Target Address +4 The prediction bits are dynamically mux Lookup determined by the hardware
Dynamic Branch Prediction

Prediction of branches at runtime using prediction bits
One or few prediction bits are associated with a branch instruction

Branch prediction buffer is a small memory
Indexed by the lower portion of the address of branch instruction

The simplest scheme is to have 1 prediction bit per branch We dont know if the prediction bit is correct or not If correct prediction
Continue normal execution no wasted cycles
2-bit Prediction Scheme

Prediction is just a hint that is assumed to be correct If incorrect then fetched instructions are flushed 1-bit prediction scheme has a performance shortcoming A loop branch is almost always taken, except for last iteration 1-bit scheme will predict incorrectly twice, rather than once On the first and last loop iterations 2-bit prediction schemes are often used A prediction must be wrong twice before it is changed Taken Predic Not Taken Predic t t A loop branch is mispredicted Taken Taken Taken Not Taken Taken only once on the last iteration
Not Taken
Not Taken
Taken
Not Taken
Not Taken
Control Hazard on Branches Three Stage Stall

10: beq r1,r3,36 14: and r2,r3,r5 18: or r6,r1,r7
ALU Ifetch Reg DMem Reg
Ifetc h
ALU
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
22: add r8,r1,r9 36: xor r10,r1,r11

What do you do with the 3 instructions in between? How do you do it? Where is the commit?
ALU
Ifetch
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
37
Branch Stall Impact

If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9! Two part solution:
Determine branch taken or not sooner, AND Compute taken branch address earlier
DLX branch tests if register = 0 or { 0 DLX Solution:

Move Zero test to ID/RF stage Adder to calculate new PC in ID/RF stage 1 clock cycle penalty for branch versus 3
38
Pipelined DLX Datapath

Figure 3.22, page 163 Instruction Fetch
Next PC

Next SEQ PC
Execute Addr. Calc
Memory Access
Write Back
MUX
Adder
Adder
4
Address
Zero?
RS1
MEM/WB
Imm
Sign Extend
RD
RD
RD
Interplay of instruction set design and cycle time.

39
WB Data
Memory
EX/MEM
RS2
Reg File
ID/EX
ALU
IF/ID
Data Memory
MUX
MUX
Four Branch Hazard Alternatives

#1: Stall until branch direction is clear #2: Predict Branch Not Taken
Execute successor instructions in sequence Squash instructions in pipeline if branch actually taken Advantage of late pipeline state update 47% DLX branches not taken on average PC+4 already calculated, so use it to get next instruction
#3: Predict Branch Taken

53% DLX branches taken on average But havent calculated branch target address in DLX DLX still incurs 1 cycle branch penalty Other machines: branch target known before outcome
40
Four Branch Hazard Alternatives

#4: Delayed Branch
Define branch to take place AFTER a following instruction branch instruction sequential successor1 sequential successor2 ........ sequential successorn branch target if taken
Branch delay of length n
1 slot delay allows proper decision and branch target address in 5 stage pipeline DLX uses this
41
Delayed Branch
Where to get instructions to fill branch delay slot?
Before branch instruction From the target address: only valuable when branch taken From fall through: only valuable when branch not taken Canceling branches allow more slots to be filled
Compiler effectiveness for single branch delay slot:

Fills about 60% of branch delay slots About 80% of instructions executed in branch delay slots useful in computation About 50% (60% x 80%) of slots usefully filled
Delayed Branch downside: 7-8 stage pipelines, multiple instructions issued per clock (superscalar)
42
Evaluating Branch Alternatives

Pipeline speedup = Pipeline depth 1 +Branch frequencyvBranch penalty
CPI 1.42 1.14 1.09 1.07 speedup v. unpipelined 3.5 4.4 4.5 4.6 speedup v. stall 1.0 1.26 1.29 1.31
Scheduling Branch scheme penalty Stall pipeline 3 Predict taken 1 Predict not taken 1 Delayed branch 0.5
Conditional & Unconditional = 14%, 65% change PC

43

Pipeline

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Pipeline

Caricato da

Copyright:

Formati disponibili

5 Steps of DLX Datapath

Instr. Decode Reg. Fetch

Execute Addr. Calc

Reg[IRrd] <= Reg[IRrs] opIRop Reg[IRrt]

5 Steps of DLX Datapath

Instr. Decode Reg. Fetch

Execute Addr. Calc

Inst. Set Processor Controller

A <= Reg[IRrs]; B <= Reg[IRrt]

if bop(A,b) PC <= PC+IRim

5 Steps of DLX Datapath

Instr. Decode Reg. Fetch

Execute Addr. Calc

Data stationary control

Speed Up Equation for Pipelining

For simple RISC pipeline, CPI = 1:

Pipelining is not quite that easy!

One Memory Port/Structural Hazards

Time (clock cycles)

I Load Ifetch n s Instr 1 t r. O r d e r

Instr 2 Instr 3 Instr 4

One Memory Port/Structural Hazards

Time (clock cycles)

I Load Ifetch n s Instr 1 t r. O r d e r

Instr 2 Stall Instr 3

How do you bubble the pipe?

Example: Dual-port vs. Single-port

Machine A is 1.33 times faster

Time (clock cycles)

add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or

Three Generic Data Hazards

Three Generic Data Hazards

Three Generic Data Hazards

Forwarding to Avoid Data Hazard

add r1,r2,r3 Ifetch sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9

HW Change for Forwarding

What circuit detects and resolves this hazard?

Data Hazard Even with Forwarding

Time (clock cycles) I n s t r. O r d e r

lw r1, 0(r2) Ifetch sub r4,r1,r6 and r6,r1,r7 or r8,r1,r9

Data Hazard Even with Forwarding

Time (clock cycles) I n s t r. O r d e r

lw r1, 0(r2) sub r4,r1,r6 and r6,r1,r7 or r8,r1,r9

How is this detected?

Software Scheduling to Avoid Load Hazards

Compiler optimizes for performance. Hardware checks for safety.

The result of branch: Taken or Not Taken Branch target:

At which point a new instruction has already been fetched

Branch Delay CC1

Branch Delay CC2

Branch Delay CC3

Branch Delay CC4

beq $1, $3, 100

Branch Delay CC5

Imm1 6 Rs Address Instruction Rt

3-Cycle Branch Delay

Next_1 // bubble Next_2 // bubble Next_3 // bubble Branch_Target

Reducing the Delay of Branches

Reducing the Delay of Branches

Shift left 2 Registers M u x

Branch Hazard Alternatives

Next instruction is always flushed (turned into a NOP)

Can predict backward branches in

EX MEM WB ID IF EX MEM WB ID EX MEM WB