Sei sulla pagina 1di 43

5 Steps of DLX Datapath

Instruction Fetch
Next PC

Instr. Decode Reg. Fetch

Execute Addr. Calc

Memory Access
MUX

Write Back

Adder

Next SEQ PC
Zero?
RS1

4
Address
PC <= PC + 4

MUX MUX

IR <= mem[PC];

Reg[IRrd] <= Reg[IRrs] opIRop Reg[IRrt]

Memory

Reg File

RS2

Inst

ALU

Data Memory

RD

L M D

MUX

Imm

Sign Extend

WB Data

5 Steps of DLX Datapath


Figure 3.4, Page 134 Instruction Fetch
Next PC

Instr. Decode Reg. Fetch


Next SEQ PC

Execute Addr. Calc


Next SEQ PC

Memory Access
MUX

Write Back

Adder
RS1

4
Address
PC <= PC + 4

Zero?

MUX MUX

MEM/WB

Imm

Sign Extend

A <= Reg[IRrs]; B <= Reg[IRrt] rslt <= A opIRop B WB <= rslt Reg[IRrd] <= WB

RD

RD

RD

WB Data

IR <= mem[PC];

Memory

EX/MEM

Reg File

RS2

ID/EX

IF/ID

ALU

Data Memory

MUX

Inst. Set Processor Controller


IR <= mem[PC]; PC <= PC + 4 Ifetch

JSR br

JR jmp

A <= Reg[IRrs]; B <= Reg[IRrt]

opFetch-DCD

ST RI
r <= A opIRop IRim

RR
PC <= IRjaddr r <= A opIRop B

LD
r <= A + IRim

if bop(A,b) PC <= PC+IRim

WB <= r

WB <= r

WB <= Mem[r]

Reg[IRrd] <= WB

Reg[IRrd] <= WB

Reg[IRrd] <= WB

5 Steps of DLX Datapath


Figure 3.4, Page 134 Instruction Fetch
Next PC

Instr. Decode Reg. Fetch


Next SEQ PC

Execute Addr. Calc


Next SEQ PC

Memory Access
MUX

Write Back

Adder
RS1

4
Address

Zero?

MUX MUX

MEM/WB

Imm

Sign Extend

RD

RD

RD

Data stationary control


local decode for each instruction phase / pipeline stage
4

WB Data

Memory

EX/MEM

Reg File

RS2

ID/EX

IF/ID

ALU

Data Memory

MUX

Visualizing Pipelining
Figure 3.3, Page 133 Time (clock cycles)
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7

ALU

I n s t r. O r d e r

ALU

Ifetch

Reg

DMem

Reg

Ifetch

Reg

DMem

Reg

ALU

Ifetch

Reg

DMem

Reg

ALU

Ifetch

Reg

DMem

Reg

Speed Up Equation for Pipelining

CPIpipelined ! Ideal CPI  Average Stall cycles per Inst Cycle Timeunpipelined Ideal CPI v Pipeline depth Speedup ! v Ideal CPI  Pipeline stall CPI Cycle Timepipelined

For simple RISC pipeline, CPI = 1:


Cycle Timeunpipelined Pipeline depth Speedup ! v 1  Pipeline stall CPI Cycle Timepipelined
6

Pipelining is not quite that easy!

Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle
Structural hazards: HW cannot support this combination of instructions (single person to fold and put clothes away) Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock) Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps).

One Memory Port/Structural Hazards


Figure 3.6, Page 142

Time (clock cycles)


Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7

ALU

I Load Ifetch n s Instr 1 t r. O r d e r

ALU

Reg

DMem

Reg

ALU

Ifetch

Reg

DMem

Reg

Instr 2 Instr 3 Instr 4

Ifetch

Reg

DMem

Reg

ALU

Ifetch

Reg

DMem

Reg

ALU

Ifetch

Reg

DMem

Reg

One Memory Port/Structural Hazards


Figure 3.7, Page 143

Time (clock cycles)


Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7

ALU

I Load Ifetch n s Instr 1 t r. O r d e r

ALU

Reg

DMem

Reg

ALU

Ifetch

Reg

DMem

Reg

Instr 2 Stall Instr 3

Ifetch

Reg

DMem

Reg

Bubble

Bubble Bubble

Bubble
ALU

Bubble

Ifetch

Reg

DMem

Reg

How do you bubble the pipe?

Example: Dual-port vs. Single-port


Machine A: Dual ported memory (Harvard Architecture) Machine B: Single ported memory, but its pipelined implementation has a 1.05 times faster clock rate Ideal CPI = 1 for both Loads are 40% of instructions executed
SpeedUpA = Pipeline Depth/(1 + 0) x (clockunpipe/clockpipe) = Pipeline Depth SpeedUpB = Pipeline Depth/(1 + 0.4 x 1) x (clockunpipe/(clockunpipe / 1.05) = (Pipeline Depth/1.4) x 1.05 = 0.75 x Pipeline Depth SpeedUpA / SpeedUpB = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33

Machine A is 1.33 times faster


10

Data Hazard on R1
Figure 3.9, page 147

Time (clock cycles)


IF ID/RF EX MEM
DMem

WB
Reg

I n s t r. O r d e r

add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or

Ifetch

Reg

ALU

ALU

Ifetch

Reg

DMem

Reg

ALU

Ifetch

Reg

DMem

Reg

ALU

r8,r1,r9

Ifetch

Reg

DMem

Reg

xor r10,r1,r11

ALU

Ifetch

Reg

DMem

Reg

11

Three Generic Data Hazards


Read After Write (RAW) InstrJ tries to read operand before InstrI writes it I: add r1,r2,r3 J: sub r4,r1,r3 Caused by a Dependence (in compiler nomenclature). This hazard results from an actual need for communication.

12

Three Generic Data Hazards


Write After Read (WAR) InstrJ writes operand before InstrI reads it I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7 Called an anti-dependence by compiler writers. This results from reuse of the name r1. Cant happen in DLX 5 stage pipeline because: All instructions take 5 stages, and Reads are always in stage 2, and Writes are always in stage 5
13

Three Generic Data Hazards


Write After Write (WAW) InstrJ writes operand before InstrI writes it. I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7 Called an output dependence by compiler writers This also results from the reuse of name r1. Cant happen in DLX 5 stage pipeline because: All instructions take 5 stages, and Writes are always in stage 5 Will see WAR and WAW in more complicated pipes
14

Forwarding to Avoid Data Hazard


Figure 3.10, Page 149 Time (clock cycles) I n s t r. O r d e r

add r1,r2,r3 Ifetch sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9

ALU

Reg

DMem

Reg

ALU

Ifetch

Reg

DMem

Reg

ALU

Ifetch

Reg

DMem

Reg

ALU

Ifetch

Reg

DMem

Reg

ALU

xor r10,r1,r11

Ifetch

Reg

DMem

Reg

15

HW Change for Forwarding


Figure 3.20, Page 161
NextPC

mux

Immediate

Registers

MEM/WR

EX/MEM

ALU

What circuit detects and resolves this hazard?


16

ID/EX

Data Memory

mux

mux

Data Hazard Even with Forwarding


Figure 3.12, Page 153

Time (clock cycles) I n s t r. O r d e r

lw r1, 0(r2) Ifetch sub r4,r1,r6 and r6,r1,r7 or r8,r1,r9

ALU

Reg

DMem

Reg

ALU

Ifetch

Reg

DMem

Reg

ALU

Ifetch

Reg

DMem

Reg

ALU

Ifetch

Reg

DMem

Reg

17

Data Hazard Even with Forwarding


Figure 3.13, Page 154

Time (clock cycles) I n s t r. O r d e r

lw r1, 0(r2) sub r4,r1,r6 and r6,r1,r7 or r8,r1,r9

ALU

Ifetch

Reg

DMem

Reg

Ifetch

Reg

Bubble

ALU

DMem

Reg

Ifetch

Bubble

ALU

Reg

DMem

Reg

Bubble

Ifetch

Reg

ALU

DMem

How is this detected?

18

Software Scheduling to Avoid Load Hazards


Try producing fast code for a = b + c; d = e f; assuming a, b, c, d ,e, and f in memory.
Slow code: LW LW ADD SW LW LW SUB SW Rb,b Rc,c Ra,Rb,Rc a,Ra Re,e Rf,f Rd,Re,Rf d,Rd Fast code: LW LW LW ADD LW SW SUB SW Rb,b Rc,c Re,e Ra,Rb,Rc Rf,f a,Ra Rd,Re,Rf d,Rd

Compiler optimizes for performance. Hardware checks for safety.


19

Control Hazards
Branch instructions can cause great performance loss Branch instructions need two things:

The result of branch: Taken or Not Taken Branch target:


PC + 4 Branch NOT taken PC + 4 + immediate*4 Branch Taken Branch instruction is not detected until the ID stage

At which point a new instruction has already been fetched


For our original pipeline:

Branch Delay CC1


Consider the pipelined execution of: beq $1, $3, 100 During the first cycle, beq is fetched in the IF stage
beq $1, $3, 100
1004 +4
PCSrc Imm1 6 Rs Address Instruction

ID

EX

MEM

IF/ID

ID/EX

EX/MEM

Extend

A d d
Zero

m u x
1

PC = 1000

Registers
Reg_dst Data_in Rt

m u x m u x m u x

Instruction Memory
Op

A L U

ALU result

Writeback data

W M

Main Control

W M E

Branch Delay CC2


During the second cycle, beq is decoded in the ID stage The next_1 instruction is fetched in the IF stage
next_1 beq $1, $3, 100 EX MEM
1008 +4
PCSrc IF/ID ID/EX EX/MEM

1004

PC = 1004

m u x
1

Imm1 6 Rs Rt

100

Extend

A d d
Zero

Address Instruction

$3

Registers
Reg_dst Data_in

m u x m u x m u x

$1

Instruction Memory
beq

A L U

ALU result

Writeback data

W M

Main Control

W M E

Branch Delay CC3


During the third cycle, beq is executed in the EX stage The next_2 instruction is fetched in the IF stage
next_2 next_1 beq $1, $3, 100 MEM
1012 +4
PCSrc IF/ID ID/EX EX/MEM

1008

1004

Extend

100

PC = 1008

Address Instruction Rt

1234 1234

m u x
1

Imm1 6 Rs

A d d
Zero

Registers
Reg_dst Data_in

m u x m u x m u x

Instruction Memory

A L U

ALU result

Writeback data W M E

W M

Main Control

Beq = 1

Branch Delay CC4


During the fourth cycle, beq reaches MEM stage The next_3 instruction is fetched in the IF stage
next_3
1016 +4
PCSrc IF/ID

next_2
ID/EX

next_1

beq $1, $3, 100


EX/MEM

1012

1008

PC = 1012

m u x
1

Rs Address Instruction Rt

Registers
Reg_dst Data_in

m u x m u x m u x

Instruction Memory

A L U

1 0
W M

Imm1 6

Extend

A d d

1404
Zero = 1 ALU result Beq = 1

Writeback data

Main Control

W M E

Branch Delay CC5


During the fifth cycle, branch_target instruction is fetched Next_1 thru next_3 should be converted into NOPs
branch_target
1408 +4
PCSrc

next_3

next_2

next_1

IF/ID

ID/EX

EX/MEM

1016

1012

PC = 1404

m u x
1

Imm1 6 Rs Address Instruction Rt

Extend

A d d
Zero

Registers
Reg_dst Data_in

m u x m u x m u x

Instruction Memory

A L U

ALU result

Writeback data

W M

Main Control

W M E

3-Cycle Branch Delay


Next_1 thru Next_3 will be fetched anyway Pipeline should flush Next_1 thru Next_3 if branch is taken Otherwise, they can be executed normally
cc1 cc2 cc3 cc4 cc5 cc6 cc7

beq $1,$3,100

IM

Re g IM

ALU Re g IM

D M
Bubbl e

Re g
Bubbl e Bubbl e Bubbl e Bubbl e Bubbl e Bubbl e Bubbl e

Next_1 // bubble Next_2 // bubble Next_3 // bubble Branch_Target

Re g IM

Re g IM

Re g

ALU

Reducing the Delay of Branches


Branch delay can be reduced from 3 cycles to just 1 cycle Branch decision is moved from 4th into 2nd pipeline stage

Branches can be determined earlier in the ID stage Branch address calculation adder is moved to ID stage A comparator in the ID stage to compare the two fetched registers
To determine branch decision, whether the branch is taken or not Only one instruction that follows the branch will

Reducing the Delay of Branches


IF.Flush Hazard detection unit M u x M u x ID/EX WB Control 0 IF/ID M EX EX/MEM WB M MEM/WB WB

Shift left 2 Registers M u x

PC

Instruction memory

=
M u x Sign extend

ALU

Data memory

M u x

M u x Forwarding unit

Branch Hazard Alternatives


Always stall the pipeline until branch direction is known

Next instruction is always flushed (turned into a NOP)


Predict Branch Not Taken

Fetch successor instruction: PC+4 already calculated Almost half of MIPS branches are not taken on average Flush instructions in pipeline only if branch is actually taken
Predict Branch Taken

Can predict backward branches in

Delayed Branch
Define branch to take place after the next instruction For a 1-cycle branch delay, we have one delay slot
branch instruction branch delay slot next instruction ... branch target if branch taken
branch instruction (taken) branch delay slot (next instruction) branch target

IF

ID IF

EX MEM WB ID IF EX MEM WB ID EX MEM WB

Compiler/assembler fills the branch delay slot

By selecting a useful instruction

Scheduling the Branch Delay Slot


From an independent instruction before the branch From a target instruction when branch is predicted taken From fall through when branch is predicted not add $t2,$t3,$t4 sub $t4,$t5,$t6 beq $s1, $s0 taken $s1, $s0 beq Delay Slot
beq $s1, $s0 Delay Slot

beq $s1, $s0 add $t2,$t3,$t4

From Fall Through

Delay Slot

sub $t4,$t5,$t6

From Before

From Target

beq $s1, $s0 sub $t4,$t5,$t6

beq $s1, $s0 sub $t4,$t5,$t6

More on Delayed Branch


Scheduling delay slot with

Independent instruction is the best choice


However, not always possible to find an independent instruction

Target instruction is useful when branch is predicted taken


Such as in a loop branch May need to duplicate instruction if it can be reached by another path Cancel branch delay instruction if branch is not taken

Fall through is useful when branch is

Zero-Delayed Branch
How can we achieve zero-delay for a taken branch

If the branch target address is computed in the ID stage ?


Solution

Check the PC to see if the instruction being fetched is a branch Store the branch target address in a table in the IF stage Such a table is called the branch target buffer If branch is predicted taken then

Branch Target and Prediction Buffer


The branch target buffer is implemented as a small cache

That stores the branch target address of taken branches


We also have a branch prediction buffer

PC

Prediction Buffer

To store the prediction bits for branch Branch Target Buffer instructions PC of Branch Target Address +4 The prediction bits are dynamically mux Lookup determined by the hardware

Dynamic Branch Prediction


Prediction of branches at runtime using prediction bits

One or few prediction bits are associated with a branch instruction


Branch prediction buffer is a small memory

Indexed by the lower portion of the address of branch instruction


The simplest scheme is to have 1 prediction bit per branch We dont know if the prediction bit is correct or not If correct prediction

Continue normal execution no wasted cycles

2-bit Prediction Scheme


Prediction is just a hint that is assumed to be correct If incorrect then fetched instructions are flushed 1-bit prediction scheme has a performance shortcoming A loop branch is almost always taken, except for last iteration 1-bit scheme will predict incorrectly twice, rather than once On the first and last loop iterations 2-bit prediction schemes are often used A prediction must be wrong twice before it is changed Taken Predic Not Taken Predic t t A loop branch is mispredicted Taken Taken Taken Not Taken Taken only once on the last iteration
Not Taken

Not Taken
Taken

Not Taken

Not Taken

Control Hazard on Branches Three Stage Stall


10: beq r1,r3,36 14: and r2,r3,r5 18: or r6,r1,r7
ALU Ifetch Reg DMem Reg

Ifetc h

ALU

Reg

DMem

Reg

ALU

Ifetch

Reg

DMem

Reg

22: add r8,r1,r9 36: xor r10,r1,r11


What do you do with the 3 instructions in between? How do you do it? Where is the commit?

ALU

Ifetch

Reg

DMem

Reg

ALU

Ifetch

Reg

DMem

Reg

37

Branch Stall Impact


If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9! Two part solution:
Determine branch taken or not sooner, AND Compute taken branch address earlier

DLX branch tests if register = 0 or { 0 DLX Solution:


Move Zero test to ID/RF stage Adder to calculate new PC in ID/RF stage 1 clock cycle penalty for branch versus 3

38

Pipelined DLX Datapath


Figure 3.22, page 163 Instruction Fetch
Next PC

Instr. Decode Reg. Fetch


Next SEQ PC

Execute Addr. Calc

Memory Access

Write Back

MUX

Adder

Adder

4
Address

Zero?

RS1

MEM/WB

Imm

Sign Extend

RD

RD

RD

Interplay of instruction set design and cycle time.


39

WB Data

Memory

EX/MEM

RS2

Reg File

ID/EX

ALU

IF/ID

Data Memory

MUX

MUX

Four Branch Hazard Alternatives


#1: Stall until branch direction is clear #2: Predict Branch Not Taken
Execute successor instructions in sequence Squash instructions in pipeline if branch actually taken Advantage of late pipeline state update 47% DLX branches not taken on average PC+4 already calculated, so use it to get next instruction

#3: Predict Branch Taken


53% DLX branches taken on average But havent calculated branch target address in DLX DLX still incurs 1 cycle branch penalty Other machines: branch target known before outcome

40

Four Branch Hazard Alternatives


#4: Delayed Branch
Define branch to take place AFTER a following instruction branch instruction sequential successor1 sequential successor2 ........ sequential successorn branch target if taken

Branch delay of length n

1 slot delay allows proper decision and branch target address in 5 stage pipeline DLX uses this

41

Delayed Branch
Where to get instructions to fill branch delay slot?
Before branch instruction From the target address: only valuable when branch taken From fall through: only valuable when branch not taken Canceling branches allow more slots to be filled

Compiler effectiveness for single branch delay slot:


Fills about 60% of branch delay slots About 80% of instructions executed in branch delay slots useful in computation About 50% (60% x 80%) of slots usefully filled

Delayed Branch downside: 7-8 stage pipelines, multiple instructions issued per clock (superscalar)

42

Evaluating Branch Alternatives


Pipeline speedup = Pipeline depth 1 +Branch frequencyvBranch penalty
CPI 1.42 1.14 1.09 1.07 speedup v. unpipelined 3.5 4.4 4.5 4.6 speedup v. stall 1.0 1.26 1.29 1.31

Scheduling Branch scheme penalty Stall pipeline 3 Predict taken 1 Predict not taken 1 Delayed branch 0.5

Conditional & Unconditional = 14%, 65% change PC


43

Potrebbero piacerti anche