Lecture 37

CPS 104 Computer Organization and Programming Lecture- 37: Pipelined Processor
April 16, 2004 Gershon Kedem http://kedem.duke.edu/cps104/Lectures
CPS104 Lec37.1
GK Spring 2004
Admin.
Homework -8: is posted. Due: Monday April 19
CPS104 Lec37.2
GK Spring 2004
Review: A More Extensive Pipelining Example

Cycle 1 Cycle 2 Clock 0: Load Ifetch Reg/Dec Exec Reg/Dec Mem Exec Reg/Dec Ifetch WrB Mem Exec Reg/Dec WrB Mem Exec WrB Mem WrB Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8
4: R-type Ifetch
8: Store Ifetch 12: Beq (target is 1000)
End of Cycle 4

End of Cycle 5
End of End of Cycle 6 Cycle 7
End of Cycle 4: Loads Mem, R-types Exec, Stores Reg, Beqs Ifetch End of Cycle 5: Loads WrB, R-types Mem, Stores Exec, Beqs Reg End of Cycle 6: R-types WrB, Stores Mem, Beqs Exec End of Cycle 7: Stores WrB, Beqs Mem
CPS104 Lec37.3
GK Spring 2004
0: Loads Mem
4: R-types Exec
8: Stores Reg
12: Beqs Ifetch
Review:Pipelining Example: End of Cycle 4

8: Stores Reg 12: Beqs Ifet RegWr=0 Clk
1 0 PC+4 PC+4 PC+4 Imm16 busA busB
4: R-types Exec ALUOp=R-type ExtOp=x
0: Loads Mem
Branch=0
Ex/Mem: R-types Result
ID/Ex: Stores busA & B
IF/ID: Beq Instruction
Mem/Wr: Loads Dout
CPS104 Lec37.4
PC = 16
Imm16 Rs Ra Rt Rt Rd Rb
Zero
Data Mem
RA Do WA Di
IUnit
I
RFile
Rw Di
Exec Unit
0 1
Mux
0
RegDst=1 ALUSrc=0
Clk MemWr=0
MemtoReg=x
GK Spring 2004
ReviewPipelining Example: End of Cycle 5
0: Lws Wr 4: Rs Mem 8: Stores Exec 12: Beqs Reg 16: Rs Ifetch

0: Loads Wr 4: R-types Mem
12: Beqs Reg 16: Rs Ifet RegWr=1 Clk

1 0 PC+4 PC+4
8: Stores Exec ALUOp=Add ExtOp=1
Branch=0
Mem/Wr: R-types Result
PC+4 Imm16 busA busB
Ex/Mem: Stores Address
IF/ID: Instruction @ 16
ID/Ex: Beqs busA & B
CPS104 Lec37.5
PC = 20
Zero
Data Mem
RA Do WA Di
IUnit
I
RFile
Rw Di
Exec Unit
0 1
Mux
0
RegDst=x ALUSrc=1
Clk MemWr=0
MemtoReg=1
GK Spring 2004
Review: Pipelining Example: End of Cycle 6
4: Rs Wr 8: Stores Mem 12: Beqs Exec 16: Rs Reg 20: Rs Ifet

16: R-types Reg 20: R-types Ifet 12: Beqs Exec ALUOp=Sub ExtOp=1 8: Stores Mem 4: R-types Wr Branch=0
RegWr=1 Clk
1 0 PC+4 PC+4 PC+4 Imm16 busA busB
ID/Ex:R-types busA & B
Mem/Wr: Nothing for St
Ex/Mem: Beqs Results
CPS104 Lec37.6
PC = 24
Zero
Data Mem
RA Do WA Di
IUnit
I
RFile
Rw Di
Exec Unit
0 1
Mux
0
RegDst=x ALUSrc=0
Clk MemWr=1
MemtoReg=0
GK Spring 2004
Review: Pipelining Example: End of Cycle 7
8: Stores Wr 12: Beqs Mem 16: Rs Exec 20: Rs Reg 24: Rs Ifet
20: R-types Reg 24: R-types Ifet 16: R-types Exec ALUOp=R-type ExtOp=x 12: Beqs Mem 8: Stores WrB
RegWr=0 Clk
Branch=1
1 0 PC+4
Mem/Wr:Nothing for Beq
PC+4 Imm16 busA busB
ID/Ex:R-types busA & B
Ex/Mem: Rtypes Results
CPS104 Lec37.7
PC = 1000
PC+4
Zero
Data Mem
RA Do WA Di
IUnit
I
RFile
Rw Di
Exec Unit
0 1
Mux
0
RegDst=1 ALUSrc=0
Clk MemWr=0
MemtoReg=x
GK Spring 2004
Review: Data Hazards

So far we ignored instructions dependencies, but in a real machine one must deal with dependencies. Example:
sub and or add sw

Clock
$2, $1, $3 $12, $2, $5 $13, $6, $2 $14, $2, $2 $15, 100($2) #
Cycle 1 Cycle 2
# $12 depends on the result in $2 # but $2 is updated 3 clock # cycles later. We have a problem!!
Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8
0: sub
Ifetch
Reg/Dec Ifetch
Exec Reg/Dec Ifetch
Mem Exec Reg/Dec Ifetch
WrB Mem Exec Reg/Dec Ifetch WrB Mem Exec Reg/Dec WrB Mem Exec WrB Mem WrB
4: and
8: or
12: add
16: sw
CPS104 Lec37.8
GK Spring 2004
Data Hazard Solution: Register Forwarding

ID/EX
Forward B
EX/MEM
MEM/WB
Registers
ALU
Data Memory
Forward A
Rd Rt Rs Forwarding Unit
CPS104 Lec37.9
GK Spring 2004
The Delay Branch Phenomenon

Cycle 4 Cycle 5 Clk 12: Beq Ifetch Reg/Dec Exec (target is 1000) 16: R-type Ifetch Reg/Dec 20: R-type Ifetch Mem Exec Reg/Dec Ifetch Wr Mem Exec Reg/Dec Ifetch Wr Mem Exec Reg/Dec Wr Mem Exec Wr Mem Wr Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Cycle 11
24: R-type
1000: Target of Br
Although beq is fetched during Cycle 4: u Target address is NOT written into the PC until the end of Cycle 7 u Branchs target is NOT fetched until Cycle 8 u 3-instruction delay before the branch take effect This is referred to as Branch Hazard: u Clever design techniques can reduce the delay to ONE instruction
CPS104 Lec37.10
GK Spring 2004
Reducing Branch delays (cont.)
The design is optimized for branch not taken (no pipeline delay) If branch is taken, the next instruction is converted to NOOP by the control (pipeline bubble <=> one stage pipeline delay). The MIPS architecture defines a delayed Branch slot to reduce this potential delay (see a later slide).
CPS104 Lec37.11
GK Spring 2004
The Delay Load Phenomenon

Cycle 1 Cycle 2 Clock I0: Load Ifetch Plus 1 Reg/Dec Ifetch Plus 2 Exec Reg/Dec Ifetch Plus 3 Mem Exec Reg/Dec Ifetch Plus 4
Wr Mem Exec Reg/Dec Ifetch Wr Mem Exec Reg/Dec Wr Mem Exec Wr Mem Wr
Although Load is fetched during Cycle 1: u The data is NOT written into the Reg File until the end of Cycle 5 u We cannot read this value from the Reg File until Cycle 6 u 3-instruction delay before the load take effect This is referred to as Data Hazard: u Register forwarding reduces the load delay to ONE instruction u It is not possible to entirely eliminate the load delay.
CPS104 Lec37.12
GK Spring 2004
Delayed Load and Branch on a Real MIPS Processor
The effect of load in a real MIPS Processor is delayed:

u
lw add add lw add
$1, 100 ($2) $3, $1, $0 $4, $1, $0 $1, 100 ($2) $3, $1, $0
// Load Register R1 // Move old R1 into R3 // Move new R1 into R4 // Load Register R1 // Move new R1 into R3
The effect of load on a normal processor is NOT delayed

The effect of branch and jump in a real MIPS Processor is delayed:

u
Instruction Address: 0x00j 1000 Instruction Address: 0x04add $1, $2, $3 Instruction Address: 0x1000 sub $1, $2, $3 Instruction Address: 0x00j 1000 Instruction Address: 0x1000 sub $1, $2, $3
Branch and jump in a Normal processor are NOT delayed

CPS104 Lec37.13
GK Spring 2004
Reducing Branch delays (cont.)
The design is optimized for branch not taken (no pipeline delay) If branch is taken, the next instruction is converted to NOOP by the control (pipeline bubble <=> one stage pipeline delay). The MIPS architecture defines a delayed Branch slot to reduce this potential delay (see a later slide).
CPS104 Lec37.14
GK Spring 2004
Branch Delays
Control & Hazards
IF/ID
Rs
PC
Instruction Memory Rt
Registers
Imm
Example: sub beq add . . go: lw
$10, $4, $8 $10, $3, go $12, $2, $5 . $4, 16($12)
CPS104 Lec37.15
+
ID/EX
Bus A = Bus B
<<2
sign Extend
Rt Rd
GK Spring 2004
The Delay Load Phenomenon

Cycle 1 Cycle 2 Clock I0: Load Ifetch Plus 1 Reg/Dec Ifetch Plus 2 Exec Reg/Dec Ifetch Plus 3 Mem Exec Reg/Dec Ifetch Plus 4
Wr Mem Exec Reg/Dec Ifetch Wr Mem Exec Reg/Dec Wr Mem Exec Wr Mem Wr
Although Load is fetched during Cycle 1: u The data is NOT written into the Reg File until the end of Cycle 5 u We cannot read this value from the Reg File until Cycle 6 u 3-instruction delay before the load take effect This is referred to as Data Hazard: u Register forwarding reduces the load delay to ONE instruction u It is not possible to entirely eliminate the load delay.
GK Spring 2004
CPS104 Lec37.16
Load Data Forwarding

ID/EX
Forward B
EX/MEM
MEM/WB
Registers
Data Memory
Forward A
Rd Rt Rs Forwarding Unit
CPS104 Lec37.17
GK Spring 2004
Dealing with the Load Data Hazard
There are two ways to deal with the load data hazard: u Insert a NOOP bubble into the data path. u Use Delayed load semantic (see a later slide)
Stall0 Stall1 Stall2
Insert NOOP Here Branch
RegWr
1 0 PC+4 Imm16 Rs Ra Rt Rb
ExtOp ALUOp
How?
RFile
PC+4 Imm16 busA busB Exec EX Unit Unit 0 1
PC+4
CPS104 Lec37.18
PC
Mem/Wr Register
Ex/Mem Register
ID/Ex Register
IF/ID Register
Zero Data Me RA Do m
WA Di
IF_Unit
I
1 0
Mux
Rt Rw Di Rd
RegDst
ALUSrc
MemWr
MemtoReg
GK Spring 2004
Delayed Load and Branch on a Real MIPS Processor
The effect of load in a real MIPS Processor is delayed:

u
lw add add lw add
$1, 100 ($2) $3, $1, $0 $4, $1, $0 $1, 100 ($2) $3, $1, $0
// Load Register R1 // Move old R1 into R3 // Move new R1 into R4 // Load Register R1 // Move new R1 into R3
The effect of load on a normal processor is NOT delayed

The effect of branch and jump in a real MIPS Processor is delayed:

u
Instruction Address: 0x00j 1000 Instruction Address: 0x04add $1, $2, $3 Instruction Address: 0x1000 sub $1, $2, $3 Instruction Address: 0x00j 1000 Instruction Address: 0x1000 sub $1, $2, $3
Branch and jump in a Normal processor are NOT delayed

CPS104 Lec37.19
GK Spring 2004
Software Scheduling to Avoid Load Hazards Try producing fast code for a = b + c; d = e f; assuming a, b, c, d ,e, and f in memory.
Slow code: LW LW ADD SW LW LW SUB SW Rb,b Rc,c Ra,Rb,Rc a,Ra Re,e Rf,f Rd,Re,Rf d,Rd
Fast code: LW LW LW ADD LW SW SUB SW
Rb,b Rc,c Re,e Ra,Rb,Rc Rf,f a,Ra Rd,Re,Rf d,Rd
CPS104 Lec37.20
GK Spring 2004
Compiler Avoiding Load Stalls

scheduled gcc spice tex 0% 20% 14% 25% 40% 60% 31% 42% 65% 80% unscheduled 54%
% loads stalling pipeline
CPS104 Lec37.21
GK Spring 2004
Pipelining Complications
Interrupts (Exceptions) u 5 instructions executing in 5 stage pipeline u How to stop the pipeline? u How to restart the pipeline? u Who caused the interrupt?
Stage IF ID EX MEM Problem interrupts occurring Page fault on instruction fetch; misaligned memory access; memory-protection violation Undefined or illegal opcode Arithmetic interrupt Page fault on data fetch; misaligned memory access; memory-protection violation
CPS104 Lec37.22
GK Spring 2004
Pipelining Complications
Simultaneous exceptions in > 1 pipeline stage u Load with data page fault in MEM stage u Add with instruction page fault in IF stage Solution #1 u Interrupt status vector per instruction u Defer check til last stage, kill state update if exception Solution #2 u Interrupt ASAP u Restart everything that is incomplete Exception in branch delay slot, u SW needs two PCs Another advantage for state update late in pipeline!
CPS104 Lec37.23
GK Spring 2004
Pipeline Complications

Complex Addressing Modes and Instructions Address modes: Autoincrement causes register change during instruction execution u Interrupts? Need to restore register state u Adds WAR and WAW hazards since writes no longer last stage Memory-Memory Move Instructions u Must be able to handle multiple page faults u Long-lived instructions: partial state save on interrupt Condition Codes
CPS104 Lec37.24
GK Spring 2004
Pipeline Complications: Floating Point

EX MEM
M1 IF ID/RF
M2
M3
M4
M5
M6
M7 WB
A1
A2
A3
A4
FP/INT Divide Unit Not Pipelined
25 Clocks
CPS104 Lec37.25
GK Spring 2004
Pipelining Complications Floating Point: long execution time Also, may pipeline FP execution unit so they can initiate new instructions without waiting full latency
FP Instruction Add, Subtract Multiply Divide Square root Negate Absolute value FP compare Latency Initiation Rate 4 3 8 4 36 35 112 111 2 1 2 1 3 2
Cycles before use result
CPS104 Lec37.26
(MIPS R4000)
(interrupts, WAW, WAR)
Cycles before issue instr of same type

GK Spring 2004
Summary of Pipelining Basics
Hazards limit performance u Structural: need more HW resources u Data: need forwarding, compiler scheduling u Control: early evaluation & PC, delayed branch, prediction Increasing length of pipe increases impact of hazards; pipelining helps instruction bandwidth, not latency Compilers reduce cost of data and control hazards u Load delay slots u Branch delay slots u Branch prediction Interrupts, Instruction Set, FP makes pipelining harder Q: How would you handle context switches?
CPS104 Lec37.27
GK Spring 2004
Case Study: MIPS R4000 (100 MHz to 200 MHz)
8 Stage Pipeline: u IFfirst half of fetching of instruction; PC selection happens here as well as initiation of instruction cache access. u ISsecond half of access to instruction cache. u RFinstruction decode and register fetch, hazard checking and also instruction cache hit detection. u EXexecution, which includes effective address calculation, ALU operation, and branch target computation and condition evaluation. u DFdata fetch, first half of access to data cache. u DSsecond half of access to data cache. u TCtag check, determine whether the data cache access hit. u WBwrite back for loads and register-register operations. 8 Stages: What is impact on Load delay? Branch delay? Why?
CPS104 Lec37.28
GK Spring 2004
Case Study: MIPS R4000

TWO Cycle Load Latency IF IS IF RF IS IF EX RF IS IF DF EX RF IS IF DS DF EX RF IS IF TC DS DF EX RF IS IF TC DS DF EX RF IS IF WB TC DS DF EX RF IS IF WB TC DS DF EX RF IS IF
GK Spring 2004
IF THREE Cycle Branch Latency (conditions evaluated during EX phase)
IS IF
RF IS IF
EX RF IS IF
Delay slot plus two stalls Branch likely cancels delay slot if not taken
DF EX RF IS IF
DS DF EX RF IS IF
CPS104 Lec37.29
MIPS R4000 Floating Point

FP Adder, FP Multiplier, FP Divider Last step of FP Multiplier/Divider uses FP Adder HW 8 kinds of stages in FP units: u Stage Functional unitDescription u A FP adder Mantissa ADD stage u D FP divider Divide pipeline stage u E FP multiplier Exception test stage u M FP multiplier First stage of multiplier u N FP multiplier Second stage of multiplier u R FP adder Rounding stage u S FP adder Operand shift stage u U Unpack FP numbers
GK Spring 2004
CPS104 Lec37.30
MIPS FP Pipe Stages

FP Instr Add, Subtract Multiply Divide Square root Negate Absolute value FP compare Stages: 1 U U U U U U U 2 S+A E+M A E S S A 3 4 A+R R+S M M R D28 (A+R)108 5 M 6 N D+A A 7 8
N+A R D+R, D+R, D+A, D+R, A, R R
M N R S U
First stage of multiplier Second stage of multiplier Rounding stage Operand shift stage Unpack FP numbers
A D E
Mantissa ADD stage Divide pipeline stage Exception test stage
CPS104 Lec37.31
GK Spring 2004
R4000 Performance
Not ideal CPI of 1: u Load stalls (1 or 2 clock cycles) u Branch stalls (2 cycles + unfilled slots) u FP result stalls: RAW data hazard (latency) u FP structural stalls: Not enough FP hardware (parallelism)
5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0
eqntott espresso Base gcc li doduc nasa7 ora spice2g6 su2cor tomcatv
Load stalls Branch stalls
FP result stalls FP structural stalls
CPS104 Lec37.32
GK Spring 2004

Lecture 37

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Lecture 37

Caricato da

Copyright:

Formati disponibili

CPS 104 Computer Organization and Programming Lecture- 37: Pipelined Processor

April 16, 2004 Gershon Kedem http://kedem.duke.edu/cps104/Lectures

Homework -8: is posted. Due: Monday April 19

Review: A More Extensive Pipelining Example

8: Store Ifetch 12: Beq (target is 1000)

End of End of Cycle 6 Cycle 7

12: Beqs Ifetch

Review:Pipelining Example: End of Cycle 4

4: R-types Exec ALUOp=R-type ExtOp=x

Ex/Mem: R-types Result

ID/Ex: Stores busA & B

IF/ID: Beq Instruction

Mem/Wr: Loads Dout

ReviewPipelining Example: End of Cycle 5

0: Lws Wr 4: Rs Mem 8: Stores Exec 12: Beqs Reg 16: Rs Ifetch

12: Beqs Reg 16: Rs Ifet RegWr=1 Clk

8: Stores Exec ALUOp=Add ExtOp=1

Mem/Wr: R-types Result

PC+4 Imm16 busA busB

Ex/Mem: Stores Address

ID/Ex: Beqs busA & B

Review: Pipelining Example: End of Cycle 6

4: Rs Wr 8: Stores Mem 12: Beqs Exec 16: Rs Reg 20: Rs Ifet

1 0 PC+4 PC+4 PC+4 Imm16 busA busB

ID/Ex:R-types busA & B

Mem/Wr: Nothing for St

Ex/Mem: Beqs Results

Review: Pipelining Example: End of Cycle 7

Mem/Wr:Nothing for Beq

PC+4 Imm16 busA busB

ID/Ex:R-types busA & B

Ex/Mem: Rtypes Results

Review: Data Hazards

sub and or add sw

Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8

Exec Reg/Dec Ifetch

Mem Exec Reg/Dec Ifetch

Data Hazard Solution: Register Forwarding

The Delay Branch Phenomenon

Reducing Branch delays (cont.)

The Delay Load Phenomenon

Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8

Delayed Load and Branch on a Real MIPS Processor

The effect of load in a real MIPS Processor is delayed:

lw add add lw add

The effect of load on a normal processor is NOT delayed

The effect of branch and jump in a real MIPS Processor is delayed:

Branch and jump in a Normal processor are NOT delayed

Reducing Branch delays (cont.)

Example: sub beq add . . go: lw

$10, $4, $8 $10, $3, go $12, $2, $5 . $4, 16($12)

The Delay Load Phenomenon

Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8

Load Data Forwarding

Dealing with the Load Data Hazard

Insert NOOP Here Branch

PC+4 Imm16 busA busB Exec EX Unit Unit 0 1

Delayed Load and Branch on a Real MIPS Processor

The effect of load in a real MIPS Processor is delayed:

lw add add lw add

The effect of load on a normal processor is NOT delayed

The effect of branch and jump in a real MIPS Processor is delayed:

Branch and jump in a Normal processor are NOT delayed

Fast code: LW LW LW ADD LW SW SUB SW