Sei sulla pagina 1di 32

CPS 104 Computer Organization and Programming Lecture- 37: Pipelined Processor

April 16, 2004 Gershon Kedem http://kedem.duke.edu/cps104/Lectures

CPS104 Lec37.1

GK Spring 2004

Admin.

Homework -8: is posted. Due: Monday April 19

CPS104 Lec37.2

GK Spring 2004

Review: A More Extensive Pipelining Example


Cycle 1 Cycle 2 Clock 0: Load Ifetch Reg/Dec Exec Reg/Dec Mem Exec Reg/Dec Ifetch WrB Mem Exec Reg/Dec WrB Mem Exec WrB Mem WrB Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8

4: R-type Ifetch

8: Store Ifetch 12: Beq (target is 1000)

End of Cycle 4

End of Cycle 5

End of End of Cycle 6 Cycle 7

End of Cycle 4: Loads Mem, R-types Exec, Stores Reg, Beqs Ifetch End of Cycle 5: Loads WrB, R-types Mem, Stores Exec, Beqs Reg End of Cycle 6: R-types WrB, Stores Mem, Beqs Exec End of Cycle 7: Stores WrB, Beqs Mem
CPS104 Lec37.3
GK Spring 2004

0: Loads Mem

4: R-types Exec

8: Stores Reg

12: Beqs Ifetch

Review:Pipelining Example: End of Cycle 4


8: Stores Reg 12: Beqs Ifet RegWr=0 Clk
1 0 PC+4 PC+4 PC+4 Imm16 busA busB

4: R-types Exec ALUOp=R-type ExtOp=x

0: Loads Mem

Branch=0

Ex/Mem: R-types Result

ID/Ex: Stores busA & B

IF/ID: Beq Instruction

Mem/Wr: Loads Dout

CPS104 Lec37.4

PC = 16

Imm16 Rs Ra Rt Rt Rd Rb

Zero

Data Mem
RA Do WA Di

IUnit
I

RFile
Rw Di

Exec Unit
0 1

Mux
0

RegDst=1 ALUSrc=0

Clk MemWr=0

MemtoReg=x
GK Spring 2004

ReviewPipelining Example: End of Cycle 5

0: Lws Wr 4: Rs Mem 8: Stores Exec 12: Beqs Reg 16: Rs Ifetch


0: Loads Wr 4: R-types Mem

12: Beqs Reg 16: Rs Ifet RegWr=1 Clk


1 0 PC+4 PC+4

8: Stores Exec ALUOp=Add ExtOp=1

Branch=0

Mem/Wr: R-types Result

PC+4 Imm16 busA busB

Ex/Mem: Stores Address

IF/ID: Instruction @ 16

ID/Ex: Beqs busA & B

CPS104 Lec37.5

PC = 20

Imm16 Rs Ra Rt Rt Rd Rb

Zero

Data Mem
RA Do WA Di

IUnit
I

RFile
Rw Di

Exec Unit
0 1

Mux
0

RegDst=x ALUSrc=1

Clk MemWr=0

MemtoReg=1
GK Spring 2004

Review: Pipelining Example: End of Cycle 6

4: Rs Wr 8: Stores Mem 12: Beqs Exec 16: Rs Reg 20: Rs Ifet


16: R-types Reg 20: R-types Ifet 12: Beqs Exec ALUOp=Sub ExtOp=1 8: Stores Mem 4: R-types Wr Branch=0

RegWr=1 Clk

1 0 PC+4 PC+4 PC+4 Imm16 busA busB

ID/Ex:R-types busA & B

Mem/Wr: Nothing for St

IF/ID: Instruction @ 20

Ex/Mem: Beqs Results

CPS104 Lec37.6

PC = 24

Imm16 Rs Ra Rt Rt Rd Rb

Zero

Data Mem
RA Do WA Di

IUnit
I

RFile
Rw Di

Exec Unit
0 1

Mux
0

RegDst=x ALUSrc=0

Clk MemWr=1

MemtoReg=0
GK Spring 2004

Review: Pipelining Example: End of Cycle 7

8: Stores Wr 12: Beqs Mem 16: Rs Exec 20: Rs Reg 24: Rs Ifet
20: R-types Reg 24: R-types Ifet 16: R-types Exec ALUOp=R-type ExtOp=x 12: Beqs Mem 8: Stores WrB

RegWr=0 Clk

Branch=1

1 0 PC+4

Mem/Wr:Nothing for Beq

PC+4 Imm16 busA busB

ID/Ex:R-types busA & B

Ex/Mem: Rtypes Results

CPS104 Lec37.7

PC = 1000

PC+4

IF/ID: Instruction @ 24

Imm16 Rs Ra Rt Rt Rd Rb

Zero

Data Mem
RA Do WA Di

IUnit
I

RFile
Rw Di

Exec Unit
0 1

Mux
0

RegDst=1 ALUSrc=0

Clk MemWr=0

MemtoReg=x
GK Spring 2004

Review: Data Hazards


So far we ignored instructions dependencies, but in a real machine one must deal with dependencies. Example:

sub and or add sw


Clock

$2, $1, $3 $12, $2, $5 $13, $6, $2 $14, $2, $2 $15, 100($2) #
Cycle 1 Cycle 2

# $12 depends on the result in $2 # but $2 is updated 3 clock # cycles later. We have a problem!!

Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8

0: sub

Ifetch

Reg/Dec Ifetch

Exec Reg/Dec Ifetch

Mem Exec Reg/Dec Ifetch

WrB Mem Exec Reg/Dec Ifetch WrB Mem Exec Reg/Dec WrB Mem Exec WrB Mem WrB

4: and

8: or

12: add

16: sw
CPS104 Lec37.8

GK Spring 2004

Data Hazard Solution: Register Forwarding


ID/EX
Forward B

EX/MEM

MEM/WB

Registers

ALU

Data Memory

Forward A

Rd Rt Rs Forwarding Unit

CPS104 Lec37.9

GK Spring 2004

The Delay Branch Phenomenon


Cycle 4 Cycle 5 Clk 12: Beq Ifetch Reg/Dec Exec (target is 1000) 16: R-type Ifetch Reg/Dec 20: R-type Ifetch Mem Exec Reg/Dec Ifetch Wr Mem Exec Reg/Dec Ifetch Wr Mem Exec Reg/Dec Wr Mem Exec Wr Mem Wr Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Cycle 11

24: R-type

1000: Target of Br

Although beq is fetched during Cycle 4: u Target address is NOT written into the PC until the end of Cycle 7 u Branchs target is NOT fetched until Cycle 8 u 3-instruction delay before the branch take effect This is referred to as Branch Hazard: u Clever design techniques can reduce the delay to ONE instruction
CPS104 Lec37.10
GK Spring 2004

Reducing Branch delays (cont.)

The design is optimized for branch not taken (no pipeline delay) If branch is taken, the next instruction is converted to NOOP by the control (pipeline bubble <=> one stage pipeline delay). The MIPS architecture defines a delayed Branch slot to reduce this potential delay (see a later slide).

CPS104 Lec37.11

GK Spring 2004

The Delay Load Phenomenon


Cycle 1 Cycle 2 Clock I0: Load Ifetch Plus 1 Reg/Dec Ifetch Plus 2 Exec Reg/Dec Ifetch Plus 3 Mem Exec Reg/Dec Ifetch Plus 4

Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8

Wr Mem Exec Reg/Dec Ifetch Wr Mem Exec Reg/Dec Wr Mem Exec Wr Mem Wr

Although Load is fetched during Cycle 1: u The data is NOT written into the Reg File until the end of Cycle 5 u We cannot read this value from the Reg File until Cycle 6 u 3-instruction delay before the load take effect This is referred to as Data Hazard: u Register forwarding reduces the load delay to ONE instruction u It is not possible to entirely eliminate the load delay.
CPS104 Lec37.12
GK Spring 2004

Delayed Load and Branch on a Real MIPS Processor

The effect of load in a real MIPS Processor is delayed:



u

lw add add lw add

$1, 100 ($2) $3, $1, $0 $4, $1, $0 $1, 100 ($2) $3, $1, $0

// Load Register R1 // Move old R1 into R3 // Move new R1 into R4 // Load Register R1 // Move new R1 into R3

The effect of load on a normal processor is NOT delayed


The effect of branch and jump in a real MIPS Processor is delayed:



u

Instruction Address: 0x00j 1000 Instruction Address: 0x04add $1, $2, $3 Instruction Address: 0x1000 sub $1, $2, $3 Instruction Address: 0x00j 1000 Instruction Address: 0x1000 sub $1, $2, $3

Branch and jump in a Normal processor are NOT delayed


CPS104 Lec37.13

GK Spring 2004

Reducing Branch delays (cont.)

The design is optimized for branch not taken (no pipeline delay) If branch is taken, the next instruction is converted to NOOP by the control (pipeline bubble <=> one stage pipeline delay). The MIPS architecture defines a delayed Branch slot to reduce this potential delay (see a later slide).

CPS104 Lec37.14

GK Spring 2004

Branch Delays
Control & Hazards

IF/ID

Rs

PC
Instruction Memory Rt

Registers

Imm

Example: sub beq add . . go: lw

$10, $4, $8 $10, $3, go $12, $2, $5 . $4, 16($12)

CPS104 Lec37.15

+
ID/EX
Bus A = Bus B
<<2
sign Extend

Rt Rd

GK Spring 2004

The Delay Load Phenomenon


Cycle 1 Cycle 2 Clock I0: Load Ifetch Plus 1 Reg/Dec Ifetch Plus 2 Exec Reg/Dec Ifetch Plus 3 Mem Exec Reg/Dec Ifetch Plus 4

Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8

Wr Mem Exec Reg/Dec Ifetch Wr Mem Exec Reg/Dec Wr Mem Exec Wr Mem Wr

Although Load is fetched during Cycle 1: u The data is NOT written into the Reg File until the end of Cycle 5 u We cannot read this value from the Reg File until Cycle 6 u 3-instruction delay before the load take effect This is referred to as Data Hazard: u Register forwarding reduces the load delay to ONE instruction u It is not possible to entirely eliminate the load delay.
GK Spring 2004

CPS104 Lec37.16

Load Data Forwarding


ID/EX
Forward B

EX/MEM

MEM/WB

Registers

Data Memory

Forward A

Rd Rt Rs Forwarding Unit

CPS104 Lec37.17

GK Spring 2004

Dealing with the Load Data Hazard

There are two ways to deal with the load data hazard: u Insert a NOOP bubble into the data path. u Use Delayed load semantic (see a later slide)
Stall0 Stall1 Stall2

Insert NOOP Here Branch

RegWr
1 0 PC+4 Imm16 Rs Ra Rt Rb

ExtOp ALUOp

How?

RFile

PC+4 Imm16 busA busB Exec EX Unit Unit 0 1

PC+4

CPS104 Lec37.18

PC

Mem/Wr Register

Ex/Mem Register

ID/Ex Register

IF/ID Register

Zero Data Me RA Do m
WA Di

IF_Unit
I

1 0

Mux

Rt Rw Di Rd

RegDst

ALUSrc

MemWr

MemtoReg
GK Spring 2004

Delayed Load and Branch on a Real MIPS Processor

The effect of load in a real MIPS Processor is delayed:



u

lw add add lw add

$1, 100 ($2) $3, $1, $0 $4, $1, $0 $1, 100 ($2) $3, $1, $0

// Load Register R1 // Move old R1 into R3 // Move new R1 into R4 // Load Register R1 // Move new R1 into R3

The effect of load on a normal processor is NOT delayed


The effect of branch and jump in a real MIPS Processor is delayed:



u

Instruction Address: 0x00j 1000 Instruction Address: 0x04add $1, $2, $3 Instruction Address: 0x1000 sub $1, $2, $3 Instruction Address: 0x00j 1000 Instruction Address: 0x1000 sub $1, $2, $3

Branch and jump in a Normal processor are NOT delayed


CPS104 Lec37.19

GK Spring 2004

Software Scheduling to Avoid Load Hazards Try producing fast code for a = b + c; d = e f; assuming a, b, c, d ,e, and f in memory.
Slow code: LW LW ADD SW LW LW SUB SW Rb,b Rc,c Ra,Rb,Rc a,Ra Re,e Rf,f Rd,Re,Rf d,Rd

Fast code: LW LW LW ADD LW SW SUB SW

Rb,b Rc,c Re,e Ra,Rb,Rc Rf,f a,Ra Rd,Re,Rf d,Rd

CPS104 Lec37.20

GK Spring 2004

Compiler Avoiding Load Stalls


scheduled gcc spice tex 0% 20% 14% 25% 40% 60% 31% 42% 65% 80% unscheduled 54%

% loads stalling pipeline

CPS104 Lec37.21

GK Spring 2004

Pipelining Complications

Interrupts (Exceptions) u 5 instructions executing in 5 stage pipeline u How to stop the pipeline? u How to restart the pipeline? u Who caused the interrupt?
Stage IF ID EX MEM Problem interrupts occurring Page fault on instruction fetch; misaligned memory access; memory-protection violation Undefined or illegal opcode Arithmetic interrupt Page fault on data fetch; misaligned memory access; memory-protection violation

CPS104 Lec37.22

GK Spring 2004

Pipelining Complications

Simultaneous exceptions in > 1 pipeline stage u Load with data page fault in MEM stage u Add with instruction page fault in IF stage Solution #1 u Interrupt status vector per instruction u Defer check til last stage, kill state update if exception Solution #2 u Interrupt ASAP u Restart everything that is incomplete Exception in branch delay slot, u SW needs two PCs Another advantage for state update late in pipeline!

CPS104 Lec37.23

GK Spring 2004

Pipeline Complications

Complex Addressing Modes and Instructions Address modes: Autoincrement causes register change during instruction execution u Interrupts? Need to restore register state u Adds WAR and WAW hazards since writes no longer last stage Memory-Memory Move Instructions u Must be able to handle multiple page faults u Long-lived instructions: partial state save on interrupt Condition Codes

CPS104 Lec37.24

GK Spring 2004

Pipeline Complications: Floating Point


EX MEM

M1 IF ID/RF

M2

M3

M4

M5

M6

M7 WB

A1

A2

A3

A4

FP/INT Divide Unit Not Pipelined

25 Clocks

CPS104 Lec37.25

GK Spring 2004

Pipelining Complications Floating Point: long execution time Also, may pipeline FP execution unit so they can initiate new instructions without waiting full latency
FP Instruction Add, Subtract Multiply Divide Square root Negate Absolute value FP compare Latency Initiation Rate 4 3 8 4 36 35 112 111 2 1 2 1 3 2
Cycles before use result
CPS104 Lec37.26

(MIPS R4000)

(interrupts, WAW, WAR)

Cycles before issue instr of same type


GK Spring 2004

Summary of Pipelining Basics

Hazards limit performance u Structural: need more HW resources u Data: need forwarding, compiler scheduling u Control: early evaluation & PC, delayed branch, prediction Increasing length of pipe increases impact of hazards; pipelining helps instruction bandwidth, not latency Compilers reduce cost of data and control hazards u Load delay slots u Branch delay slots u Branch prediction Interrupts, Instruction Set, FP makes pipelining harder Q: How would you handle context switches?

CPS104 Lec37.27

GK Spring 2004

Case Study: MIPS R4000 (100 MHz to 200 MHz)

8 Stage Pipeline: u IFfirst half of fetching of instruction; PC selection happens here as well as initiation of instruction cache access. u ISsecond half of access to instruction cache. u RFinstruction decode and register fetch, hazard checking and also instruction cache hit detection. u EXexecution, which includes effective address calculation, ALU operation, and branch target computation and condition evaluation. u DFdata fetch, first half of access to data cache. u DSsecond half of access to data cache. u TCtag check, determine whether the data cache access hit. u WBwrite back for loads and register-register operations. 8 Stages: What is impact on Load delay? Branch delay? Why?

CPS104 Lec37.28

GK Spring 2004

Case Study: MIPS R4000


TWO Cycle Load Latency IF IS IF RF IS IF EX RF IS IF DF EX RF IS IF DS DF EX RF IS IF TC DS DF EX RF IS IF TC DS DF EX RF IS IF WB TC DS DF EX RF IS IF WB TC DS DF EX RF IS IF
GK Spring 2004

IF THREE Cycle Branch Latency (conditions evaluated during EX phase)

IS IF

RF IS IF

EX RF IS IF

Delay slot plus two stalls Branch likely cancels delay slot if not taken

DF EX RF IS IF

DS DF EX RF IS IF

CPS104 Lec37.29

MIPS R4000 Floating Point


FP Adder, FP Multiplier, FP Divider Last step of FP Multiplier/Divider uses FP Adder HW 8 kinds of stages in FP units: u Stage Functional unitDescription u A FP adder Mantissa ADD stage u D FP divider Divide pipeline stage u E FP multiplier Exception test stage u M FP multiplier First stage of multiplier u N FP multiplier Second stage of multiplier u R FP adder Rounding stage u S FP adder Operand shift stage u U Unpack FP numbers
GK Spring 2004

CPS104 Lec37.30

MIPS FP Pipe Stages


FP Instr Add, Subtract Multiply Divide Square root Negate Absolute value FP compare Stages: 1 U U U U U U U 2 S+A E+M A E S S A 3 4 A+R R+S M M R D28 (A+R)108 5 M 6 N D+A A 7 8

N+A R D+R, D+R, D+A, D+R, A, R R

M N R S U

First stage of multiplier Second stage of multiplier Rounding stage Operand shift stage Unpack FP numbers

A D E

Mantissa ADD stage Divide pipeline stage Exception test stage

CPS104 Lec37.31

GK Spring 2004

R4000 Performance

Not ideal CPI of 1: u Load stalls (1 or 2 clock cycles) u Branch stalls (2 cycles + unfilled slots) u FP result stalls: RAW data hazard (latency) u FP structural stalls: Not enough FP hardware (parallelism)
5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0
eqntott espresso Base gcc li doduc nasa7 ora spice2g6 su2cor tomcatv

Load stalls Branch stalls

FP result stalls FP structural stalls

CPS104 Lec37.32

GK Spring 2004

Potrebbero piacerti anche