Sei sulla pagina 1di 33

CS 152 Computer Architecture and Engineering Lecture 4 - Pipelining

Krste Asanovic
Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste http://inst.eecs.berkeley.edu/~cs152

Last time in Lecture 3


Microcoding became less attractive as gap between RAM and ROM speeds reduced Complex instruction sets difficult to pipeline, so difficult to increase performance as gate count grew Iron-law explains architecture design space
Trade instructions/program, cycles/instruction, and time/cycle

Load-Store RISC ISAs designed for efficient pipelined implementations


Very similar to vertical microcode Inspired by earlier Cray machines

2/3/2009

CS152-Spring09

Iron Law of Processor Performance


Time = Instructions Cycles Time Program Program * Instruction * Cycle Instructions per program depends on source code, compiler technology, and ISA Cycles per instructions (CPI) depends upon the ISA and the microarchitecture Time per cycle depends upon the microarchitecture and the base technology
Microarchitecture Microcoded Single-cycle unpipelined This lecture Pipelined
2/3/2009 CS152-Spring09

CPI >1 1 1

cycle time short long short


3

An Ideal Pipeline
stage 1 stage 2 stage 3 stage 4

All objects go through the same stages No sharing of resources between any two stages Propagation delay through all pipeline stages is equal The scheduling of an object entering the pipeline is not affected by the objects in other stages These conditions generally hold for industrial assembly lines. But can an instruction pipeline satisfy the last condition?
CS152-Spring09

2/3/2009

Pipelined MIPS To pipeline MIPS: First build MIPS without pipelining with CPI=1 Next, add pipeline registers to reduce cycle time while maintaining CPI=1

2/3/2009

CS152-Spring09

Pipelined Datapath
0x4 Add we rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext

PC

addr rdata

IR

ALU

we addr

Inst. Memory

Data Memory
wdata

rdata

write fetch decode & Reg-fetch execute memory -back phase phase phase phase phase Clock period can be reduced by dividing the execution of an instruction into multiple cycles tC > max {tIM , tRF , tALU , tDM , tRW } ( = tDM probably) However, CPI will increase unless instructions are pipelined
2/3/2009 CS152-Spring09 6

Technology Assumptions
A small amount of very fast memory (caches) backed up by a large, slower memory Fast ALU (at least for integers) Multiported Register files (slower!)

Thus, the following timing assumption is reasonable


tIM tRF tALU tDM tRW

A 5-stage pipelined Harvard architecture will be the focus of our detailed design

2/3/2009

CS152-Spring09

5-Stage Pipelined Execution


0x4 Add we rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext

PC

addr rdata

IR

ALU

we addr rdata wdata

Inst. Memory

Data Memory

I-Fetch (IF)

Decode, Reg. Fetch Execute (ID) (EX)


t0 IF1 t1 t2 ID1 EX1 IF2 ID2 IF3 t3 MA1 EX2 ID3 IF4 t4 WB1 MA2 EX3 ID4 IF5 t5

Memory (MA)
t6 t7 ....

Write -Back (WB)

time instruction1 instruction2 instruction3 instruction4 instruction5


2/3/2009

WB2 MA3 WB3 EX4 MA4 WB4 ID5 EX5 MA5 WB5
8

CS152-Spring09

5-Stage Pipelined Execution


Resource Usage Diagram
0x4 Add we rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext

PC

addr rdata

IR

ALU

we addr rdata wdata

Inst. Memory

Data Memory

I-Fetch (IF)
Resources

Decode, Reg. Fetch Execute (ID) (EX)


t0 I1 t1 I2 I1 t2 I3 I2 I1 t3 I4 I3 I2 I1 t4 I5 I4 I3 I2 I1 t5 I5 I4 I3 I2

Memory (MA)
t6 t7 ....

Write -Back (WB)

time IF ID EX MA WB

I5 I4 I3

I5 I4

I5
9

2/3/2009

CS152-Spring09

Pipelined Execution:
ALU Instructions
0x4
Add

IR

IR
31

IR

PC

addr

inst IR

Inst Memory

we rs1 rs2 rd1 ws wd rd2

A
ALU

we addr

GPRs
Imm Ext

wdata

Data Memory

rdata

wdata MD1 MD2

Not quite correct! We need an Instruction Reg (IR) for each stage
2/3/2009 CS152-Spring09 10

Pipelined MIPS Datapath


without jumps
F D E
IR 0x4
Add

M
IR
31

W
IR

RegDst RegWrite we rs1 rs2 rd1 ws wd rd2 OpSel A


ALU

MemWrite we addr

WBSrc

PC

addr

inst IR

Inst Memory

GPRs
Imm Ext

Data Memory wdata


wdata MD2

rdata

MD1 ExtSel BSrc

Control Points Need to Be Connected


2/3/2009 CS152-Spring09 11

How Instructions can Interact with each other in a pipeline


An instruction in the pipeline may need a resource being used by another instruction in the pipeline structural hazard An instruction may depend on something produced by an earlier instruction
Dependence may be for a data value

data hazard
Dependence may be for the next instructions address control hazard (branches, exceptions)

2/3/2009

CS152-Spring09

12

Data Hazards
r4 r1
0x4
Add

r1
IR IR
31

IR

PC

addr

inst IR

Inst Memory

we rs1 rs2 rd1 ws wd rd2

A
ALU

we addr

GPRs
Imm Ext

wdata

Data Memory

rdata

wdata MD1 MD2

... r1 r0 + 10 r4 r1 + 17 ...
2/3/2009 CS152-Spring09

r1 is stale. Oops!
13

CS152 Administrivia
PS 1 due Tuesday Feb 10 in class Section covering PS 1 on Wednesday Feb 11
Room/time TBD

First Quiz on Thursday Feb 12


In class, closed-book, no computers or calculators Covers lectures 1-5 (this weeks lectures)

Lecture 7, Tuesday Feb 17 in 320 Soda Lecture 8, Thursday Feb 19 back in 306 Soda See website for full schedule

2/3/2009

CS152-Spring09

14

Resolving Data Hazards (1)

Strategy 1: Wait for the result to be available by freezing earlier pipeline stages interlocks

2/3/2009

CS152-Spring09

15

Feedback to Resolve Hazards

FB1

FB2

FB3

FB4

stage 1

stage 2

stage 3

stage 4

Later stages provide dependence information to earlier stages which can stall (or kill) instructions

Controlling a pipeline in this manner works provided the instruction at stage i+1 can complete without any interference from instructions in stages 1 to i (otherwise deadlocks may occur)
2/3/2009 CS152-Spring09 16

Interlocks to resolve Data Hazards


Stall Condition

0x4
Add

nop

IR

IR
31

IR

PC

addr

inst IR

Inst Memory

we rs1 rs2 rd1 ws wd rd2

A
ALU

we addr

GPRs
Imm Ext

... r1 r0 + 10 r4 r1 + 17 ...
2/3/2009

wdata

Data Memory

rdata

wdata

MD1

MD2

CS152-Spring09

17

Stalled Stages and Pipeline Bubbles


time t0 t1 t2 t3 t4 t5 (I1) r1 (r0) + 10 IF1 ID1 EX1 MA1 WB1 (I2) r4 (r1) + 17 IF2 ID2 ID2 ID2 ID2 (I3) IF3 IF3 IF3 IF3 stalled stages (I4) (I5) time t0 t1 I1 I2 I1 t6 t7 ....

EX2 MA2 WB2 ID3 EX3 MA3 WB3 IF4 ID4 EX4 MA4 WB4 IF5 ID5 EX5 MA5 WB5

Resource Usage

IF ID EX MA WB

t2 I3 I2 I1

t3 I3 I2 nop I1

t4 I3 I2 nop nop I1

t5 I3 I2 nop nop nop

t6 I4 I3 I2 nop nop

t7 I5 I4 I3 I2 nop

.... I5 I4 I3 I2

I5 I4 I3

I5 I4
18

I5

nop
2/3/2009 CS152-Spring09

pipeline bubble

Interlock Control Logic


stall ws Cstall rs rt ?

0x4
Add

nop

IR

IR
31

IR

PC

addr

inst IR

Inst Memory

we rs1 rs2 rd1 ws wd rd2

A
ALU

we addr

GPRs
Imm Ext

wdata

Data Memory

rdata

wdata

MD1

MD2

Compare the source registers of the instruction in the decode stage with the destination register of the uncommitted instructions.
2/3/2009 CS152-Spring09 19

Interlock Control Logic


ignoring jumps & branches
stall Cstall rs rt re1 0x4
Add

ws we ? re2 Cre we Cdest ws we Cdest IR


31

ws

nop

IR

IR

PC

addr

inst IR

Inst Memory

we rs1 rs2 rd1 ws wd rd2

Cdest

A
ALU

we addr

GPRs
Imm Ext

wdata

Data Memory

rdata

wdata

MD1

MD2

Should we always stall if the rs field matches some rd? not every instruction writes a register we not every instruction reads a register re
2/3/2009 CS152-Spring09

20

Source & Destination Registers


R-type: I-type: J-type: ALU ALUi LW SW BZ J JAL JR JALR
op op op rs rs rt rt rd func immediate16

immediate26

rd (rs) func (rt) rt (rs) op imm rt M [(rs) + imm] M [(rs) + imm] (rt) cond (rs) true: PC (PC) + imm false: PC (PC) + 4 PC (PC) + imm r31 (PC), PC (PC) + imm PC (rs) r31 (PC), PC (rs)
2/3/2009 CS152-Spring09

source(s) rs, rt rs rs rs, rt rs rs rs rs

destination rd rt rt

31 31
21

Deriving the Stall Signal


Cdest ws = Case opcode ALU rd ALUi, LW rt JAL, JALR R31 we = Case opcode ALU, ALUi, LW (ws 0) JAL, JALR on ... off Cre re1 = Case opcode ALU, ALUi, LW, SW, BZ, JR, JALR on J, JAL off re2 = Case opcode ALU, SW on ... off

Cstall

2/3/2009

stall = ((rsD =wsE).weE + (rsD =wsM).weM + (rsD =wsW).weW) . re1D ((rtD =wsE).weE + (rtD =wsM).weM + (rtD =wsW).weW) . re2D CS152-Spring09

t ry ! n o to is l s s l hi fu T e th
22

Hazards due to Loads & Stores


Stall Condition
What if (r1)+7 = (r3)+5 ?

0x4
Add

nop

IR

IR
31

IR

PC

addr

inst IR

Inst Memory

we rs1 rs2 rd1 ws wd rd2

A
ALU

we addr

GPRs
Imm Ext

wdata

Data Memory

rdata

wdata

... M[(r1)+7] (r2) r4 M[(r3)+5] ...


2/3/2009

MD1

MD2

Is there any possible data hazard in this instruction sequence?


CS152-Spring09 23

Load & Store Hazards


... M[(r1)+7] (r2) r4 M[(r3)+5] ... (r1)+7 = (r3)+5 data hazard

However, the hazard is avoided because our memory system completes writes in one cycle ! Load/Store hazards are sometimes resolved in the pipeline and sometimes in the memory system itself. More on this later in the course.
2/3/2009 CS152-Spring09 24

Resolving Data Hazards (2)

Strategy 2: Route data as soon as possible after it is calculated to the earlier pipeline stage bypass

2/3/2009

CS152-Spring09

25

Bypassing
time (I1) r1 r0 + 10 (I2) r4 r1 + 17 (I3) (I4) (I5) t0 IF1 t1 t2 t3 t4 t5 ID1 EX1 MA1 WB1 IF2 ID2 ID2 ID2 ID2 IF3 IF3 IF3 IF3 stalled stages t6 t7 .... EX2 MA2 WB2 ID3 EX3 MA3 IF4 ID4 EX4 IF5 ID5

Each stall or kill introduces a bubble in the pipeline CPI > 1 A new datapath, i.e., a bypass, can get the data from the output of the ALU to its input
time (I1) r1 r0 + 10 (I2) r4 r1 + 17 (I3) (I4) (I5)
2/3/2009

t0

t1 IF1

t2 t3 ID1 EX1 IF2 ID2 IF3

t4 MA1 EX2 ID3 IF4

t5 WB1 MA2 EX3 ID4 IF5

t6

t7

....

WB2 MA3 WB3 EX4 MA4 WB4 ID5 EX5 MA5 WB5
26

CS152-Spring09

Adding a Bypass
stall

r4 r1...
0x4
Add

r1 ...
nop
IR

IR

M
31

IR

ASrc
PC addr

inst IR

Inst Memory

we rs1 rs2 rd1 ws wd rd2

A
ALU

we addr

GPRs
Imm Ext

wdata

Data Memory

rdata

wdata

MD1

MD2

(I1) (I2)

2/3/2009

When does this bypass help? ... r1 r0 + 10 r1 M[r0 + 10] JAL 500 r4 r1 + 17 r4 r31 + 17 r4 r1 + 17 yes no no
CS152-Spring09 27

The Bypass Signal


Deriving it from the Stall Signal
stall = ( ((rsD =wsE).weE + (rsD =wsM).weM + (rsD =wsW).weW).re1D +((rtD =wsE).weE + (rtD =wsM).weM + (rtD =wsW).weW).re2D ) ws = Case opcode ALU rd ALUi, LW rt JAL, JALR R31 we = Case opcode ALU, ALUi, LW (ws 0) JAL, JALR on ... off

ASrc = (rsD=wsE).weE.re1D

Is this correct?

No because only ALU and ALUi instructions can benefit from this bypass Split weE into two components: we-bypass, we-stall

2/3/2009

CS152-Spring09

28

Bypass and Stall Signals


Split weE into two components: we-bypass, we-stall
we-bypassE = Case opcodeE ALU, ALUi (ws 0) ... off we-stallE = Case opcodeE LW (ws 0) JAL, JALR on ... off

ASrc stall

= (rsD =wsE).we-bypassE . re1D = ((rsD =wsE).we-stallE + (rsD=wsM).weM + (rsD=wsW).weW). re1D +((rtD = wsE).weE + (rtD = wsM).weM + (rtD = wsW).weW). re2D

2/3/2009

CS152-Spring09

29

Fully Bypassed Datapath


stall

PC for JAL, ...

0x4
Add

nop

ASrc
we rs1 rs2 rd1 ws wd rd2

IR

IR

M
31

IR

PC

addr

inst IR

A
ALU

we addr

Inst Memory

GPRs
Imm Ext

BSrc

wdata

Data Memory

rdata

wdata MD1 MD2

Is there still a need for the stall signal ?

stall = (rsD=wsE). (opcodeE=LWE).(wsE 0 ).re1D + (rtD=wsE). (opcodeE=LWE).(wsE 0 ).re2D


CS152-Spring09 30

2/3/2009

Resolving Data Hazards (3)

Strategy 3: Speculate on the dependence. Two cases: Guessed correctly do nothing Guessed incorrectly kill and restart

2/3/2009

CS152-Spring09

31

Next Time: Control Hazards


Branches/Jumps Exceptions/Interrupts

2/3/2009

CS152-Spring09

32

Acknowledgements
These slides contain material developed and copyright by:
Arvind (MIT) Krste Asanovic (MIT/UCB) Joel Emer (Intel/MIT) James Hoe (CMU) John Kubiatowicz (UCB) David Patterson (UCB)

MIT material derived from course 6.823 UCB material derived from course CS252

2/3/2009

CS152-Spring09

33

Potrebbero piacerti anche