Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
CPS104 Lec37.1
GK Spring 2004
Admin.
CPS104 Lec37.2
GK Spring 2004
4: R-type Ifetch
End of Cycle 4
End of Cycle 5
End of Cycle 4: Loads Mem, R-types Exec, Stores Reg, Beqs Ifetch End of Cycle 5: Loads WrB, R-types Mem, Stores Exec, Beqs Reg End of Cycle 6: R-types WrB, Stores Mem, Beqs Exec End of Cycle 7: Stores WrB, Beqs Mem
CPS104 Lec37.3
GK Spring 2004
0: Loads Mem
4: R-types Exec
8: Stores Reg
0: Loads Mem
Branch=0
CPS104 Lec37.4
PC = 16
Imm16 Rs Ra Rt Rt Rd Rb
Zero
Data Mem
RA Do WA Di
IUnit
I
RFile
Rw Di
Exec Unit
0 1
Mux
0
RegDst=1 ALUSrc=0
Clk MemWr=0
MemtoReg=x
GK Spring 2004
Branch=0
IF/ID: Instruction @ 16
CPS104 Lec37.5
PC = 20
Imm16 Rs Ra Rt Rt Rd Rb
Zero
Data Mem
RA Do WA Di
IUnit
I
RFile
Rw Di
Exec Unit
0 1
Mux
0
RegDst=x ALUSrc=1
Clk MemWr=0
MemtoReg=1
GK Spring 2004
RegWr=1 Clk
IF/ID: Instruction @ 20
CPS104 Lec37.6
PC = 24
Imm16 Rs Ra Rt Rt Rd Rb
Zero
Data Mem
RA Do WA Di
IUnit
I
RFile
Rw Di
Exec Unit
0 1
Mux
0
RegDst=x ALUSrc=0
Clk MemWr=1
MemtoReg=0
GK Spring 2004
8: Stores Wr 12: Beqs Mem 16: Rs Exec 20: Rs Reg 24: Rs Ifet
20: R-types Reg 24: R-types Ifet 16: R-types Exec ALUOp=R-type ExtOp=x 12: Beqs Mem 8: Stores WrB
RegWr=0 Clk
Branch=1
1 0 PC+4
CPS104 Lec37.7
PC = 1000
PC+4
IF/ID: Instruction @ 24
Imm16 Rs Ra Rt Rt Rd Rb
Zero
Data Mem
RA Do WA Di
IUnit
I
RFile
Rw Di
Exec Unit
0 1
Mux
0
RegDst=1 ALUSrc=0
Clk MemWr=0
MemtoReg=x
GK Spring 2004
So far we ignored instructions dependencies, but in a real machine one must deal with dependencies. Example:
$2, $1, $3 $12, $2, $5 $13, $6, $2 $14, $2, $2 $15, 100($2) #
Cycle 1 Cycle 2
# $12 depends on the result in $2 # but $2 is updated 3 clock # cycles later. We have a problem!!
0: sub
Ifetch
Reg/Dec Ifetch
WrB Mem Exec Reg/Dec Ifetch WrB Mem Exec Reg/Dec WrB Mem Exec WrB Mem WrB
4: and
8: or
12: add
16: sw
CPS104 Lec37.8
GK Spring 2004
EX/MEM
MEM/WB
Registers
ALU
Data Memory
Forward A
Rd Rt Rs Forwarding Unit
CPS104 Lec37.9
GK Spring 2004
24: R-type
1000: Target of Br
Although beq is fetched during Cycle 4: u Target address is NOT written into the PC until the end of Cycle 7 u Branchs target is NOT fetched until Cycle 8 u 3-instruction delay before the branch take effect This is referred to as Branch Hazard: u Clever design techniques can reduce the delay to ONE instruction
CPS104 Lec37.10
GK Spring 2004
The design is optimized for branch not taken (no pipeline delay) If branch is taken, the next instruction is converted to NOOP by the control (pipeline bubble <=> one stage pipeline delay). The MIPS architecture defines a delayed Branch slot to reduce this potential delay (see a later slide).
CPS104 Lec37.11
GK Spring 2004
Wr Mem Exec Reg/Dec Ifetch Wr Mem Exec Reg/Dec Wr Mem Exec Wr Mem Wr
Although Load is fetched during Cycle 1: u The data is NOT written into the Reg File until the end of Cycle 5 u We cannot read this value from the Reg File until Cycle 6 u 3-instruction delay before the load take effect This is referred to as Data Hazard: u Register forwarding reduces the load delay to ONE instruction u It is not possible to entirely eliminate the load delay.
CPS104 Lec37.12
GK Spring 2004
$1, 100 ($2) $3, $1, $0 $4, $1, $0 $1, 100 ($2) $3, $1, $0
// Load Register R1 // Move old R1 into R3 // Move new R1 into R4 // Load Register R1 // Move new R1 into R3
Instruction Address: 0x00j 1000 Instruction Address: 0x04add $1, $2, $3 Instruction Address: 0x1000 sub $1, $2, $3 Instruction Address: 0x00j 1000 Instruction Address: 0x1000 sub $1, $2, $3
CPS104 Lec37.13
GK Spring 2004
The design is optimized for branch not taken (no pipeline delay) If branch is taken, the next instruction is converted to NOOP by the control (pipeline bubble <=> one stage pipeline delay). The MIPS architecture defines a delayed Branch slot to reduce this potential delay (see a later slide).
CPS104 Lec37.14
GK Spring 2004
Branch Delays
Control & Hazards
IF/ID
Rs
PC
Instruction Memory Rt
Registers
Imm
CPS104 Lec37.15
+
ID/EX
Bus A = Bus B
<<2
sign Extend
Rt Rd
GK Spring 2004
Wr Mem Exec Reg/Dec Ifetch Wr Mem Exec Reg/Dec Wr Mem Exec Wr Mem Wr
Although Load is fetched during Cycle 1: u The data is NOT written into the Reg File until the end of Cycle 5 u We cannot read this value from the Reg File until Cycle 6 u 3-instruction delay before the load take effect This is referred to as Data Hazard: u Register forwarding reduces the load delay to ONE instruction u It is not possible to entirely eliminate the load delay.
GK Spring 2004
CPS104 Lec37.16
EX/MEM
MEM/WB
Registers
Data Memory
Forward A
Rd Rt Rs Forwarding Unit
CPS104 Lec37.17
GK Spring 2004
There are two ways to deal with the load data hazard: u Insert a NOOP bubble into the data path. u Use Delayed load semantic (see a later slide)
Stall0 Stall1 Stall2
RegWr
1 0 PC+4 Imm16 Rs Ra Rt Rb
ExtOp ALUOp
How?
RFile
PC+4
CPS104 Lec37.18
PC
Mem/Wr Register
Ex/Mem Register
ID/Ex Register
IF/ID Register
Zero Data Me RA Do m
WA Di
IF_Unit
I
1 0
Mux
Rt Rw Di Rd
RegDst
ALUSrc
MemWr
MemtoReg
GK Spring 2004
$1, 100 ($2) $3, $1, $0 $4, $1, $0 $1, 100 ($2) $3, $1, $0
// Load Register R1 // Move old R1 into R3 // Move new R1 into R4 // Load Register R1 // Move new R1 into R3
Instruction Address: 0x00j 1000 Instruction Address: 0x04add $1, $2, $3 Instruction Address: 0x1000 sub $1, $2, $3 Instruction Address: 0x00j 1000 Instruction Address: 0x1000 sub $1, $2, $3
CPS104 Lec37.19
GK Spring 2004
Software Scheduling to Avoid Load Hazards Try producing fast code for a = b + c; d = e f; assuming a, b, c, d ,e, and f in memory.
Slow code: LW LW ADD SW LW LW SUB SW Rb,b Rc,c Ra,Rb,Rc a,Ra Re,e Rf,f Rd,Re,Rf d,Rd
CPS104 Lec37.20
GK Spring 2004
CPS104 Lec37.21
GK Spring 2004
Pipelining Complications
Interrupts (Exceptions) u 5 instructions executing in 5 stage pipeline u How to stop the pipeline? u How to restart the pipeline? u Who caused the interrupt?
Stage IF ID EX MEM Problem interrupts occurring Page fault on instruction fetch; misaligned memory access; memory-protection violation Undefined or illegal opcode Arithmetic interrupt Page fault on data fetch; misaligned memory access; memory-protection violation
CPS104 Lec37.22
GK Spring 2004
Pipelining Complications
Simultaneous exceptions in > 1 pipeline stage u Load with data page fault in MEM stage u Add with instruction page fault in IF stage Solution #1 u Interrupt status vector per instruction u Defer check til last stage, kill state update if exception Solution #2 u Interrupt ASAP u Restart everything that is incomplete Exception in branch delay slot, u SW needs two PCs Another advantage for state update late in pipeline!
CPS104 Lec37.23
GK Spring 2004
Pipeline Complications
Complex Addressing Modes and Instructions Address modes: Autoincrement causes register change during instruction execution u Interrupts? Need to restore register state u Adds WAR and WAW hazards since writes no longer last stage Memory-Memory Move Instructions u Must be able to handle multiple page faults u Long-lived instructions: partial state save on interrupt Condition Codes
CPS104 Lec37.24
GK Spring 2004
M1 IF ID/RF
M2
M3
M4
M5
M6
M7 WB
A1
A2
A3
A4
25 Clocks
CPS104 Lec37.25
GK Spring 2004
Pipelining Complications Floating Point: long execution time Also, may pipeline FP execution unit so they can initiate new instructions without waiting full latency
FP Instruction Add, Subtract Multiply Divide Square root Negate Absolute value FP compare Latency Initiation Rate 4 3 8 4 36 35 112 111 2 1 2 1 3 2
Cycles before use result
CPS104 Lec37.26
(MIPS R4000)
Hazards limit performance u Structural: need more HW resources u Data: need forwarding, compiler scheduling u Control: early evaluation & PC, delayed branch, prediction Increasing length of pipe increases impact of hazards; pipelining helps instruction bandwidth, not latency Compilers reduce cost of data and control hazards u Load delay slots u Branch delay slots u Branch prediction Interrupts, Instruction Set, FP makes pipelining harder Q: How would you handle context switches?
CPS104 Lec37.27
GK Spring 2004
8 Stage Pipeline: u IFfirst half of fetching of instruction; PC selection happens here as well as initiation of instruction cache access. u ISsecond half of access to instruction cache. u RFinstruction decode and register fetch, hazard checking and also instruction cache hit detection. u EXexecution, which includes effective address calculation, ALU operation, and branch target computation and condition evaluation. u DFdata fetch, first half of access to data cache. u DSsecond half of access to data cache. u TCtag check, determine whether the data cache access hit. u WBwrite back for loads and register-register operations. 8 Stages: What is impact on Load delay? Branch delay? Why?
CPS104 Lec37.28
GK Spring 2004
IS IF
RF IS IF
EX RF IS IF
Delay slot plus two stalls Branch likely cancels delay slot if not taken
DF EX RF IS IF
DS DF EX RF IS IF
CPS104 Lec37.29
FP Adder, FP Multiplier, FP Divider Last step of FP Multiplier/Divider uses FP Adder HW 8 kinds of stages in FP units: u Stage Functional unitDescription u A FP adder Mantissa ADD stage u D FP divider Divide pipeline stage u E FP multiplier Exception test stage u M FP multiplier First stage of multiplier u N FP multiplier Second stage of multiplier u R FP adder Rounding stage u S FP adder Operand shift stage u U Unpack FP numbers
GK Spring 2004
CPS104 Lec37.30
M N R S U
First stage of multiplier Second stage of multiplier Rounding stage Operand shift stage Unpack FP numbers
A D E
CPS104 Lec37.31
GK Spring 2004
R4000 Performance
Not ideal CPI of 1: u Load stalls (1 or 2 clock cycles) u Branch stalls (2 cycles + unfilled slots) u FP result stalls: RAW data hazard (latency) u FP structural stalls: Not enough FP hardware (parallelism)
5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0
eqntott espresso Base gcc li doduc nasa7 ora spice2g6 su2cor tomcatv
CPS104 Lec37.32
GK Spring 2004