Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
ICS 233
Computer Architecture and Assembly
Language
Prof. Muhamed Mudawar
Presentation Outline
Pipelining versus Serial Execution
Pipelined Datapath
Pipelined Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall Unit
Control Hazards
Delayed Branch and Dynamic Branch
Prediction
Pipelined Processor Design
Pipelining Example
Laundry Example: Three Stages
1. Wash dirty load of clothes
2. Dry wet clothes
3. Fold and put clothes into drawers
A B
Each stage takes 30 minutes to complete
C
Sequential Laundry
6 PM
Time 30
7
30
8
30
30
9
30
30
10
30
30
11
30
30
12 AM
30
A
B
C
D
30
30
30
30
30
30
30
30
30
Time
30
30
30
Pipelined laundry
takes 3 hours for 4
loads
With Pipelining
One completion every 1 time unit
Synchronous Pipeline
S2
Clock
Sk
Register
S1
Register
Input
Register
Output
Pipeline Performance
Next . . .
Pipelining versus Serial Execution
Pipelined Datapath
Pipelined Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall Unit
Control Hazards
Delayed Branch and Dynamic Branch
Prediction
Pipelined Processor Design
Single-Cycle Datapath
ID = Decode and
Register Fetch
EX = Execute and
Calculate Address
Inc
Address
Rt
Instruction
Instruction
Memory
Rd
m
u
x
1
Imm16
Register
File
BusW
PC
Rs
RW
00
Imm26
m
u
x
Next
PC
Ext
m
u
x
1
MEM = Memory
Access
ALU result
zero
A
L
U
WB = Write
Back
Data
Memory
Address
Data_in
m
u
x
1
Pipelined Datapath
ID = Decode
IF/ID
EX = Execute
ID/EX
Next
PC
00
Imm26
Instruction
Rs
Register
File
Rt
RW
Instruction
Memory
Rd
m
u
x
BusW
PC
Imm16
Address
WB
EX/MEM
Inc
m
u
x
MEM = Memory
Ext
m
u
x
zero
A
L
U
MEM/WB
ALU result
Address
Data
Memory
Data_in
m
u
x
1
Corrected Pipelined
Destination registerDatapath
number should come from MEM/WB
ID
EX
IF/ID
ID/EX
Next
PC
00
Imm26
Instruction
Rs
Register
File
Rt
BusW
Instruction
Memory
Rd
m
u
x
RW
PC
Imm16
Address
WB
EX/MEM
Inc
m
u
x
MEM
Ext
m
u
x
zero
A
L
U
MEM/WB
ALU result
Address
Data
Memory
Data_in
m
u
x
1
Graphically Representing
Pipelines
Figure shows the use of resources at each stage and each cycle
Time (in cycles)
CC1
CC2
CC3
CC4
CC5
lw $6, 8($5)
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
CC6
CC7
CC8
InstructionTime Diagram
Diagram shows:
Which instruction occupies what stage at each
clock cycle
Instruction execution is pipelined over the 5
stages
Up to five instructions can be in
ALU instructions skip
Instruction Order
$7, 8($3)
lw
$6, 8($5)
IF
ID
EX MEM WB
IF
ID
EX MEM WB
IF
ID
EX
WB
IF
ID
EX
IF
ID
$2, 10($3)
CC1
WB
EX MEM
Time
Single-Cycle vs Pipelined
Performance
Reg
ALU
900 ps
200+150+200+200+150 = 900 ps
MEM
Reg
IF
Reg
ALU
900 ps
MEM
Reg
Single-Cycle versus
Pipelined
Pipelined cycle time
= max(200, 150) = 200 ps
IF
Reg
200
IF
200
ALU
Reg
IF
200
MEM
Reg
ALU
MEM
Reg
ALU
MEM
200
200
Reg
200
Reg
200
Next . . .
Pipelined Datapath
Pipelined Control
Pipeline Hazards
Control Hazards
Control Signals
IF/ID
EX/MEM
j
Inc
PCSrc
00
Imm26
Imm26
Imm16
Address
Instruction
Rs
Register
File
Rt
BusW
Instruction
Memory
Rd
Ext
m
u
x
RW
PC
m
u
x
ID/EX
m
u
x
func
RegDst RegWrite
Next
PC
beq
bne
zero
A
L
U
ALU result
m
u
x
Address
Data
Memory
Data_in
ALU
Control
ALUSrc
MEM/WB
ALUOp Br&J
MemWrite
MemRead
MemtoReg
Decode
Execute Stage
Memory Stage
Writeback
Signal
Control Signals
Control Signals
Signal
RegWrite
R-Type
1=Rd
0=Reg R-Type
addi
0=Rt
1=Imm
ADD
slti
0=Rt
1=Imm
SLT
andi
0=Rt
1=Imm
AND
ori
0=Rt
1=Imm
OR
lw
0=Rt
1=Imm
ADD
sw
1=Imm
ADD
beq
0=Reg
SUB
bne
0=Reg
SUB
Pipelined
j ProcessorxDesign
Pipelined Control
IF/ID
EX/MEM
j
Inc
PCSrc
Next
PC
00
Imm26
Imm16
Address
Register
File
Rt
Instruction
BusW
Instruction
Memory
ALU result
A
L
U
Ext
m
u
x
m
u
x
MEM/WB
bne
m
u
x
Address
Data
Memory
Data_in
Op
Rd
beq
zero
Rs
RW
PC
m
u
x
ID/EX
RegDst
ALU
Control
M
WB
Main
Control
MemRead
ALUOp
EX
ALUSrc
MemWrite
RegWrite
MemtoReg
WB
func
WB
Pass control
signals along
pipeline just
like the data
Pipelining Summary
Next . . .
Pipelining versus Serial Execution
Pipelined Datapath
Pipelined Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall Unit
Control Hazards
Delayed Branch and Dynamic Branch Prediction
Pipeline Hazards
Hazards: situations that would cause
incorrect execution
If next instruction were launched during
its designated clock cycle
1.Structural hazards
Caused by resource contention
Using same resource by two instructions
during the same cycle
Pipelined Processor Design
1.Data hazards
An instruction may compute a result
needed by next instruction
Hardware can detect dependencies
between instructions
2.Control hazards
Caused by instructions that change
control flow (branches/jumps)
Delays in changing the flow of control
Hazards complicate pipeline control and
limit performance
Pipelined Processor Design
Problem
Structural Hazards
Instructions
$6, 8($5)
IF
$2, 10($3)
CC1
ID
EX MEM WB
IF
ID
EX
WB
IF
ID
EX
WB
IF
ID
EX MEM
Time
Resolving Structural
Hazards
Serious Hazard:
Hazard cannot be ignored
Next . . .
Pipelining versus Serial Execution
Pipelined Datapath
Pipelined Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall Unit
Control Hazards
Delayed Branch and Dynamic Branch Prediction
Pipelined Processor Design
Data Hazards
Dependency between instructions causes a data hazard
The dependent instructions are close to each other
Pipelined execution might change the order of operand access
Read After Write RAW Hazard
Given two instructions I and J, where I comes before J
Instruction J should read an operand after it is written by I
Called a data dependence in compiler terminology
I: add $1, $2, $3
# r1 is written
# r1 is read
Time (cycles)
value of $2
CC1
CC2
CC3
CC4
CC5
CC6
CC7
CC8
10
10
10
10
10/20
20
20
20
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
sw $8, 10($2)
Instruction Order
CC1
CC2
CC3
CC4
CC5
CC6
CC7
CC8
10
10
10
10
10/20
20
20
20
IM
Reg
ALU
DM
Reg
bubble
bubble
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
IM
or $6, $3, $2
CC1
CC2
CC3
CC4
CC5
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
sw $8, 10($2)
CC6
CC7
CC8
ID/EX
MemtoReg
Register
Rt
File
m
u
x
m
u
x
A
L
U
Rw
Rd
m
u
x
m
u
x
Rs
Instruction
Ext
RegDst
RegWrite
Pipelined Processor Design
ForwardB
ICS 233 KFUPM
Muhamed Mudawar slide 34
MEM/WB
ALU result
Address
Data_in
Data
Memory
m
u
x
WriteData
EX/MEM
Rw
ALUSrc
ALU result
ForwardA
Imm26
IF/ID
Rw
Implementing Forwarding
IF/ID.Rs = EX/MEM.Rw
IF/ID.Rt = EX/MEM.Rw
Pipelined Processor Design
Forwarding Unit
m
u
x
ForwardB
ForwardA
Forwarding Unit
Pipelined Processor Design
ALU result
Address
Data_in
Data
Memory
m
u
x
WriteData
MEM/WB
ALU result
Rw
Rd
m
u
x
m
u
x
A
L
U
Rw
File
Register
Rt
m
u
x
Rs
EX/MEM
Ext
Rw
Instruction
Imm26
Explanation
ForwardA = 00
ForwardA = 01
ForwardA = 10
ForwardB = 00
ForwardB = 01
ForwardB = 10
if
(IF/ID.Rs == ID/EX.Rw 0
and ID/EX.RegWrite)
ForwardA = 01
ForwardA = 00
if
(IF/ID.Rt == ID/EX.Rw 0
and ID/EX.RegWrite)
ForwardB = 01
ForwardB = 00
Forwarding Example
Instruction sequence:
lw
$4, 100($9)
add $7, $5, $6
sub $8, $4, $7
ForwardA = 10
ForwardB = 01
ForwardB = 01
Data_in
Data
Memory
m
u
x
WriteData
Address
Rw
Rd
m
u
x
m
u
x
ALU result
File
m
u
x
A
L
U
Register
Rt
m
u
x
Rs
lw $4,100($9)
ALU result
Ext
Instruction
ForwardA = 10
add $7,$5,$6
Rw
sub $8,$4,$7
Imm26
Rw
Imm26
Next . . .
Pipelining versus Serial Execution
Pipelined Datapath
Pipelined Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall Unit
Control Hazards
Delayed Branch and Dynamic Branch Prediction
Pipelined Processor Design
Load Delay
Time (cycles)
lw
$2, 20($1)
CC1
CC2
CC3
CC4
CC5
IF
Reg
ALU
DM
Reg
IF
Reg
ALU
DM
Reg
IF
Reg
ALU
DM
Reg
IF
Reg
ALU
DM
CC6
CC7
CC8
However, load
can forward
data to second
next instruction
Reg
Program Order
Time (cycles)
lw
$2, 20($1)
CC1
CC2
CC3
CC4
CC5
IM
Reg
ALU
DM
Reg
bubble
Reg
IM
IM
CC6
CC7
ALU
DM
Reg
Reg
ALU
DM
CC8
Reg
Data_in
Data
Memory
m
u
x
WriteData
Address
Rw
ALU result
m
u
x
m
u
x
A
L
U
IF/IDWrite
Op
PCWrite
Rd
m
u
x
File
Address
Register
Rt
m
u
x
Forwarding,
Hazard Detection,
and Stall Unit
MemRead
0
Main
Control
m
u
x
EX
Bubble
Bubble
clears
control
signals
ForwardB
WB
PC
Instruction
Rs
ALU result
Ext
Rw
Instruction
Memory
Instruction
ForwardA
Imm26
Rw
Imm26
Compiler Scheduling
Slow code:
lw
$10, 0($1)
lw
$11, 0($2)
lw
$13, 0($4)
sw
$12, ($3)
# $3 = addr a
lw
$13, ($4)
# $4 = addr e
lw
$14, 0($5)
lw
$14, ($5)
# $5 = addr f
lw
$10, ($1)
# $1 = addr b
lw
$11, ($2)
# $2 = addr c
$15, ($6)
# $6 = addr d
ICS 233 KFUPM
Muhamed Mudawar slide 44
$12, 0($3)
$14, 0($6)
# $1 is read
# $1 is written
# $1 is written
# $1 is written again
Next . . .
Pipelining versus Serial Execution
Pipelined Datapath
Pipelined Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall Unit
Control Hazards
Delayed Branch and Dynamic Branch Prediction
Pipelined Processor Design
Control Hazards
Branch instructions can cause great performance loss
Branch instructions need two things:
Branch Result
Branch target
PC + 4
PC + 4 + 4 immediate
If Branch is Taken
Rd
A
L
U
ALU result
m
u
x
zero = 1
Imm26
m
u
x
m
u
x
Imm16
Ext
beq = 1
SUB
Forwarding
from MEM stage
label:
lw $8, ($7)
. . .
beq $5, $6, label
next1
next2
File
Address
Register
Rt
m
u
x
Instruction
Rs
Rw
m
u
x
Instruction
Memory
Instruction
00
PCSrc = 1
Next
PC
Rw
NPC
NPC
Imm26
PC
Inc
beq $5,$6,label
next1
next2
beq $5,$6,label
Next1 # bubble
cc1
cc2
cc3
IF
Reg
ALU
IF
Next2 # bubble
cc4
cc5
cc6
Reg
Bubble
Bubble
Bubble
IF
Bubble
Bubble
Bubble
Bubble
IF
Reg
ALU
MEM
cc7
Modified Datapath
Imm16
Data_in
Data
Memory
m
u
x
WriteData
Address
Rw
Rd
m
u
x
m
u
x
ALU result
File
m
u
x
A
L
U
Register
Rt
Ext
m
u
x
Rw
Address
Rs
ALU result
Rw
Instruction
PC
m
u
x
Instruction
Memory
Instruction
00
PCSrc
Imm16
Imm26
reset
NPC
Inc
Next
PC
Details of Next PC
PCSrc
NPC
30
A
D
D
30
m 30
u
x
Ext
Imm16
zero
msb 4
Imm26
beq
bne
1
26
=
Forwarded BusA
Pipelined Processor Design
BusB
Next . . .
Pipelining versus Serial Execution
Pipelined Datapath
Pipelined Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall Unit
Control Hazards
Delayed Branch and Dynamic Branch Prediction
Pipelined Processor Design
Delayed Branch
Define branch to take place AFTER the next instruction
Compiler/assembler fills the branch delay slot (for 1 delay
cycle)
Delayed Branch
Define branch to take place after the next instruction
For a 1-cycle branch delay, we have one delay slot
label:
branch instruction
branch delay slot
branch target
(next instruction)
(if branch taken)
. . .
add $t2,$t3,$t4
beq $s1,$s0,label
Delay Slot
label:
. . .
beq $s1,$s0,label
add $t2,$t3,$t4
Zero-Delayed Branch
Inc
mux
PC
predict_taken
low-order bits
used as index
=
ICS 233 KFUPM
Muhamed Mudawar slide 58
Target
Predict
Addresses
Bits
Taken
Predict
Taken
Not Taken
Taken
Predict
Taken
Not Taken
Taken
Not Taken
Not
Taken
Taken
Not
Taken
Not
Taken