CS M151B / EE M116C: Computer Systems Architecture

CS M151B / EE M116C
Computer Systems Architecture
Pipelining
Instructor: Prof. Lei He

<LHE@ee.ucla.edu>
Some notes adopted from Glenn Reinman
Reigte
Wr
Data
ry
o
m
Me
Reagd
Re
U
L
A
h
c
t
e
F
Slide from Prof. B Parhami at UCSB
Review -- Single Cycle CPU
Single Cycle Datapath Partitioning
PC Src
Add
4
Shift
left 2
RegWrite
Instruction [25 21]
PC
Read
address
Instruction
[31 0]
Instruction
memory
Instruction [20 16]

1
M
u
Instruction [15 11] x
0
Read
register 1
Read
register 2
Read
data 1
MemWrite
ALUSrc
Read
Wr ite
data 2
register
Wr ite
Registers
data
RegDst
Instruction [15 0]
16
Sign
extend
AL U
Add result
1
M
u
x
0
1
M
u
x
0
Zer o
ALU ALU
result
MemtoReg
Address
Write
data
32
AL U
control
Read
data
Data
memory
1
M
u
x
0
MemRead
Instruction [5 0]
ALUOp
IF
ID
EX
Mem
WB
Goal is to balance work done in each cycle - minimize cycle time!
Review -- Multiple Cycle CPU
Review -- Instruction Latencies
Single-Cycle CPU
Load
Ifetch
Reg/Dec
Exec
Mem
Wr
Multiple Cycle CPU

Cycle 1 Cycle 2
Cycle 3 Cycle 4
Load
Ifetch
Reg/Dec
Exec
Mem
Add
Ifetch
Reg/Dec
Exec
Wr
Cycle 5
Wr
Getting the Best of Both Worlds
Pipelined:
Single-cycle:
Clock rate = 125 MHz
CPI = 1
Single-cycle analogy:
Doctor appointments
scheduled for 60 min
per patient

CPI 1
Multicycle:
CPI 4
Multicycle analogy:
Doctor appointments
scheduled in 15-min
increments
15.1 Pipelining Concepts
Strategies for improving performance

1 Use multiple independent data paths accepting several instructions
that are read out at once: multiple-instruction-issue or superscalar
2 Overlap execution of several instructions, starting the next instruction
before the previous one has run to completion: (super)pipelined
Approval
1
Cashier
2
Registrar
3
ID photo
Pickup
Start
here
Exit
Fig. 15.1
Pipelining in the student registration process.
A Pipelined Datapath
IF: Instruction fetch

ID: Instruction decode and register fetch
EX: Execution and effective address calculation
MEM: Memory access
WB: Write back
Note: These stages are often labeled with the primary structure in the stage,
rather than the main function of the stage:
IF stage = IM (instruction memory)
ID stage = Reg (register read)
EX stage = ALU
MEM stage = DM (data memory)
WB stage = Reg
Pipelined Datapath
Warning write register line is incorrect in this figure!
Execution in a Pipelined Datapath
IF
lw
IM
CC2
ID
Reg
lw
IM
CC4
CC5
EX
MEM
WB
DM
Reg
ID
Reg
lw
IM
EX
MEM
WB
DM
Reg
ID
Reg
lw
IM
EX
MEM
WB
DM
Reg
ID
Reg
lw
IM
steady
state
EX
MEM
WB
DM
Reg
ID
Reg
CC9
EX
MEM
WB
ALU
IF
CC8
ALU
IF
CC7
ALU
IF
CC6
ALU
IF
CC3
ALU
CC1
DM
Reg
Instruction Latencies and Throughput
Single-Cycle CPU
Load
IF
Dec
EX
Mem
WB
Multiple Cycle CPU

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5
Load
IF
Dec
EX
Mem
WB
Pipelined CPU
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8
Load
IF
Dec
EX
Mem
WB
Load
IF
Dec
EX
Mem
WB
Load
IF
Dec
EX
Mem
WB
Load
IF
Dec
EX
Mem
WB
Pipelining Advantages
Higher maximum throughput

Higher utilization of CPU resources
But:
more hardware needed
perhaps complex control
Mixed Instructions in the Pipeline
CC3
IM
Reg
IM
CC4
CC5
DM
Reg
Reg
ALU
add
CC2
ALU
lw
CC1
Reg
CC6
Pipeline Principles
All instructions that share a pipeline must have

the same stages in the same order.
therefore, add does nothing during Mem stage
sw does nothing during WB stage
All intermediate values must be latched each

cycle.
There is no functional block reuse
example: we need 2 adders and ALU (like in singlecycle)
IF
ID
EX
MEM
IM
Reg
ALU
DM
WB
Reg
Pipelined Datapath
Instruction Fetch
Instruction Decode/
Register Fetch
Execute/
Address Calculation
Memory Access
Write Back
registers!
0
M
u
x
1
IF/ ID
ID/EX
EX/ MEM
MEM/WB
Add
Add
Add
resul t
Shif t
left 32
PC
Address
Instruction
memory
Read
register 1
Read
dat a 1
Read
register 2
Registers Read
Write
dat a 2
register
Write
dat a
16
Sign
ext end
32
Zero
0
M
u
x
1
ALU
ALU
resul t
Address
Data
memory
Write
dat a
Read
dat a
1
M
u
x
0
The Pipeline in Execution
Instruction Decode/
Register Fetch
add $10, $1, $2
Execute/
Address Calculation
Memory Access
Write Back
0
M
u
x
1
IF/ ID
ID/EX
EX/ MEM
MEM/WB
Add
Add
Add
resul t
Shif t
left 32
PC
Address
Instruction
memory
Read
register 1
Read
dat a 1
Read
register 2
Registers Read
Write
dat a 2
register
Write
dat a
16
Sign
ext end
32
Zero
0
M
u
x
1
ALU
ALU
resul t
Address
Data
memory
Write
dat a
Read
dat a
1
M
u
x
0
lw $12, 1000($4)
add $10, $1, $2
Execute/
Address Calculation
Memory Access
Write Back
0
M
u
x
1
IF/ ID
ID/EX
EX/ MEM
MEM/WB
Add
Add
Add
resul t
Shif t
left 32
PC
Address
Instruction
memory
Read
register 1
Read
dat a 1
Read
register 2
Registers Read
Write
dat a 2
register
Write
dat a
16
Sign
ext end
32
Zero
0
M
u
x
1
ALU
ALU
resul t
Address
Data
memory
Write
dat a
Read
dat a
1
M
u
x
0
sub $15, $4, $1
lw $12, 1000($4)
Memory Access
add $10, $1, $2
Write Back
0
M
u
x
1
IF/ ID
ID/EX
EX/ MEM
MEM/WB
Add
Add
Add
resul t
Shif t
left 32
PC
Address
Instruction
memory
Read
register 1
Read
dat a 1
Read
register 2
Registers Read
Write
dat a 2
register
Write
dat a
16
Sign
ext end
32
Zero
0
M
u
x
1
ALU
ALU
resul t
Address
Data
memory
Write
dat a
Read
dat a
1
M
u
x
0
Instruction Fetch
sub $15, $4, $1
lw $12, 1000($4)
add $10, $1, $2
Write Back
0
M
u
x
1
IF/ ID
ID/EX
EX/ MEM
MEM/WB
Add
Add
Add
resul t
Shif t
left 32
PC
Address
Instruction
memory
Read
register 1
Read
dat a 1
Read
register 2
Registers Read
Write
dat a 2
register
Write
dat a
16
Sign
ext end
32
Zero
0
M
u
x
1
ALU
ALU
resul t
Address
Data
memory
Write
dat a
Read
dat a
1
M
u
x
0
Instruction Fetch
Instruction Decode/
Register Fetch
sub $15, $4, $1
lw $12, 1000($4)
add $10, $1, $2
0
M
u
x
1
IF/ ID
ID/EX
EX/ MEM
MEM/WB
Add
Add
Add
resul t
Shif t
left 32
PC
Address
Instruction
memory
Read
register 1
Read
dat a 1
Read
register 2
Registers Read
Write
dat a 2
register
Write
dat a
16
Sign
ext end
32
Zero
0
M
u
x
1
ALU
ALU
resul t
Address
Data
memory
Write
dat a
Read
dat a
1
M
u
x
0
Instruction Fetch
Instruction Decode/
Register Fetch
Execute/
Address Calculation
sub $15, $4, $1 lw $12, 1000($4)
0
M
u
x
1
IF/ ID
ID/EX
EX/ MEM
MEM/WB
Add
Add
Add
resul t
Shif t
left 32
PC
Address
Instruction
memory
Read
register 1
Read
dat a 1
Read
register 2
Registers Read
Write
dat a 2
register
Write
dat a
16
Sign
ext end
32
Zero
0
M
u
x
1
ALU
ALU
resul t
Address
Data
memory
Write
dat a
Read
dat a
1
M
u
x
0
The Pipeline in Execution, with controls
(This figure has a correct Write register)
Pipelined Control
FSM isn t really appropriate

Combinational Logic (like single-cycle design)!
signals generated once in ID stage
follow instruction through the pipeline
get used in whatever stage they are needed
MEM/WB
EX/MEM
ID/EX
IF/ID
instruction control
Pipelined Control Signals
Execution Stage Control

Lines
Memory Stage
Control Lines
Write Back Stage

Control Lines
Instruction
RegDst
ALU
Op1
ALU
Op0
ALUSrc
Branch
Mem
Read
Mem
Write
RegWrite
MemtoReg
R-Format
lw
sw
beq
Quick survey
Break
Are we moving too fast, or right pace?
Review of FP
Continue on pipelining
Midterm II: likely Feb. 28th
May be Feb. 24th
IEEE 754 FP Numbers
Single precision
representation of
(-1)S 2E-127 (1.M)
sign
bit
23
exponent:
excess 127
binary integer
(actual exponent
is e = E - 127)
mantissa:
sign + magnitude, normalized
binary significand with hidden
integer bit: 1.M
0 = 0 00000000 00 . . . 0
-1.5 = 1 01111111 10 . . . 0
325 = 101000101 = 1.01000101 x 28
= 0 10000111 01000101000000000000000
.02 = .0011001101100... = 1.1001101100... x 2-3
= 0 01111100 1001101100...
range of about 2 X 10-38 to 2 X 1038

always normalized (so always leading 1, never shown)
special representation of 0 (E = 00000000)
can do integer compare for greater-than, sign
Double Precision FP (IEEE 754)
sign
11
20
32
exponent:
excess 1023
binary integer
actual exponent is e = E - 1023
N = (-1)S 2 E-1023(1.M)
52 (+1) bit mantissa

range of about 2 X 10-308 to 2 X 10308
mantissa:
sign + magnitude, normalized
binary significand with hidden
integer bit: 1.M
Range of FP Numbers
Exponent
Fraction
Object
Denormalized Number
1 to 254
any
Normalized Number
(regular floating point number)
255
255
NaN
Range of the floating point number:

Single precision: 2126~2127 (1038~1038)
Double Precision: 21022~21023 (10308~10308)
Review -- Single Cycle CPU
Single Cycle Datapath Partitioning
PC Src
Add
4
Shift
left 2
RegWrite
Instruction [25 21]
PC
Read
address
Instruction
[31 0]
Instruction
memory
Instruction [20 16]

1
M
u
Instruction [15 11] x
0
Read
register 1
Read
register 2
Read
data 1
MemWrite
ALUSrc
Read
Wr ite
data 2
register
Wr ite
Registers
data
RegDst
Instruction [15 0]
16
Sign
extend
AL U
Add result
1
M
u
x
0
1
M
u
x
0
Zer o
ALU ALU
result
MemtoReg
Address
Write
data
32
AL U
control
Read
data
Data
memory
1
M
u
x
0
MemRead
Instruction [5 0]
ALUOp
IF
ID
EX
Mem
WB
Goal is to balance work done in each cycle - minimize cycle time!
Review -- Multiple Cycle CPU
The Pipeline with Control Logic
Is it really that easy?
Suppose initially, register $i holds the number

2i
What happens when we see the following
dynamic instruction sequence:
add $3, $10, $11
this should add 20 + 22, putting result 42 into $3
lw $8, 50($3)
this should load memory location 92 (42+50) into $8
sub $11, $8, $7

this should subtract 14 from that just-loaded value
add $3, $10, $11
lw $8, 50($3)
Execute/
Address Calculation
Memory Access
Write Back
0
M
u
x
1
IF/ ID
ID/EX
EX/ MEM
MEM/WB
Add
Add
Add
resul t
Shif t
left 32
PC
Address
Instruction
memory
Read
register 1
Read
dat a 1
Read
register 2
Registers Read
Write
dat a 2
register
Write
dat a
16
Sign
ext end
32
20
22
Zero
0
M
u
x
1
ALU
ALU
resul t
Address
Data
memory
Write
dat a
Read
dat a
1
M
u
x
0
sub $11, $8, $7
lw $8, 50($3)
add $3, $10, $11
Memory Access
Write Back
HAZARD: This should have been 42 !

But register 3 didn t get updated yet.
0
M
u
x
1
IF/ ID
ID/EX
EX/ MEM
MEM/WB
Add
Add
Add
resul t
Shif t
left 32
PC
Address
Instruction
memory
Read
register 1
Read
dat a 1
Read
register 2
Registers Read
Write
dat a 2
register
20
16
220
M
u
x
1
Write
dat a
16
Sign
ext end
32
50
Zero
ALU
ALU
resul t
42
Address
Data
memory
Write
dat a
Read
dat a
1
M
u
x
0
add $10, $1, $2
sub $11, $8, $7
lw $8, 50($3)
add $3, $10, $11
And this should be a value

from memory (which hasn t
even been loaded yet).
0
M
u
x
1
IF/ ID
Write Back
Recall: this should

have been 92
ID/EX
EX/ MEM
MEM/WB
Add
Add
resul t
Add
Shif t
left 32
PC
Address
Instruction
memory
Read
register 1
Read
dat a 1
Read
register 2
Registers Read
Write
dat a 2
register
Write
dat a
16
Sign
ext end
32
16
14
6
Zero
0
M
u
x
1
ALU
50
ALU
resul t
56
Address
Read
dat a
Data
memory
Write
dat a
42
1
M
u
x
0
Data Hazards
When a result is needed in the pipeline before it

is available, a data hazard occurs.
R2 Available
DM
IM
Reg
DM
R2 Needed
IM
Reg
DM
IM
Reg
DM
IM
Reg
ALU
sw $15, 100($2)
Reg
ALU
add $14, $2, $2
IM
ALU
or $13, $6, $2
CC3
ALU
and $12, $2, $5
CC2
ALU
sub $2, $1, $3
CC1
3 ways.
1-wait
2- branch output of ALU to where we need.
3- waite for calculation finish.
CC4
we have 3 data hazar and 4 dependencies variables
CC5
CC6
CC7
CC8
Reg
Reg
Reg
Reg
DM

CS M151B / EE M116C: Computer Systems Architecture

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

CS M151B / EE M116C: Computer Systems Architecture

Caricato da

Copyright:

Formati disponibili

CS M151B / EE M116C

Computer Systems Architecture

Instructor: Prof. Lei He

Some notes adopted from Glenn Reinman

Slide from Prof. B Parhami at UCSB

Review -- Single Cycle CPU

Single Cycle Datapath Partitioning

Instruction [20 16]

Goal is to balance work done in each cycle - minimize cycle time!

Review -- Multiple Cycle CPU

Review -- Instruction Latencies

Multiple Cycle CPU

Getting the Best of Both Worlds

Clock rate = 500 MHz

Slide from Prof. B Parhami at UCSB

15.1 Pipelining Concepts

Strategies for improving performance

Pipelining in the student registration process.

Slide from Prof. B Parhami at UCSB

IF: Instruction fetch

Warning write register line is incorrect in this figure!

Execution in a Pipelined Datapath

Instruction Latencies and Throughput

Multiple Cycle CPU

Higher maximum throughput

Mixed Instructions in the Pipeline

All instructions that share a pipeline must have

All intermediate values must be latched each

The Pipeline in Execution

add $10, $1, $2

The Pipeline in Execution

add $10, $1, $2

The Pipeline in Execution

sub $15, $4, $1

add $10, $1, $2

The Pipeline in Execution

sub $15, $4, $1

add $10, $1, $2

The Pipeline in Execution

sub $15, $4, $1

add $10, $1, $2

The Pipeline in Execution

sub $15, $4, $1 lw $12, 1000($4)

The Pipeline in Execution, with controls

(This figure has a correct Write register)

FSM isn t really appropriate

Pipelined Control Signals

Execution Stage Control

Write Back Stage

Are we moving too fast, or right pace?

IEEE 754 FP Numbers

range of about 2 X 10-38 to 2 X 1038

Double Precision FP (IEEE 754)

52 (+1) bit mantissa

Range of the floating point number:

Review -- Single Cycle CPU

Single Cycle Datapath Partitioning

Instruction [20 16]

Goal is to balance work done in each cycle - minimize cycle time!

Review -- Multiple Cycle CPU

The Pipeline with Control Logic

Is it really that easy?