Sei sulla pagina 1di 34

1

Single-Cycle Processors:

Datapath & Control

Arvind

Computer Science & Artificial Intelligence Lab

M.I.T.

Based on the material prepared by

Arvind and Krste Asanovic

Instruction Set Architecture (ISA)


versus Implementation

6.823 L5- 2
Arvind

ISA is the hardware/software interface


Defines set of programmer visible state
Defines instruction format (bit encoding) and instruction
semantics

Examples: MIPS, x86, IBM 360, JVM

Many possible implementations of one ISA

360 implementations: model 30 (c. 1964), z900 (c. 2001)


x86 implementations: 8086 (c. 1978), 80186, 286, 386, 486,
Pentium, Pentium Pro, Pentium-4 (c. 2000), AMD Athlon,
Transmeta Crusoe, SoftPC
MIPS implementations: R2000, R4000, R10000, ...

JVM: HotSpot, PicoJava, ARM Jazelle, ...

September 26, 2005

6.823 L5- 3
Arvind

Processor Performance

Time
Program

Instructions
Program *

Cycles
Instruction

Time
Cycle

Instructions per program depends on source code, compiler


technology, and ISA
Cycles per instructions (CPI) depends upon the ISA and the
microarchitecture
Time per cycle depends upon the microarchitecture and the
base technology

this lecture

September 26, 2005

Microarchitecture

CPI

cycle time

Microcoded

>1

short

Single-cycle unpipelined

long

Pipelined

short

6.823 L5- 4
Arvind

Microarchitecture:
status
lines

Implementation of an ISA

Controller

control
points

Data
path

Structure: How components are connected.


Static
Behavior: How data moves between components
Dynamic
September 26, 2005

Hardware Elements

Combinational circuits

OpSelect

Mux, Demux, Decoder, ALU, ...


A0
A1
An-1

...

Mux

lg(n)

Demux

Sel

lg(n)

...

O0
O1

A
lg(n)

On-

Decoder

Sel

- Add, Sub, ...


- And, Or, Xor, Not, ...
- GT, LT, EQ, Zero, ...

...

O0
O1

On-1

Result

ALU

Comp?

Synchronous state elements


Flipflop, Register, Register file, SRAM, DRAM
D
En
Clk

ff
Q

register

Clk
En
D
Q

En
Clk

D1

D2

ff

ff

ff ...

Q0

Q1

Q2

Edge-triggered: Data is sampled at the rising edge


September 26, 2005

...

D0

...

Dn-1

ff
Qn-1

6.823 L5- 6
Arvind

Register Files
Clock WE
ReadSel1
ReadSel2
WriteSel
WriteData
ws clk

rs1
rs2
ws
wd

we

Register
file
2R+1W

ReadData1
ReadData2

rd1
rd2

rs1

wd
32

register 1

we

register 0

32

rd1

32

rs2
32

32

register 31

32

rd2

No timing issues in reading a selected register


Register files with a large number of ports are difficult
to design
Intels Itanium, GPR File has 128 registers with 8 read ports and
4 write ports!!!

September 26, 2005

6.823 L5- 7
Arvind

A Simple Memory Model

WriteEnable
Clock
Address
WriteData

MAGIC
RAM

ReadData

Reads and writes are always completed in one cycle


a Read can be done any time (i.e. combinational)
a Write is performed at the rising clock edge
if it is enabled
the write address and data
must be stable at the clock edge
Later in the course we will present a more realistic
model of memory
September 26, 2005

6.823 L5- 8
Arvind

Implementing MIPS:

Single-cycle per instruction

datapath & control logic

September 26, 2005

6.823 L5- 9
Arvind

The MIPS ISA

Processor State

32 32-bit GPRs, R0 always contains a 0

32 single precision FPRs, may also be viewed as

16 double precision FPRs


FP status register, used for FP compares & exceptions
PC, the program counter
some other special registers

Data types

8-bit byte, 16-bit half word

32-bit word for integers

32-bit word for single precision floating point

64-bit word for double precision floating point

Load/Store style instruction set

data addressing modes- immediate & indexed


branch addressing modes- PC relative & register indirect
Byte addressable memory- big endian mode

All instructions are 32 bits


September 26, 2005

6.823 L5- 10
Arvind

Instruction Execution
Execution of an instruction involves
1.
2.
3.
4.
5.

instruction fetch
decode and register fetch
ALU operation
memory operation (optional)
write back

and the computation of the address of the


next instruction

September 26, 2005

6.823 L5- 11
Arvind

Datapath: Reg-Reg ALU Instructions

RegWrite
0x4
clk

Add

inst<25:21>
inst<20:16>
PC
clk

addr

inst<15:11>

inst

Inst.
Memory

we
rs1
rs2
rd1
ws
wd rd2

ALU

GPRs

inst<5:0>

ALU

Control

OpCode

6
0
31
26 25
September 26, 2005

5
rs

5
rt
21 20

5
rd
16 15

5
0
11

RegWrite Timing?

6
func
5

rd (rs) func (rt)


0

6.823 L5- 12
Arvind

Datapath: Reg-Imm ALU Instructions

RegWrite
0x4

clk

Add

inst<25:21>
PC
clk

addr

inst<20:16>

inst

Inst.
Memory

we
rs1
rs2
rd1
ws
wd rd2

GPRs

inst<15:0>

OpCode

31
26 25
September 26, 2005

5
rs

5
rt
2120

ALU
Control

ExtSel

16
immediate
16 15

Imm
Ext

inst<31:26>

6
opcode

ALU

rt (rs) op immediate
0

6.823 L5- 13
Arvind

Conflicts in Merging Datapath

RegWrite
0x4
Add

we
rs1
rs2
rd1
ws
wd rd2

inst<25:21>
PC
clk

addr

inst<20:16>
inst<15:11>

inst

Inst.
Memory

inst<15:0>

September 26, 2005

5
rs

5
rt

rs

rt

Imm
Ext
ALU
Control

ExtSel

OpCode

opcode

ALU

GPRs

inst<31:26>
inst<5:0>

6
0

Introduce
muxes

clk

5
rd

5
0
immediate

6
func

rd (rs) func (rt)


rt (rs) op immediate

6.823 L5- 14
Arvind

Datapath for ALU Instructions

RegWrite
0x4

clk

Add

PC
clk

addr

we
rs1
rs2
rd1
ws
wd rd2

<25:21>
<20:16>
inst
<15:11>

Inst.
Memory

GPRs

<15:0>

6
0
opcode
September 26, 2005

5
rs

5
rt

rs

rt

Imm
Ext

<31:26>, <5:0>

OpCode

ALU

RegDst
rt / rd

5
rd

ALU
Control

ExtSel

5
0

immediate

OpSel

func

BSrc
Reg / Imm

rd (rs) func (rt)


rt (rs) op immediate

6.823 L5- 15
Arvind

Datapath for Memory Instructions

Should program and data memory be separate?


Harvard style: separate (Aiken and Mark 1 influence)
- read-only program memory
- read/write data memory
at some level the two memories have
to be the same
Princeton style: the same (von Neumanns influence)
- A Load or Store instruction requires
accessing the memory more than once
during its execution

September 26, 2005

6.823 L5- 16
Arvind

Load/Store Instructions:Harvard Datapath

RegWrite
0x4

we
rs1
rs2
rd1
ws
wd rd2

base
addr

inst

Inst.
Memory

clk

WBSrc
ALU / Mem

clk
Add

PC

MemWrite

clk

ALU

GPRs

rdata

Data
Memory

Imm
Ext

disp

we
addr

wdata
ALU
Control

OpCode RegDst

6
opcode
31

26 25

5
rs

ExtSel

5
rt
21 20

OpSel

BSrc

16
displacement
16 15

addressing mode
(rs) + displacement
0

rs is the base register


rt is the destination of a Load or the source for a Store
September 26, 2005

6.823 L5- 17
Arvind

MIPS Control Instructions


Conditional (on GPR) PC-relative branch
6
opcode

5
rs

16
offset

BEQZ, BNEZ

Unconditional register-indirect jumps

6
opcode

5
rs

16

JR, JALR

Unconditional absolute jumps


6
opcode

26
target

J, JAL

PC-relative branches add offset4 to PC+4 to calculate the


target address (offset is in words): 128 KB range
Absolute jumps append target4 to PC<31:28> to calculate
the target address: 256 MB range
jump-&-link stores PC+4 into the link register (R31)
All Control Transfers are delayed by 1 instruction
we will worry about the branch delay slot later

September 26, 2005

6.823 L5- 18
Arvind

Conditional Branches (BEQZ, BNEZ)


PCSrc
br

MemWrite

RegWrite

pc+4
0x4
Add
Add
clk

PC
clk

addr

we
rs1
rs2
rd1
ws
wd rd2

inst

Inst.
Memory

clk

we
addr

ALU

GPRs

Imm
Ext

wdata
ALU
Control

OpCode RegDst
September 26, 2005

ExtSel

rdata

Data
Memory

OpSel

BSrc

zero?

WBSrc

6.823 L5- 19
Arvind

Register-Indirect Jumps (JR)


PCSrc
br
rind

RegWrite

MemWrite

pc+4
0x4
Add
Add
clk

PC
clk

addr

we
rs1
rs2
rd1
ws
wd rd2

inst

Inst.
Memory

clk

we
addr

ALU

GPRs

Imm
Ext

wdata
ALU
Control

OpCode RegDst
September 26, 2005

ExtSel

rdata

Data
Memory

OpSel

BSrc

zero?

WBSrc

6.823 L5- 20
Arvind

Register-Indirect Jump-&-Link (JALR)


PCSrc
br
rind

RegWrite

MemWrite

pc+4
0x4
Add
Add
clk

PC
clk

addr

inst

31

Inst.
Memory

we
rs1
rs2
rd1
ws
wd rd2

clk

we
addr

ALU

GPRs

Imm
Ext

wdata
ALU
Control

OpCode RegDst
September 26, 2005

ExtSel

rdata

Data
Memory

OpSel

BSrc

zero?

WBSrc

6.823 L5- 21
Arvind

Absolute Jumps (J, JAL)


PCSrc
br
rind
jabs
pc+4

RegWrite

MemWrite

0x4
Add
Add
clk

PC
clk

addr

inst

31

Inst.
Memory

we
rs1
rs2
rd1
ws
wd rd2

clk

we
addr

ALU

GPRs

Imm
Ext

wdata
ALU
Control

OpCode RegDst
September 26, 2005

ExtSel

rdata

Data
Memory

OpSel

BSrc

zero?

WBSrc

6.823 L5- 22
Arvind

Harvard-Style Datapath for MIPS


PCSrc
br
rind
jabs
pc+4

RegWrite

MemWrite

0x4
Add
Add
clk

PC
clk

addr

inst

31

Inst.
Memory

we
rs1
rs2
rd1
ws
wd rd2

clk

we
addr

ALU

GPRs

Imm
Ext

wdata
ALU
Control

OpCode RegDst
September 26, 2005

ExtSel

rdata

Data
Memory

OpSel

BSrc

zero?

WBSrc

23

Five-minute break to stretch your legs

Single-Cycle Hardwired Control:

6.823 L5- 24
Arvind

Harvard architecture

We will assume
clock period is sufficiently long for all of

the following steps to be completed:

1.
2.
3.
4.
5.

instruction fetch
decode and register fetch
ALU operation
data fetch if required
register write-back setup time

tC > tIFetch + tRFetch + tALU+ tDMem+ tRWB

At the rising edge of the following clock, the PC,

the register file and the memory are updated

September 26, 2005

6.823 L5- 25
Arvind

Hardwired Control is pure


Combinational Logic
ExtSel
BSrc
op code
zero?

OpSel

combinational
logic

MemWrite
WBSrc
RegDst
RegWrite
PCSrc

September 26, 2005

6.823 L5- 26
Arvind

ALU Control & Immediate Extension

Inst<5:0> (Func)
Inst<31:26> (Opcode)

ALUop
+
0?

OpSel
( Func, Op, +, 0? )

Decode Map
ExtSel
( sExt16, uExt16,
High16)
September 26, 2005

6.823 L5- 27
Arvind

Hardwired Control Table

Opcode

ExtSel

ALU

BSrc

OpSel

MemW

RegW

WBSrc

RegDst

PCSrc

SW

*
sExt16
uExt16
sExt16
sExt16

Reg
Imm
Imm
Imm
Imm

Func
Op
Op
+
+

no
no
no
no
yes

yes
yes
yes
yes
no

ALU
ALU
ALU
Mem
*

rd
rt
rt
rt
*

pc+4
pc+4
pc+4
pc+4
pc+4

BEQZz=0

sExt16

0?

no

no

br

BEQZz=1

sExt16
*
*

*
*
*

no
no
no

no
no

*
*

*
*

pc+4
jabs

*
*

*
*

0?
*
*
*
*

yes
no
yes

PC
*
PC

R31
*
R31

jabs
rind
rind

ALUi
ALUiu
LW

J
JAL
JR
JALR

BSrc = Reg / Imm


RegDst = rt / rd / R31
September 26, 2005

no
no

WBSrc = ALU / Mem / PC


PCSrc = pc+4 / br / rind / jabs

6.823 L5- 28
Arvind

Pipelined MIPS
To pipeline MIPS:
First build MIPS without pipelining with CPI=1
Next, add pipeline registers to reduce cycle
time while maintaining CPI=1

September 26, 2005

6.823 L5- 29
Arvind

Pipelined Datapath

0x4
Add

PC

addr
rdata

Inst.
Memory

IR

we
rs1
rs2
rd1
ws
wd rd2
GPRs
Imm
Ext

ALU

we
addr
rdata

Data
Memory
wdata

write
fetch
decode & Reg-fetch
execute
memory
-back
phase
phase
phase
phase
phase
Clock period can be reduced by dividing the execution of an
instruction into multiple cycles
tC > max {tIM, tRF, tALU, tDM, tRW} ( = tDM probably)
However, CPI will increase unless instructions are pipelined
September 26, 2005

6.823 L5- 30
Arvind

An Ideal Pipeline
stage
1

stage
2

stage
3

stage
4

All objects go through the same stages


No sharing of resources between any two stages
Propagation delay through all pipeline stages is equal
The scheduling of an object entering the pipeline
is not affected by the objects in other stages
These conditions generally hold for industrial
assembly lines.
But can an instruction pipeline satisfy the last
condition?
September 26, 2005

How to divide the datapath


into stages

6.823 L5- 31
Arvind

Suppose memory is significantly slower than


other stages. In particular, suppose
tIM

tDM

tALU
tRF

tRW

=
=
=
=
=

10 units
10 units

5 units

1 unit
1 unit

Since the slowest stage determines the clock, it


may be possible to combine some stages without
any loss of performance

September 26, 2005

6.823 L5- 32
Arvind

Alternative Pipelining
0x4
Add

PC

addr
rdata

Inst.
Memory

fetch
phase

IR

we
rs1
rs2
rd1
ws
wd rd2
GPRs

ALU

we
addr
rdata

Data
Memory

Imm
Ext

wdata

decode & Reg-fetch


phase

execute
phase

memory
phase

tCC > max {tIM


, tRF
, t ALU,, ttDM
,, ttRW
}}
RW
IM
RF+tALU
DM+t
RW}

=
= ttDM
DM
DM+ tRW

write
-back
phase

increase the critical path by 10%


Write-back stage takes much less time than other stages.
Suppose we combined it with the memory phase
September 26, 2005

6.823 L5- 33
Arvind

Maximum Speedup by Pipelining


Assumptions

Unpipelined

Pipelined Speedup

1. tIM = tDM = 10,

tALU = 5,

tRF = tRW= 1

4-stage pipeline

tC

tC

27

10

2.7

2. tIM =tDM = tALU = tRF = tRW = 5


4-stage pipeline

25

10

2.5

3. tIM =tDM = tALU = tRF = tRW = 5


5-stage pipeline

25

5.0

It is possible to achieve higher speedup with more


stages in the pipeline.
September 26, 2005

34

Thank you !

Potrebbero piacerti anche