l05 Singlecycle

1
Single-Cycle Processors: Datapath & Control
Arvind Computer Science & Artificial Intelligence Lab M.I.T. Based on the material prepared by Arvind and Krste Asanovic
Instruction Set Architecture (ISA) versus Implementation

ISA is the hardware/software interface
Defines set of programmer visible state Defines instruction format (bit encoding) and instruction semantics Examples: MIPS, x86, IBM 360, JVM
6.823 L5- 2 Arvind
Many possible implementations of one ISA

360 implementations: model 30 (c. 1964), z900 (c. 2001) x86 implementations: 8086 (c. 1978), 80186, 286, 386, 486, Pentium, Pentium Pro, Pentium-4 (c. 2000), AMD Athlon, Transmeta Crusoe, SoftPC MIPS implementations: R2000, R4000, R10000, ... JVM: HotSpot, PicoJava, ARM Jazelle, ...
September 26, 2005
6.823 L5- 3 Arvind
Processor Performance
Time Program = Instructions Program * Cycles Instruction * Time Cycle
Instructions per program depends on source code, compiler technology, and ISA Cycles per instructions (CPI) depends upon the ISA and the microarchitecture Time per cycle depends upon the microarchitecture and the base technology Microarchitecture this lecture Microcoded Single-cycle unpipelined Pipelined
September 26, 2005
CPI >1 1 1
cycle time short long short
6.823 L5- 4 Arvind
Microarchitecture:
status lines
Implementation of an ISA control points
Controller
Data path
Structure: How components are connected. Static Behavior: How data moves between components Dynamic
September 26, 2005
Hardware Elements
Combinational circuits
Mux, Demux, Decoder, ALU, ...
Sel A0 A1 An-1
lg(n)
OpSelect
- Add, Sub, ... - And, Or, Xor, Not, ... - GT, LT, EQ, Zero, ...
Sel A
lg(n)
Decoder
Demux
. . .
Mux
. . .
O0 O1 On1
A
lg(n)
. . .
O0 O1 On-1
ALU
B
Result Comp?
Synchronous state elements

Flipflop, Register, Register file, SRAM, DRAM D En Clk ff Q Clk En D Q
En Clk
register
D0 D1 D2
...
Dn-1
ff
Q0
ff
Q1
ff ...
Q2
ff
Qn-1
...
Edge-triggered: Data is sampled at the rising edge

September 26, 2005
6.823 L5- 6 Arvind
Register Files
Clock WE ReadSel1 ReadSel2 WriteSel WriteData ws clk
5 rs1 rs2 ws wd we
Register file 2R+1W
rd1 rd2
ReadData1 ReadData2
wd
32
rs1
5
register 0 we register 1 register 31

32 32
32
rd1
rs2
32 5
No timing issues in reading a selected register Register files with a large number of ports are difficult to design
September 26, 2005
Intels Itanium, GPR File has 128 registers with 8 read ports and 4 write ports!!!
32
rd2
6.823 L5- 7 Arvind
A Simple Memory Model

WriteEnable Clock Address WriteData
MAGIC RAM
ReadData
Reads and writes are always completed in one cycle a Read can be done any time (i.e. combinational) a Write is performed at the rising clock edge if it is enabled the write address and data must be stable at the clock edge Later in the course we will present a more realistic model of memory
September 26, 2005
6.823 L5- 8 Arvind
Implementing MIPS:
Single-cycle per instruction datapath & control logic
September 26, 2005
6.823 L5- 9 Arvind
The MIPS ISA

Processor State
32 32-bit GPRs, R0 always contains a 0 32 single precision FPRs, may also be viewed as 16 double precision FPRs FP status register, used for FP compares & exceptions PC, the program counter some other special registers 8-bit byte, 16-bit half word 32-bit word for integers 32-bit word for single precision floating point 64-bit word for double precision floating point data addressing modes- immediate & indexed branch addressing modes- PC relative & register indirect Byte addressable memory- big endian mode
Data types
Load/Store style instruction set
All instructions are 32 bits

September 26, 2005
6.823 L5- 10 Arvind
Instruction Execution
Execution of an instruction involves
1. 2. 3. 4. 5. instruction fetch decode and register fetch ALU operation memory operation (optional) write back
and the computation of the address of the next instruction
September 26, 2005
6.823 L5- 11 Arvind
Datapath: Reg-Reg ALU Instructions

RegWrite 0x4
Add clk
inst<25:21> inst<20:16> PC
clk
addr
inst
inst<15:11>
Inst. Memory
inst<5:0>
we rs1 rs2 rd1 ws wd rd2
ALU
GPRs
ALU
Control
OpCode
6 0
31 26 25 September 26, 2005
5 rs
5 rt
21 20 16 15
5 rd
11
5 0
5
6 func
0
RegWrite Timing?
rd (rs) func (rt)
6.823 L5- 12 Arvind
Datapath: Reg-Imm ALU Instructions

RegWrite 0x4
Add clk
inst<25:21> PC
clk
addr
inst
inst<20:16>
Inst. Memory
ALU
GPRs
Imm Ext
inst<15:0> inst<31:26>
ALU Control
OpCode
ExtSel
6 opcode
31 26 25 September 26, 2005
5 rs
2120
5 rt
16 15
16 immediate
0
rt (rs) op immediate
6.823 L5- 13 Arvind
Conflicts in Merging Datapath

RegWrite 0x4
Add clk
inst<25:21> PC
clk
addr
inst
Inst. Memory
inst<20:16> inst<15:11> inst<15:0> inst<31:26> inst<5:0>
Introduce muxes
ALU
GPRs
Imm Ext
ALU Control
OpCode
ExtSel
6 0 opcode
September 26, 2005
5 rs rs
5 rt rt
5 rd
5 0 immediate
6 func
rd (rs) func (rt) rt (rs) op immediate
6.823 L5- 14 Arvind
Datapath for ALU Instructions

RegWrite 0x4
Add clk
PC
clk
addr
<25:21> <20:16> inst <15:11> <15:0> <31:26>, <5:0>
Inst. Memory
ALU
GPRs
Imm Ext
ALU Control
OpCode
6 0 opcode
September 26, 2005
5 rs rs
5 rt rt
5 rd
RegDst rt / rd
ExtSel
OpSel
5 0
6 func
BSrc Reg / Imm
rd (rs) func (rt) rt (rs) op immediate
immediate
6.823 L5- 15 Arvind
Datapath for Memory Instructions

Should program and data memory be separate? Harvard style: separate (Aiken and Mark 1 influence) - read-only program memory - read/write data memory at some level the two memories have to be the same Princeton style: the same (von Neumanns influence) - A Load or Store instruction requires accessing the memory more than once during its execution
September 26, 2005
6.823 L5- 16 Arvind
Load/Store Instructions:Harvard Datapath

RegWrite 0x4
Add clk
MemWrite WBSrc ALU / Mem

clk
base PC
clk
addr
inst
ALU
we addr z
Inst. Memory
GPRs
Imm Ext
disp
Data Memory
wdata ALU Control
rdata
OpCode RegDst
ExtSel
OpSel
BSrc
6 opcode
31 26 25
5 rs
5 rt
21 20 16 15
16 displacement
0
addressing mode (rs) + displacement
rs is the base register rt is the destination of a Load or the source for a Store
September 26, 2005
6.823 L5- 17 Arvind
MIPS Control Instructions

Conditional (on GPR) PC-relative branch
6 opcode 6 opcode 5 rs 5 rs 5 16 offset 16 BEQZ, BNEZ
Unconditional register-indirect jumps

5 JR, JALR
Unconditional absolute jumps

6 opcode 26 target J, JAL
PC-relative branches add offset4 to PC+4 to calculate the target address (offset is in words): 128 KB range Absolute jumps append target4 to PC<31:28> to calculate the target address: 256 MB range jump-&-link stores PC+4 into the link register (R31) All Control Transfers are delayed by 1 instruction
we will worry about the branch delay slot later
September 26, 2005
6.823 L5- 18 Arvind
Conditional Branches (BEQZ, BNEZ)

PCSrc br RegWrite MemWrite WBSrc
pc+4
0x4 Add Add clk
PC
clk
addr
inst
Inst. Memory
clk
ALU z
we addr
GPRs
Imm Ext
Data Memory
wdata
rdata
ALU Control
OpCode RegDst September 26, 2005
ExtSel
OpSel
BSrc
zero?
6.823 L5- 19 Arvind
Register-Indirect Jumps (JR)

PCSrc br rind pc+4
0x4 Add Add clk
RegWrite
MemWrite
WBSrc
PC
clk
addr
inst
Inst. Memory
clk
ALU z
we addr
GPRs
Imm Ext
Data Memory
wdata
rdata
ALU Control
ExtSel
OpSel
BSrc
zero?
6.823 L5- 20 Arvind
Register-Indirect Jump-&-Link (JALR)

PCSrc br rind pc+4
0x4 Add Add clk
RegWrite
MemWrite
WBSrc
PC
clk
addr
inst
31
Inst. Memory
clk
ALU z
we addr
GPRs
Imm Ext
Data Memory
wdata
rdata
ALU Control
ExtSel
OpSel
BSrc
zero?
6.823 L5- 21 Arvind
Absolute Jumps (J, JAL)

PCSrc br rind jabs pc+4
0x4 Add Add clk
RegWrite
MemWrite
WBSrc
PC
clk
addr
inst
31
Inst. Memory
clk
ALU z
we addr
GPRs
Imm Ext
Data Memory
wdata
rdata
ALU Control
ExtSel
OpSel
BSrc
zero?
6.823 L5- 22 Arvind
Harvard-Style Datapath for MIPS

PCSrc br rind jabs pc+4
0x4 Add Add clk
RegWrite
MemWrite
WBSrc
PC
clk
addr
inst
31
Inst. Memory
clk
ALU z
we addr
GPRs
Imm Ext
Data Memory
wdata
rdata
ALU Control
ExtSel
OpSel
BSrc
zero?
23
Five-minute break to stretch your legs
Single-Cycle Hardwired Control:

Harvard architecture We will assume clock period is sufficiently long for all of the following steps to be completed:
1. 2. 3. 4. 5. instruction fetch decode and register fetch ALU operation data fetch if required register write-back setup time tC > tIFetch + tRFetch + tALU+ tDMem+ tRWB
6.823 L5- 24 Arvind
At the rising edge of the following clock, the PC, the register file and the memory are updated
September 26, 2005
Hardwired Control is pure Combinational Logic

ExtSel BSrc op code zero? OpSel
6.823 L5- 25 Arvind
combinational logic
MemWrite WBSrc RegDst RegWrite PCSrc
September 26, 2005
6.823 L5- 26 Arvind
ALU Control & Immediate Extension

Inst<5:0> (Func) Inst<31:26> (Opcode) + 0? ALUop
OpSel ( Func, Op, +, 0? )
Decode Map
ExtSel ( sExt16, uExt16, High16)
September 26, 2005
6.823 L5- 27 Arvind
Hardwired Control Table

Opcode ALU ALUi ALUiu LW SW BEQZz=0 BEQZz=1 J JAL JR JALR ExtSel BSrc OpSel MemW RegW WBSrc RegDst PCSrc
* sExt16 uExt16 sExt16 sExt16 sExt16 sExt16 * * * *
Reg Imm Imm Imm Imm * * * * * *
Func Op Op + + 0? 0? * * * *
no no no no yes no no no no no no
yes yes yes yes no no no no yes no yes
ALU ALU ALU Mem * * * * PC * PC
rd rt rt rt * * * * R31 * R31
pc+4 pc+4 pc+4 pc+4 pc+4 br pc+4 jabs jabs rind rind
BSrc = Reg / Imm RegDst = rt / rd / R31

September 26, 2005
WBSrc = ALU / Mem / PC PCSrc = pc+4 / br / rind / jabs
6.823 L5- 28 Arvind
Pipelined MIPS
To pipeline MIPS: First build MIPS without pipelining with CPI=1 Next, add pipeline registers to reduce cycle time while maintaining CPI=1
September 26, 2005
6.823 L5- 29 Arvind
Pipelined Datapath
0x4 Add we rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext
PC
addr rdata
IR
ALU
we addr
Inst. Memory
Data Memory
wdata
rdata
write fetch decode & Reg-fetch execute memory -back phase phase phase phase phase Clock period can be reduced by dividing the execution of an instruction into multiple cycles tC > max {tIM, tRF, tALU, tDM, tRW} ( = tDM probably) However, CPI will increase unless instructions are pipelined
September 26, 2005
6.823 L5- 30 Arvind
An Ideal Pipeline
stage 1 stage 2 stage 3 stage 4
All objects go through the same stages No sharing of resources between any two stages Propagation delay through all pipeline stages is equal The scheduling of an object entering the pipeline is not affected by the objects in other stages These conditions generally hold for industrial assembly lines. But can an instruction pipeline satisfy the last condition?
September 26, 2005
How to divide the datapath into stages

Suppose memory is significantly slower than other stages. In particular, suppose
tIM tDM tALU tRF tRW = = = = = 10 units 10 units 5 units 1 unit 1 unit
6.823 L5- 31 Arvind
Since the slowest stage determines the clock, it may be possible to combine some stages without any loss of performance
September 26, 2005
6.823 L5- 32 Arvind
Alternative Pipelining
0x4 Add we rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext
PC
addr rdata
IR
ALU
we addr
Inst. Memory
Data Memory
wdata
rdata
fetch phase
decode & Reg-fetch phase
execute phase
memory phase = = tt + tRW DM DM DM
write -back phase
tC > max {tIM , tRF , tALU , tDM ,, tt } } RW RW C IM RF+t ALU, t DM+t RW}
increase the critical path by 10%

Write-back stage takes much less time than other stages. Suppose we combined it with the memory phase
September 26, 2005
6.823 L5- 33 Arvind
Maximum Speedup by Pipelining

Assumptions 1. tIM = tDM = 10, tALU = 5, tRF = tRW= 1 4-stage pipeline 2. tIM =tDM = tALU = tRF = tRW = 5 4-stage pipeline 3. tIM =tDM = tALU = tRF = tRW = 5 5-stage pipeline Unpipelined Pipelined Speedup
tC
tC
27
10
2.7
25
10
2.5
25
5.0
It is possible to achieve higher speedup with more stages in the pipeline.

September 26, 2005
34
Thank you !

l05 Singlecycle

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

l05 Singlecycle

Caricato da

Copyright:

Formati disponibili

1

Single-Cycle Processors: Datapath & Control

Instruction Set Architecture (ISA) versus Implementation

6.823 L5- 2 Arvind

Many possible implementations of one ISA

September 26, 2005

6.823 L5- 3 Arvind

cycle time short long short

6.823 L5- 4 Arvind

Implementation of an ISA control points

Synchronous state elements

Edge-triggered: Data is sampled at the rising edge

6.823 L5- 6 Arvind

Register file 2R+1W

register 0 we register 1 register 31

6.823 L5- 7 Arvind

A Simple Memory Model

6.823 L5- 8 Arvind

September 26, 2005

6.823 L5- 9 Arvind

The MIPS ISA

Load/Store style instruction set

All instructions are 32 bits

6.823 L5- 10 Arvind

and the computation of the address of the next instruction

September 26, 2005

6.823 L5- 11 Arvind

Datapath: Reg-Reg ALU Instructions

we rs1 rs2 rd1 ws wd rd2

6.823 L5- 12 Arvind

Datapath: Reg-Imm ALU Instructions

we rs1 rs2 rd1 ws wd rd2

6.823 L5- 13 Arvind

Conflicts in Merging Datapath

inst<20:16> inst<15:11> inst<15:0> inst<31:26> inst<5:0>

we rs1 rs2 rd1 ws wd rd2

rd (rs) func (rt) rt (rs) op immediate

6.823 L5- 14 Arvind

Datapath for ALU Instructions

<25:21> <20:16> inst <15:11> <15:0> <31:26>, <5:0>

we rs1 rs2 rd1 ws wd rd2

BSrc Reg / Imm

rd (rs) func (rt) rt (rs) op immediate

6.823 L5- 15 Arvind

Datapath for Memory Instructions

September 26, 2005

6.823 L5- 16 Arvind

Load/Store Instructions:Harvard Datapath

MemWrite WBSrc ALU / Mem

we rs1 rs2 rd1 ws wd rd2

addressing mode (rs) + displacement

6.823 L5- 17 Arvind

MIPS Control Instructions

Unconditional register-indirect jumps

Unconditional absolute jumps

6.823 L5- 18 Arvind

Conditional Branches (BEQZ, BNEZ)

we rs1 rs2 rd1 ws wd rd2

OpCode RegDst September 26, 2005

6.823 L5- 19 Arvind

Register-Indirect Jumps (JR)

we rs1 rs2 rd1 ws wd rd2

OpCode RegDst September 26, 2005