Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Arvind Computer Science & Artificial Intelligence Lab M.I.T. Based on the material prepared by Arvind and Krste Asanovic
Processor Performance
Time Program = Instructions Program * Cycles Instruction * Time Cycle
Instructions per program depends on source code, compiler technology, and ISA Cycles per instructions (CPI) depends upon the ISA and the microarchitecture Time per cycle depends upon the microarchitecture and the base technology Microarchitecture this lecture Microcoded Single-cycle unpipelined Pipelined
September 26, 2005
CPI >1 1 1
Microarchitecture:
status lines
Controller
Data path
Structure: How components are connected. Static Behavior: How data moves between components Dynamic
September 26, 2005
Hardware Elements
Combinational circuits
Mux, Demux, Decoder, ALU, ...
Sel A0 A1 An-1
lg(n)
OpSelect
- Add, Sub, ... - And, Or, Xor, Not, ... - GT, LT, EQ, Zero, ...
Sel A
lg(n)
Decoder
Demux
. . .
Mux
. . .
O0 O1 On1
A
lg(n)
. . .
O0 O1 On-1
ALU
B
Result Comp?
register
D0 D1 D2
...
Dn-1
ff
Q0
ff
Q1
ff ...
Q2
ff
Qn-1
...
Register Files
Clock WE ReadSel1 ReadSel2 WriteSel WriteData ws clk
5 rs1 rs2 ws wd we
rd1 rd2
ReadData1 ReadData2
wd
32
rs1
5
32
rd1
rs2
32 5
No timing issues in reading a selected register Register files with a large number of ports are difficult to design
September 26, 2005
Intels Itanium, GPR File has 128 registers with 8 read ports and 4 write ports!!!
32
rd2
MAGIC RAM
ReadData
Reads and writes are always completed in one cycle a Read can be done any time (i.e. combinational) a Write is performed at the rising clock edge if it is enabled the write address and data must be stable at the clock edge Later in the course we will present a more realistic model of memory
September 26, 2005
Implementing MIPS:
Single-cycle per instruction datapath & control logic
Data types
Instruction Execution
Execution of an instruction involves
1. 2. 3. 4. 5. instruction fetch decode and register fetch ALU operation memory operation (optional) write back
inst<25:21> inst<20:16> PC
clk
addr
inst
inst<15:11>
Inst. Memory
inst<5:0>
ALU
GPRs
ALU
Control
OpCode
6 0
31 26 25 September 26, 2005
5 rs
5 rt
21 20 16 15
5 rd
11
5 0
5
6 func
0
RegWrite Timing?
rd (rs) func (rt)
inst<25:21> PC
clk
addr
inst
inst<20:16>
Inst. Memory
ALU
GPRs
Imm Ext
inst<15:0> inst<31:26>
ALU Control
OpCode
ExtSel
6 opcode
31 26 25 September 26, 2005
5 rs
2120
5 rt
16 15
16 immediate
0
rt (rs) op immediate
inst<25:21> PC
clk
addr
inst
Inst. Memory
Introduce muxes
ALU
GPRs
Imm Ext
ALU Control
OpCode
ExtSel
6 0 opcode
September 26, 2005
5 rs rs
5 rt rt
5 rd
5 0 immediate
6 func
PC
clk
addr
Inst. Memory
ALU
GPRs
Imm Ext
ALU Control
OpCode
6 0 opcode
September 26, 2005
5 rs rs
5 rt rt
5 rd
RegDst rt / rd
ExtSel
OpSel
5 0
6 func
immediate
base PC
clk
addr
inst
ALU
we addr z
Inst. Memory
GPRs
Imm Ext
disp
Data Memory
wdata ALU Control
rdata
OpCode RegDst
ExtSel
OpSel
BSrc
6 opcode
31 26 25
5 rs
5 rt
21 20 16 15
16 displacement
0
rs is the base register rt is the destination of a Load or the source for a Store
September 26, 2005
PC-relative branches add offset4 to PC+4 to calculate the target address (offset is in words): 128 KB range Absolute jumps append target4 to PC<31:28> to calculate the target address: 256 MB range jump-&-link stores PC+4 into the link register (R31) All Control Transfers are delayed by 1 instruction
we will worry about the branch delay slot later
September 26, 2005
pc+4
0x4 Add Add clk
PC
clk
addr
inst
Inst. Memory
clk
ALU z
we addr
GPRs
Imm Ext
Data Memory
wdata
rdata
ALU Control
ExtSel
OpSel
BSrc
zero?
RegWrite
MemWrite
WBSrc
PC
clk
addr
inst
Inst. Memory
clk
ALU z
we addr
GPRs
Imm Ext
Data Memory
wdata
rdata
ALU Control
ExtSel
OpSel
BSrc
zero?
RegWrite
MemWrite
WBSrc
PC
clk
addr
inst
31
Inst. Memory
clk
ALU z
we addr
GPRs
Imm Ext
Data Memory
wdata
rdata
ALU Control
ExtSel
OpSel
BSrc
zero?
RegWrite
MemWrite
WBSrc
PC
clk
addr
inst
31
Inst. Memory
clk
ALU z
we addr
GPRs
Imm Ext
Data Memory
wdata
rdata
ALU Control
ExtSel
OpSel
BSrc
zero?
RegWrite
MemWrite
WBSrc
PC
clk
addr
inst
31
Inst. Memory
clk
ALU z
we addr
GPRs
Imm Ext
Data Memory
wdata
rdata
ALU Control
ExtSel
OpSel
BSrc
zero?
23
At the rising edge of the following clock, the PC, the register file and the memory are updated
September 26, 2005
combinational logic
Decode Map
ExtSel ( sExt16, uExt16, High16)
September 26, 2005
Func Op Op + + 0? 0? * * * *
no no no no yes no no no no no no
rd rt rt rt * * * * R31 * R31
pc+4 pc+4 pc+4 pc+4 pc+4 br pc+4 jabs jabs rind rind
Pipelined MIPS
To pipeline MIPS: First build MIPS without pipelining with CPI=1 Next, add pipeline registers to reduce cycle time while maintaining CPI=1
Pipelined Datapath
0x4 Add we rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext
PC
addr rdata
IR
ALU
we addr
Inst. Memory
Data Memory
wdata
rdata
write fetch decode & Reg-fetch execute memory -back phase phase phase phase phase Clock period can be reduced by dividing the execution of an instruction into multiple cycles tC > max {tIM, tRF, tALU, tDM, tRW} ( = tDM probably) However, CPI will increase unless instructions are pipelined
September 26, 2005
An Ideal Pipeline
stage 1 stage 2 stage 3 stage 4
All objects go through the same stages No sharing of resources between any two stages Propagation delay through all pipeline stages is equal The scheduling of an object entering the pipeline is not affected by the objects in other stages These conditions generally hold for industrial assembly lines. But can an instruction pipeline satisfy the last condition?
September 26, 2005
Since the slowest stage determines the clock, it may be possible to combine some stages without any loss of performance
Alternative Pipelining
0x4 Add we rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext
PC
addr rdata
IR
ALU
we addr
Inst. Memory
Data Memory
wdata
rdata
fetch phase
execute phase
tC > max {tIM , tRF , tALU , tDM ,, tt } } RW RW C IM RF+t ALU, t DM+t RW}
tC
tC
27
10
2.7
25
10
2.5
25
5.0
34
Thank you !