Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Outline
Computer organization Vs Architecture p g Processor architecture Pipeline h P l architecture
Data, resource and branch hazards ,
Hardware abstraction
Register file CPU PC ALU System bus Memory bus
Bus interface
bridge
Main memory
I/O bus USB controller Mouse Keyboard Graphics adapter Display Disk Disk controller
Hardware/software interface
software f C++ m/c instr reg, adder dd hardware transistors Arch. focus
Instruction set architecture Lowest level visible to a programmer Micro architecture Fills the gap between instructions and logic modules
Layer of Abstraction y
Above: how to program , machine - HLL, OS Below: what needs to be built tricks to make it run fast
Programmer-Visible State
PC Program Counter g Register File
Instructions
Language of M hi L f Machine Easily interpreted y p primitive compared to HLLs
Instructions
All MIPS Instructions: 32 bit long, have 3 operands Operand order is fixed (destination first) Example: C code: A=B+C MIPS code: d add $ 0 $ 1 $ 2 dd $s0, $s1, $s2 (associated with variables by compiler)
Design overview
Use the program counter (PC) to supply instruction address Get the i h instruction f i from memory Read registers Use the instruction to decide exactly what to do
PC
Ins truction
ALU
Addres s Da ta memory
control signals i l
status signals i l
CONTROLLER
elements that contain state (sequential) output is function of current and previous inputs state = memory Examples:
flip-flops, counters, registers, register files, memories
Components for MIPS subset Register, Adder ALU Multiplexer Register file Program memory Data memory Bit manipulation components
Components - register
32
PC
32
clock
Components - adder
PC
32 + 32
PC+4
32 + 32
4
32
offset
32
Components - ALU
operation a
32 a=b overflow ALU 32 32
result
PC+4+offset
32
mux
1
32
select
Registers
Data
Reg Write
Instruction memory
Ad dre s s
Re ad d a ta
Write da ta
Data memory
Me m R e a d
16
Sign xtend
32
MSB
LSB MSB 32
shift
32 0 LSB
Arithmetic - logic instructions add, sub, and, or, slt Memory reference instructions lw, l sw Control flow instructions beq, j
actions required
Fetching instruction
PC
ad
ins
IM
Addressing RF
PC
ad
IM
rd1 rd2
PC
ad
IM
rd1 rd2
ALU
ins
IM
ins[15-11]
rd2
ALU
PC
ad
ins[25-21] ins[20-16]
rd1
Incrementing PC
PC
ad
IM
ins[15-11]
rd1 rd2
ALU
Adding sw instruction sw
PC
ad
IM
ins[15-11]
rd1 rd2
ALU
0 1
ad wd
rd
DM
ins[15-0]
sx
Adding lw instruction lw
PC
ad
IM
ins[15-11]
rd1 rd2
ALU
0 1
ad wd
rd
1 0
DM
ins[15-0]
sx
PC
ad
IM
ins[15-11]
rd1 rd2
s2
ALU
0 1
ad wd
rd
1 0
DM
ins[15-0]
sx
Adding j instruction j
s2
ins[25-0] 28 ja[31-0]
0 0 1 1
PC+4[31-28]
PC
ad
IM
ins[15-11]
rd1 rd2
s2
ALU
0 1
ad wd
rd
1 0
DM
ins[15-0]
sx
Control signals
s2
ins[25-0] 28 ja[31-0]
0 1
jmp
0 1
PC+4[31-28]
IM
ins[15-11]
rd2
ALU
ins
Asrc
PC
ad
rd1
0 1
ad wd
rd
op
DM
MR
Rdst ins[15-0]
16
sx
M2R
1 0
ins[25-21] ins[20-16]
s2
Psrc
Datapath + Control
s2
ins[25-0] 28 ja[31-0]
0 1
jmp
0 1
PC+4[31-28]
ins[31-26]
control
IM
ins[15-11]
rd2
ALU
ins
Asrc
PC
ad
rd1
0 1
ad wd
rd
op
DM
MR
ins[15-0] ins[5-0]
Actrl
Rdst
16
sx
opc
M2R
1 0
ins[25-21] ins[20-16]
s2
brn
Psrc
Analyzing performance
Component delays
Register Adder ALU Multiplexer Register file Program memory Data memory y Bit manipulation components
0 t+ tA 0 tR tI tM 0
t+ max t I + t R + t A + t R
ins[25-21] ins[20-16] ins[15-11] rad1 rad2 wad RF wd rd1 rd2
PC
ad
IM
ALU
ins
t+ max t I + t R + t A + t M
ins[25-21] ins[20-16] rad1 rad2 wad RF wd 16 ins[15-0] rd1 rd2
PC
ad
ALU
ins
IM
ad wd
rd
DM
sx
CPI
low
pipelined design single cycle design
short
ID
ID/EX
EX
Mem
EX/Mem
WB
Mem/WB
PC
ad
rad ins
rd1 rd2
ALU
IM
wad RF wd
ad wd
rd
DM
ID
ID/EX
EX
Mem
EX/Mem
WB
Mem/WB
PC
ad
rad ins
rd1 rd2
ALU
IM
wad RF wd
ad wd
rd
DM
IF/IDw
flush
control
0 4
rad1 d rad2 wad wd
bubble
0 1
PC
ins
IM
PCw PCw=0 IF/IDw=0 bubble=1
RF
rd2
ALU
ad
rd1
0 1
s2
ad wd
rd
1 0
DM
0 1
Actrl
sx
Graphical representation
5 stage pipeline g pp
actions
IF
ID
EX
ALU U
stages
IM
RF
DM
RF
Mem
WB
lw sw add beq
IM
RF
DM
RF
ALU
IM
RF
ALU
IM
RF
ALU
IM
RF
DM
RF
DM
RF
DM
RF
Pipelining
Simple multicycle design : Resource sharing across cycles All instructions may not take same cycles
IF
RF
EX/AG
WB
Degree of overlap
Serial
Depth
Shallow
Overlapped O l d
Deep Pipelined
Hazards in Pipelining
Procedural dependencies => Control hazards cond and uncond branches, calls/returns Data dependencies => Data hazards RAW (read after write) WAR ( i after read) (write f d) WAW (write after write) Resource conflicts => Structural hazards use of same resource in different stages g
Data Hazards
read/write previous instr current instr
read/write
delay = 3
Structural Hazards
Caused by Resource Conflicts
Use of a hard are resource in hardware more than one cycle
A B A A B A C A B C A C
B A
C C
D B D
D F
X D
X X X
Control Hazards
cond eval branch b h instr next inline instr target instr
delay = 2
delay = 5
the order of cond eval and target addr gen may be different y p cond eval may be done in previous instruction
Pipeline Performance
T S stages
Frequency of interruptions - b
Branch Speed Up
Reduce time for computing CC and TIF
Branch Prediction
Guess the outcome and proceed, undo if necessary
Branch Elimination
C T S C:S F Use conditional instructions (predicated execution)
Branch Prediction
Treat conditional branches as unconditional branches / NOP Undo if necessary y Strategies: Fixed (always guess inline) Static (guess on the basis of instruction type) Dynamic (guess based on recent history)
instr addr
pred stats
BTB Performance
decision BTB miss g go inline inline .8 .2 delay 0 5 4 BTB hit g go to target g target .2 .8 0 .6*.8*0 = 0.88 (Eff.Delay)
.4 .6 4 6 target inline
result
Compute/fetch scheme
(no dynamic branch prediction)
A I I+1 Instruction I Fetch address F A I - cache R I+2 I+3
+
Next sequential address
BTI BTI+1 BTI+2 BTI+3
BTAC scheme
A I I+1 Instruction I Fetch address F A I - cache R I+2 I+3
BA BTA
BTA IIFA
BTAC
+
Next sequential address
BTI BTI+1 BTI+2 BTI+3
FU
FU
FU
FU
FU
FU
Hierarchical structure
S peed
Fastest CPU
S ize
S mallest
Cost / bit
Highest
Memory
Memory M
S lowest
Memory
Biggest
Lowest
access
hi t miss
============================
Read d Sequential / Concurrent Simple / Forward Load Block load / Load forward / Wrap around Replacement ep ace e t LRU / LFU / FIFO / Random
Load policies
0 Cache miss on AU 1 Block Load Load Forward Fetch Bypass (wrap around load) l d) 4 AU Block 2 3 1
Fetch Policies Demand fetching fetch only when required (miss) y q ( ) Hardware prefetching automatically automaticall prefetch ne t block next Software prefetching programmer decides to prefetch questions: how much ahead (prefetch distance) how f h often
Write Policies
Write Hit Write Back Write Through Write Miss Write Back Write Through
With Write Allocate With No Write Allocate
On-chip | Off-chip
on-chip : fast but small hi f b ll off-chip : large but slow
References
1. 2. 3. 4. 4 Patterson, D A.; Hennessy, J L. Computer Organization and Design:The Hardware/software Interface Morgan Kaufman Interface. Kaufman, 2000 Sima, T, FOUNTAIN, P KACSUK, Advanced Computer p Architectures: A Design Space Approach, Pearson Education, 1998 Flynn M J, Computer Architecture: Pipelined and Parallel l Processor Design, Narosa publishing India, 1999 John L. Hennessy, David A. Patterson, Computer L Hennessy A Patterson architecture: a quantitative approach, 2nd Ed, Morgan Kauffman, 2001 ,
Thanks