Lec 02

Review of Computer Architetcure
A Sahu Deptt. of C f Comp. S & Engg. Sc. IIT Guwahati
Outline
Computer organization Vs Architecture p g Processor architecture Pipeline h P l architecture
Data, resource and branch hazards ,
Superscalar & VLIW architecture Memory h hierarchy h Reference
Computer organization Vs Architecture

Comp Organization => Digital Logic Module l d l Logic and Low level ============================ p g , g Comp Architecture = > ISA Design, MicroArch Design Algorithm for Designing best micro architecture, Pipeline model, Branch prediction strategy, memory management Etc..
Hardware abstraction
Register file CPU PC ALU System bus Memory bus
Bus interface
bridge
Main memory
I/O bus USB controller Mouse Keyboard Graphics adapter Display Disk Disk controller
Expansion slots for p other devices such as network adapters
Hardware/software interface
software f C++ m/c instr reg, adder dd hardware transistors Arch. focus
Instruction set architecture Lowest level visible to a programmer Micro architecture Fills the gap between instructions and logic modules
Instruction Set Architecture

Assembly Language View
Processor state (RF, mem) Instruction set and encoding
Application pp cat o Program Compiler ISA CPU Design Circuit Design Chip Layout OS
Layer of Abstraction y
Above: how to program , machine - HLL, OS Below: what needs to be built tricks to make it run fast
The Abstract Machine

CPU Registers PC ALU Condition Codes Memory Addresses Add Data Instructions Stack Code + Data
Programmer-Visible State
PC Program Counter g Register File
heavily used data y

Condition Codes
Memory y Byte array Code + data stack
Instructions
Language of M hi L f Machine Easily interpreted y p primitive compared to HLLs
Instruction set design goals

maximize performance, minimize cost, reduce design time
Instructions
All MIPS Instructions: 32 bit long, have 3 operands Operand order is fixed (destination first) Example: C code: A=B+C MIPS code: d add $ 0 $ 1 $ 2 dd $s0, $s1, $s2 (associated with variables by compiler)
Registers numbers 0 .. 31 e g 31, e.g., $t0=8,$t1=9,$s0=16,$s1=17 etc.
000000 10001 10010 01000 00000 100000

op rs rt rd shamt funct
Instructions LD/ST & Control

Load d L d and store instructions i i Example: A[8] = h + A[8]; [ ] [ ]; C code: MIPS code: lw $t0, 32($s3) add $t0, $s2, $t0 sw $t0 32($s3) $t0, Example: lw $t0, 32($s2) 35 18 9 32 op rs rt 16 bit number Example: if (i != j) beq $s4, $s5, Lab1 b $ 4 $ 5 L b1 h = i + j; add $s3, $s4, $s5 else j Lab2 h = i - j; Lab1: sub $s3, $s4, $s5 Lab2: ...
What constitutes ISA?

Set f basic/primitive operations S of b i i ii i
Arithmetic, Logical, Relational, Branch/jump, Data movement
Storage structure registers/memory St t t it /

Register-less machine, ACC based machine, A few special purpose registers, Several Gen purpose registers, Large number of registers g , p p g , g g
How addresses are specified

Direct, Indirect, Base vs. Index, Auto incr and auto decr, Pre (post) incr/decr, Stack
How operand are specified

3 address machine r1 = r2 + r3, 2 address machine r1 = r1 + r2 1 address machine Acc = Acc + x (Acc is implicit) 0 address machine add values on (top of stack)
How instructions are encoded
RISC vs. CISC

RISC Uniformity of instructions, Simple set of operations and addressing modes, Register based architecture with 3 address instructions RISC: Virtually all new ISA since 1982 ARM, MIPS, SPARC, HPs PA RISC, PowerPC, Alpha, HP s PA-RISC, CDC 6600 CISC : Mi i i code size, make assembly l Minimize d i k bl language easy VAX: instructions from 1 to 54 bytes long! Motorola 680x0, Intel 80x86
MIPS subset for implementation

Arithmetic - logic instructions g add, sub, and, or, slt Memory reference instructions lw, sw Control flow instructions beq, j Incremental changes in the design to include other instructions will be discussed later
Design overview
Use the program counter (PC) to supply instruction address Get the i h instruction f i from memory Read registers Use the instruction to decide exactly what to do
Da ta Re gis ter # Re gis te rs Re gis ter # Re gis ter # Da ta
PC
Addres s Ins truction memory
Ins truction
ALU
Addres s Da ta memory
Division into data path and control
control signals i l
status signals i l
CONTROLLER
Building block types

Two types of functional units: elements that operate on data values (combinational) output is function of current input, no memory Examples
gates: and, or, nand, nor, xor, inverter ,Multiplexer, decoder, adder, subtractor, comparator, ALU, array multipliers
elements that contain state (sequential) output is function of current and previous inputs state = memory Examples:
flip-flops, counters, registers, register files, memories
Components for MIPS subset Register, Adder ALU Multiplexer Register file Program memory Data memory Bit manipulation components
Components - register
32
PC
32
clock
Components - adder
PC
32 + 32
PC+4
32 + 32
4
32
offset
32
Components - ALU
operation a
32 a=b overflow ALU 32 32
result
Components - multiplexers C t lti l

PC+4
32
PC+4+offset
32
mux
1
32
select
Components - register file

5 Regis ter g numbe rs 5 5 Re R ad re gis ter 1 Re a d re gis ter 2 Write re gis ter Write data Re a d data 1 Data Re a d data 2
Registers
Data
Reg Write
Components - program memory C t g

Ins tructio n a ddre s s Ins I tructio n i
Instruction memory
MIPS components - d t memory t data

Me m Write
Ad dre s s
Re ad d a ta
Write da ta
Data memory
Me m R e a d
Components - bit manipulation circuits C t i l ti i it
16
Sign xtend
32
MSB
LSB MSB 32
shift
32 0 LSB
MIPS subset for implementation
Arithmetic - logic instructions add, sub, and, or, slt Memory reference instructions lw, l sw Control flow instructions beq, j
Datapath for add sub and or slt add,sub,and,or,slt

Fetch instruction Address the register file Pass operands to ALU Pass result to register file Increment PC I t Format: add $t0, $s1, $s2 op rs rt rd
actions required
000000 10001 10010 01000 00000 100000

shamt funct
Fetching instruction
PC
ad
ins
IM
Addressing RF
PC
ad
ins[25-21] ins[20-16] ins
IM
rad1 rad2 wad RF wd
rd1 rd2
Passing operands to ALU
PC
ad
ins[25-21] ins[20-16] ins
IM
rad1 rad2 wad RF wd
rd1 rd2
ALU
Passing the result to RF
ins
IM
ins[15-11]
rd2
ALU
PC
ad
ins[25-21] ins[20-16]
rad1 rad2 wad RF wd
rd1
Incrementing PC
PC
ad
ins[25-21] ins[20-16] ins
IM
ins[15-11]
rad1 rad2 wad RF wd
rd1 rd2
ALU
Load and Store instructions

format : I Example: lw $t0, 32($s2) 35 op 18 rs 9 rt t 32 16 bit number b
Adding sw instruction sw
PC
ad
ins[25-21] ins[20-16] ins
IM
ins[15-11]
rad1 rad2 wad RF wd 16
rd1 rd2
ALU
0 1
ad wd
rd
DM
ins[15-0]
sx
Adding lw instruction lw
PC
ad
ins[25-21] ins[20-16] ins

0
IM
ins[15-11]
rd1 rd2
ALU
0 1
ad wd
rd
1 0
DM
ins[15-0]
sx
Adding beq instruction beq

0 1
PC
ad
ins[25-21] ins[20-16] ins

0
IM
ins[15-11]
rd1 rd2
s2
ALU
0 1
ad wd
rd
1 0
DM
ins[15-0]
sx
Adding j instruction j
s2
ins[25-0] 28 ja[31-0]
0 0 1 1
PC+4[31-28]
PC
ad
ins[25-21] ins[20-16] ins

0
IM
ins[15-11]
rd1 rd2
s2
ALU
0 1
ad wd
rd
1 0
DM
ins[15-0]
sx
Control signals
s2
ins[25-0] 28 ja[31-0]
0 1
jmp
0 1
PC+4[31-28]
IM
ins[15-11]
rd2
ALU
ins
Asrc
PC
ad
rd1
0 1
ad wd
rd
op
DM
MR
Rdst ins[15-0]
16
sx
M2R
1 0
ins[25-21] ins[20-16]
RW rad1 rad2 wad RF wd Z MW
s2
Psrc
Datapath + Control
s2
ins[25-0] 28 ja[31-0]
0 1
jmp
0 1
PC+4[31-28]
ins[31-26]
control
IM
ins[15-11]
rd2
ALU
ins
Asrc
PC
ad
rd1
0 1
ad wd
rd
op
DM
MR
ins[15-0] ins[5-0]
Actrl
Rdst
16
sx
opc
M2R
1 0
ins[25-21] ins[20-16]
RW rad1 rad2 wad RF wd Z MW
s2
brn
Psrc
Analyzing performance
Component delays
Register Adder ALU Multiplexer Register file Program memory Data memory y Bit manipulation components
0 t+ tA 0 tR tI tM 0
Delay for {add, sub, and, or, slt} y { , , , , }
t+ max t I + t R + t A + t R
ins[25-21] ins[20-16] ins[15-11] rad1 rad2 wad RF wd rd1 rd2
PC
ad
IM
ALU
ins
Delay for {sw}
t+ max t I + t R + t A + t M
ins[25-21] ins[20-16] rad1 rad2 wad RF wd 16 ins[15-0] rd1 rd2
PC
ad
ALU
ins
IM
ad wd
rd
DM
sx
Clock period in single cycle design

R-class R l lw sw beq tI tI tI tI t+ tI j t+ tI tR tR tR tR t+ t+ tA tA tA tA tR tM tM tR clock period
Clock period in multi cycle design multi-cycle

R-class lw sw beq tI tI tI tI t+ tI j t+ tI tR tR tR tR t+ t+ tA tA tA tA tR tM tM tR clock period
Cycle time and CPI

high
multi-cycle design
CPI
low
pipelined design single cycle design
short
cycle time long
PIpelined datapath (abstract)

IF
IF/ID +
ID
ID/EX
EX
Mem
EX/Mem
WB
Mem/WB
PC
ad
rad ins
rd1 rd2
ALU
IM
wad RF wd
ad wd
rd
DM
Fetch new instruction every cycle

IF
IF/ID +
ID
ID/EX
EX
Mem
EX/Mem
WB
Mem/WB
PC
ad
rad ins
rd1 rd2
ALU
IM
wad RF wd
ad wd
rd
DM
Pipelined processor design

1 0
IF/IDw
flush
control
0 4
rad1 d rad2 wad wd
bubble
0 1
PC
ins
IM
PCw PCw=0 IF/IDw=0 bubble=1
RF
rd2
ALU
ad
rd1
0 1
s2
ad wd
rd
1 0
DM
0 1
Actrl
sx
Graphical representation
5 stage pipeline g pp
actions
IF
ID
EX
ALU U
stages
IM
RF
DM
RF
Mem
WB
Usage of stages by instructions

ALU
lw sw add beq
IM
RF
DM
RF
ALU
IM
RF
ALU
IM
RF
ALU
IM
RF
DM
RF
DM
RF
DM
RF
Pipelining
Simple multicycle design : Resource sharing across cycles All instructions may not take same cycles
IF
RF
EX/AG
WB
Faster throughput with pipelining
Degree of overlap
Serial
Depth
Shallow
Overlapped O l d
Deep Pipelined
Hazards in Pipelining
Procedural dependencies => Control hazards cond and uncond branches, calls/returns Data dependencies => Data hazards RAW (read after write) WAR ( i after read) (write f d) WAW (write after write) Resource conflicts => Structural hazards use of same resource in different stages g
Data Hazards
read/write previous instr current instr
read/write
delay = 3
Structural Hazards
Caused by Resource Conflicts
Use of a hard are resource in hardware more than one cycle
A B A A B A C A B C A C
Different sequences of resource usage by different instructions Non-pipelined multi-cycle resources
B A
C C
D B D
D F
X D
X X X
Control Hazards
cond eval branch b h instr next inline instr target instr
delay = 2
target addr gen
delay = 5
the order of cond eval and target addr gen may be different y p cond eval may be done in previous instruction
Pipeline Performance
T S stages
Frequency of interruptions - b
CPI = 1 + (S - 1) * b Time = CPI * T / S
Improving Branch Performance

Branch Elimination
Replace branch with other instructions
Branch Speed Up
Reduce time for computing CC and TIF
Branch Prediction
Guess the outcome and proceed, undo if necessary
Branch Target Capture g p

Make use of history
Branch Elimination
C T S C:S F Use conditional instructions (predicated execution)
OP1 BC CC = Z + 2 Z, ADD R3, R2, R1 OP2
OP1 ADD R3, R2, R1, NZ OP2
Branch Speed Up : Early target address generation

Assume each instruction is Branch Generate target address while decoding If target i same page omit translation in i l i After decoding discard target address if not Branch
BC
IF IF IF D AG TIF TIF TIF
Branch Prediction
Treat conditional branches as unconditional branches / NOP Undo if necessary y Strategies: Fixed (always guess inline) Static (guess on the basis of instruction type) Dynamic (guess based on recent history)
Static Branch Prediction

Instr uncond cond loop call/ret % 14.5 58 9.8 17.7 17 7 Guess always never always always Branch 100% 54% 91% 100% Correct 14.5% 27% 9% 17.7% 17 7%
Total 68.2% ota 68 %
Branch Target Capture

Branch Target Buffer (BTB) Target Instruction Buffer (TIB)
instr addr
pred stats
target target addr g target instr
prob of target change < 5%
BTB Performance
decision BTB miss g go inline inline .8 .2 delay 0 5 4 BTB hit g go to target g target .2 .8 0 .6*.8*0 = 0.88 (Eff.Delay)
.4 .6 4 6 target inline
result
.4*.8*0 + .4*.2*5 + .6*.2*4 +
Compute/fetch scheme
(no dynamic branch prediction)
A I I+1 Instruction I Fetch address F A I - cache R I+2 I+3
BTA IIFA Compute BTA
+
Next sequential address
BTI BTI+1 BTI+2 BTI+3
BTAC scheme
A I I+1 Instruction I Fetch address F A I - cache R I+2 I+3
BA BTA
BTA IIFA
BTAC
+
Next sequential address
BTI BTI+1 BTI+2 BTI+3
ILP in VLIW processors

Cache/ memory Fetch Unit Single multi-operation instruction
FU
FU
FU
Register file multi-operation instruction
ILP in Superscalar processors

Decode Cache/ memory Fetch Unit and issue unit Multiple instruction
FU
FU
FU
Sequential stream of instructions
Instruction/control Data FU Funtional Unit Register file
Why Superscalars are popular ?

Binary code compatibility among scalar & y p y g superscalar processors of same family Same compiler works for all processors (scalars and superscalars) of same family Assembly programming of VLIWs is tedious Code density in VLIWs is very poor y yp Instruction encoding schemes
slide 69
Hierarchical structure
S peed
Fastest CPU
S ize
S mallest
Cost / bit
Highest
Memory
Memory M
S lowest
Memory
Biggest
Lowest
Data transfer between levels

Processor
access
hi t miss
Data are transferred
unit of transfer = block
Principle of locality & Cache Policies

Temporal Locality Spatial Locality
references repeated in time f di i references repeated in space Special case: Sequential Locality
============================
Read d Sequential / Concurrent Simple / Forward Load Block load / Load forward / Wrap around Replacement ep ace e t LRU / LFU / FIFO / Random
Load policies
0 Cache miss on AU 1 Block Load Load Forward Fetch Bypass (wrap around load) l d) 4 AU Block 2 3 1
Fetch Policies Demand fetching fetch only when required (miss) y q ( ) Hardware prefetching automatically automaticall prefetch ne t block next Software prefetching programmer decides to prefetch questions: how much ahead (prefetch distance) how f h often
Write Policies
Write Hit Write Back Write Through Write Miss Write Back Write Through
With Write Allocate With No Write Allocate
Cache Types Instruction | Data | Unified | Split p Split vs. Unified:

Split allows specializing each part Unified allows best use of the capacity p y
On-chip | Off-chip
on-chip : fast but small hi f b ll off-chip : large but slow
Single level | Multi level
References
1. 2. 3. 4. 4 Patterson, D A.; Hennessy, J L. Computer Organization and Design:The Hardware/software Interface Morgan Kaufman Interface. Kaufman, 2000 Sima, T, FOUNTAIN, P KACSUK, Advanced Computer p Architectures: A Design Space Approach, Pearson Education, 1998 Flynn M J, Computer Architecture: Pipelined and Parallel l Processor Design, Narosa publishing India, 1999 John L. Hennessy, David A. Patterson, Computer L Hennessy A Patterson architecture: a quantitative approach, 2nd Ed, Morgan Kauffman, 2001 ,
Thanks

Lec 02

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Lec 02

Caricato da

Copyright:

Formati disponibili

Review of Computer Architetcure

A Sahu Deptt. of C f Comp. S & Engg. Sc. IIT Guwahati

Superscalar & VLIW architecture Memory h hierarchy h Reference

Computer organization Vs Architecture

Expansion slots for p other devices such as network adapters

Instruction Set Architecture

The Abstract Machine

heavily used data y

Memory y Byte array Code + data stack

Instruction set design goals

Registers numbers 0 .. 31 e g 31, e.g., $t0=8,$t1=9,$s0=16,$s1=17 etc.

000000 10001 10010 01000 00000 100000

Instructions LD/ST & Control

What constitutes ISA?

Storage structure registers/memory St t t it /

How addresses are specified

How operand are specified

How instructions are encoded

RISC vs. CISC

MIPS subset for implementation

Da ta Re gis ter # Re gis te rs Re gis ter # Re gis ter # Da ta

Addres s Ins truction memory

Division into data path and control

Building block types

Components - multiplexers C t lti l

Components - register file

Components - program memory C t g

MIPS components - d t memory t data

Components - bit manipulation circuits C t i l ti i it

MIPS subset for implementation

Datapath for add sub and or slt add,sub,and,or,slt

000000 10001 10010 01000 00000 100000

ins[25-21] ins[20-16] ins

rad1 rad2 wad RF wd

Passing operands to ALU

ins[25-21] ins[20-16] ins

rad1 rad2 wad RF wd

Passing the result to RF

rad1 rad2 wad RF wd

ins[25-21] ins[20-16] ins

rad1 rad2 wad RF wd

Load and Store instructions

ins[25-21] ins[20-16] ins

rad1 rad2 wad RF wd 16

ins[25-21] ins[20-16] ins

rad1 rad2 wad RF wd 16

Adding beq instruction beq

ins[25-21] ins[20-16] ins

rad1 rad2 wad RF wd 16

ins[25-21] ins[20-16] ins

rad1 rad2 wad RF wd 16

RW rad1 rad2 wad RF wd Z MW

RW rad1 rad2 wad RF wd Z MW

Delay for {add, sub, and, or, slt} y { , , , , }

Delay for {sw}

Clock period in single cycle design

Clock period in multi cycle design multi-cycle

Cycle time and CPI

cycle time long

PIpelined datapath (abstract)

Fetch new instruction every cycle

Pipelined processor design

.4.80 + .4.25 + .6.24 +