Sei sulla pagina 1di 14

Instruction Sets - Appendix B Operand Storage

“Instruction set architecture is the structure of a computer that a Why in the processor?
machine language programmer (or a compiler) must understand
to write a correct (timing independent) program for that machine” • faster access
- IBM introducing 360 in 1964 • shorter address
Instruction set aspects
Accumulator
• operands
+ less hardware
• memory issues
– high memory traffic
• operations (mostly control)
– likely bottleneck
Compilers (paper #4)

READ papers #4-7, Appendix B

© 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 1 © 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 2

Operand Storage Memory vs. Registers


Stack - LIFO (60’s - 70’s) Registers
+ code density (top of stack implicit) + faster (no addressing modes, no tags)
– bottleneck while pipelining (why?) + deterministic (no misses)
• note: JAVA VM stack-based + can replicate for more ports

Registers - 8 to 256 words + short identifier

+ flexible: temporaries and variables – must save/restore on procedure calls

– registers must be named – cannot take address of a register (distinct from memory)

– code density and “second” name space – fixed size (FP, strings, structures)
– compilers must manage (an advantage?)

© 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 3 © 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 4
Registers vs. Memory Operands for ALU Instructions
How many registers? more => ALU instructions combines operands
+ hold operands longer (reducing memory traffic + run time) Number of explicit operands
– longer register specifiers (except with register windows) • two - ri := ri op rj
– slower registers • three - ri := rj op rk
– more state slows context switches
operands in registers or memory
• any combo - VAX - orthogonal but variable length intrs
• at least one register - IBM 360/370 - not orthogonal
• all registers - Cray, RISCs - orthogonal but loads/stores

© 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 5 © 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 6

Operands for DSP Endian Wars


Integer and floating point operands are common in general- Order of bytes in words
purpose
• Big endian - MSB at address xxxx00
DSPs have fixed point • Little endian MSB at address xxxx11
• used to represent a vertex
Big endian - IBM, Motorola, SPARC
• the binary point is to the right of the least significant bit
Little endian - DEC, Intel
• 0100 0000 0000 0000 = 2-1
• Fixed point is poor-man’s floating point Mode selectable

• without exponent or h/w normalization as in FP • common today

© 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 7 © 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 8
Operand Alignment Alignment
What is alignment? No restrictions
• address mod size = 0 • simpler software
• natural boundaries • hardware must detect misalignment

e.g., aligned word (4 bytes) • and make 2 memory accesses

• 10 11 12 13 20 • expensive logic, slows down all references (why?)

• d0 d1 d2 d3 - • sometimes required for backward compatibility

e.g., unaligned word (4 bytes)


• 10 11 12 13 20
•- d0 d1 d2 d3

© 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 9 © 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 10

Alignment Addressing Modes


Restricted alignment • register: Ri displacement M[Ri + #n]
• software must guarantee alignment • immediate: #n register indirect M[Ri]
• hardware only detects misalignment and traps • indexed: M[Ri + Rj] absolute: M[#n]
• trap handler does it
• memory indirect: M[M[Ri]] auto-increment: M[Ri]; Ri += d
Middle ground • auto-decrement: M[Ri]; Ri -= d
• misaligned data ok but requires multiple instructions
• scaled: M[Ri + #n + Rj * d]
• compiler must still know
• update: M[Ri = Ri + #n]
• still trap on misaligned access
Top 4 modes cover 93% of all VAX operands [Clark and Emer]

© 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 11 © 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 12
DSP Addressing modes Operations
Eg 1 - modulo or circular Arithmetic and logical - and, add
• because DSPs deal with streams of data they use circular Data transfer - move, load
buffer
Control - branch, jump, call
• this mode naturally implements such buffers
System - system call, traps
Eg 2 - bit reverse
• specifically for FFT - a common DSP operation Floating point - add, mul, div, sqrt

• FFT shuffles data in a particular order, this mode does Decimal - addd, convert
• 000 -> 000, 001 -> 100, 101 -> 010, 011 -> 110, 100 -> 001 String - move, compare
These modes are hard for compiler, linking to assembly-level
libraries allow use by programmers
© 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 13 © 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 14

DSP Operations DSP Operations


DSP has many 8-bit, 16-bit, 32-bit operands • special rounding modes like IEEE
Instead of wasting 64-bit ALUs and datapath for narrow operands • multiply-accumulate instruction because accumulating
series of products is common
• DSPs allow 4 16-bit operations in parallel in 64-bit datapath

Other peculiarities:
• in real time no time for exceptions => instead of excepting
on overflow they use saturating arithmetic => if result is too
small or large hardware “saturates” to the largest or
smallest representable number

© 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 15 © 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 16
Control Instructions Taken or Not?
Aspects Compare and branch
• 1. taken or not + no extra compare, no state passed between instructions
• 2. where is the target – requires ALU op, restricts code scheduling opportunities
• 3. link return address Implicitly set condition codes - Z, N, V, C
• 4. save or restore + can be set “for free”
Instructions that change the PC – constraints code reordering, extra state to save/restore
• (conditional) branches [1-2], (unconditional) jumps [2] Explicitly set condition codes
• function calls [2,3,4], function returns [2,4] + can be set for free, decouples branch/fetch from pipeline
• system calls [2,3,4], system returns [2,4] – extra state to save/restore

© 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 17 © 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 18

Taken or Not Where is the Target?


Condition in general-purpose register Arbitrary specifier
+ no special state but uses up a register + orthogonal
– branch condition separate from branch logic in pipeline – more bits to specify, more time to decode

Some data for MIPS – branch execution and target separated in pipeline

• > 80% branches use immediate data, > 80% of those zero PC relative with immediate
• 50% branches use == 0 or <> 0 + position independent, target computable in branch unit

Compromise in MIPS + short immediate sufficient - #bits: <4 (47%), <8 (94%)

• branch==0, branch<>0 – target must be known statically, can’t jump far

• compare instructions for all other compares – other techniques needed for returns, distant jumps

© 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 19 © 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 20
Where is the Target KEY: Connection to pipelining
Register Control flow instructions affect which instruction is fetched next
+ short specifier, can jump anywhere, dynamic target ok (ret) Fetching occurs at the frontend of the pipeline
– extra instruction to load register • fetch in frontend, and process in the backend
– branch and target separated in pipeline
Register file is in the backend of the pipeline
Vectored trap - critical for OS calls
If pipeline is deep (done to make each stage small and fast, and
+ protection hence clock speed higher, so there are many stages)
– surprises cause implementation headache • frontend and backend far away from each other (in time)
• => if processing branch needs info from backend that will
be slow

© 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 21 © 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 22

Link Return Address Save or Restore state?


Implicit register - many recent architectures use this What state?
+ fast, simple • function calls: registers
– s/w save register before next call, surprise traps? • system calls: registers, flags, PC, PSW, etc

Explicit register Hardware need not save registers


+ may avoid saving register • caller can save registers in use
– register must be specified • callee save registers it will use

Processor stack Hardware register save


+ recursion direct • IBM STM, VAX CALLS
– complex instructions • faster?

© 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 23 © 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 24
Save or Restore State? Notation
Many recent architectures do no register saving Generic assembly code
Or do implicit register saving with register windows (SPARC) • sub r1, r2, r3
• means r1 := r2 - r3

Data sizes
• byte 8 bits
• halfword 16 bits
• word 32 bits
• doubleword 64 bits
• quad word 128 bits

© 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 25 © 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 26

VAX VAX
DEC 1977 VAX-11/780 Data types

Upward compatible from PDP-11 • 8, 16, 32, 64, 128


• char string - 8 bits/char
32-bit words and addresses
• decimal - 4 bits/digit
Virtual memory
• numberic string - 8 bits/digit
16 GPRs (r15 PC r14 SP), CCs

Extremely orthogonal and memory-memory

Decode as byte stream - variable in length


• opcode: operation, #operands, operand types

© 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 27 © 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 28
VAX VAX
Addressing modes Operations
• literal 6 bits • data transfer including string move
• 8, 16, 32 bit immediates • arithmetic and logical (2 and 3 operands)
• register, register deferred • control (branch, jump, etc)
• 8, 16, 32 bit displacements • AOBLEQ
• 8, 16, 32 bit displacements deferred • function calls save state
• indexed (scaled) • bit manipulation
• autoincrement, autodecrement • floating point - add, sub, mul, div, polyf
• autoincrement deferred • system - exception, VM
• other - crc (cyclic redundancy check), insque (insert in Q)
© 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 29 © 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 30

VAX 8086
addl3 R1,737(R2),#456 Intel in 1978
• chosen for IBM PC 1980
byte 1: addl3 • remains most popular 16-bit architecture
byte 2: mode,R1
• upward compatible with 8080
byte 3: mode, R2
byte 4-5: 737 • complex - “difficult to explain and impossible to love”
byte 6: mode • special purpose registers
byte 7-10: 456 • 4 arithmetic, 4 address, 4 segment, 2 control
VAX has too many modes and formats • adresses - 16 bit segment<<$ + 16 bit offset
The big deal with RISC is not fewer instrs • 64K 16KB-aligned 64KB segments
• Fewer modes/formats => faster decoding in pipelining • many formats - see H&P
© 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 31 © 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 32
MIPS MIPS
• RISC Data transfer
• 32-bit byte addresses aligned • load/store word, load/store byte/halfword signed?
• load/store - only displacement addressing • load/store FP single/double
• standard datatypes • moves between GPRs and FPRs
• 3 fixed length formats ALU
• 32 32-bit GPRs (r0 = 0) • add/subtract signed? immediate?
• 16 64-bit (32 32-bit) FPRs
• multiply/divide signed?
• FP status register
• and,or,xor immediate?, shifts: ll, rl, ra immediate?
• no CCs
• sets immediate?

© 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 33 © 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 34

MIPS MIPS
Control 6 5 5 16
I-type: Opcode rs1 rd Immediate
• branches == 0, <> 0
6 5 5 5 11
• conditional branch testing FP bit Opcode rs1 rs2 rd func
R-type:
• jump, jump register 6 26

• jump & link, jump & link register J-type: Opcode Offset added to PC

• trap, return-from-exception
I format - ALU immediate, loads/stores, branches, jump register
FP
R format - RRR ALU ops
• add/sub/mul/div single/double
J format - unconditional jumps (& link?)
• fp converts, fp set

© 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 35 © 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 36
Compilers 101 (Wulf’s paper) Compilers 101
Wm. Wulf’s “ compilers and architecture” Phases to manage complexity
Parsing --> intermediate representation
Compiler goals:
Procedure inlining
• all correct programs execute correctly Loop Optimizations
• most compiled programs execute fast (optimizations) Common Sub-Expression
• fast compilation Jump Optimization
• debugging support Constant Propagation
Register Allocation
Strength Reduction
Pipeline Scheduling
Code Generation --> assembly code

© 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 37 © 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 38

Compilers 101 Compiler 101


Procedure inlining 10% what compiler writers want:

local optimization 5% • regularity - similar structure across instructions


• orthogonality - across operation, data type, addressing
register allocation 21%
• composability - results from one directly to another
global + local 14%
• regularity and orthogonality => composability
global+local+reg-alloc 63%
compilers perform a giant case analysis
everything 81% • too many choices make it hard
local: common subexpression, constant propagation
orthogonal instruction sets
global: common subexpression, loop invariant code motion • operation, addressing mode, data type

© 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 39 © 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 40
Compiler 101 RISC vs. CISC (Clark&Bhandarkar)
one solution or all possible solutions Clark& Bhandarkar ASPLOS paper: VAX 8700 vs MIPS R3000
• 2 branch conditions - eq, lt Combines 3 features
• or all six - eq, ne, lt, gt, le, ge • architecture
• not 3 or 4 • implementation
Primitives NOT solutions • compilers and OS

“. . . by giving too much semantic content to the Argues that


instruction, the machine designer made it possible to use the
instruction only in limited contexts. In many cases the complex • implementation effects are second order
instructions are synthesized from more primitive operations, • compilers are similar
which if the compiler had access to, could be recomposed to
more closely model the feature actually needed.” • RISCs are better than CISCs: fair comparison?

© 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 41 © 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 42

RISC vs. CISC RISC vs. CISC


Recall Iron Law: Time = #instructions x CPI x clock period Compensating factors

RISC factor: {CPIVAX * InstrVAX }/ {CPIMIPS * InstrMIPS } • increase VAX CPI but decrease VAX instruction count
• increase MIPS instruction count
instruction CPI CPI CPI RISC • e.g. 1: loads/stores vs. operand specifiers
Benchmark
ratio MIPS VAX ratio factor
• e.g. 2: necessary complex intructions: loop branches
li 1.6 1.1 6.5 6.0 3.7
eqntott 1.1 1.3 4.4 3.5 3.3 Factors favoring VAX

fpppp 2.9 1.5 15.2 10.5 2.7 • big immediate values

tomcatv 2.9 2.1 17.5 8.2 2.9 • not-taken branches incur no delay

© 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 43 © 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 44
RISC vs. CISC Technology Scaling (Borkar’s paper)
Factors favoring MIPS Why is technology scaling so important?
• operand specifier decoding What are the goals of scaling? (keep in mind that 0.7 = 1/sqrt(2))
• number of registers • reduce gate delay by 30% => clock up by 43% (1/0.7)
• separate floating point unit • double transistor density
• simple branches/jumps (lower latency) • reduce energy per switch by 65% => reduce power by 50%
• no complex instructions
Scaling theory:
• instruction scheduling
• delay = 0.7 => frequency = 1.43
• translation buffer
• width, length, thickness = 0.7 => area cap, fringe cap = 0.7
• branch displacement size
• total cap = 0.7, total area = 0.72 = 0.5

© 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 45 © 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 46

Scaling Trends Scaling Trends


Clock frequency Interconnect
• improves by factor of 2, not just 1.43, every generation • width and thickness decrease with transistors
• 1.43 to 2 comes from circuits and microarchitecture • interconnect distribution for new microarchitecture not
• mainly less work per clock => more pipeline stages => different than that of old => complexity is not reason for drop
in density
deeper pipes => branch and cache miss penalty worse
(making architects work harder to achieve performance)
Power = fCV2
Transistor density (#devices/area) • reduce by 50% => scale Vdd by 0.7 (constant field scaling)
• doubles as expected if old microarchitecture is shrunk • will not reduce if constant voltage scaling
• if new microarchitecture then density is not double • active capacitance density should increase by 43% but
• more complexity and less time to optimize? only ~30% in reality due to lower logic density

© 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 47 © 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 48
Projections Projections
Power will increase to 10KW if Vdd not scaled If Vdd is scaled then threshold voltage Vt must scale (take
ECE559 is you do not understand this)
• if Vdd is scaled it will be 2KW in 2010
• but lower Vt => exponentially more leakage energy
• if die size is restricted it will be 100W for small die, 200W
for large die • i.e., Vdd scaling reduces dynamic energy but
increases leakage energy!
Energy-delay trade-off
• leakage increases roughly 5x every generation
• why scale only delay? why not both energy and delay?
• today leakage is 15% soon it will equal dynamic power
• then energy-delay is better metric than delay alone
• if Vdd is not scaled energy-delay reduces by 50%
• if Vdd is scaled then energy-delay reduces by 75%

© 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 49 © 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 50

Projections Challenges for Next Decade (Agerwala)


• Vt decreases => noise margins reduce Application pull vs. technology push
• #logic transistors increase => soft error rate increases • standard recipe for innovation in architecture
• soft error means a bit flip due to a neutron strike Apps
• Power density (power/area) increases • aircraft design, electromagnetics simulation, entertainment
• we are already past kitchen hot plate! • all of these are require high compute power

Technology
I have research projects on all these topics (not a coincidence!) • power - dynamic and static
• leakage, power, power density, soft errors, noise . . . .

© 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 51 © 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 52
Challenges for Next Decade Challenges for Next Decade
Optimal power-performance pipelines are shallower than that for Options
pure performance
• Chip multiprocessors (CMPs) alleviate power problem
• because twice the speed comes at eight times power
• why will this reduce power?
• pipeline depth fundamentally determines power • but software has to be able to take advantage of
• deep pipeline + lower Vdd for power => poorer performance CMPs

• than shallow pipeline + high Vdd for performance • Special accelerators


• special hardware for TCP/IP, security
• why will this reduce power?

© 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 53 © 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 54

Challenges for Next Decade


Options (cont’d):
• scale out
• cluster of low-cost computers working as one large
computer
• why will this reduce power?
• system-wide power management
• compiler, operating system
• turn off things not used, scale down Vdd to required level
because not EVERYTHING needs to run at highest speed
ALWAYS

© 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 55