Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
variable length of instructions, variable complexity of operations, memory-register ALU operations, etc led to poor performance
Pentium Architecture
In order to improve performance using RISC features, the Pentium architects had to rethink things they were stuck with their CISC instruction set (for backward compatibility)
in CISC architectures, a machine instruction is first translated into a sequence of microinstructions each microinstruction is a lengthy string of 1s and 0s, each of which refer to one control signal in the machine there needs to be a process to translate each machine instruction into microinstructions and execute each microinstruction this is done by collecting machine instructions and their associated microinstructions into microprograms
Why Microinstructions?
First, since the Pentium architecture uses a microprogrammed control unit, there is already a necessary step of decoding a machine instruction into microcode Now, consider each microinstruction:
each is equal length each executes in the same amount of time
unless there are structural hazards such as a cache miss
branches are at the microinstruction level and are more predictable than machine language level branching
In a RISC architecture, each machine instruction is carried out directly in hardware because each instruction is simple and takes roughly 1 cycle to execute
to more efficiently pipeline a CISC architecture, we can pipeline the microinstructions (instead of machine instructions) to keep a pipeline running efficiently
An example architecture is shown to the right Each of the various connections is controlled by a particular control signal
for instance, to send the MBR value to the AC, we would signal C11
note that this figure is incomplete
a microprogram is a sequence of microoperations each micro-operation is one or more control signals sent out in a clock cycle to move information from one location to another
Example
Consider a CISC instruction such as Add R1, X
this requires that X be moved into the MAR and a read signaled the datum returned will be placed into the MBR the adder is then sent the value in R1 and MBR, adding the two and storing the result back into R1 this sequence can be written in terms of micro-operations as:
t1: MAR (IR (address) ) t2: MBR Memory t3: R1 (R1) + (MBR) the values t1, t2, etc denote separately clock cycles
There may be other sequences needed as well, for instance, if register results are stored in an accumulator temporarily, then we must change the above to include
we can then convert these into the actual control signals (for instance, MBR Memory is C5 in the previous figure)
Control Memory
Each microprogram consists of one or more microinstructions, each stored in a separate entry of the control memory The control memory itself is firmware, a program stored in ROM, that is placed inside of the control unit
... Jump to Indirect or Execute ... Jump to Execute
... Jump to Fetch Jump to Op code routine ... Jump to Fetch or Interrupt ...
Jump to Fetch or Interrupt
Note: each micro-program ends with a branch to the Fetch, Interrupt, Indirect or Execute micro-program
Micro-instruction address points to a branch in the control memory and is taken if the condition bit is true
Micro-instruction Address
Jump Vertical micro-instructions Condition
use function codes that need additional decoding
Horizontal micro-instructions contain 1 bit for every control signal controlled by the control unit
Micro-instruction Address
Because this micro-instruction requires Jump Condition 1 bit for every control line, it is longer than the vertical micro-instruction and System Bus therefore take more space to store, but Control Signals does not require additional time to decode by the control unit
Decoder analyzes IR
reservation stations (128 registers available) and multiple functional units (7 of them) branch speculation used (control of speculation is given to reservation stations rather than a reorder buffer, commit still occurs, controlled by reservation stations) trace cache used
Pentium IV Architecture
Specifications
2 simple ALUs (for simple integer operations like add and compare) 1 complex ALU (for integer multiplication and integer division) 1 load unit 1 store unit 1 floating point move (register to register move and convert) 1 floating point unit (addition, subtraction, multiplication, division)
the simple ALU units execute in half a clock cycle so each can accommodate up to two microoperations per cycle reducing latency the load and store units have their own address calculation components so that the memory address can be computed first and then the memory access performed, along with aggressive data cache to lower load latencies floating point and complex ALU take more than 1 cycle so are pipelined floating point units can handle up to 2 FP operations at a time allowing for some SIMD execution and improving overall FP performance
Pentium IV Pipeline
The lengthening of the pipeline allowed for the faster clock rates
the clock rate is now so fast that it takes 2 complete cycles for an instruction or data to cross the chip so that at least 2 stages in the pipeline are needed for certain operations like data movement! With the 128 reservation stations, 128 instructions could be in some state of operation simultaneously (as opposed to 40 in the Pentium III)
for now, consider it to be an instruction cache that stores instruction not by address but by the order they are being executed in this way, branches do not necessarily cost us cache misses because the instruction being branched to is not in the same cache block
A branch target buffer is used to store microinstruction branches (not machine instruction branches) within the trace cache
the target buffer uses a 2-level predictor to select between local and global histories
target buffer is 8 times the size of the target buffer used in the Pentium III
The trace cache and branch target buffer combined mean that
microinstruction fetch and microinstruction decoding is rarely needed because, once fetched and decoded, the items are often found in the cache and because predictions rarely cause wrong instructions to be fetched
This architecture is very complex and relies on being able to fetch and decode instructions quickly
the process breaks down when
less than 3 instructions can be fetched in 1 cycle trace cache causes a miss, or branches are miss predicted less than 3 instructions can be issued because instructions have different number of microoperations
e.g., one instruction has 4 and another has 1, staggering when each instruction issues and executes
Source of Stalls
limitation of reservation stations data dependencies cause a functional unit to stall data cache access results in a miss
in some of these cases, the issue stage must stall, in others the commit stage must stall
misprediction rates are very low, about .8% for integer benchmarks and .1% for floating point benchmarks (these are misprediction rates at the machine level of instructions, not microinstructions) trace cache has nearly a 0% miss rate, the L1 and L2 data caches have miss rates of around 6% and .5% respectively the machines effective CPI is around 2.2
Pentium IV Comparison
P4 has over twice the performance in many SPEC benchmarks in spite of a clock speed that isnt twice as fast (this info is not in this text edition)
The text provides a comparison between the P4 and the AMD Opteron
the Opteron uses dynamic scheduling, speculation, a shallower pipeline, issue and commit of up to 3 instructions per cycle, 2-level cache, and the chip has a similar transistor count although is only 2.8 GHz
the Opteron is a RISC instruction set, so instructions are machine instructions, not microinstructions
P4 has a higher CPI on all benchmarks except mcf (in which the AMD is more than twice the P4)
so for the most case, instructions take less clock time in the AMD than in the P4 but the P4 is a slightly faster clock
The text provides a briefer comparison between the P4 and the IBM Power5
the Power5 is only 1.9 GHz P5 is significantly better on most floating point benchmarks and slightly worse on most integer benchmarks with a clock speed half that of the P4
see figures 2.28 2.34 for specific comparisons
Improving one aspect of our processor does not necessarily improve performance
in fact, it might harm performance
consider lengthening the pipeline depth and increasing clock speed (as with the P4) but without adding reservation stations or using the trace cache
A Balancing Act
Modern processor design takes a lot of effort to balance out the factors
without accurate branch prediction and speculation hardware, stalls from miss-predicted branches will drop performance greatly as clock speeds increase, stalls from cache misses create a bigger impact on CPI, so larger caches and cache optimization techniques are needed (we cover the latter in chapter 5) to support multiple issue of instructions, we need a larger cache-toprocessor bandwidth, which can take up valuable space as we increase the number of instructions that can be issued, we need to increase the number of reservation stations and reorder buffer size
For even greater improvement, we might need to turn to software approaches instead of or in addition to hardware enhancements in appendix G, we will visit several compiler-based ideas
Sample Problem #1
We see how complex an architecture can become in the case of the Pentium IV
assume that we have additional space on the CPU and want to enhance some element(s), what should we pick and why? choices are to:
add more reservation stations add more ALU functional units add another FP functional unit add more load/store units add a larger branch target buffer (either more entries, or more prediction bits) attempt to speed up the system clock and lengthen the pipeline (the additional space will be used for pipeline latches, control logic, etc) add more memory to the trace cache add more memory to the L1 cache increase the microoperation queue size to store more microoperations at any time
Solution
Lets consider each not from the perspective of how useful it might be but how much that particular hardware is limiting instruction issue and CPI
add more reservation stations because we can issue no more than 3 microoperations per cycle, and assuming that the average microoperation executes for under 10 cycles, the 128 registers should be sufficient add more ALU/FP functional units since these are pipelined, additional units are not necessary add more load/store units limiting the number of loads may be a source of data dependencies, and so an additional load unit might help, an additional store unit is probably not necessary add a larger branch target buffer (either more entries, or more prediction bits) prediction accuracy is extremely high, more entries or bits are not needed
Solution Continue
attempt to speed up the system clock and lengthen the pipeline (the additional space will be used for pipeline latches, control logic, etc) there is little that we can do to further lengthen the pipeline, this may not be feasible add more memory to the trace cache similar to the branch target buffer, this will probably have very little impact because of the low miss rate of the current trace cache add more L1 cache this can make a significant impact since the miss rate is currently fairly high, this would be my top choice increase the microoperation queue size to store more microoperations at any time although it is unclear how many stalls arise from running out of microoperations, because of the trace caches performance, this is probably not necessary
Sample Problem #2
Two fallacies cited in the chapter are:
Processors with lower CPI will always be faster Processors with faster clock rates will always be faster
Limitations
By 2000, architects found limitations in just how much ILP there is to exploit
inherent limitations to multiple-issue are the limited amount of ILP of a program:
how many instructions are independent of each other? how much distance is available between loading an operand and using it? between using and saving it?
multi-cycle latency for certain types of operations that cause inconsistencies in the amount of issuing that can be simultaneous
window sizes have ranged between 4 and 32 with some recent machines having sizes of 2-8
a machine with window size of 32 achieves about 1/5 of the ideal speedup for most benchmarks (see figure on next slide)
Types of predictions
None
Power 5 has 88 additional FP and 88 additional integer registers for reservation stations
surprisingly though, the number of registers does not have a dramatic impact as long as there are at least 64 + 64 registers available
Alias Analysis
Aside from register renaming, we have name dependencies on memory references Three models are:
global (perfect analysis of all global vars) stack perfect (perfect analysis of all stack references) inspection (examine accesses for interference at compile time) none (assume all references conflict)
The authors describe an ambitious but realistic processor that could be available with todays technology:
issue up to 64 instructions / cycle with no restrictions on what instructions can be issued in the same cycle tournament branch predictor with 1K entries and 16 entry return predictor perfect memory reference disambiguation performed dynamically register renaming with 64 int and 64 FP registers
with a 64 instruction / cycle issue capability, the average number of instructions issued per cycle is estimated to be around 20 if there are no stalls for limited hardware, cache misses and missspeculation, this would result in a CPI of .05!
A Realizable Processor
we might question whether a 64 instruction window is reasonable given the complexity needed in comparing up to 64 instructions together in each cycle, today we find most computers limit window sizes to 8 at most
Example
Lets compare three hypothetical processors and determine their MIPS rating for the gcc benchmark
processor 1: simple MIPS 2-issue superscalar pipeline with clock rate of 4 GHz, CPI of 0.8, cache system with .005 misses per instruction processor 2: deeply pipelined MIPS with a clock rate of 5 GHz, CPI of 1.0, smaller cache yielding .0055 misses per instruction processor 3: speculative superscalar with 64-entry window that achieves 50% of its ideal issue rate (see figure 3.7) with a clock rate of 2.5 GHz, a small cache yielding .01 misses per instruction (although 25% of the miss penalty is not visible due to dynamic scheduling)
assume memory access time (miss penalty) is 50 ns
to solve this problem, we have to determine each processors CPI, which is a combination of processor CPI and the impact of memory (cache misses)
Processor 1:
Solution
4 GHz clock = .25 ns per clock cycle memory access of 50 ns so miss penalty = 50 / .25 = 200 cycles cache penalty = .005 * 200 = 1.0 cycles per instruction overall CPI = 0.8 + 1.0 = 1.8 MIPS = 4 GHz / 1.8 = 2222 MIPS
5 GHz clock = .2 ns per clock cycle miss penalty = 50 / .2 = 250 cycles cache penalty = .0055 * 250 = 1.4 cycles per instruction overall CPI = 1.0 + 1.4 = 2.4 MIPS = 5 GHz / 2.4 = 2083 MIPS
Processor 2:
Processor 3:
2.5 GHz clock = .4 ns per clock cycle miss penalty takes affect only 75% of the time, so miss penalty = .75 * 50 / .4 = 94 cycles cache penalty = .01 * 94 = 0.94 CPU portion of the CPI is based on half the ideal issue rate of a 64-entry window, which is 1 / (9 * 2) = 0.22 overall CPI = 0.94 + 0.22 = 1.16 MIPS = 2.5 / 1.16 = 2155 MIPS
Sample Problem #1
For the li benchmark
compare a perfect processor from one that has a 128 window size, tournament branch predictor, 64 integer and 64 FP renaming registers and inspection alias analysis
The more realistic processor is most limited by alias analysis (4 instructions per cycle), so a CPI = .25
the perfect machine is then .25 / .083 = 3 times faster on this benchmark
Architects are considering one of three enhancements to the next generation of computer
more on-chip cache to reduce the impact of memory access faster memories faster clock rates
Sample Problem #2
Explain, using the example on pages 167-169, how each of these would impact the three hypothetical processors
more on-chip cache lowers cache CPI depending on the current miss rate, this might be useful, but for processor 1 and 2, the miss rates are already < .1% faster memory reduces cache CPI (it decreases the number of cycles needed for any cache miss) since all three processors CPIs are roughly half from cache miss and half from processor performance, this could have a significant impact faster clock rates increases cache CPI, possibly will have no effect on execution CPI by merely increasing clock rate, the stalls for memory accesses will increase, however if this increase is coupled with a longer pipeline, then execution CPI might decrease and so overall performance might improve
Sample Problem #3
Consider a speculative superscalar with a window size of 32
with proper hardware support, the superscalar can issue 70% of the expected issue rate (see figure 3.2)
the processor has a 3.33 GHz clock rate the processor stalls when all functional units are busy (which arises once in every 12 cycles) when there is a misprediction, the processor require 6 complete cycles to flush the reorder buffer and begin again (profile-based prediction is used) memory accesses take 40 ns, 40% of the instructions are loads or stores and the instruction cache has a miss rate of .5% and the data cache has a miss rate of .03%
Solution:
cache miss penalty = 40 ns / 3.33 GHz = 120 cycles memory CPI = .005 * 120 + .40 * .0003 * 120 = .614
CPU CPI = 1 / 6.3 + 1 / 12 + 6 * .05 = .542 CPI = .614 + .542 = 1.156