Pentium 4

Recall our examination of the Intel 486 pipeline
variable length of instructions, variable complexity of operations, memory-register ALU operations, etc led to poor performance
Pentium Architecture
In order to improve performance using RISC features, the Pentium architects had to rethink things they were stuck with their CISC instruction set (for backward compatibility)
in CISC architectures, a machine instruction is first translated into a sequence of microinstructions each microinstruction is a lengthy string of 1s and 0s, each of which refer to one control signal in the machine there needs to be a process to translate each machine instruction into microinstructions and execute each microinstruction this is done by collecting machine instructions and their associated microinstructions into microprograms
Why Microinstructions?
First, since the Pentium architecture uses a microprogrammed control unit, there is already a necessary step of decoding a machine instruction into microcode Now, consider each microinstruction:
each is equal length each executes in the same amount of time
unless there are structural hazards such as a cache miss
branches are at the microinstruction level and are more predictable than machine language level branching
In a RISC architecture, each machine instruction is carried out directly in hardware because each instruction is simple and takes roughly 1 cycle to execute
to more efficiently pipeline a CISC architecture, we can pipeline the microinstructions (instead of machine instructions) to keep a pipeline running efficiently
An example architecture is shown to the right Each of the various connections is controlled by a particular control signal
Control and Micro-Operations
for instance, to send the MBR value to the AC, we would signal C11
note that this figure is incomplete
a microprogram is a sequence of microoperations each micro-operation is one or more control signals sent out in a clock cycle to move information from one location to another
this is not an x86 architecture!
Example
Consider a CISC instruction such as Add R1, X
this requires that X be moved into the MAR and a read signaled the datum returned will be placed into the MBR the adder is then sent the value in R1 and MBR, adding the two and storing the result back into R1 this sequence can be written in terms of micro-operations as:
t1: MAR (IR (address) ) t2: MBR Memory t3: R1 (R1) + (MBR) the values t1, t2, etc denote separately clock cycles
There may be other sequences needed as well, for instance, if register results are stored in an accumulator temporarily, then we must change the above to include
t3: Acc (R1) + (MBR) t4: R1 (Acc)
we can then convert these into the actual control signals (for instance, MBR Memory is C5 in the previous figure)
Control Memory
Each microprogram consists of one or more microinstructions, each stored in a separate entry of the control memory The control memory itself is firmware, a program stored in ROM, that is placed inside of the control unit
... Jump to Indirect or Execute ... Jump to Execute
Fetch cycle routine
Indirect Cycle routine

Interrupt cycle routine Execute cycle begin AND routine ADD routine
... Jump to Fetch Jump to Op code routine ... Jump to Fetch or Interrupt ...
Jump to Fetch or Interrupt
Note: each micro-program ends with a branch to the Fetch, Interrupt, Indirect or Execute micro-program
Example of Three Micro-Programs

t1: MAR (PC) C2 t2: MBR Memory C0, C5, CR PC (PC) + 1 C* t3: IR (MBR) C4 Indirect: t1: MAR (IR (address) ) C8 t2: MBR Memory C0, C5, CR t3: IR(address) (MBR (address) ) C4 Interrupt: t1: MBR (PC) C1 t2: MAR save address C* PC routine address C* t3: Memory (MBR) C12, CW Fetch:
CR Read control to system bus CW write control to system bus
C0 C12 refers to the previous figure C* are signals not shown in the figure
Horizontal vs. Vertical Micro-Instructions

Function Codes
Micro-instruction address points to a branch in the control memory and is taken if the condition bit is true
Micro-instruction Address
Jump Vertical micro-instructions Condition
use function codes that need additional decoding
Horizontal micro-instructions contain 1 bit for every control signal controlled by the control unit
Internal CPU Control Signals
Micro-instruction Address
Because this micro-instruction requires Jump Condition 1 bit for every control line, it is longer than the vertical micro-instruction and System Bus therefore take more space to store, but Control Signals does not require additional time to decode by the control unit
Decoder analyzes IR
Micro-programmed Control Unit

address placed in the to a microprogram counter (here, called a Control Address Register)
delivers starting address of op codes micro-program in control store
Loop on the following

sequencer signals read of control memory using address in microPC item in control memory moved to control buffer register contents of control buffer register generate control signals and next address information
if the micro-instructions are vertical, decoding is required here
sequencer moves next address to control address register

next instruction (add 1 to current) jump to new part of this microprogram jump to new machine routine
Pentium IV: RISC features

All RISC features are implemented on the execution of microinstructions instead of machine instructions
microinstruction-level pipeline with dynamically scheduled microoperations
fetch machine instruction (3 stages) decode machine instruction into microinstructions (2 stages) superscalar issues multiple microinstructions (2 stages, register renaming occurs here, up to 3 microinstructions can be issued per cycle) execute of microinstructions (1 stage, units are pipelined and can take from 1 to many cycles (up to 32?) to execute) write back (3 stages) commit (3 stages, up to 3 microinstructions can commit in any cycle)
reservation stations (128 registers available) and multiple functional units (7 of them) branch speculation used (control of speculation is given to reservation stations rather than a reorder buffer, commit still occurs, controlled by reservation stations) trace cache used
Pentium IV Architecture
There are 7 functional units:

Specifications
2 simple ALUs (for simple integer operations like add and compare) 1 complex ALU (for integer multiplication and integer division) 1 load unit 1 store unit 1 floating point move (register to register move and convert) 1 floating point unit (addition, subtraction, multiplication, division)
the simple ALU units execute in half a clock cycle so each can accommodate up to two microoperations per cycle reducing latency the load and store units have their own address calculation components so that the memory address can be computed first and then the memory access performed, along with aggressive data cache to lower load latencies floating point and complex ALU take more than 1 cycle so are pipelined floating point units can handle up to 2 FP operations at a time allowing for some SIMD execution and improving overall FP performance
There are 128 registers for renaming

reservation stations are used rather than a re-order buffer (which was used in older versions of the Pentium pipeline) this means that instructions must wait in reservation stations longer than in Tomasulos version, waiting for speculation results
Pentium III (Pentium Pro) pipeline was 10 stages deep

taking a minimum of 10 clock cycles to complete the shortest instructions with a clock rate of 1.1 GHz or less
the figure below shows the Pentium III pipeline
Pentium IV Pipeline
For the Pentium IV

pipeline depth was lengthened to 21 stages (minimum) in order to accommodate a faster clock rate of 1.5 GHz by 2004, the pipeline was lengthened to 31 stages (minimum) and the clock rate up to 3.2 GHz
The lengthening of the pipeline allowed for the faster clock rates
the clock rate is now so fast that it takes 2 complete cycles for an instruction or data to cross the chip so that at least 2 stages in the pipeline are needed for certain operations like data movement! With the 128 reservation stations, 128 instructions could be in some state of operation simultaneously (as opposed to 40 in the Pentium III)
We talk about the trace cache in chapter 5
Trace Cache and Branch Prediction
for now, consider it to be an instruction cache that stores instruction not by address but by the order they are being executed in this way, branches do not necessarily cost us cache misses because the instruction being branched to is not in the same cache block
The trace cache stores microinstructions (not machine instructions)

repeated decoding is avoided, once a machine instruction has been decoded, the decoded version is placed in the trace cache, this greatly reduces time necessary to do instruction decoding
A branch target buffer is used to store microinstruction branches (not machine instruction branches) within the trace cache
the target buffer uses a 2-level predictor to select between local and global histories
target buffer is 8 times the size of the target buffer used in the Pentium III
the misprediction rate for the target buffer is below .15%!
The trace cache and branch target buffer combined mean that
microinstruction fetch and microinstruction decoding is rarely needed because, once fetched and decoded, the items are often found in the cache and because predictions rarely cause wrong instructions to be fetched
This architecture is very complex and relies on being able to fetch and decode instructions quickly
the process breaks down when
less than 3 instructions can be fetched in 1 cycle trace cache causes a miss, or branches are miss predicted less than 3 instructions can be issued because instructions have different number of microoperations
e.g., one instruction has 4 and another has 1, staggering when each instruction issues and executes
Source of Stalls
limitation of reservation stations data dependencies cause a functional unit to stall data cache access results in a miss
in some of these cases, the issue stage must stall, in others the commit stage must stall
misprediction rates are very low, about .8% for integer benchmarks and .1% for floating point benchmarks (these are misprediction rates at the machine level of instructions, not microinstructions) trace cache has nearly a 0% miss rate, the L1 and L2 data caches have miss rates of around 6% and .5% respectively the machines effective CPI is around 2.2
Comparing the Pentium IV to the Pentium III
Pentium IV Comparison
P4 has over twice the performance in many SPEC benchmarks in spite of a clock speed that isnt twice as fast (this info is not in this text edition)
The text provides a comparison between the P4 and the AMD Opteron
the Opteron uses dynamic scheduling, speculation, a shallower pipeline, issue and commit of up to 3 instructions per cycle, 2-level cache, and the chip has a similar transistor count although is only 2.8 GHz
the Opteron is a RISC instruction set, so instructions are machine instructions, not microinstructions
P4 has a higher CPI on all benchmarks except mcf (in which the AMD is more than twice the P4)
so for the most case, instructions take less clock time in the AMD than in the P4 but the P4 is a slightly faster clock
The text provides a briefer comparison between the P4 and the IBM Power5
the Power5 is only 1.9 GHz P5 is significantly better on most floating point benchmarks and slightly worse on most integer benchmarks with a clock speed half that of the P4
see figures 2.28 2.34 for specific comparisons
Improving one aspect of our processor does not necessarily improve performance
in fact, it might harm performance
consider lengthening the pipeline depth and increasing clock speed (as with the P4) but without adding reservation stations or using the trace cache
A Balancing Act
Modern processor design takes a lot of effort to balance out the factors
without accurate branch prediction and speculation hardware, stalls from miss-predicted branches will drop performance greatly as clock speeds increase, stalls from cache misses create a bigger impact on CPI, so larger caches and cache optimization techniques are needed (we cover the latter in chapter 5) to support multiple issue of instructions, we need a larger cache-toprocessor bandwidth, which can take up valuable space as we increase the number of instructions that can be issued, we need to increase the number of reservation stations and reorder buffer size
For even greater improvement, we might need to turn to software approaches instead of or in addition to hardware enhancements in appendix G, we will visit several compiler-based ideas
Sample Problem #1
We see how complex an architecture can become in the case of the Pentium IV
assume that we have additional space on the CPU and want to enhance some element(s), what should we pick and why? choices are to:
add more reservation stations add more ALU functional units add another FP functional unit add more load/store units add a larger branch target buffer (either more entries, or more prediction bits) attempt to speed up the system clock and lengthen the pipeline (the additional space will be used for pipeline latches, control logic, etc) add more memory to the trace cache add more memory to the L1 cache increase the microoperation queue size to store more microoperations at any time
Solution
Lets consider each not from the perspective of how useful it might be but how much that particular hardware is limiting instruction issue and CPI
add more reservation stations because we can issue no more than 3 microoperations per cycle, and assuming that the average microoperation executes for under 10 cycles, the 128 registers should be sufficient add more ALU/FP functional units since these are pipelined, additional units are not necessary add more load/store units limiting the number of loads may be a source of data dependencies, and so an additional load unit might help, an additional store unit is probably not necessary add a larger branch target buffer (either more entries, or more prediction bits) prediction accuracy is extremely high, more entries or bits are not needed
Solution Continue
attempt to speed up the system clock and lengthen the pipeline (the additional space will be used for pipeline latches, control logic, etc) there is little that we can do to further lengthen the pipeline, this may not be feasible add more memory to the trace cache similar to the branch target buffer, this will probably have very little impact because of the low miss rate of the current trace cache add more L1 cache this can make a significant impact since the miss rate is currently fairly high, this would be my top choice increase the microoperation queue size to store more microoperations at any time although it is unclear how many stalls arise from running out of microoperations, because of the trace caches performance, this is probably not necessary
Top choices: increase L1 cache and add another load unit
Sample Problem #2
Two fallacies cited in the chapter are:
Processors with lower CPI will always be faster Processors with faster clock rates will always be faster
Why are these not necessarily true?

recall our CPU time formula: CPU Time = IC*CPI*CCT
if CPI is lower, the CPU Time is lower and thus the processor is faster if clock rate is higher, then CCT is lower and CPU Time is lower, thus the processor is faster
BUT, we see from our examination of various processors that

deeper pipelines can have a larger impact than faster clock rates multiple issue superscalars have a significant impact on CPI but only if supported by reservation stations, reorder buffers, and accurate branch speculation in the Pentium IV, the CPI might be lower than other machines but its IC can be higher because, in this case, IC is at the microinstruction level additionally, a very low CPI with a slow clock rate may not outperform a higher CPI with a faster clock rate
Limitations on ILP (Chapter 3)

From mid 80s through 2000, architects focused on promoting ILP
deeper pipelines multiple instruction issue dynamic scheduling Speculation
Hardware needs increased

multiple function units
cost grows linearly with the number of units
increase (possibly very large) in memory bandwidth more register-file bandwidth

which might take up significant space on the chip and may require larger system bus sizes which turns into more pins
more complex memory system

possibly independent memory banks
Limitations
By 2000, architects found limitations in just how much ILP there is to exploit
inherent limitations to multiple-issue are the limited amount of ILP of a program:
how many instructions are independent of each other? how much distance is available between loading an operand and using it? between using and saving it?
multi-cycle latency for certain types of operations that cause inconsistencies in the amount of issuing that can be simultaneous
Architects more recently have concentrated

on further optimizations of current architectures and achieving higher clock rates without increasing issue rates
Limitations on Issue Size

Ideally, we would like to issue as many independent instructions simultaneously as possible, but this is not practical because we would have to:
look arbitrarily far ahead to find an instruction to issue rename all registers when needed to avoid WAR/WAW determine all register and memory dependences predict all branches (conditional, unconditional, returns) provide enough functional units to ensure all ready instructions can be issued
What is a possible maximum window size?

to determine register dependences over n instructions requires n2-n comparisons
2000 instructions 4,000,000 comparisons 50 instructions 2450 comparisons
window sizes have ranged between 4 and 32 with some recent machines having sizes of 2-8
a machine with window size of 32 achieves about 1/5 of the ideal speedup for most benchmarks (see figure on next slide)
Window Size Impact on Instruction Issue
Types of predictions
Realistic Branch Prediction
Perfect branch prediction

impossible to achieve so we wont bother with this
Selective history prediction using

correlating two-bit predictor non-correlating two-bit predictor selector between them
Standard two-bit predictor with 512 two-bit entries Static predictor

uses program profile history
None
Experimental results shown to the right

notice that issue rate is not significantly different and that the static predictor is the easiest so might be a reasonable approach
Selective: Standard: Static:
Misprediction Rate ~3% ~17% ~10%
Issue Rate ~24 ~20 ~21
see the figures on the next slide for details
Branch Predictor Performance
Effects of Finite Registers

With infinite registers, register renaming can eliminate all WAW and WAR hazards
with Tomasulos approach, the reservation stations offer virtual registers
Power 5 has 88 additional FP and 88 additional integer registers for reservation stations
surprisingly though, the number of registers does not have a dramatic impact as long as there are at least 64 + 64 registers available
Alias Analysis
Aside from register renaming, we have name dependencies on memory references Three models are:
global (perfect analysis of all global vars) stack perfect (perfect analysis of all stack references) inspection (examine accesses for interference at compile time) none (assume all references conflict)
The authors describe an ambitious but realistic processor that could be available with todays technology:
issue up to 64 instructions / cycle with no restrictions on what instructions can be issued in the same cycle tournament branch predictor with 1K entries and 16 entry return predictor perfect memory reference disambiguation performed dynamically register renaming with 64 int and 64 FP registers
with a 64 instruction / cycle issue capability, the average number of instructions issued per cycle is estimated to be around 20 if there are no stalls for limited hardware, cache misses and missspeculation, this would result in a CPI of .05!
A Realizable Processor
we might question whether a 64 instruction window is reasonable given the complexity needed in comparing up to 64 instructions together in each cycle, today we find most computers limit window sizes to 8 at most
Example
Lets compare three hypothetical processors and determine their MIPS rating for the gcc benchmark
processor 1: simple MIPS 2-issue superscalar pipeline with clock rate of 4 GHz, CPI of 0.8, cache system with .005 misses per instruction processor 2: deeply pipelined MIPS with a clock rate of 5 GHz, CPI of 1.0, smaller cache yielding .0055 misses per instruction processor 3: speculative superscalar with 64-entry window that achieves 50% of its ideal issue rate (see figure 3.7) with a clock rate of 2.5 GHz, a small cache yielding .01 misses per instruction (although 25% of the miss penalty is not visible due to dynamic scheduling)
assume memory access time (miss penalty) is 50 ns
to solve this problem, we have to determine each processors CPI, which is a combination of processor CPI and the impact of memory (cache misses)
Processor 1:

Solution
4 GHz clock = .25 ns per clock cycle memory access of 50 ns so miss penalty = 50 / .25 = 200 cycles cache penalty = .005 * 200 = 1.0 cycles per instruction overall CPI = 0.8 + 1.0 = 1.8 MIPS = 4 GHz / 1.8 = 2222 MIPS
5 GHz clock = .2 ns per clock cycle miss penalty = 50 / .2 = 250 cycles cache penalty = .0055 * 250 = 1.4 cycles per instruction overall CPI = 1.0 + 1.4 = 2.4 MIPS = 5 GHz / 2.4 = 2083 MIPS
Processor 2:
Processor 3:
2.5 GHz clock = .4 ns per clock cycle miss penalty takes affect only 75% of the time, so miss penalty = .75 * 50 / .4 = 94 cycles cache penalty = .01 * 94 = 0.94 CPU portion of the CPI is based on half the ideal issue rate of a 64-entry window, which is 1 / (9 * 2) = 0.22 overall CPI = 0.94 + 0.22 = 1.16 MIPS = 2.5 / 1.16 = 2155 MIPS
Sample Problem #1
For the li benchmark
compare a perfect processor from one that has a 128 window size, tournament branch predictor, 64 integer and 64 FP renaming registers and inspection alias analysis
The perfect processor can issue 18 instructions per cycle

but the branch prediction only permits up to 16 instructions per cycle and an infinite number of registers and perfect alias analysis can only accommodate 12 instructions per cycle so the perfect processor can achieve an issue rate of 12 instructions per cycle, or a CPI = 1 / 12 = .083
The more realistic processor is most limited by alias analysis (4 instructions per cycle), so a CPI = .25
the perfect machine is then .25 / .083 = 3 times faster on this benchmark
Architects are considering one of three enhancements to the next generation of computer
more on-chip cache to reduce the impact of memory access faster memories faster clock rates
Sample Problem #2
Explain, using the example on pages 167-169, how each of these would impact the three hypothetical processors
more on-chip cache lowers cache CPI depending on the current miss rate, this might be useful, but for processor 1 and 2, the miss rates are already < .1% faster memory reduces cache CPI (it decreases the number of cycles needed for any cache miss) since all three processors CPIs are roughly half from cache miss and half from processor performance, this could have a significant impact faster clock rates increases cache CPI, possibly will have no effect on execution CPI by merely increasing clock rate, the stalls for memory accesses will increase, however if this increase is coupled with a longer pipeline, then execution CPI might decrease and so overall performance might improve
Sample Problem #3
Consider a speculative superscalar with a window size of 32
with proper hardware support, the superscalar can issue 70% of the expected issue rate (see figure 3.2)
the processor has a 3.33 GHz clock rate the processor stalls when all functional units are busy (which arises once in every 12 cycles) when there is a misprediction, the processor require 6 complete cycles to flush the reorder buffer and begin again (profile-based prediction is used) memory accesses take 40 ns, 40% of the instructions are loads or stores and the instruction cache has a miss rate of .5% and the data cache has a miss rate of .03%
determine this machines MIPS rating for the doduc benchmark
Solution:
cache miss penalty = 40 ns / 3.33 GHz = 120 cycles memory CPI = .005 * 120 + .40 * .0003 * 120 = .614
CPU CPI = 1 / 6.3 + 1 / 12 + 6 * .05 = .542 CPI = .614 + .542 = 1.156
MIPS rating = 3.33 GHz / 1.156 = 2881 MIPS

Pentium 4

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Pentium 4

Caricato da

Copyright:

Formati disponibili

Recall our examination of the Intel 486 pipeline

Control and Micro-Operations

this is not an x86 architecture!

t3: Acc (R1) + (MBR) t4: R1 (Acc)

Fetch cycle routine

Indirect Cycle routine

Example of Three Micro-Programs

Horizontal vs. Vertical Micro-Instructions

Internal CPU Control Signals

Micro-programmed Control Unit

delivers starting address of op codes micro-program in control store

Loop on the following

sequencer moves next address to control address register

Pentium IV: RISC features

There are 7 functional units:

There are 128 registers for renaming

Pentium III (Pentium Pro) pipeline was 10 stages deep

For the Pentium IV

We talk about the trace cache in chapter 5

Trace Cache and Branch Prediction

The trace cache stores microinstructions (not machine instructions)

the misprediction rate for the target buffer is below .15%!

Comparing the Pentium IV to the Pentium III

Top choices: increase L1 cache and add another load unit

Why are these not necessarily true?

BUT, we see from our examination of various processors that

Limitations on ILP (Chapter 3)

Hardware needs increased

increase (possibly very large) in memory bandwidth more register-file bandwidth

more complex memory system

Architects more recently have concentrated

Limitations on Issue Size

What is a possible maximum window size?

Window Size Impact on Instruction Issue

Realistic Branch Prediction

Perfect branch prediction

Selective history prediction using

Standard two-bit predictor with 512 two-bit entries Static predictor

Experimental results shown to the right

Selective: Standard: Static:

Misprediction Rate ~3% ~17% ~10%

Issue Rate ~24 ~20 ~21

see the figures on the next slide for details

Branch Predictor Performance

Effects of Finite Registers

The perfect processor can issue 18 instructions per cycle

determine this machines MIPS rating for the doduc benchmark

MIPS rating = 3.33 GHz / 1.156 = 2881 MIPS

Potrebbero piacerti anche