0 valutazioniIl 0% ha trovato utile questo documento (0 voti)
33 visualizzazioni18 pagine
The document discusses advanced computer architecture and ways to improve computer performance. It covers topics like RISC instruction sets, pipelining, instruction-level parallelism, cache performance optimization, and exploiting parallelism at various levels. It discusses quantitative measures for comparing architectural ideas and computer performance, such as execution time, throughput, and benchmarks. Amdahl's law is introduced as a way to calculate expected speedup from architectural enhancements.
The document discusses advanced computer architecture and ways to improve computer performance. It covers topics like RISC instruction sets, pipelining, instruction-level parallelism, cache performance optimization, and exploiting parallelism at various levels. It discusses quantitative measures for comparing architectural ideas and computer performance, such as execution time, throughput, and benchmarks. Amdahl's law is introduced as a way to calculate expected speedup from architectural enhancements.
Copyright:
Attribution Non-Commercial (BY-NC)
Formati disponibili
Scarica in formato PPT, PDF, TXT o leggi online su Scribd
The document discusses advanced computer architecture and ways to improve computer performance. It covers topics like RISC instruction sets, pipelining, instruction-level parallelism, cache performance optimization, and exploiting parallelism at various levels. It discusses quantitative measures for comparing architectural ideas and computer performance, such as execution time, throughput, and benchmarks. Amdahl's law is introduced as a way to calculate expected speedup from architectural enhancements.
Copyright:
Attribution Non-Commercial (BY-NC)
Formati disponibili
Scarica in formato PPT, PDF, TXT o leggi online su Scribd
• We will consider issues • In CSC 362, we focused on
in current Architecture – the roles of the components in the design and architecture implementation: – the structure of the architecture (how – RISC instruction sets things connect together) – Pipelining • Here, we focus on – Instruction-level – Using available technology to improve parallelism computer performance – Block-level parallelism – Using quantitative measures to test – Thread-level parallelism architectural ideas – Multiprocessors – Using a RISC instruction set for – Improving cache examples performance – Discussing a variety of software and – Optimizing virtual hardware techniques to provide memory usage optimization – Attempting to force as much parallelism out of the code as possible Measuring Performance • We might use one of the following terms to measure performance – MIPS, MegaFLOPS • neither of these terms tells us how the processor performs on the other type of operation – Clock Speed (GHZ rating) • misleading as we will explore throughout the semester – Execution time • worthwhile on an unloaded system – Throughput • number of programs / unit time – useful for servers and large systems – Wall-clock time – CPU time, user CPU time, system CPU time • CPU time = user CPU time + system CPU time – System performance • on an unloaded system – note: CPU performance = 1 / execution time • What does it mean that one computer is faster than another? Meaning of Performance • X is n times faster than Y means – Exec time Y / Exec time X = n – Perf X / Perf Y = n • Example: – if throughput of X is 1.3 times higher than Y • then the number of tasks that can be executed on X is 1.3 more than on Y in the same amount of time • Example: – X executes program p1 in .32 seconds – Y executes program p1 in .38 seconds – X is .38 / .32 = 1.19 times faster • 19% faster • To validly compare two computer’s performance, we must compare performance on the same program • Additionally, computers may have better performances on different programs – e.g., C1 runs P1 faster than C2 but C2 runs P2 faster than C1 – we might use weighted averages or geometric means, as well as distributions, to derive a single processor’s overall performance (see pages 34-37 if you are interested) Benchmarks • A benchmark suite is a set of • Four levels of programs can be used programs that test different to test performance performance metrics – Real programs – Example: test array • e.g., C compiler, CAD tool capabilities, floating point • programs with input, output, options that operations, loops, the user selects – SPEC benchmark suites are – Kernels commonly cited • remove key pieces of programs and just – SPEC 96 is the most recent test those benchmark, see figure 1.13 – Toy benchmarks on page 31 • 10-100 lines of code such as quicksort • Reporting benchmark results whose performance is known in advance must include – Synthetic benchmarks – compiler settings and • try to match average frequency of operations to simulate larger programs version – input • Only real programs are used today – OS – These others have been discredited since computer architects and compiler writers – number/size of disks will optimize systems to perform well on • Results must be these specific benchmarks/kernels reproducible Principles of Computer Design • As computer architecture research has progressed, several key design concepts have been identified – The goal today is to further exploit each of these because they provide a great deal of performance speed up – We will examine these and use a quantitative approach to identify the extent of the speedup • Take advantage of parallelism – Using multiple hardware components (ALU functional units, memory modules, register ports, disk drives, etc) we can attempt to execute instructions and threads in parallel • Principle of locality of reference – Used to design memory systems so that we can attempt to keep in cache the data and instructions that will most likely be referenced soon • Focus on the common case – As we see next, if we can achieve a small speedup for executing the common case, it is better than achieving a large speedup for an uncommon case Amdahl’s Law • In order to explore architectural • This law uses two factors: improvements, we need a mechanism – Fraction of the computation to gage the speedup of our time in the original machine improvements that can be converted to take advantage of the • Amdahl’s Law allows us to compute enhancement (F) speedup that can be gained by using a – Improvement gained by the particular feature as follows enhanced execution mode • Given an enhancement E (how much faster will the task run if the enhanced – Speedup = performance with E / mode is used for the entire performance without E program?) (S) or – Speedup = execution time without E / execution time with E Speedup = 1 / (1 – F + F / S) • Example 1: Examples – Web server is to be enhanced • new CPU is 10 times faster on computation than old CPU • the original CPU spent 40% of its time processing and 60% of its time waiting for I/O – What will the speedup be? • Fraction enhancement used = 40% • Speedup in enhanced mode = 10 • Speedup = 1 / [(1 - .4) + .4/10] = 1.56 • Example 2: – A benchmark consists of: • 20% FP sqrt • 50% FP operations (including sqrt) • 50% other operations – Enhancement options are: • add FP sqrt hardware to speed up sqrt performance by a factor of 10 • enhance all FP operations by a factor of 1.6 – Speedup FP sqrt = 1/[(1-.2) + .2/10] = 1.22 – Speedup all FP = 1/[(1-.5) + .5/1.6] = 1.23 – The enhancement to support the common case is (slightly) better CPU Performance Formulae • CPU time = CPU clock cycles * clock cycle time – CPU clock cycles – the number of clock cycles that elapse during the execution of the given program – clock cycle time is the reciprocal of the clock rate – that is, how much time elapses for one clock cycle, which gives us: • CPU time = CPU clock cycles for prog / clock rate • CPU time = IC * CPI * Clock cycle time – IC - instruction count (number of instructions) – CPI - clock cycles per instruction – IC * CPI = CPU clock cycles • CPI = CPU clock cycles / IC • CPU time = ( CPIi * ICi) * clock cycle time • Average CPI = (CPIi * ICi) / Total Instruction Count – In the latter equation, CPIi and ICi are for each type of operation (for instance, the CPI and number of adds, the CPI and number of loads, …) Example • Assume: – frequency of FP operations = 25% (other than sqrt) and frequency of FP sqrt = 2% – average CPI of FP operations = 4.0, CPI of FP sqrt = 20 – average CPI other instr = 1.33 – CPI = 4*25%+1.33*75% = 2.0 • Two alternatives: – reduce CPI of FP sqrt to 2 or – reduce average CPI of all FP ops (including sqrt) to 2.5 • CPI new FP sqrt = CPI original - 2% * (20-2) = 1.64 • CPI new FP = 75%*1.33+25%*2.5=1.625 – Speedup new FP = CPI original/CPI new FP =1.64 / 1.625 = 1.23 Computing Speedup – which formula? • We can compute speedup by – determining the difference in CPU time before and after an enhancement – or by using Amdahl’s Law • Which should we use? – the formulas are the same – lets demonstrate this with an example: • Benchmark consists of 35% loads, 15% stores, 40% ALU operations and 10% branches – CPI for each instruction is 5 for loads and stores and 4 for ALU and branches (since this is an integer benchmark, the floating point registers are not used) – Consider that we could keep more values in registers by moving them to floating point registers rather than storing and then reloading these values in memory • Let’s have the compiler replace some of the loads/stores with register moves – this enhancement is done by the compiler, so costs us nothing! – assuming that the compiler can reduce 20% of the loads from the program, how worthwhile is it? Solution • We change some loads/stores to ALU operations – so overall CPI goes down, IC remains the same • Solution 1: compute CPU Time differences – CPU Time = IC * CPI * CPU Clock Rate – CPIold = 50% * 5 + 50% * 4 = 4.5 – CPInew = 40% * 5 + 60% * 4 = 4.4 – Since IC and CPU Clock Rate have not changed, speedup is only CPIold / CPInew – Speedup = 4.5 / 4.4 = 1.0227 = 2.27% speedup • Solution 2: Amdahl’s Law – Speedup of enhanced mode is from 5 cycles to 4 cycles or 5/4 = 1.25 – Fraction used = fraction of the execution time where we use conversions instead of loads/stores • overall CPI is 4.5 • enhancement used on 20% of loads/stores • 20% * 50% * 5 = .5 clock cycles out of 4.5, or .5 / 4.5 = 11.1% of the time – Amdahl’s Law = 1 / [1 – F + F / S] = 1 / [1 - .111 + .111 / 1.25] = 1 / .9778 = 1.0227 = 2.27% speedup Why MIPS Can Be Misleading • Assume a load-store machine with a • MIPS = IC / (Execution Time * 106) breakdown of – exec time = IC * CPI / Clock Cycle rate – 43% ALU • so, MIPS = clock rate / (CPI * 106) – 21% load/store – 24% branch • CPIunoptimized = 1.57 – CPI = 1 for ALU operations • MIPSunoptimized = 500 MHz / (1.57 * 106) = – CPI = 2 for all other operations 318.5 – Optimizing compiler is able to discard 50% of ALU operations • CPIoptimized = (.43 / 2 * 1 + .57 * 1) / (1 – . • Ignoring system issues, if the machine 43 / 2) = 1.73 has a 2 nanosecond clock cycle (500 • MIPSoptimized = 500 MHz / (1.73 * 106) = MHz) and 1.57 unoptimized CPI, 289.0 – what is the MIPS rating for the optimized – The optimized program will execute and unoptimized versions? does the MIPS faster because it has fewer instructions, value agree with the execution time? but its CPI is larger because a greater portion of the instructions have a higher CPI, and therefore its MIPS rating is lower • So, MIPS and execution time are not directly related! Sample Problem #1 • Consider adding register-memory ALU instructions to a machine that previously only permitted register-register ALU operations • Assume a benchmark with the following breakdown of operations is used to test this enhancement: – ALU operations: 43%, CPI = 1 – Loads: 21%, CPI = 2 – Stores: 12%, CPI = 2 – Branches: 24%, CPI = 2 • The new ALU register-memory operation has the following consequences: – ALU register-memory operations have CPI = 2 and Branches now have a CPI = 3 • But, 25% of data loaded are only used once so that the new ALU register-memory instruction can be used in place of the load + ALU operation • Is it worth it? Solution • CPIold = .43 * 1 + .57 * 2 = 1.57 • CPInew = .11 * 2 + .13 * 2 – 3 changes: + .27 * 3 + .48 * (.25 * 2 • some ALU operations use new + .75 * 1) = 1.89 mode which changes their CPI • fewer loads • CPU Time = IC * CPI * • all branches have higher CPI Clock Cycle Rate – We have a new distribution: – Clock Cycle Rate remains unchanged • 25% of ALU operations become ALU-memory operations – CPI has been recomputed – 25% * 43% = 11%, so we – IC in the new system is remove this many loads giving 89% of the old system us 89% as many instructions as previously – CPUold = IC * 1.57 * CCR • Loads: [21% - (25% * 43%) ] / – CPUnew = .89 * IC * 1.89 * 89% = 11% CCR • Stores: 12% / 89% = 13% – Speedup = 1.57 / (.89 * • ALU operations: 43% / 89% = 1.89) = .934 48% • this is a slowdown, so this • Branches: 24% / 89% = 27% enhancement is not an improvement! Sample Problem #2 • Assume a machine with a • CPIperfectcachemachine = .43 * 1 perfect cache + .57 * 2 = 1.57 – And the following – Because of cache misses, we have to instruction mix breakdown: compute the CPI for all new • ALU: 43%, CPI 1 instructions based on misses during • Loads: 21%, CPI 2 instruction fetch (5%) and misses • Stores: 12%, CPI 2 during data accesses (10%) where a • Branches: 24%, CPI 2 miss adds 40 cycles to the CPI – An imperfect cache has a • CPIimperfectcachemachine = .43 * (1 + .05 * miss rate of 5% for 40) + .21 * (2 + .05 * 40 + .10 * instructions and 10% for data and a miss penalty of 40) + .12 * (2 + .05 * 40 + .10 * 40 cycles 40) + .24 * (2 + .05 * 40) = 4.89 • How much faster is the • Perfect machine = 4.89 / 1.57 = machine with the perfect 3.11 times faster cache? Sample Problem #3 • Architects are considering one of two enhancements for their processor – #1 can be used 20% of the time and offers a speedup of 3 – #2 offers a speedup of 7 • What fraction of the time will the second enhancement have to be used in order to achieve the same overall speedup as the first enhancement? – speedup from #1 = 1 / [(1 - .2) + .2 / 3] = 1.154 • So, for the second enhancement to match, we have 1.154 = 1 / [(1 – x) + x / 7] and we must solve for x – using some algebra, we get: – 1.154 = 1 / (1 – 7x / 7 + x / 7) = 1 / (1 – 6x / 7) = 1 / (7 – 6x) / 7 = 7 / (7 – 6x) or 7 – 6x = 7 / 1.154 6x = 7 – 7 / 1.154 = 0.934, or x = 0.934 / 6 = 0.156. Sample Problem #4 • We will compare a CISC machine and a RISC machine on a benchmark – The machines have the following characteristics • CISC machine has CPIs of – 4 for load/store, 3 for ALU/branch, 10 for call/return – CPU clock rate of 1.75 GHz • RISC machine has a CPI of 1.2 (as it is pipelined) and a CPU clock rate of 1 GHz • CISC machine uses complex instructions so the CISC version of the benchmark is 40% less than the same benchmark on the RISC machine (that is, CISC IC is 40% less than RISC IC) – The benchmark has a breakdown of: • 38% loads, 10% stores, 35% ALU operations, 3% calls, 3% returns, and 11% branches – Which machine will run the benchmark in less time? Solution • We compare the CPU time for both machines – CPU time = IC * CPI / Clock rate • Since both machines have GHz in their clock rate, to simplify, we will drop the GHz value • CISC machine: – First, compute the CISC machine’s CPI given the individual CPI for the machine and the benchmark’s breakdown of instructions: • CPI = 4 * (.38 + .10) + 3 * (.35 + .11) + 10 * (.03 + .03) = 3.9 – CPU time CISC = IC CISC * 3.9 / 1.75 • RISC machine: – IC * 1.2 / 1 = IC RISC * 1.2 – Recall that the CISC machine has 40% fewer instructions, so IC CISC = .6 * IC RISC • CPU time CISC = .6 * IC RISC * 3.9 / 1.75 = 1.34 IC RISC • CPU time RISC = 1.2 IC RISC • Since the RISC CPU time is smaller, it is faster by 1.34 / 1.2 = 1.12 or 12% faster