Sei sulla pagina 1di 18

Advanced Computer Architecture

• We will consider issues • In CSC 362, we focused on


in current Architecture – the roles of the components in the
design and architecture
implementation: – the structure of the architecture (how
– RISC instruction sets things connect together)
– Pipelining • Here, we focus on
– Instruction-level – Using available technology to improve
parallelism computer performance
– Block-level parallelism – Using quantitative measures to test
– Thread-level parallelism architectural ideas
– Multiprocessors – Using a RISC instruction set for
– Improving cache examples
performance – Discussing a variety of software and
– Optimizing virtual hardware techniques to provide
memory usage optimization
– Attempting to force as much parallelism
out of the code as possible
Measuring Performance
• We might use one of the following terms to measure performance
– MIPS, MegaFLOPS
• neither of these terms tells us how the processor performs on the other type of
operation
– Clock Speed (GHZ rating)
• misleading as we will explore throughout the semester
– Execution time
• worthwhile on an unloaded system
– Throughput
• number of programs / unit time – useful for servers and large systems
– Wall-clock time
– CPU time, user CPU time, system CPU time
• CPU time = user CPU time + system CPU time
– System performance
• on an unloaded system
– note: CPU performance = 1 / execution time
• What does it mean that one computer is faster than another?
Meaning of Performance
• X is n times faster than Y means
– Exec time Y / Exec time X = n
– Perf X / Perf Y = n
• Example:
– if throughput of X is 1.3 times higher than Y
• then the number of tasks that can be executed on X is 1.3 more than on Y in the
same amount of time
• Example:
– X executes program p1 in .32 seconds
– Y executes program p1 in .38 seconds
– X is .38 / .32 = 1.19 times faster
• 19% faster
• To validly compare two computer’s performance, we must
compare performance on the same program
• Additionally, computers may have better performances on
different programs
– e.g., C1 runs P1 faster than C2 but C2 runs P2 faster than C1
– we might use weighted averages or geometric means, as well as
distributions, to derive a single processor’s overall performance (see pages
34-37 if you are interested)
Benchmarks
• A benchmark suite is a set of
• Four levels of programs can be used programs that test different
to test performance performance metrics
– Real programs – Example: test array
• e.g., C compiler, CAD tool capabilities, floating point
• programs with input, output, options that operations, loops,
the user selects – SPEC benchmark suites are
– Kernels commonly cited
• remove key pieces of programs and just – SPEC 96 is the most recent
test those benchmark, see figure 1.13
– Toy benchmarks on page 31
• 10-100 lines of code such as quicksort • Reporting benchmark results
whose performance is known in advance
must include
– Synthetic benchmarks
– compiler settings and
• try to match average frequency of
operations to simulate larger programs version
– input
• Only real programs are used today
– OS
– These others have been discredited since
computer architects and compiler writers – number/size of disks
will optimize systems to perform well on • Results must be
these specific benchmarks/kernels reproducible
Principles of Computer Design
• As computer architecture research has progressed,
several key design concepts have been identified
– The goal today is to further exploit each of these because they
provide a great deal of performance speed up
– We will examine these and use a quantitative approach to
identify the extent of the speedup
• Take advantage of parallelism
– Using multiple hardware components (ALU functional units, memory
modules, register ports, disk drives, etc) we can attempt to execute
instructions and threads in parallel
• Principle of locality of reference
– Used to design memory systems so that we can attempt to keep in cache
the data and instructions that will most likely be referenced soon
• Focus on the common case
– As we see next, if we can achieve a small speedup for executing the
common case, it is better than achieving a large speedup for an
uncommon case
Amdahl’s Law
• In order to explore architectural • This law uses two factors:
improvements, we need a mechanism – Fraction of the computation
to gage the speedup of our time in the original machine
improvements that can be converted to
take advantage of the
• Amdahl’s Law allows us to compute enhancement (F)
speedup that can be gained by using a – Improvement gained by the
particular feature as follows enhanced execution mode
• Given an enhancement E (how much faster will the
task run if the enhanced
– Speedup = performance with E / mode is used for the entire
performance without E program?) (S)
or
– Speedup = execution time without E /
execution time with E
Speedup =
1 / (1 – F + F / S)
• Example 1: Examples
– Web server is to be enhanced
• new CPU is 10 times faster on computation than old CPU
• the original CPU spent 40% of its time processing and 60% of its time waiting
for I/O
– What will the speedup be?
• Fraction enhancement used = 40%
• Speedup in enhanced mode = 10
• Speedup = 1 / [(1 - .4) + .4/10] = 1.56
• Example 2:
– A benchmark consists of:
• 20% FP sqrt
• 50% FP operations (including sqrt)
• 50% other operations
– Enhancement options are:
• add FP sqrt hardware to speed up sqrt performance by a factor of 10
• enhance all FP operations by a factor of 1.6
– Speedup FP sqrt = 1/[(1-.2) + .2/10] = 1.22
– Speedup all FP = 1/[(1-.5) + .5/1.6] = 1.23
– The enhancement to support the common case is (slightly) better
CPU Performance Formulae
• CPU time = CPU clock cycles * clock cycle time
– CPU clock cycles – the number of clock cycles that elapse during the
execution of the given program
– clock cycle time is the reciprocal of the clock rate – that is, how much
time elapses for one clock cycle, which gives us:
• CPU time = CPU clock cycles for prog / clock rate
• CPU time = IC * CPI * Clock cycle time
– IC - instruction count (number of instructions)
– CPI - clock cycles per instruction
– IC * CPI = CPU clock cycles
• CPI = CPU clock cycles / IC
• CPU time = ( CPIi * ICi) * clock cycle time
• Average CPI = (CPIi * ICi) / Total Instruction Count
– In the latter equation, CPIi and ICi are for each type of operation (for
instance, the CPI and number of adds, the CPI and number of loads, …)
Example
• Assume:
– frequency of FP operations = 25% (other than sqrt) and
frequency of FP sqrt = 2%
– average CPI of FP operations = 4.0, CPI of FP sqrt = 20
– average CPI other instr = 1.33
– CPI = 4*25%+1.33*75% = 2.0
• Two alternatives:
– reduce CPI of FP sqrt to 2 or
– reduce average CPI of all FP ops (including sqrt) to 2.5
• CPI new FP sqrt = CPI original - 2% * (20-2) = 1.64
• CPI new FP = 75%*1.33+25%*2.5=1.625
– Speedup new FP = CPI original/CPI new FP =1.64 / 1.625 =
1.23
Computing Speedup – which formula?
• We can compute speedup by
– determining the difference in CPU time before and after an enhancement
– or by using Amdahl’s Law
• Which should we use?
– the formulas are the same
– lets demonstrate this with an example:
• Benchmark consists of 35% loads, 15% stores, 40% ALU
operations and 10% branches
– CPI for each instruction is 5 for loads and stores and 4 for ALU and
branches (since this is an integer benchmark, the floating point registers
are not used)
– Consider that we could keep more values in registers by moving them to
floating point registers rather than storing and then reloading these values
in memory
• Let’s have the compiler replace some of the loads/stores with
register moves
– this enhancement is done by the compiler, so costs us nothing!
– assuming that the compiler can reduce 20% of the loads from the
program, how worthwhile is it?
Solution
• We change some loads/stores to ALU operations
– so overall CPI goes down, IC remains the same
• Solution 1: compute CPU Time differences
– CPU Time = IC * CPI * CPU Clock Rate
– CPIold = 50% * 5 + 50% * 4 = 4.5
– CPInew = 40% * 5 + 60% * 4 = 4.4
– Since IC and CPU Clock Rate have not changed, speedup is only CPIold /
CPInew
– Speedup = 4.5 / 4.4 = 1.0227 = 2.27% speedup
• Solution 2: Amdahl’s Law
– Speedup of enhanced mode is from 5 cycles to 4 cycles or 5/4 = 1.25
– Fraction used = fraction of the execution time where we use conversions
instead of loads/stores
• overall CPI is 4.5
• enhancement used on 20% of loads/stores
• 20% * 50% * 5 = .5 clock cycles out of 4.5, or .5 / 4.5 = 11.1% of the time
– Amdahl’s Law = 1 / [1 – F + F / S] = 1 / [1 - .111 + .111 / 1.25] =
1 / .9778 = 1.0227 = 2.27% speedup
Why MIPS Can Be Misleading
• Assume a load-store machine with a • MIPS = IC / (Execution Time * 106)
breakdown of – exec time = IC * CPI / Clock Cycle rate
– 43% ALU
• so, MIPS = clock rate / (CPI * 106)
– 21% load/store
– 24% branch • CPIunoptimized = 1.57
– CPI = 1 for ALU operations • MIPSunoptimized = 500 MHz / (1.57 * 106) =
– CPI = 2 for all other operations 318.5
– Optimizing compiler is able to discard
50% of ALU operations • CPIoptimized = (.43 / 2 * 1 + .57 * 1) / (1 – .
• Ignoring system issues, if the machine 43 / 2) = 1.73
has a 2 nanosecond clock cycle (500 • MIPSoptimized = 500 MHz / (1.73 * 106) =
MHz) and 1.57 unoptimized CPI, 289.0
– what is the MIPS rating for the optimized – The optimized program will execute
and unoptimized versions? does the MIPS faster because it has fewer instructions,
value agree with the execution time?
but its CPI is larger because a greater
portion of the instructions have a higher
CPI, and therefore its MIPS rating is
lower
• So, MIPS and execution time are not
directly related!
Sample Problem #1
• Consider adding register-memory ALU instructions to a machine
that previously only permitted register-register ALU operations
• Assume a benchmark with the following breakdown of
operations is used to test this enhancement:
– ALU operations: 43%, CPI = 1
– Loads: 21%, CPI = 2
– Stores: 12%, CPI = 2
– Branches: 24%, CPI = 2
• The new ALU register-memory operation has the following
consequences:
– ALU register-memory operations have CPI = 2 and Branches now
have a CPI = 3
• But, 25% of data loaded are only used once so that the new ALU
register-memory instruction can be used in place of the load +
ALU operation
• Is it worth it?
Solution
• CPIold = .43 * 1 + .57 * 2 = 1.57 • CPInew = .11 * 2 + .13 * 2
– 3 changes: + .27 * 3 + .48 * (.25 * 2
• some ALU operations use new + .75 * 1) = 1.89
mode which changes their CPI
• fewer loads
• CPU Time = IC * CPI *
• all branches have higher CPI Clock Cycle Rate
– We have a new distribution: – Clock Cycle Rate remains
unchanged
• 25% of ALU operations become
ALU-memory operations – CPI has been recomputed
– 25% * 43% = 11%, so we – IC in the new system is
remove this many loads giving 89% of the old system
us 89% as many instructions as
previously – CPUold = IC * 1.57 * CCR
• Loads: [21% - (25% * 43%) ] / – CPUnew = .89 * IC * 1.89 *
89% = 11% CCR
• Stores: 12% / 89% = 13% – Speedup = 1.57 / (.89 *
• ALU operations: 43% / 89% = 1.89) = .934
48%
• this is a slowdown, so this
• Branches: 24% / 89% = 27% enhancement is not an
improvement!
Sample Problem #2
• Assume a machine with a • CPIperfectcachemachine = .43 * 1
perfect cache + .57 * 2 = 1.57
– And the following – Because of cache misses, we have to
instruction mix breakdown: compute the CPI for all new
• ALU: 43%, CPI 1 instructions based on misses during
• Loads: 21%, CPI 2 instruction fetch (5%) and misses
• Stores: 12%, CPI 2 during data accesses (10%) where a
• Branches: 24%, CPI 2 miss adds 40 cycles to the CPI
– An imperfect cache has a • CPIimperfectcachemachine = .43 * (1 + .05 *
miss rate of 5% for
40) + .21 * (2 + .05 * 40 + .10 *
instructions and 10% for
data and a miss penalty of 40) + .12 * (2 + .05 * 40 + .10 *
40 cycles 40) + .24 * (2 + .05 * 40) = 4.89
• How much faster is the • Perfect machine = 4.89 / 1.57 =
machine with the perfect 3.11 times faster
cache?
Sample Problem #3
• Architects are considering one of two enhancements
for their processor
– #1 can be used 20% of the time and offers a speedup of 3
– #2 offers a speedup of 7
• What fraction of the time will the second enhancement
have to be used in order to achieve the same overall
speedup as the first enhancement?
– speedup from #1 = 1 / [(1 - .2) + .2 / 3] = 1.154
• So, for the second enhancement to match, we have
1.154 = 1 / [(1 – x) + x / 7] and we must solve for x
– using some algebra, we get:
– 1.154 = 1 / (1 – 7x / 7 + x / 7) = 1 / (1 – 6x / 7) = 1 / (7 – 6x)
/ 7 = 7 / (7 – 6x) or 7 – 6x = 7 / 1.154  6x = 7 – 7 / 1.154 =
0.934, or x = 0.934 / 6 = 0.156.
Sample Problem #4
• We will compare a CISC machine and a RISC
machine on a benchmark
– The machines have the following characteristics
• CISC machine has CPIs of
– 4 for load/store, 3 for ALU/branch, 10 for call/return
– CPU clock rate of 1.75 GHz
• RISC machine has a CPI of 1.2 (as it is pipelined) and a CPU clock
rate of 1 GHz
• CISC machine uses complex instructions so the CISC version of the
benchmark is 40% less than the same benchmark on the RISC
machine (that is, CISC IC is 40% less than RISC IC)
– The benchmark has a breakdown of:
• 38% loads, 10% stores, 35% ALU operations, 3% calls, 3% returns,
and 11% branches
– Which machine will run the benchmark in less time?
Solution
• We compare the CPU time for both machines
– CPU time = IC * CPI / Clock rate
• Since both machines have GHz in their clock rate, to simplify, we
will drop the GHz value
• CISC machine:
– First, compute the CISC machine’s CPI given the individual CPI for the
machine and the benchmark’s breakdown of instructions:
• CPI = 4 * (.38 + .10) + 3 * (.35 + .11) + 10 * (.03 + .03) = 3.9
– CPU time CISC = IC CISC * 3.9 / 1.75
• RISC machine:
– IC * 1.2 / 1 = IC RISC * 1.2
– Recall that the CISC machine has 40% fewer instructions, so IC CISC = .6 *
IC RISC
• CPU time CISC = .6 * IC RISC * 3.9 / 1.75 = 1.34 IC RISC
• CPU time RISC = 1.2 IC RISC
• Since the RISC CPU time is smaller, it is faster by 1.34 / 1.2 =
1.12 or 12% faster

Potrebbero piacerti anche