Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
INSTRUCTION-LEVEL PARALLELISM
Introduction
When people make use of computers, they quickly consume all of the processing
power available.
It is the nature of computers; their flexibility quickly leads to ideas for more
automation, which then lead to more and more ideas, until the processing resources
available are exhausted.
In addition software development is agglomerative, it is easier to add software and
hence CPU load than it is to remove it. This is a fundamental asymmetry in software
development.
As we software developers have developed more sophisticated ideas for developing
and layering our software; and users have demanded more of our software
infrastructure, more demands have been made of the underlying hardware.
Cycles-Per-Second (CPS)
A very simple equation quantifies the performance of hardware.
Instructionspersecond = instructionspercycle x cyclespersecond
Fortunately Moores law has given us a formidable, non-linear, improvement
in performance for some time.
Hardware design has been incredibly competitive between teams of engineers
in companies.
For a long time the primary metric to measure performance was cycles-persecond (CPS), which caused the megahertz myth. Intel has done more to
explore the boundaries of CPS in common CPUs than any other company.
Customers of Intel found that another metric than CPU frequency had become
increasingly important in their buying decisions however, the metric of
instructionspersecondperwatt.
InstructionsPerCycle (IPC)
In parallel with research into everincreasing cyclespersecond, designers
have also been improving the instructionspercycle (IPC) that their
hardware is capable of.
When they crossed the boundary of greater than one instructionpercycle
(IPC > 1) they entered the world of instruction-level parallelism (ILP).
Instruction-level parallelism(ILP)
Most processors since 19851 have used the pipeline method just
described to overlap instructions. The background material discussed a
simple pipeline and how to achieve instruction overlap.
Achieving not instruction overlap but the actual execution of more than
one instruction at a time through dynamic scheduling and how to
maximize the throughput of a processor.
It will be useful to re-visit the various dependencies and hazards again,
before discussing these more powerful techniques for identifying and
exploiting more ILP.
Dependencies and hazards
Determining how one instruction relates to another is critical to determining
how much parallelism is available to exploit in an instruction stream. If two
instructions are not dependent then they can execute simultaneously
assuming sufficient resources that is no structural hazards.
Obviously, if one instruction depends on another, they must execute in order
though they may still partially overlap. It is imperative then, to determine
exactly how much and what kind of dependency exists between instructions.
The following sections will describe the different kinds of non-structural
dependency that can exist in an instruction stream.
TYPES OF DEPENDENCIES:
There are three different types of dependencies:
1. data dependencies (aka true dependencies),
2. name dependencies and
3. control dependencies.
Data dependencies
An instruction j can be considered data dependent on instruction i as
follows: directly, where instruction i produces a result that may be used by
instruction j or indirectly, where instruction j is data dependent on
instruction k and k is data dependent on i etc.
The indirect data dependence means that one instruction is dependent on
another if there exists a chain of dependencies between them.
This dependence chain can be as long as the entire program! If two
instructions are data dependent, they cannot execute simultaneously nor be
completely overlapped.
A data dependency can be overcome in two ways:
1. maintaining the dependency but avoiding the hazard or
2. eliminating a dependency by transforming the code. Code scheduling is the
primary method used to avoid a hazard without altering the dependency.
Name Dependencies
The second type of dependence is a name dependency. A name dependency occurs
when two instructions use the same register or memory location, called a name, but
there is no flow of data between them.
There are two types of name dependencies between an instruction i that proceeds
instruction j:
1. an anti-dependence occurs when j writes a register/memory that i reads (the
original value must be preserved until i can use it) or
2. an output dependence occurs when i and j write to the same register/memory
location (in this case instruction order must be preserved.)
Both anti-dependencies and output dependencies are name dependencies, as
opposed to true data dependency, as there is no information flow between the
two instructions.
In fact these dependencies are a direct result of re-purposing registers, hence
an instruction-set architecture with sufficient registers can minimize the
number of name dependencies in an instruction stream.
Since name dependencies are not truly dependent they can execute
simultaneously or be reordered, if the name used in the instruction is changed
dependent on p1.
SPEC Instructions
gcc
issued55
espres
63
so li
18
fpppp
75
doduc
11
Table 1: Table of SPEC benchmark and average instructions issued for an ideal
processor.
Obviously the ideal processor is not realisable; a realisable but ambitious
processor can achieve much less than the ideal processor.
Table 2 explores a processor which can issue 64 instructions per clock with
no issue restrictions, a tournament branch predictor of substantial size,
perfect disambiguation of memory reference, 64 additional renaming
registers for both integer and FP.
SPEC\Wind Infinite
gcc
10
256
10
128
10
64
9
32
8
espres
3 15
15
13
so li
2 12
12
11
fpppp
3 52
47
10
16
6
8
4
6
11
35
9
22
doduc
14
8
5
3 17
16
15
12
Table 2: Average instruction issue of SPEC Benchmark versus window size for
instruction decode
The most startling observation is that with realistic processor constraints
listed above, the effect of the window size for the integer programs is not as
severe as for FP programs, this points to the key difference between these
two types of programs, it seems that most FP programs are vector-based and
can use loop-level parallelism to a much greater extent than the typical
integer program.
Astrophysics
Atmospheric and Ocean Modeling
Bioinformatics
Biomolecular simulation: Protein folding
Computational Chemistry
Computational Fluid Dynamics (CFD)
Computational Physics
Computer vision and image understanding
Data Mining and Data-intensive Computing
Engineering analysis (CAD/CAM)
Global climate modeling and forecasting
Material Sciences
Military applications
Quantum chemistry
VLSI design
FLYNN'S CLASSIFICATION:
The most popular taxonomy of computer architecture was defined by Flynn in 1966.
Flynns classification scheme is based on the notion of a stream of information.
Two types of information flow into a processor: instructions and data.
The instruction stream is defined as the sequence of instructions performed by the
processing unit.
The data stream is defined as the data traffic exchanged between the memory and the
processing unit.
According to Flynns classification, either of the instruction or data streams can be
single or multiple.
Computer architecture can be classified into the following four distinct categories:
single-instruction single-data streams (SISD);
single-instruction multiple-data streams (SIMD);
multiple-instruction single-data streams (MISD); and
Multiple-instruction multiple-data streams (MIMD).
1) Single Instruction and Single Data stream (SISD)
In this organization, sequential execution of instructions is performed by one CPU
containing a single processing element (PE), i.e., ALU under one control unit.
Therefore, SISD machines are conventional serial computers that process only one
stream of instructions and one stream of data.
This type of computer organization is depicted in the diagram:
SISD Organization
Examples of SISD machines include:
SIMD Organization
MISD Organization
This classification is not popular in commercial machines as the concept of single data
streams executing on multiple processors is rarely applied.
But for the specialized applications, MISD organization can be very helpful.
For example, Real time computers need to be fault tolerant where several processors
execute the same data for producing the redundant data.
This is also known as N- version programming. All these redundant data are compared
as results which should be same; otherwise faulty unit is replaced.
Thus MISD machines can be applied to fault tolerant real time computers.
4) Multiple Instruction and Multiple Data stream (MIMD)
In this organization, multiple processing elements and multiple control units are
organized as in MISD.
But the difference is that now in this organization multiple instruction streams operate
on multiple data streams.
Therefore, for handling multiple instruction streams, multiple control units and
multiple processing elements are organized such that multiple processing elements are
handling multiple data streams from the Main memory.
The processors work on their own data with their own instructions. Tasks executed by
different processors can start or finish at different times.
They are not lock-stepped, as in SIMD computers, but run asynchronously. This
classification actually recognizes the parallel computer.
That means in the real sense MIMD organization is said to be a Parallel computer. All
multiprocessor systems fall under this classification.
Examples include; C.mmp, Burroughs D825, Cray-2, S1, Cray X-MP, HEP, Pluribus,
IBM 370/168 MP, Univac 1100/80, Tandem/16, IBM 3081/3084, C.m*, BBN Butterfly,
Meiko Computing Surface (CS-1), FPS T/40000, iPSC.
This type of computer organization is denoted as:
Is > 1
Ds > 1
MIMD Organization
Of the classifications discussed above, MIMD organization is the most popular for a
parallel computer.
In the real sense, parallel computers execute the instructions in MIMD mode.
HARDWARE MULTITHREADING
Hardware multithreading allows multiple threads to share the functional units of a
single processor in an overlapping fashion.
To permit this sharing, the processor must duplicate the independent state of each
thread. For example, each thread would have a separate copy of the register file and the
PC.
The memory itself can be shared through the virtual memory mechanisms, which
already support multiprogramming.
In addition, the hardware must support the ability to change to a different thread
relatively quickly.
In particular, a thread switch should be much more efficient than a process switch,
which typically requires hundreds to thousands of processor cycles while a thread
switch can be instantaneous.
There are two main approaches to hardware multithreading.
Fine-grained multithreading
Simultaneous multithreading
Fine-grained multithreading:
Fine-grained multithreading switches between threads on each instruction, resulting in
interleaved execution of multiple threads.
This interleaving is often done in a round robin fashion, skipping any threads that are
stalled at that time.
To make fine-grained multithreading practical, the processor must be able to switch
threads on every clock cycle.
One key advantage of fine-grained multithreading is that it can hide the throughput
losses that arise from both short and long stalls, since instructions from other threads
can be executed when one thread stalls.
The primary disadvantage of fine-grained multithreading is that it slows down the
execution of the individual threads, since a thread that is ready to execute without stalls
will be delayed by instructions from other threads.
Coarse-grained multithreading:
Coarse-grained multithreading was invented as an alternative to fine-grained
multithreading.
Coarse-grained multithreading switches threads only on costly stalls, such as secondlevel cache misses.
This change relieves the need to have thread switching be essentially free and is much
less likely to slow down the execution of an individual thread, since instructions from
other threads will only be issued when a thread encounters a costly stall.
Coarse-grained multithreading suffers, however, from a major drawback: it is limited
in its ability to overcome throughput losses, especially from shorter stalls.
This limitation arises from the pipeline start-up costs of coarse-grained multithreading.
Because a processor with coarse-grained multithreading issues instructions from a
single thread, when a stall occurs, the pipeline must be emptied or frozen.
The new thread that begins executing after the stall must fill the pipeline before
instructions will be able to complete.
Due to this start-up overhead, coarse-grained multithreading is much more useful for
reducing the penalty of high-cost stalls, where pipeline refill is negligible compared to
the stall time.
Simultaneous multithreading (SMT):
Simultaneous multithreading is a variation on hardware multithreading that uses the
resources of a multiple-issue, dynamically scheduled processor to exploit thread-level
parallelism at the same time it exploits instruction-level parallelism.
The key insight that motivates SMT is that multiple-issue processors often have more
functional unit parallelism available than a single thread can effectively use.
Furthermore, with register renaming and dynamic scheduling, multiple instructions
from independent threads can be issued without regard to the dependences among
them; the resolution of the dependences can be handled by the dynamic scheduling
capability.
Since you are relying on the existing dynamic mechanisms, SMT does not switch
resources every cycle.
The top portion shows how four threads would execute independently on a superscalar
with no multithreading support.
The bottom portion shows how the four threads could be combined to execute on the
processor more efficiently using three multithreading options:
A superscalar with coarse-grained multithreading
A superscalar with fine-grained multithreading
A superscalar with simultaneous multithreading
In the superscalar without hardware multithreading support, the use of issue slots is
limited by a lack of instruction-level parallelism.
In addition, a major stall, such as an instruction cache miss, can leave the entire
processor idle.
In the coarse-grained multithreaded superscalar, the long stalls are partially hidden by
switching to another thread that uses the resources of the processor.
Although this reduces the number of completely idle clock cycles, the pipeline start-up
overhead still leads to idle cycles, and limitations to ILP means all issue slots will not
be used.
In the fine-grained case, the interleaving of threads mostly eliminates fully empty slots.
Because only a single thread issues instructions in a given clock cycle, however,
limitations in instruction-level parallelism still lead to idle slots within some clock
cycles.
How four threads use the issue slots of a superscalar processor in different approaches.
The four threads at the top show how each would execute running alone on a standard
superscalar processor without multithreading support.
The three examples at the bottom show how they would execute running together in
three multithreading options.
The horizontal dimension represents the instruction issue capability in each clock
cycle. The vertical dimension represents a sequence of clock cycles.
An empty (white) box indicates that the corresponding issue slot is unused in that clock
cycle. The shades of gray and color correspond to four different threads in the
multithreading processors.
The additional pipeline start-up effects for coarse multithreading, which are not
illustrated in this figure, would lead to further loss in throughput for coarse
multithreading.
In the SMT case, thread-level parallelism and instruction-level parallelism are both
exploited, with multiple threads using the issue slots in a single clock cycle.
Ideally, the issue slot usage is limited by imbalances in the resource needs and resource
availability over multiple threads. In practice, other factors can restrict how many slots
are used.
For example, the recent Intel Nehalem multicore supports SMT with two threads to
improve core utilization.
MULTICORE PROCESSORS
The Need for Multicore:
Due to advances in circuit technology and performance limitation in wide-issue, superspeculative processors, Chip-Multiprocessors (CMP) or multi-core technology has become the mainstream in CPU designs.
Speeding up processor frequency had run its course in the earlier part of this decade;
computer architects needed a new approach to improve performance.
Adding an additional processing core to the same chip would, in theory, result in twice
the performance and dissipate less heat, though in practice the actual speed of each
core is slower than the fastest single core processor.
Multicore is not a new concept, as the idea has been used in embedded systems and for
specialized applications for some time, but recently the technology has become
mainstream with Intel and Advanced Micro Devices (AMD) introducing many
commercially available multicore chips.
Multicore Basics:
The following isnt specific to any one multicore design, but rather is a basic overview
of multi-core architecture.
Although manufacturer designs differ from one another, multicore architec-tures need
to adhere to certain aspects.
Multicore Implementations:
As with any technology, multicore architectures from different manufacturers vary
greatly.
Along with differences in communication and memory configuration another variance
comes in the form of how many cores the microprocessor has.
And in some multicore architectures differ-ent cores have different functions, hence
they are heterogeneous.
Differences in architectures are discussed below for Intels Core 2 Duo, Advanced
Micro Devices Athlon 64 X2, Sony-Toshiba- IBMs CELL Processor, and finally
Tileras TILE64.
Intel and AMD Dual-Core Processors
Intel and AMD are the mainstream manufacturers of microprocessors.
Intel produces many dif-ferent flavors of multicore processors: the Pentium D is used
in desktops, Core 2 Duo is used in both laptop and desktop environments, and the
Xeon processor is used in servers.
AMD has the Althon lineup for desktops, Turion for laptops, and Opteron for
servers/workstations.
Although the Core 2 Duo and Athlon 64 X2 run on the same platforms their
architectures differ greatly.