Unit 4

UNIT IV PARALLELISM 9
Instruction-level-parallelism Parallel processing challenges Flynn's classification

Hardware multithreading Multicore processors.
INSTRUCTION-LEVEL PARALLELISM
Introduction
When people make use of computers, they quickly consume all of the processing
power available.
It is the nature of computers; their flexibility quickly leads to ideas for more
automation, which then lead to more and more ideas, until the processing resources
available are exhausted.
In addition software development is agglomerative, it is easier to add software and
hence CPU load than it is to remove it. This is a fundamental asymmetry in software
development.
As we software developers have developed more sophisticated ideas for developing
and layering our software; and users have demanded more of our software
infrastructure, more demands have been made of the underlying hardware.
Cycles-Per-Second (CPS)
A very simple equation quantifies the performance of hardware.
Instructionspersecond = instructionspercycle x cyclespersecond
Fortunately Moores law has given us a formidable, non-linear, improvement
in performance for some time.
Hardware design has been incredibly competitive between teams of engineers
in companies.
For a long time the primary metric to measure performance was cycles-persecond (CPS), which caused the megahertz myth. Intel has done more to
explore the boundaries of CPS in common CPUs than any other company.
Customers of Intel found that another metric than CPU frequency had become
increasingly important in their buying decisions however, the metric of
instructionspersecondperwatt.
InstructionsPerCycle (IPC)
In parallel with research into everincreasing cyclespersecond, designers
have also been improving the instructionspercycle (IPC) that their
hardware is capable of.
When they crossed the boundary of greater than one instructionpercycle
(IPC > 1) they entered the world of instruction-level parallelism (ILP).
Instruction-level parallelism(ILP)
Most processors since 19851 have used the pipeline method just
described to overlap instructions. The background material discussed a
simple pipeline and how to achieve instruction overlap.
Achieving not instruction overlap but the actual execution of more than
one instruction at a time through dynamic scheduling and how to
maximize the throughput of a processor.
It will be useful to re-visit the various dependencies and hazards again,
before discussing these more powerful techniques for identifying and
exploiting more ILP.
Dependencies and hazards
Determining how one instruction relates to another is critical to determining
how much parallelism is available to exploit in an instruction stream. If two
instructions are not dependent then they can execute simultaneously
assuming sufficient resources that is no structural hazards.
Obviously, if one instruction depends on another, they must execute in order
though they may still partially overlap. It is imperative then, to determine
exactly how much and what kind of dependency exists between instructions.
The following sections will describe the different kinds of non-structural
dependency that can exist in an instruction stream.
TYPES OF DEPENDENCIES:
There are three different types of dependencies:
1. data dependencies (aka true dependencies),
2. name dependencies and
3. control dependencies.
Data dependencies
An instruction j can be considered data dependent on instruction i as
follows: directly, where instruction i produces a result that may be used by
instruction j or indirectly, where instruction j is data dependent on
instruction k and k is data dependent on i etc.
The indirect data dependence means that one instruction is dependent on
another if there exists a chain of dependencies between them.
This dependence chain can be as long as the entire program! If two
instructions are data dependent, they cannot execute simultaneously nor be
completely overlapped.
A data dependency can be overcome in two ways:
1. maintaining the dependency but avoiding the hazard or
2. eliminating a dependency by transforming the code. Code scheduling is the
primary method used to avoid a hazard without altering the dependency.
Name Dependencies
The second type of dependence is a name dependency. A name dependency occurs
when two instructions use the same register or memory location, called a name, but
there is no flow of data between them.
There are two types of name dependencies between an instruction i that proceeds
instruction j:
1. an anti-dependence occurs when j writes a register/memory that i reads (the
original value must be preserved until i can use it) or
2. an output dependence occurs when i and j write to the same register/memory
location (in this case instruction order must be preserved.)
Both anti-dependencies and output dependencies are name dependencies, as
opposed to true data dependency, as there is no information flow between the
two instructions.
In fact these dependencies are a direct result of re-purposing registers, hence
an instruction-set architecture with sufficient registers can minimize the
number of name dependencies in an instruction stream.
Since name dependencies are not truly dependent they can execute
simultaneously or be reordered, if the name used in the instruction is changed
so that they do not conflict.

This technique is known as register renaming and uses a bank of additional
temporary registers in addition to the register file.
Data Hazards
A data hazard is created whenever there is a data dependency between
instructions and they are close enough to cause the pipeline to stall or some
other reordering of instructions.
Because of the dependency, we must preserve program order, that is, the
order in which the instructions would execute in a non-pipelined sequential
processor.
A requirement of ILP must be to maintain the correctness of a program and
reorder/overlap instructions only what correctness is not at risk.
Types of Data hazards:
There are three types of data hazards:
1. read after write (RAW)j tries to read a source before i writes itthis is the
most common type and is a true data dependence;
2. write after write (WAW)j tries to write an operand before it is written by i
this corresponds to the output dependence;
3. write after read (WAR)j tries to write a destination before i has read it
this corresponds to an anti-dependency. Self evidently the read after read
(RAR) case is not a hazard.
Control Dependencies
The last type of dependency is a control dependency. A control dependency
determines the order of an instruction i with respect to a branch, so that i is
executed in correct program order only if it should be.
The first basic block in a program is the only block without some control
dependency.
Consider the statements:
if (p1)
S1
if (p2)
S2
S1 is control dependent on p1 and S2 is control dependent on p2 but is not
dependent on p1.
In general there are two constraints imposed by control dependencies: an instruction

that is control dependent on a branch cannot be moved before the branch and,
conversely, an instruction that is not control dependent on a branch must not be
moved after the branch in such a way that its execution would be controlled by the
branch.
A control dependency does not in itself limit performance fundamentally.
We can execute an instruction path speculatively, provided we guarantee
that speculatively executed instructions do not effects the program state until
branch result is determined.
This implies that the instructions executed speculatively must not raise an
exception or otherwise cause side effects.
Limitations of ILP
To evaluate the maximum potential instruction-level parallelism possible, an
instruction stream needs to be run on an ideal processor with no significant
limitations.
The ideal processor always predicts branches correctly, has no structural
hazards and has an infinite number of ROB buffers and renaming registers.
This eliminates all control and name dependencies, leaving only true data
dependencies.
This model means that any instruction can be scheduled on the cycle
immediately following the execution of the predecessor on which it depends.
It further implies that it is possible for the last dynamically executed
instruction in the program to be scheduled on the first cycle.
To measure the available ILP, a set of programsin this case the SPEC
benchmarkswhere compiled and optimised with a standard optimising
compiler. The programs were then instrumented and executed to produce a
trace of the instruction and data references.
Every instruction in the trace is then scheduled as early as possible given the
assumptions of perfect branch prediction and no hardware limits. Table 1
shows the average amount of parallelism available for 6 of the SPEC92
benchmarks. Three of these benchmarks are FP intensive and the other three
are integer programs.
SPEC Instructions
gcc
issued55
espres
63
so li
18
fpppp
75
doduc
11
Table 1: Table of SPEC benchmark and average instructions issued for an ideal
processor.
Obviously the ideal processor is not realisable; a realisable but ambitious
processor can achieve much less than the ideal processor.
Table 2 explores a processor which can issue 64 instructions per clock with
no issue restrictions, a tournament branch predictor of substantial size,
perfect disambiguation of memory reference, 64 additional renaming
registers for both integer and FP.
SPEC\Wind Infinite
gcc
10
256
10
128
10
64
9
32
8
espres
3 15
15
13
so li
2 12
12
11
fpppp
3 52
47
10
16
6
8
4
6
11
35
9
22
doduc
14
8
5
3 17
16
15
12
Table 2: Average instruction issue of SPEC Benchmark versus window size for
instruction decode
The most startling observation is that with realistic processor constraints
listed above, the effect of the window size for the integer programs is not as
severe as for FP programs, this points to the key difference between these
two types of programs, it seems that most FP programs are vector-based and
can use loop-level parallelism to a much greater extent than the typical
integer program.
PARALLEL PROCESSING CHALLENGES

Application demands: More computing cycles/memory needed
Scientific/Engineering computing: CFD, Biology, Chemistry, Physics, ...
General-purpose computing: Video, Graphics, CAD, Databases, Transaction
Processing,
Gaming
Mainstream multithreaded programs, are similar to parallel programs
Technology Trends:
Number of transistors on chip growing rapidly. Clock rates expected to continue to go
up but only slowly.
Actual performance returns diminishing due to deeper pipelines.
Increased transistor density allows integrating multiple processor cores per creating
Chip-Multiprocessors (CMPs) even for mainstream computing applications
(desktop/laptop..).
Architecture Trends:
Instruction-level parallelism (ILP) is valuable (superscalar, VLIW) but limited.
Increased clock rates require deeper pipelines with longer latencies and higher CPIs.
Coarser-level parallelism (at the task or thread level, TLP), utilized in multiprocessor
systems is the most viable approach to further improve performance.
Main motivation for development of chip-multiprocessors (CMPs)
Economics:
The increased utilization of commodity of-the-shelf (COTS) components in high
performance parallel computing systems instead of costly custom components used in
traditional supercomputers leading to much lower parallel system cost.
Todays microprocessors offer high-performance and have multiprocessor support
eliminating the need for designing expensive custom Pes.
Commercial System Area Networks (SANs) offer an alternative to custom more costly
Networks
Challenging Applications in Applied Science/Engineering
Astrophysics
Atmospheric and Ocean Modeling
Bioinformatics
Biomolecular simulation: Protein folding
Computational Chemistry
Computational Fluid Dynamics (CFD)
Computational Physics
Computer vision and image understanding
Data Mining and Data-intensive Computing
Engineering analysis (CAD/CAM)
Global climate modeling and forecasting
Material Sciences
Military applications
Quantum chemistry
VLSI design
FLYNN'S CLASSIFICATION:
The most popular taxonomy of computer architecture was defined by Flynn in 1966.
Flynns classification scheme is based on the notion of a stream of information.
Two types of information flow into a processor: instructions and data.
The instruction stream is defined as the sequence of instructions performed by the
processing unit.
The data stream is defined as the data traffic exchanged between the memory and the
processing unit.
According to Flynns classification, either of the instruction or data streams can be
single or multiple.
Computer architecture can be classified into the following four distinct categories:
single-instruction single-data streams (SISD);
single-instruction multiple-data streams (SIMD);
multiple-instruction single-data streams (MISD); and
Multiple-instruction multiple-data streams (MIMD).
1) Single Instruction and Single Data stream (SISD)
In this organization, sequential execution of instructions is performed by one CPU
containing a single processing element (PE), i.e., ALU under one control unit.
Therefore, SISD machines are conventional serial computers that process only one
stream of instructions and one stream of data.
This type of computer organization is depicted in the diagram:
SISD Organization
Examples of SISD machines include:
CDC 6600 which is unpipelined but has multiple functional units.

CDC 7600 which has a pipelined arithmetic unit.
Amdhal 470/6 which has pipelined instruction processing.
Cray-1 which supports vector processing.
2) Single Instruction and Multiple Data stream (SIMD)

In this organization, multiple processing elements work under the control of a single
control unit. It has one instruction and multiple data stream.
All the processing elements of this organization receive the same instruction broadcast
from the CU.
Main memory can also be divided into modules for generating multiple data streams
acting as a distributed memory.
Therefore, all the processing elements simultaneously execute the same instruction and
are said to be 'lock-stepped' together.
Each processor takes the data from its own memory and hence it has on distinct data
streams. Every processor must be allowed to complete its instruction before the next
instruction is taken for execution.
Thus, the execution of instructions is synchronous.
Examples of SIMD organisation are ILLIAC-IV, PEPE, BSP, STARAN, MPP, DAP
and the Connection Machine (CM-1).
This type of computer organization is denoted as:
SIMD Organization
3) Multiple Instruction and Single Data stream (MISD)

In this organization, multiple processing elements are organized under the control of
multiple control units.
Each control unit is handling one instruction stream and processed through its
corresponding processing element.
But each processing element is processing only a single data stream at a time.
Therefore, for handling multiple instruction streams and single data stream, multiple
control units and multiple processing elements are organized in this classification.
All processing elements are interacting with the common shared memory for the
organization of single data stream.
The only known example of a computer capable of MISD operation is the C.mmp built
by Carnegie-Mellon University.
Is > 1
Ds = 1
MISD Organization
This classification is not popular in commercial machines as the concept of single data
streams executing on multiple processors is rarely applied.
But for the specialized applications, MISD organization can be very helpful.
For example, Real time computers need to be fault tolerant where several processors
execute the same data for producing the redundant data.
This is also known as N- version programming. All these redundant data are compared
as results which should be same; otherwise faulty unit is replaced.
Thus MISD machines can be applied to fault tolerant real time computers.
4) Multiple Instruction and Multiple Data stream (MIMD)
In this organization, multiple processing elements and multiple control units are
organized as in MISD.
But the difference is that now in this organization multiple instruction streams operate
on multiple data streams.
Therefore, for handling multiple instruction streams, multiple control units and
multiple processing elements are organized such that multiple processing elements are
handling multiple data streams from the Main memory.
The processors work on their own data with their own instructions. Tasks executed by
different processors can start or finish at different times.
They are not lock-stepped, as in SIMD computers, but run asynchronously. This
classification actually recognizes the parallel computer.
That means in the real sense MIMD organization is said to be a Parallel computer. All
multiprocessor systems fall under this classification.
Examples include; C.mmp, Burroughs D825, Cray-2, S1, Cray X-MP, HEP, Pluribus,
IBM 370/168 MP, Univac 1100/80, Tandem/16, IBM 3081/3084, C.m*, BBN Butterfly,
Meiko Computing Surface (CS-1), FPS T/40000, iPSC.
Is > 1
Ds > 1
MIMD Organization
Of the classifications discussed above, MIMD organization is the most popular for a
parallel computer.
In the real sense, parallel computers execute the instructions in MIMD mode.
HARDWARE MULTITHREADING
Hardware multithreading allows multiple threads to share the functional units of a
single processor in an overlapping fashion.
To permit this sharing, the processor must duplicate the independent state of each
thread. For example, each thread would have a separate copy of the register file and the
PC.
The memory itself can be shared through the virtual memory mechanisms, which
already support multiprogramming.
In addition, the hardware must support the ability to change to a different thread
relatively quickly.
In particular, a thread switch should be much more efficient than a process switch,
which typically requires hundreds to thousands of processor cycles while a thread
switch can be instantaneous.
There are two main approaches to hardware multithreading.
Fine-grained multithreading
Simultaneous multithreading
Fine-grained multithreading:
Fine-grained multithreading switches between threads on each instruction, resulting in
interleaved execution of multiple threads.
This interleaving is often done in a round robin fashion, skipping any threads that are
stalled at that time.
To make fine-grained multithreading practical, the processor must be able to switch
threads on every clock cycle.
One key advantage of fine-grained multithreading is that it can hide the throughput
losses that arise from both short and long stalls, since instructions from other threads
can be executed when one thread stalls.
The primary disadvantage of fine-grained multithreading is that it slows down the
execution of the individual threads, since a thread that is ready to execute without stalls
will be delayed by instructions from other threads.
Coarse-grained multithreading:
Coarse-grained multithreading was invented as an alternative to fine-grained
multithreading.
Coarse-grained multithreading switches threads only on costly stalls, such as secondlevel cache misses.
This change relieves the need to have thread switching be essentially free and is much
less likely to slow down the execution of an individual thread, since instructions from
other threads will only be issued when a thread encounters a costly stall.
Coarse-grained multithreading suffers, however, from a major drawback: it is limited
in its ability to overcome throughput losses, especially from shorter stalls.
This limitation arises from the pipeline start-up costs of coarse-grained multithreading.
Because a processor with coarse-grained multithreading issues instructions from a
single thread, when a stall occurs, the pipeline must be emptied or frozen.
The new thread that begins executing after the stall must fill the pipeline before
instructions will be able to complete.
Due to this start-up overhead, coarse-grained multithreading is much more useful for
reducing the penalty of high-cost stalls, where pipeline refill is negligible compared to
the stall time.
Simultaneous multithreading (SMT):
Simultaneous multithreading is a variation on hardware multithreading that uses the
resources of a multiple-issue, dynamically scheduled processor to exploit thread-level
parallelism at the same time it exploits instruction-level parallelism.
The key insight that motivates SMT is that multiple-issue processors often have more
functional unit parallelism available than a single thread can effectively use.
Furthermore, with register renaming and dynamic scheduling, multiple instructions
from independent threads can be issued without regard to the dependences among
them; the resolution of the dependences can be handled by the dynamic scheduling
capability.
Since you are relying on the existing dynamic mechanisms, SMT does not switch
resources every cycle.
The top portion shows how four threads would execute independently on a superscalar
with no multithreading support.
The bottom portion shows how the four threads could be combined to execute on the
processor more efficiently using three multithreading options:
A superscalar with coarse-grained multithreading
A superscalar with fine-grained multithreading
A superscalar with simultaneous multithreading
In the superscalar without hardware multithreading support, the use of issue slots is
limited by a lack of instruction-level parallelism.
In addition, a major stall, such as an instruction cache miss, can leave the entire
processor idle.
In the coarse-grained multithreaded superscalar, the long stalls are partially hidden by
switching to another thread that uses the resources of the processor.
Although this reduces the number of completely idle clock cycles, the pipeline start-up
overhead still leads to idle cycles, and limitations to ILP means all issue slots will not
be used.
In the fine-grained case, the interleaving of threads mostly eliminates fully empty slots.
Because only a single thread issues instructions in a given clock cycle, however,
limitations in instruction-level parallelism still lead to idle slots within some clock
cycles.
How four threads use the issue slots of a superscalar processor in different approaches.
The four threads at the top show how each would execute running alone on a standard
superscalar processor without multithreading support.
The three examples at the bottom show how they would execute running together in
three multithreading options.
The horizontal dimension represents the instruction issue capability in each clock
cycle. The vertical dimension represents a sequence of clock cycles.
An empty (white) box indicates that the corresponding issue slot is unused in that clock
cycle. The shades of gray and color correspond to four different threads in the
multithreading processors.
The additional pipeline start-up effects for coarse multithreading, which are not
illustrated in this figure, would lead to further loss in throughput for coarse
multithreading.
In the SMT case, thread-level parallelism and instruction-level parallelism are both
exploited, with multiple threads using the issue slots in a single clock cycle.
Ideally, the issue slot usage is limited by imbalances in the resource needs and resource
availability over multiple threads. In practice, other factors can restrict how many slots
are used.
For example, the recent Intel Nehalem multicore supports SMT with two threads to
improve core utilization.
MULTICORE PROCESSORS
The Need for Multicore:
Due to advances in circuit technology and performance limitation in wide-issue, superspeculative processors, Chip-Multiprocessors (CMP) or multi-core technology has become the mainstream in CPU designs.
Speeding up processor frequency had run its course in the earlier part of this decade;
computer architects needed a new approach to improve performance.
Adding an additional processing core to the same chip would, in theory, result in twice
the performance and dissipate less heat, though in practice the actual speed of each
core is slower than the fastest single core processor.
Multicore is not a new concept, as the idea has been used in embedded systems and for
specialized applications for some time, but recently the technology has become
mainstream with Intel and Advanced Micro Devices (AMD) introducing many
commercially available multicore chips.
Multicore Basics:
The following isnt specific to any one multicore design, but rather is a basic overview
of multi-core architecture.
Although manufacturer designs differ from one another, multicore architec-tures need
to adhere to certain aspects.
Generic Modern Processor Configuration

Closest to the processor is Level 1 (L1) cache; this is very fast memory used to store
data frequently used by the processor.
Level 2 (L2) cache is just off-chip, slower than L1 cache, but still much faster than
main memory; L2 cache is larger than L1 cache and used for the same purpose.
Main memory is very large and slower than cache and is used, for example, to store a
file currently being edited in Microsoft Word.
Most systems have between 1GB to 4GB of main memory compared to approximately
32KB of L1 and 2MB of L2 cache.
Finally, when data isnt located in cache or main memory the system must retrieve it
from the hard disk, which takes exponentially more time than reading from the
memory system.
If we set two cores side-by-side, one can see that a method of communication between
the cores, and to main memory, is necessary.
This is usually accomplished either using a single communication bus or an
interconnection network.
The bus approach is used with a shared memory model, whereas the inter-connection
network approach is used with a distributed memory model.
After approximately 32 cores the bus is overloaded with the amount of processing,
communication, and competition, which leads to diminished performance; therefore, a
communication bus has a limited scalability
Shared Memory Model
Distributed Memory Model

Multicore processors seem to answer the deficiencies of single core processors, by
increasing bandwidth while decreasing power consumption.
Multicore Implementations:
As with any technology, multicore architectures from different manufacturers vary
greatly.
Along with differences in communication and memory configuration another variance
comes in the form of how many cores the microprocessor has.
And in some multicore architectures differ-ent cores have different functions, hence
they are heterogeneous.
Differences in architectures are discussed below for Intels Core 2 Duo, Advanced
Micro Devices Athlon 64 X2, Sony-Toshiba- IBMs CELL Processor, and finally
Tileras TILE64.
Intel and AMD Dual-Core Processors
Intel and AMD are the mainstream manufacturers of microprocessors.
Intel produces many dif-ferent flavors of multicore processors: the Pentium D is used
in desktops, Core 2 Duo is used in both laptop and desktop environments, and the
Xeon processor is used in servers.
AMD has the Althon lineup for desktops, Turion for laptops, and Opteron for
servers/workstations.
Although the Core 2 Duo and Athlon 64 X2 run on the same platforms their
architectures differ greatly.
(a) Intel Core 2 Duo,
(b) AMD Athlon 64 X2 [5]
Both architectures are homogenous dual-core processors.

The Core 2 Duo adheres to a shared memory model with private L1 caches and a
shared L2 cache which provides a peak transfer rate of 96 GB/sec.
If a L1 cache miss occurs both the L2 cache and the second cores L1 cache are
traversed in parallel before sending a request to main memory.
In contrast, the Athlon follows a distributed memory model with discrete L2 caches.
These L2 caches share a system request interface, eliminating the need for a bus.
The system request interface also connects the cores with an on-chip memory
controller and an interconnect called HyperTransport. HyperTransport effectively
reduces the number of buses required in a system, reducing bottlenecks and increasing
bandwidth.

Unit 4

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Unit 4

Caricato da

Copyright:

Formati disponibili

UNIT IV PARALLELISM 9

Instruction-level-parallelism Parallel processing challenges Flynn's classification

so that they do not conflict.

In general there are two constraints imposed by control dependencies: an instruction

PARALLEL PROCESSING CHALLENGES

CDC 6600 which is unpipelined but has multiple functional units.

2) Single Instruction and Multiple Data stream (SIMD)

3) Multiple Instruction and Single Data stream (MISD)

Generic Modern Processor Configuration

Shared Memory Model

Distributed Memory Model

(a) Intel Core 2 Duo,

(b) AMD Athlon 64 X2 [5]

Both architectures are homogenous dual-core processors.

Potrebbero piacerti anche