Sei sulla pagina 1di 28

INB375

Parallel Computing
Forms of Parallelism
Instruction Level Parallelism (ILP)
Multiple instructions (from one instruction stream or thread of execution)
can execute in each cycle.

CPU (Core)

ALU1
ALU2
Multiple execution units
FPU1
FPU2

1. e = a + b Superscalar
2. f = c + d Pipelined instruction execution
Out of order execution
3. g = e * f Branch prediction and speculative execution
INB375 Parallel Computing 3
Vector Parallelism
Vector instructions operate on a list of data values

CPU (Core) Intel MMX and


Intel Streaming SIMD Extensions (SSE)
Vector Vector AltiVec (Apple, IBM, Motorola)
registers execution Cray supercomputer
unit(s)

Scalar Scalar
registers execution
unit(s)

INB375 Parallel Computing 4


Processes and Threads

Application1 Unrelated Application

Process1 Process2 Process3

Threadm
Threadn

Threadu
Thread1
Thread2

Thread1
Thread2

Thread1
Thread2

Memory Memory Memory

Inter-process
communication

INB375 Parallel Computing 5


Thread Level Parallelism (TLP)
Multiple threads (belonging to a single process) execute at the
same time.
Multi-threaded CPUs Multi-core CPUs
CPU (Core) CPU
Thread1 Core1 Core2
PC: ALU1
Registers:
Thread1 Thread2
ALU2 PC: PC:
Registers: Registers:

Thread2 FPU1
PC:
ALU1 FPU1 ALU1 FPU1
Registers: FPU2
ALU2 FPU2 ALU2 FPU2

Memory
Memory
Simultaneous multithreading or
Hyperthreading (Intel)
INB375 Parallel Computing 6
Process Level Parallelism
Machine1
Core1
Process2

Inter Process
Memory Communication

Machine2 Machine3
Core1 Core1 Core2
Process3 Process2 Process3

Memory Memory Memory


INB375 Parallel Computing 7
Shared vs. Distributed Memory
Shared Memory
Threads (within one process on the same machine)
communicate via share memory.
Distributed Memory
Processes (on different machines) communicate via message
passing.
Distributed memory programs contain operations to send and
receive messages respectively.
Distributed Shared Memory
Provide a shared memory programming model for distributed
memory machines.

INB375 Parallel Computing 8


Types of Parallel Computers
Flynns Taxonomy
Single Instruction, Single Data stream (SISD)
Normal (Von Neumann) sequential computer
One thread of execution performing scalar operations
Single Instruction, Multiple Data streams (SIMD)
Includes superscalar(?), vector and stream processors
Multiple Instructions, Single Data streams (MISD)
Includes fault tolerant redundant systems
Arguably includes pipelined processors.
Multiple Instructions, Multiple Data streams (MIMD)
Includes most thread and process level parallelism
Includes both shared and distributed memory systems

INB375 Parallel Computing 10


Types of Supercomputers
Vector Processors
Early Cray supercomputers
Symmetric Multi-Processor (SMP)
Shared memory
Massively Parallel Processors (MPP)
Distributed memory, tightly coupled, proprietary interconnect
Cluster (Beowulf)
Distributed memory, commodity nodes, loosely coupled
Asymmetric Multi-Processor (AMP)
Specialized co-processors, e.g. GPUs
Hybrids
E.g. Cluster of SMPs with vector instructions and co-processors
Cycle Stealing Systems
Workstations donated when not otherwise in use.

INB375 Parallel Computing 11


Top 500 Supercomputer List
www.top500.org
1 Tianhe-2, National University of Defense Technology (China) ,
55 Pflops
2 Titan, Oak Ridge National Laboratory (USA), 27pflops
4 K Computer, Riken (Japan), 11 Pflops
7 JUQUEEN, Forschungszentrum Julich (Germany), 6 Pflops
31 Avoca, Victorian Life Sciences Computation Initiative
(Australia), 695 Tflops

INB375 Parallel Computing 12


Tianhe-2
(Worlds Most powerful Computer)
16,000 nodes, each: 2 Intel Xeon (12 core) + 3 Xeon Phi (60 cores) co-processor
Total 3,120,000 cores, 54.9 Petaflops
Massively parallel distributed memory (each node is a shared memory SMP)
Memory: 88 GB per Xeon + 8 GB per Xeon Phi processor = 1,375 TB total memory
Kylin Linux operating system
Water Cooled
17.6 Megawatt peak power
24 Megawatts for cooling

INB375 Parallel Computing 13


INB375 Parallel Computing 14
Parallel Computing Terms Concurrent
Concurrent Computing
Any system in which operations can happen at the same time.
Parallel
Includes multi-threaded and multi-process systems. Distributed
Parallelism exists but is not introduced for performance reasons.
E.g. a user interface thread and a background thread or a server processing
requests from two different clients.
Need to be aware of problems like deadlock.
Parallel Computing
Different operations belonging to the same application are performed at the
same time so as to allow the application to run faster finish in less time
(speedup).
Includes everything from previous slide.
Distributed Computing
Any system where operations are performed at different physical locations
Includes parallel computing clusters but also distributed concurrent systems.

INB375 Parallel Computing 15


Parallelization

Sequential Parallel
Hardware Hardware

Sequential Parallelization Parallel


Program Program

Sequential Change Parallel


Algorithm Algorithm Algorithm

INB375 Parallel Computing 16


Parallel Algorithms
In many cases, to efficiently perform a particular calculation
on a parallel computer we need to use a fundamentally
different algorithm.
Such an algorithm may be designed and expressed in a
manner that makes the parallelism explicit, or it may simply
be an algorithm still expressed in a sequential manner but
which provides greater opportunity for exploiting parallelism.
The best parallel algorithm may not perform as well as the
best sequential algorithm when executed on a single
processor.
The parallel algorithm may only shine for larger numbers of processors
In general, replacing an algorithm with a more efficient
equivalent algorithm is not something that software tools
(optimizing compilers) can do automatically.
It generally requires a programmer applying human intellect.

INB375 Parallel Computing 17


Inherent Parallelism
Often, however, we dont need to fundamentally change the
algorithm.
Even if the algorithm isnt expressed in an explicitly parallel
fashion, we can through analysis determine computational
steps within the algorithm which can be safely performed in
parallel.
We call this exploiting Inherent Parallelism.

In general, when parallelizing:


Some parts of the original application may need to be redesigned with
new parallel algorithms
In other sections of that application we may be able to exploit inherent
parallelism without fundamentally changing the algorithm.

INB375 Parallel Computing 18


Parallelism in
Pure Functional Programs
Pure functional languages express computation entirely via
function evaluation.
The entire computation can be viewed as a large data flow problem
which can be represented by a directed acyclic graph:
Input 1 func func
func func func output
Input 2 func
func
Evaluating a function produces no other side effects.
There is no implied evaluation sequence
The only order constraints placed on computation is that functions
cannot be evaluated until their input parameters are available.
even that constraint is relaxed for lazy functional languages.
So, functions can be evaluated in parallel unless the result of
one is explicitly required as input to the other.
INB375 Parallel Computing 19
Parallelism in
Imperative Programs
Imperative languages express computation in terms of
statements which change a programs state.
Invoking a function can have untold side effects.
Correct evaluation relies on the statements being performed
in a particular sequence
Control flow dictated by Sequence, Selection and Repetition
It can therefore be very difficult in general to determine if
evaluating two functions in parallel will produce the same
result as executing them one after the other.

INB375 Parallel Computing 20


Inherent Parallelism
S1: a = b + c; S1 S2
S2: d = b + e;
S3: f = a + d; S3

for (int i=0; i<N; i++)


a[i] = 0;
Parallelization

Parallel.For (int i=0; i<N; i++)


a[i] = 0;
INB375 Parallel Computing 21
Safe Parallelization
Parallel version of the program must produce the
same result as the original sequential version.
Sufficient to preserve all control and data
dependencies in the original program.
Control dependency
Where one statement can affect whether some other statement actually
executes.
Data dependency
Where one statement refers to the same data as some other statement.

INB375 Parallel Computing 22


Control Dependencies
int gcd(int a, int b) { int gcd(int a, int b) {
while (b != 0) { if (b == 0)
t = b; return a;
b = a % b; else
a = t; return gcd(b, a % b);
} }
return a;
}

int gcd(int a, int b) {


if (a < 0 || b < 0)
throw new ArgumentOutOfRangeException();

}
INB375 Parallel Computing 23
Data Dependencies
a = 1;
...
Flow (or True) Dependence (W R) b = a + 1;
One statement reads a value written by an earlier statement
a = 1;
...
Output Dependence (W W) a = 2;
One statement overwrites a value written by an earlier statement
b = a + 1;
...
Anti Dependence (R W) a = 2;
One statement reads a value before it is overwritten by a later statement.
b = a + 1;
...
Input Dependence (R R) c = a - 1;
One statement reads a value also read by an earlier statement.
Note input dependencies dont need to be preserved.
INB375 Parallel Computing 24
Pointer/Reference Aliasing

Results a = new Results(x, 100);

Results b = a;
a:
a.Normalize(); Results
object
b.FindBest(selector);

b:

INB375 Parallel Computing 25


Array Data Dependence Analysis
for (int i=0; i<n; i++)
for (int j=i; j<n; j++)
a[i, j+1] = a[n,j];

Any data dependencies between loop iterations?

If there is a (flow) data dependence then there must exist at


least one iteration (ir, jr) that reads the same array element that
is written by some iteration (iw, jw).

ir, jr, iw, jw : 0 ir < n ^ ir jr < n ^


0 iw < n ^ iw jw < n ^
iw = n ^ jw+1 = jr

INB375 Parallel Computing 26


Limitations of Static Analysis
Any form of static dependence analysis is going to be inexact
in general.
Alias analysis is undividable
Data dependence analysis is undecidable.
Especially difficult to perform accurately inter-procedurally.
If in doubt must assume that a dependency might exist.
If a dependence might exist then we must not parallelize.
Safe, but conservative
May not find all of the available parallelism.

INB375 Parallel Computing 27


Automatic Parallelization
Parallelization can be performed:
Automatically by a tool/compiler,
and/or
Manually by a programmer.
Unfortunately:
Current compilers are not generally smart enough to perform
parallelization in general. They are necessarily conservative, so they work
in some cases, but fail to effectively parallelize in many other cases.
Manual parallelization requires highly skilled programmers, is very time
consuming and error prone.
Creating correct sequential programs is hard enough parallelization introduces many
new potential categories of bugs (race conditions, deadlocks, starvation, etc).

INB375 Parallel Computing 28

Potrebbero piacerti anche