Introduction To Parallelism

Part I
Fundamental Concepts
Winter 2014 Parallel Processing, Fundamental Concepts Slide 1

About This Presentation
This presentation is intended to support the use of the textbook

Introduction to Parallel Processing: Algorithms and Architectures
(Plenum Press, 1999, ISBN 0-306-45970-1). It was prepared by
the author in connection with teaching the graduate-level course
ECE 254B: Advanced Computer Architecture: Parallel Processing,
at the University of California, Santa Barbara. Instructors can use
these slides in classroom teaching and for other educational
purposes. Any other use is strictly prohibited. Behrooz Parhami
Edition Released Revised Revised Revised

First Spring 2005 Spring 2006 Fall 2008 Fall 2010
Winter 2013 Winter 2014

I Fundamental Concepts
Topics in This Part

Chapter 1 Introduction to Parallelism
Chapter 2 A Taste of Parallel Algorithms
Chapter 3 Parallel Algorithm Complexity
Chapter 4 Models of Parallel Processing

1 Introduction to Parallelism
Topics in This Chapter
1.1 Why Parallel Processing?
1.2 A Motivating Example
1.3 Parallel Processing Ups and Downs
1.4 Types of Parallelism: A Taxonomy
1.5 Roadblocks to Parallel Processing
1.6 Effectiveness of Parallel Processing

Some Resources
Our textbook; followed closely in lectures
1 Parhami, B., Introduction to Parallel Processing:
Algorithms and Architectures, Plenum Press, 1999
Recommended book; complementary software topics
2 Herlihy, M. and M. Shavit, The Art of Multiprocessor
Programming, Morgan Kaufmann, revised 1st ed., 2012
Free on-line book (Creative Commons License)
3 Matloff, N., Programming on Parallel Machines: GPU,
Multicore, Clusters and More, 341 pp., PDF file
http://heather.cs.ucdavis.edu/~matloff/158/PLN/ParProcBook.pdf
Useful free on-line course, sponsored by NVIDIA

4 Introduction to Parallel Programming, CPU/GPU-CUDA
Complete Unified
https://www.udacity.com/course/cs344 Device Architecture

The request for higher-performance digital computers seems

unending.
In the past two decades, the performance of microprocessors

has enjoyed an exponential growth.
The growth of microprocessor speed/performance by a factor

of 2 every 18 months is known as Moores law

This growth is the result of a combination of two factors:
1) Increase in complexity(related both to higher device

density and to larger size) of VLSI chips, projected to rise
to around 10 M transistor per chip for microprocessors.
2) Introduction of, and improvements in, architectural

features such as on-chip cache memories, large instruction
buffers, multiple instruction issue per cycle,
multithreading

Moores law was originally formulated in 1965 in terms of the

doubling of chip complexity every year (later revised to every 18
months) [Scha97].
Moores law seems to hold regardless of how one measures

processor performance:
counting the number of executed instructions per second (IPS)

counting the number of floating-point operations per second
(FLOPS)
using sophisticated benchmark suites that attempt to measure the
processor's performance on real applications

TIPS
Projection,
circa 1998
Projection,
Processor performance
circa 2012
1.6 / yr
GIPS
Pentium II R10000
Pentium
68040 The number of cores
80486 has been increasing
80386 from a few in 2005
MIPS to the current 10s,
68000 and is projected to
80286 reach 100s by 2020
KIPS
1980 1990 2000 2010 2020
Calendar year
Fig. 1.1 The exponential growth of microprocessor performance,
known as Moores Law, shown over the past two decades (extrapolated).
Even though it is expected that Moore's law will continue to hold

for the near future, there is a limit that will eventually be reached.
That some previous predictions about when the limit will be
reached have proven wrong does not alter the fact that a limit,
dictated by physical laws, does exist.
The most easily understood physical limit is that imposed by the

finite speed of signal propagation along a wire.

Why High-Performance Computing?
Higher speed (solve problems faster)

1 Important when there are hard or soft deadlines;
e.g., 24-hour weather forecast
Higher throughput (solve more problems)
2 Important when we have many similar tasks to perform;
e.g., transaction processing
Higher computational power (solve larger problems)
3 e.g., weather forecast for a week rather than 24 hours,
or with a finer mesh for greater accuracy

Init. Pass 1 Pass 2 Pass 3
1.2 A Motivating 2m 2 2 2
3 3m 3 3
Example 4
5 5 5m 5
6
7 7 7 7 m
Fig. 1.3 The sieve of 8
9 9
Eratosthenes yielding a 10
11 11 11 11
list of 10 primes for n = 30. 12
Marked elements have 13 13 13 13
14
been distinguished by 15 15
erasure from the list. 16
17 17 17 17
18
19 19 19 19
20
Any composite number 21 21
22
has a prime factor 23 23 23 23
that is no greater than 24
25 25 25
its square root. 26
27 27
28
29 29 29 29
30
Single-Processor Implementation of the Sieve
Current Prime Index

P
Bit-vector n
1 2
Fig. 1.4 Schematic representation of single-processor

solution for the sieve of Eratosthenes.

Control-Parallel Implementation of the Sieve
Index Index Index

P1 P2 ... Pp
Shared Current Prime

Memory I/O Device
1 2 n
(b)
Fig. 1.5 Schematic representation of a control-parallel

solution for the sieve of Eratosthenes.

Running Time of the Sequential/Parallel Sieve
Time
0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
2 | 3 | 5 | 7 | 11 |13|17
19 29
p = 1, t = 1411 23 31
23 29 31
2 | 7 |17
3 5 | 11 |13|
19
p = 2, t = 706
2 |
| 3 11 | 19 29 31
5 | 7 13|17 23
p = 3, t = 499
Fig. 1.6 Control-parallel realization of the sieve of

Eratosthenes with n = 1000 and 1 p 3.

Data-Parallel Implementation of the Sieve
P1 Current Prime Index Assume at most n processors,
so that all prime factors dealt with
are in P1 (which broadcasts them)
1 2 n/p
P2 Current Prime Index
Communi-
cation n/p+1 2n/p
Pp Current Prime Index
n <n/p
nn/p+1 n
Fig. 1.7 Data-parallel realization of the sieve of Eratosthenes.

One Reason for Sublinear Speedup:
Communication Overhead
Ideal speedup
Solution time Actual speedup
Computation
Communication
Number of processors Number of processors
Fig. 1.8 Trade-off between communication time and computation

time in the data-parallel realization of the sieve of Eratosthenes.

Another Reason for Sublinear Speedup:
Input/Output Overhead
Ideal speedup
Solution time
Actual speedup
Computation
I/O time
Number of processors Number of processors
Fig. 1.9 Effect of a constant I/O time on the data-parallel

realization of the sieve of Eratosthenes.

Parallel processing, in the literal sense of the term, is used in

virtually every modern computer.
(For example, overlapping I/O with computation is a form of parallel

processing, as is the overlap between instruction preparation and execution
in a pipelined processor.)
Other forms of parallelism or concurrency that are widely

used include the use of multiple functional units (e.g., separate
integer and floating-point ALUs or two floating-point
multipliers in one ALU) and multitasking (which allows
overlap between computation and memory load necessitated
by a page fault).

However, in this book:

The term parallel processing is used in a restricted sense of
having multiple (usually identical) processors for the main
computation and not for the I/O or other peripheral activities.

The history of parallel processing has had its ups and downs
(read company formations and bankruptcies!) with what
appears to be a 20-year cycle. Serious interest in parallel
processing started in the 1960s.
Commercial interest in parallel processing resurfaced in the

1980s. Driven primarily by contracts from the defense
establishment and other federal agencies in the United States,
numerous companies were formed to develop parallel
systems.

However, three factors led to another recess:
Government funding in the United States and other countries

dried up, in part related to the end of the cold war between the
NATO allies and the Soviet bloc.
Commercial users in banking and other data-intensive

industries were either saturated or disappointed by application
difficulties.
Microprocessors developed so fast in terms of

performance/cost ratio that custom designed parallel machines
always lagged in cost-effectiveness.

1.4 Types of Parallelism: A Taxonomy
Single data Multiple data Shared Message
stream streams variables passing
Single instr
memory
stream
Global
Johnson s expansion
SISD SIMD GMSV GMMP
Uniprocessors Array or vector Shared-memory Rarely used
processors multiprocessors
Multiple instr
Distributed
memory
streams
MISD MIMD DMSV DMMP

Rarely used Multiprocs or Distributed Distrib-memory
multicomputers shared memory multicomputers
Flynns categories
Fig. 1.11 The Flynn-Johnson classification of computer systems.

1.5 Roadblocks to Parallel Processing
Groschs law: Economy of scale applies, or power = cost2
No longer valid; in fact we can buy more
MFLOPS computing power on micros rather than supers
Minskys conjecture: Speedup tends to be proportional to log p
Has roots in analysis of memory bank conflicts; can be overcome
Tyranny of IC technology: Uniprocessors suffice (x10 faster/5 yrs)
Faster ICs make parallel machines faster too; what about x1000?
Tyranny of vector supercomputers: Familiar programming model
Not all computations involve vectors; parallel vector machines

Software inertia: Billions of dollars investment in software
New programs; even uniprocessors benefit from parallelism spec
Amdahls law: Unparallelizable code severely limits the speedup

Amdahls Law
50
f = 0
f = fraction
40
unaffected
Spee du p ( s )
f = 0 .01
30 p = speedup
f = 0 .02 of the rest
20
f = 0 .05
1
s=
10 f + (1 f)/p
f = 0 .1
0
min(p, 1/f)
0 10 20 30 40 50
E nha nc em en t f ac tor ( p )
Fig. 1.12 Limit on speed-up according to Amdahls law.

1.6 Effectiveness of Parallel Processing
1 Fig. 1.13 p Number of processors
Task graph
exhibiting W(p) Work performed by p processors
2
limited
inherent T(p) Execution time with p processors
3
parallelism. T(1) = W(1); T(p) W(p)
4
S(p) Speedup = T(1) / T(p)
5
8
7 6 E(p) Efficiency = T(1) / [p T(p)]
10
9 R(p) Redundancy = W(p) / W(1)
11
12 W(1) = 13 U(p) Utilization = W(p) / [p T(p)]
T(1) = 13
13 Q(p) Quality = T3(1) / [p T2(p) W(p)]
T() = 8
Reduction or Fan-in Computation
Example: Adding 16 numbers, 8 processors, unit-time additions
----------- 16 numbers to be added -----------
Zero-time communication
+ + + + + +
+ + E(8) = 15 / (8 4) = 47%
S(8) = 15 / 4 = 3.75
R(8) = 15 / 15 = 1
+ + + + Q(8) = 1.76
Unit-time communication
+ +
E(8) = 15 / (8 7) = 27%
S(8) = 15 / 7 = 2.14
+ R(8) = 22 / 15 = 1.47
Q(8) = 0.39
Sum
Fig. 1.14 Computation graph for finding the sum of 16 numbers .


Introduction To Parallelism

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Introduction To Parallelism

Caricato da

Copyright:

Formati disponibili

Part I

Winter 2014 Parallel Processing, Fundamental Concepts Slide 1

This presentation is intended to support the use of the textbook

Edition Released Revised Revised Revised

Winter 2014 Parallel Processing, Fundamental Concepts Slide 2

Topics in This Part

Winter 2014 Parallel Processing, Fundamental Concepts Slide 3

Topics in This Chapter

1.1 Why Parallel Processing?

1.2 A Motivating Example

1.3 Parallel Processing Ups and Downs

1.4 Types of Parallelism: A Taxonomy

1.5 Roadblocks to Parallel Processing

1.6 Effectiveness of Parallel Processing

Winter 2014 Parallel Processing, Fundamental Concepts Slide 4

Useful free on-line course, sponsored by NVIDIA

Winter 2014 Parallel Processing, Fundamental Concepts Slide 5

The request for higher-performance digital computers seems

In the past two decades, the performance of microprocessors

The growth of microprocessor speed/performance by a factor

Winter 2014 Parallel Processing, Fundamental Concepts Slide 6

1) Increase in complexity(related both to higher device

2) Introduction of, and improvements in, architectural

Winter 2014 Parallel Processing, Fundamental Concepts Slide 7

Moores law was originally formulated in 1965 in terms of the

Moores law seems to hold regardless of how one measures

counting the number of executed instructions per second (IPS)

Winter 2014 Parallel Processing, Fundamental Concepts Slide 8

Even though it is expected that Moore's law will continue to hold

The most easily understood physical limit is that imposed by the

Winter 2014 Parallel Processing, Fundamental Concepts Slide 10

Higher speed (solve problems faster)

Winter 2014 Parallel Processing, Fundamental Concepts Slide 11

Current Prime Index

Fig. 1.4 Schematic representation of single-processor

Winter 2014 Parallel Processing, Fundamental Concepts Slide 13

Index Index Index

Shared Current Prime

Fig. 1.5 Schematic representation of a control-parallel

Winter 2014 Parallel Processing, Fundamental Concepts Slide 14

Fig. 1.6 Control-parallel realization of the sieve of

Winter 2014 Parallel Processing, Fundamental Concepts Slide 15

P2 Current Prime Index

Pp Current Prime Index

Fig. 1.7 Data-parallel realization of the sieve of Eratosthenes.

Winter 2014 Parallel Processing, Fundamental Concepts Slide 16

Solution time Actual speedup

Number of processors Number of processors

Fig. 1.8 Trade-off between communication time and computation

Winter 2014 Parallel Processing, Fundamental Concepts Slide 17

Number of processors Number of processors

Fig. 1.9 Effect of a constant I/O time on the data-parallel

Winter 2014 Parallel Processing, Fundamental Concepts Slide 18

Parallel processing, in the literal sense of the term, is used in

(For example, overlapping I/O with computation is a form of parallel

Other forms of parallelism or concurrency that are widely

Winter 2014 Parallel Processing, Fundamental Concepts Slide 19

However, in this book:

Winter 2014 Parallel Processing, Fundamental Concepts Slide 20

Commercial interest in parallel processing resurfaced in the

Winter 2014 Parallel Processing, Fundamental Concepts Slide 21

Government funding in the United States and other countries

Commercial users in banking and other data-intensive