Sei sulla pagina 1di 27

Part I

Fundamental Concepts

Winter 2014 Parallel Processing, Fundamental Concepts Slide 1


About This Presentation

This presentation is intended to support the use of the textbook


Introduction to Parallel Processing: Algorithms and Architectures
(Plenum Press, 1999, ISBN 0-306-45970-1). It was prepared by
the author in connection with teaching the graduate-level course
ECE 254B: Advanced Computer Architecture: Parallel Processing,
at the University of California, Santa Barbara. Instructors can use
these slides in classroom teaching and for other educational
purposes. Any other use is strictly prohibited. Behrooz Parhami

Edition Released Revised Revised Revised


First Spring 2005 Spring 2006 Fall 2008 Fall 2010
Winter 2013 Winter 2014

Winter 2014 Parallel Processing, Fundamental Concepts Slide 2


I Fundamental Concepts

Topics in This Part


Chapter 1 Introduction to Parallelism
Chapter 2 A Taste of Parallel Algorithms
Chapter 3 Parallel Algorithm Complexity
Chapter 4 Models of Parallel Processing

Winter 2014 Parallel Processing, Fundamental Concepts Slide 3


1 Introduction to Parallelism

Topics in This Chapter

1.1 Why Parallel Processing?

1.2 A Motivating Example

1.3 Parallel Processing Ups and Downs

1.4 Types of Parallelism: A Taxonomy

1.5 Roadblocks to Parallel Processing

1.6 Effectiveness of Parallel Processing

Winter 2014 Parallel Processing, Fundamental Concepts Slide 4


Some Resources
Our textbook; followed closely in lectures
1 Parhami, B., Introduction to Parallel Processing:
Algorithms and Architectures, Plenum Press, 1999
Recommended book; complementary software topics
2 Herlihy, M. and M. Shavit, The Art of Multiprocessor
Programming, Morgan Kaufmann, revised 1st ed., 2012
Free on-line book (Creative Commons License)
3 Matloff, N., Programming on Parallel Machines: GPU,
Multicore, Clusters and More, 341 pp., PDF file
http://heather.cs.ucdavis.edu/~matloff/158/PLN/ParProcBook.pdf

Useful free on-line course, sponsored by NVIDIA


4 Introduction to Parallel Programming, CPU/GPU-CUDA
Complete Unified
https://www.udacity.com/course/cs344 Device Architecture

Winter 2014 Parallel Processing, Fundamental Concepts Slide 5


1.1 Why Parallel Processing?

The request for higher-performance digital computers seems


unending.

In the past two decades, the performance of microprocessors


has enjoyed an exponential growth.

The growth of microprocessor speed/performance by a factor


of 2 every 18 months is known as Moores law

Winter 2014 Parallel Processing, Fundamental Concepts Slide 6


1.1 Why Parallel Processing?
This growth is the result of a combination of two factors:

1) Increase in complexity(related both to higher device


density and to larger size) of VLSI chips, projected to rise
to around 10 M transistor per chip for microprocessors.

2) Introduction of, and improvements in, architectural


features such as on-chip cache memories, large instruction
buffers, multiple instruction issue per cycle,
multithreading

Winter 2014 Parallel Processing, Fundamental Concepts Slide 7


1.1 Why Parallel Processing?

Moores law was originally formulated in 1965 in terms of the


doubling of chip complexity every year (later revised to every 18
months) [Scha97].

Moores law seems to hold regardless of how one measures


processor performance:

counting the number of executed instructions per second (IPS)


counting the number of floating-point operations per second
(FLOPS)
using sophisticated benchmark suites that attempt to measure the
processor's performance on real applications

Winter 2014 Parallel Processing, Fundamental Concepts Slide 8


1.1 Why Parallel Processing?
TIPS
Projection,
circa 1998
Projection,
Processor performance

circa 2012
1.6 / yr
GIPS
Pentium II R10000

Pentium
68040 The number of cores
80486 has been increasing
80386 from a few in 2005
MIPS to the current 10s,
68000 and is projected to
80286 reach 100s by 2020

KIPS
1980 1990 2000 2010 2020
Calendar year
Fig. 1.1 The exponential growth of microprocessor performance,
known as Moores Law, shown over the past two decades (extrapolated).
Winter 2016 Parallel Processing, Fundamental Concepts Slide 9
1.1 Why Parallel Processing?

Even though it is expected that Moore's law will continue to hold


for the near future, there is a limit that will eventually be reached.
That some previous predictions about when the limit will be
reached have proven wrong does not alter the fact that a limit,
dictated by physical laws, does exist.

The most easily understood physical limit is that imposed by the


finite speed of signal propagation along a wire.

Winter 2014 Parallel Processing, Fundamental Concepts Slide 10


Why High-Performance Computing?

Higher speed (solve problems faster)


1 Important when there are hard or soft deadlines;
e.g., 24-hour weather forecast
Higher throughput (solve more problems)
2 Important when we have many similar tasks to perform;
e.g., transaction processing
Higher computational power (solve larger problems)
3 e.g., weather forecast for a week rather than 24 hours,
or with a finer mesh for greater accuracy

Winter 2014 Parallel Processing, Fundamental Concepts Slide 11


Init. Pass 1 Pass 2 Pass 3
1.2 A Motivating 2m 2 2 2
3 3m 3 3
Example 4
5 5 5m 5
6
7 7 7 7 m
Fig. 1.3 The sieve of 8
9 9
Eratosthenes yielding a 10
11 11 11 11
list of 10 primes for n = 30. 12
Marked elements have 13 13 13 13
14
been distinguished by 15 15
erasure from the list. 16
17 17 17 17
18
19 19 19 19
20
Any composite number 21 21
22
has a prime factor 23 23 23 23
that is no greater than 24
25 25 25
its square root. 26
27 27
28
29 29 29 29
30
Winter 2014 Parallel Processing, Fundamental Concepts Slide 12
Single-Processor Implementation of the Sieve

Current Prime Index


P

Bit-vector n
1 2

Fig. 1.4 Schematic representation of single-processor


solution for the sieve of Eratosthenes.

Winter 2014 Parallel Processing, Fundamental Concepts Slide 13


Control-Parallel Implementation of the Sieve

Index Index Index


P1 P2 ... Pp

Shared Current Prime


Memory I/O Device

1 2 n
(b)

Fig. 1.5 Schematic representation of a control-parallel


solution for the sieve of Eratosthenes.

Winter 2014 Parallel Processing, Fundamental Concepts Slide 14


Running Time of the Sequential/Parallel Sieve
Time
0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
2 | 3 | 5 | 7 | 11 |13|17
19 29
p = 1, t = 1411 23 31
23 29 31
2 | 7 |17
3 5 | 11 |13|
19
p = 2, t = 706

2 |
| 3 11 | 19 29 31
5 | 7 13|17 23
p = 3, t = 499

Fig. 1.6 Control-parallel realization of the sieve of


Eratosthenes with n = 1000 and 1 p 3.

Winter 2014 Parallel Processing, Fundamental Concepts Slide 15


Data-Parallel Implementation of the Sieve
P1 Current Prime Index Assume at most n processors,
so that all prime factors dealt with
are in P1 (which broadcasts them)
1 2 n/p

P2 Current Prime Index

Communi-
cation n/p+1 2n/p

Pp Current Prime Index

n <n/p
nn/p+1 n

Fig. 1.7 Data-parallel realization of the sieve of Eratosthenes.

Winter 2014 Parallel Processing, Fundamental Concepts Slide 16


One Reason for Sublinear Speedup:
Communication Overhead
Ideal speedup

Solution time Actual speedup

Computation

Communication

Number of processors Number of processors

Fig. 1.8 Trade-off between communication time and computation


time in the data-parallel realization of the sieve of Eratosthenes.

Winter 2014 Parallel Processing, Fundamental Concepts Slide 17


Another Reason for Sublinear Speedup:
Input/Output Overhead

Ideal speedup

Solution time
Actual speedup
Computation

I/O time

Number of processors Number of processors

Fig. 1.9 Effect of a constant I/O time on the data-parallel


realization of the sieve of Eratosthenes.

Winter 2014 Parallel Processing, Fundamental Concepts Slide 18


1.3 Parallel Processing Ups and Downs

Parallel processing, in the literal sense of the term, is used in


virtually every modern computer.

(For example, overlapping I/O with computation is a form of parallel


processing, as is the overlap between instruction preparation and execution
in a pipelined processor.)

Other forms of parallelism or concurrency that are widely


used include the use of multiple functional units (e.g., separate
integer and floating-point ALUs or two floating-point
multipliers in one ALU) and multitasking (which allows
overlap between computation and memory load necessitated
by a page fault).

Winter 2014 Parallel Processing, Fundamental Concepts Slide 19


1.3 Parallel Processing Ups and Downs

However, in this book:


The term parallel processing is used in a restricted sense of
having multiple (usually identical) processors for the main
computation and not for the I/O or other peripheral activities.

Winter 2014 Parallel Processing, Fundamental Concepts Slide 20


1.3 Parallel Processing Ups and Downs

The history of parallel processing has had its ups and downs
(read company formations and bankruptcies!) with what
appears to be a 20-year cycle. Serious interest in parallel
processing started in the 1960s.

Commercial interest in parallel processing resurfaced in the


1980s. Driven primarily by contracts from the defense
establishment and other federal agencies in the United States,
numerous companies were formed to develop parallel
systems.

Winter 2014 Parallel Processing, Fundamental Concepts Slide 21


1.3 Parallel Processing Ups and Downs
However, three factors led to another recess:

Government funding in the United States and other countries


dried up, in part related to the end of the cold war between the
NATO allies and the Soviet bloc.

Commercial users in banking and other data-intensive


industries were either saturated or disappointed by application
difficulties.

Microprocessors developed so fast in terms of


performance/cost ratio that custom designed parallel machines
always lagged in cost-effectiveness.

Winter 2014 Parallel Processing, Fundamental Concepts Slide 22


1.4 Types of Parallelism: A Taxonomy
Single data Multiple data Shared Message
stream streams variables passing
Single instr

memory
stream

Global
Johnson s expansion
SISD SIMD GMSV GMMP
Uniprocessors Array or vector Shared-memory Rarely used
processors multiprocessors
Multiple instr

Distributed
memory
streams

MISD MIMD DMSV DMMP


Rarely used Multiprocs or Distributed Distrib-memory
multicomputers shared memory multicomputers

Flynns categories

Fig. 1.11 The Flynn-Johnson classification of computer systems.

Winter 2014 Parallel Processing, Fundamental Concepts Slide 23


1.5 Roadblocks to Parallel Processing
Groschs law: Economy of scale applies, or power = cost2
No longer valid; in fact we can buy more
MFLOPS computing power on micros rather than supers
Minskys conjecture: Speedup tends to be proportional to log p
Has roots in analysis of memory bank conflicts; can be overcome
Tyranny of IC technology: Uniprocessors suffice (x10 faster/5 yrs)

Faster ICs make parallel machines faster too; what about x1000?
Tyranny of vector supercomputers: Familiar programming model

Not all computations involve vectors; parallel vector machines


Software inertia: Billions of dollars investment in software
New programs; even uniprocessors benefit from parallelism spec

Amdahls law: Unparallelizable code severely limits the speedup


Winter 2014 Parallel Processing, Fundamental Concepts Slide 24
Amdahls Law
50

f = 0
f = fraction
40
unaffected
Spee du p ( s )

f = 0 .01
30 p = speedup
f = 0 .02 of the rest
20
f = 0 .05
1
s=
10 f + (1 f)/p
f = 0 .1

0
min(p, 1/f)
0 10 20 30 40 50
E nha nc em en t f ac tor ( p )

Fig. 1.12 Limit on speed-up according to Amdahls law.

Winter 2014 Parallel Processing, Fundamental Concepts Slide 25


1.6 Effectiveness of Parallel Processing
1 Fig. 1.13 p Number of processors
Task graph
exhibiting W(p) Work performed by p processors
2
limited
inherent T(p) Execution time with p processors
3
parallelism. T(1) = W(1); T(p) W(p)
4
S(p) Speedup = T(1) / T(p)
5
8

7 6 E(p) Efficiency = T(1) / [p T(p)]

10
9 R(p) Redundancy = W(p) / W(1)
11
12 W(1) = 13 U(p) Utilization = W(p) / [p T(p)]
T(1) = 13
13 Q(p) Quality = T3(1) / [p T2(p) W(p)]
T() = 8
Winter 2014 Parallel Processing, Fundamental Concepts Slide 26
Reduction or Fan-in Computation
Example: Adding 16 numbers, 8 processors, unit-time additions
----------- 16 numbers to be added -----------
Zero-time communication
+ + + + + +
+ + E(8) = 15 / (8 4) = 47%
S(8) = 15 / 4 = 3.75
R(8) = 15 / 15 = 1
+ + + + Q(8) = 1.76

Unit-time communication
+ +
E(8) = 15 / (8 7) = 27%
S(8) = 15 / 7 = 2.14
+ R(8) = 22 / 15 = 1.47
Q(8) = 0.39
Sum

Fig. 1.14 Computation graph for finding the sum of 16 numbers .


Winter 2014 Parallel Processing, Fundamental Concepts Slide 27

Potrebbero piacerti anche