Introduction To Parallel Programming

Introduction to Parallel Programming
Student Workbook with Instructors Notes

Intel Software College
Legal Lines and Disclaimers

Student Workbook with Instructors Notes - Inner Front Cover

The information contained in this document is provided for informational purposes only and represents the current view of Intel Corporation ("Intel") and
its contributors ("Contributors") on, as of the date of publication. Intel and the Contributors make no commitment to update the information contained
in this document, and Intel reserves the right to make changes at any time, without notice.
Legal Lines and Disclaimers
DISCLAIMER. THIS DOCUMENT, IS PROVIDED "AS IS." NEITHER INTEL, NOR THE CONTRIBUTORS MAKE ANY REPRESENTATIONS OF ANY KIND WITH
RESPECT TO PRODUCTS REFERENCED HEREIN, WHETHER SUCH PRODUCTS ARE THOSE OF INTEL, THE CONTRIBUTORS, OR THIRD PARTIES. INTEL,
AND ITS CONTRIBUTORS EXPRESSLY DISCLAIM ANY AND ALL WARRANTIES, IMPLIED OR EXPRESS, INCLUDING WITHOUT LIMITATION, ANY
WARRANTIES OF MERCHANTABILITY, FITNESS FOR ANY PARTICULAR PURPOSE, NON-INFRINGEMENT, AND ANY WARRANTY ARISING OUT OF THE
INFORMATION CONTAINED HEREIN, INCLUDING WITHOUT LIMITATION, ANY PRODUCTS, SPECIFICATIONS, OR OTHER MATERIALS REFERENCED
HEREIN. INTEL, AND ITS CONTRIBUTORS DO NOT WARRANT THAT THIS DOCUMENT IS FREE FROM ERRORS, OR THAT ANY PRODUCTS OR OTHER
TECHNOLOGY DEVELOPED IN CONFORMANCE WITH THIS DOCUMENT WILL PERFORM IN THE INTENDED MANNER, OR WILL BE FREE FROM
INFRINGEMENT OF THIRD PARTY PROPRIETARY RIGHTS, AND INTEL, AND ITS CONTRIBUTORS DISCLAIM ALL LIABILITY THEREFOR.
INTEL, AND ITS CONTRIBUTORS DO NOT WARRANT THAT ANY PRODUCT REFERENCED HEREIN OR ANY PRODUCT OR TECHNOLOGY DEVELOPED IN
RELIANCE UPON THIS DOCUMENT, IN WHOLE OR IN PART, WILL BE SUFFICIENT, ACCURATE, RELIABLE, COMPLETE, FREE FROM DEFECTS OR SAFE FOR
ITS INTENDED PURPOSE, AND HEREBY DISCLAIM ALL LIABILITIES THEREFOR. ANY PERSON MAKING, USING OR SELLING SUCH PRODUCT OR
TECHNOLOGY DOES SO AT HIS OR HER OWN RISK.
Licenses may be required. Intel, its contributors and others may have patents or pending patent applications, trademarks, copyrights or other
intellectual proprietary rights covering subject matter contained or described in this document. No license, express, implied, by estoppels or otherwise,
to any intellectual property rights of Intel or any other party is granted herein. It is your responsibility to seek licenses for such intellectual property
rights from Intel and others where appropriate.
Limited License Grant. Intel hereby grants you a limited copyright license to copy this document for your use and internal distribution only. You may not
distribute this document externally, in whole or in part, to any other person or entity.
LIMITED LIABILITY. IN NO EVENT SHALL INTEL, OR ITS CONTRIBUTORS HAVE ANY LIABILITY TO YOU OR TO ANY OTHER THIRD PARTY, FOR ANY LOST
PROFITS, LOST DATA, LOSS OF USE OR COSTS OF PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES, OR FOR ANY DIRECT, INDIRECT, SPECIAL OR
CONSEQUENTIAL DAMAGES ARISING OUT OF YOUR USE OF THIS DOCUMENT OR RELIANCE UPON THE INFORMATION CONTAINED HEREIN, UNDER ANY
CAUSE OF ACTION OR THEORY OF LIABILITY, AND IRRESPECTIVE OF WHETHER INTEL, OR ANY CONTRIBUTOR HAS ADVANCE NOTICE OF THE
POSSIBILITY OF SUCH DAMAGES. THESE LIMITATIONS SHALL APPLY NOTWITHSTANDING THE FAILURE OF THE ESSENTIAL PURPOSE OF ANY LIMITED
REMEDY.
Intel and Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.
*Other names and brands may be claimed as the property of others.
Copyright 2007, Intel Corporation. All Rights Reserved.

4

January 2007 Intel Corporation
Contents
Contents
Lab 1: Identifying Parallelism .............................................................................. 7
Lab 2: Introducing Threads................................................................................ 11
Lab 3: Domain Decomposition with OpenMP ...................................................... 15
Lab 4: Critical Sections and Reductions with OpenMP ........................................ 21
Lab 5: Implementing Task Decompositions........................................................ 25
Lab 6: Analyzing Parallel Performance............................................................... 29
Lab 7: Improving Parallel Performance.............................................................. 33
Lab 8: Choosing the Appropriate Thread Model .................................................. 35
Instructors Notes and Solutions........................................................................ 37


5

6

Lab 1: Identifying Parallelism

Time Required
Thirty minutes
Part A
For each of the following code segments, draw a dependence graph and determine whether the
computation is suitable for parallelization. If the computation is suitable for parallelization, decide
how it should be divided among three CPUs. You may assume that all functions are free of side
effects.
Example 1:
for (i = 0; i < 4; i++) {
a[i] = 0.25 * i;
b[i] = 4.0 / (a[i] * a[i]);
}
Example 2:
if (a < b) c = f(-1);
else if (a == b) c = f(0);
else c = f(1);
Example 3:
for (i = 0; i < 4; i++)
for (j = 0; j < 3; j++)
a[i][j] = f(a[i][j] * b[j]);
Example 4:
prime = 2;
do {
first = prime * prime;
for (i = first; i < 10; i+= prime)
marked[i] = 1;
while (marked[++prime]);
} while (prime * prime < N);


7
Example 5:
switch (i): {
case 0:
a = f(x);
b = g(y);
break;
case 1:
a = g(x);
b = f(y);
break;
case -1:
a = f(y);
b = f(x);
break;
}
Example 6:
sum = 0.0;
for (i = 0; i < 9; i++)
sum = sum + b[i];

8

Lab. 1: Identifying Parallelism
Part B
Describe how parallelism could be used to reduce the time needed to perform each of the following
tasks.
Example 7:
A relational database table contains (among other things) student ID numbers and their cumulative
GPAs. Find out the percentage of students with a cumulative GPA greater than 3.5.
Example 8:
A ray-tracing program renders a realistic image by tracing one or more rays for each pixel of the
display window.
Example 9:
An operating system utility searches a disk and identifies every text file containing a particular
phrase specified by the user.
Example 10:
We want to improve a game similar to Civilization IV by reducing the amount of time the human
player must wait for the virtual world to be set up.

-

9

10

Lab 2: Introducing Threads

Time Required
Thirty minutes
For each of the following programs or program segments:

1.
determine whether the best parallelization approach is a domain decomposition or a task

decomposition;
2.
decide whether the best thread model is the fork/join model or the general threads model;
3.
determine fork/join points (in the case of the fork/join model) or thread creation points (in
the case of the general threads model); and
4.
decide which variables should be shared and which variables should be private.
Example 1:
/* Matrix multiplication */
int i, j, k;
double **a, **b, **c, tmp;
...
for (i = 0; i < m; i++)
for (j = 0; j < n; j++) {
tmp = 0.0;
for (k = 0; k < p; k++)
tmp += a[i][k] * b[k][j];
c[i][j] = tmp;
}
Example 2:
/* This program implements an Internet-based service that
responds to number-theoretic queries */
int main() {
request r;
...
while(1) {
next_request(&r);
acknowledge_request (r);
switch (r.type) {


11
case PRIME:
primality_test (r);
break;
case PERFECT: perfect_test (r);
break;
case WARING: find_waring_integer (r);
break;
}
}
...
}
Example 3:
double inner_product (double *x, double *y, int n)
{
int i;
double result;
result = 0.0;
for (i = 0; i < n; i++)
result += x[i] * y[i];
return result;
}
int main (int argc, char *argv[])
{
double *d, *g, w, x, y, z;
int i;
...
for (i = 0; i < n; i++)
d[i] = -g[i] + (w/x) * d[i];
y = inner_product (d, g);
z = inner_product (d, t);
...
}
Example 4:
/* Finite difference method to solve string vibration
problem (from Michael J. Quinn, Parallel Programming
in C with MPI and OpenMP, p. 325) */
#include <stdio.h>
#include <math.h>

12

Lab. 2: Introducing Threads
#define
#define
#define
#define
#define
#define
#define
F(x)
G(x)
a
c
m
n
T
int main
{
float
int
float
float
float
sin(3.14159*(x))
0.0
1.0
2.0
2000
1000
1.0
(int argc, char *argv[])

h;
i, j;
k;
L;
u[m+1][n+1];
h = a / n;
k = T / m;
L = (k*c/h)*(k*c/h);
for (j = 0; j <= m; j++) u[j][0] = u[j][n] = 0.0;
for (i = 1; i < n; i++) u[0][i] = F(i*h);
for (i = 1; i < n; i++)
u[1][i] = (L/2.0)*(u[0][i+1] + u[0][i-1])+
(1.0 - L) * u[0][i] + k * G(i*h);
for (j = 1; j < m; j++)
for (i = 1; i < n; i++)
u[j+1][i] = 2.0*(1.0 - L) * u[j][i] +
L*(u[j][i+1] + u[j][i-1]) u[j-1][i];
for (j = 0; j <= m; j++) {
for (i = 0; i <= n; i++) printf (%6.3f, u[j][i]);
putchar (\n);
}
return 0;
}

-

13

14

Lab 3: Domain Decomposition with OpenMP

Time Required
Fifty minutes
For each of the programs below

1.
make the program parallel by adding the appropriate OpenMP pragmas;
2.
compile the program;
3.
execute the program for 1, 2, 3, and 4 threads; and
4.
check the program outputs to verify they are the same.
Note: You will need to generate matrices for the matrix multiplication exercise; a utility
program gen.c is included in the lab folder for this purpose. Compile this code, and run it
to create files matrix_a and matrix_b; explicit usage is outlined in the code itself. Be sure
to generate a workload sufficiently large (e.g., matrix dimensions 1000 x 1000) to be
meaningful.
Program 1:
/*
*
*/
Matrix multiplication
#include <stdio.h>
/*
*
*
*/
Function 'rerror' is called when the program detects an

error and wishes to print an appropriate message and exit.
void rerror (char *s)

{
printf ("%s\n", s);
exit (-1);
}
/*
*
*
*/
Function 'allocate_matrix", passed the number of rows and columns,

allocates a two-dimensional matrix of floats.
void allocate_matrix (float ***subs, int rows, int cols) {

int
i;
float *lptr, *rptr;
float *storage;


15
storage = (float *) malloc (rows * cols * sizeof(float));

*subs = (float **) malloc (rows * sizeof(float *));
for (i = 0; i < rows; i++)
(*subs)[i] = &storage[i*cols];
return;
}
/*
*
*
*/
Given the name of a file containing a matrix of floats, function

'read_matrix' opens the file and reads its contents.
void read_matrix (
char
*s,
/*
float ***subs,
/*
int
*m,
/*
int
*n)
/*
{
char
error_msg[80];
FILE
*fptr;
File name */
2D submatrix indices */
Number of rows in matrix */
Number of columns in matrix */
/* Input file pointer */
fptr = fopen (s, "r");

if (fptr == NULL) {
sprintf (error_msg, "Can't open file '%s'", s);
rerror (error_msg);
}
fread (m, sizeof(int), 1, fptr);
fread (n, sizeof(int), 1, fptr);
allocate_matrix (subs, *m, *n);
fread ((*subs)[0], sizeof(float), *m * *n, fptr);
fclose (fptr);
return;
}
/*
*
*
*
*
*/
Passed a pointer to a two-dimensional matrix of floats and

the dimensions of the matrix, function 'print_matrix' prints
the matrix elements to standard output. If the matrix has more
than 10 columns, the output may not be easy to read.
void print_matrix (float **a, int rows, int cols)

{
int i, j;
for (i = 0; i < rows; i++) {
for (j = 0; j < cols; j++)
printf ("%6.2f ", a[i][j]);
putchar ('\n');
}
putchar ('\n');
return;
}

16

Lab. 3: Domain Decomposition with OpenMP
/*
*
*
*/
Function 'matrix_multiply' multiplies two matrices containing

floating-point numbers.
void matrix_multiply (float **a, float **b, float **c,

int arows, int acols, int bcols)
{
int i, j, k;
float tmp;
for (i = 0; i < arows; i++)
for (j = 0; j < bcols; j++) {
tmp = 0.0;
for (k = 0; k < acols; k++)
tmp += a[i][k] * b[k][j];
c[i][j] = tmp;
}
return;
}
int main (int *argc, char *argv[])
{
int m1, n1;
/* Dimensions of matrix 'a' */
int m2, n2;
/* Dimensions of matrix 'b' */
float **a, **b;
/* Two matrices being multiplied */
float **c;
/* Product matrix */
read_matrix ("matrix_a", &a, &m1, &n1);
print_matrix (a, m1, n1);
read_matrix ("matrix_b", &b, &m2, &n2);
print_matrix (b, m2, n2);
if (n1 != m2) rerror ("Incompatible matrix dimensions");
allocate_matrix (&c, m1, n2);
matrix_multiply (a, b, c, m1, n1, n2);
print_matrix (c, m1, n2);
return 0;
}
Program 2:
/*
*
*
*
*
*
*
*
*/
Polynomial Interpolation
This program demonstrates a function that performs polynomial
interpolation. The function is taken from "Numerical Recipes
in C", Second Edition, by William H. Press, Saul A. Teukolsky,
William T. Vetterling, and Brian P. Flannery.
#include <math.h>
#define N 20

-
/* Number of function sample points */

17
#define X 14.5
/* Interpolate at this value of x */
/* Function 'vector' is used to allocate vectors with subscript

range v[nl..nh] */
double *vector (long nl, long nh)
{
double *v;
v = (double *) malloc(((nh-nl+2)*sizeof(double)));
return v-nl+1;
}
/* Function 'free_vector' is used to free up memory allocated
with function 'vector' */
void free_vector(double *v, long nl, long nh)
{
free ((char *) (v+nl-1));
}
/* Function 'polint' performs a polynomial interpolation */
void polint (double xa[], double ya[], int n, double x, double *y, double
*dy)
{
int i, m, ns=1;
double den,dif,dift,ho,hp,w;
double *c, *d;
dif = fabs(x-xa[1]);
c = vector(1,n);
d = vector(1,n);
for (i=1; i <= n; i++) {
dift = fabs (x - xa[i]);
if (dift < dif) {
ns = i;
dif = dift;
}
c[i] = ya[i];
d[i] = ya[i];
}
*y = ya[ns--];
for (m = 1; m < n; m++) {
for (i = 1; i <= n-m; i++) {
ho = xa[i] - x;
hp = xa[i+m] - x;
w = c[i+1] - d[i];
den = ho - hp;
den = w / den;
d[i] = hp * den;
c[i] = ho * den;
}
*y += (*dy=(2*ns < (n-m) ? c[ns+1] : d[ns--]));

18

Lab. 3: Domain Decomposition with OpenMP
}
free_vector (d, 1, n);
free_vector (c, 1, n);
}
/* Functions 'sign' and 'init' are used to initialize the
x and y vectors holding known values of the function.
*/
int sign (int j)
{
if (j % 2 == 0) return 1;
else return -1;
}
void init (int i, double *x, double *y)
{
int j;
*x = (double) i;
*y = sin(i);
}
/* Function 'main' demonstrates the polynomial interpolation function
by generating some test points and then calling 'polint' with a
value of x between two of the test points. */
{
double x, y, dy;
double *xa, *ya;
int i;
xa = vector (1, N);
ya = vector (1, N);
/* Initialize xa's and ya's */
for (i = 1; i <= N; i++) {
init (i, &xa[i], &ya[i]);
printf ("f(%4.2f) = %13.11f\n", xa[i], ya[i]);
}
/* Interpolate polynomial at X */
polint (xa, ya, N, X, &y, &dy);
printf ("\nf(%6.3f) = %13.11f with error bound %13.11f\n", X, y,
fabs(dy));
free_vector (xa, 1, N);
free_vector (ya, 1, N);
return 0;
}

-

19

20

Lab 4: Critical Sections and Reductions

with OpenMP
Time Required
Twenty minutes
Exercise 1
Make this program parallel by adding the appropriate OpenMP pragmas and clauses. Compile the
program, execute it on 1 and 2 threads, and make sure the program output is the same as the
sequential program. Finally, compare the execution times of the sequential, single-threaded, and
double-threaded programs.
/*
*
*
*
*
*
*
*
*
*/
/*
*
*
*
*
*/
A small college is thinking of instituting a six-digit student ID

number. It wants to know how many "acceptable" ID numbers there
are. An ID number is "acceptable" if it has no two consecutive
identical digits and the sum of the digits is not 7, 11, or 13.
024332 is not acceptable because of the repeated 3s.
204124 is not acceptable because the digits add up to 13.
304530 is acceptable.
Function "no_problem_with_digits" extracts the digits from

the ID number from right to left, making sure that there are
no repeated digits and that the sum of the digits is not 7,
11, or 13.
int no_problem_with_digits (int i)

{
int j;
int latest;
/* Digit currently being examined */
int prior;
/* Digit to the right of "latest" */
int sum;
/* Sum of the digits */
prior = -1;
sum = 0;
for (j = 0; j < 6; j++) {
latest = i % 10;
if (latest == prior) return 0;
sum += latest;
prior = latest;
i /= 10;
}
if ((sum == 7) || (sum == 11) || (sum == 13)) return 0;
return 1;


21
}
/*
*
*
*
*/
Function "main" iterates through all possible six-digit ID

numbers (integers from 0 to 999999), counting the ones that
meet the college's definition of "acceptable."
int main (void)

{
int count;
/* Count of acceptable ID numbers */
int i;
count = 0;
for (i = 0; i < 1000000; i++)
if (no_problem_with_digits (i)) count++;
printf ("There are %d acceptable ID numbers\n", count);
return 0;
}
Exercise 2
Make this program parallel by adding the appropriate OpenMP pragmas and clauses. Compile the
program, execute it on 1 and 2 threads, and make sure the program output is the same as the
sequential program. Finally, compare the execution times of the sequential, single-threaded, and
/*
*
*
*
*
*
*/
This program uses the Sieve of Eratosthenes to determine the

number of prime numbers less than or equal to 'n'.
Adapted from code appearing in Parallel Programming in C with
MPI and OpenMP, by Michael J. Quinn, McGraw-Hill (2004).
#include <stdio.h>
#define MIN(a,b) ((a)<(b)?(a):(b))
{
int
count;
/* Prime count */
int
first;
/* Index of first multiple */
int
i;
int
index;
/* Index of current prime */
char *marked;
/* Marks for 2,...,'n' */
int
n;
/* Sieving from 2, ..., 'n' */
int
prime;
/* Current prime */
if (argc != 2) {
printf ("Command line: %s <m>\n", argv[0]);
exit (1);
}

22

Lab. 4: Critical Sections and Reductions with OpenMP
n = atoi(argv[1]);
marked = (char *) malloc (n-1);
if (marked == NULL) {
printf ("Cannot allocate enough memory\n");
exit (1);
}
for (i = 0; i < n-1; i++) marked[i] = 1;
index = 0;
prime = 2;
do {
first = prime * prime - 2;
for (i = first; i < n-1; i += prime) marked[i] = 0;
while (!marked[++index]);
prime = index + 2;
} while (prime * prime <= n);
count = 0;
for (i = 0; i < n-1; i++)
count += marked[i];
printf ("There are %d primes less than or equal to %d\n", count, n);
return 0;
}
Exercise 3
The Monte Carlo method refers to the use of statistical sampling to solve a problem. Some experts
say that more than half of all supercomputing cycles are devoted to Monte Carlo computations. A
Monte Carlo program can benefit from parallel processing in two ways. Parallel processing can be
used to reduce the time needed to find a solution of a particular resolution. The other use of parallel
processing is to find a more accurate solution in the same amount of time. This assignment is to
reduce the time needed to find a solution of a particular accuracy. The following C program uses the
Monte Carlo method to come up with an approximation to pi. Add OpenMP directives to make the
program suitable for execution on multiple threads. Divide the number of points to be generated
evenly among the threads. Compare the execution times of the sequential, single-threaded, and
/*
*
*
*
*/
This program uses the Monte Carlo method to come up with an

approximation to pi. Taken from Parallel Programming in C with
MPI and OpenMP, by Michael J. Quinn, McGraw-Hill (2004).
#include <stdio.h>
int main (int argc, char *argv[1])
{
int count;
/* Points inside unit circle */
int i;
int samples;
/* Number of points to generate */
unsigned short xi[3];
/* Random number seed */

-

23
double x, y;
/* Coordinates of point */
/* Number of points and 3 random number seeds are command-line

arguments. */
if (argc != 5) {
printf (Command-line syntax: %s <samples>
<seed> <seed> <seed>\n, argv[0]);
exit (-1);
}
samples = atoi (argv[1]);
count = 0;
xi[0] = atoi(argv[2]);
for (i = 0; i < samples; i++) {
x = erand48(xi);
y = erand48(xi);
if (x*x + y*y <= 1.0) count++;
}
printf (Estimate of pi: %7.5f\n, 4.0 * count / samples);
return 0;
}

24

Lab 5: Implementing Task Decompositions

Time Required
Sixty minutes
Exercise 1
Make this quicksort program parallel by adding the appropriate OpenMP pragmas and clauses.
Compile the program, execute it on 1 and 2 threads, and make sure the program is still correctly
sorting the elements of array A. Finally, compare the execution times of the sequential, singlethreaded, and double-threaded programs.
/*
*
*
*
*
*
*
*
*
*
*
*
*
*
*/
Stack-based Quicksort
The quicksort algorithm works by repeatedly dividing unsorted
sub-arrays into two pieces: one piece containing the smaller
elements and the other piece containing the larger elements.
The splitter element, used to subdivide the unsorted sub-array,
ends up in its sorted location. By repeating this process on
smaller and smaller sub-arrays, the entire array gets sorted.
The typical implementation of quicksort uses recursion. This
implementation replaces recursion with iteration. It manages its
own stack of unsorted sub-arrays. When the stack of unsorted
sub-arrays is empty, the array is sorted.
#include <stdio.h>
#include <stdlib.h>
#define MAX_UNFINISHED 1000
/* Maximum number of unsorted sub-arrays */
/* Global shared variables */

struct {
int first;
int last;
} unfinished[MAX_UNFINISHED];
/* Low index of unsorted sub-array */

/* High index of unsorted sub-array */
/* Stack */
int unfinished_index;
/* Index of top of stack */
float *A;
int
n;
/* Array of elements to be sorted */

/* Number of elements in A */
/* Function 'swap' is called when we want to exchange two array elements */

void swap (float *x, float *y)
{


25
float tmp;
tmp = *x;
*x = *y;
*y = tmp;
}
/* Function 'partition' actually does the sorting by dividing an
Unsorted sub-array into two parts: those less than or equal to the
splitter, and those greater than the splitter. The splitter is the
last element in the unsorted sub-array. The splitter ends up in its
final, sorted location. The function returns the final location of
the splitter (its index). This code is an implementation of the
algorithm appearing in Introduction to Algorithms, Second Edition,
by Cormen, Leiserson, Rivest, and Stein (The MIT Press, 2001). */
int partition (int first, int last)
{
int i, j;
float x;
x = A[last];
i = first - 1;
for (j = first; j < last; j++)
if (A[j] <= x) {
i++;
swap (&A[i], &A[j]);
}
swap (&A[i+1], &A[last]);
return (i+1);
}
/* Function 'quicksort' repeatedly retrieves the indices of unsorted
sub-arrays from the stack and calls 'partition' to divide these
sub-arrays into two pieces. It keeps one of the pieces and puts the
other piece on the stack of unsorted sub-arrays. Eventually it ends
up with a piece that doesn't need to be sorted. At this point it
gets the indices of another unsorted sub-array from the stack. The
function continues until the stack is empty. */
void quicksort (void)
{
int first;
int last;
int my_index;
int q;
/* Split point in array */
while (unfinished_index >= 0) {
my_index = unfinished_index;
unfinished_index--;
first = unfinished[my_index].first;
last = unfinished[my_index].last;
while (first < last) {

26

Lab. 5: Implementing Task Decompositions
/* Split unsorted array into two parts */

q = partition (first, last);
/* Put upper portion on stack of unsorted sub-arrays */
if ((unfinished_index+1) >= MAX_UNFINISHED) {
printf ("Stack overflow\n");
exit (-1);
}
unfinished_index++;
unfinished[unfinished_index].first = q+1;
unfinished[unfinished_index].last = last;
/* Keep lower portion for next iteration of loop */
last = q-1;
}
}
}
/* Function 'print_float_array', given the address and length of an
Array of floating-point values, prints the values to standard
output, one element per line. */
void print_float_array (float *A, int n)
{
int i;
printf ("Contents of array:\n");
for (i = 0; i < n; i++)
printf ("%6.4f\n", A[i]);
}
/* Function 'verify_sorted' returns 1 if the elements of array 'A'
are in monotonically increasing order; it returns 0 otherwise. */
int verify_sorted (float *A, int n)
{
int i;
for (i = 0; i < n-1; i++)
if (A[i] > A[i+1]) return 0;
return 1;
}
/* Function 'main' gets the array size and random number seed from
the command line, initializes the array, prints the unsorted array,
sorts the array, and prints the sorted array. */
{
int
i;
int
seed;
/* Seed component input by user */
unsigned short xi[3];
/* Random number seed */

-

27
if (argc != 3) {
printf ("Command-line syntax: %s <n> <seed>\n", argv[0]);
exit (-1);
}
seed = atoi (argv[2]);
xi[0] = xi[1] = xi[2] = seed;
n = atoi (argv[1]);
A = (float *) malloc (n * sizeof(float));
for (i = 0; i < n; i++)
A[i] = erand48(xi);
/*
print_float_array (A, n);
*/
unfinished[0].first = 0;
unfinished[0].last = n-1;
unfinished_index = 0;
quicksort ();
/*
print_float_array (A, n);
*/
if (verify_sorted (A, n)) printf ("Elements are sorted\n");
else printf ("ERROR: Elements are NOT sorted\n");
return 0;
}

28

Lab 6: Analyzing Parallel Performance

Time Required
Thirty-five minutes
Exercise 1
You are responsible for maintaining a library of core functions used by a wide variety of programs in
an application suite. Your supervisor has noted the availability of multi-core processors and wants to
know whether rewriting the library of functions using threads would significantly improve the
performance of the programs in the application suite. What do you need to do to provide a
meaningful answer?
Exercise 2
Somebody wrote an OpenMP program to solve the problem posed in Lab 5 and benchmarked its
performance sorting 25 million keys. Here are the run times of the program, as reported by the
command-line utility time:
Threads
1
2
3
4
Run Time (sec)

8.535
21.183
22.184
25.060
What is the efficiency of the multithreaded program for 2, 3, and 4 threads? What can you conclude
about the design of the parallel program? Can you offer any suggestions for improving the
performance of the program?
Exercise 3
A co-worker has been working on converting a sequential program into a multithreaded program. At
this point, only some of the functions of the program have been made parallel. On a key data set,
the multithreaded program exhibits these execution times:
Processors
1
2
3
4
Time (sec)
5.34
3.74
3.31
3.10
Is your co-worker on the right track? Would you advise your co-worker to continue the
parallelization effort?


29
Exercise 4
Youve worked hard to convert a key application to multithreaded execution, and youve
benchmarked it on a quad-core processor. Here are the results:
Threads
1
2
3
4
Time (sec)
24.3
14.6
11.7
10.6
Suppose an 8-core version of the processor becomes available.

(a) Predict the execution time of this algorithm on an 8-core processor.
(b) Give a reason why the actual speedup may be lower than expected.
(c) Give two reasons why the actual speedup may be higher than expected.
Exercise 5
You have benchmarked your multithreaded application on a system with CPU A, and it exhibits this
performance:
Threads
1
2
3
4
Time (sec)
14.20
7.81
5.87
4.72
Next you benchmark the same application on an otherwise identical system that has been upgraded
with a newer processor, CPU B, and it exhibits this performance:
Threads
1
2
3
4
Time (sec)
11.83
7.01
5.42
4.59
CPU B is clearly faster than CPU A. The execution times are lower when CPU B is used. However, the
single-processor performance is improved by 20% by using CPU B. In contrast, when four
processors are engaged, the parallel program is only 3% faster. Explain how this can happen.

30

Lab. 6: Analyzing Parallel Performance
Exercise 6
Hard disk drives continue to improve in speed at a slower rate than microprocessors. What are the
implications of this trend for developers of multithreaded applications? What can be done about it?

-

31

32

Lab 7: Improving Parallel Performance

Time Required
Forty-five minutes
Exercise 1
Recall that the parallel quicksort program developed in Lab 5 exhibited poor performance because of
excessive contention among the tasks for access to the shared stack containing the indices of
unsorted sub-arrays. You can dramatically improve the performance by reducing the frequency at
which threads access the shared stack.
One way to reduce accesses to the shared stack is to switch to sequential quicksort for sub-arrays
smaller than a threshold size. In other words, when a thread encounters a sub-array smaller than
the threshold size and partitions it into two pieces, it does not put one piece on the stack and work
on the remaining piece. Instead, it sorts both pieces itself by recursively calling the sequential
quicksort function.
Use this strategy, and the sequential quicksort function given below, to improve the performance of
the parallel quicksort program you developed in Lab 5. Run some experiments to determine the best
threshold size for switching to sequential quicksort.
void seq_quicksort (int first, int last)

{
int q;
/* Split point in array */
if (first < last) {
q = partition (first, last);
seq_quicksort (first, q-1);
seq_quicksort (q+1, last);
}
}
Exercise 2
The following C program counts the number of primes less than n. Use OpenMP pragmas and clauses
to enable it to run on a multiprocessor. Make as many changes as you can in the time allowed to
improve the performance of the program on the maximum available number of processors.
/*
*
*/
This C program counts the number of primes between 2 and n.
#include <stdio.h>
#include <math.h>
#include <omp.h>
/*
Passed a positive integer p, function is_prime returns 1 if


33

Introduction To Parallel Programming - Student Workbook With Instructor's Notes PDF

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Introduction To Parallel Programming - Student Workbook With Instructor's Notes PDF

Caricato da

Copyright:

Formati disponibili

Student Workbook with Instructors Notes

Legal Lines and Disclaimers

Introduction to Parallel Programming

Introduction to Parallel Programming

Introduction to Parallel Programming

Intel Software College

Intel Software College

Introduction to Parallel Programming

Introduction to Parallel Programming

Introduction to Parallel Programming

Intel Software College

Lab 1: Identifying Parallelism

Intel Software College

Introduction to Parallel Programming

Introduction to Parallel Programming

Introduction to Parallel Programming

Intel Software College

Lab. 1: Identifying Parallelism

Intel Software College

Introduction to Parallel Programming

Introduction to Parallel Programming

Introduction to Parallel Programming

Intel Software College

Lab 2: Introducing Threads

For each of the following programs or program segments:

determine whether the best parallelization approach is a domain decomposition or a task

Intel Software College

Introduction to Parallel Programming

Introduction to Parallel Programming

Introduction to Parallel Programming

Intel Software College

Lab. 2: Introducing Threads

(int argc, char *argv[])

Intel Software College

Introduction to Parallel Programming

Introduction to Parallel Programming

Introduction to Parallel Programming

Intel Software College

Lab 3: Domain Decomposition with OpenMP

For each of the programs below

make the program parallel by adding the appropriate OpenMP pragmas;

compile the program;

execute the program for 1, 2, 3, and 4 threads; and

check the program outputs to verify they are the same.

Function 'rerror' is called when the program detects an

void rerror (char *s)

Function 'allocate_matrix", passed the number of rows and columns,

void allocate_matrix (float ***subs, int rows, int cols) {

Intel Software College

Introduction to Parallel Programming

Introduction to Parallel Programming

storage = (float *) malloc (rows * cols * sizeof(float));

Given the name of a file containing a matrix of floats, function

/* Input file pointer */

fptr = fopen (s, "r");

Passed a pointer to a two-dimensional matrix of floats and

void print_matrix (float **a, int rows, int cols)

Introduction to Parallel Programming

Intel Software College

Lab. 3: Domain Decomposition with OpenMP

Function 'matrix_multiply' multiplies two matrices containing

storage = (float ) malloc (rows cols * sizeof(float));

void matrix_multiply (float a, float b, float **c,