Mpi

MPI Workshop Notes1 DRAFT WORKING COPY
Date: Time: Location: Instructor: Organisation: Friday, 8 June 2012 1:00 - 4:50 PM Sir James Foot Building (Bldg. 47A), Room 241 Terry Clark, Ph.D. (t.clark3@uq.edu.au) UQ Research Computing Centre
In this workshop we introduce parallel computing with the Message Passing Interface (MPI), a standardized message-passing system designed for writing parallel programs in sequential languages such as C and Fortran. Since its release in 1994, and currently with several reliable implementations, MPI continues as a principal parallel programming model. Central to its success are the portability and achievable performance for a signicant share of applications in the technical computing community. MPI uses a model wherein each process in a parallel execution uses exactly the same program, yet with each process operating on dierent parts of the calculation. The required logic to coordinate multiple processes through the program adds signicant complexity; this programming task is intensied by MPIs low-level interprocess communication primitives. Consequently, MPI program development tends to be tedious and error prone. Parallel applications need to operate eciently over a practical range of input and computing platforms. Within this operational range, the performance varies, often widely, as a function of input, number of processors, types of resources, and MPI runtime parameters. It follows that to achieve desirable performance the users of parallel applications need informed choices to appropriately select and congure resources. The supporting information usually involves measurements made to assess the application on target computer systems with suitable data. This workshop covers concepts and methods pertaining to issues described above. The topics include MPI program development, parallel program debugging and proling, and running parallel applications. The aim is a comprehensive introduction, necessarily of limited depth, but without major gaps, that will enable attendees to pursue topics specic to their research needs. Site-specic content for the UQ HPC cluster, Barrine, is a critical adjunct which will provide attendees at all levels of expertise with a useful digest of protocols, tools, and systems. The programs discussed in this workshop and additional material can be retrieved from the web site and software repository at http://hpc-curlew.hpcu.uq.edu.au/.
$LastChangedDate: 2012-06-08 12:26:32 +1000 (Fri, 08 Jun 2012) $

1 This three-hour workshop is part of ongoing HPC training conducted by the UQ Research Computing Centre. The material in this section is suitable for HPC users at all levels.
Contents
1 Introduction 1.1 Types of Parallel Systems . . . . . . . . . . . 1.2 Simple Sum . . . . . . . . . . . . . . . . . . . 1.2.1 Node codes to dene simple-sum data 1.2.2 Node codes to compute partial sum . 1.2.3 Node codes for simple sum . . . . . . 1.2.4 Global sum approach . . . . . . . . . . 1.3 SPMD model for Simple Sum . . . . . . . . . 1.4 Simple Sum using Basic MPI . . . . . . . . . 2 MPI Standard and Implementations 2.1 Message Passing Interface Standard . . . . . 2.2 MPI Implementations on Barrine . . . . . . . 2.2.1 Basic Information . . . . . . . . . . . 2.2.2 Features eecting program deployment 2.3 Compilers on Barrine . . . . . . . . . . . . . . 2.4 Makeles and PBS scripts for Barrine . . . . 2.4.1 Intel MPI with gcc and icc . . . . . . 2.4.2 MPICH2 with gcc and icc . . . . . . . 2.4.3 SGI MPT with gcc and icc . . . . . . 2.4.4 Open MPI with gcc and icc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 2 3 4 5 6 7 8 10 10 11 11 11 12 13 13 14 14 14 17 17 17 18 18 18 19 20 21 22 22 27 28 28 29 31 32 33 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 35 35 35 35 36 36 38 38
3 Parallel Program Performance 3.1 Performance measures . . . . . . . . . . . . . . . . . . . . . . 3.2 Amdahls Law assessment of speedup . . . . . . . . . . . . . . 3.3 Performance examples . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Eect of serial part on Speedup and Eciency . . . . 3.3.2 Eect of communication on Speedup and Eciency . 3.3.3 Eect of serial part and comunication on Speedup and 3.3.4 Parallel performance with diminishing returns . . . . . 3.4 Barrine performance example . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Eciency . . . . . . . . . . . .
4 Barrine Network and I/0 4.1 Inniband and Gigabit Ethernet Networks . . . . . . . . . . . . . . . . . . . . . . . 4.2 Local le systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Parallel Debuggers 5.1 Totalview Debugger . . . . . . . . . . . . . . . . . 5.2 Example 1. Process skips baton pass . . . . . . . . 5.2.1 Example 2. MPI process busy waiting . . . 5.2.2 Example 3. Process busy waiting with print A PBS Environment Variables B MPI Point-to-Point Communication Functions B.1 Blocking Send and Receive . . . . . . . . . . . . B.2 Non-blocking Send and Receive . . . . . . . . . . B.3 Combined Send and Receive . . . . . . . . . . . . B.4 Operations on Messages and Queues . . . . . . . B.5 Denitions . . . . . . . . . . . . . . . . . . . . . . B.6 Send modes with the MPI API . . . . . . . . . . B.7 Semantics of Point-to-Point Communication . . . B.8 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CONTENTS
ii
C Code Listings C.1 Ping-pong Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39 39
1 INTRODUCTION
1
1.1
Introduction
Types of Parallel Systems
PMS Notation We use PMS notation to describe key components of computer system where: P is the processor, M is the memory, and S is the switch, or network.
...
...
S (switch)
S (switch)
...
...
(a) SMP (symmetric multiprocessor)
(b) NUMA (non-uniform memory access)
Figure 1: Shared-memory architectures: (a) symmetric multiprocessors (SMP) (or dancehall) approach distributes all memory uniformly far from processors, (limiting scalability of the approach). (b) NUMA distributes memory so latency to access local memory is xed and independent of number of processors.
M P
M P
M P
M P
...
P P
S (switch)
(a) NOW based on SMP nodes
...
P P
S (switch)
(b) NOW based on NUMA nodes
Figure 2: Networks of workstations also called NOWs, or clusters are message-passing architectures using complete computers as building blocks (the nodes). The high-level diagram for a NOW is the same as the multiprocessor NUMA illustrated in gure 1(b).2
2 The main dierence between gure 1(b) multiprocessors and gure 1(b) message-passing systems is in how non-local memory is accessed: multiprocessor NUMA integrates communication into the memory system; messagepassing architectures perform explicit I/O operations.
1 INTRODUCTION
1.2
Simple Sum n
ai = a1 + a2 + a3 . . . an1 + an
i=1
(1)
Suppose nP = 4 processors to calculate sum, S of n=16 elements.
S=
(a01 + a02 + a03 + a04 ) + (a05 + a06 + a07 + a08 ) + (a09 + a10 + a11 + a12 ) + (a13 + a14 + a15 + a16 )
(2)
The Goal
//
time units
5 ... + 16
Figure 3: Time for 4-way parallel addition shown above the x-axis is 1/4 sequential time.
A Detail
S=
S0 + S1 + S2 + S3
(3)
where subscripts are process number for p0 , p1 , p2 , p3 .
1 INTRODUCTION
1.2.1
Node codes to dene simple-sum data
1 2 3 4 5 6 7
int A[4] int nP = 4 int thisProc = 0 do i = 1, 4 A[i] = (thisProc * 4) + i enddo

(a)
1 2 3 4 5 6 7

(b)
p0 code
1 2 3 4 5 6 7
p1 code
1 2 3 4 5 6 7

(c)

(d)
p2 code
p3 code
Figure 4: Pseudo-code to initialise array A[0:3] on processes p0 , p1 , p2 , p3 for simple sum.
P0
0 1 2 3 4 5
P1
0 1 2 3 4 5
1 A 2 3 4 0 R 0 S
P0
9 A 10 11 12 0 R 0 S
P1
Figure 5: Initial data distribution for 4-process simple sum, n=16.
1 INTRODUCTION
1.2.2
Node codes to compute partial sum
1 2 3 4 5 6 7 8 9 10 11
int A[4], R int nP = 4 int thisProc = 0 do i = 1, 4 A[i] = (thisProc * 4) + i enddo do i = 1, 4 R = R + A[i] enddo
(a)
1 2 3 4 5 6 7 8 9 10 11
(b)
p0 code
1 2 3 4 5 6 7 8 9 10 11
p1 code
1 2 3 4 5 6 7 8 9 10 11
(c)
(d)
p2 code
p3 code
Figure 6: Pseudo-code with partial sum (lines 911) at processes p0 , p1 , p2 , p3 for simple sum.
p0
0 1 2 3 4 5
p1
0 1 2 3 4 5
p2
0 1 2 3 4 5
p3
0 1 2 3 4 5
1 A 2 3 4 10 R 0 S
9 A 10 11 12 42 R 0 S
13 A 14 15 16 58 R 0 S
5 A 6 7 8 26 R 0 S
Figure 7: Processes memory contents after partial sum.
1 INTRODUCTION
1.2.3
Node codes for simple sum
1 2 3 4 5 6 7 8 9 10 11
int A[4], R, S int nP = 4 int thisProc = 0 do i = 1, 4 A[i] = (thisProc * 4) + i enddo do i = 1, 4 R = R + A[i] enddo S = +{R}
(a)
1 2 3 4 5 6 7 8 9 10 11
(b)
p0 code
1 2 3 4 5 6 7 8 9 10 11
p1 code
1 2 3 4 5 6 7 8 9 10 11
(c)
(d)
p2 code
p3 code
Figure 8: Simple sum pseudo-code with global sum on line 11.
p0
0
p1
p2
p3
0
1 A 1 2 2 3 3 4 4 10 R 5 136 S
9 A 1 10 2 11 3 12 4 42 R 5 136 S
0
13 A 1 14 2 15 3 16 4 58 R 5 136 S
0
5 A 1 6 2 7 3 8 4 26 R 5 136 S
Figure 9: Processes memory contents after partial sum.
1 INTRODUCTION
1.2.4
Global sum approach
P0
(R0)
P1
(R1)
P2
(R2)
P3
(R3)
Figure 10: Initial state of processes after partial sum (see gure 7).
(R0)
(R1)
(R2)
(R3)
P0
P1
P2
P3
Figure 11: Global sum, phase 1: exchange data from initial state (gure 10).
(R1+R0) (R0+R1) (R2+R3)
(R3+R2)
P0
P1
P2
P3
Figure 12: Global sum, phase 2: exchange summed data from phase 1 (gure 11).
P0
(R0+R1+R2+R3)
P1
(R1+R0+R3+R2)
P2
P3
(R2+R3+R0+R1) (R3+R2+R1+R0)
Figure 13: Global sum, nal state: Processes have nal sum, S = R0 + R1 + R2 + R3 .
1 INTRODUCTION
1.3
SPMD model for Simple Sum

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Program SPMDSum int i, A[4], R, S; int (nP,thisProc) = initProcs() do i = 1, 4 A[i] = (thisProc * 4) + i enddo do i = 1, 4 R = R + A[i] enddo S = globalSum(R) end
int
spmdsum.c
(source code)
compiler spmdsum.o
(object code)
linker spmdsum
(executable)
runtime system 4 x spmdsum

(processes)
b01a22
b01a23
b02b29
M0 P0
M P
M1 P1
M P
M2 P2
M3 P3
Figure 14:
1 INTRODUCTION
1.4
Simple Sum using Basic MPI

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
#include "mympinc.h" #dene N 4 int main(int argc, char **argv) { int A[N],R,S,i; int nP, thisProc; initProcesses(argc,argv,&thisProc,&nP); for (i=1; i<=N; i++) { A[i1] = (thisProc * 4) + i; } for (R=0,i=0; i<N; i++) { R = R + A[i]; } S = globalSum(R,thisProc); MPI Finalize(); exit(0); }
main
Figure 15: C program implementing the simple sum outlined above.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
#include <stdio.h> #include <mpi.h> void initProcesses(int argc, char **argv, int *myproc, int *nproc) initProcesses { MPI Init (&argc,&argv); MPI Comm rank (MPI COMM WORLD, myproc); MPI Comm size (MPI COMM WORLD, nproc); printf("myProc=%d\tnProc=%d\t\tprogram=%s\n",*myproc,*nproc,argv[0]); MPI Barrier(MPI COMM WORLD); } int globalSum(int R, int myProc) globalSum { int S; int status = MPI Allreduce(&R,&S,1,MPI INT,MPI SUM,MPI COMM WORLD); printf("p%d\tR=%d\tS=%d\n",myProc,R,S); MPI Barrier(MPI COMM WORLD); return(S); }
Figure 16: C program functions completing code in gure 15.
1 INTRODUCTION
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
CC = mpiicc AR = ar ARFLAGS= r CFLAGS = O1 LIBDIR=. LDFLAGS = L$(LIBDIR) LIBS = lmympi program=spmdsum mylibrary=libmympi.a $(program): $(program).o $(mylibrary) $(CC) $(LDFLAGS) o $@ $< $(LIBS) $(mylibrary): mympi.o $(AR) $(ARFLAGS) $@ $< clean:; $(RM) core *.o *.trace *.a realclean:; $(RM) core *.o* *.e* *.trace $(program) *.a .c.o:; $(CC) $(CFLAGS) c $*.c
Figure 17: Makele for simple sum program in gure 15.
1 2 3 4 5 6 7 8 9 10 11 12
#!/bin/bash #PBS -A sf-Admin #PBS -l select=4:ncpus=1:mpiprocs=1:NodeType=large #PBS -l walltime=00:00:30 #PBS -N spmdmsg #PBS -q workq cd ${PBS O WORKDIR} module load intelmpi/3.2.2.006 mpirun np 4 spmdsum
Figure 18: PBS script for simple sum program of gure 15.
Six functions MPI MPI MPI MPI MPI MPI Init Finalize Comm Size Comm Rank Send Recv Initiate an MPI computation Terminate an MPI computation Determine the number of MPI processes. Determine MPI identief for calling process. Send a message to an MPI process. Receive a message from an MPI process.
2 MPI STANDARD AND IMPLEMENTATIONS
10
2
2.1
MPI Standard and Implementations

Message Passing Interface Standard
MPI 2.0 Standard (pdf) MPI 1.1 Standard (pdf)
The denitive reference for MPI is the MPI Forum Web site.
11
2.2
MPI Implementations on Barrine
Installed MPI information version module le Intel MPI-2.2 intel-mpi/3.2.2.006 intel-mpi/4.0.0.027 intel-mpi/4.0.1.007 mpich2/1.4.1p1-intel mpich-ch4-p4 mpich-ch4-p4mpd OpenMPI/1.2.8 OpenMPI/1.4.3 OpenMPI/1.5.3 mpt/2.00 mpt/2.02
documentation reference manual 3.2.2.006 4.0.0.027 4.0.1.007 introduction guide 3.2.2.006 4.0.0.027 4.0.1.007
MPICH2 MPICH1 Open MPI
MPI-2.2 MPI-1.1 MPI-2.1
MPICH2 Users Guide MPICH2 commands and routines Users Guide to MPICH, v1.2.6 Open MPI v1.6 commands and routines
SGI MPT
MPI-2.2
MPT User Guide, supports 2.06
Features Eecting Application Deployment (includes Barrine specic) Inniband Intel MPI yes (default) program build use mpi compiler with sequential compiler option (see gure 20) use mpi compiler with sequential compiler option (see gure 22) icc,gcc, etc. no mpicc program launch mpirun, mpiexec other notes mpiicc invokes icc
MPICH2
3rd party, not on barrine yes [SGI12], use export MPI USE IB=1 check...
mpirun, mpiexec mpiexec cannot run non-MPI program
SGI MPT Open MPI
mpiexec mpt (PBS), mpirun (not PBS) mpirun
12
2.3
Compilers on Barrine
Compiler Related Documentation Language Intel Fortran Reference, etc. C++ Compiler (http) Fortran Compiler (http) GNU GNU Compiler Collection GNU OpenMP Manual GCC online documentation OpenMP Developing Threaded Applications Cluster OpenMP Manual Other Intel 64 IA32 Optimization Manual
13
2.4
Makeles and PBS scripts for Barrine
Programs in this section are online at this software repository.
Shown here are Makeles and PBS scripts for the program baton.c for each combination of of compiler and MPI system {GNU, Intel} {IntelMPI, MPICH2, OpenMPI, and MPT}.3 Before invoking make on the Makeles it is necessary to run module loads for the compiler and MPI types; these module loads are shown in each accompanying PBS script listing on lines 89. The source code listing for baton.c is shown below. 2.4.1 Intel MPI with gcc and icc
1 2 3 4 5 6 7 8 9 10 11 12 13
# GNU C and Intel MPI makefile CC = gcc MPICC = mpicc CFLAGS = O2 LDFLAGS = LIBS = program=baton $(program): $(program).o $(MPICC) $(LDFLAGS) o $@ $< $(LIBS) .c.o:; $(MPICC) cc=$(CC) $(CFLAGS) c $*.c
1 2 3 4 5 6 7 8 9 10 11 12 13
#!/bin/bash #PBS -A sf-Admin #PBS -l select=4:ncpus=2:mpiprocs=2:NodeType=large #PBS -l walltime=00:01:00 #PBS -q workq #PBS -N thebaton module load compiler/gcc4.5.2 module load intelmpi/3.2.2.006 cd $PBS O WORKDIR mpirun np 8 baton
Figure 19: Intel MPI with gcc (left) Makele, (right) PBS script.
1 2 3 4 5 6 7 8 9 10 11 12 13
# Intel C and Intel MPI makefile CC = icc MPICC = mpicc CFLAGS = O2 LDFLAGS = LIBS = program=baton $(program): $(program).o $(MPICC) $(LDFLAGS) o $@ $< $(LIBS) .c.o:; $(MPICC) cc=$(CC) $(CFLAGS) c $*.c
1 2 3 4 5 6 7 8 9 10 11 12 13
#!/bin/bash #PBS -A sf-Admin #PBS -l select=4:ncpus=2:mpiprocs=2:NodeType=large #PBS -l walltime=00:01:00 #PBS -q workq #PBS -N thebaton module load intelcc11/11.1.072 module load intelmpi/3.2.2.006 cd $PBS O WORKDIR mpirun np 8 baton
Figure 20: Intel MPI and icc compiler (left) Makele, (right) PBS script.
3 The
GNU and Intel compilers demonstrated are gcc and icc.
14
1 2 3 4 5 6 7 8 9 10 11 12 13
# GNU C and MPICH2 makefile CC = gcc MPICC = mpicc CFLAGS = O2 LDFLAGS = LIBS = program=baton $(program): $(program).o $(MPICC) $(LDFLAGS) o $@ $< $(LIBS) .c.o:; $(MPICC) $(CFLAGS) c $*.c
1 2 3 4 5 6 7 8 9 10 11 12 13
#!/bin/bash #PBS -A sf-Admin #PBS -l select=4:ncpus=2:mpiprocs=2:NodeType=large #PBS -l walltime=00:01:00 #PBS -q workq #PBS -N thebaton module load compiler/gcc4.5.2 module load mpich2/1.4.1p1intel cd $PBS O WORKDIR mpirun np 8 baton
Figure 21: MPICH2 with gcc (left) Makele, (right) PBS script
1 2 3 4 5 6 7 8 9 10 11 12 13
# Intel C and MPICH2 makefile CC = icc MPICC = mpicc CFLAGS = O2 LDFLAGS = LIBS = program=baton $(program): $(program).o $(MPICC) $(LDFLAGS) o $@ $< $(LIBS) .c.o:; $(MPICC) $(CFLAGS) c $*.c
1 2 3 4 5 6 7 8 9 10 11 12 13
#!/bin/bash #PBS -A sf-Admin #PBS -l select=4:ncpus=2:mpiprocs=2:NodeType=large #PBS -l walltime=00:01:00 #PBS -q workq #PBS -N thebaton module load intelcc11/11.1.072 module load mpich2/1.4.1p1intel cd $PBS O WORKDIR mpirun np 8 baton
Figure 22: MPICH2 with Intel icc compiler (left) Makele, (right) PBS script
1 2 3 4 5 6 7 8 9 10 11 12 13
# GNU C and MPT makefile CC = gcc MPICC = $(CC) CFLAGS = O2 LDFLAGS = LIBS = lmpi program=baton $(program): $(program).o $(MPICC) $(LDFLAGS) o $@ $< $(LIBS) .c.o:; $(MPICC) $(CFLAGS) c $*.c
1 2 3 4 5 6 7 8 9 10 11 12 13
#!/bin/bash #PBS -A sf-Admin #PBS -l select=4:ncpus=2:mpiprocs=2:NodeType=large #PBS -l walltime=00:01:00 #PBS -q workq #PBS -N thebaton module load mpt/2.00 module load compiler/gcc4.5.2 cd $PBS O WORKDIR mpiexec mpt np 8 baton
Figure 23: SGI MPT with gcc (left) Makele, (right) PBS script
15
1 2 3 4 5 6 7 8 9 10 11 12 13
# Intel C and MPT makefile CC = icc MPICC = $(CC) CFLAGS = O2 LDFLAGS = LIBS = lmpi program=baton $(program): $(program).o $(MPICC) $(LDFLAGS) o $@ $< $(LIBS) .c.o:; $(MPICC) $(CFLAGS) c $*.c
1 2 3 4 5 6 7 8 9 10 11 12 13
#!/bin/bash #PBS -A sf-Admin #PBS -l select=4:ncpus=2:mpiprocs=2:NodeType=large #PBS -l walltime=00:01:00 #PBS -q workq #PBS -N thebaton module load intelcc11/11.1.072 module load mpt/2.00 cd $PBS O WORKDIR mpiexec mpt np 8 baton
Figure 24: SGI MPT with Intel icc compiler (left) Makele, (right) PBS script
1 2 3 4 5 6 7 8 9 10 11 12 13
# GNU C and Open MPI makefile CC = gcc MPICC = mpicc CFLAGS = O2 LDFLAGS = LIBS = program=baton $(program): $(program).o $(MPICC) $(LDFLAGS) o $@ $< $(LIBS) .c.o:; $(MPICC) $(CFLAGS) c $*.c
1 2 3 4 5 6 7 8 9 10 11 12 13
#!/bin/bash #PBS -A sf-Admin #PBS -l select=4:ncpus=2:mpiprocs=2:NodeType=large #PBS -l walltime=00:01:00 #PBS -q workq #PBS -N thebaton module load compiler/gcc4.5.2 module load OpenMPI/1.5.3 cd $PBS O WORKDIR mpirun np 8 baton
Figure 25: Open MPI with gcc (left) Makele, (right) PBS script
1 2 3 4 5 6 7 8 9 10 11 12 13
# Intel C and Open MPI makefile CC = icc MPICC = mpicc CFLAGS = O2 LDFLAGS = LIBS = program=baton $(program): $(program).o $(MPICC) $(LDFLAGS) o $@ $< $(LIBS) .c.o:; $(MPICC) $(CFLAGS) c $*.c
1 2 3 4 5 6 7 8 9 10 11 12 13
#!/bin/bash #PBS -A sf-Admin #PBS -l select=4:ncpus=2:mpiprocs=2:NodeType=large #PBS -l walltime=00:01:00 #PBS -q workq #PBS -N thebaton module load intelcc11/11.1.072 module load OpenMPI/1.5.3 cd $PBS O WORKDIR mpirun np 8 baton
Figure 26: Open MPI with Intel icc compiler (left) Makele, (right) PBS script
16
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61
#include <stdlib.h> #include <stdio.h> #include <string.h> #include <mpi.h> MPI Status status; int myProc, nProc; int main(int argc, char **argv) { MPI Init (&argc,&argv); MPI Comm rank (MPI COMM WORLD, &myProc); MPI Comm size (MPI COMM WORLD, &nProc); check mybaton value(pass baton()); exit(MPI Finalize()); } int pass baton() { int toProc = (myProc + 1) % nProc; int frmProc = (myProc 1 + nProc) % nProc; int ibaton = 0; main
pass baton
if ( myProc > 0 ) MPI Recv(&ibaton,1,MPI INT,frmProc,MPI ANY TAG,MPI COMM WORLD,&status); ibaton += myProc; if ( myProc < nProc1 ) MPI Send(&ibaton, 1, MPI INT, toProc, 7, MPI COMM WORLD); return ibaton; } check mybaton value(int mybaton) { int myanswer = myAnswer(); wait in order(); printf("p%d %s\n",myProc,((mybatonmyanswer)?"NOT OK":"OK")); ush(stdout); system("sleep 1"); notifynext(); } check mybaton value
wait in order() wait in order { if ( myProc < nProc1) MPI Recv(0, 0, MPI INT, myProc+1, MPI ANY TAG, MPI COMM WORLD, &status); } notifynext() { if ( myProc > 0) MPI Send(0, 0, MPI INT, myProc1, 3, MPI COMM WORLD); } int myAnswer() { int i, sum; for (i=1,sum=0; i<=myProc; i++) sum += i; return(sum); } notifynext
myAnswer
Figure 27: A copy of this baton.c is in each directory for section 2.4.
3 PARALLEL PROGRAM PERFORMANCE
17
3
3.1
Parallel Program Performance

Performance measures
Speedup, SP using P processors is the
ratio of sequential time, T1 , and parallel time, Tp SP = T1 TP (4)
Eciency, EP using P processors is the
ratio of the speedup, SP , and P EP = T1 SP = P P TP (5)
3.2
Amdahls Law assessment of speedup
SP = =
(1 f )T1 T1 , TP = f T1 + TP P T1 f T1 + 1
(1f )T1 P
(6) (7)
SP
f+
(1f ) P
(8)
18
3.3
3.3.1
Performance examples
Eect of serial part on Speedup and Eciency
parallel part: parallelizes perfectly, i.e., Tp = T1 /P

serial part: executes serially Tp = T1
Tp =
10
3
T1 P
+ T1
total parallel part sequential part
nProcs 1 2
time (m) 110 60 35 23 16 13 12
speedup 1.0 1.8 3.1 4.9 6.7 8.3 9.5
eciency 100% 92% 79% 61% 42% 26% 15%
10
compute time
4 8
10
1
16 32 64
10 0 10
10
10
number of processes
3.3.2
Eect of communication on Speedup and Eciency

communication: Tp = k log2 P
Tp =
10
2
T1 P
+ Tp nProcs 1 2 time 100 51 27 16 10 8 7 speedup 1.0 2.0 3.8 6.4 9.8 12.3 13.2 eciency 100% 98% 93% 81% 61% 38% 21%
total parallel part communication
compute time
4
10
1
8 16 32 64
10 0 10
10
10
number of processes
19
3.3.3
Eect of serial part and comunication on Speedup and Eciency

serial part: executes serially Tp = T1 communication: Tp = k log2 P
Tp =
T1 P
+ T1 + T p
10
total parallel part sequential part communication
10
compute time
10
10 0 10
10
10
number of processes
nProcs 1 2 4 8 16 32 64
time 110 61 37 26 20 18 17
speedup 1.0 1.8 3.0 4.3 5.4 6.1 6.3
eciency 100% 90% 74% 54% 34% 19% 10%
20
3.3.4
Parallel performance with diminishing returns

serial part: executes serially Tp = T1 communication: Tp = k log2 P
Tp =
T1 P
10
3
+ T + Tp
10
compute time
10
10 0 10
total parallel part sequential part communication 10

1
10
10
10
number of processes
nProcs 1 2 4 8 16 32 64 128 256 512 1024 2048
time 110.0 61.0 37.0 25.5 20.3 18.1 17.6 17.8 18.4 19.2 20.1 20.0
speedup 1.0 1.8 3.0 4.3 5.4 6.1 6.3 6.2 6.0 5.7 5.5 5.2
eciency 100.0% 90.2% 74.3% 54.0% 34.0% 19.0% 10.0% 4.8% 2.3% 1.1% 0.5% 0.3%
21
3.4
Barrine performance example
Weather Research and Forecasting Model (WRF) 4

elapsed time in minutes
600 500 400 300 200
nProcs 1 4
time (m) 530 245 150 120 84 100
speedup 1.0 2.2 3.5 4.4 6.3 5.3
eciency 100% 54% 44% 28% 20% 8%
time (minutes)
8 16 32 64
100 0 0 10
10
10
number of processes
Speedup
7 6 5
efficiency 100
Eciency
80
speedup
60
4 3 2 1 0 10
40
20
10
10
0 0 10
10
10
number of processes
number of processes
4 Michael Hewson with Professor Hamish McGowans laboratory provided the data for these WRF benchmarks. Michael applies WRF to climate modelling and orchestrates WRF and its associated programs on Barrine.
4 BARRINE NETWORK AND I/0
22
4
4.1
Barrine Network and I/0

Inniband and Gigabit Ethernet Networks
Figure 28: Throughput estimates of Barrines Inniband and gigabit ethernet networks.
10
4
10
throughput (MB/s)
10
10
10
10
Intel MPI MPICH2

0
10
10
10
10
10
10
10
10
10
message size (bytes)

Figure 29: Send-and-receive pairs between two processes used to estimate latency and throughput.
4 BARRINE NETWORK AND I/0 Figure 30: Ping-pong code to measure throughput. See Appendix C.1 for complete listing.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
23
int myproc, nproc; int main(int argc, char **argv) { int niter,msgsz; double clockticks; (void) getcmdline(argc,argv,&niter,&msgsz); (void) gethost(thishost,NMLEN); (void) initmpi(argc,argv); (void) echoparms(niter,msgsz); (void) pingpong(niter,msgsz,&clockticks); (void) if (myProc==0) summarise(niter,msgsz,clockticks); (void) zshutdown(0); return(0); } int pingpong(int niter,int msgsz,double *clockticksPtr) { MPI Status s; double clockticks = 0.0; int iter=0; int *outcargo = (int *)malloc(sizeof (int)*msgsz); int *incargo = (int *)malloc(sizeof (int)*msgsz); (void) dene cargo(msgsz,outcargo); do { if (myproc == 0) { clockticks = MPI Wtime(); MPI Send(outcargo,1,MPI INT,1,999,MPI COMM WORLD); MPI Recv(incargo,msgsz,MPI INT,1,888,MPI COMM WORLD,&s); clockticks += MPI Wtime(); } else if (myproc == 1) { clockticks = MPI Wtime(); MPI Recv(incargo,1,MPI INT,0,999,MPI COMM WORLD,&s); MPI Send(outcargo,msgsz,MPI INT,0,888,MPI COMM WORLD); clockticks += MPI Wtime(); } } while (++iter < niter); *clockticksPtr = clockticks; return(0); } main
pingpong
24
Figure 31: Summarize point-to-point communication measurements.Full listing in Appendix C.1.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
int summarise(int niter,int msgsz,double ticks) { int iproc; double ntotdat = (double)msgsz * (double)niter * sizeof (int); double throughput = (ntotdat / ticks) / (1024.0*1024.0); double latency = ((ticks/2.0) / niter) * 1000000.0; fprintf(stdout,"p=%d\ttime=%fsec\tthru=%fMB\tlate=%fusec\n", myproc,ticks,throughput,latency); ush(stdout); return(0); } int dene cargo(int msgsz, int *cargo) { int ic; for(ic=0; ic<msgsz; ic++) { *(cargo+ic) = hide(ic); } return(0); } int zshutdown(int status) { int ag; MPI Initialized(&ag); /* only MPI function before MPI Init() */ if ( ag ) { if (status != 0) printf("p%d\t MPI shutdown\n",myproc); MPI Finalize(); } if (status != 0) printf("p%d\t exit status=%d\n",myproc,status); exit(status); } int initmpi(int argc, char **argv) { initializeMPIprocesses(argc,argv,&myproc,&nproc); if ( nproc > 2 ) (void) usage and exit(1); return(0); }
summarise
dene cargo
zshutdown
initmpi
25
Estimates of throughput for node pairs in two randomly selected node lists, one each for the x-axis and y-axis, and dierent lists for the two gures. Units are MB/second. Figure 32: Inniband throughput with Intel MPI; early AM Barrine measurements.
b07b27 b07b07 b07a29 b07a25 b07a21 b07a14 b07a08 b06b32 b06b29 b06b15 b06b08 b06b07 b06a15 b06a14 b06a13 b03b14 b03b01 b03a16 b02b32 b02b31 b02b26 b02b15 b02b13 b02b01 b02a34 b02a25 b02a01 b01b11 b01b05 b01b03 b01a33 b01a30 b 1 0 a 0 1 5 0 b 1 0 a 0 9 b 1 0 b 1 2 b 0 1 a 1 7 b 0 1 a 2 3 5 b 0 1 a 2 5 b 0 1 a 2 6 b 0 1 b 0 3 bb 0 0 1 1 bb 0 1 5 3 10 b 0 1 b 2 5 b 0 1 b 3 7 b 0 2 a 0 1 bb 0 0 2 2 a a 0 2 2 9 15 b 0 2 b 2 5 b 0 2 b 3 1 b 0 2 b 3 2 bb 0 0 3 3 a a 0 1 5 4 20 b 0 3 a 1 8 b 0 3 a 2 3 b 0 3 a 3 3 bb 0 0 6 6 a a 2 2 2 9 25 b 0 6 b 0 1 b 0 6 b 2 1 b 0 6 b 2 3 bb 0 0 6 7 bb 3 0 0 3 30 b 0 7 b 2 0 b 0 7 b 2 2
30
2500
25
20
2000
Ipartner node
15
1500
10
1000
500
Jpartner node
Figure 33: Inniband throughput with Intel MPI; midday Barrine measurements.
b10b11 b10a10 b10a07 b10a06 b07b09 b07b02 b07a23 b06b26 b06b25 b06a04 b03b28 b03b15 b03b04 b03a36 b03a32 b03a01 b02b29 b02b15 b02b12 b02a36 b02a35 b02a32 b02a23 b02a15 b01b27 b01b23 b01b18 b01b13 b01b12 b01b05 b01a33 b01a29 b 0 1 a 0 5 5 0 b 0 1 a 2 1 b 0 1 a 2 8 b 0 1 b 0 6 b 0 1 b 1 5 5 b 0 1 b 3 7 b 0 2 a 2 4 b 0 2 a 2 6 bb 0 0 2 2 a a 3 3 2 4 10 b 0 2 b 0 5 b 0 2 b 0 9 b 0 2 b 2 1 bb 0 0 2 3 ba 2 0 4 3 15 b 0 3 a 1 6 b 0 3 a 2 6 b 0 3 b 1 6 bb 0 0 3 3 bb 1 2 7 5 20 b 0 3 b 2 9 b 0 3 b 3 1 b 0 3 b 3 5 bb 0 0 6 6 a b 2 3 6 2 25 b 0 7 a 0 6 b 0 7 a 0 8 b 0 7 b 1 8 bb 0 1 7 0 ba 2 0 3 9 30 b 1 0 a 1 7 b 1 0 b 1 3
30
2500
25
20
2000
Ipartner node
15
1500
10
1000
500
Jpartner node
4 BARRINE NETWORK AND I/0 Figure 34: Each curve shows Inniband throughput for a I-partner in Figure 32.
3500
26
3000
2500
throughput (MB/s)
2000
1500
1000
500 b 1 0 a 0 1 0 b 1 0 a 0 9 b 1 0 b 1 2 b 0 1 a 1 7 b 0 1 a 2 3 5 b 0 1 a 2 5 b 0 1 a 2 6 b 0 1 b 0 3 b 0 1 b 0 5 b 0 1 b 1 3 10 b 0 1 b 2 5 b 0 1 b 3 7 b 0 2 a 0 1 b 0 2 a 0 2 b 0 2 a 2 9 15 b 0 2 b 2 5 b 0 2 b 3 1 b 0 2 b 3 2 b 0 3 a 0 5 b 0 3 a 1 4 20 b 0 3 a 1 8 b 0 3 a 2 3 b 0 3 a 3 3 b 0 6 a 2 2 b 0 6 a 2 9 25 b 0 6 b 0 1 b 0 6 b 2 1 b 0 6 b 2 3 b 0 6 b 3 0 b 0 7 b 0 3 30 b 0 7 b 2 0 b 0 7 b 2 2
500
1000
jpartner node
Figure 35: Each curve shows Inniband throughput for a J-partner in Figure 32.
3500
3000
2500
throughput (MB/s)
2000
1500
1000
500 b 0 1 a 3 0 0 b 0 1 a 3 3 b 0 1 b 0 3 b 0 1 b 0 5 b 0 1 b 1 1 5 b 0 2 a 0 1 b 0 2 a 2 5 b 0 2 a 3 4 b 0 2 b 0 1 b 0 2 b 1 3 10 b 0 2 b 1 5 b 0 2 b 2 6 b 0 2 b 3 1 b 0 2 b 3 2 b 0 3 a 1 6 15 b 0 3 b 0 1 b 0 3 b 1 4 b 0 6 a 1 3 b 0 6 a 1 4 b 0 6 a 1 5 20 b 0 6 b 0 7 b 0 6 b 0 8 b 0 6 b 1 5 b 0 6 b 2 9 b 0 6 b 3 2 25 b 0 7 a 0 8 b 0 7 a 1 4 b 0 7 a 2 1 b 0 7 a 2 5 b 0 7 a 2 9 30 b 0 7 b 0 7 b 0 7 b 2 7
500
1000
ipartner node
27
4.2
Local le systems
5 PARALLEL DEBUGGERS
28
5
5.1
Parallel Debuggers
Totalview Debugger
TotalView Users Guide provides detailed instructions for running Totalview for this section. The rst example, section 5.2, uses a program that causes processes to hang on communication that will not occur. TotalView can be applied using break-points to track the proceses. The second example, section 5.2.1, uses a program with processes hung on on missing communication but with the missing processes in an indenite loop. The third example, section 5.2.2, incorporates print statements for debugging. It is useful to try the Linux command gstack to get at the same information. Makeles are shown with compilation ags for Intel MPI and compilation ags set for debugging.
29
5.2
Example 1. Process skips baton pass
The program batonpskips.c (see gure36) is written so that MPI process rank=3 will skip the baton passing. Suppose nProc=8, then processes 0 2, 4 7 will hang because they are missing process rank=3. Figure 36: batonpskips.c
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
#include <stdlib.h> #include <stdio.h> #include <string.h> #include <mpi.h> #include <uparms.h> #include <misc.h> #dene ROGUE 3 main(int argc, char **argv) { int myProc, nProc; int localbaton; int pass baton(int myProc, int nProc, MPI Comm comm); initializeMPIprocesses(argc,argv,&myProc,&nProc); if ( myProc != ROGUE ) { localbaton = pass baton(myProc,nProc,MPI COMM WORLD); check allbatons(myProc,nProc,MPI COMM WORLD,localbaton); } cleanup(); } cleanup() { int status = MPI Finalize(); exit(status); } /* baton pass in ring from 0, 1, . . . nProc-1 */ int pass baton(int myProc, int nProc, MPI Comm comm) { MPI Status status; int msgtag = 0; int toProc = (myProc + 1) % nProc; int frmProc = (myProc 1 + nProc) % nProc; int ibaton = 0; cleanup main
pass baton
if ( myProc > 0 ) checkMPIerror(MPI Recv(&ibaton, 1, MPI INT, frmProc, msgtag, comm, &status)); ibaton += myProc; if ( myProc < nProc1 ) checkMPIerror(MPI Send(&ibaton, 1, MPI INT, toProc, msgtag, comm)); return ibaton; }
30
Figure 37: Intel MPI with icc (left) Makele, (right) shell script to run. Note the ags set for debugging, -O0 -g.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
PROG= batonpskips CC = mpicc PREFIX= .. INCDIR= $(PREFIX)/include LIBDIR= $(PREFIX)/lib LDFLAGS= L$(LIBDIR) LIBS= lu lmisc CFLAGS = O0 g I$(INCDIR) $(PROG): $(PROG).o libu.a $(CC) $(LDFLAGS) o $@ $@.o lm $(LIBS) libu.a:; clean: $(RM) core *.o *.trace realclean: $(RM) core *.o *.trace busywait .c.o:; $(CC) $(CFLAGS) c $*.c
1 2 3 4 5 6
#!/bin/bash module load intelmpi/3.2.2.006 mpirun n 4 ./batonpskips
cd $(LIBDIR); make f $(LIBDIR)/makele
31
5.2.1
Example 2. MPI process busy waiting
Figure 38: busywait.c

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
#include <stdio.h> #include <mpi.h> #include <uparms.h> main(int argc, char **argv) { int myProc, nProc; int takebranch; initializeMPIprocesses(argc,argv,&myProc,&nProc); takebranch = (nProc%2 == 0) ? 0 : 1; if ( takebranch == 0 ) { send and recv(myProc, nProc); } else { loopawhile(); } MPI Finalize(); return 0; } main
send and recv(int myProc, int nProc) send and recv { int error,value; if ( myProc == 0 ) { MPI Status status; error = MPI Recv(&value, 1, MPI INT, 0, MPI ANY TAG, MPI COMM WORLD,&status); checkMPIerror(error); } else if ( 1 == 2 ) { value = 9; error = MPI Send(&value, 1, MPI INT, 0, MPI ANY TAG, MPI COMM WORLD); checkMPIerror(error); } }
loopawhile() { int i; while (1) i=0; }
loopawhile
32
5.2.2
Example 3. Process busy waiting with print statements
Figure 39: busywaitprint.c; includes have been removed for space.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
#dene TRACE(x) x main(int argc, char **argv) { int myProc, nProc; int takebranch; TRACE(procprint(0,"start busywaitprt()");) initializeMPIprocesses(argc,argv,&myProc,&nProc); TRACE(procprint(myProc,"initialized()");) takebranch = (nProc%2 == 0) ? 0 : 1; if ( takebranch == 0 ) { send and recv(myProc, nProc); } else { loopawhile(myProc); } MPI Finalize(); return 0; } void send and recv(int myProc, int nProc) send and recv { int error,value; TRACE(procprint(myProc,"enter send_and_recv()");) if ( myProc == 0 ) { MPI Status status; error = MPI Recv(&value, 1, MPI INT, 0, MPI ANY TAG, MPI COMM WORLD,&status); checkMPIerror(error); } else if ( 1 == 2 ) { value = 9; error = MPI Send(&value, 1, MPI INT, 0, MPI ANY TAG, MPI COMM WORLD); checkMPIerror(error); } TRACE(procprint(myProc,"leave send_and_recv()");) } void loopawhile(int myProc) { int i; TRACE(procprint(myProc,"enter loopawhile()");) while (1) i=0; TRACE(procprint(myProc,"leave loopawhile()");) } loopawhile main
A PBS ENVIRONMENT VARIABLES
33
Appendix A PBS Environment Variables
PBS environment variables are dened from the login shell and PBS specic variables. Here is a partial list of the PBS specic variables. NCPUS Number of threads, defaulting to number of CPUs, on a compute node. OMP NUM THREADS Same as NCPUS. PBS ARRAY ID Array identier of subjob in job array. PBS ARRAY INDEX Index number of subjob in job array. PBS CONF PATH Path to pbs.conf PBS CPUSET DEDICATED Set by mpiexec to assert exclusive use of resources in assigned cpuset. PBS ENVIRONMENT When set indicates PBS job calling mpiexec. Job types are PBS BATCH or PBS INTERACTIVE. PBS JOBID The job identier assigned to the job or job array by the batch system. PBS JOBNAME The job name supplied by the user. PBS NODEFILE The lename containing the list of nodes assigned to the job. PBS O HOME Value of HOME directory from submission environment. PBS O HOST The host name on which the qsub command was executed. PBS O LOGNAME Value of users login name in the submission environment. PBS O MAIL Value of MAIL from submission environment. PBS O PATH Value of PATH from submission environment. PBS O QUEUE The original queue name to which the job was submitted.
A PBS ENVIRONMENT VARIABLES
34
PBS O SHELL Value of SHELL from submission environment. PBS O SYSTEM The operating system name where qsub was executed. PBS O WORKDIR The absolute path of the directory where qsub was executed. PBS QUEUE The name of the queue that executed the job. TMPDIR The job-specic temporary directory for the job.
B MPI POINT-TO-POINT COMMUNICATION FUNCTIONS
35
B
B.1
MPI Point-to-Point Communication Functions

Blocking Send and Receive
MPI SEND(buf, count, datatype, dest, tag, comm) sends in standard mode. MPI BSEND(buf, count, datatype, dest, tag, comm) sends in buered mode. MPI SSEND(buf, count, datatype, dest, tag, comm) sends in synchronous mode. MPI RSEND(buf, count, datatype, dest, tag, comm) sends in ready mode. MPI RECV (buf, count, datatype, source, tag, comm, status) starts a standard mode receive.
B.2
Non-blocking Send and Receive
MPI ISEND(buf, count, datatype, dest, tag, comm, request) starts a standard mode, nonblocking send. MPI IBSEND(buf, count, datatype, dest, tag, comm, request) starts a buered mode, nonblocking send. MPI ISSEND(buf, count, datatype, dest, tag, comm, request) starts a synchronous mode, nonblocking send. MPI IRSEND(buf, count, datatype, dest, tag, comm, request) starts a ready-mode nonblocking send. MPI IRECV (buf, count, datatype, source, tag, comm, request) starts a nonblocking receive.
B.3
Combined Send and Receive
MPI SENDRECV(sendbuf,sendcount,sendtype,dest,sendtag,recvbuf,recvcount,recvtype,source,recvtag,c Executes a blocking send and a blocking receive operation using the same communicator, but possibly dierent tags. Th
MPI SENDRECV REPLACE(buf, count, datatype, dest, sendtag, source, recvtag, comm, status) Execute a blocking send and receive. The same buer is used both for the send and for the receive, so that the message
B.4
Operations on Messages and Queues
MPI WAIT(request, status) returns when the operation identied by request is complete (it blocks).
36
MPI TEST(request, ag, status) returns flag=true if the operation identied by request is complete. Otherwise, the call returns flag=false. (non-bloc
MPI IPROBE(source, tag, comm, ag, status) returns flag=true queue contains receivable message matching the pattern specied by the arguments source, tag, and MPI PROBE(source, tag, comm, status) behaves like MPI IPROBE except that it is a blocking call, returning only after a matching message has been found.
MPI CANCEL(request) marks for cancellation a pending, nonblocking communication operation (send or receive). The cancel call is local; it ret
B.5
Denitions
blocking send - does not return until the message data have been copied, either to send buer, or to receiver memory. I.e., return implies that the user send buer can be safely modied. buered send - copies the message from sender buer into system buer, thereby decoupling send and receive operations. standard communication mode - buering implementation dependent.
B.6
Send modes with the MPI API
Blocking Send mode

MPI SEND(buf, count, datatype, dest, tag, comm) start - depends whether implementation buers; might require matching send. complete requires - data copied from send buer. complete implies - send buer available. locality - non-local since it might depend on matching receive.
Non-blocking Send mode

MPI ISEND(buf, count, datatype, dest, tag, comm) start - depends whether implementation buers; might require matching send. complete requires - nothing, returns immediately. complete implies - nothing. locality - local
37
Buered Send mode

MPI BSEND(buf, count, datatype, dest, tag, comm) start - anytime, does not require a matching receive. complete requires - data copied from send buer. complete implies - send buer available. locality - local
Synchronous Send mode

MPI SSEND(buf, count, datatype, dest, tag, comm) start - anytime, does not require a matching receive. complete requires - matching receive posted and receiving data, complete implies - send buer available; sender-receiver rendezvous. locality - non-local
Ready Send mode

MPI RSEND(buf, count, datatype, dest, tag, comm) start - only when matching receive is posted. complete requires - data copied from send buer. complete implies - send buer available. locality - non-local
Blocking Receive operation

Matches any of the send modes. MPI RECV (buf, count, datatype, source, tag, comm, status) Blocking receive with standard mode semantics; matches any send mode. A receive can complete before the matching send has completed (of course, it can complete only after the matching send has started). complete requires - receive buer contains the new message. complete implies - matching send started (maybe not completed); message available. locality - non-local
Non-blocking Receive operation

Matches any of the send modes. A receive can complete before the matching send has completed (of course, it can complete only after the matching send has started). MPI IRECV (buf, count, datatype, source, tag, comm, status) complete requires - nothing, returns immediately. complete implies - nothing, need to probe using MPI request objects. locality - local
38
B.7
Semantics of Point-to-Point Communication

sender sends two messages in succession to same matching receive. The matching receive cannot receive the second message. receiver posts two receives in succession that match same message, then the second receive cannot be satised by this message.
1. Order - messages are non-overtaking (determinism)
2. Progress - given an initiated matching send and receive, at least one will complete. For non-blocking send and receives, guarantee of completion moves to the MPI WAIT calls, one each for non-blocking send and non-blocking receive. 3. Fairness - MPI makes no guarantee of fairness. For example, suppose a given posted send. It is possible for receiver posting matching receives to never receive the message, because each time the given send is overtaken by another message. 4. Resource limitations - any pending communication operation consumes limited system resources such as buers. Errors may occur when lack of resources prevent execution of a communication attempt. Non-blocking communication reduces buering requirements for progress to occur.
B.8
References
MPI Standard References MPI: The Complete Reference [SOHL+ 95] with MPI version 2.
C CODE LISTINGS
39
C
C.1
Code Listings
Ping-pong Codes
The full listing of the pingpong codes from Section 4.1.
pingpong.c
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
#include <stdlib.h> #include <stdio.h> #include <string.h> #include <mpi.h> #include <uparms.h> #include <misc.h> #dene TRACE(x) #dene NMLEN 128 int myproc, nproc; char thishost[NMLEN]; int main(int argc, char **argv) { int niter,msgsz; double clockticks; (void) getcmdline(argc,argv,&niter,&msgsz); (void) gethost(thishost,NMLEN); (void) initmpi(argc,argv); (void) echoparms(niter,msgsz); (void) pingpong(niter,msgsz,&clockticks); (void) summarise(niter,msgsz,clockticks); (void) zshutdown(0); return(0); } int summarise(int niter,int msgsz,double ticks) { int iproc; double ntotdat = (double)msgsz * (double)niter * sizeof (int); double throughput = (ntotdat / ticks) / (1024.0*1024.0); double latency = ((ticks/2.0) / niter) * 1000000.0;
summarise main
for (iproc=0; iproc<nproc; iproc++) { if ( myproc == iproc && myproc==0) { fprintf(stdout,"p=%d\ttime=%fsec\tthru=%fMB\tlate=%fusec\n",myproc,ticks,throughput,latency); ush(stdout); } /* MPI Barrier(MPI COMM WORLD); */ } return(0); }
C CODE LISTINGS
40
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
int pingpong(int niter,int msgsz,double *clockticksPtr) { MPI Status status; double clockticks = 0.0; int iter=0; int *outcargo = (int *)malloc(sizeof (int)*msgsz); int *incargo = (int *)malloc(sizeof (int)*msgsz); (void) dene cargo(msgsz,outcargo); do { if (myproc == 0) { clockticks = MPI Wtime(); MPI Send(outcargo,1,MPI INT,1,999,MPI COMM WORLD); MPI Recv(incargo,msgsz,MPI INT,1,888,MPI COMM WORLD,&status); clockticks += MPI Wtime(); } else if (myproc == 1) { clockticks = MPI Wtime(); MPI Recv(incargo,1,MPI INT,0,999,MPI COMM WORLD,&status); MPI Send(outcargo,msgsz,MPI INT,0,888,MPI COMM WORLD); clockticks += MPI Wtime(); } } while (++iter < niter); *clockticksPtr = clockticks; return(0); } int dene cargo(int msgsz, int *cargo) { int ic; for(ic=0; ic<msgsz; ic++) { *(cargo+ic) = hide(ic); } return(0); }
pingpong
dene cargo
int echoparms(int niter,int msgsz) echoparms { int iproc; for (iproc=0; iproc<nproc; iproc++) { if ( myproc == iproc) { fprintf(stdout,"xp=%d\thost=%s\tniter=%d\tmsgsz=%d\n",myproc,thishost,niter,msgsz); ush(stdout); } MPI Barrier(MPI COMM WORLD); } return(0); } int gethost(char *string, int stringlength)
gethost
C CODE LISTINGS
41
101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
{ getHostname(string,stringlength); TRACE(printf("thishost=%s\n",thishost);) TRACE(ush(stdout);) return(0); } int getcmdline(int argc, char**argv, int *niter, int* msgsz) { if (argc != 3) (void) usage and exit(1); sscanf(argv[1],"%d",niter); sscanf(argv[2],"%d",msgsz); TRACE(printf("cmdline niter=%d\tmsgsz=%d\n",*niter,*msgsz);) TRACE(ush(stdout);) return(0); } int zshutdown(int status) { int ag; MPI Initialized(&ag); /* only MPI routine callable before MPI Init */ if ( ag ) { if (status != 0) printf("p%d\t MPI shutdown\n",myproc); MPI Finalize(); } if (status != 0) printf("p%d\t exit status=%d\n",myproc,status); exit(status); } int initmpi(int argc, char **argv) { initializeMPIprocesses(argc,argv,&myproc,&nproc); if ( nproc > 2 ) (void) usage and exit(1); return(0); } int usage and exit(int status) { fprintf(stderr,"usage: [mpirun -np 2] pingpong <niter> <msgsz>\n"); ush(stderr); (void) zshutdown(status); exit(status); }
getcmdline
zshutdown
initmpi
usage and exit
REFERENCES
42
References
[SGI12] SGI. Passing Toolkit (MPT) User Guide, 2012. reference for barrine MPT. [SOHL+ 95] Marc Snir, Steve W. Otto, Steven Huss-Lederman, David W. Walker, and Jack Dongarra. MPI: The Complete Reference. 1995. ISBN 0-262-69184-1.
Index
Amdahls law, 17 baton.c, 14 batonpskips.c, 29 busywait.c, 31 busywaitprint.c, 32 cluster, see NOW code listings, 39 collective communication global sum, 6 eciency, 17 global sum, 6 Makele debug ags, 30 Intel MPI and gcc, 13 Intel MPI and icc, 13 MPICH and gcc, 14 MPICH2 and gcc, 14 MPT and gcc, 14 MPT and icc, 14 OpenMPI and gcc, 14 OpenMPI and icc, 14 multiprocessors, 1 network throughput, 22, 23, 25, 26 node code, 35 nodes, 1 NOW, 1 PBS environment variables, 33 performance diminishing returns, 20 pingpong code, 22, 23, 25, 26 PMS notation, 1 point-to-point MPI functions, 35 MPI sendrecv(), 35 shared-memory multiprocessor, 1 speedup, 17 TotalView example, 29, 31, 32 WRF, 21
43

Mpi

Caricato da

Informazioni sul documento

Descrizione originale:

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Mpi

Caricato da

Copyright:

Formati disponibili

MPI Workshop Notes1 DRAFT WORKING COPY

$LastChangedDate: 2012-06-08 12:26:32 +1000 (Fri, 08 Jun 2012) $

C Code Listings C.1 Ping-pong Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

(a) SMP (symmetric multiprocessor)

(b) NUMA (non-uniform memory access)

Suppose nP = 4 processors to calculate sum, S of n=16 elements.

where subscripts are process number for p0 , p1 , p2 , p3 .

Node codes to dene simple-sum data

int A[4] int nP = 4 int thisProc = 0 do i = 1, 4 A[i] = (thisProc * 4) + i enddo

int A[4] int nP = 4 int thisProc = 1 do i = 1, 4 A[i] = (thisProc * 4) + i enddo

int A[4] int nP = 4 int thisProc = 2 do i = 1, 4 A[i] = (thisProc * 4) + i enddo

int A[4] int nP = 4 int thisProc = 3 do i = 1, 4 A[i] = (thisProc * 4) + i enddo

Figure 4: Pseudo-code to initialise array A[0:3] on processes p0 , p1 , p2 , p3 for simple sum.

Figure 5: Initial data distribution for 4-process simple sum, n=16.

Node codes to compute partial sum

Figure 7: Processes memory contents after partial sum.

Node codes for simple sum

Figure 8: Simple sum pseudo-code with global sum on line 11.

Figure 9: Processes memory contents after partial sum.

Global sum approach

(R1+R0) (R0+R1) (R2+R3)

SPMD model for Simple Sum

runtime system 4 x spmdsum

Simple Sum using Basic MPI

Figure 15: C program implementing the simple sum outlined above.

Figure 16: C program functions completing code in gure 15.

Figure 17: Makele for simple sum program in gure 15.

2 MPI STANDARD AND IMPLEMENTATIONS

MPI Standard and Implementations

2 MPI STANDARD AND IMPLEMENTATIONS

MPI Implementations on Barrine

MPICH2 MPICH1 Open MPI

MPI-2.2 MPI-1.1 MPI-2.1

MPT User Guide, supports 2.06

mpirun, mpiexec mpiexec cannot run non-MPI program

SGI MPT Open MPI

mpiexec mpt (PBS), mpirun (not PBS) mpirun

2 MPI STANDARD AND IMPLEMENTATIONS

2 MPI STANDARD AND IMPLEMENTATIONS

Makeles and PBS scripts for Barrine

Programs in this section are online at this software repository.

GNU and Intel compilers demonstrated are gcc and icc.

2 MPI STANDARD AND IMPLEMENTATIONS

2 MPI STANDARD AND IMPLEMENTATIONS

2 MPI STANDARD AND IMPLEMENTATIONS

3 PARALLEL PROGRAM PERFORMANCE

Parallel Program Performance

Speedup, SP using P processors is the

ratio of sequential time, T1 , and parallel time, Tp SP = T1 TP (4)

Eciency, EP using P processors is the

ratio of the speedup, SP , and P EP = T1 SP = P P TP (5)

Amdahls Law assessment of speedup

3 PARALLEL PROGRAM PERFORMANCE

parallel part: parallelizes perfectly, i.e., Tp = T1 /P

time (m) 110 60 35 23 16 13 12

speedup 1.0 1.8 3.1 4.9 6.7 8.3 9.5

eciency 100% 92% 79% 61% 42% 26% 15%

Eect of communication on Speedup and Eciency

total parallel part communication

3 PARALLEL PROGRAM PERFORMANCE

Eect of serial part and comunication on Speedup and Eciency

parallel part: parallelizes perfectly, i.e., Tp = T1 /P

total parallel part sequential part communication