Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Date: Time: Location: Instructor: Organisation: Friday, 8 June 2012 1:00 - 4:50 PM Sir James Foot Building (Bldg. 47A), Room 241 Terry Clark, Ph.D. (t.clark3@uq.edu.au) UQ Research Computing Centre
In this workshop we introduce parallel computing with the Message Passing Interface (MPI), a standardized message-passing system designed for writing parallel programs in sequential languages such as C and Fortran. Since its release in 1994, and currently with several reliable implementations, MPI continues as a principal parallel programming model. Central to its success are the portability and achievable performance for a signicant share of applications in the technical computing community. MPI uses a model wherein each process in a parallel execution uses exactly the same program, yet with each process operating on dierent parts of the calculation. The required logic to coordinate multiple processes through the program adds signicant complexity; this programming task is intensied by MPIs low-level interprocess communication primitives. Consequently, MPI program development tends to be tedious and error prone. Parallel applications need to operate eciently over a practical range of input and computing platforms. Within this operational range, the performance varies, often widely, as a function of input, number of processors, types of resources, and MPI runtime parameters. It follows that to achieve desirable performance the users of parallel applications need informed choices to appropriately select and congure resources. The supporting information usually involves measurements made to assess the application on target computer systems with suitable data. This workshop covers concepts and methods pertaining to issues described above. The topics include MPI program development, parallel program debugging and proling, and running parallel applications. The aim is a comprehensive introduction, necessarily of limited depth, but without major gaps, that will enable attendees to pursue topics specic to their research needs. Site-specic content for the UQ HPC cluster, Barrine, is a critical adjunct which will provide attendees at all levels of expertise with a useful digest of protocols, tools, and systems. The programs discussed in this workshop and additional material can be retrieved from the web site and software repository at http://hpc-curlew.hpcu.uq.edu.au/.
Contents
1 Introduction 1.1 Types of Parallel Systems . . . . . . . . . . . 1.2 Simple Sum . . . . . . . . . . . . . . . . . . . 1.2.1 Node codes to dene simple-sum data 1.2.2 Node codes to compute partial sum . 1.2.3 Node codes for simple sum . . . . . . 1.2.4 Global sum approach . . . . . . . . . . 1.3 SPMD model for Simple Sum . . . . . . . . . 1.4 Simple Sum using Basic MPI . . . . . . . . . 2 MPI Standard and Implementations 2.1 Message Passing Interface Standard . . . . . 2.2 MPI Implementations on Barrine . . . . . . . 2.2.1 Basic Information . . . . . . . . . . . 2.2.2 Features eecting program deployment 2.3 Compilers on Barrine . . . . . . . . . . . . . . 2.4 Makeles and PBS scripts for Barrine . . . . 2.4.1 Intel MPI with gcc and icc . . . . . . 2.4.2 MPICH2 with gcc and icc . . . . . . . 2.4.3 SGI MPT with gcc and icc . . . . . . 2.4.4 Open MPI with gcc and icc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 2 3 4 5 6 7 8 10 10 11 11 11 12 13 13 14 14 14 17 17 17 18 18 18 19 20 21 22 22 27 28 28 29 31 32 33 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 35 35 35 35 36 36 38 38
3 Parallel Program Performance 3.1 Performance measures . . . . . . . . . . . . . . . . . . . . . . 3.2 Amdahls Law assessment of speedup . . . . . . . . . . . . . . 3.3 Performance examples . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Eect of serial part on Speedup and Eciency . . . . 3.3.2 Eect of communication on Speedup and Eciency . 3.3.3 Eect of serial part and comunication on Speedup and 3.3.4 Parallel performance with diminishing returns . . . . . 3.4 Barrine performance example . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Eciency . . . . . . . . . . . .
4 Barrine Network and I/0 4.1 Inniband and Gigabit Ethernet Networks . . . . . . . . . . . . . . . . . . . . . . . 4.2 Local le systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Parallel Debuggers 5.1 Totalview Debugger . . . . . . . . . . . . . . . . . 5.2 Example 1. Process skips baton pass . . . . . . . . 5.2.1 Example 2. MPI process busy waiting . . . 5.2.2 Example 3. Process busy waiting with print A PBS Environment Variables B MPI Point-to-Point Communication Functions B.1 Blocking Send and Receive . . . . . . . . . . . . B.2 Non-blocking Send and Receive . . . . . . . . . . B.3 Combined Send and Receive . . . . . . . . . . . . B.4 Operations on Messages and Queues . . . . . . . B.5 Denitions . . . . . . . . . . . . . . . . . . . . . . B.6 Send modes with the MPI API . . . . . . . . . . B.7 Semantics of Point-to-Point Communication . . . B.8 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CONTENTS
ii
39 39
1 INTRODUCTION
1
1.1
Introduction
Types of Parallel Systems
PMS Notation We use PMS notation to describe key components of computer system where: P is the processor, M is the memory, and S is the switch, or network.
...
...
S (switch)
S (switch)
...
...
Figure 1: Shared-memory architectures: (a) symmetric multiprocessors (SMP) (or dancehall) approach distributes all memory uniformly far from processors, (limiting scalability of the approach). (b) NUMA distributes memory so latency to access local memory is xed and independent of number of processors.
M P
M P
M P
M P
...
P P
S (switch)
(a) NOW based on SMP nodes
...
P P
S (switch)
(b) NOW based on NUMA nodes
Figure 2: Networks of workstations also called NOWs, or clusters are message-passing architectures using complete computers as building blocks (the nodes). The high-level diagram for a NOW is the same as the multiprocessor NUMA illustrated in gure 1(b).2
2 The main dierence between gure 1(b) multiprocessors and gure 1(b) message-passing systems is in how non-local memory is accessed: multiprocessor NUMA integrates communication into the memory system; messagepassing architectures perform explicit I/O operations.
1 INTRODUCTION
1.2
Simple Sum n
ai = a1 + a2 + a3 . . . an1 + an
i=1
(1)
S=
(a01 + a02 + a03 + a04 ) + (a05 + a06 + a07 + a08 ) + (a09 + a10 + a11 + a12 ) + (a13 + a14 + a15 + a16 )
(2)
The Goal
//
time units
5 ... + 16
Figure 3: Time for 4-way parallel addition shown above the x-axis is 1/4 sequential time.
A Detail
S=
S0 + S1 + S2 + S3
(3)
1 INTRODUCTION
1.2.1
1 2 3 4 5 6 7
1 2 3 4 5 6 7
p0 code
1 2 3 4 5 6 7
p1 code
1 2 3 4 5 6 7
p2 code
p3 code
P0
0 1 2 3 4 5
P1
0 1 2 3 4 5
1 A 2 3 4 0 R 0 S
P0
9 A 10 11 12 0 R 0 S
P1
1 INTRODUCTION
1.2.2
1 2 3 4 5 6 7 8 9 10 11
int A[4], R int nP = 4 int thisProc = 0 do i = 1, 4 A[i] = (thisProc * 4) + i enddo do i = 1, 4 R = R + A[i] enddo
(a)
1 2 3 4 5 6 7 8 9 10 11
int A[4], R int nP = 4 int thisProc = 1 do i = 1, 4 A[i] = (thisProc * 4) + i enddo do i = 1, 4 R = R + A[i] enddo
(b)
p0 code
1 2 3 4 5 6 7 8 9 10 11
p1 code
1 2 3 4 5 6 7 8 9 10 11
int A[4], R int nP = 4 int thisProc = 2 do i = 1, 4 A[i] = (thisProc * 4) + i enddo do i = 1, 4 R = R + A[i] enddo
(c)
int A[4], R int nP = 4 int thisProc = 3 do i = 1, 4 A[i] = (thisProc * 4) + i enddo do i = 1, 4 R = R + A[i] enddo
(d)
p2 code
p3 code
Figure 6: Pseudo-code with partial sum (lines 911) at processes p0 , p1 , p2 , p3 for simple sum.
p0
0 1 2 3 4 5
p1
0 1 2 3 4 5
p2
0 1 2 3 4 5
p3
0 1 2 3 4 5
1 A 2 3 4 10 R 0 S
9 A 10 11 12 42 R 0 S
13 A 14 15 16 58 R 0 S
5 A 6 7 8 26 R 0 S
1 INTRODUCTION
1.2.3
1 2 3 4 5 6 7 8 9 10 11
int A[4], R, S int nP = 4 int thisProc = 0 do i = 1, 4 A[i] = (thisProc * 4) + i enddo do i = 1, 4 R = R + A[i] enddo S = +{R}
(a)
1 2 3 4 5 6 7 8 9 10 11
int A[4], R, S int nP = 4 int thisProc = 1 do i = 1, 4 A[i] = (thisProc * 4) + i enddo do i = 1, 4 R = R + A[i] enddo S = +{R}
(b)
p0 code
1 2 3 4 5 6 7 8 9 10 11
p1 code
1 2 3 4 5 6 7 8 9 10 11
int A[4], R, S int nP = 4 int thisProc = 2 do i = 1, 4 A[i] = (thisProc * 4) + i enddo do i = 1, 4 R = R + A[i] enddo S = +{R}
(c)
int A[4], R, S int nP = 4 int thisProc = 3 do i = 1, 4 A[i] = (thisProc * 4) + i enddo do i = 1, 4 R = R + A[i] enddo S = +{R}
(d)
p2 code
p3 code
p0
0
p1
p2
p3
0
1 A 1 2 2 3 3 4 4 10 R 5 136 S
9 A 1 10 2 11 3 12 4 42 R 5 136 S
0
13 A 1 14 2 15 3 16 4 58 R 5 136 S
0
5 A 1 6 2 7 3 8 4 26 R 5 136 S
1 INTRODUCTION
1.2.4
P0
(R0)
P1
(R1)
P2
(R2)
P3
(R3)
Figure 10: Initial state of processes after partial sum (see gure 7).
(R0)
(R1)
(R2)
(R3)
P0
P1
P2
P3
Figure 11: Global sum, phase 1: exchange data from initial state (gure 10).
(R3+R2)
P0
P1
P2
P3
Figure 12: Global sum, phase 2: exchange summed data from phase 1 (gure 11).
P0
(R0+R1+R2+R3)
P1
(R1+R0+R3+R2)
P2
P3
(R2+R3+R0+R1) (R3+R2+R1+R0)
Figure 13: Global sum, nal state: Processes have nal sum, S = R0 + R1 + R2 + R3 .
1 INTRODUCTION
1.3
Program SPMDSum int i, A[4], R, S; int (nP,thisProc) = initProcs() do i = 1, 4 A[i] = (thisProc * 4) + i enddo do i = 1, 4 R = R + A[i] enddo S = globalSum(R) end
int
spmdsum.c
(source code)
compiler spmdsum.o
(object code)
linker spmdsum
(executable)
b01a22
b01a23
b02b29
M0 P0
M P
M1 P1
M P
M2 P2
M3 P3
Figure 14:
1 INTRODUCTION
1.4
#include "mympinc.h" #dene N 4 int main(int argc, char **argv) { int A[N],R,S,i; int nP, thisProc; initProcesses(argc,argv,&thisProc,&nP); for (i=1; i<=N; i++) { A[i1] = (thisProc * 4) + i; } for (R=0,i=0; i<N; i++) { R = R + A[i]; } S = globalSum(R,thisProc); MPI Finalize(); exit(0); }
main
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
#include <stdio.h> #include <mpi.h> void initProcesses(int argc, char **argv, int *myproc, int *nproc) initProcesses { MPI Init (&argc,&argv); MPI Comm rank (MPI COMM WORLD, myproc); MPI Comm size (MPI COMM WORLD, nproc); printf("myProc=%d\tnProc=%d\t\tprogram=%s\n",*myproc,*nproc,argv[0]); MPI Barrier(MPI COMM WORLD); } int globalSum(int R, int myProc) globalSum { int S; int status = MPI Allreduce(&R,&S,1,MPI INT,MPI SUM,MPI COMM WORLD); printf("p%d\tR=%d\tS=%d\n",myProc,R,S); MPI Barrier(MPI COMM WORLD); return(S); }
1 INTRODUCTION
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
CC = mpiicc AR = ar ARFLAGS= r CFLAGS = O1 LIBDIR=. LDFLAGS = L$(LIBDIR) LIBS = lmympi program=spmdsum mylibrary=libmympi.a $(program): $(program).o $(mylibrary) $(CC) $(LDFLAGS) o $@ $< $(LIBS) $(mylibrary): mympi.o $(AR) $(ARFLAGS) $@ $< clean:; $(RM) core *.o *.trace *.a realclean:; $(RM) core *.o* *.e* *.trace $(program) *.a .c.o:; $(CC) $(CFLAGS) c $*.c
1 2 3 4 5 6 7 8 9 10 11 12
#!/bin/bash #PBS -A sf-Admin #PBS -l select=4:ncpus=1:mpiprocs=1:NodeType=large #PBS -l walltime=00:00:30 #PBS -N spmdmsg #PBS -q workq cd ${PBS O WORKDIR} module load intelmpi/3.2.2.006 mpirun np 4 spmdsum
Figure 18: PBS script for simple sum program of gure 15.
Six functions MPI MPI MPI MPI MPI MPI Init Finalize Comm Size Comm Rank Send Recv Initiate an MPI computation Terminate an MPI computation Determine the number of MPI processes. Determine MPI identief for calling process. Send a message to an MPI process. Receive a message from an MPI process.
10
2
2.1
The denitive reference for MPI is the MPI Forum Web site.
11
2.2
Installed MPI information version module le Intel MPI-2.2 intel-mpi/3.2.2.006 intel-mpi/4.0.0.027 intel-mpi/4.0.1.007 mpich2/1.4.1p1-intel mpich-ch4-p4 mpich-ch4-p4mpd OpenMPI/1.2.8 OpenMPI/1.4.3 OpenMPI/1.5.3 mpt/2.00 mpt/2.02
documentation reference manual 3.2.2.006 4.0.0.027 4.0.1.007 introduction guide 3.2.2.006 4.0.0.027 4.0.1.007
MPICH2 Users Guide MPICH2 commands and routines Users Guide to MPICH, v1.2.6 Open MPI v1.6 commands and routines
SGI MPT
MPI-2.2
Features Eecting Application Deployment (includes Barrine specic) Inniband Intel MPI yes (default) program build use mpi compiler with sequential compiler option (see gure 20) use mpi compiler with sequential compiler option (see gure 22) icc,gcc, etc. no mpicc program launch mpirun, mpiexec other notes mpiicc invokes icc
MPICH2
3rd party, not on barrine yes [SGI12], use export MPI USE IB=1 check...
12
2.3
Compilers on Barrine
Compiler Related Documentation Language Intel Fortran Reference, etc. C++ Compiler (http) Fortran Compiler (http) GNU GNU Compiler Collection GNU OpenMP Manual GCC online documentation OpenMP Developing Threaded Applications Cluster OpenMP Manual Other Intel 64 IA32 Optimization Manual
13
2.4
Shown here are Makeles and PBS scripts for the program baton.c for each combination of of compiler and MPI system {GNU, Intel} {IntelMPI, MPICH2, OpenMPI, and MPT}.3 Before invoking make on the Makeles it is necessary to run module loads for the compiler and MPI types; these module loads are shown in each accompanying PBS script listing on lines 89. The source code listing for baton.c is shown below. 2.4.1 Intel MPI with gcc and icc
1 2 3 4 5 6 7 8 9 10 11 12 13
# GNU C and Intel MPI makefile CC = gcc MPICC = mpicc CFLAGS = O2 LDFLAGS = LIBS = program=baton $(program): $(program).o $(MPICC) $(LDFLAGS) o $@ $< $(LIBS) .c.o:; $(MPICC) cc=$(CC) $(CFLAGS) c $*.c
1 2 3 4 5 6 7 8 9 10 11 12 13
#!/bin/bash #PBS -A sf-Admin #PBS -l select=4:ncpus=2:mpiprocs=2:NodeType=large #PBS -l walltime=00:01:00 #PBS -q workq #PBS -N thebaton module load compiler/gcc4.5.2 module load intelmpi/3.2.2.006 cd $PBS O WORKDIR mpirun np 8 baton
Figure 19: Intel MPI with gcc (left) Makele, (right) PBS script.
1 2 3 4 5 6 7 8 9 10 11 12 13
# Intel C and Intel MPI makefile CC = icc MPICC = mpicc CFLAGS = O2 LDFLAGS = LIBS = program=baton $(program): $(program).o $(MPICC) $(LDFLAGS) o $@ $< $(LIBS) .c.o:; $(MPICC) cc=$(CC) $(CFLAGS) c $*.c
1 2 3 4 5 6 7 8 9 10 11 12 13
#!/bin/bash #PBS -A sf-Admin #PBS -l select=4:ncpus=2:mpiprocs=2:NodeType=large #PBS -l walltime=00:01:00 #PBS -q workq #PBS -N thebaton module load intelcc11/11.1.072 module load intelmpi/3.2.2.006 cd $PBS O WORKDIR mpirun np 8 baton
Figure 20: Intel MPI and icc compiler (left) Makele, (right) PBS script.
3 The
14
1 2 3 4 5 6 7 8 9 10 11 12 13
# GNU C and MPICH2 makefile CC = gcc MPICC = mpicc CFLAGS = O2 LDFLAGS = LIBS = program=baton $(program): $(program).o $(MPICC) $(LDFLAGS) o $@ $< $(LIBS) .c.o:; $(MPICC) $(CFLAGS) c $*.c
1 2 3 4 5 6 7 8 9 10 11 12 13
#!/bin/bash #PBS -A sf-Admin #PBS -l select=4:ncpus=2:mpiprocs=2:NodeType=large #PBS -l walltime=00:01:00 #PBS -q workq #PBS -N thebaton module load compiler/gcc4.5.2 module load mpich2/1.4.1p1intel cd $PBS O WORKDIR mpirun np 8 baton
Figure 21: MPICH2 with gcc (left) Makele, (right) PBS script
1 2 3 4 5 6 7 8 9 10 11 12 13
# Intel C and MPICH2 makefile CC = icc MPICC = mpicc CFLAGS = O2 LDFLAGS = LIBS = program=baton $(program): $(program).o $(MPICC) $(LDFLAGS) o $@ $< $(LIBS) .c.o:; $(MPICC) $(CFLAGS) c $*.c
1 2 3 4 5 6 7 8 9 10 11 12 13
#!/bin/bash #PBS -A sf-Admin #PBS -l select=4:ncpus=2:mpiprocs=2:NodeType=large #PBS -l walltime=00:01:00 #PBS -q workq #PBS -N thebaton module load intelcc11/11.1.072 module load mpich2/1.4.1p1intel cd $PBS O WORKDIR mpirun np 8 baton
Figure 22: MPICH2 with Intel icc compiler (left) Makele, (right) PBS script
1 2 3 4 5 6 7 8 9 10 11 12 13
# GNU C and MPT makefile CC = gcc MPICC = $(CC) CFLAGS = O2 LDFLAGS = LIBS = lmpi program=baton $(program): $(program).o $(MPICC) $(LDFLAGS) o $@ $< $(LIBS) .c.o:; $(MPICC) $(CFLAGS) c $*.c
1 2 3 4 5 6 7 8 9 10 11 12 13
#!/bin/bash #PBS -A sf-Admin #PBS -l select=4:ncpus=2:mpiprocs=2:NodeType=large #PBS -l walltime=00:01:00 #PBS -q workq #PBS -N thebaton module load mpt/2.00 module load compiler/gcc4.5.2 cd $PBS O WORKDIR mpiexec mpt np 8 baton
Figure 23: SGI MPT with gcc (left) Makele, (right) PBS script
15
1 2 3 4 5 6 7 8 9 10 11 12 13
# Intel C and MPT makefile CC = icc MPICC = $(CC) CFLAGS = O2 LDFLAGS = LIBS = lmpi program=baton $(program): $(program).o $(MPICC) $(LDFLAGS) o $@ $< $(LIBS) .c.o:; $(MPICC) $(CFLAGS) c $*.c
1 2 3 4 5 6 7 8 9 10 11 12 13
#!/bin/bash #PBS -A sf-Admin #PBS -l select=4:ncpus=2:mpiprocs=2:NodeType=large #PBS -l walltime=00:01:00 #PBS -q workq #PBS -N thebaton module load intelcc11/11.1.072 module load mpt/2.00 cd $PBS O WORKDIR mpiexec mpt np 8 baton
Figure 24: SGI MPT with Intel icc compiler (left) Makele, (right) PBS script
1 2 3 4 5 6 7 8 9 10 11 12 13
# GNU C and Open MPI makefile CC = gcc MPICC = mpicc CFLAGS = O2 LDFLAGS = LIBS = program=baton $(program): $(program).o $(MPICC) $(LDFLAGS) o $@ $< $(LIBS) .c.o:; $(MPICC) $(CFLAGS) c $*.c
1 2 3 4 5 6 7 8 9 10 11 12 13
#!/bin/bash #PBS -A sf-Admin #PBS -l select=4:ncpus=2:mpiprocs=2:NodeType=large #PBS -l walltime=00:01:00 #PBS -q workq #PBS -N thebaton module load compiler/gcc4.5.2 module load OpenMPI/1.5.3 cd $PBS O WORKDIR mpirun np 8 baton
Figure 25: Open MPI with gcc (left) Makele, (right) PBS script
1 2 3 4 5 6 7 8 9 10 11 12 13
# Intel C and Open MPI makefile CC = icc MPICC = mpicc CFLAGS = O2 LDFLAGS = LIBS = program=baton $(program): $(program).o $(MPICC) $(LDFLAGS) o $@ $< $(LIBS) .c.o:; $(MPICC) $(CFLAGS) c $*.c
1 2 3 4 5 6 7 8 9 10 11 12 13
#!/bin/bash #PBS -A sf-Admin #PBS -l select=4:ncpus=2:mpiprocs=2:NodeType=large #PBS -l walltime=00:01:00 #PBS -q workq #PBS -N thebaton module load intelcc11/11.1.072 module load OpenMPI/1.5.3 cd $PBS O WORKDIR mpirun np 8 baton
Figure 26: Open MPI with Intel icc compiler (left) Makele, (right) PBS script
16
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61
#include <stdlib.h> #include <stdio.h> #include <string.h> #include <mpi.h> MPI Status status; int myProc, nProc; int main(int argc, char **argv) { MPI Init (&argc,&argv); MPI Comm rank (MPI COMM WORLD, &myProc); MPI Comm size (MPI COMM WORLD, &nProc); check mybaton value(pass baton()); exit(MPI Finalize()); } int pass baton() { int toProc = (myProc + 1) % nProc; int frmProc = (myProc 1 + nProc) % nProc; int ibaton = 0; main
pass baton
if ( myProc > 0 ) MPI Recv(&ibaton,1,MPI INT,frmProc,MPI ANY TAG,MPI COMM WORLD,&status); ibaton += myProc; if ( myProc < nProc1 ) MPI Send(&ibaton, 1, MPI INT, toProc, 7, MPI COMM WORLD); return ibaton; } check mybaton value(int mybaton) { int myanswer = myAnswer(); wait in order(); printf("p%d %s\n",myProc,((mybatonmyanswer)?"NOT OK":"OK")); ush(stdout); system("sleep 1"); notifynext(); } check mybaton value
wait in order() wait in order { if ( myProc < nProc1) MPI Recv(0, 0, MPI INT, myProc+1, MPI ANY TAG, MPI COMM WORLD, &status); } notifynext() { if ( myProc > 0) MPI Send(0, 0, MPI INT, myProc1, 3, MPI COMM WORLD); } int myAnswer() { int i, sum; for (i=1,sum=0; i<=myProc; i++) sum += i; return(sum); } notifynext
myAnswer
Figure 27: A copy of this baton.c is in each directory for section 2.4.
17
3
3.1
3.2
SP = =
(1 f )T1 T1 , TP = f T1 + TP P T1 f T1 + 1
(1f )T1 P
(6) (7)
SP
f+
(1f ) P
(8)
18
3.3
3.3.1
Performance examples
Eect of serial part on Speedup and Eciency
Tp =
10
3
T1 P
+ T1
total parallel part sequential part
nProcs 1 2
10
compute time
4 8
10
1
16 32 64
10 0 10
10
10
number of processes
3.3.2
Tp =
10
2
T1 P
+ Tp nProcs 1 2 time 100 51 27 16 10 8 7 speedup 1.0 2.0 3.8 6.4 9.8 12.3 13.2 eciency 100% 98% 93% 81% 61% 38% 21%
compute time
4
10
1
8 16 32 64
10 0 10
10
10
number of processes
19
3.3.3
Tp =
T1 P
+ T1 + T p
10
10
compute time
10
10 0 10
10
10
number of processes
nProcs 1 2 4 8 16 32 64
time 110 61 37 26 20 18 17
20
3.3.4
Tp =
T1 P
10
3
+ T + Tp
10
compute time
10
10 0 10
10
10
10
number of processes
time 110.0 61.0 37.0 25.5 20.3 18.1 17.6 17.8 18.4 19.2 20.1 20.0
speedup 1.0 1.8 3.0 4.3 5.4 6.1 6.3 6.2 6.0 5.7 5.5 5.2
eciency 100.0% 90.2% 74.3% 54.0% 34.0% 19.0% 10.0% 4.8% 2.3% 1.1% 0.5% 0.3%
21
3.4
nProcs 1 4
time (minutes)
8 16 32 64
100 0 0 10
10
10
number of processes
Speedup
7 6 5
efficiency 100
Eciency
80
speedup
60
4 3 2 1 0 10
40
20
10
10
0 0 10
10
10
number of processes
number of processes
4 Michael Hewson with Professor Hamish McGowans laboratory provided the data for these WRF benchmarks. Michael applies WRF to climate modelling and orchestrates WRF and its associated programs on Barrine.
22
4
4.1
10
throughput (MB/s)
10
10
10
10
10
10
10
10
10
10
10
10
10
4 BARRINE NETWORK AND I/0 Figure 30: Ping-pong code to measure throughput. See Appendix C.1 for complete listing.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
23
int myproc, nproc; int main(int argc, char **argv) { int niter,msgsz; double clockticks; (void) getcmdline(argc,argv,&niter,&msgsz); (void) gethost(thishost,NMLEN); (void) initmpi(argc,argv); (void) echoparms(niter,msgsz); (void) pingpong(niter,msgsz,&clockticks); (void) if (myProc==0) summarise(niter,msgsz,clockticks); (void) zshutdown(0); return(0); } int pingpong(int niter,int msgsz,double *clockticksPtr) { MPI Status s; double clockticks = 0.0; int iter=0; int *outcargo = (int *)malloc(sizeof (int)*msgsz); int *incargo = (int *)malloc(sizeof (int)*msgsz); (void) dene cargo(msgsz,outcargo); do { if (myproc == 0) { clockticks = MPI Wtime(); MPI Send(outcargo,1,MPI INT,1,999,MPI COMM WORLD); MPI Recv(incargo,msgsz,MPI INT,1,888,MPI COMM WORLD,&s); clockticks += MPI Wtime(); } else if (myproc == 1) { clockticks = MPI Wtime(); MPI Recv(incargo,1,MPI INT,0,999,MPI COMM WORLD,&s); MPI Send(outcargo,msgsz,MPI INT,0,888,MPI COMM WORLD); clockticks += MPI Wtime(); } } while (++iter < niter); *clockticksPtr = clockticks; return(0); } main
pingpong
24
int summarise(int niter,int msgsz,double ticks) { int iproc; double ntotdat = (double)msgsz * (double)niter * sizeof (int); double throughput = (ntotdat / ticks) / (1024.0*1024.0); double latency = ((ticks/2.0) / niter) * 1000000.0; fprintf(stdout,"p=%d\ttime=%fsec\tthru=%fMB\tlate=%fusec\n", myproc,ticks,throughput,latency); ush(stdout); return(0); } int dene cargo(int msgsz, int *cargo) { int ic; for(ic=0; ic<msgsz; ic++) { *(cargo+ic) = hide(ic); } return(0); } int zshutdown(int status) { int ag; MPI Initialized(&ag); /* only MPI function before MPI Init() */ if ( ag ) { if (status != 0) printf("p%d\t MPI shutdown\n",myproc); MPI Finalize(); } if (status != 0) printf("p%d\t exit status=%d\n",myproc,status); exit(status); } int initmpi(int argc, char **argv) { initializeMPIprocesses(argc,argv,&myproc,&nproc); if ( nproc > 2 ) (void) usage and exit(1); return(0); }
summarise
dene cargo
zshutdown
initmpi
25
Estimates of throughput for node pairs in two randomly selected node lists, one each for the x-axis and y-axis, and dierent lists for the two gures. Units are MB/second. Figure 32: Inniband throughput with Intel MPI; early AM Barrine measurements.
b07b27 b07b07 b07a29 b07a25 b07a21 b07a14 b07a08 b06b32 b06b29 b06b15 b06b08 b06b07 b06a15 b06a14 b06a13 b03b14 b03b01 b03a16 b02b32 b02b31 b02b26 b02b15 b02b13 b02b01 b02a34 b02a25 b02a01 b01b11 b01b05 b01b03 b01a33 b01a30 b 1 0 a 0 1 5 0 b 1 0 a 0 9 b 1 0 b 1 2 b 0 1 a 1 7 b 0 1 a 2 3 5 b 0 1 a 2 5 b 0 1 a 2 6 b 0 1 b 0 3 bb 0 0 1 1 bb 0 1 5 3 10 b 0 1 b 2 5 b 0 1 b 3 7 b 0 2 a 0 1 bb 0 0 2 2 a a 0 2 2 9 15 b 0 2 b 2 5 b 0 2 b 3 1 b 0 2 b 3 2 bb 0 0 3 3 a a 0 1 5 4 20 b 0 3 a 1 8 b 0 3 a 2 3 b 0 3 a 3 3 bb 0 0 6 6 a a 2 2 2 9 25 b 0 6 b 0 1 b 0 6 b 2 1 b 0 6 b 2 3 bb 0 0 6 7 bb 3 0 0 3 30 b 0 7 b 2 0 b 0 7 b 2 2
30
2500
25
20
2000
Ipartner node
15
1500
10
1000
500
Jpartner node
Figure 33: Inniband throughput with Intel MPI; midday Barrine measurements.
b10b11 b10a10 b10a07 b10a06 b07b09 b07b02 b07a23 b06b26 b06b25 b06a04 b03b28 b03b15 b03b04 b03a36 b03a32 b03a01 b02b29 b02b15 b02b12 b02a36 b02a35 b02a32 b02a23 b02a15 b01b27 b01b23 b01b18 b01b13 b01b12 b01b05 b01a33 b01a29 b 0 1 a 0 5 5 0 b 0 1 a 2 1 b 0 1 a 2 8 b 0 1 b 0 6 b 0 1 b 1 5 5 b 0 1 b 3 7 b 0 2 a 2 4 b 0 2 a 2 6 bb 0 0 2 2 a a 3 3 2 4 10 b 0 2 b 0 5 b 0 2 b 0 9 b 0 2 b 2 1 bb 0 0 2 3 ba 2 0 4 3 15 b 0 3 a 1 6 b 0 3 a 2 6 b 0 3 b 1 6 bb 0 0 3 3 bb 1 2 7 5 20 b 0 3 b 2 9 b 0 3 b 3 1 b 0 3 b 3 5 bb 0 0 6 6 a b 2 3 6 2 25 b 0 7 a 0 6 b 0 7 a 0 8 b 0 7 b 1 8 bb 0 1 7 0 ba 2 0 3 9 30 b 1 0 a 1 7 b 1 0 b 1 3
30
2500
25
20
2000
Ipartner node
15
1500
10
1000
500
Jpartner node
4 BARRINE NETWORK AND I/0 Figure 34: Each curve shows Inniband throughput for a I-partner in Figure 32.
3500
26
3000
2500
throughput (MB/s)
2000
1500
1000
500 b 1 0 a 0 1 0 b 1 0 a 0 9 b 1 0 b 1 2 b 0 1 a 1 7 b 0 1 a 2 3 5 b 0 1 a 2 5 b 0 1 a 2 6 b 0 1 b 0 3 b 0 1 b 0 5 b 0 1 b 1 3 10 b 0 1 b 2 5 b 0 1 b 3 7 b 0 2 a 0 1 b 0 2 a 0 2 b 0 2 a 2 9 15 b 0 2 b 2 5 b 0 2 b 3 1 b 0 2 b 3 2 b 0 3 a 0 5 b 0 3 a 1 4 20 b 0 3 a 1 8 b 0 3 a 2 3 b 0 3 a 3 3 b 0 6 a 2 2 b 0 6 a 2 9 25 b 0 6 b 0 1 b 0 6 b 2 1 b 0 6 b 2 3 b 0 6 b 3 0 b 0 7 b 0 3 30 b 0 7 b 2 0 b 0 7 b 2 2
500
1000
jpartner node
Figure 35: Each curve shows Inniband throughput for a J-partner in Figure 32.
3500
3000
2500
throughput (MB/s)
2000
1500
1000
500 b 0 1 a 3 0 0 b 0 1 a 3 3 b 0 1 b 0 3 b 0 1 b 0 5 b 0 1 b 1 1 5 b 0 2 a 0 1 b 0 2 a 2 5 b 0 2 a 3 4 b 0 2 b 0 1 b 0 2 b 1 3 10 b 0 2 b 1 5 b 0 2 b 2 6 b 0 2 b 3 1 b 0 2 b 3 2 b 0 3 a 1 6 15 b 0 3 b 0 1 b 0 3 b 1 4 b 0 6 a 1 3 b 0 6 a 1 4 b 0 6 a 1 5 20 b 0 6 b 0 7 b 0 6 b 0 8 b 0 6 b 1 5 b 0 6 b 2 9 b 0 6 b 3 2 25 b 0 7 a 0 8 b 0 7 a 1 4 b 0 7 a 2 1 b 0 7 a 2 5 b 0 7 a 2 9 30 b 0 7 b 0 7 b 0 7 b 2 7
500
1000
ipartner node
27
4.2
Local le systems
5 PARALLEL DEBUGGERS
28
5
5.1
Parallel Debuggers
Totalview Debugger
TotalView Users Guide provides detailed instructions for running Totalview for this section. The rst example, section 5.2, uses a program that causes processes to hang on communication that will not occur. TotalView can be applied using break-points to track the proceses. The second example, section 5.2.1, uses a program with processes hung on on missing communication but with the missing processes in an indenite loop. The third example, section 5.2.2, incorporates print statements for debugging. It is useful to try the Linux command gstack to get at the same information. Makeles are shown with compilation ags for Intel MPI and compilation ags set for debugging.
5 PARALLEL DEBUGGERS
29
5.2
The program batonpskips.c (see gure36) is written so that MPI process rank=3 will skip the baton passing. Suppose nProc=8, then processes 0 2, 4 7 will hang because they are missing process rank=3. Figure 36: batonpskips.c
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
#include <stdlib.h> #include <stdio.h> #include <string.h> #include <mpi.h> #include <uparms.h> #include <misc.h> #dene ROGUE 3 main(int argc, char **argv) { int myProc, nProc; int localbaton; int pass baton(int myProc, int nProc, MPI Comm comm); initializeMPIprocesses(argc,argv,&myProc,&nProc); if ( myProc != ROGUE ) { localbaton = pass baton(myProc,nProc,MPI COMM WORLD); check allbatons(myProc,nProc,MPI COMM WORLD,localbaton); } cleanup(); } cleanup() { int status = MPI Finalize(); exit(status); } /* baton pass in ring from 0, 1, . . . nProc-1 */ int pass baton(int myProc, int nProc, MPI Comm comm) { MPI Status status; int msgtag = 0; int toProc = (myProc + 1) % nProc; int frmProc = (myProc 1 + nProc) % nProc; int ibaton = 0; cleanup main
pass baton
if ( myProc > 0 ) checkMPIerror(MPI Recv(&ibaton, 1, MPI INT, frmProc, msgtag, comm, &status)); ibaton += myProc; if ( myProc < nProc1 ) checkMPIerror(MPI Send(&ibaton, 1, MPI INT, toProc, msgtag, comm)); return ibaton; }
5 PARALLEL DEBUGGERS
30
Figure 37: Intel MPI with icc (left) Makele, (right) shell script to run. Note the ags set for debugging, -O0 -g.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
PROG= batonpskips CC = mpicc PREFIX= .. INCDIR= $(PREFIX)/include LIBDIR= $(PREFIX)/lib LDFLAGS= L$(LIBDIR) LIBS= lu lmisc CFLAGS = O0 g I$(INCDIR) $(PROG): $(PROG).o libu.a $(CC) $(LDFLAGS) o $@ $@.o lm $(LIBS) libu.a:; clean: $(RM) core *.o *.trace realclean: $(RM) core *.o *.trace busywait .c.o:; $(CC) $(CFLAGS) c $*.c
1 2 3 4 5 6
5 PARALLEL DEBUGGERS
31
5.2.1
#include <stdio.h> #include <mpi.h> #include <uparms.h> main(int argc, char **argv) { int myProc, nProc; int takebranch; initializeMPIprocesses(argc,argv,&myProc,&nProc); takebranch = (nProc%2 == 0) ? 0 : 1; if ( takebranch == 0 ) { send and recv(myProc, nProc); } else { loopawhile(); } MPI Finalize(); return 0; } main
send and recv(int myProc, int nProc) send and recv { int error,value; if ( myProc == 0 ) { MPI Status status; error = MPI Recv(&value, 1, MPI INT, 0, MPI ANY TAG, MPI COMM WORLD,&status); checkMPIerror(error); } else if ( 1 == 2 ) { value = 9; error = MPI Send(&value, 1, MPI INT, 0, MPI ANY TAG, MPI COMM WORLD); checkMPIerror(error); } }
loopawhile
5 PARALLEL DEBUGGERS
32
5.2.2
#dene TRACE(x) x main(int argc, char **argv) { int myProc, nProc; int takebranch; TRACE(procprint(0,"start busywaitprt()");) initializeMPIprocesses(argc,argv,&myProc,&nProc); TRACE(procprint(myProc,"initialized()");) takebranch = (nProc%2 == 0) ? 0 : 1; if ( takebranch == 0 ) { send and recv(myProc, nProc); } else { loopawhile(myProc); } MPI Finalize(); return 0; } void send and recv(int myProc, int nProc) send and recv { int error,value; TRACE(procprint(myProc,"enter send_and_recv()");) if ( myProc == 0 ) { MPI Status status; error = MPI Recv(&value, 1, MPI INT, 0, MPI ANY TAG, MPI COMM WORLD,&status); checkMPIerror(error); } else if ( 1 == 2 ) { value = 9; error = MPI Send(&value, 1, MPI INT, 0, MPI ANY TAG, MPI COMM WORLD); checkMPIerror(error); } TRACE(procprint(myProc,"leave send_and_recv()");) } void loopawhile(int myProc) { int i; TRACE(procprint(myProc,"enter loopawhile()");) while (1) i=0; TRACE(procprint(myProc,"leave loopawhile()");) } loopawhile main
33
PBS environment variables are dened from the login shell and PBS specic variables. Here is a partial list of the PBS specic variables. NCPUS Number of threads, defaulting to number of CPUs, on a compute node. OMP NUM THREADS Same as NCPUS. PBS ARRAY ID Array identier of subjob in job array. PBS ARRAY INDEX Index number of subjob in job array. PBS CONF PATH Path to pbs.conf PBS CPUSET DEDICATED Set by mpiexec to assert exclusive use of resources in assigned cpuset. PBS ENVIRONMENT When set indicates PBS job calling mpiexec. Job types are PBS BATCH or PBS INTERACTIVE. PBS JOBID The job identier assigned to the job or job array by the batch system. PBS JOBNAME The job name supplied by the user. PBS NODEFILE The lename containing the list of nodes assigned to the job. PBS O HOME Value of HOME directory from submission environment. PBS O HOST The host name on which the qsub command was executed. PBS O LOGNAME Value of users login name in the submission environment. PBS O MAIL Value of MAIL from submission environment. PBS O PATH Value of PATH from submission environment. PBS O QUEUE The original queue name to which the job was submitted.
34
PBS O SHELL Value of SHELL from submission environment. PBS O SYSTEM The operating system name where qsub was executed. PBS O WORKDIR The absolute path of the directory where qsub was executed. PBS QUEUE The name of the queue that executed the job. TMPDIR The job-specic temporary directory for the job.
35
B
B.1
MPI SEND(buf, count, datatype, dest, tag, comm) sends in standard mode. MPI BSEND(buf, count, datatype, dest, tag, comm) sends in buered mode. MPI SSEND(buf, count, datatype, dest, tag, comm) sends in synchronous mode. MPI RSEND(buf, count, datatype, dest, tag, comm) sends in ready mode. MPI RECV (buf, count, datatype, source, tag, comm, status) starts a standard mode receive.
B.2
MPI ISEND(buf, count, datatype, dest, tag, comm, request) starts a standard mode, nonblocking send. MPI IBSEND(buf, count, datatype, dest, tag, comm, request) starts a buered mode, nonblocking send. MPI ISSEND(buf, count, datatype, dest, tag, comm, request) starts a synchronous mode, nonblocking send. MPI IRSEND(buf, count, datatype, dest, tag, comm, request) starts a ready-mode nonblocking send. MPI IRECV (buf, count, datatype, source, tag, comm, request) starts a nonblocking receive.
B.3
MPI SENDRECV(sendbuf,sendcount,sendtype,dest,sendtag,recvbuf,recvcount,recvtype,source,recvtag,c Executes a blocking send and a blocking receive operation using the same communicator, but possibly dierent tags. Th
MPI SENDRECV REPLACE(buf, count, datatype, dest, sendtag, source, recvtag, comm, status) Execute a blocking send and receive. The same buer is used both for the send and for the receive, so that the message
B.4
MPI WAIT(request, status) returns when the operation identied by request is complete (it blocks).
36
MPI TEST(request, ag, status) returns flag=true if the operation identied by request is complete. Otherwise, the call returns flag=false. (non-bloc
MPI IPROBE(source, tag, comm, ag, status) returns flag=true queue contains receivable message matching the pattern specied by the arguments source, tag, and MPI PROBE(source, tag, comm, status) behaves like MPI IPROBE except that it is a blocking call, returning only after a matching message has been found.
MPI CANCEL(request) marks for cancellation a pending, nonblocking communication operation (send or receive). The cancel call is local; it ret
B.5
Denitions
blocking send - does not return until the message data have been copied, either to send buer, or to receiver memory. I.e., return implies that the user send buer can be safely modied. buered send - copies the message from sender buer into system buer, thereby decoupling send and receive operations. standard communication mode - buering implementation dependent.
B.6
37
38
B.7
2. Progress - given an initiated matching send and receive, at least one will complete. For non-blocking send and receives, guarantee of completion moves to the MPI WAIT calls, one each for non-blocking send and non-blocking receive. 3. Fairness - MPI makes no guarantee of fairness. For example, suppose a given posted send. It is possible for receiver posting matching receives to never receive the message, because each time the given send is overtaken by another message. 4. Resource limitations - any pending communication operation consumes limited system resources such as buers. Errors may occur when lack of resources prevent execution of a communication attempt. Non-blocking communication reduces buering requirements for progress to occur.
B.8
References
MPI Standard References MPI: The Complete Reference [SOHL+ 95] with MPI version 2.
C CODE LISTINGS
39
C
C.1
Code Listings
Ping-pong Codes
pingpong.c
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
#include <stdlib.h> #include <stdio.h> #include <string.h> #include <mpi.h> #include <uparms.h> #include <misc.h> #dene TRACE(x) #dene NMLEN 128 int myproc, nproc; char thishost[NMLEN]; int main(int argc, char **argv) { int niter,msgsz; double clockticks; (void) getcmdline(argc,argv,&niter,&msgsz); (void) gethost(thishost,NMLEN); (void) initmpi(argc,argv); (void) echoparms(niter,msgsz); (void) pingpong(niter,msgsz,&clockticks); (void) summarise(niter,msgsz,clockticks); (void) zshutdown(0); return(0); } int summarise(int niter,int msgsz,double ticks) { int iproc; double ntotdat = (double)msgsz * (double)niter * sizeof (int); double throughput = (ntotdat / ticks) / (1024.0*1024.0); double latency = ((ticks/2.0) / niter) * 1000000.0;
summarise main
for (iproc=0; iproc<nproc; iproc++) { if ( myproc == iproc && myproc==0) { fprintf(stdout,"p=%d\ttime=%fsec\tthru=%fMB\tlate=%fusec\n",myproc,ticks,throughput,latency); ush(stdout); } /* MPI Barrier(MPI COMM WORLD); */ } return(0); }
C CODE LISTINGS
40
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
int pingpong(int niter,int msgsz,double *clockticksPtr) { MPI Status status; double clockticks = 0.0; int iter=0; int *outcargo = (int *)malloc(sizeof (int)*msgsz); int *incargo = (int *)malloc(sizeof (int)*msgsz); (void) dene cargo(msgsz,outcargo); do { if (myproc == 0) { clockticks = MPI Wtime(); MPI Send(outcargo,1,MPI INT,1,999,MPI COMM WORLD); MPI Recv(incargo,msgsz,MPI INT,1,888,MPI COMM WORLD,&status); clockticks += MPI Wtime(); } else if (myproc == 1) { clockticks = MPI Wtime(); MPI Recv(incargo,1,MPI INT,0,999,MPI COMM WORLD,&status); MPI Send(outcargo,msgsz,MPI INT,0,888,MPI COMM WORLD); clockticks += MPI Wtime(); } } while (++iter < niter); *clockticksPtr = clockticks; return(0); } int dene cargo(int msgsz, int *cargo) { int ic; for(ic=0; ic<msgsz; ic++) { *(cargo+ic) = hide(ic); } return(0); }
pingpong
dene cargo
int echoparms(int niter,int msgsz) echoparms { int iproc; for (iproc=0; iproc<nproc; iproc++) { if ( myproc == iproc) { fprintf(stdout,"xp=%d\thost=%s\tniter=%d\tmsgsz=%d\n",myproc,thishost,niter,msgsz); ush(stdout); } MPI Barrier(MPI COMM WORLD); } return(0); } int gethost(char *string, int stringlength)
gethost
C CODE LISTINGS
41
101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
{ getHostname(string,stringlength); TRACE(printf("thishost=%s\n",thishost);) TRACE(ush(stdout);) return(0); } int getcmdline(int argc, char**argv, int *niter, int* msgsz) { if (argc != 3) (void) usage and exit(1); sscanf(argv[1],"%d",niter); sscanf(argv[2],"%d",msgsz); TRACE(printf("cmdline niter=%d\tmsgsz=%d\n",*niter,*msgsz);) TRACE(ush(stdout);) return(0); } int zshutdown(int status) { int ag; MPI Initialized(&ag); /* only MPI routine callable before MPI Init */ if ( ag ) { if (status != 0) printf("p%d\t MPI shutdown\n",myproc); MPI Finalize(); } if (status != 0) printf("p%d\t exit status=%d\n",myproc,status); exit(status); } int initmpi(int argc, char **argv) { initializeMPIprocesses(argc,argv,&myproc,&nproc); if ( nproc > 2 ) (void) usage and exit(1); return(0); } int usage and exit(int status) { fprintf(stderr,"usage: [mpirun -np 2] pingpong <niter> <msgsz>\n"); ush(stderr); (void) zshutdown(status); exit(status); }
getcmdline
zshutdown
initmpi
REFERENCES
42
References
[SGI12] SGI. Passing Toolkit (MPT) User Guide, 2012. reference for barrine MPT. [SOHL+ 95] Marc Snir, Steve W. Otto, Steven Huss-Lederman, David W. Walker, and Jack Dongarra. MPI: The Complete Reference. 1995. ISBN 0-262-69184-1.
Index
Amdahls law, 17 baton.c, 14 batonpskips.c, 29 busywait.c, 31 busywaitprint.c, 32 cluster, see NOW code listings, 39 collective communication global sum, 6 eciency, 17 global sum, 6 Makele debug ags, 30 Intel MPI and gcc, 13 Intel MPI and icc, 13 MPICH and gcc, 14 MPICH2 and gcc, 14 MPT and gcc, 14 MPT and icc, 14 OpenMPI and gcc, 14 OpenMPI and icc, 14 multiprocessors, 1 network throughput, 22, 23, 25, 26 node code, 35 nodes, 1 NOW, 1 PBS environment variables, 33 performance diminishing returns, 20 pingpong code, 22, 23, 25, 26 PMS notation, 1 point-to-point MPI functions, 35 MPI sendrecv(), 35 shared-memory multiprocessor, 1 speedup, 17 TotalView example, 29, 31, 32 WRF, 21
43