Sei sulla pagina 1di 46

MPI Workshop Notes1 DRAFT WORKING COPY

Date: Time: Location: Instructor: Organisation: Friday, 8 June 2012 1:00 - 4:50 PM Sir James Foot Building (Bldg. 47A), Room 241 Terry Clark, Ph.D. (t.clark3@uq.edu.au) UQ Research Computing Centre

In this workshop we introduce parallel computing with the Message Passing Interface (MPI), a standardized message-passing system designed for writing parallel programs in sequential languages such as C and Fortran. Since its release in 1994, and currently with several reliable implementations, MPI continues as a principal parallel programming model. Central to its success are the portability and achievable performance for a signicant share of applications in the technical computing community. MPI uses a model wherein each process in a parallel execution uses exactly the same program, yet with each process operating on dierent parts of the calculation. The required logic to coordinate multiple processes through the program adds signicant complexity; this programming task is intensied by MPIs low-level interprocess communication primitives. Consequently, MPI program development tends to be tedious and error prone. Parallel applications need to operate eciently over a practical range of input and computing platforms. Within this operational range, the performance varies, often widely, as a function of input, number of processors, types of resources, and MPI runtime parameters. It follows that to achieve desirable performance the users of parallel applications need informed choices to appropriately select and congure resources. The supporting information usually involves measurements made to assess the application on target computer systems with suitable data. This workshop covers concepts and methods pertaining to issues described above. The topics include MPI program development, parallel program debugging and proling, and running parallel applications. The aim is a comprehensive introduction, necessarily of limited depth, but without major gaps, that will enable attendees to pursue topics specic to their research needs. Site-specic content for the UQ HPC cluster, Barrine, is a critical adjunct which will provide attendees at all levels of expertise with a useful digest of protocols, tools, and systems. The programs discussed in this workshop and additional material can be retrieved from the web site and software repository at http://hpc-curlew.hpcu.uq.edu.au/.

$LastChangedDate: 2012-06-08 12:26:32 +1000 (Fri, 08 Jun 2012) $


1 This three-hour workshop is part of ongoing HPC training conducted by the UQ Research Computing Centre. The material in this section is suitable for HPC users at all levels.

Contents
1 Introduction 1.1 Types of Parallel Systems . . . . . . . . . . . 1.2 Simple Sum . . . . . . . . . . . . . . . . . . . 1.2.1 Node codes to dene simple-sum data 1.2.2 Node codes to compute partial sum . 1.2.3 Node codes for simple sum . . . . . . 1.2.4 Global sum approach . . . . . . . . . . 1.3 SPMD model for Simple Sum . . . . . . . . . 1.4 Simple Sum using Basic MPI . . . . . . . . . 2 MPI Standard and Implementations 2.1 Message Passing Interface Standard . . . . . 2.2 MPI Implementations on Barrine . . . . . . . 2.2.1 Basic Information . . . . . . . . . . . 2.2.2 Features eecting program deployment 2.3 Compilers on Barrine . . . . . . . . . . . . . . 2.4 Makeles and PBS scripts for Barrine . . . . 2.4.1 Intel MPI with gcc and icc . . . . . . 2.4.2 MPICH2 with gcc and icc . . . . . . . 2.4.3 SGI MPT with gcc and icc . . . . . . 2.4.4 Open MPI with gcc and icc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 2 3 4 5 6 7 8 10 10 11 11 11 12 13 13 14 14 14 17 17 17 18 18 18 19 20 21 22 22 27 28 28 29 31 32 33 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 35 35 35 35 36 36 38 38

3 Parallel Program Performance 3.1 Performance measures . . . . . . . . . . . . . . . . . . . . . . 3.2 Amdahls Law assessment of speedup . . . . . . . . . . . . . . 3.3 Performance examples . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Eect of serial part on Speedup and Eciency . . . . 3.3.2 Eect of communication on Speedup and Eciency . 3.3.3 Eect of serial part and comunication on Speedup and 3.3.4 Parallel performance with diminishing returns . . . . . 3.4 Barrine performance example . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Eciency . . . . . . . . . . . .

4 Barrine Network and I/0 4.1 Inniband and Gigabit Ethernet Networks . . . . . . . . . . . . . . . . . . . . . . . 4.2 Local le systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Parallel Debuggers 5.1 Totalview Debugger . . . . . . . . . . . . . . . . . 5.2 Example 1. Process skips baton pass . . . . . . . . 5.2.1 Example 2. MPI process busy waiting . . . 5.2.2 Example 3. Process busy waiting with print A PBS Environment Variables B MPI Point-to-Point Communication Functions B.1 Blocking Send and Receive . . . . . . . . . . . . B.2 Non-blocking Send and Receive . . . . . . . . . . B.3 Combined Send and Receive . . . . . . . . . . . . B.4 Operations on Messages and Queues . . . . . . . B.5 Denitions . . . . . . . . . . . . . . . . . . . . . . B.6 Send modes with the MPI API . . . . . . . . . . B.7 Semantics of Point-to-Point Communication . . . B.8 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CONTENTS

ii

C Code Listings C.1 Ping-pong Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

39 39

1 INTRODUCTION

1
1.1

Introduction
Types of Parallel Systems

PMS Notation We use PMS notation to describe key components of computer system where: P is the processor, M is the memory, and S is the switch, or network.

...

...
S (switch)

S (switch)

...

...

(a) SMP (symmetric multiprocessor)

(b) NUMA (non-uniform memory access)

Figure 1: Shared-memory architectures: (a) symmetric multiprocessors (SMP) (or dancehall) approach distributes all memory uniformly far from processors, (limiting scalability of the approach). (b) NUMA distributes memory so latency to access local memory is xed and independent of number of processors.

M P

M P

M P

M P

...
P P
S (switch)
(a) NOW based on SMP nodes

...
P P
S (switch)
(b) NOW based on NUMA nodes

Figure 2: Networks of workstations also called NOWs, or clusters are message-passing architectures using complete computers as building blocks (the nodes). The high-level diagram for a NOW is the same as the multiprocessor NUMA illustrated in gure 1(b).2

2 The main dierence between gure 1(b) multiprocessors and gure 1(b) message-passing systems is in how non-local memory is accessed: multiprocessor NUMA integrates communication into the memory system; messagepassing architectures perform explicit I/O operations.

1 INTRODUCTION

1.2

Simple Sum n

ai = a1 + a2 + a3 . . . an1 + an
i=1

(1)

Suppose nP = 4 processors to calculate sum, S of n=16 elements.

S=

(a01 + a02 + a03 + a04 ) + (a05 + a06 + a07 + a08 ) + (a09 + a10 + a11 + a12 ) + (a13 + a14 + a15 + a16 )

(2)

The Goal

//
time units

5 ... + 16

Figure 3: Time for 4-way parallel addition shown above the x-axis is 1/4 sequential time.

A Detail

S=

S0 + S1 + S2 + S3

(3)

where subscripts are process number for p0 , p1 , p2 , p3 .

1 INTRODUCTION

1.2.1

Node codes to dene simple-sum data

1 2 3 4 5 6 7

int A[4] int nP = 4 int thisProc = 0 do i = 1, 4 A[i] = (thisProc * 4) + i enddo


(a)

1 2 3 4 5 6 7

int A[4] int nP = 4 int thisProc = 1 do i = 1, 4 A[i] = (thisProc * 4) + i enddo


(b)

p0 code
1 2 3 4 5 6 7

p1 code

1 2 3 4 5 6 7

int A[4] int nP = 4 int thisProc = 2 do i = 1, 4 A[i] = (thisProc * 4) + i enddo


(c)

int A[4] int nP = 4 int thisProc = 3 do i = 1, 4 A[i] = (thisProc * 4) + i enddo


(d)

p2 code

p3 code

Figure 4: Pseudo-code to initialise array A[0:3] on processes p0 , p1 , p2 , p3 for simple sum.

P0
0 1 2 3 4 5

P1
0 1 2 3 4 5

1 A 2 3 4 0 R 0 S
P0

9 A 10 11 12 0 R 0 S
P1

Figure 5: Initial data distribution for 4-process simple sum, n=16.

1 INTRODUCTION

1.2.2

Node codes to compute partial sum

1 2 3 4 5 6 7 8 9 10 11

int A[4], R int nP = 4 int thisProc = 0 do i = 1, 4 A[i] = (thisProc * 4) + i enddo do i = 1, 4 R = R + A[i] enddo
(a)

1 2 3 4 5 6 7 8 9 10 11

int A[4], R int nP = 4 int thisProc = 1 do i = 1, 4 A[i] = (thisProc * 4) + i enddo do i = 1, 4 R = R + A[i] enddo
(b)

p0 code
1 2 3 4 5 6 7 8 9 10 11

p1 code

1 2 3 4 5 6 7 8 9 10 11

int A[4], R int nP = 4 int thisProc = 2 do i = 1, 4 A[i] = (thisProc * 4) + i enddo do i = 1, 4 R = R + A[i] enddo
(c)

int A[4], R int nP = 4 int thisProc = 3 do i = 1, 4 A[i] = (thisProc * 4) + i enddo do i = 1, 4 R = R + A[i] enddo
(d)

p2 code

p3 code

Figure 6: Pseudo-code with partial sum (lines 911) at processes p0 , p1 , p2 , p3 for simple sum.

p0
0 1 2 3 4 5

p1
0 1 2 3 4 5

p2
0 1 2 3 4 5

p3
0 1 2 3 4 5

1 A 2 3 4 10 R 0 S

9 A 10 11 12 42 R 0 S

13 A 14 15 16 58 R 0 S

5 A 6 7 8 26 R 0 S

Figure 7: Processes memory contents after partial sum.

1 INTRODUCTION

1.2.3

Node codes for simple sum

1 2 3 4 5 6 7 8 9 10 11

int A[4], R, S int nP = 4 int thisProc = 0 do i = 1, 4 A[i] = (thisProc * 4) + i enddo do i = 1, 4 R = R + A[i] enddo S = +{R}
(a)

1 2 3 4 5 6 7 8 9 10 11

int A[4], R, S int nP = 4 int thisProc = 1 do i = 1, 4 A[i] = (thisProc * 4) + i enddo do i = 1, 4 R = R + A[i] enddo S = +{R}
(b)

p0 code
1 2 3 4 5 6 7 8 9 10 11

p1 code

1 2 3 4 5 6 7 8 9 10 11

int A[4], R, S int nP = 4 int thisProc = 2 do i = 1, 4 A[i] = (thisProc * 4) + i enddo do i = 1, 4 R = R + A[i] enddo S = +{R}
(c)

int A[4], R, S int nP = 4 int thisProc = 3 do i = 1, 4 A[i] = (thisProc * 4) + i enddo do i = 1, 4 R = R + A[i] enddo S = +{R}
(d)

p2 code

p3 code

Figure 8: Simple sum pseudo-code with global sum on line 11.

p0
0

p1

p2

p3
0

1 A 1 2 2 3 3 4 4 10 R 5 136 S

9 A 1 10 2 11 3 12 4 42 R 5 136 S
0

13 A 1 14 2 15 3 16 4 58 R 5 136 S
0

5 A 1 6 2 7 3 8 4 26 R 5 136 S

Figure 9: Processes memory contents after partial sum.

1 INTRODUCTION

1.2.4

Global sum approach

P0
(R0)

P1
(R1)

P2
(R2)

P3
(R3)

Figure 10: Initial state of processes after partial sum (see gure 7).

(R0)

(R1)

(R2)

(R3)

P0

P1

P2

P3

Figure 11: Global sum, phase 1: exchange data from initial state (gure 10).

(R1+R0) (R0+R1) (R2+R3)

(R3+R2)

P0

P1

P2

P3

Figure 12: Global sum, phase 2: exchange summed data from phase 1 (gure 11).

P0
(R0+R1+R2+R3)

P1
(R1+R0+R3+R2)

P2

P3

(R2+R3+R0+R1) (R3+R2+R1+R0)

Figure 13: Global sum, nal state: Processes have nal sum, S = R0 + R1 + R2 + R3 .

1 INTRODUCTION

1.3

SPMD model for Simple Sum


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Program SPMDSum int i, A[4], R, S; int (nP,thisProc) = initProcs() do i = 1, 4 A[i] = (thisProc * 4) + i enddo do i = 1, 4 R = R + A[i] enddo S = globalSum(R) end

int

spmdsum.c
(source code)

compiler spmdsum.o
(object code)

linker spmdsum
(executable)

runtime system 4 x spmdsum


(processes)

b01a22

b01a23

b02b29

M0 P0

M P

M1 P1

M P

M2 P2

M3 P3

Figure 14:

1 INTRODUCTION

1.4

Simple Sum using Basic MPI


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

#include "mympinc.h" #dene N 4 int main(int argc, char **argv) { int A[N],R,S,i; int nP, thisProc; initProcesses(argc,argv,&thisProc,&nP); for (i=1; i<=N; i++) { A[i1] = (thisProc * 4) + i; } for (R=0,i=0; i<N; i++) { R = R + A[i]; } S = globalSum(R,thisProc); MPI Finalize(); exit(0); }

main

Figure 15: C program implementing the simple sum outlined above.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

#include <stdio.h> #include <mpi.h> void initProcesses(int argc, char **argv, int *myproc, int *nproc) initProcesses { MPI Init (&argc,&argv); MPI Comm rank (MPI COMM WORLD, myproc); MPI Comm size (MPI COMM WORLD, nproc); printf("myProc=%d\tnProc=%d\t\tprogram=%s\n",*myproc,*nproc,argv[0]); MPI Barrier(MPI COMM WORLD); } int globalSum(int R, int myProc) globalSum { int S; int status = MPI Allreduce(&R,&S,1,MPI INT,MPI SUM,MPI COMM WORLD); printf("p%d\tR=%d\tS=%d\n",myProc,R,S); MPI Barrier(MPI COMM WORLD); return(S); }

Figure 16: C program functions completing code in gure 15.

1 INTRODUCTION

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

CC = mpiicc AR = ar ARFLAGS= r CFLAGS = O1 LIBDIR=. LDFLAGS = L$(LIBDIR) LIBS = lmympi program=spmdsum mylibrary=libmympi.a $(program): $(program).o $(mylibrary) $(CC) $(LDFLAGS) o $@ $< $(LIBS) $(mylibrary): mympi.o $(AR) $(ARFLAGS) $@ $< clean:; $(RM) core *.o *.trace *.a realclean:; $(RM) core *.o* *.e* *.trace $(program) *.a .c.o:; $(CC) $(CFLAGS) c $*.c

Figure 17: Makele for simple sum program in gure 15.

1 2 3 4 5 6 7 8 9 10 11 12

#!/bin/bash #PBS -A sf-Admin #PBS -l select=4:ncpus=1:mpiprocs=1:NodeType=large #PBS -l walltime=00:00:30 #PBS -N spmdmsg #PBS -q workq cd ${PBS O WORKDIR} module load intelmpi/3.2.2.006 mpirun np 4 spmdsum

Figure 18: PBS script for simple sum program of gure 15.

Six functions MPI MPI MPI MPI MPI MPI Init Finalize Comm Size Comm Rank Send Recv Initiate an MPI computation Terminate an MPI computation Determine the number of MPI processes. Determine MPI identief for calling process. Send a message to an MPI process. Receive a message from an MPI process.

2 MPI STANDARD AND IMPLEMENTATIONS

10

2
2.1

MPI Standard and Implementations


Message Passing Interface Standard
MPI 2.0 Standard (pdf) MPI 1.1 Standard (pdf)

The denitive reference for MPI is the MPI Forum Web site.

2 MPI STANDARD AND IMPLEMENTATIONS

11

2.2

MPI Implementations on Barrine

Installed MPI information version module le Intel MPI-2.2 intel-mpi/3.2.2.006 intel-mpi/4.0.0.027 intel-mpi/4.0.1.007 mpich2/1.4.1p1-intel mpich-ch4-p4 mpich-ch4-p4mpd OpenMPI/1.2.8 OpenMPI/1.4.3 OpenMPI/1.5.3 mpt/2.00 mpt/2.02

documentation reference manual 3.2.2.006 4.0.0.027 4.0.1.007 introduction guide 3.2.2.006 4.0.0.027 4.0.1.007

MPICH2 MPICH1 Open MPI

MPI-2.2 MPI-1.1 MPI-2.1

MPICH2 Users Guide MPICH2 commands and routines Users Guide to MPICH, v1.2.6 Open MPI v1.6 commands and routines

SGI MPT

MPI-2.2

MPT User Guide, supports 2.06

Features Eecting Application Deployment (includes Barrine specic) Inniband Intel MPI yes (default) program build use mpi compiler with sequential compiler option (see gure 20) use mpi compiler with sequential compiler option (see gure 22) icc,gcc, etc. no mpicc program launch mpirun, mpiexec other notes mpiicc invokes icc

MPICH2

3rd party, not on barrine yes [SGI12], use export MPI USE IB=1 check...

mpirun, mpiexec mpiexec cannot run non-MPI program

SGI MPT Open MPI

mpiexec mpt (PBS), mpirun (not PBS) mpirun

2 MPI STANDARD AND IMPLEMENTATIONS

12

2.3

Compilers on Barrine

Compiler Related Documentation Language Intel Fortran Reference, etc. C++ Compiler (http) Fortran Compiler (http) GNU GNU Compiler Collection GNU OpenMP Manual GCC online documentation OpenMP Developing Threaded Applications Cluster OpenMP Manual Other Intel 64 IA32 Optimization Manual

2 MPI STANDARD AND IMPLEMENTATIONS

13

2.4

Makeles and PBS scripts for Barrine

Programs in this section are online at this software repository.

Shown here are Makeles and PBS scripts for the program baton.c for each combination of of compiler and MPI system {GNU, Intel} {IntelMPI, MPICH2, OpenMPI, and MPT}.3 Before invoking make on the Makeles it is necessary to run module loads for the compiler and MPI types; these module loads are shown in each accompanying PBS script listing on lines 89. The source code listing for baton.c is shown below. 2.4.1 Intel MPI with gcc and icc

1 2 3 4 5 6 7 8 9 10 11 12 13

# GNU C and Intel MPI makefile CC = gcc MPICC = mpicc CFLAGS = O2 LDFLAGS = LIBS = program=baton $(program): $(program).o $(MPICC) $(LDFLAGS) o $@ $< $(LIBS) .c.o:; $(MPICC) cc=$(CC) $(CFLAGS) c $*.c

1 2 3 4 5 6 7 8 9 10 11 12 13

#!/bin/bash #PBS -A sf-Admin #PBS -l select=4:ncpus=2:mpiprocs=2:NodeType=large #PBS -l walltime=00:01:00 #PBS -q workq #PBS -N thebaton module load compiler/gcc4.5.2 module load intelmpi/3.2.2.006 cd $PBS O WORKDIR mpirun np 8 baton

Figure 19: Intel MPI with gcc (left) Makele, (right) PBS script.

1 2 3 4 5 6 7 8 9 10 11 12 13

# Intel C and Intel MPI makefile CC = icc MPICC = mpicc CFLAGS = O2 LDFLAGS = LIBS = program=baton $(program): $(program).o $(MPICC) $(LDFLAGS) o $@ $< $(LIBS) .c.o:; $(MPICC) cc=$(CC) $(CFLAGS) c $*.c

1 2 3 4 5 6 7 8 9 10 11 12 13

#!/bin/bash #PBS -A sf-Admin #PBS -l select=4:ncpus=2:mpiprocs=2:NodeType=large #PBS -l walltime=00:01:00 #PBS -q workq #PBS -N thebaton module load intelcc11/11.1.072 module load intelmpi/3.2.2.006 cd $PBS O WORKDIR mpirun np 8 baton

Figure 20: Intel MPI and icc compiler (left) Makele, (right) PBS script.

3 The

GNU and Intel compilers demonstrated are gcc and icc.

2 MPI STANDARD AND IMPLEMENTATIONS

14

1 2 3 4 5 6 7 8 9 10 11 12 13

# GNU C and MPICH2 makefile CC = gcc MPICC = mpicc CFLAGS = O2 LDFLAGS = LIBS = program=baton $(program): $(program).o $(MPICC) $(LDFLAGS) o $@ $< $(LIBS) .c.o:; $(MPICC) $(CFLAGS) c $*.c

1 2 3 4 5 6 7 8 9 10 11 12 13

#!/bin/bash #PBS -A sf-Admin #PBS -l select=4:ncpus=2:mpiprocs=2:NodeType=large #PBS -l walltime=00:01:00 #PBS -q workq #PBS -N thebaton module load compiler/gcc4.5.2 module load mpich2/1.4.1p1intel cd $PBS O WORKDIR mpirun np 8 baton

Figure 21: MPICH2 with gcc (left) Makele, (right) PBS script

1 2 3 4 5 6 7 8 9 10 11 12 13

# Intel C and MPICH2 makefile CC = icc MPICC = mpicc CFLAGS = O2 LDFLAGS = LIBS = program=baton $(program): $(program).o $(MPICC) $(LDFLAGS) o $@ $< $(LIBS) .c.o:; $(MPICC) $(CFLAGS) c $*.c

1 2 3 4 5 6 7 8 9 10 11 12 13

#!/bin/bash #PBS -A sf-Admin #PBS -l select=4:ncpus=2:mpiprocs=2:NodeType=large #PBS -l walltime=00:01:00 #PBS -q workq #PBS -N thebaton module load intelcc11/11.1.072 module load mpich2/1.4.1p1intel cd $PBS O WORKDIR mpirun np 8 baton

Figure 22: MPICH2 with Intel icc compiler (left) Makele, (right) PBS script

1 2 3 4 5 6 7 8 9 10 11 12 13

# GNU C and MPT makefile CC = gcc MPICC = $(CC) CFLAGS = O2 LDFLAGS = LIBS = lmpi program=baton $(program): $(program).o $(MPICC) $(LDFLAGS) o $@ $< $(LIBS) .c.o:; $(MPICC) $(CFLAGS) c $*.c

1 2 3 4 5 6 7 8 9 10 11 12 13

#!/bin/bash #PBS -A sf-Admin #PBS -l select=4:ncpus=2:mpiprocs=2:NodeType=large #PBS -l walltime=00:01:00 #PBS -q workq #PBS -N thebaton module load mpt/2.00 module load compiler/gcc4.5.2 cd $PBS O WORKDIR mpiexec mpt np 8 baton

Figure 23: SGI MPT with gcc (left) Makele, (right) PBS script

2 MPI STANDARD AND IMPLEMENTATIONS

15

1 2 3 4 5 6 7 8 9 10 11 12 13

# Intel C and MPT makefile CC = icc MPICC = $(CC) CFLAGS = O2 LDFLAGS = LIBS = lmpi program=baton $(program): $(program).o $(MPICC) $(LDFLAGS) o $@ $< $(LIBS) .c.o:; $(MPICC) $(CFLAGS) c $*.c

1 2 3 4 5 6 7 8 9 10 11 12 13

#!/bin/bash #PBS -A sf-Admin #PBS -l select=4:ncpus=2:mpiprocs=2:NodeType=large #PBS -l walltime=00:01:00 #PBS -q workq #PBS -N thebaton module load intelcc11/11.1.072 module load mpt/2.00 cd $PBS O WORKDIR mpiexec mpt np 8 baton

Figure 24: SGI MPT with Intel icc compiler (left) Makele, (right) PBS script

1 2 3 4 5 6 7 8 9 10 11 12 13

# GNU C and Open MPI makefile CC = gcc MPICC = mpicc CFLAGS = O2 LDFLAGS = LIBS = program=baton $(program): $(program).o $(MPICC) $(LDFLAGS) o $@ $< $(LIBS) .c.o:; $(MPICC) $(CFLAGS) c $*.c

1 2 3 4 5 6 7 8 9 10 11 12 13

#!/bin/bash #PBS -A sf-Admin #PBS -l select=4:ncpus=2:mpiprocs=2:NodeType=large #PBS -l walltime=00:01:00 #PBS -q workq #PBS -N thebaton module load compiler/gcc4.5.2 module load OpenMPI/1.5.3 cd $PBS O WORKDIR mpirun np 8 baton

Figure 25: Open MPI with gcc (left) Makele, (right) PBS script

1 2 3 4 5 6 7 8 9 10 11 12 13

# Intel C and Open MPI makefile CC = icc MPICC = mpicc CFLAGS = O2 LDFLAGS = LIBS = program=baton $(program): $(program).o $(MPICC) $(LDFLAGS) o $@ $< $(LIBS) .c.o:; $(MPICC) $(CFLAGS) c $*.c

1 2 3 4 5 6 7 8 9 10 11 12 13

#!/bin/bash #PBS -A sf-Admin #PBS -l select=4:ncpus=2:mpiprocs=2:NodeType=large #PBS -l walltime=00:01:00 #PBS -q workq #PBS -N thebaton module load intelcc11/11.1.072 module load OpenMPI/1.5.3 cd $PBS O WORKDIR mpirun np 8 baton

Figure 26: Open MPI with Intel icc compiler (left) Makele, (right) PBS script

2 MPI STANDARD AND IMPLEMENTATIONS

16

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61

#include <stdlib.h> #include <stdio.h> #include <string.h> #include <mpi.h> MPI Status status; int myProc, nProc; int main(int argc, char **argv) { MPI Init (&argc,&argv); MPI Comm rank (MPI COMM WORLD, &myProc); MPI Comm size (MPI COMM WORLD, &nProc); check mybaton value(pass baton()); exit(MPI Finalize()); } int pass baton() { int toProc = (myProc + 1) % nProc; int frmProc = (myProc 1 + nProc) % nProc; int ibaton = 0; main

pass baton

if ( myProc > 0 ) MPI Recv(&ibaton,1,MPI INT,frmProc,MPI ANY TAG,MPI COMM WORLD,&status); ibaton += myProc; if ( myProc < nProc1 ) MPI Send(&ibaton, 1, MPI INT, toProc, 7, MPI COMM WORLD); return ibaton; } check mybaton value(int mybaton) { int myanswer = myAnswer(); wait in order(); printf("p%d %s\n",myProc,((mybatonmyanswer)?"NOT OK":"OK")); ush(stdout); system("sleep 1"); notifynext(); } check mybaton value

wait in order() wait in order { if ( myProc < nProc1) MPI Recv(0, 0, MPI INT, myProc+1, MPI ANY TAG, MPI COMM WORLD, &status); } notifynext() { if ( myProc > 0) MPI Send(0, 0, MPI INT, myProc1, 3, MPI COMM WORLD); } int myAnswer() { int i, sum; for (i=1,sum=0; i<=myProc; i++) sum += i; return(sum); } notifynext

myAnswer

Figure 27: A copy of this baton.c is in each directory for section 2.4.

3 PARALLEL PROGRAM PERFORMANCE

17

3
3.1

Parallel Program Performance


Performance measures

Speedup, SP using P processors is the

ratio of sequential time, T1 , and parallel time, Tp SP = T1 TP (4)

Eciency, EP using P processors is the

ratio of the speedup, SP , and P EP = T1 SP = P P TP (5)

3.2

Amdahls Law assessment of speedup

SP = =

(1 f )T1 T1 , TP = f T1 + TP P T1 f T1 + 1
(1f )T1 P

(6) (7)

SP

f+

(1f ) P

(8)

3 PARALLEL PROGRAM PERFORMANCE

18

3.3
3.3.1

Performance examples
Eect of serial part on Speedup and Eciency

parallel part: parallelizes perfectly, i.e., Tp = T1 /P


serial part: executes serially Tp = T1

Tp =
10
3

T1 P

+ T1
total parallel part sequential part

nProcs 1 2

time (m) 110 60 35 23 16 13 12

speedup 1.0 1.8 3.1 4.9 6.7 8.3 9.5

eciency 100% 92% 79% 61% 42% 26% 15%

10

compute time

4 8
10
1

16 32 64

10 0 10

10

10

number of processes

3.3.2

Eect of communication on Speedup and Eciency


communication: Tp = k log2 P

Tp =
10
2

T1 P

+ Tp nProcs 1 2 time 100 51 27 16 10 8 7 speedup 1.0 2.0 3.8 6.4 9.8 12.3 13.2 eciency 100% 98% 93% 81% 61% 38% 21%

total parallel part communication

compute time

4
10
1

8 16 32 64

10 0 10

10

10

number of processes

3 PARALLEL PROGRAM PERFORMANCE

19

3.3.3

Eect of serial part and comunication on Speedup and Eciency

parallel part: parallelizes perfectly, i.e., Tp = T1 /P


serial part: executes serially Tp = T1 communication: Tp = k log2 P

Tp =

T1 P

+ T1 + T p

10

total parallel part sequential part communication

10

compute time

10

10 0 10

10

10

number of processes

nProcs 1 2 4 8 16 32 64

time 110 61 37 26 20 18 17

speedup 1.0 1.8 3.0 4.3 5.4 6.1 6.3

eciency 100% 90% 74% 54% 34% 19% 10%

3 PARALLEL PROGRAM PERFORMANCE

20

3.3.4

Parallel performance with diminishing returns

parallel part: parallelizes perfectly, i.e., Tp = T1 /P


serial part: executes serially Tp = T1 communication: Tp = k log2 P

Tp =

T1 P
10
3

+ T + Tp

10

compute time

10

10 0 10

total parallel part sequential part communication 10


1

10

10

10

number of processes

nProcs 1 2 4 8 16 32 64 128 256 512 1024 2048

time 110.0 61.0 37.0 25.5 20.3 18.1 17.6 17.8 18.4 19.2 20.1 20.0

speedup 1.0 1.8 3.0 4.3 5.4 6.1 6.3 6.2 6.0 5.7 5.5 5.2

eciency 100.0% 90.2% 74.3% 54.0% 34.0% 19.0% 10.0% 4.8% 2.3% 1.1% 0.5% 0.3%

3 PARALLEL PROGRAM PERFORMANCE

21

3.4

Barrine performance example

Weather Research and Forecasting Model (WRF) 4


elapsed time in minutes
600 500 400 300 200

nProcs 1 4

time (m) 530 245 150 120 84 100

speedup 1.0 2.2 3.5 4.4 6.3 5.3

eciency 100% 54% 44% 28% 20% 8%

time (minutes)

8 16 32 64

100 0 0 10

10

10

number of processes

Speedup
7 6 5
efficiency 100

Eciency

80

speedup

60

4 3 2 1 0 10

40

20

10

10

0 0 10

10

10

number of processes

number of processes

4 Michael Hewson with Professor Hamish McGowans laboratory provided the data for these WRF benchmarks. Michael applies WRF to climate modelling and orchestrates WRF and its associated programs on Barrine.

4 BARRINE NETWORK AND I/0

22

4
4.1

Barrine Network and I/0


Inniband and Gigabit Ethernet Networks
Figure 28: Throughput estimates of Barrines Inniband and gigabit ethernet networks.
10
4

10

throughput (MB/s)

10

10

10

10

Intel MPI MPICH2


0

10

10

10

10

10

10

10

10

10

message size (bytes)


Figure 29: Send-and-receive pairs between two processes used to estimate latency and throughput.

4 BARRINE NETWORK AND I/0 Figure 30: Ping-pong code to measure throughput. See Appendix C.1 for complete listing.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45

23

int myproc, nproc; int main(int argc, char **argv) { int niter,msgsz; double clockticks; (void) getcmdline(argc,argv,&niter,&msgsz); (void) gethost(thishost,NMLEN); (void) initmpi(argc,argv); (void) echoparms(niter,msgsz); (void) pingpong(niter,msgsz,&clockticks); (void) if (myProc==0) summarise(niter,msgsz,clockticks); (void) zshutdown(0); return(0); } int pingpong(int niter,int msgsz,double *clockticksPtr) { MPI Status s; double clockticks = 0.0; int iter=0; int *outcargo = (int *)malloc(sizeof (int)*msgsz); int *incargo = (int *)malloc(sizeof (int)*msgsz); (void) dene cargo(msgsz,outcargo); do { if (myproc == 0) { clockticks = MPI Wtime(); MPI Send(outcargo,1,MPI INT,1,999,MPI COMM WORLD); MPI Recv(incargo,msgsz,MPI INT,1,888,MPI COMM WORLD,&s); clockticks += MPI Wtime(); } else if (myproc == 1) { clockticks = MPI Wtime(); MPI Recv(incargo,1,MPI INT,0,999,MPI COMM WORLD,&s); MPI Send(outcargo,msgsz,MPI INT,0,888,MPI COMM WORLD); clockticks += MPI Wtime(); } } while (++iter < niter); *clockticksPtr = clockticks; return(0); } main

pingpong

4 BARRINE NETWORK AND I/0

24

Figure 31: Summarize point-to-point communication measurements.Full listing in Appendix C.1.


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39

int summarise(int niter,int msgsz,double ticks) { int iproc; double ntotdat = (double)msgsz * (double)niter * sizeof (int); double throughput = (ntotdat / ticks) / (1024.0*1024.0); double latency = ((ticks/2.0) / niter) * 1000000.0; fprintf(stdout,"p=%d\ttime=%fsec\tthru=%fMB\tlate=%fusec\n", myproc,ticks,throughput,latency); ush(stdout); return(0); } int dene cargo(int msgsz, int *cargo) { int ic; for(ic=0; ic<msgsz; ic++) { *(cargo+ic) = hide(ic); } return(0); } int zshutdown(int status) { int ag; MPI Initialized(&ag); /* only MPI function before MPI Init() */ if ( ag ) { if (status != 0) printf("p%d\t MPI shutdown\n",myproc); MPI Finalize(); } if (status != 0) printf("p%d\t exit status=%d\n",myproc,status); exit(status); } int initmpi(int argc, char **argv) { initializeMPIprocesses(argc,argv,&myproc,&nproc); if ( nproc > 2 ) (void) usage and exit(1); return(0); }

summarise

dene cargo

zshutdown

initmpi

4 BARRINE NETWORK AND I/0

25

Estimates of throughput for node pairs in two randomly selected node lists, one each for the x-axis and y-axis, and dierent lists for the two gures. Units are MB/second. Figure 32: Inniband throughput with Intel MPI; early AM Barrine measurements.
b07b27 b07b07 b07a29 b07a25 b07a21 b07a14 b07a08 b06b32 b06b29 b06b15 b06b08 b06b07 b06a15 b06a14 b06a13 b03b14 b03b01 b03a16 b02b32 b02b31 b02b26 b02b15 b02b13 b02b01 b02a34 b02a25 b02a01 b01b11 b01b05 b01b03 b01a33 b01a30 b 1 0 a 0 1 5 0 b 1 0 a 0 9 b 1 0 b 1 2 b 0 1 a 1 7 b 0 1 a 2 3 5 b 0 1 a 2 5 b 0 1 a 2 6 b 0 1 b 0 3 bb 0 0 1 1 bb 0 1 5 3 10 b 0 1 b 2 5 b 0 1 b 3 7 b 0 2 a 0 1 bb 0 0 2 2 a a 0 2 2 9 15 b 0 2 b 2 5 b 0 2 b 3 1 b 0 2 b 3 2 bb 0 0 3 3 a a 0 1 5 4 20 b 0 3 a 1 8 b 0 3 a 2 3 b 0 3 a 3 3 bb 0 0 6 6 a a 2 2 2 9 25 b 0 6 b 0 1 b 0 6 b 2 1 b 0 6 b 2 3 bb 0 0 6 7 bb 3 0 0 3 30 b 0 7 b 2 0 b 0 7 b 2 2

30

2500

25

20

2000

Ipartner node

15

1500

10

1000

500

Jpartner node
Figure 33: Inniband throughput with Intel MPI; midday Barrine measurements.
b10b11 b10a10 b10a07 b10a06 b07b09 b07b02 b07a23 b06b26 b06b25 b06a04 b03b28 b03b15 b03b04 b03a36 b03a32 b03a01 b02b29 b02b15 b02b12 b02a36 b02a35 b02a32 b02a23 b02a15 b01b27 b01b23 b01b18 b01b13 b01b12 b01b05 b01a33 b01a29 b 0 1 a 0 5 5 0 b 0 1 a 2 1 b 0 1 a 2 8 b 0 1 b 0 6 b 0 1 b 1 5 5 b 0 1 b 3 7 b 0 2 a 2 4 b 0 2 a 2 6 bb 0 0 2 2 a a 3 3 2 4 10 b 0 2 b 0 5 b 0 2 b 0 9 b 0 2 b 2 1 bb 0 0 2 3 ba 2 0 4 3 15 b 0 3 a 1 6 b 0 3 a 2 6 b 0 3 b 1 6 bb 0 0 3 3 bb 1 2 7 5 20 b 0 3 b 2 9 b 0 3 b 3 1 b 0 3 b 3 5 bb 0 0 6 6 a b 2 3 6 2 25 b 0 7 a 0 6 b 0 7 a 0 8 b 0 7 b 1 8 bb 0 1 7 0 ba 2 0 3 9 30 b 1 0 a 1 7 b 1 0 b 1 3

30

2500

25

20

2000

Ipartner node

15

1500

10

1000

500

Jpartner node

4 BARRINE NETWORK AND I/0 Figure 34: Each curve shows Inniband throughput for a I-partner in Figure 32.
3500

26

3000

2500

throughput (MB/s)

2000

1500

1000

500 b 1 0 a 0 1 0 b 1 0 a 0 9 b 1 0 b 1 2 b 0 1 a 1 7 b 0 1 a 2 3 5 b 0 1 a 2 5 b 0 1 a 2 6 b 0 1 b 0 3 b 0 1 b 0 5 b 0 1 b 1 3 10 b 0 1 b 2 5 b 0 1 b 3 7 b 0 2 a 0 1 b 0 2 a 0 2 b 0 2 a 2 9 15 b 0 2 b 2 5 b 0 2 b 3 1 b 0 2 b 3 2 b 0 3 a 0 5 b 0 3 a 1 4 20 b 0 3 a 1 8 b 0 3 a 2 3 b 0 3 a 3 3 b 0 6 a 2 2 b 0 6 a 2 9 25 b 0 6 b 0 1 b 0 6 b 2 1 b 0 6 b 2 3 b 0 6 b 3 0 b 0 7 b 0 3 30 b 0 7 b 2 0 b 0 7 b 2 2

500

1000

jpartner node

Figure 35: Each curve shows Inniband throughput for a J-partner in Figure 32.
3500

3000

2500

throughput (MB/s)

2000

1500

1000

500 b 0 1 a 3 0 0 b 0 1 a 3 3 b 0 1 b 0 3 b 0 1 b 0 5 b 0 1 b 1 1 5 b 0 2 a 0 1 b 0 2 a 2 5 b 0 2 a 3 4 b 0 2 b 0 1 b 0 2 b 1 3 10 b 0 2 b 1 5 b 0 2 b 2 6 b 0 2 b 3 1 b 0 2 b 3 2 b 0 3 a 1 6 15 b 0 3 b 0 1 b 0 3 b 1 4 b 0 6 a 1 3 b 0 6 a 1 4 b 0 6 a 1 5 20 b 0 6 b 0 7 b 0 6 b 0 8 b 0 6 b 1 5 b 0 6 b 2 9 b 0 6 b 3 2 25 b 0 7 a 0 8 b 0 7 a 1 4 b 0 7 a 2 1 b 0 7 a 2 5 b 0 7 a 2 9 30 b 0 7 b 0 7 b 0 7 b 2 7

500

1000

ipartner node

4 BARRINE NETWORK AND I/0

27

4.2

Local le systems

5 PARALLEL DEBUGGERS

28

5
5.1

Parallel Debuggers
Totalview Debugger

TotalView Users Guide provides detailed instructions for running Totalview for this section. The rst example, section 5.2, uses a program that causes processes to hang on communication that will not occur. TotalView can be applied using break-points to track the proceses. The second example, section 5.2.1, uses a program with processes hung on on missing communication but with the missing processes in an indenite loop. The third example, section 5.2.2, incorporates print statements for debugging. It is useful to try the Linux command gstack to get at the same information. Makeles are shown with compilation ags for Intel MPI and compilation ags set for debugging.

5 PARALLEL DEBUGGERS

29

5.2

Example 1. Process skips baton pass

The program batonpskips.c (see gure36) is written so that MPI process rank=3 will skip the baton passing. Suppose nProc=8, then processes 0 2, 4 7 will hang because they are missing process rank=3. Figure 36: batonpskips.c
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

#include <stdlib.h> #include <stdio.h> #include <string.h> #include <mpi.h> #include <uparms.h> #include <misc.h> #dene ROGUE 3 main(int argc, char **argv) { int myProc, nProc; int localbaton; int pass baton(int myProc, int nProc, MPI Comm comm); initializeMPIprocesses(argc,argv,&myProc,&nProc); if ( myProc != ROGUE ) { localbaton = pass baton(myProc,nProc,MPI COMM WORLD); check allbatons(myProc,nProc,MPI COMM WORLD,localbaton); } cleanup(); } cleanup() { int status = MPI Finalize(); exit(status); } /* baton pass in ring from 0, 1, . . . nProc-1 */ int pass baton(int myProc, int nProc, MPI Comm comm) { MPI Status status; int msgtag = 0; int toProc = (myProc + 1) % nProc; int frmProc = (myProc 1 + nProc) % nProc; int ibaton = 0; cleanup main

pass baton

if ( myProc > 0 ) checkMPIerror(MPI Recv(&ibaton, 1, MPI INT, frmProc, msgtag, comm, &status)); ibaton += myProc; if ( myProc < nProc1 ) checkMPIerror(MPI Send(&ibaton, 1, MPI INT, toProc, msgtag, comm)); return ibaton; }

5 PARALLEL DEBUGGERS

30

Figure 37: Intel MPI with icc (left) Makele, (right) shell script to run. Note the ags set for debugging, -O0 -g.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

PROG= batonpskips CC = mpicc PREFIX= .. INCDIR= $(PREFIX)/include LIBDIR= $(PREFIX)/lib LDFLAGS= L$(LIBDIR) LIBS= lu lmisc CFLAGS = O0 g I$(INCDIR) $(PROG): $(PROG).o libu.a $(CC) $(LDFLAGS) o $@ $@.o lm $(LIBS) libu.a:; clean: $(RM) core *.o *.trace realclean: $(RM) core *.o *.trace busywait .c.o:; $(CC) $(CFLAGS) c $*.c

1 2 3 4 5 6

#!/bin/bash module load intelmpi/3.2.2.006 mpirun n 4 ./batonpskips

cd $(LIBDIR); make f $(LIBDIR)/makele

5 PARALLEL DEBUGGERS

31

5.2.1

Example 2. MPI process busy waiting

Figure 38: busywait.c


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

#include <stdio.h> #include <mpi.h> #include <uparms.h> main(int argc, char **argv) { int myProc, nProc; int takebranch; initializeMPIprocesses(argc,argv,&myProc,&nProc); takebranch = (nProc%2 == 0) ? 0 : 1; if ( takebranch == 0 ) { send and recv(myProc, nProc); } else { loopawhile(); } MPI Finalize(); return 0; } main

send and recv(int myProc, int nProc) send and recv { int error,value; if ( myProc == 0 ) { MPI Status status; error = MPI Recv(&value, 1, MPI INT, 0, MPI ANY TAG, MPI COMM WORLD,&status); checkMPIerror(error); } else if ( 1 == 2 ) { value = 9; error = MPI Send(&value, 1, MPI INT, 0, MPI ANY TAG, MPI COMM WORLD); checkMPIerror(error); } }

loopawhile() { int i; while (1) i=0; }

loopawhile

5 PARALLEL DEBUGGERS

32

5.2.2

Example 3. Process busy waiting with print statements

Figure 39: busywaitprint.c; includes have been removed for space.


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51

#dene TRACE(x) x main(int argc, char **argv) { int myProc, nProc; int takebranch; TRACE(procprint(0,"start busywaitprt()");) initializeMPIprocesses(argc,argv,&myProc,&nProc); TRACE(procprint(myProc,"initialized()");) takebranch = (nProc%2 == 0) ? 0 : 1; if ( takebranch == 0 ) { send and recv(myProc, nProc); } else { loopawhile(myProc); } MPI Finalize(); return 0; } void send and recv(int myProc, int nProc) send and recv { int error,value; TRACE(procprint(myProc,"enter send_and_recv()");) if ( myProc == 0 ) { MPI Status status; error = MPI Recv(&value, 1, MPI INT, 0, MPI ANY TAG, MPI COMM WORLD,&status); checkMPIerror(error); } else if ( 1 == 2 ) { value = 9; error = MPI Send(&value, 1, MPI INT, 0, MPI ANY TAG, MPI COMM WORLD); checkMPIerror(error); } TRACE(procprint(myProc,"leave send_and_recv()");) } void loopawhile(int myProc) { int i; TRACE(procprint(myProc,"enter loopawhile()");) while (1) i=0; TRACE(procprint(myProc,"leave loopawhile()");) } loopawhile main

A PBS ENVIRONMENT VARIABLES

33

Appendix A PBS Environment Variables

PBS environment variables are dened from the login shell and PBS specic variables. Here is a partial list of the PBS specic variables. NCPUS Number of threads, defaulting to number of CPUs, on a compute node. OMP NUM THREADS Same as NCPUS. PBS ARRAY ID Array identier of subjob in job array. PBS ARRAY INDEX Index number of subjob in job array. PBS CONF PATH Path to pbs.conf PBS CPUSET DEDICATED Set by mpiexec to assert exclusive use of resources in assigned cpuset. PBS ENVIRONMENT When set indicates PBS job calling mpiexec. Job types are PBS BATCH or PBS INTERACTIVE. PBS JOBID The job identier assigned to the job or job array by the batch system. PBS JOBNAME The job name supplied by the user. PBS NODEFILE The lename containing the list of nodes assigned to the job. PBS O HOME Value of HOME directory from submission environment. PBS O HOST The host name on which the qsub command was executed. PBS O LOGNAME Value of users login name in the submission environment. PBS O MAIL Value of MAIL from submission environment. PBS O PATH Value of PATH from submission environment. PBS O QUEUE The original queue name to which the job was submitted.

A PBS ENVIRONMENT VARIABLES

34

PBS O SHELL Value of SHELL from submission environment. PBS O SYSTEM The operating system name where qsub was executed. PBS O WORKDIR The absolute path of the directory where qsub was executed. PBS QUEUE The name of the queue that executed the job. TMPDIR The job-specic temporary directory for the job.

B MPI POINT-TO-POINT COMMUNICATION FUNCTIONS

35

B
B.1

MPI Point-to-Point Communication Functions


Blocking Send and Receive

MPI SEND(buf, count, datatype, dest, tag, comm) sends in standard mode. MPI BSEND(buf, count, datatype, dest, tag, comm) sends in buered mode. MPI SSEND(buf, count, datatype, dest, tag, comm) sends in synchronous mode. MPI RSEND(buf, count, datatype, dest, tag, comm) sends in ready mode. MPI RECV (buf, count, datatype, source, tag, comm, status) starts a standard mode receive.

B.2

Non-blocking Send and Receive

MPI ISEND(buf, count, datatype, dest, tag, comm, request) starts a standard mode, nonblocking send. MPI IBSEND(buf, count, datatype, dest, tag, comm, request) starts a buered mode, nonblocking send. MPI ISSEND(buf, count, datatype, dest, tag, comm, request) starts a synchronous mode, nonblocking send. MPI IRSEND(buf, count, datatype, dest, tag, comm, request) starts a ready-mode nonblocking send. MPI IRECV (buf, count, datatype, source, tag, comm, request) starts a nonblocking receive.

B.3

Combined Send and Receive

MPI SENDRECV(sendbuf,sendcount,sendtype,dest,sendtag,recvbuf,recvcount,recvtype,source,recvtag,c Executes a blocking send and a blocking receive operation using the same communicator, but possibly dierent tags. Th

MPI SENDRECV REPLACE(buf, count, datatype, dest, sendtag, source, recvtag, comm, status) Execute a blocking send and receive. The same buer is used both for the send and for the receive, so that the message

B.4

Operations on Messages and Queues

MPI WAIT(request, status) returns when the operation identied by request is complete (it blocks).

B MPI POINT-TO-POINT COMMUNICATION FUNCTIONS

36

MPI TEST(request, ag, status) returns flag=true if the operation identied by request is complete. Otherwise, the call returns flag=false. (non-bloc

MPI IPROBE(source, tag, comm, ag, status) returns flag=true queue contains receivable message matching the pattern specied by the arguments source, tag, and MPI PROBE(source, tag, comm, status) behaves like MPI IPROBE except that it is a blocking call, returning only after a matching message has been found.

MPI CANCEL(request) marks for cancellation a pending, nonblocking communication operation (send or receive). The cancel call is local; it ret

B.5

Denitions

blocking send - does not return until the message data have been copied, either to send buer, or to receiver memory. I.e., return implies that the user send buer can be safely modied. buered send - copies the message from sender buer into system buer, thereby decoupling send and receive operations. standard communication mode - buering implementation dependent.

B.6

Send modes with the MPI API

Blocking Send mode


MPI SEND(buf, count, datatype, dest, tag, comm) start - depends whether implementation buers; might require matching send. complete requires - data copied from send buer. complete implies - send buer available. locality - non-local since it might depend on matching receive.

Non-blocking Send mode


MPI ISEND(buf, count, datatype, dest, tag, comm) start - depends whether implementation buers; might require matching send. complete requires - nothing, returns immediately. complete implies - nothing. locality - local

B MPI POINT-TO-POINT COMMUNICATION FUNCTIONS

37

Buered Send mode


MPI BSEND(buf, count, datatype, dest, tag, comm) start - anytime, does not require a matching receive. complete requires - data copied from send buer. complete implies - send buer available. locality - local

Synchronous Send mode


MPI SSEND(buf, count, datatype, dest, tag, comm) start - anytime, does not require a matching receive. complete requires - matching receive posted and receiving data, complete implies - send buer available; sender-receiver rendezvous. locality - non-local

Ready Send mode


MPI RSEND(buf, count, datatype, dest, tag, comm) start - only when matching receive is posted. complete requires - data copied from send buer. complete implies - send buer available. locality - non-local

Blocking Receive operation


Matches any of the send modes. MPI RECV (buf, count, datatype, source, tag, comm, status) Blocking receive with standard mode semantics; matches any send mode. A receive can complete before the matching send has completed (of course, it can complete only after the matching send has started). complete requires - receive buer contains the new message. complete implies - matching send started (maybe not completed); message available. locality - non-local

Non-blocking Receive operation


Matches any of the send modes. A receive can complete before the matching send has completed (of course, it can complete only after the matching send has started). MPI IRECV (buf, count, datatype, source, tag, comm, status) complete requires - nothing, returns immediately. complete implies - nothing, need to probe using MPI request objects. locality - local

B MPI POINT-TO-POINT COMMUNICATION FUNCTIONS

38

B.7

Semantics of Point-to-Point Communication


sender sends two messages in succession to same matching receive. The matching receive cannot receive the second message. receiver posts two receives in succession that match same message, then the second receive cannot be satised by this message.

1. Order - messages are non-overtaking (determinism)

2. Progress - given an initiated matching send and receive, at least one will complete. For non-blocking send and receives, guarantee of completion moves to the MPI WAIT calls, one each for non-blocking send and non-blocking receive. 3. Fairness - MPI makes no guarantee of fairness. For example, suppose a given posted send. It is possible for receiver posting matching receives to never receive the message, because each time the given send is overtaken by another message. 4. Resource limitations - any pending communication operation consumes limited system resources such as buers. Errors may occur when lack of resources prevent execution of a communication attempt. Non-blocking communication reduces buering requirements for progress to occur.

B.8

References

MPI Standard References MPI: The Complete Reference [SOHL+ 95] with MPI version 2.

C CODE LISTINGS

39

C
C.1

Code Listings
Ping-pong Codes

The full listing of the pingpong codes from Section 4.1.

pingpong.c
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46

#include <stdlib.h> #include <stdio.h> #include <string.h> #include <mpi.h> #include <uparms.h> #include <misc.h> #dene TRACE(x) #dene NMLEN 128 int myproc, nproc; char thishost[NMLEN]; int main(int argc, char **argv) { int niter,msgsz; double clockticks; (void) getcmdline(argc,argv,&niter,&msgsz); (void) gethost(thishost,NMLEN); (void) initmpi(argc,argv); (void) echoparms(niter,msgsz); (void) pingpong(niter,msgsz,&clockticks); (void) summarise(niter,msgsz,clockticks); (void) zshutdown(0); return(0); } int summarise(int niter,int msgsz,double ticks) { int iproc; double ntotdat = (double)msgsz * (double)niter * sizeof (int); double throughput = (ntotdat / ticks) / (1024.0*1024.0); double latency = ((ticks/2.0) / niter) * 1000000.0;
summarise main

for (iproc=0; iproc<nproc; iproc++) { if ( myproc == iproc && myproc==0) { fprintf(stdout,"p=%d\ttime=%fsec\tthru=%fMB\tlate=%fusec\n",myproc,ticks,throughput,latency); ush(stdout); } /* MPI Barrier(MPI COMM WORLD); */ } return(0); }

C CODE LISTINGS

40

47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100

int pingpong(int niter,int msgsz,double *clockticksPtr) { MPI Status status; double clockticks = 0.0; int iter=0; int *outcargo = (int *)malloc(sizeof (int)*msgsz); int *incargo = (int *)malloc(sizeof (int)*msgsz); (void) dene cargo(msgsz,outcargo); do { if (myproc == 0) { clockticks = MPI Wtime(); MPI Send(outcargo,1,MPI INT,1,999,MPI COMM WORLD); MPI Recv(incargo,msgsz,MPI INT,1,888,MPI COMM WORLD,&status); clockticks += MPI Wtime(); } else if (myproc == 1) { clockticks = MPI Wtime(); MPI Recv(incargo,1,MPI INT,0,999,MPI COMM WORLD,&status); MPI Send(outcargo,msgsz,MPI INT,0,888,MPI COMM WORLD); clockticks += MPI Wtime(); } } while (++iter < niter); *clockticksPtr = clockticks; return(0); } int dene cargo(int msgsz, int *cargo) { int ic; for(ic=0; ic<msgsz; ic++) { *(cargo+ic) = hide(ic); } return(0); }

pingpong

dene cargo

int echoparms(int niter,int msgsz) echoparms { int iproc; for (iproc=0; iproc<nproc; iproc++) { if ( myproc == iproc) { fprintf(stdout,"xp=%d\thost=%s\tniter=%d\tmsgsz=%d\n",myproc,thishost,niter,msgsz); ush(stdout); } MPI Barrier(MPI COMM WORLD); } return(0); } int gethost(char *string, int stringlength)
gethost

C CODE LISTINGS

41

101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144

{ getHostname(string,stringlength); TRACE(printf("thishost=%s\n",thishost);) TRACE(ush(stdout);) return(0); } int getcmdline(int argc, char**argv, int *niter, int* msgsz) { if (argc != 3) (void) usage and exit(1); sscanf(argv[1],"%d",niter); sscanf(argv[2],"%d",msgsz); TRACE(printf("cmdline niter=%d\tmsgsz=%d\n",*niter,*msgsz);) TRACE(ush(stdout);) return(0); } int zshutdown(int status) { int ag; MPI Initialized(&ag); /* only MPI routine callable before MPI Init */ if ( ag ) { if (status != 0) printf("p%d\t MPI shutdown\n",myproc); MPI Finalize(); } if (status != 0) printf("p%d\t exit status=%d\n",myproc,status); exit(status); } int initmpi(int argc, char **argv) { initializeMPIprocesses(argc,argv,&myproc,&nproc); if ( nproc > 2 ) (void) usage and exit(1); return(0); } int usage and exit(int status) { fprintf(stderr,"usage: [mpirun -np 2] pingpong <niter> <msgsz>\n"); ush(stderr); (void) zshutdown(status); exit(status); }
getcmdline

zshutdown

initmpi

usage and exit

REFERENCES

42

References
[SGI12] SGI. Passing Toolkit (MPT) User Guide, 2012. reference for barrine MPT. [SOHL+ 95] Marc Snir, Steve W. Otto, Steven Huss-Lederman, David W. Walker, and Jack Dongarra. MPI: The Complete Reference. 1995. ISBN 0-262-69184-1.

Index
Amdahls law, 17 baton.c, 14 batonpskips.c, 29 busywait.c, 31 busywaitprint.c, 32 cluster, see NOW code listings, 39 collective communication global sum, 6 eciency, 17 global sum, 6 Makele debug ags, 30 Intel MPI and gcc, 13 Intel MPI and icc, 13 MPICH and gcc, 14 MPICH2 and gcc, 14 MPT and gcc, 14 MPT and icc, 14 OpenMPI and gcc, 14 OpenMPI and icc, 14 multiprocessors, 1 network throughput, 22, 23, 25, 26 node code, 35 nodes, 1 NOW, 1 PBS environment variables, 33 performance diminishing returns, 20 pingpong code, 22, 23, 25, 26 PMS notation, 1 point-to-point MPI functions, 35 MPI sendrecv(), 35 shared-memory multiprocessor, 1 speedup, 17 TotalView example, 29, 31, 32 WRF, 21

43

Potrebbero piacerti anche