Sei sulla pagina 1di 53

Advanced MPI

Shamjith K V
shamjithkv@cdac.in
Hybrid Computing Group
CDAC, Bangalore

Overview

MPI in Action
Hybrid Programming
MPI Standards

26-Jun-14

Think Parallel June 2014

MPI IN ACTION

26-Jun-14

Think Parallel June 2014

MPI Ecosystem

MPI Application
MPI implementation
Compilers & Linkers
Schedulers / Resource Managers

Cluster Interconnects
Communication Stacks
And of course .. The physical resources & the OS
Hence to get most out of MPI, the user should know
Some details of the MPI implementation
The System architecture

26-Jun-14

Think Parallel June 2014

Typical Message Flow in MPI


Applications
Rank 0

Rank 1

Application

Application
Message

Message

MPI
Implementation

MPI
Implementation

OS

OS

The Network

The Network

26-Jun-14

Think Parallel June 2014

Message Flow with Interconnects

Rank 0

Rank 1

Application

Application
Message

MPI
Implementation

Message
MPI
Implementation

OS

OS

Fast Interconnect

26-Jun-14

Fast Interconnect

Think Parallel June 2014

Infiniband

Zero Copy mechanism


RDMA
Reliable Transport services
Virtual lanes
16 virtual lanes
Includes dedicated lane for management
operations
High link speeds
1X, 4X, and 12X yielding 2.5 Gbps, 10 Gbps, and 30
Gbps resply.
Currently working to provide 60 Gbps and 120
Gbps
26-Jun-14

Think Parallel June 2014

Myrinet
High speed LAN system by Myricom
Initially proposed for building HPC clusters
Two fibre optic cables
Upstream
Downstream
Fault tolerant features
Myri-10G 10 Gbps rate
Compatible with 10Gig Ethernet at PHY

26-Jun-14

Think Parallel June 2014

PARAMNet 3

Developed by C-DAC
GEMINI Communication co-processor
Supports 8-48 ports
Each port supports 10 Gbps full duplex
KSHIPRA Software stack
RDMA centric
MVAPICH and Intel MPI support

Deployed in PARAM Yuva

26-Jun-14

Think Parallel June 2014

Case Study
Application Case Study
N-BODY Problem
Wikipedia says In physics, the n-body

problem is an ancient, classical problem


of predicting the individual motions of a
group of celestial objects interacting with
each other gravitationally.

26-Jun-14

Think Parallel June 2014

10

Gravitaional Force

26-Jun-14

Think Parallel June 2014

11

GalaxSee
F = G(M1M2)/sqr(D)
Where G is the Gravitational Constant
M1, M2 are the masses of two bodies
D is the distance between the two bodies
The acceleration of an object is given by the sum of the
forces acting on that object divided by its mass
a = F/M
change in velocity, a = v/t
change in position ,v = x/t
New position ,
NEW = OLD + CHANGE

26-Jun-14

Think Parallel June 2014

12

GalaxSee
Galaxsee Application
The GalaxSee program lets the user model a
number of bodies in space moving under the
influence of their mutual gravitational attraction.
It is effective for relatively small numbers of bodies
(on the order of a few hundred

26-Jun-14

Think Parallel June 2014

13

GalaxSee Simulation

./GalaxSee 4000 400 1000 1


./GalaxSee {Number of Bodies} {Mass of
the body} {final time in Mega Years}

26-Jun-14

Think Parallel June 2014

14

HYBRID PROGRAMMING

26-Jun-14

Think Parallel June 2014

15

Programming Paradigms

Single Threaded Programming


Multi-Threaded Programming
Multi-Process Programming

26-Jun-14

Think Parallel June 2014

16

Single Threaded Program

CPU

Memory

26-Jun-14

Think Parallel June 2014

17

Multi-Threaded Program

Core

Core

Core

CPU

CPU
Core

Core

Core

Core

Core

Shared Memory

26-Jun-14

Think Parallel June 2014

18

Multi-Process Program
Shared Memory

Core

Process

Core

Core

CPU
Core

Shared Memory

Core

Core

CPU
Core

Core

Core

Core

CPU
Core

Core

Core

Process

CPU
Core

Core

Core

Network

Core
Core

Core

Core

CPU
Core

Core

CPU

CPU
Core

Core

Core

Core

Core

Core

CPU
Core

Core

Process

Process

Shared Memory
26-Jun-14

Core

Core

Shared Memory
Think Parallel June 2014

19

Distributed Memory
Many nodes distributed memory
each node has its own local memory
not directly addressable from other nodes
Multiple sockets per node
each node has 2 sockets (chips)
Multiple cores per socket
each socket (chip) has 4 cores
Memory spans all 8 cores - shared memory
nodes full local memory is addressable from any
core in any socket
Memory is attached to sockets
4 cores sharing the socket have fastest access to
attached memory
26-Jun-14

Think Parallel June 2014

20

How to exploit both distributed & Shared


memory
Threads for shared memory
parent process uses pthreads or OpenMP to fork
multiple threads
threads share the same virtual address space
also known as SMP = Symmetric MultiProcessing
Message passing for distributed memory
processes use MPI to pass messages (data)
between each other
each process has its own virtual address space
If we attempt to combine both types of models Hybrid
programming
try to exploit the whole shared/distributed memory
hierarchy
26-Jun-14

Think Parallel June 2014

21

Why Hybrid
Eliminates domain decomposition at node
level
Lower memory latency and data movement
within node
Improved application performance; reduced
turn-around time.

26-Jun-14

Think Parallel June 2014

22

Motivation for Hybrid

Balance the computational load


Reduce memory traffic, especially for memory-bound
applications
Better resource utilization

26-Jun-14

Think Parallel June 2014

23

Conventional Ways to Write Parallel


Programs

OpenMP (or pthreads) only


launch one process per node
have each process fork one thread (or maybe more) per core
share data using shared memory
cant share data with a different process (except maybe via file
I/O)
MPI only
launch one process per core, on one node or on many
pass messages among processes without concern for
location
(maybe create different communicators intra-node vs. internode)
ignore the potential for any memory to be shared
With hybrid OpenMP/MPI programming, we want each MPI
process to launch multiple OpenMP threads that can share local
memory
26-Jun-14

Think Parallel June 2014

24

MPI+OpenMP/Thread Combination
Treat each node as an SMP
launch a single MPI process per node
create parallel threads sharing full-node
memory
typically 8 threads/node
Treat each socketas an SMP
launch one MPI process on each socket
create parallel threads sharing same-socket
memory
typically 4 threads/socket
26-Jun-14

Think Parallel June 2014

25

Application Categories to Exploit Hybrid


Parallelism

Nested Parallelism
Nested Loops
Principles
Limited parallelism on outer level
Additional inner level of parallelism
Inner level not suitable for MPI
Inner level may be suitable for OpenMP

26-Jun-14

Think Parallel June 2014

26

Nested Loops
for(int i=0;i<1000;i++)
{
for(int j=0;j<10000000;j++)
{

OpenMP

MPI

c[i][j]=a[i][j]+b[i][j];

}
}

26-Jun-14

Think Parallel June 2014

27

Sample Hello Hybrid Program


#include <stdio.h>
#include "mpi.h"
#include <omp.h>
int main(int argc, char *argv[])
{
int numprocs, rank, namelen;
char processor_name[MPI_MAX_PROCESSOR_NAME];
int iam = 0, np = 1;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Get_processor_name(processor_name, &namelen);
#pragma omp parallel default(shared) private(iam, np)
{
np = omp_get_num_threads();
iam = omp_get_thread_num();
printf("Hello from thread %d out of %d from process %d out of %d on %s\n", iam, np, rank, numprocs,
processor_name);
}
MPI_Finalize();
}
26-Jun-14

Think Parallel June 2014

28

Compiling and Running Hybrid Program


Compilation:
mpicc -fopenmp hello-hybrid.c -o hello-hybrid

Run:
mpirun np 2 ./hello-hybrid

26-Jun-14

Think Parallel June 2014

29

MPI STANDARDS

26-Jun-14

Think Parallel June 2014

30

Contents of MPI-2

One-sided communication (put / get)


Dynamic process management
Parallel I/O (MPI-IO)
Miscellaneous
Extended collective communication operations
C++ bindings

26-Jun-14

Think Parallel June 2014

31

One sided communication


Traditionally, Parameters for interprocess
communications had to be known by both
processes, and both had to issue matching
send/receive calls.
Obviates the need to both communicate
parameters prior to the real data
communication and to poll periodically for
data exchange requests.
Can be used to simplify or eliminate timeconsuming global communications.

26-Jun-14

Think Parallel June 2014

32

Operations
MPI_Put()
for remote writes
MPI_Get()
for remote reads
MPI_Accumulate()
for remote updates.

26-Jun-14

Think Parallel June 2014

33

I/O Calls
MPI-1 relied on OS I/O functions, but MPI-2 provides
MPI_File functions for dedicated parallel I/O:
int MPI_File_open(MPI_Comm comm, char *name,
int mode, MPI_Info info, MPI_File *fh);
int MPI_File_seek(MPI_File fh, MPI_Offset offset,
int whence);
int MPI_File_read / MPI_File_write(MPI_File fh, void
*buf, int count, MPI_Datatype type, MPI_Status
*status);
int MPI_File_close(MPI_File *fh);

26-Jun-14

Think Parallel June 2014

34

Parallel I/O

Also supports parallel I/O for non-contiguous data


Non-blocking parallel I/O and shared file pointers.

26-Jun-14

Think Parallel June 2014

35

What is Parallel I/O?


Multiple processes of a parallel program accessing
data (reading or writing) from a common file
Alternatives to parallel I/O:
All processes send data to rank 0, and rank 0
writes it to a file
Each process opens a separate file and writes to it

36
26-Jun-14

Think Parallel June 2014

Parallelizing I/O Blocks


Every process reads the file on a shared file system
Case 1 Simplest case

Rank 0

Rank 1

Rank 2

Shared File System

26-Jun-14

Think Parallel June 2014

37

Case 2: File Read


One process reads the input file and distributes it
to the other processes

Rank 0

Rank 1

Rank 2

If(myrank==0)
{
Input_data = read();
}
MPI_Bcast(Input_data,);
Shared File System
26-Jun-14

Think Parallel June 2014

38

Case 2: File Writes


One process gathers data and writes it to a local file

Rank 0

Rank 1

Rank 2

MPI_Gather(output_data,)
If(myrank==0)
{
write(output_data);
}
Shared File System

26-Jun-14

Think Parallel June 2014

39

Dynamic Process Management


In the MPI-1 standard, the number of processors a
given MPI job executes on is fixed.
In MPI-2 supports dynamic process management to
allow:
New MPI processes to be spawned while an MPI
program is running.
New MPI processes to connect to other MPI
processes which are already running.

26-Jun-14

Think Parallel June 2014

40

Dynamic Process Management


MPI_Comm_spawn creates a new group of tasks and
returns an intercommunicator:
MPI_Comm_spawn(command, argv, numprocs, info,
root, comm, intercomm, errcodes)
Tries to start numprocs process running
command, passing them command-line arguments
argv.
The operation is collective over comm.
Spawnees are in remote group of intercomm.
Errors are reported on a per-process basis in
errcodes.
info used to optionally specify hostname,
archname, wdir, path, file.
26-Jun-14

Think Parallel June 2014

41

C++ Language bindings

C++ bindings match the new C bindings


MPI objects are C++ objects
MPI functions are methods of C++ classes
User must use MPI create and free functions instead
of default constructors and destructors
Uses shallow copy semantics (except MPI::Status
objects)
C++ exceptions used instead of returning error code
declared within an MPI namespace (MPI::...)
C++/C mixed-language interoperability

26-Jun-14
42

Think Parallel June 2014

Extended Collective Operations


In MPI-1, collective operations are restricted to
ordinary (intra) communicators.
In MPI-2, most collective operations are extended by
an additional functionality for intercommunicators
e.g., Bcast on a parents-children
intercommunicator:
sends data from one parent process to all children.
Two new collective routines:
generalized all-to-all
exclusive scan

26-Jun-14
43

Think Parallel June 2014

MPI -2 Miscellany
Standard startup with mpiexec
Recommended but not required

Implementations are allowed to pass NULL to


MPI_Init rather than argc, argv
MPI_Finalized(flag) added for library writers
New predefined datatypes
MPI_WCHAR
MPI_SIGNED_CHAR
MPI_UNSIGNED_LONG_LONG

26-Jun-14
44

Think Parallel June 2014

External Interfaces
Generalized Requests
users can create new non-blocking operations
Naming objects for debuggers and profilers
label communicators, windows, datatypes
Allow users to add error codes, classes and strings
Specifies how threads are to be handled if the
implementation chooses to provide them

26-Jun-14
45

Think Parallel June 2014

MPI-3

MPI-3 Scope

Includes, but is not limited to issues associated with


scalability (performance and robustness), multi-core
support, cluster support, and application support.
Backwards compatibility maybe maintained Routines may be deprecated

26-Jun-14

Think Parallel June 2014

47

MPI_Count Larger Types


int MPI_Send(void *buf, int count, MPI_Datatype
datatype, int dest, int tag, MPI_Comm comm)
Counts are expressed as int / INTEGER
Usually limited to 231
Propose a new type: MPI_Count
Can be larger than an int / INTEGER

26-Jun-14

Think Parallel June 2014

48

Fault tolerance issues


Items being discussed
Define consistent error response and reporting across the
standard
Clearly define the failure response for current MPI dynamics master/slave fault tolerance
Recovery of
Communicators
File handles
RMA windows

Data piggybacking
Dynamic communicators
Asynchronous dynamic process control

26-Jun-14

Think Parallel June 2014

49

The MPIT Performance Interface


Goal: provide tools with access to MPI
internal information
Access to configuration/control and performance
variables
MPI implementation agnostic: tools query
available information

Examples of Performance Vars.


Number of packets sent
Time spent blocking
Memory allocated

26-Jun-14

Think Parallel June 2014

50

Hybrid Programming
Ensure that MPI has the features
necessary to facilitate efficient hybrid
programming
Investigate what changes are needed in
MPI to better support:
Traditional thread interfaces (e.g., Pthreads,
OpenMP)
Emerging interfaces (like TBB, OpenCL, CUDA,
and Ct)
PGAS (UPC, CAF, etc.)
Shared Memory

26-Jun-14

Think Parallel June 2014

51

References & Acknowledgements


http://bccd.net/wiki/index.php/GalaxSee
http://www.montgomerycollege.edu/Departments/plan
et/GalaxSEE/help_docs/win_galaxy_tutorial.html
http://www.shodor.org/
http://www.slac.stanford.edu/comp/unix/farm/mpi_and
_openmp.html
http://openmp.org/sc13/HybridPP_Slides.pdf
https://docs.loni.org/wiki/Introduction_to_Programmin
g_Hybrid_Applications_Using_OpenMP_and_MPI
http://www.mpi-forum.org/docs/docs.html

26-Jun-14

Think Parallel June 2014

52

THANK YOU!

Questions?

Potrebbero piacerti anche