Day5 MPIinAction Shamjith

Advanced MPI
Shamjith K V
shamjithkv@cdac.in
Hybrid Computing Group
CDAC, Bangalore
Overview
MPI in Action
Hybrid Programming
MPI Standards
26-Jun-14
Think Parallel June 2014
MPI IN ACTION
26-Jun-14
MPI Ecosystem
MPI Application
MPI implementation
Compilers & Linkers
Schedulers / Resource Managers
Cluster Interconnects
Communication Stacks
And of course .. The physical resources & the OS
Hence to get most out of MPI, the user should know
Some details of the MPI implementation
The System architecture
26-Jun-14
Typical Message Flow in MPI

Applications
Rank 0
Rank 1
Application
Application
Message
Message
MPI
Implementation
MPI
Implementation
OS
OS
The Network
The Network
26-Jun-14
Message Flow with Interconnects
Rank 0
Rank 1
Application
Application
Message
MPI
Implementation
Message
MPI
Implementation
OS
OS
Fast Interconnect
26-Jun-14
Fast Interconnect
Infiniband
Zero Copy mechanism

RDMA
Reliable Transport services
Virtual lanes
16 virtual lanes
Includes dedicated lane for management
operations
High link speeds
1X, 4X, and 12X yielding 2.5 Gbps, 10 Gbps, and 30
Gbps resply.
Currently working to provide 60 Gbps and 120
Gbps
26-Jun-14
Myrinet
High speed LAN system by Myricom
Initially proposed for building HPC clusters
Two fibre optic cables
Upstream
Downstream
Fault tolerant features
Myri-10G 10 Gbps rate
Compatible with 10Gig Ethernet at PHY
26-Jun-14
PARAMNet 3
Developed by C-DAC
GEMINI Communication co-processor
Supports 8-48 ports
Each port supports 10 Gbps full duplex
KSHIPRA Software stack
RDMA centric
MVAPICH and Intel MPI support
Deployed in PARAM Yuva
26-Jun-14
Case Study
Application Case Study
N-BODY Problem
Wikipedia says In physics, the n-body
problem is an ancient, classical problem

of predicting the individual motions of a
group of celestial objects interacting with
each other gravitationally.
26-Jun-14
10
Gravitaional Force
26-Jun-14
11
GalaxSee
F = G(M1M2)/sqr(D)
Where G is the Gravitational Constant
M1, M2 are the masses of two bodies
D is the distance between the two bodies
The acceleration of an object is given by the sum of the
forces acting on that object divided by its mass
a = F/M
change in velocity, a = v/t
change in position ,v = x/t
New position ,
NEW = OLD + CHANGE
26-Jun-14
12
GalaxSee
Galaxsee Application
The GalaxSee program lets the user model a
number of bodies in space moving under the
influence of their mutual gravitational attraction.
It is effective for relatively small numbers of bodies
(on the order of a few hundred
26-Jun-14
13
GalaxSee Simulation
./GalaxSee 4000 400 1000 1

./GalaxSee {Number of Bodies} {Mass of
the body} {final time in Mega Years}
26-Jun-14
14
HYBRID PROGRAMMING
26-Jun-14
15
Programming Paradigms
Single Threaded Programming

Multi-Threaded Programming
Multi-Process Programming
26-Jun-14
16
Single Threaded Program
CPU
Memory
26-Jun-14
17
Multi-Threaded Program
Core
Core
Core
CPU
CPU
Core
Core
Core
Core
Core
Shared Memory
26-Jun-14
18
Multi-Process Program
Shared Memory
Core
Process
Core
Core
CPU
Core
Shared Memory
Core
Core
CPU
Core
Core
Core
Core
CPU
Core
Core
Core
Process
CPU
Core
Core
Core
Network
Core
Core
Core
Core
CPU
Core
Core
CPU
CPU
Core
Core
Core
Core
Core
Core
CPU
Core
Core
Process
Process
Shared Memory
26-Jun-14
Core
Core
Shared Memory
19
Distributed Memory
Many nodes distributed memory
each node has its own local memory
not directly addressable from other nodes
Multiple sockets per node
each node has 2 sockets (chips)
Multiple cores per socket
each socket (chip) has 4 cores
Memory spans all 8 cores - shared memory
nodes full local memory is addressable from any
core in any socket
Memory is attached to sockets
4 cores sharing the socket have fastest access to
attached memory
26-Jun-14
20
How to exploit both distributed & Shared

memory
Threads for shared memory
parent process uses pthreads or OpenMP to fork
multiple threads
threads share the same virtual address space
also known as SMP = Symmetric MultiProcessing
Message passing for distributed memory
processes use MPI to pass messages (data)
between each other
each process has its own virtual address space
If we attempt to combine both types of models Hybrid
programming
try to exploit the whole shared/distributed memory
hierarchy
26-Jun-14
21
Why Hybrid
Eliminates domain decomposition at node
level
Lower memory latency and data movement
within node
Improved application performance; reduced
turn-around time.
26-Jun-14
22
Motivation for Hybrid
Balance the computational load

Reduce memory traffic, especially for memory-bound
applications
Better resource utilization
26-Jun-14
23
Conventional Ways to Write Parallel

Programs
OpenMP (or pthreads) only

launch one process per node
have each process fork one thread (or maybe more) per core
share data using shared memory
cant share data with a different process (except maybe via file
I/O)
MPI only
launch one process per core, on one node or on many
pass messages among processes without concern for
location
(maybe create different communicators intra-node vs. internode)
ignore the potential for any memory to be shared
With hybrid OpenMP/MPI programming, we want each MPI
process to launch multiple OpenMP threads that can share local
memory
26-Jun-14
24
MPI+OpenMP/Thread Combination
Treat each node as an SMP
launch a single MPI process per node
create parallel threads sharing full-node
memory
typically 8 threads/node
Treat each socketas an SMP
launch one MPI process on each socket
create parallel threads sharing same-socket
memory
typically 4 threads/socket
26-Jun-14
25
Application Categories to Exploit Hybrid

Parallelism
Nested Parallelism
Nested Loops
Principles
Limited parallelism on outer level
Additional inner level of parallelism
Inner level not suitable for MPI
Inner level may be suitable for OpenMP
26-Jun-14
26
Nested Loops
for(int i=0;i<1000;i++)
{
for(int j=0;j<10000000;j++)
{
OpenMP
MPI
c[i][j]=a[i][j]+b[i][j];
}
}
26-Jun-14
27
Sample Hello Hybrid Program

#include <stdio.h>
#include "mpi.h"
#include <omp.h>
int main(int argc, char *argv[])
{
int numprocs, rank, namelen;
char processor_name[MPI_MAX_PROCESSOR_NAME];
int iam = 0, np = 1;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Get_processor_name(processor_name, &namelen);
#pragma omp parallel default(shared) private(iam, np)
{
np = omp_get_num_threads();
iam = omp_get_thread_num();
printf("Hello from thread %d out of %d from process %d out of %d on %s\n", iam, np, rank, numprocs,
processor_name);
}
MPI_Finalize();
}
26-Jun-14
28
Compiling and Running Hybrid Program

Compilation:
mpicc -fopenmp hello-hybrid.c -o hello-hybrid
Run:
mpirun np 2 ./hello-hybrid
26-Jun-14
29
MPI STANDARDS
26-Jun-14
30
Contents of MPI-2
One-sided communication (put / get)

Dynamic process management
Parallel I/O (MPI-IO)
Miscellaneous
Extended collective communication operations
C++ bindings
26-Jun-14
31
One sided communication

Traditionally, Parameters for interprocess
communications had to be known by both
processes, and both had to issue matching
send/receive calls.
Obviates the need to both communicate
parameters prior to the real data
communication and to poll periodically for
data exchange requests.
Can be used to simplify or eliminate timeconsuming global communications.
26-Jun-14
32
Operations
MPI_Put()
for remote writes
MPI_Get()
for remote reads
MPI_Accumulate()
for remote updates.
26-Jun-14
33
I/O Calls
MPI-1 relied on OS I/O functions, but MPI-2 provides
MPI_File functions for dedicated parallel I/O:
int MPI_File_open(MPI_Comm comm, char *name,
int mode, MPI_Info info, MPI_File *fh);
int MPI_File_seek(MPI_File fh, MPI_Offset offset,
int whence);
int MPI_File_read / MPI_File_write(MPI_File fh, void
*buf, int count, MPI_Datatype type, MPI_Status
*status);
int MPI_File_close(MPI_File *fh);
26-Jun-14
34
Parallel I/O
Also supports parallel I/O for non-contiguous data

Non-blocking parallel I/O and shared file pointers.
26-Jun-14
35
What is Parallel I/O?

Multiple processes of a parallel program accessing
data (reading or writing) from a common file
Alternatives to parallel I/O:
All processes send data to rank 0, and rank 0
writes it to a file
Each process opens a separate file and writes to it
36
26-Jun-14
Parallelizing I/O Blocks

Every process reads the file on a shared file system
Case 1 Simplest case
Rank 0
Rank 1
Rank 2
Shared File System
26-Jun-14
37
Case 2: File Read

One process reads the input file and distributes it
to the other processes
Rank 0
Rank 1
Rank 2
If(myrank==0)
{
Input_data = read();
}
MPI_Bcast(Input_data,);
Shared File System
26-Jun-14
38
Case 2: File Writes

One process gathers data and writes it to a local file
Rank 0
Rank 1
Rank 2
MPI_Gather(output_data,)
If(myrank==0)
{
write(output_data);
}
Shared File System
26-Jun-14
39
Dynamic Process Management

In the MPI-1 standard, the number of processors a
given MPI job executes on is fixed.
In MPI-2 supports dynamic process management to
allow:
New MPI processes to be spawned while an MPI
program is running.
New MPI processes to connect to other MPI
processes which are already running.
26-Jun-14
40
Dynamic Process Management

MPI_Comm_spawn creates a new group of tasks and
returns an intercommunicator:
MPI_Comm_spawn(command, argv, numprocs, info,
root, comm, intercomm, errcodes)
Tries to start numprocs process running
command, passing them command-line arguments
argv.
The operation is collective over comm.
Spawnees are in remote group of intercomm.
Errors are reported on a per-process basis in
errcodes.
info used to optionally specify hostname,
archname, wdir, path, file.
26-Jun-14
41
C++ Language bindings
C++ bindings match the new C bindings

MPI objects are C++ objects
MPI functions are methods of C++ classes
User must use MPI create and free functions instead
of default constructors and destructors
Uses shallow copy semantics (except MPI::Status
objects)
C++ exceptions used instead of returning error code
declared within an MPI namespace (MPI::...)
C++/C mixed-language interoperability
26-Jun-14
42
Extended Collective Operations

In MPI-1, collective operations are restricted to
ordinary (intra) communicators.
In MPI-2, most collective operations are extended by
an additional functionality for intercommunicators
e.g., Bcast on a parents-children
intercommunicator:
sends data from one parent process to all children.
Two new collective routines:
generalized all-to-all
exclusive scan
26-Jun-14
43
MPI -2 Miscellany
Standard startup with mpiexec
Recommended but not required
Implementations are allowed to pass NULL to

MPI_Init rather than argc, argv
MPI_Finalized(flag) added for library writers
New predefined datatypes
MPI_WCHAR
MPI_SIGNED_CHAR
MPI_UNSIGNED_LONG_LONG
26-Jun-14
44
External Interfaces
Generalized Requests
users can create new non-blocking operations
Naming objects for debuggers and profilers
label communicators, windows, datatypes
Allow users to add error codes, classes and strings
Specifies how threads are to be handled if the
implementation chooses to provide them
26-Jun-14
45
MPI-3
MPI-3 Scope
Includes, but is not limited to issues associated with

scalability (performance and robustness), multi-core
support, cluster support, and application support.
Backwards compatibility maybe maintained Routines may be deprecated
26-Jun-14
47
MPI_Count Larger Types

int MPI_Send(void *buf, int count, MPI_Datatype
datatype, int dest, int tag, MPI_Comm comm)
Counts are expressed as int / INTEGER
Usually limited to 231
Propose a new type: MPI_Count
Can be larger than an int / INTEGER
26-Jun-14
48
Fault tolerance issues

Items being discussed
Define consistent error response and reporting across the
standard
Clearly define the failure response for current MPI dynamics master/slave fault tolerance
Recovery of
Communicators
File handles
RMA windows
Data piggybacking
Dynamic communicators
Asynchronous dynamic process control
26-Jun-14
49
The MPIT Performance Interface

Goal: provide tools with access to MPI
internal information
Access to configuration/control and performance
variables
MPI implementation agnostic: tools query
available information
Examples of Performance Vars.

Number of packets sent
Time spent blocking
Memory allocated
26-Jun-14
50
Hybrid Programming
Ensure that MPI has the features
necessary to facilitate efficient hybrid
programming
Investigate what changes are needed in
MPI to better support:
Traditional thread interfaces (e.g., Pthreads,
OpenMP)
Emerging interfaces (like TBB, OpenCL, CUDA,
and Ct)
PGAS (UPC, CAF, etc.)
Shared Memory
26-Jun-14
51
References & Acknowledgements

http://bccd.net/wiki/index.php/GalaxSee
http://www.montgomerycollege.edu/Departments/plan
et/GalaxSEE/help_docs/win_galaxy_tutorial.html
http://www.shodor.org/
http://www.slac.stanford.edu/comp/unix/farm/mpi_and
_openmp.html
http://openmp.org/sc13/HybridPP_Slides.pdf
https://docs.loni.org/wiki/Introduction_to_Programmin
g_Hybrid_Applications_Using_OpenMP_and_MPI
http://www.mpi-forum.org/docs/docs.html
26-Jun-14
52
THANK YOU!
Questions?

Day5 MPIinAction Shamjith

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Day5 MPIinAction Shamjith

Caricato da

Copyright:

Formati disponibili

Advanced MPI

Think Parallel June 2014

Think Parallel June 2014

Think Parallel June 2014

Typical Message Flow in MPI

Think Parallel June 2014

Message Flow with Interconnects

Think Parallel June 2014

Zero Copy mechanism

Think Parallel June 2014

Think Parallel June 2014

Deployed in PARAM Yuva

Think Parallel June 2014

problem is an ancient, classical problem

Think Parallel June 2014

Think Parallel June 2014

Think Parallel June 2014

Think Parallel June 2014

./GalaxSee 4000 400 1000 1

Think Parallel June 2014

Think Parallel June 2014

Single Threaded Programming

Think Parallel June 2014

Single Threaded Program

Think Parallel June 2014

Think Parallel June 2014

Think Parallel June 2014

How to exploit both distributed & Shared

Think Parallel June 2014

Think Parallel June 2014

Motivation for Hybrid

Balance the computational load

Think Parallel June 2014

Conventional Ways to Write Parallel

OpenMP (or pthreads) only

Think Parallel June 2014

Think Parallel June 2014

Application Categories to Exploit Hybrid

Think Parallel June 2014

Think Parallel June 2014

Sample Hello Hybrid Program

Think Parallel June 2014

Compiling and Running Hybrid Program

Think Parallel June 2014

Think Parallel June 2014

One-sided communication (put / get)

Think Parallel June 2014

One sided communication

Think Parallel June 2014

Think Parallel June 2014

Think Parallel June 2014

Also supports parallel I/O for non-contiguous data

Think Parallel June 2014

What is Parallel I/O?

Think Parallel June 2014

Parallelizing I/O Blocks

Shared File System

Think Parallel June 2014

Case 2: File Read

Think Parallel June 2014

Case 2: File Writes

Think Parallel June 2014

Dynamic Process Management