Unit 4 Openmp

OpenMP (Open Multi-Processing) is an application programming interface (API) that supports multi-platform shared memory
multi processing programming in C, C++, and Fortran, on most platforms, processor architectures and operating systems,
including Solaris, AIX, HP-UX, Linux, OS X, and Windows. It consists of a set of compiler directives, library routines,
and environment variables that influence run-time behavior.
OpenMP is managed by the nonprofit technology consortium OpenMP Architecture Review Board (or OpenMP ARB), jointly
defined
by
a
group
of
major
computer
hardware
and
software
vendors,including AMD, IBM, Intel, Cray, HP, Fujitsu, Nvidia, NEC, Red Hat, Texas Instruments, Oracle Corporation, and
more.
OpenMP uses a portable, scalable model that gives programmers a simple and flexible interface for developing parallel
applications for platforms ranging from the standard desktop computer to the super computer.
OpenMP is an Application Program Interface (API) that may be used to explicitly direct multi-threaded,
shared memory parallelism.
Comprised of three primary API components:
Compiler Directives
Runtime Library Routines
Environment Variables
OpenMP Is Not:
Meant for distributed memory parallel systems (by itself)
Necessarily implemented identically by all vendors
Guaranteed to make the most efficient use of shared memory
Required to check for data dependencies, data conflicts, race conditions, deadlocks, or code sequences that cause a
program to be classified as non-conforming
Designed to handle parallel I/O. The programmer is responsible for synchronizing input and output.
Goals of OpenMP:
Standardization:
o
Provide a standard among a variety of shared memory architectures/platforms
Jointly defined and endorsed by a group of major computer hardware and software vendors
Lean and Mean:

o
Establish a simple and limited set of directives for programming shared memory machines.
Significant parallelism can be implemented by using just 3 or 4 directives.
This goal is becoming less meaningful with each new release, apparently.
Ease of Use:
o
Provide capability to incrementally parallelize a serial program, unlike message-passing libraries which
typically require an all or nothing approach
Provide the capability to implement both coarse-grain and fine-grain parallelism
Portability:
o
The API is specified for C/C++ and Fortran
Public forum for API and membership

Most major platforms have been implemented including Unix/Linux platforms and Windows.
o
History:
In the early 90's, vendors of shared-memory machines supplied similar, directive-based, Fortran programming
extensions:
o
The user would augment a serial Fortran program with directives specifying which loops were to be
parallelized
The compiler would be responsible for automatically parallelizing such loops across the SMP processors
Implementations were all functionally similar, but were diverging (as usual)
First attempt at a standard was the draft for ANSI X3H5 in 1994. It was never adopted, largely due to waning interest as
distributed memory machines became popular.
However, not long after this, newer shared memory machine architectures started to become prevalent, and interest
resumed.
The OpenMP standard specification started in the spring of 1997, taking over where ANSI X3H5 had left off.
Led by the OpenMP Architecture Review Board (ARB). Original ARB members and contributors are shown below.
Release History
Dat
e
Oct
199
7
Oct
199
8
Nov
199
9
Nov
200
0
OpenMP continues to evolve - new constructs and features are added with each release.
Initially, the API specifications were released separately for C and Fortran. Since 2005, they have been released
together.
The table below chronicles the OpenMP API release history.

Version
Fortran 1.0
C/C++ 1.0
Fortran 1.1
Fortran 2.0
C/C++ 2.0
OpenMP 2.5
OpenMP 3.0
OpenMP 3.1
OpenMP 4.0
OpenMP 4.5
Mar
200
2
May
200
5
May
200
8
Jul
201
1
Jul
201
3
Nov
201
5
OpenMP Programming Model
Shared Memory Model:
OpenMP is designed for multi-processor/core, shared memory machines. The underlying architecture can be shared
memory UMA or NUMA.
Thread Based Parallelism:
OpenMP programs accomplish parallelism exclusively through the use of threads.
A thread of execution is the smallest unit of processing that can be scheduled by an operating system. The idea of a
subroutine that can be scheduled to run autonomously might help explain what a thread is.
Threads exist within the resources of a single process. Without the process, they cease to exist.
Typically, the number of threads match the number of machine processors/cores. However, the actual use of threads is
up to the application.
Explicit Parallelism:
OpenMP is an explicit (not automatic) programming model, offering the programmer full control over parallelization.
Parallelization can be as simple as taking a serial program and inserting compiler directives....
Or as complex as inserting subroutines to set multiple levels of parallelism, locks and even nested locks.
Fork - Join Model:
OpenMP uses the fork-join model of parallel execution:
All OpenMP programs begin as a single process: the master thread. The master thread executes sequentially until the
first parallel region construct is encountered.
FORK: the master thread then creates a team of parallel threads.
The statements in the program that are enclosed by the parallel region construct are then executed in parallel among the
various team threads.
JOIN: When the team threads complete the statements in the parallel region construct, they synchronize and terminate,
leaving only the master thread.
The number of parallel regions and the threads that comprise them are arbitrary.
Compiler Directive Based:
Most OpenMP parallelism is specified through the use of compiler directives which are imbedded in C/C++ or Fortran
source code.
Nested Parallelism:
The API provides for the placement of parallel regions inside other parallel regions.
Implementations may or may not support this feature.
Dynamic Threads:
The API provides for the runtime environment to dynamically alter the number of threads used to execute parallel
regions. Intended to promote more efficient use of resources, if possible.
Implementations may or may not support this feature.

I/O:
OpenMP specifies nothing about parallel I/O. This is particularly important if multiple threads attempt to write/read
from the same file.
If every thread conducts I/O to a different file, the issues are not as significant.
It is entirely up to the programmer to ensure that I/O is conducted correctly within the context of a multi-threaded
program.
Memory Model: FLUSH Often?
OpenMP provides a "relaxed-consistency" and "temporary" view of thread memory (in their words). In other words,
threads can "cache" their data and are not required to maintain exact consistency with real memory all of the time.
When it is critical that all threads view a shared variable identically, the programmer is responsible for insuring that the
variable is FLUSHed by all threads as needed.
The core elements of OpenMP are the constructs for thread creation, workload distribution (work sharing), data-environment management, thread
synchronization, user-level runtime routines and environment variables.
In C/C++, OpenMP uses #pragmas. The OpenMP specific pragmas are listed below.
Thread creation
The pragma omp parallel is used to fork additional threads to carry out the work enclosed in the construct in parallel. The original
thread will be denoted as master thread with thread ID 0.
Example (C program): Display "Hello, world." using multiple threads.
#include <stdio.h>
int main(void)
{
#pragma omp parallel
printf("Hello, world.\n");
return 0;
}
Use flag -fopenmp to compile using GCC:
$ gcc -fopenmp hello.c -o hello
Output on a computer with two cores, and thus two threads:
Hello, world.
Hello, world.
Work-sharing constructs
Used to specify how to assign independent work to one or all of the threads.
omp for or omp do: used to split up loop iterations among the threads, also called loop constructs.
sections: assigning consecutive but independent code blocks to different threads
single: specifying a code block that is executed by only one thread, a barrier is implied in the end
master: similar to single, but the code block will be executed by the master thread only and no barrier implied in the
end.
Example: initialize the value of a large array in parallel, using each thread to do part of the work
int main(int argc, char **argv)

{
int a[100000];
#pragma omp parallel for
for (int i = 0; i < 100000; i++) {
a[i] = 2 * i;
}
return 0;
}
The loop counter i is declared inside the parallel for loop in C99 style, which gives each thread a unique and private version of
the variable.[1
OpenMP clauses
Since OpenMP is a shared memory programming model, most variables in OpenMP code are visible to all threads by default. But
sometimes private variables are necessary to avoid race conditions and there is a need to pass values between the sequential part
and the parallel region (the code block executed in parallel), so data environment management is introduced as data sharing
attribute clauses by appending them to the OpenMP directive. The different types of clauses are
Data sharing attribute clauses
shared: the data within a parallel region is shared, which means visible and accessible by all threads simultaneously. By
default, all variables in the work sharing region are shared except the loop iteration counter.
private: the data within a parallel region is private to each thread, which means each thread will have a local copy and
use it as a temporary variable. A private variable is not initialized and the value is not maintained for use outside the parallel
region. By default, the loop iteration counters in the OpenMP loop constructs are private.
default: allows the programmer to state that the default data scoping within a parallel region will be either shared,
or none for C/C++, or shared, firstprivate, private, or nonefor Fortran. The none option forces the programmer to declare
each variable in the parallel region using the data sharing attribute clauses.
firstprivate: like private except initialized to original value.
lastprivate: like private except original value is updated after construct.
reduction: a safe way of joining work from all threads after construct.
Synchronization clauses
critical: the enclosed code block will be executed by only one thread at a time, and not simultaneously executed by
multiple threads. It is often used to protect shared data from race conditions.
atomic: the memory update (write, or read-modify-write) in the next instruction will be performed atomically. It does
not make the entire statement atomic; only the memory update is atomic. A compiler might use special hardware
instructions for better performance than when using critical.
ordered: the structured block is executed in the order in which iterations would be executed in a sequential loop
barrier: each thread waits until all of the other threads of a team have reached this point. A work-sharing construct has
an implicit barrier synchronization at the end.
nowait: specifies that threads completing assigned work can proceed without waiting for all threads in the team to
finish. In the absence of this clause, threads encounter a barrier synchronization at the end of the work sharing construct.
Scheduling clauses
schedule(type, chunk): This is useful if the work sharing construct is a do-loop or for-loop. The iteration(s) in the work
sharing construct are assigned to threads according to the scheduling method defined by this clause. The three types of
scheduling are:
1. static: Here, all the threads are allocated iterations before they execute the loop iterations. The iterations are divided
among threads equally by default. However, specifying an integer for the parameter chunk will allocate chunk number
of contiguous iterations to a particular thread.
2. dynamic: Here, some of the iterations are allocated to a smaller number of threads. Once a particular thread finishes its
allocated iteration, it returns to get another one from the iterations that are left. The parameter chunk defines the
number of contiguous iterations that are allocated to a thread at a time.
3. guided: A large chunk of contiguous iterations are allocated to each thread dynamically (as above). The chunk size
decreases exponentially with each successive allocation to a minimum size specified in the parameter chunk
Environment variables
A method to alter the execution features of OpenMP applications. Used to control loop iterations scheduling, default number of
threads, etc. For example, OMP_NUM_THREADS is used to specify number of threads for an application.

Unit 4 Openmp

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Unit 4 Openmp

Caricato da

Copyright:

Formati disponibili

OpenMP (Open Multi-Processing) is an application programming interface (API) that supports multi-platform shared memory

Comprised of three primary API components:

Runtime Library Routines

Meant for distributed memory parallel systems (by itself)

Necessarily implemented identically by all vendors

Guaranteed to make the most efficient use of shared memory

Provide a standard among a variety of shared memory architectures/platforms

Lean and Mean:

Significant parallelism can be implemented by using just 3 or 4 directives.

Provide the capability to implement both coarse-grain and fine-grain parallelism

The API is specified for C/C++ and Fortran

Public forum for API and membership

The table below chronicles the OpenMP API release history.

Thread Based Parallelism:

OpenMP programs accomplish parallelism exclusively through the use of threads.

OpenMP uses the fork-join model of parallel execution:

Implementations may or may not support this feature.

Use flag -fopenmp to compile using GCC:

$ gcc -fopenmp hello.c -o hello

Output on a computer with two cores, and thus two threads:

sections: assigning consecutive but independent code blocks to different threads

int main(int argc, char **argv)

Data sharing attribute clauses

firstprivate: like private except initialized to original value.

lastprivate: like private except original value is updated after construct.

Potrebbero piacerti anche