Sei sulla pagina 1di 6

Embedded Systems Design - Embedded.

com Page 1 of 6
Use OpenMP for programming parallel threads in multicore
applications: Part 1
The challenges of threading a loop

By Shameem Akhter and Jason Roberts, Intel Corp.


Embedded.com
(08/15/07, 10:00:00 PM EDT)

The major CPU vendors are shifting gears, choosing to add parallelism support on-chip with multi-core
processors in order to avoid many of the technological hurdles in boosting speeds, while still offering a better
performing processor.

However, if your software does not take advantage of these multiple cores, it may not run any faster. That is
where OpenMP plays a key role by providing an easy method for threading applications without burdening
the programmer with the complications of creating, synchronizing, load balancing, and destroying threads.

The OpenMP standard was formulated in 1997 as an API for writing portable, multithreaded applications. It
started as a Fortran-based standard, but later grew to include C and C++. The current version is OpenMP
Version 2.5, which supports Fortran, C, and C++.

Intel C++ and Fortran compilers support the OpenMP Version 2.5 standard. The OpenMP programming model
provides a platform-independent set of compiler pragmas, directives, function calls, and environment
variables that explicitly instruct the compiler how and where to use parallelism in the application.

Many loops can be threaded by inserting only one pragma right before the loop, as demonstrated by
examples in this chapter. By leaving the nitty-gritty details to the compiler and OpenMP runtime library, you
can spend more time determining which loops should be threaded and how to best restructure the algorithms
for performance on multi-core processors. The full potential of OpenMP is realized when it is used to thread
the most time-consuming loops; that is, the hot spots.

Tackling the topic of OpenMP is an intimidating task. Therefore, this series of articles serves as a bridge for
you, allowing you to reach a point where you have a fundamental understanding of threading with OpenMP
from which you can build your broader practical knowledge.

The power and simplicity of OpenMP can be demonstrated by looking at an example. The following loop
converts each 32-bit RGB (red, green, blue) pixel in an array into an 8-bit grayscale pixel. The one pragma,
which has been inserted immediately before the loop, is all that is needed for parallel execution under
OpenMP.

Let's take a closer look at the loop. First, the example uses work-sharing, which is the general term that
OpenMP uses to describe distributing work across threads. When work-sharing is used with the for construct,
as shown in this example, the iterations of the loop are distributed among multiple threads.

The OpenMP implementation determines how many threads to create and how best to manage them. All the
programmer needs to do is to tell OpenMP which loop should be threaded. No need for programmers to add a
lot of codes for creating, initializing, managing, and killing threads in order to exploit parallelism.

OpenMP compiler and runtime library take care of these and many other details behind the scenes. In the
current OpenMP specification Version 2.5, OpenMP places the following five restrictions on which loops can be
threaded:

1) The loop variable must be of type signed integer. Unsigned integers will not work. (Note: this restriction is
to be removed in the future OpenMP specification Version 3.0).

file://I:\Documents\articles\Parallel_tools_4_multicore\Use_OpenMP_p1.htm 5/3/2008
Embedded Systems Design - Embedded.com Page 2 of 6
2) The comparison operation must be in the form loop_variable <, <=, >, or >= loop_invariant_integer.

3) The third expression or increment portion of the for loop must be either integer addition or integer
subtraction and by a loop-invariant value.

4) If the comparison operation is < or <=, the loop variable must increment on every iteration; conversely, if
the comparison operation is > or >=, the loop variable must decrement on every iteration.

5) The loop must be a single entry and single exit loop, meaning no jumps from the inside of the loop to the
outside or outside to the inside are permitted with the exception of the exit statement, which terminates the
whole application. If the statements goto or break are used, they must jump within the loop, not outside it.
The same goes for exception handling; exceptions must be caught within the loop.

Although these restrictions may sound somewhat limiting, most loops can easily be rewritten to conform to
them. The restrictions listed above must be observed so that the compiler can parallelize loops via OpenMP.
However, even when the compiler parallelizes the loop, you must still ensure the loop is functionally correct
by watching out for the issues in the next section.

Challenges in Threading a parallized loop


Threading a loop is to convert independent loop iterations to threads and run these threads in parallel. In
some sense, this is a re-ordering transformation in which the original order of loop iterations can be
converted to into an undetermined order.

In addition, because the loop body is not an atomic operation, statements in the two different iterations may
run simultaneously. In theory, it is valid to convert a sequential loop to a threaded loop if the loop carries no
dependence.

Therefore, the first challenge for you is to identify or restructure the hot loop to make sure that it has no
loop-carried dependence before adding OpenMP pragmas.

Loop-carried Dependence. Even if the loop meets all five loop criteria and the compiler threaded the loop,
it may still not work correctly, given the existence of data dependencies that the compiler ignores due to the
presence of OpenMP pragmas.

The theory of data dependence imposes two requirements that must be met for a statement S2 and to be
data dependent on statement S1:

1) There must exist a possible execution path such that statement S1 and S2 both reference the same
memory location L.

2) The execution of S1 that references L occurs before the execution of S2 that references L.

In order for S2 to depend upon S1, it is necessary for some execution of S1 to write to a memory location L
that is later read by an execution of S2. This is also called flow dependence.

Other dependencies exist when two statements write the same memory location L, called an output
dependence, or a read occurs before a write, called an anti-dependence. This pattern can occur in one of two
ways:

1) S1 can reference the memory location L on one iteration of a loop; on a subsequent iteration S2 can
reference the same memory location L.

2) S1 and S2 can reference the same memory location L on the same loop iteration, but with S1 preceding
S2 during execution of the loop iteration.

The first case is an example of loop-carried dependence, since the dependence exists when the loop is
iterated. The second case is an example of loop-independent dependence; the dependence exists because of
the position of the code within the loops.

Table 6.1 below shows three cases of loop-carried dependences with dependence distance d, where 1 <_ d
>_ n, and n is the loop upper bound.

file://I:\Documents\articles\Parallel_tools_4_multicore\Use_OpenMP_p1.htm 5/3/2008
Embedded Systems Design - Embedded.com Page 3 of 6

Table 6.1 The Different Cases of Loop-carried Dependences

Let's take a look at the following example where d = 1 and n = 99. The write operation is to location x[k] at
iteration k in S1, and a read from it at iteration k+1 in S2, thus a loop-carried flow dependence occurs.

Furthermore, with the read from location y[k"1] at iteration k in S1, a write to it is performed at iteration k+1
in S2, hence, the loop-carried anti-dependence exists. In this case, if a parallel for pragma is inserted for
threading this loop, you will get a wrong result.

Because OpenMP directives are commands to the compiler, the compiler will thread this loop. However, the
threaded code will fail because of loop-carried dependence. The only way to fix this kind of problem is to
rewrite the loop or to pick a different algorithm that does not contain the loop-carried dependence.

With this example, you can first predetermine the initial value of x[49] and y[49]; then, you can apply the
loop strip-mining technique to create a loop-carried dependence-free loop m.

Finally, you can insert the parallel for to parallelize the loop m. By applying this transformation, the original
loop can be executed by two threads on a dual-core processor system.

Besides using parallel for pragma, for the same example, you can also use the parallel sections pragma to
parallelize the original open loop that has loop-carried dependence for a dual core processor system.

file://I:\Documents\articles\Parallel_tools_4_multicore\Use_OpenMP_p1.htm 5/3/2008
Embedded Systems Design - Embedded.com Page 4 of 6

Data-race Conditions. Some data-race conditions are often due to output dependences, in which multiple
threads attempt to update the same memory location, or variable, after threading.

In general, the OpenMP C++ and Fortran compilers do honor OpenMP pragmas or directives while
encountering them during compilation phase, however, the compiler does not perform or ignores the
detection of data-race conditions.

Thus, a loop similar to the following example, in which multiple threads are updating the variable x will lead
to undesirable results. In such a situation, the code needs to be modified via privatization or synchronized
using mechanisms like mutexes.

For example, you can simply add the private(x) clause to the parallel for pragma to eliminate the data-race
condition on variable x for this loop.

As discussed previously, data-race conditions can sometimes be difficult to spot; that is, more difficult than
shown in this example. When using the full thread synchronization tools in the Windows API or in Pthreads,
developers are more likely to avoid these issues because data is designed from the start to be managed with
threading contention and race conditions in mind.

However, when using OpenMP, it is easier to overlook data-race conditions. One tool that helps identify such
situations is Intel Thread Checker, which is an add-on to Intel VTune Performance Analyzer.

Managing Shared and Private Data


In writing multithreaded programs, understanding which data is shared and which is private becomes
extremely important, not only to performance, but also for program correctness.

OpenMP makes this distinction apparent to the programmer through a set of clauses such as shared, private,
and default, and it is something that you can set manually. With OpenMP, it is the developer's responsibility
to indicate to the compiler which pieces of memory should be shared among the threads and which pieces
should be kept private.

When memory is identified as shared, all threads access the exact same memory location. When memory is
identified as private, however, a separate copy of the variable is made for each thread to access in private.
When the loop exits, these private copies become undefined. By default, all the variables in a parallel region
are shared, with three exceptions.

1) In parallel for loops, the loop index is private. In the next example, the k variable is private.

file://I:\Documents\articles\Parallel_tools_4_multicore\Use_OpenMP_p1.htm 5/3/2008
Embedded Systems Design - Embedded.com Page 5 of 6
2) Variables that are local to the block of the parallel region are private.
3) Any variables listed in the private, firstprivate, lastprivate, or reduction clauses are private. The
privatization is done by making a distinct copy of each of these variables for each thread.

Each of the four clauses takes a list of variables, but their semantics are all different. The private clause says
that each variable in the list should have a private copy made for each thread.

This private copy is initialized with its default value, using its default constructor where appropriate. For
example, the default value for variables of type int is 0. In OpenMP, memory can be declared as private in the
following three ways:

1) Use the private, firstprivate, lastprivate, or reduction clause to specify variables that need to be private for
each thread.

2) Use the threadprivate pragma to specify the global variables that need to be private for each thread.

3) Declare the variable inside the loop - really inside the OpenMP parallel region - without the static keyword.
Because static variables are statically allocated in a designated memory area by the compiler and linker, they
are not truly private like other variables declared within a function, which are allocated within the stack frame
for the function.

The following loop fails to function correctly because the variable x is shared. It needs to be private. Given
example below, it fails due to the loop-carried output dependence on the variable x.

The x is shared among all threads based on OpenMP default shared rule, so there is a data-race condition on
the x while one thread is reading x, another thread might be writing to it

This problem can be fixed in either of the following two ways, which both declare the variable x as private
memory.

Every time you use OpenMP to parallelize a loop, you should carefully examine all memory references,
including the references made by called functions.

Next in Part 2: Loop scheduling, partitioning and thread overhead

This article was excerpted from Multi-Core Programming by Shameem Akhter and Jason Roberts.
Copyright © 2006 Intel Corporation. All rights reserved.

Shameem Akhter is a platform architect at Intel Corporation, focusing on single socket multi-core architecture

file://I:\Documents\articles\Parallel_tools_4_multicore\Use_OpenMP_p1.htm 5/3/2008
Embedded Systems Design - Embedded.com Page 6 of 6
and performance analysis. Jason Roberts is a senior software engineer at Intel, and has worked on a number
of different multi-threaded software products that span a wide range of applications targeting desktop,
handheld, and embedded DSP platforms.

To read more on Embedded.com about the topics discussed in this article, go to "More about
multicores, multiprocessors and multithreading."

Comments have been disabled by the system administrator.

file://I:\Documents\articles\Parallel_tools_4_multicore\Use_OpenMP_p1.htm 5/3/2008