Ieee 4

GRID
COMPUTING
CROSS-SITE COMPUTATIONS
ON THE TERAGRID
The TeraGrid’s collective computing resources can help researchers perform very-large-scale
simulations in computational fluid dynamics (CFD) applications, but doing so requires
tightly coupled communications among different sites. The authors examine a scaled-down
turbulent flow problem, investigating the feasibility and scalability of cross-site simulation
paradigms, targeting grand challenges such as blood flow in the entire human arterial tree.
L
aunched by the US National Science chemical, mechanical, or biological processes have
Foundation in August 2001, the Tera- also demonstrated widely varying characteris-
Grid (www.teragrid.org) integrates the tics.5,6 Some applications are computation inten-
most powerful open resources in the sive, requiring extremely powerful computing
US, providing 50 Tflops in processing power and systems, whereas others are data intensive,7 ne-
1.5 Pbytes of online storage connected via a 40- cessitating the creation or mining of multiterabyte
Gbits-per-second network. Although the Tera- data archives.
Grid offers potentially unlimited scalability, the Efficiently and effectively harnessing grid com-
key question facing computational scientists is puting’s power requires applications that can ex-
how to effectively adapt their applications to such ploit ensembles of supercomputers, which in turn
a complex and heterogeneous network. Indeed, requires the ability to match application require-
parallel scientific computing is currently at a ments and characteristics with grid resources. The
crossroads. The emergence of the message-pass- challenges in developing grid-enabled applications
ing interface (MPI) as well as domain decomposi- lie primarily in the high degree of system hetero-
tion algorithms and their corresponding freeware geneity and the grid environment’s dynamic be-
(such as ParMETIS1) has made parallel comput- havior. For example, a grid can have a highly
ing available to the wider scientific community, al- heterogeneous and unbalanced communication
lowing first-principles simulations of turbulence network whose bandwidth and latency characteris-
at very fine scales,2 blood flow in the human tics vary widely over time and space. Computers in
heart,3 and global climate change.4 Unfortunately, grid environments can also have radically different
simulations designed to capture detailed physico- operating systems and utilities.
Grid technology—primarily in the form of
Globus-family services (www.globus.org)—has
1521-9615/05/$20.00 © 2005 IEEE largely overcome the difficulties in managing such
Copublished by the IEEE CS and the AIP
a heterogeneous environment. With these services’
SUCHUAN DONG AND GEORGE EM KARNIADAKIS uniform mechanisms for user authentication, ac-
Brown University counting, resource access, and data transfer, users
NICHOLAS T. KARONIS and applications can discover and utilize disparate
resources in coordinated ways. In particular, the
Northern Illinois University and Argonne National Laboratory
emergence of scientific-application-oriented grid
14 COMPUTING IN SCIENCE & ENGINEERING

12
middleware, such as MPICH-G2,8 has significantly 13
17
16
spared computational scientists from low-level de- 6
15
5
tails about communication handling, network 4
2
20
14 19
topology, resource allocation, and management. 7 3
1 18
26
However, in spite of these advancements, devising 31 29
32
27
28
21
35 37 33
efficient algorithms for computational fluid dy- 9 36,38
39
34 23
8 41 40
namics (CFD) applications that can exploit the Ter- 10 43 42
22
24
aGrid’s scalability remains an enormously 11
51 45
25
50 44
challenging problem. In this article, we’ll present
53
computations performed on the TeraGrid ma- 52
47
46
chines across a continent and show that grid com-
puting can pave the way for the solution of future
grand-challenge problems in biological and physi- 54 48
cal sciences. 55 49
(a)
Grand Challenges
Our work is motivated by two grand-challenge Experiment
3.0
problems in biological and physical sciences: the Henderson (1997)
simulation of blood flow in the entire human arte- 2D simulations
rial tree, and the direct numerical simulation Dong and Karniadakis (2005), Re = 10,000
2.5 Ma et al. (2000), Re = 3,900
(DNS) of bluff-body turbulent wake flows. Both
problems are significant from both fundamental
and application viewpoints, and their resolution 2.0
Drag coefficient
will have profound scientific and societal impacts. Drag crisis

The human arterial tree simulation problem
originates from the widely accepted causal rela- 1.5
tionship between blood flow and the formation of
arterial diseases such as atherosclerotic
plaques.13–15 These disease conditions seem to 1.0
preferentially develop in separated and recirculat-
ing flow regions such as arterial branches and bi-
0.5
furcations. Blood-flow interaction in the human
arterial system can occur on widely different scales;
it can also occur on similar scales in different re- 0.0 1
gions of the vascular system. At the largest scale, 10 102 103 104 105 106 107
the human arterial system is coupled via the wave- (b) Reynolds number (Re)
like nature of the pulse information traveling from
the heart through the arteries. Surgical interven- Figure 1. Two grand-challenge problems. (a) Sketch of the arterial tree
tions, such as bypass grafts, can block the system containing the largest 55 arteries in the human body along with 27
and alter the wave reflections, which in turn can artery bifurcations; (b) drag coefficient as a function of Reynolds
modify the flow waveforms at seemingly remote lo- number in cylinder flow.
cations. Subsequently, the modification of a local
waveform can lead to the onset of undesirable
stresses on the arterial walls, possibly starting an- proach based on a reduced set of 1D equations for
other pathological event. the overall system and detailed 3D Navier-Stokes
The challenge of modeling these interactions lies equations for arterial branches and bifurcations.
in the demand for supercomputing to model the To capture the flow dynamics in an artery bifur-
three-dimensional (3D) unsteady fluid dynamics cation reasonably well, the grid resolution typically
within the arterial branches. What makes this type requires a mesh of 70,000 to 200,000 high-order fi-
of application amenable to TeraGrid computing is nite elements (spectral elements with a polynomial
that we can reasonably model the waveform cou- order of 10 to 12 on each element).10 The human
pling between the sites of interest with a reduced set arterial tree model in Figure 1a contains the largest
of 1D equations that capture the cross-sectional area 55 arteries in the human body with 27 artery bi-
and sectional velocity properties.9 We can thus sim- furcations. The inclusion of all 27 bifurcations in
ulate the entire arterial tree by using a hybrid ap- the simulation with the grid resolution we just de-
SEPTEMBER/OCTOBER 2005 15
scried requires a total memory of 3 to 7 Tbytes, CFD code called Nektar (www.nektar.info/
which is beyond any single supercomputer’s cur- 2nd_edition/) in our computations. It employs a
rent capacity. The TeraGrid is the only possible spectral/hp element method to discretize in space
way to accommodate such a simulation. and a semi-implicit scheme to discretize in time.10
The second problem, DNS of the “drag crisis” The mesh consists of structured or unstructured
in turbulent bluff-body flows, is a fundamental grids (or a combination of both) similar to those
grand-challenge problem in fluid dynamics. (The used in standard finite element and finite volume
drag crisis is the sudden drop of drag force near methods. We employ Jacobi polynomial expansions
Reynolds number Re = 300,000; see Figure 1b) to represent flow variables; these expansions pro-
The need to resolve all the energetic scales in the vide multiresolution, a way to hierarchically refine
DNS down to the Kolmogorov scale dictates that numerical solutions by increasing the order of the
the number of grid points should be on the order expansion (p-refinement) within each element
of Re9/4, or roughly a trillion grid points at drag- without needing to regenerate the mesh, thus
crisis conditions. Concentration of turbulence in avoiding a significant overhead cost.
the bluff-body wake and nonuniform meshing ef- We can use MPICH-G2 for the cross-site com-
fectively reduces the required number of grid munication: it’s a Globus-based MPI library that
points to a few billion. The appropriate mesh extends MPICH to use the Globus Toolkit’s ser-
now consists of approximately 512 to 768 Fourier vices.8 During the computation, MPICH-G2 se-
modes along the cylinder axis and 50,000 to lects the most efficient communication method
80,000 spectral elements in the nonhomogeneous possible between two processes, using vendor-sup-
planes, with a spectral polynomial order of 6 to plied MPI whenever available. MPICH-G2 uses
10 on each element. A monolithic simulation information in the Globus Resource Specification
with such resolutions requires more than 4 Language (RSL) script to create multilevel cluster-
Tbytes of memory, exceeding any open super- ing of the processes based on the underlying net-
computer’s current capacity. Just as in the human work topology; it stores this information as
arterial tree simulation, the TeraGrid is the only attributes in the MPI communicators for applica-
viable option. tions to retrieve and exploit.
These extremely large biological and physical
simulations share a common characteristic: the so- Computation Algorithms
lution process requires tightly coupled communi- Two cross-site parallel algorithms based on differ-
cations among different TeraGrid sites. This is in ent data distribution strategies can both minimize
sharp contrast to other grid application scenarios the number of cross-site communications and over-
in which a monolithic application runs on one grid lap cross-site communications with in-site compu-
site while the data it produces is moved to another tations and communications. Both algorithms are
site for visualization or postprocessing (for exam- designed based on a two-level parallelization strat-
ple, the TeraGyroid project, www.realitygrid. egy.11 Let’s consider the turbulent flow past a cir-
org/TeraGyroid.html). A big question here is the cular cylinder. The following incompressible
scalability of an application involving multiple Navier-Stokes equations govern the flow:
TeraGrid sites and the slowdown factor of cross-
site runs compared to single-site runs under other- ∂ui 3
∂u ∂p 1 3
∂2ui
wise identical conditions. ∂t
+ ∑ u j ∂x i =− +
∂xi Re
∑ ∂x 2j
,
j =1 j j =1
Using Code i = 1, 2, 3
To investigate these issues and the feasibility of 3
∂u
cross-site runs on the TeraGrid, we’ll look at a
scaled-down version of the drag-crisis problem—
∑ ∂xi = 0,
i =1 i
specifically, we’ll use the simulation of turbulent
flow past a circular cylinder at lower Reynolds where xi(i = 1, 2, 3) are the interchangeable coor-
numbers (Re = 3,900 and 10,000) as a prototype. dinates used with x, y, z; ui(i = 1, 2, 3) are the three
We’ll employ Fourier spectral expansions in the velocity components; and p is pressure. We nor-
homogeneous direction and a spectral element malize all length scales with the cylinder diameter
discretization in the nonhomogeneous planes to D, all velocity components with the free-stream ve-
efficiently handle a multiconnected computa- locity U0, and the pressure with ␳U02 (where ␳ is
tional domain.10 the fluid density). The Reynolds number Re is ex-
To conduct such a DNS, we use a high-order pressed by

U 0D when computing nonlinear terms. Compared to the
Re = , usual approach of performing the FFTs of different
ν
physical variables separately, data agglomeration
where ␯ is the kinematic viscosity of the fluid. We minimizes the number of cross-site communica-
refer to the term tions and increases each message’s size, thus reduc-
∂u ing the latency effect. The larger message size also
uj i improves cross-site bandwidth utilization.
∂x j
as the nonlinear term in the following discussions. Physical Variable-Based Algorithm
Because the flow is homogeneous along the cylin- The physical variable-based algorithm computes
der axis (assumed to be the z-direction), we per- physical variables on different TeraGrid sites and
form the following Fourier expansion for the three exploits the coupling characteristics among them
velocity components and the pressure (letting f de- in the Navier-Stokes equations. Computations of
note one of these variables): various velocity components are independent ex-
cept for their interdependence in the nonlinear
f ( x, y, z, t ) = ∑ fˆk ( x, y, t )e ikz , term. A mutual dependence exists between the ve-
k
locity and the pressure, for example: computation
where fˆk (k = –N/2, ..., N/2 – 1) are the Fourier of pressure depends on both the velocity diver-
modes, and N is the number of Fourier planes in gence and the nonlinear terms and velocity gradi-
the homogeneous direction. ents on the boundaries; computation of velocity
depends on the pressure gradient.
Fourier Modal-Based Algorithm For simplicity, let’s assume we’re using three
The Fourier modal-based algorithm distributes TeraGrid sites; in the physical variable-based algo-
different groups of Fourier modes onto different rithm, we compute all of a velocity component’s
TeraGrid sites. It’s based on the observation that a Fourier modes along with a third of the pressure
physical variable’s different Fourier modes (three Fourier modes on a different site. The computa-
velocity components and the pressure) are decou- tion will involve three cross-site communications,
pled except when evaluating the fast Fourier trans- with the first occurring prior to the nonlinear solve.
form (FFT) as we compute the nonlinear terms in (Here, the nonlinear solve—and the pressure and
the Navier-Stokes equations. At each site, we com- velocity solves in subsequent discussions—refers to
pute a subset of the Fourier modes for all physical a three-step time-integration scheme.10) Each site
variables. As a result, solutions of any physical vari- must communicate its own velocity component
able on different sites are largely independent. to—and receive other velocity components from—
Coupling among different sites (hence cross-site the other sites for nonlinear term calculation. With
communication) only occurs in the transposition of Nektar’s two-level parallelization,11 a processor at
distributed matrices—an all-to-all type of commu- one site communicates only with the other two
nication—when evaluating the FFT during non- sites’ corresponding processors, and thus cross-site
linear term calculation. communication involves three processors. Differ-
The application takes special care to minimize ent processors at the same site participate in paral-
the cross-site latency impact and improve cross-site lel independent communications.
bandwidth utilization. It uses the network topology The second cross-site communication, a SUM
information in MPICH-G2’s initial communicator reduction for velocity divergence and pressure
to enforce the data distribution strategy and ensure boundary conditions, occurs prior to the pressure
that, in the two-level parallelization, computations solve. Again, a processor participates in the reduc-
within non-homogeneous planes involve processors tion only with corresponding processors at the
from the same site only.11 A special implementation other two sites (different processors at the same site
avoids unnecessary Transmission Control Protocol participate in parallel independent reductions).
(TCP) polling for MPI_ANY_SOURCE exchanges on The third cross-site communication occurs prior
communicators involving in-site communications to the velocity solve: it distributes the pressure gra-
only.8 We also agglomerate the data of different dient data to the other sites and receives the pres-
physical variables such that we perform a single sure gradient component that it computes from the
cross-site matrix transposition instead of several other sites.
separate transpositions for different variables.
Therefore, we only need two cross-site communi- Simulation Results
cations (one forward and one backward transform) We’ve obtained a large amount of simulation re-
0.5
0.0
y
–0.5
0.5 1.0 1.5 2.0

(a) (b) x
Figure 2. Turbulent flow past a circular cylinder. At Reynolds number Re = 10,000, we can compare streamwise
root-mean-square (RMS) velocity u/U0 between (a) the particle-image-velocimetry (PIV) experiment and (b)
current direct numerical simulation (DNS). Contours are plotted on the same levels for the experiment and
the DNS: umin/U0 = 0.1 and the incremental value between contour lines u/U0 = 0.025.
sults, but this article’s emphasis is on the comput- vortices). Our simulation data shows that the fre-
ing aspect (algorithm and performance), so we’ve quency of shear-layer vortices follows a scaling of
chosen to include only one comparison with ex- Re0.67 with respect to the Reynolds number, in
periment here. Specifically, let’s examine some sim- agreement with experimental observations.17 Fur-
ulation results for turbulent flows past bluff bodies thermore, the values of shear-layer frequencies from
obtained on the TeraGrid clusters. We can com- our computation agree with the experimental val-
pare the statistical characteristics in the turbulent ues; for example, at Re = 10,000, our simulation pre-
wake of a circular cylinder at Reynolds number Re dicts a value (normalized by Strouhal frequency) of
= 10,000 between our 3D DNS and the particle- 11.83 whereas it’s 11.25 from experiment.17
image-velocimetry (PIV) experiments.12 This
Reynolds number is the highest that the DNS has Performance Results
achieved for this flow so far. To evaluate the efficiency of computation algo-
Figure 2 shows a comparison of the normalized rithms, we conducted single-site runs on three
streamwise root-mean-square (RMS) velocity fluc- TeraGrid sites—the US National Center for Su-
tuation u normalized by the free-stream velocity U0 percomputing Applications (NCSA), the San
between the experiment (Figure 2a) and the simula- Diego Supercomputer Center (SDSC), and the
tion (Figure 2b). We plotted experimental and DNS Pittsburgh Supercomputing Center (PSC). We also
results on identical contour levels, with a minimum ran cross-site runs between the SDSC and the
RMS value u/U0 = 0.1 and an incremental value of NCSA, and between the two TeraGrid machines at
0.025 between contour lines. The distribution pat- the PSC. The NCSA and SDSC TeraGrid ma-
terns show strong fluctuations in the separating shear chines have Intel IA-64 processors (Itanium-2, 1.5
layers and two maxima associated with the vortex for- GHz) whereas those at the PSC have Compaq Al-
mation. The downstream locations of the RMS max- pha processors (Alpha EV68, 1 GHz).
ima are essentially the same for both experiment and
simulation (at x = 1.14 for the experiment and at x = Single-Site Performance
1.13 for the simulation), and the respective peak val- Figure 3a demonstrates Nektar’s scalability, show-
ues are also the same: umax/U0 = 0.5. ing parallel speedup with respect to the number of
At Reynolds numbers in the subcritical range processors on the PSC TeraGrid cluster for a fixed
(above 1,000 and below the “drag crisis” Reynolds problem size with 300 million degrees of freedom.
number), the separating shear layers behind the The parallel efficiency exceeds 95 percent on 1,024
cylinder become unstable and small-scale vortices processors and 85 percent on 1,536 processors.
develop in the shear layers (so-called shear-layer The test problem here is the turbulent cylinder

(× 256)
5.0
Wall-clock time per step (seconds)

6 Nektar 4.5
Ideal
4.0
5
3.5
Parallel speedup
3.0
4
2.5
2.0
3
1.5
2 1.0
0.5
1 0.0
256 512 768 1,024 1,280 1,536 100 200 300 400 500 600
(a) Number of processors (b) Number of processors
Figure 3. Nektar’s performance on the Pittsburgh Supercomputing Center’s TeraGrid cluster. (a) Parallel speedup with respect
to the number of processors for a fixed problem size, and (b) wall-clock time per step (in seconds) as a function of the number
of processors for a fixed workload per processor. As the problem size increases, the number of processors increases
proportionally to keep the workload per processor constant.
flow at Reynolds number Re = 10,000 based on the based on the MPICH-G2 library, we also want to
free-stream velocity and cylinder diameter. We test the performance differences between MPICH-
used a spectral element mesh with 9,272 triangular G2 and the native MPI implementation in a single-
elements in the nonhomogeneous planes; the num- site environment. MPICH-G2 hides low-level
ber of Fourier planes in the spanwise direction is operational details from the application, including
256 in this test. communication channel selection (vendor or TCP),
Scalability for a fixed workload per processor is data conversion, resource allocation, and computa-
another important measure. In this set of tests, as tion management. MPICH-G2 has taken measures
the problem size increases, the number of proces- such as eliminating memory copies and unnecessary
sors increases proportionally such that the work- message polling to minimize the overhead cost.8
load on each processor remains unchanged. The Figure 4 shows a performance comparison of
test problem is still the turbulent cylinder flow at Nektar compiled with MPICH-G2 and native MPI
Re = 10,000; in the nonhomogeneous plane, we use on the SDSC’s TeraGrid cluster (we expect the re-
the same grid resolution, but vary the problem size sults would be valid for all TeraGrid machines, par-
by varying the number of Fourier planes in the ticularly the ones that have identical compute and
spanwise direction. In this test, the number of switch hardware). The test problem is the turbulent
Fourier planes increases from 8 to 128, and the cylinder flow at Reynolds number Re = 3,900. We
number of processors increases proportionally plot the wall-clock time per step as a function of the
from 32 to 512 to keep the workload per processor total number of processors for a fixed problem size,
constant. For the case of 512 processors (128 and we use a spectral element mesh with 902 trian-
Fourier planes), we use 64 processors for compu- gular elements in the nonhomogeneous planes and
tations in the homogeneous direction and eight 128 planes in the spanwise direction; the polynomial
processors for computations in the two-level par- order is 8 on all elements. MPICH-G2 demon-
allelization’s nonhomogeneous planes. Figure 3b strates a performance virtually identical to the native
shows the wall-clock time per step (in seconds) as MPI, indicating a negligible overhead cost.
a function of the number of processors from the Figure 5 compares the performances of the
test. The ideal result would be a constant wall- Fourier modal-based and physical variable-based
clock time per step for any number of processors algorithms discussed earlier—specifically, it com-
(flat curve), but we see (only) a slight increase in pares the wall-clock time per step and the speedup
wall-clock time as the number of processors in- factor with respect to the number of processors for
creases from 32 to 512, indicating good scalability. a fixed problem size with 24 Fourier planes in the
We used native (vendor) MPI libraries in all these spanwise direction. Due to the configuration of the
tests, but because cross-site communications are physical variable-based algorithm, the number of
the processors are split between the NCSA and
MPICH-G2
25
Native MPI SDSC TeraGrid machines—in a 256-CPU cross-site
run, for example, 128 processors are from both the
NCSA and the SDSC. We use MPICH-G2 in both

20 single-site and cross-site runs; we performed at least
three independent runs for each case. In single-site
runs, the wall-clock time essentially shows a linear re-
15 lationship with respect to the number of processors,
indicative of a near-linear speedup.
In cross-site runs, the wall-clock time decreases
significantly with the increasing number of
10
processors. In fact, the wall-clock time–CPU
curve shows a dramatic decrease, nearly an order
of magnitude, as the number of processors in-
5 creases from 32 to 64. To check if software errors
cause this performance jump, we took special care
to ensure that we obtained identical, correct com-
0 10 20 30 40 50 60 70
putation results in all test cases, including those
on 32 and 64 processors, and that we conducted
Number of processors
several independent runs for each case (at least
three for smaller CPU counts, at least five for
Figure 4. Nektar performance. Comparing MPICH-G2 and native MPI on larger ones). We conducted the tests at a special
the San Diego Supercomputer Center’s TeraGrid cluster, we can plot reserved time for both machines, with exclusive
wall-clock time per step versus the number of processors for a fixed access to about a third of the TeraGrid machine at
problem size. the NCSA and the whole TeraGrid machine at
the SDSC. The performance results are repeat-
able, with only slight variation in exact values
processors tested ranges from 3 to 96, and the par- (Figure 6 shows the mean values). We’re con-
allel speedup is computed based on the wall-clock vinced that these aren’t spurious data points. The
time on three processors. The Fourier modal- exact reason for this performance jump isn’t to-
based algorithm consistently performs better, most tally clear at this point. It can result from several
likely due to the data agglomeration and reduced factors such as the communication characteristics
number of cross-site communications. of the network connecting the NCSA and SDSC
machines. As expected, a cross-site run is slower
Cross-Site Performance than the corresponding single-site run on the
To assess the efficiency on cross-site platforms, we same total number of processors. The slowdown
use the Fourier modal-based cross-site algorithm ratio, however, decreases dramatically as the num-
to conduct a series of cross-site runs with Nektar ber of processors increases. Beyond 32 processors,
and MPICH-G2 on the SDSC and NCSA Tera- the slowdown ratio of the cross-site runs ranges
Grid machines. from 1.5 to 2.0.
Our test problem was the turbulent flow past a Now let’s look at the scaling for a fixed workload
cylinder at Reynolds number Re = 3,900. As we did per processor. The problem size is varied by
earlier, we use spectral element mesh with 902 trian- changing the number of Fourier planes in the
gular elements in the nonhomogeneous planes (with spanwise direction. In this set of tests, we started
a spectral element order of 8 on all elements); the with eight Fourier planes in the spanwise direc-
number of Fourier planes in the spanwise direction tion, and doubled the number for each test until
varies from 16 to 128. We first investigate the scaling we reached 128. Correspondingly, we increased
for a fixed problem size with 128 Fourier planes in the total number of processors proportionally,
the spanwise direction. Figure 6 shows the wall-clock from eight to 128, such that the workload on each
time per step (in seconds) as a function of the total processor remained unchanged. Figure 7 plots the
number of processors for cross-site runs between the wall-clock time per step (in seconds) as a function
NCSA and SDSC TeraGrid machines, together with of the total number of processors for cross-site
results for single-site runs on the NCSA machine un- runs between the NCSA and SDSC TeraGrid ma-
der identical configurations. The total number of chines, as well as results for single-site runs on the
processors varies from 16 to 256. For cross-site runs, NCSA machine under identical configurations. In

40 (x 3)
30 Fourier modal-based algorithm Fourier modal-based algorithm
Physical variable-based algorithm 30 Physical variable-based algorithm
20 Ideal
10
Parallel speedup
20
10
0 20 40 60 80 100 0 20 40 60 80 100
(a) Number of processors (b) Number of processors
Figure 5. Algorithm performance. Comparing the Fourier modal-based and physical variable-based algorithms for a fixed
problem size on the Pittsburgh Supercomputing Center’s TeraGrid cluster, we can match up (a) wall-clock time and (b) parallel
speedup as a function of the total number of processors. Speedup is calculated based on the wall-clock time on three processors.
cross-site runs, again, half the processors are from MPICH-G2, NCSA/SDSC cross site
the NCSA and the other half are from the SDSC; MPICH-G2, NCSA single site
we used MPICH-G2 in both cross- and single-site 2
10
runs. In single-site runs, the wall-clock time in-
creases very slightly as the number of processors
increases from eight to 128, indicating excellent
scalability. In cross-site runs, we observe a larger
increase in wall-clock time as the number of 10
1
processors increases from eight to 32. Again, we

see a dramatic decrease in wall-clock time as the
number of processors increases from 32 to 64.
Compared to single-site runs on the same number
of processors, the slowdown ratio of cross-site runs 10
0
decreases significantly beyond 32 processors.

We also examined the influence of processor
configuration on cross-site run performance.
Table 1 lists the wall-clock time per step on a to-
–1
tal of 256 processors in cross-site runs between 10
100 200 300
the NCSA and SDSC machines with a fixed
problem size for turbulent cylinder flow at Re =
10,000. We tested several different configurations
with a different number of processors from each Figure 6. Benchmarking for a fixed problem size. In comparing the US
site, but the wall-clock timing for different con- National Center for Supercomputing Applications (NCSA)/San Diego
figurations is essentially the same; we didn’t ob- Supercomputer Center (SDSC) cross-site runs with the NCSA single-site
serve any significant influence of processor runs, we can show the wall-clock time per step as a function of the total
configuration on performance. number of processors for a simulation of turbulent wake.
C
ompared to single-site runs, the slow- site runs with a Fourier-modal based algorithm,
down ratio of cross-site runs decreases which is characterized by a stressful all-to-all type
dramatically as the number of proces- of cross-site communication and in a sense repre-
sors increases. We performed the cross- sents the worst-case scenario. For applications
Table 1. The performance of cross-site runs for flow past a cylinder at Reynolds number Re = 10,000.
Cross-site run CPUs from the US National Center CPUs from the San Diego Wall-clock time
for Supercomputing Applications Supercomputer Center per step (seconds)
Total of 256 CPUs 128 128 16.307
144 112 16.467
160 96 15.935
MPICH-G2, NCSA/SDSC cross site grand-challenge problems in biological and physi-

MPICH-G2, NCSA single site cal sciences.
2
10
Acknowledgments
The US National Science Foundation, the Office of Naval
Research, and the Defense Advanced Research Projects
10
1 Agency (DARPA) supported this work. Computer time for
the TeraGrid was provided through the US National
Center for Supercomputing Applications (NCSA), the San
Diego Supercomputer Center (SDSC), and the Pittsburgh
10
0 Supercomputing Center (PSC). We thank John Towns
(NCSA), David O’Neal (PSC), Donald Frederick (SDSC),
Frank Wells (NCSA), Dongju Choi (SDSC), and the
TeraGrid support team for their assistance in setting up
–1
10 the TeraGrid for the benchmark tests.
50 100 150
References
1. G. Karypis and V. Kumar, “A Fast and High Quality Multilevel
Figure 7. Benchmarking for a fixed workload per processor. In Scheme for Partitioning Irregular Graphs,” SIAM J. Scientific Com-
comparing the US National Center for Supercomputing Applications puting, vol. 20, no. 1, 1998, pp. 359–392.
(NCSA)/San Diego Supercomputer Center (SDSC) cross-site runs with 2. M. Yokokawa et al., “16.4-Tflops Direct Numerical Simulation of
Turbulence by a Fourier Spectral Method on the Earth Simula-
the NCSA single-site runs, the problem size increases proportionally as tor,” Proc. Supercomputing 2002, IEEE CS Press, 2002; www.sc
the number of processors increases such that the workload per -conference.org/sc2002/.
processor remains unchanged. 3. S.J. Kovacs, D.M. McQueen, and C.S. Peskin, “Modelling Cardiac
Fluid Dynamics and Diastolic Function,” Philosophical Trans. Royal
Soc. London, Series A: Mathematical Physics and Eng. Science, vol.
359, 15 June 2001, pp. 1299–1314.
characterized by less stressful communication pat- 4. S. Shingu et al., “A 26.58 Tflops Global Atmospheric Simulation
terns, we can achieve even better performance for with the Spectral Transform Method on the Earth Simulator,”
cross-site runs. The human arterial tree simulation Proc. Supercomputing 2002, IEEE CS Press, 2002; www.sc
-conference.org/sc2002/.
shows a much less demanding communication char-
5. G. Allen et al., “Supporting Efficient Execution in Heterogeneous
acteristic; we’re currently developing and testing Distributed Computing Environments with Cactus and Globus,”
this application in cross-site computations on the Proc. Supercomputing 2001, IEEE CS Press, 2001; www.sc
TeraGrid as well as on machines between the US 2001.org/.
and UK. Several techniques can further boost cross- 6. M. Russel et al., “The Astrophysics Simulation Collaboratory: A
Science Portal Enabling Community Software Development,”
site performance to possibly even match single-site Cluster Computing, vol. 5, no. 3, 2002, pp. 297–304.
performance—for example, multithreading to truly 7. M.S. Allen and R. Wolski, “The Livny and Plank-Beck Problems:
overlap cross-site communications with in-site Studies in Data Movement on the Computational Grid,” Proc.
computations, and User Datagram Protocol Supercomputing 2003, IEEE CS Press, 2003; www.sc
-conference.org/sc2003/.
(UDP)-based messaging to improve cross-site com-
8. N.T. Karonis, B. Toonen, and I. Foster, “MPICH-G2: A Grid-En-
munication’s bandwidth utilization. Developers are abled Implementation of the Message Passing Interface,” J. Par-
currently incorporating these approaches into the allel and Distributed Computing, vol. 63, no. 5, 2003, pp.
next-generation implementation of MPICH-G2. 551–563.
Grid computing enabled by Globus/MPICH-G2 9. S.J. Sherwin et al., “Computational Modeling of 1D Blood Flow
with Variable Mechanical Properties in the Human Arterial Sys-
and other grid services and middleware (see tem,” Int’l J. Numerical Methods in Fluids, vol. 43, nos. 6–7, 2003,
NaReGi, www.naregi.org; DEISA, www.deisa.org; pp. 673–700.
and PACX-MPI16) hold the key to the solution of 10. G.E. Karniadakis and S.J. Sherwin, Spectral/hp Element Methods

for CFD, 2nd ed., Oxford Univ. Press, 2005.
11. S. Dong and G.E. Karniadakis, “Multilevel Parallelization Models
in CFD,” J. Aerospace Computing, Information, Comm., vol. 1, no.
6, 2004, pp. 256–268.
12. A. Ekmekci, Flow Structure from a Circular Cylinder with Defined
Surface Perturbations, PhD dissertation, Dept. Mechanical Eng.
and Mechanics, Lehigh Univ., 2005.
The American Institute of Physics is a not-for-profit membership
13. C.G. Caro, J.M. Fitz-Gerald, and R.C. Schroter, “Atheroma and
Arterial Wall Shear: Observation, Correlation and Proposal of a
corporation chartered in New York State in 1931 for the
Shear Dependent Mass Transfer Mechanism for Atherogenesis,” purpose of promoting the advancement and diffusion of the
Proc. Royal Soc. London Series B, vol. 177, 1971, pp. 109–159. knowledge of physics and its application to human welfare.
14. M.H. Friedman, G.M. Hutchins, and C.B. Bargeron, “Correlation Leading societies in the fields of physics, astronomy, and related
between Intimal Thickness and Fluid Shear in Human Arteries,” sciences are its members.
Atherosclerosis, vol. 39, 1987, p. 425.
15. C. Zarins et al., “Carotid Bifurcation Atherosclerosis: Quantitative In order to achieve its purpose, AIP serves physics and related
Correlation of Plaque Localization with Flow Velocity Profiles and fields of science and technology by serving its member societies,
Wall Shear Stress,” Circulation Research: J. Am. Heart Assoc., vol. individual scientists, educators, students, R&D leaders, and the
53, 1983, pp. 502–514.
general public with programs, services, and publications—
16. E. Gabriel et al., “Distributed Computing in a Heterogeneous information that matters.
Computing Environment,” Recent Advances in Parallel Virtual Ma-
chine and Message Passing Interface, V. Alexandrov and J. Don-
garra, eds., LNCS 1497, Springer-Verlag, 1998, pp. 180–187. The Institute publishes its own scientific journals as well as those
of its member societies; provides abstracting and indexing
17. A. Prasad and C.H.K. Williamson, “The Instability of the Shear
Layer Separating from a Bluff Body,” J. Fluid Mechanics, vol. 333, services; provides online database services; disseminates
Feb. 1997, pp. 375–402. reliable information on physics to the public; collects and
analyzes statistics on the profession and on physics education;
encourages and assists in the documentation and study of the
Suchuan Dong is a research assistant professor in the history and philosophy of physics; cooperates with other
Division of Applied Mathematics at Brown University. organizations on educational projects at all levels; and collects
His research interests include high-performance com- and analyzes information on federal programs and budgets.
puting, flow structure interaction, biofluids and turbu-
lence simulations, and bubbly flows. Dong has a PhD The Institute represents approximately 120,000 scientists
in mechanical engineering from the State University of through its member societies. In addition, approximately 6,000
students in more than 700 colleges and universities are members
New York at Buffalo. He is a member of the American
of the Institute’s Society of Physics Students, which includes the
Society of Mechanical Engineers, the American Insti-
honor society Sigma Pi Sigma. Industry is represented through the
tute of Aeronautics and Astronautics, and the Ameri- membership of 36 Corporate Associates.
can Physical Society. Contact him at sdong@dam.
brown.edu. Governing Board: Mildred S. Dresselhaus (chair), Martin
Blume, Marc H. Brodsky (ex officio), Slade Cargill, Charles
George Em Karniadakis is a professor of applied W. Carter Jr., Hilda A. Cerdeira, Marvin L. Cohen, Timothy
mathematics at Brown University. His research interests A. Cohn, Lawrence A. Crum, Robert E. Dickinson, Michael
D. Duncan, H. Frederick Dylla, Judy R. Franz, Brian J.
include spectral methods in unstructured grids, paral-
Fraser, John A. Graham, Joseph H. Hamilton, Ken Heller,
lel simulations of turbulence in complex geometries, James N. Hollenhorst, Judy C. Holoviak, Anthony M.
and microfluid simulations. Karniadakis has a PhD in Johnson, Angela R. Keyser, Bernard V. Khoury, Leonard V.
mechanical engineering from MIT. He is a fellow of the Kuhi, Louis J. Lanzerotti, Rudolf Ludeke, Christopher H.
American Physical Society and the American Society of Marshall, Thomas J. McIlrath, Arthur B. Metzner, Robert
Mechanical Engineers, and a member of the American W. Milkey, James Nelson, John A. Orcutt, Richard W.
Institute of Aeronautics and Astronautics and SIAM. Peterson, Helen R. Quinn, S. Narasinga Rao, Elizabeth A.
Contact him at gk@dam.brown.edu. Rogan, Bahaa A.E. Saleh, Charles E. Schmid, James B.
Smathers, Benjamin B. Snavely (ex officio), A.F. Spilhaus Jr,
Richard Stern, and John H. Weaver.
Nicholas T. Karonis is an associate professor at North- Board members listed in italics are members of the Executive Committee.
ern Illinois University and a resident associate guest of
Argonne National Laboratory’s Mathematics and Com- Management Committee: Marc H. Brodsky, Executive
puter Science Division. His current research interest is Director and CEO; Richard Baccante, Treasurer and CFO;
message-passing systems in computational grids. Ka- Theresa C. Braun, Vice President, Human Resources;
ronis has a BS in finance and a BS in computer science James H. Stith, Vice President, Physics Resources;
Darlene A. Walters, Senior Vice President, Publishing; and
from Northern Illinois University, an MS in computer
Benjamin B. Snavely, Secretary.
science from Northern Illinois University, and a PhD in
computer science from Syracuse University. Contact
him at karonis@niu.edu.
w w w. a i p . o r g

Ieee 4

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Ieee 4

Caricato da

Copyright:

Formati disponibili

GRID

14 COMPUTING IN SCIENCE & ENGINEERING

will have profound scientiﬁc and societal impacts. Drag crisis

16 COMPUTING IN SCIENCE & ENGINEERING

0.5 1.0 1.5 2.0

18 COMPUTING IN SCIENCE & ENGINEERING

Wall-clock time per step (seconds)

NCSA and the SDSC. We use MPICH-G2 in both

20 COMPUTING IN SCIENCE & ENGINEERING

processors increases from eight to 32. Again, we

decreases signiﬁcantly beyond 32 processors.

MPICH-G2, NCSA/SDSC cross site grand-challenge problems in biological and physi-

22 COMPUTING IN SCIENCE & ENGINEERING

Potrebbero piacerti anche