Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
COMPUTING
CROSS-SITE COMPUTATIONS
ON THE TERAGRID
The TeraGrid’s collective computing resources can help researchers perform very-large-scale
simulations in computational fluid dynamics (CFD) applications, but doing so requires
tightly coupled communications among different sites. The authors examine a scaled-down
turbulent flow problem, investigating the feasibility and scalability of cross-site simulation
paradigms, targeting grand challenges such as blood flow in the entire human arterial tree.
L
aunched by the US National Science chemical, mechanical, or biological processes have
Foundation in August 2001, the Tera- also demonstrated widely varying characteris-
Grid (www.teragrid.org) integrates the tics.5,6 Some applications are computation inten-
most powerful open resources in the sive, requiring extremely powerful computing
US, providing 50 Tflops in processing power and systems, whereas others are data intensive,7 ne-
1.5 Pbytes of online storage connected via a 40- cessitating the creation or mining of multiterabyte
Gbits-per-second network. Although the Tera- data archives.
Grid offers potentially unlimited scalability, the Efficiently and effectively harnessing grid com-
key question facing computational scientists is puting’s power requires applications that can ex-
how to effectively adapt their applications to such ploit ensembles of supercomputers, which in turn
a complex and heterogeneous network. Indeed, requires the ability to match application require-
parallel scientific computing is currently at a ments and characteristics with grid resources. The
crossroads. The emergence of the message-pass- challenges in developing grid-enabled applications
ing interface (MPI) as well as domain decomposi- lie primarily in the high degree of system hetero-
tion algorithms and their corresponding freeware geneity and the grid environment’s dynamic be-
(such as ParMETIS1) has made parallel comput- havior. For example, a grid can have a highly
ing available to the wider scientific community, al- heterogeneous and unbalanced communication
lowing first-principles simulations of turbulence network whose bandwidth and latency characteris-
at very fine scales,2 blood flow in the human tics vary widely over time and space. Computers in
heart,3 and global climate change.4 Unfortunately, grid environments can also have radically different
simulations designed to capture detailed physico- operating systems and utilities.
Grid technology—primarily in the form of
Globus-family services (www.globus.org)—has
1521-9615/05/$20.00 © 2005 IEEE largely overcome the difficulties in managing such
Copublished by the IEEE CS and the AIP
a heterogeneous environment. With these services’
SUCHUAN DONG AND GEORGE EM KARNIADAKIS uniform mechanisms for user authentication, ac-
Brown University counting, resource access, and data transfer, users
NICHOLAS T. KARONIS and applications can discover and utilize disparate
resources in coordinated ways. In particular, the
Northern Illinois University and Argonne National Laboratory
emergence of scientific-application-oriented grid
cal sciences. 55 49
(a)
Grand Challenges
Our work is motivated by two grand-challenge Experiment
3.0
problems in biological and physical sciences: the Henderson (1997)
simulation of blood flow in the entire human arte- 2D simulations
rial tree, and the direct numerical simulation Dong and Karniadakis (2005), Re = 10,000
2.5 Ma et al. (2000), Re = 3,900
(DNS) of bluff-body turbulent wake flows. Both
problems are significant from both fundamental
and application viewpoints, and their resolution 2.0
Drag coefficient
SEPTEMBER/OCTOBER 2005 15
scried requires a total memory of 3 to 7 Tbytes, CFD code called Nektar (www.nektar.info/
which is beyond any single supercomputer’s cur- 2nd_edition/) in our computations. It employs a
rent capacity. The TeraGrid is the only possible spectral/hp element method to discretize in space
way to accommodate such a simulation. and a semi-implicit scheme to discretize in time.10
The second problem, DNS of the “drag crisis” The mesh consists of structured or unstructured
in turbulent bluff-body flows, is a fundamental grids (or a combination of both) similar to those
grand-challenge problem in fluid dynamics. (The used in standard finite element and finite volume
drag crisis is the sudden drop of drag force near methods. We employ Jacobi polynomial expansions
Reynolds number Re = 300,000; see Figure 1b) to represent flow variables; these expansions pro-
The need to resolve all the energetic scales in the vide multiresolution, a way to hierarchically refine
DNS down to the Kolmogorov scale dictates that numerical solutions by increasing the order of the
the number of grid points should be on the order expansion (p-refinement) within each element
of Re9/4, or roughly a trillion grid points at drag- without needing to regenerate the mesh, thus
crisis conditions. Concentration of turbulence in avoiding a significant overhead cost.
the bluff-body wake and nonuniform meshing ef- We can use MPICH-G2 for the cross-site com-
fectively reduces the required number of grid munication: it’s a Globus-based MPI library that
points to a few billion. The appropriate mesh extends MPICH to use the Globus Toolkit’s ser-
now consists of approximately 512 to 768 Fourier vices.8 During the computation, MPICH-G2 se-
modes along the cylinder axis and 50,000 to lects the most efficient communication method
80,000 spectral elements in the nonhomogeneous possible between two processes, using vendor-sup-
planes, with a spectral polynomial order of 6 to plied MPI whenever available. MPICH-G2 uses
10 on each element. A monolithic simulation information in the Globus Resource Specification
with such resolutions requires more than 4 Language (RSL) script to create multilevel cluster-
Tbytes of memory, exceeding any open super- ing of the processes based on the underlying net-
computer’s current capacity. Just as in the human work topology; it stores this information as
arterial tree simulation, the TeraGrid is the only attributes in the MPI communicators for applica-
viable option. tions to retrieve and exploit.
These extremely large biological and physical
simulations share a common characteristic: the so- Computation Algorithms
lution process requires tightly coupled communi- Two cross-site parallel algorithms based on differ-
cations among different TeraGrid sites. This is in ent data distribution strategies can both minimize
sharp contrast to other grid application scenarios the number of cross-site communications and over-
in which a monolithic application runs on one grid lap cross-site communications with in-site compu-
site while the data it produces is moved to another tations and communications. Both algorithms are
site for visualization or postprocessing (for exam- designed based on a two-level parallelization strat-
ple, the TeraGyroid project, www.realitygrid. egy.11 Let’s consider the turbulent flow past a cir-
org/TeraGyroid.html). A big question here is the cular cylinder. The following incompressible
scalability of an application involving multiple Navier-Stokes equations govern the flow:
TeraGrid sites and the slowdown factor of cross-
site runs compared to single-site runs under other- ∂ui 3
∂u ∂p 1 3
∂2ui
wise identical conditions. ∂t
+ ∑ u j ∂x i =− +
∂xi Re
∑ ∂x 2j
,
j =1 j j =1
Using Code i = 1, 2, 3
To investigate these issues and the feasibility of 3
∂u
cross-site runs on the TeraGrid, we’ll look at a
scaled-down version of the drag-crisis problem—
∑ ∂xi = 0,
i =1 i
specifically, we’ll use the simulation of turbulent
flow past a circular cylinder at lower Reynolds where xi(i = 1, 2, 3) are the interchangeable coor-
numbers (Re = 3,900 and 10,000) as a prototype. dinates used with x, y, z; ui(i = 1, 2, 3) are the three
We’ll employ Fourier spectral expansions in the velocity components; and p is pressure. We nor-
homogeneous direction and a spectral element malize all length scales with the cylinder diameter
discretization in the nonhomogeneous planes to D, all velocity components with the free-stream ve-
efficiently handle a multiconnected computa- locity U0, and the pressure with U02 (where is
tional domain.10 the fluid density). The Reynolds number Re is ex-
To conduct such a DNS, we use a high-order pressed by
SEPTEMBER/OCTOBER 2005 17
0.5
0.0
y
–0.5
Figure 2. Turbulent flow past a circular cylinder. At Reynolds number Re = 10,000, we can compare streamwise
root-mean-square (RMS) velocity u/U0 between (a) the particle-image-velocimetry (PIV) experiment and (b)
current direct numerical simulation (DNS). Contours are plotted on the same levels for the experiment and
the DNS: umin/U0 = 0.1 and the incremental value between contour lines u/U0 = 0.025.
sults, but this article’s emphasis is on the comput- vortices). Our simulation data shows that the fre-
ing aspect (algorithm and performance), so we’ve quency of shear-layer vortices follows a scaling of
chosen to include only one comparison with ex- Re0.67 with respect to the Reynolds number, in
periment here. Specifically, let’s examine some sim- agreement with experimental observations.17 Fur-
ulation results for turbulent flows past bluff bodies thermore, the values of shear-layer frequencies from
obtained on the TeraGrid clusters. We can com- our computation agree with the experimental val-
pare the statistical characteristics in the turbulent ues; for example, at Re = 10,000, our simulation pre-
wake of a circular cylinder at Reynolds number Re dicts a value (normalized by Strouhal frequency) of
= 10,000 between our 3D DNS and the particle- 11.83 whereas it’s 11.25 from experiment.17
image-velocimetry (PIV) experiments.12 This
Reynolds number is the highest that the DNS has Performance Results
achieved for this flow so far. To evaluate the efficiency of computation algo-
Figure 2 shows a comparison of the normalized rithms, we conducted single-site runs on three
streamwise root-mean-square (RMS) velocity fluc- TeraGrid sites—the US National Center for Su-
tuation u normalized by the free-stream velocity U0 percomputing Applications (NCSA), the San
between the experiment (Figure 2a) and the simula- Diego Supercomputer Center (SDSC), and the
tion (Figure 2b). We plotted experimental and DNS Pittsburgh Supercomputing Center (PSC). We also
results on identical contour levels, with a minimum ran cross-site runs between the SDSC and the
RMS value u/U0 = 0.1 and an incremental value of NCSA, and between the two TeraGrid machines at
0.025 between contour lines. The distribution pat- the PSC. The NCSA and SDSC TeraGrid ma-
terns show strong fluctuations in the separating shear chines have Intel IA-64 processors (Itanium-2, 1.5
layers and two maxima associated with the vortex for- GHz) whereas those at the PSC have Compaq Al-
mation. The downstream locations of the RMS max- pha processors (Alpha EV68, 1 GHz).
ima are essentially the same for both experiment and
simulation (at x = 1.14 for the experiment and at x = Single-Site Performance
1.13 for the simulation), and the respective peak val- Figure 3a demonstrates Nektar’s scalability, show-
ues are also the same: umax/U0 = 0.5. ing parallel speedup with respect to the number of
At Reynolds numbers in the subcritical range processors on the PSC TeraGrid cluster for a fixed
(above 1,000 and below the “drag crisis” Reynolds problem size with 300 million degrees of freedom.
number), the separating shear layers behind the The parallel efficiency exceeds 95 percent on 1,024
cylinder become unstable and small-scale vortices processors and 85 percent on 1,536 processors.
develop in the shear layers (so-called shear-layer The test problem here is the turbulent cylinder
3.0
4
2.5
2.0
3
1.5
2 1.0
0.5
1 0.0
256 512 768 1,024 1,280 1,536 100 200 300 400 500 600
(a) Number of processors (b) Number of processors
Figure 3. Nektar’s performance on the Pittsburgh Supercomputing Center’s TeraGrid cluster. (a) Parallel speedup with respect
to the number of processors for a fixed problem size, and (b) wall-clock time per step (in seconds) as a function of the number
of processors for a fixed workload per processor. As the problem size increases, the number of processors increases
proportionally to keep the workload per processor constant.
flow at Reynolds number Re = 10,000 based on the based on the MPICH-G2 library, we also want to
free-stream velocity and cylinder diameter. We test the performance differences between MPICH-
used a spectral element mesh with 9,272 triangular G2 and the native MPI implementation in a single-
elements in the nonhomogeneous planes; the num- site environment. MPICH-G2 hides low-level
ber of Fourier planes in the spanwise direction is operational details from the application, including
256 in this test. communication channel selection (vendor or TCP),
Scalability for a fixed workload per processor is data conversion, resource allocation, and computa-
another important measure. In this set of tests, as tion management. MPICH-G2 has taken measures
the problem size increases, the number of proces- such as eliminating memory copies and unnecessary
sors increases proportionally such that the work- message polling to minimize the overhead cost.8
load on each processor remains unchanged. The Figure 4 shows a performance comparison of
test problem is still the turbulent cylinder flow at Nektar compiled with MPICH-G2 and native MPI
Re = 10,000; in the nonhomogeneous plane, we use on the SDSC’s TeraGrid cluster (we expect the re-
the same grid resolution, but vary the problem size sults would be valid for all TeraGrid machines, par-
by varying the number of Fourier planes in the ticularly the ones that have identical compute and
spanwise direction. In this test, the number of switch hardware). The test problem is the turbulent
Fourier planes increases from 8 to 128, and the cylinder flow at Reynolds number Re = 3,900. We
number of processors increases proportionally plot the wall-clock time per step as a function of the
from 32 to 512 to keep the workload per processor total number of processors for a fixed problem size,
constant. For the case of 512 processors (128 and we use a spectral element mesh with 902 trian-
Fourier planes), we use 64 processors for compu- gular elements in the nonhomogeneous planes and
tations in the homogeneous direction and eight 128 planes in the spanwise direction; the polynomial
processors for computations in the two-level par- order is 8 on all elements. MPICH-G2 demon-
allelization’s nonhomogeneous planes. Figure 3b strates a performance virtually identical to the native
shows the wall-clock time per step (in seconds) as MPI, indicating a negligible overhead cost.
a function of the number of processors from the Figure 5 compares the performances of the
test. The ideal result would be a constant wall- Fourier modal-based and physical variable-based
clock time per step for any number of processors algorithms discussed earlier—specifically, it com-
(flat curve), but we see (only) a slight increase in pares the wall-clock time per step and the speedup
wall-clock time as the number of processors in- factor with respect to the number of processors for
creases from 32 to 512, indicating good scalability. a fixed problem size with 24 Fourier planes in the
We used native (vendor) MPI libraries in all these spanwise direction. Due to the configuration of the
tests, but because cross-site communications are physical variable-based algorithm, the number of
SEPTEMBER/OCTOBER 2005 19
the processors are split between the NCSA and
MPICH-G2
25
Native MPI SDSC TeraGrid machines—in a 256-CPU cross-site
run, for example, 128 processors are from both the
Wall-clock time per step (seconds)
10
Parallel speedup
20
10
0 20 40 60 80 100 0 20 40 60 80 100
(a) Number of processors (b) Number of processors
Figure 5. Algorithm performance. Comparing the Fourier modal-based and physical variable-based algorithms for a fixed
problem size on the Pittsburgh Supercomputing Center’s TeraGrid cluster, we can match up (a) wall-clock time and (b) parallel
speedup as a function of the total number of processors. Speedup is calculated based on the wall-clock time on three processors.
cross-site runs, again, half the processors are from MPICH-G2, NCSA/SDSC cross site
the NCSA and the other half are from the SDSC; MPICH-G2, NCSA single site
we used MPICH-G2 in both cross- and single-site 2
Wall-clock time per step (seconds)
10
runs. In single-site runs, the wall-clock time in-
creases very slightly as the number of processors
increases from eight to 128, indicating excellent
scalability. In cross-site runs, we observe a larger
increase in wall-clock time as the number of 10
1
C
ompared to single-site runs, the slow- site runs with a Fourier-modal based algorithm,
down ratio of cross-site runs decreases which is characterized by a stressful all-to-all type
dramatically as the number of proces- of cross-site communication and in a sense repre-
sors increases. We performed the cross- sents the worst-case scenario. For applications
SEPTEMBER/OCTOBER 2005 21
Table 1. The performance of cross-site runs for flow past a cylinder at Reynolds number Re = 10,000.
Cross-site run CPUs from the US National Center CPUs from the San Diego Wall-clock time
for Supercomputing Applications Supercomputer Center per step (seconds)
Total of 256 CPUs 128 128 16.307
144 112 16.467
160 96 15.935
2
10
Acknowledgments
The US National Science Foundation, the Office of Naval
Research, and the Defense Advanced Research Projects
10
1 Agency (DARPA) supported this work. Computer time for
the TeraGrid was provided through the US National
Center for Supercomputing Applications (NCSA), the San
Diego Supercomputer Center (SDSC), and the Pittsburgh
10
0 Supercomputing Center (PSC). We thank John Towns
(NCSA), David O’Neal (PSC), Donald Frederick (SDSC),
Frank Wells (NCSA), Dongju Choi (SDSC), and the
TeraGrid support team for their assistance in setting up
–1
10 the TeraGrid for the benchmark tests.
50 100 150
Number of processors
References
1. G. Karypis and V. Kumar, “A Fast and High Quality Multilevel
Figure 7. Benchmarking for a fixed workload per processor. In Scheme for Partitioning Irregular Graphs,” SIAM J. Scientific Com-
comparing the US National Center for Supercomputing Applications puting, vol. 20, no. 1, 1998, pp. 359–392.
(NCSA)/San Diego Supercomputer Center (SDSC) cross-site runs with 2. M. Yokokawa et al., “16.4-Tflops Direct Numerical Simulation of
Turbulence by a Fourier Spectral Method on the Earth Simula-
the NCSA single-site runs, the problem size increases proportionally as tor,” Proc. Supercomputing 2002, IEEE CS Press, 2002; www.sc
the number of processors increases such that the workload per -conference.org/sc2002/.
processor remains unchanged. 3. S.J. Kovacs, D.M. McQueen, and C.S. Peskin, “Modelling Cardiac
Fluid Dynamics and Diastolic Function,” Philosophical Trans. Royal
Soc. London, Series A: Mathematical Physics and Eng. Science, vol.
359, 15 June 2001, pp. 1299–1314.
characterized by less stressful communication pat- 4. S. Shingu et al., “A 26.58 Tflops Global Atmospheric Simulation
terns, we can achieve even better performance for with the Spectral Transform Method on the Earth Simulator,”
cross-site runs. The human arterial tree simulation Proc. Supercomputing 2002, IEEE CS Press, 2002; www.sc
-conference.org/sc2002/.
shows a much less demanding communication char-
5. G. Allen et al., “Supporting Efficient Execution in Heterogeneous
acteristic; we’re currently developing and testing Distributed Computing Environments with Cactus and Globus,”
this application in cross-site computations on the Proc. Supercomputing 2001, IEEE CS Press, 2001; www.sc
TeraGrid as well as on machines between the US 2001.org/.
and UK. Several techniques can further boost cross- 6. M. Russel et al., “The Astrophysics Simulation Collaboratory: A
Science Portal Enabling Community Software Development,”
site performance to possibly even match single-site Cluster Computing, vol. 5, no. 3, 2002, pp. 297–304.
performance—for example, multithreading to truly 7. M.S. Allen and R. Wolski, “The Livny and Plank-Beck Problems:
overlap cross-site communications with in-site Studies in Data Movement on the Computational Grid,” Proc.
computations, and User Datagram Protocol Supercomputing 2003, IEEE CS Press, 2003; www.sc
-conference.org/sc2003/.
(UDP)-based messaging to improve cross-site com-
8. N.T. Karonis, B. Toonen, and I. Foster, “MPICH-G2: A Grid-En-
munication’s bandwidth utilization. Developers are abled Implementation of the Message Passing Interface,” J. Par-
currently incorporating these approaches into the allel and Distributed Computing, vol. 63, no. 5, 2003, pp.
next-generation implementation of MPICH-G2. 551–563.
Grid computing enabled by Globus/MPICH-G2 9. S.J. Sherwin et al., “Computational Modeling of 1D Blood Flow
with Variable Mechanical Properties in the Human Arterial Sys-
and other grid services and middleware (see tem,” Int’l J. Numerical Methods in Fluids, vol. 43, nos. 6–7, 2003,
NaReGi, www.naregi.org; DEISA, www.deisa.org; pp. 673–700.
and PACX-MPI16) hold the key to the solution of 10. G.E. Karniadakis and S.J. Sherwin, Spectral/hp Element Methods
SEPTEMBER/OCTOBER 2005 23