Parallelization and Performance

PARALLELIZATION AND
PERFORMANCE OF THE
NIM WEATHER MODEL ON CPU,
GPU, AND MIC PROCESSORS
Mark Govett, Jim Rosinski, Jacques Middlecoff, Tom Henderson, Jin Lee, Alexander MacDonald,
Ning Wang, Paul Madden, Julie Schramm, and Antonio Duarte
Next-generation supercomputers containing millions of processors will require weather

prediction models to be designed and developed by scientists and software experts to ensure
portability and efficiency on increasingly diverse HPC systems.
A
new generation of high-performance computing exploited in the applications. Fortunately, most weather
(HPC) has emerged called fine grain or massively and climate codes contain a high degree of parallelism,
parallel fine grain (MPFG). The term “massively making them good candidates for MPFG computing.
parallel” refers to large-scale HPC systems contain- As a result, research groups worldwide have begun
ing tens of thousands to millions of processing cores. parallelizing their weather and climate prediction
“Fine grain” refers to loop-level parallelism that must models for MPFG processors. The Swiss National
be exposed in applications to permit thousands to Supercomputing Center (CSCS) has done the most
millions of arithmetic operations to be executed every comprehensive work so far. They parallelized the
clock cycle. Two general classes of MPFG chips are dynamical core of the Consortium for Small-Scale
available: Many Integrated Core (MIC) from Intel and Modeling (COSMO) model for GPUs in 2013 (Fuhrer
graphics processing units (GPUs) from NVIDIA and et al. 2014). At that time, no viable commercial
Advanced Micro Devices (AMD) (see “Many-core and FORTRAN GPU compilers were available, so the
GPU computing explained” sidebar). In contrast to code was rewritten in C++ to enhance performance
up to 18 cores on the latest-generation Intel Broadwell and portability. They reported the C++ version gave
CPUs, these MPFG chips contain hundreds to thou- a 2.9-times speedup over the original FORTRAN
sands of processing cores. They provide 10–20 times code using same-generation dual-socket Intel Sandy
greater peak performance than CPUs, and they appear Bridge CPU and Kepler K20x GPU chips. Paralleliza-
in systems that increasingly dominate the list of top tion of model physics in 2014 preserved the original
supercomputers in the world (Strohmaier et al. 2016). FORTRAN code by using industry standard open
Peak performance does not translate to real application accelerator (OpenACC) compiler directives for par-
performance, however. Good performance can only allelization (Lapillonne and Fuhrer 2014). The entire
be achieved if fine-grain parallelism can be found and model, including data assimilation, is now running
AMERICAN METEOROLOGICAL SOCIETY OCTOBER 2017 | 1

operationally on GPUs at the Swiss Federal Office of mature GPU compilers have simplified parallelization
Meteorology and Climatology (MeteoSwiss). and improved application performance. However,
Most atmospheric modeling groups exploring reporting of results has not been uniform and can be
MPFG focused on the parallelization of model dy- misleading. Ideally, comparisons should be made us-
namics. The German Weather Service (DWD) and ing the same source code, with optimizations applied
Max Planck Institute for Meteorology developed the faithfully to the CPU and GPU, and run on same-
Icosahedral Nonhydrostatic (ICON) dynamical core, generation processors. When codes are rewritten, it
which has been parallelized for GPUs. Early work by becomes harder to make fair comparisons as multiple
Sawyer et al. (2011) converted the FORTRAN using versions must be maintained and optimized. When
NVIDIA-specific Compute Unified Device Archi- different-generation hardware is used (e.g., 2010 CPUs
tecture (CUDA)-FORTRAN and open computing vs 2013 GPUs), adjustments should be made to nor-
language (OpenCL), demonstrated a 2-times speedup malize reported speedups. Similarly, when compari-
over dual-socket CPU nodes. The invasive, platform- sons are made with multiple GPUs attached to a single
specific code changes were unacceptable to domain node, further adjustments should be made. Finally,
scientists, so current efforts are focused on minimal comparisons between a GPU and a single CPU core
changes to the original code using OpenACC for par- give impressive speedups of 50–100 times, but such
allelization (Sawyer et al. 2014). Another dynamical results are not useful or fair and require adjustment
core, the Nonhydrostatic Icosahedral Atmospheric to factor in use of all cores available on the CPU.
Model (NICAM), has been parallelized for GPUs, When Intel released its MIC processor, called Fig. 1. An illustration contrasting the converging grid points of a lat–lon grid vs the nearly uniform grid spacing
with a reported 7–8-times performance speedup Knights Corner (KNC), in 2013, a new influx of re- of an icosahedral–hexagonal grid.
comparing two 2013-generation K20x GPUs to one searchers began exploring fine-grain computing. Re-
2011-generation, dual-socket Intel Westmere CPU search teams from National Center for Atmospheric served as a benchmark for evaluation of commercial finite-volume approach into three-dimensional
(Yashiro et al. 2014). Other dynamical cores parallel- Research (NCAR)’s Community Earth System Model OpenACC compilers from Cray and The Portland finite-volume solvers designed to improve pressure
ized for the GPU, including the finite volume cubed (CESM) (Kim et al. 2013), Weather Research and Group International (PGI). Using the same source gradient calculation and orographic precipitation
(FV3) model (Lin 2004) used in Goddard Earth Forecasting (WRF) Model (Michalakes et al. 2016), code, the NIM was ported to Intel MIC in 2013 when over complex terrain.
Observing System Model, version 5 (GEOS-5) (Put- and the FV3 (Nguyen et al. 2013) reported little to no these processors became available. NIM uses the following innovations in the model
nam 2011), and the High-Order Method Modeling performance gain compared to the CPU. A more com- We believe NIM is currently the only weather formulation:
Environment (HOMME) (Carpenter et al. 2013), have prehensive parallelization for the MIC with National model that runs on CPU, GPU, and MIC processors
shown some speedup versus the CPU. Oceanic and Atmospheric Administration (NOAA)’s with a single-source code. The dynamics portion of • a local coordinate system that remaps a spherical
Collectively, these experiences show that porting Flow-Following Finite-Volume Icosahedral Model NIM uses open multiprocessing (OpenMP) (CPU and surface to a plane (Lee and MacDonald 2009),
codes to GPUs can be challenging, but most users (FIM) (Bleck et al. 2015) included dynamics and phys- MIC), OpenACC (GPU), and F2C-ACC (GPU) direc- • indirect addressing of grid cells to simplify the
have reported speedups over CPUs. Over time, more ics running on the MIC (Rosinski 2015). Execution tives for parallelization. Scalable Modeling System code and improve performance (MacDonald et al.
of the entire model on the KNC gave no performance (SMS) directives and run-time library support MPI- 2011),
benefit compared to the CPU. A common sentiment based distributed-memory parallelism, including do- • flux-corrected transport formulated on finite-
AFFILIATIONS: Govett, Lee,* and MacDonald*—Global in these efforts is that porting applications to run on main decomposition, interprocess communications, volume operators to maintain conservative and
Systems Division, NOAA/Earth System Research Laboratory, the MIC is easy, but getting good performance with and input/output (I/O) operations (Govett et al. 2003). monotonic transport (Lee et al. 2010),
Boulder, Colorado; Rosinski, Middlecoff, Henderson,* Wang, and KNC was difficult. This all changed with the release Collectively, these directives allow a single-source • all differentials evaluated as finite-volume inte
Schramm —Cooperative Institute of Research in the Atmosphere, of Intel’s Knights Landing (KNL) processor in early code to be maintained capable of running on CPU, grals around the cells, and
Colorado State University, Fort Collins, Colorado; Madden* and 2016. Research groups are now reporting 2-times or GPU, and MIC processors for serial or parallel execu- • icosahedral–hexagonal grid optimization (Wang
Duarte *—Cooperative Institute for Research in Environmental
more improvement in application performance for tion. Further, the NIM demonstrates efficient parallel and Lee 2011).
Sciences, University of Colorado Boulder, Boulder, Colorado
* CURRENT AFFILIATIONS: Henderson, Lee, MacDonald,
KNL versus the CPU. performance and scalability to tens of thousands of
and Madden —Spire Global, Inc., Boulder, Colorado; D uarte — This paper describes the development of the non- compute nodes and has been useful for comparisons The icosahedral–hexagonal grid is a key part of the
Cheesecake Laboratories, Florianopolis, Brazil hydrostatic icosahedral model (NIM), a dynamical between CPU, GPU, and MIC processors (Govett et al. model. This formulation approximates a sphere with
CORRESPONDING AUTHOR: Mark Govett, core that was designed to exploit MPFG processors. 2014, 2015, 2016). a varying number of hexagons but always includes 12
mark.w.govett@noaa.gov The NIM was initially designed for NVIDIA GPUs in pentagons. (Sadourny et al. 1968; Williamson 1971).
The abstract for this article can be found in this issue, following the 2009. Since commercial FORTRAN GPU compilers MODEL DESIGN. NIM is a multiscale model, The key advantage of this formulation is the nearly
table of contents. were not available at that time, the FORTRAN-to- which has been designed, developed, and run uniform grid areas that are possible over a sphere as
DOI:10.1175/BAMS-D-15-00278.1 CUDA accelerator (F2C-ACC) (Govett et al. 2010) globally at 3-km resolution with a goal to improve illustrated in Fig. 1. This is in contrast to the latitude–
In final form 5 March 2017 was codeveloped with NIM to convert FORTRAN medium-range weather forecasts. The model was de- longitude models that have dominated global weather
©2017 American Meteorological Society code into CUDA, a high-level programming language signed to explicitly permit convective cloud systems and climate prediction for 30 years. The nearly uni-
For information regarding reuse of this content and general copyright
information, consult the AMS Copyright Policy.
used on NVIDIA GPUs (NVIDIA 2015). The F2C- without cumulus parameterizations typically used form grid represents the poles without the notorious
ACC compiler has been the primary compiler used in models run at coarser scales. In addition, NIM “pole problem” inherent in latitude–longitude grids,
for execution of NIM on NVIDIA GPUs and has has extended the conventional two-dimensional where meridians converge toward the poles.
2 | OCTOBER 2017 AMERICAN METEOROLOGICAL SOCIETY OCTOBER 2017 | 3

PARALLELIZATION. NIM uses standards-com- OpenMP. Parallelization for the CPU and MIC involves
MANY-CORE AND GPU COMPUTING EXPLAINED pliant OpenMP (for CPU and MIC) and OpenACC (for two steps: 1) insert OpenMP directives to identify
GPU) directives for parallelization. OpenMP is the de thread-level parallelism and 2) optimize performance.
M any core and GPUs represent a new
class of computing called MPFG.
In contrast to CPU chips with up to
regions of code called kernels. Loop-
level calculations are typically executed
in parallel in kernels. The OpenACC
more parallelism than traditional CPU
architectures. Like GPUs, the clock
rate of the chips is 2–3 times slower
facto standard for shared memory programming on
the CPU and MIC processors, with recent extensions
Loop calculations are organized in NIM with thread-
ing over the single horizontal dimension and vectoriza-
18 cores, these fine-grain processors programming model designates three than current-generation CPUs, with to support attached devices such as GPUs. OpenACC tion over the generally independent vertical dimension.
contain hundreds to thousands of com- levels of parallelism for loop calcula- higher peak performance provided was developed initially to support GPUs, with more re- Threading of the horizontal loop is normally
putational cores. Each individual core is tions: gang, worker, and vector that are by additional processing cores, wider cent support for CPU (×86) and MIC processors. Both outside of the vertical loops and, if applicable, loops
slower than a traditional CPU core, but mapped to execution threads and vector processing units, and a fused
standards are striving toward performance portabil- over cell edges. Most OpenMP loops in NIM contain
there are many more of them available blocks on the GPU. Gang parallelism is multiply–add (FMA) instruction. The
to execute instructions simultaneously. for coarse-grain calculations. Worker- programming model used to ex- ity, where a single set of directives is sufficient to run sufficient work to amortize the overhead of assigning
This has required model calculations to level parallelism is fine grain, where each press parallelism on MIC hardware is efficiently on CPU, GPU, MIC, and other processors. work to threads on loop start-up and thread synchro-
become increasingly fine grained. gang will contain one or more workers. traditional OpenMP threading along Until recently, F2C-ACC was the primary compiler nization at the end of the threaded region. These
GPUs are designed for compute-in- Vector parallelism is for single instruc- with vectorization. User code can be being used to parallelize and run NIM on NVIDIA costs are generally higher on the MIC than the CPU
tensive, highly parallel execution. GPUs tion multiple data (SIMD) or vector written to offload computationally GPUs. F2C-ACC was an effective way to push for im- because there are more threads to manage.
contain up to 5,000 compute cores parallelism that is executed on the intensive calculations from the CPU to provements in commercial FORTRAN GPU compilers. Vectorization is an optimization where independent
that execute instructions simultaneous- hardware simultaneously. the MIC (similar to GPU), run in MIC-
Prior evaluation of OpenACC compilers and their pre- calculations executed serially within a loop can be ex-
ly. As a coprocessor to the CPU, work MIC hardware from Intel also only mode, or shared between MIC
is given to the GPU in routines or provides the opportunity to exploit and CPU host. decessors was done in 2011 [Compiler and Architecture ecuted simultaneously in hardware by specially desig-
for Embedded and Superscalar Processors (CAPS), nated vector registers available to each processing core.
PGI] (Henderson et al. 2011), 2013 (PGI, Cray) (Govett Intel compilers automatically attempt vectorization,
2013), and 2014 (PGI, Cray) (Govett et al. 2014). These with compiler flags available for further optimization
NIM uses a fully three-dimensional finite-volume capability of both NVIDIA GPU and Intel MIC archi- evaluations exposed bugs and performance problems on specific hardware. The number of operations that
discretization scheme designed to improve pressure tectures. Primary model computations are organized as in the compilers. The problems identified have been can be executed simultaneously is based on the length
gradient calculations over complex terrain. Three- simple dot products or vector operations and loops with corrected, making F2C-ACC no longer necessary. of the vector registers. On the CPU, vector registers are
dimensional finite-volume operators also provide ac- no data-dependent conditionals or branching. The NIM currently 256 bits in length; the KNC MIC coprocessor
curate and efficient tracer transport essential for next- dynamical core requires only single-precision floating- OpenACC. GPU parallelization can be done in three contains 512-bit vector registers. As a result, vectoriza-
generation global atmospheric models. Prognostic point computations and runs well on the CPU, achiev- phases: 1) define GPU kernels and identifying loop- tion provided some benefit on the host, but in most
variables are collocated at horizontal cell centers ing 10% of peak performance on an Intel Haswell CPU. level parallelism, 2) minimize data movement, and cases, it provided a greater improvement on the MIC.
(Arakawa and Lamb 1977). This simplifies looping Grid cells can be stored in any order because a 3) optimize performance. GPU kernels are regions of
constructs and reduces data dependencies in the code. lookup table is used to indirectly access neighboring code, identified with the parallel or kernels directive, PERFORMANCE. The NIM has demonstrated
The numerical scheme uses a local coordinate sys- grid cells and edges on the icosahedral–hexagonal that are executed on the GPU. Loop-level parallel- good performance and scaling on both CPUs and
tem remapped from the spherical surface to a plane at grid. The model’s loop and array structures are ism is prescribed using the loop directive, with the GPUs on Titan, 2 where it has run on more than
each grid cell. All differentials are evaluated as finite- organized with the vertical dimension innermost in optional key words gang, worker, or vector, to identify 250,000 CPU cores and more than 15,000 GPUs. It
volume integrals around each grid cell. Passive tracers dynamics routines. This organization effectively am- the type of parallelism desired. These directives are has also been run on up to 320 Intel MIC (Xeon Phi)
are strictly conserved to the round-off limit of single- ortizes the cost of the indirect access of grid cells over generally sufficient to parallelize and run applica- processors at the Texas Advanced Computing Center
precision floating-point operations. NIM governing the 96 vertical levels. Testing during model develop- tions on GPUs. Further work involves optimization (TACC).3 Optimizations targeting Xeon Phi and GPU
equations are cast in conservative flux forms with mass ment verified there was a less than 1% performance to minimize data movement and improve parallel have also improved CPU performance.
flux to transport both momentum and tracer variables. penalty using this approach (MacDonald et al. 2011). performance. Since NIM has been optimized for the CPU, GPU,
NIM dynamics executes completely on the GPU. Data movement between the CPU and GPU are and MIC, it is a useful way to make comparisons
Computational design. NIM is a FORTRAN code Model state remains resident in GPU global memory. handled automatically by the run-time system. between chips.4 Every attempt was made to make fair
containing a mix of FORTRAN 77 and FORTRAN Data are only copied between CPU and GPU for However, copying data between the host (CPU) and comparisons between same-generation hardware,
90 language constructs. It does not use derived types, model initialization, interprocess communications, device (GPU) is slow, so minimizing data movement using identical source code optimized for all archi-
pointers, or other constructs that can be challenging and output. GPU-to-GPU interprocess communica- is an important optimization needed to improve per- tectures. Given the increasing diversity of hardware
for compilers to support or run efficiently.1 The SMS tions are handled via SMS directives and initiated by formance. The data directive can be used to manage solutions, results are shown in terms of device, node,
library used by NIM for coarse-grain parallelism the CPU. Since physical parameterizations have not data movement between the CPU and GPU explicitly. and multinode performance.
employs the message passing interface (MPI) library yet been ported to the GPU, data must also be moved Managing data movement explicitly is expected to
to handle domain decomposition, interprocess com- between the GPU and CPU every physics time step. diminish with the introduction of unified memory 2
Titan is an AMD-GPU-based system containing over
munications, reductions, and other MPI operations. This constraint can be removed once the physics is in Pascal-generation chips. Unified memory is a way 17,000 GPUs, managed by the U.S. Department of Energy’s
NIM was designed from the outset to maximize also running on the GPU. to programmatically treat CPU and GPU memory as Oak Ridge National Laboratory (ORNL).
fine-grain or loop-level parallel computational Parallelization of NIM for the MIC was trivial a single large memory on NVIDIA hardware. Using 3
Runs were made on Stampede, an Intel CPU–MIC system
since the code had already been modified to run on NVIDIA’s proprietary hardware called NVLink, the supported by the National Science Foundation (NSF).
1
The OpenACC specification only recently added support for the CPU and GPU. As a result, few code changes and GPU can access CPU memory at the same speed as 4
GPU performance relied on the F2C-ACC compiler. Based
derived types; pointer abstractions may limit the ability of optimizations were needed to run efficiently on the the CPU would, further reducing the requirement to on our evaluations, we believe openACC compilers would
compilers to fully analyze and optimize calculations. MIC processor. manage data movement explicitly. yield similar results.

Fig. 2. Run times for the NIM running at 240-km resolution (10,242 horizontal points, 96 vertical levels) for 100
time steps using CPU, GPU, and MIC chips identified in Table 1.
Device performance. Single-device performance is interconnect (NIC), peripheral component inter-

the simplest and most direct comparison of chip connect express (PCIe) bus, and a motherboard.
technologies. Figure 2 shows performance running Deviations from this basic configuration are available
the entire NIM dynamical core on five generations but more expensive since the volumes manufactured
of CPU, GPU, and MIC hardware (see Table 1). CPU are lower. Therefore, most computing centers use
results are based on standard two-socket node con- standard, high-volume parts that offer the best price
figurations. A roughly 2-times performance benefit performance. GPUs can be attached to these nodes
favoring accelerators is observed for 2010–16-genera- and communicate with the CPU host via the PCIe
tion GPU chips. CPU performance has continued to bus.
Fig. 3. Illustration of the Cray Storm node architecture containing eight accelerators per node. NVIDIA GPUs
benefit from increasing cores per chip, improvements When more than two GPUs are attached to
are shown, but other PCIe-compatible devices can be used. IB refers to a type of node interconnect called
in memory speeds, and the introduction of advanced the host, they must share the PCIe bus, which can “InfiniBand”; others are also available.
vector instructions. Both the KNC and KNL proces- impact performance. More specialized solutions
sors are faster than same-generation CPU chips, with are available that improve communications perfor- communications path off node that is shared by the
Scaling. To run efficiently on hundreds to thousands of
the 2016 KNL processor giving a 2-times performance mance. Figure 3 illustrates the architecture of a Cray eight attached accelerators. Alternative node configu-
processors requires good scaling. Both strong and weak
benefit versus the CPU. The NVIDIA Pascal proces- Storm node, with eight attached accelerators (GPUs rations are available, including ones with multiple
scaling measures are useful for performance compari-
sor is even better, giving a 2.5-times speedup over the are shown, but MIC processors can also be used), InfiniBand (IB) connections, nested PCIe architec-
sons. Strong scaling is measured by applying increas-
CPU, and 1.3 times faster than the KNL. and additional PCIe hardware. Communications tures, and solutions that avoid use of QPI because of
ing numbers of compute resources to a fixed problem
While device comparisons are useful, they do not in- between sockets are handled with Intel’s QuickPath reported latency issues (Ellis 2014; Cirrascale 2015).
size. This metric is particularly important for opera-
clude the cost of a CPU host that is required by the GPU Interconnect (QPI). While such solutions can increase performance, test-
tional weather prediction where forecasts should run
accelerator. This practical and economic consideration Figure 4 shows weak scaling performance as ing remains the best way to measure cost–benefit.in under 1% of real time. The requirement is normally
motivates further examination and performance com- the number of GPUs per CPU node increases from achieved by increasing the
parisons with up to eight GPUs attached to a single CPU. two to eight on a Cray Storm system. These results number of processors until
primarily indicate PCIe bandwidth limitations on the given time threshold is
Single-node performance (GPU only). Compute nodes the Cray Storm system. An additional performance met. For example, a 1-day
normally have two CPU sockets, memory, network bottleneck may be the limited bandwidth of the single forecast that runs in 15 min
represents 1% of real time;
therefore, runs in 2% of real
Table 1. 2010–16-generation CPU, GPU, and MIC chips with corresponding numbers of pro- time would take 30 min.
cessing cores. The number of cores for the CPU chips is based on two sockets.
Figure 5 shows multi-
Year CPU: two sockets Cores GPU Cores MIC Cores node scaling results for 20–
2010/11 Westmere 12 Fermi 448 160 Haswell CPUs (2015),
2012 Sandy Bridge 16 Kepler K20x 2,688 N V I DI A Pa s c a l GPUs
(2016), and Intel KNL MIC
2013 Ivy Bridge 20 Kepler K40 2,880 Knights Corner 61
Fig . 4. Communications scaling using 40 Pascal GPUs with 2–8 GPUs per (2016) processors. Up to
2014 Haswell 24 Kepler K80 4,992
node. Computation times for each of the 15- and 30-km runs are not shown 2.5-times speedup for GPU
2016 Broadwell 30 Pascal 3,584 Knights Landing 68 but were consistently 28.6 and 7.5 s, respectively. versus CPU is observed for

SMS Inter-Process Communicaons
- Spiral Grid Opmizaon -
Spiral
Halo Points (received)
Data Layout MPI Task 2
MPI Receive
MPI Send
Interior Points Interior Points Interior Points
MPI Receive
MPI Task 4 MPI Task 5 MPI Task 6
MPI Send
Fig . 5. NIM strong scaling comparison with dual-socket Haswell CPU, NVIDIA Pascal GPU, and Intel KNL
(MIC) processors. The horizontal axis gives the number of nodes used for a fixed problem size. The “cols/node”
numbers indicate computational workload per node. Speedup efficiency compared to the 10-node CPU and
GPU run times appear as numeric values in each performance bar. An MIC baseline run with 10 processors
could not be run because of insufficient memory. Data Storage Layout
MPI Task 5 Interior Points Halo Points (received)
the 10-node result when 65,536 columns of work are two nodes illustrates the substantial increase in off-
given to each node or GPU. The decrease in scaling node communications time. Given communications MPI Task5 Task4 Task8 Task6 Task3
efficiency is almost completely due to interprocess time within a node (3.19 s) is less than the off-node
communications overhead. For example, when com- time (7.23 s). The results show that more GPUs could Fig. 6. An illustration of the spiral layout. The upper portion of the figure, titled “spiral data layout,” shows a
traversal of icosahedral grid cells (hexagons) for MPI tasks 4, 5, and 6. “Data storage layout” illustrates how
munications are removed from the 80-GPU run, be added to each node without adversely affecting
data are organized to be contiguous in memory. Line color indicates who owns the cells (e.g., task 5 is in black).
scaling efficiency increases from 63% to over 90%. model run times. This is because all processes must The orange, red, green, and blue lines in task 5 are halo cells, duplicated in task 5 memory but owned by tasks
CPU and MIC scaling also show similar degrading wait for the slowest communication to complete be- 4, 2, 6, and 8. Arrows indicate MPI interprocess communications to update these halo cells.
communications performance. fore model execution can continue.
Weak scaling is a measure of how solution time initialization, is called “spiral grid order.” Figure 6 would be impossible to fairly represent them in any
varies with increasing numbers of processors when Spiral grid order. To run efficiently on hundreds to illustrates spiral grid ordering used in NIM. In the cost–benefit evaluation here.
the problem size per processor and the number of thousands of nodes requires efficient interprocess figure, points are organized according to where Figure 7 shows a cost–benefit based on running
model time steps remains fixed. It is considered a communications. For most models, communications data must be sent (as interior points) or received (as NIM dynamics at 30-km model resolution. Each of
good way to determine how a model scales to high normally include gathering and packing data to be halo points). Each point in the figure represents an the five system configurations shown produced a 3-h
numbers of processors and is particularly useful for sent to neighboring processes, MPI communications icosahedral grid column that contains 96 vertical forecast in 23 s or 0.20% of real time. The CPU-only
measuring communications overhead. of the data, and then unpacking and distributing the levels. The section labeled “spiral grid ordering” configuration (upper-left point) required 960 cores
Table 2 gives performance results for a single node received data. Analysis of NIM dynamics perfor- illustrates the method used to order points within or 40 Haswell nodes. The rightmost configurations
with 20,284 columns per GPU for 120- and 60-km mance showed that message packing and unpacking each MPI task. The “data storage layout” section il- used 20 NVIDIA K80 GPUs that were attached to 20,
resolution runs using two and eight GPUs. NVIDIA accounted for 50% of inter-GPU communications lustrates how grid points are organized in memory 10, 8, and 5 CPUs, respectively. The execution time
K80s packaged with two GPUs were used for the runs. time (Middlecoff 2015). Since NIM relies on a lookup for optimal communications and computation. Use of 23 s can be extrapolated to 1.6% of real time for a
Computation time was nearly identical for all runs, table to reference horizontal grid points, data can of the spiral grid order gave performance benefit on 3.75-km-resolution model when per-process work-
with communications time increasing to 3.19 s for the be reorganized to eliminate packing and unpack- all architectures, with a 20% improvement in model load remains fixed (weak scaling).5
one-node, eight-GPU run. An additional run using ing. This optimization, configured during model run times on the GPU, 16% on the MIC, and 5% on Based on list prices in Table 3, a 40-node CPU
the CPU. would cost $260,000. Systems configured with 1–4
Table 2. Weak scaling performance for a single node (not shaded) and multiple nodes (shaded) for
100 time steps on K80 GPUs. For a fixed computational workload (20,482 columns), single-node Cost–benefit. Cost–benefit is determined using list 5
Each doubling in horizontal resolution requires 4 times
communications time increases from 0.56 to 3.19 as the number of GPUs increase from two to prices as specified from Intel and NVIDIA in Table 3. more compute power and a 2-times increase in the number
eight. A further increase to 7.23 s is observed when four nodes are used. The CPU node estimate was based on a standard two- of model time steps. Assuming perfect scaling, a threefold
GPUs per No. of Model resolution Columns per Computation Communications Total run
socket, 24-core, Intel Haswell node, which includes increase in model resolution from 30 to 3.75 km requires
node nodes (km) GPU time (s) time (s) time the processor, memory, network interconnect, and 64 times (43) more GPUs and an 8-times (23) increase in the
warranty. The system interconnect was not included number of model time steps. Therefore, scaling to 3.75 km
2 1 120 20,482 25.13 0.56 25.71
in cost calculations, based on the assumption that the is calculated as 8 × 0.20 = 1.6% of real time. Additional in-
8 1 60 20,482 25.16 3.19 28.35
cost for each system would be similar. While signifi- creases in compute power and time to solution are expected
2 4 60 20,482 25.22 7.23 33.45 cant discounts are normally offered to customers, it when physics calculations are included.

NVIDIA K80s per CPU are Table 3. List prices for Intel Haswell CPU, Intel MIC, and NVIDIA K80
performance on all architectures, but lower speedups working together during design and development.
shown that lower the price GPU processors. The CPU node is based on the cost of a Dell R430 rack- over the CPU were observed than for the dynamics Scientists wrote the code, computer scientists were
of the system from $230,000 mounted system. routines (Henderson et al. 2015; Michalakes et al. responsible for the directive-based parallelization and
to $132,500, respectively. 2016). This is likely due to more branching (i.e., if optimization, and software engineers maintained the
Chip Part Cores Power (W) OEM Price
For these tests, 20 NVIDIA statements) in the code and less available parallelism software infrastructure capable of supporting devel-
K80s were used packaged Haswell E5–2690-V3 (2) 24 270 $4,180 a in model physics than dynamics. opment, testing, and running the model on diverse
with two GPUs per K80. NVIDIA K80 K80 4,992 300 $5,000 b The paper gives a cost–benefit calculation for supercomputer systems.
No changes in run times Intel MIC (KNC) 7120P 61 300 $4,129c NIM dynamics that shows increasing value as more The selection of the FV3 dynamical core to be part
were observed for the four Haswell CPU Node Dell R430 24 — $6,500d accelerators per node are used. However, there are of the Next Generation Global Prediction System
CPU–GPU configurations. several limitations in the value of these results. First, (NGGPS) run by NWS in 2020 has accelerated ef-
a
http://ark.intel.com/products/81713/Intel-Xeon-Processor-E5-2690-v3-30M-Cache-2_60-GHz
Systems such as Cray Storm b
www.anandtech.com/show/8729/nvidia-launches-tesla-k80-gk210-gpu the comparison was only for model dynamics; when forts to port it to MPFG processors. Experience with
support up to eight GPU per c
http://ark.intel.com/products/75799/Intel-Xeon-Phi-Coprocessor-7120P-16GB-1_238-GHz-61-core physics is included, model performance and cost– the NIM on achieving performance portability will
node, which could give ad- d
This price quote, from 2 Mar 2015, is for a rack-mounted Dell PowerEdge R430 server. benefit favoring the GPU is expected to decrease. guide these efforts. Evaluation of FV3 performance
ditional cost benefit. Second, use of list price is naïve as vendors typically in 2015 indicated good scaling to 130,000 CPU cores
offer significant discounts, particularly for large in- (Michalakes et al. 2015). While these results indicate
DISCUSSION. The NIM demonstrates that due to limitations in F2C-ACC but had a benefit of stallations. Third, calculations did not include the sufficient parallelism is available, significant work is
weather prediction codes can be designed for high- organizing calculations to avoid creation and execu- cost of the system interconnect. For small systems expected to adapt FV3 to run efficiently on GPU and
performance and portability-targeting CPU, GPU, tion of small parallel regions, where synchronization with tens of nodes, this was deemed acceptable for MIC processors (Govett and Rosinski 2016).
and MIC architectures. Inherent in the design of NIM and thread start-up (CPU, MIC) or kernel start-up comparison as there would be little difference in price In the next decade, HPC is expected to become
has been the simplicity of the code, use of basic FOR- (GPU) time can be significant. or performance. However, comparisons with hun- increasingly fine grained, with systems containing
TRAN language constructs, and minimal branching The choice to organize arrays and loop calcula- dreds to thousands of nodes would amplify the role potentially hundreds of millions of processing cores.
in loop calculations. Use of FORTRAN pointers, tions with an innermost vertical dimension and of the interconnect and would need to be included in To take advantage of these systems, new weather
derived types, and other constructs that are not well indirect addressing to access neighboring grid cells cost–benefit calculations. prediction models will need to be codeveloped by
supported or are challenging for compilers to analyze simplified code design without sacrificing perfor- scientific and computational teams to incorporate
and optimize were avoided. NIM’s icosahedral–hex- mance. It also improved code portability and per- CONCLUSIONS. The NIM is currently the only parallelism in model design, code structure, algo-
agonal grid permits grid cells to be treated identically, formance in unanticipated ways. First, the innermost weather model capable of running on CPU, GPU, and rithms, and underlying physical processes.
which minimizes branching in gridpoint calcula- vertical dimension of 96 levels was sufficient for CPU MIC architectures with a single-source code. Perfor-
tions. Further, code design separated fine-grain and and MIC vectorization but essential for the GPU’s mance of the NIM dynamical core was described. ACKNOWLEDGMENTS. Thanks to technical teams
coarse-grain (MPI) parallelism. This was primarily high-core-count devices. With few dependencies in CPU, GPU, and MIC comparisons were made for at Intel, Cray, PGI, and NVIDIA who were responsible for
the vertical dimension, vectorization (CPU, MIC) and device, node, and multinode performance. Device fixing bugs and providing access to the latest hardware and
thread parallelism (GPU) were consistently available comparisons showed that NIM ran on the MIC and compilers. Thanks also to the staff at ORNL and TACC for
in dynamics routines. Second, indirect addressing GPU 2.0 and 2.5 times faster, respectively, than the providing system resources and helping to resolve system
of grid cells gave flexibility and benefit in how they same-generation CPU hardware. The 2.0-times MIC issues. This work was also supported in part by the Disaster
could be organized. As a result, spiral grid reordering speedup for KNL versus a dual-socket Broadwell Relief Appropriations Act of 2013 and the NOAA HPCC
eliminated MPI message packing and unpacking and CPU is a significant improvement over the previous- program.
decreased run times by up to 20%. generation KNC. Multinode scaling targeted a goal
Optimizations benefitting one architecture also of running NIM at 3-km resolution in 1% of real
helped the others. In the rare event performance time. The spiral grid ordering was described that REFERENCES
degraded on one or more architecture, changes eliminated data packing and unpacking and gave Arakawa, A., and V. R. Lamb, 1977: Computational
were reformulated to give positive benefit on all. performance benefit on all architectures. Finally, design of the basic dynamical processes of the UCLA
OpenACC compilers continue to mature, benefiting a cost–benefit analysis demonstrated increasing general circulation model. General Circulation Mod-
from F2C-ACC comparisons that exposed bugs and benefits favoring the K80 GPUs when up to eight els of the Atmosphere, J. Chang, Ed., Vol. 17, Methods
performance issues that were corrected. Paralleliza- accelerators are attached to each CPU host. Further in Computational Physics: Advances in Research and
tion is becoming simpler with OpenACC because data analysis of cost–benefit using the latest Pascal and Applications, Academic Press, 173–265.
movement between CPU and GPU is managed by the KNL chips is planned. Bleck, R., and Coauthors, 2015: A vertically f low-
Fig. 7. Cost comparison for CPU-only and CPU–GPU run-time system. Unified memory on the GPU is ex- A critical element in achieving good performance following icosahedral grid model for medium-range
systems needed to run 100 time steps of NIM dynamics pected to further simplify parallelization, narrowing and portability was the design of NIM. The simplic- and seasonal prediction. Part I: Model description.
in 23 s. Run times do not include model initialization the ease-of-use gap versus OpenMP. ity of the code, looping, and array structures and the Mon. Wea. Rev., 143, 2386–2403, doi:10.1175/MWR
or I/O. Cost estimates are based on list prices for hard- The scope of this paper primarily focused on indirect addressing of the icosahedral grid were all -D-14-00300.1.
ware given in Table 3. The CPU-only system used 40
the dynamical core, largely because domain sci- chosen to expose the maximum parallelism to the un- Carpenter, I., and Coauthors, 2013: Progress towards
Haswell CPU nodes. Four CPU–GPU configurations
were used, where “numCPUs” indicates the total entists had not decided which physics suite to use derlying hardware. The work reported here represents accelerating HOMME on hybrid multi-core systems.
number of CPUs used, and “K80s per CPU” indicates for high-resolution (<4 km) runs. Parallelization of a successful development effort by a team of domain Int. J. High Perform. Comput. Appl., 27, 335–347,
the number of accelerators attached to each node. select microphysics and radiation routines improved and computer scientists and software engineers doi:10.1177/1094342012462751.

Cirrascale, 2015: Scaling GPU compute performance. Henderson, T., M. Govett, and J. Middlecoff, 2011: Ap- Nguyen, H. V., C. Kerr, and Z. Liang, 2013: Performance —, G. Zaengl, and L. Linardakis, 2014: Towards
Cirrascale Rep., 11 pp. [Available online at www plying Fortran GPU compilers to numerical weather of the cubed-sphere atmospheric dynamical core a multi-node OpenACC implementation of the
.cirrascale.com/documents/whitepapers/Cirrascale prediction. 2011 Symp. on Application Accelerators in on Intel Xeon and Xeon Phi architectures. Third ICON model. European Geosciences Union Gen-
_ScalingGPUCompute_WP_M987_REVA.pdf.] High Performance Computing, Knoxville, TN, Insti- NCAR Multi-Core Workshop, Boulder, CO, NCAR, eral Assembly 2014, Vienna, Austria, European
Ellis, S., 2014: Exploring the PCIe bus routes. CirraScale. tute of Electrical and Electronics Engineers, 34–41, 3a. [Available online at https://www2.cisl.ucar.edu Geosciences Union, ESSI2.1. [Available online at
[Available online at www.cirrascale.com/blog/index doi:10.1109/SAAHPC.2011.9. /sites/default/files/vu_3a.pdf.] http://meetingorganizer.copernicus.org/EGU2014
.php/exploring-the-pcie-bus-routes/.] —, J. Michalakes, I. Gokhale, and A. Jha, 2015: Opti- NVIDIA, 2015: CUDA C programming guide. NVIDIA. /EGU2014-15276.pdf.]
Fuhrer, O., C. Osuna, X. Lapillone, T. Gysi, B. Cumming, mizing numerical weather prediction. Multicore and [Available online at http://docs.nvidia.com/cuda Strohmaier, E., J. Dongarra, H. Simon, and M. Meuer,
M. Bianco, A. Arteaga, and T. Schulthess, 2014: To- Many-Core Programming Approaches, J. Reinders /cuda-c-programming-guide/.] 2016: Top 500 supercomputers. Top500, accessed 22
wards a performance portable, architecture agnostic and J. Jeffers, Eds., Vol. 2, High Performance Paral- Putnam, B., 2011: Graphics processing unit (GPU) ac- January 2017. [Available online at www.top500.org
implementation strategy for weather and climate lelism Pearls, Morgan Kaufmann, 7–23, doi:10.1016 celeration of the Goddard Earth Observing System /lists/2016/11/.]
models. Supercomput. Front. Innovations, 1, 45–62, /B978-0-12-803819-2.00016-1. atmospheric model. NASA Tech. Rep., 15 pp. Wang, N., and J. Lee, 2011: Geometric properties of the
doi:10.14529/jsfi140103. Kim, Y., 2013: Performance tuning techniques for GPU Rosinski, J., 2015: Porting and optimizing NCEP’s GFS icosahedral-hexagonal grid on the two-sphere. SIAM
Govett, M., 2013: Using OpenACC compilers to run and MIC. Third NCAR Multi-Core Workshop, Boulder, physics package for unstructured grids on Intel Xeon J. Sci. Comput., 33, 2536–2559, doi:10.1137/090761355.
FIM and NIM on GPUs. Third NCAR Multi-Core CO, NCAR, 1. [Available online at https://www2.cisl and Xeon Phi. Fifth NCAR Multi-Core Workshop, Boul- Williamson, D., 1971: A comparison of first- and
Workshop, Boulder, CO, NCAR, 6b. [Available on- .ucar.edu/sites/default/files/youngsung_1_2013.pdf.] der, CO, NCAR, 4. [Available online at https://www2 second-order difference approximations over a
line at https://www2.cisl.ucar.edu/sites/default/files Lapillonne, X, and O. Fuhrer, 2014: Using compiler .cisl.ucar.edu/sites/default/files/Rosinski_Slides.pdf.] spherical geodesic grid. J. Comput. Phys., 7, 301–309,
/govett_6b.pdf.] directives to port large scientific applications to Sadourny, R., A. Arakawa, and Y. Mintz, 1968: Integra- doi:10.1016/0021-9991(71)90091-X.
—, and J. Rosinski, 2016: Evaluation of the FV3 GPUs: An example from atmospheric science. tion of non-divergent barotropic vorticity equation Yashiro, H., A. Naruse, R. Yoshida, and H. Tomita,
dynamical core. NGGPS Supplementary Rep., 10 Parallel Process. Lett., 24, 1450003, doi:10.1142 with an icosahedral-hexagonal grid for the sphere. 2014: A global atmosphere simulation on a GPU
pp. [Available online at www.esrl.noaa.gov/gsd/ato /S0129626414500030. Mon. Wea. Rev., 96, 351–356, doi:10.1175/1520- supercomputer using OpenACC: Dry dynamical
/FV3_Analysis-final.pdf.] Lee, J.-L., and A. E. MacDonald, 2009: A finite- 0493(1968)096<0351:IOTNBV>2.0.CO;2. cores tests. TSUBAME ESJ, Vol. 12, Tokyo Institute
—, L. Hart, T. Henderson, and D. Schaffer, 2003: The volume icosahedral shallow water model on lo- Sawyer, W., C. Conti, and X.Lapillonne, 2011: Porting the of Technology Global Scientific Information and
scalable modeling system: Directive-based code par- cal coordinate. Mon. Wea. Rev., 137, 1422–1437, ICON non-hydrostatic dynamics and physics to GPUs. Computing Center, Tokyo, Japan, 8–12. [Available
allelization for distributed and shared memory com- doi:10.1175/2008MWR2639.1. First NCAR Multi-Core Workshop, Boulder, CO, NCAR, online at www.gsic.titech.ac.jp/sites/default/files
puters. Parallel Comput., 29, 995–1020, doi:10.1016 —, R. Bleck, and A. E. MacDonald, 2010: A multistep 19 pp. [Available online at https://www2.cisl.ucar.edu/ /TSUBAME_ESJ_12en_0.pdf.]
/S0167-8191(03)00084-X. flux-corrected transport scheme. J. Comput. Phys., sites/default/files/2011_09_08_ICON_NH_GPU.pdf.]
—, J. Middlecoff, and T. Henderson, 2010: Running 229, 9284–9298, doi:10.1016/j.jcp.2010.08.039.
the NIM next-generation weather model on GPUs. Lin, S.-J., 2004: A “vertically Lagrangian” finite-volume
10th IEEE/ACM Int. Conf. on Cluster, Cloud and dynamical core for global models. Mon. Wea. Rev.,
Grid Computing, Melbourne, Australia, Institute 132, 2293–2307, doi:10.1175/1520-0493(2004)132
of Electrical and Electronics Engineers/Association <2293:AVLFDC>2.0.CO;2.
for Computing Machinery, 792–796, doi:10.1109 MacDonald, A. E., J. Middlecoff, T. Henderson, and
/CCGRID.2010.106. J. Lee, 2011: A general method for modeling on ir-
—, —, and —, 2014: Directive-based parallelization regular grids. Int. J. High Perform. Comput. Appl., 25,
of the NIM weather model for GPUs. First Workshop 392–403, doi:10.1177/1094342010385019.
on Accelerator Programming Using Directives, New Michalakes, J., M. Govett, R. Benson, T. Black, H. Juang,
Orleans, LA, Institute of Electrical and Electronics A. Reinecke, and B. Skamarock, 2015: NGGPS
Engineers, 55–61, doi:10.1109/WACCPD.2014.9. level-1 benchmarks and software evaluation. Ad-
—, T. Henderson, J. Rosinski, J. Middlecoff, and vanced Computing Evaluation Committee Rep.,
P. Madden, 2015: Parallelization and performance 22 pp. [Available online at www.nws.noaa.gov/ost
of the NIM for CPU, GPU and MIC. First Symp. on /nggps/DycoreTestingFiles/AVEC%20Level%201%20
High Performance Computing for Weather, Water, Benchmarking%20Report%2008%2020150602.pdf.]
and Climate, Phoenix, AZ, Amer. Meteor. Soc., —, M. Iacono, and E. Jessup, 2016: Optimizing weath-
1.3. [Available online at https://ams.confex.com er model radiative transfer physics for Intel’s Many
/ams/95Annual/webprogram/Paper262515.html.] Integrated Core (MIC) architecture. Parallel Process.
—, —, J. Middlecoff, and J. Rosinski, 2016, A cost Lett., 26, 1650019, doi:10.1142/S0129626416500195.
benefit analysis of CPU, GPU and MIC chips us- Middlecoff, J., 2015: Optimization of MPI message
ing NIM performance as a guide. Second Symp. on passing in a multi-core NWP dynamical core run-
High Performance Computing for Weather, Water, ning on NVIDIA GPUs. Fifth NCAR Multi-Core
and Climate, New Orleans, LA, Amer. Meteor. Soc., Workshop, Boulder, CO, NCAR, 4. [Available on-
1.3. [Available online at https://ams.confex.com line at https://www2.cisl.ucar.edu/sites/default/files
/ams/96Annual/webprogram/Paper286165.html.] /Abstract_Middlecoff.pdf.]

ABSTRACT
The design and performance of the nonhydrostatic icosahedral model (NIM) global weather prediction model
is described. NIM is a dynamical core designed to run on central processing unit (CPU), graphics processing unit
(GPU), and Many Integrated Core (MIC) processors. It demonstrates efficient parallel performance and scalability to
tens of thousands of compute nodes and has been an effective way to make comparisons between traditional CPU and
emerging fine-grain processors. The design of the NIM also serves as a useful guide in the fine-grain parallelization
of the finite volume cubed (FV3) model recently chosen by the National Weather Service (NWS) to become its next
operational global weather prediction model.
This paper describes the code structure and parallelization of NIM using standards-compliant open multiprocessing
(OpenMP) and open accelerator (OpenACC) directives. NIM uses the directives to support a single, performance-
portable code that runs on CPU, GPU, and MIC systems. Performance results are compared for five generations of
computer chips including the recently released Intel Knights Landing and NVIDIA Pascal chips. Single and multinode
performance and scalability is also shown, along with a cost–benefit comparison based on vendor list prices.

Parallelization and Performance

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Parallelization and Performance

Caricato da

Copyright:

Formati disponibili

PARALLELIZATION AND

Next-generation supercomputers containing millions of processors will require weather

AMERICAN METEOROLOGICAL SOCIETY OCTOBER 2017 | 1

2 | OCTOBER 2017 AMERICAN METEOROLOGICAL SOCIETY OCTOBER 2017 | 3

4 | OCTOBER 2017 AMERICAN METEOROLOGICAL SOCIETY OCTOBER 2017 | 5

Device performance. Single-device performance is interconnect (NIC), peripheral component inter-

6 | OCTOBER 2017 AMERICAN METEOROLOGICAL SOCIETY OCTOBER 2017 | 7

8 | OCTOBER 2017 AMERICAN METEOROLOGICAL SOCIETY OCTOBER 2017 | 9

10 | OCTOBER 2017 AMERICAN METEOROLOGICAL SOCIETY OCTOBER 2017 | 11

12 | OCTOBER 2017 AMERICAN METEOROLOGICAL SOCIETY OCTOBER 2017 | 13

Potrebbero piacerti anche