Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
l ab_41:
@CI V0 = 0;
do {
28 | ( ( f l oat *) x) [ @CI V0] = 5. 40000009E+00 + ( ( f l oat *) y) [ @CI V0] ;
27 | @CI V0 = @CI V0 + 1;
} whi l e ( ( unsi gned) @CI V0 < ( unsi gned) i t er _count ) ; / * ~43 */
Performance Guide for HPC Applications on IBM Power 755
Copyright 2010 IBM Corporation
Page 82 of 110
11 VSX and Altivec Programming
Currently, there isnt a widely accepted recipe to decide on whether to convert a code to
exploit vectorization. Many of the concepts that allow compilers to auto-vectorize code
also apply to hand-coded approaches to vectorizing code. In addition, the effort involved
in manually converting a code to use vector data types and operations encounters
additional issues to be aware of.
11.1 Handling Data Loads
The VSU unit has two load/store execution pipelines, with a 2-cycle load-to-use
latency 32 bytes can be loaded from the D-cache (into vector registers) in each
cycle. The D-cache can also be loaded from the L2 at the same rate (32 bytes
per cycle).
The best way to exploit the LSU for double precision (VSX) vectors is to load
(and work on) vectors in pairs. This is different than with earlier SIMD
architectures, where Altivec only supported at most one load per clock.
11.2 Performance Improvement of VSX/Altivec-enabled Code Over
Scalar Code
Given the various data types that can be used as vectors, the following performance
gains are possible:
On POWER7, there are 2 scalar floating point units (FPUs) and 2 scalar fixed point units
(FXUs) but only one VSU (with one embedded Altivec unit). So at peak performance,
VSX/Altivec code can be 2 times faster than scalar code for 64-bit floating point, 32-bit
floating point and integer arithmetic, 4 times faster for 16-bit (short) integer arithmetic and
8 times faster for 8-bit (byte) integer arithmetic.
These performance gains assume both corresponding scalar functional units are fully
utilized. Performance is limited by how well the application can keep the VSU busy as
well as cache reuse, memory bandwidth and memory alignment considerations. These
issues are addressed in later sections. Less efficient scalar code will result in measuring
a greater speedup than expected. The most valid scalar-SIMD comparison is between
optimum scalar and SIMD code.
11.3 Memory Alignment
Many issues are already covered in section 10.1.5. Here are some additional suggestions
for handling dynamically allocated data that is unaligned.
11.3.1 AIX
The malloc() system call should already be allocating one-dimensional arrays on 16-byte
boundaries by default. But just in case, AIX 6L has a useful system call that explicitly
forces malloc()d one-dimensional arrays to align on 16-byte boundaries vec_malloc().
It can be used in IBM XL C/C++routines in place of any standard malloc() call; in fact an
easy way to ensure arrays are mallocd properly is to use the C preprocessor:
#def i ne mal l oc vec_mal l oc
Automatic and local arrays are already properly aligned i.e. a[0] will be on a 16-byte
address boundary 0xXXXXXXXX0.
Performance Guide for HPC Applications on IBM Power 755
Copyright 2010 IBM Corporation
Page 83 of 110
For XL Fortran programs running on AIX, an easy way to allocate aligned one-
dimensional arrays is to provide a Fortran wrapper that calls vec_malloc().
An alternative for any executable program is to set the MALLOCALIGN environment
variable
expor t MALLOCALI GN=16
This will force all dynamically allocated arrays to align on 16 byte boundaries
11.3.2 Linux
For Linux on POWER, programmers can use the Linux system call memalign() to align
dynamically allocated one-dimensional arrays on 16-byte boundaries.
11.3.3 Multiple Array Offsets in a Loop
A common problem is where array references with different offsets (e.g. A(i) and A(i+1)
for single precision data) appear in the same loop. Usually, extra loads and permutes are
needed to manage the data. There is no guarantee that vector performance will be better
than scalar performance; the extra instruction overhead can offset any advantage of
using SIMD instructions.
11.3.4 Multidimensional Arrays
Working with multidimensional arrays is more of a challenge. Those arrays whose
leading dimensions will not allow a row (for C) or a column (for Fortran) to load evenly
into vectors require the overhead of additional instructions to handle partial vectors at the
matrix boundaries. This overhead can offset any potential performance gains from using
SIMD instructions.
To make it easier to align a multidimensional array, one suggestion is to allocate a large
enough one-dimensional array. Then, use an array of pointers or another language-
appropriate mechanism to reference the allocated space as if it were a multi-dimensional
array.
Vector Programming Strategies
A lot of confusion arises when talking about whether SIMD programming is worth the
effort or works. While Altivec and VSX have differences in programming details, they
have similar criteria when it comes to deciding how to convert a candidate (scalar)
program to exploit SIMD instructions to increase performance. To understand some of
the tradeoffs involved, this section classifies SIMD programming approaches into 3
categories: local loop changes, local algorithmic changes, and global data restructuring.
The order goes from easiest to implement to hardest.
It costs nothing to first try and see if auto-vectorization will improve code performance,
but oftentimes it wont. Providing the compiler with the flags -qreport qlist will describe
the reasons that the compiler will fail to auto-vectorize any particular loop. This is further
discussed in the introduction to chapter 10
Note that the examples that follow focus on high calculation rates. They assume that the
dominant floating point arithmetic operations are multiplies and adds and similarly
pipelined floating point instructions. But the pursuit of high gigaflop rates is not the only
situation where Altivec/VSX operations have a potential performance advantage over
Performance Guide for HPC Applications on IBM Power 755
Copyright 2010 IBM Corporation
Page 84 of 110
their scalar equivalents. If a significant fraction of the calculations are floating point
divides and/or square roots, the higher execution latency and the lack of pipelining gives
a performance advantage to Altivec/VSX versions of these operations. Two examples of
codes that take advantage of VSX operations to improve performance are the SPEC
CPU2006 versions of WRF and CactusADM.
11.3.5 Some Requirements for Efficient Loop Vectorization
One way to look at SIMD programming is to list all of the things that can go wrong, and
the methods used to solve the problems.
Besides the aforementioned data flow dependencies, issues to be handled are:
1. Make sure the arrays that provide data to form Altivec/VSX vectors) are aligned
as discussed in section 10.1.5.
2. Iteration-dependent branching inside loops can remove the loop as a candidate
for SIMD conversion. The vec_sel() intrinsic can handle conditions that depend
on the loop iteration, but the performance can fall off rapidly with every if-test.
3. Iterate through arrays with a unit stride (i++). There are common situations
where iteration is not unit stride (e.g. red-black lattice order). In these cases, the
array elements can be reordered so that the most common iteration patterns can
be done over contiguous chunks of memory. In the case of red-black ordering,
the even elements could all be grouped in the first of the array, the odd
elements could be in the second of the array.
4. Minimize the number of loads relative to all other operations (loads are often the
main performance bottleneck in SIMD programming). This doesnt inhibit
vectorization, but it does limit performance. This can happen in nested loops,
like those that are found in matrix multiplies.
11.3.6 OpenMP Loops
The most straightforward approach to enabling a code for VSX is to find the (hopefully
few) loops that use the most execution time and vectorize them. In legacy applications,
these loops are often the ones that are parallelized with OpenMP directives or similar
approaches. Many loops that can be successfully parallelized (in the thread-safe sense of
OpenMP) are also candidates for vectorization.
There are some differences between vectorizable loops and OpenMP loops.
1. A loop that includes branch conditions that depend on the iterator (loop index)
value are perfectly ok to implement in parallel threads, but can have a substantial
performance penalty when executed in a SIMD context.
2. A loop that references array elements across more than one iteration will create
race conditions in parallel execution, but may be perfectly fine for SIMD
execution. For example (assuming an incrementing iterator), consider the code
fragment:
A( I ) = A( I +1) + B( I )
In a scalar loop, any give element of A( ) is modified after it is read. This
fragment vectorizes on I (though it does have alignment issues). In contrast, the
vectorized version of
A( I ) = A( I - 1) + B( I )
does not give the same results as the scalar form. The data to set A( I ) comes
from an earlier write of A( I ) . This is an example of a recursive data
dependence. Note that if the index offset was 4 instead of 1, the data
dependence wouldnt be an issue for Altivec instructions, and if the offset was 2,
Performance Guide for HPC Applications on IBM Power 755
Copyright 2010 IBM Corporation
Page 85 of 110
it wouldnt be an issue for VSX instructions.
Even though there are differences to watch out for between OpenMP-parallel and
vectorizable loops, these types of loops are the top-of-the-list candidates for
vectorization.
11.3.7 Example: Vectorizing a Simple Loop
As an example, lets see how the SAXPY/DAXPY loop is transformed. This could be
auto-vectorized, but hand coding is used for illustration.
Here is the scalar code snippet; (this snippet just includes the highlights. The supporting
code is assumed)
a = ( f l oat *) mal l oc( ar r ayl en*si zeof ( f l oat ) ) ;
b = ( f l oat *) mal l oc( ar r ayl en*si zeof ( f l oat ) ) ;
f or ( i =0; i <ar r ayl en; i ++) {
b[ i ] = al pha*a[ i ] + b[ i ] ;
}
And here is its [Altivec] counterpart
a = ( f l oat *) vec_mal l oc( ar r ayl en*si zeof ( f l oat ) ) ;
c = ( f l oat *) vec_mal l oc( ar r ayl en*si zeof ( f l oat ) ) ;
vAl pha = vec_spl at s( al pha) ;
f or ( i =0; i <ar r ayl en; i +=4) {
vA = ( vect or f l oat *) &( a[ i ] ) ;
vC = ( vect or f l oat *) &( c[ i ] ) ;
*vC = vec_madd( vAl pha, *vA, *vC) ;
}
Notes:
1. vec_malloc() is used in place of malloc() to force arrays to align on 16-byte
boundaries. Depending on the AIX release level, malloc() may also force arrays
to align properly, but using vec_malloc() always will.
2. vec_splats() is a new VSX intrinsic. It can be used to splat scalar data to all
vector data types. Another (now obsolete) way to do the same thing is by using
the vec_loadAndSplatScalar() function found at the Apple (Altivec) website:
3. Assigning the vector float pointers to appropriate elements of the arrays is one
way to have the compiler load the data into vector registers. For this case, no
explicit load intrinsics have to be included. However, explicit vector loads may
need to be included in other situations.
4. This is not the fastest way to code a SAXPY calculation. For example, unrolling
the loop (by 4 seems to be a good choice) yields better performance. Other
techniques can further improve performance. The loop shown here will not
achieve the highest performance possible on a Power 755.
5. The speed measured depends on the size of the arrays. As the arrays get longer,
both the SIMD and scalar calculation rates decrease.
And here is its double-precision (VSX) analog.
a = ( doubl e *) vec_mal l oc( ar r ayl en*si zeof ( doubl e) ) ;
b = ( doubl e *) vec_mal l oc( ar r ayl en*si zeof ( doubl e) ) ;
vAl pha = vec_spl at s( al pha) ;
f or ( i nt i =0; i <ar r ayl en; i +=4) {
vA1 = ( vect or doubl e*) &( a[ i ] ) ;
vB1 = ( vect or doubl e*) &( b[ i ] ) ;
*vB1 = vec_madd( vAl pha, *vA1, *vB1) ;
vA2 = ( vect or doubl e*) &( a[ i +2] ) ;
vB2 = ( vect or doubl e*) &( b[ i +2] ) ;
*vB2 = vec_madd( vAl pha, *vA2, *vB2) ;
}
Notes:
Performance Guide for HPC Applications on IBM Power 755
Copyright 2010 IBM Corporation
Page 86 of 110
1. A double can be loaded and splatted similarly to a float.
2. The vector multiply-add intrinsic for double precision arithmetic (VSX) is the
same as for single precision (Altivec). Not all corresponding VSX and Altivec
instructions share the same intrinsic.
3. As noted in section 9.3 and elsewhere, a vector double holds only 2 doubles, but
VSX can still do 4 double precision FMAs at a time, by issuing two vec_madd()s
per cycle. The base loop is unrolled by 2 to make sure this happens. This is
analogous to the base SAXPY loop above.
4. Like the Altivec loop, this loop could benefit from more unrolling.
11.3.8 Local Algorithms
Experience has shown that many hot routines are VSX opportunities, but auto-
vectorization doesnt work well enough. To get optimal performance, the routines have to
be rewritten from scratch, probably using a different, more SIMD-friendly algorithm to
further improve the performance. For example, sometimes there are loop data flow
dependencies that inhibit vectorization and have to be reworked to allow vectorization. In
any event, the necessary changes needed to get a large fraction of the runtime executing
SIMD instructions are local to the routines involved, so they are relatively quick to
implement, though not as quick as auto-vectorization, when it works.
The local algorithms approach has been applied to DGEMM (of course), most (if not all)
bioinformatics codes that have SIMD versions of the dynamic programming algorithm,
FFTs, specialized matrix multiplies (like that found in LQCD codes), DSP, Video
processing, encryption and other signal processing tasks. Notable examples are the
Smith-Waterman algorithm in FASTA and the HMMER hidden markov algorithm.
Freescale maintains a collection of web pages with many useful SIMD algorithms,
techniques and code examples, originally targeted at 32-bit data, but readily adaptable to
handle double-precision VSX.
All local algorithms have one weakness they cant speed up code that uses at least one
load per arithmetic operation, like those found in matrix-vector and vector-vector multiply
kernels. These are simply bounded by the POWER7 system memory bandwidth once the
problem size grows beyond L2 cache. This means that unit tests that stay in cache often
predict too optimistic a performance gain compared to real life workloads. Caution is in
order.
11.3.9 Global Restructuring
Frequently, a program cant reach peak theoretical VSX performance by using either of
the first two approaches. For example, many programs create arrays of structures (e.g.
3D cells in a finite element program, sites in a QCD lattice) and loop over selected
attributes in each array element. This requires accessing data with a non-unit (but usually
constant) stride through memory requiring aggressive prefetching and wasting memory
bandwidth (because adjacent structure attributes are loaded with the requested data if
they share the same cache line). If an application programmer is willing to put in enough
effort (or if an appropriate tool is available), the program can be transformed from using
an array of structures (AoS) to an array of structures of vectors (AoSoV) approach.
The structures in AoSoV are purposely built so that each element is the size of a 16-byte
vector type, whether the underlying data is single or double precision (32- or 64-bit). This
allows the AoSoV structures to exploit the same scalar algorithm already present, just by
changing the scalar variables to their vector equivalents and adjusting the loop iterators.
For example, the SU3 matrix from MILC (found in the SPEC benchmark suite, as well as
Performance Guide for HPC Applications on IBM Power 755
Copyright 2010 IBM Corporation
Page 87 of 110
academic MILC) is a scalar structure:
t ypedef st r uct {
f l oat r eal ;
f l oat i mag;
} compl ex;
t ypedef st r uct { compl ex e[ 3] [ 3] ; } su3_mat r i x;
The program loops over arrays of a lattice (whose structure in turn contains this structure)
for most of its run time. This structures size does not allow all array elements to align on
16-byte boundaries
To illustrate the Local Algorithm Change method, we will pad the structure out so that
each row is a multiple of 16 bytes. This is required for Altivec-enabled code, since the
structure has to be 16-byte aligned to get results that agree with the scalar version, An
altered structure that would work is:
t ypedef st r uct { compl ex e[ 3] [ 4] ; } su3_mat r i x_vec1;
Alternatively, the original structure can be modified by adding 8 bytes of padding. (This
minimizes unused data in a cache line). This forces more data juggling when loading the
rows in the matrix, but saves 2*8=16 bytes over a su3_matrix_vec1 structure.
Similar arguments apply to a double-precision version of the su3_matrix.
But a better choice for a structure to enable Altivec is:
#def i ne VECTOR_LENGTH 4
t ypedef st r uct {
f l oat r eal [ 3] [ 3] [ VECTOR_LENGTH] ;
f l oat i mag[ 3] [ 3] [ VECTOR_LENGTH] ;
} su3_mat r i x_bl k_vmx;
and, for VSX,
#def i ne VECTOR_LENGTH 2
t ypedef st r uct {
doubl e r eal [ 3] [ 3] [ VECTOR_LENGTH] ;
doubl e i mag[ 3] [ 3] [ VECTOR_LENGTH] ;
} su3_mat r i x_bl k_vsx;
Changing a basic structure within a code forces global changes that can require changing
most of the code. It can increase the performance gains but it is a costly approach in both
time and effort.
Performance Guide for HPC Applications on IBM Power 755
Copyright 2010 IBM Corporation
Page 88 of 110
11.4 Conclusions
First, as discussed in chapter 10 and elsewhere, since auto-vectorization can be invoked
with little effort, it is always worthwhile to try out. The -qreport qlist qsource options
will indicate why loops are or are not vectorized. The amount of effort that should be
invested in helping auto-vectorization succeed depends on the expected benefit, but it
should be easier than the effort needed to hand-code Altivec/VSX intrinsics.
Keep in mind the criteria presented in section 11.4 while inspecting the hot
computational loops for vectorization opportunities.
For any given application, the performance gain that can be achieved from using
Altivec/VSX rather than scalar instructions is not guaranteed to reach the maximum. The
ultimate performance will depend on the specific application, the algorithms used for the
scalar and Altivec/VSX versions of the application, and several factors such as bandwidth
from caches and memory, data alignment, etc. For most HPC applications it is unlikely
that VSX speedup will approach the maximum of 2x.
In many cases where SIMD opportunities exist to improve code performance, the IBM
ESSL library can save time and effort for coding up vectorized versions of standard
algorithms, like FFTs.
It is important to be aware of codes where there are opportunities to benefit from
exploiting the Altivec/VSX capabilities of POWER7 systems like the Power 755, whether
it is through auto-vectorization or hand-coding. It is equally important to have realistic
expectations of the potential speedup available from vectorizing an application.
Performance Guide for HPC Applications on IBM Power 755
Copyright 2010 IBM Corporation
Page 89 of 110
12 Power Consumption
In industries across the board energy consumption has become a top priority. The HPC
segment is no exception. In a business where customers often look to cluster densely
populated nodes, power and cooling can rapidly escalate operating expenses. In
response to these new challenges IBM offers a technology called IBM EnergyScale
which is available for IBM POWER7 processor-based systems. The goal of this chapter is
to highlight the features of this new technology and communicate to the user what kind of
behavior they can expect with respect to power consumption and performance.
EnergyScale offers the user 3 modes for desired power management which are
configurable via IBM Systems Director Active Energy Manager (AEM). The available
modes are SPS: Static Power Saver, DPS: Dynamic Power Saver, and DPS-FP:
Dynamic Power Saver - Favor Performance.
xCAT 2 (Extreme Cluster Administration Toolkit) is a tool for deploying and managing
clusters. xCAT 2 is a scalable distributed computing management and provisioning tool
that can be used to deploy and manage Power 755 clusters. For xCAT 2.3 and later there
is an Energy Management Plugin available that gives the administrator of Power clusters
the ability to query for power and change the EnergyScale modes for each 755 server in
the cluster. The actions are invoked using an xCAT command called renergy that can be
used either in scripts or from the command line.
Figure 12-1 shows the GUI for AEM.
12.1 Static Power Saver: " SPS"
The firmware sets the processor frequency to a predetermined fixed value of 30% below
the shipping nominal frequency hence the term "static" power saver. This mode also
enables a feature called "folding." Folding triggers the OS to scan the processor core
utilization. If any of the cores are running idle the OS will issue a command to the
hypervisor to either "Nap" or "Sleep" the cores depending on the OS level the user is
running --AIX 6.1H supports Nap only, AIX 6.1J supports sleep. For Power 755 systems
when the core is being napped the frequency is set to 1.65 GHz. The sleep frequency is
set to 0 MHz. There is a minimal lag time when the cores come out of sleep or nap mode.
If the user is concerned that this might impact performance on their application, the
following command can be issued from AIX to disable folding thus disabling nap or sleep:
schedo - o vpm_f ol d_pol i cy=0
SPS mode offers the maximum power savings for the system at the cost of sacrificing
performance. This mode is ideal for periods when there is little or no activity on the
system such as weekends or evenings when the user is looking to maximize power
savings.
12.2 Dynamic Power Saver: " DPS"
The firmware will alter the processor frequency based upon the utilization of the
POWER7 cores. The processor frequency is ramped up as utilization goes up. When
processor cores are not utilized their frequency drops to 1.65 GHz. The maximum core
frequency that can be achieved in this mode is 90% of the nominal ship frequency of the
system given 100% core utilization. This feature prefers power savings over
performance.
Performance Guide for HPC Applications on IBM Power 755
Copyright 2010 IBM Corporation
Page 90 of 110
12.3 Dynamic Power Saver - Favor Performance: " DPS-FP"
The firmware will alter the processor frequency based upon the utilization of the
POWER7 cores. The processor frequency is ramped up as utilization goes up. When
processor cores are not utilized their frequency drops to 1.65 GHz. The maximum core
frequency that can be achieved in this mode is 107% of the nominal ship frequency of the
system given 100% core utilization. This feature prefers maximum performance over
power savings.
Figure 12-1 Active Energy Manager GUI
12.4 Performance Versus Power Consumption
The degree of power consumption on a system depends on the performance
characteristics of the application. HPC applications display a very diverse set of
performance characteristics ranging from very core intensive applications with very little
data movement from memory over the course of computation to very memory bandwidth
intensive behavior where data is constantly moved to and from system memory.
In order to understand the correlation between performance and power consumption on
POWER7 based systems, we considered a set of 6 application benchmarks from the
SPEC CFP2006 suite of benchmarks. Three of the applications are very memory
bandwidth intensive and the other three are core intensive. The applications are shown in
Table 12-1.
We measured the system level power consumption of a Power 755, which includes all
the components such as processor chips, DIMMs, fans, etc. with the system in two
different modes, (1) nominal and (2) SPS. Each application is run in a throughput mode
on the system, whereby, 32 serial copies of a given application are started
simultaneously with the Power 755 system booted in single thread (ST) mode.
In Figure 12-2, the x-axis represents a reduction in performance relative to performance
in nominal mode when the system is switched to SPS mode. Similarly, the y-axis
represents reduction power relative to power in nominal mode when the system is
switched to SPS mode.
As it can be seen, the core intensive applications suffer a reduction in performance of
about 30%, though power consumption goes down by about 30%, too. For many HPC
users, this may not always be attractive since their applications suffer in performance by
using SPS mode. On the other hand, memory bandwidth intensive applications suffer
Performance Guide for HPC Applications on IBM Power 755
Copyright 2010 IBM Corporation
Page 91 of 110
less than 5% reduction in performance while resulting in power savings of about 20%-
25%. Our study shows that users and data center managers with knowledge of
application performance behavior can fully utilize the EnergyScale power management
features provided in POWER7 based systems to reduce power consumption with minimal
impact to performance.
Table 12-1 Performance characteristics of selected SPEC applications
SPEC Benchmark
Performance
Characteristic
416.gamess Core intensive
433.milc Mem. bandwidth intensive
435.gromacs Core intensive
437.leslie3d Mem. bandwidth intensive
444.namd Core intensive
459.GemsFDTD Mem. bandwidth intensive
Figure 12-2 Correlation of performance and power consumption
Nominal to SPS: Performance vs Power
0%
5%
10%
15%
20%
25%
30%
35%
0% 5% 10% 15% 20% 25% 30% 35%
Reduction in performance (%)
R
e
d
u
c
t
i
o
n
i
n
p
o
w
e
r
(
%
)
Memory bandwidth intensive Core intensive
Performance Guide for HPC Applications on IBM Power 755
Copyright 2010 IBM Corporation
Page 92 of 110
Appendix A: POWER7 and POWER6 Hardware
Comparison
Table A 1 Core Features
6
Feature
POWER7 POWER6
Registers 64 FPR/GPR (with renaming) 64 FPR/GPR (no renaming)
Cache 32KB 4W I Cache
32KB 8W D Cache (2RD/1WR)
Dedicated L2 reload buses for I and D
32B D-Cache reload bus at core frequency
4 MB L2 not shared
32 MB L3
64KB 4W I Cache
64KB 8W D Cache (2RD/1WR)
Shared L2 reload bus for I and D
32B reload bus at core frequency
4 MB L2 not shared
32 MB L3
Functional Units 2FX, 2LS, 4FP, 1BR, 1CR, 1DP, 1VSX/AltiVec 2FX, 2LS, 2FP, 1BR/CR, 1DP, 1 AltiVec
FPU Pipe line 2-eight stage (6 execution) 2-eight stage (6 execution)
Threading 1, 2 or 4 thread SMT
Priority-based dispatch
Alternating dispatch from 2 or 4 threads (6
instructions)
2 thread SMT
Priority-based dispatch
Simultaneous dispatch from 2 threads (7 instructions)
Instruction
Dispatch
6 Instruction dispatch per thread
Two branches
Four non-branch instructions
5 Instruction dispatch per thread
7 Instruction dispatch for 2 threads (SMT)
1 branch at any location in group
2 threads per cycle
In flight:
1 thread - 120
2 threads 184
Instruction
Issue
8- Instruction issue per thread
Two load or store ops
Two fixed-point ops
Two scalar floating-point, two VSX, two AltiVec
ops (one must be a permute op) or one DFP op
One branch op
One condition register op
2FX, 2LS, 2FP/1DP, 1BR/CR
AltiVec uses FPQ and VIQ
Rename Yes
No
Load Target Buffer (up to 20 loads)
Translation I-ERAT =64 entries, 2W (4KB,64KB page)
D-ERAT =64 entries, Fully set associative
(4KB,64KB, 16M page)
SLB =32 entries per thread
68 bit VA, 46 bit RA
Page size =4KB, 64KB, 16MB, 16GB
I-ERAT =128 entries, 2W (4KB,64KB page)
D-ERAT =128 entries, Fully set associative (4K,64K,
16M page)
SLB =64 entries per thread
68 bit VA, 48 bit RA
Page size =4KB, 64KB, 16MB, 16GB
Performance Guide for HPC Applications on IBM Power 755
Copyright 2010 IBM Corporation
Page 93 of 110
Appendix B: IBM System Power 755 Compute Node
Figure B- 1 Schematic of Power 755 Node
Performance Guide for HPC Applications on IBM Power 755
Copyright 2010 IBM Corporation
Page 94 of 110
Appendix C: Script petaskbind.sh
#! / bi n/ sh
# Scr i pt i s desi gned t o bi nd t asks usi ng t he bi ndpr ocessor command
# f or MPI t asks st ar t ed vi a poe usi ng t he bi ndpr ocessor .
#
# Fi r st dr af t no checki ng i s done so be car ef ul
# TODO : make det er mi nat i on of ncpu mor e r obust
# i . e. check out put of l par st at .
#
# Usage : poe pet askbi nd. sh a. out <ar gs>
#
# Assumed env var i abl es f r omPE :
# MP_CHI LD - f r omPOE, ef f ect i vel y t he t ask r ank
# MP_COMMON_TASKS - f r omPOE, col on del i mi t ed st r i ng cont ai ni ng number
# and l i st of mpi t ask i ds r unni ng on t he same node
#
# Envs t o cont r ol bi ndi ng
# PEBND_PE_STRI DE - St r i de bet ween successi ve MPI t asks
# val ue of - 1 wi l l set st i de = ncpus/ nt asks
# Def aul t val ue of - 1
# PEBND_PE_START - Desi r ed l ogi cal pr ocessor t o st ar t PE t asks
# Def aul t val ue 0
#
get mi n ( )
{
xmi n=$1
# xl i st i s col on del i mi t ed l i st of MPI t asks shar i ng same node
xl i st =$2
f or x i n `echo $xl i st | sed ' s/ : / / g' `; do
i f [ $x - l t $xmi n ] ; t hen
xmi n=$x
f i
done
echo " $xmi n"
}
# Set def aul t s
PEBND_PE_STRI DE=${PEBND_PE_STRI DE: - - 1}
PEBND_PE_START=${PEBND_PE_START: - 0}
# Get number of common t asks
ncom=${MP_COMMON_TASKS%%: *}
nt asks=`expr $ncom+ 1`
# Get number of l ogi cal pr ocessor s on node, assums l par st at i s avai l abl e
ncpu=`l par st at | gr ep Syst em| awk ' { pr i nt $6 }' | awk - F= ' { pr i nt $2 }' `
# Get l i st of common t asks , 1st el emi nt i n t hi s l i st i s number of common t asks
# unl ess i t i s t he onl y t ask
coml i st =${MP_COMMON_TASKS#*: }
i f [ $ncom- eq 0 ] ; t hen
coml i st =" "
f i
myt ask=$MP_CHI LD
# Det er mi ne smal l est t ask i d on node
mi nt ask=`get mi n $myt ask $coml i st `
# l ocal i ndex
st ar t _i ndex=`expr $myt ask - $mi nt ask`
i f [ " x$PEBND_PE_STRI DE" = " x- 1" ] ; t hen
st r i de=`expr $ncpu / $nt asks`
el se
st r i de=" $PEBND_PE_STRI DE"
f i
Performance Guide for HPC Applications on IBM Power 755
Copyright 2010 IBM Corporation
Page 95 of 110
st ar t _pr oc=`expr $PEBND_PE_START + $st ar t _i ndex \ * $st r i de`
# Debuggi ng
debug=0
i f [ $debug = 1 ] ; t hen
echo " st ar t _pr oc $st ar t _pr oc"
echo " st r i de = $st r i de"
echo " PEBND_PE_STRI DE $PEBND_PE_STRI DE"
echo " PEBND_PE_START $PEBND_PE_START"
# echo " MP_COMMON_TASKS $MP_COMMON_TASKS"
# echo " coml i st $coml i st "
# echo " ncom$ncom"
f i
# Do t he bi ndi ng.
bi ndpr ocessor $$ $st ar t _pr oc
# Execut e command
exec " $@"
ok
Performance Guide for HPC Applications on IBM Power 755
Copyright 2010 IBM Corporation
Page 96 of 110
Appendix D: Script petaskbind-rset.sh
#! / bi n/ sh
# Scr i pt i s desi gned t o bi nd t asks usi ng t he execr set command
# f or MPI t asks st ar t ed vi a poe.
#
# Fi r st dr af t no checki ng i s done so be car ef ul
# TODO : make det er mi nat i on of ncpu mor e r obust
# i . e. check out put of l par st at .
#
# Usage : poe pet askbi nd- r set . sh a. out <ar gs>
#
# Assumed env var i abl es f r omPE :
# MP_CHI LD - f r omPOE, ef f ect i vel y t he t ask r ank
# MP_COMMON_TASKS - f r omPOE, col on del i mi t ed st r i ng cont ai ni ng number
# and l i st of mpi t ask i ds r unni ng on t he same node
#
# Envs t o cont r ol bi ndi ng
# PEBND_PE_STRI DE - St r i de bet ween successi ve MPI t asks
# val ue of - 1 wi l l set st i de = ncpus/ nt asks
# Def aul t val ue of - 1
# PEBND_PE_START - Desi r ed l ogi cal pr ocessor t o st ar t PE t asks
# Def aul t val ue 0
#
get mi n ( )
{
xmi n=$1
# xl i st i s col on del i mi t ed l i st of MPI t asks shar i ng same node
xl i st =$2
f or x i n `echo $xl i st | sed ' s/ : / / g' `; do
i f [ $x - l t $xmi n ] ; t hen
xmi n=$x
f i
done
echo " $xmi n"
}
# Set def aul t s
PEBND_PE_STRI DE=${PEBND_PE_STRI DE: - - 1}
PEBND_PE_START=${PEBND_PE_START: - 0}
# Get number of common t asks
ncom=${MP_COMMON_TASKS%%: *}
nt asks=`expr $ncom+ 1`
# Get number of l ogi cal pr ocessor s on node, assums l par st at i s avai l abl e
ncpu=`l par st at | gr ep Syst em| awk ' { pr i nt $6 }' | awk - F= ' { pr i nt $2 }' `
# Get l i st of common t asks , 1st el emi nt i n t hi s l i st i s number of common t asks
# unl ess i t i s t he onl y t ask
coml i st =${MP_COMMON_TASKS#*: }
i f [ $ncom- eq 0 ] ; t hen
coml i st =" "
f i
myt ask=$MP_CHI LD
# Det er mi ne smal l est t ask i d on node
mi nt ask=`get mi n $myt ask $coml i st `
# l ocal i ndex
st ar t _i ndex=`expr $myt ask - $mi nt ask`
i f [ " x$PEBND_PE_STRI DE" = " x- 1" ] ; t hen
st r i de=`expr $ncpu / $nt asks`
el se
st r i de=" $PEBND_PE_STRI DE"
f i
Performance Guide for HPC Applications on IBM Power 755
Copyright 2010 IBM Corporation
Page 97 of 110
st ar t _pr oc=`expr $PEBND_PE_START + $st ar t _i ndex \ * $st r i de`
# Debuggi ng
debug=0
i f [ $debug = 1 ] ; t hen
echo " st ar t _pr oc $st ar t _pr oc"
echo " st r i de = $st r i de"
echo " PEBND_PE_STRI DE $PEBND_PE_STRI DE"
echo " PEBND_PE_START $PEBND_PE_START"
# echo " MP_COMMON_TASKS $MP_COMMON_TASKS"
# echo " coml i st $coml i st "
# echo " ncom$ncom"
f i
# Do t he bi ndi ng.
# bi ndpr ocessor $$ $st ar t _pr oc
at t achr set - F - c $st ar t _pr oc $$ > / dev/ nul l 2>&1
# Execut e command
# or r epl eace bel ow wi t h
# execr set - c $st ar t _pr oc - e " $@"
exec " $@"
Performance Guide for HPC Applications on IBM Power 755
Copyright 2010 IBM Corporation
Page 98 of 110
Appendix E: Enabling Huge Pages on SLES11 Power 755
systems
This is a small overview on how to set up Huge Pages on a system. See also this wiki
page.
We strongly recommend performing these actions immediately after a reboot of
the system!
How to allocate Huge Pages:
#! / bi n/ bash
# ver i f y t he l ocal i t y of t he memor y on t he memor y pool s:
numact l - - har dwar e | t ee numact l _out 0
# ver i f y t he number of Huge Pages al l ocat ed on t he syst em
cat / pr oc/ memi nf o | gr ep Huge
# al l ocat e Huge Pages ( j ust af t er a r eboot t o have a cl ear memor y)
# f i r st r eset ever yt hi ng
echo 0 > / pr oc/ sys/ vm/ nr _hugepages
# al l ocat e X GB of Huge Pages
expor t X=64
nbhp=$( echo " $X * 1024 / 16" | bc)
#echo <i nt eger val ue f r omabove command l i ne> > / pr oc/ sys/ vm/ nr _hugepages
echo $nbhp > / pr oc/ sys/ vm/ nr _hugepages
# ver i f y t he amount of Huge Pages al l ocat ed
cat / pr oc/ memi nf o | gr ep Huge
# ver i f y t he l ocal i t y of t he memor y on t he memor y pool s:
numact l - - har dwar e | t ee numat cl _out 1
# now cr eat e t he f i l esyst emused t o access t hese Huge Pges
mkdi r / l i bhuget l bf s
# t hen mount t he f i l esyst em
mount - t huget l bf s huget l bf s / l i bhuget l bf s
# cr eat e a user gr oup t o r est r i ct access t o Huge Pages
gr oupadd l i bhuge
chmod 770 / l i bhuget l bf s
chgr p l i bhuge / l i bhuget l bf s/
chmod g+w / l i bhuget l bf s/
# add sar a user i d t o t he Huge Pages gr oup
user mod sar a - G l i bhuge
For codes using malloc (C) or ALLOCATE (Fortran) functions:
you don't need to recompile. At execution time, use the following:
LD_PRELOAD=l i bhuget l bf s. so HUGETLB_MORECORE=yes . / ${EXE}
For codes using static arrays:
you must recompile and add use the following flags at link time:
- B / usr / shar e/ l i bhuget l bf s/ - t l - Wl , - - huget l bf s- l i nk=BDT
There is no way to use Huge Pages for codes using both static arrays and malloc.
How to use Huge Pages with your application:
Performance Guide for HPC Applications on IBM Power 755
Copyright 2010 IBM Corporation
Page 99 of 110
Appendix F: Flushing Linux I/O buffers
The total memory used by I/O buffers during an application run is not released by default.
OS tuning can be done to improve this behavior, but there is way to manually flush these
buffers to free the memory used.
This is quite important when an application Y is launched just after an application X
has been using a lot of local memory. The memory pools available to allocate data may
not be local, and then one MPI tasks/process of Y will allocate data on a memory card
in a remote location. This can dramatically impact the performance of memory intensive
codes.
Command
echo 1 > / pr oc/ sys/ vm/ dr op_caches : To free pagecache
echo 2 > / pr oc/ sys/ vm/ dr op_caches : To free dentries and inodes
echo 3 > / pr oc/ sys/ vm/ dr op_caches : To free pagecache, dentries
and inodes
:
As this is a non-destructive operation, and dirty objects are not freeable, the user should
run "sync" first in order to make sure all cached objects are freed.
Example
:
See link
Performance Guide for HPC Applications on IBM Power 755
Copyright 2010 IBM Corporation
Page 100 of 110
Appendix G: Compiler Flags and Environment Settings for
NAS Parallel Benchmarks
Following are the compiler flags and the environment settings used in the NAS
benchmark runs on the Power 755 cluster system.
Different compiler options are used for different benchmarks:
For ft and lu:
- O5 - q64 - qnohot - qar ch=pwr 7 - qt une=pwr 7
For bt and sp:
- O3 - q64 - qnohot - qar ch=pwr 7 - qt une=pwr 7
For cg and mg:
- O5 - q64 - qar ch=pwr 7 - qt une=pwr 7
SYSTEM CONFIGURATION:
Node: Power 755
---------------------
MEMORY: 249.25GB (32 x 8GB DIMM) - No large pages
CPUs: 64 Clock speed 3300 MHz - SMT-2 enabled
AIX: 6.1 (6100-04-01) - 64 bit kernel
LoadLeveler: The BACKFILL scheduler is in use
Switch: Qlogic Infiniband
2 144 port SilverStorm 9120, DDR
2 links per network adapter to each Qlogic switch
Installed software:
LOADL: 4.1.0.1
LAPI: 3.1.4.1
PPE-POE: 5.2.0.1
ESSL: 5.1.0.0
PESSL: 3.3.0.2
GPFS: 3.3.0.2
VAC: 11.01.0000.0000
XLF: 13.01.0000.0000
The following MPI and other environment variables were used in the runs
expor t OMP_NUM_THREADS=1
expor t MP_PROCS=32
expor t MP_HOSTFI LE=hf
expor t MP_USE_BULK_XFER=yes
expor t MEMORY_AFFI NI TY=MCM
expor t MP_PULSE=0
expor t MP_EAGER_LI MI T=65536
expor t MP_I NFOLEVEL=4
expor t MP_EUI LI B=us
expor t MP_EUI DEVI CE=sn_al l
expor t MP_SHARED_MEMORY=yes
expor t MP_SI NGLE_THREAD=yes
expor t MP_I NSTANCES=2
expor t MP_RETRANSMI T_I NTERVAL=5000
expor t TARGET_CPU_LI ST=- 1
Performance Guide for HPC Applications on IBM Power 755
Copyright 2010 IBM Corporation
Page 101 of 110
Appendix H: Example Program Listing for Using the
dscr_ctl System Call
Applications may exhibit better performance using a DSCR setting different from the
system default. AIX does support a dscr_ctl subroutine that can be used to set the DSCR
register for the application. The prototype for the routine is in the file
/usr/include/sys/machine.h. Below is an example program demonstrating how to query
and set the DSCR register for a user application.
#i ncl ude <st di o. h>
#i ncl ude <st dl i b. h>
#i ncl ude <sys/ machi ne. h>