Code Optimization For Cell BE - Opportunities For ABINIT

Code optimization for Cell BE - Opportunities for ABINIT
Timo Schneider, Toni Volkmer and Wolfgang Rehm

Technical University of Chemnitz, Germany
{timos,tovo,rehm}@cs.tu-chemnitz.de
Torsten Hoefler
Open Systems Laboratory, Indiana University, Bloomington, IN 47404 USA
htor@cs.indiana.edu
Heiko Schick
IBM Deutschland Entwicklung GmbH, Germany
schickhj@de.ibm.com
Abstract intensive application kernels have to be identified and re-

placed by Cell optimized implementations. This method-
The Cell BE is a new processor architecture. In or- ology is suitable to optimize legacy applications for the ar-
der to obtain the maximum performance, it is necessary to chitecture. In this work, we demonstrate the library-based
adapt the code to match the underlying hardware architec- optimization of a large and feature-rich quantum mechan-
ture. This means addressing issues of vectorization, mem- ical code for the Cell BE architecture. We identify two
ory alignment and communication between main memory main kernels and replace them with library calls to Cell-
and local stores. The goal of this work is to study the per- optimized implementations. In order to do this, we propose
formance potential of the Cell BE by implementing selected a new implementation for dense matrix multiplication and
parallel compute kernels of ABINIT, a widely used open leverage a Fast Fourier Transformation library for the Cell
source code for ab initio electronic structure calculations. BE. The following section describes key-properties of the
We identified through profiling that only 2% of ABINITs Cell architecture followed by an introduction of the opti-
code make up over 80% of it’s run-time. The most important mized application ABINIT.
parts turned out to be double precision dense matrix mul-
tiplication and double precision Fourier Transformations. 1.1 The Cell Broadband Engine Architec-
We therefore implemented a new high-performance double ture
precision dense matrix multiplication routine for Cell BE.
Thus, our work may serve as guideline for implementing
similar algorithms. With FFTW3 there is a very efficient The Cell BE architecture [21] is a multicore micropro-
FFT library available which supports the SPEs in Cell sys- cessor with a general purpose core called Power Processing
tems. We managed to make use of this library in ABINIT Element (PPE) and multiple vector co-processing elements,
what lead to an reduced runtime too.1 called Synergistic Processing Elements (SPEs). The PPE
is a stripped-down general purpose core to administer
the SPEs, which handle the computational workload. It
is easy to run conventional software on the PPE due to
1 Introduction its compatibility to the PPC64 architecture. The PPE is
connected to the system memory and the SPEs via the
The Cell BE architecture is, due to its distributed- Element Interconnect Bus (EIB), a high-bandwidth circular
memory parallelism, a challenge for programmers and algo- data bus. Each SPE hosts some local memory, called Local
rithm designers. Different programming models have been Store (LS), an Synergistic Execution Unit (SXE) and a
proposed for this architecture. The simplest of those mod- Memory Flow Controller (MFC) which connects the SPE
els is a library-based accelerator approach where compute- to the EIB. The MFC operates independently of the SPU, so
1 This research is supported by the Center for Advanced Studies (CAS) memory transactions can be overlapped with computations.
of the IBM Böblingen Laboratory as part of the NICOLL Project. Figure 1 provides an overview of the Cell BE architecture.
1
SPE SPE SPE two floating point instructions (FLOP). As all double preci-
sion arithmetic instructions need the same number of clock
SXU SXU SXU cycles, these instructions yield the best floating point oper-
ation per second (Flop/s) ratio.
LS LS LS
1.2 ABINIT
MFC MFC MFC
The software package ABINIT [12] is used to perform ab
initio electronic structure calculations by solving the time-
independent Schrödinger equation
EIB
b tot φ = Etot φ
H
with a conjugate gradient method, using the density

PPE functional theory (DFT) of Hohenberg and Kohn [15] and
L2 Kohn and Sham [18] to construct the Hamiltonian operator
Power Memory Hb tot . The solution φ, the wavefunction of the system,
core describes the state of all electrons and nuclei in the system,
L1
while Etot is the total energy of this state. The code is
the object of an ongoing open software project of the
Universite Catholique de Louvain, Corninge Incorporated,
Figure 1. Cell BE Architecture and other contributors. ABINIT mostly aims at solid state
research. Periodic boundary conditions are applied and
the majority of the integrals that have to be calculated are
Let’s take a closer look at the SPEs now, as the perfor- represented in reciprocal space (k-space). The interactions
mance of our implementation depends solely on our ability between valence electrons and ionic cores are modeled by
to use their full potential. SPEs are special cores: They are pseudopotentials, and the electronic wave-functions are
meant to fill the gap between standard desktop PC cores expanded in a set of plane waves.
and special number crunching devices like the Graphics
Processing Units (GPUs) in graphics cards. The SPEs can ABINIT is free software which can be redistributed un-
not access the main memory directly, they can only operate der the terms of the GPL Licence. It is mainly written in
on their Local Store which is capable of holding 256 KiB FORTRAN and consists of 240.000 lines of code.
data. One should not think of the LS as of a cache in a
standard CPU because it is not updated automatically or
2 Related Work
transparently to the running process. To get new data into
the LS one has to use the Memory Flow Controller to issue
Others have already successfully ported a wide range of
a DMA PUT or GET transfer. The DMA transfers from
applications to the Cell BE architecture, for example an
and to SPUs have to be 16 byte aligned. However they
real-time ray-caster [20] or financial market applications
yield the best performance when multiples of 128 byte are
[8]. ABINITs performance and scalability on cluster sys-
transferred.
tems was evaluated in [13]. One of ABINITs performance
critical routines, three dimensional Fourier transformation,
The Synergistic Execution Unit is a vector processor
has been analyzed and optimized for cluster systems by
which operates on 128 registers, each 128 bit wide. That
leveraging non-blocking collective operations in [14].
means when coping with double precision floating point
numbers (which are 8 byte wide) we can do two similar
operations simultaneously if we manage to put our input 3 Optimization Principles
data in a single vector. Unfortunately the first generation of
Cell BE processors are very slow when doing double pre- Since ABINIT has a very large code base and relies on
cision arithmetic (1.83 GFlop/s per SPE [22], which is 14 quite complex theories from the field of physics we knew
times lower than the single precision performance). But this from the beginning of the project that it would be impossible
improved with future generations of this chip. The perfor- to rewrite the whole program for the Cell BE architecture.
mance cited above can only be reached when fused multiply Instead we had to take a different approach, we had to focus
add instructions are used. These instructions perform the on small and understandable compute kernels and optimize
operation c := a ∗ b + c or similar and therefore count as those.
2
3.1 Profiling ABINIT the executed SPE program does not do anything besides
waiting until all used SPE’s are running (which is ensured
Profiling the application shows that only 2% of ABINITs via fast SPE to SPE communication) and hand the control
code make up over 80 % of it’s run-time. The most impor- flow back to the PPE which will just destroy the SPE
tant parts turned out to be ZGEMM, a double precision ma- contexts after all SPE’s completed their little task. This is
trix multiplication routine from BLAS, which contributed done 105 times. The overall time for all iterations is then
about 25 % of the total run time in our test runs, Fourier divided by 105 and therefore we get a good lower bound
Transformations are another important part of ABINIT. The for the minimal overhead induced through SPE context
functions which perform those contribute another 25 % creation and destruction by the PPE, including the overhead
of the total run time. Another important function is the of the PPE function call. We found out that this overhead
ABINIT internal application of the non-local Hamiltonian increases almost linearly as the number of SPEs involved
operator which is realized in the routines opernl4*. To- in this micro-benchmark rises. See Figure 3 for the detailed
gether these contribute about 35 % of ABINITs run time. results.
Similar profiling results have been obtained in [13] with an
older version of ABINIT, with the exception that ZGEMM 2.5
Overhead
was not used in this version, since it is leveraged by the
Initialization Overhead [ms]

LOBPCG eigensolver [17], which ABINIT did not use back 2
then.
1.5
3.2 Our Acceleration Model
1
As ABINIT can run unmodified on the PPE we could
just substitute single performance critical functions with an 0.5
optimized ones which are able to leverage the SPEs for the
computation task. This approach is quite simple as does
0
not need any background knowledge about the relationship 0 2 4 6 8 10 12 14 16
of the different parts of ABINIT. But it also has a serious Number of used SPEs
drawback: Some actions like creating a PPE pthread for
each used SPE and the creation and destruction of the SPE Figure 3. SPE context creation/destruction
contexts have to be performed every time an optimized overhead benchmark
function is executed. This overhead could of course slow
down the measured overall performance if the input dataset
is very small and the function is called very often. On the one hand this benchmark clearly shows that our
acceleration model has it’s limits, especially for compute
Memory kernels working on a small set of input data or consisting of
low complexity class algorithms. Therefore this overhead
SPE0
should be tackled, especially when it comes to feature gen-
ABINIT (PPE) function.o (PPE) get input
do calculations erations of Cell CPUs which will have more SPEs.
copy parameters write back
ALLOCATE(A)
start SPE tasks
wait for completion
... 4 Dense Matrix Multiplication
.... SPE8
CALL FUNCTION(A) get input
do calculations
write back
Basic Linear Algebra Subprograms (BLAS) is an widely
used application programming interface for libraries to
original code our contribution perform basic dense linear algebra operations such as
matrix multiplication. They were first published in 1979
[19]. Highly optimized implementations of the BLAS
Figure 2. Fundamentals of our acceleration interface have been developed by different vendors or
model groups for many architectures.
ZGEMM performs the matrix-matrix operation on input of

To measure this startup overhead we implemented a complex matrices:
function which executes an SPE program in the same way
our optimized functions work. The only difference is that C := α · op(A) · op(B) + β · C
3
Where op(A) specifies if the normal, transposed or conju- column of op(B), we will always have to consider the
gated version of the matrix is to be used. A,B and C are operators for our memory access.
matrices consisting of complex numbers and α and β are
complex scalars. The FORTRAN interface is: We investigated works that use Strassen’s or Winograd’s
SUBROUTINE ZGEMM(TRANSA, TRANSB, M, implementation to reduce the asymptotic complexity of the
N, K, ALPHA, A, LDA, B, LDB, BETA, C, matrix multiplication [7]. However, those optimized algo-
LDC) rithms work only with well conditioned matrices which we
TRANSA and TRANSB contains the operator to be used can not guarantee in the general case. Thus, we chose to
on matrix A and B as a single character which can be n implement a traditional O(N 3 ) algorithm for our ZGEMM.
(normal), t (transposed) or c (transposed, conjugated).
op(A) is M by K matrix, op(B) is a K by N matrix, and 4.1 Our ZGEMM implementation
C is a M by N matrix. Note that M, N and K refer to the
matrices after the operators are applied, not the original We had to apply two important concepts to be able to
input matrices. ALPHA and BETA correspond to α and design a well-performing ZGEMM implementation: We par-
β in the equation above. LDA, LDB and LDC specify the titioned the input data, distributed it among the Local Stores
first dimension of the input matrices so it is possible to use of the available SPEs to minimize memory latencies during
ZGEMM the top-left part of the input matrices only. calculation and vectorized all calculations in order to exploit
the SIMD architecture.
The input matrices A, B and C are stored in column
major order, as they come from a program written in FOR- 4.1.1 Data Partitioning
TRAN. Figure 4 illustrates the meaning of the different As the Local Store of an SPE space is limited to 256KiB,
ZGEMM parameters which deal with the representation of the goal should be to save space and memory transfers.
the input matrices. A first idea was to load parts of a row of op(A) and a
column of op(B) and to compute exactly one element of
C. There are some problems with this: depending on the
K
operator, the rows (or columns) of the matrices are stored
A sequentially in memory or scattered with a displacement
(of LDx), forcing us to get each element separately. This
would decrease performance, as the MFC operates best
M with memory chunks that are multiples of 128 byte in size.
LDA
A better idea is to load blocks instead of lines, and per-
form small matrix-matrix multiplications instead of scalar
products. This gives us independence from the operator:
the decision whether rows or columns should be used in
the scalar product of the matrix multiplications on the SPEs
does not affect performance, as we have random access to
the Local Store. Another advantage is the number of op-
Memory layout:
erations. For n elements which fit in each input buffer of
LDA M our Local Store, O(n) multiply and add√ operations can be
done with the scalar product, but O( n3 ) = O(n1.5 ) op-
A ...
erations can be achieved with small matrix multiplications.
column K
Of course, with more operations on the same amount of lo-
column 1
cal data the total number of memory transfers is reduced.
Figure 4. FORTRAN/ZGEMM Matrix represen- 4.1.2 Work Assignment

tation
With our partitioning approach, each part of the result
matrix can be independently computed with the block row
The tricky part is the operator: Depending on if its of op(A) and the block column of op(B). The blocks to
normal or not, elements which are stored sequentially in be computed are simply distributed circular on the SPEs.
memory can be in one row or one column. As one result Figure 5 illustrates the assignment scheme for 6 SPEs. The
element is computed based on one row of op(A) and one shaded result block is computed using the shaded row in
4
op(A) and the shaded column in op(B). In Figure 4 you can see how the block-wise multiplica-
tion can be implemented in C, using the SPU intrinsics. 3
1 2 3 4 5 6 1 2
3 4 5 6 1 2 3 4 # d e f i n e VPTR ” ( v e c t o r d o u b l e ∗) ”
5 6 1 2 3 4 5 6 v e c t o r char h i g h d b l = { 0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 ,
1 2 3 4 5 6 ... 16 ,17 ,18 ,19 ,20 ,21 ,22 ,23};
* = v e c t o r char l o w d b l = { 8 , 9 , 1 0 , 1 1 , 1 2 , 1 3 , 1 4 , 1 5 ,
24 ,25 ,26 ,27 ,28 ,29 ,30 ,31};
v e c t o r do ubl e r r e = { 0 , 0 } , ri m = { 0 , 0 } , t r e , tim , s r e , sim ;
f o r ( k = 0 ; k < k l e n ; k ++ , a a += a s t p , bb += b s t p ) {
fi m = s p u s h u f f l e ( ∗ ( VPTR a a ) , ∗(VPTR ( a a + a s t p ) ) , l o w d b l ) ;
gim = s p u s h u f f l e ( ∗ ( VPTR bb ) , ∗(VPTR bb ) , l o w d b l ) ;
Figure 5. SPE Block assignment f r e = s p u s h u f f l e ( ∗ ( VPTR a a ) , ∗(VPTR ( a a + a s t p ) ) , h i g h d b l ) ;
g r e = s p u s h u f f l e ( ∗ ( VPTR bb ) , ∗(VPTR bb ) , h i g h d b l ) ;
t r e = s p u n ms u b ( fim , gim , s r e ) ;
We investigated the use of the PPE with an experimen- t i m = spu madd ( f r e , gim , sim ) ;
tal implementation. The PPE has a theoretical peak perfor- s r e = s p u ms u b ( f r e , g re , t r e ) ;
sim= spu madd ( fim , g re , t i m ) ;
mance of 6.4 GFlop/s. Our code spawns N threads on the }
PPE, each of them computes the same chunk of op(C) as r r e = s p u s h u f f l e ( s r e , sim , h i g h d b l ) ;
an SPE does2 , using a PPC970 optimized BLAS implemen- ri m = s p u s h u f f l e ( s r e , sim , l o w d b l ) ;
tation to perform the computation. Despite the given peak ∗(VPTR c c ) = s p u a d d ( ∗ ( VPTR c c ) , r r e ) ;
∗(VPTR ( c c + 1 ) ) = s p u a d d ( ∗ ( VPTR ( c c + 1 ) ) , ri m ) ;
performance of the SPE, we achieved only 1.7 GFlop/s with
ATLAS on the PPE, which makes this partitioning scheme
suboptimal. Thus, we did not include the PPE measure- Figure 6. Inner loop of the blockwise matrix
ments in our benchmarks. multiplication, implemented in C
4.1.3 Vectorization
4.2 Benchmarks
In our matrix multiplication, each element is a 128 bit com-
plex number, consisting of 64 bit double precision floating This section provides a performance evaluation of our
point values for real part and imaginary part. We can safely implementation and a qualitative and quantitative compar-
assume that only fused multiply add operations are used, as ison to BLAS implementations on other modern architec-
two elements of each matrix are multiplied and added to the tures.
temporary scalar product. One multiply-add operation of
30
complex numbers a and b added to y (y = y + a · b) is split RefBLAS
CellBLAS, PS3, 6 SPUs
up like this for its real and imaginary parts: 25 CellBLAS, QS20, 8 SPUs
CellBLAS, QS20, 16 SPUs
yre := yre + are bre − aim bim IBM BLAS (DGEMM), QS20, 8 SPUs
20
yim := yim + are bim + aim bre
GFlop/s
15
This makes 4 fused multiply add operations, with 64 bit
operands. With the SIMD-ability of the SPU, two complex 10
multiply-adds can be done instead of one. To use SIMD
instructions, the real parts and imaginary parts have to be 5
splitted and packed into separate registers. This can be done
0
with the SPU shuffle instruction. Now the calculation can 0 200 400 600 800 1000 1200 1400 1600 1800 2000
be done as described above, and the only thing left to do is Matrix size (NxN)
to separate the real and imaginary part into the result regis-
ters before we write back into C. Figure 7. Performance Comparison
One little obstacle remains: The fused multiply subtract
operation on the SPU spu msub(a, b, c) calculates The current Cell BE chip’s SPEs are capable of issuing
a · b − c, but we would need c − a · b. To achieve this without one double precision arithmetic instruction every six clock
adding further instructions to change the sign, the real part cycles. This instruction needs another seven cycles until the
can be calculated as follows: result is available in the target register. But if we assume
yre := are bre − ((aim bim ) − yre ) 3 Our code and the tests that were used to obtain
the presented benchmark results can be fetched from
2 theoretically, N = 4 should be optimal http://files.perlplexity.org/zgemm.tar.gz.
5
to execute a very large number of data-independent double refblas by using different numactl configurations (numactl
precision operations we would get a cycles per instruction controls which CPU uses which memory bank), we were
(CPI) value of 6. Considering FMADD operations and a not able to achieve more than one Gflop. This is due to the
vector size of two, the theoretical peak performance of a fact that the current compilers do not automatically gener-
single Cell BE CPU with 8 SPE and a clock rate of 3.2 GHz ate code for the SPUs. Thus, the refblas implementation
is used only the rather slow PPC core. We outperform the
3.2 ∗ 109 Hz IBM DGEMM implementation by large for all different
Rpeak = ·8 SPE·4 Flop/SPE = 17.07 GFlop/s matrix sizes and our code scales very well to up to 16
6
SPUs. We can also reproduce similar performance on the
This is the number in theory, in practical tests (back to back specialized Playstation 3 (PS3) hardware (only 6 SPEs are
execution of fused multiply add instructions with no data accessible with Linux).
dependencies) we were able to measure up to 14.5 GFlop/s.
This number is said to be the Cell BE double precision
peak performance. [22] Another optimization technique that has been proposed
[6] is to overlap memory (DMA) accesses with computa-
Even though our implementation supports arbitrary ma- tion. However, this increases the code complexity signif-
trices, we benchmarked square matrices to enable easy icantly. To evaluate the potential benefit, we removed all
comparisons to other publications. We used ppu-gcc, the memory (DMA) accesses from our implementation to
version 4.1.1 with the flags -O3 -mabi=altivec simulate the overlap. This invalidates the results but pro-
-maltivec to compile all PPE code and spu-gcc, ver- vides an upper bound to the performance-gain due to over-
sion 4.1.1 with -O3 for the SPE code. The Cell BE specific lap. Figure 8 shows the comparison to our implementation.
benchmarks were run on a 3.2 GHz IBM QS20 Cell Blade, Our experiments show that we could gain up to one Gflop/s
which contains 2 Cell BE processors with 8 SPEs per pro- performance with this overlap technique.
cessor and two 512 MiB RAM banks and a Playstation 3
50
running at 3.2 GHz with 200 MiB memory. Both systems
45
run Linux 2.6 (with IBM patches applied).
40
12
35
30
GFlop/s
10
25
RefBLAS
8 20 CellBLAS PS3
CellBLAS, 8 SPEs
GFlop/s
15 CellBLAS, 16 SPEs
6 Big Red
10
Jeltz
4 5 Odin
Sif
0
2 0 1000 2000 3000 4000 5000 6000 7000
cellblas, 8 SPEs Matrix size (NxN)
no DMA, cellblas, 8 SPEs
0
0 200 400 600 800 1000 1200 1400 1600 1800 2000 Figure 9. Absolute efficiency of different
Matrix size (NxN) BLAS implementations
Figure 8. Effects of overlapping the memory

accesses with computation Our next benchmark compares the Cell BE and our
optimized implementation (currently the fastest available)
with different modern High Performance Computing (HPC)
In our first benchmark (Figure 7), we compare the architectures. We chose a variety of different systems to be
performance of netlib.org’s refblas ZGEMM with the IBM able to evaluate the suitability of the Cell BE for scientific
DGEMM implementation4 and our optimized implementa- calculations using ZGEMM as an example. The different
tion for different matrix sizes. systems and their peak floating point performances are
described in the following. We leveraged all available
The results show that the our implementation performs processing units (CPUs/Cores) that share a common
very well on Cell BE CPUs. Even though we tried to tune system memory (are in the same physical node). Thus
4 The current IBM BLAS implements no ZGEMM. Thus, we used we compare our multi-core Cell BE implementation with
DGEMM for comparison, because of its similarity to ZGEMM other multi-core BLAS implementations. The test systems
6
are described in the following: a node in Big Red has two ment. However, the Cell BE represents a completely new
dual-core PowerPC 970 MP processors (2.5GHz) with 8GB approach of the “explicit cache” (Local Store). Addition-
RAM per node. The peak-performance (with FMADD) is ally to that, the Cell architecture introduces additional over-
40 GFlop/s and we ran the IBM ESSL library. We used the heads for loading the code to the SPUs. The relative per-
Goto BLAS [16] library 1.19 on Odin, a dual CPU dual- formance results are presented in Figure 10. The highly
core Opteron running at 2 GHz with a peak performance of optimized Goto BLAS implementation delivers the best per-
16 GFlop/s, and Sif, a dual CPU quad-core 1.86 GHz Intel formance on the available architectures. IBM’s Engineering
Xeon with 59.5 GFlop/s peak. The theoretically fastest and Scientific Subroutine Library (ESSL) delivers good per-
tested system, Jeltz, as two quad-core Intel Xeon 3.0 GHz formance on PowerPC. Our implementation which explores
and a peak performance of 96 GFlop/s. Jeltz runs Mac OS a new CPU architecture is performing very well in compari-
X Tiger and we used the vendor supplied Veclib for our son to the well established ones and even better than Apple’s
experiments. The absolute performance results for all those Veclib.
systems are plotted in Figure 9.
5 Fast Fourier Transformation on Cell BE
0.9
0.8 ABINIT 5.4.4 contains two different FFT implemen-
Reached peak performance
0.7 tations written by Stefan Goedecker in 1993 and 2002

0.6
[10] [11], called goedecker and goedecker2002 in the
following text. The goedecker2002 algorithm can be used
0.5
with MPI. In addition to that ABINIT supports the FFTW2
0.4 RefBLAS
CellBLAS PS3 library [4], but this did not work out of the box for us, but
0.3 CellBLAS, 8 SPEs we ware able to write a small patch which fixes the problem.
CellBLAS, 16 SPEs
0.2 ESSL (Big Red)
Veclib (Jeltz) Only the goedecker2002 algorithm can be used to-
0.1 Goto (Odin)
Goto (Sif) gether with the LOBPCG [17] solver without further
0 modifications. This is because only goedecker2002 uses a
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Matrix size (NxN)
LOBPCG-compatible data layout when used with MPI in
parallel. In the sequential case all FFT algorithms work.
Figure 10. Relative efficiency of different
BLAS implementations
5.1 Benchmark of different FFT libraries
Due to memory and CPU time limits, not all matrix sizes To compare the performance capabilities of different
could be run on all systems (e.g., the PS3 had only 200 FFT algorithms on the Cell architecture we employed
MiB). Our benchmarks show that the current generation benchFFT [4], which is an FFT benchmarksuite. It
Cell BE is not really suited to perform double precision contains and supports benchmarking a large number of
floating point calculations because it is largely outper- FFT algorithms like ACML (AMD Core Math Library, [2]),
formed by systems in the same and lower price-range. FFTW2, FFTW3.1 ([4], [9]), Intel MKL (Intel Math Kernel
However, the specialized low-cost Playstation 3 makes Library, [5]) and goedecker. A list of all supported FFT
a big difference in this price-performance game but its libraries can be obtained from [3]. The benchmarks have
limited memory might be a big obstacle to scientific use. been carried out on a Opteron 244 (1,8 GHz) machine and
an IBM QS 20 Cell Blade.
Those absolute value comparisons do not allow any qual- By the time of writing only FFTW3.2-alpha3 could
itative comparisons between the different libraries. The leverage the SPEs during a three dimensional fast Fourier
main problem is the high variance in peak performance. To transformation. Therefore we use the term FFTW3 as a syn-
compare our implementation to other BLAS libraries, we onym for FFTW3.2-alpha3 in the following text. As we are
normalized the measured performance to the peak perfor- interested to find possible candidates for FFT algorithms in
mance of the architecture to get an estimate of the efficiency ABINIT on Cell we did not benchmark algorithms which
of use of the floating point units. We expect a pretty high are only available for x86 architectures (like ACML, Intel
efficiency on the standard superscalar and cache-based ar- MKL) or do not support a three dimensional FFT.
chitectures due to the high spatial and temporal locality in As you can see in Figure 11, FFTW3.2-alpha3 is the
matrix multiplication algorithms and decades of develop- fastest Algorithm on the Opteron. On the Cells PPE (Fig-
7
Figure 11. 3D-FFT on Opteron 244 (1,8 GHz) Figure 13. 3D-FFT on IBM BladeCenter QS20
FFTW3 (with SPE support)
has to be transformed. Further informations about the 3D

FFT in ABINIT can be found in [14]. To evaluate the ben-
efit of FFTW3 over the goedecker2002 implementation we
therefore performed benchmarks of ABINIT using different
FFT implementations. For those benchmarks we analysed
the wall clock times of ABINIT runs on an input file defin-
ing a unit cell consisting of 108 Aluminum atoms.
before transformation transformation in x
0000000
1111111
0000000
1111111
0000000
1111111
0000000
1111111
x x 1111111
0000000
0000000
1111111
0000000
1111111
Figure 12. 3D-FFT on IBM BladeCenter QS20 0000000
1111111
(PPE only) z
y
z
y
transformation in y transformation in z
11111111111
00000000000
11
00 00
11 00000000000
11111111111
ure 12) FFTW3.2-alpha3, ffte and goedecker are the fastest 00
11 00
11 00000000000
11111111111
00
11 00
11 00000000000
11111111111
00
11 00
11 00000000000
11111111111
00000000000
11111111111
available algorithms, depending on the problem size. If one 00
11 00
11 00000000000
11111111111
00000000000
11111111111
00
11 00
11 00000000000
11111111111
00000000000
11111111111
00
11 00
11 00000000000
11111111111
00000000000
11111111111
00
11 00
11 00000000000
11111111111
00000000000
11111111111
allows the usage of the SPEs, FFTW3.2-alpha3 is by far the 00
11 00
11 00000000000
11111111111
00000000000
11111111111
00
11 00
11 00000000000
11111111111
00000000000
11111111111
00
11 00
11 00000000000
11111111111
00000000000
11111111111
fastest algorithm available (see Figure 13). 00
11 00
11 00000000000
11111111111
00000000000
11111111111
00
11 00
11 00000000000
11111111111
00000000000
11111111111
00
11 00
11 00000000000
11111111111
00000000000
11111111111
00
11 00
11 00000000000
11111111111
00000000000
11111111111
00
11 00
11 00000000000
11111111111
00000000000
11111111111
00
11 00
11 00000000000
11111111111
00000000000
11111111111
00
11 00
11 00000000000
11111111111
00000000000
11111111111
5.2 ABINIT and FFTW3 00
11 00
11 00000000000
11111111111
00000000000
11111111111
00
11 00
11 00000000000
11111111111
00000000000
11111111111
x
00
11 00
11 x
00000000000
11111111111
00
11 00
11 00000000000
11111111111
00
11 00
11 00000000000
11111111111
00
11 00
11 00000000000
11111111111
We have shown that FFTW3 is the fastest FFT imple- y y
z z
mentation for Cell BE available. On the other hand FFTW3,
unlike the goedecker2002 implementation, performs a full
Figure 14. 3D-FFT with zero padding
transformation for every dimension. Due to the Nyquist-
Shannon sampling theorem this is not necessary for all FFTs
which occur in ab initio calculations. The optimized trans-
formations which make use of this fact are illustrated in Fig-
ure 14. Note that only the shaded part of the FFT box really
8
5.2.1 Benchmarks Cell.
For the final benchmarks we installed FFTW3.2-alpha3 and
Our ZGEMM implementation shows the best performance
ABINIT 5.4.4 on an IBM QS20 Blade. The following com-
of all publicly available ZGEMM or DGEMM implementations
piler flags were used:
for Cell BE. Thus, our work may serve as guideline for im-
CC="ppu-gcc" plementing similar algorithms.
CFLAGS="-O3 -fomit-frame-pointer -mcpu=cell" With FFTW3 there is a very efficient FFT library avail-
CXXFLAGS="-O3 -fomit-frame-pointer -mcpu=cell" able which supports the SPEs in Cell systems. We man-
FC="gfortran -m64 -O3 -fomit-frame-pointer" aged to make it possible to use this library together with
FFLAGS="" ABINIT. This lead to an reduced runtime (14% in our ex-
ample). Currently we focused on the sequential variant of
The input file which was used for the benchmarks ABINIT, but as the FFTW3 library comes with MPI sup-
specifies an 543 FFT box. All calculations have been done port and employs a data layout similar to the one used in
on one PPE with 8 SPEs. For this benchmark we varied the goedecker2002 algorithm implementation it should be
the fftalg parameter, which means we used a number of feasibly to achieve a similar performance improvement in
different FFT algorithms. The wfoptalg parameter was set the parallel case. This could be evaluated in further stud-
to 4 which means that the LOBPCG solver was used. A ies. Another approach could be to port the goedecker2002
complete list of ABINIT parameters can be obtained from implementations to the Cell architecture. This could prove
[1]. to be a bit more difficult as the goedecker algorithms are
hardly documented and specifically crafted for ABINIT.
fftalg Wallclock Time Algorithm References

100 2349,6 goedecker
300 1950,4 FFTW3 [1] ABINIT documentation
. http://www.abinit.org/documentation/.
400 2282,7 goedecker2002 [2] ACML: AMD Core Math Library
401 2294,5 goedecker2002, zero padding . http://developer.amd.com.
[3] benchFFT: Benchmarked FFT Implementations
If one compares the execution times of fftalg=400 with . http://www.fftw.org/benchfft/ffts.html.
those from FFTW3 a performance increase over 14% be- [4] FFTW homepage
comes visible. Please note that the possibly performance . http://www.fftw.org.
improvements heavily depend on the size of the FFT box [5] Intel Math Kernel Library
used in the calculation. We used a rather small one in this . http://www.intel.com.
[6] T. Chen, R. Raghavan, J. N. Dale, and E. Iwata. Cell broad-
example (543 ) to demonstrate that even in that case there is
band engine architecture and its first implementation A per-
a benefit over the standard algorithms. formance view. IBM Journal of Research and Development,
51:559–572, 2007.
5.3 Conclusion and Future Work [7] C. C. Douglas, M. Heroux, G. Slishman, and R. M. Smith.
GEMMW: a portable level 3 BLAS winograd variant of
Since scientific simulations heavily rely on optimized strassen’s matrix-matrix multiply algorithm. J. Comput.
linear algebra functions we presented in this article an Phys., 110(1):1–10, 1994.
[8] J. Easton, I. Meents, O. Stephan, H. Zisgen, and S. Kato.
optimized ZGEMM implementation for the IBM Cell BE
Porting financial markets applications to the cell broadband
processor. As a part of the BLAS package, the ZGEMM engine architecture. 6 2007.
routine performs a complex matrix–matrix multiplication. [9] M. Frigo and S. G. Johnson. The design and implementa-
We discussed the strategies to distribute data and to exploit tion of FFTW3. Proceedings of the IEEE, 93(2):216–231,
the double precision floating point elements of the SPEs. 2005. special issue on “Program Generation, Optimization,
and Platform Adaptation”.
The benchmarks showed that the performance of our [10] S. Goedecker. Fast radix 2, 3, 4, and 5 kernels for fast
ZGEMM algorithm achieves up to 70% of the peak perfor- Fourier transformations on computers with overlapping
multiply-add instructions. SIAM Journal on Scientific Com-
mance and scales linearly from 1 to 16 SPEs. We assume
puting, 18(6):1605–1611, 1997.
that our code will also perform well on the next generation [11] S. Goedecker, M. Boulet, and T. Deutsch. An efficient 3-dim
Cell BE which supports a fully-pipelined double precision FFT for plane wave electronic structure calculations on mas-
unit that does not stall 6 cycles after every instruction. We sively parallel machines composed of multiprocessor nodes.
compared the algorithm with the IBM DGEMM implementa- Computer Physics Communications, 154(2):105–110, Aug.
tion since there is no ZGEMM implementation available for 2003.
9
[12] X. Gonze, J.-M. Beuken, R. Caracas, F. Detraux, M. Fuchs,
G.-M. Rignanese, L. Sindic, M. Verstraete, G. Zerah, F. Jol-
let, M. Torrent, A. Roy, M. Mikami, P. Ghosez, J.-Y. Raty,
and D. C. Allan. First-principles computation of material
properties: the ABINIT software project. Computational
Materials Science, 25:478–493, 2002.
[13] T. Hoefler, R. Janisch, and W. Rehm. A Performance Anal-
ysis of ABINIT on a Cluster System. In Parallel Algo-
rithms and Cluster Computing, pages 37–51. Springer, Lec-
ture Notes in Computational Science and Engineering, 12
2005.
[14] T. Hoefler and G. Zerah. Transforming the high-
performance 3d-FFT in ABINIT to enable the use of non-
blocking collective operations. Technical report, Commis-
sariat a l’Energie Atomique - Direction des applications mil-
itaires (CEA-DAM), 2 2007.
[15] P. Hohenberg and W. Kohn. Inhomogeneous Electron Gas.
Physical Review, 136:B864, 1964.
[16] G. K. and G. R. On reducing TLB misses in matrix multipli-
cation. Technical report tr-2002-55, The University of Texas
at Austin, Department of Computer Sciences, 2002.
[17] A. KNYAZEV. Toward the optimal preconditioned eigen-
solver: Locally optimal block preconditioned conjugate gra-
dient method. SIAM journal on scientific computing(Print),
23(2):517–541, 2002.
[18] W. Kohn and L. Sham. Self-Consistent Equations Includ-
ing Exchange and Correlation Effects. Physical Review,
140(4A):1133–1138, 1965.
[19] C. L. Lawson, R. J. Hanson, D. Kincaid, and F. T. Krogh.
Basic Linear Algebra Subprograms for FORTRAN usage.
In In ACM Trans. Math. Soft., 5 (1979), pp. 308-323, 1979.
[20] B. Minor, G. Fossum, and V. To. Terrain rendering engine
(tre): Cell broadband engine optimized real-time ray-caster.
[21] J. C. R. and B. D. A. Introduction to the cell broadband
engine architecture. IBM Journal of Research and Develop-
ment, 51:503–519, 2007.
[22] S. Williams, J. Shalf, L. Oliker, S. Kamil, P. Husbands, and
K. Yelick. The potential of the cell processor for scientific
computing. In CF ’06: Proceedings of the 3rd conference
on Computing frontiers, pages 9–20, New York, NY, USA,
2006. ACM.
10

Code Optimization For Cell BE - Opportunities For ABINIT

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Code Optimization For Cell BE - Opportunities For ABINIT

Caricato da

Copyright:

Formati disponibili

Code optimization for Cell BE - Opportunities for ABINIT

Timo Schneider, Toni Volkmer and Wolfgang Rehm

Abstract intensive application kernels have to be identified and re-

with a conjugate gradient method, using the density

Initialization Overhead [ms]

ZGEMM performs the matrix-matrix operation on input of

Figure 4. FORTRAN/ZGEMM Matrix represen- 4.1.2 Work Assignment

Figure 8. Effects of overlapping the memory

0.7 tations written by Stefan Goedecker in 1993 and 2002

has to be transformed. Further informations about the 3D

fftalg Wallclock Time Algorithm References

Potrebbero piacerti anche