Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
1
SPE SPE SPE two floating point instructions (FLOP). As all double preci-
sion arithmetic instructions need the same number of clock
SXU SXU SXU cycles, these instructions yield the best floating point oper-
ation per second (Flop/s) ratio.
LS LS LS
1.2 ABINIT
MFC MFC MFC
The software package ABINIT [12] is used to perform ab
initio electronic structure calculations by solving the time-
independent Schrödinger equation
EIB
b tot φ = Etot φ
H
2
3.1 Profiling ABINIT the executed SPE program does not do anything besides
waiting until all used SPE’s are running (which is ensured
Profiling the application shows that only 2% of ABINITs via fast SPE to SPE communication) and hand the control
code make up over 80 % of it’s run-time. The most impor- flow back to the PPE which will just destroy the SPE
tant parts turned out to be ZGEMM, a double precision ma- contexts after all SPE’s completed their little task. This is
trix multiplication routine from BLAS, which contributed done 105 times. The overall time for all iterations is then
about 25 % of the total run time in our test runs, Fourier divided by 105 and therefore we get a good lower bound
Transformations are another important part of ABINIT. The for the minimal overhead induced through SPE context
functions which perform those contribute another 25 % creation and destruction by the PPE, including the overhead
of the total run time. Another important function is the of the PPE function call. We found out that this overhead
ABINIT internal application of the non-local Hamiltonian increases almost linearly as the number of SPEs involved
operator which is realized in the routines opernl4*. To- in this micro-benchmark rises. See Figure 3 for the detailed
gether these contribute about 35 % of ABINITs run time. results.
Similar profiling results have been obtained in [13] with an
older version of ABINIT, with the exception that ZGEMM 2.5
Overhead
was not used in this version, since it is leveraged by the
ALLOCATE(A)
start SPE tasks
wait for completion
... 4 Dense Matrix Multiplication
.... SPE8
CALL FUNCTION(A) get input
do calculations
write back
Basic Linear Algebra Subprograms (BLAS) is an widely
used application programming interface for libraries to
original code our contribution perform basic dense linear algebra operations such as
matrix multiplication. They were first published in 1979
[19]. Highly optimized implementations of the BLAS
Figure 2. Fundamentals of our acceleration interface have been developed by different vendors or
model groups for many architectures.
3
Where op(A) specifies if the normal, transposed or conju- column of op(B), we will always have to consider the
gated version of the matrix is to be used. A,B and C are operators for our memory access.
matrices consisting of complex numbers and α and β are
complex scalars. The FORTRAN interface is: We investigated works that use Strassen’s or Winograd’s
SUBROUTINE ZGEMM(TRANSA, TRANSB, M, implementation to reduce the asymptotic complexity of the
N, K, ALPHA, A, LDA, B, LDB, BETA, C, matrix multiplication [7]. However, those optimized algo-
LDC) rithms work only with well conditioned matrices which we
TRANSA and TRANSB contains the operator to be used can not guarantee in the general case. Thus, we chose to
on matrix A and B as a single character which can be n implement a traditional O(N 3 ) algorithm for our ZGEMM.
(normal), t (transposed) or c (transposed, conjugated).
op(A) is M by K matrix, op(B) is a K by N matrix, and 4.1 Our ZGEMM implementation
C is a M by N matrix. Note that M, N and K refer to the
matrices after the operators are applied, not the original We had to apply two important concepts to be able to
input matrices. ALPHA and BETA correspond to α and design a well-performing ZGEMM implementation: We par-
β in the equation above. LDA, LDB and LDC specify the titioned the input data, distributed it among the Local Stores
first dimension of the input matrices so it is possible to use of the available SPEs to minimize memory latencies during
ZGEMM the top-left part of the input matrices only. calculation and vectorized all calculations in order to exploit
the SIMD architecture.
The input matrices A, B and C are stored in column
major order, as they come from a program written in FOR- 4.1.1 Data Partitioning
TRAN. Figure 4 illustrates the meaning of the different As the Local Store of an SPE space is limited to 256KiB,
ZGEMM parameters which deal with the representation of the goal should be to save space and memory transfers.
the input matrices. A first idea was to load parts of a row of op(A) and a
column of op(B) and to compute exactly one element of
C. There are some problems with this: depending on the
K
operator, the rows (or columns) of the matrices are stored
A sequentially in memory or scattered with a displacement
(of LDx), forcing us to get each element separately. This
would decrease performance, as the MFC operates best
M with memory chunks that are multiples of 128 byte in size.
LDA
A better idea is to load blocks instead of lines, and per-
form small matrix-matrix multiplications instead of scalar
products. This gives us independence from the operator:
the decision whether rows or columns should be used in
the scalar product of the matrix multiplications on the SPEs
does not affect performance, as we have random access to
the Local Store. Another advantage is the number of op-
Memory layout:
erations. For n elements which fit in each input buffer of
LDA M our Local Store, O(n) multiply and add√ operations can be
done with the scalar product, but O( n3 ) = O(n1.5 ) op-
A ...
erations can be achieved with small matrix multiplications.
column K
Of course, with more operations on the same amount of lo-
column 1
cal data the total number of memory transfers is reduced.
4
op(A) and the shaded column in op(B). In Figure 4 you can see how the block-wise multiplica-
tion can be implemented in C, using the SPU intrinsics. 3
1 2 3 4 5 6 1 2
3 4 5 6 1 2 3 4 # d e f i n e VPTR ” ( v e c t o r d o u b l e ∗) ”
5 6 1 2 3 4 5 6 v e c t o r char h i g h d b l = { 0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 ,
1 2 3 4 5 6 ... 16 ,17 ,18 ,19 ,20 ,21 ,22 ,23};
* = v e c t o r char l o w d b l = { 8 , 9 , 1 0 , 1 1 , 1 2 , 1 3 , 1 4 , 1 5 ,
24 ,25 ,26 ,27 ,28 ,29 ,30 ,31};
v e c t o r do ubl e r r e = { 0 , 0 } , ri m = { 0 , 0 } , t r e , tim , s r e , sim ;
f o r ( k = 0 ; k < k l e n ; k ++ , a a += a s t p , bb += b s t p ) {
fi m = s p u s h u f f l e ( ∗ ( VPTR a a ) , ∗(VPTR ( a a + a s t p ) ) , l o w d b l ) ;
gim = s p u s h u f f l e ( ∗ ( VPTR bb ) , ∗(VPTR bb ) , l o w d b l ) ;
Figure 5. SPE Block assignment f r e = s p u s h u f f l e ( ∗ ( VPTR a a ) , ∗(VPTR ( a a + a s t p ) ) , h i g h d b l ) ;
g r e = s p u s h u f f l e ( ∗ ( VPTR bb ) , ∗(VPTR bb ) , h i g h d b l ) ;
t r e = s p u n ms u b ( fim , gim , s r e ) ;
We investigated the use of the PPE with an experimen- t i m = spu madd ( f r e , gim , sim ) ;
tal implementation. The PPE has a theoretical peak perfor- s r e = s p u ms u b ( f r e , g re , t r e ) ;
sim= spu madd ( fim , g re , t i m ) ;
mance of 6.4 GFlop/s. Our code spawns N threads on the }
PPE, each of them computes the same chunk of op(C) as r r e = s p u s h u f f l e ( s r e , sim , h i g h d b l ) ;
an SPE does2 , using a PPC970 optimized BLAS implemen- ri m = s p u s h u f f l e ( s r e , sim , l o w d b l ) ;
tation to perform the computation. Despite the given peak ∗(VPTR c c ) = s p u a d d ( ∗ ( VPTR c c ) , r r e ) ;
∗(VPTR ( c c + 1 ) ) = s p u a d d ( ∗ ( VPTR ( c c + 1 ) ) , ri m ) ;
performance of the SPE, we achieved only 1.7 GFlop/s with
ATLAS on the PPE, which makes this partitioning scheme
suboptimal. Thus, we did not include the PPE measure- Figure 6. Inner loop of the blockwise matrix
ments in our benchmarks. multiplication, implemented in C
4.1.3 Vectorization
4.2 Benchmarks
In our matrix multiplication, each element is a 128 bit com-
plex number, consisting of 64 bit double precision floating This section provides a performance evaluation of our
point values for real part and imaginary part. We can safely implementation and a qualitative and quantitative compar-
assume that only fused multiply add operations are used, as ison to BLAS implementations on other modern architec-
two elements of each matrix are multiplied and added to the tures.
temporary scalar product. One multiply-add operation of
30
complex numbers a and b added to y (y = y + a · b) is split RefBLAS
CellBLAS, PS3, 6 SPUs
up like this for its real and imaginary parts: 25 CellBLAS, QS20, 8 SPUs
CellBLAS, QS20, 16 SPUs
yre := yre + are bre − aim bim IBM BLAS (DGEMM), QS20, 8 SPUs
20
yim := yim + are bim + aim bre
GFlop/s
15
This makes 4 fused multiply add operations, with 64 bit
operands. With the SIMD-ability of the SPU, two complex 10
multiply-adds can be done instead of one. To use SIMD
instructions, the real parts and imaginary parts have to be 5
splitted and packed into separate registers. This can be done
0
with the SPU shuffle instruction. Now the calculation can 0 200 400 600 800 1000 1200 1400 1600 1800 2000
be done as described above, and the only thing left to do is Matrix size (NxN)
to separate the real and imaginary part into the result regis-
ters before we write back into C. Figure 7. Performance Comparison
One little obstacle remains: The fused multiply subtract
operation on the SPU spu msub(a, b, c) calculates The current Cell BE chip’s SPEs are capable of issuing
a · b − c, but we would need c − a · b. To achieve this without one double precision arithmetic instruction every six clock
adding further instructions to change the sign, the real part cycles. This instruction needs another seven cycles until the
can be calculated as follows: result is available in the target register. But if we assume
yre := are bre − ((aim bim ) − yre ) 3 Our code and the tests that were used to obtain
the presented benchmark results can be fetched from
2 theoretically, N = 4 should be optimal http://files.perlplexity.org/zgemm.tar.gz.
5
to execute a very large number of data-independent double refblas by using different numactl configurations (numactl
precision operations we would get a cycles per instruction controls which CPU uses which memory bank), we were
(CPI) value of 6. Considering FMADD operations and a not able to achieve more than one Gflop. This is due to the
vector size of two, the theoretical peak performance of a fact that the current compilers do not automatically gener-
single Cell BE CPU with 8 SPE and a clock rate of 3.2 GHz ate code for the SPUs. Thus, the refblas implementation
is used only the rather slow PPC core. We outperform the
3.2 ∗ 109 Hz IBM DGEMM implementation by large for all different
Rpeak = ·8 SPE·4 Flop/SPE = 17.07 GFlop/s matrix sizes and our code scales very well to up to 16
6
SPUs. We can also reproduce similar performance on the
This is the number in theory, in practical tests (back to back specialized Playstation 3 (PS3) hardware (only 6 SPEs are
execution of fused multiply add instructions with no data accessible with Linux).
dependencies) we were able to measure up to 14.5 GFlop/s.
This number is said to be the Cell BE double precision
peak performance. [22] Another optimization technique that has been proposed
[6] is to overlap memory (DMA) accesses with computa-
Even though our implementation supports arbitrary ma- tion. However, this increases the code complexity signif-
trices, we benchmarked square matrices to enable easy icantly. To evaluate the potential benefit, we removed all
comparisons to other publications. We used ppu-gcc, the memory (DMA) accesses from our implementation to
version 4.1.1 with the flags -O3 -mabi=altivec simulate the overlap. This invalidates the results but pro-
-maltivec to compile all PPE code and spu-gcc, ver- vides an upper bound to the performance-gain due to over-
sion 4.1.1 with -O3 for the SPE code. The Cell BE specific lap. Figure 8 shows the comparison to our implementation.
benchmarks were run on a 3.2 GHz IBM QS20 Cell Blade, Our experiments show that we could gain up to one Gflop/s
which contains 2 Cell BE processors with 8 SPEs per pro- performance with this overlap technique.
cessor and two 512 MiB RAM banks and a Playstation 3
50
running at 3.2 GHz with 200 MiB memory. Both systems
45
run Linux 2.6 (with IBM patches applied).
40
12
35
30
GFlop/s
10
25
RefBLAS
8 20 CellBLAS PS3
CellBLAS, 8 SPEs
GFlop/s
15 CellBLAS, 16 SPEs
6 Big Red
10
Jeltz
4 5 Odin
Sif
0
2 0 1000 2000 3000 4000 5000 6000 7000
cellblas, 8 SPEs Matrix size (NxN)
no DMA, cellblas, 8 SPEs
0
0 200 400 600 800 1000 1200 1400 1600 1800 2000 Figure 9. Absolute efficiency of different
Matrix size (NxN) BLAS implementations
6
are described in the following: a node in Big Red has two ment. However, the Cell BE represents a completely new
dual-core PowerPC 970 MP processors (2.5GHz) with 8GB approach of the “explicit cache” (Local Store). Addition-
RAM per node. The peak-performance (with FMADD) is ally to that, the Cell architecture introduces additional over-
40 GFlop/s and we ran the IBM ESSL library. We used the heads for loading the code to the SPUs. The relative per-
Goto BLAS [16] library 1.19 on Odin, a dual CPU dual- formance results are presented in Figure 10. The highly
core Opteron running at 2 GHz with a peak performance of optimized Goto BLAS implementation delivers the best per-
16 GFlop/s, and Sif, a dual CPU quad-core 1.86 GHz Intel formance on the available architectures. IBM’s Engineering
Xeon with 59.5 GFlop/s peak. The theoretically fastest and Scientific Subroutine Library (ESSL) delivers good per-
tested system, Jeltz, as two quad-core Intel Xeon 3.0 GHz formance on PowerPC. Our implementation which explores
and a peak performance of 96 GFlop/s. Jeltz runs Mac OS a new CPU architecture is performing very well in compari-
X Tiger and we used the vendor supplied Veclib for our son to the well established ones and even better than Apple’s
experiments. The absolute performance results for all those Veclib.
systems are plotted in Figure 9.
5 Fast Fourier Transformation on Cell BE
0.9
0.8 ABINIT 5.4.4 contains two different FFT implemen-
Reached peak performance
Due to memory and CPU time limits, not all matrix sizes To compare the performance capabilities of different
could be run on all systems (e.g., the PS3 had only 200 FFT algorithms on the Cell architecture we employed
MiB). Our benchmarks show that the current generation benchFFT [4], which is an FFT benchmarksuite. It
Cell BE is not really suited to perform double precision contains and supports benchmarking a large number of
floating point calculations because it is largely outper- FFT algorithms like ACML (AMD Core Math Library, [2]),
formed by systems in the same and lower price-range. FFTW2, FFTW3.1 ([4], [9]), Intel MKL (Intel Math Kernel
However, the specialized low-cost Playstation 3 makes Library, [5]) and goedecker. A list of all supported FFT
a big difference in this price-performance game but its libraries can be obtained from [3]. The benchmarks have
limited memory might be a big obstacle to scientific use. been carried out on a Opteron 244 (1,8 GHz) machine and
an IBM QS 20 Cell Blade.
Those absolute value comparisons do not allow any qual- By the time of writing only FFTW3.2-alpha3 could
itative comparisons between the different libraries. The leverage the SPEs during a three dimensional fast Fourier
main problem is the high variance in peak performance. To transformation. Therefore we use the term FFTW3 as a syn-
compare our implementation to other BLAS libraries, we onym for FFTW3.2-alpha3 in the following text. As we are
normalized the measured performance to the peak perfor- interested to find possible candidates for FFT algorithms in
mance of the architecture to get an estimate of the efficiency ABINIT on Cell we did not benchmark algorithms which
of use of the floating point units. We expect a pretty high are only available for x86 architectures (like ACML, Intel
efficiency on the standard superscalar and cache-based ar- MKL) or do not support a three dimensional FFT.
chitectures due to the high spatial and temporal locality in As you can see in Figure 11, FFTW3.2-alpha3 is the
matrix multiplication algorithms and decades of develop- fastest Algorithm on the Opteron. On the Cells PPE (Fig-
7
Figure 11. 3D-FFT on Opteron 244 (1,8 GHz) Figure 13. 3D-FFT on IBM BladeCenter QS20
FFTW3 (with SPE support)
x x 1111111
0000000
0000000
1111111
0000000
1111111
Figure 12. 3D-FFT on IBM BladeCenter QS20 0000000
1111111
(PPE only) z
y
z
y
transformation in y transformation in z
11111111111
00000000000
11
00 00
11 00000000000
11111111111
ure 12) FFTW3.2-alpha3, ffte and goedecker are the fastest 00
11 00
11 00000000000
11111111111
00
11 00
11 00000000000
11111111111
00
11 00
11 00000000000
11111111111
00000000000
11111111111
available algorithms, depending on the problem size. If one 00
11 00
11 00000000000
11111111111
00000000000
11111111111
00
11 00
11 00000000000
11111111111
00000000000
11111111111
00
11 00
11 00000000000
11111111111
00000000000
11111111111
00
11 00
11 00000000000
11111111111
00000000000
11111111111
allows the usage of the SPEs, FFTW3.2-alpha3 is by far the 00
11 00
11 00000000000
11111111111
00000000000
11111111111
00
11 00
11 00000000000
11111111111
00000000000
11111111111
00
11 00
11 00000000000
11111111111
00000000000
11111111111
fastest algorithm available (see Figure 13). 00
11 00
11 00000000000
11111111111
00000000000
11111111111
00
11 00
11 00000000000
11111111111
00000000000
11111111111
00
11 00
11 00000000000
11111111111
00000000000
11111111111
00
11 00
11 00000000000
11111111111
00000000000
11111111111
00
11 00
11 00000000000
11111111111
00000000000
11111111111
00
11 00
11 00000000000
11111111111
00000000000
11111111111
00
11 00
11 00000000000
11111111111
00000000000
11111111111
5.2 ABINIT and FFTW3 00
11 00
11 00000000000
11111111111
00000000000
11111111111
00
11 00
11 00000000000
11111111111
00000000000
11111111111
x
00
11 00
11 x
00000000000
11111111111
00
11 00
11 00000000000
11111111111
00
11 00
11 00000000000
11111111111
00
11 00
11 00000000000
11111111111
We have shown that FFTW3 is the fastest FFT imple- y y
z z
mentation for Cell BE available. On the other hand FFTW3,
unlike the goedecker2002 implementation, performs a full
Figure 14. 3D-FFT with zero padding
transformation for every dimension. Due to the Nyquist-
Shannon sampling theorem this is not necessary for all FFTs
which occur in ab initio calculations. The optimized trans-
formations which make use of this fact are illustrated in Fig-
ure 14. Note that only the shaded part of the FFT box really
8
5.2.1 Benchmarks Cell.
For the final benchmarks we installed FFTW3.2-alpha3 and
Our ZGEMM implementation shows the best performance
ABINIT 5.4.4 on an IBM QS20 Blade. The following com-
of all publicly available ZGEMM or DGEMM implementations
piler flags were used:
for Cell BE. Thus, our work may serve as guideline for im-
CC="ppu-gcc" plementing similar algorithms.
CFLAGS="-O3 -fomit-frame-pointer -mcpu=cell" With FFTW3 there is a very efficient FFT library avail-
CXXFLAGS="-O3 -fomit-frame-pointer -mcpu=cell" able which supports the SPEs in Cell systems. We man-
FC="gfortran -m64 -O3 -fomit-frame-pointer" aged to make it possible to use this library together with
FFLAGS="" ABINIT. This lead to an reduced runtime (14% in our ex-
ample). Currently we focused on the sequential variant of
The input file which was used for the benchmarks ABINIT, but as the FFTW3 library comes with MPI sup-
specifies an 543 FFT box. All calculations have been done port and employs a data layout similar to the one used in
on one PPE with 8 SPEs. For this benchmark we varied the goedecker2002 algorithm implementation it should be
the fftalg parameter, which means we used a number of feasibly to achieve a similar performance improvement in
different FFT algorithms. The wfoptalg parameter was set the parallel case. This could be evaluated in further stud-
to 4 which means that the LOBPCG solver was used. A ies. Another approach could be to port the goedecker2002
complete list of ABINIT parameters can be obtained from implementations to the Cell architecture. This could prove
[1]. to be a bit more difficult as the goedecker algorithms are
hardly documented and specifically crafted for ABINIT.
9
[12] X. Gonze, J.-M. Beuken, R. Caracas, F. Detraux, M. Fuchs,
G.-M. Rignanese, L. Sindic, M. Verstraete, G. Zerah, F. Jol-
let, M. Torrent, A. Roy, M. Mikami, P. Ghosez, J.-Y. Raty,
and D. C. Allan. First-principles computation of material
properties: the ABINIT software project. Computational
Materials Science, 25:478–493, 2002.
[13] T. Hoefler, R. Janisch, and W. Rehm. A Performance Anal-
ysis of ABINIT on a Cluster System. In Parallel Algo-
rithms and Cluster Computing, pages 37–51. Springer, Lec-
ture Notes in Computational Science and Engineering, 12
2005.
[14] T. Hoefler and G. Zerah. Transforming the high-
performance 3d-FFT in ABINIT to enable the use of non-
blocking collective operations. Technical report, Commis-
sariat a l’Energie Atomique - Direction des applications mil-
itaires (CEA-DAM), 2 2007.
[15] P. Hohenberg and W. Kohn. Inhomogeneous Electron Gas.
Physical Review, 136:B864, 1964.
[16] G. K. and G. R. On reducing TLB misses in matrix multipli-
cation. Technical report tr-2002-55, The University of Texas
at Austin, Department of Computer Sciences, 2002.
[17] A. KNYAZEV. Toward the optimal preconditioned eigen-
solver: Locally optimal block preconditioned conjugate gra-
dient method. SIAM journal on scientific computing(Print),
23(2):517–541, 2002.
[18] W. Kohn and L. Sham. Self-Consistent Equations Includ-
ing Exchange and Correlation Effects. Physical Review,
140(4A):1133–1138, 1965.
[19] C. L. Lawson, R. J. Hanson, D. Kincaid, and F. T. Krogh.
Basic Linear Algebra Subprograms for FORTRAN usage.
In In ACM Trans. Math. Soft., 5 (1979), pp. 308-323, 1979.
[20] B. Minor, G. Fossum, and V. To. Terrain rendering engine
(tre): Cell broadband engine optimized real-time ray-caster.
[21] J. C. R. and B. D. A. Introduction to the cell broadband
engine architecture. IBM Journal of Research and Develop-
ment, 51:503–519, 2007.
[22] S. Williams, J. Shalf, L. Oliker, S. Kamil, P. Husbands, and
K. Yelick. The potential of the cell processor for scientific
computing. In CF ’06: Proceedings of the 3rd conference
on Computing frontiers, pages 9–20, New York, NY, USA,
2006. ACM.
10