Sei sulla pagina 1di 9

Novel Architectures

Graphical Processing Units for Quantum Chemistry


The authors provide a brief overview of electronic structure theory and detail their experiences implementing quantum chemistry methods on a graphical processing unit. They also analyze algorithm performance in terms of floating-point operations and memory bandwidth, and assess the adequacy of single-precision accuracy for quantum chemistry applications.

n 1830, Auguste Comte wrote in his work Philosophie Positive: Every attempt to employ mathematical methods in the study of chemical questions must be considered profoundly irrational and contrary to the spirit of chemistry. If mathematical analysis should ever hold a prominent place in chemistryan aberration which is happily almost impossibleit would occasion a rapid and widespread degeneration of that science. Fortunately, Comtes assessment was far off the mark and, instead, the opposite has occurred. Detailed simulations based on the principles of quantum mechanics now play a large role in suggesting, guiding, and explaining experiments in chemistry and materials science. In fact, quantum chemistry is a major consumer of CPU cycles at national supercomputer centers, with the fields rise to prominence largely due to early demonstrations of quantum mechanics applied to chemical problems and the tremendous advances in computing power over the past decadestwo developments that Comte could not have foreseen.
1521-9615/08/$25.00 2008 ieee CopubliShed by the ieee CS and the aip

However, limited computational resources remain a serious obstacle to the application of quantum chemistry in problems of widespread importance, such as the design of more effective drugs to treat diseases or new catalysts for use in applications such as fuel cells or environmental remediation. Thus, researchers have a considerable impetus to relieve this bottleneck in any way possible, both by developing new and more effective algorithms and exploring new computer architectures. In our own work, weve recently begun exploring the use of graphical processing units (GPUs), and this article presents some of our experiences with GPUs for quantum chemistry.

GPu Architecture

Ivan S. Ufimtsev and Todd J. Martnez


University of Illinois at Urbana-Champaign

Low precision (generally, 24-bit arithmetic) and limited programmability stymied early attempts to use GPUs for general-purpose scientific computing.14 However, the release of the Nvidia G80 series and the compute unified device architecture (CUDA) application programming interface (API) have ushered in a new era in which these difficulties are largely ameliorated. The CUDA API lets developers control the GPU via an extension of the standard C programming language (as opposed to specialized assembler or graphics-oriented APIs, such as OpenGL and DirectX). The G80 supports 32-bit floating-point arithmetic,
Computing in SCienCe & engineering

26

This arTicle has been peer-reviewed.

which is largely (but not entirely) compliant with the IEEE-754 standard. In some cases, this might not be sufficient precision (many scientific applications use 64-bit or double precision), but Nvidia has already released its next generation of GPUs that supports 64-bit arithmetic in hardware. We performed all the calculations presented in this article on a single Nvidia GeForce 8800 GTX card using the CUDA API (www.nvidia.com/object/cuda_home.html offers a detailed overview of the hardware and API). We sketch some of the important concepts here. As depicted schematically in Figure 1, the GeForce 8800 GTX consists of 16 independent streaming multiprocessors (SMs), each comprised of eight scalar units operating in SIMD fashion and running at 1.35 GHz. The device can process a large number of concurrent threads; these threads are organized into a one- or two-dimensional (1or 2D) grid of 1-, 2-, or 3D blocks with up to 512 threads in each block. Threads in the same block are guaranteed to execute on the same SM and have access to fast on-chip shared memory (16 Kbytes per SM) for efficient data exchange. Threads belonging to different blocks can execute on different SMs and must exchange data through the GPU DRAM (768 Mbytes), which has much larger latency than shared memory. Theres no efficient way to synchronize thread block executionthat is, in which order or on which SM theyll be processedthus, an efficient algorithm should avoid communication between thread blocks as much as possible. The thread block grid can contain up to 65,535 blocks in each dimension. Every thread block has its own unique serial number (or two numbers, if the grid is 2D); likewise, each thread also has a set of indices, identifying it within a thread block. Together, the thread block and thread serial numbers provide enough information to precisely identify a given thread and thereby split computational work in an application. Much of the challenge in developing an efficient algorithm on the GPU involves determining an ideal mapping between computational tasks and the grid/thread block/thread hierarchy. Ideal mappings lead to load balance with interthread communication restricted to within the thread blocks. Because the number of threads running on a GPU is much larger than the total number of processing units (to fully utilize the devices computational capabilities and efficiently hide the DRAMs access latency, the program must spawn at least 104 threads), the hardware executes all threads in time-slicing. The thread scheduler splits the thread blocks into 32-thread warps, which SMs
november/deCember 2008

Streaming multiprocessor Nvidia GeForce 8800 GTX GPU DRAM (768 Mbytes) Streaming multiprocessors
SP 1 SM 1 SM 2 SM 15 SM 16 SP 2 SP 7 SP 8

Shared memory (16 Kbytes) SIMD streaming processors


Instruction unit

Figure 1. Schematic block diagram of the Nvidia GeForce 8800 GTX. It has 16 streaming multiprocessors (SMs), each containing 8 SIMD streaming processors and 16 Kbytes of shared memory.

then process in SIMD fashion, with all 32 threads executed by eight scalar processors in four clock cycles. The thread scheduler periodically switches between warps, maximizing overall application performance. An important point here is that once all the threads in a warp have completed, this warp is no longer scheduled for execution, and no loadbalancing penalty is incurred. Another important consideration is the amount of on-chip resources available for an active thread. Active means that the thread has GPU context (registers and so on) attached to it and is included in the thread schedulers to do list. Once a thread is activated, it wont be deactivated until all of its instructions are executed. A G80 SM can support up to 768 active threads (24 warps), and the warpswitching overhead is negligible compared to the time required to execute a typical instruction. Such cost-free switching is possible because every thread has its own context, which implies that the whole register space (32 Kbytes per SM) is evenly distributed among active threads (for 768 active threads, every thread has 10 registers available). If the threads need more registers, fewer threads are activated, leading to partial SM occupation. This important parameter determines if the GPU DRAM access latency can be efficiently hidden (a large number of active threads means that although some threads are waiting for data, others can execute instructions and vice versa). Thus, any GPU kernel (a primitive routine each GPU thread executes) should consume as few registers as possible to maximize SM occupation.

Quantum chemistry overview

Two of the most basic questions in chemistry are, where are the electrons? and where are the nuclei? Electronic structure theorythat is, quantum chemistryfocuses on the first one. Because the electrons are very light, we must apply the laws of quantum mechanics, which are described
27

with an electronic wave function determined by solving the time-independent Schrdinger equation. As usual in quantum mechanics, this wave functions absolute square is interpreted as a probability distribution for electron positions. Once we know the electronic distribution for a fixed nuclear configuration, its straightforward to calculate the resulting forces on the nuclei. Thus, the answer to the second question follows from the answer to the firstthrough either a search for the nuclei arrangement that minimizes energy (molecular geometry optimization) or solution of the classical Newtonian equations of motion. (Its also possibleand in some cases, necessaryto solve quantum mechanical equations of motion for the nuclei, but we dont consider this further here.) The great utility of quantum chemistry comes from the resulting ability to predict molecular shapes and chemical rearrangements. Denoting the set of all electronic coordinates as r and all nuclear coordinates as R, we can write the electronic time-independent Schrdinger equation for a molecule as H (r, R ) elec (r, R ) = E (R ) elec (r, R ) , (1)

i (r ) = Ci (r ) .
=1

(2)

Notice that we no longer write the electronic coordinates in boldface to emphasize that these are a single electrons coordinates. The computational task is then to determine the linear coefficients Ci. Collecting these coefficients into a matrix C and introducing the overlap matrix S with elements S = (r ) (r )dr 3, we can write the HF equations as F (C) C = SC , (4) (3)

where H (r, R ) is the electronic Hamiltonian operator describing the electronic kinetic energy as well as the Coulomb interactions between all electrons and nuclei, elec (r, R ) is the electronic wave function, and E (R ) is the total energy. This total energy depends only on the nuclear coordinates and is often referred to as the molecules potential energy surface. For most molecules, exact solution of Equation 1 is impossible, so we must invoke approximations. Researchers have developed numerous such approximate approaches, the simplest of which is the Hartree-Fock (HF) method.5 In HF theory, we write the electronic wave function as a single antisymmetrized product of orbitals, which are functions describing a single electrons probability distribution. Physically, this means that the electrons see each other only in an averaged sense while obeying the appropriate Fermi statistics. We can obtain higher accuracy by including many antisymmetrized orbital products or by using density functional theory (DFT) to describe the detailed electronic correlations present in real molecules.6 We consider only HF theory in this article because it illustrates many key computational points. We express each electronic orbital i (r ) in HF theory as a linear combination of K basis functions (r ) specified in advance:
28

where is a diagonal matrix of one-electron orbital energies. The Fock matrix F is a one-electron analog of the Hamiltonian operator in Equation 1, describing the electronic kinetic energy, the electronnuclear Coulomb attraction, and the averaged electronelectron repulsion. Because F depends on the unknown matrix C, we solve for the unknowns C and with a self-consistent field (SCF) procedure. After guessing C, we construct the Fock matrix and solve the generalized eigenvalue problem in Equation 4, giving and a new set of coefficients C. The process iterates until C remains unchanged within some tolerancethat is, until C and F are self-consistent. The dominant effort in the formation of F lies in the evaluation of the two-electron repulsion integrals (ERIs) representing the Coulomb (J) and exchange (K) interactions between pairs of electrons. The exchange interaction is a nonclassical Coulomb-like term arising from the electronic wave functions antisymmetry. Specifically, we construct the Fock matrix for a molecule with N electrons as 1 F ( C ) = H core + J( C ) K ( C ) , 2 (5)

where Hcore includes the electronic kinetic energy and the electronnuclear attraction. (The equations given here are specific to closed-shell singlet molecules, with no unpaired electrons.) The Coulomb and exchange matrices are given by J = P ( | )

(6) (7)

K = P ( | ),

in terms of the density matrix P and the twoelectron integrals (|),


Computing in SCienCe & engineering

P = 2 Ci Ci
i =1

N /2

(8)

STO GTO (r ) , contracted (r ) = GTO di i , primitve (r ). i =1 N

(r1 ) (r1 ) (r2 ) (r2 ) 3 3 ( | ) = dr1 dr2 .(9) r1 r2

(12)

Usually, we choose the basis functions as Gaussians centered on the nuclei, which leads to analytic expressions for the two-electron integrals. Nevertheless, K 4 such integrals must be evaluated, where K grows linearly with the size of the molecule under consideration. In practice, many of these integrals are small and can be neglected, but the number of non-negligible integrals still grows faster than O(K 2), making their evaluation a critical bottleneck in quantum chemistry. We can calculate and store (conventional SCF) the ERIs for use in constructing the J and K matrices during the iterative SCF procedure, or we can recalculate them in each iteration (direct SCF). The direct SCF method is often more efficient in practice because it minimizes the I/O associated with reading and writing ERIs. The form of the basis functions (r) is arbitrary in principleas long as the basis set is sufficiently flexible to describe the electronic orbitals. The natural choice for these functions is the Slatertype orbital (STO), which comes from the exact analytic solution of the electronic structure problem for the hydrogen atom:
STO (r ) x x

This procedure leads to two-electron integrals over contracted basis functions, which are given as sums of integrals over primitive basis functions. Unlike the elements of the C matrix, the contraction coefficients di arent allowed to vary during the SCF iterative process. The number of primitives in a contracted basis function, N , is the contraction length and usually varies between one and eight. Given this basis set construction, we can now talk about ERIs over contracted or primitive basis functions: ( | ) =
N N N N

d p d q dr d s pq | rs ,
p =1 q =1 r =1 s =1

(13)

) ( y y ) (z z )

n r R

, (10)

where R is the position of the nucleus on which the basis function is centered (with components x , y , and z), and the integers l, m, and n represent the orbitals angular momentum. The orbitals total angular momentum, ltotal, is given by the sum of these integers and is often referred to as s, p, and d, for ltotal = 0, 1, and 2, respectively. Unfortunately, its difficult to evaluate the required two-electron integrals using these basis functions, so we use Gaussian-type orbitals (GTOs) in their stead:
GTO (r ) x x

where square brackets denote integrals over primitive basis functions and parentheses denote integrals over contracted basis functions. Several algorithms can evaluate the primitive integrals [pq|rs] for GTO basis sets. We wont discuss these in any detail here except to say that weve used the McMurchie-Davidson scheme,7 which requires relatively few intermediates per integral. The resulting low memory requirements for the kernels let us maximize SM occupancy; the operations involved in evaluating the primitive integrals also include evaluation of reciprocals, square roots, and exponentials, in addition to simple arithmetic operations such as addition and multiplication.

GPu Algorithms for eri evaluation

) ( y y ) (z z )

n r R

. (11)

Relatively simple analytic expressions are available for the required integrals when using GTO basis sets, but the functional form is qualitatively different. To mimic the more physically motivated STOs, we typically contract these GTOs as
november/deCember 2008

In other work, weve explored three different algorithms to evaluate the O(K 4) ERIs over contracted basis functions and store them in the GPU memory.8 Here, we summarize the algorithms and comment on several aspects of their performance. As a test case, we consider a molecular system composed of 64 hydrogen atoms arranged on a 4 4 4 lattice. We use two basis setsthe first (denoted STO-6G) has six primitive functions for each contracted basis function with one contracted basis function per atom, and the second (denoted 6-311G) has three contracted functions per atom in combinations of three, one, and one primitive basis functions, respectively. These two basis sets represent highly contracted or relatively uncontracted basis sets and serve to show how the degree of contraction in the basis set affects algorithm performance. For the hydrogen atom lattice
29

1 block1 contracted integral


This block calculates (11|11) integral
Thread (0) [11|11] Thread Thread (62) (1) [11|12] idle Thread (63) idle

1 thread1 primitive integral


This block contributes to (13|23) integral
Thread Thread (0,0) (3,0) [11|13] [11|33]

11 12 13 22 23 33 11 12 13 Redundant contracted integrals

KpKqKrKs primitive integrals

Thread Thread (0,3) (3,3) [22|13] [22|33]

1 thread1 contracted integral


This block contributes to (23-33|23-33) integrals
Thread (0,0) [23|23] Thread (1,0) [23|33]

Thread (0,1) [33|23]

Thread (1,1) [33|33]

22

23

33

Figure 2. Schematic of three different mapping schemes for evaluating ERIs on the GPU. The large square represents the matrix of contracted integrals; small squares below the main diagonal (blue) represent integrals that dont need to be computed because the integral matrix is symmetric. Each of the contracted integrals is a sum over primitive integrals. The mapping schemes differ in how the computational work is apportionedred squares superimposed on the integral matrix denote work done by a representative thread block, and the three blow ups show how the work is apportioned to threads within the thread block.

test case, the number of contracted basis functions is 64 and 192 for the STO-6G and 6-311G basis sets, respectively, which leads to O(106) and O(108) ERIs over contracted basis functions. As we can see in Equation 9, the ERIs have several permutation symmetriesfor example, interchange of the first or last two indices in the (|) ERI doesnt change the integrals value. Thus, we can represent the contracted ERIs as a square matrix of dimension K(K + 1)/2 K(K + 1)/2, as Figure 2 showshere, the rows and columns represent unique and index pairs. Furthermore, we can interchange the first pair of indices with the last pair without changing the ERI valuethat is, (|) = ( |). This implies that the ERI matrix is symmetric, and only the ERIs on or above the main diagonal need to be calculated. Figure 2 shows the primitive integrals contributing to each contracted integral as small squares (see the blow up labeled primitive integrals). Weve simplified here to the case in which each contracted basis function is a linear combination of the same number of primitives. In realistic cases, each of the contracted basis functions can involve a different number of primitive basis functions.
30

This organization of the contracted ERIs immediately suggests three different mappings of the computational work to thread blocks. We could assign a thread to each contracted ERI (1T1CI, 1 Thread-1 Contracted Integral in Figure 2); a thread block to each contracted ERI (1B1CI, 1 Block-1 Contracted Integral in Figure 2); or a thread to each primitive ERI (1T1PI, 1 Thread-1 Primitive Integral in Figure 2). Weve implemented all three of these schemes on the GPU; the grain of parallelism and the degree of load balancing differed in all three cases. The 1T1PI scheme is the most fine-grained and provides the largest number of threads for calculation, and the 1T1CI scheme is the least finegrained, providing a larger amount of work for active threads. In the 1T1CI scheme, each thread calculates its contracted integral by directly looping over all primitive ERIs and accumulating the results according to Equation 13. Once the primitive ERI evaluation and summation completes, the contracted integral is stored in the GPU memory.
Computing in SCienCe & engineering

Neighboring integrals can have different numbers of contributing primitives and hence a different number of loop cycles. When the threads responsible for these neighboring integrals belong to the same warp, they execute in SIMD fashion, which produces load misbalancing. We can minimize the impact by further organizing the integrals into subgrids according to the contraction length of the basis functions involved. In this case, all threads in a warp have similar workloads (thus minimizing load-balancing issues), but it requires further reorganization of the computation with both programming and runtime overhead; the latter, however, is usually small when compared to typical computation timings. The 1B1CI mapping scheme is finer-grained and maps each contracted integral to a whole thread block rather than a single thread. This organization avoids the load-imbalance issues inherent to the 1T1CI algorithm because distinct thread blocks never share common warps. Within a block, we have several ways to assign primitive integrals to GPU threads. We chose to assign them cyclically, with each successive term in the sum of Equation 13 mapped to a successive GPU threadwhen the last thread is reached, the subsequent integral is assigned to the first thread, and so on. Because all threads compute their integrals, the latter are summed using the shared on-chip memory, and the final result is stored in GPU DRAM. Unfortunately, the 1B1CI scheme sometimes experiences load-balancing issues that are difficult to eliminate. Consider a contracted integral comprised of just one primitive integralsuch a situation is possible when all the basis functions have unit contraction length (that is, they arent contracted at all). In this case, only one thread in the whole block will have work assigned to it, but because the warps are processed in SIMD fashion, the other 31 threads in the warp will execute the same set of instructions and waste the computational time. Direct tests, performed on a system with a large number of weakly contracted integrals, confirm this prediction. The 1T1PI mapping scheme exhibits the finestgrain level of parallelism of all the schemes presented. Unlike the two previous approaches, the target integral grid has primitive rather than contracted integrals, and each GPU thread calculates just one primitive integral, no matter which contracted integrals it contributes to. As soon as we calculate and store all the primitives on the GPU, another GPU kernel further transforms them to the final array of contracted integrals. The second step isnt required in 1T1CI and 1B1CI algorithms because all required
november/deCember 2008

primitives are stored either in registers or shared memory and thus are easily assembled into a contracted integral. In contrast, in the 1T1PI scheme, those primitives constituting the same contracted integral can belong to different thread blocks running on different SMs. In this case, data exchange is possible only through the GPU DRAM, incurring hundreds of clock cycles of latency. Table 1 shows benchmark results for the 64 hydrogen atom lattice. As mentioned earlier, we used two different basis sets to determine the contraction lengths effect. For the weakly contracted basis set (6-311G), we found that 1B1CI mapping performs poorly (as predicted), mostly because of the large number of empty threads that still execute instructions due to the SIMD hardware model. For this case, we estimated that the 1B1CI algorithm possesses 4.2X computational overhead, assuming each warp contains 32 threads. The 1T1PI mapping was the fastest, but two considerations are important here. First, the summation of Equation 13 doesnt produce much overheadmost of the contracted basis functions consist of a single primitive. Second, the GPU integral evaluation kernel is relatively simple and consumes a small number of registers, which allows more active threads to run on an SM and hence provides better instruction pipelining. For the highly contracted basis set (STO-6G), the situation is reversed: the 1T1PI algorithm is the slowest because of the summation of the primitive integrals, which is more likely to require communication across thread blocks. The 1B1CI scheme avoids this overhead and distributes the work more evenly (all contracted integrals require the same number of primitive ERIs because all basis functions have the same contraction length). We found that the 1T1CI algorithm represents a compromise thats less sensitive to the degree of basis set contraction. In both cases, its either almost the fastest or simply the fastest and thus would be recommended for conventional SCF. An additional issue is the time required to move the contracted integrals between the GPU and CPU main memory (in practice, the integrals rarely fit in the GPU DRAM). Table 1 shows that the time for this GPUCPU transfer can exceed the integral evaluation time for weakly contracted basis sets. An alternate approach that avoids transferring the ERIs would clearly be advantageous. By substituting Equation 13 into Equations 6 and 7, we can avoid the formation of the contracted ERIs completely. This is the usual strategy when we use direct SCF methods to re-evaluate ERIs in every step of the SCF procedure. In
31

table 1. two-electron integral evaluation of the 64 hydrogen atom lattice on a GPu using three algorithms (1B1ci, 1t1ci, and 1t1Pi). Basis set 6-311G STO-6G GPu 1B1ci 7.086 s 1.608 s GPu 1t1ci 0.675 s 1.099 s GPu 1t1Pi 0.428 s 2.863 s GPucPu transfer* 0.883 s 0.012 s GAMess** 170.8 s 90.6 s

*The amount of time required to copy the contracted integrals from the GPU to CPU memory **The same test case using the GAMESS program package on a single Opteron 175 CPU for comparison

this case, we avoid the Achilles heel of the 1T1PI schemeformation of the contracted ERIsso it thus becomes the recommended scheme. Weve implemented construction of the J and K matrices on the GPU concurrent with ERI evaluation via the 1T1PI scheme. This has the added advantage of avoiding CPUGPU transfer of the ERIsthe J and K matrices contain only O(K 2) elements, compared to the O(K4) ERIs. Due to limited space, we wont discuss the details of the algorithms here, but we will present some results that demonstrate the accuracy and performance achieved so far. As mentioned earlier, the basis functions used in quantum chemistry have an associated angular momentumthat is, the polynomial prefactor in Equations 10 and 11. In the hydrogen lattice test case, all basis functions were of s type, meaning no polynomial prefactor. For atoms heavier than hydrogen, it becomes essential to also include higher angular momentum functions. Treating these efficiently requires computing all components such as px, py, and pz simultaneously. In the context of the GPU, this means that we should write separate kernels for ERIs that have different angular momentum combinations. These kernels will involve more arithmetic operations as the angular momentum increases, simply because more ERIs are computed simultaneously. Weve written such kernels for all ERIs in basis sets including s and p type basis functions. To better quantify the GPUs performance, weve investigated our algorithms for J matrix construction. The GPUs peak performance is 350 Gflops, and we were curious to see how close our algorithms came to this theoretical limit. Table 2 shows performance results for a subset of the kernels we coded. For each kernel, we counted the corresponding number of floating-point instructions it executed. We counted all the instructions as 1 Flop, excluding MAD, which we assumed to take 2 flops. We then hand-counted the resulting floating-point operations from the compiler-generated PTXAS file (an intermediate assembler-type code that the compiler transforms to actual machine instructions). To evaluate the applications DRAM
32

bandwidth, we also counted the number of memory operations (Mops) each thread needed to execute to load data from the GPU main memory for integral batch evaluation. We counted each 32-bit load instruction as 1 Mop and the 64- and 128-bit load instructions as 2 and 4 Mops, correspondingly. Because we use texture memory (which can be cached), the resulting bandwidth is likely overestimated. In our algorithm, we found that using texture memory was even more efficient than the textbook global memory load shared memory store synchronization broadcast through shared memory scheme. This is due to synchronization overhead, which hinders effective parallelization. We also determined the number of active threads (the hardware supports 768 at most) that are actually launched on every streaming multiprocessor. Finally, we determined the number of registers each active thread requires, which, in turn, determines GPU occupancy as discussed earlier. As expected, the kernels involving higher angular momentum functions require more floating-point operations and registers, but the need for more registers per thread leads to fewer threads being active. Although sustained performance is less than 30 percent of the theoretical peak value, the GPU performance is still impressive compared to commodity CPUsfor example, a single AMD Opteron core demonstrates 1 to 3 Gflops in the Linpack benchmark. Given that a general quantum chemistry code is far less optimized than Linpack, we can estimate 1 Gflop as the upper bound for integral generation performance on this CPU. In contrast, we achieve 70 to 100 Gflops on the GPU. Comparing the performance in Table 2 for the sspp and pppp kernels, we can see that the GPU performance grows with arithmetic complexity for a fixed number of memory accesses, which suggests that our application is memorybound on the GPU. Furthermore, the total memory bandwidth observed (although sometimes overestimated due to texture caching) is close to the 80 Gbytes/s peak value (in practice, we usually got 40 to 70 Gbytes/s bandwidth in global memory reads). To further verify the
Computing in SCienCe & engineering

table 2. integral evaluation GPu kernel specifications and performance results. Kernel ssss sssp sspp pppp Floating-point operations 30 55 84 387 Memory operations 12 15 21 21 registers per thread 20 24 24 56 Active threads per sM 384 320 320 128 Performance (Gflops) 88 (175) 70 (174) 69 (227) 97 (198) Bandwidth (Gbytes/s) 131 71 64 20

table 3. Performance and accuracy of GPu algorithms for direct self-consistent field (scF) benchmarks. Molecule Caffeine Cholesterol Buckyball Taxol Valinomycin time per direct scF iteration (seconds) GPu 0.168 1.23 5.71 4.45 8.09 GAMess 4.4 68.0 332.0 279.6 750.6 electronic energy (atomic units) GPu (32 bit) 1605.91830 3898.82158 10521.6414 12560.6840 20351.9855 GAMess 1605.91825 3898.82189 10521.6491 12560.6828 20351.9904 26 55 58 63 93 speedup

conclusion that our application is memorybound, we performed a direct test. Out of the 48 to 84 bytes required to evaluate each batch, we left only 24 bytes that were vitally important for the application to run and replaced the other quantities with constants. We evaluated the resulting performance and present it in parentheses in Table 2s performance column. Although the number of arithmetic operations was unchanged, the Gflops achieved increased by a factor of two or more, which clearly demonstrates our conclusions correctness. In anticipation of upcoming double-precision hardware, were pleased that our algorithms are currently memory-bound. Although the memory bandwidth will decrease by a factor of two (due to 64- instead of 32-bit number representation) when the next generation of GPUs uses double-precision, the more dramatic decrease will come in arithmetic performance. However, we anticipate that the increased arithmetic intensity wont much affect our algorithmsinstead, it will only be roughly a factor of two slower in double precision. Our code, which is still under development, successfully competes with modern, welloptimized, general-purpose quantum chemistry programs such as GAMESS.9 We performed benchmark tests on the following molecules using the 3-21G basis set: caffeine (C8N4H10O2), cholesterol (C27H46O), buckyball (C60), taxol (C45NH49O15), and valinomycin (C54N6H90O18). Figure 3 represents all these molecules, and Table
november/deCember 2008

3 summarizes the benchmark results. The GPU is up to 93 times faster than a single 3.0-GHz Intel Pentium D CPU for these molecules. The GPU we used for these tests supports only 32-bit arithmetic operations, meaning that we can only expect six or seven significant figures of accuracy in the final results. This might not always be sufficient for quantum chemistry applications, where chemical accuracy is typically considered to be 10 3 atomic units. As we can see by comparing the GPU and GAMESS electronic energies in Table 3, this level of accuracy isnt always achieved. Fortunately, the next generation of GPUs will provide hardware support for double-precision, and we expect our algorithms will only be two times slower because theyre currently limited by the GPUs memory bandwidth and dont saturate the GPUs floating-point capabilities.

n this article, weve demonstrated that GPUs can significantly outpace commodity CPUs in the central bottleneck of most quantum chemistry problemsevaluation of two-electron repulsion integrals and subsequent Coulomb and exchange operator matrix formations. Speedups on the order of 100 times are readily achievable for chemical systems of practical interest, and the inherent high level of parallelism results in complete elimination of interblock communication during Fock matrix formation, making further parallelization over multiple GPUs an obvious step in the near future.
33

should be low for large molecules. The computational methods presented here can be easily augmented to allow calculations within the framework of DFT, which is known to be significantly more accurate than HF theory. Were currently implementing a general-purpose electronic structure code including DFT that runs almost entirely on the GPU in anticipation of upcoming hardware advances. There is good reason to believe that these advances will enable the calculation of structures for small proteins directly from quantum mechanicsas well as computational design of new smallmolecule drugs targeted to specific proteinswith unprecedented accuracy and speed.

references
1. J. Bolz et al., Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid, ACM Trans. Graph., vol. 22, no. 3, 2003, p. 917. J. Hall, N. Carr, and J. Hart, GPU Algorithms for Radiosity and Subsurface Scattering, tech. report UIUCDCS-R-2003-2328, Univ. of Illinois, Urbana-Champaign, 2003. K. Fatahalian, J. Sugerman, and P. Hanrahan, Graphics Hardware, T. Akenine-Moller and M. McCool, eds., Wellesley, 2004, p. 133. A.G. Anderson, W.A. Goddard III, and P. Schroder, Quantum Monte Carlo on Graphical Processing Units, Computer Physics Comm., vol. 177, no. 3, 2007, p. 298. A. Szabo and N.S. Ostlund, Modern Quantum Chemistry, Dover, 1996. R.G. Parr and W. Yang, Density-Functional Theory of Atoms and Molecules, Oxford, 1989. L.E. McMurchie and E.R. Davidson, One- and Two-Electron Integrals Over Cartesian Gaussian Functions, J. Computational Physics, vol. 26, no. 2, 1978, p. 218 I.S. Ufimtsev and T.J. Martnez, Quantum Chemistry on Graphical Processing Units. 1. Strategies for Two-Electron Integral Evaluation, J. Chemical Theory and Computation, vol. 4, no. 2, 2008, p. 222. M.W. Schmidt et al. General Atomic and Molecular Electronic Structure System, J. Computational Chemistry, vol. 14, no. 11, 1993, p. 1347.

2.

3.

Figure 3. Molecules used to test GPU performance. The set of molecules used spans the size range from 20 to 256 atoms.

4.

5.

For very large molecules, the 32-bit precision provided by the Nvidia G80 series hardware isnt sufficient because the total energy grows with the molecules sizein chemical problems, relative energies are the primary objects of interest. In fact, using an incremental Fock matrix scheme to compute only the difference between Fock matrices in successive iterations and accumulating the Fock matrix on the CPU with dual-precision accuracy can improve the final results precision by up to a factor of 10. Nevertheless, we still require higher precision as the molecules being studied get larger. To maintain chemical accuracy for energy differences in 32-bit precision, were limited in practice to molecules with less than 100 atoms. Fortunately, Nvidia recently released the next generation of GPUs that supports 64-bit precision in hardware. Because 32-bit arithmetic will remain significantly faster than 64-bit arithmetic, we anticipate that a mixed precision computational model will be ideal. In this case, the program will process a small fraction of ERIs (those with the largest absolute value) using 64-bit arithmetic and evaluate the vast majority of ERIs using the faster 32-bit arithmetic. Because the number of ERIs that require dual-precision accuracy scales linearly with system size, the impact of dual-precision calculations on overall computational performance
34

6. 7.

8.

9.

ivan s. Ufimtsev is a graduate student and research assistant in the chemistry department at the University of Illinois. His research interests include leveraging non-traditional architectures for scientific computing. Contact him at iufimts2@uiuc.edu. Todd J. Martnez is the Gutgsell Chair of Chemistry at the University of Illinois. His research interests center on understanding the interplay between electronic and nuclear motion in molecules, especially in the context of chemical reactions initiated by light. Martnez became interested in computer architectures and videogame design at an early age, writing and selling his first game programs (coded in assembler for the 6502 processor) in the early 1980s. Contact him at toddjmartinez@gmail.com.

Computing in SCienCe & engineering