Programming GPU Clusters With Shared Memory Abstraction in Software

2011 19th International Euromicro Conference on Parallel, Distributed and Network-Based Processing
Programming GPU Clusters with Shared Memory Abstraction in Software

Konstantinos I. Karantasis, Eleftherios D. Polychronopoulos
High Performance Information Systems Lab Department of Computer Engineering and Informatics University of Patras 26500 Rio, Greece Email: {kik, edp}@hpclab.ceid.upatras.gr http://pdsgroup.hpclab.ceid.upatras.gr Tesla GPU[4] consists of 30 multiprocessors, where each multiprocessor contains 8 cores, resulting on an aggregate number of 240 cores inside a single GPU. Nevertheless, since the introduction of multi-core and manycore architectures, the burden of the utilization of the afforded resources has been transferred mostly to the software stack. In the case of parallel processing with GPUs, synergetic execution schemes have to be implemented between CPU and GPU threads, in order to benet from the afforded computing power of many-core GPUs. That process has been facilitated by the introduction of simple programming environments, such as CUDA[5] and OpenCL[6], that rely on C/C++ programming languages and their respective multithreaded libraries. However, until now, there is no availability of well-established, sophisticated programming environments and optimization tools. Under these circumstances, the exploration of the performance capabilities of the newly introduced multiprocessors, has been relied mostly on the implementation, porting and hand-optimization of a wide range of applications that could benet from the proliferation of computational resources. It is expected that such a thorough study at the application level will drive the development of high performance compilers, runtime systems and respective middleware. While the interest on GPU accelerators grows, clusters are expected to remain the main architectural organization of supercomputers in the near future. The advent of manycores is not expected to replace clusters. On the contrary, it has already began to enhance their architecture, and the result of such a transition is evident by the deployment of the rst heterogeneous accelerator clusters. Following that anticipation, in the current paper we present an alternative approach in order to program high performance GPU clusters. As opposed to the common practice that employes MPI or a related message passing programming model, we are presenting the implementation procedure and the rst results of a programming approach for GPU clusters that relies on shared memory abstraction in software between the nodes. The rest of the paper is organized as follows. In section II we refer to the research efforts that relate to the presented approach. Section III describes the implementation platforms that form the basis of the GPU cluster middleware that offers
223
AbstractAs many-core graphics processors gain an increasingly important position concerning the advancements on modern highly concurrent processors, we are experiencing the deployment of the rst heterogeneous clusters that are based on GPUs. The attempts to match future expectations in computational power and energy saving with hybrid - GPU-based - clusters are expected to grow in the next years, and much of their success will depend on the provision of the appropriate programming tools. In the current paper we propose a programming model for GPU clusters that is based on shared memory abstraction. We give evidence for the applicability of the proposed model under two cases. In the rst case we describe an implementation procedure that involves the utilization of Intel Cluster OpenMP, a clusterenabled OpenMP implementation. Subsequently, we present an extended version of Pleiad, a cluster middleware which is based on the Java platform. The evaluation of these schemes under two characteristic computationally intensive applications on a 4node multi-GPU cluster, reveals that such approaches can easily enhance existing GPU software development tools, such as CUDA, and they can lead to a signicant acceleration of applications that can benet from many-core GPU clusters. Index TermsGPU Clusters, Software DSM, CUDA, OpenMP, Pleiad
I. I NTRODUCTION During the last decade, we have experienced a major shift on microprocessor design technology[1]. For the purpose of producing next generation microprocessors, that could exhibit respectable performance gains, preserving at the same time an acceptable rate of power consumption, in comparison to high-frequency processors of that time, the decision was to unfold the potential of parallelism utilization inside the chip. Along that process, that is often described as the multicore revolution[2], modern GPUs have - up to now - the lead concerning the number of cores that they deploy. Due to their simplistic design that excludes several features and most notably memory coherence, out of order execution, and branch prediction, modern GPUs are able to encompass hundreds of cores in a single chip, while at the same time, the number of cores in general purpose CPUs reaches a few dozens at experimental level[3]. For instance, the NVIDIA
This work is supported by the Karatheodori Grant no. C-141 of the University of Patras
1066-6192/11 $26.00 2011 IEEE DOI 10.1109/PDP.2011.91
shared memory abstraction and refers to the necessary extensions that took place. In Section IV we discuss the application benchmarks that were selected and implemented to evaluate shared memory middleware on GPU clusters, referring to the most important implementation details of the parallelization process. In Section V we provide the performance evaluation of the proposed scheme and nally in Section VI we draw our conclusions and refer to our future work. II. R ELATED W ORK As the rst GPU clusters are deployed[7] and application porting aims to utilize multiple GPUs, most codes that target GPU clusters tend to use a programming model that is based on message passing. Particularly, in most cases, an implementation of MPI is used to distribute and coordinate cooperating processes across the GPU cluster and quite a few applications have been ported in that way[8][9][10]. In parallel with application porting, certain efforts try to integrate GPU programming environments and MPI in a unied development platform[11][12]. Research efforts that follow an alternative approach, which would be based on a software distributed shared memory (SDSM) layer across the cluster, have not yet been thoroughly evaluated, although such an approach could transparently extend the memory hierarchy of a GPU application and efciently match its respective memory access patterns. Gelado et al.[13] have presented recently ADSM, moving towards the evaluation of SDSM on GPU accelerators, however ADSM is currently concentrating on the communication between host and device memories and, for the moment, ADSM can not be used to utilize GPUs on a distributed environment. On the other hand, Zippy[14], is based on Global Arrays to provide shared memory abstraction on CUDA enabled clusters. It succeeds to do so by exposing part of the data sharing mechanisms to the application programmer, and in that way bypasses the need to provide a memory consistency protocol. Strengert et al.[15] have presented CUDASA, which operates as a language extension to CUDA. CUDASA is supported by a source-to-source compiler, and in order to realize inter-node communication on a cluster it uses MPI calls. Lastly, Barak et al. presented recently MGP[16] as a preliminary implementation of OpenCL for clusters that is based on MOSIX[17]. cluster oriented operating system. A highly active research area that shares common targets to our effort, is the area of Partitioned Global Address Space (PGAS) languages. Certain results up to now have shown that, under circumstances, these languages can equally compete with their message passing counterparts. Nevertheless, the research efforts that aim to provide a programming environment appropriate for GPU cluster utilization under PGAS languages, are still at a preliminary stage. III. S HARED MEMORY ABSTRACTION MIDDLEWARE In this section we describe the implementation effort at the middleware level, including the programming platforms that formed the basis for the presented application acceleration and
the necessary extensions that were made to support shared memory abstraction on GPU clusters. A. CUDA Targeting GPU clusters that are supplied with NVIDIA graphics processors, the CUDA[5] programming environment was used in order to implement the portion of the algorithm that executes on the GPU device side and to realize the communication between the code executing on the GPUs and the local CPU threads executing on the host side. CUDA, is a simple programming environment that includes a run-time library and a compiler driver. Currently, CUDA is the most advanced tool that can efciently utilize the afforded data parallel resources of the NVIDIA graphics processors. Much of the embrace of CUDA as a toolkit to program many-core GPUs has taken place because it successfully extends the C programming language. The application programmer writes actually C code, and the NVCC compiler driver bifurcates the code into two portions. One portion is delivered to the CPU - the so-called host side - mainly for the coordination of the computation, while the other portion, involving intensive computations, it is delivered to the GPU - the so-called device side - that executes the code in a data parallel manner. The execution programming model of CUDA is characterized as simultaneous multithreading (SIMT) and presents some relevance with the SIMD category of parallel programming models on vector processors. B. Intel Cluster OpenMP The current practice concerning the utilization of GPU clusters uses, almost exclusively, some type of message passing as a cluster wide programming model and in most cases that model is specically an implementation of MPI[9]. In order to evaluate an alternative approach that would provide shared memory abstraction across the cluster, in the current implementation we have rstly integrated to our scheme a clusterenabled OpenMP implementation. This specic implementation is provided as part of the Intel C/C++ compiler and is based on the well-established software DSM Treadmarks[18]. Its main difference with regular OpenMP implementations for multicore systems in relation to the application programming interface, concerns the default data sharing policy. In Cluster OpenMP, data are not shared by default, but they have to be explicitly declared shared and have their memory allocated through special function calls. In order to carry out an implementation with CUDA and Intel Cluster OpenMP, the use of low-level CUDA Driver API is imposed. This is, currently, due to the lack of support to Intel C/C++ compiler by the NVCC CUDA compiler driver. In that case, the cooperation is feasible if the source code that is destined to run on the host is compiled with icc and the source code of the CUDA kernels that will operate on the GPU device side is compiled with NVIDIA CUDA compiler driver (nvcc).
224
TABLE I: Pleiad Programming API

Type PleiadObject SharedObject Object void Object Object SharedArray void void void LockObject BarrierObject PleiadThread PleiadGPU void void void void void Return Basic Methods - abstract base class get( ) set(Object obj) get(int i), get(int i, int j), get(int[] ndims) set(int i, Object obj) set(int i, int j, Object obj) set(int[] ndims, Object obj) lock( ) unlock( ) await( ) run( ) run( )
Algorithm 1 K-means multi-GPU algorithm Require: Object Set E = {e1 , ...en }. Number of clusters k Ensure: Set of cluster centroids C = {c1 , ..., ck } procedure K- MEANS C (initialize centroids) repeat for all obj E do update object membership end for for all obj E do Check changes in membership and update cluster size end for C (new centroids via reduction) until #M embership changes delta end procedure
GPU
CPU
C. Pleiad The second alternative in terms of cluster middleware that is considered in the current paper is Pleiad[19]. Pleiad is a cluster middleware that is based on the Java platform and enables transparent multithreaded execution across the physically distributed nodes, such as the nodes of a GPU-based accelerator cluster. Data sharing in Pleiad is based on objects and instead of being tightly coupled with a specic consistency protocol, Pleiad incorporates several implementations of consistency maintenance, that can be potentially interchanged even during run-time. In order to incorporate into Pleiad the ability to utilize GPUs though CUDA, the collection of native and Java wrapper methods that are available through the JCuda[20] package has been used. The essential objects of Pleiads API and their respective methods are presented in Table I. Pleiad was extended in order to incorporate threads that can handle GPU contexts and that process resulted in the addition of PleiadGPU class. Moreover, the distributed SharedArrays were also enhanced in order to supply CUDA kernels with data portions of a particular data size. IV. M ULTI -GPU A PPLICATION B ENCHMARKS Although, there is a considerable number of applications that have been ported on GPUs, up to now, the process of choosing characteristic codes and forming equivalent benchmark suites is at its early stages. Consequently, the few benchmark suites that are publicly available, target the evaluation of a single GPU[21][22]. Therefore, due to the lack of widely accepted multi-GPU application benchmarks, we will present the results that we have obtained through the extension of two characteristic applications, K-means[23] and WENO solver[24], for multi-GPU execution. These applications come from the areas of data clustering and computational uid dynamics respectively and, without being embarrassingly parallel, they exhibit considerable speedup under execution on a single GPU. Next we describe shortly the main issues of their
optimization, in order to run on GPU clusters under shared memory abstraction. A. K-means clustering The plain version K-means describes an algorithm that operates iteratively, classifying a given data set to a number of clusters that is supplied a priori. The classication is based on the attributes/features of each object and the number of such features is arbitrary. K-means starts by selecting K objects, one for each cluster, that are called centroids. There are several selection strategies that aim to reduce iterations, but in the current study we have considered random selection. On every iteration, K-means assigns each object to its nearest centroid, according to a certain metric, i.e. the euclidean distance metric. Subsequently, the algorithm computes the new centroid of every cluster as the mean of all the objects that belong to that cluster. The algorithm terminates when the overall changes in the centroids fall below a certain threshold. In order to utilize the several GPUs that are available at the extent of the cluster, our implementation follows a twolevel hierarchical parallelization scheme (Alg. 1). At the rst level, the partitioning of the object space takes place, and each group of objects is assigned for examination on each CPU thread (either PleiadThread or OpenMP thread) that is created. Every such thread also corresponds to the manipulation of a GPU context that is going to run on the device side. On CUDA architectures - as the one that we have used on the experimental evaluation - that do not support concurrent kernel execution, the reasonable choice is to spawn one such thread for each available GPU. The second level of parallelism spawning is taking place at the GPU level, where the GPU threads are organized into 2D blocks, that in turn are structured on a 1D grid of thread blocks. Each thread on every block is
225
(0,0,Z) (0,0, ...) (0, ... , ...)
(0 , ... , Z) (0,Y, ...)
(0,Y,Z)
( ... , Y, Z) (0,0,0) (0, ... , 0) (0,Y,0)
(... ,Y, ...) (X,Y,Z)
(... , 0, 0)
(... , ... , 0)
(... , Y, 0)
(X,Y, ...)
(X,0,0)
(X , ... , 0)
(X,Y,0)
(0,0)
(0, ...)
(0,Y)
(0, ...)
(0,(Y+1)*Z)
(0, ...)
(0,(Y+1)* (Z+1) -1)
(... , 0)
(... , ...)
(... , Y)
(... , ...)
(... ,(Y+1)*Z)
(... , ...)
(... ,(Y+1)* (Z+1) -1)
(X,0)
(X, ...)
(X,Y)
(X, ...)
(X,(Y+1)*Z)
(X, ...)
(X,(Y+1)* (Z+1) -1)
GRID MAPPING
(a) Cluster level
(b) GPU level
Fig. 1: 3D domain decomposition
void RungeKutta ( ) { f o r ( i n t m = 0 ; m < ORDER; m++) { i f ( o m p g e t n u m t h r e a d s ( ) > 1){ c o p y D a t a ( FROM DEVICE , DIRECTION ) ; / / #pragma omp b a r r i e r c o p y D a t a ( TO DEVICE , DIRECTION ) ; } l a u n c h K e r n e l (BC ) ; / / Boundary C o n d i t i o n s Upd l a u n c h K e r n e l ( XI ) ; / / R i g h t hand s i d e a l o n g I l a u n c h K e r n e l (ETA ) ; / / R i g h t hand s i d e a l o n g J l a u n c h K e r n e l ( ZETA ) ; / / R i g h t hand s i d e a l o n g K l a u n c h K e r n e l (UPDATE, m ) ; / / Update m u t t a b l e d a t a } c o p y D a t a ( FROM DEVICE , DIRECTION ) ; }
Because WENO solver operates on structured meshes, domain decomposition is a convenient way to assign the computational load on the available GPUs. Again a 2-level hierarchical parallelization strategy is enforced. At the cluster level, a band based domain decomposition is applied as it is shown in Fig.1a. Every CPU thread corresponds to a particular GPU execution context. The necessary changes at the boundaries of each subdomain are propagated transparently through the consistency mechanisms of the shared memory abstraction layer. At the GPU level, each subdomain has to be fragmented in a way that the resultant number of computational portions will be large enough to allow overlapping of thread blocks and efcient utilization of stream multiprocessors of every GPU. Therefore, every band of the initial domain is furthermore imposed on a 3D decomposition that results on the formation of the computational thread-blocks that will be scheduled on each GPU. The implementation of that 3D scheme follows the practice that is described by Cohen et al in [25]. Every data point of the mesh is represented by an aligned structure of oating-point, single precision, values and its manipulation is assigned on a single GPU thread. In order to achieve proper mapping of the 3D thread blocks on a 2D grid that is supplied on every kernel launch, the respective subdomain is decomposed on a 2-dimensional grid with dimensions ( I Size/BlockSize x , (J Size/BlockSize y) (Z Size/BlockSize z) ). A schematic representation of the applied scheme if shown in Fig.1b. The above decomposition scheme leads to a quite coalescent data access pattern of the mutable data structures that correspond to the structured mesh. These data structures are placed in the global memory of the GPU and their respective parts that need to be communicated between multiple GPUs are placed at the sharable data region of the SDSM. On
Fig. 2: Execution ow of the Runge-Kutta time advancement in the GPU
responsible for the assignment of a certain data object on the appropriate cluster. When the assignment part is completed, a global reduction of the newly computed clusters has to take place. Because the new cluster centers have to be propagated across the cluster, in the current implementation we have chosen to perform that operation on the CPU side, since it does not exhibit a high degree of parallelism. Therefore the reduction computation is taking place across the CPU threads and the new centroids are propagated on every node at the start of the new iteration through the activation of the memory consistency mechanisms. B. WENO solver This specic application involves a numerical simulation of turbulence in high-speed, compressible, ows using a Weighted Essentially Non-Oscillatory (WENO) scheme[24]. WENO solver is a high order accurate method that in the particular application benchmark refers to the simulation of Rayleigh-Taylor (R-T) instability in the 3-dimensional domain.
226
TABLE II: Experimental Environment and Settings

CPU #Cluster nodes #units per node #cores per node Type Memory Interconnect Cluster Middleware CUDA SDK 1 4 Intel R Xeon R E5504 @ 2.00GHz 4096KB (Cache) 4 GB (Host Memory) Gigabit Ethernet via Gigabit Switch 4 2 480 (60 SM) Tesla T10 @ 1.30 GHz S1070 1U system 4GB (Device Global) PCI-E x16 GPU
A. Benchmark Setup The experimental evaluation of the current implementation took place on a 4-node GPU-based accelerator cluster. The 4 nodes of the cluster were externally connected with 2 NVIDIA Tesla 1U computing blades, establishing one connection per node that supplied each node with 2 Tesla T10 graphics processors. Thus, each node was able to utilize 60 stream multiprocessors and a total number of 480 cores at the GPU side. In total, this specic conguration resulted in a GPU cluster with an aggregate of 240 stream multiprocessors and 1920 cores. The aspects of the conguration in hardware and software are summarized in Table II. Concerning the application benchmark setup, we have conducted experiments using various congurations. For Kmeans, the experimental runs included les with real data sets that are supplied by the UCI Knowledge Discovery in Databases repository[26]. In the following graphs we present the results that concern executions of K-means with input sizes of 1.6 million objects. On every run a number of 20 clusters was requested and these clusters were permanently determined after 106 iterations. The execution times refer to the average execution time of 5 runs. The simulation of compressible ows involved execution of the WENO solver on a 480x120x120 mesh. At the current setting, a heavy uid resides on the left side of the domain and has density L = 2 while a light uid resides on the right with density R = 1. The interface between the two uids is at x = 1/2 and the variation of the initial pressure is linear throughout the domain. The initial pressure in the domain of the heavy uid on the left is pL (x) = 1 + 2x, while the variation of the pressure on the right is pR (x) = 1.5 + x. The computational domain of the simulation refers to the box (1 0.25 0.25) in three dimensions. The execution times refer to the average execution time of 4 simulations, with each simulation performing 8000 iterations. B. Performance evaluation on GPU clusters Three basic congurations are compared in terms of execution time (Fig.3) and their respective speedup (Fig.4). CUDALOCAL refers to local multi-GPU evaluation on a single node of the cluster. This implementation is solely based on CUDA and does not involve interactions with any cluster middleware. CUDA-SDSM refers to a cluster wide evaluation over SDSM through the Intel Cluster OpenMP implementation. Under that scheme, a single processes is started on every node, and in the case of utilization of 2 or more GPUs per node, each process uses internal local multithreading. Specically, on the available cluster, the hardware resources signify the creation of maximum 2 threads per process. Finally, CUDA-Pleiad corresponds to the evaluation of Pleiad cluster middleware, using a variant of home lazy release consistency (HLRC) based on invalidations. In the case of Pleiad, internal multithreading is also exercised transparently per node. The two applications share common results. Specically, the scheme under Intel Cluster OpenMP shows the best performance across the whole scale of the given cluster platform.
Intel Cluster OpenMP (ICC 11.1) Pleiad (Java 1.6.0 16) CUDA Driver API 3.1
every simulation step these boundary points are transferred between host memory and device global memory with direct memory copies and among cluster nodes with propagations that correspond to barrier synchronization points. As far as immutable data in the GPU is concerned, plain variables are placed in constant memory and 3D read-only data structures that have been produced during initialization are placed in texture memory. In that way we are able to benet from the caching mechanisms of texture memory which are optimized for spatial locality and less coalescent accesses. Concerning the execution ow of the simulation, all the computations that are required by the Runge-Kutta time stepping are taking place in the GPU. On every iteration, there are 5 distinct kernels that are launched and realize the simulation step, as it is depicted in Fig.2. The kernels that discretize the right hand side part of the equations operate on the entire extent of the subdomain, while kernels BC and UPDATE perform updates that do not exhibit a high degree of parallelism. Still, their execution on the GPU side is more efcient than a potential execution on the CPU if we take into account the necessary memory copies that should take place. Lastly, the necessary synchronization that is imposed by the computation scheme, it is restricted on barrier synchronization between the threads of the same common block inside each CUDA kernel. At these points, barrier synchronization assures that the required updates of auxiliary local variables, such as Roes averaging, or right and left eigenvectors that are evaluated at this average state have been accomplished. The need for mutual exclusion is minimized on the use of an atomic max operation that is used in order to compute the maximum eigenvalue required for the construction of the Lax-Friedrichs numerical ux. V. B ENCHMARK E VALUATION In this section we rst describe the benchmark setup and the experimental platform that were used and next we present the evaluation of the proposed schemes under the application benchmarks.
227
(a) K-means
(b) WENO
Fig. 3: Execution times
(a) K-means
(b) WENO
Fig. 4: Speedup against sequential execution
Pleiad achieves a lower speedup, which is, however, positive and it follows in terms of scaling the speedup of Intel Cluster OpenMP. CUDA-LOCAL was included in the evaluation to form a basis of comparison between sole CUDA execution and execution that involved a cluster middleware, when local multithreaded execution was applied on a single node. Because of the lack of more than 2 GPUs on the same node, the performance of CUDA-LOCAL degrades signicantly when two or more GPU contexts execute sequentially on the same device. That effect is evident when 4, 6 and 8 GPU contexts are created. This is, however, an expected behavior and these results were included here for the sake of completeness. If we examine each application separately, we observe that the maximum speedup for K-means is approximately 56 times and it is obtained with 8 GPUs under Intel Cluster OpenMP. The speedups are presented in comparison to the sequential execution of the K-means algorithm that is available through the NU-Minebench benchmark suite[27]. The OpenMP bar corresponds to the speedup that is achieved through a sole CPU-based OpenMP execution. The detailed presentation of execution time in Fig.5a reveals that the reduction phase
of the computation that is executed on the CPU side has the greatest impact on performance after the implementation of cluster assignment on the GPU. Nevertheless, both CPU and GPU computations scale efciently in the presence of more computational resources. The data transfers between host memory and device memory impose low overhead while the communication at the cluster level and the time that is spent on synchronization - which are combined - have an increased impact only in the case of Pleiad. This effect is observed mainly due to the unavoidable object serialization process. However that cost does not surpass the overall benet of GPU acceleration across the cluster. Concerning WENO solver, the maximum speedup is again achieved by Intel Cluster OpenMP and ranges between 30x and 90x. However, in that application the computations on the right-hand side (RHS) that take part on the CPU side have a negligible effect (Fig.5b). The whole computational effort is assigned on the GPU side and as long as there are enough GPUs the speedup is sufcient. Again the data transfers on the PCI-E interface between host and device memory have smaller impact, though slightly bigger than in K-means, in comparison
228
(a) K-means
(b) WENO
Fig. 5: Execution time details
to the impact of communication on the SDSM layer. The difference on cluster-wide communication between the Intel Cluster OpenMP and Pleiad was smaller in the case of WENO solver, and this fact implies that the memory consistency mechanisms of Pleiad succeeded to overlap more efciently the serialization cost with the communication pattern imposed by WENO solver. VI. C ONCLUSIONS AND F UTURE W ORK In the current paper we have presented a programming approach for modern GPU clusters that is based on shared memory abstraction. Our approach was evaluated under two cluster middleware platforms, Pleiad and Intel Cluster OpenMP. We argue that such an implementation can pose an interesting alternative to mainstream message passing. First results from the experimental evaluation with two computationally intensive applications, show that, the presented approach is valid and can result in considerable acceleration of such applications. The proposed approach extends in a more consistent and transparent way the memory hierarchy of a GPU cluster, than message passing implementations do. Therefore, these preliminary results can encourage the implementation of appropriate middleware that will be based on the concept of shared memory abstraction and will also be able to offer an efcient programming model for heterogeneous GPU clusters. Our future research will focus to further explore the implications of GPU clusters programming using shared memory abstraction and will enhance the respective middleware. In parallel, we aim to efciently evaluate our scheme with more applications that present high demands in terms of both data handling and computational cost. R EFERENCES
[1] K. Olukotun and L. Hammond, The Future of Microprocessors, Queue, vol. 3, no. 7, pp. 2629, 2005.
[2] H. Sutter and J. Larus, Software and the Concurrency Revolution, Queue, vol. 3, no. 7, pp. 5462, 2005. [3] J. Howard, S. Dighe, Y. Hoskote, S. Vangal, D. Finan, G. Ruhl, D. Jenkins, H. Wilson, N. Borkar, G. Schrom, F. Pailet, S. Jain, T. Jacob, S. Yada, S. Marella, P. Salihundam, V. Erraguntla, M. Konow, M. Riepen, G. Droege, J. Lindemann, M. Gries, T. Apel, K. Henriss, T. Lund-Larsen, S. Steibl, S. Borkar, V. De, R. V. D. Wijngaart, and T. Mattson, A 48-Core IA-32 Message-Passing Processor with DVFS in 45nm CMOS, in IEEE International Solid-State Circuits Conference, San Francisco, California, USA, Feb. 2010. [4] E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym, NVIDIA Tesla: A Unied Graphics and Computing Architecture, IEEE Micro, vol. 28, no. 2, pp. 3955, 2008. [5] NVIDIA CUDA, Compute Unied Device Architecture, Programming Guide (Version 3.1), NVIDIA Corporation, Jun. 2010. [Online]. Available: http://developer.nvidia.com/object/cuda 3 1 downloads.html [6] OpenCL, OpenCL - the Open Standard for Parallel Programming of Heterogeneous Systems, Khronos Group, 2009. [Online]. Available: http://www.khronos.org/opencl/ [7] Z. Fan, F. Qiu, A. Kaufman, and S. Yoakum-Stover, GPU Cluster for High Performance Computing, in SC 04: Proceedings of the 2004 ACM/IEEE conference on Supercomputing. Washington, DC, USA: IEEE Computer Society, 2004, p. 47. [8] D. G ddeke, R. Strzodka, J. Mohd-Yusof, P. McCormick, S. H. M. o Buijssen, M. Grajewski, and S. Turek, Exploring weak scalability for FEM calculations on a GPUenhanced cluster, Parallel Computing, vol. 33, no. 1011, pp. 685699, 2007. [9] D. A. Jacobsen, J. C. Thibault, and I. Senocak, An MPI-CUDA Implementation for Massively Parallel Incompressible Flow Computations on Multi-GPU Clusters, in 48th AIAA Aerospace Sciences Meeting, Orlando, Florida, USA, Jan. 2010. [10] D. G ddeke, R. Strzodka, J. Mohd-Yusof, P. S. McCormick, H. Wobker, o C. Becker, and S. Turek, Using GPUs to Improve Multigrid Solver Performance on a Cluster, International Journal of Computational Science and Engineering, vol. 4, no. 1, pp. 3655, Nov. 2008. [11] J. C. Phillips, J. E. Stone, and K. Schulten, Adapting a message-driven parallel application to GPU-accelerated clusters, in SC 08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing. Piscataway, NJ, USA: IEEE Press, 2008, pp. 19. [12] E. P. Mancini, G. Marsh, and D. K. Panda, An MPI-Stream Hybrid Programming Model for Computational Clusters, Cluster Computing and the Grid, IEEE International Symposium on, vol. 0, pp. 323330, 2010. [13] I. Gelado, J. E. Stone, J. Cabezas, S. Patel, N. Navarro, and W.m. W. Hwu, An asymmetric distributed shared memory model for heterogeneous parallel systems, in ASPLOS 10: Proceedings of the fteenth edition of ASPLOS on Architectural support for programming
229
[14] [15]
[16]
[17] [18]
[19]
[20]
languages and operating systems. New York, NY, USA: ACM, 2010, pp. 347358. Z. Fan, F. Qiu, and A. E. Kaufman, Zippy: A Framework for Computation and Visualization on a GPU Cluster, Comput. Graph. Forum, vol. 27, no. 2, pp. 341350, 2008. M. Strengert, C. Mller, C. Dachsbacher, and T. Ertl, CUDASA: Compute Unied Device and Systems Architecture, in Eurographics 2008 Symposium on Parallel Graphics and Visualization (EGPGV08), K.L. M. Daniel Weiskopf, Jean M. Favre, Ed. Eurographics Association, 2008, pp. 4956. A. Barak, T. Ben-Nun, E. Levy, and A. Shiloh, A package for OpenCL based heterogeneous computing on clusters with many GPU devices, in Cluster Computing Workshops and Posters (CLUSTER WORKSHOPS), 2010 IEEE International Conference on, Sep. 2010, pp. 1 7. A. Barak and O. Laadan, The MOSIX multicomputer operating system for high performance cluster computing, Future Gener. Comput. Syst., vol. 13, pp. 361372, Mar. 1998. C. Amza, A. L. Cox, S. Dwarkadas, P. Keleher, H. Lu, R. Rajamony, W. Yu, and W. Zwaenepoel, TreadMarks: Shared Memory Computing on Networks of Workstations, Computer, vol. 29, no. 2, pp. 1828, 1996. K. I. Karantasis and E. D. Polychronopoulos, Pleiad: a crossenvironment middleware providing efcient multithreading on clusters, in CF 09: Proceedings of the 6th ACM conference on Computing frontiers. New York, NY, USA: ACM, 2009, pp. 109116. Y. Yan, M. Grossman, and V. Sarkar, JCUDA: A Programmer-Friendly Interface for Accelerating Java Programs with CUDA, in Euro-Par 09: Proceedings of the 15th International Euro-Par Conference on Parallel Processing. Berlin, Heidelberg: Springer-Verlag, 2009, pp. 887899.
[21] The IMPACT Research Group, Parboil benchmark suite, 2009. [Online]. Available: http://impact.crhc.illinois.edu/parboil.php [22] A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter, The Scalable Heterogeneous Computing (SHOC) benchmark suite, in GPGPU 10: Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units. New York, NY, USA: ACM, 2010, pp. 6374. [23] J. B. MacQueen, Some Methods for Classication and Analysis of MultiVariate Observations, in Proc. of the fth Berkeley Symposium on Mathematical Statistics and Probability, L. M. L. Cam and J. Neyman, Eds., vol. 1. University of California Press, 1967, pp. 281297. [24] G.-S. Jiang and C.-W. Shu, Efcient implementation of weighted ENO schemes, J. Comput. Phys., vol. 126, no. 1, pp. 202228, 1996. [25] J. M. Cohen and J. Molemake, A Fast Double Precision CFD Code Using CUDA, in 21st International Conference on Parallel Computational Fluid Dynamics (ParCFD2009), 2009. [26] S. D. Bay, D. Kibler, M. J. Pazzani, and P. Smyth, The UCI KDD archive of large data sets for data mining research and experimentation, SIGKDD Explor. Newsl., vol. 2, no. 2, pp. 8185, 2000. [27] J. Pisharath, Y. Liu, W.-k. Liao, A. Choudhary, o. GMemik, and J. Parhi, NU-Minebench 2.0, Center for Ultra-Scale Computing and Information Security (CUCIS) Northwestern University, Tech. Rep. CUCIS-2005-08-01, Aug. 2005. [Online]. Available: http://cucis.ece.northwestern.edu/techreports/pdf/CUCIS-2004-08001.pdf
230

Programming GPU Clusters With Shared Memory Abstraction in Software

Caricato da

Informazioni sul documento

Descrizione originale:

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Programming GPU Clusters With Shared Memory Abstraction in Software

Caricato da

Copyright:

Formati disponibili

2011 19th International Euromicro Conference on Parallel, Distributed and Network-Based Processing

Programming GPU Clusters with Shared Memory Abstraction in Software

1066-6192/11 $26.00 2011 IEEE DOI 10.1109/PDP.2011.91

TABLE I: Pleiad Programming API

(0,0,Z) (0,0, ...) (0, ... , ...)

(0 , ... , Z) (0,Y, ...)

( ... , Y, Z) (0,0,0) (0, ... , 0) (0,Y,0)

(... ,Y, ...) (X,Y,Z)

(0,(Y+1)* (Z+1) -1)

(... ,(Y+1)* (Z+1) -1)

(X,(Y+1)* (Z+1) -1)

(a) Cluster level

(b) GPU level

Fig. 1: 3D domain decomposition

Fig. 2: Execution ow of the Runge-Kutta time advancement in the GPU

TABLE II: Experimental Environment and Settings

Fig. 3: Execution times

Fig. 4: Speedup against sequential execution

Fig. 5: Execution time details

Potrebbero piacerti anche