Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Table of contents .............................................................................................. i List of figures................................................................................................... v List of tables................................................................................................... vi Summary.........................................................................................................1 1 Introduction ..............................................................................................3 1.1 1.2 1.3 1.4 1.5 Motivation ...........................................................................................3 Reading guide ......................................................................................3 Problem definition.................................................................................5 Method ...............................................................................................6 Scope.................................................................................................6 Algorithms ....................................................................................6 Numerical stability ..........................................................................6 IEEE 754 and double-precision ............................................................7 BLAS ............................................................................................7
Background ...............................................................................................8 2.1 2.2 Linear algebra ......................................................................................8 GPU computing .....................................................................................8
Parallel platforms..................................................................................... 10 3.1 Cuda ................................................................................................ 10 History ....................................................................................... 10 Version....................................................................................... 10 Cuda program .............................................................................. 11 Architecture ................................................................................ 13 Limitations .................................................................................. 16
3.2.1 3.2.2
Hardware platform ................................................................................... 21 4.1 4.2 Analysis ............................................................................................ 22 Benchmarking .................................................................................... 23 Memory performance ..................................................................... 23 Arithmetic performance .................................................................. 24
4.2.1 4.2.2 5
Implementation ....................................................................................... 26 5.1 5.2 5.3 Development environment ..................................................................... 26 Design decisions .................................................................................. 26 Optimisation ...................................................................................... 27 Strategy ..................................................................................... 27
5.3.1 6
Matrix-multiplication ................................................................................ 29 6.1 Analysis ............................................................................................ 30 The sequential algorithm ................................................................ 30 Parallelism .................................................................................. 31
Simple algorithm ................................................................................. 32 The algorithm .............................................................................. 32 Test and results ............................................................................ 33
Optimisation ...................................................................................... 36 Unroll loop with threads ................................................................. 36 Tiling v1 ..................................................................................... 38 Tiling v2 with latency hiding ............................................................ 41 Tiling v3 with prefetching................................................................ 42 Tiling v4 and v5 with more output per thread ....................................... 43 Cuda compute capability ................................................................. 45
Evaluation ......................................................................................... 46
ii
Simple algorithm ................................................................................. 51 The algorithm .............................................................................. 51 Test and results ............................................................................ 53
Block LU-decomposition ........................................................................ 55 The block algorithm ....................................................................... 55 Implementation ............................................................................ 56 Test and results ............................................................................ 59 Optimising round 1 ........................................................................ 61 Test and results ............................................................................ 63 Optimising round 2 ........................................................................ 64 Further optimisation ...................................................................... 65 Large matrices ............................................................................. 68
Evaluation ......................................................................................... 69
QR decomposition ..................................................................................... 71 8.1 Analysis ............................................................................................ 71 The sequential algorithm ................................................................ 72 Parallelism .................................................................................. 73
Simple algorithm ................................................................................. 74 The algorithm .............................................................................. 74 Test and results ............................................................................ 75
8.3.1 8.4
Evaluation ......................................................................................... 80
iii
9.2 10
GPU.NET ........................................................................................... 82
Discussion and future work.......................................................................... 83 10.1 10.2 10.3 10.4 Project ............................................................................................. 83 Cuda ................................................................................................ 83 Hardware .......................................................................................... 84 Future of GPGPU ................................................................................. 84
11
Conclusion .............................................................................................. 86
Bibliography and references ............................................................................... 87 Appendix A Project evaluation .......................................................................... 89 Appendix B Implementation considerations .......................................................... 90 Cuda thread organisation ............................................................................... 91 SIMT and warp size ....................................................................................... 93 Elapsed time ............................................................................................... 93 Pinned or page-locked memory ........................................................................ 94 Matrix structure ........................................................................................... 94 Appendix C Hardware specification description and analysis ..................................... 95 Platform #1 ............................................................................................. 95 Platform #2 ............................................................................................. 96 Platform #3 ............................................................................................. 96 Platform evaluation ................................................................................... 97 Specifications .......................................................................................... 98 Evaluation ............................................................................................. 101 Appendix D Development environment problems and solution model .......................... 102 Development model ..................................................................................... 102 Cuda C and C++ ....................................................................................... 102 Appendix E CGMA and Cuda profiler .................................................................. 104 Appendix F Matrix-multiplication CC levels ......................................................... 106 Appendix G Report page count ......................................................................... 107
iv
List of figures
Figure 1 - Cuda program sequence diagram............................................................. 12 Figure 2 - Cuda architecture with four multiprocessors .............................................. 14 Figure 3 - How GPU.NET works as describe on TidePowerd.com .................................... 17 Figure 4 Simplified diagram of a chipset .............................................................. 22 Figure 5 - Matrix-multiplication process depicted ..................................................... 29 Figure 6 The output of the console testing program ................................................ 34 Figure 7 - Performance of kernels executed for different CC levels on platform #4 ............ 45 Figure 8 Performance of simple LU-decomposition on different platforms. .................... 53 Figure 9 Matrix A being decomposed by block LU-decomposition in steps. ..................... 56 Figure 10 - Performance of block LU-decomposition v1 on different platforms. ................. 60 Figure 11 - Computing time of each kernel in block LU-decomposition v1 on platform #4. ... 61 Figure 12- Performance of block LU-decomposition v2 on different platforms. ................. 63 Figure 13 - Computing time of each kernel in block LU-decomposition v2 on platform #4. ... 64 Figure 14 - Performance of block LU-decomposition v3 on different platforms. ................. 65 Figure 15 Showing the sub-matrix part of the triangular solve method. ......................... 66 Figure 16 A 10.000 x 10.000 matrix LU-decomposed on platform #3 and #4. ................... 67 Figure 17 - Peak performance of LU-decomposition v3 on platform #3 ............................ 68 Figure 18 - Storage strategy for the compressed Householder QR-factorisation ................. 72 Figure 19 - Matrix A being decompose by block QR-decomposition in steps. ..................... 78 Figure 24 - Cuda thread organisation [4] ................................................................ 90 Figure 20 Block diagram of a chipset. Source: Intel ................................................. 98 Figure 21 - CPU and bus details of platform #1 ........................................................ 99 Figure 22 - System memory details of platform #1 ................................................... 100 Figure 23 - GPU details of platform #1 .................................................................. 101
List of tables
Table 1 - Hardware specifications for the four platforms ........................................... 21 Table 2 - Measured bandwidth of Cuda memory transfer operations .............................. 23 Table 3 - Measured gigaflops performance of GPU .................................................... 25 Table 4 - Test result of outer loops matrix-multiplication on platform #1 ........................ 34 Table 5 - Test result of outer loops matrix-multiplication no structure on platform #1 ........ 36 Table 6 - Test result of matrix-multiplication for resulting matrix on platform #1 .............. 37 Table 7 - Test result of matrix-multiplication for tiling strategy on platform #1 ................ 40 Table 8 - Tiling with 2 and 4 outputs per thread comparison for different platforms .......... 44 Table 9 - Kernel invocation overhead ratio of total running time .................................. 54 Table 13 - Cuda built-in variables ........................................................................ 91 Table 10 - GPU specifications for Nvidia GeForce 9400m, platform #1 ............................ 96 Table 11 - GPU specifications for Nvidia GeForce 8800 GS, platform #2 .......................... 96 Table 12 - GPU specification for Nvidia Tesla C1060, platform #3.................................. 97 Table 14 - Selected profile counter from Compute Visual Profiler User Guide .................. 105
vi
Summary
The purpose of this project was to uncover characteristics, features and limitations of the Cuda architecture. An optimisation strategy was formed, containing methods and techniques that supposedly enabled increased performance. Three frequent used linear algebra algorithms for matrix-multiplication, and LU- and QRdecomposition was chosen. These algorithms were then implemented as a simple version, and performance and correctness test were performed. GPU.NET was used as a frame of reference where applicable. The optimisation strategy was used to improve the performance of the implemented algorithms. It was found that a linear block algorithm could achieve better performance, than a regular algorithm. The main output from this project was a list of recommendations and experiences from the tests performed on the linear algebra algorithms. The findings from this project suggested that tiling was the best strategy, followed by latency hiding and coalescing memory access, when optimisation was the goal. In addition to the points above, this list describes recommendations based on the testing: Avoid using structures as parameters in the kernel definitions, use instead simple types or pointers thereof. Target the highest possible Compute Capability level. Among other things, the precision of instructions are better and the result will be more accurate. Unroll loops, by making the threads fine-grained. Generation and thread scheduling are cheap. Thread block size should be a multiple of the warp size (Currently 32). Be aware of the overhead for invoking a kernel. Note that default instructions deviate from IEEE 754, use specific IEEE 754 functions for increased precision, but at the cost of speed. Besides the list and suggestions above, there were also methods with doubtful results: The Volkov suggestion yielded performance gains on some systems, but lower on others. Can be useful for low occupancy kernels, but should be tested and evaluated. Data prefetching can both increase and lower performance.
It was pointed out that the underlying hardware and its capabilities played an important role to whether an optimisation technique affected performance. Some methods had positive effect on some GPUs, and a negative on others. Analysing and testing should therefore always be performed. The purpose described in the problem definition was achieved, and the learning goals were reached with satisfaction.
1 Introduction
This 30 ECTS thesis project has been produced by Mikkel Bundgaard-Ovesen from the 1st February 2011 to 1st August 2011, on the ITU Copenhagen. The project builds on the results from the report Documentation of the GPUs usability in advanced parallel calculations [1], and has been supervised by Peter Sestoft.
1.1 Motivation
The speed of computers has increased over the years as a result of increased demand for processing power. The CPU has from the beginning, been the preferred architecture for computing. But during the last decade, an additional computing architecture has evolved, namely the graphic processing unit (GPU). A GPU, also called a massively parallel processor, offer tremendous performance in gigaflops, at a relatively low cost. Different parallel computing architectures, such as Nvidias Compute Unified Device Architecture (Cuda), Open Computing Language (OpenCL) and Microsofts DirectCompute have been developed to serve as a platform for general purpose programming on the GPU (GPGPU), to enable massively parallel programs. Utilising the immense GPU power is not a trivial task. The execution model of the GPU is becoming more and more flexible, but being a SIMD model puts restrictions on utilisation. Data-parallel algorithms that have a simple execution path, and high arithmetic intensity are usually well suited for processing by the GPU architecture. But, there are indications that other algorithms, which do not share these characteristics, in fact can be optimised in a way, such that they are accelerated by the GPU. The huge performance offered at a relatively low cost, makes it interesting to find out how this power can be harnessed. In this project, I will look into how the linear algebra operation matrix-multiplication and the decompositions LU and QR can be implemented and optimised, on the Nvidia Cuda architecture.
The report is divided into 11 chapters. The first chapter describes the purpose and the goals of the project. The second chapter gives a short introduction to linear algebra and GPU computing history, readers can skip this chapter. The third chapter describes the parallel platforms Cuda and GPU.NET. The paragraphs 3.1.4 and 3.1.5 are the most important. Chapter 4 focuses on the hardware platform and its influence on performance. The different development and test systems are described, and the importance of the chipsets North Bridge is described. The paragraph 4.2.2 holds a description of the CGMA term, and an analysis is performed. Chapter 5 describes the development environment together with some design decisions. The most important section is 5.3 that holds the optimisation strategy used later. Chapter 6 analyses the matrix-multiplication algorithms, and describes its implementation and optimisations. The results of the improvements are found throughout the chapter, but an evaluation can be found in paragraph 6.4. Chapter 7 deals with LU-decomposition. The chapter describes the algorithm, its implemented and optimisation, along with the test results. An evaluation can be found in paragraph 7.4. Chapter 8 looks into QR-decomposition. The algorithm is described and analyses, after which it is implemented, improved and tested. Results are found throughout the chapter, but an evaluation can be found in paragraph 8.5. Chapter 9 tries to summarise results from all three algorithms, and compare them with the initial optimisation strategy. This chapter is important and serves as an evaluation and conclusion. Chapter 10 looks at the work done so far, and discusses possible extensions to the project. A broader perspective is also discussed, looking into Cuda, hardware and GPGPU in general. Chapter 11 is solely the conclusion to the problem definition, for an evaluation on the projects results, please refer to chapter 9.
Learning goals Knowledge of linear algebra and linear algebra algorithms Understanding of the Cuda architecture and platform Obtaining skills in C/C++ and Cuda C Ability to implement linear algebra algorithms using C/C++ and C for Cuda
1.4 Method
1. Study literature on linear algebra, C and C++ and Cuda architecture development. 2. Implement basic versions of linear algebra algorithms in C/C++ and C for Cuda using Visual Studio and Nvidia Nsight. Develop tests and benchmarks and compare results with comparable CPU implementations. 3. Implement optimisations for the algebra algorithms and compare results with CPU implementations. As mentioned before, this thesis builds on the experiences and results of the project Documentation of the GPUs usability in advanced parallel calculations [1]. One of the goals of that project was to uncover how the GPU could be utilised from .NET. This is not a specific goal for this thesis; however, I regard it an important perspective. During the thesis research period, I discovered GPU.NET by TidePowerd, a framework and tool whose main feature is to bridge Cuda and .NET. In this project, GPU.NET will be used, where it makes sense, to compare algorithm implementations and their performance with the pure Cuda implementations. It will be interesting to see how GPU.NET performs compared to pure Cuda C/C++, and furthermore, whether GPU.NET is easier to. Testing the correctness of algorithms in both GPU.NET and Cuda will be compared to results computed by the CPU.
1.5 Scope
All areas of this project cannot be analysed and documented, prioritising is important so the parts that are processed is done with adequate depth.
1.5.1 Algorithms
This project is an empirical study that should document implementation, optimisation and performance of existing linear algebra algorithms on the Cuda platform. It is not part of this project to develop new algorithms, but merely to base the testing on existing. The algorithms selected are designed for dense linear algebra, and are well known and well documented.
a correct result. All algorithms are implemented for both the GPU and CPU, and tests are performed on both platforms to compare the results. The maximum difference in the result indicates how well and precise the GPU implementation performs compared to the CPU implementation.
1.5.4 BLAS
Basic Linear Algebra Subprograms (BLAS) is an interface for linear algebra operations, and it offers optimised operations for vectors and matrices. Many linear algebra algorithms are designed on the basis of these operations, but the implementations in this project will not use any BLAS API, even though Cublas would be obvious. To really uncover the architectures capabilities it is necessary to experiment with it directly. For that reason I implement all algorithms without the use of such math libraries. This will mean that the full performance potential of the algorithms will not be achieved, but it will give better insight.
2 Background
This chapter will work as an introduction to the ideas that will be used throughout the report. Firstly linear algebra will shortly be described, after which the concept of parallel computing in relation to the GPU.
Programmable vertex- and pixel shaders were intended solely for graphics rendering, however they were actual small programs that performed a programmed computation on some input, and then returned the output. The computational power of the GPU combined with the programmable vertex- and pixel shaders feature made developers look into how the GPU, could solve other problems than just graphics rendering.
3 Parallel platforms
The following chapter will take a deeper look at Cuda and GPU.NET; describe usage, features and performance limiting factors. GPU.NET uses Cuda, so most energy will be on describing and analysing Cuda.
3.1 Cuda
Cuda stands for Compute Unified Device Architecture and is a generic term covering the GPU architecture of Nvidias graphic cards, development platform and tools. It can be described as a parallel computing architecture and development platform that enables the GPU to solve general purpose computational problems [4].
3.1.1 History
The first GPU was released in 2001, and in the early stages the only way to access the GPU was through a graphics API, such as OpenGL or DirectX. This meant that general use of the GPU was difficult. Nvidia saw the potential of the GPU as another computing platform, and they initiated the development of a completely new architecture. This architecture was to overcome the limitations of earlier GPUs, by allowing General-Purpose computation on Graphics Processing Units (GPGPU), without the need to use a graphics API. Nvidia released in 2006 GeForce 8800 GTX, the first GPU to support the Cuda architecture. Later, in June 2007 the first version of the Cuda development toolkit was released. Over the years the architecture and toolkit have undergone development and improvements, with the latest toolkit released May 2011.
3.1.2 Version
The latest version of the Cuda when this project initiated, was version 3.2, released November 2010. Many things have happened since, and the current toolkit version, as of May 2011, was version 4.0. Some of the new features include Share GPUs across multiple threads, Use all GPUs in the system concurrently from a single host thread and No-copy pinning of system memory, a faster alternative to cudaMallocHost(). Even though the last feature is interesting, none of these newly added features bring any major benefit to this project, so any upgrade during the project phase was deemed unnecessary. Hence, version 3.2 is used throughout this project and report.
10
// Copy date from host -> device cudaMemcpy( d_base, base, N * sizeof(int), cudaMemcpyHostToDevice); cudaMemcpy( d_n, n, N * sizeof(int), cudaMemcpyHostToDevice); // Execute kernel power<<<blocks,THREADS_PER_BLOCK>>>(d_base, d_n, d_output, N); // Copy data from device -> host cudaMemcpy( out, d_out, N * sizeof(int), cudaMemcpyDeviceToHost); // Free memory on device cudaFree( d_base ); cudaFree( d_n ); cudaFree( d_output ); // Let the Cuda runtime now that we are finished cudaThreadExit();
11
The host sends commands and messages to the device by invoking functions. A sequence diagram, based on the program illustrated above, is shown in Figure 1.
Line 6 to 8 in the host code allocates memory on the device, line 11 and 12 copies data to the device and the kernel is invoked in line 15. When the kernel is done processing, the host
12
copies the data back from the device memory in line 18, where after the memory is release again in line 21-23.
3.1.4 Architecture
Cuda is an architecture consisting of both the physical layout of the GPU and the logical structure of threads in the Cuda runtime. The exact physical layout and specifications differs for different chip versions1, and the capabilities of these chips are defined by the Compute Capability version (CC). The first chips were released with CC 1.0, and the latest version is 2.1. Physical layout The GPU shown in Figure 2 is a simplified G86 GPU with CC version 1.1. It consists of four streaming multiprocessors (SM) and each SM contains 8 streaming processors (SP) and two special function units (SFU). In a SM, the SP processes normal instructions like add and multiply, and in this case 8 SPs are able to process 8 normal instructions per clock cycle. The SFU processes instructions related to square root, sine and cosine, logarithmic and exponential, so a kernel with heavy usage of these instructions will be limited to only two instructions per clock cycle. A SM has, beside the SP and SFU, also access to different memory types. The register and shared memory are limited in size, but very fast. They are in addition to this local to each SM. The register is the fastest memory type, and local to a thread, and the number of 32-bit registers of a SM with CC 1.1 is 8192, or 8K. The shared memory is a bit slower, but there is more of it and it is shared between the threads in a block. The shared memory size of a SM with CC 1.1 is 16KB. All of SMs of a GPU has read access to the constant memory, which for all current CC versions is 64KB. Access to the constant memory is cached and generally faster than the global memory, which is the devices main memory. All SMs have shared access to it, and the exact size and speed is device dependent.
For instance G80 allows 768 threads per multiprocessor, GT200 allows 1536. 13
Memory As described above, there are different memory types, but the common denominator is that they are all typically based on dynamic random access memory (DRAM). Accessing a single bit in a DRAM cell is a slow process, and to improve performance DRAM controllers read several consecutive bits in parallel [4]. This means that actual random access to DRAM memory will yield a low performance. So to achieve the highest memory performance possible, the kernel should access consecutive memory locations, as much as possible. This is also called coalesced memory access. Accessing memory in a coalesced manor is important for all memory types. This also holds for shared memory even though it is on-chip and fast, and in addition to this, access to shared memory should also minimise bank conflicts. Shared memory on CC version 1.1 has 16 banks, and the bandwidth of each bank is 32 bits per two clock cycles. If two or more threads access the same bank, the access will be handled sequentially and hence impact performance. From CC version 2.0 and higher, simultaneous access to the same bank has been optimised. Multiple
14
reads from a single bank only results in a single read instruction being performed, after which the values is broadcasted to all threads. Cuda threads Cuda threads are very different from CPU threads; the only similarity is the fact that they process data in parallel. The GPU can be classified as SIMD, which makes its applicability differ from that of a CPU. A task, or a set of instructions, can be performed on different data in parallel, and the SIMD means that two independent tasks cannot be performed in parallel by the GPU. Furthermore, threads in Cuda are very lightweight compared to threads on the CPU. A typical Cuda program uses several thousand Cuda threads, and thread generation and scheduling should therefore not be considered a limitation. Cuda threads are organised into a block, and the threads in a block can share memory and be synchronised. The blocks are then again organised into a grid. This logical thread structure allows threads to be organised in several dimensions, which makes structuring of threads directly correlate to specific data structures, a matrix for instance is defined in two dimensions. More details about how threads are organised and how this affects usage in a kernel please refer to appendix B. Thread scheduling The presumption is that the threads of a block are grouped, and processed in parallel. This is conceptually true, but in reality is not actually happening. The current implementation of thread scheduling in e.g. G80 and GT200 chips schedules threads using a term called a warp. A warp is a bundle of 32 threads being executed in parallel, and a block with for instance 128 threads, are partitioned into 4 warps. These threads share a single instruction set, hence Cuda is SIMD architecture. This is a design decision to reduce hardware cost and to enable optimisations techniques, and it is not without relevance to the developer. The size of a warp has direct impact on the recommended size of blocks. Consider the example where a problem is organised into 20 blocks each with 10 threads, giving a total of 20 x 10 = 200 threads. Cuda executes 32 threads in a warp in parallel. In the example above, only 10 threads are available per block. Cuda will in this case fill up the warp with 22 empty threads, resulting in 20 x 22 = 440 empty threads being created. It is advisable to set the block size to a multiple of the warp size, currently 32 [4].
15
Occupancy and latency A SM with a CC of version 1.1 is able to handle 768 residing threads, and as the current warp size is 32, the maximum number of residing warps per SM is 24. The actual number is dependent on the kernels consumption of registers and shared memory. The Cuda occupancy is, for a given kernel, the ratio of active warps to the maximum number of warps supported by the SM. In other words, the occupancy indicates how many active warps and threads a SM can hold. The number of clock cycles it takes for a warp to be ready to execute its next instruction is called latency. There are instructions that incur latency, for instance global memory access, which incurs high latency before the data is supplied. The execution of a warp does not halt due to a memory access; execution is continued until the data is actually needed, if the data is still unavailable the scheduler switches to another warp. Whenever a warp incurs latency, the SM should switch to another warp and start processing to achieve full utilisation of the SM. So the Cuda occupancy ratio can indicate if the performance of a kernel suffers from high latency. Vasily Volkov [6][7] have however shown that high performance is not necessarily equal to a high occupancy, so improvements based solely on the occupancy ratio should be carefully evaluated. Optimisation that aims at full utilisation of the SM is called latency hiding.
3.1.5 Limitations
Using Cuda can be advantages, but there also limitations that are influence the implementation and optimisation of algorithms. In the following I describe a couple that are relevant to this project. One should be aware that the Cuda architecture was developed for speed at the expense of precision. There can, for that reason, be a higher numerical instability of an algorithm implemented on the GPU, when compared to the same algorithm implemented on the CPU. For example, the operations multiply and add, can be contracted to a single FMAD (multiplyadd) instruction, which specifically deviates from the IEEE 754 standard. FMAD instructions are for instance often used in linear algebra algorithms to calculate dot-products, vector norms and more. Nvidia has been focusing on this, and latter CC versions should comply better with the IEEE 754 standard. Latency, warps and memory access described in the architecture chapter 20, are all factors influencing the computational performance. One should therefore not expect to reach close to the theoretical performance of a device, as these factors will limit the performance of an algorithm. The theoretical properties can, on the other hand, be of assistance in the analysis and optimisation of a kernel.
16
Cuda C is an extended version of ANSI C, and is the language in which the device code or kernels are written. A kernel function is able to call other device functions, but recursion is currently not supported. Cuda is developed by Nvidia, and can only be used in Nvidia GPUs.
3.2 GPU.NET
The framework GPU.NET consists of a runtime and a compiler, which are integrated with Visual Studio. This framework makes it possible to develop host and kernel code directly in .NET with all the benefits that .NET and Visual Studio provides. The version being used for this project is GPU.NET v1.0.3.5.
3.2.1 Overview
GPU.NET currently only supports Cuda, but expect to support other parallel architectures in the future. GPU.NET allows a developer to write host and device code directly in .NET using the API from the provided assembly, and thereby making computations on hardware accelerated architectures. Accelerating .NET code is achieved in two steps; first the .NET code is written, decorated and compiled, then the GPU.NET runtime accelerates the program during execution, as shown in Figure 3.
17
3.2.2 Development
Visual Studio 2010 is supported for development as well as .NET 4. The kernel is annotated with a KernelAttribute that also holds the name of the CPU method to be used if no acceleration hardware is present. ThreadIndex and BlockIndex hold the same values as when used in Cuda directly.
1. [Kernel(CustomFallbackMethod = "MatrixMultiplication_CPU")] 2. private static void MatrixMultiplicationSimpleNS_GPU(float[] a, float[] b, float[] c, int aheight, int awidth, int bwidth) 3. { 4. // Thread ID 5. int tid = ThreadIndex.X + BlockIndex.X * BlockDimension.X; 6. 7. if (tid < aheight) 8. { 9. for (int j = 0; j < bwidth; ++j) 10. { 11. float sum = 0; 12. 13. for (int k = 0; k < awidth; ++k) 14. { 15. float av = a[tid * awidth + k]; 16. float bv = b[k * bwidth + j]; 17. sum += av * bv; 18. } 19. c[tid * bwidth + j] = sum; 20. } 21. } 22. }
The kernel is returns void, and is private, and can therefore not be called directly. Another public and static method is created, which calls the kernel shown in line 5.
1. public static float[] MatrixMultiplicationSimpleNS(float[] a, float[] b, int aheight, int awidth, int bwidth) 2. { 3. var c = new float[aheight * bwidth]; 4. 5. MatrixMultiplicationSimpleNS_GPU(a, b, c, aheight, awidth, bwidth); 6. 7. return c; 8. }
The .NET code is compiled to a normal assembly, in which the GPU.NET compiler then injects calls to the GPU.NET runtime. The result is a modified .NET assembly where calls to any kernel method are being redirected to the GPU.NET runtime, and hence the GPU.
18
3.2.3 Execution
When the program is being executed, the GPU.NET runtime detects the availability of any supported hardware. When a call to a kernel is detected, the kernel code is then passed to the correct vendor plug-in, which in turn JIT compiles the code to the hardware vendors instruction set architecture. Lastly the runtime executes the compiled device code and transfers any data back to the .NET runtime. If no hardware acceleration is present, then the CPU version of the kernel is called.
19
3.2.5 Evaluation
GPU.NET can definitely be used for testing and playing with GPU acceleration of programs. But one should, with the current version, expect bugs and minor problems; the framework is far from mature at this point. Furthermore, by using JIT compilation the GPU.NET will incur a performance hit compared to Cuda, as Cuda kernels are already compiled at runtime. This is a design decision made by TidePowerd, a decision that makes the framework flexible at the expense of performance. GPU.NET does however cache the JIT compiled kernel in-memory, so subsequent calls to the same kernel will not incur the same performance hit.
20
4 Hardware platform
The parallel architecture software has been described above, but there are also hardware specifications that are important to the performance of Cuda. The following chapter will analyse hardware specifications and perform two simple tests. Cuda requires Nvidia GPUs, so all development and test computers must be equipped with an Nvida GPU. I used two development and two test computers for this project. All machines are running Windows 7, and have Cuda v3.2 installed, including the matching drivers. The following table shows selected hardware specifications for the platforms.
#1
Development 8.00 GB/s (PCI-E v2.0) 16.60 GB/s (DDR3) GeForce 9400m (G86) 16 1100 MHz 8.00 GB/s2 (DDR3)
#2
Development 4.00 GB/s (PCI-E v1.1) 6.23 GB/s (DDR2) GeForce 8800 GS (G92) 96 1250 MHz 37.45 GB/s (GDDR3)
#3
Testing 8.00 GB/s (PCI-E v2.0) 16.60 GB/s (DDR3) Tesla C1060 (GT200) 240 1300 MHz 102.40 GB/s (GDDR3)
#4
Testing 4.00 GB/s (PCI-E v1.1) 6.25 GB/s (DDR2) GeForce GT440 (GF108) 96 1645 MHz 51.20 GB/s (GDDR5)
Host memory
GPU Cores Shader clock Device memory Processing power in gigaflops (MUL+ADD+SF) Processing power in gigaflops (MUL+ADD) Compute Capability
34.38
351.56
914.06
462.66
51.56
234.38
609.38
308.44
1.1
1.1
1.3
2.1
The actual memory speed is 16.60 GB/s, but system #1 has no dedicated device memory and uses host memory. So the device is limited by the speed of the graphics bus. 21
The reason why the GPU of each system has two processing powers stated, is based on the theoretical peak performance in gigaflops Cuda architecture design. The first is based on the GPU architecture design, which says that a GPU is capable of performing a Multiply-Add instruction dual-issued with a special function instruction per operation cycle. The second is based on a more realistic estimation, in which a operation cycle can perform a Multiply-Add instruction dual-issued. For a detailed description of the different platforms, please refer to appendix C.
4.1 Analysis
This paragraph will dig a little deeper into the important hardware specifications, and describe the theoretical performance limits. When dealing with GPUs the important factors are memory transfer rates and GPU processing power. Figure 4 shows a simplified chipset diagram, which highlight the important elements, namely the processor, DDR ram, GPU and the chipsets north bridge.
22
In Table 1, the graphics bus indicates the maximum transfer rate between the north bridge and the GPU device. Host memory is the peak bandwidth between the DDR ram and the north bridge. A chain is not stronger than its weakest link, and the same holds for data transfer between the host and device, and vice versa. Consider platform #1, the bandwidth of the host memory is 16.60 GB/s, but the graphics bus is limited to 8.00 GB/s, which then is the theoretical peak transfer rate between the GPU device and the host system.
4.2 Benchmarking
The specifications are theoretical, and to give a more realistic performance target I have tested actual data transfer and processing power.
System Host -> Device Host -> Device (P+WC) Device -> Host Device -> Host (P+WC) Device -> Device Device -> Device (P+WC)
#1
1,584.5 MB/s 5,224.9 MB/s 1,365.9 MB/s 5,096.2 MB/s 6,935.4 MB/s3 6,951.3 MB/s3
#2
1,434.7 MB/s 2,513.1 MB/s 1,178.0 MB/s 1,687.9 MB/s 28,525.7 MB/s 28,529.0 MB/s
#3
4,233.7 MB/s 5,761.9 MB/s 3,864.3 MB/s 5,297.6 MB/s 73,463.8 MB/s 73,527.3 MB/s
#4
1,578.6 MB/s 2,509.8 MB/s 1,235.1 MB/s 1,857.9 MB/s 21,338.1 MB/s 21,339.2 MB/s
System #1 does not have any dedicated device memory, so actually the rates from host to device are relevant when a kernel needs to access device memory. 23
The measured transfer rates between host and device are, as expected, faster when using pinned and write-combined memory. The ratio between measured bandwidth for paged memory transfers are between 17% and 40% of the theoretical bandwidth, the ratio span increases to between 47% and 72% when pinned and write-combined memory is used. The result also shows that in general, device to host transfers are slower than its counterpart host to device. The memory speed for copying data from host to device and vice versa is mostly important for hybrid algorithms, meaning an algorithm that solves a problem by using both the CPU and GPU. An implementation that requires transfer between the host and the device should be designed with caution, as this would put a restriction on performance. Based on this, the strategy for the algorithm implementations of this project is to keep the data processing solely on the GPU, and limit the number of data transfers between host and device. Consider the worst case memory transfer scenario. System #3 has 4GB of device memory, and if this GPU was installed in the slowest system, it would be possible to copy all 4GB in about 3.57 seconds. Global memory access is limited by device memory bandwidth, so the device to device memory transfer rate is interesting and relevant to the performance of a kernel. The results are between 41% and 75% of the theoretical limit. Paged or pinned/write-combined memory does not have any impact as this is an operating system resource, and hence only relevant for the host memory. The result of the device to device memory transfer for system #1 is a bit misleading. System #1, does not have any dedicated device memory, and a device to device transfer rate then only indicates the peak performance of the DDR3 ram on the host system. Before the data could actually be processed by the GPU, it would have to pass the north bridge and graphics bus, which is the same as the host to device memory transfers.
24
#1
2.19 15.41
#2
9.70 60.09
#3
27.87 62.29
#4
8.18 29.08
In reality, these kernels do not say anything about the maximum expected performance. An implementation of an algorithm can be optimised in several ways, and so could these kernels. However, they do suggest that global memory access is indeed a limiting factor. This factor is referred to as Compute to Global Memory Access (CGMA). The first kernel has a memory load from input and a write to output. The number of operations are three (multiply, add and power). Based on these numbers, the calculated CGMA is 1.5. Consider system #3, the device memory peak performance is 73,463.8 MB/s. The kernel uses single floating values that are 4 bytes. So the system is able to transfer about 18,365.95 mega single float values. The CGMA is 1.5; hence the peak performance of this kernel is about 27 gigaflops, which the result from Table 3 also shows. The memory transfer rates, together with an estimated CGMA, can be important tool when analysing a kernel for optimisation. With these results in mind, how are they compared to the performance of a CPU? Consider that a Pentium 4 3.06 GHz CPU computes a single-precision float values dot-product with between 1.8 gigaflops (single thread) and 3.08 gigaflops (multiple threads) [8]. In that light, even though the results from Table 3 are far lower than the theoretical processing power, it is evident that even un-optimised kernels could have a similar peak performance or even higher, when compared to the CPU.
25
5 Implementation
In the following chapter I describe the development environment and some design decisions, but more importantly, I form an optimisation strategy used for the algorithms.
26
As mentioned earlier, support for double-precision operations is not a common denominator for the development and test machines. Algorithms are therefore implemented using singlefloating point precision, which is supported by all GPU devices, the CPU and GPU.NET.
5.3 Optimisation
The aim is to implement and optimise three linear algebra algorithms for the Cuda architecture. The method for doing so is composed of the following steps: 1. Use an existing, well-known and well-documented algorithm, for implementation in C/C++ for CPU processing. 2. Analyse and update CPU implementation to Cuda C, while making simple improvements that exploits the parallelised architecture. 3. Test the implementations. 4. Optimise based on the test results, and test again.
5.3.1 Strategy
The Nvidia paper on Analysis Driven Optimization [10] identifies four categories of what can limit a kernels performance; memory throughput, instruction throughput, latency or a combination of the above. There are some methods that can be helpful in finding the limiting factors of a Cuda program. To determine if memory throughput is a limiting factor, the CGMA of a kernel can help determine the theoretical maximum performance of a kernel. When it comes to instructions, the Nvidia profiler can give valuable information about undesirable code. Please refer to appendix E for at description of the Cuda profiler and CGMA. There exist different optimisation techniques and methods, and some have already been described in the chapters 3.1.4 and 3.1.5. In the following I will describe methods that form the optimisation strategy. Algorithms that process data rely on memory to perform well. Coalescing memory access is important for all memory types, and in addition to this, shared memory should avoid bank conflicts as much as possible. A loop structure in a kernel adds extra control flow instructions, which will consume arithmetic resources. The organisation of threads in several dimensions can enable the unrolling of a loop by increasing thread granularity. The compiler already unrolls small loop structures, but doing it manually can help making a kernel run faster.
27
Whether the block size should be high or low depends on the kernel, but it should where possible by a multiplier of the warp size (currently 32), to avoid empty threads. Hiding latency of slow instructions can be achieved by reorganising the kernel, exploiting data prefetching or making sure that the kernel Cuda occupancy is high. Notice that a high occupancy is not equal to high performance [6][7]. Vasily Volkov has shown that a kernel can increase performance, by instead of outputting a single result, then outputting several results per kernel. It has also been proved that using this method, high performance can be achieved with a low occupancy. Another technique for increasing performance focuses on using fast memory, such as the register or shared memory. Updating an implementation so that it divides data into smaller pieces, called tiles, that fits into caches or shared memory can be very effective. The cost of copying data is amortised, and the kernel will process the cached data. Some algorithms are not designed for parallel processing, and the performance they can deliver on a parallel architecture, is not very high. Instead, for some algorithms, a block version has been designed. A block algorithm usually has three advantages over normal algorithms. Firstly, they are able to solve much larger problem, by dividing the problem into smaller pieces and solving them independently. Secondly, dividing a problem into smaller sizes is the core of the tiling implementation strategy, so using a block algorithm can automatically enables the tiling. Lastly, the block algorithms sometimes rely on other linear algebra operations that are highly parallel, for instance matrix-multiplication. So to clarify, tiling refers to a specific implementation that exploits a faster memory type, whereas block refers to the algorithm. For some algorithms block and tiling is almost the same (e.g. matrix-multiplication), for others they are not.
28
6 Matrix-multiplication
A matrix is essentially a rectangle array of numbers, and is often denoted with a capital letter. Here the matrix A with two rows and three columns is shown.
=
The numbers or values in a matrix are called elements, and are by convention denoted where r is the row index and c is the column index. The row index indicates in which row the element lies, where the column index indicates the column in which the element lies. Matrix-multiplication, also called matrix product, is a linear algebra matrix operation consisting of the operations multiplication and addition. Elements in the respective matrices are aligned, multiplied, added, and then the grand sum is placed into the resulting matrix. The process of performing matrix-multiplication on two matrices is only possible, if their dimensions conform for multiplications, meaning the number of columns of the first matrix should be equal to the number of rows in the second. The resulting matrix will be a matrix, where second. is the number of rows of first matrix and is the number of columns of the
29
Except for special cases, where matrix-multiplication actually is commutative. These cases are however not described in any further details, as they are outside the scope of this report. matrices is 2 The naive process of matrix-multiplication is rather simple. The data size of two square and the running time is O( ) where n is the width and height of the matrices. This shows that the running increases more than the data size. The simple or naive implementation will be discussed and shown later, but there exist other algorithms which are more efficient, for instance the Strassen's or CoppersmithWinograd algorithms [11]. However these algorithms add complexity to the implementation, and require extra attention to handling numerical stability issues. This projects matrix-multiplication focus should be on optimisations for the GPU platform, and not the algorithm itself. Then, for that reason, will only the simple implementation serve as a base for analysis, implementation and testing, and not the other algorithms mentioned. Optimisations applied to the simple algorithm will focus on capabilities and properties of the GPU platform, and the essence of the original matrix-multiplication algorithm, will be kept. Futhermore, Cuda is the focus of this project, meaning the implementation and optimisations will be focussed on Cuda and the C/C++ implementations. GPU.NET will be used in the result section as a perspective and for comparison. In the following the matrix named A will always reference the first matrix of the matrixmultiplication process. The second matrix will be named B and the resulting matrix C, like so: =
6.1 Analysis
Parallel processing on a GPU platform is stream based, and supports the parallelisation of data very well. The simple nature of the matrix-multiplication algorithm makes the implementation, for processing on a GPU platform, straightforward. The fact that the algorithm has running time of O( ), makes optimisations and performance gains easier to test and time on different platforms, and with different data sizes.
30
1. for (int i = 0; i < A.rowCount; ++i) 2. for (int j = 0; j < B.columnCount; ++j) { 3. double sum = 0; 4. for (int k = 0; k < A.columnCount; ++k) { 5. double a = A[i][k]; 6. double b = B[k][j]; 7. sum += a * b; 8. } 9. C[i][j] = (float)sum; 10. } 11. }
operation. This also shows that the running time is O( For square matrices where = =
This algorithm consists of three loops and the inner loop has an addition and a multiplication in A, m the number columns of A and rows in B, and lastly p is the number of columns in B. the running time is O( ).
The inner loop computes the dot-product of the vectors of A and B. The two outer loops are responsible for iterating through the rows of A and columns of B, and their mutual order does not influence the running time.
6.1.2 Parallelism
The simple matrix-multiplication algorithm consists of three loops. One adjustment to induce concurrency is to perform the outer loop in parallel; another to calculate each value of the resulting matrix in parallel. In any case, there is not just one single solution to making matrix-multiplication work in parallel, but multiple. These different approaches will be discussed in the following. The outer loop A simple approach to make the outer loop work in parallel is to make threads handle each row in A. This is possible as there are no synchronization issues to handle, but it means that the total number of required threads will be equal to the number of rows in A. This adjustment is doable; however any performance gains or losses are dependent on the data size. Consider the case where A is a column matrix and B a row matrix, this would mean, many threads that do little work. Resulting matrix values Calculating the different values of the resulting matrix is another way of making the algorithm work in parallel. By using this method the required number of threads will be equal to the size of the resulting matrix C, which is:
31
This shows that this approach also is prone to the problems where A is a column matrix and B a row matrix. Again, many threads do little work. For simplicity, I will not handle this case specifically, but will later present an algorithm that performs better with different data sizes.
32
19. 20. }
} }
single thread O(
What is important to note is that the kernel has a loop in a loop making the running time of a ) where n is columns in B and m the columns in A.
The kernel that calculates the values of the resulting matrix does not have this double loop.
1. __global__ void matrixMultiplicationRM(matrix *a, matrix *b, matrix *c) { 2. 3. // Matrix C coordinates 4. int c_column = blockIdx.x * blockDim.x + threadIdx.x; 5. int c_row = blockIdx.y * blockDim.y + threadIdx.y; 6. double sum, av, bv; 7. 8. // Make sure not to exceed C boundaries 9. if (c_row < c->height && c_column < c->width) { 10. 11. sum = 0; 12. 13. for(int i=0; i < a->width; i++) { 14. 15. av = a->n[c_row * a->width + i]; 16. bv = b->n[i * b->width + c_column]; 17. sum += av * bv; 18. } 19. 20. c->n[c_row * b->width + c_column] = (float)sum; 21. } }
The two kernels are different in the sense that the first does more work than the last. In the last kernel, a loop structure was unrolled, the firstly makes the threads more fine-grained, which have a higher parallel potential. Secondly, the control flow instructions from the loop are not performed, releasing more resources for the kernel.
33
console testing program Figure 6 The output of the c Matrix-multiplication is tested where matrix A is 2 matrix C will be 2 Outer loop The Cuda occupancy calculator showed that the simple outer loop implementation would showed have a multiprocessor occupancy of 83%. The outer loop implantation was tested with the use of the matrix structure as a parameter for the kernel. The kernel running time is for Cuda, the direct GPU calculation time; hence it is exclusive is, ation the time to perform data transfer. The Cuda kernel running time does not say anything about, whether it is feasible to perform matrix multiplication on the GPU compared to the CPU, but matrix-multiplication only whether the GPU calculates the resulting matrix faster than that of the CPU. By just resulting measuring the calculation time, it is easy to directly compare the computation time on the GPU with that of the CPU. .
and B is
. The resulting
Platform
Operations/ms
Gigaflops/sec /sec
Cuda
349.15 ms
366.609
0.37 0
250.90 ms
34
Some interesting results have emerged from this initial test. First of all, there is a difference in the result calculated on the GPU from that calculated on the CPU. The maximum difference in the resulting values is 0.010742 performed on platform #1. The GPU architecture was initially designed for increased speed, on the cost of precision which partly explains the difference in the resulting values. Newer architectures implement an instruction set with increased precision, this kernel have been tested on an architecture with compute capability v2.0, where the difference between GPU and CPU results was 0.0. Another surprise to see from the test, the GPU calculation actually has a peak performance of 0.3666 gigaflops, and takes longer than on the CPU. What causes such a bad performance might one ask? Outer loop without structure Global reads are expensive and coalesced memory reads should be achieved to optimise performance. Structures can, if not aligned, produce non coalesced memory access. Whether using structures as parameters for the kernel had any impact on performance, would be interesting to test. So, minor adjustment to the code where made to eliminate structures as parameters, and the updated kernel function definition now looked like this:
1. __global__ void 2. matrixMultiplicationSimpleNS(float *a, float *b, float *c, int aheight, int awidth, int bwidth)
The adjustment meant that the occupancy of the multiprocessor rose from 83% to 100%, an increase, which suggested that better performance could be expected. But the kernel calculation running time was tested to 357.88 ms. A running time that is approximately the same as using structure parameters. Even though a higher occupancy suggested increased performance, no performance gain was achieved. This result is confirmed by Vasily Volkov test on Better performance at lower Occupancy [6]. So the first optimisation will look into reorganising the threads, to try if Cuda performs better when more threads perform less, than when few threads perform more. But before doing so, testing and comparing performance with GPU.NET would indicate, whether the GPU.NET API performs on the same level as using Cuda directly. GPU.NET The GPU.NET platform has different limitations; one is that kernel methods only support primitive types as parameters. Of that reason, testing matrix-multiplication with the Matrix
35
structure is not possible. So the outer loop matrix-multiplication method was implemented without structures, and can therefore be directly compared to the similar Cuda implementation. It is furthermore not possible, on the GPU.NET platform, to measure solely the direct calculation time, the kernel running time is inclusive data transfer and JIT compilation.
Platform
Operations/ms
Gigaflops/sec
Cuda GPU.NET
357,661 312.958
0,36 0.31
381.23 ms 387.00 ms
* Inclusive data transfer and JIT compilation Taking data transfer and JIT compilation into account, the performance of GPU.NET and Cuda are almost identical. It will be interesting to see whether this is also the case, when different optimisation techniques and features are exploited.
6.3 Optimisation
The Cuda architecture has different characteristics and capabilities, and to optimise performance different features and techniques can be utilised. First the unrolling of a loop will be tried, after which the tiling and other methods from the strategy will be applied.
36
Test and result A similar modification was made to the kernel in GPU.NET and the running times are shown in the following table.
Platform
Operations/ms
Gigaflops/sec
349.15 ms
366,609
0.37
250.90 ms
2,413,297 895,104
2.41 0.90
253.14 ms 372.00 ms
* Inclusive data transfer and JIT compilation The running time of both the GPU.NET and Cuda implementation has decreased. GPU.NET performs about 2.86 times better than the GPU.NET outer loop implementation, however the
37
performance increased is disappointing when comparing the performance gain when purely using Cuda. When looking solely at the Cuda implementation the performance increase is almost 6.6 times better than the outer loop approach. So even though the GPU.NET performance, compared to Cuda, is disappointed, the performance gains are significant, and indicate indeed that more threads doing less by unrolling a loop is a reasonable approach. When programming for parallel execution on the CPU platform, it is important to use the correct amount of threads to solve the problem optimally, and as spawning a thread is expensive not to many threads should be used. The results from these tests show that this rule of thumb does not apply to the GPU platform. The overhead for creating a thread in Cuda, is far less than that of creating threads on the CPU. Another factor is the amount of global memory reads. Reading from global memory is expensive and very slow [5]. When looking at the code of the kernel, it shows that the inner loop makes two global memory reads and one multiplication and addition operation. This equals a CGMA ratio of approximately 1.0. On platform #1 the global memory has a peak performance of 16.6 GB/sec bandwidth. With 4 bytes in each single-precision floating-point value, the expected giga single-precision data per second is 4.15 (16.6/4). With a CGMA ratio of 1.0, this kernel will not execute at no more than 4.15 gigaflops [4]. So in short, this kernel is memory-bound and to optimise a memory bound kernel, the focus should be on global memory access. One method for doing this is to
6.3.2 Tiling v1
One of the fastest memory types on a Cuda device, is the shared memory. The shared memory is on-chip and very fast, but also limited. Shared memory is accessible and shared by all threads in a block, so it is obvious to use it as a block cache. One strategy for reducing global memory traffic is to partition data into tiles that will fit into the shared memory. Then load a tile of data from device memory into shared memory, process the data and lastly write the results back to device memory [2]. One important criterion is that the computation on these tiled data must be able to, be processed individually. This requires the threads in a block to be synchronised, as shown in the following kernel code:
38
1. __global__ 2. void matrixMultiplicationTILINGns(float* a, float* b, float* c, int aWidth, int bWidth) { 3. 4. // blockDim.x = TILING_DIM (last is defined and hence faster) 5. // blockDim.y = TILING_DIM (last is defined and hence faster) 6. 7. int bx = blockIdx.x; 8. int by = blockIdx.y; 9. int tx = threadIdx.x; 10. int ty = threadIdx.y; 11. 12. // Matrix C coordinates 13. int c_column = bx * TILING_DIM + tx; 14. int c_row = by * TILING_DIM + ty; 15. 16. // Calculate the first index in of row in a, and the last for the 17. // current thread 18. int aIdxBegin = c_row * aWidth + tx; 19. int aIdxEnd = aIdxBegin + aWidth - 1; 20. int bIdxBegin = c_column + bWidth * ty; 21. 22. float sum = 0.0; 23. 24. for (int aIdx = aIdxBegin, 25. bIdx = bIdxBegin; aIdx <= aIdxEnd;) { 26. 27. __shared__ float ac[TILING_DIM][TILING_DIM]; // A cache 28. __shared__ float bc[TILING_DIM][TILING_DIM]; // B cache 29. 30. // Load values to cache 31. ac[tx][ty] = a[aIdx]; 32. bc[tx][ty] = b[bIdx]; 33. 34. // Synchronze to make sure all threads in block have saved 35. // values to the shared memory for this phase 36. __syncthreads(); 37. 38. for (int i=0; i < TILING_DIM; ++i) { 39. sum += ac[i][ty]*bc[tx][i]; 40. } 41. 42. // Synchronise to make sure that computation are done 43. __syncthreads(); 44. 45. aIdx += TILING_DIM; // Add index by phase dimension 46. bIdx += TILING_DIM*bWidth; // Add index by phase dimension and 47. // b width 48. } 49. 50. // Insert dot-product in resulting matrix 51. c[c_row * bWidth + c_column] = sum; 52. }
39
1 Where
+1
stands for block dimension size and is the axis size of one dimension in the block.
stands for global memory read and is the number of accesses to global memory. The block dimension was set to 20, giving a CGMA of 20, and with a giga single-precision data per second of 4.15 for platform #1, the immediate kernel peak performance is calculated to 83 gigaflops. This is an impressive theoretical peak performance of this kernel when taking in to account that the global memory has a bandwidth of 16.6 GB/sec. However the GPU of platform #1 has a peak performance of 34.38 gigaflops and the kernel is limited by that, so the theoretical maximum performance of this kernel on this platform is 34.38 gigaflops. Showing that the kernel on platform #1 is limited to 34.38 gigaflops proves that the tilestrategy kernel algorithm is no longer memory-bound, but actually arithmetic-bound. This is theoretically true, but the picture might be different when the test has been performed and the result is ready. Test and result To test whether using the matrix structure as parameter had any impact on performance in Cuda, I implemented two tiled kernels. The first one used the matrix structure as parameter and the second used pointer arrays. GPU.NET supports shared memory as well, however only arrays with one dimension were supported. So the shared memory indexes in the source code were adjusted, to align the arrays sequentially. Besides this minor adjustment the Cuda kernel was easy to port to GPU.NET and the test result are shown in the following table.
Platform
Operations/ms
Gigaflops/sec
39.52 ms
3,238,368
3.24
257.92 ms
35.96 ms
3,559,595
3.56
254,48 ms
76.00 ms*
1,684,210
1.68
357.00 ms
40
2 2 =
The block has two dimensions with the length of 20, this gives . The normal recommendation is to make the block size , but the peak performance for the Cuda kernel, block size of 16 16 = 256 dividable by the warp size, currently 32. However these tests were also performed with a was about 1.6 gigaflops. The conclusion is, sometimes it pays of not following the recommendation. In this case, the overhead of filling the warp with empty padded threads is insignificant, when compared to the larger amount of coalesced memory reads, the larger block size results in. By using the tile strategy and shared memory, it was possible to perform matrixmultiplication even faster than the resulting matrix algorithm. Cuda was about 1.47 times faster and GPU.NET was faster by a factor of 1.88. Looking at the peak performance, the result indicates that even though the algorithms are the same, then GPU.NET have no chance of performing on the same level as when Cuda is used directly. This is most likely due to the fact that GPU.NET JIT compiles the device code. The Cuda kernel has a peak performance of 3.56 gigaflops which is remarkably slow, compared to the theoretical 34.38 gigaflops. This gives an actual performance that is just 10.35% of the theoretical possible. And even though the kernel algorithm is arithmetic-bound, due to this significant slower performance, the performance limiting factors can in fact be both arithmetic and memory.
41
15. 16. 17. 18. 19. 20. { 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. }
const int aIdxEnd = aIdxBegin + aWidth - 1; float sum = 0.0; for (int aIdx = aIdxBegin, bIdx = c_column + bWidth * threadIdx.y; aIdx <= aIdxEnd;) ac[threadIdx.y][threadIdx.x] = a[aIdx]; aIdx += TILING_DIM; // Increase a index bc[threadIdx.y][threadIdx.x] = b[bIdx]; bIdx += TILING_DIM*bWidth; // Increase b index // Synchronze to make sure all threads in block have saved // values to the shared memory for this phase __syncthreads(); // Compute dot-product for (int i=0; i < TILING_DIM; ++i) { sum += ac[threadIdx.y][i]*bc[i][threadIdx.x]; } // Synchronise to make sure that computation are done __syncthreads(); } // Insert dot-product in resulting matrix c[cidx] = sum;
To optimisation this v2 kernel, four register variables have been removed to increase the number of possible active warps. Furthermore, the shared memory access in the lines 21 and 25 has been optimised for coalescing. This now yields a peak performance of 5.028 gigaflops on platform #1. Nvidia provides a code example implementation of matrix-multiplication using the tilestrategy. This kernel has a peak performance of 4.91 gigaflops, so with these minor updates it is possible to get a kernel to performing better.
42
1. __global__ void mm_prefecth(float* a, float* b, float* c, int aWidth, int bWidth) { 2. 3. // Load data from global memory to register variables 4. 5. while(data to process) { 6. 7. // Insert register values to shared memory 8. 9. // Synchronze threads 10. 11. // Prefetch next values to register 12. 13. // Calculate the dot-product 14. 15. // Synchronise to make sure that computations are done 16. } 17. 18. // Insert dot-product in resulting matrix 19. }
By exploiting prefetching the peak performance of matrix-multiplication increased for system #1 and #4. 5.68 gigaflops was reached for system #1, while system #3 surprisingly incurred a performance loss of 2.2 gigaflops.
43
1. // Calculate the dot-product 2. for (int i=0; i < TILING_DIM; ++i) { 3. sum[0] += ac[threadIdx.y][i] * 4. sum[1] += ac[threadIdx.y+5][i] * 5. sum[2] += ac[threadIdx.y+10][i] * 6. sum[3] += ac[threadIdx.y+15][i] * 7. }
I made two tests, one where a single thread computes two dot-products, and another where a thread computes four dot-products. The results of platform #1 were a bit surprising and have therefore also been tested on platform #4.
#1 (CC v1.1) 5.68 gigaflops (Occupancy: 54%) 4.45 gigaflops (Occupancy: 58%) 5.89 gigaflops (Occupancy: 67%)
96.51 gigaflops
113.83 gigaflops
Table 8 - Tiling with 2 and 4 outputs per thread comparison for different platforms
The performance expectations set by the Volkov slides say that several outputs/thread could perform better than 1 output/thread. The results from platform #4 are the only one following this pattern, ending with a peak performance of 37.13 gigaflops. This performance is achieved by having an occupancy level of 67%, confirming that higher performance can be achieved with a lower occupancy rate. But it is evident that using Volkovs suggestion is not a certain measure for success. It is interesting to see that the performance of the 2 outputs/thread method actually was lower than expected for system #1 and #3, and the 4 outputs/thread on these systems yielded results similar to those when not using Volkovs suggestions at all. The major difference between system #1 and #3 on one side and #4 at the other is the different GPU devices compute capability level. System #1 and #3 belongs to 1.x where #4 belongs to the 2.x generation. This might be the reason that the algorithm performs relatively different on different architectures. I will in the next paragraph look into whether the CC level has any influence on the performance of al algorithm.
44
40 35 30 Gigaflops 25 20 15 10 5 0
CC 1.1 in gigaflops
CC 1.3 in gigaflops
CC 2.0 in gigaflops
45
It was expected that the kernels would be best performing on higher CC levels, and the rule of thumb is to target the highest CC level possible when compiling kernels, to take advantage of the newest optimisations and features. This is true except for the last kernel, where CC levels 1.1 and 1.3 have a peak performance of 39.47 gigaflops, which is faster than the 37.06 gigaflops that CC 2.0 delivers. Kernels executed on CC levels 2.0 are in generally between 5% and 10% faster than lower levels, except the last case described above that is 6.11% slower. I have not been able to find any explanation why the rule of thumb is not valid for this specific case, but even though this exception breaks the rule, I do still recommend compiling for the highest CC level possible.
6.4 Evaluation
The key to a good performing kernel is memory coalescing and latency hiding. The first step should be to structure the algorithm so that most possible memory coalescing is achieved. Tiling has proven a very good strategy for increasing memory coalescing. A memory access limitation is due to the DRAM memory design; so memory coalescing optimisation techniques should also be applied to shared memory. Using matrix-structures as parameters was initially thought of as a good abstraction, but test shoved that performance losses where the result. It is therefore recommended to use primitive variables or pointers in the kernel function definitions. Data prefetching combined with operations reordering in the kernel was used to hide latency, and gave diverging results. On three systems a performance of about 1 gigaflop was achieved, but on the Tesla C1060 device, a loss of 2.2 gigaflops was the result. Maybe the hardware of the Tesla card is already optimised with regard to hiding this type of latency, so trying to handle this in the kernel counteract these hardware optimisations. In any way, this shows that data prefetching should be carefully applied to a kernel. Volkov presented ideas that latency-hiding and the exploitation of registers can achieve a higher peak performance. This proved to be true, and by making a thread do more work was it possible to make a kernel perform even better. This optimisation technique furthermore showed that the occupancy rate should not necessarily be relied on for an optimisation strategy. Control flow is another factor to keep in mind when designing a kernel. As the Cuda architecture is a SIMT, and if a kernel has a complex control flow, then several runs by the warp scheduler can be necessary to complete the warp. This can unwarrantedly result in a longer computing time.
46
The numerical precision for the same operations processed on a GPU and a CPU does not always yield the same result. This is especially true for older architectures that have a compute capability levels between v1.0 and v1.3. Newer architectures better support the IEEE 754 standard and yields in many cases a result with better precision. In the tests of matrix-multiplication the maximum differences in values was 0.013. To minimise this inaccuracy, special and slower intrinsic functions can be used in the kernel. These functions have less deviations from IEEE 754 and forces the compiler not to use FMAD instructions, which are fast multiply-add instructions, but imprecise.
47
7 LU decomposition
LU-decomposition, also called LU-factorisation, is a linear algebra matrix decomposition of a matrix A in the form: = Where L and U are lower and upper triangular matrices [17]. If the LU factorisation is known, it can be used to solve matrix-vector linear equations in two steps: =
1: 2:
Decomposing matrix A to a product of L and U can be achieved by using an enhanced version of Gauss elimination. Only a square matrix can be decomposed, and the L and U matrices are of the same size, as shown here: = 1 0 1 0 0 1 0 0
Note that L is a unit one matrix, meaning the diagonal elements unit one matrix.
provides a sequential algorithm that builds on Gauss elimination, which also creates a lower
7.1 Analysis
Stewart designs an LU-decomposition algorithm and provides the code that overwrites the matrix A with its LU factorization [17]. Pivoting or row interchanges may be required for two reasons, firstly to ensure the existence of a LU factorisation and secondly to increase the numerical stability of the Gaussian elimination algorithm [18]. For simplicity algorithms without pivoting will initially be analysed, but when testing, only algorithms that implement partial pivoting will be used. This makes sure that the performance and correctness of the individual algorithms can be compared.
48
1. // Core algorithm for LU Decomposition 2. for (int k = 0; k < n; k++) 3. { 4. for (int i = k + 1; i < n; i++) 5. { 6. // Compute scale factor Rik 7. float Rik = (m[i * mWidth + k] /= m[k * mWidth + k]); 8. 9. // Subtract row k elements from row i elements with the 10. // Rik scale factor 11. for (int c = k + 1; c < n; c++) 12. { 13. m[i * mWidth + c] -= Rik * m[k * mWidth + c]; 14. } 15. } 16. }
The code shown above is without pivoting for simplicity reasons. The sequential implementation is simple, consisting of three loops using the operations divide, multiplication and addition. The data size of the square matrix is multiplications, and time is O( 2 divisions to complete. Ignoring the lower order term, the running ) where n is both the width and height of the matrix [19]. This shows that the and requires 3 additions and
7.1.2 Parallelism
Matrix-multiplication and the characteristics of its data access meant that inducing concurrency and exploiting data-parallelism was straightforward. The same cannot be said about LU-decomposition, in which data dependencies between the loops makes parallelising more complicated. The sequential algorithm consists of three loops, and the operations performed by the two inner loops results in a asymptotical running time as shown: = parallel, as the 2+ 3= ( , ) ... iterations.
Where n is the width or height of the matrix. The outer loop cannot be directly performed in iteration depends on the results from the
49
Parallelism is not impossible, but the order of the outer loop is vital. Taking the outer loop into account, the operations part of the algorithm can be written as: 2+ 3
2+ 3 operations are performed, and these operations can be performed in parallel. The required number of operations when taking This equation says, for each step then parallelisation into account is: where = . 2+ 3 = 1. But for simplicity, )
is the number of processes. The optimal execution performs all tasks possible in
parallel, for this algorithm the optimal number of processes is lets set it to 2+
3= (
So this algorithm does have a parallel potential, I will now look into whether this potential
The parts of the algorithm that exhibit no or little parallelism should be processed on the CPU. This means the outer loop is processed by the CPU and the inner parts that exhibit parallelism will be processed by the GPU. So the LU-decomposition implementation should be processed by the CPU and GPU in correlation. The interesting point will be to see if it is possible, when also considering data transfer, to make the GPU assist the CPU, to accelerate the execution of the LU-decomposition algorithm. To make the initial implementation simple, I have divided the calculation of multipliers and the row operations into separate tasks. For each step of the outer loop, the multipliers are calculated for the current column (line 2 to 4), after which the multipliers are used to calculate the elements in the upper triangular matrix (line 5 to 9). These are individual task that can be performed in parallel as shown in the following pseudo code.
50
k from 1 To n-1 For i in { k+1, ...,n } LU[i][k] = LU[i][k] / LU[k][k] End For j in { k+1, ...,n } For i from k+1 To n LU[i][j] = LU[i][j] LU[i][k] * LU[k][j] End End
This algorithm is not the only method for creating the LU-factorisation, there are other algorithms that structure the operations in the outer loop differently, and the main difference is their memory access patterns. The performance of using different memory access patterns may wary depending on different memory types used in the algorithm, another factor is the whether the tasks are fine-grained or coarse-grained. For now I recognise the existence of other algorithms, but use the one described for the simple implementation.
51
} // Calculate scale factors for column k lud_simple_calc_scale_factor<<< gridX, THREADS_PER_BLOCK >>>( d_lu, lu->width, lu->height, k); // Calculate new columne values with scale factor lud_simple_compute_row<<< gridX, THREADS_PER_BLOCK >>>( d_lu, lu>width, lu->height, k);
16. 17. }
The function call in line 12, calculates the multipliers of a given column on the device. Line 15 performs the row operations with the multipliers. The kernels being called and their logic are shown here:
1. __global__ void lud_simple_calc_scale_factor(float *lu, int luWidth, int luHeight, int k) { 2. 3. int tid = threadIdx.x + blockIdx.x * blockDim.x; 4. int i = k + 1 + tid; 5. 6. if (i < luHeight) 7. { 8. // Calculare rik scale factor and insert to Lower triangle 9. lu[i * luWidth + k] /= lu[k * luWidth + k]; 10. } 11. } 12. __global__ void lud_simple_compute_row(float *lu, int luWidth, int luHeight, int k) { 13. 14. // Id of the row 15. int tid = threadIdx.x + blockIdx.x * blockDim.x; 16. int i = k + 1 + tid; 17. 18. if (i < luHeight) { 19. 20. // Load rik scale factor, can be cached in shared memory 21. float rik = lu[i * luWidth + k]; 22. 23. // Subtract row k elements from row i elements with the Rik scale factor 24. for (int c = k + 1; c < luWidth; c++) 25. { 26. lu[i * luWidth + c] -= rik * lu[k * luWidth + c]; 27. } 28. } 29. }
52
0,60 0,50 0,40 Gigaflops 0,30 0,20 0,10 0,00 0 1.000 2.000 3.000 4.000 5.000 Data size Platform #1 Platform #2 Platform #3 Platform #4 6.000 7.000 8.000 9.000 10.000
Figure 8 Performance of simple LU-decomposition on different platforms. size. The performance increases for matrices of increasing sizes up to about 2000 2000, and The graph indicates that the peak performance measure in gigaflops is dependent on the data after that the performance result levels for all platforms. The peak GPU performance of the gigaflops is slow.
fastest platform was about 0.5 gigaflops, which compared to a peak CPU performance of 2.44
53
10,000 matrix this equals to 30,000 kernel invocations. Naturally, kernel invocations will incur overhead, but how much will 30,000 invocations influence the total result? Vasily Volkov et al. has measured the kernel launch overhead for various systems and GPUs [20]. For synchronised kernel invocations the times measures were between 10-14 s, for asynchronous kernel invocations the timings were 3-7 s. The following table shows the kernel invocation overhead as a ratio of the fastest running times on system #3 and #4.
System Fastest result (n=10,000) Asynchronous (low=900 ms) Asynchronous (high=2,100 ms) Synchronous (low=3,000 ms) Synchronous (high=4,200 ms)
This table shows that kernel invocations should not be disregarded when implementing an algorithm, because their contribution to the total running time can be relatively high. This obviously depends on the algorithm, but for this LU-decomposition implementation, the contribution is as high as 26.69%. It is evident that asynchronous invocations are faster than synchronous, and hence represent a lower percentage of the total running time. So where possible asynchronies functions should be used GPU.NET The LU-decomposition implementation requires a matrix to initially be copied to device memory, then several kernel calls compute the result and updates the data, before the matrix is copied back to the host. GPU.NET does currently not allow data on the device to be modified by multiple kernel calls, so testing LU-decomposition through GPU.NET would not be relevant, as data transfers would severely impact performance.
54
The usual rules of matrix-multiplication hold for block matrices, so we can write: 1. 2. 3. 4. = = = =
is ,
is (
),
is (
and
is (
The first step is based on lemma 1 and 2, by performing a normal LU-decomposition on combined, the result is then , and , which are then known. . and
Step 2 uses lemma 3 and a triangular solve method, which results in the matrix In step 3, rearranging lemma 4 gives can be found by LU-decomposing on In the . = =
decomposition, as depicted here. The white parts have already been solved.
/ number of steps the matrix A has been decomposed by using a block LU-
55
width is obviously the dimension of the current block being processed (here the green submatrix). Step 1 solves the green and purple sub-matrix by regular LU-decomposition, then the lower triangular matrix of the block (L of the green block) is used to triangular solve the cyan sub-matrix, as the second step. In the third step, the blue sub-matrix is found by regular matrix-multiplying the purple and cyan sub-matrices and subtracting the element values from the current elements in the blue sub-matrix. The steps are then continued for the remaining parts until the whole matrix is processed.
As this algorithm makes it possible to partition large matrices and solve smaller parts, and therefore exploit shared memory, this algorithm will be implemented using Cuda and used for testing.
7.3.2 Implementation
This algorithm, as shown above, consists of three steps, which the implementation must also follow. The first step, to LU-decompose simple algorithm (lud_block_scale). This part also includes the optimised pivoting kernels (lud_block_pivot, lud_block_pivot_L2 and lud_block_swap). , is covered by an optimised kernel of the
56
The second step requires a triangular solving kernel (lud_block_triangular_solve), and the last step is regular matrix-multiplication kernel (lud_block_matrixMultiplication), which has already been implemented and optimised in chapter 6 from page 29. Pivoting In LU-decomposition, pivoting is performed for each column. Instead of the simple algorithm that had a running time proportional to , the parallel nature of Cuda can be exploited to implement a reduction pivoting algorithm with a running time of O(log(n)). This required two pivoting kernels; the first reduces the current column of the matrix and saves the result to a temporary pivoting array on the device. The second kernel does the same, but works on the temporary pivot array instead of the matrix. The first kernel is shown here, and has already been optimised with focus on memory coalescing and a sort of tiling strategy. In line 20 and 21 the individual threads loads a value from global memory to shared memory. This data is then processed from line 27 to 37, while threads synchronise data access for consistency. In line 42, the first thread of each thread blocks in a grid, writes the pivoting index to the temporary array.
1. __global__ void lud_block_pivot(int *out, float *a, int M, int k, int max) 2. { 3. extern __shared__ float shared[]; 4. float* max_cache = (float*)shared; 5. int* idx_cache = (int*)&shared[blockDim.x]; 6. 7. unsigned int tx = threadIdx.x; 8. unsigned int i = blockIdx.x * blockDim.x + tx + k; // Get row index 9. 10. unsigned int idx = i * M; 11. 12. // Clear cache for threads that exceeds max + they should not 13. //influence result 14. max_cache[tx] = 0; 15. idx_cache[tx] = -1; 16. 17. if (i < M) 18. { 19. // Read value + set row index 20. max_cache[tx] = abs(a[idx + k]); 21. idx_cache[tx] = i; 22. 23. // Sync threads to make sure all other also have loaded values 24. __syncthreads(); 25. 26. // Do the actual pivot finding 27. for(unsigned int stride = blockDim.x/2; stride>0; stride>>=1) 28. {
57
29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. }
if (tx < stride && (stride+tx+k) < M && max_cache[tx] < max_cache[tx + stride]) { max_cache[tx] = max_cache[tx + stride]; // Update value idx_cache[tx] = idx_cache[tx + stride]; // Update index } // Sync threads __syncthreads(); } // The first thread should write result from block to output if (tx == 0) { out[blockIdx.x] = idx_cache[0]; // Load index to output } }
Swapping rows If a pivoting row has been identified, the indices of the two rows are then transferred to the device, by calling a kernel for swapping rows. By swapping the rows on the device, a transfer of the matrix to and from the host is avoided. Several threads can be used to swap the rows, and by aligning the memory access correctly, the memory access is coalesced. LU-factorisation The algorithm described by Stewart [17] is parallelised by making threads process individual all threads, as the grid size is 1 1. This means that the rows. To optimise the performance the row is loaded to shared memory and accesses by
accessed by all threads. With just one block, the disadvantage is that Cuda are not able to hide memory latency access by switching to other active blocks. I will later test whether this approach performs well, or another approach with several thread blocks performs better. Triangular solving Triangular solving for can be performed row- or column wise. I have chosen column wise
row is only loaded to shared memory ones, and the values are
as the memory access of the threads in a block will be coalesced. This part is well suited for GPU processing, because each column can be processed independently.
58
1. __global__ void lud_block_triangular_solve(float *a, int M, int k, int LU_BlockDim) 2. { 3. extern __shared__ float y[]; 4. 5. int tx = threadIdx.x; 6. int tid = blockIdx.x * blockDim.x + tx; 7. int column = tid + k + LU_BlockDim; 8. 9. if (column < M) 10. { 11. for (int r = 0; r < LU_BlockDim; r++) // For each row in block 12. { 13. float res = a[(r+k) * M + column]; 14. for (int c = 0; c < r; c++) // 0<=c<r, so below diagonal 15. res -= a[(r+k) * M + c + k] * y[tx * LU_BlockDim + c]; 16. y[tx * LU_BlockDim + r] = res; 17. } 18. 19. for (int r = 0; r < LU_BlockDim; r++) 20. a[(r+k) * M + column] = y[tx * LU_BlockDim + r]; 21. } 22. }
Each thread uses shared memory to calculate the resulting column values (line 3, 15, 16 and 20). The size of the thread blocks, and thereby the needed shared memory, is not known compile time. But as shared memory can be dynamically allocated, this is not a problem. Shared memory is fast, but the register is even faster. If the thread block size was known on compile time, it could prove beneficial to use the register instead of the shared memory. Matrix-multiplication The kernel for performing matrix-multiplication is based on tiling v3, which includes tiling, pre-fetching and memory coalescing optimisations. For any details about this kernel please turn to paragraph 6.3.4 on page 42.
The test with the largest matrix was not performed on platform #1, due to memory limitations, neither was it tested by the CPU, as the running time would be too high. For had a peak performance of 14.37 gigaflops for a matrix 10,000 10,000. comparison I have added a qualified projection on the graph, which shows that platform #3
59
16,0 14,0 12,0 Gigaflops 10,0 8,0 6,0 4,0 2,0 0,0 0 1.000 2.000 3.000 4.000 5.000 Data size Platform #1 Platform #2 Platform #3 Platform #4 CPU 6.000 7.000 8.000 9.000 10.000
The graph shows two important things. Firstly, the GPU architecture is in fact able to perform LU-decomposition faster than the CPU, for larger matrices. The specific speed of the platform determines when and for which data sizes, the GPU is faster, but looking at the graph, this happens somewhere between 1,000 and 3,000. Secondly, the peak performance of the algorithm is almost proportional to the data size, for these tests. Obviously the peak performance cannot keep increasing proportionally to the data size; there must be an upper limit. But it makes sense that when parallel. Several of the tests were also performed by the CPU as a comparison. The average difference was 0.104211 and the maximum and minimum differences were respectively 0.332855 and 0.0. The 0.0 differences were only achieved on platform #4 with a compute capability of 2.1. Profiling The Nvidia Compute Visual Profiler is a tool that allows profiling of a Cuda program. The GPU time summary plot indicates which part of an algorithm that could be optimised with most effect. The following figure shows how much computing time each kernel uses. increases, even more operations can be performed in
60
Almost 90% of the time is spent in the regular LU-decomposing kernel, so optimising this part should have the best effect on the total running time.
in threads, which normally is 20. This means that only 20 threads are running in parallel at any given time, but as the warp size for the G80 and GT200 architecture is 32, the active warp is padded with 12 empty threads that do not process any data.
1. __global__ void lud_block_scale(float *a, int M, int k) 2. { 3. extern __shared__ float ac[]; 4. 5. int aWidth = M; 6. int tx = threadIdx.x; 7. int end = min( blockDim.x, M-k ); 8. 9. ac[tx] = a[k * aWidth + k + tx]; // Load k row to shared memory, as // it is used across threads 10. 11. // Sync threads to make sure all other also have loaded values 12. __syncthreads(); 13. 14. for(int i = k+1 + tx; i < M; i+=blockDim.x) { // Foreach row 15. 16. // Compute scale factor Rik, 1 operation=divide 17. float rik = (a[i * aWidth + k] /= ac[0]); 18. 19. for (int c = 1; c < end; c++) // Foreach column value in row
61
Another factor in this kernel is its dependency on global memory. All threads load a value from the and the write to shared memory is coalesced, so this is good. The cached values are then heavily on global memory access, and have only few operations to hide latency. CGMA
row into shared memory in line 9, both the memory read from global memory
used to calculate the upper and lower triangular matrices from line 14 to 20. These loops rely
The core parts of the kernel are line 17 and 20. Consider line 17, a memory load and a write, combined with a single divide operation gives a CGMA of 0.5. Line 20 has a memory load and a write, together with the operations addition and multiply, which gives a CGMA of 1.0. On expected giga single-precision data per second is 4.15 (16.6 / 4). Line 20 is the dominant part, platform #1 the global memory has a peak performance of 16.6 GB/sec bandwidth. The
but line 17 cannot be ignored, so taking the CGMA ratio of 1.0 and 0.5 into account, this
kernel will execute at no more than between 2.075 and 4.15 gigaflops on platform #1 [4]. Hiding latency The low CGMA suggests that this kernel is limited by memory, but in the current form, it is possible to improve on latency hiding. One way is to assign more work to the streaming processors and let them continue working on another warp, while the first warp waits for data. The updated thread block size is 64, and the needed number of blocks would be ( like this: ) / 64, where M is the height the matrix and k the current iteration. The kernel looks
1. __global__ void lud_block_scale_v2(float *a, int M, int k, int end) 2. { 3. extern __shared__ float ac[]; 4. 5. int aWidth = M; 6. int tid = blockIdx.x * blockDim.x + threadIdx.x; 7. 8. // Load k row to shared memory, as it is used across threads 9. ac[threadIdx.x] = a[k * aWidth + k + threadIdx.x]; 10. 11. // Sync threads to make sure all other also have loaded values 12. __syncthreads(); 13. 14. int i = k+1 + tid; // Row index 15. if (i < M) 16. { 17. // Compute scale factor Rik, 1 operation=divide 18. float rik = (a[i * aWidth + k] /= ac[0]);
62
for (int c = 1; c < end-k; c++) // Foreach column value in row a[i * aWidth + k + c] -= rik * ac[c]; }
But this is not the only benefit from this update. A for loop is a control flow element that often is part of a kernel. When doing operation counting analysis of a kernel, the operations contributed by for loops are often overlooked. Consider line 20 in the kernel above, for every iteration the c++ operation and the c < end-k comparison is performed. Unrolling loops are another way of increasing performance of a kernel, and this is exactly what has been achieved with this kernel, compared to the former versions line 14.
035 030 025 Gigaflops 020 015 010 005 000 0 1.000 2.000 3.000 4.000 5.000 Data size Platform #1 Platform #2 Platform #3 Platform #4 CPU 6.000 7.000 8.000 9.000 10.000
63
This optimised algorithm is faster than the former, now the peak performance of platform #3 is 31.51 gigaflops, which is 2.19 times as fast. This also shows that even smaller matrices can with benefit be processed by the GPU architecture. Several of the tests were also performed by the CPU as a comparison. The average difference was 0.103360 and the maximum and minimum differences were respectively 0.314285 and 0.002296. The Cuda architectures with higher compute capability do not seem to have smaller deviations from the CPU reference result. Profiling Focus should be on hiding latency, which can be achieved by increasing the number of active blocks or by data pre-fetching. The following graph shows the updated computing time, when the number of threads and blocks has been adjusted.
This profiling result indicates that optimisation of the lud_block_scale and the
lud_block_matrixMultiplication kernels could have the highest performance effect. So this
64
The result of the optimised algorithm is shown below. The peak performance, for platform #3, is now 42.89 gigaflops for a matrix, which is an increase of 1.32 tim ch times.
045 040 035 030 Gigaflops 025 020 015 010 005 000 400 2.000 4.000 Data size Platform #1 Platform #2 Platform #3 Platform #4 CPU 6.000 10.000
Triangular solve The first version of triangular solve implementation was well balanced with regard to grid regards and thread block size, and even though, according to the profiling result above, an optimisation only would affect about 3% of the GPU running time, I chosen to try and optimise ly this kernel further. Being fully aware about any optimisation would have a limited impact on peak performance; this would still be a good exercise and give insight to the analys and analysis improvement of a kernel.
65
The focus was on unrolling a loop (line 20 was in former version performed in separate loop) and coalescing memory access (line 17 and 19 are now coalesced), and as expected the optimisation did not yield any significant change in peak performance.
1. __global__ void lud_block_triangular_solve_v2(float *a, int M, int k, int LU_BlockDim) 2. { 3. extern __shared__ float y[]; 4. 5. int tid = blockIdx.x * blockDim.x + threadIdx.x; 6. int column = tid + k + LU_BlockDim; 7. 8. if (column < M) 9. { 10. for (int r = 0; r < LU_BlockDim; r++) // For each row in block 11. { 12. int rkM = r+k*M; 13. float res = a[rkM + column]; 14. 15. for (int c = 0; c < r; c++) // 0<=c<r, so below diagonal 16. res -= a[rkM + c + k] * 17. y[c * LU_BlockDim + threadIdx.x]; 18. 19. y[r * LU_BlockDim + threadIdx.x] = res; 20. a[rkM + column] = res; 21. } 22. } 23. }
The result was limited as expected, but if I were to improve this kernel further, then I would focus on testing whether Volkovs suggestion (1 thread = 2 output) would have any positive effect. Another improvement would focus on the global memory access in lines 13 and 16. The elements of the lower triangular matrix of the current block being processed (L of the orange sub-matrix) could be copied to shared memory and the reads in line 13 and 16 could be from the faster shared memory instead of global memory.
66
Regular LU-decomposition Focusing on the kernel that accounts for the highest GPU consumption time makes sense if this part of the algorithm can be further optimised. The former version of the kernel focused on hiding latency by increasing the number of thread blocks. Hiding latency can also be achieved by using data prefetching and by minimising the need for global memory access. The performance of the kernel lud_block_scale was attempted to be improved by using registers to hold indices computed several times, by using data prefetching and by applying the Volkov suggestion (1 thread = 2 outputs). Unfortunately no performance gains were established compared to the former versions, intact a minor performance loss proved to be the reality.
Platform #3
42,4 42,3 42,2 42,1 42,0 v4 v5 Kernel editions v6 20,4 20,3 20,3 20,2 20,2 Gigaflops Gigaflops
Platform #4
v4
v5 Kernel editions
v6
So based on this results, sometimes when applying an optimisation method, the result is actually a performance loss. The reason for this is covered by the fact that these methods add restrictions to the kernel, which results in extra boundary checks being needed with an increase in flow control complexity and operations. So implementing improvements with care followed by testing should always be exhibited to determine whether the improvement is actually needed, to yield a better performance. Correctness Several tests with different data sizes were performed by the CPU as a comparison for GPU computed results. The average deviation was 0.062963 and the maximum and minimum were respectively 0.309113 and 0.0. The 0.0 was only achieved by the platform #4 with a Cuda compute capability of 2.1. The deviations increased proportionally to the data size, which
67
make good sense. Any inaccuracy effects the result for every iterations, the larger the matrix size, the more iterations are needed.
about 1525 MB of both host and device memory, which only platform #3 matches with 4GB.
60,0 50,0 Gigaflops 40,0 30,0 20,0 10,0 0,0 0 5.000 10.000 15.000 20.000 25.000 30.000 35.000
Figure 17 - Peak performance of LU-decomposition v3 on platform #3 results I have added a qualified projection for matrices up to 35,000 35,000 in size. The 51-52 gigaflops on system #3. It is difficult to calculate the theoretical performance of the LU-decomposition block implementation, because the computations are divided into 6 different kernels. Each kernel has its own CGMA and its share in solving the full problem, but I will try and approximate. The kernels with a CGMA between 0.5 and 1.0 takes up 48% of the running time, the matrixmultiplication kernel has a CGMA of 20 and make up about 17%. The remaining kernels have a
The graph above shows the peak performance in gigaflops of different data sizes. From the result and projections show that the v3 algorithm should have a peak performance of about
68
CGMA of about 1.0. These numbers have been retrieved from the Nvidia profiler shown in Figure 13, and the approximated ranged result is found by these two equations. , 0.5 48% + 20.0 17% + 1.0 35% = 3.99 4.0
102.4 GB/sec, which gives a giga single-precision data per second of 25.6 (102.4 / 4). The coalesced memory access. The actual is about half, namely 52 gigaflops.
So the approximated CGMA is 4.0. The peak global memory performance for platform #3 is theoretical peak performance of this algorithm on this platform is 102.4 gigaflops, for fully
There are several factors influencing this result, one is the fact that not all memory load and writes are coalesced, another factor is the extra instructions processed due to control flow complexity. But the result indicates that memory access is not a limiting factor on performance for these kernels, but something else is.
7.4 Evaluation
LU-decomposition algorithms have given some valuable insight to some optimisation methods that work and some that does not. Reducing the number of kernel invocations or using asynchronous functions, can reduce the total running. With 30,000 kernel calls, the total invocation time could be reduced from 3-4.2 seconds to 0.9-2.1 seconds. I do not think that reducing kernel calls should be a primary focus, but just something that a developer should be aware of when implementing an algorithm for Cuda. To base the implementation on a block algorithm increased the performance for two reasons. First, the problem size is reduced to pieces that can exploit faster memory types, and second the operation matrix-multiplication is highly parallel, and had already been optimised for the Cuda architecture. These tests also showed that when a kernel was invoked, using several thread blocks is better than just using one. One reason could be that the warp scheduler can utilise multiple SMs for solving the problem. Unrolling a loop together with Volkovs suggestion for matrix-multiplication also helped increase performance. The last part is a bit surprising, because Volkovs suggestion on matrixmultiplication actually lead to a performance decrease on system #3.
69
Other tests revealed that data prefetching and the tiling strategy did not actually increase performance, but left it without any major change. This fact promotes the notion described above, that memory is not the limiting factor. The correctness tests also confirmed that instructions have a higher degree of precision on CC v2.0 than on earlier versions. If I were to optimising LU-decomposition even further, I would focus on arithmetic optimisations. This could be achieved by among others focusing on unrolling loops, minimising control flow complexity and removing unnecessary synchronisation points.
70
8 QR decomposition
QR-decomposition, also known as QR-factorisation, is a decomposition of the with = , in the form:
matrix A,
Where Q is an
Which implies = There are different methods for calculating the QR factorisation, which can be used to solve linear systems and least squares problems [23][24].
8.1 Analysis
The different methods for decomposing matrix A into a QR factorisation include GramSchmidt, Householder reflections and Givens rotations. The classic Gram- Schmidt process is considered to subject to numerical instability. The modified Gram-Schmidt algorithm overcomes this numerical instability but at the expense of adding extra operations [23][25], I will for these reasons not consider the classic nor the modified version. Operations count analysis of both Householder reflections and Givens rotations show that Givens rotations require about 50% more operations than Householder transformation [26]. Besides that, Givens rotations rely heavily on sine and cosine instructions, which will be processed by the limited SFU. I have therefore decided to base the QR-decomposition algorithm on Householder transformations. But there are also other advantages; firstly, the parallelisation is similar to LU-decomposition, why I expect draw on the parallel optimised experiences from the LU-decomposition chapter [24]. Secondly, Householder QR can use a compressed data storage form, by using the original matrix A and an additional array for the diagonal values of R [23]. Consider the matrix A in
71
The diagonal of R is stored in an extra vector. If the actual be computed from this compressed representation [25].
or
72
21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. }
} qr[k * n + k] += 1.0; // Apply transformation to remaining columns. for (int j = k + 1; j < n; j++) { float s = 0.0; for (int i = k; i < m; i++) { s += qr[i * n + k] * qr[i * n + j]; } s = (-s) / qr[k * n + k]; for (int i = k; i < m; i++) { qr[i * n + j] += s * qr[i * n + k]; } } } d[k] = -nrm;
This implementation is based on the Householder QR factorisation algorithm, which in the central part has the following operation count per iteration [26]: 2( ( ( )( )
Dot products (Lines 7 and 30): Outer product (Lines 14-22): Subtraction (Line 35): Including the outer loop, the total running time is: 4( )( ) ~ 2
)( ) )( )
/3
Pivoting can be used to increase numerical stability, but for simplicity, this has not been included in this implementation.
8.1.2 Parallelism
QR-decomposition share the similarity of the outer loop with LU-decomposition, meaning, the order of the outer loop is important and requires sequential processing. The algorithm can be divided into the following tasks: 73
1. 2. 3. 4. 5. 6. 7. 8. 9.
// Tasks in the algorithm for QR Decomposition for (int k = 0; k < n; k++) // For each column { // Task 1: Compute 2-norm of k-th column // Task 2: Compute the kth Householder vector. // Task 3: Apply transformation to remaining columns. }
Each of the tasks above can be performed with varying parallel degree. So there is a parallel potential for this algorithm, and the details will be described later.
and 2, there are generated a number of empty threads. This will only have an effect when the last rows and columns are being processed, and for large matrices this will constitute a relative small amount of the total running time. Task 1 - Two-norm The data size for each to /128. step is
sequential implementation, the running time is proportional to . In this version, one thread block with 128 threads processes any size, meaning the running time is proportional
Task 2 - Householder vector Each element in the Householder vector can be calculated independently, and this task processes the remaining rows. The data size for each step is = . The sequential
74
implementation has a running time proportional to , while this version as task 1, has a running time proportional to d/128. Task 3 Transform columns When task 1 and 2 have been performed, the rest of the matrix must be updated, which is the remaining columns and rows. Each column can be generated independently, and each thread of the 128, processes for each step 128 columns. ( )
0,70 0,60 0,50 Gigaflops 0,40 0,30 0,20 0,10 0,00 0 1000 2000 3000 Data size System #1 System #2 System #3 System #4 CPU 4000 5000 6000
All systems were used in the tests and the CPU was used to calculate a reference result. The CPU results indicates that the CPU is slower when the matrix gets bigger, this makes good sense as the CPU for large problem sizes is not able to exploit its caches. The performance of the GPU gets better when the matrix size increases up to about 2000, then the processing power performance evens out. System #3 is the only architecture that performs better than the CPU, so this specific implementation does not yield an acceptable performance on the GPU. The maximum difference from the CPU reference result was 0.000049 so the implementation is considered acceptable accurate.
75
GPU.NET QR-decomposition is not tested with GPU.NET for the same reasons as described in the LUdecomposition chapter.
8.3 Optimisation
Task 1 - Two-norm This task can be improved by using a parallel reduction algorithm. Doing so, make it possible to decrease the asymptotical running time to O(log Task 2 - Householder vector Each step has a data size of ).
number of processors. The maximum possible number of processes is , making the asymptotical parallel running time O(1) if a min. of asymptotical running time is O( ). Task 3 Transform columns
implementation is proportional to. Each element in the Householder vector can be calculated is the 1 processes is available, if not the
The number of operations required to update the remaining columns and rows of the matrix equals: 2 Where represents the number of columns parallel running time is: 2 for each step is proportional to 2 Where that can be processed in parallel. So the
, and
With these optimisations implemented, the tests were performed again for 400 400,
76
1,60 1,40 1,20 Gigaflops 1,00 0,80 0,60 0,40 0,20 0,00 0 1000 2000 3000 Data size System #1 System #2 System #3 System #4 CPU 4000 5000 6000
The peak performance for system #3 reached 1.42 gigaflops, not as impressive as the results
approximately 1000 1000 , now benefits from being computed using the Cuda architecture. The optimisations have increased performance on all systems. Matrices larger than
achieved by matrix-multiplication and LU-decomposition, but still about 5 times as fast as the CPU. One of the reasons for this low performance is because the algorithm is not that suited for a parallel architecture.
is ( ),
is is
The first step is to perform a regular QR factorisation on transformations of the form: = is of length
=2 It can be shown that: = So in step 2, the triangular factor = is calculated. The triangular
factor is used together with the transformation above, to update the remaining matrix. In the / number of steps the matrix A has been decomposed by using a block QR-
decomposition, as depicted here. The white parts have already been solved.
78
dimension of the current block being processed (here the green and purple sub-matrix). Step 1 QR-decomposes the green and purple sub-matrix by regular QR-decomposition, then the triangular factor is is used to transform the remaining columns in the cyan and blue submatrix. The steps are then continued for the remaining parts until the whole M x M matrix is processed. This algorithm makes it possible to partition large matrices and solve smaller parts, furthermore matrix operations are being used, that can utilise the parallel Cuda architecture.
matrix,
8.4.2 Implementation
Implementation of this block algorithm has been challenging. The structure of the algorithm resembles the block LU-decomposition, which was implemented and performed well. This made me hope that the block QR algorithm also could be implemented and perform well. Unfortunately this has not been the case. Implementation of the algorithm was initially attempted for CPU processing. Thorough debugging and testing have revealed that most of the algrithm works, and generates the expected result. Regrettable, not all parts work as hoped. The poblematic part is related to this transformation: = above: This transformation is a matrix-multiplication between the transposed block of Householder vectors and the sub-matrix is then written to the , consisting of the remaining columns in the matrix A. The result matrix . . =
This transformation can be divided into three steps. Using explanation and figures from
79
When the final matrix W is computed, it is used in a matrix-multiplication with the block Householder vectors. The elements are then subtracted from the sub-matrix give . , which should
These steps should then be repeated until the complete matrix has been decomposed. But numerous attempts at calculating the triangular matrix and array contains wrong values. factor has failed, and the resulting
8.5 Evaluation
The algorithms for LU- and QR-decomposition have a similar structure, so ideas from the LU implementation was also applied to the QR implementations. Optimising the running time of the different tasks proved to increase performance with a factor of 3.84 times. Unfortunately, due to the lacking implementation of a block QR algorithm, no further tests were performance. This means that the full potential of QR on the Cuda architecture is still to be unfolded. Jack Dongarra, Susan Ostrouchov and others have designed this block QR algorithm. They are highly competent people that have made contributions to Eispack, Linpack, BLAS, Lapack and ScaLapack. The challenge with the rather than the algorithm. factor is more than likely related to my implementation
80
9 Evaluation
The optimisation strategy described some methods and techniques that could be applied when improving the implementation of the linear algebra algorithm. This evaluation paragraph will summarise the findings and evaluate on the strategy.
9.1 Cuda
The Nvidia profiler can show relevant counters for both arithmetic and memory performance. CGMA source code analysis can give valuable information about memory bandwidth as a limiting factor. The results from the tests suggest that a block linear algorithm is best suited for the Cuda architecture. Such an algorithm is designed to divide data into sizes that fit into caches, such as shared memory. When implementations are to be optimised, the findings from this project suggest that tiling is the best strategy, followed by latency hiding and coalescing memory access. With regards to coalescing memory access, it should be mentioned that GPU architecture designers are aware of the importance of this limiting factor, so newer GPUs are designed with built-in optimised memory access. The impact of non-coalesced memory access should therefore be of less importance in the future, and hence make porting of existing algorithms easier. In addition to the points above, here is a list with recommendations based on the findings of the tests performed in this project: Avoid using structures as parameters in the kernel definitions, use instead simple types or pointers thereof. Target the highest possible Compute Capability level. Among other things, the precision of instructions are better and the result will be more accurate. Unroll loops, by making the threads fine-grained. Generation and thread scheduling are cheap. Thread block size should be a multiple of the warp size (Currently 32). Be aware of the overhead for invoking a kernel. Note that default instructions deviate from IEEE 754, use specific IEEE 754 functions for increased precision, but at the cost of speed. Besides the list and suggestions above, there were also methods with doubtful results:
81
The Volkov suggestion yielded performance gains on some systems, but lower on others. Can be useful for low occupancy kernels, but should be tested and evaluated. Data prefetching can both increase and lower performance.
The underlying hardware and its capabilities play an important role whether an optimisation technique affect performance. Some methods have positive effect on some GPUs, and a negative on others. Analysing and testing should therefore always be performed.
9.2 GPU.NET
GPU.NET v1.0.3.5 was not mature and suffered from several bugs. The number of problems makes it not recommendable for production environments. However, the latest release is v2.0.14, which solves many of the bugs and problems I encountered. The JIT compilation of kernels is a design decision that applies to all current versions of GPU.NET. A JIT compilation is cached in-memory, and subsequent calls from the same process will be served from this cache. It is therefore recommended, when using GPU.NET for large or numerous problems, to warm-up both Cuda and GPU.NET. Do this by calling the kernel with a small data size, subsequent calls will then be served faster.
82
10.1 Project
A more thorough correctness test and analysis could further clarify the numerical stability of the implementations used in this project. For example by comparing this projects results with results from the widely recognised Matlab. For this project I insisted on implementing all parts of the algorithms. A lot of work and research have gone into the development of standard math libraries supporting the BLAS interface. Implementing all parts gave valuable insight to the inner workings of the algorithms, but possibly at the expense of performance. Using these libraries (e.g. Cublas or Cula), could reveal the full performance potential of the different algorithms on the Cuda architecture. Testing performance, of other linear algebra algorithms, could serve as a frame of reference. For example, how would Givens rotations affect the performance of QR-decomposition instead of the Householder transformation method chosen? A more thorough analysis and testing of the QR block algorithm would also be beneficial. The optimisation strategy and the optimisation experiences could be applied on several other linear algebra algorithms. An obvious extension would be the Singular Value Decomposition (SVD).
10.2 Cuda
Cuda C and GPU.NET currently represents two different directions for utilising the Cuda architecture. Cuda C is C/C++ and complex, whereas GPU.NET is .NET, uses code generation, and is easier to use. You might say that GPU.NET is for developers that without too much trouble, wants to accelerate their applications using parallel architectures. Cuda C, on the other hand, is for developers that are not intimidated by C/C++ and tweaking. Cuda C offers more flexibility, which enables better optimisation and higher performance, but it does however not have to be a choice of either advantages. Cudafy.NET is a set of libraries and tools supporting both directions.
83
Cudafy.NET can be used in the same way as GPU.NET, using full code generation. But it can also just work as a bridge from .NET to Cuda C kernels. Cuda C optimisations are then possible, while the invocation is carried out by the .NET runtime. Uncovering the performance characteristics of Cudafy.NET e.g. using the linear algebra algorithms from this project could be another valuable next step. It is expected that Cuda will continuously be improved, e.g. by making the NVCC support C++ language features in kernels, allow better debugging in Nsight, and increase the language support features in IDEs, to make development smoother. With Cuda v4.0 the tools and drivers has been updated, and now enable a grid of machines and GPUs to work together to solve large problems. This makes Cuda able to solve even larger problems, than with former versions.
10.3 Hardware
The newer Cuda GPUs are becoming increasingly accurate, meaning the instructions are performed with better numerical precision, at even faster speeds. Double-precision instructions have been supported from Compute Capability 1.3, and it is expected this as well, will become more precision together with faster processing times. The future will surely also bring GPUs with even more cores and faster memory. Currently the architecture of the Nvidia GF100 chips support up to 512 cores, but the dedicated GPU computing system Tesla S2050 have 4 GPU with a total of 1,792 cores. Nvidia is not only player when it comes to GPGPU. AMD has the FireStream architecture, and the top model FireStream 9370 has 1,600 cores delivering 2,640 gigaflops. Looking at the latest TOP500 supercomputers list, out of the top 5 the 3 are using Nvidia GPUs. So Nvidia is a strong player, and I expect Nvidia and Cuda to play an important role in the GPGPU field in the future.
84
A graphics card with a high performing GPU is a relatively cheap commodity, and many regular computer systems are today equipped with a high performing GPU. Some application developers have spotted this opportunity and now allow their application to be optionally accelerated by the GPU. This is often completely transparent to the end user, but delivers an increased application response time, which gives the user a better experience. Applications that currently exploit this possibility includes, but are not limited to, different browsers, such as Internet Explorer, Chrome and Firefox, and different video editing applications.
85
11 Conclusion
Three frequently used linear algebra algorithms for matrix-multiplication, LU- and QR decomposition was decided on for this project. They were described, analysed, and then initially implemented using C/C++ for the CPU architecture. The Cuda architecture and development platform was subsequently analysed and described. Important features, characteristics and limitations were uncovered and an optimisation strategy was formed. Based on the analysis of the linear algebra algorithms and Cuda, implementation procedures were designed. Then the algorithms were implemented targeting the Cuda architecture and using C/C++ and Cuda C, after which they were tested. During this process different findings were learned, which was subsequently used in combination with the Cuda optimisation strategy to improve performance. GPU.NET was used, where applicable, as a perspective on how to use Cuda from .NET. Correctness tests were performed by comparing the results from the CPU with the results from the GPU. The maximum differences documented the accuracy of the different algorithms processed on various systems and GPUs. The learning goals have all been achieved and the complete process has been documented in this report.
86
87
15. John J. Barton and Lee R. Nacnman, Scientific and Engineering C++: An Introduction With Advanced Techniques and Examples, 19th August 1994 16. Jens Eising, Liner Algebra, 1999 17. G. W. Stewart, Afternotes on Numerical Analysis, 1996 18. E. E. Santos and M. Muraleetharan, Analysis and Implementation of Parallel LUDecomposition with Different Data Layouts, June 2000 19. Prof. Michael T. Heath, Parallel Numeric Algorithms: LU-Decomposition (slides), 2010 20. Vasily Volkov and James W. Demmel, Benchmarking GPUs to Tune Dense Linear Algebra, November 2008 21. Vasily Volkov and James W. Demmel, LU, QR and Cholesky Factorisations using Vector Capabilities of GPUs, 2008 22. Jack Dongarra et al., Derivation of a Block Algorithm for LU Factorization, 9th February 1997 23. Peter J. Olver, Orthogonal Bases and the QR Algorithm, 5th June 2010 24. Prof. Michael T. Heath, Parallel Numerical Algorithms: QR-Factorization (slides), 2010 25. Walter Gander, Algorithms for the QR-Decomposition, April 1980 26. Radu Trmbitas, Householder Reflectors and Givens Rotations: Why orthogonality is fine, 11th March 2009 27. Susan Ostrouchov, QR Factorization (a block algorithm), 28th April 1995
88
89
90
Variable gridDim.x
Description Holds the number of blocks in the first dimension of the grid. Values are valid in the range 1-65535.
gridDim.y
Holds the number of blocks in the second dimension of the grid. Values are valid in the range 1-65535.
blockDim.x
Holds the number of threads in the first dimension of the block. Values are valid in the range 1-512.
blockDim.y
Holds the number of threads in the second dimension of the block. Values are valid in the range 1-512.
blockDim.z
Holds the number of threads in the second dimension of the block. Values are valid in the range 1-64.
blockIdx.x
Hold the current blocks first dimension position in the grid. Values are valid in the range 1-[gridDim.x].
blockIdx.y
Hold the current blocks second dimension position in the current grid. Values are valid in the range 1-[gridDim.y].
threadIdx.x
Hold the current threads first dimension position in the current block. Values are valid in the range 1-[blockDim.x].
threadIdx.y
Hold the current threads second dimension position in the current block. Values are valid in the range 1-[blockDim.y].
threadIdx.z
Hold the current threads third dimension position in the current block. Values are valid in the range 1-[blockDim.z].
Why has Nvidia designed a thread structure in up to five dimensions? Would it not be easier to just use a single dimension?
91
For simple algorithms that only require a thread structure in one dimension, this can be achieved. But there exists problems that naturally belong to a space of two dimensions or more, e.g. a matrix. This structure is optional only, meaning the developer, and some hardware limitations, decides how many dimensions to be used. The total number of threads is a result of the following: Where = . . . . .
The size of the grid and blocks is often defined directly in the source code, but the optimal size is in many cases directly dependent on the data size. This is not very flexible, as it means that grid and block size would have to be adjusted, in the source code, for different data sizes, and afterwards recompiled before execution. There are different solutions to this. One way is to set the number high to cover most cases. In the kernel one would have to check if the current thread actually has data to process like so in line 6:
1. __global__ void kernel(float *data, int dataSize) { 2. 3. // Thread ID 4. int tid = threadIdx.x + blockIdx.x * blockDim.x; 5. 6. if (tid < dataSize) { 7. 8. // Process data 9. } 10. }
This is inefficient as many threads will be spawned but without any actual data to process. Another way is to define the number of threads per block (e.g. 128), and then calculate the number of required blocks from the data size. This makes sure that at most (threadsPerBlock1) threads are created without any data to process. A third way is to calculate the grid and block size dynamically from the data size; this is however difficult as the optimal setting is influenced by both the data size and the structure of the algorithm.
92
Either of the second or third method can prove feasible, they both have pros and cons, but which specific method to use, should be determined on a case by case basis.
Elapsed time
Measuring elapsed time is essential to measuring performance. Normal event timing in C and C++ is CPU based, which is insufficient when dealing with the GPU. The GPU and CPU are physically two independent processors, which run in parallel. The Cuda toolkit provides an API for measuring GPU events and elapsed time. The Cuda API will be used to measure memory allocation, copy of data from host to device, the kernel execution time, copy of data from device to host and the release of memory. These different timers will not just give the elapsed times of different operations, but actual valued insight to the GPU performance. It will for instance be possible to calculate memory transfer rates as well actual peak performance in gigaflops of the kernel. In addition to valued insight, the timings can be used to measure relative performance gains or losses, when certain properties or capabilities of the Cuda architecture have been applied to the algorithms. In addition to measuring relative performance, the GPU timing will serve as a base for comparison with the similar linear algebra processes on GPU.NET and the CPU.
93
Matrix structure
Matrices are mathematical structure in two dimensions. In the computer memory this can either be represented by 2-dimensional array or an array of arrays. Even though 2dimensional structures are available in computer memory, it is better to vectorise the matrix, by aligning the rows after each other. Accessing a specific value in the vector of matrix A, is performed like so: v[3 * Width + 2]. Where v is the vector of matrix A, and Width is the column count of A. The Cuda architecture is designed to be stream based, so by vectorising data for processing on the GPU platform, one uses Cuda as it was designed and intended. For the code I use the following matrix structure to hold the vector and details about the matrix.
1. typedef struct 2. { 3. float *n; 4. unsigned int width; 5. unsigned int height; 6. unsigned int size; 7. } matrix;
n is the pointer to the vector of float values, width is the number of columns, height the number of rows and lastly the size if the length of the vector (height*width).
94
Platform #1
Apple Macbook 13 with Intel Core 2 Duo P8700 2,53 MHz processor, 4GB DDR3 ram on 533 MHz, a Nvidia GeForce 9400m and a Front-side-bus (FSB) on 1066 MHz. The GPU on the machine has the following specifications:
Cores Memory interface Memory bandwidth (internal/external) Graphics bus interface (PCI-E v2.0) Transistors Core clock Shader Clock Memory Clock Gigaflops
16 128-bit 8GB/sec, 16,6 GB/sec 8 GB/sec 282 Million 450 MHz 1100 MHz 1066 MHz (533 MHz double pumped) 51,56
95
1.1
Platform #2
Apple iMac 24 with Intel Core 2 Duo E8435 3.06 GHz, 4 GB DDR2 ram on 399 MHz, a Nvidia GeForce 8800 GS and FSB on 1066 MHz. The GPU on the machine has the following specifications:
Cores Memory interface Memory bandwidth (internal) Memory bandwidth (external) Graphics bus interface (PCI-E v1.1) Transistors Core clock Shader Clock Memory Clock Gigaflops Cuda Compute Capability
96 256-bit 49,94 GB/sec 6,23 GB/sec 8 GB/sec 754 million 500 MHz 1250 MHz 800 MHz 234,38 1.1
Platform #3
Is a machine with a Nvidia Tesla C1060 GPU. The exact machine specifications have not been available, however the specifications for the C1060 GPU gives some hints on the performance.
Cores
240
96
Memory interface Memory bandwidth (internal) Transistors Core clock Shader Clock Memory Clock Gigaflops
512-bit 102,4 GB/sec 754 million 602 MHz 1300 MHz 1600 MHz 933,12 for Total(Mul+Add+Special Function) 622,08 for Total(Mul+Add)
1.3
Platform evaluation
You may wonder why platform #3 has two different gigaflops. The first is based on the specifications of the G80 and the descending architectures, which says that a GPU is capable of performing a Multiply-Add instruction dual-issued with a special function instruction per operation cycle. The second is based on the newer Fermi architecture specifications, in which a operation cycle can perform a Multiply-Add instruction dual-issued. That a newer architecture supposedly is slower than an older one, not only contradicts the logic of development and improvement, but it is not so. The G80 based architectures are equipped with streaming processors (SP) and separate special function units (SFU). The SP combined with the SFU gives theoretically 3 operations per clock cycle; however basing a gigaflops calculation on these specifications makes the result very theoretical. Calculating the gigaflops performance according to this may be correct, but does not yield an achievable result. The reason is surely a result of Nvidias competition with other GPU manufactures, to produce a GPU with the highest gigaflops count. Most development and testing are performed on platform #1, so this platform will serve as a base.
97
Specifications
This paragraph will dig a little deeper into the specifications of the hardware, and describe the theoretical performance limits. When dealing with GPUs the most important are memory transfer rates and GPU gigaflops. The relevant elements in question are chipset, front-sidebus (FSB), memory speeds and the GPU.
Chipset The chipset consists of a north- and a south bridge. The north bridge is responsible for handling the exchange of data between the CPU, memory and the graphics adapter. The south bridge handles exchange of data with external devices like audio, network, hard discs and USB devices. The north bridge is the most data intensive and relevant for this project, whereas the south bridge is not in used for GPU accelerated applications. The bus speed of platform #1 is 266 MHz with a multiplier of 4, making the rated FSB about 1066 MHz. The width of the bus is 64-bit making the transfer rate:
98
8 1024
Memory Memory transfer occurs when data is copied from host to device and again when the result is copied from device to host. This data is transferred via the chipsets north bridge from the CPU/system memory to the device memory. The GPU of Platform #1 has no dedicated memory and uses the system memory. The transfer rate is of that reason equal to that of the system memory. The system memory consists of two DDR3 modules whose peak transfer rate is double that of the FSB (double data rate), meaning 16.66 GB/s.
99
Grahpics adapter The software GPU-Z reports the GPU of platform #1 to run on a PCI port. But this cannot be true as the transfer rate would be about 2 GB/sec. My guess is, as the Nvidia specification says, it runs on a PCI Express 2.0 bus interface with a peak transfer rate of 8 GB/s one way, which by the way is the same as the memory transfer.
100
Platform #1 has 16 cores with a shader speed of 1100 MHz. The number of operations are theoretically 3 (Mul+Add+SF), which in terms result in a gigaflops count of 51.56.
Evaluation
Development and testing will mainly be performed on platform #1 and #2, even though they lack the extreme computing power platform #3 posses. The purpose of this project is to test the applicability of GPGPU for solving different problems, and the focus is furthermore on testing relative performance gains or losses of different optimisations techniques. Platform #3 will however give an important insight into the performance of solving these problems on a massive parallel architecture. Platform #3 furthermore supports a higher compute capability, which makes even more optimisation techniques available, as well as double precision operations.
101
Development model
Cuda toolkit version 3.2 supports Microsoft Visual Studio 2005 (VS2005) and Visual Studio 2008 (VS2008). It is possible to enable development in Visual Studio 2010 (VS2010), but has proven difficult to setup. This is among other reasons, due to the fact that the Nvidia Cuda compiler (NVCC) requires either a Visual C++ version 8 or 9 compiler. I have tried to set Cuda up for VS2010, but the trouble have led me to the conclusion, that the problems and minor inconveniences far exceed any gains achieved by using VS2010. The Nvidia GPU computing SDK, a separate package, provides help, tutorials, utility helpers as well as code examples. With this package, all the hard work of configuring Visual Studio, setting up paths and environment variables are done for you. However with it follows libraries packed with utility and helper functions and references to other libraries. Performance is important in this project, and there is no possibility to say what impact any reference libraries or any utility functions might have, which is why I have decided to create a new and clean project model, that can serve as a base for the performance tests in Cuda. By doing so, I get valuable insight of the structure of the toolkit and its applicability.
102
A description of what steps I had to take, and the project model can be found on my blog: http://blog.ovesens.net/2011/05/cuda-v3-2-template-project-using-cpp/
103
Description Number of divergent branches within a warp. This counter is incremented by one if at least one thread in a warp diverges (that is, follows a different execution path) via a data dependent conditional branch. The counter is incremented by one at each point of divergence in a warp.
Number of instructions executed. Number of non-coalesced global memory loads. Number of noncoalesced global memory loads.
Number of coalesced global memory loads. Number of coalesced global memory stores.
104
local load
Number of local memory load transactions. Each local load request will generate one transaction irrespective of the size of the transaction.
local store
Number of local memory store transactions; incremented by 2 for each 32-byte transaction, by 4 for each 64-byte transaction and by 8 for each 128-byte transaction for compute devices having compute capability 1.x. It is incremented by 1 irrespective of the size of the transaction for compute devices having compute capability 2.0.
Table 14 - Selected profile counter from Compute Visual Profiler User Guide
These profile counters can give valuable insight to what a kernel actually do, but they cannot be used without consideration, Nvidia writes the following: Compute Visual Profiler values are best used to identify relative performance differences between un-optimized and optimized code. But holding the profiled numbers together with analysed numbers presents a good estimate of how much bandwidth is wasted by suboptimal coalescing of memory access [9].
105
31.97 gigaflops
31.95 gigaflops
33.57 gigaflops
32.83 gigaflops
32.84 gigaflops
34.57 gigaflops
33.46 gigaflops
33.43 gigaflops
36.90 gigaflops
39.47 gigaflops
39.43 gigaflops
37.06 gigaflops
Results from the matrix-multiplication compute capability levels test on platform #4.
106
Characters:
107