Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
net/publication/261424611
CITATIONS READS
7 530
3 authors:
Milo Tomasevic
University of Belgrade
63 PUBLICATIONS 701 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Marko Misic on 07 February 2018.
Abstract – Central Processing Units (CPUs) are The rest of the paper is organized as follows. The
task-parallel, latency-oriented processors, while Graphics second section presents a short history of modern GPUs,
Processing Units (GPUs) are data-parallel, throughput and gives a further discussion on General-Purpose
oriented processors. Besides their traditional use as graphics computation on Graphic Processing Units (GPGPU). The
coprocessors, the GPUs have been used in recent years for third section presents an overview of the Compute Unified
general purpose computations, too. Rapid development of Device Architecture (CUDA) that exploits GPU
graphics hardware led to an extensive use in both scientific computing power for general purpose computation. The
and commercial applications. Numerous papers report high fourth section concentrates on the issues related to CUDA
speedups in various domains. This paper presents an effort
program execution and performance optimization. The
to bring GPU computing closer to programmers and wider
fifth section of the paper gives a brief discussion on
community of users. GPU computing is explored through
NVIDIA Compute Unified Device Architecture (CUDA) that application domains suitable for GPU computing and
is currently the most mature application programming presents some programming primitives and common
interface (API) for general purpose computation on GPUs. algorithms. Sixth section reveals some of the experiences
and lessons learned by the authors of this paper. The final
section briefly concludes the paper and discusses the
I. INTRODUCTION future trends of GPU computing.
From their beginning, graphics processing units
(GPUs) have been used for specialized, intensive II. TOWARDS GENERAL PURPOSE COMPUTATION ON
computations in the domain of computer graphics. GPUS
Rendering of 3D graphics is a good example of highly In the early stages of their existence, the GPUs have
intensive parallel computation. It includes computations been fixed-function processors used primarily for 3D
for both geometry (vertices) and rasterization (pixels). graphics rendering. The entire execution was built around
Strong pressure from the fast growing gaming industry graphics pipeline that was made configurable to some
and the demand for fast, high definition graphics led to extent, but not programmable [1]. GPUs were specialized
constant innovation and evolution of GPUs towards highly to render a screen picture through several steps, as shown
parallel, programmable processors with lots of GFLOPS in Figure 1. About the same time, 3D graphics application
and high throughput. programming interfaces (APIs) became popular, notably
Central processing units (CPUs) are task-parallel, OpenGL and Direct3D component of Microsoft DirectX.
latency-oriented processors with transistors devoted to They exposed several programmable features to the
caching and sophisticated flow control. On the contrary, programmers. The direction towards more programmable
GPUs are data-parallel and throughput-oriented processors processors was driven by the fact that programmers
that hide relatively expensive global memory accesses wanted to create more complex visual effects on the
with extensive use of parallel threads. Contemporary screen than those provided by the fixed-function graphics
CPUs could be considered as multicore processors, since hardware. In these early days of the GPU programming,
they need only several threads to fulfill their full capacity, the programmer had to send the short programs to the
while GPUs are manycore processors that need thousands GPU, by calling API functions. These programs processed
of threads for their full use. The GPUs could be seen as graphic primitives.
coprocessing units to the CPUs, which are suitable for
problems that exhibit high regularity and arithmetic A. Graphics hardware evolution
intensity. Typically, the sequential parts of the program In graphics pipelines, certain stages do a great deal of
are executed on CPU, while the compute-intensive parts floating-point arithmetic on completely independent data,
are offloaded to GPU to speedup the whole process. such as transforming the positions of triangle vertices or
The GPUs are found in many applications. At the very generating pixel colors. This data independence, as the
beginning, they were used mostly in academia, but since dominant application characteristics, is the key difference
they brought significant speedups in several domains they between the design assumption for GPUs and CPUs [1]. It
were adopted from wider research community. GPUs are allows a usage of a substantial hardware parallelism to
nowadays used in computational physics, computational process the bulks of data.
chemistry, life sciences, signal processing, as well as The first attempts to make the commercially available
finances, oil and gas exploration, etc. graphics processors more programmable exposed the
This work has been partially funded by the Ministry of Education
vertex shader programmability, in NVIDIA GeForce 3 [2].
and Science of the Republic of Serbia (III44009 and TR32039). The Further development provided programmability of the
authors gratefully acknowledge the financial support. pixel shaders and full 32-bit floating point arithmetic
On the other hand, the GPU memory organization had The GeForce 8800 was also the first GPU to use scalar
a very restricted memory access policy. Since the output thread processors rather than vector processors [5]. It
data of the shader programs are pixels, they have brought a wider instruction set to support C and other
predefined positions on the screen (and thus in the frame general-purpose languages, including integer and IEEE
buffer). The output address of a pixel is always 754 floating-point arithmetic. Also, the GeForce 8800
determined before the pixel is processed and the processor eliminated memory access restrictions and provided load
cannot change the output location of a pixel, so the pixel and store memory access instructions with byte
processors were incapable of memory “scatter” operations addressing. It provided new hardware and instructions to
[3]. For several generations, GPUs lacked an efficient way support parallel computation, communication, and
to do the scatter operation, e.g. to write to a calculated synchronization. Contemporary NVIDIA GPUs are
(arbitrary) memory address. programmed through simple C language extension and
corresponding application programming interface or, more
B. Software support for general purpose computation recently, through OpenCL API.
In the early stages of the GPU development, some Similarly to NVIDIA, ATI (now AMD) also
researchers noticed the computational power of those introduced unified shader architecture with Radeon
processors and tried to use that power to solve general, 2000/3000 series of processors. Unlike NVIDIA that
compute-intensive problems. Because typical graphics exposed GPU programmability through C language
scenes consist of primitives defined with fewer vertices extension, AMD firstly exposed programmability through
than the number of pixels on modern displays, the low-level programming interface, called Close-To-Metal
programmable stage with the highest arithmetic rates in that did not gain much success. Its successor, Stream
modern GPU pipeline is the pixel shader stage. Therefore, SDK, included an open-source Brook+ language for
a typical GPU program used the pixel processor as the GPGPU. In a more recent development, AMD switched to
computation engine for the general-purpose computation fully supported OpenCL, as a main platform to program
on the GPU [3]. their GPUs [6].
A. Programming model
Essentially, the processing units of the GPU follow a
single program multiple-data (SPMD) programming Figure 2. CUDA execution model
model, since many elements are processed in parallel CUDA provides a way to synchronize threads in the
using the same program. Processing elements (threads) same block using the barrier synchronization primitive.
can read data from a shared global memory (a “gather” However, threads in different thread blocks cannot
operation) and write back to arbitrary locations in shared cooperate, since they could be executed on different
global memory (“scatter” operation). Generally, threads streaming multiprocessors. Global synchronization is
execute the same code that is inherent to SIMD execution possible only through repeated kernel calls, which is one
model. However, since threads could follow different of the inherent limitations of manycore architectures.
execution paths through branching in the same program,
that leads to more general SPMD programming model. B. Memory hierarchy
Programs on the CUDA platform are executed in CUDA offers specific memory hierarchy to support
co-processing mode. Typically, the application consists of fast, parallel execution. Threads have access to registers,
two parts. Sequential parts of the application are executed local memory, shared memory, constant memory, texture
on the CPU (host) that is responsible for data memory, and global memory. All threads have access to
management, memory transfers, and the GPU execution slow, global device memory (DRAM). However, access to
configuration. Compute-intensive, parallel parts of the global memory is slow (approx. 200 cycles), in
code are executed on the GPU (device) as special comparison to other memories, so other smaller memories
functions (kernels) that are called by the CPU. could be used for temporary storage and reuse of data..
Every thread has an exclusive access to given (allocated)
A kernel is executed by a large number of lightweight registers and local memory. Although private to the
threads that run on the processing units called streaming thread, local memory is off-chip. Hence, access to local
multiprocessors (SMs). Every streaming multiprocessor is memory is as expensive as access to global memory [7].
a SIMD processor, currently consisting of up to 32 scalar Threads in a block could share data through per-block
processors. Threads are organized in blocks executed on a shared memory. Shared memory could be seen as a
single multiprocessor, and kernel execution is organized user-managed cache (scratchpad) memory as it could be
as a grid of thread blocks, as shown in Figure 2. Thread accessed very fast (3-4 cycles). The programmer is
blocks executed on the same SM share the available responsible to bring the data to and move it from the
resources, such as registers and shared memory. shared memory. Also, threads can access constant
Depending on the commercial target niche, modern GPUs memory, that is read-only and cached, and texture
have from 2 up to 32 streaming multiprocessors. Due to memory using special hardware units.
this organization of thread execution, a kernel scales very
well across any number of parallel processors. Efficient
threading management in GPUs allows applications to IV. PERFORMANCE CONSIDERATIONS
expose a much larger amount of parallelism than available Although it is not difficult to write a correct CUDA
through hardware execution resources, with little or no program that can run on any CUDA device, performance
penalty [1]. A compiled CUDA program executes on any can vary greatly depending on the resource constraints of
size GPU automatically using more parallelism on GPUs the particular device architecture. So, performance
with more processor cores and threads [5]. concerned programmers still need to be aware of them to
make a good use of a contemporary hardware.
The number of threads in a block, and the number of
blocks in a grid are specified through execution
configuration of the kernel. Both block and grid could be A. Thread scheduling and execution
multidimensional (1D, 2D and 3D) to support different Logically, threads are executed in thread batches,
memory access patterns. Every thread has a unique thread called thread blocks. Physically, however, every thread
index within a block, and every block has a unique block is executed on a SM in 32-thread units called warps.
identifier within a grid. Threads can access those Threads in a block are partitioned into warps based on
identifiers and dimensionality through special, built-in their indices. Consecutive threads could be considered to
variables. Typically, every thread uses its own indices to reside in the same warp. Threads in a block can execute in
make branching decisions, and to simplify memory any order relatively to each other, so the synchronization
addressing, when processing multidimensional data. is needed to avoid race conditions.
Time [ms]
100
that implement common algorithms on the GPU. Those
include FFT algorithms (cuFFT), BLAS library
10
(cuBLAS), pseudorandom-number generation (cuRAND),
and CUDPP (CUDA Data Parallel Primitives) library that
1
implements primitives like scan, sorting, and reduction 1000 10000 100000 1000000 10000000