StreamCg: A Stream-Based Framework For Programmable Graphics Hardware

StreamCg : A Stream-based Framework for
Programmable Graphics Hardware
Alex Radeski
School of Computer Science and Software Engineering
University of Western Australia
35 Stirling Highway, Crawley, Western Australia, 6009
radesa01@cs.uwa.edu.au
alex@radeski.net
November 26, 2003
This report is submitted as partial fulfilment

of the requirements for the Graduate Diploma Programme of the
Department of Computer Science and Software Engineering,
The University of Western Australia,
2003
Abstract
The thesis of this research states that a modern programmable Graphics Pro-
cessing Unit (GPU) can be used for media processing tasks, such as signal
and image processing, resulting in significant performance gains over tra-
ditional CPU-based systems. This is made possible by innovative research
in the areas of programmable rendering pipelines, shader programming lan-
guages, and graphics hardware. The GPU uses data locality and parallelism
to achieve these performance improvements. Recent advances in mainstream
programmable GPU technology has led many to research the suitability of
these devices for more generic programming tasks.
This work explores the difficulties associated with using a GPU for media
processing and the performance gains made over traditional CPU imple-
mentations. StreamCg was developed as a C++ framework to simplify the
development of media processing applications. The NVIDIA Cg language
and runtime infrastructure is used to provide a high level programming en-
vironment for the GPU. StreamCg simplifies the development of GPU-based
programs by providing a simple stream-based programming model that fa-
cilitates reuse, and encapsulates the complexity of the underlying OpenGL
rendering system. As part of this research I also developed the EmuCg
framework. EmuCg assists in the execution of NVIDIA Cg programs on a
traditional CPU. This aids in the normally difficult or impossible task of de-
bugging Cg programs. The remainder of this thesis discusses the evolution
of programmable rendering pipelines and graphics hardware.
A Discrete Wavelet Transform (DWT) is implemented as a non-trivial exam-
ple to assist in the performance analysis. The DWTs were designed, imple-
mented and tested using StreamCg. Three different implementations were
used, including a CPU-based algorithm, a GPU-based algorithm that was
executed on a GPU and the GPU-based algorithm executed on a CPU using
EmuCg. The experimental test results show significant performance gains
are made by GPU-based kernels when compared to CPU-based implemen-
i
tations. The StreamCg programming encourages loose coupling that allows
the arbitrary combination of CPU and GPU kernel implementations.
Keywords: stream processors, programmable hardware, GPU, Cg, shader
languages, Discrete Wavelet Transform, stream kernel
CR Classification: C.1.2 [PROCESSOR ARCHITECTURES]: Multiple
Data Stream Architectures (Multiprocessors), I.3.1 [COMPUTER GRAPH-
ICS]: Hardware Architecture, I.3.6 [COMPUTER GRAPHICS]: Methodol-
ogy and Techniques, I.3.7 [COMPUTER GRAPHICS]: Three-Dimensional
Graphics and Realism
ii
Acknowledgements
I would like to thank Dr Karen Haines, my supervisor, for her enthusiasm,

motivation, and guidance. This research would not have been possible if
Karen had not provided the NVIDIA GeforceFX 5900 graphics card needed,
for which I am sincerely grateful.
I am very grateful of the consideration given to my work/study situation by
Dr Richard Thomas, the honours/fourth year coordinator.
I would not be able to judge a good scientific paper, let alone write one if
it wasn’t for the excellent lessons taught by Professor Robyn Owens. I hope
this paper is an example of how much I valued those classes.
I would like to thank all my work colleagues at ADI Limited for all your
interest and encouragement. Also, thanks to my managers who didn’t watch
the clock when I disappeared during the day to go to university.
Finally, I would like to thank my partner Alison, my family, and friends for
all your love, support and understanding.
iii
Contents
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 Programmable Rendering Systems . . . . . . . . . . . . 3
1.1.2 Hardware-based Rendering Systems . . . . . . . . . . . 6
1.1.3 Hardware-based Shader Languages . . . . . . . . . . . 9
1.1.4 High Level Shader Languages . . . . . . . . . . . . . . 12
1.2 Generic Programming using GPUs . . . . . . . . . . . . . . . 13
2 The StreamCg Framework 15

2.1 The EmuCg Framework . . . . . . . . . . . . . . . . . . . . . 20
3 Testing 21
3.1 Test System Configuration . . . . . . . . . . . . . . . . . . . . 21
3.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 24
4 Discussion 26
4.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
A Original Research Proposal 29
iv
A.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
A.2 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
A.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
A.4 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
B Cg Programmaning 32
B.1 The Cg Language . . . . . . . . . . . . . . . . . . . . . . . . . 32
B.2 The Cg Runtime . . . . . . . . . . . . . . . . . . . . . . . . . 34
C Cg Discrete Wavelet Transform Implementation 36

C.1 Forward DWT Cg Implementation . . . . . . . . . . . . . . . 36
C.2 Inverse DWT Cg Implementation . . . . . . . . . . . . . . . . 39
v
List of Figures
1.1 The Application interacts with the graphics pipeline using the
Command system, this is typically done through a computer
graphics API, such as OpenGL. The Command system sup-
ports the specification of the 3-dimensional scene, for example
specifying geometry, textures, lighting, and cameras. The Ge-
ometry system handles the transformation, clipping, culling,
texture coordinates, lighting, and primitive assembly opera-
tions. The Rasterisation system samples the geometry into
colour fragments and performs colour interpolation. The Tex-
ture system performs texture transformation, projection, and
filtering. The Fragment system performs alpha, stencil, and
depth testing along with fog and blending to produce pixel
colours. The Display system performs gamma correction and
the output signal for the display. . . . . . . . . . . . . . . . . 9
1.2 A logical view of the render pipeline operations performed by
the CPU and GPU. This figure shows that only the Applica-
tion and Command stages of the pipeline are performed in the
CPU. The remainder are executed by the GPU, therefore re-
moving the intensive processing from the CPU. The GPU has
its own local video RAM and can also accesses the main RAM
via the Advanced Graphics Port (AGP). It is important to
note that the AGP bus is the primary bottleneck when trans-
mitting data to and from the GPU; the higher the transfer
speed of the AGP bus to better the throughput. . . . . . . . 10
1.3 The deep execution pipelines are often executed using stream
processors. A stream is essentially comprised of a sequence of
kernels that process data in one direction. Kernels are limited
to processing only the data passed down, or limited number of
high speed global registers. This is referred to a data locality. 10
vi
1.4 The data flow of a Programmable Rendering Pipeline. This
pipeline replaces parts of the fixed pipeline with Vertex Pro-
gram and the Fragment Program stages. . . . . . . . . . . . . 11
2.1 The StreamCg inheritance hierarchy. The Kernel class can be

used for CPU-based Kernel implementations, these Kernels
will execute on the CPU. The CgKernelFP subclass provides
the additional infrastructure to execute a Cg Fragment Pro-
gram on the GPU. The EmuCgKernelFP subclass provides
the emulation layer required to emulate a subset of the Cg
language. EmuCg is discussed in more detail later in this sec-
tion. These classes are extended and appropriate methods can
be overridden to specialise the objects behaviour. . . . . . . . 16
2.2 Kernel A is the data source where the data may be obtained
from numerous sources including local files, databases or over
a network; Kernel D is the data sink where the output data
may be displayed on the screen, written to a file or transferred
over a network. Kernels B and C are typical Kernels in so far
as they receive data as input, process the data, and pass on
the data down stream. . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Illustration of the four modes of data transfer supported by
StreamCg. It has four parts representing different source and
target Kernel implementations. In part A, the data transfer
remains in video RAM as both are GPU Kernels. In part B,
the data remains in main RAM as both are CPU Kernels. In
part C, the data is converted from a texture in video RAM into
to pixel buffer in main RAM, and part D the reverse occurs
where the data is converted into a texture in video RAM. . . 18
3.1 The DWT Stream Program is comprised of 4 kernels: an image

loader, forward DWT, inverse DWT, and image viewer. The
image loader reads PNG image files from disk. The images
are already resident in memory before execution, therefore the
load time from disk is not included. The forward DWT pro-
cesses the data from the image loader, the output of which is
fed into the inverse DWT. . . . . . . . . . . . . . . . . . . . . 22
vii
3.2 DWTs are applied to 2-dimensional images in two passes, parts
A and B. The first pass, the vertical pass, performs a high
and low band filter on a row by row basis, resulting in a high
and low column. The second pass, the horizontal pass, again
performs a high and low band filter on the data, however this
time its on a column by column basis. The result, seen in part
C, is four quadrants with a mixture of high and low filtered
data. The algorithm then recursively transforms the upper
left quadrant to produce part D. . . . . . . . . . . . . . . . . 23
B.1 A logical view of a Fragment Program that illustrates the per-

fragment input, per-fragment output and constant data. The
per-fragment input changes which each fragment processed,
however the constant data is the same for all fragments. . . . . 34
B.2 An example main function for a FP30 Fragment Program,
using direction modifiers and semantics. . . . . . . . . . . . . 34
B.3 When a Fragment Program is executed it processes each frag-
ment on a per-row basis, from the min screen position to the
max screen position. This means at each fragment the Cg
program knows nothing of its adjacent neighbours, this data
locality is on of the features that enables high performance
parallelism of GPUs. . . . . . . . . . . . . . . . . . . . . . . . 35
viii
List of Tables
3.1 The execution times are presented in fastest to slowest. The

GPU shows the fastest times that scale very well with data size
increases. The raw GPU execution times follow an interesting
pattern. The first time the DWT Stream Program is run, it
takes approximately 230 milliseconds. Subsequent executions
averaged approximately in the four to seven millisecond range.
The CPU has the next fastest times, which scale significant
worse that the GPU. Finally, the worst times are achieved
using EmuCG on the CPU. . . . . . . . . . . . . . . . . . . . 25
ix
Chapter 1
Introduction
The data processing demands of many industries, such as medicine, mining,

and entertainment, are not being easily met by traditional microprocessor
systems alone, such as Intel-compatible processors [11] (referred to as CPUs
for here on). This is due to the scalar processor architecture of the CPU and
the significant increases in the volume of data. The data volume increases can
be attributed to the demand for greater accuracy and improved realism. This
research focuses on data that are typically used in signal processing, image
processing and computer graphics, generally referred to as media processing.
Developments in 3-dimensional computer graphics hardware industries have
resulted in devices that utilise Graphics Processing Units (GPUs). The GPU
offers unprecedented performance gains in mainstream computer graphics.
This is primarily due to the employment of large-scale parallelism. Many
leading modern GPUs offer increased programmability, thus enabling many
more complex rendering techniques. This research looks at applying these
programmable GPUs to more general media processing tasks. There are
numerous difficulties that need to be overcome, including determining what
parts of the GPU to utilise, what language to be used to program the GPU,
and how to transfer data between GPU programs.
These problems resulted in the development of StreamCg, a C++ software
development framework. I developed StreamCg utilising a layered approach.
The lowest layer wraps the complexity of the programming interface to the
GPU; this includes wrapping OpenGL and the NVIDIA Cg runtime. The
highest layer provides a simplified stream oriented programming model for de-
veloping StreamCg programs. In addition, the supporting framework EmuCg
was developed. EmuCg assists in the execution of NVIDIA Cg programs on
1
a traditional CPU. This aids in the normally difficult or impossible task of
debugging Cg programs.
I implemented a Discrete Wavelet Transform (DWT) as a non-trivial exam-
ple to assist in the performance analysis. The DWTs were designed, imple-
mented and tested using StreamCg. Three different implementations were
used, including a CPU-based algorithm, a GPU-based algorithm that was
executed on a GPU and the GPU-based algorithm executed on a CPU using
EmuCg. The experimental test results show that significant performance
gains are made by GPU-based kernels when compared to CPU-based im-
plementations. The StreamCg programming encourages loose coupling that
allows the arbitrary combination of CPU and GPU kernel implementations.
The significant contributions of this research are the development of the
StreamCg framework, that allows the easier development of loosely coupled
and highly reusable software kernels; the EmuCg framework, that enables the
execution of Cg algorithms on a CPU from within StreamCg; and finally the
development of a forward and inverse DWT algorithm using the Cg language.
2
1.1 Background
The following section outlines the relevant background regarding the evolu-
tion of rendering systems that resulted in modern programmable graphics
hardware. The first part of this section outlines programmable rendering
systems in general, these were traditionally software-based renderers. The
second part highlights how the programming capabilities of hardware-based
rendering systems have become more programmable and have reached a point
where they are flexible enough to be used for more generic tasks.
1.1.1 Programmable Rendering Systems
For many years, off-line (non-realtime) rendering systems have supported

programmable components. These programmable components are commonly
referred to as shaders. Shaders offer a high degree artistic control over the
rendering process and have been employed by the movie industry to pro-
duce cinematic quality computer graphics for many years. Historically, pro-
grammable rendering systems have sacrificed performance for quality, result-
ing in slow rendering times. For example, in the late 1980s Pixar considered
it reasonable to render a feature film in one year [7]. The reason for such long
computation times is that very few quality compromises can be made when
rendering cinematic quality effects such as ray tracing, anti-aliasing, pro-
cedural surface materials, transparency, reflection and refraction, shadows,
lighting, and atmospheric effects [18].
Shade Trees
The evolution of shader languages began with the seminal work done by
Robert L. Cook [6] on Shade Trees. Shade Trees provide a foundation for
flexible procedural shading techniques, with the aim of integrating discrete
shading techniques into a unified rendering model. This allowed shading
techniques to be combined in novel ways.
Shade Trees are defined as a tree structure where nodes define operations
and leaves contain appearance parameters, such as color and surface normal.
Nodes can be nested to produce composite operations where the outputs of
the children operations are the inputs of the parent operation. Ultimately,
the root node of a Shade Tree provides the RGB pixel color for the current
3
point. A Shade Tree is associated with one or more objects in the scene,
this means each object has a specific shading program used to evaluate its
pixels. This was revolutionary at the time as many other rendering systems
used fixed shading models for all objects in a scene. The use of Shade Trees
enabled distinct surfaces to be shaded according to an lighting model that
best synthesised that surface. For example, polished wood interacts with
light very differently to brushed metal.
In addition to Shade Trees are Light Trees and Atmosphere Trees. A Light
Tree describes a specific type of light, such as spotlight or point light. Each
type of light source is accessed by a Shade Tree to perform the appropriate
light shading calculations. Atmospheric Trees affect the light output from a
Shade Tree by performing some computation to simulate atmospheric effects,
such as fog. This transforms the light that reaches the virtual camera.
An Image Synthesizer
Ken Perlin’s [39] subsequent work extended the programming model of Shade
Trees, this language was called the Pixel Stream Editing (PSE) language.
The PSE language included conditions, loops, function definitions, arithmetic
and logical operators, and mathematical functions. In the PSE language
variables are either scalars or vectors and operators can work on both types.
For example, the expression a + b could be adding two scalars or two vectors.
Another important contribution of the PSE language is the famous Perlin
Noise function. The noise function is used to create “natural” looking tex-
tures that are devoid of repeating patterns. Using noise as a base function,
complex functions are used to produce very realistic effects such as water,
crystal, fire, ice, marble, wood, metal and rock. These effects are examples of
solid textures, which generate distinct features in 3-dimensional space. This
was a major step forward for procedural texturing, which were previously
only computed in 2D.
The RenderMan Shader Language
RenderMan [24] is one of the most popular shader languages in use today.
RenderMan builds on the foundation of laid by Shade Trees and PSE. The
design goals of RenderMan were to develop a unified rendering system that
supports shading models for global and local illumination, define the interface
4
between the rendering system and shader programs, and provide a high level
language that is expressive and easy to use.
The RenderMan shader language [41] defines a number of different shaders
type handled at different stages in the rendering process. This is made pos-
sible by standardising the shader behaviour and defining an interface to the
greater rendering system. These shader types include
• Surface shaders are attached to geometric primitives and compute the

light reflected by a surface point as a function of the incoming light and
the surface properties. These shaders synthesise how the light interacts
with the distinct surface type, such as wood or metal.
• Displacement shaders modify the position and normal vector of a point

on the surface of an object to modify the visible features of the surface.
For example, creating bump mapped effects.
• Light shaders are attached to a point in space and compute the color
of light that is emitted to an illuminated surface.
• Volume shaders compute how light is modified when it travels through

a volume, such as light refraction. Atmospheric effects can be created
using a global volume that contains the entire scene.
• Imager shaders perform operations on the rasterised pixels before they

are displayed using image processing techniques.
There is a strong correlation between the RenderMan shader types and the
Shade Trees types mentioned previously. For example, surface and displace-
ment shaders closely resemble the functionality of the base Shade Tree, and
light and volume shaders perform similar operations to Light and Atmosphere
Trees. The RenderMan system is far more flexible than the specialised graph-
ics hardware discussed later in the chapter. Typically RenderMan systems
utilise networked clusters of computers, called render farms, which distribute
the work load to improve performance.
The shader language has a C-like syntax and supports a specialised set of
types including floats, RGB colours, points, vectors, strings and arrays. The
language also supports loops and conditions, which are also present in Perlin’s
PSE language. Shaders make use of the wide array of trigonometric and
mathematical functions, as well as other special purpose functions, such as
interpolation, noise and lighting functions. A shader is implemented as a
5
named function with inputs and outputs. RenderMan defines a specific set
of input and output parameters for each shader class. For example, the
surface shader is required to output the Ci parameter that represents the
new surface color. One problem is that parameters tend to have cryptic
names so it is a good idea to keep the reference manual handy.
The RenderMan implementation referred to in [24], utilises a virtual Single
Instruction Multiple Data (SIMD) [40] array architecture. The scene is split
into regions and each region has the associated shaders executed over the
region by the rendering system. Even though this was a software renderer,
the SIMD architecture facilitates the utilisation of high speed parallel graph-
ics hardware. This leads into the following discussion on the evolution of
hardware-based rendering systems.
1.1.2 Hardware-based Rendering Systems
The programming model of hardware-based rendering systems have evolved

with the capabilities of the graphics hardware. These capabilities have pro-
gressed from a fixed function system to the programmable graphics hardware
of today. Consumer level graphic hardware devices were not readily avail-
able until the mid-1990s. Up to this point most rendering was performed in
software, except for some high-end specialised graphics systems. For interac-
tive applications, with a frame rate above 20 Hz, this resulted in a trade-off
between image quality and responsiveness. Even a current day two giga-
hertz CPU, such as a Pentium IV, cannot render very high quality computer
graphics at interactive rates without help.
There are a number of reasons why the CPU cannot keep up with the de-
mands of computer graphics [19]. Firstly, a CPU is primarily a scalar pro-
cessor as it essentially processes one instruction at a time on a single set of
data. Secondly, a CPU has a single memory interface. This increases access
latency as memory access contention increases. Thirdly, only a small portion
of the entire chip is actually dedicated to Arithmetic Logic Units (ALUs).
This means there are less instructions that can be executed in one chip clock
cycle. Most of the chip is actually dedicated to caching instructions and
data. So as a general rule, CPUs are optimised for complex logic, not high
bandwidth throughput. The following section outlines the evolution of how
specialised graphics devices offload the intensive graphics processing from the
main CPU.
One of the first commercially available high performance hardware acceler-
6
ated graphics systems was developed by Silicon Graphics Incorporated (SGI)
in the 1980s [3]. This system could render 100,000 polygons per second at a
refresh rate of 10Hz; relatively speaking this was a very high performance sys-
tem. This system defined a polygon as 4 sided, 10x10 pixel, RGB, Gouraud
shaded, z-buffered, clipped and screen projected.
The SGI system tightly coupled the main CPU and RAM with the graphics
system into one complete unit. From a usability stand point, this allowed
the graphics system to integrate with a user’s window-based desktop envi-
ronment. The SGI graphics system provided the hardware acceleration of
a fixed function pipeline and was comprised of the geometry [5], scan con-
version and rasterisation subsystems. Subsequent work by SGI on hardware
graphics systems enabled realtime texture mapping and anti-aliasing [1], and
made further improvements on refresh rates and overall performance [36].
The DN10000VS system used an alternative approach to using a single spe-
cialised graphics device. This was to make holistic changes to the system
to handle high-performance graphics [28]. One of the main goals of the sys-
tem designers was to have most of the hardware usable most of the time.
This could only be achieved through efficient load balancing and minimising
latency system wide. Performance was primarily achieved through utilising
four processors, with custom graphics instructions and improved hardware
bus speeds.
In the mid-1990s, Lastra et al. [30] outlined the requirements for programmable
graphics hardware that could achieve interactive frame rates. These require-
ments covered programmability, memory layout, and computational power
that formed part of the experimental graphics system called PixelFlow. This
work was built on the previous work done by Molnar et al [35] and followed
the achievements of early graphics systems [20, 21].
To achieve interactive frame rates PixelFlow employed large-scale parallelism.
To highlight the scale required, during that time existing commercial graphics
systems required hundreds of processors for a fixed rendering pipeline [36],
a programmable pipeline would require many more. Parallel architectures
provided greater performance gains over single pipeline architectures as data
was processed simultaneously across an array of processors. Single pipeline
architectures are constrained to the clock speed of the single processor; per-
formance increases could only be achieved through advances in technology
that improve clock speed.
PixelFlow used a SIMD array of 128 x 64 pixel processors and two general
7
purpose RISC processors. The general purpose processors fed instructions
into the SIMD array where they were executed simultaneously. Parallelism
was applied to a scene by subdividing the screen into 128 x 64 pixel regions
and then each region was then processed at once.
The modern rendering pipeline evolved as a result of common patterns found
in the processes used for rendering 3-dimensional scenes. This pipeline was
devised to provide a flexible and simple programming model to the graph-
ics hardware. A high level example of a modern rendering pipeline can be
seen in Figure 1.1. This render pipeline is comprised of the Application,
Command, Geometry, Rasterisation, Texture, Fragment and Display sub-
systems [2]. The Application interacts with the graphics pipeline using the
Command system, this is typically done through a computer graphics API,
such as OpenGL. The Command system supports the specification of the
3-dimensional scene, for example specifying geometry, textures, lighting, and
cameras. The Geometry system handles the transformation, clipping, culling,
texture coordinates, lighting, and primitive assembly operations. The Ras-
terisation system samples the geometry into colour fragments and performs
colour interpolation. The Texture system performs texture transformation,
projection, and filtering. The Fragment system performs alpha, stencil, and
depth testing along with fog and blending to produce pixel colours. The
Display system performs gamma correction and the output signal for the
display.
At the core of modern graphics hardware is the Graphics Processing Unit
(GPU). The term GPU was first introduced in 1999 in the NVIDIA Geforce
series of chips [17]. Other hardware vendors, such as ATI, 3D Labs, Matrox
also used the same or similar term for their graphics processing chip. The
first generations of GPU provided hardware acceleration of the fixed function
pipeline mentioned earlier. Figure 1.2 provides a logical view of the render
pipeline operations performed by the CPU and GPU. This figure shows that
only the Application and Command stages of the pipeline are performed in
the CPU. The rest is executed by the GPU, therefore removing the intensive
processing from the CPU. The GPU has its own local video RAM and can
also access the main RAM via the Advanced Graphics Port (AGP) [10]. The
standard 1x AGP speed is approximately 267 megabytes (MB) per second
throughput. The standard AGP speed at the time of this research is 4x AGP,
which is about 1 gigabyte (GB) per second throughput. Newer generations
of computers will have 8x AGP, which is approximately 2 gigabytes per sec-
ond throughput. It is important to note that the AGP bus is the primary
bottleneck when transmitting data to and from the video RAM accessed by
8
Figure 1.1: The Application interacts with the graphics pipeline using the
Command system, this is typically done through a computer graphics API,
such as OpenGL. The Command system supports the specification of the
3-dimensional scene, for example specifying geometry, textures, lighting, and
cameras. The Geometry system handles the transformation, clipping, culling,
texture coordinates, lighting, and primitive assembly operations. The Ras-
terisation system samples the geometry into colour fragments and performs
colour interpolation. The Texture system performs texture transformation,
projection, and filtering. The Fragment system performs alpha, stencil, and
depth testing along with fog and blending to produce pixel colours. The
Display system performs gamma correction and the output signal for the
display.
the GPU.
There are a number of reasons why GPUs are better suited to computer
graphics than CPUs [19]. Firstly, GPUs employ large-scale parallelism, re-
sulting in fewer clocks per instruction. This is enabled through data locality,
where processors essentially have exclusive access over their allocated data.
Secondly, GPUs have multiple and wide memory interfaces, which means
that the GPU is optimised for high throughput, thus reducing data access
latency. Thirdly, GPUs have deep execution pipelines that help to amortize
latencies over the entire processing time. The deep execution pipelines are
often executed using stream processors. A stream, shown in Figure 1.3, is
essentially comprised of a sequence of kernels that process data in one di-
rection [25]. Kernels are limited to processing only the data passed down
to them this is referred to a data locality. The momentum behind stream
processing is increasing, particularly in the area of media processing [27] [46].
1.1.3 Hardware-based Shader Languages
A major challenge to developing a shader language for hardware is provid-

ing an easy to use programming model that also works within the cost and
9
Figure 1.2: A logical view of the render pipeline operations performed by the
CPU and GPU. This figure shows that only the Application and Command
stages of the pipeline are performed in the CPU. The remainder are executed
by the GPU, therefore removing the intensive processing from the CPU. The
GPU has its own local video RAM and can also accesses the main RAM via
the Advanced Graphics Port (AGP). It is important to note that the AGP
bus is the primary bottleneck when transmitting data to and from the GPU;
the higher the transfer speed of the AGP bus to better the throughput.
Figure 1.3: The deep execution pipelines are often executed using stream
processors. A stream is essentially comprised of a sequence of kernels that
process data in one direction. Kernels are limited to processing only the
data passed down, or limited number of high speed global registers. This is
referred to a data locality.
10
Figure 1.4: The data flow of a Programmable Rendering Pipeline. This
pipeline replaces parts of the fixed pipeline with Vertex Program and the
Fragment Program stages.
complexity constraints of manufacturing the hardware devices. The GPU

programming model is developed as an extension to the fixed function ren-
der pipeline of the Microsoft Direct3D [34] and OpenGL [44] programming
Application Programmer Interface (API).
The data flow of a Programmable Rendering Pipeline is illustrated in Figure
1.4. This pipeline replaces parts of the fixed pipeline with programmable
processors.
Vertex Programs
A Vertex Program, also know as a Vertex Shaders, processes the properties

of a single vertex as input to produce a single transforms vertex as output.
This model was first supported in the mainstream by the NVIDIA Geforce3
GPU [32]. In the context of a fixed function pipeline, Vertex Programs are ex-
ecuted before view space clipping and screen space scaling. This is to ensure
that the programs do not cause an invalid operation during the rasterisation
process. Vertex Programs allow effects such as procedural animation, includ-
ing interpolation and morphing, lens effects (such as the fish eye lens) and
fog effects (such as elevation based fog).
The basic data type of a Vertex Program is a four-component 32-bit floating
point vector (x,y,z,w). The inputs and outputs for the Vertex Program
are named four-component registers, for example COL0, is the diffuse output
colour; and TEX0-7, are the 8 possible output texture coordinates for the
vertex. The input registers are read-only and the output registers are write-
only, this simplifies their hardware implementation. Recent generations of
GPUs, such as the NVIDIA Geforce4, only support 16 input registers, 96
11
constant registers, and 12 output registers and a maximum of 128 instruction
per shader. The input registers default to vertex attributes such as position,
colour, normal and texture coordinates, however, the register contents can
be overridden by the programmer. The constant registers are available for
application specific values, these values are the same for each execution of
the Vertex Program.
The Vertex Program uses low level machine language with operations in-
cluding move (MOV), multiply (MUL), distance (DST), minimum(MIN), four-
component dot product (DP4) and others. With each new generation of
graphics hardware these limitations are being eliminated, for example, the
latest NVIDIA GeforceFX, allows shaders to have up to 1024 instructions
and many more registers.
Fragment Programs
Fragment Programs, also known as Pixel Shaders, allow per-fragment op-

erations during the rasterisation process. A fragment is essentially a pixel
with associated metadata, such as color, texture coordinates and screen co-
ordinates. Previously, per-pixel operations were achieved through register
combiners [12], and texture shaders. These provided a limited set of op-
erations that enabled simple arithmetic operations to be performance with
RGBA colours.
The Fragment Program language is similar to the Vertex Program language.
Fragment Programs supported by the NVIDIA Geforce4 only provided a lim-
ited subset of the operations supported by Vertex Programs and used reduced
precision fixed point numbers instead of 32-bit floating point numbers used
in Vertex Programs. These limitation have been lifted in the GeforceFX,
which provides a uniform set of operations and 32-bit floating point numbers
across Vertex and Fragment Programs.
1.1.4 High Level Shader Languages
Low level shader languages share many of the same problems as CPU machine
languages. They are not easy to read or write, and are tightly coupled to
the hardware. New shading languages are being designed to counter these
problems; two hardware-based languages that are still in development are “C
for Graphics” (Cg) [13] and OpenGL Shader Language (GLslang) [26]. The
12
remainder of this paper will focus on Cg, further details on programming
with Cg can be found in Appendix B.
Shader language compilers are designed to produce the machine language
that is compliant with target shader specification. The compiler can emu-
late missing functionality wherever possible. However under certain circum-
stances the compiler may reject or ignore functions that are not supported
by the graphics hardware.
These shader languages share similarities with the RenderMan shader lan-
guage by design. These include a C-like language, support for high precision
data types (such as 32-bit floating point numbers), provide noise and turbu-
lence functions, and provide a rich set of mathematical functions. However,
RenderMan provides more shader types applied to more stages in the ren-
dering pipeline. The current generation of high level shader languages are
still limited to the capabilities of the programmable graphics hardware; only
the Vertex Program or the Fragment Program can be used to customise the
rendering pipeline of programmable graphics hardware.
1.2 Generic Programming using GPUs
There is an ever increasing body of research into using GPUs for varying
tasks. GPUs are currently used in computer graphics for accelerating ra-
diosity calculations [8], and ray tracing [43]. However, the remainder of this
section provides an overview of the work done with GPUs on non-computer
graphics research.
Moreland et al [38], implement a Fast Fourier Transform (FFT) using a
similar GPU to the one used in my research. The FFT implementation
did not use Cg, but was instead implemented as a Fragment Program in
machine language. The results of the research showed that a 512 by 512
image was synthesised by conventional means, the FFT performed, the image
was filtered, and then finally the inverse FFT was applied in a time well under
one second.
Moravanszky [37], discusses implementing dense matrix multiplication using
Microsoft DirectX, instead of OpenGL and Cg. This has the disadvantage as
it ties the implementation to the Windows platform. Moravanszky’s research
differs to my research in that he used an ATI Radeon GPU. Moravanszky’s
results showed significant performance gains for larger datasets that can ab-
13
sorb the cost of transferring the data to the GPU.
Krueger et al [29], implement a comprehensive suite of linear algebra opera-
tions on a GPU, including vector arithmetic, matrix-vector products, sparse
matrices, banded matrices and many more. Krueger et al also used an ATI
Radeon GPU that did not provide as comprehensive floating-point number
support as the NVIDIA GeforceFX used in my research. Higher perfor-
mance can be achieved when less accurate data types are used. However,
their research shows that even with consideration given to lower precision,
the performance gains were significant in comparison to the CPU-based im-
plementation.
Much of the research discussed in the previous paragraphs is based on ad-
hoc systems developed to test a narrow domain. McCool et al [33] developed
a metaprogramming system that enabled all code, including the shader ma-
chine language, to be specified in C++. The metaprogramming system would
manage the underlying graphics API, in this case OpenGL, and also handle
loading and executing the shader machine language on the CPU. Although
the research primarily focuses on computer graphics examples, this kind of
system could be used for more generic tasks.
Buck et al [4] emphasise that a GPU is a form of stream processor. This leads
to the development of Brook, a new language that shares many similarities
with Cg, but differs in that it provides greater support for generalised GPU
programming. The Brook programming model is based on streams of kernels.
The research done by Buck et al, shares some resemblance to my research.
The main similarity lies in the motivation to create a programming model
based on stream processors. The main difference is that my research tried
to work within the limitations imposed by established technologies such as
C++ and Cg, rather than creating a new language.
14
Chapter 2
The StreamCg Framework
The StreamCg framework is a C++ framework that simplifies the devel-

opment of high-performance media processing applications. The goals of
StreamCG are to developed a simple programming model and hide the un-
derlying graphics system complexity. Another driving factor when develop-
ing StreamCg, was to leverage established technologies that are familiar to
developers, primarily C++ and Cg.
The StreamCg programming model borrows heavily from the stream architec-
ture of parallel computing. StreamCg is used to assemble Stream Programs,
which are comprised of two fundamental building blocks; the Kernel and the
Channel. A StreamCg Kernel encapsulates the logic to be performed on a
given dataset. Kernels have named inputs and outputs. However, at present
only one input and output channel are supported. A Channel is a connection
between an output to an input and is used to transfer data down stream to
the next Kernel. Kernels can only access the data passed down stream or
constant data specific before execution. This constraint is aligned with the
model used by GPU shaders discussed previously.
The UML diagram shown in Figure 2.1 shows the inheritance hierarchy of the
StreamCg framework. The Kernel class can be used for CPU-based Kernel
implementations, these Kernels will execute on the CPU. The CgKernelFP
subclass provides the additional infrastructure to execute a Cg Fragment
Program on the GPU. The StreamCg framework currently only supports
Cg Fragment Program Kernels because they allow the easiest access to the
output data. The data is easily accessible as the Fragment Program outputs
data directly to the frame buffer. In contrast, Vertex Programs produce
output in the middle of the pipeline that is not easily accessible. Therefore,
15
Figure 2.1: The StreamCg inheritance hierarchy. The Kernel class can be
used for CPU-based Kernel implementations, these Kernels will execute on
the CPU. The CgKernelFP subclass provides the additional infrastructure to
execute a Cg Fragment Program on the GPU. The EmuCgKernelFP subclass
provides the emulation layer required to emulate a subset of the Cg language.
EmuCg is discussed in more detail later in this section. These classes are
extended and appropriate methods can be overridden to specialise the objects
behaviour.
Vertex Programs are not widely used for generic programming tasks. The
EmuCgKernelFP subclass provides the emulation layer required to emulate
a subset of the Cg language. EmuCg is discussed in more detail later in
this section. These classes are extended and appropriate methods can be
overridden to specialise the objects behaviour.
A Kernel is configured to transfer its output to a down stream Kernel using
the writeTo(next:Kernel,inputName:String) method. This method is
passed the next Kernel instance and the name of the input. StreamCg only
supports transferring data to one input at a time, however there may be
other unused inputs specified by a Kernel. An example Stream Program is
assembled in Figure 2.2. In this figure, Kernel A is the data source where the
data may be obtained from numerous sources including local files, databases
or over a network; Kernel D is the data sink where the output data may
be displayed on the screen, written to a file, or transferred over a network.
Kernels B and C are typical Kernels in so far as they receive data as input,
process the data, and pass on the data down stream.
The Kernel::execute() method is called on the root Kernel, which in turn
executes the entire Stream Program one Kernel at a time. The execution
sequence of a single Kernel is as follows Kernel::initialise(), performs
user initialisation of values prior to processing; Kernel::process(), per-
16
kernelA->writeTo(kernelB, "InputB");
kernelB->writeTo(kernelC, "InputC");
kernelC->writeTo(kernelD, "InputD");
Figure 2.2: Kernel A is the data source where the data may be obtained from
numerous sources including local files, databases or over a network; Kernel
D is the data sink where the output data may be displayed on the screen,
written to a file or transferred over a network. Kernels B and C are typical
Kernels in so far as they receive data as input, process the data, and pass on
the data down stream.
forms the actual data processing, this may be repeated a predefined number
of times; and finally Kernel::transfer(), transfers the output data to the
next Kernel. The behaviour of the Kernel::transfer() method is depen-
dant on the type of Kernel. Kernels that execute on the CPU, output their
data to PixelBufferChannels, which store data in main RAM and transfer
data downstream using Channel::write(buf:PixelBuffer). GPU-based
Kernels output data to TextureBufferChannels, which store data in video
RAM and transfer data using Channel::write(). Details of the significance
of the two write methods are discussed later in this section.
StreamCg Kernels are constrained to the types of data inputs and outputs
provided by a Fragment Program. As outlined in Appendix B, a Fragment
Program can take 2-dimensional textures as input, and its output is always
written to the colour buffer in the current OpenGL render context [44]. By
default OpenGL clamps the values of buffers and textures, to the range of 0.0
to 1.0. Under normal use this is sufficient as the values typically represent
RGBA colours. However, for more generic data processing this is a serious
limitation. The solution is to use the NVIDIA OpenGL extension for Float
Buffers [14]. This extension allows the creation of render contexts containing
buffers and textures in video RAM that do not enforce any limitation of the
value beyond that of a float type.
The complexity of managing textures and buffers is encapsulated in the in-
frastructure of StreamCg. The Channel::write() method will copy the
colour buffer contents of the current render context into either a texture
(video RAM) or pixel buffer (main RAM). The Channel::write(buf:PixelBuffer)
method will copy the contents of a pixel buffer into either a texture or an-
other pixel buffer. So we can see that, based on the configuration of the
Stream Program Kernels, StreamCG will perform the appropriate conver-
17
Figure 2.3: Illustration of the four modes of data transfer supported by
StreamCg. It has four parts representing different source and target Kernel
implementations. In part A, the data transfer remains in video RAM as both
are GPU Kernels. In part B, the data remains in main RAM as both are
CPU Kernels. In part C, the data is converted from a texture in video RAM
into to pixel buffer in main RAM, and part D the reverse occurs where the
data is converted into a texture in video RAM.
sion of channel data as required. Figure 2.3 illustrates the four modes of
data transfer supported by StreamCg. It has four parts representing differ-
ent source and target Kernel implementations. In part A, the data transfer
remains in video RAM as both are GPU Kernels. In part B, the data remains
in main RAM as both a CPU Kernels. In part C, the data is converted from
a texture in video RAM into to pixel buffer in main RAM, and part D the
reverse occurs where the data is converted into a texture in video RAM.
To ensure compatibility of data sizes when transferring the colour buffer
to a texture and in reverse, StreamCg enforces a uniform data size for all
Kernels in a Stream Program. The size limitations are influenced by the size
constraints of textures in OpenGL (up to 4096 x 4096), and the available
video RAM. The size is set as part of the OpenGL render context, in typical
OpenGL applications this is the viewport size. Pure CPU Stream Programs
are least impacted by these limitations as they do no use video RAM.
StreamCg provides the KernelContext class to manage the creation of Float
Buffers. A KernelContext is specified on an individual Kernel basis, this
allows multiple contexts to be utilised in a single Stream Program. However,
context switching is demanding on resources. A CPU Kernel executes in the
default context or null context, which can be considered main RAM. GPU
Kernels must execute within a KernelContext otherwise the data values will
be clamped.
18
The Kernels of a Stream Program are executed over the entire data set from
top to bottom, row by row. This execution pattern is that supported by
Fragment Programs, this is illustrated by Figure B.3 in Appendix B. This
is achieved by rendering a GL_QUAD that spans the entire display area. This
results in the execution of Fragment Program for every pixel in the colour
buffer. Due to data locality constraints imposed by the GPU the row and
column of the pixel being processed is not immediately available. StreamCg
specifies texture coordinates for each corner of the GL_QUAD rendered. This
is passed to the Fragment Program using the TEX0 semantic and serves as a
indexing mechanism for calculations that need to know about the row and
column being processed. The CPU-based Kernels emulate this behaviour to
ensure a consistent programming model.
The StreamCg underlying infrastructure comprises of a number of layers,
the OpenGL Object layer, the Cg Object layer and the StreamCg layer.
The OpenGL Object layer wraps the OpenGL API into convenience objects
and, the Cg Object layer wraps the underlying Cg API. These subsystems
provide convenience classes essential to overcoming problems, discussed in
the following paragraphs, encountered during the construction of StreamCg.
Many problems were encountered when developing the StreamCg framework,
the primary issue was that OpenGL has a C API. This is a problematic
because logical elements within OpenGL (textures, meshes etc) are object-
like, but these are not exposed easily. So, it is easy to forget to set a property
or not realise that properties you did set are incompatible. For example, when
using the NVIDIA Float Buffer extension, textures must have there internal
format set to GL_FLOAT_RGBA_NV [14] instead of the standard GL_RGBA, which
is easy to forget when developing a system for the first time.
Another significant development issue is that OpenGL and Cg are really
completely separate systems with a loose coupling via the Cg Runtime sys-
tem. Limited diagnostic and validation is performed, so it is important to
perform manual validation regularly. An issue encountered during devel-
opment relates to silent incompatibilities that cause things not to work.
For example, the typical Cg type for a texture is sample2D, this requires
that the texture is specified in OpenGL using the GL_TEXTURE_2D type [44].
Using the NVIDIA Float Buffer extension requires the texture type to be
GL_TEXTURE_RECTANGLE_NV. This is not compatible with the Cg sample2D
type, therefore sampleRECT [14] must be used instead.
19
2.1 The EmuCg Framework
The EmuCg framework is a supporting framework that is primarily intended

for debugging Cg programs to be used with StreamCg. Cg programs cannot
be easily traced through step-by-step as they are executed on the GPU. At
the time of this research there is no facility that allows tracing instructions
on a GPU. EmuCg helps minimise the serious problems encountered when
developing generic GPU programs. The Cg code can be relatively easily
ported to the EmuCg framework which allows it to be compiled with a C++
compiler; enabling the usage of common C++ debugging tools.
EmuCg emulates the Cg runtime and Cg language. The Cg runtime support
is limited to the execution model of a Fragment Program that processes a
single four-sided polygon that covers the viewport. This is currently the only
execution model supported by the StreamCg framework and is a more limited
version of what is possible using a general Fragment Program. Refer to
Appendix B for more details on Cg and Fragment Programs. Comprehensive
emulation of the Cg runtime would be a difficult task and would require
reproducing much of the rendering pipeline behaviour externally.
Emulating the Cg language is a little easier, this is primarily due to the
varying degrees of similarity between Cg, Java [22] and C/C++ [45]. The
basic primitive types are supported with little effort, such as float and int.
EmuCg also supports vector types, float2, float3 and float4; associated
arithmetic operators, such a addition, and division; and type conversions
between the vector sizes. This is possible in C++ primarily through the use
of classes and operator overloading.
20
Chapter 3
Testing
3.1 Test System Configuration
The hardware used comprised of an Athlon Thunderbird 1.2Ghz with 768MB

main memory on an Asus A7V266 mainboard supporting 4x AGP. The graph-
ics device was a GeforceFX Ultra 5900 with 256MB video memory. The
mainboard was configured to give the graphics device an additional 64MB of
main memory as AGP video RAM.
The operating system used was Redhat Linux Version 9.0 with Linux Kernel
Version 2.4.22 [47]. The display system used included XFree86 Version 4.3.0
with the NVIDIA Linux Driver Version 1.0-4496, using the internal NVAGP
AGP driver.
The software was compiled with GCC Version 3.2.2, using the following soft-
ware development libraries
• Cg Toolkit for Linux Version 1.1, provided the Cg Runtime and lan-
guage compiler [13];
• libSDL Version 1.2.5, for OpenGL display configuration, input handling

and performance timer;
• libPNG Version 1.2.2, for loading PNG image files;
• and the GNU Wavelet Image Codec Version 0.1 (GWIC) [31], for an
easy to following DWT reference implementation.
21
Figure 3.1: The DWT Stream Program is comprised of 4 kernels: an image
loader, forward DWT, inverse DWT, and image viewer. The image loader
reads PNG image files from disk. The images are already resident in memory
before execution, therefore the load time from disk is not included. The
forward DWT processes the data from the image loader, the output of which
is fed into the inverse DWT.
The source code was compiled using GCC 3.2.3, to produce an optimised
binary executable using the following configuration
-march=athlon-tbird -mmmx -m3dnow -Wall -O3 -pipe -fomit-frame-pointer.
3.2 Method
The aim of this experiment is to compare execution times of CPU and GPU
based stream programs implemented using StreamCg. The test stream pro-
gram, illustrated in figure 3.1, is comprised of 4 kernels, an image loader,
forward DWT, inverse DWT, and image viewer. The image loader reads
PNG image files from disk. The forward DWT processes the data from the
image loader, the output of which is fed into the inverse DWT. This col-
lection of stream kernels will be referred to as the DWT Stream Program.
The expected output is that the image viewer will display an image the same
as the input image. Only a working knowledge of DWTs was required for
this research and only the salient aspects of DWTs are discussed in this
paper. Further details on DWTs can be found in the existing body of knowl-
edge [23] [42].
The DWT algorithm implemented as part of this research is designed for
22
Figure 3.2: DWTs are applied to 2-dimensional images in two passes, parts
A and B. The first pass, the vertical pass, performs a high and low band
filter on a row by row basis, resulting in a high and low column. The second
pass, the horizontal pass, again performs a high and low band filter on the
data, however this time its on a column by column basis. The result, seen in
part C, is four quadrants with a mixture of high and low filtered data. The
algorithm then recursively transforms the upper left quadrant to produce
part D.
use in image compression. The process of the DWT is illustrated in Figure

3.2. DWTs are applied to 2-dimensional images in two passes, parts A and
B. The first pass, the vertical pass, performs a high and low band filter on
a row by row basis, resulting in a high and low column. The second pass,
the horizontal pass, again performs a high and low band filter on the data,
however this time its on a column by column basis. The result, seen in part C,
is four quadrants with a mixture of high and low filtered data. The algorithm
then recursively transforms the upper left quadrant to produce part D.
Three versions of the DWT Stream Program were implemented including
a CPU-based algorithm, implemented as the ForwardWaveletKernel and
InverseWaveletKernel classes; a GPU-based algorithm running on a GPU,
implemented as the CgForwardWaveletKernel and CgInverseWaveletKernel
classes; and a GPU-based algorithm running on a CPU (using EmuCg), im-
plemented as the EmuCgForwardWaveletKernel and the EmuCgInverseWaveletKernel
classes. The ForwardWaveletKernel and InverseWaveletKernel provide
the base reference implementation for developing the Cg DWT kernels. The
reference implementation provided a means of comparing the CPU-based
version with the GPU-based version for both correctness and performance.
The CPU-based algorithm was used as a reference implementation using
third party code from the GWIC DWT implementation; this code was used
directly with only minor modifications.
The CgForwardWaveletKernel and CgInverseWaveletKernel kernels were
23
developed to execute on the GPU, the Cg code can be seen in Appendix C.
The algorithm processes logical rows, a logical row may be a physical row or
column depending of the orientation. This is required, as Fragment Programs
only process data on a row by row basis. The isH and isV parameters are
used to set the logical row orientation to horizontal or vertical, respectively.
If one value is set to 1.0 the other must be set to 0.0. An alternative approach
could have been to have two different algorithms handle the horizontal and
vertical processing, however this would result in a significant performance
penalty caused by rapidly unloading and loading of the Fragment Programs.
The Cg DWT algorithm is optimised for how the GPU handles branching.
On a current GPU both paths of a boolean condition are executed. If the
condition is true then the results of the true path are multiplied by 1.0 and
the other by 0.0, then the values are then added together. The opposite
occurs if the condition is false. The result is a weighted sum of both paths
based on the condition. This is not immediately obvious to a developer, and
counter to how branching is performed in a CPU, where only the true branch
is executed. The algorithm is optimised by collocating the common parts of
each branch, thus minimising the actual code for each condition.
The EmuCgForwardWaveletKernel and EmuCgInverseWaveletKernel ker-
nels were developed to assist in debugging the algorithm during develop-
ment. This was achieved by prototyping the the forward and inverse DWT
algorithm and tracing through it with a debugger. This is not possible with
Cg code executed on the GPU, so it is very difficult to develop complex
algorithms without something like EmuCg.
The StreamCg framework enables any forward DWT implementation to be
used with any other inverse DWT implementation. This is achieved by only
changing the starting configuration of the kernels, as illustrated in Figure 2.2.
This highlights the modularity of StreamCg kernels and the improved possi-
bility of kernel reuse. This is supported by the strict object contract between
the kernels and the StreamCg framework that enforces a loose coupling.
3.3 Experimental Results
The results were obtained by executing the DWT Stream Program on 256 by 256,
512 by 512, and 1024 by 1024 data sizes. This research intended to explore
the usage of larger data sets, including 2048 by 2048 and 4096 by 4096.
However, the graphics hardware could not support such large data sets as it
24
Implementation Data Size Average Time (millis)
GPU 256x256 10
512x512 11
1024x1024 11
CPU 256x256 89
512x512 450
1024x1024 2783
CPU (EmuCg) 256x256 920
512x512 4191
1024x1024 18656
Table 3.1: The execution times are presented in fastest to slowest. The GPU
shows the fastest times that scale very well with data size increases. The raw
GPU execution times follow an interesting pattern. The first time the DWT
Stream Program is run, it takes approximately 230 milliseconds. Subsequent
executions averaged approximately in the four to seven millisecond range.
The CPU has the next fastest times, which scale significant worse that the
GPU. Finally, the worst times are achieved using EmuCG on the CPU.
ran out of video RAM.

The DWT stream programs were executed 30 times to compute the average
execution time. This experiment used the SDL_GetTicks() timer provided
by libSDL. The SDL timer was configured to use the Read Time-Stamp
Counter (RDTSC) machine code instruction supported by Intel Pentium
compatible processors [9]. The time-stamp counter keeps an accurate count
of every cycle that occurs on the processor.
The execution times, shown in Table 3.1, are recorded in milliseconds and
start when the first kernel writes the input data to the forward DWT kernel,
and stops when the last kernel receives the inverse DWT data. Each run of
the DWT Stream Program is run as a single render frame using OpenGL.
At the end of each frame the OpenGL glFinish() function is called to
allow the graphics hardware to finalise any internal rendering processes. The
time taken for this task is not included in the experiment times as they are
a side-effect of using OpenGL. This research assumes that the timer stops
immediately after the last kernel receives the data, and subsequent processing
required by OpenGL is not significant.
25
Chapter 4
Discussion
The introduction of high level shader languages, such as Cg, are encouraging
the development a wider range of applications that utilise the processing
power of GPUs. The NVIDIA Cg toolkit provides a runtime system that
work within OpenGL, and a language for programming GPUs. StreamCg
utilises the Cg toolkit to provide a simple programming model, abstract
the underlying implementation complexities, and enable to development of
reusable software components.
Many limitations are imposed on StreamCg due to the strong dependency on
OpenGL. A major problem is that OpenGL is not being used in the manner
for which it was originally designed, in so far as it is not being used for
computer graphics. This means that vendor extensions, such as the NVIDIA
Float Buffer extension need to be used to enable more generic programming.
StreamCg is also limited to a fixed data size for all kernels within an execution
context. This limitation is due to the use of textures and the colour buffer
to store input and output data respectively.
EmuCg assists in the execution of NVIDIA Cg programs on a traditional
CPU. This aids in the normally difficult or impossible task of debugging Cg
programs. EmuCg only supports the execution of Cg Fragment Programs in
a limited manner. However, it provides the foundations for a more complete
emulation of the Cg language.
The GPU DWT Stream Program shows the best performance. Not only are
the execution times significantly faster than the CPU times, but the execution
times scale very well and are almost constant as the data size increases.
26
The next best times are that of the CPU DWT Stream Program using the
GNU Wavelet Image Codec (GWIC) code, of which, the fastest time is still
nine times worse than the slowest GPU time. This Stream Program does not
scale well as the data sizes increase. The execution times are approximately
four times slower with each step in size, highlighting that there is a linear
relationship between the data size and the execution time.
The worst times are recorded for the EmuCg DWT Stream Program, these
are about ten times worse than the GWIC algorithm. This is due to emu-
lating the runtime nature of a Fragment Programs using EmuCg. Also the
Cg DWT algorithm is designed to execute on a GPU and performs badly
on a CPU, as it relies on parallel execution and high bandwidth memory
throughput.
The summary of results above highlights that by using parallelism and high
bandwidth memory throughput the GPU achieves significant performance
gains over CPU based programs. This is primarily because a CPU is a
scalar processor and has single relatively narrow memory interface. However,
current graphics hardware is limited by the size of video RAM, thus limited
the amount of data that can be processed. Also, OpenGL only supports
textures, used as inputs in StreamCg, up to 4096 by 4096. As a result
the GPU is not limited by processing power, but by the capacity to store
data. The CPU has the inverse relationship as it is primarily bandwidth and
processor limited, and limited less by storage capacity.
Better results could be achieved if the motherboard used supported 8x AGP,
about 2 GB per second throughput. This would minimise the initial spike
seen in the GPU results, which is caused by uploading the texture to the
graphics hardware. The size of the video RAM would also need to be accom-
panied by a increase in the AGP bus speed, necessary to transfer larger data
to the graphics hardware in a timely manner.
4.1 Future Work
There are numerous enhancements that could be made to the StreamCg

framework. These include
• Support for more Cg types, such as matrices.
• Port to MS Windows platform.
27
• Support more GPU profiles other than FP30. This includes investigat-
ing using Vertex Programs.
• Implement more image processing kernels, such as edge detection algo-

rithms.
• Improve the programming model to make it simpler and more flexible.
It is possible to further optimise the DWT stream program. Fine tuning

the Cg program to use integers where possible will improve performance; as
integer and float operations are performed in parallel.
28
Appendix A
Original Research Proposal
Title: High Performance Generic Programming using Programmable Graph-

ics Hardware
Author: Aleksandar Radeski
Supervisor: Dr Karen Haines
A.1 Background
In recent times the speed of the modern computer processor (such as an

AMD Athlon XP) has reached speeds over 2GHz . However, in the area
of cinematic quality computer graphics the raw processing speed of a single
processor is not enough to render these scenes at interactive rates (above
25Hz). This is due to the large volumes of data that need to be processed
many hundreds of times a second.
The aim of modern graphics hardware is to render cinematic quality computer
graphics at interactive rates. This is achieved through developing highly par-
allel hardware devices that efficiently process large volumes of data. Early
generations of graphics hardware only supported a fixed function rendering
pipeline. The fixed function rendering pipeline simplified both the hard-
ware design and programming the hardware device; this was at the cost of
flexibility.
Modern graphics hardware, such as the NVIDIA GeforceFX, support the
execution of user programs called shaders. The two forms of shader are
29
the vertex shader and the pixel shader. The vertex shader is executed per
vertex with the purpose of transforming world coordinates into view-space
coordinates prior to view frustum clipping. The pixel shader is executed per
pixel during the rasterisation phase prior to screen coordinate clipping.
Earlier generations of shader languages resembled machine languages and
were limited and difficult to use, for example branching and looping oper-
ations were not supported. Current shader languages are becoming more
powerful and less specialised. These C-like languages, such as NVIDIA Cg,
supports a wide range of types, and looping and branching operations. As
these languages become more generalised it has become apparent that these
languages could be used for high performance processing of generic program-
ming tasks.
A.2 Aim
The goal of my research is to investigate modern programmable graphics

hardware and shader languages and how they can be utilised for high per-
formance generic programming tasks.
A.3 Method
The first step is to build an understanding of modern programmable graphics

hardware. My research will not specifically focus on the physical design of
the hardware, but the aim is to provide enough background to give a better
understanding of the difficulties in programming such devices.
Following the hardware research I will investigate the origins of shader lan-
guages; primarily on the current generation of C-like languages, such as Cg,
and how they were influenced by early shader languages, such as RenderMan.
To better understand shader languages I will prototype a number of simple
typical shader examples.
Once I have a good understanding of conventional shaders I will explore the
possibilities of using programmable graphics hardware in more generic pro-
gramming tasks. The example I will implement is a projection shader using
Cg, that will project data from one coordinate system into another. This
technique can be used in Geographic Information Systems (GIS) that re-
30
quire data to be transformed from geographic (latitude, longitude, elevation)
into Cartesian (x,y,z) coordinates, and back. The shader is expected to be
accurate and fast. The goal is to devise an implementation that can project
a complex scene consisting of terrain and entities at interactive rates. Other
applications beyond GIS will also be explored.
A.4 Requirements
The application programming languages used will be C++, the shader lan-
guage used will be NVIDIA Cg. Software requirements include Linux, GCC
3.x, OpenGL, NVIDIA Cg Toolkit,
Hardware Requirements include a standard PC of around 2GHz with about
512MB of RAM and an NVIDIA GeforceFX graphics device.
31
Appendix B
Cg Programmaning
The Cg Toolkit consists of two parts, the Cg Language and Cg Runtime ser-
vices [13]. The Cg language is used to implement GPU programs, the generic
form of Vertex and Fragment Program. The Cg Runtime services form part
of the CPU-based host application and manage the execution context of a
GPU program. The Cg Runtime services are used to specify the GPU Pro-
gram profile; perform loading, compiling and binding of GPU programs; and
specify the input parameters. This discussion outlines many relevant aspects
of the Cg Language, but does not try cover the entire language.
B.1 The Cg Language
The Cg language is designed to be general purpose and hardware-oriented.

As a general purpose programming language, Cg may support features be-
yond that of the available hardware. Cg uses profiles to group supported
capabilities for a specific GPU. The GPU profiles discussed in this paper are
the OpenGL NVIDIA Vertex Program 2 profile (VP30) [16] and Fragment
Program profile (FP30) [15]. A GPU profile is used by the Cg compiler to
generate the GPU machine language. If a Cg program uses features not
supported by the target profile, the Cg compiler will produce an error. The
GPU profile defines the supported data types, flow control operations (such
as conditions and loops) and other special purpose operations (such as light-
ing calculations). This research will not cover the VP30 profile in great detail
as the primary discussion concerns the FP30 profile.
The Cg language is based on ANSI C and incorporates certain desirable fea-
32
tures from C++ and Java. Cg does not support full ANSI C, there are a
number of limitations imposed. For example, pointers are not supported,
therefore arrays are a first-class type; function overloading is supported; and
variables may be defined anywhere before being used, not just at the begin-
ning of the scope. The limitations of Cg are a direct result of the limitations
imposed by current GPU hardware. The enhancements made to ANSI C
are intended to provide specialised GPU support, such as swizzling, and to
reduce the programming effort for certain tasks.
The supported data types include float, a 32-bit IEEE floating-point num-
ber; half, a 16-bit IEEE-like floating point number; int, a 32-bit integer;
fixed, a 12-bit integer; bool, a boolean type; and sampler*, a texture ob-
ject with six variants including sampler1D, sampler2D, samplerRECT and
sampler3D. Cg also supports more complex types, such as float2, float3,
float4, which are two, three, and four component floating point vectors re-
spectively. Although not used in this paper, Cg also supports matrix types,
such as float4x4 which is the largest matrix supported. These complex
types act very similarly to C++ classes with a broad range of overloaded
arithmetic operators.
Cg program defines a main function that specifies the parameters for varying
inputs and outputs, and constant data, referred to as uniform data. The value
of an input parameter and the destination of an output parameter can be
specified using semantic tags. Semantic tags relate to a specific GPU profiles
and are declared with the parameter name and type. An example of a main
function using semantics is shown in Figure B.2. The input parameter incol
uses the COLOR semantic, which implies that for each pixel this Fragment
Program visits, the parameter receives the colour value. Some input and
output semantics are required, otherwise the GPU program is invalid. If a
parameter has no semantics defined, then it is up to the host application to
specify the value of the parameter. Uniform data is data that is constant
for the entire execution of the Cg program. This can be of any of the types
discussed previously. Figure B.1 illustrates the inputs, outputs and constant
data for a Fragment Program. The constant data is using the Cg Runtime
discuss later.
Function parameters can also have a direction modifier specified; one of in,
out or inout. Parameters are declared in by default if no direction is spec-
ified. An in parameter is passed by value, an out parameter is passed out
when the function exits but has no initial value, and an inout parameter
has a value on function entry and if it is modified in the function the value
is reflected outside the function. The inout direction modified is meant to
33
Figure B.1: A logical view of a Fragment Program that illustrates the per-
fragment input, per-fragment output and constant data. The per-fragment
input changes which each fragment processed, however the constant data is
the same for all fragments.
void main(in float3 incol : COLOR,

in float4 winPos : WPOS,
out float4 outcol : COLOR)\{
//function body
\}
Figure B.2: An example main function for a FP30 Fragment Program, using
direction modifiers and semantics.
provide some of the functionality missing due to the lack of pointers in Cg.
The Cg language supports flow control constructs, such as looping and branch-
ing, which are relatively difficult to implement in parallel systems due to la-
tency and synchronisation issues. At present only the VP30 profile supports
looping and branching, the FP30 profile only supports branching.
B.2 The Cg Runtime
The Cg Runtime provides Cg program compilation and inspections opera-

tions. The Cg Runtime supports the pre-compilation of Cg programing into
the target profile machine language, or on-demand compilation that is per-
formed at runtime by the graphics driver. On-demand compilation has the
advantage of enabling new driver implementations to generate more efficient
34
Figure B.3: When a Fragment Program is executed it processes each frag-
ment on a per-row basis, from the min screen position to the max screen
position. This means at each fragment the Cg program knows nothing of its
adjacent neighbours, this data locality is on of the features that enables high
performance parallelism of GPUs.
machine language as GPU technology matures. By using the on-demand

compilation the Cg Runtime also supports inspecting the Cg programs in-
terface, including the inputs, outputs and constant data.
The Cg Runtime also manages setting the value of constant input parame-
ters and binding a Cg program to the current rendering context. Values of
constant input parameters, of the various Cg types discussed earlier, are set
via an appropriately typed function call. The value of an input parameter
remains constant until it is changed or the state of the rendering context
changes. Only one Vertex and Fragment Program can be bound to a render-
ing context at any given time.
When correctly configured, Cg programs are executed after the OpenGL
glBegin() function call is made to draw a primitive. When a Fragment
Program is executed it processes each fragment on a per-row basis, from the
min screen position to the max screen position as show in figure B.3. This
means at each fragment the Cg program knows nothing of its adjacent neigh-
bours, this data locality is one of the features that enables high performance
parallelism of GPUs. If Cg programs could access neighbouring values, this
would introduce significant performance penalties due to the synchronisation
management of shared data.
35
Appendix C
Cg Discrete Wavelet Transform

Implementation
C.1 Forward DWT Cg Implementation
//parameters:
//icol - required for Fragment Program but not used
//tex0 - the execution position passed as texture coords
//subframeSize - the size of the subframe of the data buffer to process
//isH - use horizontal logical orientation
//isV - use vertical logical orientation
//hSize - half the size of the subframe
//imageData - the input data as an RGBA texture
//daub4High - the DWT high band co-efficients
//daub4Low - the DWT low band co-efficients
//ocol - the colour buffer output value
void main(
in float3 icol : COLOR,
in float2 tex0 : TEX0,
uniform float subframeSize,
uniform float isH,
uniform float isV,
uniform float hSize,
uniform samplerRECT imageData,
uniform float4 daub4High,
36
uniform float4 daub4Low,
out float4 ocol : COLOR)
{
//ensure we get a floored value to avoid interpolation issues
float2 tex = floor(tex0);
//copy default colour to fill-in unprocessed quadrants

float3 val = texRECT(imageData, tex).xyz;
//set logical row/col position based on orientation (H/V)

float rowPos = (tex.x * isH) + (tex.y * isV); ;
float colPos = (tex.y * isH) + (tex.x * isV); ;
//set sample step based on orientation (H/V)

float uStep = (2 * isH) + (1 * isV);
float vStep = (1 * isH) + (2 * isV);
float2 uv0;
float2 uv1;
float2 uv2;
float2 uv3;
float uvMax;
float4 daub;
bool doFilter = false;
//process the low and high band halves

if((rowPos < hSize) &&
(colPos < subframeSize)) {
//low band filter
float u = (tex.x * uStep);
float v = (tex.y * vStep);
uv0 = float2(u, v);
daub = daub4Low;
doFilter = true;
}
else if((rowPos >= hSize) &&
(rowPos < subframeSize) &&
(colPos < subframeSize)) {
//high band filter
float u = (tex.x - (hSize * isH)) * uStep;
float v = (tex.y - (hSize * isV)) * vStep;
37
uv0 = float2(u, v);
daub = daub4High;
doFilter = true;
}
if(doFilter) {
//compute the next 3 input uv coords
//in the appropriate direction (H or V)
uv1 = float2(uv0.x + isH, uv0.y + isV);
//wrap values to within the subframeSize

uv0.x = (uv0.x >= subframeSize ? (uv0.x - subframeSize) : uv0.x);
uv0.y = (uv0.y >= subframeSize ? (uv0.y - subframeSize) : uv0.y);
//fetch 4 input data elements

float3 a = texRECT(imageData, uv0).xyz;
float3 b = texRECT(imageData, uv1).xyz;
float3 c = texRECT(imageData, uv2).xyz;
float3 d = texRECT(imageData, uv3).xyz;
//peform the high or low band calculation

val.x = (a.x * daub.x) + (b.x * daub.y) +
(c.x * daub.z) + (d.x * daub.w);
val.y = val.x;
val.z = val.x;
}
//if the current execution position in not within the subframe then the
//existing value at the current position is copied to the output without
//being processsed.
//set the output colour

ocol = float4(val, 1);
38
}
C.2 Inverse DWT Cg Implementation
//parameters:
//icol - required for Fragment Program but not used
//tex0 - the execution position passed as texture coords
//subframeSize - the size of the subframe of the data buffer to process
//isH - use horizontal logical orientation
//isV - use vertical logical orientation
//hSize - half the size of the subframe
//imageData - the input data as an RGBA texture
//daub4High - the DWT high band co-efficients
//daub4Low - the DWT low band co-efficients
//ocol - the colour buffer output value
void main(in float3 icol : COLOR,
in float2 tex0 : TEX0,
uniform float subframeSize,
uniform float isH,
uniform float isV,
uniform float hSize,
uniform samplerRECT imageData,
uniform float4 daub4High,
uniform float4 daub4Low,
out float4 ocol : COLOR)
{
//ensure we get a floored value to avoid interpolation issues
float2 tex = floor(tex0);
//copy default colour to fill in unprocessed quadrants

float3 val = texRECT(imageData, tex).xyz;
//set logical row/col position based on orientation (H/V)

float rowPos = (tex.x * isH) + (tex.y * isV); ;
float colPos = (tex.y * isH) + (tex.x * isV); ;
//set sample step based on orientation (H/V)

float uStep = (0.5 * isH) + (1 * isV);
39
float vStep = (1 * isH) + (0.5 * isV);
float uLimit = (hSize * isH) + (subframeSize * isV);

float vLimit = (subframeSize * isH) + (hSize * isV);
if((rowPos < subframeSize) && (colPos < subframeSize)) {

//determine if this is an odd position index
float odd = fmod(rowPos, 2.0);
//compute daub coefficients for even (0,2) or odd (1,3) case

float4 daube = float4(daub4Low.x, daub4High.x,
daub4Low.z, daub4High.z);
float4 daubo = float4(daub4Low.y, daub4High.y,
daub4Low.w, daub4High.w);
float4 daub = (daube * (1 - odd)) + (daubo * odd);

float ul = (tex.x - (1 * isH * odd * 0)) * uStep;
float vl = (tex.y - (1 * isV * odd * 0)) * vStep;
float uh = ul + (hSize * isH);
float vh = vl + (hSize * isV);
float lo = texRECT(imageData, float2(ul, vl)).x;
float hi = texRECT(imageData, float2(uh, vh)).x;
float a = (lo.x * daub.x) + (hi.x * daub.y);
ul = (tex.x - ((1+odd) * isH)) * uStep;

vl = (tex.y - ((1+odd) * isV)) * vStep;
ul = (ul < 0 ? uLimit + ul : ul);
vl = (vl < 0 ? vLimit + vl : vl);
uh = ul + (hSize * isH);
vh = vl + (hSize * isV);
lo = texRECT(imageData, float2(ul, vl)).x;
hi = texRECT(imageData, float2(uh, vh)).x;
a += (lo.x * daub.z) + (hi.x * daub.w);
val = a;
}
ocol = float4(val, 1.0);

}
40
Bibliography
[1] Kurt Akeley. Reality Engine graphics. In Proceedings of the 20th annual
conference on Computer graphics and interactive techniques, pages 109–
116. ACM Press, 1993.
[2] Kurt Akeley and Pat Hanrahan. Cs448a: Lecture notes on real-
time graphics architectures, 2001. http://graphics.stanford.edu/
courses/cs448a-01-fall/index.html.
[3] Kurt Akeley and Tom Jermoluk. High-performance polygon rendering.
In Proceedings of the 15th annual conference on Computer graphics and
interactive techniques, pages 239–246. ACM Press, 1988.
[4] Ian Buck and Pat Hanrahan. Data parallel computation on graphics
hardware, 2003.
[5] James H. Clark. The Geometry Engine: A VLSI geometry system for
graphics. In Proceedings of the 9th annual conference on Computer
graphics and interactive techniques, pages 127–133. ACM Press, 1982.
[6] Robert L. Cook. Shade trees. In Proceedings of the 11th annual confer-
ence on Computer graphics and interactive techniques, pages 223–231.
ACM Press, 1984.
[7] Robert L. Cook, Loren Carpenter, and Edwin Catmull. The reyes image
rendering architecture. In Proceedings of the 14th annual conference
on Computer graphics and interactive techniques, pages 95–102. ACM
Press, 1987.
[8] Greg Coombe, Mark J. Harris, and Anselmo Lastra. Radiosity on graph-
ics hardware, 2003.
[9] Intel Corporation. Using the RDTSC instruction for performance
monitoring, 1997. http://developer.intel.com/drg/pentiumII/
appnotes/RDTSCPM1.HTML.
41
[10] Intel Corporation. AGP V3.0 Interface Specification, 2002. http://
www.intel.com/technology/agp/.
[11] Intel Corporation. Intel corporation website, 2003. http://www.intel.

com/.
[12] NVIDIA Corporation. Geforce 256 and Riva TNT combiners, 1999.
http://developer.nvidia.com/.
[13] NVIDIA Corporation. Cg Toolkit user’s manual:a developers guide to

programmable graphics Release 1.1, 2003. http://developer.nvidia.
com/Cg.
[14] NVIDIA Corporation. NV Float Buffer OpenGL Extension,

2003. http://oss.sgi.com/projects/ogl-sample/registry/NV/
float_buffer.txt.
[15] NVIDIA Corporation. NV Fragment Program OpenGL Exten-

sion, 2003. http://oss.sgi.com/projects/ogl-sample/registry/
NV/fragment_program.tx%t.
[16] NVIDIA Corporation. NV Vertex Program2 OpenGL Extension,

2003. http://oss.sgi.com/projects/ogl-sample/registry/NV/
vertex_program2.txt%.
[17] NVIDIA Corporation. Nvidia website, 2003. http://www.nvidia.com/.
[18] James D. Foley, Andries van Dam, Steven K. Feiner, and John F.
Hughes. Computer Graphics : Principles and Practice. Addision-Wesley,
second edition, 1996.
[19] William Dally. EE482 : Lecture notes on stream processor architecture,

2002. http://cva.stanford.edu/ee482s/notes.html.
[20] Michael Deering, Stephanie Winner, Bic Schediwy, Chris Duffy, and Neil
Hunt. The triangle processor and normal vector shader: a VLSI system
for high performance graphics. ACM SIGGRAPH Computer Graphics,
22(4):21–30, 1988.
[21] Henry Fuchs, John Poulton, John Eyles, Trey Greer, Jack Goldfeather,
David Ellsworth, Steve Molnar, Greg Turk, Brice Tebbs, and Laura
Israel. Pixel-Planes 5: a heterogeneous multiprocessor graphics sys-
tem using processor-enhanced memories. ACM SIGGRAPH Computer
Graphics, 23(3):79–88, 1989.
42
[22] James Gosling, Bill Joy, Guy Steele, and Gilad Bracha. The Java Lan-
guage Specification Second Edition. Addison-Wesley, Boston, Mass.,
2000.
[23] Amara Graps. An introduction to wavelets. IEEE Comput. Sci. Eng.,

2(2):50–61, 1995.
[24] Pat Hanrahan and Jim Lawson. A language for shading and lighting
calculations. In Proceedings of the 17th annual conference on Computer
[25] Ujval J. Kapasi, Scott Rixner, William J. Dally, Brucek Khailany,

Jung Ho Ahn, Peter Mattson, and John D. Owens. Programmable
stream processors. IEEE Computer, pages 54–62, August 2003.
[26] John Kessenich, Dave Baldwin, and Randi Rost. OpenGL Shading
Language draft v1.05, 2003. http://www.opengl.org/developers/
documentation/gl2_workgroup/.
[27] Brucek Khailany, William J. Dally, Scott Rixner, Ujval J. Kapasi, Peter
Mattson, Jinyung Namkoong, John D. Owens, Brian Towles, and An-
drew Chang. Imagine: Media processing with streams. IEEE Micro,
pages 35–46, 2001.
[28] David Kirk and Douglas Voorhies. The rendering architecture of the
DN10000VS. In Proceedings of the 17th annual conference on Computer
[29] Jens Krueger and Ruediger Westermann. Linear algebra operators for
GPU implementation of numerical algorithms. ACM Transactions on
Graphics (TOG), 22(3):908–916, 2003.
[30] Anselmo Lastra, Steven Molnar, Marc Olano, and Yulan Wang. Real-
time programmable shading. In Proceedings of the 1995 symposium on
Interactive 3D graphics, pages 59–ff. ACM Press, 1995.
[31] Joonas Lehtinen. GWIC - GNU Wavelet Image Codec version 0.1, 1998.
http://jole.fi/research/gwic/.
[32] Erik Lindholm, Mark J. Kligard, and Henry Moreton. A user-

programmable vertex engine. In Proceedings of the 28th annual confer-
ence on Computer graphics and interactive techniques, pages 149–158.
ACM Press, 2001.
43
[33] Michael D. McCool, Zheng Qin, and Tiberiu S. Popa. Shader metapro-
gramming. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS
conference on Graphics hardware, pages 57–68. Eurographics Associa-
tion, 2002.
[34] Microsoft. Microsoft DirectX website, 2003. http://www.microsoft.

com/windows/directx/.
[35] Steven Molnar, John Eyles, and John Poulton. PixelFlow: high-speed
rendering using image composition. In Proceedings of the 19th annual
240. ACM Press, 1992.
[36] John S. Montrym, Daniel R. Baum, David L. Dignam, and Christo-

pher J. Migdal. InfiniteReality: a real-time graphics system. In Proceed-
ings of the 24th annual conference on Computer graphics and interactive
techniques, pages 293–302. ACM Press/Addison-Wesley Publishing Co.,
1997.
[37] Adam Moravanszky. Dense matrix algebra on the gpu, 2003. http:
//www.shaderx2.com/shaderx.PDF.
[38] K. Moreland and E. Angel. The FFT on a GPU. In Proceedings of

the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hard-
ware, pages 112–119. Eurographics Association, 2003.
[39] Ken Perlin. An image synthesizer. In Proceedings of the 12th annual

296. ACM Press, 1985.
[40] Gregory F. Pfister. In Search of Clusters, chapter 9.7. Prentice-Hall,

second edition, 1998.
[41] Pixar. RenderMan Interface, Version 3.2. Pixar, 2000.
[42] William H. Press, Saul A. Teukolsky, William T. Vetterling, and Brian P.

Flannery. Numerical Recipes in C: The Art of Scientific Computing.
Cambridge University Press, 1992.
[43] Timothy J. Purcell, Ian Buck, William R. Mark, and Pat Hanrahan.
Ray tracing on programmable graphics hardware. In Proceedings of the
29th annual conference on Computer graphics and interactive techniques,
pages 703–712. ACM Press, 2002.
44
[44] Mark Segal and Kurt Akeley. The OpenGL graphics system: A spec-
ification (version 1.5), 2003. http://www.opengl.org/developers/
documentation/specs.html.
[45] B. Stroustrup. The C++ Programming Language. Addison-Wesley,

Reading, Massachusetts, USA, 1991.
[46] Michael Taylor, Jason Kim, Jason Miller, David Wentzlaff, Fae Ghodrat,
Benjamin Greenwald, Henry Hoffman, Jae-Wook Lee, Paul Johnson,
Walter Lee, Albert Ma, Arvind Saraf, Mark Seneski, Nathan Shnidman,
Volker Strumpen, Matthew Frank, Saman Amarasinghe, and Anant
Agarwal. The Raw microprocessor: A computational fabric for soft-
ware circuits and general purpose programs. IEEE Micro, May 2002.
[47] Linus Torvalds and Open Source Hackers. Linux kernel website, 2003.
http://www.kernel.org/.
45

StreamCg: A Stream-Based Framework For Programmable Graphics Hardware

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

StreamCg: A Stream-Based Framework For Programmable Graphics Hardware

Caricato da

Copyright:

Formati disponibili

StreamCg : A Stream-based Framework for

Programmable Graphics Hardware

November 26, 2003

This report is submitted as partial fulfilment

I would like to thank Dr Karen Haines, my supervisor, for her enthusiasm,

2 The StreamCg Framework 15

A Original Research Proposal 29

C Cg Discrete Wavelet Transform Implementation 36

2.1 The StreamCg inheritance hierarchy. The Kernel class can be

3.1 The DWT Stream Program is comprised of 4 kernels: an image

B.1 A logical view of a Fragment Program that illustrates the per-

3.1 The execution times are presented in fastest to slowest. The

The data processing demands of many industries, such as medicine, mining,

1.1.1 Programmable Rendering Systems

For many years, off-line (non-realtime) rendering systems have supported

The RenderMan Shader Language

• Surface shaders are attached to geometric primitives and compute the

• Displacement shaders modify the position and normal vector of a point

• Volume shaders compute how light is modified when it travels through

• Imager shaders perform operations on the rasterised pixels before they

1.1.2 Hardware-based Rendering Systems

The programming model of hardware-based rendering systems have evolved

1.1.3 Hardware-based Shader Languages

A major challenge to developing a shader language for hardware is provid-

complexity constraints of manufacturing the hardware devices. The GPU

A Vertex Program, also know as a Vertex Shaders, processes the properties

Fragment Programs, also known as Pixel Shaders, allow per-fragment op-

1.1.4 High Level Shader Languages

1.2 Generic Programming using GPUs

The StreamCg Framework

The StreamCg framework is a C++ framework that simplifies the devel-

The EmuCg framework is a supporting framework that is primarily intended

3.1 Test System Configuration

The hardware used comprised of an Athlon Thunderbird 1.2Ghz with 768MB

• libSDL Version 1.2.5, for OpenGL display configuration, input handling

• libPNG Version 1.2.2, for loading PNG image files;

use in image compression. The process of the DWT is illustrated in Figure

3.3 Experimental Results

ran out of video RAM.

4.1 Future Work

There are numerous enhancements that could be made to the StreamCg

• Support for more Cg types, such as matrices.

• Port to MS Windows platform.

• Implement more image processing kernels, such as edge detection algo-

• Improve the programming model to make it simpler and more flexible.

It is possible to further optimise the DWT stream program. Fine tuning

Original Research Proposal

Title: High Performance Generic Programming using Programmable Graph-

In recent times the speed of the modern computer processor (such as an

The goal of my research is to investigate modern programmable graphics

The first step is to build an understanding of modern programmable graphics

B.1 The Cg Language

The Cg language is designed to be general purpose and hardware-oriented.

void main(in float3 incol : COLOR,

B.2 The Cg Runtime

The Cg Runtime provides Cg program compilation and inspections opera-

machine language as GPU technology matures. By using the on-demand

Cg Discrete Wavelet Transform

C.1 Forward DWT Cg Implementation

//copy default colour to fill-in unprocessed quadrants

//set logical row/col position based on orientation (H/V)

//set sample step based on orientation (H/V)

//process the low and high band halves