Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Alex Radeski
School of Computer Science and Software Engineering
University of Western Australia
35 Stirling Highway, Crawley, Western Australia, 6009
radesa01@cs.uwa.edu.au
alex@radeski.net
The thesis of this research states that a modern programmable Graphics Pro-
cessing Unit (GPU) can be used for media processing tasks, such as signal
and image processing, resulting in significant performance gains over tra-
ditional CPU-based systems. This is made possible by innovative research
in the areas of programmable rendering pipelines, shader programming lan-
guages, and graphics hardware. The GPU uses data locality and parallelism
to achieve these performance improvements. Recent advances in mainstream
programmable GPU technology has led many to research the suitability of
these devices for more generic programming tasks.
This work explores the difficulties associated with using a GPU for media
processing and the performance gains made over traditional CPU imple-
mentations. StreamCg was developed as a C++ framework to simplify the
development of media processing applications. The NVIDIA Cg language
and runtime infrastructure is used to provide a high level programming en-
vironment for the GPU. StreamCg simplifies the development of GPU-based
programs by providing a simple stream-based programming model that fa-
cilitates reuse, and encapsulates the complexity of the underlying OpenGL
rendering system. As part of this research I also developed the EmuCg
framework. EmuCg assists in the execution of NVIDIA Cg programs on a
traditional CPU. This aids in the normally difficult or impossible task of de-
bugging Cg programs. The remainder of this thesis discusses the evolution
of programmable rendering pipelines and graphics hardware.
A Discrete Wavelet Transform (DWT) is implemented as a non-trivial exam-
ple to assist in the performance analysis. The DWTs were designed, imple-
mented and tested using StreamCg. Three different implementations were
used, including a CPU-based algorithm, a GPU-based algorithm that was
executed on a GPU and the GPU-based algorithm executed on a CPU using
EmuCg. The experimental test results show significant performance gains
are made by GPU-based kernels when compared to CPU-based implemen-
i
tations. The StreamCg programming encourages loose coupling that allows
the arbitrary combination of CPU and GPU kernel implementations.
Keywords: stream processors, programmable hardware, GPU, Cg, shader
languages, Discrete Wavelet Transform, stream kernel
CR Classification: C.1.2 [PROCESSOR ARCHITECTURES]: Multiple
Data Stream Architectures (Multiprocessors), I.3.1 [COMPUTER GRAPH-
ICS]: Hardware Architecture, I.3.6 [COMPUTER GRAPHICS]: Methodol-
ogy and Techniques, I.3.7 [COMPUTER GRAPHICS]: Three-Dimensional
Graphics and Realism
ii
Acknowledgements
iii
Contents
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 Programmable Rendering Systems . . . . . . . . . . . . 3
1.1.2 Hardware-based Rendering Systems . . . . . . . . . . . 6
1.1.3 Hardware-based Shader Languages . . . . . . . . . . . 9
1.1.4 High Level Shader Languages . . . . . . . . . . . . . . 12
1.2 Generic Programming using GPUs . . . . . . . . . . . . . . . 13
3 Testing 21
3.1 Test System Configuration . . . . . . . . . . . . . . . . . . . . 21
3.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 24
4 Discussion 26
4.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
iv
A.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
A.2 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
A.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
A.4 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
B Cg Programmaning 32
B.1 The Cg Language . . . . . . . . . . . . . . . . . . . . . . . . . 32
B.2 The Cg Runtime . . . . . . . . . . . . . . . . . . . . . . . . . 34
v
List of Figures
1.1 The Application interacts with the graphics pipeline using the
Command system, this is typically done through a computer
graphics API, such as OpenGL. The Command system sup-
ports the specification of the 3-dimensional scene, for example
specifying geometry, textures, lighting, and cameras. The Ge-
ometry system handles the transformation, clipping, culling,
texture coordinates, lighting, and primitive assembly opera-
tions. The Rasterisation system samples the geometry into
colour fragments and performs colour interpolation. The Tex-
ture system performs texture transformation, projection, and
filtering. The Fragment system performs alpha, stencil, and
depth testing along with fog and blending to produce pixel
colours. The Display system performs gamma correction and
the output signal for the display. . . . . . . . . . . . . . . . . 9
1.2 A logical view of the render pipeline operations performed by
the CPU and GPU. This figure shows that only the Applica-
tion and Command stages of the pipeline are performed in the
CPU. The remainder are executed by the GPU, therefore re-
moving the intensive processing from the CPU. The GPU has
its own local video RAM and can also accesses the main RAM
via the Advanced Graphics Port (AGP). It is important to
note that the AGP bus is the primary bottleneck when trans-
mitting data to and from the GPU; the higher the transfer
speed of the AGP bus to better the throughput. . . . . . . . 10
1.3 The deep execution pipelines are often executed using stream
processors. A stream is essentially comprised of a sequence of
kernels that process data in one direction. Kernels are limited
to processing only the data passed down, or limited number of
high speed global registers. This is referred to a data locality. 10
vi
1.4 The data flow of a Programmable Rendering Pipeline. This
pipeline replaces parts of the fixed pipeline with Vertex Pro-
gram and the Fragment Program stages. . . . . . . . . . . . . 11
vii
3.2 DWTs are applied to 2-dimensional images in two passes, parts
A and B. The first pass, the vertical pass, performs a high
and low band filter on a row by row basis, resulting in a high
and low column. The second pass, the horizontal pass, again
performs a high and low band filter on the data, however this
time its on a column by column basis. The result, seen in part
C, is four quadrants with a mixture of high and low filtered
data. The algorithm then recursively transforms the upper
left quadrant to produce part D. . . . . . . . . . . . . . . . . 23
viii
List of Tables
ix
Chapter 1
Introduction
1
a traditional CPU. This aids in the normally difficult or impossible task of
debugging Cg programs.
I implemented a Discrete Wavelet Transform (DWT) as a non-trivial exam-
ple to assist in the performance analysis. The DWTs were designed, imple-
mented and tested using StreamCg. Three different implementations were
used, including a CPU-based algorithm, a GPU-based algorithm that was
executed on a GPU and the GPU-based algorithm executed on a CPU using
EmuCg. The experimental test results show that significant performance
gains are made by GPU-based kernels when compared to CPU-based im-
plementations. The StreamCg programming encourages loose coupling that
allows the arbitrary combination of CPU and GPU kernel implementations.
The significant contributions of this research are the development of the
StreamCg framework, that allows the easier development of loosely coupled
and highly reusable software kernels; the EmuCg framework, that enables the
execution of Cg algorithms on a CPU from within StreamCg; and finally the
development of a forward and inverse DWT algorithm using the Cg language.
2
1.1 Background
The following section outlines the relevant background regarding the evolu-
tion of rendering systems that resulted in modern programmable graphics
hardware. The first part of this section outlines programmable rendering
systems in general, these were traditionally software-based renderers. The
second part highlights how the programming capabilities of hardware-based
rendering systems have become more programmable and have reached a point
where they are flexible enough to be used for more generic tasks.
Shade Trees
The evolution of shader languages began with the seminal work done by
Robert L. Cook [6] on Shade Trees. Shade Trees provide a foundation for
flexible procedural shading techniques, with the aim of integrating discrete
shading techniques into a unified rendering model. This allowed shading
techniques to be combined in novel ways.
Shade Trees are defined as a tree structure where nodes define operations
and leaves contain appearance parameters, such as color and surface normal.
Nodes can be nested to produce composite operations where the outputs of
the children operations are the inputs of the parent operation. Ultimately,
the root node of a Shade Tree provides the RGB pixel color for the current
3
point. A Shade Tree is associated with one or more objects in the scene,
this means each object has a specific shading program used to evaluate its
pixels. This was revolutionary at the time as many other rendering systems
used fixed shading models for all objects in a scene. The use of Shade Trees
enabled distinct surfaces to be shaded according to an lighting model that
best synthesised that surface. For example, polished wood interacts with
light very differently to brushed metal.
In addition to Shade Trees are Light Trees and Atmosphere Trees. A Light
Tree describes a specific type of light, such as spotlight or point light. Each
type of light source is accessed by a Shade Tree to perform the appropriate
light shading calculations. Atmospheric Trees affect the light output from a
Shade Tree by performing some computation to simulate atmospheric effects,
such as fog. This transforms the light that reaches the virtual camera.
An Image Synthesizer
Ken Perlin’s [39] subsequent work extended the programming model of Shade
Trees, this language was called the Pixel Stream Editing (PSE) language.
The PSE language included conditions, loops, function definitions, arithmetic
and logical operators, and mathematical functions. In the PSE language
variables are either scalars or vectors and operators can work on both types.
For example, the expression a + b could be adding two scalars or two vectors.
Another important contribution of the PSE language is the famous Perlin
Noise function. The noise function is used to create “natural” looking tex-
tures that are devoid of repeating patterns. Using noise as a base function,
complex functions are used to produce very realistic effects such as water,
crystal, fire, ice, marble, wood, metal and rock. These effects are examples of
solid textures, which generate distinct features in 3-dimensional space. This
was a major step forward for procedural texturing, which were previously
only computed in 2D.
RenderMan [24] is one of the most popular shader languages in use today.
RenderMan builds on the foundation of laid by Shade Trees and PSE. The
design goals of RenderMan were to develop a unified rendering system that
supports shading models for global and local illumination, define the interface
4
between the rendering system and shader programs, and provide a high level
language that is expressive and easy to use.
The RenderMan shader language [41] defines a number of different shaders
type handled at different stages in the rendering process. This is made pos-
sible by standardising the shader behaviour and defining an interface to the
greater rendering system. These shader types include
• Light shaders are attached to a point in space and compute the color
of light that is emitted to an illuminated surface.
There is a strong correlation between the RenderMan shader types and the
Shade Trees types mentioned previously. For example, surface and displace-
ment shaders closely resemble the functionality of the base Shade Tree, and
light and volume shaders perform similar operations to Light and Atmosphere
Trees. The RenderMan system is far more flexible than the specialised graph-
ics hardware discussed later in the chapter. Typically RenderMan systems
utilise networked clusters of computers, called render farms, which distribute
the work load to improve performance.
The shader language has a C-like syntax and supports a specialised set of
types including floats, RGB colours, points, vectors, strings and arrays. The
language also supports loops and conditions, which are also present in Perlin’s
PSE language. Shaders make use of the wide array of trigonometric and
mathematical functions, as well as other special purpose functions, such as
interpolation, noise and lighting functions. A shader is implemented as a
5
named function with inputs and outputs. RenderMan defines a specific set
of input and output parameters for each shader class. For example, the
surface shader is required to output the Ci parameter that represents the
new surface color. One problem is that parameters tend to have cryptic
names so it is a good idea to keep the reference manual handy.
The RenderMan implementation referred to in [24], utilises a virtual Single
Instruction Multiple Data (SIMD) [40] array architecture. The scene is split
into regions and each region has the associated shaders executed over the
region by the rendering system. Even though this was a software renderer,
the SIMD architecture facilitates the utilisation of high speed parallel graph-
ics hardware. This leads into the following discussion on the evolution of
hardware-based rendering systems.
6
ated graphics systems was developed by Silicon Graphics Incorporated (SGI)
in the 1980s [3]. This system could render 100,000 polygons per second at a
refresh rate of 10Hz; relatively speaking this was a very high performance sys-
tem. This system defined a polygon as 4 sided, 10x10 pixel, RGB, Gouraud
shaded, z-buffered, clipped and screen projected.
The SGI system tightly coupled the main CPU and RAM with the graphics
system into one complete unit. From a usability stand point, this allowed
the graphics system to integrate with a user’s window-based desktop envi-
ronment. The SGI graphics system provided the hardware acceleration of
a fixed function pipeline and was comprised of the geometry [5], scan con-
version and rasterisation subsystems. Subsequent work by SGI on hardware
graphics systems enabled realtime texture mapping and anti-aliasing [1], and
made further improvements on refresh rates and overall performance [36].
The DN10000VS system used an alternative approach to using a single spe-
cialised graphics device. This was to make holistic changes to the system
to handle high-performance graphics [28]. One of the main goals of the sys-
tem designers was to have most of the hardware usable most of the time.
This could only be achieved through efficient load balancing and minimising
latency system wide. Performance was primarily achieved through utilising
four processors, with custom graphics instructions and improved hardware
bus speeds.
In the mid-1990s, Lastra et al. [30] outlined the requirements for programmable
graphics hardware that could achieve interactive frame rates. These require-
ments covered programmability, memory layout, and computational power
that formed part of the experimental graphics system called PixelFlow. This
work was built on the previous work done by Molnar et al [35] and followed
the achievements of early graphics systems [20, 21].
To achieve interactive frame rates PixelFlow employed large-scale parallelism.
To highlight the scale required, during that time existing commercial graphics
systems required hundreds of processors for a fixed rendering pipeline [36],
a programmable pipeline would require many more. Parallel architectures
provided greater performance gains over single pipeline architectures as data
was processed simultaneously across an array of processors. Single pipeline
architectures are constrained to the clock speed of the single processor; per-
formance increases could only be achieved through advances in technology
that improve clock speed.
PixelFlow used a SIMD array of 128 x 64 pixel processors and two general
7
purpose RISC processors. The general purpose processors fed instructions
into the SIMD array where they were executed simultaneously. Parallelism
was applied to a scene by subdividing the screen into 128 x 64 pixel regions
and then each region was then processed at once.
The modern rendering pipeline evolved as a result of common patterns found
in the processes used for rendering 3-dimensional scenes. This pipeline was
devised to provide a flexible and simple programming model to the graph-
ics hardware. A high level example of a modern rendering pipeline can be
seen in Figure 1.1. This render pipeline is comprised of the Application,
Command, Geometry, Rasterisation, Texture, Fragment and Display sub-
systems [2]. The Application interacts with the graphics pipeline using the
Command system, this is typically done through a computer graphics API,
such as OpenGL. The Command system supports the specification of the
3-dimensional scene, for example specifying geometry, textures, lighting, and
cameras. The Geometry system handles the transformation, clipping, culling,
texture coordinates, lighting, and primitive assembly operations. The Ras-
terisation system samples the geometry into colour fragments and performs
colour interpolation. The Texture system performs texture transformation,
projection, and filtering. The Fragment system performs alpha, stencil, and
depth testing along with fog and blending to produce pixel colours. The
Display system performs gamma correction and the output signal for the
display.
At the core of modern graphics hardware is the Graphics Processing Unit
(GPU). The term GPU was first introduced in 1999 in the NVIDIA Geforce
series of chips [17]. Other hardware vendors, such as ATI, 3D Labs, Matrox
also used the same or similar term for their graphics processing chip. The
first generations of GPU provided hardware acceleration of the fixed function
pipeline mentioned earlier. Figure 1.2 provides a logical view of the render
pipeline operations performed by the CPU and GPU. This figure shows that
only the Application and Command stages of the pipeline are performed in
the CPU. The rest is executed by the GPU, therefore removing the intensive
processing from the CPU. The GPU has its own local video RAM and can
also access the main RAM via the Advanced Graphics Port (AGP) [10]. The
standard 1x AGP speed is approximately 267 megabytes (MB) per second
throughput. The standard AGP speed at the time of this research is 4x AGP,
which is about 1 gigabyte (GB) per second throughput. Newer generations
of computers will have 8x AGP, which is approximately 2 gigabytes per sec-
ond throughput. It is important to note that the AGP bus is the primary
bottleneck when transmitting data to and from the video RAM accessed by
8
Figure 1.1: The Application interacts with the graphics pipeline using the
Command system, this is typically done through a computer graphics API,
such as OpenGL. The Command system supports the specification of the
3-dimensional scene, for example specifying geometry, textures, lighting, and
cameras. The Geometry system handles the transformation, clipping, culling,
texture coordinates, lighting, and primitive assembly operations. The Ras-
terisation system samples the geometry into colour fragments and performs
colour interpolation. The Texture system performs texture transformation,
projection, and filtering. The Fragment system performs alpha, stencil, and
depth testing along with fog and blending to produce pixel colours. The
Display system performs gamma correction and the output signal for the
display.
the GPU.
There are a number of reasons why GPUs are better suited to computer
graphics than CPUs [19]. Firstly, GPUs employ large-scale parallelism, re-
sulting in fewer clocks per instruction. This is enabled through data locality,
where processors essentially have exclusive access over their allocated data.
Secondly, GPUs have multiple and wide memory interfaces, which means
that the GPU is optimised for high throughput, thus reducing data access
latency. Thirdly, GPUs have deep execution pipelines that help to amortize
latencies over the entire processing time. The deep execution pipelines are
often executed using stream processors. A stream, shown in Figure 1.3, is
essentially comprised of a sequence of kernels that process data in one di-
rection [25]. Kernels are limited to processing only the data passed down
to them this is referred to a data locality. The momentum behind stream
processing is increasing, particularly in the area of media processing [27] [46].
9
Figure 1.2: A logical view of the render pipeline operations performed by the
CPU and GPU. This figure shows that only the Application and Command
stages of the pipeline are performed in the CPU. The remainder are executed
by the GPU, therefore removing the intensive processing from the CPU. The
GPU has its own local video RAM and can also accesses the main RAM via
the Advanced Graphics Port (AGP). It is important to note that the AGP
bus is the primary bottleneck when transmitting data to and from the GPU;
the higher the transfer speed of the AGP bus to better the throughput.
Figure 1.3: The deep execution pipelines are often executed using stream
processors. A stream is essentially comprised of a sequence of kernels that
process data in one direction. Kernels are limited to processing only the
data passed down, or limited number of high speed global registers. This is
referred to a data locality.
10
Figure 1.4: The data flow of a Programmable Rendering Pipeline. This
pipeline replaces parts of the fixed pipeline with Vertex Program and the
Fragment Program stages.
Vertex Programs
11
constant registers, and 12 output registers and a maximum of 128 instruction
per shader. The input registers default to vertex attributes such as position,
colour, normal and texture coordinates, however, the register contents can
be overridden by the programmer. The constant registers are available for
application specific values, these values are the same for each execution of
the Vertex Program.
The Vertex Program uses low level machine language with operations in-
cluding move (MOV), multiply (MUL), distance (DST), minimum(MIN), four-
component dot product (DP4) and others. With each new generation of
graphics hardware these limitations are being eliminated, for example, the
latest NVIDIA GeforceFX, allows shaders to have up to 1024 instructions
and many more registers.
Fragment Programs
Low level shader languages share many of the same problems as CPU machine
languages. They are not easy to read or write, and are tightly coupled to
the hardware. New shading languages are being designed to counter these
problems; two hardware-based languages that are still in development are “C
for Graphics” (Cg) [13] and OpenGL Shader Language (GLslang) [26]. The
12
remainder of this paper will focus on Cg, further details on programming
with Cg can be found in Appendix B.
Shader language compilers are designed to produce the machine language
that is compliant with target shader specification. The compiler can emu-
late missing functionality wherever possible. However under certain circum-
stances the compiler may reject or ignore functions that are not supported
by the graphics hardware.
These shader languages share similarities with the RenderMan shader lan-
guage by design. These include a C-like language, support for high precision
data types (such as 32-bit floating point numbers), provide noise and turbu-
lence functions, and provide a rich set of mathematical functions. However,
RenderMan provides more shader types applied to more stages in the ren-
dering pipeline. The current generation of high level shader languages are
still limited to the capabilities of the programmable graphics hardware; only
the Vertex Program or the Fragment Program can be used to customise the
rendering pipeline of programmable graphics hardware.
There is an ever increasing body of research into using GPUs for varying
tasks. GPUs are currently used in computer graphics for accelerating ra-
diosity calculations [8], and ray tracing [43]. However, the remainder of this
section provides an overview of the work done with GPUs on non-computer
graphics research.
Moreland et al [38], implement a Fast Fourier Transform (FFT) using a
similar GPU to the one used in my research. The FFT implementation
did not use Cg, but was instead implemented as a Fragment Program in
machine language. The results of the research showed that a 512 by 512
image was synthesised by conventional means, the FFT performed, the image
was filtered, and then finally the inverse FFT was applied in a time well under
one second.
Moravanszky [37], discusses implementing dense matrix multiplication using
Microsoft DirectX, instead of OpenGL and Cg. This has the disadvantage as
it ties the implementation to the Windows platform. Moravanszky’s research
differs to my research in that he used an ATI Radeon GPU. Moravanszky’s
results showed significant performance gains for larger datasets that can ab-
13
sorb the cost of transferring the data to the GPU.
Krueger et al [29], implement a comprehensive suite of linear algebra opera-
tions on a GPU, including vector arithmetic, matrix-vector products, sparse
matrices, banded matrices and many more. Krueger et al also used an ATI
Radeon GPU that did not provide as comprehensive floating-point number
support as the NVIDIA GeforceFX used in my research. Higher perfor-
mance can be achieved when less accurate data types are used. However,
their research shows that even with consideration given to lower precision,
the performance gains were significant in comparison to the CPU-based im-
plementation.
Much of the research discussed in the previous paragraphs is based on ad-
hoc systems developed to test a narrow domain. McCool et al [33] developed
a metaprogramming system that enabled all code, including the shader ma-
chine language, to be specified in C++. The metaprogramming system would
manage the underlying graphics API, in this case OpenGL, and also handle
loading and executing the shader machine language on the CPU. Although
the research primarily focuses on computer graphics examples, this kind of
system could be used for more generic tasks.
Buck et al [4] emphasise that a GPU is a form of stream processor. This leads
to the development of Brook, a new language that shares many similarities
with Cg, but differs in that it provides greater support for generalised GPU
programming. The Brook programming model is based on streams of kernels.
The research done by Buck et al, shares some resemblance to my research.
The main similarity lies in the motivation to create a programming model
based on stream processors. The main difference is that my research tried
to work within the limitations imposed by established technologies such as
C++ and Cg, rather than creating a new language.
14
Chapter 2
15
Figure 2.1: The StreamCg inheritance hierarchy. The Kernel class can be
used for CPU-based Kernel implementations, these Kernels will execute on
the CPU. The CgKernelFP subclass provides the additional infrastructure to
execute a Cg Fragment Program on the GPU. The EmuCgKernelFP subclass
provides the emulation layer required to emulate a subset of the Cg language.
EmuCg is discussed in more detail later in this section. These classes are
extended and appropriate methods can be overridden to specialise the objects
behaviour.
Vertex Programs are not widely used for generic programming tasks. The
EmuCgKernelFP subclass provides the emulation layer required to emulate
a subset of the Cg language. EmuCg is discussed in more detail later in
this section. These classes are extended and appropriate methods can be
overridden to specialise the objects behaviour.
A Kernel is configured to transfer its output to a down stream Kernel using
the writeTo(next:Kernel,inputName:String) method. This method is
passed the next Kernel instance and the name of the input. StreamCg only
supports transferring data to one input at a time, however there may be
other unused inputs specified by a Kernel. An example Stream Program is
assembled in Figure 2.2. In this figure, Kernel A is the data source where the
data may be obtained from numerous sources including local files, databases
or over a network; Kernel D is the data sink where the output data may
be displayed on the screen, written to a file, or transferred over a network.
Kernels B and C are typical Kernels in so far as they receive data as input,
process the data, and pass on the data down stream.
The Kernel::execute() method is called on the root Kernel, which in turn
executes the entire Stream Program one Kernel at a time. The execution
sequence of a single Kernel is as follows Kernel::initialise(), performs
user initialisation of values prior to processing; Kernel::process(), per-
16
kernelA->writeTo(kernelB, "InputB");
kernelB->writeTo(kernelC, "InputC");
kernelC->writeTo(kernelD, "InputD");
Figure 2.2: Kernel A is the data source where the data may be obtained from
numerous sources including local files, databases or over a network; Kernel
D is the data sink where the output data may be displayed on the screen,
written to a file or transferred over a network. Kernels B and C are typical
Kernels in so far as they receive data as input, process the data, and pass on
the data down stream.
forms the actual data processing, this may be repeated a predefined number
of times; and finally Kernel::transfer(), transfers the output data to the
next Kernel. The behaviour of the Kernel::transfer() method is depen-
dant on the type of Kernel. Kernels that execute on the CPU, output their
data to PixelBufferChannels, which store data in main RAM and transfer
data downstream using Channel::write(buf:PixelBuffer). GPU-based
Kernels output data to TextureBufferChannels, which store data in video
RAM and transfer data using Channel::write(). Details of the significance
of the two write methods are discussed later in this section.
StreamCg Kernels are constrained to the types of data inputs and outputs
provided by a Fragment Program. As outlined in Appendix B, a Fragment
Program can take 2-dimensional textures as input, and its output is always
written to the colour buffer in the current OpenGL render context [44]. By
default OpenGL clamps the values of buffers and textures, to the range of 0.0
to 1.0. Under normal use this is sufficient as the values typically represent
RGBA colours. However, for more generic data processing this is a serious
limitation. The solution is to use the NVIDIA OpenGL extension for Float
Buffers [14]. This extension allows the creation of render contexts containing
buffers and textures in video RAM that do not enforce any limitation of the
value beyond that of a float type.
The complexity of managing textures and buffers is encapsulated in the in-
frastructure of StreamCg. The Channel::write() method will copy the
colour buffer contents of the current render context into either a texture
(video RAM) or pixel buffer (main RAM). The Channel::write(buf:PixelBuffer)
method will copy the contents of a pixel buffer into either a texture or an-
other pixel buffer. So we can see that, based on the configuration of the
Stream Program Kernels, StreamCG will perform the appropriate conver-
17
Figure 2.3: Illustration of the four modes of data transfer supported by
StreamCg. It has four parts representing different source and target Kernel
implementations. In part A, the data transfer remains in video RAM as both
are GPU Kernels. In part B, the data remains in main RAM as both are
CPU Kernels. In part C, the data is converted from a texture in video RAM
into to pixel buffer in main RAM, and part D the reverse occurs where the
data is converted into a texture in video RAM.
sion of channel data as required. Figure 2.3 illustrates the four modes of
data transfer supported by StreamCg. It has four parts representing differ-
ent source and target Kernel implementations. In part A, the data transfer
remains in video RAM as both are GPU Kernels. In part B, the data remains
in main RAM as both a CPU Kernels. In part C, the data is converted from
a texture in video RAM into to pixel buffer in main RAM, and part D the
reverse occurs where the data is converted into a texture in video RAM.
To ensure compatibility of data sizes when transferring the colour buffer
to a texture and in reverse, StreamCg enforces a uniform data size for all
Kernels in a Stream Program. The size limitations are influenced by the size
constraints of textures in OpenGL (up to 4096 x 4096), and the available
video RAM. The size is set as part of the OpenGL render context, in typical
OpenGL applications this is the viewport size. Pure CPU Stream Programs
are least impacted by these limitations as they do no use video RAM.
StreamCg provides the KernelContext class to manage the creation of Float
Buffers. A KernelContext is specified on an individual Kernel basis, this
allows multiple contexts to be utilised in a single Stream Program. However,
context switching is demanding on resources. A CPU Kernel executes in the
default context or null context, which can be considered main RAM. GPU
Kernels must execute within a KernelContext otherwise the data values will
be clamped.
18
The Kernels of a Stream Program are executed over the entire data set from
top to bottom, row by row. This execution pattern is that supported by
Fragment Programs, this is illustrated by Figure B.3 in Appendix B. This
is achieved by rendering a GL_QUAD that spans the entire display area. This
results in the execution of Fragment Program for every pixel in the colour
buffer. Due to data locality constraints imposed by the GPU the row and
column of the pixel being processed is not immediately available. StreamCg
specifies texture coordinates for each corner of the GL_QUAD rendered. This
is passed to the Fragment Program using the TEX0 semantic and serves as a
indexing mechanism for calculations that need to know about the row and
column being processed. The CPU-based Kernels emulate this behaviour to
ensure a consistent programming model.
The StreamCg underlying infrastructure comprises of a number of layers,
the OpenGL Object layer, the Cg Object layer and the StreamCg layer.
The OpenGL Object layer wraps the OpenGL API into convenience objects
and, the Cg Object layer wraps the underlying Cg API. These subsystems
provide convenience classes essential to overcoming problems, discussed in
the following paragraphs, encountered during the construction of StreamCg.
Many problems were encountered when developing the StreamCg framework,
the primary issue was that OpenGL has a C API. This is a problematic
because logical elements within OpenGL (textures, meshes etc) are object-
like, but these are not exposed easily. So, it is easy to forget to set a property
or not realise that properties you did set are incompatible. For example, when
using the NVIDIA Float Buffer extension, textures must have there internal
format set to GL_FLOAT_RGBA_NV [14] instead of the standard GL_RGBA, which
is easy to forget when developing a system for the first time.
Another significant development issue is that OpenGL and Cg are really
completely separate systems with a loose coupling via the Cg Runtime sys-
tem. Limited diagnostic and validation is performed, so it is important to
perform manual validation regularly. An issue encountered during devel-
opment relates to silent incompatibilities that cause things not to work.
For example, the typical Cg type for a texture is sample2D, this requires
that the texture is specified in OpenGL using the GL_TEXTURE_2D type [44].
Using the NVIDIA Float Buffer extension requires the texture type to be
GL_TEXTURE_RECTANGLE_NV. This is not compatible with the Cg sample2D
type, therefore sampleRECT [14] must be used instead.
19
2.1 The EmuCg Framework
20
Chapter 3
Testing
• Cg Toolkit for Linux Version 1.1, provided the Cg Runtime and lan-
guage compiler [13];
• and the GNU Wavelet Image Codec Version 0.1 (GWIC) [31], for an
easy to following DWT reference implementation.
21
Figure 3.1: The DWT Stream Program is comprised of 4 kernels: an image
loader, forward DWT, inverse DWT, and image viewer. The image loader
reads PNG image files from disk. The images are already resident in memory
before execution, therefore the load time from disk is not included. The
forward DWT processes the data from the image loader, the output of which
is fed into the inverse DWT.
The source code was compiled using GCC 3.2.3, to produce an optimised
binary executable using the following configuration
-march=athlon-tbird -mmmx -m3dnow -Wall -O3 -pipe -fomit-frame-pointer.
3.2 Method
The aim of this experiment is to compare execution times of CPU and GPU
based stream programs implemented using StreamCg. The test stream pro-
gram, illustrated in figure 3.1, is comprised of 4 kernels, an image loader,
forward DWT, inverse DWT, and image viewer. The image loader reads
PNG image files from disk. The forward DWT processes the data from the
image loader, the output of which is fed into the inverse DWT. This col-
lection of stream kernels will be referred to as the DWT Stream Program.
The expected output is that the image viewer will display an image the same
as the input image. Only a working knowledge of DWTs was required for
this research and only the salient aspects of DWTs are discussed in this
paper. Further details on DWTs can be found in the existing body of knowl-
edge [23] [42].
The DWT algorithm implemented as part of this research is designed for
22
Figure 3.2: DWTs are applied to 2-dimensional images in two passes, parts
A and B. The first pass, the vertical pass, performs a high and low band
filter on a row by row basis, resulting in a high and low column. The second
pass, the horizontal pass, again performs a high and low band filter on the
data, however this time its on a column by column basis. The result, seen in
part C, is four quadrants with a mixture of high and low filtered data. The
algorithm then recursively transforms the upper left quadrant to produce
part D.
23
developed to execute on the GPU, the Cg code can be seen in Appendix C.
The algorithm processes logical rows, a logical row may be a physical row or
column depending of the orientation. This is required, as Fragment Programs
only process data on a row by row basis. The isH and isV parameters are
used to set the logical row orientation to horizontal or vertical, respectively.
If one value is set to 1.0 the other must be set to 0.0. An alternative approach
could have been to have two different algorithms handle the horizontal and
vertical processing, however this would result in a significant performance
penalty caused by rapidly unloading and loading of the Fragment Programs.
The Cg DWT algorithm is optimised for how the GPU handles branching.
On a current GPU both paths of a boolean condition are executed. If the
condition is true then the results of the true path are multiplied by 1.0 and
the other by 0.0, then the values are then added together. The opposite
occurs if the condition is false. The result is a weighted sum of both paths
based on the condition. This is not immediately obvious to a developer, and
counter to how branching is performed in a CPU, where only the true branch
is executed. The algorithm is optimised by collocating the common parts of
each branch, thus minimising the actual code for each condition.
The EmuCgForwardWaveletKernel and EmuCgInverseWaveletKernel ker-
nels were developed to assist in debugging the algorithm during develop-
ment. This was achieved by prototyping the the forward and inverse DWT
algorithm and tracing through it with a debugger. This is not possible with
Cg code executed on the GPU, so it is very difficult to develop complex
algorithms without something like EmuCg.
The StreamCg framework enables any forward DWT implementation to be
used with any other inverse DWT implementation. This is achieved by only
changing the starting configuration of the kernels, as illustrated in Figure 2.2.
This highlights the modularity of StreamCg kernels and the improved possi-
bility of kernel reuse. This is supported by the strict object contract between
the kernels and the StreamCg framework that enforces a loose coupling.
The results were obtained by executing the DWT Stream Program on 256 by 256,
512 by 512, and 1024 by 1024 data sizes. This research intended to explore
the usage of larger data sets, including 2048 by 2048 and 4096 by 4096.
However, the graphics hardware could not support such large data sets as it
24
Implementation Data Size Average Time (millis)
GPU 256x256 10
512x512 11
1024x1024 11
CPU 256x256 89
512x512 450
1024x1024 2783
CPU (EmuCg) 256x256 920
512x512 4191
1024x1024 18656
Table 3.1: The execution times are presented in fastest to slowest. The GPU
shows the fastest times that scale very well with data size increases. The raw
GPU execution times follow an interesting pattern. The first time the DWT
Stream Program is run, it takes approximately 230 milliseconds. Subsequent
executions averaged approximately in the four to seven millisecond range.
The CPU has the next fastest times, which scale significant worse that the
GPU. Finally, the worst times are achieved using EmuCG on the CPU.
25
Chapter 4
Discussion
The introduction of high level shader languages, such as Cg, are encouraging
the development a wider range of applications that utilise the processing
power of GPUs. The NVIDIA Cg toolkit provides a runtime system that
work within OpenGL, and a language for programming GPUs. StreamCg
utilises the Cg toolkit to provide a simple programming model, abstract
the underlying implementation complexities, and enable to development of
reusable software components.
Many limitations are imposed on StreamCg due to the strong dependency on
OpenGL. A major problem is that OpenGL is not being used in the manner
for which it was originally designed, in so far as it is not being used for
computer graphics. This means that vendor extensions, such as the NVIDIA
Float Buffer extension need to be used to enable more generic programming.
StreamCg is also limited to a fixed data size for all kernels within an execution
context. This limitation is due to the use of textures and the colour buffer
to store input and output data respectively.
EmuCg assists in the execution of NVIDIA Cg programs on a traditional
CPU. This aids in the normally difficult or impossible task of debugging Cg
programs. EmuCg only supports the execution of Cg Fragment Programs in
a limited manner. However, it provides the foundations for a more complete
emulation of the Cg language.
The GPU DWT Stream Program shows the best performance. Not only are
the execution times significantly faster than the CPU times, but the execution
times scale very well and are almost constant as the data size increases.
26
The next best times are that of the CPU DWT Stream Program using the
GNU Wavelet Image Codec (GWIC) code, of which, the fastest time is still
nine times worse than the slowest GPU time. This Stream Program does not
scale well as the data sizes increase. The execution times are approximately
four times slower with each step in size, highlighting that there is a linear
relationship between the data size and the execution time.
The worst times are recorded for the EmuCg DWT Stream Program, these
are about ten times worse than the GWIC algorithm. This is due to emu-
lating the runtime nature of a Fragment Programs using EmuCg. Also the
Cg DWT algorithm is designed to execute on a GPU and performs badly
on a CPU, as it relies on parallel execution and high bandwidth memory
throughput.
The summary of results above highlights that by using parallelism and high
bandwidth memory throughput the GPU achieves significant performance
gains over CPU based programs. This is primarily because a CPU is a
scalar processor and has single relatively narrow memory interface. However,
current graphics hardware is limited by the size of video RAM, thus limited
the amount of data that can be processed. Also, OpenGL only supports
textures, used as inputs in StreamCg, up to 4096 by 4096. As a result
the GPU is not limited by processing power, but by the capacity to store
data. The CPU has the inverse relationship as it is primarily bandwidth and
processor limited, and limited less by storage capacity.
Better results could be achieved if the motherboard used supported 8x AGP,
about 2 GB per second throughput. This would minimise the initial spike
seen in the GPU results, which is caused by uploading the texture to the
graphics hardware. The size of the video RAM would also need to be accom-
panied by a increase in the AGP bus speed, necessary to transfer larger data
to the graphics hardware in a timely manner.
27
• Support more GPU profiles other than FP30. This includes investigat-
ing using Vertex Programs.
28
Appendix A
A.1 Background
29
the vertex shader and the pixel shader. The vertex shader is executed per
vertex with the purpose of transforming world coordinates into view-space
coordinates prior to view frustum clipping. The pixel shader is executed per
pixel during the rasterisation phase prior to screen coordinate clipping.
Earlier generations of shader languages resembled machine languages and
were limited and difficult to use, for example branching and looping oper-
ations were not supported. Current shader languages are becoming more
powerful and less specialised. These C-like languages, such as NVIDIA Cg,
supports a wide range of types, and looping and branching operations. As
these languages become more generalised it has become apparent that these
languages could be used for high performance processing of generic program-
ming tasks.
A.2 Aim
A.3 Method
30
quire data to be transformed from geographic (latitude, longitude, elevation)
into Cartesian (x,y,z) coordinates, and back. The shader is expected to be
accurate and fast. The goal is to devise an implementation that can project
a complex scene consisting of terrain and entities at interactive rates. Other
applications beyond GIS will also be explored.
A.4 Requirements
The application programming languages used will be C++, the shader lan-
guage used will be NVIDIA Cg. Software requirements include Linux, GCC
3.x, OpenGL, NVIDIA Cg Toolkit,
Hardware Requirements include a standard PC of around 2GHz with about
512MB of RAM and an NVIDIA GeforceFX graphics device.
31
Appendix B
Cg Programmaning
The Cg Toolkit consists of two parts, the Cg Language and Cg Runtime ser-
vices [13]. The Cg language is used to implement GPU programs, the generic
form of Vertex and Fragment Program. The Cg Runtime services form part
of the CPU-based host application and manage the execution context of a
GPU program. The Cg Runtime services are used to specify the GPU Pro-
gram profile; perform loading, compiling and binding of GPU programs; and
specify the input parameters. This discussion outlines many relevant aspects
of the Cg Language, but does not try cover the entire language.
32
tures from C++ and Java. Cg does not support full ANSI C, there are a
number of limitations imposed. For example, pointers are not supported,
therefore arrays are a first-class type; function overloading is supported; and
variables may be defined anywhere before being used, not just at the begin-
ning of the scope. The limitations of Cg are a direct result of the limitations
imposed by current GPU hardware. The enhancements made to ANSI C
are intended to provide specialised GPU support, such as swizzling, and to
reduce the programming effort for certain tasks.
The supported data types include float, a 32-bit IEEE floating-point num-
ber; half, a 16-bit IEEE-like floating point number; int, a 32-bit integer;
fixed, a 12-bit integer; bool, a boolean type; and sampler*, a texture ob-
ject with six variants including sampler1D, sampler2D, samplerRECT and
sampler3D. Cg also supports more complex types, such as float2, float3,
float4, which are two, three, and four component floating point vectors re-
spectively. Although not used in this paper, Cg also supports matrix types,
such as float4x4 which is the largest matrix supported. These complex
types act very similarly to C++ classes with a broad range of overloaded
arithmetic operators.
Cg program defines a main function that specifies the parameters for varying
inputs and outputs, and constant data, referred to as uniform data. The value
of an input parameter and the destination of an output parameter can be
specified using semantic tags. Semantic tags relate to a specific GPU profiles
and are declared with the parameter name and type. An example of a main
function using semantics is shown in Figure B.2. The input parameter incol
uses the COLOR semantic, which implies that for each pixel this Fragment
Program visits, the parameter receives the colour value. Some input and
output semantics are required, otherwise the GPU program is invalid. If a
parameter has no semantics defined, then it is up to the host application to
specify the value of the parameter. Uniform data is data that is constant
for the entire execution of the Cg program. This can be of any of the types
discussed previously. Figure B.1 illustrates the inputs, outputs and constant
data for a Fragment Program. The constant data is using the Cg Runtime
discuss later.
Function parameters can also have a direction modifier specified; one of in,
out or inout. Parameters are declared in by default if no direction is spec-
ified. An in parameter is passed by value, an out parameter is passed out
when the function exits but has no initial value, and an inout parameter
has a value on function entry and if it is modified in the function the value
is reflected outside the function. The inout direction modified is meant to
33
Figure B.1: A logical view of a Fragment Program that illustrates the per-
fragment input, per-fragment output and constant data. The per-fragment
input changes which each fragment processed, however the constant data is
the same for all fragments.
provide some of the functionality missing due to the lack of pointers in Cg.
The Cg language supports flow control constructs, such as looping and branch-
ing, which are relatively difficult to implement in parallel systems due to la-
tency and synchronisation issues. At present only the VP30 profile supports
looping and branching, the FP30 profile only supports branching.
34
Figure B.3: When a Fragment Program is executed it processes each frag-
ment on a per-row basis, from the min screen position to the max screen
position. This means at each fragment the Cg program knows nothing of its
adjacent neighbours, this data locality is on of the features that enables high
performance parallelism of GPUs.
35
Appendix C
//parameters:
//icol - required for Fragment Program but not used
//tex0 - the execution position passed as texture coords
//subframeSize - the size of the subframe of the data buffer to process
//isH - use horizontal logical orientation
//isV - use vertical logical orientation
//hSize - half the size of the subframe
//imageData - the input data as an RGBA texture
//daub4High - the DWT high band co-efficients
//daub4Low - the DWT low band co-efficients
//ocol - the colour buffer output value
void main(
in float3 icol : COLOR,
in float2 tex0 : TEX0,
uniform float subframeSize,
uniform float isH,
uniform float isV,
uniform float hSize,
uniform samplerRECT imageData,
uniform float4 daub4High,
36
uniform float4 daub4Low,
out float4 ocol : COLOR)
{
//ensure we get a floored value to avoid interpolation issues
float2 tex = floor(tex0);
float2 uv0;
float2 uv1;
float2 uv2;
float2 uv3;
float uvMax;
float4 daub;
bool doFilter = false;
37
uv0 = float2(u, v);
daub = daub4High;
doFilter = true;
}
if(doFilter) {
//compute the next 3 input uv coords
//in the appropriate direction (H or V)
uv1 = float2(uv0.x + isH, uv0.y + isV);
uv2 = float2(uv1.x + isH, uv1.y + isV);
uv3 = float2(uv2.x + isH, uv2.y + isV);
//if the current execution position in not within the subframe then the
//existing value at the current position is copied to the output without
//being processsed.
38
}
//parameters:
//icol - required for Fragment Program but not used
//tex0 - the execution position passed as texture coords
//subframeSize - the size of the subframe of the data buffer to process
//isH - use horizontal logical orientation
//isV - use vertical logical orientation
//hSize - half the size of the subframe
//imageData - the input data as an RGBA texture
//daub4High - the DWT high band co-efficients
//daub4Low - the DWT low band co-efficients
//ocol - the colour buffer output value
void main(in float3 icol : COLOR,
in float2 tex0 : TEX0,
uniform float subframeSize,
uniform float isH,
uniform float isV,
uniform float hSize,
uniform samplerRECT imageData,
uniform float4 daub4High,
uniform float4 daub4Low,
out float4 ocol : COLOR)
{
//ensure we get a floored value to avoid interpolation issues
float2 tex = floor(tex0);
39
float vStep = (1 * isH) + (0.5 * isV);
val = a;
}
40
Bibliography
[1] Kurt Akeley. Reality Engine graphics. In Proceedings of the 20th annual
conference on Computer graphics and interactive techniques, pages 109–
116. ACM Press, 1993.
[2] Kurt Akeley and Pat Hanrahan. Cs448a: Lecture notes on real-
time graphics architectures, 2001. http://graphics.stanford.edu/
courses/cs448a-01-fall/index.html.
[3] Kurt Akeley and Tom Jermoluk. High-performance polygon rendering.
In Proceedings of the 15th annual conference on Computer graphics and
interactive techniques, pages 239–246. ACM Press, 1988.
[4] Ian Buck and Pat Hanrahan. Data parallel computation on graphics
hardware, 2003.
[5] James H. Clark. The Geometry Engine: A VLSI geometry system for
graphics. In Proceedings of the 9th annual conference on Computer
graphics and interactive techniques, pages 127–133. ACM Press, 1982.
[6] Robert L. Cook. Shade trees. In Proceedings of the 11th annual confer-
ence on Computer graphics and interactive techniques, pages 223–231.
ACM Press, 1984.
[7] Robert L. Cook, Loren Carpenter, and Edwin Catmull. The reyes image
rendering architecture. In Proceedings of the 14th annual conference
on Computer graphics and interactive techniques, pages 95–102. ACM
Press, 1987.
[8] Greg Coombe, Mark J. Harris, and Anselmo Lastra. Radiosity on graph-
ics hardware, 2003.
[9] Intel Corporation. Using the RDTSC instruction for performance
monitoring, 1997. http://developer.intel.com/drg/pentiumII/
appnotes/RDTSCPM1.HTML.
41
[10] Intel Corporation. AGP V3.0 Interface Specification, 2002. http://
www.intel.com/technology/agp/.
[12] NVIDIA Corporation. Geforce 256 and Riva TNT combiners, 1999.
http://developer.nvidia.com/.
[18] James D. Foley, Andries van Dam, Steven K. Feiner, and John F.
Hughes. Computer Graphics : Principles and Practice. Addision-Wesley,
second edition, 1996.
[20] Michael Deering, Stephanie Winner, Bic Schediwy, Chris Duffy, and Neil
Hunt. The triangle processor and normal vector shader: a VLSI system
for high performance graphics. ACM SIGGRAPH Computer Graphics,
22(4):21–30, 1988.
[21] Henry Fuchs, John Poulton, John Eyles, Trey Greer, Jack Goldfeather,
David Ellsworth, Steve Molnar, Greg Turk, Brice Tebbs, and Laura
Israel. Pixel-Planes 5: a heterogeneous multiprocessor graphics sys-
tem using processor-enhanced memories. ACM SIGGRAPH Computer
Graphics, 23(3):79–88, 1989.
42
[22] James Gosling, Bill Joy, Guy Steele, and Gilad Bracha. The Java Lan-
guage Specification Second Edition. Addison-Wesley, Boston, Mass.,
2000.
[24] Pat Hanrahan and Jim Lawson. A language for shading and lighting
calculations. In Proceedings of the 17th annual conference on Computer
graphics and interactive techniques, pages 289–298. ACM Press, 1990.
[26] John Kessenich, Dave Baldwin, and Randi Rost. OpenGL Shading
Language draft v1.05, 2003. http://www.opengl.org/developers/
documentation/gl2_workgroup/.
[27] Brucek Khailany, William J. Dally, Scott Rixner, Ujval J. Kapasi, Peter
Mattson, Jinyung Namkoong, John D. Owens, Brian Towles, and An-
drew Chang. Imagine: Media processing with streams. IEEE Micro,
pages 35–46, 2001.
[28] David Kirk and Douglas Voorhies. The rendering architecture of the
DN10000VS. In Proceedings of the 17th annual conference on Computer
graphics and interactive techniques, pages 299–307. ACM Press, 1990.
[29] Jens Krueger and Ruediger Westermann. Linear algebra operators for
GPU implementation of numerical algorithms. ACM Transactions on
Graphics (TOG), 22(3):908–916, 2003.
[30] Anselmo Lastra, Steven Molnar, Marc Olano, and Yulan Wang. Real-
time programmable shading. In Proceedings of the 1995 symposium on
Interactive 3D graphics, pages 59–ff. ACM Press, 1995.
[31] Joonas Lehtinen. GWIC - GNU Wavelet Image Codec version 0.1, 1998.
http://jole.fi/research/gwic/.
43
[33] Michael D. McCool, Zheng Qin, and Tiberiu S. Popa. Shader metapro-
gramming. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS
conference on Graphics hardware, pages 57–68. Eurographics Associa-
tion, 2002.
[35] Steven Molnar, John Eyles, and John Poulton. PixelFlow: high-speed
rendering using image composition. In Proceedings of the 19th annual
conference on Computer graphics and interactive techniques, pages 231–
240. ACM Press, 1992.
[37] Adam Moravanszky. Dense matrix algebra on the gpu, 2003. http:
//www.shaderx2.com/shaderx.PDF.
[43] Timothy J. Purcell, Ian Buck, William R. Mark, and Pat Hanrahan.
Ray tracing on programmable graphics hardware. In Proceedings of the
29th annual conference on Computer graphics and interactive techniques,
pages 703–712. ACM Press, 2002.
44
[44] Mark Segal and Kurt Akeley. The OpenGL graphics system: A spec-
ification (version 1.5), 2003. http://www.opengl.org/developers/
documentation/specs.html.
[46] Michael Taylor, Jason Kim, Jason Miller, David Wentzlaff, Fae Ghodrat,
Benjamin Greenwald, Henry Hoffman, Jae-Wook Lee, Paul Johnson,
Walter Lee, Albert Ma, Arvind Saraf, Mark Seneski, Nathan Shnidman,
Volker Strumpen, Matthew Frank, Saman Amarasinghe, and Anant
Agarwal. The Raw microprocessor: A computational fabric for soft-
ware circuits and general purpose programs. IEEE Micro, May 2002.
[47] Linus Torvalds and Open Source Hackers. Linux kernel website, 2003.
http://www.kernel.org/.
45