Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
A DISSERTATION
SUBMITTED TO THE DEPARTMENT OF ELECTRICAL ENGINEERING
AND THE COMMITTEE ON GRADUATE STUDIES
OF STANFORD UNIVERSITY
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
Brucek Khailany
June 2003
c Copyright by Brucek Khailany 2003
All Rights Reserved
ii
I certify that I have read this dissertation and that, in my opin-
ion, it is fully adequate in scope and quality as a dissertation
for the degree of Doctor of Philosophy.
William J. Dally
(Principal Adviser)
Mark Horowitz
Teresa Meng
iii
Abstract
Media applications such as image processing, signal processing, and graphics require tens
to hundreds of billions of arithmetic operations per second of sustained performance for
real-time application rates, yet also have tight power constraints in many systems. For
this reason, these applications often use special-purpose (fixed-function) processors, such
as graphics processors in desktop systems. These processors provide several orders of
magnitude higher performance efficiency (performance per unit area and performance per
unit power) than conventional programmable processors.
In this dissertation, we present the VLSI implementation and evaluation of stream pro-
cessors, which reduce this performance efficiency gap while retaining full programmability.
Imagine is the first implementation of a stream processor. It contains 48 32-bit arithmetic
units supporting floating-point and integer data-types organized into eight SIMD arithmetic
clusters. Imagine executes applications stream programs consisting of a sequence of com-
putation kernels operating on streams of data records. The prototype Imagine processor is
a 21-million transistor chip, implemented in a 0.15 micron CMOS process. At 232 MHz,
a peak performance of 9.3 GFLOPS is achieved while dissipating 6.4 Watts with a die size
measuring 16 mm on a side.
Furthermore, we extend these experimental results from Imagine to stream processors
designed in more area- and energy-efficient custom design methodologies and to future
VLSI technologies where thousands of arithmetic units on a single chip will be feasible.
Two techniques for increasing the number of arithmetic units in a stream processor are pre-
sented: intracluster and intercluster scaling. These scaling techniques are shown to provide
high performance efficiencies to tens of ALUs per cluster and to hundreds of arithmetic
clusters, demonstrating the viability of stream processing for many years to come.
iv
Acknowledgments
During the course of my studies at Stanford University, I have been fortunate to work with
a number of talented individuals. First and foremost, thanks goes to my research advisor,
Professor William J. Dally. Through his vision and leadership, Bill has always been an
inspiration to me and everyone else on the Imagine project. He also provided irreplacable
guidance for me when I needed to eventually find a dissertation topic. Professor Dally
provided me with the opportunity to take a leadership role on the VLSI implementation of
the Imagine processor, an invaluable experience for which I will always be grateful. I would
also like to thanks the other members of my reading committee, Professor Mark Horowitz
and Professor Teresa Meng, for their valuable feedback regarding the work described in
this dissertation and interactions over my years at Stanford.
The Imagine project was the product of the hard work of many graduate students in the
Concurrent VLSI Architecture group at Stanford. Most notably, I would like to thank Scott
Rixner, Ujval Kapasi, John Owens, and Peter Mattson. Together, we formed a team that
took the Imagine project from a research idea to a working silicon prototype. More recently,
Jung-Ho Ahn, Abhishek Das, and Ben Serebrin have helped with laboratory measurements.
Thanks also goes to all of the other team members who helped with the Imagine VLSI
implementation, including Jinyung Namkoong, Brian Towles, Abelardo Lopez-Lagunas,
Andrew Chang, Ghazi Ben Amor, and Mohamed Kilani.
I would also like to thank all of the other members of the CVA group at Stanford,
especially my officemates over the years: Ming-Ju Edward Lee, Li-Shiuan Peh, and Patrick
Chiang. Many thanks also goes to Pamela Elliot and Shelley Russell, the CVA group
administrators while I was a graduate student here.
The research described in this dissertation would not have been possible without the
v
generous funding provide by a number of sources. I would like to specifically thank the
Intel Foundation for a one-year fellowship in 2001-2002 to support this research. The
remainder of my time as a graduate student, I was supported by the Imagine project, which
was funded by the Defense Advanced Research Projects Agency under ARPA order E254
and monitored by the Army Intelligence Center under contract DABT63-96-C0037, by
ARPA order L172 monitored by the Department of the Air Force under contract F29601-
00-2-0085, by Intel Corporation, by Texas Instruments, and by the Interconnect Focus
Center Program for Gigascale Integration under DARPA Grant MDA972-99-1-0002.
Finally, I can not say enough about the support provided by my friends and family. My
parents, Asad (the first Dr. Khailany) and Laura, have been my biggest supporters and for
that I am forever grateful. Now that they will no longer be able to ask me when my thesis
will be done we will have to find a new subject to discuss on the telephone. My sister and
brother, Raygar and Sheilan, have always providing timely encouragement and advice. To
all of my friends and family members who have helped me in one way or another over the
years, I would like to say thanks.
vi
Contents
Abstract iv
Acknowledgments v
1 Introduction 1
1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Background 5
2.1 Media Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Compute Intensity . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.2 Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.3 Locality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 VLSI Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Media Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.1 Special-purpose Processors . . . . . . . . . . . . . . . . . . . . . . 10
2.3.2 Microprocessors . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.3 Digital Signal Processors and Programmable Media Processors . . 13
2.3.4 Vector Microprocessors . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.5 Chip Multiprocessors . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Stream Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.1 Stream Programming . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.2 Stream Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.3 Stream Processing Related Work . . . . . . . . . . . . . . . . . . . 20
vii
2.4.4 VLSI Efficiency of Stream Processors . . . . . . . . . . . . . . . . 21
viii
5 Imagine: Experimental Results 69
5.1 Operating Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.2 Power Dissipation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.3 Energy Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.4 Sustained Application Performance . . . . . . . . . . . . . . . . . . . . . 80
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
8 Conclusions 137
8.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
ix
Bibliography 141
x
List of Tables
xi
List of Figures
xii
5.1 Die Photograph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.2 Measured Operating Frequency . . . . . . . . . . . . . . . . . . . . . . . . 71
5.3 Measured Ring Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.4 Measured Core Power Dissipation . . . . . . . . . . . . . . . . . . . . . . 74
5.5 Csw distribution during Active Operation . . . . . . . . . . . . . . . . . . . 75
5.6 Measured Energy Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . 76
xiii
Chapter 1
Introduction
Computing devices and applications have recently emerged to interface with, operate on,
and process data from real-world samples classified as media. As media applications oper-
ating on these data-types have come to the forefront, the design of processors optimized to
operate on these applications have emerged as an important research area. Traditional mi-
croprocessors have been optimized to execute applications from desktop computing work-
loads. Media applications are a workload with significantly different characteristics, mean-
ing that the potential for large improvements in performance, cost, and power efficiency
can be achieved by improving media processors.
Media applications include workloads from the areas of signal processing, image pro-
cessing, video encoding and decoding, and computer graphics. These workloads require a
large and growing amount of arithmetic performance. For example, many current computer
graphics and image processing applications in desktop systems require tens to hundreds of
billions of arithmetic operations per second for real-time performance [Rixner, 2001]. As
scene complexity, screen resolutions, and algorithmic complexity continues to grow, this
demand for absolute performance will continue to increase. Similar examples of large and
growing performance requirements can be drawn in the other application areas, such as
the need for higher communication bandwidth rates in signal processing and higher video
quality in video encoding and decoding algorithms. As a result, media processors must be
designed to provide large amounts of absolute performance.
While high performance is necessary to meet the computational requirements of media
1
CHAPTER 1. INTRODUCTION 2
applications, many media processors will need to be deployed in mobile systems and other
systems where cost and power consumption is a key concern. For this reason, low power
consumption and high energy efficiency, or high performance per unit power (low aver-
age energy dissipated per arithmetic operation), must be a key design goal for any media
processor.
Fixed-function processors have been able to provide both high performance and good
energy-efficiency when compared to their programmable counterparts on media applica-
tions. For example, the Nvidia Geforce3 [Montrym and Moreton, 2002; Malachowsky,
2002], a recent graphics processor, provides 1.2 Teraops per second of peak performance
at 12 Watts for an energy-efficiency of 10 picoJoules per operation. In comparison, pro-
grammable digital signal processors and microprocessors are several orders of magnitude
worse in absolute performance and in energy efficiency. However, programmability is a key
requirement in many systems where algorithms are too complex or change too rapidly to
be built into fixed-function hardware. Using programmable rather than fixed-function pro-
cessors also enables fast time-to-market. Finally, the cost of building fixed-function chips
is growing significantly in deep sub-micron technologies, meaning that programmable so-
lutions also have an inherent cost advantage since a single programmable chip can be used
in many different systems. For these reasons, a programmable media processor which can
provide the performance and energy efficiency of fixed-function media processors is desir-
able.
Stream processors have recently been proposed as a solution that can provide all three
of the above: performance, energy efficiency, and programmability. In this dissertation,
the design and evaluation of a prototype stream processor, called Imagine is presented.
This 21-million transistor processor is implemented in a 5-level metal 0.15 micron CMOS
technology with a die size measuring 16 millimeters on a side. At 232 MHz, a peak per-
formance of 9.3 GFLOPS is achieved while dissipating 6.4 Watts. Furthermore, in future
VLSI technologies, the scalability of stream processors to Teraops per second of peak per-
formance is demonstrated.
CHAPTER 1. INTRODUCTION 3
1.1 Contributions
This dissertation makes several contributions to the fields of computer architecture and
media processing:
• The design and evaluation of the Imagine stream processor. This is the first VLSI
implementation of a stream architecture and provides experimental verification to
the VLSI feasibility and performance of stream processors.
• Analytical models for the area, power, and delay of key components of a stream
processor. These models are used to demonstrate the scalability of stream processors
to thousands of arithmetic units in future VLSI technologies.
1.2 Outline
Recently, media processing has gained attention in both commercial products and academic
research. The important recent trends in media processing are presented in Chapter 2. One
such trend which has gained prominence in the research community is stream processing.
In Chapter 2, we introduce and explain stream processing, which consists of a programming
model and architecture that enables high performance on media applications with fully-
programmable processors.
In order to explore the performance and efficiency of stream processing, a prototype
stream processor, Imagine, was designed and implemented in a modern VLSI technology.
CHAPTER 1. INTRODUCTION 4
In Chapter 3, the instruction set architecture, microarchitecture, and key arithmetic circuits
from Imagine are described. In Chapter 4, the design methodology is presented and finally,
in Chapter 5, experimental results are provided. Also in Chapter 5, the energy efficiency of
Imagine and a comparison to existing processors is presented.
This work on Imagine was then extended to study the scalability of stream processors
to future VLSI technologies when thousands of arithmetic units could fit on a single chip.
In Chapter 6, analytical models for the area, power, and delay of key components of a
stream processor are presented. These models are then used to explore how area and energy
efficiency scales with the number of arithmetic units. In Chapter 7, performance scalability
is studied by exploring the avaiable parallelism in media applications and by exploring the
tradeoffs between different methods of scaling.
Finally, conclusions and future work are presented in Chapter 8.
Chapter 2
Background
Media applications and media processors have recently become an active and important
area of research. In this chapter, background and previous work on media processing is
presented. First, media application characteristics and previous work on processors for
running these applications is presented. Then, stream processors are introduced. Stream
processors have recently been proposed as an architecture that exploits media application
characteristics to achieve better performance, area efficiency, and energy efficiency than
existing programmable processors.
5
CHAPTER 2. BACKGROUND 6
Left Image
Convolution
Depth Map
Filter
Sum-of-
Absolute
Center Image Differences
Convolution
Filter
estimate the disparity between objects in the images. From the disparity calculated at each
image pixel, the depth of objects in an image can be approximated. This stereo depth
extractor will be used to demonstrate the three important characteristics common to most
media applications.
2.1.2 Parallelism
Not only do these applications require large numbers of arithmetic operations per memory
reference, but many of these arithmetic operations can be executed in parallel. This avail-
able parallelism in media applications can be classified into three categories: instruction-
level parallelism (ILP), data-level parallelism (DLP), and task-level parallelism (TLP).
The most plentiful parallelism in media applications is at the data level. DLP refers to
computation on different data elements occurring in parallel. Furthermore, DLP in media
applications can often be exploited with SIMD execution since the same computation is
typically applied to all data elements. For example, in the stereo depth extractor, all output
pixels in the depth map could theoretically be computed in parallel by the same fixed-
function hardware element since there are no dependencies between these pixels and the
computation required for every pixel is the same. Other media applications also contain
large degrees of DLP.
Some parallelism also is available at the instruction level. In the stereo depth extractor,
ILP refers to the parallel execution of individual arithmetic instructions in the convolution
filter or sum-of-absolute differences calculation. For example, the convolution filter com-
putes the product of a coefficient matrix with a sequence of pixels. This matrix-vector
product includes a number of multiplies and adds that could be performed in parallel. Such
fine-grained parallelism between individual arithmetic operations operating on one data el-
ement is classified as ILP and can be exploited in many media applications. As will be
shown later in Chapter 7, available ILP in media applications is usually limited to a few in-
structions per cycle due to dependencies between instructions. Although other researchers
have shown that out-of-order superscalar microprocessors are able to execute up to 4.2 in-
structions per cycle on some media benchmarks [Ranganathan et al., 1999], this is largely
due to DLP being converted to ILP with compiler or hardware techniques rather than the
true ILP that exists in these applications.
Finally, the stereo depth extractor and other media applications also contain task-level,
or thread-level, parallelism. TLP refers to different stages of a computation pipeline being
overlapped. For example, in the stereo depth extractor, there are four exeuction stages: load
image data, convolution filter, sum-of-absolute differences, and store output data. TLP is
CHAPTER 2. BACKGROUND 8
available in this application because these execution stages could be set up as a pipeline
where each stage concurrently processes different portions of the dataset. For example, a
pipeline could be set up where each stage operates on a different row: the fourth image rows
are loaded from memory, the convolution filter operates on the third rows, sum-of-absolute
differences is computed between the second rows, while the first output row is stored back
to memory. Note that ILP, DLP, and TLP are all orthogonal types of parallelism, meaning
that all three could theoretically be supported simultaneously.
2.1.3 Locality
In addition to compute intensity and parallelism, the other important media application
characteristic is locality of reference for data accesses. This locality can be classified into
kernel locality and producer-consumer locality. Kernel locality is temporal and refers to
reuse of coefficients or data during the execution of computation kernels such as the con-
volution filter. Producer-consumer locality is also a form of temporal locality that exists
between different stages of a computation pipeline or kernels. It refers to data which is pro-
duced, or written, by one kernel and consumed ,or read, by another kernel and is never read
again. This form of locality is seen very frequently in media applications [Rixner, 2001]. In
a traditional microprocessor, kernel locality would most often be captured in a register file
or a small first-level cache. Producer-consumer locality on the other hand is not as easily
captured by traditional cache hierarchies in microprocessors since it is not well-matched to
least-recently-used replacement policies typically utilized in caches.
0.18 µm technology measures 0.486 mm2 and dissipates 185 pJ per multiply (0.185 mW
per MHz). A thousand of these multipliers could fit on a single die in a 0.13 µm technology.
While arithmetic itself is cheap, handling the data and control communication between
arithmetic units is expensive. On-chip communication between such arithmetic units re-
quires storage and wires. Small distributed storage elements are not too expensive com-
pared to arithmetic. In the same 0.18 µm technology, a 16-word 32-bit, one-read-port
one-write-port SRAM which is 0.0234 mm2 and dissipates 15pJ per access cycle assum-
ing both ports are active. However, as additional ports are added to this memory, the area
cost increases significantly. Furthermore, the drivers and wires for a 32-bit 5 millimeter
bus dissipate 24 pJ per transfer on average [Ho et al., 2001]. If each multiply requires
three multi-ported memory accesses and three 5 millimeter bus transfers (two reads and
one write), then the cost of the communication is very similar to the cost of a multiply.
Architectures must therefore manage this communication effectively in order to keep its
area and energy costs from dominating the computation itself. Off-chip communication is
an even more critical resource, since there are only hundreds of pins available in large chips
today. In addition, each off-chip communication dissipates a lot of energy (typically over 1
nJ for a 32b transfer) when compared to arithmetic operations.
Although handling the cost of communication in modern VLSI technology is a chal-
lenge, media application characteristics are well-suited to take advantage of cheap com-
putation with highly distributed storage with local communications. Cheap computation
can be exploited with large numbers of arithmetic units to take advantage of both compute
intensity and parallelism in these applications. Furthermore, producer-consumer locality
can be exploited to keep communication local as much as possible, thereby minimizing
communication costs.
power, energy or power efficiency. These metrics are often more important than raw per-
formance in many media processing systems since higher area efficiency leads to low cost
and better manufacturability, both important in embedded systems. Energy efficiency im-
plies that for executing a fixed computation task, less energy from a power source such as a
battery is used, leading to longer battery life and lower packaging costs in mobile products.
In this section, we present previous work on fixed-function and programmable processors
for media applications, with data on both performance and performance efficiency.
operation when normalized to a 0.13 µm technology. The other processors in Table 2.1 are
all programmable. Although area efficiencies are not provided in the table, comparisons
between processors for energy efficiency should be similar to area efficiency. As can be
seen, there is an efficiency gap of several orders of magnitude between the special-purpose
and programmable processors. The remainder of this section will provide background into
these programmable processors and explain their performance efficiency limitations.
2.3.2 Microprocessors
The second section of Table 2.1 includes two microprocessors, a 3.08 GHz Intel Pentium
41 [Sager et al., 2001; Intel, 2002] and a SiByte SB-1250, which consists of two on-chip
SB-1 CPU cores [Sibyte, 2000]. The Pentium 4 is designed for high performance through
deep pipelining and high clock rate. The SiByte processor is targeted specifically for energy
efficient operation through extensive use low power design techniques, and has efficiencies
simiilar to other low power microprocessors, such as XScale [Clark et al., 2001]. These
1
Gate length for this process is actually 60-70 nanometers because of poly profiling engineering [Tyagi et
al., 2000; Thompson et al., 2001].
CHAPTER 2. BACKGROUND 12
energy efficiencies competitive with DSPs at higher performance rates because of its ability
to efficiently exploit DLP and its embedded memory system.
Vector processors directly exploit data parallelism by executing vector instructions such
as vector adds or multiplies out of a vector register file. These vector instructions are similar
to SIMD extensions in that they exploit inner-loop data parallelism in media applications,
however, vector lengths are not constrained by the width of the vector units, allowing even
more DLP to be exploited. Furthermore, vector memory systems are suitable for media pro-
cessing because they are optimized for bandwidth and predictable strided accesses rather
than conventional processors whose memory systems are optimized for reducing latency.
For these reasons, vector processors are able to exploit significant data parallelism and
compute intensity in media applications.
As shown above, there are a wide variety of processors that can be used to run media
applications. Special-purpose processors are inflexible, but are matched to both VLSI tech-
nology and media application characteristics. As a result, there is a large and growing gap
between the performance efficiency of these fixed-function processors and programmable
processors. The next section introduces stream processors as a way to bridge this efficiency
gap.
Depth
SAD
Map
Stream programs expose the locality and parallelism in the algorithm to the com-
piler and hardware. Two key types of locality are exposed: kernel locality and producer-
consumer locality. Kernel locality refers to intermediate data values that are live for only a
short time during kernel execution, such as temporaries during a convolution filter compu-
tation. Producer-consumer locality refers to streams produced by one kernel and consumed
by subsequent kernels. Finally, parallelism is exposed because a kernel typically executes
the same kernel program on all elements of an input stream. By casting media applica-
tions as stream programs, hardware is able to take advantage of the abundant parallelism,
compute intensity, and locality in media applications.
Host
Processor
Stream
Controller Microcontroller
ALU Cluster 7
ALU Cluster 6
S ALU Cluster 5
D Streaming ALU Cluster 4
R Memory Stream
A System Register File ALU Cluster 3
M
ALU Cluster 2
ALU Cluster 1
ALU Cluster 0
Stream Processor
Instructions sent to the stream processor from the host are sequenced through a stream con-
troller. The stream register file (SRF) is a large on-chip storage for streams. The microcon-
troller and ALU clusters execute kernels from a stream program. As shown in Figure 2.4,
each cluster consists of ALUs fed by two local register files (LRFs) each, external ports for
accessing the SRF, and an intracluster switch that connects the outputs of the ALUs and
external ports to the inputs of the LRFs. In addition, there is a scratchpad (SP) unit, used
for small indexed addressing operations within a cluster, and an intercluster communica-
tion (COMM) unit, used to exchange data between clusters. Imagine is a stream processor
recently designed at Stanford University that contains six floating-point ALUs per cluster
(three adders, two multipliers, and one divide-square-root unit) and eight clusters [Khailany
et al., 2001], and was fabricated in a CMOS technology with 0.18 micron metal spacing
rules and 0.15 micron drawn gate length.
CHAPTER 2. BACKGROUND 18
To/From
SP COMM Other
Clusters
To/From
Intracluster Switch
SRF
Stream processors directly execute stream programs. Streams are loaded and stored
from off-chip memory into the SRF. SIMD execution of kernels occurs in the arithmetic
clusters. Although the stream processor in Figure 2.3 conatins eight arithmetic clusters,
in general, the stream processor architecture can contain an arbitrary number of arithmetic
clusters, represented by the variable C. For each iteration of a loop in a kernel, C clus-
ters will read C elements in parallel from an input stream residing in the SRF, perform
the exact same series of computations as specified by the kernel inner loop, and write C
output elements in parallel back to an output stream in the SRF. Kernels repeat this for
several loop iterations until all elements of the input stream have been read and operated
on. Data-dependent conditionals in kernels are handled with conditional streams which,
like predication, keep control flow in the kernel simple [Kapasi et al., 2000]. However,
conditional streams eliminate the extra computation required by predication by converting
data-dependent control flow decisions into data-routing decisions.
Stream processors exploit parallelism and locality at both the kernel level and applica-
tion level. During kernel execution, data-level parallelism is exploited with C clusters con-
currently operating on C elements and instruction-level parallelism is exploited by VLIW
execution within the clusters. At the application level, stream loads and stores can be over-
lapped with kernel execution, providing more concurrency. Kernel locality is exploited
by stream processors because all temporary values produced and consumed during a ker-
nel are stored in the cluster LRFs without accessing the SRF. At the application level,
CHAPTER 2. BACKGROUND 19
producer-consumer locality is exploited when streams are passed between subsequent ker-
nels through the SRF, without going back to external memory.
The data in media applications that exhibits kernel locality and producer-consumer
locality also has high data bandwidth requirements when compared to available off-chip
memory bandwidth. Stream processors are able to support these large bandwidth require-
ments because their register files provide a three-tiered data bandwidth hierarchy. The first
tier is the external memory system, optimized to take advantage of the predictable memory
access patterns found in streams [Rixner et al., 2000a]. The available bandwidth in this
stage of the hierarchy is limited by pin bandwidth and external DRAM bandwidth. Typi-
cally, during a stream program, external memory is only referenced for global data accesses
such as input/output data. Programs are strip-mined so that the processor reads only one
batch of the input dataset at a time. The second tier of the bandwidth hierarchy is the SRF,
which is used to transfer streams between kernels in a stream program. Its bandwidth is
limited by the available bandwidth of on-chip SRAMs. The third tier of the bandwidth
hierarchy is the cluster LRFs and the intracluster switch between the LRFs which forwards
intermediate data in a kernel between the ALUs in each cluster during kernel execution.
The available bandwidth in this tier of the hierarchy is limited by the number of ALUs one
can fit on a chip and the size of the intracluster switch between the ALUs.
The peak bandwidth rates of the three tiers of the data bandwidth hierarchy are matched
to the bandwidth demands in typical media applications. For example, the Imagine proces-
sor contains 40 fully-pipelined ALUs and provides 2.3 GB/s of external memory band-
width, 19.2 GB/s of SRF bandwidth, and 326.4 GB/s of LRF bandwidth. As discussed in
Section 2.1, some media applications such as the stereo depth extractor require over 400 in-
herent ALU operations per memory reference. Imagine supports a ratio of ALU operations
to memory words referenced of 28. Therefore, not only are stream processors in today’s
technology with tens of ALUs able to exploit this compute intensity, but as VLSI capac-
ity continues to scale at 70% annually and as memory bandwidth continues to increase at
25% annually, this suggests that stream processors with thousands of ALUs could provide
significant speedups on media applications without becoming memory bandwidth limited.
CHAPTER 2. BACKGROUND 20
file organization is the area and energy efficiency derived from partitioning the register file
storage into stream register files, arithmetic clusters, and local register files within the arith-
metic clusters. This partitioning enables stream processors to scale to thousands of ALUs
with significantly modest area and energy costs.
The area of a register file is the product of three terms: the number of registers R, the
bits per register, and the size of a register cell. Asymptotically, with a large number of ports,
each register cell has an area that grows with p2 because one wire is needed in the word-line
direction, and another wire needed in the bit-line direction per register file port. Register
file energy per access follows similar trends. Therefore, a highly multi-ported register
file has area and power that grows asymptotically with Rp2 [Rixner et al., 2000b]. A
general-purpose processor containing N arithmetic units with a single centralized register
file requires approximately 3N ports (two read ports for the operands and one wire port
for the result per ALU). However, as N increases, working set sizes would also increase,
meaning that R should also grow linearly with N . As a result, a single centralized multi-
ported register file interconnecting N arithmetic units in a general-purpose microprocessor
has area and power that grows with N 3 , and would quickly begin to dominate processor
area and power. As a result, partitioning register files is necessary in order to efficiently
scale to large numbers of arithmetic units per processor.
Historically, register file partitioning has been used extensively in programmable pro-
cessors in order to improve scalability, area and energy efficiency, and to reduce wire delay
effects. For example, the TI C6x [Agarwala et al., 2002] is a VLIW architecture split
into two partitions, each containing a single multi-ported register file connected to four
arithmetic units. Even in high-performance microprocessors not necessarily targeted for
energy efficient operation, such as the Alpha 21264 [Gieseke et al., 1997], register file par-
titioning has been used. In the stream architecture, register file partitioning occurs along
three dimensions: distributed register files within the clusters, SIMD register files across
the clusters, and the stream register organization between the clusters and memory. In the
remainder of this section, we explain how the register file partition of Imagine along these
three dimensions improves area and energy efficiency and is related to previous work on
partitioned register files.
CHAPTER 2. BACKGROUND 23
The first register file partitioning in the stream architecture is along the ILP dimension
within a cluster. Given N ALUs per cluster, a VLIW cluster with one centralized register
file connected to all of the ALUs would grow with N 3 as explained above. However,
by splitting this centralized multi-ported register file into an organization with one two-
ported LRF per ALU input within each arithmetic cluster, the area and power of the LRFs
only grows with N , and the intracluster switch connecting the ALU outputs to the LRF
inputs grows with N 2 asymptotically. The exact area efficiency, energy efficiency, and
performance when scaling N on a stream architecture will be explored in more detail in
Chapter 6.
The disadvantage of this approach is that the VLIW compiler must explicitly manage
communications across this switch and must deal with replication of data across various
LRFs [Mattson et al., 2000]. However, using asymptotic models for area and energy of
register files, Rixner et al. showed that for N = 8, this distributed register organization
provides a 6.7x and an 8.7x reduction on area and energy efficiency respectively in the
ALUs, register files, and switches2 [Rixner et al., 2000b].
Partitioned register files in VLIW processors and explicitly scheduled communications
between these partitions were proposed on a number of previous processors. For example,
the TI C6x [Agarwala et al., 2002] contains two partitions with four arithmetic units per
partition. In addition, a number of earlier architectures used partitioned register files of
various granularities. The Polycyclic architecture [Rau et al., 1982], the Cydra [Rau et
al., 1989], and Transport-triggered architectures [Janssen and Corporaal, 1995] all had
distributed register file organizations.
Whereas the distributed register partitioning was along the ILP dimension and was handled
by the VLIW compiler, the next partitioning in the stream architecture occurs in the DLP
2
Implementation details such as design methodology or available wiring layers would affect the efficiency
advantage of certain DRF organizations. For instance, comparing the efficiency of one four-ported LRF
per ALU rather to one two-ported LRF per ALU input would provide different results depending on these
implementation details.
CHAPTER 2. BACKGROUND 24
The third and final partition in the stream architecture register file is a split between stor-
age for loads and stores and storage for intermediate buffering between individual ALU
operations. This is accomplished by separating the SRF storage from the LRFs within each
cluster. This splitting between the SRF and LRFs has two main advantages. First, stag-
ing data for loads and stores is capacity-limited because of long memory latencies, rather
than bandwidth-limited, meaning that large memories with few ports can be used for the
SRF whereas the capacity of the LRFs can be kept relatively small. Second, data can be
staged in the SRF as streams, meaning that accesses to the SRF will be sequential and pre-
dictable. As a result, streambuffers can be used to prefetch data into and out of the SRF,
much like streambuffers are often used to prefetch data from main memory in micropro-
cessors [Jouppi, 1990]. As explained in Section 3.2.4, these streambuffers allow accesses
to a stream from each SRF client to be aggregated into larger portions of a stream before
they are read or written from the SRF, leading to a much more efficient use of the SRF
bandwidth and a more area- and energy-efficient design.
CHAPTER 2. BACKGROUND 25
The stream architecture register file organization can be viewed as a combination of the
above three register partitionings. Overall, these partitions each provide a large benefit in
area and energy efficiency. When compared to a 48-ALU processor with a single unified
register file, a C = 8 N = 6 stream processor takes 195 times less area and 430 times
less energy. A performance degradation of 8% over a hypothetical centralized register
file architecture is incurred due to SIMD instruction overheads and explicit data transfers
between partitions [Rixner et al., 2000b].
In summary, there is a large and growing gap between the area and energy efficiency
of special-purpose and programmable processors on media applications. The stream ar-
chitecture attempts to bridge that gap through its ability to exploit important application
characteristics and its efficient register file organization.
Chapter 3
In the previous chapter, a stream processor architecture [Rixner et al., 1998; Rixner, 2001]
was introduced to bridge the efficiency gap between special-purpose an programmable pro-
cessors. A stream processor’s efficiency is derived from several architectural advantages
over other programmable processors. The first advantage is a data bandwidth hierarchy
for effectively dealing with limited external memory bandwidth that can also exploit com-
pute intensity and producer-consumer locality in media applications. The next advantage
is SIMD arithmetic clusters and multiple arithmetic units per cluster that can exploit both
DLP and ILP in media processing kernels. Finally, the bandwidth hierarchy and SIMD
arithmetic clusters are built around a area- and energy-efficient register file organization.
Although the previous analysis qualitatively demonstrates the efficiency of the stream
architecture, in order to truly evaluate its performance efficiency, a VLSI prototype Imag-
ine stream processor [Khailany et al., 2001] was developed so that performance, power
dissipation, and area could be measured. Not only did this prototype provide a vehicle for
experimental measurements, but also, by implementing a stream processor in VLSI, key
insights into the effect of technology on the microarchitecture are gained. These insights
were then used to study the scalability of stream processors in Chapter 6 and Chapter 7.
The next few chapters discuss the Imagine prototype in detail. This chapter presents
the instruction set architecture, microarchitecture, and circuits of key components from the
Imagine stream processor. Chapter 5 discusses the design methodology used for Imagine,
and finally, in Chapter 6, experimental results for Imagine are presented.
26
CHAPTER 3. IMAGINE: MICROARCHITECTURE AND CIRCUITS 27
• CLUSTER OP executes a kernel in the arithmetic clusters that reads inputs streams
from the SRF, computes output streams, and writes the output streams to the SRF.
In addition to the six main instructions listed above, there are other instructions for
writes and reads to on-chip control registers which are inserted as needed by the stream-
level compiler. Streams must have lengths that are a multiple of eight (the number of
CHAPTER 3. IMAGINE: MICROARCHITECTURE AND CIRCUITS 28
To/From
ADD ADD ADD MUL MUL DSQ SP JB/VAL COM Other
Clusters
IO Ports
Intracluster Switch
(To SRF)
clusters) and lengths from 0 to 8K words are supported, where each word is 32 bits. Stream
instructions are fetched and dispatched by a host processor to a scoreboard in the on-chip
stream controller. As will be described in Section 3.2.8, the stream controller issues stream
instructions to the various on-chip units as their dependencies become satisfied and their
resources become available.
the kernel-level instructions also control register file accesses and the intracluster switch.
Register file reads are handled with an address field in the kernel-level ISA, while writes
require both an address field and a software pipeline stage field. Finally, the kernel-level
ISA controls the intracluster switch with a bus select field for each write port. This field
specifies which function unit output or input port should be written into the register file for
this instruction.
3.2 Microarchitecture
In the previous chapter, the arhitecture of the Imagine stream processor, shown in Fig-
ure 2.3, and the basic execution of a stream processor was presented. In this section,
this discussion is extended with microarchitectural details from the key components of
the Imagine architecture. First, the microarchitecture and pipeline diagrams of the micro-
controller and arithmetic clusters are presented. These units execute instructions from the
CHAPTER 3. IMAGINE: MICROARCHITECTURE AND CIRCUITS 32
36 33 29 25 21 18 14 10 6 0
kernel-level ISA. Next, both the stream register file microarchitecture and its pipeline di-
agram are described. Finally, we present the stream controller and the streaming memory
system, the other major components of a stream processor.
3.2.1 Microcontroller
The microcontroller provides storage for the kernels’ VLIW instructions, and sequences
and issues these instructions to the arithmetic clusters during kernel execution. A block di-
agram of the Imagine microcontroller is shown in Figure 3.3. It is composed of nine banks
of microcode storage as well as blocks for loading the microcode, sequencing instructions
using a program counter, and instruction decode.
Each bank of microcode storage contains a single-ported SRAM where 64 bits of each
576-bit VLIW kernel instruction are stored. Since each bank contains a 128Kb SRAM, a
total of 2K instructions can be stored at one time. In order to allow for microcode to be
loaded during kernel execution without a performance penalty, two instructions are read
at one time from the SRAM array. The first of these instructions is passed directly to the
CHAPTER 3. IMAGINE: MICROARCHITECTURE AND CIRCUITS 33
From Microcode
SRF Loader
Din A Din A
1024 x 128b 1024 x 128b
SRAM SRAM
Dout Dout
instruction decoder. The second is stored in a register, so that it can be decoded in the next
clock cycle without accessing the SRAM array again.
The microcode loader handles the loading of kernel instructions from the SRF to the
microcode storage arrays. Since microcode is read from the SRF one word at a time,
and 1152 bits of microcode must be written at a time, the microcode loader reads words
from a stream in the SRF, then sends them to local buffers in one of the microcode store
banks. Once these buffers have all been filled, the microcode loader requests access to write
two instructions into the microcode storage banks. A controller, not shown in Figure 3.3,
handles this arbitration and also controls the reading of instructions from the microcode
storage and intermediate registers during kernel execution.
The instruction sequencer contains the program counter which is used to compute the
addresses to be read from the microcode storage. At kernel startup, the program counter
CHAPTER 3. IMAGINE: MICROARCHITECTURE AND CIRCUITS 34
is loaded with the address of the first kernel instruction, specified by the stream controller.
As kernel execution proceeds, the program counter is either incremented or, on condi-
tional branch instructions, a new address is computed and loaded into the program counter.
Conditional branches are handled with the CHK and LOOP/NLOOP instructions. CHK
instructions store a true or false value into a register inside the instruction sequencer. Based
on the value of this register, LOOP instructions conditionally branch to a relative offset
specified in the instruction field.
The final component of the microcontroller is the instruction decoder, which handles
the squashing of register file writes, a key part of the software pipeline mechanism on
Imagine. In the VLIW instruction, each register file write has a corresponding stage field,
which allows the kernel scheduler to easily implement software pipeline priming and drain-
ing without a loop pre-amble and post-amble. The kernel scheduler assigns all register file
writes to a software pipelining stage, and encodes this stage in the VLIW instruction as
the LRF Stg. sub-field from Figure 3.2. During loops, the instruction decoder keeps track
of which stages are currently active, and squashes register file writes from inactive stages.
In addition to squashing register file writes, the instruction decoder also provides pipeline
registers and buffers for each ALU and LRF’s opcodes before they are distributed to the
SIMD ALU clusters and the instruction decoder handles reads and writes from the micro-
controller register file, which is used to store constants and cluster permutations in many
kernels.
FU Result Bus
Function
Unit
Local
Copy
LRFs
of
CCRF
Latches were used as the basic storage element for the LRFs, as shown in Figure 3.5. The
multiplexer before the LRF output flip flop enables register file bypassing within the LRFs
so that data written on one cycle can be read correctly by the FU in the subsequent cycle.
Flip flop writes can be disabled by selecting the top feedback path through the multiplexer.
Each FU also contains a copy of the condition code register file (CCRF), not shown
in Figure 3.1, but shown in the detailed view of Figure 3.4. Condition codes (CCs) are
special data values generated by comparison instructions such as IEQ and FLT and are
used with SELECT instructions and with conditional streams. Although there is only one
CCRF in the ISA, each FU contains a local copy of the CCRF. During writes to the CCRF,
data and write addresses are broadcast to each CCRF copy, whereas during reads, each FU
reads locally from its own CCRF copy. This structure allows for a CCRF with as many
read ports as there are FUs, yet does not incur any wire delay when accessing CCs shared
between all of the FUs in a cluster.
Finally, data is exchanged between FUs via the intracluster switch. This switch is
implemented as a full crossbar where each FU broadcasts its result bus(es) to every LRF
in an arithmetic cluster. A multiplexer uses the bus select field for its associated LRF write
port to select the correct FU result bus for the LRF write.
DECODE/ REG
FETCH1 FETCH2 EX 1 EX N WB
DIST READ
DECODE/ REG
FETCH1 FETCH2 EX 1 EX N WB
DIST READ
DECODE/ REG
FETCH1 FETCH2 EX 1 EX N WB
DIST READ
occurs and LRFs are accessed locally in each arithmetic cluster. This is followed by the
execute (EX) pipeline stages, which vary in length depending on the operation being exe-
cuted. The last half-cycle of each function unit’s last execute stage is used to traverse the
intracluster switch, and then in the writeback (WB) stage, the register write occurs.
Although the clusters are statically scheduled by a VLIW compiler and sequenced by
a single microcontroller, dynamic events during execution can cause the kernel execution
pipeline to stall. Stalls are caused by one of three conditions: the SRF not being ready for
a write to an output stream, the SRF not being ready for a read from an input stream, or
a SYCNH instruction being executed by the microcontroller for synchronization with the
host processor. When one of these stall conditions is encountered, all pipeline registers in
the clusters and microcontroller are disabled and writes to machine state are squashed until
a later cycle when the stall condition is no longer valid.
The microcontroller and arithmetic clusters work together to execute kernels from an
application’s stream program. They execute VLIW instructions made up of operations
from the kernel-level ISA in a six-stage (or more for some operations) execution pipeline.
The other main blocks in the Imagine processor are used to sequence and execute stream
transfers from the stream-level ISA.
CHAPTER 3. IMAGINE: MICROARCHITECTURE AND CIRCUITS 38
SRF 22:1 SB 0 SB 21
Control Arbiter Control Control
22
SBs
SRF SB 0 SB 21
Bank 7 Bank 7 Bank 7
(4K words) (8 words) (8 words)
4
words
SRF SB 0 SB 21
Bank 0 Bank 0 Bank 0
(4K words) (8 words) (8 words)
4
words
SEL MEM WB
SEL MEM WB
SEL MEM WB
frequency of the kernel pipeline, in order to ease timing constraints, and therefore reduce
overall design effort.
The SRF pipeline consists of three stages: stream select (SEL), memory access (MEM),
and streambuffer writeback (WB). During the SEL stage, SBs arbitrate for access to the
SRAM array, and one of the SBs is granted access. Meanwhile the arbiter state is updated
using a last-used-last-served scheme to ensure fairness among SB access. During the MEM
stage, the SB that was granted access transfers data between its local storage and the SRAM
array. Finally, during the WB stage, which only occurs on SRF reads, data from the SRAM
array is written locally to the eight SB banks. While the SRF storage and control operates
at half speed, the SBs operate at full speed, so the WB stage only takes one additional clock
cycle to complete.
reverse with data being read from the streambuffers and packaged into flits as they are sent
into the network interface injection queue.
are free and its dependencies have been satisfied, it makes a request to an arbiter to be
issued this cycle. One instruction is granted access and is sent from the operation buffer to
the issue and decode logic. The issue and decode logic converts the instruction into control
information that start the stream instruction in the individual execution units. A stream
controller register file (SCTRF) is used to transfer scalar data such as stream lengths and
scalar outputs from kernels if necessary. Once the stream instruction execution completes,
its scoreboard entry is freed and subsequent instructions dependent on it can be issued.
By using dynamic scheduling of stream instructions, the stream controller ensures that
stream execution units can stay highly utilized. This allows Imagine to exploit task-level
parallelism by efficiently overlapping memory operations and kernel operations. Further-
more, the 32-entry operation buffer also allows the host processor to work ahead of the
stream processor since the host can issue up to 32 stream instructions until it is forced to
stall waiting for more scoreboard entries to be free. This buffering mitigates any effect the
latency of sending stream instructions to the stream processor would have on performance.
additions. The ALU X23 sub-block contains two pipeline stages and implements inte-
ger additions and the addition portion of floating-point adds. Rounding also occurs in the
ALU X23 stage during floating-point adds. Finally, the ALU X4 sub-block executes a
normalizing shift operation.
Operations requiring floating-point additions, such as FADD, FSUB, and others, are 4-
cycle operations and therefore use all three major sub-blocks. The ALU supports floating-
point arithmetic adhering to the IEEE 754 standard, although only the round-to-nearest-
even rounding mode and limited support for denormals and NaNs are supported [Coonen,
1980]. Additions supporting this standard can be implemented with an alignment shifter,
a carry-select adder for summing the mantissas and doing the rounding, and a normalizing
shifter [Goldberg, 2002; Kohn and Fu, 1989]. This basic architecture was used in the ALU
unit.
Floating-point operands are comprised of a sign bit, eight bits for an exponent, and
23 bits for a fraction with an implied leading one. In the ALU X1 block, a logarithmic
shifter [Weste and Eshraghian, 1993] is used to shift the operand with the smaller exponent
to the right by the difference between the two exponents. If the sign bits of the two operands
are different, then the shifted result is also bitwise inverted, so that subtraction rather than
addition will be computed in the ALU X23 stage. Furthermore, both the unshifted and
shifted fractions are then shifted to the left by two bits such that the leading one of the
unshifted operand is at bit position 25 in the datapath (there are 32 bit positions numbering
0 to 31). This is necessary because guard, round, and sticky bits must also be added into
the two operands in the ALU X23 stage [Goldberg, 2002; Santoro et al., 1989].
In the ALU X23 stage, the shifted and unshifted operands are added together using a
carry-select adder [Goldberg, 2002]. A block diagram of this adder is shown in Figure 3.11.
For each byte in the result, the adder computes two additions in parallel, one assuming the
carry-in to the byte was zero and the other assuming it was one. Meanwhile, a two-level
tree computes the actual carry-ins to each byte. For integer additions, the carry-ins are
based on the results of the group PGKs, the operation type, and the result sign bits. For
floating-point adds, the carry-ins are based on the group PGKs and the overflow bit.
32-bit integer and lower-precision subword data-types also use the carry-select adder in
CHAPTER 3. IMAGINE: MICROARCHITECTURE AND CIRCUITS 45
A B
Logical
Shifter
Unit
X1 sub-block
A B
Integer
Adder
X2 stage
X3 stage
Saturation/
Rounding
X23 sub-block
Alignment
Shifter
X4 sub-block
Output
Buffers
Result
0 0 0 0
Conditional Sum 1 Conditional Sum 1 Conditional Sum 1 Conditional Sum 1
the ALU X23 stage to compute fast additions, subtractions and absolute difference com-
putations. During these operations, the adder also computes two additions in parallel for
each byte, but the global carry chain takes into account both the data-type and the opera-
tion being executed to determine whether the carry-in to each byte should be zero or one.
Furthermore, when an subtraction occurs, the B operand must be inverted (not shown in the
figure). Using this carry-select adder architecture, it was possible to design one adder that
could be used for floating-point, 32-bit, 16-bit, and 8-bit additions and subtractions with
little additional area or complexity over an adder that supports only integer additions.
A
(multiplier)
X1 stage
Booth
B Encoder B
(multiplicand) (multiplicand)
Lower Lower
Half Half
Array Array
X2 stage
Shifting/ Shifting/
Buffering Buffering
Sign
Extension /
Two's
Complement
7:2 Combiner
X3 stage
Saturation / Rounding
Output
Buffers
Result
control information is sent to the two half arrays. Based on this control information, each
partial product contains a shifted version of -2, -1, 0, 1, or 2 times the multiplicand, which
can easily be computed with a few logic gates per bit and a 1-bit shifter within the half
arrays. Once the partial products have been computed, each half array sums eight of the
partial products with 6 rows of full adders. The first row combines 3 of the partial products
and each of the other 5 rows add in one more partial product. Three of these additions
occur in the X1 pipeline stage and the other three occur in the X2 stage.
Once the half arrays have summed the 8 products, each half array sends two 48-bit out-
puts to a 7:2 combiner. This combiner sums these four values with three other buses from
the sign extension and two’s complement logic. These three buses ensure a correctly sign
extended result and also add a one into the lsb location of partial products that were -2 or -1
times the multiplicand during booth encoding. To keep the half arrays modular and simple,
this occurs here rather than in the half arrays. The 7:2 combiner is implemented with 5
full adders: three of the adders are in the X2 stage and two are in the X3 stage. The 7:2
combiner outputs two 64-bit buses that are converted back into non-redundant form with
a 64-bit carry-select adder. Its architecture is similar to the 32-bit integer adder shown in
Figure 3.11, but is extended to 64 bits. The adder spans two pipeline stages: the actual
additions and carry propagation occurs in X3 while the carry select and final multiplexing
occurs in X4. This result is then analyzed and sent through muxes which handle alignment
shifting during floating-point operations and saturation during some integer operations be-
fore it is buffered and broadcast across the intracluster switch.
Like the ALU unit, the MUL unit is also designed to execute 16-bit, 32-bit, and floating-
point additions. 8-bit multiplications were not implemented to reduce design complexity.
During floating-point or 32-bit multiplications, the multiplier operates as described above.
However, during 16-bit multiplications, some parts of the multiplier half array must be
disabled, otherwise partial products from the upper half-word would be added into the
result from the lower half-word and vice versa. To avoid this problem, a mode bit is sent
to both half arrays so that during 16-bit operation, the upper 16 bits of the multiplicand are
set to zero in the lower half array and the lower 16 bits of the multiplicand are set to zero
in the upper half array. Although lower-latency 16-bit multiplications could be achieved by
summing less partial products together, this optimization was not made in order to minimize
CHAPTER 3. IMAGINE: MICROARCHITECTURE AND CIRCUITS 49
A B
Result
MUL units, the DSQ is not fully pipelined, but more than one operation can be executed
concurrently because once an operation has passed through the pre-processor and into one
of the cores, a new operation can be issued and executed in the other core as long as the
operations will not conflict in the post-processor stage.
3.3.4 SP Unit
While the ALU, MUL, and DSQ units support all of the arithmetic operations in a cluster,
several important non-arithmetic operations are supported by the SP, COMM, and JB/VAL
units. The scratchpad (SP) unit provides a small indexable memory within the clusters.
This 256-word memory contains one read port and one write port and supports base plus
index addressing, where the base is specified in the VLIW instruction word and the index
comes from a local LRF. This allows small table lookups to occur in each cluster without
using LRF storage or sacrificing SRF bandwidth.
Clusters
7 6 5 4 3 2 1 0
Loop Iteration 1
Condition codes 0 1 1 0 1 1 1 0
COMM source cluster X X X 6 5 3 2 1
Next cluster pointer 5
Ready bit 0
Loop Iteration 2
Condition codes 1 0 0 1 1 0 1 0
COMM source cluster 4 3 1 X X X X 7
Next cluster pointer 1
Ready bit 1
Loop Iteration 3
Condition codes 0 0 0 0 0 1 0 1
COMM source cluster X X X X X 2 0 X
Next cluster pointer 3
Ready bit 0
used as a double buffer in order to stage data between the streambuffers and the COMM
unit. Finally, the JB/VAL functional unit manages the control wires that are sent to the
streambuffers, the COMM unit, and the SP unit in each cluster during conditional streams.
To explain the operation of conditional output streams, consider the example shown in
Table 3.3. In this example, single-word records are assumed, so there are five instructions
involved with each conditional output stream during each loop iteration: GEN COSTATE,
COMM, SPCWR, SPCRD, and COND OUT D. During the first iteration through the loop,
condition codes specify that only five clusters have valid data to send to the output stream.
In each cluster, the GEN COSTATE instruction in the JB/VAL unit reads these condition
codes and computes a COMM source cluster, a next cluster pointer, and a ready bit (the
values for the next cluster pointer and the ready bit are the same across all eight clusters).
In this case, the five clusters with valid data (clusters 1, 2, 3, 5, and 6) will send their data
to the first five clusters (0 through 4). When the COMM is executed, each cluster uses its
CHAPTER 3. IMAGINE: MICROARCHITECTURE AND CIRCUITS 52
COMM source cluster value to read the appropriate data from the intercluster switch and
buffers this data locally in its scratchpad using SPCWR. The next cluster pointer keeps
track of where the next valid element should be written during subsequent loop iterations.
The ready bit keeps track of whether eight new valid elements have been accumulated and
should be written to the output streambuffer from the scratchpad. During the first loop
iteration, since only five valid elements have been stored in the scratchpad, the next cluster
pointer is set to 5 and the ready bit is set to zero. When the SPCRD and COND OUT D
are executed this loop iteration, the write to the streambuffer is squashed because the ready
bit was set to zero.
During the second iteration, four clusters have valid data. In this case, when the JB/VAL
unit executes GEN COSTATE, it uses the next cluster pointer (set to 5 by the previous
iteration) and new condition codes to compute the source clusters to be used during the
COMM. Again, the data is buffered locally in the scratchpad with SPCWR. However, this
time since eight valid elements have been accumulated across the clusters (five from the first
iteration and three from the second), the ready bit is set to one. When the COND OUT D
instruction is executed, these eight values stored in the scratchpad are written to the output
streambuffer. Double buffering is used in the scratchpad so that the values written into
cluster 0 during the first two iterations do not conflict. The third and final iteration in the
example contains only two valid elements from clusters 0 and 2, and in this case, those
elements are written into clusters 1 and 2.
Subsequent iterations continue in a similar manner, with the JB/VAL unit providing the
control information for the streambuffers, COMM unit, and SP unit. Figure 3.14 shows
the circuit used in the JB/VAL unit to compute the COMM source cluster. Each cluster
computes this by subtracting the next cluster pointer from its cluster number, then using
that difference to select one of the source clusters with a valid CC. For example, if the
difference were three, then this cluster is looking for the third cluster starting from cluster
0 with a CC set to 1. The selection occurs by converting the 3-bit difference into a one-hot
8-bit value, then using each CC to conditionally shift this one-hot value by one position.
Once enough valid CCs have been encountered, the lone one in the one-hot value will
have been shifted off the end. Since only one row will shift a one off the end, the COMM
source index can be easily computed by encoding the bits shifted off the end back into
CHAPTER 3. IMAGINE: MICROARCHITECTURE AND CIRCUITS 53
Next
Cluster Cluster
Number Pointer
3:8 Decoder
CC from
Cluster 0
0
CC from
Cluster 1
0
CC from
Cluster 2
0
CC from
Cluster 3 8:3 Encoder
COMM
0 Source
Cluster
CC from
Cluster 4
0
CC from
Cluster 5
0
CC from
Cluster 6
0
CC from
Cluster 7
Figure 3.14: Computing the COMM Source Index in the JB/VAL unit
CHAPTER 3. IMAGINE: MICROARCHITECTURE AND CIRCUITS 54
a 3-bit value. The computations required for the next cluster pointer and ready bit are not
shown in Figure 3.14, but they can be computed by simply adding the eight 1-bit CC values
together.
In addition to computing the COMM source index, next cluster pointer, and ready bit,
the JB/VAL unit also keeps track of when the stream ends. This is necessary for padding
streams when the total number of valid elements in a conditional stream is not a multiple of
the number of clusters. Conditional input streams function similarly to conditional output
streams, except buffering in the scratchpad occurs before traversing the intercluster switch
rather than vice versa.
3.4 Summary
The six function units described above along with the LRFs, CCRFs, and intracluster switch
are the components of an arithmetic cluster on the Imagine stream processor. The main de-
sign goals for these arithmetic units were low design complexity, low area, high throughput,
and low power. Although latency was important to keep limited to a reasonable value, it
CHAPTER 3. IMAGINE: MICROARCHITECTURE AND CIRCUITS 55
was not a primary design goal. As described in the next chapter, these arithmetic clus-
ter components were implemented in a standard cell CMOS technology with 0.15 micron
drawn gate length transistors and five layers of Aluminum with metal spacing typical to a
0.18 micron process. For a number of these arithmetic units and other key compoments,
Table 3.4 shows their silicon area (both in mm2 and in wire grids3 ), number of standard
cells, and total standard cell area if additional area required for wiring between standard
cells is discounted(normalized to the area of a NAND2 standard cell).
In summary, the ISA, microarchitecture, and functional unit circuits from the Imagine
stream processor are designed for directly executing the stream programs in an area- and
energy-efficient manner. The next two chapters will describe the design methodology and
performance efficiency results achieved when this microarchitecture was implemented in
modern VLSI technology.
3
A wire grid in this process is 0.40 square microns
Chapter 4
To demonstrate the applicability of the Imagine stream processor to modern VLSI tech-
nology, a prototype Imagine processor was designed by a collaboration between Stanford
University and Texas Instruments (TI). Stanford completed the microarchitecture specifi-
cation, logic design, logic verification, and did the floorplanning and cell placement. TI
completed the layout and layout verification. Imagine was implemented in a standard cell
CMOS technology with 0.15 micron drawn gate length transistors and five layers of Alu-
minum with metal spacing typical to a 0.18 micron process.
The key challenge with the VLSI implementation of Imagine was working with the
limited resources afforded by a small team of less than five graduate students, yet with-
out sacrificing performance. In total, the final Imagine design included 701,000 unique
placeable instances and had an operating frequency of 45 fan-out-of-4 inverter delays, as
reported by static timing analysis tools. This was accomplished with a total design effort
of 11 person-years on logic design, floorplanning and placement, significantly smaller than
the design effort typical to comparable industrial designs [Malachowsky, 2002].
This chapter provides an overview of the design process and experiences for the Imag-
ine processor. Section 4.1 presents the design schedule for Imagine followed by back-
ground on the standard-cell design methodologies typically used for large digital VLSI
circuits in Section 4.2. Section 4.3 introduces a tiled region design methodology, the ap-
proach used for Imagine, where the designer is given fine-grained control over placement
of small regions of standard cells in a datapath style. Finally, the clocking and verification
56
CHAPTER 4. IMAGINE: DESIGN METHODOLOGY 57
4.1 Schedule
By summer 1998, the Imagine architecture specification had been defined and a cycle-
accurate C++ simulator for Imagine was completed and running. In November 1998, logic
design had begun with one Stanford graduate student writing the RTL for an ALU cluster.
By December 2000, the team working on Imagine implementation had grown to five grad-
uate students and the entire behavioral RTL model for Imagine had been completed and
functionally verified.
The Imagine floorplanning, placement, and layout was carried out by splitting the de-
sign into five unique subchips and one top-level design. In November 2000, the first trial
placement of one of these subchips, an ALU cluster, was completed by Stanford. By Au-
gust 2001, the final placement of all five subchips and the full-chip design was complete
and Stanford handed the design off to TI for layout and layout verification. In total, between
November 1998 when behavioral RTL was started and August 2001 when the placed de-
sign was handed off to TI, Stanford expended 11 person-years of work on the logic design,
floorplanning, and placement of the Imagine processor. Imagine parts entered a TI fab in
February 2002. First silicon was received in April, 2002 and full functionality was verified
in the laboratory in subsequent months.
Wire
Models
Library
Place &
RTL Synthesis Netlist Route
Layout
Figure 4.1 shows the typical ASIC tool flow. RTL is written in a hardware description
language such as Verilog and is mapped to a standard-cell library with a logic synthesis tool
such as Synopsys Design Compiler [Synopsys, 2000a]. Wire lengths are estimated from
statistical models and timing violations are fixed by resynthesizing with new timing con-
straints or by restructuring the logic. After pre-placement timing convergence, designs are
then passed through an automatic place and route tool, which usually uses a timing-driven
placement algorithm. After placement, wire lengths from the placed design are extracted
and back-annotated to a static timing analysis (STA) tool. However, when actual wire
lengths do not match predicted pre-placement statistical-based wire lengths, this can cause
a timing problem and can lead to costly design iterations, shown in the bottom feedback
loop.
Recent work in industry and academia has addressed many of the inefficiencies in ASIC
flows. This work can be grouped in two categories: improving timing convergence and
incorporating datapath-style design in ASIC flows.
Physically-aware synthesis approaches [Synopsys, 2000b] attempt to address the short-
comings of timing convergence in traditional flows by concurrently optimizing the logical
and physical design, rather than relying on statistically-based wire-length models. The
principal benefit of these techniques is to reduce the number of iterations required for tim-
ing convergence, and as a result, deliver modest improvement in timing performance and
area.
Datapaths are examples of key design structures that ASIC flows handle poorly. There
are three limitations. First, aggregating many simple standard cells to create a complex
CHAPTER 4. IMAGINE: DESIGN METHODOLOGY 59
function is inefficient. Second, the typical logical partitions (functional) often differ from
the desirable physical partitions (bit-slices). Finally, since the “correct” bit-sliced datap-
ath solution is very constrained, small errors in placement and routing during automated
optimization can result in spiraling congestion and can quickly destroy the inherent reg-
ularity. When developing the design methodology for Imagine, the goal was to keep the
inherent advantages of standard-cell design, but to eliminate some of the inefficiencies of
ASIC methodologies by retaining datapath structure.
Many researchers have demonstrated that identifying and exploiting regularity yields
significant improvements in density and performance for datapath structures in comparison
to standard ASIC place and route results [Chinnery and Keutzer, 2002]. In particular,
researchers have shown numerous automated techniques for extracting datapath structures
from synthesized designs and doing datapath-style placement [Kutzschebauch and Stok,
2000] [Nijssen and van Eijk, 1997] [Chowdhary et al., 1999]. However, widespread
adoption of these techniques into industry-standard tools had not yet occurred by the time
the VLSI design for the Imagine processor was started.
and one top-level design. In the ASIC methodology used on Imagine, flat placement within
a subchip is used, where all of the standard cells in each subchip are placed at once. This
is in contrast to hierarchical placement techniques where subcomponents of a subchip are
placed first and larger designs are built from smaller sub-designs. After routing each sub-
chip, the top-level design then includes instances of the placed and routed subchips as well
as additional standard cells. Table 4.1 shows the number of instances, area, and gate area
in equivalent NAND2 gates for each of the five subchips: the ALU cluster (CLUST), the
micro-controller (UC), the stream register file (SRF), the host interface / stream controller
/ network interface (HISCNI), and the memory bank (MBANK). Each of these subchips
corresponds directly to units in Figure 2.3 except the MBANK. The streaming memory sys-
tem is composed of 4 MBANK units: 1 per SDRAM channel. Also shown is the top-level
design, which includes glue logic between subchips and I/O interfaces.
In addition to the gates listed in Table 4.1, some of the subchips also contain SRAM’s
instantiated from the TI ASIC library. The UC contains storage for 2048 576-bit VLIW
instructions organized as 9 banks of single-ported, 1024-word, 128-bit SRAM’s. The SRF
contains 128 KBytes of storage for stream data, organized as 8 banks of single-ported,
1024-word, 128-bit SRAM’s. There is a dual-ported, 256-word, 32-bit SRAM in each ALU
cluster for scratchpad memory. Finally, the HISCNI subchip contains SRAM’s for input
buffers in the network interface and for stream instruction storage in the stream controller.
Several of the subchips listed above benefit from using datapath-style design. Specifi-
cally, each ALU cluster contains six 32-bit floating-point arithmetic units and fifteen 32-bit
CHAPTER 4. IMAGINE: DESIGN METHODOLOGY 61
Short
Wire
Models
Library
register files. Exploiting the datapath regularity for these units keeps wire lengths within
a bitslice very short, which in turn leads to smaller buffers, and therefore a more compact
design. In addition, control wires are distributed across a bitslice very efficiently since cells
controlled by the same control wires can be optimally aligned. The SRF, which contains
22 8-entry 256-bit streambuffers, also benefits from the use of datapaths. The 256 bits in
the streambuffers align to the 8 clusters’ 32-bit-wide datapath, keeping wires predictable
and short and allowing for efficient distribution of control wires.
The tiled-region basic flow used on Imagine is shown in Figure 4.2. It is similar to the
typical ASIC methodology shown previously in Figure 4.1. However, several key addi-
tional steps, shown in gray, have been added in order to allow for datapath-style placement
and to reduce costly design iterations. First, in order to make sure that datapath structure
is maintained all the way through the flow, two RTL models were used. A second RTL
model, labeled structured RTL, was written. It is logically equivalent to the behavioral
RTL, but contains additional logical hierarchy in the RTL model. Datapath units such as
adders, multipliers, and register files contain submodules that correspond to datapath bit-
slices. These bitslices correspond to a physical location along the datapath called a region.
Regions provide a hard boundary during placement, meaning cells assigned to that region
will only be placed within the associated datapath bitslice. Regions are often used in typical
CHAPTER 4. IMAGINE: DESIGN METHODOLOGY 62
ASIC design methodologies in order to provide constraints on automatic place and route
tools, but the tiled-region flow has a much larger number of smaller regions (typically 10
to 50 instances per region) when compared to timing-driven placement flows.
In addition to the floorplanning of regions, the subchip designer also must take into
account the wire plan for a subchip. The wire plan involves manually annotating all wires
of length greater than one millimeter with an estimated capacitance and resistance based
on wire length between regions. By using these manual wire-length annotations during
synthesis and timing analysis runs, statistical wire models generated during synthesis are
restricted to short wires. Manual buffers and repeaters were also inserted in the structured
RTL for long wires. With wire planning, pre-placement timing more closely matches post-
placement timing with annotated wire resistance and capacitance.
A more detailed view of the floorplanning and placement portion of the tiled-region
methodology is shown in Figure 4.3. Consider an 8-bit adder. It would be modeled with
the statement y = a + b in behavioral RTL. However, the structured RTL is split up by hand
into bitslices as shown in Figure 4.3. The structured RTL is then either mapped by hand
or synthesized into a standard-cell netlist using Synopsys Design Compiler [Synopsys,
2000a].
In conjunction with the netlist generation, before placement can be run, floorplanning
has to be completed. In the tiled-region design methodology, this is done by writing a
tile file. An example tile file containing two 8-bit adders is shown in the upper right of
Figure 4.3. The tile file contains a mapping between logical hierarchy in the standard cell
netlist and a bounding box on the datapath given in x-y coordinates. The example tile file
shows how the eight bitslices in each adder would be tiled if the height of each bitslice
was 30 units. Arbitrary levels of hierarchy are allowed in a tile file, allowing one to take
advantage of modularity in a design when creating the floorplan. In this example, two levels
of hierarchy are used, so cells belonging to the adder 1/slice5 region would be placed in
the bounding box given by 40 < x < 80 and 150 < y < 180.
Once the floorplan has been completed using a tile file, it is then passed through a tool
developed by Stanford called tileparse. Tileparse flattens the hierarchy of the tile file and
outputs scripts which are later run by the placer to set up the regions. Once the regions
have been set up, but before running placement, the designer can look at the number of
CHAPTER 4. IMAGINE: DESIGN METHODOLOGY 63
Module adder {
region slice0 x1=0 x2=40 y1=0 y1=30
region slice1 x1=0 x2=40 y1=30 y1=60
module adder (a,b,y); region slice2 x1=0 x2=40 y1=60 y1=90
input [7:0] a,b; region slice3 x1=0 x2=40 y1=90 y1=120
output [7:0] y; region slice4 x1=0 x2=40 y1=120 y1=150
wire [6:0] c;
… region slice5 x1=0 x2=40 y1=150 y1=180
adder_slice slice2 region slice6 x1=0 x2=40 y1=180 y1=210
(a[3],b[3],c[2],c[3],y[3]); region slice7 x1=0 x2=40 y1=210 y1=240
adder_slice slice3 }
(a[4],b[4],c[3],c[4],y[4]);
….
endmodule inst adder adder_0 x=0 y=0
inst adder adder_1 x=40 y=0
module adder_slice(a,b,ci,co,y)
input a,b,ci;
output co,y;
assign y=a^b^ci; Floorplan
assign co=(a&b)|(a&ci)|(b&ci);
endmodule
Structured
Tile File
RTL
Synthesis Tileparse
Region-Based
Placement
cells in a region and iterate by changing region sizes and shapes until a floorplan that fits
is found. Finally, the Avant! Apollo-II automatic placement and global route tool [Chen,
1999] is used to generate a trial placement on the whole subchip. These steps are then
iterated until a floorplan and placement with satisfactory wiring congestion and timing has
been achieved. The steps following placement in the tiled-region design methodology do
not differ from the typical ASIC design methodology.
By using tiled-regioning, large subchips such as the SRF and CLST with logic con-
ducive to datapath-style placement were easily managed by the designer. For example,
placement runs for the SRF, which contained over 300,000 instances took only around one
hour on a 450 MHz Ultrasparc II processor. This meant that when using tiled-region place-
ment on these large subchips, design iterations proceeded very quickly. Furthermore, the
designer had fine-grained control over the placement of regions to easily fix wiring con-
gestion problems. For example, the size and aspect ratio of datapath bitslices could be
modified as necessary to provide adequate wiring resources.
Timing results for each of these subchips are included in Table 4.3. Maximum clock
frequency and critical path for each clock domain in fan-out-of-4 inverter delays (FO4s) are
shown. Results were measured using standard RC extraction and STA tools at the typical
process corner.
used on the subchips was used to synthesize a balanced clock tree to all of the inputs of the
delay elements and the leaf-level clock loads for clocked elements in the top-level design.
Imagine must interface with several different types of I/O each running at different
clock speeds. For example, the memory controller portion of each MBANK runs at the
SDRAM clock speed. Rather than coupling the SDRAM clock speed to an integer multiple
of the Imagine core clock speed, completely separate clock trees running at arbitrarily dif-
ferent frequencies were used. In total, Imagine has 11 clock domains: the core clock (iclk),
a clock running at half the core clock speed (sclk), the memory controller clock (mclk), the
host interface clock (hclk), four network input channel clocks (nclkin n, nclkin s, nclkin e,
nclkin w), and four network output channel clocks (nclkin n, nclkout s, nclkout e, nclk-
out w). These clocks and the loads for each clock are shown in Table 4.3, but for clarity,
only one of the network channel clocks is shown. The maximum speed of the network
clocks were architecturally constrained to be the same speed as iclk, but can operate slower
if needed in certain systems. Mclk and hclk are also constrained by the frequency of other
chips in the system such as SDRAM chips, rather than the speed of the logic on Imagine.
Sclk was used to run the SRF and stream controller at half the iclk speed. The relaxed
timing constraints significantly reduced the design effort in those blocks and architectural
experiments showed that running these units at half-speed would have little impact on over-
all performance.
The decoupling provided by Imagine’s 11 independent clock domains reduces the com-
plexity of the clock distribution problem. Also, non-critical timing violations within one
clock domain can be waived without affecting performance of the others. To facilitate these
CHAPTER 4. IMAGINE: DESIGN METHODOLOGY 67
Sync
Com- Com-
Full Empty
pare pare
Sync
many clock domains, a synchronizing FIFO was used to pass data back and forth between
different clock domains. Figure 4.4 shows the FIFO design used [Dally and Poulton, 1998].
In this design, synchronization delay is only propagated to the external inputs and outputs
when going from the full to non-full state or vice versa, and similarly with the empty to
non-empty state. Brute force synchronizers were used to do the synchronization. By mak-
ing the number of entries in the FIFO large enough, write and read bandwidths are not
affected by the FIFO design.
corner cases were used for testing Imagine’s floating-point adder, multiplier, divide-square-
root (DSQ) unit, memory controller, and network interface. In each of these units, signifi-
cant random testing was also used. For example, in the memory controller, large sequences
of random memory reads and writes were issued. In addition, square-root functionality in
the DSQ unit was tested exhaustively.
Chip-level tests were used to target modules whose control was highly coupled to other
parts of the chip and for running portions of real applications. Rather than relying only on
end-to-end correctness comparisons in these chip-level tests, a more aggressive compari-
son methodology was used for these tests. A cycle-accurate C++ simulator had already
been written for Imagine. During chip-level tests, a comparison checker verified that the
identical writes had occurred to architecturally-visible registers and memory in both the
C++ simulator and the RTL model. This technique was very useful due to the large number
of architecturally-visible registers on Imagine. Also, since this comparison occurred every
cycle, it simplified debugging since any bugs would be seen immediately as a register-write
mismatch. A number of chip-level tests were written to target modules such as the stream
register file and microcontroller. In order to generate additional test coverage, insertion
of random stalls and timing perturbations of some of the control signals were included in
nightly regression runs.
In total, there were 24 focused tests, 10 random tests, and 11 application portions run
nightly as part of a regression suite. Some focused tests included random timing perturba-
tions. Every night 0.7 million cycles of focused tests, 3.6 million cycles of random tests,
and 1.3 million cycles of application portions were run as part of the functional verification
test suite on the C++ simulator, the behavioral RTL and the structured RTL. These three
simulators ran at 600, 75, and 3 Imagine cycles per second respectively when simulated on
a 750 MHz UltrasparcIII processor.
In summary, the design, clocking, and verification methodologies used on Imagine en-
abled the design of a 0.7M-instance ASIC without sacrificing performance, and with a
considerably smaller design team than comparable industrial designs.
Chapter 5
In this chapter, experimental results measured from the Imagine stream processor are pre-
sented. Imagine was fabricated in a Texas Instruments CMOS process with metal spacing
typical to a 0.18 micron process and with 0.15 micron drawn-gate-length transistors.
Figure 5.1 shows a die photograph of the Imagine processor with the five subchips pre-
sented in Chapter 4 highlighted. Its die size is 16 mm × 16 mm. The IO’s are peripherally
bonded in a 792-pin BGA package. There are 456 signal pins (140 network, 233 memory
system, 45 host, 38 core clock and debug), 333 power pins (136 1.5V-core, 158 3.3V-IO,
39 1.5V-IO), and 3 voltage reference pins. The additional empty area in the chip plot is
either glue logic and buffers between subchips or is devoted to power distribution.
69
CHAPTER 5. IMAGINE: EXPERIMENTAL RESULTS 70
HI SC NI
UC
CLUST7
MBANK MBANK MBANK MBANK
CLUST6
3
CLUST5
SRF
2
CLUST4
CLUST3
1
CLUST2
CLUST1
0
CLUST0
Predicted
300
Typical
Frequency (MHz) 250
200
Worst-case
150
Measured
100
50
1 1.2 1.4 1.6 1.8 2 2.2
Voltage (V)
30
25
Delay (ns)
20
15
10
5
0
0.5 0.7 0.9 1.1 1.3 1.5 1.7 1.9 2.1
Voltage (V)
Further insight into reasons for a lower-than-predicted operating frequency can be pro-
vided by comparing the voltage dependence of Imagine’s cycle time to the voltage depen-
dence of the ring oscillator delay. This data is graphed in Figure 5.3. The scaled ring
oscillator delay, shown in the curve on the left, is the ring oscillator delay multiplied by a
constant factor, so that at 1.5 Volts, it equals 3.79ns (for a frequency of 264 MHz). This
delay is what would be expected from static timing analysis given the 11% performance
degradation due to process variation measured on the ring oscillator. Therefore, the scaled
ring oscillator delay shows the cycle time predicted by static timing analysis across a range
of supply voltages.
As other researchers have shown [Chen et al., 1997], the effect supply voltage has on
gate delay in modern CMOS processes where transistors are typically velocity saturated
can be modeled by:
V
td = kC (5.1)
(V − Vth )1.25
The measured delay through the ring oscillator follows this gate delay model closely if a
threshold voltage, Vth , of 0.38 Volts is assumed.
CHAPTER 5. IMAGINE: EXPERIMENTAL RESULTS 73
On the other hand, the measured cycle time, shown in the curve on the right in Fig-
ure 5.3, shows voltage adversely affecting delay at significantly higher voltages and in fact,
Imagine stops functioning correctly below 1.2 Volts. At 1.5 Volts, the actual cycle time
is twice the delay predicted by the scaled ring oscillator delay. However, this relative dif-
ference is greater for lower voltages and smaller for higher voltages. For example, at 1.2
Volts, the difference is 3x and at 2.0 Volts, the difference is 1.28x.
Several factors could explain the discrepancies between predicted and measured cy-
cle times. First, internal IR drop across large wire resistances in the the core power supply
could contribute to significantly lower voltages at standard cells located on critical paths. In
addition, if IR drop led to low supply voltages for gates with feedback elements or circuits
that have delays not accurately modeled by (5.1), then their performance at low voltages
could be degraded more severely than the ring oscillator, and they could even stop function-
ing correctly. Finally, the lack of sufficient bypass capacitance on the power supply in areas
of the chip with highly varying current draw could lead to additional supply degradation not
modeled by IR drop and could further degrade performance. These factors could explain
the measured cycle time’s steep slope at voltages significantly higher than that predicted
by the ring oscillator delay measurements. Unfortunately, without sophisticated measure-
ment techniques, it is difficult to conclusively verify the exact factors contributing to this
behavior. Nevertheless, if thorough analysis and careful re-design of the on-chip power
distribution network were able to solve the discrepancy between predicted and measured
operating frequency, at worst-case operating conditions a 1.42x performance improvement
without increasing voltage or degrading energy efficiency would be observed. By improv-
ing to typical operating conditions, a 2x performance improvement over current behavior
would be achieved.
14
12
10
Power(W)
8
6
4
2
0
1 1.2 1.4 1.6 1.8 2 2.2
Voltage (V)
frequencies from Figure 5.2. The core power dissipation was measured during a test appli-
cation written to keep Imagine fully occupied with floating-point arithmetic instructions.
Imagine’s power dissipation is dominated by dynamic power dissipation, given by:
P = Csw V 2 f
where Csw is the average capacitance switched per clock cycle. At 1.5 Volts and 132 MHz,
a power dissipation of 3.07 Watts was observed, meaning that Csw , the average capacitance
switched per clock cycle, was measured to be 10.3 nF.
Although it is difficult to determine exactly how this 10.3 nF of Csw is distributed
throughout Imagine, estimates can be made with Synopsys Design Compiler [Synopsys,
2000a] using exact capacitances and resistances extracted from actual layout and toggle
rates captured from a test application. These estimates are presented in the pie chart in
Figure 5.5 and sum to the 10.3 nF measured experimentally. Total power dissipation is
separated into three major categories: clock, arithmetic clusters, and other. Clock power
includes capacitance switched during idle operation when only the Imagine core clock is
toggling, and is further subdivided into three categories: Csw in the clock tree, Csw in the
loads of the clock tree, and Csw internally within the flip flops. Together, clock power
CHAPTER 5. IMAGINE: EXPERIMENTAL RESULTS 75
Clock Tree
SRF, UC, Mbanks
11%
6%
Cluster RFs, Wires,
Clock Loads
and Control
14%
21%
Cluster ALUs
19% Flip Flop Internal
Power
29%
comprises around 55% of the total core power, verified experimentally by idle power mea-
surements. The arithmetic cluster power is also subdivided into two categories: capac-
itance switched in the ALUs and capacitance switched elsewhere in the cluster such as
in the LRFs and intracluster switch. The final remaining category labeled SRF, UC, and
Mbanks includes the switched capacitance not accounted for in the arithmetic clusters or
clock network.
Note that on Imagine, a large percentage of power dissipation is devoted to arithmetic
units. Not only is 20% of active power dissipated in the ALUs, but over one third of
the internal flip flop clock power and clock load power is in pipeline registers within the
ALUs. In total, nearly 40% of the chip power is spent directly on providing high-bandwidth
arithmetic units. Furthermore, power dissipation could be significantly reduced by lower-
ing clock power. Lowering the number of pipeline registers or using latch-based clocking
methodologies and more efficient clock trees as would typically be done with custom de-
sign methodologies would help to greatly reduce Imagine’s core power dissipation.
CHAPTER 5. IMAGINE: EXPERIMENTAL RESULTS 76
600
400
200
0
1 1.2 1.4 1.6 1.8 2 2.2
Voltage (V)
operation through extensive use of clock gating and other design techniques for low power.
These processors demonstrate the range of energy efficiencies typically provided by micro-
processors, over 500 pJ per instruction in a 0.13 micron technology.
Digital signal processors are listed next in Table 5.1. The first DSP, the TI C67x [TI,
2003], an 8-way VLIW operating at 225 MHz targets floating-point applications, and has
energy efficiency of 889 pJ per instruction, similar to the SB-1250 when normalized to the
same technology and voltage. The TI C64x [Agarwala et al., 2002], a 600 MHz 8-way
VLIW DSP targeted for lower-precision fixed-point operation, is able to provide improved
energy efficiency over floating-point DSPs at 150-250 pJ per 16b operation. This improved
efficiency is due to arithmetic units optimized for 16b operation and with architectures
designed to efficiently exploit parallelism, such as SIMD operations in the C64x.
Although fixed-point DSPs are able to provide significant improvements over micro-
processors in energy efficiency, special-purpose processors are still one to two orders of
magnitude better, as demonstrated by the Nvidia Geforce3 [Montrym and Moreton, 2002;
Malachowsky, 2002] at 10pJ per operation, or 5 pJ per operation when scaled to 0.13 µm
technology. The energy efficiency of graphics processors is due to their highly parallel ar-
chitectures that provide a large amount of arithmetic performance with low VLSI overhead.
In addition, graphics processors are able to exploit producer-consumer locality by feeding
the output of one stage of the graphics pipeline directly to the next stage of the pipeline,
avoiding global data transfers.
Finally, we present the energy efficiency of Imagine in Table 5.1. When normalized to
the same voltage, Imagine dissipates nearly half the energy per floating-point op per 16-bit
op dissipated by the SB-1250, the most energy efficient fully-programmable floating-point
processor listed. When normalized to the same technology, Imagine provides energy ef-
ficiencies 2.7x better than the C67x on floating-point operations. On 16-bit operations,
Imagine is comparable to the C64x, even though Imagine contains arithmetic units opti-
mized for floating-point and 32-bit performance and a less aggressive design methodology.
In the next section, we will explore the energy efficiency of stream processors optimized for
16-bit fixed-point applications rather than floating-point performance. In addition, we will
demonstrate the potential of stream processors to achieve much higher energy efficiency by
employing more-aggressive custom design methodologies rather than standard cells and by
CHAPTER 5. IMAGINE: EXPERIMENTAL RESULTS 79
using low-power circuit techniques. This would also provide a more fair comparison to the
DSPs and microprocessors listed here, which also use custom design methodologies.
Some researchers have proposed using the energy-delay product as an alternate met-
ric for energy efficiency [Horowitz et al., 1994]. Since processors designed to operate at
slower performance rates can reduce their energy per task with techniques such as smaller
transistor sizing and less pipelining to reduce clock power (at the cost of higher delay per
task), the energy-delay product metric allows one to compare the energy efficiency of two
processors operating at different performance rates.
We evaluated energy-delay product on the same range of processors as shown in Ta-
ble 5.2 (lower energy-delay product is better). As with energy efficiency, energy-delay is
estimated from peak performance and power dissipation, given by the ratio of energy per
operation to the peak performance on that operation type.
The energy-delay product is affected more than energy efficiency or performance alone
by design methodologies and technology scaling. That is because the advantages in raw
CHAPTER 5. IMAGINE: EXPERIMENTAL RESULTS 80
performance and energy efficiency provided by smaller device sizes and custom circuits
are compounded with energy-delay product. Nevertheless, when normalized to the same
technology, Imagine still is less than half the energy-delay of the most energy-efficient
DSPs on 16b operations. However, on floating-point, it is slightly worse than the SB-1250.
Several factors in the Imagine implementation lead to higher energy-delay products and
energy efficiency than possible with a more optimal implementation of the Imagine archi-
tecture. First, as described in Section 5.1, if at 1.5V Imagine operated at the worst-case
operating conditions predicted by static timing analysis of 188 MHz rather than the mea-
sured frequency of 132 MHz, it would see a 1.42x improvement in performance, without
affecting energy efficiency. This would translate directly to a reduction in energy-delay
product. Furthermore, by using more aggressive design methodologies and low-power
circuit techniques, Imagine could easily operate at much higher frequencies, and could
therefore provide additional reductions in its energy-delay product.
5.5 Summary
A prototype Imagine processor has been shown to provide a peak performance of 11.5
GFLOPS at its maximum frequency and to dissipate 423 pJ per floating-point operation
at its most energy-efficient operating condition, more than twice as efficient as other pro-
grammable floating-point processors when normalized to the same technology. However,
there is still a large gap in performance, area efficiency, and energy-efficiency of stream
processors when compared to special-purpose processors, typically less than 10 pJ per op-
eration in a 0.13 µm technology. Stream processors have the potential to further close this
gap by utilizing more aggressive custom design methodologies and utilizing low-power
circuit techniques and energy-efficient ALU designs for 16-bit operations. In the next sec-
tion, we extend stream processors to custom design methodologies and stream processors
tailored for low-power embedded systems, demonstrating the potential for highly area- and
energy-efficient stream processors.
Chapter 6
The previous chapters in this thesis describe the VLSI implementation and evaluation of
the Imagine stream processor, and demonstrated its capability of efficiently supporting 48
ALUs in a 0.15 µm standard cell technology. In the following chapters, we extend this work
by exploring the capability of scaling stream processors to many more ALUs per chip in
future technologies and by exploring efficiency improvements with more aggressive design
methodologies.
With CMOS technology scaling and improved design methodologies, increasing num-
bers of ALUs can fit onto one chip. On the Imagine stream processor, a multiplier support-
ing single-precision floating-point, 32-bit integer, and dual 16-bit integer multiplies has an
area of 0.486 mm2 (1224K grids) and an average energy per multiply of 185 pJ, includ-
ing internal flip-flop power in pipeline registers. In comparison, custom implementations
of similar multipliers scaled to the same technology would have an area of less than 0.26
mm2 (655K grids) and energy of less than 50 pJ per multiply [Huang and Ercegovac, 2002;
Nagamatsu et al., 1990]. By only supporting 16-bit data-types, this custom multiplier area
and energy could be further reduced to less than 15pJ [Goldovsky et al., 2000]. Other com-
ponents, such as register files, scale similarly to custom methodologies. This demonstrates
the potential for large area and energy savings by employing custom design methodologies,
rather than standard cells, and by tailoring datapaths to smaller widths if that is all that is
82
CHAPTER 6. STREAM PROCESSOR SCALABILITY: VLSI COSTS 83
track of length (0.093 fJ per wire track in a 0.18 micron technology1 ). Measured delays
for on-chip wire propagation and key gates which will be used to construct large switches
are presented in fan-out-of-4 inverter delays (FO4s), a process-independent measure of
device speed. As technology scales, wire propagation velocity v0 stays relatively constant
with optimal repeatering [Ho et al., 2001]. A clock cycle of 45 FO4s, measured from
the Imagine stream processor, was used. Typical microprocessors designed with custom
methodologies have clock cycles closer to 20 FO4s [Agarwal et al., 2000]. Adapting the
cost analysis to results for custom processors will be addressed in Section 6.3.
A stream processor can be constructed from the building blocks in Table 6.1. However,
appropriate sizes and bandwidths must be chosen for structures such as the local register
files, stream register file, microcontroller, and others. The values shown in Table 6.2 were
used to govern how such structures should scale. These values were empirically determined
from the inner loop characteristics of a variety of key media processing kernels, shown in
1
Calculated from an assumed wire capacitance of 0.26 fF per micron including repeater capacitance [Ho
et al., 2001] with a 25% 1-to-0 transition probability.
CHAPTER 6. STREAM PROCESSOR SCALABILITY: VLSI COSTS 86
Table 6.3 with the number of accesses per ALU operation shown in parentheses. Based on
these inner loop characteristics, reasonable values for GSRF , GSB , GSP , and GCOM M were
used to ensure that average application performance was not limited from microarchitec-
tural assumptions. Also included in Table 6.2 are C and N , variables that will be varied
throughout the chapter as the number of ALUs are scaled.
Micro-Controller
C1/2 NCOMM b
a cluster will read during SRF reads and writes. The only communication between the
clusters or SRF banks is in the memory system ports to the SRF (not shown) and the
intercluster switch, with buses and cross-point switches represented as lines and dots in
Figure 6.1.
The area of the SRF, ASRF , contains two components: the stream storage and the stream-
buffers (SBs). The SRF is used to stage streams passed between kernels. A SB automati-
cally prefetches sequential data for its associated stream out of the stream storage. All SBs
share a single port into the stream storage, allowing that single port to act as many logi-
cal ports. The stream storage is a large single-ported on-chip SRAM, organized as rm T N
CHAPTER 6. STREAM PROCESSOR SCALABILITY: VLSI COSTS 88
Microcontroller
The microcontroller, listed next in Table 6.4, provides storage for the kernels’ VLIW in-
structions, and sequences and issues these instructions during kernel execution. The mi-
crocode storage is a large single-ported memory. The microcontroller area is comprised of
the microcode storage area and area for control wire distribution between the microcon-
troller and the clusters. The microcode storage requires ruc VLIW instructions for kernel
storage in typical applications. Although as is shown in Section 7.3, increasing N results in
higher inner-loop performance, the number of instructions in a kernel stays relatively con-
stant with N since more loop unrolling is often used with higher N to provide more ILP and
2
Splitting multi-word-record streams into multiple streams was done by hand to optimize performance
for the experiments in Section 7.3.
CHAPTER 6. STREAM PROCESSOR SCALABILITY: VLSI COSTS 89
Element Equation
COMMs per NCOM M = GCOM M N
Cluster
SPs per cluster NSP = GSP N
FUs per cluster NF U = N + NSP + NCOM M
Cluster SBs NCLSB = LC + LN N
Total SBs NSB = LO + NCLSB
External Pe = NCLSB
Cluster Ports
Total Area AT OT = CASRF + AU C + CACLST + ACOM M
SRF Bank Area ASRF = rm T N ASRAM b + (2GSRF N )NSB ASB b
Microcontroller AU C = ruc (I√
0 + IN NF U )ASRAM +
Area (IN NF U ) ASRF + ACLST + ACOM M
Cluster Area ACLST = NF U√wLRF h + √ N wALU h + NSP wSP h + ASW
Intracluster ASW
√ = NF U√( NF U b)(2 NF U b + h + 2wALU + 2wLRF )+
Switch Area NF U (3 NF U b + h√+ wALU + wLRF √ )Pe b√
Intercluster ACOM M = CNCOM M b C(NCOM M b C + 2 ACLST + ASRF )
Switch Area
√ √ √
Intracluster tintra = NF√U (h + 2 √NF U b + wALU + wLRF + NF U b)/v0 +
Wire Delay tmux (log2 NF√ U + NF U )
Intercluster tinter = tintra + 2 CACLST √+ CASRF + ACOM M /v0 +
√
Wire Delay tmux (log2 CNCOM M + C)
Total Energy ET OT = CESRF + EU C + CECLST + GCOM M N CbEinter
SRF Bank ESRF = rm T N bESRAM GSB /GSRF + (GSB N b)(ESB + Eintra /2)
Energy
Microcontroller EU C = ruc (I0 +√
IN NF U )ESRAM +
√
Energy (IN NF U )Ew ( C CASRF + CACLST + ACOM M )
Cluster Energy ECLST = NF√U ELRF + N E√ALU + GSP N ESP + NF U bEintra
√
Intracluster Eintra = Ew NF U ((h + 2 NF U b) + 2(wALU + wLRF + NF U b))
Switch Energy √ √
√
Intercluster Einter = Ew (2 C)( ACLST + ASRF + NCOM M b C)
Switch Energy
because loop prologues and epilogues in kernels are critical-path limited, not arithmetic-
bandwidth limited. The width of each VLIW instruction is given by I0 + IN NF U bits.
CHAPTER 6. STREAM PROCESSOR SCALABILITY: VLSI COSTS 90
I0 bits are required for microcontroller instruction sequencing, conditional stream instruc-
tions, immediate data, and for interfacing with the SRF. IN bits per ALU per cluster are
required to encode ALU operations, to control LRF read and writes, and to control the
intracluster switch. Area and energy for distributing the instructions from the microcode
storage to the grid of clusters is accounted for in the second term in both formulae in Ta-
ble 6.4. In addition, repeaters and pipeline registers are required within the cluster grid for
more instruction distribution, but this area is accounted for in the area measured for the
components in Table 6.2.
Arithmetic Clusters
Each cluster is comprised of area devoted to the LRFs, ALUs, a scratchpad, and the intra-
cluster switch. This switch is a full crossbar that connects the outputs of the FUs and the
streambuffers to the inputs of the LRFs and the streambuffers. In this study, the ALUs are
assumed to be arranged in a square grid as shown in Figure 6.2, where each row contains
a bus for each ALU output in that row and each column contains a bus for each LRF in-
put in that column. The row-column intersections contain program-controlled cross-point
switches that connect rows to columns. This grid structure minimizes the area and wire
delay of the intracluster switch when the number of ALUs per cluster is large3 . The area
devoted to the intracluster switch includes wire tracks for the wires and repeaters in the
rows and columns of the grids and the cross-points between the rows and columns, as
shown in Figure 6.2. Additional area for the external ports from the cluster streambuffers is
included as well. Area from the control wires for the crosspoints is ignored for simplicity
since it is small when compared to the area for the data wires.
Table 6.4 also includes equations for the energy dissipated in an arithmetic cluster4 and
the intracluster wire delay. For wire delay, the first term in Eintra models the worst-case
wire propagation delay incurred in the intracluster switch (width + height of a cluster)/v0
√
and a term for the logic delay through the cross points. The logic delay includes a NF U :1
mux for each row-column to select which ALU to read from on that row, followed by
3
For smaller numbers of ALUs per cluster, a linear floorplan has comparable area and delay, but for
simplicity, only grid floorplans are considered in this study.
4
ECLST from Table 6.4 includess a correction from [Khailany et al., 2003] for the term modeling energy
dissipated in the scratchpad.
CHAPTER 6. STREAM PROCESSOR SCALABILITY: VLSI COSTS 91
h
NFU1/2 b
RF RF wLRF
RF RF
Peb
Peb 2NFU1/2 b
an additional 2:1 mux delay at each additional row in the column to choose between the
current row or the adjacent rows. As N increases, the VLSI costs of the arithmetic clusters
3/2
are dominated by the NF U term in the intracluster switch area.
Intercluster Switch
The final component of the stream processor area is the intercluster switch, shown in Fig-
ure 6.1. Each cluster has NCOM M buses it broadcasts to in the rows and NCOM M buses
it reads from in the columns. Since each cluster can only access stream elements from
its SRF bank, the intercluster switch allows kernels that aren’t completely data parallel to
communicate data with each-other without going back to memory. It is also used by condi-
tional streams to route data to and from the SRF [Kapasi et al., 2000]. A two-dimensional
grid structure similar to the intracluster switch is also assumed for the floorplan of the
arithmetic clusters. This layout minimizes the area, delay, and energy overhead of the in-
tercluster switch when the number of arithmetic clusters becomes large. Each cluster has
NCOM M buses it writes to in each row and reads from in each column, so there is a bus
√
width of NCOM M b C between each arithmetic cluster. As shown in ECOM M , on average,
CHAPTER 6. STREAM PROCESSOR SCALABILITY: VLSI COSTS 92
2.0
1.0 Clusters
Microcontroller
Intercluster
0.0 Switch
0 16 32 48 64 80 96 112 128
ALUs per cluster
GCOM M N C intercluster communications will occur for every N C ALU operations, where
each intercluster communication switches the capacitance for a bus in its row and in its
destination’s column.
1.0
Microcontroller
Intercluster
0.0 Switch
0 16 32 48 64 80 96 112 128
ALUs per cluster
for N = 5, the most area- and energy-efficient configuration. For small N , the overhead
from the I0 bits of microcode storage and the COMM and SP units contributes to larger
area per ALU. The area per ALU then stays within 16% of the minimum up to 16 ALUs
per cluster, at which point the intracluster and intercluster switch start to reduce the area
efficiency. The energy efficiency follows a similar trend, although by 16 ALUs per cluster
the energy per ALU op has grown to 1.22x of the minimum, due to the intracluster switch
and microcontroller instruction distribution to the large arithmetic clusters.
2.0
Area per ALU (normalized to 8
clusters)
1.0
SRF
Clusters
Microcontroller Intercluster
0.0 Switch
8 16 32 64 128 256
Number of Clusters
at C = 128, the area per ALU is 2% worse than for C = 8, mostly due to area in the inter-
cluster switch. As shown in Figure 6.6, energy overhead grows slightly faster than area. A
C = 128 dissipates 7% more energy per ALU operation than for C = 8.
1.0 SRF
Clusters
Intercluster
0.0 Microcontroller Switch
8 16 32 64 128 256
Number of Clusters
2.0
Area per ALU (normalized to 32
clusters, 5 ALUs per cluster)
N=2
N=16
1.0
N=5
0.0
0 256 512 768 1024
Total ALUs
As technology enables more ALUs to fit on a single chip, architectures must efficiently
utilize bandwidth in order to achieve large performance gains. Intracluster scaling was
shown to be effective from a cost standpoint up to 10 ALUs per cluster, although was
most area- and energy-efficient at 5 ALUs per cluster. Intercluster scaling was shown to be
effective up to 128 clusters with a slight decrease in area and energy efficiency, although
was most area-efficient at 32 clusters. Together, these two scaling techniques enable area-
and energy-efficient stream processors with thousands of ALUs.
Table 6.5: Building block Areas, Energies, and Delays for ASIC, CUST, and LP
these units suggests improvements applicable to the full stream processor design through
custom circuit methodologies, however, the analysis is simplified by only studying these
units, which comprise over 70% of the active chip area and power dissipation.
In order to model area and energy costs across the various configurations, it suffices
to vary the area and energy of the building block parameters across the configurations and
then use the analytical models presented it Section 6.1. The parameters that were varied are
presented in Table 6.5, again normalized to technology-independent units. For the ASIC
configuration, the above values are those presented in Table 6.1 taken from the Imagine
stream processor. For the CUST configuration, there are two configurations considered.
The aggressive (Aggr) numbers assume custom cells for ALUs and register files (LRFs
and SBs) whereas the conservative (Cons) only assumes custom cells for register files with
CHAPTER 6. STREAM PROCESSOR SCALABILITY: VLSI COSTS 98
the ASIC ALUs. Register file area and energies were taken from custom register files im-
plemented in the same technology as Imagine. ALU data was taken from published 32b
multiplier designs [Huang and Ercegovac, 2002; Nagamatsu et al., 1990]. Note that the
width of various units also changed since a smaller datapath height is assumed with the
custom configuration. Although memories and register files are well documented, it is dif-
ficult to predict exact area and energy of custom ALU designs due to lack of published data,
so carrying out an analysis with both the CUST AGGR and CUST CONS configurations
demonstrates the sensitivity of area and energy to individual ALU designs.
For the LP processor, custom cells were also assumed for both custom ALUs and reg-
ister files. A dual 16b fused-multiply-adder is used with area and energy taken from pub-
lished 16b multipliers [Goldovsky et al., 2000]. Finally, the CUST AGGR configuration is
assumed to have a clock cycle of 20 FO4 inverter delays per cycle, more typical of custom
processors [Agarwal et al., 2000]. This difference in clock cycle time also affects mem-
ory latency, although not exactly by 2.5x since much of the memory latency is incurred
through cycles in the on-chip memory controller. The LP configuration is less aggressively
pipelined than CUST AGGR at 45 FO4 delays per cycle in order to reduce power overhead
CHAPTER 6. STREAM PROCESSOR SCALABILITY: VLSI COSTS 99
in pipeline registers.
Using the models presented in Section 6.1, estimates for area and energy per operation
in the clusters, SRF, and microcontroller were generated. These results are shown in Ta-
ble 6.6, assuming a 1.2 Volt, 0.13 µm technology. In this technology, the minimum wire
pitch is 0.46 µm and a FO4 inverter has a delay of 65 ps. The CUST CONS configuration
is 80% the size of ASIC, due to the smaller LRFs and SBs. Further area reduction occurs
when moving to custom ALUs in the CUST AGGR configuration. The LP configuration
has an additional 36% improvement in area over CUST AGGR because wALU is signif-
icantly smaller, since the 16-bit multiply-add unit only is required to sum half as many
partial products as the multiplier in CUST AGGR. Supporting multiply-add instructions
in the LP configuration also enables twice the peak 16b performance when compared to
CUST CONS.
The energy efficiency savings from moving to custom design methodologies can be
seen when comparing the power of ASIC and CUST AGGR. Although it has 14% higher
power dissipation, CUST AGGR is operating at more than twice the frequency, meaning
that significantly less energy is dissipated per clock cycle. Furthermore, the LP configura-
tion is more than twice as energy-efficient as CUST because each multiply-add operation
consumes less energy but is doing twice the GOPS. Additional energy-efficiency improve-
ments can be achieved in all configurations by operating at a lower voltage. Note that the
energy-efficiencies in Table 6.6 can not be compared directly to other processors, since
they do not take into account other sources of power dissipation in a processor, such as
the clock tree. However, the 4.8x improvement in energy-efficiency suggests that similar
energy savings could be achieved in the clock tree and other parts of the chip by employing
custom design methodologies and using other custom circuit techniques.
These results for custom and low power processors can be further extended by combin-
ing them with the scalability models from Section 6.1 and the technology scaling parame-
ters shown in Table 6.7. The parameters in the top part of the table are based on projections
for future technologies [Ho et al., 2001; SIA, 2001]. The wire capacitance printed assumes
no miller capacitance from adjacent wires, the assumption used for calculating average
case power dissipation. Architectural scaling assumptions are shown in the bottom part of
the table. Since intercluster scaling enables scaling with near-constant area-efficiency, each
CHAPTER 6. STREAM PROCESSOR SCALABILITY: VLSI COSTS 100
technology generation allows for a doubling in the number of arithmetic clusters. ALUs
per cluster in Table 6.7 are held constant at 5, the most area- and energy-efficient organiza-
tion. The increase in ALU count with each technology generation can be coupled with an
increase in clock frequency due to decreasing gate delay, allowing for a total improvement
in peak performance of 64x across 5 generations of technology scaling.
The results for area and power scaling across technology generations are shown in Fig-
ure 6.8. Note that the area stays relatively constant for all configurations. With each gen-
eration, approximately twice the number of transistors can fit into the same die area. Since
area-efficiency stays near constant with intercluster scaling, area also stays near constant.
In contrast, power dissipation increases gradually with technology. Although energy effi-
ciency stays near constant with intercluster scaling in the same technology, voltage is not
projected to scale aggressively enough to counteract the additional switched capacitance
with a larger number of clusters.
Energy efficiency scaling is shown in Figure 6.9. For both floating-point and 16b op-
erations, energy efficiency shows dramatic improvements. Since energy-efficiency stays
CHAPTER 6. STREAM PROCESSOR SCALABILITY: VLSI COSTS 101
80
Area (square millimeters)
70
60
ASIC
CUST_CONS
50
CUST_AGGR
40 LP
30
20
180 130 90 65 45
Technology (nanometers)
15
Power Dissipation (W)
10
ASIC
CUST_CONS
CUST_AGGR
5 LP
0
180 130 90 65 45
Technology (nanometers)
Figure 6.8: Effect of Technology Scaling on Die Area and Power Dissipation
CHAPTER 6. STREAM PROCESSOR SCALABILITY: VLSI COSTS 102
240
200 ASIC
pJ per 16b operation
CUST_CONS
160
CUST_AGGR
LP
120
80
40
0
180 130 90 65 45
Technology (nanometers)
500
pJ per floating-point operation
450
ASIC
400
CUST_CONS
350
CUST_AGGR
300
250
200
150
100
50
0
180 130 90 65 45
Technology (nanometers)
near constant with the number of clusters, this improvement is due solely to technology
advances. And since energy efficiency is independent of frequency, it is improving by CV 2
scaling, allowing for an improvement of 29x across five future technology generations.
In summary, by utilizing custom methodologies and fixed-point arithmetic units com-
monly found in DSPs, area and energy efficiency of stream processors can be greatly im-
proved. SRF, microcontroller, and cluster area was shown to be reduced by 30% when
using custom methodologies, while energy efficiency was improved by 2.0x. Energy is
further reduced with dual-16b multiply-add units for an average energy savings of 4.8x
per operation. These area and energy efficiency improvements suggest that total stream
processor efficiency would scale accordingly and would therefore be an order of magni-
tude better in raw performance, performance per unit area, and energy per operation than
current programmable architectures.
Chapter 7
As presented in the previous chapter, the efficiency of the stream processor register or-
ganization enables scaling to thousands of ALUs providing Teraops per second of peak
performance with only a small degradation in area and power efficiency. However, peak
performance and VLSI efficiency alone does not prove that a stream processor can scale
effectively. Performance efficiency (performance per unit power or performance per unit
area) is achieved by combining VLSI efficiency with high sustained performance.
In this chapter, we evaluate how sustained application and kernel performance scales as
the number of ALUs per stream processor are increased with intracluster and intercluster
scaling. First, we explore how the technology trends of limited off-chip memory bandwidth
and increasing on-chip wire delay affect microarchitecture and performance. Then, we
demonstrate how a stream processor is able to effectively exploit both instruction-level and
data-level parallelism in key media processing kernels and applications to take advantage
of both intracluster and intercluster scaling.
104
CHAPTER 7. STREAM PROCESSOR SCALABILITY: PERFORMANCE 105
In order for a stream processor to effectively scale to hundreds of arithmetic units by the
end of the decade, there must be enough memory bandwidth available for media application
requirements. Fortunately, due to the advent of high-bandwidth memory systems and high
compute intensity in media appliations, hundreds of arithmetic units could be supported in
future technologies without media applications becoming memory limited.
While available on-chip arithmetic bandwidth is increasing at 70% annually, off-chip
pin bandwidth is only increasing by 25% each year [Dally and Poulton, 1998] if only
standard scaling techniques are used. As explained in Section 2.4, Imagine deals with
this bandwidth gap by mapping stream programs onto a three-tiered bandwidth hierarchy
and by efficiently exploiting locality. Producer-consumer locality is exploited by passing
streams between kernels through the second tier of the bandwidth hierarchy, the SRF. Ker-
nel locality is exploited by keeping all temporary data accesses during kernel execution in
the LRFs within the arithmetic clusters, the third tier of the hierarchy. Using these tech-
niques, Imagine is able to sustain high arithmetic performance with significantly lower
off-chip bandwidth. Imagine contains 40 fully-pipelined ALUs and at 232 MHz, provides
2.3 GB/s of external memory bandwidth, 19.2 GB/s of SRF bandwidth, and 326.4 GB/s of
LRF bandwidth. This bandwidth hierarchy on Imagine provides a ratio of ALU operations
to memory words referenced of 28.
Although Imagine can execute applications with compute intensities of greater than
28 without becoming memory-bandwidth limited, this threshold would increase for future
stream processors as off-chip bandwidth grows more slowly than arithmetic bandwidth.
Assuming the scaling factors for arithmetic and off-chip bandwidth above, this threshold
will grow at 36% annually, meaning if stream processors arithmetic performance increased
at 70% annually, in five years, applications with compute intensities of less than 130 would
be memory-bandwidth limited. Fortunately, with the advent of memory systems optimized
for bandwidth and the large and growing compute intensity of media applications, this
CHAPTER 7. STREAM PROCESSOR SCALABILITY: PERFORMANCE 107
problem can be mitigated. For example, a stream processor with a 16 GB/s memory sys-
tem, achievable with eight Rambus channels [Rambus, 2001], could support 400 1 GHz
arithmetic units without being memory-bandwidth limited on applications requiring greater
than 100 operations per memory reference. This suggests that the bandwidth hierarchy in
stream processors should be able to scale effectively to hundreds of ALUs and provide
large speedups without becoming memory bandwidth limited.
270.0
225.0
Intercluster
Delay
180.0
Delay (FO4s)
135.0
Intracluster
90.0 Delay
45.0
0.0
0 16 32 48 64 80 96 112 128
ALU's per cluster
With the two-dimensional grid floorplan for the intercluster switches shown in Figure 6.1,
fully pipelined operation can be supported with the insertion of pipeline registers under the
wires (along with buffers and repeaters) and a few control wires, not shown in the figure.
Within each row, each cluster broadcasts its data across the row, for a total of NCOM M C
buses. If more than one cycle is required to account for wire delay within a row, then
pipeline registers must be inserted at this stage. Meanwhile, each destination cluster read
port broadcasts its associated source cluster across its column (not shown in the figure).
Again, if this traversal requires more than one cycle, pipeline registers must also be in-
serted for these control wires. Furthermore, additional pipeline registers may be required
at some cross points in order to make all switch delays equal to the worst-case pipeline
delay. The vertical control and horizontal data wires then meet at the cross points and the
appropriate row buses are muxed onto the vertical data buses. Similarly, pipeline registers
along vertical wires may also be required for various configurations. In this manner, a fully
pipelined crossbar between clusters is achievable. Similar pipelined operation can also be
CHAPTER 7. STREAM PROCESSOR SCALABILITY: PERFORMANCE 109
270.0
225.0
180.0
Delay (FO4s)
135.0 Intercluster
Delay
90.0
45.0 Intracluster
Delay
0.0
8 16 32 64 128 256
Number of clusters
implemented in the intracluster switch. For simplicity, it can be assumed that the area and
power of these pipeline registers are small when compared to the buffers and repeaters
required for the switches.
With pipelining, longer switch delays do not have an effect on clock rate, but these de-
lays do have a significant effect on microarchitecture. Additional delay in the intracluster
switch affects operation latency during kernel operation. On the Imagine processor, ap-
proximately one half of each execution unit’s last pipeline stage clock cycle was allocated
for traversing the intracluster switch. As shown in Figure 7.1, when N increases to greater
than 14 ALUs, the worst-case delay across the intracluster switch is greater than half of a
clock cycle. For N > 14, additional pipeline stages must be added to all ALU operations in
order to account for this wire delay in the intracluster switch. In Section 7.3, the effect this
operation latency has on performance will be presented. Note that in configurations with
multi-cycle worst-case switch traversals, ideally the VLIW kernel compiler could exploit
locality in the placement of operations onto the ALUs so that most communications would
CHAPTER 7. STREAM PROCESSOR SCALABILITY: PERFORMANCE 110
take place in a single clock cycle and only rarely will data have to be communicated all the
way across the cluster. However, this compiler optimization was not available at the time
of this study, so was not considered in the performance analysis.
Wire delay in intercluster communications also has an effect on microarchitecture. At
C = 128 in Figure 7.2, the worst case delay between clusters requires three clock cycles to
traverse the switch. As these multi-cycle worst-case intercluster switch traversals become
necessary, three types of instructions must be modified to account for this delay. First, all
COMM unit instructions must include additional latency. Second, latency must be also
added to many of the instructions used to implement conditional streams. For example, la-
tency must be added to the GEN CISTATE and GEN COSTATE instructions executed by
the JB since they require communicating 1-bit CC values between all C clusters. Finally,
the CHK ANY and CHK ALL instructions executed by the microcontroller require addi-
tional latency for broadcasting 1-bit CC’s from all C clusters to the microcontroller. These
changes can easily be handled without affecting instruction throughput by adding execu-
tion unit pipeline stages to the microarchitecture, a change easily supported by the VLIW
kernel scheduler. In addition to increased operation latency, additional pipeline stages must
be added to the instruction distribution (DECODE/DIST) stage in the kernel execution
pipeline shown in Figure 3.6. The performance impact for both increased operation latency
and pipeline depths will be presented in Section 7.3.
In summary, communication delay across switches must be accounted for as intracluster
and intercluster scaling are used. With a 45 FO4 delay clock cycle and using the analytical
models presented in Section 6.1, these delays are shown to be manageable with pipelined
switches and by managing this switch latency with the VLIW kernel compiler. For exam-
ple, when scaling to a C = 128N = 5 processor, only 3 cycles are required for traversing
the intercluster switch. In the following sections, we account for this wire delay when
measuring the performance of intracluster and intercluster scaling on a set of key media
processing kernels and applications.
CHAPTER 7. STREAM PROCESSOR SCALABILITY: PERFORMANCE 111
stream processor, over 80% of execution time is spent in kernel inner loops.
In order to study the effect of intercluster and intracluster scaling on kernel inner-loop
performance, a suite of kernels was compiled for various stream processor sizes. Func-
tional unit latencies were taken from latencies in the Imagine stream processor and the
latencies of communications were taken from the results presented in Section 6.2. In the
Imagine design, half of a 45 FO4 cycle was allocated for intracluster communication delay.
Therefore, for configurations with N > 12, where more than a half-cycle is required for
intracluster communication, an additional pipeline stage was added to ALU operations and
streambuffer reads to cover this latency. Similarly, the COMM unit operation latency and
instruction issue pipeline depth was determined by the intercluster communication delay.
Intracluster Scaling
5.00
IPC per cluster
4.00
3.00
2.00
1.00
0.00
100.0%
ALU Utilization
75.0%
50.0%
25.0%
0.0%
2 3 4 5 6 7 8 9 10 11 12 13 14
Each category affects the tradeoffs between intracluster and intercluster scaling in slightly
different ways. True ILP can only be exploited by intracluster scaling and therefore in-
tracluster scaling that exploits true ILP does not degrade intercluster scaling. Software
pipelining and loop unrolling are loop transformations that convert DLP between loop iter-
ations into ILP within kernel inner-loops. These transformations benefit intracluster scal-
ing without affecting inner-loop performance with intercluster scaling. However, note that
software pipelining and loop unrolling exploit parallelism which could have also been ex-
ploited directly by intercluster scaling. For this reason, these different forms of ILP must
be treated separately in order to explore the tradeoffs between intracluster and intercluster
scaling.
First, intracluster scaling exploiting true ILP in kernel inner-loops is presented. Fig-
ure 7.3 shows this inner-loop performance without the use of software pipelining or loop
unrolling. In other words, each inner loop iteration processes only one element of a stream.
For example, the FFT kernel executes one butterfly operation and the noise kernel runs a
perlin noise function shader for one fragment. All kernels were scheduled with intracluster
and intercluster switch latencies for a C = 8 processor, with N varied from 2 to 14. The
top graph shows the average instructions per cycle (IPC) executed in each cluster within
the kernel inner loops. IPC ranges from 0.90 to 1.39 for N = 2 and from 0.89 to 4.17 for
N = 12, the highest performing configuration. The harmonic mean across all kernels is
shown in bold1 . Most kernels have average IPCs of less than 2.5 for all ranges of ALUs.
The one exception is the convolve kernel, which has a significant amount of ILP within
the inner loop. However, for the other kernels, the minimum loop length is limited by the
length of the dependency chain in the processing of stream elements in the inner loops,
1
The instructions included in the IPC shown are those executed on the ALU, MUL, and DSQ units only,
and does not include COMM or SP operations. However, IPC but does not translate directly to GOPS since
non-arithmetic operations such as SELECTs and SHUFFLEs are included.
CHAPTER 7. STREAM PROCESSOR SCALABILITY: PERFORMANCE 115
making these kernel schedules critical-path limited rather than arithmetic-throughput lim-
ited, leading to limited IPC.
In the lower graph in Figure 7.3, the same kernel inner-loop performance data is pre-
sented as average arithmetic unit utilization. On this graph, flat lines would correspond to
linear speedups for intracluster scaling. Without the use of loop transformations such as
software pipelining or loop unrolling, the IPC and utilization data suggests that there would
be little advantage to intracluster scaling beyond 2 or 3 ALUs since little additional speedup
is achieved for larger clusters. Furthermore, when going from 12 ALUs to 13 ALUs per
cluster, a slowdown is observed, due to the additional cycle of latency incurred in travers-
ing the intracluster switch. Without the use of software pipelining or loop unrolling, little
ILP can be exploited with intracluster scaling and performance is affected by intracluster
switch delay.
Once software pipelining is used, the dependency chain that limited ILP in the kernels
above is broken across several iterations of the inner loop, allowing independent process-
ing for several stream elements to occur within the same loop iteration. This significantly
increases the amount of ILP available within the inner loops. Average IPC and ALU uti-
lization with software pipelining in these inner loops are shown in Figure 7.4. For N = 2,
high ALU utilization of over 75% is achieved for all kernels except for Irast. With this
small cluster size, an average IPC of 1.59 is sustained, a speedup of 1.34x over the non-
software-pipelined kernels.
As the number of ALUs per cluster is increased, the advantage of these kernels over the
non-software-pipelined kernels dramatically increases. In fact, two of the kernels, Noise
and Convolve, containing 269 and 146 operations per loop iteration respectively, show
near-linear speedups up to 14 ALUs. Irast has the worst average IPC of all kernels because
its inner-loop performance is limited by COMM unit throughput, not ALU throughput.
For this reason, its performance improves when going from 5 to 6 ALUs and from 10 to
11 ALUs, thresholds for adding COMM units in the scaling model. The remaining three
kernels scale to maximum IPCs between 4.1 and 6.4 for 12 ALUs per cluster. These three
kernels only contain between 49 and 64 operations per loop iteration and in some cases
contain loop-carried state with long dependency chains that can not be broken across loop
iterations with software pipelining.
CHAPTER 7. STREAM PROCESSOR SCALABILITY: PERFORMANCE 116
12.00
IPC per cluster
10.00
8.00
6.00
4.00
2.00
0.00
100.0%
ALU Utilization
75.0%
50.0%
25.0%
0.0%
2 3 4 5 6 7 8 9 10 11 12 13 14
Software pipelining is also effective at creating enough ILP in some kernels to hide the
latency of the intracluster switch. Noise and Convolve continue to demonstrate speedups
when scaling from 12 to 13 ALUs, in contrast to the slow-downs observed without software
pipelining. However, slow-downs are still observable for the Update and FFT kernels,
which don’t have quite as much ILP.
As shown above, the Blocksad, FFT, and Update kernels scale well to around 5 ALUs
per cluster, but beyond that become limited by the amount of ILP and total operation count
in each loop iteration. Loop unrolling is a transformation that can help to overcome this
by allowing the VLIW compiler to schedule more than one loop iteration at one time. This
allows the compiler to interleave ALU operations from subsequent iterations in order to
achieve higher ALU utilizations. IPC and ALU utilization results with both loop unrolling
and software pipelining on the same suite of kernels are shown in Figure 7.5. From N = 2
to N = 5, the kernel results with software pipelining are nearly identical with and without
loop unrolling. However, for N > 5, loop unrolling is effective at improving speedups
for the Blocksad, FFT, and Update kernels. For N = 14, loop unrolling improved the
average IPC from 5.71 to 6.87. In addition, delay in the intracluster switch does not have a
noticeable effect with the use of loop unrolling.
The above data shows that near-linear speedups can be achieved on media processing
kernels up to around 10 ALUs per cluster by applying loop transformations and intracluster
scaling. However, in order to fairly evaluate the tradeoffs between intracluster and inter-
cluster scaling, performance efficiency (performance per unit area or per unit power) must
be considered as well. Figure 7.6 shows the ratio of sustained IPC in kernel inner-loops
to processor area for 8-cluster processors. Area is scaled to dimensions typical to a 45
nanometer technology. The three graphs from top to bottom present IPC per square mil-
limeter with no loop transformations, only software pipelining, and both software pipelin-
ing and loop unrolling, respectively. Results are shown for all six kernels with the harmonic
means in bold. The lower two graphs contain two bold lines: the lower bold line is the har-
monic mean of all six kernels while the upper bold line excludes the Irast kernel from the
harmonic mean.
CHAPTER 7. STREAM PROCESSOR SCALABILITY: PERFORMANCE 118
12.00
IPC per cluster
10.00
8.00
6.00
4.00
2.00
0.00
100.0%
ALU Utilization
75.0%
50.0%
25.0%
0.0%
2 3 4 5 6 7 8 9 10 11 12 13 14
Figure 7.5: Intracluster Scaling with Software Pipelining and Loop Unrolling
CHAPTER 7. STREAM PROCESSOR SCALABILITY: PERFORMANCE 119
Without loop transformations, the peak IPC per square millimeter is with 2 ALUs per
cluster at 3.82 when averaged over the six kernels. However, the peak performance ef-
ficiency improves by 47% when software pipelining is used, providing a peak efficiency
of 5.62 with 3 ALUs per cluster. Loop unrolling further increases the peak to 5.76 at 3
ALUs per cluster. In Chapter 6, it was shown that the optimal ratio of peak performance to
area was at 5 ALUs per cluster with intracluster scaling. The same result, with an optimal
performance efficiency at 5 ALUs, is measured when the Irast kernel is excluded from the
analysis. Irast is a kernel limited by COMM unit bandwidth, so does not benefit as much
from increasing the ALUs per cluster until more COMM units are also added. The optimal
intracluster scaling for performance per unit power (average energy per ALU operation)
was presented previously in Chapter 6 and was also shown to be at 5 ALUs.
The kernel performance data presented above demonstrates the effectiveness of intra-
cluster scaling at exploiting various forms of ILP. By applying software pipelining, near-
linear speedups were shown up to 5 ALUs across a broad range of media processing ker-
nels. Further scaling to around 10 ALUs provided good speedups on Noise and Convolve
and other kernels with the use of loop unrolling. Performance efficiency measurements
show that with the use of software pipelining, the optimal intracluster scaling occurs at be-
tween 3 and 5 ALUs per cluster depending on the kernel. Due to limitations in the VLIW
compiler, loop-carried state within kernels, intracluster wire delay for N > 12, and reduced
area efficiency for N > 5, intracluster scaling is not as effective beyond 5 ALUs per cluster
with the use of software pipelining. Although performance efficiencies improve slightly
for 4 to 10 ALUs with the use of loop unrolling, this is achieved by converting DLP into
ILP.
Intercluster Scaling
Intercluster scaling, on the other hand, is able to directly exploit DLP more efficiently.
Intercluster scaling directly exploits DLP by executing more iterations of kernel inner-loops
simultaneously in a SIMD fashion. However, just as wire delay and available ILP affected
sustained kernel inner-loop performance with intracluster scaling, for intercluster scaling,
we have to account for wire delay in the intercluster switch and available DLP. In this
section, we explore how intercluster switch delay affects kernel inner-loop performance
CHAPTER 7. STREAM PROCESSOR SCALABILITY: PERFORMANCE 120
8
7
IPC per sq mm
6
(True ILP)
5
4
3
2
1
0
(Software Pipelining)
8
7
IPC per sq mm
6
5
4
3
2
1
0
8
(Unrolling + SWP)
7
IPC per sq mm
6
5
4
3
2
1
0
2 3 4 5 6 7 8 9 10 11 12 13 14
18
16 Blocksad
14 Convolve
Speed-up
12 Update
10 Fft
8 Noise
6 Irast
4 Harmonic Mean
2
0
8 16 32 64 128
Number of Clusters (C)
and performance efficiency. Limitations on DLP affect application stream lengths, but
not kernel inner-loop performance, so the effect of stream lengths on performance will be
explored later when discussing application performance in Section 7.3.3.
Holding N fixed at 5 ALUs per cluster, kernel speedups over an 8-cluster processor are
shown in Figure 7.7. This data shows the effectiveness of intercluster scaling with the most
efficient number of ALUs per cluster, although results would be similar for clusters with
different numbers of ALUs. In addition, software pipelining was used for all configura-
tions since this was shown to provide the best efficiencies. As shown in Figure 7.7, as C
increases, some kernels such as Noise, are perfectly data-parallel and demonstrate perfect
speedup. Even kernels such as Irast, which rely heavily on conditional stream and inter-
cluster switch bandwidth, are able to hide intercluster switch latency by taking advantage
of available ILP within kernel inner-loops. Based on this kernel inner-loop performance,
intercluster scaling is able to achieve near-linear speedups when scaling to 128 clusters.
Performance efficiency results with intercluster scaling are shown in Table 7.2. The ta-
ble shows total IPC from all C clusters per square millimeter in a 45 nanometer technology.
Although peak performance efficiency is achieved with the C = 32 N = 5 configuration
at 5.31 IPC per square millimeter, performance per area is relatively unaffected by inter-
cluster scaling until around 64 clusters. Performance per unit area starts to degrade slightly
CHAPTER 7. STREAM PROCESSOR SCALABILITY: PERFORMANCE 122
Clusters
N 8 16 32 64 128
2 5.23 5.16 5.18 5.05 5.00
5 5.29 5.29 5.31 5.22 4.95
10 4.39 4.47 4.20 4.08 3.81
14 2.68 3.30 3.00 2.88 2.62
when scaling to 128 clusters. However, with performance per area of 4.95, the 640-ALU
C = 128 N = 5 processor is only 7% worse than the best C = 32 N = 5 processor.
The kernel inner-loop performance data presented above shows that given a fixed area bud-
get, without the use of software pipelining, it would be most efficient to only scale to 2
ALUs per cluster and then to utilize intercluster scaling to provide additional performance.
However, even with only 2 ALUs per cluster, the use of software pipelining was shown to
provide a 36% performance on inner-loop performance, meaning that this is a loop trans-
formation that provides significant performance gains for all cluster sizes. Once software
pipelining is used, intracluster scaling has near-linear speedups up to 10 ALUs on some ker-
nels, although was most efficient at 5 ALUs per cluster for most kernels. Therefore, with
the use of software pipelining, it would be most efficient to use intracluster scaling up to
5 ALUs per cluster and to use intercluster scaling, which provides near-linear speedups up
to 128 clusters on kernel inner loops, to further improve performance. An optimal perfor-
mance efficiency was found with 32 clusters and 5 ALUs per cluster, although performance
efficiency degraded by only 7% when scaling to 128 clusters.
Whereas software pipelining was shown to be a beneficial loop transformation for all
cluster sizes, the case for loop unrolling is less convincing. Loop unrolling shows the most
benefit for intracluster scaling for N > 5 and has a negligible impact for N <= 5. Even
though loop unrolling increases ILP for these larger clusters, the increase in ILP does not
overcome degradations in area efficiency and IPC per square millimeter gradually degrades
CHAPTER 7. STREAM PROCESSOR SCALABILITY: PERFORMANCE 123
from 5.33 to 4.50 when scaling from N = 5 to N = 10. However, loop unrolling and intr-
acluster scaling for the larger clusters is exploiting DLP which could have been exploited
with intercluster scaling, yet with much better performance efficiency. This data suggests
that intracluster scaling beyond N = 5 would not be as efficient from a performance effi-
ciency standpoint as intercluster scaling to up to 128 clusters.
6.00
4.00
2.00
0.00
16 48 80 112 144 176 208 240
8.00
IPC per cluster
6.00
Update
4.00
2.00
0.00
16 48 80 112 144 176 208 240
8.00
IPC per cluster
6.00
FFT
4.00
2.00
0.00
16 48 80 112 144 176 208 240
8.00
IPC per cluster
6.00
Noise
4.00
2.00
0.00
16 48 80 112 144 176 208 240
since the loop prologues, bodies, and epilogues are all constrained by limits on true ILP and
inter-instruction dependencies within the kernels. Next, results with software pipelining for
three cluster sizes (N = 2, 5, and 10) are shown as SWP N2, SWP N5, and SWP N10. In
these cases, the kernel scheduler chose the number of software pipeline stages that would
minimize the inner loop length. In some of the kernels, up to 6 pipeline stages were used.
When software pipelining is used, IPC is much more dependent on stream length than in
the none N2 case, due to the overhead of software pipelining: given stream length L, a
kernel inner loop with S software pipeline stages executes L/C + (S − 1) times because
S − 1 iterations are required to prime a software pipelined loop. Finally, two other results
are shown: SWP2 N5 and Unroll N10. SWP2 N5 limits S to a maximum of two stages,
meaning kernel inner-loop length is greater than SWP N5, leading to better IPC for short
streams but worse IPC for long streams. Unroll N10 shows results with the use of software
pipelining and loop unrolling for N = 10. It performs worse than SWP N10 for the same
size streams for most kernels, with the exception being Update. Although Unroll N10
contains shorter inner loops, it executes half as many inner loop iterations as SWP N10,
leading to more software pipelining overhead.
Across the four kernels, for N = 2, the crossover point at which software pipelining
improves average kernel IPC is between 16 (Update) and 112 elements (Convolve). Al-
though not shown in the graph, with intracluster scaling, this crossover point shifts down:
for N = 5, software pipelining improves IPC across all four kernels for stream lengths of
32 elements or higher. This data suggests that our conclusions for N = 5 being the most
efficient cluster organization holds for reasonably sized streams since software pipelining
should be used for streams greater than 32 elements, and once software pipelining is used,
N = 5, was shown previously in Section 7.3.1 to be the most efficient cluster organization.
Further scaling to N = 10 and using loop unrolling to improve inner-loop IPC was shown
to be even less efficient when stream lengths are taken into account.
Although the results from Figure 7.8 are specific to a C = 8 processor, from this data,
the overall effect intercluster scaling has on kernel performance becomes more clear when
stream lengths are taken into account. During applications where dataset size limits stream
lengths, such as QRD, intercluster scaling with fixed dataset sizes would reduce stream
lengths. With streams significantly longer than the number of clusters, this would have
CHAPTER 7. STREAM PROCESSOR SCALABILITY: PERFORMANCE 126
a neglible effect on performance, but for short streams, this could reduce the achieved
speedup. For example, in the Update2 kernel (a key part of QRD), as shown in Figure 7.7,
an inner-looop speedup of 1.8x was achieved when scaling from 8 to 16 clusters. However,
when accounting for short stream effects, the overall kernel speedup would vary from 1.30x
to 1.72x for streams of length 32 elements and 512 elements, respectively.
Although short streams could limit speedup on some applications with intercluster scal-
ing with fixed dataset sizes, it is important to note that not all applications incur this perfor-
mance degradation with intercluster scaling. RENDER, for example, has enough available
DLP such that its stream lengths are limited by the capacity of the SRF, not the application
dataset. Therefore, its stream lengths are able to scale with the number of clusters and
the percentage of runtime in inner loops remains high. Applications such as these or with
large datasets (and therefore long streams) will have overall speedups similar to inner-loop
performance speedups.
128
64
RENDER 32 311
15.4 GOPS
16
GOPS
8
leads to little application-level speedup or even slow-downs in some cases when increasing
N from 10 to 14.
With intercluster scaling, speedups vary considerably depending on the application.
With large numbers of clusters and relatively small dataset sizes, some applications suffer
from short stream effects. In addition to the short stream effects exhibited during kernel ex-
ecution described above in Section 7.3.1, as stream lengths decrease relative to C, memory
latency and host processor bandwidth also begin to affect performance.
The effect short streams have on application performance with intercluster scaling is
evident from the breakdown of execution cycles in RENDER, DEPTH, and QRD, shown
in Figure 7.10. Execution cycles are grouped into four categories: kernel inner-loop cycles,
kernel non-inner-loop cycles, cycles when kernels are stalled waiting for SRF streams, and
cycles when kernels are not active because of memory or host bottlenecks. RENDER is
very data-parallel and contains stream lengths limited only by the total number of triangles
in a scene. Since this number stays large compared to C, the ratio of kernel inner-loop
iterations to kernel invocations stays high, and over 80% of runtime is devoted to processing
from kernel inner loops. As a result, RENDER scales very well to large numbers of clusters.
DEPTH also contains abundant data parallelism, however does not scale quite as well
as RENDER due to short stream effects and SIMD overheads. When streams contain only
pixels from one row of the input image, SIMD overheads (extra instructions executed in the
kernel inner loops per cluster) are small, but short stream effects cause a large percentage
of run-time to be spent in non-inner-loop kernel cycles, as seen in Figure 7.10 when going
from 8 to 16 clusters. For 32 clusters, the application is restructured so that each stream
contains pixels from 16 rows at a time, rather than the single row which was used for 8 and
16 clusters. The result is that over 95% of runtime is spent in the inner loops. However,
the operation count per inner loop is higher when clusters operate on streams that contain
pixels from more than one row in SIMD. Consequently, there is a sub-linear speedup from 8
to 32 clusters on DEPTH. The advantage of multiple-row DEPTH becomes apparent when
scaling to 64 and 128 clusters: short stream effects are avoided and large amounts of data
parallelism can be exploited.
With QRD, the matrix block update kernels scale well and if the datasets grew with
C, QRD performance would scale similarly. However, with a fixed-size dataset, the larger
CHAPTER 7. STREAM PROCESSOR SCALABILITY: PERFORMANCE 129
100%
90%
80%
70%
RENDER
60%
50%
40%
30%
20%
10%
0%
8 16 32 64 128
Clusters
100%
90%
80%
Clusters Idle
70%
(Memory or Host Stalls)
DEPTH
60%
50% Kernel Non-Inner Loop
40%
30% SRF stalls
20%
10% Inner Loop
0%
8 16 32 64 128
Clusters
100%
90%
80%
70%
60%
QRD
50%
40%
30%
20%
10%
0%
8 16 32 64 128
Clusters
machines spend an increasing fraction of their runtime computing the orthogonal bases for
the decomposition, a step which scales poorly, therefore limiting speedup. In addition to
algorithmic inefficiencies leading to poor speedups, QRD with a fixed 256x256 dataset-size
also suffers from short stream effects, as can be seen in Figure 7.10, with a large percentage
of cycles spent outside of kernel inner-loops.
Direct evidence for performance degradation due to short stream effects is apparent
when comparing FFT4K to FFT1K in Figure 7.9. Although FFT4K has lower performance
than FFT1K at C = 8 N = 5 because its large working set requires spilling from the SRF
to memory, at C = 128 N = 10, the difference in raw performance between FFT4K and
FFT1K is due purely to stream length. On a C = 128 N = 10 processor, FFT4K sustains
211 GFLOPS, while FFT1K, containing shorter streams, only sustains 103 GFLOPS.
In summary, the advantages of intracluster scaling to exploit ILP and provide optimal
efficiency at 5 ALUs per cluster were apparent from kernel inner-loop performance. Similar
speedups in the intracluster scaling dimension were also achieved with full applications.
With intercluster scaling, kernel inner-loop performance results showed the ability to take
advantage of DLP in kernels with near-linear speedups up to 128 clusters on many kernels.
These kernel performance speedup numbers suggest how media application performance
would scale if dataset size scaled with machine size. However, with fixed dataset sizes,
limited DLP in some applications leads to short streams. Nevertheless, even with these
datasets, a 1280-ALU C = 128 N = 10 processor is able to sustain an average of 200
GOPS over six applications, a speedup of 10.4x over a 40-ALU C = 8 N = 5 processor.
10000
1000
RENDER
100
GB/s
10
0.1
8 16 32 64 128
Clusters
10000
1000
External Memory BW
DEPTH
100
GB/s
SRF BW
10
LRF BW
1
0.1
8 16 32 64 128
Clusters
10000
1000
100
QRD
GB/s
10
0.1
8 16 32 64 128
Clusters
This scalable data bandwidth hierarchy in the stream architecture enables GFLOPs of arith-
metic performance to be sustained with modest off-chip memory bandwidth requirements.
The scalability of the data bandwidth hierarchy is shown on a logarithmic scale in Fig-
ure 7.11. With eight arithmetic clusters, between 380 and 565 GB/s of LRF bandwidth,
between 11 and 42 GB/s of SRF bandwidth, and between 0.33 and 2.87 GB/s of memory
bandwidth are necessary in the RENDER, DEPTH, and QRD applications. As intercluster
scaling is used to scale to 128 clusters, the LRF bandwidth demands grow by over an order
of magnitude to between 5.9 and 7.0 TB/s. However, since the SRF bandwidth and capacity
is also increased with the number of ALUs, the bandwidth hiearchy is able to handle this
increased LRF bandwidth. During RENDER, memory bandwidth is used only for input
and output operations, so its memory and SRF bandwidth scales as the same rate as the
LRFs. However, for DEPTH and QRD, with increased SRF capacity, additional locality
can be captured in the SRF, avoiding memory spills required in the smaller processors.
Since LRF and SRF bandwidth and capacity are increased appropriately with intraclus-
ter and intercluster scaling, the data bandwidth hierarchy is able to scale effectively to hun-
dreds of ALUs per stream processor. This scalablity allows for high sustained arithmetic
throughputs with hundreds of ALUs per stream processor with modest off-chip memory
systems. Furthermore, the scalability of the data bandwidth hierarchy means that over 95%
of data accesses in these future stream processors are made to the LRFs, not to the SRF or
external memory, leading to energy-efficient execution of these media applications.
120
Number of
100
80
60
40
20
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Number of Hops
Figure 7.12: Intercluster Switch Locality with 8x8 Cluster Grid Floorplan
which must be traveled between clusters. Note that in an 8x8 grid, the maximum number
of hops is 14. This histogram shows that indeed some locality exists as the most common
case is 4 hops and only 6.5% of communications require greater than 8 hops.
Future research could explore intercluster switch topologies that could provide better
area and energy efficiency by exploiting the locality between clusters during communica-
tions. One example of a switch topology for an 8x8 cluster grid that exploits this locality
is shown in Figure 7.13. Similar to the full crossbar shown in Figure 6.1, in this figure,
each cluster broadcasts its output across its row bus and reads an input onto its column bus.
However, in this topology, row and column buses span only 2 clusters in each direction,
meaning communications are limited to four hops maximum. Permutations with maxi-
mum hop counts greater than 4 would require more than one communication to route the
data succesfully, leading to more required switch bandwidth in some kernels. However, this
topology has an area-efficiency advantage over a full crossbar because it only requires five
buses per column and per row, whereas a full crossbar requires eight buses per column and
CHAPTER 7. STREAM PROCESSOR SCALABILITY: PERFORMANCE 135
7 15 23 31 39 47 55 63
6 14 22 30 38 46 54 62
5 13 21 29 37 45 53 61
4 12 20 28 36 44 52 60
3 11 19 27 35 43 51 59
2 10 18 26 34 42 50 58
1 9 17 25 33 41 49 57
0 8 16 24 32 40 48 56
per row. Future research could explore the tradeoff between performance and efficiency for
this and other switch topologies.
Conclusions
In this dissertation, stream processors were shown to achieve performance rates and per-
formance efficiencies significantly higher than other programmable processors on media
applications, and approaching the efficiencies of fixed-function processors. The Imagine
architecture, the first VLSI implementation of a stream processor, is able to achieve high
performance and efficiencies by exploiting large amounts of parallelism, by exploiting lo-
cality to effectively manage data bandwidth, and by the use of area- and energy-efficient
register organization.
Furthermore, the architectural concepts and performance measurements validated with
the Imagine stream processor can be extended to future technologies using the intercluster
and intracluster scaling techniques presented in this dissertation. The analysis presented
in this work demonstrates the scalability of stream processors to thousands of arithmetic
units in future VLSI technologies with comparable performance efficiencies to stream pro-
cessors containing tens of arithmetic units. These future stream processors will be capable
of sustaining hundreds of billions of arithmetic operations per second while maintaining
high performance efficiencies, enabling a large set of new and existing real-time media
applications in a wide variety of mobile, desktop, and server systems, while retaining the
advantages of programmability.
137
CHAPTER 8. CONCLUSIONS 138
Imagine System
Although this dissertation presents the design and implementation of the Imagine processor,
one important area of future work is in system design for the Imagine stream processor. The
experimental measurements described in Chapter 5 were obtained with a PCI board con-
taining two Imagine processors, a host processor, glue-logic FPGAs, and DRAM chips. At
the time this dissertation was submitted, this board was not able to run all available media
applications at real-time due to host-processor and glue-logic limitations (not because of
host interface bandwidth limitations on the Imagine processor). Future work is underway to
redesign a board which eliminates these limitations enabling real-time execution of a larger
range of media applications. Another ongoing area of research is multiprocessor systems
using the Imagine network interface. This research would lead to a number of key insights
into performance, software tools, and host processor requirements in these multi-Imagine
systems.
Applications
In this dissertation, the evaluation of stream processing was restricted to media processing
requirements with today’s workloads and datasets. However, as stream processors scale
to Teraops of performance with high performance efficiencies, such processors will pro-
vide the potential for running more complex media processing algorithms and much larger
datasets at real-time rates. For example, in computer graphics, polygon rendering with
higher scene complexity and screen resolutions would require larger datasets and more
performance to achieve real-time rates. Furthermore, raytracing and image-based rendering
techniques have been emerging as alternative applications for computer graphics also re-
quiring higher performance than polygon rendering. These trends suggest that as available
media processor performance increases, media applications and their datasets will evolve
to take advantage of this available performance while sustaining real-time rates. Therefore,
an important area of future work would be to explore how media processors can scale in a
CHAPTER 8. CONCLUSIONS 139
way to provide the types of performance rates required for future media applications, rather
than today’s applications.
Another area of future work for applications of stream processors is in broadening the
application domains. Recent work has started to explore the effectiveness of stream pro-
cessing to both scientific workloads [Dally et al., 2001] and signal processing for wireless
communications [Rajagopal et al., 2002]. Such application domains share many of the
same application characteristics as media applications, but differ slightly, meaning that en-
hancements and optimizations to the stream processor architecture and microarchitecture
could be made to improve performance and VLSI efficiency on these applications.
The VLSI efficiency of stream processing stems from the stream register organization that
allows for efficient use of parallelism and locality and media applications. One area of fu-
ture work would be to explore enhancements to this register organization that could further
improve the VLSI efficiency of stream processors. Such enhancements were briefly dis-
cussed in Chapter 7 with studies into non-fully-interconnected intercluster switches. How-
ever, other possibilities for improvement include a non-fully-interconnected intracluster
switch and slightly modified local register file structures. For example, rather than using
two dual-ported register files per arithmetic unit, other structures such as one multi-ported
register file per arithmetic unit could be explored. Although such a structure would require
more interconnectivity in the read and write ports, such a design might be more efficient if
this interconnect could be overlapped with memory cells and other transistors in this regis-
ter file design. This and other improvements to the register organization could be explored
to achieve small improvements in the VLSI efficiency of stream processing.
Scalability
A final area of future work involves scalability to large numbers of arithmetic units per
chip in future technologies. In this dissertation, the two techniques of intracluster and
intercluster scaling were presented and evaluated. These two dimensions of scaling are
able to exploit instruction-level and data-level parallelism respectively. However, a third
CHAPTER 8. CONCLUSIONS 140
dimension of scaling is also possible, in the task-level (or thread-level) dimension. This
scaling technique would involve either multiple kernel execution units (microcontroller
with some number of clusters) connected to a single stream register file or multiple stream
processor cores on a single chip. In these architectures, multiple kernels could be executing
simultaneously, thereby exploiting task-level parallelism. As software tools for exploiting
these scaling techniques mature, the performance and cost advantages of exploiting task-
level parallelism could be explored and compared to intracluster and intercluster scaling.
In summary, the work presented in this thesis describes the first VLSI implementation
of a stream processor and describes scaling techniques that show the viability of stream
processing for many years to come. This is an area of research that is just beginning and
will lead to new and exciting results in the areas of real-time media processing and power-
efficient computer architecture.
Bibliography
[Agarwal et al., 2000] Vikas Agarwal, M.S. Hrishikesh, Stephen W. Keckler, and Doug
Burger. Clock rate versus IPC: The end of the road for conventional microarchitectures.
In 27th Annual International Symposium on Computer Architecture, pages 248–259,
June 2000.
[Asanovic, 1998] Krste Asanovic. Vector Microprocessors. PhD thesis, University of Cal-
ifornia at Berkeley, 1998.
[Bove and Watlington, 1995] V. Michael Bove and John A. Watlington. Cheops: A re-
configurable data-flow system for video processing. IEEE Transactions on Circuits and
Systems for Video Technology, 3(2):140–149, April 1995.
141
BIBLIOGRAPHY 142
[Brooks and Shearer, 2000] Thomas Brooks and Findlay Shearer. Communications core
meets 3G wireless handset challenges. Wireless Systems Design, pages 51–56, October
2000.
[Burd and Brodersen, 1995] Thomas D. Burd and Robert W. Brodersen. Energy-efficient
cmos microprocessor design. In 28th Hawaii International Conference on System Sci-
ences, pages 288–297, January 1995.
[Caspi et al., 2001] Eylon Caspi, Andre Dehon, and John Wawrzynek. A streaming multi-
threaded model. In Proceedings of the Third Workshop on Media and Stream Processors,
pages 21–28, Austin, TX, Dec 2001.
[Chang, 1998] Andrew Chang. VLSI datapath choices: Cell-based versus full-custom.
Master’s thesis, MIT, 1998.
[Chen et al., 1997] Kai Chen, Chenming Hu, Peng Fang, Min Ren Lin, and Donald L.
Wollesen. Predicting CMOS speed with gate oxide and voltage scaling and interconnect
loading effects. IEEE Transactions on Electron Devices, 44(11):1951–1957, November
1997.
[Chen, 1999] David Chen. Apollo II adds power capabilities, speeds VDSM place and
route. Electronics Journal, page 25, July 1999.
[Chinnery and Keutzer, 2000] D. Chinnery and K. Keutzer. Closing the gap between ASIC
and custom: An ASIC perspective. In Proceedings of 37th Design Automation Confer-
ence, pages 637–641, June 2000.
[Chinnery and Keutzer, 2002] D. Chinnery and K. Keutzer. Closing the Gap Between
ASIC and Custom: Tools and Techniques for High-Performance ASIC Design. Kluwer
Academic Publishers, May 2002.
BIBLIOGRAPHY 143
[Chowdhary et al., 1999] A. Chowdhary, S. Kale, P.K. Saripella, N.K. Sehgal, and R.K.
Gupta. Extraction of functional regularity in datapath circuits. IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Systems, 18(9):1279–1296, Septem-
ber 1999.
[Clark et al., 2001] Lawrence T. Clark, Eric J. Hoffman, Jay Miller, Manish Biyani, Yuyun
Liao, Stephen Strazdus, Michael Morrow, Kimberley E. Velarde, and Mark A. Yarch. An
embedded 32-b microprocessor core for low-power and high-performance applications.
In IEEE Journal of Solid State Circuits, pages 1599–1608, November 2001.
[Cradle, 2003] Cradle technologies white paper: The software scalable system on a chip
(3SOC) architecture, 2003.
[Dally and Chang, 2000] William J. Dally and Andrew Chang. The role of custom designs
in ASIC chips. In Proceedings of 37th Design Automation Conference, pages 643–647,
June 2000.
[Dally and Poulton, 1998] William J. Dally and John Poulton. Digital Systems Engineer-
ing, pages 12–22. Cambridge University Press, 1998.
[Dally et al., 2001] William J. Dally, Pat Hanrahan, and Ron Fedkiw. A streaming super-
computer. Stanford Computer Systems Laboratory White Paper, September 2001.
[Dally, 1992] William J. Dally. Virtual channel flow control. IEEE Transactions on Par-
allel and Distributed Systems, pages 194–205, March 1992.
[Davis et al., 2001] W. Rhett Davis, Ning Zhang, Kevin Camera, Fred Chen, Dejan
Markovic, Nathan Chan, Borivoje Nikolic, and Robert W. Brodersen. A design envi-
ronment for high throughput, low power dedicated signal processing systems. In IEEE
2001 Conference on Custom Integrated Circuits, pages 545–548, May 2001.
[de Kock et al., 2000] E.A. de Kock, G. Essink, W.J.M. Smits, R. van der Wolf, J.-Y.
Brunei, W.M. Kruijtzer, P. Lieverse, and K.A. Vissers. YAPI: Application modeling
BIBLIOGRAPHY 144
[Gieseke et al., 1997] Bruce A. Gieseke, Randy L. Allmon, Daniel W. Bailey, Bradley J.
Benschneider, Sharon M. Britton, John D. Clouser, Harry R. Fair III, James A. Far-
rell, Michael K. Gowan, Christopher L. Houghton, James B. Keller, Thomas H. Lee,
Daniel L. Leibholz, Susan C. Lowell, Mark D. Matson, Richard J. Matthew, Victor
Peng, Michael D. Quinn, Donald A. Priore, Michael J. Smith, and Kathryn E. Wilcox.
A 600 MHz superscalar RISC microprocessor with out-of-order execution. In 1997 In-
ternational Solid-State Circuits Conference Digest of Technical Papers, pages 176–177,
1997.
[Gokhale et al., 2000] Maya Gokhale, Jan Stone, Jeff Arnold, and Mirek Kalinowski.
Stream-oriented FPGA computing in the Streams-C high level language. In 2000 IEEE
Symposium on Field-Programmable Custom Computing Machines, pages 49–56, 2000.
[Goldovsky et al., 2000] Alexander Goldovsky, Bimal Patel, Michael Schulte, Ravi Ko-
lagotla, Hosahalli Srinivas, and Geoffrey Burns. Design and implementation of a 16
by 16 low-power two’s complement multiplier. In IEEE International Symposium on
Circuits and Systems., volume 5, pages 345–348, May 2000.
[Gordon et al., 2002] Michael I. Gordon, William Thies, Michal Karczmarek, Jasper Lin,
Ali S. Meli, Andrew A. Lamb, Chris Leger, Jeremy Wong, Henry Hoffmann, David
Maze, and Saman Amarasinghe. A stream compiler for communication-exposed archi-
tectures. In Proceedings of the Tenth International Conference on Architectural Support
for Programming Languages and Operating Systems, pages 82–92, October 2002.
BIBLIOGRAPHY 145
[Ho et al., 2001] Ron Ho, Ken Mai, and Mark Horowitz. The future of wires. Proceedings
of the IEEE, April 2001.
[Horowitz et al., 1994] Mark Horowitz, Thomas Indermaur, and Ricardo Gonzalez. Low-
power digital design. In Symposium on Low Power Electronics, pages 102–103, October
1994.
[Huang and Ercegovac, 2002] Zhijun Huang and Milos D. Ercegovac. Two dimensional
signal gating for low-power array multiplier design. In IEEE International Symposium
on Circuits and Systems., volume 1, pages I–489–I–492, 2002.
[Intel, 2002] Intel Corp. Intel Pentium 4 Processor with 512-KB L2 Cache on 0.13 Micron
Process at 2 GHz - 3.06 GHz, with Support for Hyper-Threading Technology at 3.06
Ghz, document number: 298643-005 edition, November 2002.
[Janssen and Corporaal, 1995] Johan Janssen and Henk Corporaal. Partitioned register
files for TTAs. In Proceedings of the 28th Annual IEEE/ACM International Symposium
on Microarchitecture, pages 303–312, November 1995.
[Kanade et al., 1996] Takeo Kanade, Atsushi Yoshida, Kazuo Oda, Hiroshi Kano, and
Masaya Tanaka. A stereo machine for video-rate dense depth mapping and its new
applications. In Proceedings of the 15th Computer Vision and Pattern Recognition Con-
ference, pages 196–202, San Francisco, CA, June 18–20, 1996.
[Kapadia et al., 1995] Hema Kapadia, Katayoun Falakshahi, and Mark Horowitz. Array-
of-arrays architecture for floating point multiplication. In Advanced Research in VLSI,
pages 150–157, March 1995.
BIBLIOGRAPHY 146
[Kapasi et al., 2000] Ujval J. Kapasi, William J. Dally, Scott Rixner, Peter R. Mattson,
John D. Owens, and Brucek Khailany. Efficient conditional operations for data-parallel
architectures. In Proceedings of the 33rd Annual IEEE/ACM International Symposium
on Microarchitecture, pages 159–170, December 2000.
[Kapasi et al., 2001] Ujval J. Kapasi, Peter Mattson, William J. Dally, John D. Owens, and
Brian Towles. Stream scheduling. In Proceedings of the Third Workshop on Media and
Stream Processors, pages 101–106, Austin, TX, Dec 2001.
[Khailany et al., 2001] Brucek Khailany, William J. Dally, Scott Rixner, Ujval J. Kapasi,
Peter Mattson, Jin Namkoong, John D. Owens, Brian Towles, and Andrew Chang. Imag-
ine: Media processing with streams. IEEE Micro, pages 35–46, Mar/Apr 2001.
[Khailany et al., 2002] Brucek Khailany, William J. Dally, Andrew Chang, Ujval J. Ka-
pasi, Jinyung Namkoong, and Brian Towles. VLSI design and verification of the Imagine
processor. In Proceedings of the IEEE International Conference on Computer Design,
pages 289–296, September 2002.
[Khailany et al., 2003] Brucek Khailany, William J. Dally, Scott Rixner, Ujval J. Kapasi,
John D. Owens, and Brian Towles. Exploring the VLSI scalability of stream processors.
In Proceedings of the Ninth International Symposium on High Performance Computer
Architecture, pages 153–164, February 2003.
[KleinOsowski et al., 2000] AJ KleinOsowski, John Flynn, Nancy Meares, and David J.
Lilja. Adapting the SPEC 2000 benchmark suite for simulation-based computer archi-
tecture research. In Workshop on Workload Characterization, International Conference
on Computer Design (ICCD), September 2000.
[Kohn and Fu, 1989] Leslie Kohn and Sai-Wai Fu. A 1,000,000 transistor microprocessor.
In 1989 International Solid-State Circuits Conference Digest of Technical Papers, pages
54–55, 290, 1989.
BIBLIOGRAPHY 147
[Kozyrakis and Patterson, 2002] Christos Kozyrakis and David Patterson. Vector vs. su-
perscalar and VLIW architectures for embedded multimedia benchmarks. In Proceed-
ings of the 35th Annual IEEE/ACM International Symposium on Microarchitecture,
pages 283–293, November 2002.
[Kozyrakis and Patterson, 2003] Christos Kozyrakis and David Patterson. Overcoming the
limitations of conventional vector processors. In 30th Annual International Symposium
on Computer Architecture, June 2003.
[Kutzschebauch and Stok, 2000] T. Kutzschebauch and L. Stok. Regularity driven logic
synthesis. In Proceedings of the International Conference on Computer Aided Design,
pages 439–446, November 2000.
[Lee and Parks, 1995] Edward A. Lee and Thomas M. Parks. Dataflow process networks.
Proceedings of the IEEE, 83(5), May 1995.
[Lee and Stoodley, 1998] Corinna G. Lee and Mark G. Stoodley. Simple vector micro-
processors for multimedia applications. In Proceedings of the 31st Annual IEEE/ACM
International Symposium on Microarchitecture, pages 25–36, December 1998.
[Lee, 1996] Ruby B. Lee. Subword parallelism with MAX-2. IEEE Micro, pages 51–59,
August 1996.
[Mai et al., 2000] Ken Mai, Tim Paaske, Nuwan Jayasena, Ron Ho, William J. Dally, and
Mark Horowitz. Smart memories: A modular reconfigurable architecture. In 27th An-
nual International Symposium on Computer Architecture, pages 161–171, June 2000.
[Malachowsky, 2002] Chris Malachowsky. When 10M gates just isn’t enough....the GPU
challenge. In Proceedings of 39th Design Automation Conference, June 2002.
BIBLIOGRAPHY 148
[Mattson et al., 2000] Peter Mattson, William J. Dally, Scott Rixner, Ujval J. Kapasi, and
John D. Owens. Communication scheduling. In Proceedings of the International Con-
ference on Architectural Support for Programming Languages and Operating Systems,
pages 82–92, November 2000.
[Mattson, 2001] Peter Mattson. A Programming System for the Imagine Media Processor.
PhD thesis, Stanford University, 2001.
[Montrym and Moreton, 2002] John Montrym and Henry Moreton. Nvidia GeForce4. In
Hotchips 14, August 2002.
[Nagamatsu et al., 1990] Masato Nagamatsu, Shigeru Tanaka, Junji Mori, Katsusi Hirano,
Tatsuo Noguchi, and Kazuhisa Hatanaka. A 15-ns 32x32-b CMOS multiplier with an
improved parallel structure. In IEEE Journal of Solid State Circuits, pages 494–497,
April 1990.
[Nickolls et al., 2002] John Nickolls, L. J. Madar III, Scott Johnson, Viresh Rustagi, Ken
Unger, and Mustafiz Choudhury. Broadcom Calisto: A multi-channel multi-service
communications platform. In Hotchips 14, August 2002.
[Nijssen and van Eijk, 1997] Raymond X.T. Nijssen and C.A.J. van Eijk. Regular layout
generation of logically optimized datapaths. In Proceedings of the International Sym-
posium on Physical Design, pages 42–47, 1997.
[Ohashi et al., 2002] Masahiro Ohashi, T. Hashimoto, S.I. Kuromaru, M. Matsuo, T. Mori-
iwa, M. Hamada, Y. Sugisawa, M. Arita, H. Tomita, M. Hoshino, H. Miyajima, T. Naka-
mura, K.I. Ishida, T. Kimura, Y. Kohashi, T. Kondo, A. Inoue, H. Fujimoto, K. Watada,
T. Fukunaga, T. Nishi, H. Ito, and J. Michiyama. A 27 MHz 11.1 mW MPEG-4 video
decoder LSI for mobile application. In 2002 IEEE International Solid-State Circuits
Conference Digest of Technical Papers, pages 366–367, February 2002.
[Olofsson and Lange, 2002] Andreas Olofsson and Fredy Lange. A 4.32GOPS 1W
general-purpose DSP with an enhanced instruction set for wireless communication. In
2002 International Solid-State Circuits Conference Digest of Technical Papers, pages
54–55,443, 2002.
BIBLIOGRAPHY 149
[Owens et al., 2002] John D. Owens, Scott Rixner, Ujval J. Kapasi, Peter Mattson, Ben
Serebrin, and William J. Dally. Media processing applications on the Imagine stream
processor. In Proceedings of the IEEE International Conference on Computer Design,
pages 295–302, September 2002.
[Owens, 2002] John Owens. Computer Graphics on a Stream Architecture. PhD thesis,
Stanford University, 2002.
[Peleg and Weiser, 1996] Alex Peleg and Uri Weiser. MMX technology extension to the
Intel architecture. IEEE Micro, pages 42–50, August 1996.
[Phillip, 1998] Mike Phillip. Altivec: A second generation SIMD microprocessor archi-
tecture. In Hotchips 10, August 1998.
[Rajagopal et al., 2002] Sridhar Rajagopal, Scott Rixner, and Joseph R. Cavallaro. A pro-
grammable baseband processor design for software defined radios. In 45th IEEE In-
ternational Midwest Symposium on Circuits and Systems, volume 3, pages 413–416,
August 2002.
[Rambus, 2001] Rambus. 512/576 Mb 1066 MHz RDRAM Datasheet, DL-0117-030 ver-
sion 0.3, 3.6MB, 11/01 edition, 2001.
[Rathnam and Slavenburg, 1996] Selliah Rathnam and Gerrit A. Slavenburg. An architec-
tural overview of the programmable media processor, TM-1. In Proceedings of COMP-
CON, pages 319–326, February 1996.
[Rau et al., 1982] B. Ramakrishna Rau, Christopher D. Glaeser, and Raymond L. Picard.
Efficient code generation for horizontal architectures: Compiler techniques and architec-
tural support. In Proceedings of the International Symposium on Computer Architecture,
pages 131–139, April 1982.
BIBLIOGRAPHY 150
[Rau et al., 1989] B. Ramakrishna Rau, David W. L. Yen, Wei Yen, and Ross A. Towle.
The Cydra 5 departmental supercomputer: Design philosophies, decisions, and trade-
offs. Computer, pages 12–35, January 1989.
[Rixner et al., 1998] Scott Rixner, William J. Dally, Ujval J. Kapasi, Brucek Khailany,
Abelardo Lopez-Lagunas, Peter Mattson, and John D. Owens. A bandwidth-efficient
architecture for media processing. In Proceedings of the 31st Annual IEEE/ACM Inter-
national Symposium on Microarchitecture, pages 3–13, November 1998.
[Rixner et al., 2000a] Scott Rixner, William J. Dally, Ujval J. Kapasi, Peter R. Mattson,
and John D. Owens. Memory access scheduling. In 27th Annual International Sympo-
sium on Computer Architecture, pages 128–138, June 2000.
[Rixner et al., 2000b] Scott Rixner, William J. Dally, Brucek Khailany, Peter Mattson, Uj-
val J. Kapasi, and John D. Owens. Register organization for media processing. In
Proceedings of the Sixth International Symposium on High Performance Computer Ar-
chitecture, pages 375–387, January 2000.
[Rixner, 2001] Scott Rixner. Stream Processor Architecture. Kluwer Academic Publish-
ers, Boston, MA, 2001.
[Russell, 1978] Richard M. Russell. The Cray-1 computer system. Communications of the
ACM, 21(1):63–72, January 1978.
[Sager et al., 2001] David Sager, Glenn Hinton, Michael Upton, Terry Chappell,
Thomas D. Fletcher, Samie Samaan, and Robert Murray. A 0.18 µm CMOS IA32
microprocessor with a 4GHz integer execution unit. In 2001 International Solid-State
Circuits Conference Digest of Technical Papers, pages 324–325, 2001.
[Santoro et al., 1989] Mark R. Santoro, Gary Bewick, and Mark Horowitz. Rounding al-
gorithms for IEEE multipliers. In Proceedings of the 9th Symposium on Computer Arith-
metic, pages 176–183, September 1989.
[Sibyte, 2000] Sibyte. SB-1250 Product Data Sheet, rev 0.2 edition, October 2000.
[Synopsys, 2000a] Synopsys. Design Compiler User Guide, 2000.11 edition, 2000.
[Synopsys, 2000b] Synopsys. Physical Compiler User Guide, 2000.11 edition, 2000.
[Thakkar and Huff, 1999] Shreekant Thakkar and Tom Huff. The internet streaming SIMD
extensions. Intel Technology Journal, Q2, 1999.
[Tremblay et al., 1996] Marc Tremblay, J. Michael O’Connor, Venkatesh Narayanan, and
Liang He. VIS speeds new media processing. IEEE Micro, pages 10–20, August 1996.
[Waingold et al., 1997] Elliot Waingold, Michael Taylor, Devabhaktuni Srikrishna, Vivek
Sarkar, Walter Lee, Victor Lee, Jang Kim, Matthew Frank, Peter Finch, Rajeev Barua,
BIBLIOGRAPHY 152
Jonathan Babb, Saman Amarasinghe, , and Anant Agarwal. Bearing it all to software:
RAW machines. Computer, pages 86–93, September 1997.
[Wawrzynek et al., 1996] John Wawrzynek, Krste Asanovic, Brian Kingsbury, David
Johnson, James Beck, and David Morgan. Spert II: A vector microprocessor system.
Computer, pages 79–86, March 1996.
[Weste and Eshraghian, 1993] Neil H. E. Weste and Kamran Eshraghian. Principles of
CMOS VLSI Design: A Systems Perspective, Second Edition, pages 560–563. Addison-
WesleyPublishing Company, 1993.