Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
A Thesis
Submitted for the Degree of
Doctor of Philosophy
in the Faculty of Engineering
by
Abstract
Todays feature-rich multimedia products require embedded system solution with complex
System-on-Chip (SoC) to meet market expectations of high performance at low cost and
lower energy consumption. SoCs are complex designs with multiple embedded processors,
memory subsystems, and application specific peripherals. The memory architecture of
embedded SoCs strongly influences the area, power and performance of the entire system.
Further, the memory subsystem constitutes a major part (typically up to 70%) of the
silicon area for the current day SoC.
The on-chip memory organization of embedded processors varies widely from one
SoC to another, depending on the application and market segment for which the SoC is
deployed. There is a wide variety of choices available for the embedded designers, starting
from simple on-chip SPRAM based architecture to more complex cache-SPRAM based
hybrid architecture. The performance of a memory architecture also depends on how
the data variables of the application are placed in the memory. There are multiple data
layouts for each memory architecture that are efficient from a power and performance
viewpoint. Further, the designer would be interested in multiple optimal design points
to address various market segments. Hence a memory architecture exploration for an
embedded system involves evaluating a large design space in the order of 100,000 of
design points and each design points having several tens of thousands of data layouts.
Due to its large impact on system performance parameters, the memory architecture is
often hand-crafted by experienced designers exploring a very small subset of this design
space. The vast memory design space prohibits any possibility for a manual analysis.
In this work, we propose an automated framework for on-chip memory architecture
Acknowledgments
There are many people I would like to thank who have helped me in various ways.
First and foremost I would like to thank my Supervisors, Prof. R. Govindarajan and
Dr.C.P. Ravikumar, who have guided me and supported me in various aspects through the
entire journey in completion of my thesis work. I profusely thank for the encouragement
they provided and their perseverance in keeping me focused on the Ph.D. work.
I would like to express my gratitude to Texas Instruments for giving me the time
and opportunity to pursue my studies. I would like to thank my colleagues at Texas
Instruments for their support and reviews. In particular my manager Balaji Holur.
I would also like to thank my previous managers Pamela Kumar and Manohar Sambandam.
Last but not the least, I would like to thank my dearest family members for the
encouragement they provided and the sacrifices they made to help me achieve my goals.
iv
Contents
Abstract
Acknowledgments
iii
1 Introduction
1.1
1.2
Memory Subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2.1
1.2.2
1.2.3
1.3
Data Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4
1.5
1.6
Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.7
Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2 Background
2.1
2.2
23
2.1.2
Software Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
vi
2.3
2.2.1
2.2.2
2.4
2.5
35
3.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2
3.3
3.2.1
Method Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2.2
Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
ILP Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.1
Basic Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3.2
3.3.3
3.3.4
3.3.5
Swapping of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4
3.5
Heuristic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.6
3.5.1
3.5.2
Experimental Methodology . . . . . . . . . . . . . . . . . . . . . . . 53
3.6.2
3.6.3
3.6.4
3.6.5
. . . . . . . . . . . . . . . . 54
3.7
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.8
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
vii
67
4.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.2
Method Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.3
4.4
4.2.1
4.2.2
4.2.3
4.3.2
4.5
Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.5.1
Experimental Methodology . . . . . . . . . . . . . . . . . . . . . . . 80
4.5.2
Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.6
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.7
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
93
5.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.2
Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.3
5.4
5.3.1
Method Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.3.2
5.3.3
5.4.2
5.4.3
5.5
5.6
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
viii
111
6.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.2
6.3
6.4
6.2.1
6.2.2
6.2.3
6.3.2
6.4.2
6.4.3
6.4.4
6.5
6.6
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
137
7.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.2
Solution Overview
7.3
7.4
7.5
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
7.4.1
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
7.4.2
7.4.3
7.5.2
7.5.3
7.5.4
ix
7.6
7.7
7.6.2
7.6.3
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
8 Conclusions
171
8.1
8.2
8.2.2
8.2.3
8.2.4
8.2.5
Bibliography
176
List of Tables
1.1
3.1
3.2
3.3
Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.4
. . . . . . . 59
3.5
4.1
4.2
4.3
4.4
5.1
6.1
6.2
7.1
7.2
7.3
. . . . . . . . . . . . . . . . 103
List of Figures
1.1
1.2
1.3
1.4
1.5
2.1
2.2
2.3
3.1
3.2
3.3
3.4
3.5
4.1
4.2
4.3
4.4
4.5
xii
4.6
4.7
4.8
5.1
5.2
5.3
5.4
5.5
6.1
6.2
6.3
6.4
6.5
6.6
6.7
6.8
6.9
xiii
7.1
7.2
7.3
7.4
7.5
7.6
7.7
7.8
7.9
7.10 AAC: Power consumed for different hybrid memory architecture . . . . . . 160
7.11 MPEG: Power consumed for different hybrid memory architecture . . . . . 161
7.12 JPEG: Power consumed for different hybrid memory architecture . . . . . 161
7.13 AAC: Non-dominated Solutions . . . . . . . . . . . . . . . . . . . . . . . . 163
7.14 MPEG: Non-dominated Solutions . . . . . . . . . . . . . . . . . . . . . . . 164
7.15 JPEG: Non-dominated Solutions . . . . . . . . . . . . . . . . . . . . . . . 164
Chapter 1
Introduction
1.1
Todays VLSI technology allows us to integrate tens of processor cores on the same chip
along with embedded memories, application specific circuits, and interconnect infrastructure. As a result, it is possible to integrate an entire system onto a single chip. The single
chip phone, which has been introduced by several semiconductor vendors, is an example of
such a system-on-chip; it includes the modem, radio transceiver, power management functionality, a multimedia engine and security features, all on the same chip. An embedded
system is an application-specific system which is optimized to perform a single function
or a small set of functions [70]. We distinguish this from a general-purpose system, which
is software-programmable to perform multiple functions. A personal computer is an example of a general-purpose system; depending on the software we run on the computer,
it can be useful for playing games, word processing, database operations, scientific computation, etc. On the other hand, a digital camera is an example of an embedded system,
which can perform a limited set of functions such as taking pictures, organizing them, or
transferring them to another device through a suitable I/O interface. Other examples of
embedded systems include mobile phones, audio/video players, videogame consoles, settop boxes, car infotainment systems, personal digital assistants, telephone central-office
switches, dedicated network routers and bridges. Note that a large number of embedded
Introduction
systems are built for the consumer market. As a result, in order to be competetive, the
cost of an embedded system cannot be very high. Yet, the consumers demand higher performance and more features from the embedded systems products. It is easy to appreciate
this point if we compare the performance and feature set offered by mobile phones that
cost Rs 5000/-(or 100$) today and which cost the same a few years ago. We also see that
a large number of embedded systems are being built for the mobile market. This trend
is not surprising - the number of mobile phone subscribers increased from 500 Million in
year 2000 to 2.6 Billion in 2007 [7]. Because of such high volumes, embedded systems are
extremely cost sensitive and their design demands careful silicon-area optimization. Since
mobile devices use batteries as the main source of power, embedded systems must also be
optimized for energy dissipation. Power, which represents the rate at which energy is consumed, must also be kept low to avoid heating and improving reliability. In summary, the
designer of an embedded system must simultaneously consider and optimize price, performance, energy, and power dissipation. Application specific embedded systems designed
today demand innovative methods to optimize these system cost functions [11, 19].
Many of todays embedded systems are based on system-on-chip platforms [16], which,
in turn, consist of one or more embedded microcontrollers, digital signal processors (DSP),
application specific circuits and read-only memory, all integrated into a single package.
These blocks are available from vendors of intellectual property (IP) as hard cores or soft
cores [42, 28]. A hard core, or hard IP block, is one where the circuit is available at a
lower level of abstraction such as the layout-level [42, 28]; it is impossible to customize a
hard IP to suit the requirements of the embedded system. As a result, there are limited
opportunities in optimizing the cost functions by modifying the hard IP. For example, if
some functionality included in the IP is not required in the present application, we cannot
remove the function to save area. Soft IP refers to circuits which are available at a higher
level of abstraction, such as register-transfer level [28, 42]. It is possible to customize the
soft IP for the specific application. The designer of an embedded SoC integrates the IP
cores for processors, memories, and application-specific hardware to create the SoC.
Figure 1.1 illustrates the architecture of an embedded system-on-chip (SoC). As can
be seen in the figure, there are four principal components in such an SoC.
1. An Analog Front End which includes the analog/digital and digital/analog converters
2. Programmable Components which include microprocessors, microcontrollers, and
DSPs. The number of embedded processors is increasing every year. An interesting
statistic shows that of the nine billion processors manufactured in 2005, less than 2%
were used for general-purpose computers. The other 8.8 billion went into embedded
systems [13]. The microcontroller/microprocessor is useful in handling interrupts,
house-keeping and performing timing related functions. The DSP is useful for processing the audio and video information e.g., compression and decompression of
audio and video information. The application software is normally preloaded in
the memory and is not user programmable, unlike general-purpose processor-based
systems
3. Application-specific components these include hardware accelerators for computeintensive functions. Examples of hardware accelerators include digital image processors which are useful in cameras
1.2
1.2.1
Memory Subsystem
On-chip Memory Organization
The memory architecture of an embedded processor core is complex and is custom designed to improve run-time performance and power consumption. In this section we
describe only on the memory architecture of the DSP processor as this is the focus of
the thesis. This is because, the memory architecture of the DSP is more complex than
that of microcontrollers (MCU) due to the following reasons: (a) DSP applications are
more data dominated than the control-dominated software executed on an MCU. Memory bandwidth requirements for DSP applications range from 2 to 3 memory accesses per
Introduction
processor clock cycle. For an MCU, this figure is, at best, one memory access per cycle.
(b) It is critical in DSP application to extract maximum performance from the memory
subsystem in order to meet the real-time constraints of the embedded application. As a
consequence, the DSP software for critical kernels is developed mostly as hand optimized
assembly code. In contrast, the software for MCU is typically developed in high-level
languages. The memory architecture for a DSP is unique since the DSP has multiple onchip buses and multiple address generation units to service higher bandwidth needs. The
on-chip memory of embedded processors can include (a) only Level-1 cache (L1-cache)
(e.g., [1]), (b) only scratch-pad RAM (SPRAM) (e.g., [75, 76], or (c) a combination of
L1-cache and SPRAM (e.g., [2, 77]).
1.2.2
1.2.3
On-chip memory organization based only on Scratch Pad memory ensures single cycle
access times and guarantees on worst-case execution for data that resides in Scratch-Pad
RAM (SPRAM). However, it is the responsibility of the programmer to identify data
section that should be placed in SPRAM or place code in the program to appropriately
move data from off-chip memory to SPRAM. A DSP core can include the following types of
memories static RAM (SRAM), ROM, and/or dynamic RAM (DRAM). The scratch pad
memory in the DSP core is organized into multiple memory banks to facilitate multiple
simultaneous data accesses. A memory bank can be organized as a single-access RAM
(SARAM) or a dual-access RAM (DARAM) to provide single or dual access to the memory
bank in a single cycle. Also the on-chip memory banks can be of different sizes. Smaller
memory banks consume lesser power per access than the larger memories. The embedded
system may also be interfaced to off-chip memory, which can include SRAM and DRAM.
Purely SPRAM based on-chip organization is suitable only for low to medium complex
embedded applications. SPRAM based systems do not use the on-chip RAM efficiently
as it requires the entire data sections that are currently accessed to be placed exclusively
Introduction
in the SPRAM. It is possible to accommodate different data sections in SPRAM at different points in execution time by moving data dynamically between off-chip memory
and SPRAM. But this results in certain run-time overhead and increase in code size.
For medium to large applications, which have large number of critical data variables, a
large amount of on-chip RAM will become necessary to meet the real-time performance
constraints. Hence for such applications pure SPRAM architecture are not preferred.
1.3
Data Layout
To efficiently use the on-chip memory, critical data variables of the application need to be
identified and mapped to the on-chip RAM. The memory architecture may contain both
on-chip cache and SPRAM. In such a case it is important to partition the data section and
assign them appropriately to on-chip cache and SPRAM such that memory performance
of the application is optimized. Further, among the data sections assigned to on-chip
cache and SPRAM, a proper placement of the data sections on the cache and SPRAM
is required to ensure that the cache misses are reduced and the multiple memory banks
of the SPRAM and the dual ported SPRAMs are efficiently utilized. Identifying such a
data placement for data sections, referred to as the data layout problem, is complex and
critical step [10, 53]. This task is typically performed manually as the compiler cannot
assume that the code under compilation represents the entire system [10].
The application program in a modern embedded system is complex since it must
support a variety of device interfaces such as networking interfaces, credit card readers,
USB interfaces, parallel ports, and so on. The application also has many multimedia
components like MP3, AAC and MIDI [8]. This necessitates an IP reuse methodology
[74], where software modules developed and optimized independently by different vendors
are integrated. Figure 1.2 explains the typical flow in embedded application development. This integration is a very challenging job with multiple objectives: (a) it has to
be done under tight constraints on time-to-market constraints, (b) it has to be repeated
for different variants of SoCs with different custom memory architectures, and (c) it has
to perform in such a way that the embedded application is optimized for performance,
Since the IPs/modules are independently optimized, the integrator is under pressure
to deliver the complete product with the expectation that each component performs at the
same level as it did in isolation. This is a major challenge. When a module is optimized
independently, the developer has all the resources of the SoC (MIPS and Memory) to
optimize the module. When these modules are integrated at the system-level, the system
resources are shared among the modules. So the application integrator needs to know
the MIPS and memory requirements of the modules unambiguously to be able to allocate
the shared resources to critical needs [74]. Usually, the modules memory requirements
are given only at a high level. To be able to optimize the whole application/system, the
integrator will need detailed memory analysis at the module-level; e.g., which data buffers
need to be placed in dual ported memories and which data buffers should not be placed
in the same memory bank this data is usually not available. Moreover, the critical code
is usually written in low-level assembly language to meet real-time constraints and/or
10
Introduction
due to legacy reasons. Because of the above mentioned reasons, the application integration/optimization, analyzing the application and mapping software modules in order to
obtain optimal cost and performance takes significant amount of time (approximately 1-2
man months). Currently in most of the SoC design data layout is also performed manually
and it has two major problems:(1) the development time is significant not acceptable
for current-day time to market requirements, (2) quality of solution varies based on the
expertise.
1.4
In modern embedded systems, the area and power consumed by the memory subsystem is
up to 10 times that of the data path, making memory a critical component of the design
[11]. Further, the memory subsystem constitutes a large part (typically up to 70%) of
the silicon area for the current day SoC and it is expected to go up to 94% in 2014 as
shown in the Figure 1.3 [6]. The main reason for this is that embedded memory has
a relatively smallsubsystem per-area design cost in terms of both man-power, time-tomarket and power consumption [60]. Hence the memory plays an important role in the
design of embedded SoCs. Further the memory architecture strongly influences the cost,
performance and power dissipation of an embedded SoC.
As discussed earlier, the on-chip memory organization of embedded processors varies
widely from one SoC to another, depending on the application and market segment for
which the SoC is deployed. There is a wide variety of choices available for the embedded designers, starting from simple on-chip SPRAM based architecture to more complex
cache-SPRAM based hybrid architecture. To begin with, the system designer needs to
decide if the SoC requires cache and what is the right size of on-chip RAM. Once the high
level memory organization is decided, the finer parameters need to be defined to complete
the memory architecture definition. For the on-chip SPRAM based architecture, the parameters, namely, size, latency, number of memory banks, number of read/write ports per
memory bank and connectivity, collectively define the memory organization and strongly
influence the performance, cost, and power consumption. For cache based on-chip RAM,
11
the finer parameters are the size of cache, associativity, line size, miss latency and write
policy. Due to its large impact on system performance parameters, the memory architecture is often hand-crafted by the designer based on the targeted applications. However,
with the combination of on-chip SPRAM and cache, the memory design space is too large
for a manual analysis [31]. Also, with the projected growth in the complexity of embedded systems and the vast design space in memory architecture, hand optimization of the
memory architecture will soon become impossible. This warrants an automated framework which can explore the memory architecture design space and identify interesting
design points that are optimal from a performance, power consumption and VLSI area
(and hence cost) perspective. As the memory architecture design space itself is vast, a
brute force design space exploration tool may take large computation time and hence is
unlikely to be useful in meeting the tight time-to-market constraint. Further, for each
given memory architecture, there are several possible data section layouts which are optimal in terms of performance and power. This further compounds the memory architecture
exploration problem.
12
Introduction
1.5
In this section, we present our view of embedded system design flow to set the context
for our work. For this purpose, we introduce the notion of the X-chart, which is inspired
from the well-known Y-chart introduced by Gajski to capture the process of VLSI system
design [29].
In a Y-chart, the three levels of design abstraction form the three dimensions of the
figure Y; these are (a) design behavior, (b) design structure and (c) physical aspects of
the design. A design flow starts from a behavior specification, which is then mapped to
a structure, which in turn is mapped to a physical realization. We can view the process
of transforming a behavior to a physical realization as a successive refinement process.
Optimization of design metrics such as area, performance, and power are the goals of
each of these refinement steps. The design process may spiral from the behavioral axis to
structural axis to physical design axis in multiple stepwise refinement steps.
We introduce the notion of the X-chart, which is illustrated in Figure 1.4. The Xchart representation has four axes: (a) Behavior, (b) Logical Architecture, (c) Physical
Architecture and (d) Software Data Layout. The logical memory architecture (LMA)
defines the embedded cache size, cache associativity, cache block size, size of the scratch
pad memory, number of memory banks, and the number of ports. The physical memory
architecture (PMA) is an actual realization of an LMA using the memory library components provided by the semiconductor vendor. The fourth dimension, namely Software
Data Layout, is necessary for capturing the process of embedded system design. We have
identified several steps in the embedded system design flow and marked them with circled
numbers. Table 1.1 explains the individual steps in the X-chart representation.
The design of an embedded system begins with a behavioral description (Point (1)
in Figure 1.4, which is shown on the behavioral axis). Today, there are many languages
available to capture the system behavior, e.g., System Verilog [5], System C [4], and so
on. Hardware-software partitioning is performed to identify which functionalities of the
description are best performed in hardware and which are best implemented in software.
Hardware implementation is cost-intensive, but improves the performance.
13
We show point (2) on the LMA axis, since hardware-software partitioning adds considerable amount of detail to decide the LMA parameters. The next step is to select
hardware and software IP blocks. Depending on the time schedule (for designing the
embedded system) and the cost constraint, the designer may wish to use readily available
IP blocks from a vendor or implement a custom version of the IP. The target platform
is then defined to implement the embedded system. As mentioned earlier, a platform includes one or more processors, memory, and hardware accelerators for specific functions.
Platforms also come with software tools such as compilers and simulators, so that the
development cycle can be accelerated. In other words, one does not need to wait for the
hardware implementation to complete before trying out the software. We show point (4)
on the software data layout axis, since the selection of a platform defines many aspects
of software implementation. Software partitioning is now performed to decide which software IP blocks are executed on which processor. This completes one spiral cycle in the
design life cycle of the embedded system. To recapitulate, the following components are
defined at the end of the first cycle (a) the platform on which the embedded system
will be built, (b) the hardware and software IP blocks that are selected for the target
application, (c) assignment of software IP blocks to target processors where the software
will be executed. We show point (5) on the behavioral axis, since the next spiral cycle
will begin from here.
The next step is to define the logical memory architecture for the memory subsystem.
Guided by considerations such as cost, performance, and power, the designer must decide
basic architectural parameters of the memory sub-system, such as whether or not to
provide cache memory, how many memory banks are provided, whether or not dualported memories are necessary for guaranteeing performance, etc. The next step is to
perform design space exploration in the logical space. Each logical memory architecture
is also characterized by the selection of values for parameters such as cache size, cache
associativity, cache block size, etc. There is often a cost/performance tradeoff between two
solutions in the architectural space. Hence the designer must consider different Paretooptimal solutions that exhibit cost/performance tradeoff. This results in point (6) in
14
Introduction
Figure 1.4.
Figure 1.4: Application Specific SoC Design Flow Illustration with X-chart
15
of the physical memory architecture. Once again, there are multiple solutions for data
layout for a given PMA. These solutions may exhibit tradeoffs in power, performance,
and area.
In this thesis, we use the phrase Physical Memory Architecture Exploration (PMAE)
to refer to the search for Pareto-optimal LMA/PMA/DL solutions. We capture this in
the form of an equation that follows.
P M AE =
+
M emory Allocation Exploration
+
Data Layout Exploration
(1.1)
16
Introduction
1.6
Contributions
First, we propose methods for data layout optimization, assuming a fixed memory architecture for a DSP-based embedded system architecture. Data layout is a critical component in the embedded design cycle and decides the final configuration of the embedded
system. Data layout happens at the final stage in the life cycle of an embedded system, as
illustrated in the X-chart of Figure 1.4. Data layout forms the foundation for memory subsystem optimization. Hence, we first formulate data section layout as an Integer Linear
programming (ILP) problem. The proposed ILP formulation can handle: (i) partitioning
of data between on-chip and off-chip memory, (ii) handling simultaneously accessed data
variables (parallel conflict) in different on-chip memory banks, (iii) placing data variables
that are accessed concurrently (self conflict) in dual-access RAMs, (iv) overlay of data sections with non-overlapping life times, and (v) swapping of data sections from/to off-chip
memory.
An important contribution of this work is the development of a simple unified ILP
formulation to handle all the above mentioned optimizations. The ILP based approach
is very effective for many moderately complex applications and delivers optimal results.
However, as the application complexity increases, the execution time of ILP method
increases drastically, making them unsuitable for large applications and in situations (such
as memory architecture exploration) where the data layout need to be solved repeatedly.
1.6 Contributions
17
18
Introduction
Next, we explore the data layout design space for a given physical memory architecture
in order to optimize the performance and power consumption of the memory subsystem.
Note that data layout exploration forms the step (8) to (9) in the X-chart representation.
We propose MODLEX, a Multi Objective Data Layout EXploration framework based
on Genetic Algorithm that explores the data layout design space for a given logical and
physical memory architecture and obtains a list of Pareto-optimal data layout solutions
from performance and power perspectives. Most of the existing work in the literature
assumes that performance and power are non-conflicting objectives with respect to data
layout. However we show that there is a significant trade-off (up to 70%) that is possible
between power and performance.
Our next step is physical memory architecture exploration (step (5 to 8) in Figure 1.4).
We propose two different methods for physical memory exploration. The first approach is
an extension of the Logical Memory Architectural Exploration (LMAE) method described
in Chapter 4 and represented in X-chart by step 5 to 6. Physical memory exploration
is performed by taking the output of LMAE and for each of the Pareto-optimal logical
memory architecture, performing a memory allocation exploration (step (6 to 7)) with an
objective to optimize power and area in the physical memory space. Note that the data
layout is fixed at the logical memory exploration stage itself and hence the performance
does not change at this step. The memory allocation exploration is formulated as a multiobjective Genetic search to explore the design space with power and area as objectives.
We refer to this approach as LME2PME.
The second approach is a direct and integrated approach for Physical Memory Exploration, which we refer to as DirPME. This approach corresponds to a direct move
from point 5 to point 8 in Figure 1.4. In this approach, we integrate three critical components together: (i) Logical Memory Architecture Exploration, (ii) Memory Allocation
Exploration (iii) Data layout exploration. The core engine of the memory architecture
exploration framework is formulated as a Multi-objective Non-Dominated Sorting Genetic
Algorithm (NSGA) [25]. For the data layout problem, which needs to be solved for thousands of memory architectures, we use our fast efficient heuristic data layout method.
1.6 Contributions
19
Our integrated memory architecture exploration framework searches the design space by
exploring 1000s of memory architectures and lists down 200-300 Pareto-optimal design
solutions that are interesting from an area, power, and performance view point.
Next, we address the memory architecture exploration problem for hybrid memory architectures that have a combination of SPRAM and cache. For such a hybrid architecture,
a critical step is to partition the data between on-chip SPRAM and external RAM. Data
partitioning aims at improving the overall memory sub-system performance by placing
data in SPRAM that have the following characteristics: (a) higher access frequency, (b)
data that has a overlapping life time with many other data, and (c) data that has poor
spatial access characteristics. By placing all data that exhibits the above characteristics in
SPRAM results in reducing the number of potentially conflicting data in cache, reducing
the cache misses, leading to overall memory sub-system performance improvement.
But typically the SPRAM size is small and it is not possible to accommodate all the
data identified for SPRAM placement. Hence, even after data partitioning, there will be
a significant number of potentially conflicting data sections that need to be placed in external RAM. If these data are need to be placed in the caches such that the conflict misses
causes between them is reduced. Cache-conscious data layout addresses this problem and
aims at placing data in external RAM (off-chip RAM) with the objective to reduce cache
misses. This is achieved by an efficient data layout heuristic that is independent of instruction caches, optimizes run-time and keeps the off-chip memory address space usage
under check. We extend the above approach and perform hybrid memory architecture
exploration with the objective to optimize run-time performance, power consumption and
area.
The salient feature of our work are as follows.
First, we provide a unified framework for logical memory exploration, memory allocation exploration, and data layout
Our work addresses power, performance, area optimization in an integrated framework
20
Introduction
Our work addresses memory architecture exploration framework for a hybrid memory architecture involving on-chip SPRAM and cache.
Our work does not rely on source-code optimization for power and performance
optimization. Hence it is suitable for Platform-based/IP-based system design
1.7
Thesis Overview
The rest of the thesis is organized as follows. In the following chapter, we provide the
background material for the thesis. We begin by explaining the memory architecture of
a DSP and an MCU. We summarize the software optimizations used in the literature
to improve memory access efficiency. We explain cache-based embedded SoC and their
challenges with respect to predictability. Finally, we introduce the concepts of a Genetic
Algorithm (GA) for optimization, since GA is used in our optimization framework in the
latter chapters.
In Chapter 3, we propose different methods to address the data layout problem for onchip SPRAM based memory architecture. First, we propose a Integer Linear Programming
(ILP) based approach. Further, we also propose a fast and efficient heuristic for the data
layout problem. Finally, we formulate the data layout problem in Genetic Algorithm
(GA).
In Chapter 4, we present a multi-objective memory architecture exploration framework
to search the memory design space for the on-chip memory architecture with performance
and memory cost as two objectives. We address the memory architecture exploration
problem at the logical level.
Multi-objectective Data Layout Exploration problem is addressed in Chapter 5. Here,
the data layout design space is explored for a given logical memory architecture and
application with respect to performance and power.
In Chapter 6, we address the memory architecture exploration problem at physical
memory level. In this chapter we propose two different approaches for addressing the
physical memory architecture exploration.
21
22
Introduction
Chapter 2
Background
In this chapter we provide the necessary background information that are useful to understand the rest of the thesis. The Following section explains the on-chip memory architecture of Digital Signal Processors (DSPs) and Microcontrollers (MCUs). Section
2.2 presents the software optimizations used in embedded applications that are targeted
at using on-chip memory efficiently. Section 2.3 describes cache based on-chip memory
architectures and motivates the need for cache-SPRAM based hybrid architectures for
embedded SoCs. In Section 2.4, an overview of Genetic Algorithm is presented. Finally,
in Section 2.5, importance of multi-objective multiple design solutions for platform based
design is explained.
2.1
2.1.1
DSP processor based embedded systems have an on-chip memory which typically has a
single cycle access time [49]. The on-chip memory, also referred to as scratch pad memory,
is mapped into an address space disjoint from the off-chip memory but connected to
24
Background
the same address and data buses. 1 Typically the scratch-pad memory is organized into
multiple memory banks to facilitate multiple simultaneous data accesses. DSP Processors
typically have 2 or more address generation units and multiple on-chip buses to facilitate
multiple memory accesses.
Further, each on-chip memory bank can be organized either as a single-access RAM
(SARAM) or as a dual-access RAM (DARAM), to provide single or dual accesses to
the same memory bank in a single cycle. For example, Texas Instruments TMS320C54X
digital signal processor has two data read buses and one data write bus [75]. and, Texas Instruments TMS320C55X processor has three data read busses and two data write busses,
since concurrent access to the same array are common in DSP applications [76]. Figure 2.1 presents memory map of C55X DSPs, where multiple memory banks of SARAM
and DARAM memory banks form a part of memory map, and MMR represents memory mapped registers which typically contain control registers, status registers and stack
pointers. The DARAM and SARAM regions can be recognized using multiple memory
bank to enable two concurrent accesses.
1
We use the terms scratch pad memory, on-chip memory and internal memory interchangeably.
Similarly off-chip memory and external memory are also used interchangeably.
2.1.2
25
26
Background
is because, the memory architecture of the DSP is more complex than that of microcontrollers (MCU) due to the following reasons: (a) DSP applications are more data dominated than the control-dominated software executed on an MCU. Memory bandwidth
requirements for DSP applications range from 2 to 3 memory accesses per processor clock
cycle. For an MCU, this figure is, at best, one memory access per cycle. (b) It is critical in DSP application to extract maximum performance from the memory subsystem in
order to meet the real-time constraints of the embedded application. As a consequence,
the DSP software for critical kernels is developed mostly as hand optimized assembly
code. In contrast, the software for MCU is typically developed in high-level languages.
The memory architecture for a DSP is unique since the DSP has multiple on-chip buses
and multiple address generation units to service higher bandwidth needs. The on-chip
memory of embedded processors can include (a) only Level-1 cache (L1-cache) (e.g., [1]),
(b) only scratch-pad RAM (SPRAM) (e.g., [75, 76], or (c) a combination of L1-cache and
SPRAM (e.g., [2, 77]).
2.2
Software Optimizations
2.2.1
27
To take advantage of the multiple on-chip memory buses provided by the underlying processor architecture, software application developers must carefully partition the data into
several independent sections. A data section typically holds an array or a set of program
data structures and is placed contiguously in a memory bank. The data structures that
are used in the same instruction cycle are said to be mutually conflicting and are ideally
assigned to different sections so that they can be placed in different memory banks. Assigning data structures to separate sections increases the number of placement decisions
drastically.
Several software optimization techniques for improving the performance have been
proposed in the literature [10, 14, 37, 40, 44, 53, 58, 74, 79], including:
Placing frequently accessed data variables to on-chip SPRAM and placing less frequently accessed data variables in off-chip RAM [10, 53].
Partitioning data arrays that are accessed simultaneously in the same processor cycle
into different on-chip memory banks. This way multiple data can be simultaneously
accessed in the same cycle without incurring any additional memory stalls [40, 58].
Mapping a data array which is required to support multiple simultaneous access to
DARAM. This avoids additional memory stalls for two simultaneous accesses. [44].
Overlay of data structures, typically arrays, to share the same on-chip memory
space. These arrays are referred as scratch buffers [74]. The life time of these
buffers are limited to a software module. Hence scratch buffers corresponding to
different modules, which are not live simultaneously, can share the same on-chip
memory space.
Swapping critical code and data sections from off-chip memory to on-chip memory
before the execution of the appropriate code segment. This facilities efficient access
to code/data currently being accessed. The benefits of swapping (on-chip access and
reduced memory stalls) should more than compensate the cost of swapping [37, 79].
28
Background
Except for the swapping technique, which works on both code and data, all the other
techniques concentrate only on data. Managing data is very important because most of
the embedded applications are data dominated [19].
Towards achieving this goal, critical code and data which are accessed frequently
are identified by performing extensive simulation and profiling of application. The decision
to place a data structure in the on-chip SRAM is taken after analyzing the frequency of
the variable in the application.
The ideal case is where all the critical code and data sections can be placed in the onchip memory. While this can result in very high performance, in terms of fewer memory
stall cycles, it is also prohibitively expensive to support such a large on-chip SRAM.
Hence to achieve a good performance/cost ratio, a careful data layout for the memory
architecture is mandatory.
Taking the above optimizations into consideration, a code and data section layout can
be defined as a mapping which specifies where (i.e., in which memory type) the various
code and data sections reside, the memory bank(s) on which the sections reside, the type
of memory access (single or dual access) supported in the memory bank, whether or not
certain code (or data) sections are overlayed, and whether or not certain code (or data)
sections are swapped.
2.2.2
Typically, embedded applications running on the MCU are control-oriented and not very
computation intensive. The primary objective is to use the on-chip SPRAM efficiently.
Towards this, application is profiled to get the access frequency of all the data variables.
Frequently accessed variables are placed in on-chip SPRAM and less frequently accessed
variables are placed in off-chip RAM.
With an objective to optimize power consumption, a non-uniform bank size based onchip SPRAM architecture is used in [14]. The key idea here is, smaller banks are used to
accommodate the most frequently accessed variables, this placement optimizes the system
power. For example, let a and b be two data variables with 1KB of size and both these
29
variables are accessed 100000 times and 20000 times respectively. For an on-chip SPRAM
size of 16KB organized as 42KB and 18K, placing a in one of the 2KB banks and
placing b in the 8KB is more power optimal than the other way.
2.3
All programs exhibit the property of locality of reference [68] and the cache memories
exploit this property of the programs to give improved performance. Programs exhibit
two types of localities, temporal and spatial. Temporal locality indicates that a recently
accessed memory location is likely to be accessed again. And spatial locality implies that
a recently accessed memory locations neighboring location is likely to be accessed.
In cache based architectures, data is placed in an off-chip RAM and copied at run-time
to cache by a hardware cache controller. Cache controllers increase the silicon area, but
eliminate the requirement of data placement and management and the associated runtime overhead. The mapping of data from off-chip RAM to L1-cache is dictated by the
cache associativity scheme and can create potential side effects like thrashing. Therefore,
a careful analysis of data access characteristics and understanding of temporal access
pattern of the data structures is required to improve the cache performance.
From a power, performance and area perspective, direct mapped caches are preferred
over set-associative caches. However, direct mapped caches incur much more off-chip
memory traffic [36], which, when not handled properly, can lead to very high power
consumption and lower performance. In [36], the traffic inefficiency of direct mapped
caches is evaluated for different embedded and multimedia applications from Mediabench
[43]. The traffic (data movements from off-chip RAM to cache and vice versa) inefficiency
is in the range of 10+ even for large cache sizes, and is mainly attributed to conflict
misses. However, for application specific systems, the code is known a priori and an
optimal cache-conscious data layout will be able to reduce the number of conflict misses
and improve the performance and power consumption by reducing the off-chip memory
traffic.
30
2.3.1
Background
2.4
Genetic Algorithms (GA) [30] belong to the class of stochastic search methods [69]. Other
stochastic search methods include simulated annealing [57], threshold acceptance [27], and
31
some forms of branch and bound [24]. Most stochastic search methods operate on a single
solution to the problem at hand. Whereas genetic algorithms operate on a set of solutions
which lead to faster convergence.
To use a genetic algorithm, the problem at hand needs to be encoded as an object.
In GAs terminology, the encoded object is called a chromosome. A population consist of
a set of such chromosomes. GA combines randomness and survival of the fittest to
perform an effective search in the solution space [30].
In Figure 2.3, the basic flow of GA is explained. To start with, the solution to the
problem at hand needs to be modeled as a chromosome. A better chromosome would mean
a better solution. The next step is create a set of P chromosomes, referred as population,
initialized by the initialization step in Figure 2.4. The objective of the GA is to keep
operating the chromosomes in the current population to generate new chromosomes and
select P fittest chromosomes. GA uses operators like selection, crossover and mutation
for this purpose. Observe that there are two nested loops in GA as shown in Figure
2.4. The outer loop corresponds to the evolution of different generations and the inner
32
Background
loop constitute the GA operations to generate a certain set of new chromosomes within
a generation.
The inner loop starts with the selection operation that picks two of the best individuals
for mating. Some of the more commonly used selection methods [30] are (i) roulette
wheel selection, (ii) tournament selection and (ii) rank selection. The probability of
selecting a chromosome in Roulette wheel selection is proportional to the fitness function
of the chromosome. For tournament selection, a set of chromosomes are selected based
on roulette wheel selection and then the top two best chromosomes are picked among the
selected chromosomes. Rank selection always picks the best two chromosomes based on
the fitness function. We have used roulette wheel selection in our work.
33
each of the parents and generates two children. Thus, the children chromosomes are
expected to have a combination of characteristics from both parents. Since the parents
are best chromosomes from the current population, the children are expected to be better
than parents by evolution theory. Typically a 3-point cross over is performed, which
is illustrated in Figure 2.3. After the crossover operation, the mutation operation is
performed with certain probability. The mutation operation randomly change certain
elements (flip a bit) of the chromosome. The mutation operator introduces certain amount
of randomness to the search. It can help the search find solutions that crossover alone
might not encounter. For each of the new chromosomes, objective functions are computed.
There can be more than one objective. A fitness function is assigned based on the set of
objectives. The fitness function represents how good a chromosome is (in other words, how
good a solution is). This set of operation is repeated (inner loop) till M new chromosomes
are generated. For each of the new M chromosomes, the objective functions and fitness
values are computed.
The last step in a generation (outer loop) is the anhilation step that is a representation
of survival of the fittest concept. At the end of the inner loop there will be a total of
P + M chromosomes, where P are the parents and M are the newly generated child. Out
of the P + M chromosomes, the top P chromosomes with respect to the fitness functions
are selected and passed on to the next generation and the remaining chromosomes are
ignored. The outer loop is repeated for a given number of generations.
2.5
Platform based design is a way to address the complexity of embedded system design
under tight deadlines. It is common to build systems around the same computational
platform which includes a microprocessor or microcontroller for running the operating
system and a DSP for running media-related applications. The same platform will therefore need to cater to different application characteristics. The OMAP platform from
Texas Instruments comes in several flavors to address the market diversity. Similarly,
Texas Instruments offers two variants of C55X DSP C5510 (with 320KB of SRAM and
34
Background
64KB of DARAM) and C5503 (with 64KB SRAM and 64KB of DARAM) for high-end
and mid-end applications. As a consequence, the platform designer is not just interested
in a single optimal design point but a set of design points. These set of design points
are termed as non-dominated design points [30] as no design point is better than any of
the non-dominated points on all objective criteria. These non-dominated points form the
Pareto Optimal set. Conditions of Pareto optimality [30] is mathematically defined as
follows. Let a vector x be partially less than y, symbolically, x <p y, when the following
conditions hold:
(x <p y) (i )(xi <= yi ) (i)(xi < yi )
Using the partial relation <p , we can say if x <p y then x dominates y or y is a dominated
point. If the set of all dominated points are removed from the set of all points in the
design, we get the non-dominated set or the Pareto-optimal design points. In other
words, the complement of the dominated set with respect to the design space, gives the
Pareto-optimal set.
Each non-dominated point is an optimal design point with a specific price-performance
factor. Thus platform based design requires a set of non-dominated design points is
computed in an automated and in reasonable computation time.
Chapter 3
Data Layout for Embedded
Applications
3.1
Introduction
As discussed in Chapter 1, embedded applications are highly performance critical and the
processor resources needs to be utilized optimally to extract maximum performance. One
of the most critical step in embedded application development flow is system integration,
where all the software modules are integrated and mapped to a given target memory
architecture. This step has a large performance implication depending on how the memory
architecture is used. The memory architecture of embedded DSPs is heterogeneous and
contain memories with different access times. For example, an embedded system may
contain on-chip and off-chip memory modules with different access times, single and dual
ported memory, and multiple memory banks to support many simultaneous accesses.
During system integration, the decision to map critical data on to faster memories and
map non critical data in to slower memories is made. But, it is not easy to classify the
data into critical and non critical because of following reasons: (a) typically 70 to 80%
of code and data is legacy and may not have clear specification, (b) most of the code
in embedded DSPs are developed in assembly and hence compiler based analysis is not
possible [49] and (c) because of faster time to market constraints many of the software
36
3.1 Introduction
37
for current day time to market requirements, (2) quality of solution varies based on the
expertise.
The data layout optimization methods [10, 40, 44, 53, 58] varies significantly for applications built for microcontrollers (MCU) and Digital Signal Processors (DSPs) due to
the following reasons: (a) DSP applications are more data dominated than the control
software executed in MCU. Memory bandwidth requirements for DSP applications range
from 2 to 3 memory accesses per processor clock cycle. While that for MCU at best needs
only one memory access per cycle. (b) The DSP software for critical kernels is developed
mostly as hand optimized assembly code. Whereas the MCU SW is developed in high
level languages. Hence compiler based optimizations may not be directly applicable for
the DSP kernels.
In this chapter we address the data layout problem for DSP memory architectures
in different methods. First, we formulate the data section layout as a Integer Linear
programming (ILP) problem. The proposed ILP formulation can handle: (i) on-chip and
off-chip memory, (ii) multiple on-chip memory banks, (iii) single and dual access RAMs,
(iv) overlay of data sections with non-overlapping life times, and (v) swapping of data
(from/to off-chip memory). The main contribution of this work is the development of
a simple unified ILP formulation. The formulation can optimize performance or cost,
although in our work we concentrate on performance. We have developed a framework
which automatically generates the ILP formulation for an embedded application. The
ILP formulation is solved using a public domain LP solver, viz., lp solve.
The ILP based approach is very effective for many moderately complex test cases and
delivers optimal results. However, as the application complexity increases, the execution
time of ILP method becomes an issue for some of the test cases, the run-time is more
than 24 hours and in some cases the ILP does not yield a valid solution even after
running for 30 hours. Hence, we also formulate the data layout problem in Genetic
Algorithm (GA). Finally, as the data layout problem is the kernel for solving the memory
architecture exploration problem and needs to be invoked several times.Hence we looked
at developing faster methods to solve this problem. In this chapter we also propose a
38
heuristic algorithm that maps the data sections to the given memory architecture and
reduces the number of memory access conflicts (both self conflicts and parallel conflicts).
We compare the results of the heuristic, GA and ILP.
The rest of this chapter is organized as follows. The following section deals with the
necessary background and the problem statement. The ILP formulation is presented in
Section 3.3. In Section 3.4 we present the Genetic Algorithm Formulation of the data
layout problem. The Greedy back-tracking heuristic is discussed in Section 3.5. We
report the experimental results in Section 3.6. In Section 3.7 we discuss the related work.
Finally concluding remarks are presented in Section 3.8.
3.2
3.2.1
Figure 3.1 explains the datalayout method in a block diagram. Initially, the applications
data is grouped in to logical sections. This is done to reduce the number of individual
items and there by reduce the complexity. This step is important as once the data is
grouped into a section, the section can only be assigned a single location and all the data
variables inside a section will be placed contiguously starting from the given memory
address. Also the order of data placement within a section can be random and may not
affect the performance. Note that, a section cannot have both code and data. There is
a trade-off in combining different variables into a section. If too many data variables are
combined into one section, then the flexibility of placement in memory gets negatively
impacted. On the other hand, if each of the data variable is mapped into one section then
there are too many sections to handle and thus increasing the data layout complexity. In
practice, however an embedded development engineer makes a judicious choice of mapping
a set of data variables into a section. Typically, each of the large data arrays are mapped
into an individual section, and all data scalar variables belonging to a module are mapped
into a section. Note that this process is performed manually.
Once the grouping of data into sections are done, the code is compiled and executed
39
in a cycle accurate software simulator. From the software simulator profile data (access
frequency) of data sections are obtained. In addition, the simulator generates a conflict
matrix that represents the parallel and self conflicts. Parallel conlicts refers to simultaneous accesses of two different data sections while self conflicts refers to simultaneous
accesses of same data sections. Consider the code segment in Figure 3.2.
In this code segment data sections a and b need to be accessed together and therefore
represent a parallel conflict. Accesses to a[i] and a[i-1] refer to a self conflict. If these
arrays (a,b) is placed in different memory banks or memory bank with multiple ports
then these accesses can be made concurrently without incurring additional stall cycles.
However, note that the data array (a) which has a self conflict must be placed in a memory
bank with multiple ports to avoid additional stall cycles.
The conflict relations among data sections is represented by an n n matrix, where n
is the number of data sections. The (i, j)th element represents the conflict or concurrent
accesses between datasection i and j. The diagnol elements represent self conflicts. The
40
100
40
C=
2000
40
500
2000
600
50
50
Data section sizes, access frequency of data sections, conflict matrix and the memory
architecture are given as inputs to data layout. The objective of data layout is to efficiently
use the memory architecture by placing the most critical data section in on-chip RAM
and reduce bank-conflicts by placing conflicting data in different memory banks. Data
layout assigns memory addresses for all the data sections.
3.2.2
Problem Statement
41
incurring stall cycles. Cii represents the number of times two accesses to data section i
are made in the same cycle. Self-conflicting data sections need to be placed in DARAM
memory banks, if available, to avoid stalls. The objective of the data layout problem is
to place the data sections in memory modules such that the following are minimized:
Number of memory stalls incurred due to conflicting accesses of data sections placed
in the same memory bank
Self-conflicting accesses placed in SARAM banks
Number of off-chip memory accesses.
Note that the sum of the sizes of the data sections placed in a memory bank cannot
exceed the size of the memory bank.
3.3
ILP Formulation
42
As mentioned earlier, in our discussion, we follow the convention that each data section
refers to a single array. Let the size of data section s in module j be denoted by SDjs .
The access count for a data section is denoted by ADjs . During memory layout, a data
section occupies a block of contiguous memory locations. The value for some of the above
parameters, e.g., the access counts, can be obtained by profiling the application. In our
framework the profile data is collected using an Instruction Level Simulator.
43
For the ILP formulation we also require memory architecture parameters. The size of
internal (on-chip) and external (off-chip) memory are denoted by SM i and SM e respectively. We also need the number of stall cycles We for each access to external memory.
Table 3.1 summarizes the list of symbols used in our formulation. With this, we are ready
to describe the basic formulation.
3.3.1
Basic Formulation
min
Ndj
N X
X
ADjs We (1 IDjs )
(3.1)
j=1 s=1
Next we specify the memory constraints. Equations (3.2) and (3.3) enforce the constraint that the total size of the code and data sections that are placed in the external and
internal memory do not exceed the available external and internal memory respectively.
Ndj
N X
X
SDjs (1 IDjs ) SM e
(3.2)
j=1 s=1
Ndj
N X
X
SDjs IDjs SM i
j=1 s=1
Lastly, we add the constraint that IDjs are 0-1 integer variables.
(3.3)
44
3.3.2
Embedded DSP applications are data intensive; typically two to three data sections are
accessed simultaneously in a cycle. DSP processors are designed to handle multiple data
accesses. DSPs have internal memory with multiple banks and multiple internal data
buses. The data variables that are accessed simultaneously need to be placed in different
memory banks to avoid memory stalls.
This section handles the partitioning of concurrently accessed data arrays from multiple memory banks to avoid additional stalls. Only two simultaneous data accesses are
considered; but this can be extended easily for more than two accesses. To represent the
number of simultaneous accesses to two different data sections, in a module j, we use a
j
2-dimensional matrix B j which is of size Ndj Ndj . An element of this matrix, Bst
, for
s 6= t, represents the number of simultaneous accesses to data sections s and t. Note that
j
Bst
refers to the total number of simultaneous accesses to the different elements of data
sections s and t. For example, if two data sections s1 and s2, each of size 100 elements,
j
= 100.
are accessed simultaneously (as in s1[i]+s2[i]), then Bs1s2
j
Bss
refers to the number of simultaneous accesses to the same data section s. We will
consider this in our formulation in the next subsection. For the time being, we will assume
j
Bss
= 0 for all data section s and for all modules j. The elements of the B j matrix are
Nb
X
SM ik
k=1
Further, let IDkjs represent whether data section s of module j reside in the kth internal
bank. Lastly, we use a (derived) 1-0 variable Zkjst to represent whether data sections s
and t of module j are both placed in internal bank k. Zkjst is 1 if and only if IDkjs = 1
45
(3.4)
Note that Zkjss = 1 for all sections. We replace Equations (3.2) and (3.3) in the basic
formulation with:
Ndj
N X
X
SDjs EDjs SM e
(3.5)
SDjs IDkjs SM ik
(3.6)
j=1 s=1
Ndj
N X
X
j=1 s=1
where EDjs is 1 if the sth data section reside in off-chip memory. These variables can be
expressed in terms of IDkjs as:
EDjs = 1
Nb
X
IDkjs
(3.7)
k=1
Note that inequality (3.6) must hold for all k from 1 to Nb , the number of internal memory
banks. To enforce that each data section reside in at most 1 internal memory bank, we
add the constraint:
Nb
X
IDkjs 1
(3.8)
k=1
N
j=1
PN
j=1
PNpj
s=1
ADjs We EDjs +
s=1
t=s+1
j
Bst
Zkjst
(3.9)
subject to constraints (3.4) to (3.8). Note that the first term includes all accesses (simultaneous and non-simultaneous accesses) to data section s, while the second term excludes
46
non-simultaneous accesses.
3.3.3
In this formulation we account for the cost of simultaneous accesses to the same data secj
denote the number of such accesses. These accesses will incur an additional
tion. Let Bss
stall cycle if the data section s does not reside in a memory bank that supports dual
access. Likewise, a simultaneous access to data sections s and t will incur an additional
stall cycle when they both are in memory bank k which is single ported. Let DPk = 1
denote that memory bank k is dual ported and DPk = 0 otherwise. Note that for a given
memory architecture DPk is a constant (0 or 1) and known a priori.
min
N
j=1
PN
PNpj
j=1
s=1
ADjs We EDjs +
s=1
j
t=s Bst (1 DPk ) Zkjst
(3.10)
3.3.4
As mentioned earlier data sections that have non-overlapping life-times can share the same
on-chip memory space. These arrays are commonly referred to as scratch buffers. In our
discussion, we assume that scratch buffers are identified by the application developer. Let
Sjs = 1 denote that data section s is a scratch buffer; Sjs = 0 otherwise. The memory
used by a scratch buffer can be reused across different modules, but not within the same
module.
We account for the internal memory required for the scratch buffers in the following
way. For each module j, we compute SBkj , the sum of the sizes of the scratch buffers
in module j that are stored in the kth internal memory bank. The memory required for
scratch buffers in the kth internal bank corresponds to the maximum of SBkj over all
modules. That is:
47
SBk = max
j
Ndj
X
(3.11)
s=1
Further, the individual memory requirements for each scratch buffer which is stored in
kth internal memory bank can be excluded in constraint for internal memory (Inequality (3.6)). Thus Inequality (3.6) is replaced by
Ndj
N X
X
(3.12)
j=1 s=1
The constraint for external memory remains same (Inequality (3.5)). Thus the ILP formulation in this case has the same objective function (Equation (3.10)) subject to constraints (3.4), (3.5) (3.8), (3.11), and (3.12).
3.3.5
Swapping of Data
Swapping of a data section is generally applied in embedded DSP systems that have
large external memory and very small internal memory. Here the data that is identified
for swapping resides in external memory and copied into the internal memory (on-chip
RAM) only for the duration of execution/access of a section. A data section is identified
for swapping by carefully weighing the swapping cost against the performance benefit that
results from accessing the section from internal memory. To model swapping, we assume
that one common swap memory space SWk is allocated in the kth internal memory. The
size of SWk is the maximum of the total size of all swapped sections in a module, where
the maximum is taken across all modules. The formulation for swapping proceeds in a
manner similar to scratch buffer, where swapped section share the same memory area in
the on-chip memory bank. Additionally we have to account for off-chip requirement for
all swapped section (
cost of swapping.
PNb
k=1
48
3.4
Genetic Algorithms (GAs) have been used to solve hard optimization problems [30]. Genetic algorithms simulate the natural process of evolution using genetic operators such
as, natural selection, survival of the fittest, mutation and crossover in order to search the
solution space.
To map an optimization problem to the GA framework, we need the following: chromosomal representation, fitness computation, selection function, genetic operators, the
creation of the initial population and the termination criteria.
For the memory layout problem, each individual chromosome should represent a memory placement. A chromosome is a vector of d elements , where d is the number of data
sections. Each element of a chromosome can take a value in (0 .. m), where 1..m represent on-chip memory banks (including both SARAM and DARAM memory banks) and 0
represents off-chip memory. Thus if the element i of a chromosome has a value k, then the
j th data section is placed in memory bank k. Thus a chromosome represents a memory
placement for all data sections. Note that a chromosome may not always represent a
valid memory placement, as the size of data sections placed in a memory bank k may
exceed the size of k. Thus the genetic algorithm should consider only valid chromosomes
for evolution. This is achieved by giving a low fitness value for invalid chromosomes. Our
initial experiments demonstrated that the above chromosome representation (vector of
decimal numbers) is more effective than the conventional bit vector representation [30]
as the latter will lead to assignment of non-existent memory banks when the number of
memory banks is not a power of 2.
Genetic operators provide the basic search mechanism by creating new solutions based
on the solutions that exist. The selection of the individuals to produce successive generation plays an extremely important role. The selection approach assigns a probability of
selection to each individual, depending on its fitness. An individual with a higher fitness
has a higher probability of contributing one or more offspring to the next generation. In
the selection process a given individual can be chosen more than once. Let us denote the
49
size of the population (number of individuals) as P . Reproduction is the operation of producing offspring for the next generation. This is an iterative process. In every generation,
from the P individuals of the current generation, M more offspring are generated. This
results in a total population of P + M . From this total population of P + M , P fittest
individuals survive to the next generation. The remaining M individuals are annihilated.
In our data layout problem, for each of the individuals, the fitness function computes the
number of resulting memory conflicts. Since GAs typically solve a maximization problem,
we change our problem as a maximization problem by negation and normalization. Recall
that a chromosome may represent an invalid solution. To discourage invalid individuals,
we associate a very low fitness value to them.
The Crossover operator takes two individuals and produces two new individuals by
merging the characteristics of the two parents at a random point (named crossover site).
Mutation is applied after crossover to each individual with a given probability. Mutation
changes an individual to produce a new one by changing some of its genes. Lastly, the
GA must be provided with an initial population that is created randomly. GAs move
from generation to generation until a pre-determined number of generations is seen or
the change in the best fitness value is below certain threshold. In our implementation we
have used a fixed number of generations as the termination criterion.
We have also developed a simulated annealing (SA) approach for the data layout
problem and experimented the same for some of the applications. The performance of
SA is comparable to that of GA, however SA takes more time to arrive at the solution.
Hence we did not consider the SA approach for data layout further in this thesis.
3.5
Heuristic Algorithm
As mentioned earlier the data layout problem is NP Complete. Further the ILP and
the GA methods described in previous sections consume significant run-time to arrive
at a solution and these methods are suitable only for obtaining an optimal data layout
for a fixed memory architecture. But to perform memory architecture exploration, this
problem is addressed in the following chapters, data layout needs to be performed for
50
1000s of memory architecture and it is very critical to have a fast heuristic method for
data layout. Using exact solving method such as Integer Linear Programming (ILP) are
using an evolutionary approach, such as GA or SA which takes as much as 20 to 25 minutes
of computation time for each data layout problem, may be prohibitively expensive for the
memory architecture exploration problem. Hence in this section we propose a 3-step
heuristic method for data placement.
3.5.1
The first step in data layout is to identify and place all the data sections that are frequently
accessed in the internal memory. Data sections are sorted in descending order of frequency
per byte (F P Bi ) defined as the ratio of number of accesses to the size of the data section.
Based on the sorted order, data sections are greedily identified for placement in internal
memory till free space is available. We refer all the on-chip memory banks together
as internal memory. Note that the data sections are not placed at this point but only
identified for internal memory placement. The actual placement decision are taken later
as explained below.
Once all the data sections to be placed in internal memory are identified, the remaining
sections are placed in external memory. The cost of placing data section i in external
memory is computed by multiplying the access frequency of data i with the wait-states
of external memory. The placement cost is computed for all the data sections placed in
the external memory.
3.5.2
The objective of the next two steps is to resolve as many conflicts (self-conflicts and
parallel-conflicts) by utilizing the DARAM memory and the multiple banks of SARAM.
Self-conflicts can only be avoided if the corresponding data section is placed in DARAM.
On the other hand, parallel conflicts can be avoided in two ways either by placing the
conflicting data a and b in two different SARAM banks or by placing the conflicting data
a and b in any DARAM bank. But former solution is attractive as the SARAM area
51
cost is much less compared to DARAM area cost. Considering that self-conflicts can only
be avoided by placing in DARAM and the cost of DARAM is very high, data placement
decisions in DARAM needs to be done very carefully. Also, many of the DSP applications
have large self-conflicting data and the DARAM placements is very crucial for reducing
the run-time of an application.
The heuristic algorithm considers placement of data in DARAM as the first step in
internal memory placements. Data sections that are identified for placement in internal
memory are sorted based on the self-conflict per byte (SP Bi ), defined as the ratio of self
conflicts to the size of the data section. Data sections in the sorted order of SP Bi are
placed in DARAM memory until all DARAM banks are exhausted. Cost of placing data
section i in DARAM is computed and added as part of the overall placement cost.
Once the DARAM data placements are complete, SARAM placement decisions are
made. Figure 3.3 explains the SARAM placement algorithm. Parallel conflicts between
data sections i and j can be resolved by placing conflicting data sections in different
SARAM banks. The SARAM placement is started by sorting all the data sections identified for placement in internal memory based on the total number of conflicts (T Ci ), the
sum of all conflicts for this data section with all other data sections including self conflicts. Note that all data sections including the ones that are already placed in DARAM
are considered while sorting for SARAM placement. This is because the data placement
in DARAM is only tentative and may be backtracked in the backtracking step if there is
a larger gain (i.e., more parallel conflicts resolved) in placing a data section i in DARAM
instead of one or more data sections that are already placed in DARAM.
During the SARAM placement step, if the data section under consideration is already
placed in DARAM then it is ignored and the next data section in the sorted order is
considered for SARAM placement. The placement cost for placing data section i in
SARAM bank b computed considering all the data sections already placed in DARAM
and SARAM banks. Among this the memory bank b that results in the minimum cost is
chosen.
Next, the heuristic backtracks to find if there is any gain in placing data i in DARAM
52
53
by removing some of the already placed data sections from DARAM. This is done by
considering the size of data section i and the minimum placement cost of data i in SARAM.
If there is one or more data sections (refer this set of data as daram-remove-set) in
DARAM with size more than the size of data section i and the sum of self-conflicts for all
these data sections are less than the minimum placement cost of data i in SARAM, then
there can potentially be a possibility of gain of placing data i into DARAM by removing
the daram-remove-set. Note that it is only a possibility and not a certain gain.
Once a daram-remove-set is identified, to ensure that there is gain in the backtracking
step, the data sections that are part of daram-remove-set needs to be placed in SARAM
banks and minimum placement cost has to be computed again for each of the data section.
If the sum of the minimum placement cost for all data sections in daram-remove-set is
greater than the original minimum cost for placing data i in SARAM bank b, then there
is no gain in backtracking and data section i is placed in SARAM banks. Else there is
gain in backtracking and the daram-remove-set is removed from DARAM and placed in
SARAM. Data section i is placed in DARAM. The overall-placement cost is updated. This
process is repeated for all data sections identified to be placed in the internal memory.
The overall-placement cost gives the memory cycles (Mcyc ) for placing application data
for a given memory architecture.
3.6
3.6.1
In this section, we explain the methodology used in our experiments. For our experiments,
the main inputs are the access characteristics of the data sections. We need the sizes of
the data sections, access frequency of each of the data sections and the conflict matrix.
The access frequency and the conflict matrix are obtained from a software profiler. Since
the DSP applications typically have simple control flow, the profile information on the
access characteristics does not change very much from run to run.
54
We have used the Texas Instruments TMS320C55XX processor [76] for our experiments. This processor has three 16-bit memory read busses and two 16-bit memory write
busses and has the capability to read three 16-bit data and write two 16-bit data in the
same clock cycle. The memory architecture of the 55X device is given in Table 3.2. Note
that the total memory size is 72 Kwords and is large enough to fit each of the instances
of all the four applications reported in Table 3.4.
We have used the Texas Instruments Code Composer Studio V2.2 [73] to run the
applications. Initially the applications are compiled with the CCS2.2 compiler with the
default memory placement made by the compiler. The compiled application is loaded
and simulated in the simulator to obtain the profile information and the conflict matrix,
which are inputs to the heuristic and the Genetic algorithms.
We have developed a framework which automatically generates the ILP formulation for
an embedded application. The input for the ILP formulation generator, specified in XML
format, are application parameters, memory configuration parameters, and the profile
data parameters. To obtain profile information, we developed a profiler and integrated it
with the C5510 Instruction Set Simulator (ISS) [73].
The ILP formulation is solved using a public domain LP solver [3], viz., lp solve.
3.6.2
To compare the performances of the different layouts, we consider the number of memory
stall cycles and sometimes MIPS consumed, which is commonly used in embedded systems
design. MIPS consumed refers to the processing capability required to guarantee realtime performance for a given application. Thus higher the MIPS consumed, lower is the
performance of the layout. Further, lower the MIPS consumed, more applications, or
55
higher number of instances of the same application can be run on the embedded device,
guaranteeing real-time performance for all of these instances. MIPS consumed for a given
data layout is obtained by running the application on the simulator with the given layout
and for the given memory architecture. We report these numbers when a single instance
of the application is run on the embedded system.
First, we report the MIPS consumed by the optimal solution obtained using the following formulations for the four applications.
The basic formulation is used to get a data layout considering only the internal and
external memory. The internal memory is considered as one single SARAM bank
of size 12K words. We use this as the baseline model. In Table 3.3, we report the
normalized MIPS consumed, the number of variables and the number of constraints
for each ILP formulation, and the time taken on a 900 MHz Pentium III machine1
to solve the ILP problem.
The on-chip memory is split into multiple SARAM banks (three 4K word SARAM
banks) and the formulation that handles multiple memory banks was used. The
results for this case (refer to Table 3.3) show 14%, 16%, 3.8%, and 7.1% performance
improvement over the baseline case for the four applications.
Next, the basic formulation is extended to handle SARAM and DARAM banks. For
this formulation we assumed that the internal memory consists of 2 banks, one, of 8K
word size, supporting single access (SARAM) and another, of 4K word, as DARAM.
This optimization gives a 16%, 18%, 4.8% and 7.4% performance improvement over
the baseline case for the four applications.
The last experiment is performed by enhancing the basic formulation to handle both
multiple banks, and bank types (SARAM and DARAM). The memory configuration
considered here consists of two 4K word SARAM banks and one 4K word DARAM
1
These experiments were run in 2002 on a set of proprietary benchmark with a desktop configuration
which was state of the art at that time. We are unable to repeat the experiments on a more modern
platform due to portability reasons. We have however run another set of benchmarks on a recent desktop
configuration and the results are reported in Section 3.6.3
56
bank. This optimization exploits both multiple memory banks and dual access
capabilities of the scratch pad memory and gives a significant reduction (28%, 30%,
9% and 13.8%) in MIPS consumption over the baseline case.
We remark that the somewhat lower performance improvement in Appln. 3 and Appln. 4
could be due to the fact that these are kernel codes where multiple simultaneous memory
accesses are not fully exploited.
Table 3.3: Experimental Results
Application
Parameter
Appln. 1 Normalized
MIPS consumed
ILP Number
22
of variables
ILP Number
13
of constraints
Time taken
2 sec
to solve
Appln. 2 Normalized
1.0
MIPS consumed
ILP Number
60
of variables
ILP Number
32
of constraints
Time taken
90 sec
to solve
Appln. 3 Normalized
1.0
MIPS consumed
ILP Number
16
of variables
ILP Number
10
of constraints
Time taken
110 sec
to solve
Appln. 4 Normalized
1.0
MIPS consumed
ILP Number
14
of variables
ILP Number
9
of constraints
Time taken
1 sec
to solve
60
60
60
37
37
37
10 sec
10 sec
10 sec
0.84
0.82
0.72
152
152
152
88
88
88
480 sec
480 sec
480 sec
0.96
0.95
0.91
38
38
38
24
24
24
260 sec
260 sec
260 sec
0.93
0.93
0.86
29
29
29
12
12
12
1 sec
1 sec
1 sec
57
Next we compare our solution with hand-optimized data layout. For the first two
applications (referred to as Appln. 1 and Appln. 2), we were able to obtain the handoptimized memory placement done by application developers. The application developers
had performed hand-optimization to come up with a number of different code and data
layouts for a fixed cost, measured in terms of DARAM size a standard practice followed in the industry. For each application, the application developers consider different
configurations, differing in terms of DARAM sizes, and obtain different optimal code
and data layouts for each of the configurations. Among these, they pick the one which
maximizes the performance without incurring excessive cost. This process took approximately 1 man-month for each of the two applications. This includes the time to analyze
all the layouts and come up with the best solution that delivers maximum performance
by occupying minimum DARAM size. To compare the hand-optimized solution, we used
the best case memory configuration, different for different applications, and obtained the
optimal layout using our ILP formulation. We compared the hand-optimized layout with
our optimal solution. The result generated by our formulation marginally betters the
hand-optimized memory placement in terms of MIPS consumption for the same DARAM
size. More importantly, it should be noted here that our approach can be automated to
solve the optimal data layout problem for different memory configurations and pick the
most appropriate one.
Lastly, we measure the time taken for obtaining an optimal solution in our approach increases with the increasing complexity of the application. For this purpose, we considered
Application 2, (3 instances of a standard DSP application) to be running simultaneously
on the embedded device. We considered the same memory architecture with 80K of
off-chip memory, 1 4K DARAM bank and 2 4K SARAM bank. The complete formulation is considered in this case, which can handle multiple memory banks, DARAM,
overlay and swapping. The number of variables and the number of constraints in the
formulation increased to 152 and 88 respectively. Even with the increased complexity,
the time for obtaining an optimal solution remained within a few minutes (8 minutes on
a 900Mhz Pentium III PC and 3 minutes on an Ultra SPARC-2). Even after including
58
the one-time simulation and profiling cost, the total time taken by our approach is within
30 minutes. Thus the proposed ILP approach can result in a significant reduction in the
system integration time (from a few man-months to a few hours) even for moderately
complex systems.
From the experiments it is clear that our unified approach opens up many optimization possibilities. We observe that when applications are used in multi-channel systems
(instantiated multiple times), the overlay optimization can result in significant reduction
in the on-chip memory usage with the same MIPS consumption.
Thus, we anticipate that for a more complex application, the performance of our
optimal data section layout solution will exceed the hand-optimized performance.
3.6.3
To evaluate the performance of the heuristic and GA, we used the same 4 different embedded DSP applications explained in the previous section. Table 3.4 reports the performance
of our Heuristic method and our GA. Column 1 shows the benchmark and column 2 indicates the number of instances of this module in the application. Column 3 shows the
number of data sections in the application. Column 4 is the number of conflicts (sum of
both parallel and self conflicts), without any optimization. Column 5 indicates the unresolved conflicts when the heuristic placement algorithm is used for data layout. Similarly,
column 6 shows the number of unresolved conflicts for the genetic algorithm. We observe
that both methods eliminate more than 90% the total number of conflicts. In the case
of the JPEG decoder, the algorithms resolved all the conflicts and obtained an optimal
memory placement. Although the performances, in terms of the unresolved conflicts, of
the heuristic and GA method are comparable, the GA method performs better for moderate and large problems. We observe that for large problems the ILP method could
not get the optimal solution even after hours of computation. Further, we believe that
by tuning some of the parameters of the GA method (e.g., the cross-over and mutation
probabilities, the size of the population, and number of generations), GA method can
be made to perform significantly better even for large applications, and obtain close to
59
Table 3.4: Results from Heuristic Placement (HP) and Genetic Placement (GP) on 4
Embedded Applications, VE = Voice Encoder, JP = JPEG Decoder, LLP = Levinsons
Linear Predictor, 2D = 2D Wavelet Transform.
App # instances # data # conflicts # conflicts # conflicts # conflicts
sections
HP
GP
ILP
VE
1
12
356813
436
130
0
2
24
713626
1132
606
112
3
36
1070439
13449
11883
no result
4
48
1427252
67721
62390
no result
5
60
1784065
129931
126365
no result
6
72
2140878
190134
180591
no result
JP
1
14
94275
0
0
0
2
28
188440
0
0
0
3
42
282825
0
0
0
4
56
377100
0
0
0
LLP
16
80
556992
19639
19386
no result
32
160
1113984
165865
164395
no result
2D
4
23
5632707
33368
32768
32768
6
31
6727827
35768
33968
32768
8
39
7821867
37568
36368
32768
10
47
8916447
45638
38168
36368
3.6.4
We compared our heuristic data layout performance with GAs data layout. Figure 3.5
presents the normalized performance of heuristic data layout. We randomly picked 100
different architectures for obtaining data layout for the Voice Encoder application. For
these 100 points, we ran both GA and our heuristic algorithm. The x-axis represents test
60
Figure 3.4: Relative performance of the Genetic Algorithm w.r.t. Heuristic, for Varying
Number of Generations
Data Layout (Heuristic vs SA)
1.2
Normalized performance
0.8
0.6
0.4
0.2
0
0
20
40
60
80
100
120
Test No
Figure 3.5: Comparison of Heuristic Data Layout Performance with GA Data layout
case identifier from 1 to 100. The y-axis presents the memory stall cycles of heuristic
ga
heu
normalized by GAs memory cycles (Mcyc
/Mcyc
). It can be observed that the heuristic
method performs as well as GA for most of the points. The worst performance of heuristic
is approximately 25% below GAs performance for two of the test cases. However, an
average heuristic performs at 98% efficiency, in terms of solution quality as compared to
GA. The execution time of GA for completing all 100 data layout is approximately 22
hours, whereas the execution time of heuristic for completing all 100 placements is less
than a second on a Pentium P4 Desktop machine with 1GB main memory operating at
1.7 GHz. Thus the heuristic method is an attractive option, providing efficient solution
61
in a very little execution time, for solving large number of data layout problems which is
required in a memory architecture exploration problem. Note that for some of the points
heuristic performs better than GA, this may be due to the termination of GA at fixed
iterations. Based on the above results we can conclude that the heuristic algorithm is fast
and also very efficient.
3.6.5
In Table 3.5, we provide a qualitative comparison of the heuristic algorithm, the genetic
algorithm, and the ILP based for the data layout problem. We see that the run-time
of the heuristic is the lowest among the three approaches. The run-time of the Genetic
algorithm depends on the number of generations, the population size P , and the number
of offspring per generation M . For the four large test cases of Table 3.4, the run-time
of the heuristic algorithm was of the order of 1 second, whereas the GA took about 20
minutes to complete. The ILP approach, on the other hand, required several hours to
converge to the optimal solution. As a matter of fact, a public domain ILP solver could
not converge in 24 hours for the 6-instance Voice Encoder and the 32-instance Levinsons
LPC applications. This clearly demonstrates that the GA and the heuristic methods
are attractive from the point of view of quickly solving th data layout problem. From
the view point of optimality of solutions, ILP is guaranteed to converge to the optimal
and is hence ranked the best. The GA comes second since it provides better solutions for
larger problem instances. From the viewpoint of flexibility, we believe GA is most flexible,
since the cost function can be easily reflected in the fitness measure. For example, if the
power or energy dissipation must be minimized, we can modify the fitness function to
be a weighted average of the power and performance metrics and still reuse the genetic
algorithm framework. This becomes difficult for the heuristic algorithm to simultaneously
optimize both performance and power.
From Table 3.5, we see that each of the three approaches for the data placement
problem has a definite advantage over the other two methods in terms of run-time, quality,
or flexibility. This point can be exploited as follows. The algorithms presented in the
62
previous section are intended to optimize the placement of data sections for a given
memory architecture. Often, the designers of the SoC have the flexibility to change the
memory architecture. It would be ideal if the memory architecture optimization and
the data section placement were to be done concurrently. This is a classic example of
hardware-software codesign. We propose a multi-objective Genetic Algorithm technique
for memory architecture exploration, where the number and size of the SARAM and
DARAM banks can be determined through a combinatorial search process. For each of
the memory architecture considered, a quick fitness can be computed based on the cost
of the best placement obtained using the heuristic algorithm. The heuristic is an ideal
choice for computation of the bound since it is the fastest approach, requiring less than 1
second of run time. After a small number of competing memory architectures have been
shortlisted through this procedure, the GA can be used to explore the best placement of
data sections for each of the memory configurations.
3.7
Related Work
Several efficient heuristic approaches for data layout have been published in the literature
[10, 34, 40, 44, 53, 55, 58, 67, 71, 79]. These can be classified as static and dynamic
methods. In static data layout, the memory addresses for all data variables are decided at
compile time and do not change at run-time. In dynamic data layout [79], on-chip SPRAM
is reused by overlaying many data variables to the same address. Thus, two addresses are
assigned to a variable at compile time, namely, load address and run address. A variable
is loaded at the load address and copied to run-address at run-time. At the cost of
63
64
important as typically programmable processors have equal sized memory banks. This
work uses benchmarks written in C and hence the conflict graph is very sparse and only
bipartite graphs are obtained from the compiler. Because of this, they could resolve all
the conflicts and their main focus is only on balancing the data partitions. But typically
the DSP applications will have dense graph as the critical part of software is developed
in hand optimized assembly. Their work does not address parallel conflicts between more
than two arrays. Also they do not consider dual access RAMs. Their objective is to
reduce the data conflicts and improve run-time.
In [71], Sundaram et al., present an efficient data partitioning approach for data arrays on limited-memory embedded systems. They perform compile time partitioning of
data segments based on the data access frequency. The partitioned data footprints are
placed in local or remote memory with the help of 0/1 Knapsack algorithm. Here the
data partitioning is performed at a finer granularity and because of this, the address computation needs to be modified for functional correctness. In contrast, in our work, the
data partitioning is performed at the data section level and the data layout optimization
is performed by considering a more complex on-chip memory architecture with multiple single and dual port memory banks. Further, no additional address computation or
modification to address computation is required in our approach.
Kulkarni et al., [41] present formal and heuristic algorithms to organize the data
in the main memory with the objective of reducing cache conflict misses. In [55], a
data partitioning technique is presented that places data into on-chip SRAM and data
cache with the objective of maximizing performance. Based on the life times and access
frequencies of array variables, the most conflicting arrays are identified and placed in
scratch pad RAM to reduce the conflict misses in the data cache. This work addresses
the problem of limiting the number of memory stalls by reducing the conflict misses
in the data cache through efficient data partitioning. Our work addresses the problem
of reducing the memory stalls by efficient data partitioning within the on-chip scratch
pad RAM itself. Also our work addresses the data layout for DSP applications, where
resolving self and parallel conflicts by efficient partitioning of data variables is very critical
3.8 Conclusions
65
for achieving real-time performance. Lastly, the memory architecture considered in the
initial part of our thesis does not have data cache; in Chapter 7, we consider memory
architecture with on-chip RAM and caches.
3.8
Conclusions
In this chapter, we described three approaches to solve the data placement problem in
embedded systems. Given memory architecture, the placement of data sections is crucial
to the performance of the system. Badly placed data can result in a large number of
memory stalls. We consider a memory architecture that consists of on- chip single-access
RAM with multiple memory banks, on-chip Dual-access RAM, and external RAM. We
analyze the application for data conflicts using a profiling tool and create a matrix representation of the conflict information. We present three different methods to address
data layout problem: (a) ILP formulation, (b) Genetic Algorithm and (c) Greedy backtracking heuristic algorithm. The greedy back-tracking heuristic and Genetic Algorithm
approaches out perform the ILP based formulation in terms of the time to solve the data
layout problem. However, the ILP and the GA methods produce better quality of results
especially for large-sized applications. The framework of the GA is generic enough to
permit other cost functions such as power dissipation [64] to be incorporated. In Chapter 5 we extend the GA formulation to consider performance and power minimization.
Similarly, the GA can also be extended to concurrently explore alternative memory architectures [54] this is possible by changing the representation of the chromosome and
reworking the crossover and mutation operations.
66
Chapter 4
Logical Memory Exploration
4.1
Introduction
In the previous chapter, we discussed data layout methods to find optimal and nearoptimal placement of data for a given fixed memory architecture for embedded DSP
processors. In this chapter we will focus on memory architecture exploration for a given
application in order to obtain memory architecture performance (reduced memory stalls)
and memory area. In Chapter 5 we extend our approach to consider the power consumption also as an objective.
As discussed in Chapter 1, embedded systems are application specific and hence embedded designers study the target application to understand the memory architecture requirements. DSP applications are typically data intensive and require very high memory
bandwidth to meet real-time requirements. There are two steps to designing an optimal
memory architecture for a given application. The first step is to find the right memory architecture parameters that are important for improving target applications performance
and the second step is to optimally map the given application on to the memory architecture under consideration. This leads to a two-level optimization problem with multiple
objectives. At the first level, an appropriate memory architecture must be chosen which
includes determining the number and size of each memory bank, the number of memory
ports per bank, the types of memory (scratch pad RAM or cache), wait-states/latency
68
etc. Thus the number of memory architecture possible for an SoC for a given application
are many. The objective functions at this level are the memory system-cost, performance,
and power dissipation. However, the performance for a given application for a given architecture depends on the appropriate placement of code and data sections in the various
on-chip memory banks or the off-chip memory modules. Hence, at the next level, for a
given application, the code and data sections must be placed optimally in memory to
minimize the number of stall cycles. As discussed in the previous chapter, the number
of placements for a given architecture are also many. Thus the number of memory architectures and the number of data placements are formidably large. Hence an optimal
solution for the memory space exploration problem which involves exploring a large design space. A performance optimal solution may not be optimal in terms of cost or power
consumption. In this solution space there are several interesting design points known as
Pareto-optimal points for the embedded system design. This is especially the case as an
embedded system designer typically designs multiple variant of a embedded product (to
meet different market segments) and hence would want to obtain several good solutions
which may make sense for different application segments. Hence the memory space exploration problem should identify multiple Pareto-optimal design points. Further since
embedded system products are often designed under tight time-to-market constraints, the
resources available for such an optimization process are limited. To make the problem
more complex, the market space is volatile and frequently the top-level specification and
architecture may be redefined during the life cycle of a product.
In this chapter, we propose an efficient methodology for the memory architecture
exploration of the DSP core1 . We concentrate mainly on the DSP core as it largely
determines the performance of the embedded application. We consider both on-chip and
off-chip memory space exploration for the DSP core. The memory architecture exploration
problem involves identifying the appropriate memory architectures for a given application,
in terms of performance, power consumption and cost. As mentioned earlier, this involves
1
In addition to the DSP core, an embedded SoC will have a micro-controller which has embedded
memory. In this thesis we do not focus on the memory system design of embedded micro controllers
though many of the methods proposed may be applicable to microcontrollers as well.
4.1 Introduction
69
solving two interacting problems: (a) memory architecture exploration and (b) data layout
optimization for the architecture considered.
Previous work on data layout [10, 41, 44, 53, 58, 71] has focused on addressing the
layout problem independently, for a given memory architecture, by considering the objective either as improving application run-time or energy consumption. Also the previous
work in this area has addressed the data layout either for memory architecture on the
embedded side (microcontroller), where they do not consider dual-port memories, or on
the DSP side, where the on-chip/off-chip partitioning is not considered. A detailed comparison with related work is presented in Section 4.6. To the best of our knowledge,
there is no work which considers integrating memory architecture exploration and data
layout to explore memory design space by targeting multiple objectives. This integrated
approach is very critical to navigate through the search space in the right direction to
explore memory design space to obtain multiple Pareto Optimal design points.
In this chapter we propose an iterative two level integrated approach for the data layout
and memory exploration problem. At the outer level, for the architecture exploration, we
have used multi-objective evolutionary algorithm. We propose both Genetic Algorithm
(GA) formulation and an Simulated Annealing (SA) formulation for this problem. For
the inner level, i.e., the data layout problem, we have used a simple and fast heuristic
algorithm described in Section 3.5; this is because the data layouts problem is solved
for several thousands of memory architectures. As discussed in Chapter 3 the heuristic
algorithm proposed there performs reasonably well in reducing the memory stalls and at
the same time obtains the data layout in very little computation time (less than 1msec).
In comparison, the GA or the ILP approach takes few minutes to few hours for each data
layout problem which can become prohibitively expensive when solving for a large number
of memory architectures,
The main contributions of this chapter are (a) proposing an iterative two level based
solution to address the data layout and architecture exploration as an integrated problem
(b) proposing performance (in terms of memory stalls) and memory area as two objectives for the memory exploration framework; and (c) proposing a memory exploration
70
framework that is fully automatic. The proposed memory exploration framework is flexible and can be configured to explore additional memory architecture parameters. Also
the framework is scalable and additional objectives like power consumption can be added
easily. We have used 4 different multimedia and communication applications for our experiments. Our proposed memory exploration method gives up to 130-200 Pareto optimal
design choices (memory architectures) for each of the applications.
The rest of the chapter is organized as follows. Section 4.2 provides necessary background on the data layout and memory architecture exploration. Section 4.3 describes the
multi-objective Genetic Algorithm (GA) formulation of the memory exploration problem.
Section 4.4 explains the formulation of memory architecture exploration problem in Simulated Annealing (SA). Section 4.5 reports the experimental results. Section 4.6 describes
the related work. We present conclusions and future work in Section 4.7.
4.2
4.2.1
Method Overview
Memory Architecture Parameters
As discussed in section 2.1.1, the memory architecture of a DSP processor has to support a
high bandwidth to satisfy the needs of data memory intensive DSP applications. As shown
in the Figure 4.1, the memory architecture of a DSP processor is organized as multiple
memory banks, where each bank can be accessed independently to enable parallel accesses.
In addition each of the bank can be a single port or a dual port memory. For now we
assume that the memory banks with single ports have the same size and similarly the
memory banks with dual-ports have the same size. Also, at this point, we only consider
a logical view of the memory architecture. How the different (logical) memory banks are
realised using different physical memories from a given ASIC design database, and how
they impact the power, performance and cost of the memory architecture will be discussed
in the next chapter. Choosing the appropriate physical memory architecture is a design
space exploration process. We use the terms logical memory exploration and physical
memory exploration to clearly distinguish between the two. This chapter concentrates on
71
the former whereas the following chapter deals with the later.
The Table 4.1 describes the memory types and parameters. There are 3 types of memory (single-ported on-chip memory (Sp ), dual-ported on-chip memory (Dp ) and External
memory). The parameters to be explored are the number of Sp memory banks (Ns ), the
Sp bank size (Bs ), the number of Dp memory banks (Nd ), the Dp bank size (Bd ) and the
size of external memory (Es ). For example, 64KB on-chip memory of C5503 DSP is organized as 124KB single port memory banks and 44KB dual-port memory banks. For
this example, the parameters can be specified as Ns = 12, Bs = 4096, Nd = 4, Bd = 4096
and Es = 0.
4.2.2
We consider the following two objectives while exploring the (logical) memory design
space: (a) Memory Stall Cycles (which is a critical component of system performance)
and (b) Memory cost. A meaningful estimation of the power consumed can be made only
72
with the physical memory architecture, and hence we defer the power objective to the
following chapter on physical memory exploration.
Memory cycles (Mcyc ) is the sum of all memory stall cycles where the CPU is waiting
for memory. This includes stall cycles spent in on-chip memory bank conflicts and off-chip
memory latency. Our objective is to minimize the number of stall cycles (Mcyc ). It is very
critical to have an efficient data layout algorithm to obtain a valid Mcyc . Note that if an
efficient data layout algorithm is not used then the data mapping may not be optimal
leading to higher number of Mcyc even for a good memory architecture. This may lead
the memory architecture exploration search in a completely wrong direction.
Memory cost is directly proportional to the silicon area occupied by memory. Since
the memory silicon area is dependent on the silicon technology, memory implementation,
and the ASIC cell library that is used, instead of considering the absolute silicon area
numbers, for now, we consider the relative (logical) area. The memory cost is defined by
the equation (4.1).
(4.1)
73
is the memory access latency, which is set to 10. We normalize the memory cost with the
total data size in equation (4.1).
The objective of the memory architecture exploration is to find a set of Pareto optimal
memory architectures that are optimal in terms of performance for a given area. The
two objectives, memory architecture performance and memory area, are conflicting and
hence it is not a clear minimization or maximization problem. For such multi-objective
problems, it is not possible to compare two solutions. For example if we are evaluating
a set of solution on a pair of objectives say M1 and M2 , and both objectives need to
be minimised then two solutions cannot be compared if each is better than the other in
terms of one of the objectives. To evaluate a design solution we use the Pareto Optimality
condition described in section 2.5.
4.2.3
Solving the memory architecture exploration problem involves solving two interacting
problems: (a) architecture exploration and (b) data layout optimization for the architecture considered. The data layout determines the performance of the given application in
the considered architecture, which in turn determines whether the memory architecture
to be considered further in the design space exploration. We propose a two-level iterative
approach for this problem.
At the outer level, for the memory architecture exploration, we have tried two evolutionary approaches (a) multi-objective Genetic Algorithm formulation and (b) Simulated
Annealing. We have compared the results from GA and SA, we find the GA performs
better for our multi-objective problem. At the inner level, for the data layout, we use the
Greedy back-tracking heuristic algorithm described in section 3.5.
In the next two sections the GA and SA formulations for memory architecture exploration is described.
74
4.3
In this section, first, we describe the GA formulation for the memory architecture exploration. Section 4.3.2 deals with how the non-dominated GA [25] approach has been used
to perform the multi objective search. The multiple objectives considered in this chapter
are, reducing memory cycles (Mcyc ) and memory cost (Mcost ) as explained in Section 4.2.2.
4.3.1
To map an optimization problem to the GA framework, we need the following: chromosomal representation, fitness computation, selection function, genetic operators, the
creation of the initial population and the termination criteria.
For the memory exploration problem, each individual chromosome represents a memory architecture. A chromosome is a vector of 5 elements: (Ns , Bs , Nd , Bd , Es ). The
parameters are explained in Section 4.2.1 and they represent the number and size of single
and dual ported memory bank and the size of the external memory.
75
Fitness function computes the fitness for each of the individual chromosomes. For the
memory exploration problem there are two objectives Memory cost (Mcost ) and Memory Cycles (Mcyc ). For each of the individuals, the fitness function computes Mcost and
Mcyc . The memory cost(Mcost ) is computed from equation (4.1) based on the memory
architecture parameters (Ns , Bs , Nd , Bd , Es ) that defines each of the chromosome.
The memory stall cycles (Mcyc ) is obtained from the data-layout that maps the application data buffers on to the given memory architecture, defined by the chromosomes
parameters, with the objective to reduce memory stall cycles. We use the greedy backtracking heuristic algorithm described in Section 3.5 for the data layout. The data layout
algorithm estimates the memory stall cycles after placing the application data buffers in
the given memory architecture.
Once Mcyc and Mcost are computed for all the individuals in the population, the
individuals need to be ranked. The non-dominated points at the end of the evolution
represent a set of solutions which provide interesting trade-offs in terms of one of the
objectives in order to anhilate the chromosomes that has a lower fitness. For a single
objective optimization problem, the ranking process is straightforward and is proportional
to the objective. But for multi-objective optimization problem, the ranking needs to be
computed based on all the objectives. We describe how we do this in the following
subsection.
4.3.2
a
a
) be the memory cost and memory
, Mcyc
First we define Pareto optimality. Let (Mcost
b
b
) be the memory cost and memory cycles of
, Mcyc
cycles of chromosome A and (Mcost
b
a
b
a
b
a
b
a
))
Mcost
) (Mcost
< Mcyc
)) ((Mcyc
Mcyc
) (Mcyc
< Mcost
((Mcost
(4.2)
The non-dominated points at a given generation are those which are not dominated
76
u
u
j !2
i Mj
i
Mcyc
Mcost
u Mcost
cyc
+ u
dij = t
u
l
l
Mcost Mcost
Mcyc Mcyc
(4.3)
77
u
l
where Mcost
and Mcost
represent respectively the highest and lowest values for Mcost
u
l
seen across all solutions; similarly Mcyc
and Mcyc
denote the highest and lowest number
of stall cycles.
Step 2: The distance dij is compared with a pre-defined parameter share and the
following sharing function value is computed [25]:
dij
1
, if dij < share
Sh(dij ) =
share
0,
otherwise.
Step 3: Calculate niche count for the ith solution in rank k as follows:
mi =
nk
X
Sh(dij )
j=1
Step 4: Reduce the fitness fk of i-th solution in the k-th rank as:
fi0 =
fk
mi
The above steps have to be repeated for all the ranks. Note that the niche count
(mi ) will be greater than one for solution i that has many neighboring points. For a
lone distant point the niche count will be approximately 1. Thus greater fitness values
are assigned to points that do not have many close neighboring solutions, encouraging
them to get selected for reproduction. Once all the nk individuals in rank k are assigned
fitness based on the above steps, the minimum of fitness is taken as the starting fitness
and assigned to all the individuals in rank k + 1.
After some experimentation we fixed the share as 0.6 as the initial value and decrease
the share up to 0.25 based on the number of generations and the number of non-dominated
points in rank 1.
The GA must be provided with an initial population that is created randomly. GAs
move from generation to generation until a pre-determined number of generations is seen
or the change in the best fitness value is below certain threshold. In our implementation
78
4.4
4.4.1
Recall the architectural parameters used in an memory architecture exploration and their
notations as summarized in Table 4.1.
A solution S to the memory subsystem optimization problem is a tuple (A, L) where
A represents the memory architecture and L represents the data layout. Two objective functions2 are associated with each solution, namely, M emory Cost (Mcost ) and
2
Energy consumption (Menergy ) can be considered as the third objective function if power is also
79
Function
Assessment
same solution
Improvement
Deterioration
Improvement
Improvement
need analysis
Deterioration
need analysis
Deterioration
M emory Cycles (Mcyc ). Memory cost is computed using the equation (4.1). Memory
cycles is obtained using the same greedy backtracking heuristic described in Section 3.5
for the data layout.
The initial temperature is computed by randomizing the solution space initially. The
mean and the variance are computed from the (improvement or deterioration) through
the initial iterations during the randomization process. Temperature is initialized with
the standard-deviation of calculated during the initial iterations.
To generate a new solution S 0 from the current solution S, we use controlled randomization to ensure that S 0 is in the neighborhood of S and that the new solution represents
a valid solution. For example, we change the number of banks in S 0 by adding a randomly
generated offset to the number of banks in S while ensuring that the total does not exceed the maximum number of banks. We also ensure that the memory size is greater than
the data size. Let us denote the memory cost and memory cycles associated with S as
Mcost (S)and Mcyc (S); similarly, let Mcost (S 0 )and Mcyc (S 0 )correspond to solution S 0 . The
new solution is a definite improvement if Mcost (S 0 ) Mcost (S) and Mcyc (S 0 ) Mcyc (S)
and we say S 0 dominates S. When the new solution does not dominate the existing
solution, there are many possibilities, as illustrated in Table 4.2.
lt
ut
)
, Mcyc
We maintain an upper and lower threshold for the objective functions; let (Mcyc
lt
ut
) be the limits for memory cost. The
, Mcost
be the limits for memory cycles and (Mcost
80
ut
lt
ut
lt
= (Mcost (S 0 ) Mcost (S))/(Mcost
Mcost
) + (Mcyc (S 0 ) Mcyc (S))/(Mcyc
Mcyc
)
(4.4)
Since our objective is to present to the designer a list of all good solutions, we maintain
a list of competitive solutions seen during the course of optimization. Each of these
solutions is assigned a weight. When a new locally good solution is encountered, we
compare its weight with that of all the globally competitive solutions that have been
seen so far. There is fixed amount of room in the data structure that stores globally
competitive solutions; as a result, we will remove a solution from the list if its weight is
lower than that of all others, including the new entrant.
4.5
4.5.1
Experimental Results
Experimental Methodology
81
application data size. This configuration is selected because this does not resolve any of
the parallel or self conflicts, so the conflict matrix can be obtained from this simulated
memory configuration. The output profile data contain (a) frequency of access for all
data sections (b) the conflict matrix. The other inputs required for our method is the
application data section sizes, which were obtained from the C55X linker.
4.5.2
Experimental Results
This section presents our results on the memory architecture exploration. We have applied
GA and SA for the memory architecture exploration. Both GA and SA based methods
uses the same data layout heuristic described in Section 3.5. The reason for trying two
different evolutionary schemes for memory architecture exploration is to identify the better
approach between GA and SA for the multi-objective problem at hand. A better approach
will be able to search the design space uniformly and identify non-dominated points that
are globally Pareto-optimal. In this section, first we compare the results from GA and
SA, and then we describe our observations of the exploration process based on the better
approach (GA or SA).
4.5.2.1
The objective is to obtain the set of Pareto-optimal points that minimizes either memory
cost or the memory cycles. For one of the benchmark, Vocoder, Figure 4.3 plots all
the memory architectures explored by GA and SA respectively, each point represents a
memory architecture and the non-dominated points or the Pareto optimal points are also
plotted in the same figure. Note that each of the non-dominated points represents the
best memory architecture for a given Mcyc or Mcost .
In Figure 4.3, the x-axis represents the normalised memory cost as calculated by
equation 4.1. We have used Ws = 1, Wd = 3 and We = 0.05 in our experiments. Based
on these values and from equation (4.1), the important data range for x-axis (Mcost ) is
from 0.05 to 3.0. It can be seen that Mcost = 0.05 corresponds to an architecture where
all the memory is only off-chip memory, while Mcost = 3.0 corresponds to a memory
82
83
architecture that has only on-chip memory composed of DARAM memory banks. The
y-axis represents the memory stall cycles, which is number of processor cycles spent in
data accesses. This includes the memory bank conflicts and also the additional wait-states
for data accessed from external memory.
From Figure 4.3, we observe that the multi-objective GA explores the design points
uniformly in all regions of memory cost, whereas SA explores a large number of design
points only in the region of Mcost < 1. Our observation is that the sharing and niche
formation methods used in GA leads to better solution diversity than SA. This trend is
observed in other benchmarks as well.
Table 4.3 presents the number of memory architectures explored and the number of
non-dominated points obtained from the GA and SA based approaches. For each of the
applications, the GA and SA are run for a fixed time, so as to compare the efficiency of the
two approaches. The execution time reported in Table 4.3 is the time taken on a Pentium
P4 Desktop machine with 1GB main memory operating at 1.7 GHz. From Table 4.3 we
observe that both GA and SA explore a large number of design points (a few thousand) for
each of the benchmark and identifies a few hundred Pareto optimal design points which
are interesting from a platform based design [59] viewpoint. The total computation time
84
taken by these methods for each benchmark varies from 3 hours to 11 hours. Compared
to this, the memory space design exploration typically done manually in industry, can
take several man-months and may explore only few design points of interest.
We observe that GA produces most of the non-dominated points in the first 25% of
time and slowly improves the solution quality after that. On the other hand, SA gives
the best results only towards the end when the annealing temperature approaches zero.
Hence, given sufficient time SA catches-up with GA but the time taken by SA to reach
the solution quality of GA is 2-3 times more than the GAs run-time.
We observe that GA explores significantly more points in the design space (almost
by a factor of 2 to 3) than SA for all applications except Mpeg Enc. This is due to
the higher execution time of 11 hours. SAs performance levels increase with time and
we observe that the number of global non-dominated points is highest for MPEG enc.
However the number of non-dominated points identified by these methods are nearly the
same. Interestingly, the non-dominant design points identified by the two methods only
partly overlap. Note that our definition of non-dominated point with respect to GA and
SA approach refer to those design points, that are not dominated by any other point seen
so far by the respective methods. Thus it is possible a point identified as non-dominated
in one approach may be dominated by design points identified in the other approach.
We observe that the non-dominated points from SA and GA are very close. This
can be observed from Figure 4.4 where the non-dominated points of GA and SA are
plotted together in one graph for two of the applications. Table 4.4 presents data on the
total number of non-dominated points obtained from GA and SA. The table presents the
number of common non-dominated points between GA and SA in column 4. These are
the same Pareto optimal design points identified by both GA and SA. The number of
unique non-dominated points represents the solutions that are globally non-dominated
but present only in one of GA/SA approaches. The presence of unique non-dominated
points in one approach means that this point is missing in the other approach. Column
6 reports the global non-dominated points, this is the sum of column 4 and column 5.
The ratio of column 6 to column 3, in a way represents the efficiency of an approach. We
Application
Mpeg Enc
Vocoder
Jpeg
DSL
85
Method
Mpeg Enc
GA
SA
GA
SA
GA
SA
GA
SA
Vocoder
Jpeg
DSL
Num of
non-dom
(ND)
points
270
287
105
104
90
89
133
149
Common
ND pts
Unique
ND pts
global
ND pts
No of
Dominated
pts
115
115
56
56
32
32
71
71
143
99
45
22
63
2
54
34
258
214
101
78
95
34
125
105
12
73
4
26
2
55
8
44
avg min
dist from
unique
NDs
1.3%
2.1%
0.77%
3.1%
1.6%
2.1%
1.8%
5.5%
observe that the number of common points increases if time allotted to SA is increased.
Further, the column 7 reports the number of non-dominated points identified by one
method which gets dominated by points in the other method. This also is an indicator of
the efficiency of the approach: more the number of dominated points less the efficiency.
For example, for the MPEG encoder benchmark, 73 of the non-dominated design points
reported by SA are in fact dominated by certain design points seen by the GA approach.
As a consequence the global non-dominated points reduces to 214 for this benchmark. In
contrast, GA fares 270 non-dominated points of which 258 are globally non-dominated.
In fact this trend is observed almost for all benchmarks. Thus the data from experiments
point that GA performs a better job than SA. One concern that still remains is the set of
86
unique non-dominated points identified by SA but not by GA. If these design points are
interesting from a platform based design, then to be competitive the GA approach should
at least find a close enough design point. In order to analytically assess this we find
the minimum of the Euclidean distance between the each unique non-dominated point
reported by SA to all the non-dominated points reported by GA. The minimum distance
is normalised with respect to the distance between the unique non-dominated point to
the origin. This metric in some sense presents a close enough design point for each Pareto
optimal point missed by GA. If we could find an alternate non-dominated point in GA
at a very close distance to the unique non-dominated point reported by SA, then the
GAs solution space can be considered as an acceptable superset. In column 8, we report
the average (arithmetic mean) minimum distance of all unique non-dominated points in
SA to the non-dominated points in GA. A similar metric is reported for the unique nondominated points identified by GA. We also report the maximum of the minimum distance
for all unique non-dominated points in column 9. The worst case average distance from
unique non-dominated points is 1.8% for GA and 5.5% for SA. Thus for every unique nondominated point reported by SA, the GA method can find a corresponding non-dominated
point within a distance of 1.8%.
In summary, we observe that GA finds more non-dominated points in general and
result in better solution quality for a given time. Only fewer non-dominated points of
GA are dominated by SA. Also, GA searches the design space more uniformly. This may
be due to the sharing and niche count based approach used in multi-objective GA which
facilitates better solution diversity and explores more number of Pareto-optimal points.
4.5.2.2
Figures 4.5, 4.6, 4.7 and 4.8 plots all the memory architectures explored for each of the 4
applications using the GA approach. The figure also plots the non-dominated points with
respect to memory cost and memory cycles. Note that each of the non-dominated point
is a Pareto optimal memory architecture for a given memory cost or memory cycles. The
results present 150-200 non-dominated solutions (that represents optimal architectures)
87
Figure 4.5: Vocoder: Memory Exploration (All Design Points Explored and Nondominated Points)
Figure 4.6: MPEG: Memory Exploration (All Design Points Explored and Non-dominated
Points)
88
Figure 4.7: JPEG: Memory Exploration (All Design Points Explored and Non-dominated
Points)
Figure 4.8: DSL: Memory Exploration (All Design Points Explored and Non-dominated
Points)
4.6
Related Work
Broadly, there are two types of approaches that are attempted for memory design space
exploration: (i) Architecture Description Language (ADL) based approaches that uses
89
simulation as a means to evaluate different design choices and (ii) exhaustive search or
evolutionary based approaches for memory architecture exploration with analytical model
based estimation to evaluate memory architectures.
There are architecture description language based approaches like LISA [56], EXPRESSION [46], and ISDL [32] that capture processor architecture details in a high
level language as front-end and uses a generator as back-end to generate C models that
simulate the processor architecture configuration. Specifically, LISA and EXPRESSION
captures the micro-architectural details of memory organization in a high level language
format. From the specification, both EXPRESSION and LISA generates C models that
simulate the memory behavior. To evaluate a specific memory configuration for the given
application, the application has to be compiled and run on the generated C model to get
the performance numbers in terms of number of memory stalls. LISA allows the flexibility to capture memory architecture details at different abstraction levels like functional
and cycle accurate specification. A functional C model will be 1-2 orders of magnitude
faster, in terms of run-time, as compared to a cycle accurate simulation model. Though
ADLs provide an elegant means to capture the memory architecture details and further
provide a platform to evaluate a given configuration by means of simulation, there are
some open issues that needs to be addressed. One, simulation is an expensive method
in terms of run-time and this limits the number of configurations that can be evaluated.
Two, the memory configurations needs to be fed as inputs manually. To evaluate significantly different memory organizations, developing the specification is a time consuming
task. Further, these methods do not address the problem of configuration selection itself.
Providing new configurations is a manual task and based on the designer, who modifies
the specification, the type of configurations evaluated could be different. While these
methods are very effective in evaluating a given memory architecture in an accurate way,
it is not suitable for exploring the design space with thousands of configurations because
of the following reasons: (i) for every configuration that needs to be explored, the input
specification needs to be modified and this is a manual process and (ii) since these are
simulation based approaches, even with a functional simulator, the number of architecture
90
configuration that can be evaluated for a large application is very limited because of the
large time taken by the simulator.
The second type of approach is estimation based methods. In [54], Panda et al., present
a heuristic algorithm for SPRAM-cache based local memory exploration. The objective of
this work is to determine the right size of on-chip memory for a given application. Their
algorithm partitions the on-chip memory into Scratch-pad RAM and data cache and also
computes the right line size for the data cache. This algorithm searches the entire memory
space to find the right combination of Scratch-Pad RAM, data cache and line size that
gives the best memory performance. This approach is very useful for architectures that
contain both SPRAM and Cache. Our work is different from this work in many aspects.
We address a different memory architecture class which consists of a on-chip SPRAM
with multiple SARAM banks and DARAM banks, but without cache memory. We have
proposed a two-level iterative approach for memory architecture exploration. The main
advantage of our method is it integrates data layout and memory exploration into one
problem. To the best of our knowledge there is no work which considers integration of
memory exploration and data layout as one single problem and optimises for performance
and area (memory cost). The memory exploration strategy presented in [64] explores the
design space to find the optimal configurations considering the cache size, processor cycles
and energy consumption. They propose an enumeration based search. Our approach on
the other hand uses evolutionary methods and is efficient, in terms of computation time,
in exploring complex memory architectures with multiple objectives. Also there are other
memory design space exploration approaches that considers a cache based target memory
architectures [9, 52, 51]. In this chapter, our work addresses the memory architecture
exploration of DSP memory architecture that is typically organized as multiple memory
banks where each of the banks can consist of single/dual port memories with different
sizes. We consider non-uniform memory bank sizes. Our work uses an integrated datalayout and memory architecture exploration approach, which is key for guiding GAs
search path in the right direction. The cost functions and the solution search space will
be very different for a cache based memory architecture used in [9, 52] and an on-chip
4.7 Conclusions
91
scratch pad based DSP memory architecture used in our paper. Although the approach
presented in this chapter does not address cache based architectures, we deal with this in
Chapter 7.
In summary the unique contributions of our work are the following: (a) integrating
memory architecture exploration and data layout in an iterative framework to explore
memory design space (b) addressing the class of memory architecture for DSPs that are
more complex and heterogeneous and (c) solving the design space exploration problem for
multiple objectives (memory architecture performance and memory area) and to obtain
a set of Pareto-optimal design solutions.
4.7
Conclusions
In this chapter we addressed the multi-level multi-objective memory architecture exploration problem through a combination of evolutionary algorithms (for memory architecture exploration) and an efficient heuristic data placement algorithm. More specifically,
for the outer level memory exploration problem, we have used two different evolutionary
algorithms (a) multi-objective Genetic Algorithm and (b) Simulated Annealing. We have
addressed two of the key system design objectives (i) performance in terms of memory
stall cycles and (ii) memory cost. Our approach explores the design space and gives a
few hundred Pareto Optimal memory architectures at various system design points in a
few hours of run time on a standard desktop. Each of these Pareto optimal design point
is interesting to the system designer from a platform based design view point. We have
presented a fully automated approach in order to meet the time-to-market requirements.
We extend the methodology to handle energy consumption, in Chapter 6.
92
Chapter 5
Data Layout Exploration
5.1
Introduction
94
more memory modules taken from a Semiconductor vendor memory library. For example, a logical memory bank of 16KB16 can be constructed with four 4KB16 or eight
2KB16 or eight 4KB8 or sixteen 1KB16 memory units. Each of these options, for
different process technology and different memory unit organization results in different
performance, area and energy consumption trade-offs. Hence the memory allocation process is performed with the objective to reduce the memory area in terms of silicon gates,
and energy consumption. The memory allocation problem in general is NP-Complete [35].
Earlier approaches for the data layout step typically use a logical memory architecture
as input [10, 53] and as a consequence power consumption data for the memory banks is
not available. By considering the physical memory architecture, the data layout method
proposed in this chapter can optimize for power as well. Also, a common design assumption in earlier design approaches [10] is that, for data layout, the power and performance
are non-conflicting objectives and therefore optimizing performance will also result in
lower power. However we show that this assumption in general is not valid for all classes
of memory architectures. Specifically, we here show that for DSPs memory architectures,
power and performance are conflicting objectives and there is a significant trade-off (up to
70%) possible. Hence this factor needs to be carefully factored in the data layout method
to choose an optimal power-performance point in the design space.
When we extend these problems taking the physical memory architecture into account,
there are two possible approaches. One approach is to solve the data layout and memory
architecture exploration problem for logical memory architecture, as described in the
previous chapters and then map the logical memory architecture to physical memory
architecture. Alternatively the above problem can be directly solved for physical memory
architecture. We evaluate both these approaches and demonstrate the latter is more
beneficial. In this process, we develop a comprehensive automatic memory architecture
exploration framework, which can explore logical and physical memory architecture. We
do this in a systematic way, first addressing the data layout problem for physical memory
architecture in this chapter. The following chapter deals with the memory architecture
exploration problem considering the physical memory architecture.
95
5.2
Problem Definition
We are given a logical memory architecture Me with m on-chip SARAM memory banks,
n on-chip DARAM memory banks, and an off-chip memory and the memory access characteristics of the application in terms of the conflict matrix defined in Section 3.2.1, which
specifies the number of concurrent accesses between a pair of data sections i and j and
self-conflicts, and the frequency of access of individual data sections. The problem on
96
hand is to realise the logical memory architecture Me in terms of physical memory modules, available in the ASIC memory library for a given technology or process node and
obtain a suitable data layout for the physical memory architecture Mp such that the number of memory stalls incurred and the energy consumed by the memory architecture are
minimised.
More specifically, we consider the data layout problem for physical memory architecture with the following two objectives.
Number of memory stalls incurred due to conflicting accesses (parallel and self
conflicts) and the additional cycles incurred in accessing off-chip memory
The total memory power calculated as the sum of the memory power of all memory
banks for various memory accesses. Memory power of each of the banks is computed
by multiplying the number of read/write accesses (based on the data placed in the
bank) and the power per read/write access for the specific memory module accesses
We defer the consideration of memory area optimization from a physical memory
architecture exploration perspective to the following chapter.
5.3
5.3.1
Method Overview
97
map is obtained using a greedy heuristic method which is explained in the following section. The core engine of the MODLEX framework is the multi-objective data layout,
which is implemented as a Genetic Algorithm (GA). The data layout block takes the
application data and the logical memory architecture as input and outputs a data placement. The cost of data placement in terms of memory stalls is computed as explained
in Chapter 3. To compute the memory power, we use the physical memory architecture
and use the power per read/write obtained from the ASIC memory library. The memory
power computation is further explained in Section 5.3.3.3. The overall fitness function
used by the GA is a combination of memory stall cost and memory power cost. Based on
the fitness function, the GA evolves by selecting the fittest individuals (the data placements with the lowest cost) to the next generation. To handle multiple objectives, the
fitness function is computed by ranking the chromosomes based on the non-dominated
criteria (as explained in Section 2.5). This process is repeated for a maximum number of
generations specified as a input parameter.
98
5.3.2
To get the memory power and area numbers, the logical memories have to be mapped
to physical memory modules available in a ASIC memory library for a specific technology/process node. As mentioned earlier each of the logical memory bank can be implemented physically in many ways. For example, for a logical memory bank of 4K*16 bits
can be formed with two physical memories of size 2K*16 bits or four physical memories of
size 2K*8 bits. Different approaches have been proposed for mapping logical memory to
physical memories [35, 61]. The memory mapping problem in general is NP-Complete [35].
However since the logical memory architecture is already organized as multiple memory
banks, most of the mapping turns out to be a direct one to one mapping. In this chapter
a simple greedy heuristic is used to perform the mapping of logical to physical memory
with the objective of reducing silicon area. This is achieved by first sorting the memory
modules based on area/byte and then by choosing the smallest area/byte physical memory to form the required logical memory bank size. Though this heuristic is very simple,
it results in an efficient physical memory architecture. Further in the following chapter,
we consider the exploration of physical memory architecture with an added objective of
area optimization.
5.3.3
To map the data layout problem to the GA framework, we use the chromosomal representation, fitness computation, selection function and genetic operators defined in Section
3.4. For easy reference and completeness, we briefly describe them in the following subsections.
5.3.3.1
Chromosome Representation
For the data memory layout problem, each individual chromosome represents a memory
placement. A chromosome is a vector of d elements, where d is the number of data sections.
Each element of a chromosome can take a value in (0 .. m), where 1..m represent onchip logical memory banks (including both SARAM and DARAM memory banks) and 0
99
represents off-chip memory. For the purpose of data layout it is sufficient to consider the
logical memory architecture from which the number of memory stalls can be computed.
However, for computing the power consumption for a given placement done by data layout,
the corresponding physical memory architecture obtained from our heuristic mapping
algorithm, need to be considered. Thus if element i of a chromosome has a value k, then
the data section is placed in memory bank k. Thus a chromosome represents a memory
placement for all data sections. Note that a chromosome may not always represent a valid
memory placement, as the size of data sections placed in a memory bank k may exceed
the size of k. Such a chromosome is marked as invalid and assigned a low fitness value.
5.3.3.2
The strongest individuals in a population are used to produce new off-springs. The
selection of an individual depends on its fitness, an individual with a higher fitness has
a higher probability of contributing one or more offspring to the next generation. In
every generation, from the P individuals of the current generation, M new offsprings are
generated, resulting in a total population of (P + M ). From this P fittest individuals
survive to the next generation. The remaining M individuals are annihilated. Crossover
and mutation operators are implemented as explained in Section 3.4.
5.3.3.3
For each of the individuals corresponding to a data layout the fitness function computes
the power consumed by the memory architecture (Mpow ) and the performance in terms of
memory stall cycles (Mcyc ). The value of Mcyc computation is similar to the cost function
used in our heuristic algorithm described in Section 3.5 which is explained briefly below.
The number of memory stalls incurred in a memory bank j can be computed by summing
the number of conflicts between pairs of data sections that are kept in j. For each pair of
the conflicting data sections, the number of conflicts is given by the conflict matrix. Thus
the number of stalls in memory bank j is given by Cx,y , for all (x, y) such that data
sections x and y are placed in memory bank j. As DARAM banks support concurrent
100
accesses, DARAM bank conflicts Cx,y between data section x and y placed in a DARAM
bank, as well self conflicts Cx,x do not incur any memory stalls. Note that our model
assumes only up to two concurrent accesses in any cycle. The total memory stalls incurred
in bank j can be computed by multiplying the number of conflicts and the bank latency.
The total memory stalls for the complete memory architecture is computed by summing
all the memory stalls incurred by all the individual memory banks.
Memory Power corresponding to a chromosome is computed as follows. Assume each
logical memory bank j is mapped to a set of physical memory banks mj,1 , mj,2 , ...mj,nj .
If Pj,k is the power per read/write accesses of memory module mj,k and AFi,j,k is the
number of accesses to data section i that map to physical memory bank mj,k , then the
total power consumed is given by
Ponchip =
XXX
i
AFi,j,k Pj,k
(5.1)
Note that AFi,j,k is 0 if data section i is either not mapped to logical memory bank
j, or if i not mapped to physical memory bank k. Also, AFi,j,k and AFi,j,k would both
account for an access to data section i that is mapped to logical memory bank j, when
j is implemented using multiple banks k and k. For example, logical memory bank of
2K 16 implemented using two physical memory modules of size 2K 8.
Thus the total power Mpow for all the memory banks including off-chip memory is
given by
Mpow = Ponchip +
AFi,of f Pof f
where AFi,of f represents the number of access to off-chip memory from data section
i, and Pof f is power per access for off-chip memory.
Once the memory cost and memory cycles are computed for all the individuals in
the population, individuals are ranked according to the Pareto optimality conditions on
power consumption (Mpow ) and performance in terms of memory stall cycles (Mcyc ). More
b
b
a
a
) are the memory power and memory cycles
, Mcyc
) and (Mpow
, Mcyc
specifically, if (Mpow
101
5.4
5.4.1
Experimental Results
Experimental Methodology
We have used the same set of benchmark programs and profile information as in the earlier
chapters. For performing the memory allocation step, we have used TIs ASIC memory
library. The area and power numbers are obtained from the ASIC memory library.
We consider a set of 6 different logical memory architecture listed in Table 5.1. The
corresponding physical memory architecture and the normalized area1 required by the
1
As the ASIC library is proprietary to Texas Instruments, we present only the normalized power and
102
physical memory for the different architectures are also given in Table 5.1. Note that the
memory size used for each of the memory architectures is 96KB and this is enough to
fit the data of each of the applications considered. Further, the architectures A1 to A5
are sorted based on physical memory area in descending order. Architecture A6 will be
used in Section 5.4.3 for comparison with other related work. In Table 5.1, column 3, the
physical memory banks with symbols 1P and 2P represent respectively single and dual
port memory banks. Architectures A1 to A5 are selected such that the memory configuration in terms of multiple memory banks and the bank types (SARAM and DARAM)
is varied. In all of these configurations, the data width is 16-bit in both the logical architecture and physical memory banks. From the table it can be observed that the memory
area increases with the DARAM size and the number of banks. A1 has the highest number of memory banks with largest DARAM size; hence A1 consumes the largest area.
A2 and A3 has the same DARAM size but the SARAM configuration is different. A3
and A4 present a non-uniform bank size based SARAM architecture. Non-uniform bank
size based architectures allows the usage of memory banks with multiple sizes and hence
presents opportunities to optimize memory area and power consumption. Larger memory
banks optimizes area, whereas smaller memory bansk reduces power consumption. A5
has the least number of memory banks and uses larger memories with a reduced memory
area. In summary, we would expect architecture A1 to perform very well in terms of
performance because of its large DARAM memory and architecture A4 to perform better
in terms of power consumption because of its lesser DARAM size and the presence of
non-uniform bank sizes. Note that the architecture A4 has more memory area than A5
even though it has only half of the A5s DARAM. This is due to the higher number of
banks in A4. Note that A6 has the lowest area because it has 32KB of off-chip RAM and
off-chip memory is not included in the area.
area numbers.
103
104
5.4.2
Experimental Results
This section presents the experimental results on the multi-objective data layout for physical memory architecture. Figures 5.2, 5.3, and 5.4 shows the set of non-dominated points
each corresponds to a Pareto Optimal data layout from a power and performance viewpoint for the 3 applications for architectures A1-A5. Note that architectures A1 to A5
correspond to a fixed physical memory architecture with a known silicon area. Figure 5.2
presents the different data layout solution space from a power consumption and performance (memory stalls) view point. Note that each of the point in the plot represents a
data layout for a given architecture. Observe that there are several data layout solutions
presented for each of the architectures considered.
105
106
107
in terms of power and performance in the mid-region. Observe that A4s solution points
dominates all the rest of the solution points in the mid region. A5s solution points are
notably inferior as compared to the rest of the solutions. This is because of the fewer
number of memory banks in A5. From this, it can be deduced that MPEG performs
multiple simultaneous memory access and thus, for MPEG, multiple memory banks is
more important than DARAM banks for achieving better solution points.
Figure 5.3 presents the results for the Voice Encoder application for the 5 different
architectures A1-A5. Unlike MPEG, the solution points of A1 are clearly superior here,
mainly in terms of performance. Observe that the solution points of the architectures A1,
A2 and A4 dominate some of the power-performance regions in the data layout space.
Solutions of A1 dominate the high performance space, solutions of A2 and A4 dominate
the middle space both in performance and power, and again solutions of A2 dominates
the low power-performance region. From the results, it can be deduced that for voice
encoder, DARAM and multiple memory banks both are equally critical. With only a
small increase in area compared to A5, A3 achieves much better performance than A5.
This is due to the higher number of banks in A3 that resolves more parallel conflicts.
Figure 5.4 presents the results for the Multi-channel DSL application for the 5 different architectures A1-A5 described in Table 5.1. Observe that all the architecture gives a
solution point with near zero memory stalls. This indicates that the application does not
require more that 16K of DARAM (this is the smallest size of DARAM used among all
architectures A1-A5 and is used in A4). Also, it can be deduced that this application does
not need more than 3 banks to resolve all the parallel conflicts (note that A5 has only 3
number of banks). A significant portion of the DSL application was developed in C language and this is one reason for the lesser number of parallel and self conflicts. Typically,
hand optimized assembly code will try to exploit the DSP architectures by using multiple
simultaneous accesses and self accesses. However, compiler generated assembly code may
not be as efficient as hand-optimized code, mainly in terms of parallel memory accesses.
Interestingly, the solution points of A4 dominates most of the other solution points. This
is mainly due to the non-uniform banks sizes of A4 that presents opportunities for data
108
5.4.3
In this section we present results on all the stand-alone optimizations and compare it with
our integrated approach, MODLEX, where all the optimizations are considered together.
For this purpose we consider the following optimzations.
O1 Optimization O1 corresponds to performing just on-chip/off-chip data partition, similar to the approach proposed in [10, 67]
O2 Optimization O2 corresponds to performing O1 and also resolving parallel memory
conflicts by utilizing only multiple memory banks [44, 40]
O3 Optimization O3 corresponds to the MODLEX approach that integrates O1 and O2,
and resolves self-conflicts and also exploits non-uniform sized memory banks
109
Figure 5.5 presents the results for MPEG for the memory architecture A6 explained
in Table 5.1. There are six different plots and each plot represents a specific data layout
optimization. Note that the plots O1 and O2 uses only the optimizations O1 and O2
respectively, In comparison using our MODLEX framework and optimization O3, presents
different solution points from a power and performance view point. Observe that for the
same memory architecture, the MODLEX approach presents a wide range of solutions
starting from the high performance region that resolves almost all the memory stalls to the
low performance region. Note that from power and performance perspective, the solution
points in the integrated approach completely dominates the solution points in the other
two plots. Methods like [10, 67] will give power/performance close to point P1 and the
point P2 corresponds to the works [44, 40, 58] and the data layout that optimizes power
[15] is represented by point P3. From the results we can conclude that our integrated
approach gives better solution points both with respect to power and performance. Also
from the experimental results it can be concluded that there is a wide range of design
points with respect to power and performance that can be obtained from multi-objective
data layout optimizations. The computation cost involved in our approach is very small
, less than an hour on a standard desktop.
5.5
Related Work
The data layout problem [10, 15, 40, 44, 53, 67] has been widely researched in the literature from either a performance or power perspective individually. In [18], a low-energy
memory design method is proposed, referred as VAbM, that optimizes the memory area by
allocating multiple memory banks with variable bit-width to optimally fit the application
data. In [15], Benini et al., present a data layout method that aims at energy reduction.
The main idea of this work is to use the access frequency of memory address space as
starting point and design smaller (larger) bank size for the most (least) frequently accessed memory addresses. In [40], the authors present a heuristic algorithm to efficiently
partition the data to avoid parallel conflicts in DSP applications. Their objective is to
partition the data into multiple chunks of same size so that they can fit in a memory
110
architecture with uniform bank sizes. This approach works well if we consider only performance as an objective. However, if the objective is to optimize both performance and
power, then a memory architecture with non-uniform banks is very attractive.
All the above optimizations are very effective individually for the class of memory architecture they target. However a complete data layout approach has to combine many/all
of these approaches to be able to comprehensively address the problem. Also it may not
be optimal to just combine different optimizations compared to an integrated approach
which is likely yield a better result. Our MODLEX framework accomplishes this. Further,
our data layout approach can effectively partition data to resolve parallel conflicts and
also exploit the advantage from non-uniform bank architecture to save power. To the best
of our knowledge there is no work done in the literature to address this problem.
5.6
Conclusions
In this chapter we presented our framework MODLEX, a Multi Objective Data Layout
EXploration for physical memory architecture. Our approach results in many data layout
that are Pareto-optimal with respect to power and performance which are important from
a platform design view point. We demonstrated that there is significant trade-off (up to
70%) that is possible between power and performance. In the next chapter we extend our
framework to explore memory architecture design space along with the data layout.
Chapter 6
Physical Memory Exploration
6.1
Introduction
112
method described in Chapter 4. The output of LME is a set of design points (Logical Memory Architectures) that are Pareto optimal with respect to performance
and (logical) memory cost. Also, as a part of LME, the data layout generates data
placement details for each of the logical memory architecture has been explored in
LME. The non-dominated points from LME and the placement details for each of
the non-dominated point form inputs to physical memory architecture exploration
step. The mapping of logical memory architecture to physical memory architecture
is formulated as a Multi-Objective Genetic Algorithm to explore the design space
with power and area as the objectives. Area and power numbers of these physical
memory modules are obtained from a semi-conductor vendor memory library. The
physical memory exploration step is performed for every non-dominated point from
LME. Note that the performance was one of the objectives at LME and this does
not change during physical memory exploration step. Hence at the output of physical memory exploration approach, for every non-dominated point generated from
LME, a set of non-dominated points are identified that are optimal with respect to
power and area. We refer to this approach as LME2PME.
Second approach is a direct approach for Physical Memory Exploration (PME). In
this approach we integrate three critical components together: (i) memory architecture exploration, (ii) memory allocation, which constructs a logical memory by
picking memory modules from a semi-conductor vendor memory library and (iii)
6.1 Introduction
113
data layout exploration, this module is critical for estimating performance. The
memory allocation step is critical and influences power/read and power/write as
well as memory area for all the memory modules. This integrated approach is
shown in Figure 6.2. For memory architecture exploration, we use a multi-objective
non-dominated sorting Genetic Algorithm approach [25]. For the data layout problem which needs to be solved for each of the 1000s of memory architectures, we
use the fast efficient heuristic method described in Section 3.5. For the memory
allocation, we use an exhaustive search algorithm. Thus the overall framework uses
a two level iterative approach with memory architecture exploration and memory
packing at the outer level and data layout at the inner level. We propose a fully
automated framework for this integrated approach and we refer the framework as
DirPME.
114
frameworks. Section 6.5 covers some of the related work from the literature. Finally in
Section 6.6, we conclude by summarizing our work in this chapter.
6.2
6.2.1
Method Overview
The LME2PME method extends the Logical Memory Exploration (LME) process described in Chapter 4 by considering memory power and memory area in addition to the
memory performance objective addressed by the LME. Note that the LME works on minimizing the number of memory stalls for a given logical memory cost. The logical memory
cost is a factor proportional to memory area. LME, for a given application, finds a list of
Pareto optimal logical memory architectures considering performance and logical memory
area as the objective criterion. This is shown in the top right portion of Figure 6.3.
At the LME step, the memory is not mapped to physical modules and hence the
actual silicon area and power consumption numbers are not known. Also, for a given
logical memory architecture there are many possible ways to implement the actual physical
memory architecture. As shown in Figure 6.3, the non-dominated points from LME are
taken as inputs for the memory allocation exploration step. The output of Physical
Memory Exploration is a set of Pareto optimal points with the memory power, memory
area and memory stalls as the objective criteria. For each non-dominated logical memory
architecture generated by LME, there are multiple physical memory architectures with
different power-area operating points with the same memory stalls. This is shown in
Figure 6.3, where the design solution LM 1 in LMEs output corresponds to a memory
stall of ms1 and generates a set of Pareto optimal points (denoted by P M 1s in the
lower half of Figure 6.3) with respect to memory area and memory power. Similarly,
LM 2 which incurs a memory stall of ms2 results in a set of P M 2s of physical memory
architectures. Note that ms1 and ms2 , which are the memory stalls as determined by
115
LME, does not change during the Physical Memory Exploration step. Different physical
memory architectures are explored with different area-power operating points for a given
memory performance.
6.2.2
In a traditional HW-SW codesign method, once the logical memory architecture is finalized, the HW and SW designs happen independently. The SW design teams focus on
performance optimization of application with the given logical memory architecture and
the H/W design teams focus on area optimization during the memory allocation step.
But in the process, power optimization, which requires both H/W and S/W perspective,
is not considered. Our LME2PME method addresses this problem by taking the required
inputs from the datalayout step that helps in optimizing power consumption and also
optimizing area at the same time during memory allocation. Figure 6.4 describes the
LME2PME method.
116
The top part of Figure 6.4, the logical memory architecture exploration (LME) is same
as what was described in Chapter 4. The bottom part of Figure 6.4 shows the physical
memory exploration. As shown in Figure 6.4, the Physical Memory Exploration (PME)
step takes two inputs from the LME. The first input is the set of Non-dominated points
generated by LME and the second input is the data placement details, which is the output
of the data layout step and provides information on what data-section is placed in which
memory bank. From the data placement and the profile data, the PME computes the
number of memory accesses per logical memory bank. This is an important information
and this can be used to decide on using larger or smaller memories while mapping a
logical memory bank. As discussed in Chapter 2, a smaller memory consumes less power
per read/write access as compared to a larger memory. Hence, if a logical memory bank
is known to have a data that is accessed higher number of times, it is power-optimal
to design this logical memory bank with many smaller physical memories. However this
comes with a higher silicon area cost and hence results in a area-power trade-off.
We formulate the memory allocation exploration as a Multi-Objective Genetic Algorithm problem. To map an optimization problem to the GA framework, we need the
following: chromosomal representation, fitness computation, selection function, genetic
operators, the creation of the initial population and the termination criteria. Figure 6.5
explains the GA formulation of the Physical Memory Mapping problem.
6.2.3
6.2.3.1
Chromosome Representation
117
118
memories selected may not be used, if with km physical memories the given logical memory bank can be constructed1 . For example, if k=4, and if the 4 elements are 2K*8bits,
2K*8bits, 1K*8bits, and 16K*8bits and if the logical memory bank is 2k*16bits, then our
memory allocator builds a 2K*16bits logical memory bank from the two 2K*8bits and
the remaining two memories are ignored. Note that the 16K*8bit memory and 1K*8bit
memory is removed from the configuration as the logical memory bank can be constructed
optimally with the two 2K*8bit memory modules. Here, the memory area of this logical
1
This approach of using only the required k m physical memory modules relaxes the constraint that
the chromosome representation has to exactly match a given logical memory architecture. This, in turn,
facilitates the GA approach to explore many physical memory architecture efficiently.
119
memory bank is the sum of the memory area of the two 2K*8bit physical memory modules2 . This process is repeated for each of Nl logical memory banks. The memory area of
a memory architecture is the sum of the area of all the logical memory banks.
6.2.3.2
The strongest individuals in a population are used to produce new off-springs. The
selection of an individual depends on its fitness; an individual with a higher fitness has
a higher probability of contributing one or more offsprings to the next generation. In
every generation, from the P individuals of the current generation, M new offsprings
are generated using mutation and crossover operators, resulting in a total population
of (P + M ). The crossover operation is performed as illustrated in Figure 6.5. From
this total population of (P + M ), P fittest individuals survive to the next generation.
The remaining M individuals are annihilated. Crossover and mutation operators are
implemented in standard way.
6.2.3.3
For each of the individuals, the fitness function computes Marea and Mpow . Note that
Mcyc is not computed as it is already available from LME. The Marea is obtained from
the memory mapping block, which is the sum of area of all the physical memory modules
used in the chromosome. Mpow is computed based on two factors: (a) access frequency
of data-sections and the data-placement information and (b) power per read/write access
information derived from the semiconductor vendor memory library for all the physical
memory modules.
To compute the memory power the method uses the data layout information provided
by the LME step. Based on the data layout, and the physical memories required to form
the logical memory (obtained from the chromosome representation), the accesses to each
data section is mapped to the respective physical memories. From this, the power per
2
Although the chromosome representation may have more physical memories than required to construct the given logical memory, the fitness function (area and power estimates) is derived only for the
required physical memories.
120
access for each physical memory, and the number of accesses to the data section, the
total memory power consumed for all accesses to a data section is determined. From this,
the total memory power consumed by the entire application on a given physical memory
architecture is computed by summing the power consumed by all the data sections.
Once the memory area, memory power and memory cycles are computed for all the
individuals in the population, individuals are ranked according to the Pareto optimality conditions given in the following equation, which is similar to the Pareto optimality condition discussed in Chapter 4, but considers all three objective functions. Let
a
a
a
b
b
b
(Mpow
, Mcyc
, Marea
) and (Mpow
, Mcyc
, Marea
) be the memory power, memory cycles and
a
b
a
b
a
b
(((Mpow
< Mpow
) (Mcyc
Mcyc
) (Marea
Marea
))
a
b
a
b
a
b
((Mcyc
< Mcyc
) (Mpow
Mpow
) (Marea
Marea
))
a
b
a
b
a
b
((Marea
< Marea
) (Mcyc
Mcyc
) (Mpow
Mpow
)))
For ranking of the chromosomes, we use the non-dominated sorting process described
in Section 4.3. The GA must be provided with an initial population that is created
randomly. In our implementation we have used a fixed number of generations as the
termination criterion.
6.3
6.3.1
Method Overview
In the LME2PME approach described in the previous section, the physical memory exploration is done in two steps. In this section we describe the DirPME framework that
121
Figure 6.6 explains our DirPME framework. The core engine of the framework is
the multi-objective memory architecture exploration, which takes the application data
size and semi-conductor vendor memory library as inputs and forms different memory
architectures. The memory allocation procedure builds the complete memory architecture
from the memory modules chosen by the exploration block. If the memory modules
together does not form a proper memory architecture, the memory allocation block rejects
the chosen memory architecture as invalid. Also the memory mapping allocation checks
the access time of the on-chip memory modules and rejects those whose cycle time is
greater than the required access time. The exploration process using the genetic algorithm
and the chromosome representation is discussed in detail in the following section. Once
the memory modules are selected the memory mapping block computes the total memory
area, which is the sum of all the individual memory modules.
Details on the selected memory architecture, like the on-chip memory size, number of
memory banks, number of ports, off-chip memory bank latency are passed to the data
layout procedure. The application data buffers and the application profile information also
122
given as inputs to the data layout. The application itself consists of multiple modules,
including several third-party IP modules as shown in Figure 6.6. With these inputs
the data layout maps the application data buffers to the memory architecture; the data
layout heuristic is the same as explained in Section 3.5. The output of data layout is
a valid placement of application data buffers, from the data layout, and the application
memory access characteristic the memory stalls are determined. The memory power is
also computed using the application characteristic and power per access available from the
semi-conductor vendor memory library. Lastly, the memory cost is computed by summing
the cost of the individual physical memories. Thus the fitness function for the memory
exploration is computed with the memory area, performance and power.
Based on the fitness function the GA evolves by selecting the fittest individuals to
the next generation. Since the fitness function contains multiple objectives, the fitness
function is computed by ranking the chromosomes based on the non-dominated criteria
(explained in Section 6.3.2). This process is repeated for a maximum number of generations specified as an input parameter.
6.3.2
6.3.2.1
Chromosome Representation
For the memory architecture exploration problem in DirPME, each individual chromosome
represents a physical memory architecture. As shown in Figure 6.7, a chromosome consists
of two parts: (a) number of logical memory banks (Li ), and (b) list of physical memory
modules that form the logical memory bank. Once again we assume that each logical
memory bank is constructed using at most k physical memories. It is important to note
here that a key difference between the LME2PME and DirPME approaches is that, in
the LME2PME approach the number of logical memory bank is fixed (equal to Nl ).
Hence the chromosome are all of the same size. However in DirPME each Li can be
of a different size. Hence the chromosomes are of different sizes. Thus, a chromosome
is a vector of d elements, where d = Li k + 1 and Li is the number of logical memory
banks for ith chromosome. The first element of a chromosome is Li and it can take value in
123
(0 .. maxbanks), where maxbanks is the maximum number of logical banks given as input
parameter. The remaining elements of a chromosome can take a value in (0 .. m), where
1 .. m represent the physical memory module id in the semiconductor vendor memory
library. Here the index 0 represents a void memory (size zero bits) to help the memory
allocation step to construct physical memories.
For decoding a chromosome, first Li is read and then for each of the Li logical banks,
the chromosome has k elements. Each of the k elements are integers used to index into
semiconductor vendor memory library. With the k physical memory modules, corresponding to a logical memory bank, a rectangular memory bank is formed. We have used the
124
same memory allocator (described in Section 6.2.3.1 which performs exhaustive combinations with the k physical memory modules to get the largest logical memory with the
required word size. In this process it may happen that some of the physical memory modules may be wasted. For example, if k=4, then if the 4 elements are 2K8bits, 2K8bits,
1K8bits, and 16K8bits and if the bit-width requirement is 16-bits then our memory
allocator builds a 5K16bits logical memory bank from the given 4 memory modules.
Note that 11K8bits is wasted in this configuration and this architecture will have a low
fitness as the memory area will be very high but is considered in the exploration process
nonetheless. The memory area of a logical memory bank is the sum of the memory area
of all the physical memory modules. This process is repeated for each of the Ln logical
memory banks. The memory area of a memory architecture is the sum of the area of all
the logical memory banks.
6.3.2.2
The strongest individuals in a population are used to produce new off-springs. The
selection of an individual depends on its fitness; an individual with a higher fitness has
a higher probability of contributing one or more offspring to the next generation. In
every generation, from the P individuals of the current generation, M more off-springs
are generated using mutation and crossover operators, resulting in a total population of
(P + M ). From this P fittest individuals survive to the next generation. The remaining
M individuals are annihilated. Crossover and mutation operators are implemented in the
standard way. The crossover operation is illustrated in Figure 6.7.
6.3.2.3
For each of the individuals, the fitness function computes Marea , Mpow and Mcyc . The
value of Mcyc is computed by data layout using the heuristic explained in Section 3.5.
The Marea is obtained from the memory mapping block, which is the sum of area of all
the memory modules used in the chromosome.
Memory power computation is performed in the same way as described in Section
125
6.2.3.3. Once the Marea , Mpow and Mcyc are computed, the chromosomes are ranked as
per the process described in Section 6.2.3.3.
6.4
6.4.1
We have used the same set of benchmark programs and profile information as in the earlier
chapters. For performing the memory allocation step, we have used TIs ASIC memory
library. The area and power numbers are obtained from the ASIC memory library.
The results from the LME2PME method are presented in the following section. After
that we present the results from the DirPME framework. Finally we compare the results
from LME2PME and DirPME.
6.4.2
As discussed in Section 6.2, the LME2PME approach performs memory allocation exploration on the set of Pareto-optimal logical memory architectures, which are obtained
from the LME, with the objective to obtain Pareto-optimal physical memory architectures
that are interesting from a area, power and performance viewpoint. Figures 6.8, 6.9 and
6.10 present the results of the LME2PME approach for all the 3 applications. In these
figures, the x-axis represents the total memory area (normalized) required by a physical
memory architecture and the y-axis represents the total power (normalized) consumed
by the memory accesses. In the figures, each plot corresponds to a set of performance
operating points from the LME. Note that the performance points are grouped to reduce
the number of plots so that it is easier to analyze the results. Performance band 0 0.1
corresponds to an operating point that resolves > 90% memory stalls (from the on-chip
memory bank conflicts) and hence is a high performance operating point. Similarly, the
performance band of 0.8 0.9 corresponds to an operating point that resolves only less
than 20% of memory stalls and hence is a low performance operating points.
126
For each of the pareto-optimal logical memory architecture, the memory allocation
exploration step constructs a set of physical memory architectures that have different
area-power operating point. Note that the performance (number of memory stalls) remain
unchanged from the LME step. Each point in the Figures 6.8, 6.9 and 6.10 represents
a physical memory architecture. It can be observed from these figures that each plot
presents a wide choice of area-power operating points in the physical view. Note that
the plots are arranged from the high performance band to low performance band. Each
plot starts from a high-power and low-area region and ends in a low-power and high-area
region. Observe that all the high performance (low memory stalls) plots operate on a high
area-power region and a low performance (high memory stalls) operating points have a
lower area-power values. Thus, from a platform design view point, a system designer
needs to be clear on the critical factor among area, power and performance. Based on
this information, the system designer can select the appropriate set of operating points
that are intresting from the system design perspective.
6.4.3
This section presents the experimental results on the multi-objective Memory Architecture
Exploration. The objective is to explore the memory design space to obtain all the nondominated design solutions (memory architectures) that are Pareto optimal with respect
to area, power and performance.
The Pareto-optimal design points identified by our framework for the voice encoder
application are shown in Figure 6.11. It should be noted that the non-dominated points
seen by the multi-objective GA are only near optimal, as the evolutionary method may
result in another design point in future generations that could dominate. One can observe
a set of points for each x-z plane (memory power - memory stalls) corresponding to a given
area. These represent the trade-off in power and performance for a given area. The same
graph is plotted in a 2D-graph in Figure 6.12 where architectures which require an area
within a specific range are plotted using the same color. These correspond to the points
in a set of x-z planes for the area range.
127
Figure 6.8: Voice Encoder: Memory Architecture Exploration - Using LME2PME Approach
Figures 6.12, 6.13 and 6.14 show the set of non-dominated points each corresponding
to a Pareto Optimal Memory Architecture for the 3 applications. It can be observed
from Figures 6.12, 6.13 and 6.14 that the increase in memory area results in improved
performance and power. Increased area will translate to one or more of on-chip memory,
increased number of memory banks, and more dual-port memory all these are essential
for improved performance. We look at the optimal memory architectures derived by our
framework. In particular we consider (a) R1, (b) R2 and (c) R3 in each of the figures.
The region R1 corresponds to (high performance, high area, high power); R2 corresponds
to (low performance, high area, low power); and the region R3 corresponds to (medium
performance, low area, medium power). Since the memory exploration design space is very
large, it is important to focus on regions that is critical to the targeted applications. The
region R1 has memory architectures with large dual-port memory that aids in improved
performance but also is a cause for high power consumption. The region R2 has large
128
number of memory banks of different size. This helps in reducing the power consumption
by keeping the data sections with higher access frequency to smaller memory banks.
But the region R2 does not have dual-port memory modules and hence results in low
performance. But the presence of a higher number of memory banks increases the area.
The region R3 does not have dual port memory modules and also has lesser number
of on-chip memory banks. Since the memory banks are large, the power per access is
resulting in higher power consumption. Note that for a given area there can be more
than one memory architecture. Also it can be observed that for a fixed memory area, the
design points are Pareto optimal with respect to power and performance. Observe the
wide range of trade-off available between power and performance for a given area. We
observe that by trading off performance, power consumed can be reduced by as much as
70-80%.
Table 6.1 gives details on the run-time, the total number of memory architectures
explored and the number of non-dominated (near-optimal) points for each of the application. Note that even the number of non-dominated design solutions is also large. Hence
129
to select an optimal memory architecture for a targeted application, the system designer
needs to follow a clear top down approach of narrowing down the region (area, power,
performance) of interest and then focus on specific memory architectures. The table also
reports execution time taken on a standard desktop (Pentium 4 with 1.7Ghz). As can be
seen the execution time for each of these application is fairly low.
130
x 10
2.4
2.2
Memory Power
2
1.8
1.6
1.4
1.2
1
0
0.8
2
0.6
x 10
0.4
Memory Area
Memory Stalls
Figure 6.11: Voice Encoder (3D view): Memory Architecture Exploration - Using DirPME
Approach
6.4.4
In this section we compare the non-dominated points from the LME2PME and DirPME
approaches. Table 6.2 presents data on the total number of non-dominated points obtained
from LME2PME and DirPME. The number of unique non-dominated points listed in
column 4 represents the solutions that are globally non-dominated but present only in
one of LME2PME or DirPME approach. The presence of unique non-dominated points
in one approach means that this point is missing in the other approach. The ratio of
column 4 to column 3, in a way represents the efficiency of an approach. We observe that
this ratio is low for DirPME approach compared to LME2PME. The number of unique
non-dominated points in DirPME increases if time allotted to DirPME is increased.
131
Figure 6.12: Voice Encoder: Memory Architecture Exploration - Using DirPME Approach
Further, the column 5 of Table 6.2 reports the number of non-dominated points identified by one method which gets dominated by points in the other method. For example,
for the MPEG encoder benchmark, 709 of the non-dominated design points reported by
DirPME are in fact dominated by certain design points seen by the LME2PME approach.
As a consequence the unique non-dominated points reduces to 26 for this benchmark.
In contrast, LME2PME fares better with 175 non-dominated points of which none are
dominated by DirPME approach. In fact this trend is observed almost for all benchmarks. Thus the data from experiments point that LME2PME performs a better job
than DirPME.
One concern that still remains is the set of unique non-dominated points identified by
DirPME but not by LME2PME. If these design points are interesting from a platform
based design, then to be competitive the LME2PME approach should at least find a close
enough design point. In order to quantitatively assess this, we find the minimum of the
Euclidean distance between the each unique non-dominated point reported by DirPME
132
Figure 6.13: MPEG Encoder: Memory Architecture Exploration - Using DirPME Approach
to all the non-dominated points reported by LME2PME. The minimum distance is normalised with respect to the distance between the unique non-dominated point to the
origin. This metric in some sense represents how close a non-dominated in DirPME approach is to a point in LME2PME. If we could find an alternate non-dominated point
in LME2PME at a very close distance to the unique non-dominated point reported by
DirPME, then the LME2PMEs solution space can be considered as an acceptable superset. In column 6, we report the average (arithmetic mean) minimum distance of all unique
non-dominated points in DirPME to the non-dominated points in LME2PME. A similar
metric is reported for the unique non-dominated points identified by DirPME. We also
report the maximum of the minimum distance for all unique non-dominated points in column 7 of Table 6.2. The worst case average distance from unique non-dominated points is
0.46% for LME2PME and 0.49% for DirPME. Thus for every unique non-dominated point
reported by DirPME, the LME2PME method can find a corresponding non-dominated
133
point within a distance of 0.46%. In colum 7, we report the maximum of minimum distance of all non-dominated points in DirPME to the non-dominated points in LME2PME.
The same metric is presented the other way, i.e. the maximum of minumum distance of
all non-dominated in LME2PME to the non-dominated points in DirPME. Observe from
cloumn 6 that for every non-dominated point that is missing in LME2PME and reported
in DirPME, we can find a close enough non-dominated point in LME2PME at most within
4.1% distance from the missing point for MPEG benchmark. Simillarly, for every new
non-dominated point reported by LME2PME, we can find a close enough non-dominated
point in DirPME at most within 6.2% distance from the missing point. Finally, in column
8, the run-time for all the benchmarks for both approaches are reported. Note that the
DirPME approach takes significantly more time than the LME2PME approach.
In summary, we observe that LME2PME finds more non-dominated points in general
and offers better solution quality for a given time. However, since for every unique nondominated point in LME2PME, we can find a very close non-dominated point result in
134
DirPME and vice versa, we can conclude that both the approaches perform very closely.
Further, the DirPME approach operates on a much bigger search space. Hence we expect
DirPME approach to catch-up and fare equally well or fare even better as compared to
LME2PME approach when sufficient time is given.
6.5
Related Work
Memory architecture exploration is performed in [18] using a low-energy memory design method, referred as VAbM, that optimizes the memory area by allocating multiple
memory banks with variable bit-width to optimally fit the application data. Their work
addresses custom memory design for application specific hardware accelerators. Whereas
our work focuses on defining memory architecture for programmable processors of embedded SoC. In [15], Benini et al., present a method that combines the memory allocation and
data layout together to optimize the power consumption and area of memory architecture.
They start from a given data layout and design smaller (larger) bank size for the most
(least) frequently accessed memory addresses. In our method the data layout is not fixed
and hence it explores the complete design space with respect to area, performance and
power. Performance-energy design space exploration is presented in [72]. They present
6.6 Conclusions
135
a branch and bound algorithm which produces Pareto trade-off points representing different performance-energy execution options. In [62], an integrated memory exploration
approach is presented which combines scheduling and memory allocation. They consider
different speed of memory accesses during memory exploration. They consider only performance and area as objectives and they output only one design point. In our work we
consider area, power and performance as objectives and we explore the complete design
space to output several hundreds of Pareto optimal design points.
There are other methods for memory architecture exploration for target architectures
involving on-chip caches [9, 51, 52, 64]. We compare them with our memory architecture
exploration approach for hybrid architectures described in the next chapter.
The memory allocation exploration step of LME2PME approach is an extension of
memory packing or memory allocation process [63]. Memory allocation step typically
constructs a logical memory architecture with a set of physical memories considering
minimizing memory area as the objective criteria. But in our approach, the memory allocation exploration has to consider two inputs: (a) area optimization by picking the right
set of memory modules and (b) power optimization by considering the memory access
frequency of data-sections placed in a logical bank. Note that these are conflicting objectives and our approach outputs Pareto-optimal design points which present interesting
trade-offs for these objectives.
6.6
Conclusions
In this chapter we presented two different approaches for Physical Memory Architecture
Exploration. The first method, called LME2PME method, is a two step process and
an extension of the LME method described in Chapter 4. The LME2PME method offers flexibility with respect to exploring the design spaces in logical and physical memory
architectures independently. This enables the system designers to start the memory architecture definition process without locking the technology node and semiconductor vendor
memory library. The second method is a direct physical memory architecture exploration
(DirPME) framework that integrates memory exploration, logical to physical memory
136
Chapter 7
Cache Based Architectures
7.1
Introduction
In the previous chapters, memory architecture exploration frameworks and data layout
heuristics are presented for target architectures that are primarily Scratch-Pad RAM
(SPRAM) based. Many SoC designs on the other hand also include cache in their memory
architecture [77] as caches provide comparable performance benefits of SPRAM but with
lower software overhead [83] at both program development time - requiring very little
data layout and management responsibilities from the application developer - and runtime - movement of data from off-chip memory to cache is transparent and managed by
hardware. Hence in this chapter we consider memory architecture with both SPRAM and
cache. The work in this chapter also applies to memory architectures that have on-chip
memories that could be configured both as cache and Scratch-Pad RAM.
We discussed in Chapter 6 about how the presence of caches alter the objective functions in memory architecture exploration and also for data layout heuristics. In a cache
architecture if two different data sections that are accessed alternatively are mapped to
the same cache sets, it causes a large number of conflict misses [33], potentially resulting
in no benefits from the cache. Hence it is important to address memory exploration and
data layout approaches for cache based architectures. Further, the memory exploration
problem becomes more challenging if the target architecture consists of both SPRAM
138
and cache. In this chapter, we address the memory architecture exploration problem for
hybrid memory architectures that have a combination of SPRAM and Cache.
As discussed in Chapter 4, the evaluation of a memory architecture cannot be separated from the problem of data layout, which physically places the application data in
the memory. A non-optimal data layout will yield an inferior performance even on a very
good memory architecture platform, thereby leading the memory exploration search path
to go in a wrong direction. Hence before addressing the memory architecture exploration
problem, for a cache based memory architecture, it is important to have a efficient data
layout heuristic.
For SPRAM-Cache based architectures, a critical step is to partition the data placement between on-chip SPRAM and external RAM. Data partitioning aims at improving
the overall memory sub-system performance by placing data sections in SPRAM that has
the following characteristics: (a) higher access frequency, (b) over-lapping life time with
many other data, and (c) poor spatial access characteristics. By placing all data that exhibits the above characteristics in SPRAM results in reducing the number of potentially
conflicting data in Cache and hence the reduced cache misses leading to overall memory
sub-system performance.
Typically the SPRAM size is small and hence it is not possible to accommodate all the
data identified for SPRAM placement. Hence, even after data partitioning, there will be a
significant number of potentially conflicting data placed in external RAM. If these data are
not carefully placed based on the off-chip RAM, there will be a significant number of cache
misses resulting in lower system performance. Cache conscious data layout addresses this
problem and aims at placing data in external RAM (off-chip RAM) with the objective to
reduce cache misses. The mapping of data from off-chip RAM to L1-cache is dictated by
the cache size and associativity. Hence the data-sections which map to the same cache
set, when accessed alternatively can incur a large number of conflict misses. Careful
analysis of data access characteristics and understanding of temporal access pattern of
the data structures is required in order to come up with cache conscious data layout that
minimize conflict misses. A number of earlier approaches address the problem of data
7.1 Introduction
139
layout mapping for cache architecture in embedded systems [17, 21, 22, 41, 50, 53].
In this chapter our aim is to perform memory architecture exploration and data layout
in an integrated manner, assuming a hybrid architecture which includes both on-chip
SPRAM and data cache (see Figure 7.1). As a first step, we address the data layout
problem for Cache-SPRAM based architecture. We address this problem using a two step
approach for each memory architecture: (a) data partitioning to divide the data between
SPRAM and cache with the objective to improve overall memory sub-system performance
and power and (b) Cache conscious data layout to minimize the number of cache misses
within a given external memory address space.
140
7.2
Solution Overview
Figure 7.2 presents our memory architecture exploration framework. Our proposed memory exploration framework consists of two levels. The outer level explores various memory architectures while the inner level explores placement of data sections (data layout
problem) to minimize memory stalls. More specifically the outer level, the memory architecture exploration phase, targets the optimization of cache and SPRAM size and the
organization of cache architecture, including cache-line size and associativity. We use
an exhaustive search1 for memory architecture exploration by imposing certain practical
constraints (such as, the memory bank size is always a power of 2) on the architectural
parameters. Although these constraints limit the search space, they still allow all practical architectures to be considered and at the same time help to reduce the run-time
of the memory exploration phase drastically. The exploration module takes the applications total data size as input and provides an instance of memory architecture by defining
(a)cache size (b) cache block size (c) cache associativity and (d) SPRAM size2 . Based on
the SPRAM size and the application access characteristics, the data partitioning heuristic identifies the data sections to be placed in SPRAM. The remaining data sections are
placed in off-chip RAM. The details of the data partitioning heuristic are presented in
Section, 7.3.
The cache conscious data layout heuristic assigns addresses to the data sections placed
in off-chip RAM such that these data do not conflict in the cache. The data layout heuristic
uses the temporal access information as input to find the optimal data placement. The
objective is to minimize the number of cache misses. In Section, 7.4 we discuss the
proposed cache conscious data layout.
The data partitioning heuristic and data layout heuristic together place the application
data in SPRAM and off-chip RAM respectively. From the temporal access information
1
Alternative approaches such as genetic algorithm or simulated annealing could also be used here.
However we found the exhaustive approach does explore all practical memory architectures in a reasonable
amount of computation time.
2
The proposed framework can easily be extended to consider SPRAM organization parameters such
as, number of banks, number of ports etc. We do not consider it here as these were extensively dealt
with in the earlier chapters.
141
of data sections and access frequency information, the run-time performance in terms of
memory stall cycles is computed. The memory stalls include stall cycles due to concurrent
accesses to the same single-ported SPRAM bank, stall cycles due to cache misses and misspenalty (off-chip memory access to fetch the cache block). The software eCacti [45] is used
to obtain the power per cache read-hit, read-miss, write-hit and write-miss. The SPRAM
power per read access and power per write access are obtained from the semiconductor
vendors ASIC memory library. The area for a given cache architecture is computed using
eCacti [45] and the area for SPRAM is obtained from the memory library.
The exploration process is repeated for all valid memory architectures and the area,
power and performance are computed for each of these. The last step is to identify the list
of optimal architectures. Since this is a multi-objective problem, all the solution points
are evaluated according to the Pareto optimality conditions given by Equation 6.1 in
b
b
b
a
a
a
)
, Marea
, Mcyc
) and (Mpow
, Marea
, Mcyc
Section 6.2.3.3. According to this equation, (Mpow
are the memory power, memory cycles and memory area for memory architecture A and
142
From the set of solutions generated by the memory architecture exploration module, all the
dominated solutions are identified and removed. The non-dominated solutions form the
Pareto optimal set, which represents the set of good architectural solutions that provide
interesting design trade-off points from power, performance, cost view point.
7.3
As cache structure has associated tag overheads, SPRAM consumes much less area than
caches on a per-bit basis [12]. Further, SPRAM memory accesses consume less power
than a memory access that is a cache-hit [12]. While the data sections mapped to off-chip
memory share the cache space dynamically and in a transparent manner, SPRAM space
is assigned to data sections exclusively if dynamic data layout is not used. As a result, the
usage of SPRAM is costly from a system perspective as it gets locked to a specific data
after data section layout unlike in caches, where the space is effectively reused through
dynamic mapping of data by hardware. Hence, SPRAM has to be carefully utilized and
the objective in a memory architecture exploration should be to minimize the SPRAM
size.
The objective of data partitioning is to identify data sections that must be placed in
SPRAM for best performance. We refer to a set of data (one or more scalar variables
or array variables) that are grouped together as one data-section. A data-section forms
an atomic unit that will be assigned a memory address. All data that are part of a data
section are placed in memory contiguously. An example of a data section is an array data
structure.
In order to identify data sections that should be mapped to SPRAM, our heuristic
143
uses different characteristics of the data section. These include the access frequency, the
temporal access pattern and the spatial locality pattern. These are explained below.
To model the temporal access pattern of different data sections, a temporal relationship
graph (TRG) representation has been proposed in [17]. A TRG is an undirected graph,
where nodes represent data sections and an edge between a pair of nodes represents that
the two successive references of either of the data sections is interleaved by a reference to
the other. The weight associated with an edge (a, b) represents the number of times such
interleaved accesses of a and b has occurred in the access pattern. We illustrate these
ideas with the help of an example.
Let there be 4 data-sections a, b, c and d and the access pattern of these data sections
in the application be:
aaabcbcbcbcdddddaaaaaaacacaacac
For this access pattern the TRG is shown in Figure 7.3. Given a trace of data memory
references, the weight associated with (a, b), denoted by T RG(a, b) is the number of times
that two successive occurrences of a are intervened with at least one reference to b or vice
versa. As an example, for the pattern bcbcbcb, T RG(b, c) = 5. Note that reference to
c intervenes successive references to b on three occasions and references to b intervenes
144
successive references of c twice making the T RG(b, c) = 5. For the given pattern the
T RG(b, d) = 0 as there are no interleaved accesses. Hence no edge exist between b and
d. TRG is computed for all the data sections from the address trace collected from an
instruction set simulator. We define ST RG(i) as the sum of all TRG weights on the edges
connected to node i. As an example from Figure 7.3, ST RG(a) = 10.
Next, We define a term, spatial locality factor, which gives a measure of spatial locality
in the access trace for each data section. The spatial locality is influenced by the stride in
accessing different elements of the data section. The spatial locality factor is computed
by determining the number of misses incurred by that data section on a cache with a
single block by the filtered access trace that contains only accesses pertaining to that
data section. The spatial locality factor is the ratio of the number of such misses to the
size of the data section. For example if the accesses to data section b in the filtered trace
bbbb correspond to cache blocks b1 b2 b1 b1 , where b1 and b2 correspond to different blocks
(determined by the cache block size) and size of data section b is Sb cache blocks then the
spatial locality factor is 3/Sb .
There are three parameters that control the decision to keep a data section in an
on-chip SPRAM.
1. Access Frequency (Af) : Placing the most frequently accessed data section in
SPRAM gives better power consumption and better run-time performance.
2. Temporal Access Characteristics : A data section is said to be conflicting if it gets
accessed along with many other data sections. Placing the most conflicting data
section in SPRAM reduces the number of cache conflict misses and hence improves
the overall memory subsystem performance. This parameter is computed from the
TRG. The ST RG factor is a direct indication of the extent to which a data sections
life-time overlaps with other data sections.
3. Spatial Locality Factor (SLF) : Data sections that have lesser spatial locality factor
uses more cache lines simultaneously and thereby reduce the available cache space for
other data. Also, such data exhibits less spatial reuse, causing more cache misses
145
which in turn increases the power consumption due to off-chip memory accesses.
Hence, it is both power and performance efficient to place a data section that has
less spatial locality factor in SPRAM.
Thus, a frequently accessed data that conflicts most with the rest of data and also
exhibits less spatial locality is an ideal candidate to be placed in SPRAM as this gives the
best performance from an overall memory subsystem perspective. For each of the data
sections, a conflict index is computed using the three parameters mentioned above. The
conflict index of a node corresponding to data section s is computed as follows.
ST RG(s)
nST RG(s) = PN
ST
RG(i)
i=1
(7.1)
AF (s)
nAF (s) = PN
AF
(i)
i=1
(7.2)
SLF (s)
nSLF (s) = PN
i=1 SLF (i)
(7.3)
(7.4)
In the above equations, SLF (s) and AF (s) correspond to the spatial locality factor
and access frequency of s respectively. The terms in the LHS of equations 7.1, 7.2 and
146
7.3 are normalized factors. Higher the conflict index, more suitable the data section is
for SPRAM placement. Our data partitioning heuristic algorithm is explained in Figure
7.4. The greedy heuristic sorts the data sections based on the conflict index and assigns
data sections that have the highest conflict index to SPRAM. The corresponding node
is removed from the TRG and the conflict index for the remaining data sections are
recomputed. Note that the above step is performed for every data section identified to be
placed in SPRAM. This process is repeated either until the SPRAM space is full or until
there are no more data sections to be placed.
7.4
7.4.1
The data partitioning step places the most conflicting data in SPRAM and thereby reduces
the possible conflict misses in cache. However, the SPRAM size typically is very small
and only a few data-sections would have been placed in SPRAM3 . The remaining data
sections still needs to be placed carefully in the cache to reduce the cache misses. In this
section we will be discussing the cache conscious data layout.
The problem of cache-conscious data layout is to find optimal data placement in offchip RAM with the following objectives: (a) to reduce the number of cache misses and (b)
to reduce address space used in off-chip RAM. In other words, the objective is to reduce
the holes in off-chip RAM after placement. By this, we mean that the data sections are
placed in the off-chip RAM, in such a manner that the gap between data sections which
is left to reduce conflict misses is reduced. These gaps lead to wasted memory space and
hence increase hardware cost. To the best of our knowledge, reducing cache misses (the
first objective) has been the sole objective targeted by all earlier data layout approaches
published [17, 22, 41, 50, 53]. But it is very important to consider objective (a) in the
context of objective (b) for the following reasons.
3
As mentioned earlier, data placement within the SPRAM can be done in a subsequent phase using
any of the data layout methods discussed in Chapter 3. We do not experiment this as this has been
extensively dealt with in the previous chapters.
147
148
For SOC architectures with instruction cache and data cache that share the same
off-chip RAM, a data layout approach that optimizes only the data cache misses,
without considering optimization of off-chip RAM address space will use-up too
much address space by spreading the data placement, leaving many holes. This will
place severe constraints on code placement requiring the code to be placed across the
holes and in the remaining off-chip RAM. This may potentially result in additional
instruction cache misses. Hence, there is a chance that all the gains achieved by
optimizing the data cache misses is lost.
A data layout approach which optimizes the data placement in off-chip RAM without any holes will be independent of instruction cache placement. Hence, the architecture exploration of data cache can be done independent of instruction cache.
For example, an application with 96K of data will have around 2700 hybrid architectures that are worth exploring. If the code placement is not independent of the
data layout and the code segments are placed in the holes created, then the memory
exploration process needs to consider both instruction and data cache configuration
together. This will increase the number of architectures considered. In such a scenario, the number of architectures explored could increase to 50000+. Hence, it
is important to design a data layout algorithm that is independent of instruction
cache.
We formulate the cache conscious data layout problem as a graph partitioning problem [38]. Inputs to the data layout algorithm are (i) applications data section sizes
and (ii) Temporal Relationship Graph. The data layout algorithm is explained in a
block diagram in Figure 7.5. The first step in the data layout problem is modelled
as a graph partitioning problem, where data sections are grouped into disjoint subsets,
such that the memory requirement for the data sections in a disjoint subset is less than
the cache size. More specifically, the first step is a k-way graph partitioning, where
k = dapplication data size/cache-sizee. The data sections in each of the partitions are
selected such that they have intervening accesses and hence can cause potential conflict
misses. Thus the output of graph partitioning step is k partitions with each partition
149
having a set of data sections that conflicts among themselves the most and the partition
size is less than cache size. Since each of the k partitions is lesser than the cache size,
each of these partitions can be mapped into off-chip RAM address space that corresponds
to one cache page. This step eliminates all the conflicts between data-sections that are in
the same partition. The graph partitioning method is discussed in detail in Section 7.4.2.
The next step in the data layout is to minimize the possible conflicts between datasections that are in two different partitions. This is handled by the offset-computation
step. The details of the offset computation are presented in Section 7.4.3. Once the offsetcomputation step assigns cache-block offsets for each of the data section, the address
assignment step allocates unique off-chip addresses to all the data-sections. Finally, using
the address assignment, the number of cache misses and the power consumed for cache
150
and off-chip memory accesses are computed which is used for identifying Pareto-optimal
solution. The following subsections details the graph partitioning heuristic and offset
computation heuristic.
7.4.2
Sm
i=1
Vi = V
w(ej )
(7.5)
i ej Gi
X
uj Gi
s(uj ) cache-size
(7.6)
151
P P
i
eext Gi
w(eint ), where
partitioning problem is to find a partition with minimum external cost. Alternatively, the
graph partitioning problem can also be formulated as maximizing the total internal cost,
i.e.
P P
i
eint Gi
P
uj Gi
s(uj ) cache-sizeGi .
The optimal partitioning problem is NP-Complete [38, 66]. There are a number of
heuristic approaches [26, 47] to this problem, including the well known Kernigan Lin
heuristic [38] for two partitions. We extend the heuristic proposed in [38, 66] to solve our
problem. The Kernighan-Lin heuristic aims at finding a minimal external cost partition
of a graph into two equally sized sub-graphs. The heuristic achieves this by starting with
a random partition, and keeps swapping two nodes that gives the maximum gain. Gain
is computed as the difference between internal and external costs. Let us consider two
nodes a and b present in two different sub-graphs A and B respectively. We define external
cost(ECost) of a as Ea =
P
xB
P
yA
w(a, y)
for each a A. Similarly ECost and ICost of b are defined as Eb and Ib respectively. Let
Da = Ea Ia be the difference between ECost and ICost for each a A. A result proved
by Kernighan and Lin [38] shows that for any a A and b B, if they are interchanged,
the reduction in partitioning cost is given by Rab = Da + Db 2 w(a, b). The nodes a
and b are interchanged to partitions B and A respectively if Rab > 0.
In [66], the graph partitioning heuristic is generalized to an m-way partition. It starts
with a random set of m partitions and picks any two of the partitions and applies the
Kernighan-Lin heuristic repeatedly on this pair until no more profitable exchanges are
possible. Then these two partitions are marked as pair-wise optimal. The algorithm then
picks two other partitions to apply the heuristic. This process is repeated until all the
152
2. if a data-section size s(a) > cache-size, then this data-section is placed in a partition
and marked optimal; and
3. Nodes a and b are interchanged to partitions B and A respectively only if Rab > 0
and if
P
aA
P
bB
s(b) cache-size
The output of the graph partitioning step is a collection of sub-graphs that maximizes
the internal cost and minimizes the external cost and ensures that no partition has a size
larger than the cache size4 . Thus, each of the partition can be placed in the off-chip RAM
address space that maps to a cache page such that none of the data sections that are part
of the same partition will conflict in cache. Now we are left with optimizing the cache
conflicts that might arise because of conflicts from data sections belonging to two different
partitions. Since the external cost is already minimized, the number of such conflicts will
already be very less. The offset computation step, described in the following subsection,
aims at reducing conflicts caused by data sections belonging to different partitions.
7.4.3
The cache offset computation step aims at reducing cache conflict misses between data
sections that are part of two different partitions. Each partition is placed in the offchip RAM address space that corresponds to one cache page. It may be noted that the
ordering of the partitions does not have any impact on the cache misses. For each of the
data sections in a partition, a cache-block offset needs to be assigned which in turn is
used to determine a unique off-chip memory address for the data section.
4
Obviously, a partition containing a data section whose size larger than the cache size will not obey this
property. But such data section can be considered to form l = ddata section size/cache-sizee consecutive
partitions, each less than or equal to cache size.
153
154
To decide the offset that gives the least number of conflicts, we compute the placement
cost for all possible placements of the data section inside a cache page. To compute the
placement cost, we use a fine grained version of TRG. Note that the TRG computed in
Section 7.3 is at the granularity of data section. But to determine which offset to place
a data section, the temporal access pattern needs to be computed at a finer granularity
level. We illustrate these ideas with the help of an example.
Let there be 2 data sections a and b of size 128 bytes and 64 bytes respectively.
Consider the following access pattern: a[0]b[0]a[60]b[1]a[61]b[2]a[62]b[3]
For this access pattern TRG(a,b) is 6 as explained in Section 7.3. Basically it means
data sections a and b are accessed 6 times in an interleaving way. However, for a direct
mapped cache of size 4KB with 32 byte block size, if a is placed in address k and b is
placed in off-chip address k + 4KB, will not result in any conflict misses even though the
TRG(a,b)=6. This is because a[60], a[61] and a[62] will map to a cache line (C + 1), while
a[0], b[0], b[1], b[2] and b[3] will map to cache line C. Further as if a is placed in address
k and b is placed in address k + 4KB + 32B then it will result in 5 conflict misses.
Hence, to determine, the cost of placing the data section, on the number of conflict
misses, the TRG values are needed at a more granular level. For the above example, if we
keep the granularity level as 1 cache block then the data section a is divided into 4 data
blocks and data section b is divided in to 2 data blocks. We define a new term T RGblk
that represents the temporal access pattern among data blocks. This is similar to the
approach described in [17]. The above access sequence results in a0, b0, a1, b0, a1, b0, a1, b0,
where a0, a1 and b0 represent the first two (cache-block sizes) blocks of data section a
and first block of data section b. For the above example, T RGblk will consist of nodes
a0, a1, a2, a3, b1 and b2. For the access pattern given above, T RGblk (a1, b0) = 5 and all
other T RGblk values are 0. We use the T RGblk values to compute the cost of placement,
C(s,l), for a data section s in a cache offset l.
The offset computation algorithm is explained in the Figure 7.6. To begin with, the
partitions are ordered based on the total external cost (Ei ). The partition Gi with the
highest external cost is selected first for offset computation. Data sections that are part
155
of partition Gi are ordered based on the external cost of the corresponding nodes in Gi .
Data section uj with the highest external cost (Ei,uj ) is taken for offset computation
first. Data section uj is placed in each of the allowable cache lines and the placement
cost is computed with the help of T RGblk . Here, by allowable, we mean there should
be contiguous cache lines free in a cache page to accommodate the data section uj . For
example, if data section size is 128 bytes and cache block size is 32 bytes, then a feasible
cache line mean 4 contiguous lines are free. Note that at this point no offset is assigned
to the data section uj . Cost of placement C(uj ,`) for data section uj is computed for all
allowable cache line ` from 1 to Nl , where Nl is the total number of cache lines. The cache
line ` that has the minimum cost is assigned to to data section uj and the cache lines
from l to size(uj )/line-size is marked as full so that these cache lines are not available for
any other data section in Gi . Note that this restriction is put to ensure that the cache
offsets for all data section in a partition Gi are assigned within one cache page and this
ensures that the amount of external address space used is close to the application data
size. The above process is repeated for all data sections in partition Gi . After this the
next partition Gi+1 with the highest external cost is selected for offset computation. This
process continues until all partitions are handled.
7.5
7.5.1
We have used Texas Instruments TMS32064X processor for our experiments. This processor has 16K data cache and we have used Texas Instruments Code Composer Studio
(CCS) environment for obtaining profile data, data memory address traces and also for
validating data-layout placements. We have used 3 different applications - AAC(Advanced
Audio Codec), MPEG video encoder and JPEG image compression from the Mediabench
[43] for performing the experiments. We compute the TRG, sumtrg, and spatial locality
factor from the data memory address traces obtained from the CCS. We used eCacti
[45] to obtain the area and power numbers for different cache configurations. First, we
156
report experimental results demonstrating the benefits of our cache-conscious data layout
method. Subsequently in Section 7.5.4, we repeat the results pertaining to cache-SPRAM
memory architecture exploration.
7.5.2
In this section we present results on our cache conscious data layout and we compare our
results with the approach proposed by Calder [17]. We have used the above 3 mediabench
applications and 4 different cache sizes. In this experiment, for all the cache sizes we
have used a 32 byte cache-block size and direct mapped cache configuration. Table 7.2
presents the results of the data layout. Column 4 in Table 7.2 presents the number of
cache-misses incurred when the data-layout approach of [17] is used and the Column 5
gives the number of cache misses incurred when our data layout approach is applied. Our
approach performs consistently better and reduces the number of cache misses especially
for AAC and MPEG. Our method achieves upto 34% reduction (for AAC with 16KB
cache size) in cache misses. Also our approach consumes an off-chip memory address
space that is very close to the application data-size. This is by construction of the graphpartitioning approach and avoiding gaps during data layout as explained in Section 7.4.
Whereas Calders [17] approach consumes 1.5 to 2.6 times the application data-size in the
off-chip address space to achieve the performance given in Table 7.2. This is a significant
advantage of our approach, as increased off-chip address space implies increased memory
cost for the SoC.
In Table 7.3, we present the results of our approach for different cache configurations
(direct mapped, 2-way and 4-way set associative caches). Note that these experiments
are performed with cache only architecture and no SPRAM. Observe that for all the
applications, the reduction in misses is significant for 2-way and 4-way set associative
caches. However, for the 4KB cache configuration for MPEG, the reduction in cache
misses is not much. This is due to the large data set (footprint) requirement for MPEG.
Also, observe that the data set for JPEG is much smaller and hence a direct mapped 16K
cache or 4-way set associative 8KB cache could resolve most of the conflict misses.
Application
AAC
MPEG
JPEG
157
imrovement
(%)
0
34
17
14
15
19
1
0
0
0
10
-1
158
7.5.3
In this section we present the results from our cache-SPRAM Data Partitioning method.
Figures 7.7, 7.8 and 7.9 present the results of data partitioning heuristic. In these figures,
the x-axis represents the SPRAM size and the y-axis represents the performance in terms
of memory stalls. Experiments were performed for three different cache sizes (4KB, 8KB
and 16KB). For each of the cache sizes, the SPRAM size is increased from 0 to application
data size. For each of the memory configuration, data partitioning and cache conscious
data layout is performed to obtain the memory stalls. The memory stalls refers to the
number of stalls due to the external memory accesses due to cache misses.
Observe that for all the applications, when the SPRAM size is increased, a significant
performance improvement is achieved for all the cache sizes. However, the performance
improvement is more pronounced in 4KB and 8KB caches than in 16KB caches. Observe
that for AAC, the 8KB cache + 24KB of SPRAM gives the same performance as a 16KB
cache with 4KB of SPRAM. The 16KB Cache and 4KB SPRAM consumes more area
than the 8KB Cache + 24KB SPRAM configuration. Simillarly, for JPEG, 4KB Cache
with 20KB of SPRAM gives the same performance as a 16KB Cache with no SPRAM.
This gives an architecture choice to the designers to select a configuration that suits
159
the target application. As we discussed earlier, both caches and SPRAM have their own
advantages. For instance, caches offer hardware managed resusable on-chip memory space
that provides feature extendability to the system. Whereas SPRAM provides predictable
160
performance and lower power consumption. Hence, the selction of architecture needs
careful analysis from different viewpoints.
Now we present the power consumption numbers for all the applications in Figures
7.10, 7.11 and 7.12. In these figures, the x-axis represents the SPRAM size and the y-axis
represents the total power consumed by the memory subsystem. There are three plots,
each for different cache sizes (4KB, 8KB, and 16KB). As expected, the power numbers for
16KB cache configurations is higher than the other two. However, in the all the figures,
observe that the power numbers converge towards the end for higher SPRAM sizes. This
is becase, for higher SPRAM sizes, most of the applications critical data sections are
mapped to SPRAM and hence not much activity happens in cache. Thus, the power
numbers are mostly influenced by the SPRAM accesses. Observe that for 16KB cache,
the power numbers are higher for lesser SPRAM sizes and gradually decreases as the
SPRAM size increase.
Figure 7.10: AAC: Power consumed for different hybrid memory architecture
In summary, the system designer needs to look at the performance graphs for his
application, presented similar to those in Figures 7.7, 7.8 and 7.9 and also study the
power graphs similar to the ones presented in 7.10, 7.11 and 7.12 to arrive at a suitable
161
Figure 7.11: MPEG: Power consumed for different hybrid memory architecture
Figure 7.12: JPEG: Power consumed for different hybrid memory architecture
architecture. One more dimension that is not covered here is the memory area. The next
section looks at the memory architecture exploration, where the system designer can look
162
at the memory design space from a power, performance and area viewpoint.
7.5.4
In this section we present the results from our memory architecture exploration. As
mentioned in Section 7.2, we explore the Cache-SPRAM solution space with the following
parameters: (a) cache-size, (b) cache block-size, (c) cache associativity and (d) SPRAM
size. Again we have used the same 3 benchmark applications. As mentioned earlier, we use
an exhaustive search method for memory exploration by varying the above parameters.
We start with no SPRAM and a 4KB cache and keep increasing the cache sizes up to the
application data size (88KB, 108KB and 40KB for AAC, MPEG and JPEG respectively).
For each of the cache size explored, we then increase the SPRAM size from 0 to application
data size with a 4KB step increase. Also for each of the cache configurations we vary the
block size from 8 bytes to 64 bytes with a 8-byte step increase and associativity from 1
to 4. Based on the application data size, the number of memory configurations evaluated
varies from 1200 to 2800. From the total memory configurations evaluated, we compute
the non-dominated solutions based on the Pareto Optimal criteria explained in Section
7.2. Figures 7.13, 7.14, and 7.15 present the non-dominated solutions for AAC, MPEG
and JPEG respectively. In these figures, the x-axis represents the number of memory
stall cycles and the y-axis represents the power consumption. We have presented the
power vs. performance graph for different area bands5 . We observe from the Figure
7.13 that as the area band increases, we get better power and performance. Note that
the solution points are converging from the top-left portion of the graph (which is a high
power and low performance region) to the lower left portion of the graph (which is the low
power and high performance region) as the area is increased. In Figure 7.14, the solution
on the top right corner has the memory configuration of 4K cache size, direct mapped
with 32 byte cache-block with no SPRAM. As we can observe this is a very conservative
architecture giving very low performance and high power consumption. On the other
5
Again, due to proprietary reason, we present normalized area for different configuration, instead of
absolute values.
163
hand, the solution in the lower left corner has the memory configuration of 8K cache
with 2-way set-associativity and 16-byte cache-block and 128K of SPRAM. This is a very
high end architecture consuming lot of area but gives the best performance and power
consumption. Thus the set of Pareto Optimal design points presents a critical view to
the designers to pick appropriate memory configurations that suit the application-system
requirements.
7.6
7.6.1
Related Work
Cache Conscious Data Layout
There are many earlier work that propose source code level transformations with the objective to improve the data locality. Loop transformation based data locality optimizing
algorithms are proposed by Wolf et al., [82]. They describe a locality optimization algorithm that applies a combination of loop interchange, skewing, reversal and tiling to
improve the data locality of loop nests. Earlier work [21, 37, 53] propose source level
164
165
transformations such as array tiling, re-ordering data structures and loop unrolling to improve cache performance. But we focus on optimizing object module level data placements
without any code modifications. We emphasize that this is important as application development flow in embedded systems typically involves integration of many IPs and the
source code for them may not be available. Data layout optimization proposed by Panda
et al., addresses the scenario of data arrays placed in off-chip RAM addresses that are
multiples of cache size which results in thrashing due to cache conflict misses in a direct mapped cache. They propose introducing dummy words (padding) between the data
arrays to avoid cache conflicts.
Data layout heuristics that aim at minimizing cache conflict misses have been proposed
in [17, 41]. The problem has been formulated as an Integer Linear Program (ILP) in
[41]. They also propose a heuristic method to avoid the long run-times of ILP solvers.
Calder et al., [17] use a Temporal Relationship Graph (TRG) that captures the temporal
access characteristics of data and proposes a greedy algorithm for cache conscious data
layout. While the approaches in [41, 17] target only the minimization of conflict misses en
masse, our approach aims at minimizing conflict misses within a certain off-chip memory
address space. The constraint of working within a certain external memory address space
is very important for memory architecture exploration, since this makes the instructioncache performance independent of data cache for architectures where the external memory
address space is common for both data and instruction caches, and thereby reducing the
memory architecture search space.
Chilimbi et al., [22] propose two cache friendly placement techniques coloring and
clustering that improves data structures spatial and temporal locality and there by
improving cache performance. Their approach works mainly for tree and tree like data
structures. They also propose a cache conscious heap allocation method which allocates
memory closer to contemporaniously accessed data objects based on programmers input.
This reduces the number of conflict misses. However, this approach will be expensive
in terms of performance as run-time decisions need to be taken. Embedded systems are
166
performance sensitive and hence usually the usage of dynamic heap objects are discouraged. Any additional run-time performance overhead in memory allocation will take away
the benefit that comes from reduced conflict misses. Further, critical sections of embedded applications are typically developed in hand-written assembly language. Hence any
modifications in the layout of structures cannot be completely handled by compilers.
A greedy data layout heuristic is proposed in [65] which optimizes energy consumption
in horizontally partitioned cache architectures. Their approach uses the idea that the
energy consumed per access in a small cache is less than that in a larger cache. Hence,
for cache architectures that have a main cache and a smaller mini cache, the authors
show that a simple greedy data partitioning heuristic, which partitions data between the
main cache and the mini cache, performs well to reduce the overall energy consumption of
the memory subsystem. Our work addresses a different target memory architecture with
SPRAM and cache.
Palem et al. [50] propose a compile time data remapping method to reorganize record
data types with the objective to increase temporal access characteritics of data objects.
Their method analyzes program traces to mark data objects of records whose access
characteristics and field layout do not exhibit temporal locality. The authors propose a
heuristic algorithm to remap the fields of the data objects that were marked during the
analysis phase. The heuristic remaps the fields in data objects to improve the temporal
locality and thus avoids additional cache misses. Their approach is very efficient for record
type data structures like linked lists and trees. However, their approach requires compiler
support to reorganize the fields of data structures and the corresponding code changes to
access the remapped fields. Whereas our work focusses on the layout of data structures
that do not require code changes, which is an important constraint in IP-based embedded
application development flow.
7.6.2
Integer Linear Programming (ILP) based approach to partition instruction traces between SPRAM and instruction cache with the objective to reduce energy consumption
167
has been propsed in [80]. Our work focuses on data partition between SPRAM and data
cache. Further, we consider DSP applications which typically have multiple simultaneous
memory accesses leading to parallel and self conflicts.
To the best of our knowledge, only [53] addresses data partitioning for SPRAM-cache
based hybrid architectures. They propose a data partitioning technique is presented that
places data into on-chip SRAM and data cache with the objective of maximizing performance. Based on the life times and access frequencies of array variables, the most
conflicting arrays are identified and placed in scratch pad RAM to reduce the conflict
misses in the data cache. This work addresses the problem of limiting the number of
memory stalls by reducing the conflict misses in the data cache through efficient data
partitioning. They also demonstrate memory exploration of hybrid architectures with
their proposed data partitioning heuristic. However, their memory exploration framework does not have an integrated cache-conscious data layout. They propose a model to
estimate the number of cycles spent in cache access. Our approach proposes data partitioning based on three factors (i) access frequency, (ii) temporal access characteristics and
(iii) spatial access characteristics. Our proposed method is a comprehensive data layout
approach for SPRAM-cache based architecture as we perform data partitioning followed
by cache conscious data layout. Also our approach works on all the key system design
objectives such as area, power and performance.
7.6.3
Panda et al., proposes a local memory exploration method for SPRAM-cache based memory organization. They propose a simple and efficient heuristic to walk through the
SPRAM-cache space. For each of the memory architecture configuration considered, the
performance of the memory configuration is estimated by an analytical model. In [64],
an exhaustive search based exploration approach is proposed for a cache based memory
architecture which explores the memory design space based on parameters like on-chip
memory size, cache size, line size and associativity. The authors extend the work by
Panda et al., to consider energy consumption as the performance metrics for the memory
168
architecture exploration. Memory exploration for cache based memory architecture is also
considered by [51].
The main difference between the above works and our method is that our memory
architecture exploration framework integrates an efficient data layout heuristic as a part
of the framework to evaluate the memory architecture. Without an efficient data layout a
random mapping of application may result in a poor performance even for a good memory
architecture. Further, our memory architecture exploration framework considers multiple
objectives such as performance, area and power.
Memory hierarchy exploration problem in the Genetic Algorithm framework is proposed in [52] and [9]; their target architecture consists of separate L1 caches for instruction
and data, and unified L2 cache. Their objective function is a single formula which combines area, average access time and power. In [52], additional parameters such as bus
width and bus encoding are considered, and the problem is modeled in a multi-objective
GA framework. The main difference between their work and our work is the integration
of data layout as part of the memory architecture exploration framework. Absence of a
cache conscious data layout means that the application data may not have been efficiently
placed in off-chip RAM and hence will lead to a poor performance. A point to note here
is that the poor performance is a result of inefficient data placement and not due to the
cache configuration. The other main difference is that [52, 9] uses simulation based fitness
function evaluation which will limit the number of evaluations due to large run-time. In
comparison our approach uses an analytical model to compute fitness functions.
7.7
Conclusions
7.7 Conclusions
169
several Pareto Optimal solutions within a few hours on a standard desktop. Our solution
is fully automated and meets the time-to-market requirements.
170
Chapter 8
Conclusions
In this chapter, we present a summary of the thesis and outline possible extensions to
this work.
8.1
Thesis Summary
In this work, we presented methods and a framework to address the memory subsystem
optimization problem for embedded SoC.
In Chapter 3, we presented three different methods to address the data layout problem
for Digital Signal Processors. Multiple methods are required for addressing the problems
in the embedded design flow. For instance, data layout during memory architecture
exploration needs to be very fast, as data layout is used several thousand times to evaluate
different memory architectures. On the other hand, the data layout method used for final
system production needs to generate a highly optimal solution irrespective of the run-time.
Hence, we proposed three different approaches for data layout in Chapter 3: (i) Integer
Linear Programming (ILP) based approach, (ii) Genetic Algorithm (GA) formulation of
data layout and (iii) a fast and efficient heuristic method. We compared the results of all
the three approaches. The heuristic method performs very efficiently both in terms of the
quality of the data layout and also in terms of run-time. The quality of data layout (the
number of memory stalls reduced) generated by the heuristic algorithm is within 5% from
172
Conclusions
that of GAs output. The ILP approach gives the best quality solution, but its run-time
is very high.
In Chapter 4, we addressed the logical memory architecture exploration problem for
embedded DSP processors. As discussed in Chapter 1, logical view is closer to the behavior and helps in reducing the search space by abstracting the problem. We formulated
the logical memory architecture exploration (LME) problem in multi-objective GA and
multi-objective SA. The multiple objectives include performance (in terms of memory
stalls) and cost (in terms of logical memory area). Both GA and SA produce 100-250
Pareto-optimal design points for all application benchmarks. Our experiments showed
that the multi-objective GA performed better than SA approach in terms of (i) quality of
solutions in terms of the number of non-dominated solutions generated for a given time
and (ii) uniformly searching the design space (diversity of solutions). Both GA and SA
based approaches take approximately 30 minutes of run-time to generate Pareto-optimal
solutions for one benchmark.
Chapter 5 addressed the data layout exploration problem from a physical memory
architecture perspective. Again, the target memory architecture is for the embedded
DSP processors. We proposed a Multi Objective Data Layout EXploration (MODLEX)
framework that searches the data layout design space from a performance and power
consumption view point for a given physical memory architecture. We showed that our
method effectively uses the multiple memory banks, single/dual-ported memories, and
non-uniform banks to produce around 100-200 data layout solutions that are Paretooptimal with respect to performance and power consumed. We also showed that there
is a big 70% trade-off between power and performance possible by using different data
layout solutions; specifically for DSP based memory architectures.
In Chapter 6, we addressed the memory architecture exploration for embedded DSP
processors from a physical memory perspective. We proposed two different approaches to
physical memory exploration. First approach is extends the logical memory architecture
exploration described in Chapter 4 to address the physical memory architecture exploration problem. This approach was referred as LME2PME. As part of the steps to extend
173
the LME to address PME, we proposed a memory allocation exploration framework that
takes the Pareto-optimal logical memory architecture and its corresponding data layout
as input and explores the physical memory space by constructing the given logical memory architecture with physical memories in different possible ways with the objective to
optimize area and power consumption. The memory allocation exploration is formulated
in multi-objective GA.
The second approach proposed in Chapter 6 is an integrated approach that directly
address the physical memory architecture exploration problem. This approach is known
as DirPME. This approach formulates the physical memory exploration problem directly
as a multi-objective GA. This approach works on data layout, memory exploration and
memory allocation simultaneously and hence the search space is very high as compared to
the LME2PME approach. We showed that both approaches, LME2PME and DirPME,
provide several 100s of Pareto-optimal points that are interesting from a area, power and
performance view point. Further, we showed for a given time LME2PME provides better
solutions than the DirPME approach. However, the solutions of DirPME and LME2PME
are very close and hence both approaches are useful depending on the needs of system
designers.
Finally, in Chapter 7 we extended our memory architecture exploration framework
to address SPRAM-cache based on-chip memory architecture. We proposed an efficient
data partitioning heuristic to partition data sections between on-chip SPRAM and cache.
A graph partitioning based cache conscious data layout heuristic is proposed with the
objective to reduce cache conflict misses. Exhaustive search method is applied to explore
SPRAM-cache design space. Each memory architecture is evaluated by mapping a target
application to the memory architecture under consideration, by using the data partition
heuristic and cache conscious data layout heuristic, to obtain the performance in terms
of number of memory stalls. We used eCacti [45] to obtain the area and power per
access numbers for the cache and used a semiconductor memory library to obtain the
area and power numbers for SPRAM. Based on the area, power and performance and by
applying the Pareto-optimal conditions, the list of Pareto-optimal memory architectures
174
Conclusions
are identified.
8.2
Future Work
8.2.1
8.2.2
The impact of platform change on system parameters like area, power and performance
can be studied for a given application, a given semiconductor memory library and process
node. The impact analysis is critical to identify where to spend the effort in improving
the platform such that overall system performance improvement is high.
8.2.3
We have addressed the memory architecture exploration problem for a given set of applications. At this stage, the make, buy or reuse decisions are made and the list of IP
175
modules to be used as part of the system is known as shown in Figure 1.4. We could
extend our memory architecture exploration framework to analyze the impact of rework
or design improvement of one or more software IP on the memory power, performance and
area. This analysis could direct the IP optimization efforts properly with the objective
to improve system area, power and performance.
8.2.4
Our memory architecture exploration framework can be extended to study the suitability
of a specific semiconductor memory library for a specific embedded system. Further, the
impact of rework of a semiconductor memory library from memory system area and power
can be studied to decide and prioritize the area of rework.
8.2.5
Multiprocessor Architectures
Our work on data layout and memory architecture exploration focuses mainly on optimizing the on-chip memory organization of a processor (DSP or Microcontroller) in a SoC.
Our work can be extended for optimizing shared memory subsystems in a multiprocessor
based SoC.
Bibliography
[1] ARM920T and ARM922T: ARM9 Family of Embedded Processors.
http://www.arm.com/products/CPUs/families/ARM9Family.html.
[2] ARM926EJ-S and ARM926E-S: ARM9E Family of Embedded Processors.
http://www.arm.com/products/CPUs/families/ARM9Family.html.
[3] lp solve.
http://lpsolve.sourceforge.net/5.5/.
[4] SystemC Language for System-Level Modeling, Design and Verification.
http://www.systemc.org/home.
[5] Verilog Hardware Description Language.
http://www.verilog.com/index.html.
[6] International Technology Roadmap for Semiconductors, SEMATECH, 3101, Industrial Terrace Suite, 106 Austin TX 78758., 2001.
[7] 2007 global mobile communications - statistics, trends and forecasts. Technical
report,
http://www.reportbuyer.com/telecoms/mobile/2007 global mobile trends.html,
2007.
[8] 1st IEEE Inter. Symposium on Industrial Embedded Systems. Panel Discussion.
Open Issues in SoC Design.,
http://www.iestcfa.org/panel discussions.htm, 2006.
BIBLIOGRAPHY
177
[9] G. Ascia, V. Catania, and M. Palesi. Parameterised system design based on genetic
algorithms. In Proceedings of ACM 2nd International Conference on Compilers,
Architectures and Synthesis for Embedded Systems (CASES), November 2001.
[10] O. Avissar, R. Barua, and D. Stewart. Heterogeneous memory management for embedded systems. In Proceedings of ACM 2nd International Conference on Compilers,
Architectures and Synthesis for Embedded Systems (CASES), November 2001.
[11] F. Balasa, F. Catthoor, and H. De Man. Background memory area estimation for
multidimensional signal processing systems. IEEE Trans. VLSI system, 3:157172,
June 1995.
[12] R. Banakar, S. Steinke, B-S. Lee, M. Balakrishnan, and P. Marwedel. Scratchpad
Memory: A design alternative for Cache On-chip memory in Embedded Systems.
In Tenth International Symposium on Hardware/Software Codesign (CODES), Estes
Park, Colorado, May 2002. ACM.
[13] M. Barr. Embedded Systems Gallery.
http://www.netrino.com/Publications/Glossary/index.php.
[14] L. Benini, L. Macchiarulo, A. Macii, E. Macii, and M. Poncino. From architecture
to layout: Partitioned memory synthesis for embedded systems-on-chip. In Design
Automation Conference, 2001.
[15] L. Benini, L. Macchiarulo, A. Macii, and M. Poncino. Layout driven memory synthesis for embededed systems-on-chip. In Proceedings of ACM 3rd International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES),
November 2002.
[16] Broadcom,
http://www.broadcom.com/collateral/pb/2702-PB02-R.pdf. BCM2702: High Performance Mobile Multimedida Processor, 2006.
178
BIBLIOGRAPHY
BIBLIOGRAPHY
179
[26] W.W. Donaldson and R.R. Meyer. A dynamic-programming heuristic for regular
grid-graph partitioning. Technical report,
http://pages.cs.wisc.edu/ wwd/rev4.pdf, 2007.
[27] G. Dueck and T.Scheuer. Threshold accepting: A general purpose optimization
algorithm appear superior to simulated annealing. Journal of Computational Physics,
90:161175, 1990.
[28] R. Fehr. Intellectual property: A solution for system design. In Technology Leadership
Day, October 2000.
[29] D. Gajski. Design methodology for systems-on-chip. Technical report, Centre for
Embedded Computer Systems, University of California, Irvine, California,
http://www.cecs.uci.edu/eve presentations.htm, 2002.
[30] D. E. Goldberg. Genetic Algorithms in Search, Optimizations and Machine Learning.
Addison-Wesley, 1989.
[31] P. Grun, N. Dutt, and A. Nicolau.
ment algorithm for retargetable compilers. In Proceedings of the 2004 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems, pages 182191. ACM Press, 2004.
[35] P. K. Jha and N. D. Dutt. Library mapping for memories. In EuroDesign, 1997.
180
BIBLIOGRAPHY
[36] B. Juurlink and P. Langen. Dynamic techniques to reduce memory traffic in embedded systems. In Conference On Computing Frontiers, pages 192201, 2004.
[37] M. Kandemir, J. Ramanujam, and A. Choudhary. Improving cache locality by a
combination of loop and data transformations. IEEE Transactions on Computers,
1999.
[38] B.W. Kernighan and S. Lin. An efficient heuristic procedure for partitioning graphs.
The Bell System Technical Journal, 49(2):91307, 1970.
[39] S. Kirkpatrick, C. D Gelatt, and M. P Vechi. Optimization by simulated annealing.
Science, 220, 1983.
[40] M. Ko and S. S. Bhattacharyya. Data partitioning for DSP software synthesis. In
Proceedings of the International Workshop on Software and Compilers for Embedded
Processors, September 2003.
[41] C. Kulkarni, C. Ghez, M. Miranda, F. Catthoor, and H. De Man. Cache conscious
data layout organization for embedded multimedia applications. In Design, Automation and Test in Europe, pages 686691, 2001.
[42] Bernard Laurent and Thierry Karger. A system to validate and certify soft and hard
ip. In Design, Automation and Test in Europe Conference and Exhibition, 2003.
[43] C. Lee, M. Potkonjak, and W. H. Mangione-Smith. MediaBench: A Tool for Evaluating and Synthesizing Multimedia and Communications Systems. In International
Symposium on Microarchitecture, 1997.
[44] R. Leupers and D. Kotte. Variable partitioning for dual memory bank DSPs. In
International Conference on Acoustics, Speech, and Signal Processing (ICASSP),
Salt Lake City (USA), May 2001.
[45] M. Mamidipaka and N. Dutt. eCACTI: An enhanced power estimation model for onchip caches. Technical report, Centre for Embedded Computer Systems, University
of California, Irvine, California, 2004.
BIBLIOGRAPHY
181
182
BIBLIOGRAPHY
[55] P. R. Panda, N. D. Dutt, and A. Nicolau. On-chip vs. off-chip memory: The data
partitioning problem in embedded processor-based systems. ACM Trans. Design
Automation of Electronic Systems, 5(3):682704, July 2000.
[56] S. Peesl, A. Hoffmannl, V. Zivojnovic2, and H. Meyrl. LISA - Machine Description
Language for Cycle-Accurate Models of Programmable DSP Architectures. In Design
Automation Conference, 1999.
[57] R. Rutenbar. Simulated annealing algorithms: an overview. IEEE Circuits and
Devices Magazine, January 1989.
[58] M. A. R. Saghir, P. Chow, and C. G. Lee. Exploiting dual data-memory banks
in digital signal processors. In Proceedings of the 7th Intl Conference Architectural
Support for Programming Languages and Operating Systems, pages 234243, October
1996.
[59] A. Sangiovanni-Vincentelli, L. Carloni, F. De Bernardinis, and M. Sgroi. Benefits
and challenges for platform based design. In Design Automation Conference, 2004.
[60] E. Schmidt. Power Modelling of Embedded Memories. PhD thesis, 2003.
[61] H. Schmit and D. Thomas. Array mapping in behavioral synthesis. In Proc. of the
International Symposium on System Synthesis (ISSS), 1995.
[62] J. Seo, T. Kim, and P. Panda. An integrated algorithm for memory allocation
and assignment in high-level synthesis. In Proceedings of 39th Design Automation
Conference, pages 608611, 2002.
[63] J. Seo, T. Kim, and P. Panda. Memory allocation and mapping in high-level synthesis:
an integrated approach. IEEE Transactions on Very Large Scale Integration (VLSI)
Systems, 11(5), October 2003.
[64] W.T. Shiue and C. Chakrabarti. Memory exploration for low power, embedded
systems. In Design Automation Conference, pages 140145, New York, 1999. ACM
Press.
BIBLIOGRAPHY
183
[65] A. Shrivastava, I. Issenin, and N. D. Dutt. Compilation techniques for energy reduction in horizontally partitioned cache architectures. In Proceedings of ACM 6th
International Conference on Compilers, Architectures and Synthesis for Embedded
Systems (CASES), September 2005.
[66] K. Shyam and R. Govindarajan. Compiler directed power optimization for partitioned memory architectures. In Proc. of the Compiler Construction Conference
(CC-07), 2007.
[67] J. Sjodin and C. Platen. Storage allocation for embedded processors. In Proceedings
of ACM 2nd International Conference on Compilers, Architectures and Synthesis for
Embedded Systems (CASES), November 2001.
[68] A.J. Smith. Cache memories. ACM Computing Surveys, 1993.
[69] J. C. Spall. Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control. Wiley, 2003.
[70] S. Sriram and S. S. Bhattacharyya. Embedded Multiprocecssors: Scheduling and
Synchronization. Embedded computer systems, 2000.
[71] A. Sundaram and S. Pande. An efficient data partitioning method for limited memory
embedded systems. In ACM SIGPLAN Workshop on Languages, Compilers and
Tools for Embedded Systems (in conjunction with PLDI 98), pages 205218, 1998.
[72] R. Szymanek, F. Catthoor, and K. Kuchcinski. Time-energy design space exploration
for multi-layer memory architectures. In Design Automation and Test Europe, 2004.
[73] Texas Instruments,
http://focus.ti.com/dsp/docs/. Code Composer Studio (CCS) IDE.
[74] Texas Instruments,
http://dspvillage.ti.com/. TMS320 DSP Algorithm Standard, 2001.
184
BIBLIOGRAPHY
BIBLIOGRAPHY
185
[84] D.F. Wong, H.W. Leong, and C.L. Liu. Simulated Annealing for VLSI Design. Kluwer
Academic Publishers, 1988.