Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Platform
1 Introduction
Embedded systems are getting increasingly heterogeneous in terms of software. A
current high end cell phone has a considerable number of applications, being most of
them installed during the product lifetime. As the current embedded systems have hard
design constraints, the applications must be efficiently executed to provide the lowest
possible energy consumption. To support such demand, current embedded platforms
(e.g. OMAP) comprise one or two simple general purpose processors that are
surrounded by several dedicated hardware accelerators (communication, graphics and
audio processors), each one of them with its particular instruction set architecture
(ISA). While nowadays one can find up to 21 accelerators in an embedded platform, it
is expected that 600 different dedicated accelerators will be needed in the year 2024
[1], considering the growth in the complexity of embedded applications.
The foreseen scenario for the current strategy of embedded platform development is
not favorable for future embedded products. The four quadrants plotted in the Figure 1
show the strengths and the weaknesses of the existing hardware strategies used to
design a platform, considering organization and architecture. The main strategy used
in the current embedded platforms, illustrated in the lower left quadrant of the Figure
1, is area inefficient, since it relies on the employment of dedicated accelerators (as
ASIPs or graphic accelerators). Moreover, although it is energy efficient, it covers
only software with a restricted behavior in terms of ILP and TLP. Every release of
such a platform will not be transparent to the software developers, since together with
a new platform, a new version of its tool chain with particular libraries and compilers
must be provided. Besides the obvious deleterious effects on software productivity and
compatibility for any new hardware upgrade, there will also be intrinsic costs of new
hardware and software developments for every new product.
On the other hand, the upper right quadrant of the Figure 1 illustrates the
multiprocessing systems that are conceived as the replication of the same processors,
in terms of architecture and organization. Commonly, such strategy is employed in
general purpose platforms where performance is mandatory. However, energy
consumption is also getting relevant in this domain (e.g. it is necessary to reduce
energy costs in datacenters). In order to cope with this drawback, the homogeneous
architecture and heterogeneous organization, shown in the upper left quadrant of the
Figure 1, has been emerging to provide better energy and area efficiency than the other
two aforementioned platforms, with the cost of higher design validation time, since
many different processors organizations are used. However, it has the advantage of
implementing a unique ISA, so the software development process is not penalized: it is
possible to generate assembly code using the very same tool chain for any platform
version, and still maintain full binary compatibility for the already developed
applications. Unfortunately, since there is a limit of TLP for most applications [26], all
the above strategies do not provide the energy and performance optimization required
with the software productivity necessary in the constrained embedded systems
domain.
Homogeneous
Organization
SW Productivity SW Productivity
Energy Consumption Design Time
Chip Area
GPP GPP GPP GPP
GPP
Weaknesses
Weaknesses
G ASIP2
P
P ASIP1
D
S Graphic
P Accelerator
Weaknesses
Chip Area
SW Productivity
Design Time
SW Behavior Coverage
Heterogeneous
Organization
General Purpose MPSoC Embedded MPSoC
Time
(a) (b)
Figure 2. (a) Example of a CReAMS Platform (b) Acceleration Process
The entire structure of the reconfigurable data path is totally combinational: there is
no temporal barrier between the functional units. The only exception is for the entry
and exit points. The entry point is used to keep the input context, which is connected
to the processor register file. The fetch of the input operands from the SparcV8
register file is the first step to configure the data path before actual execution. After
that, results are stored in the output context registers through the exit point of the data
path. The values stored in the output context are sent to the SparcV8 register file on
demand. It means that if any value is produced at any data path level (a cycle of
SparcV8 processor) and if it will not be changed in the subsequent levels, this value is
written back in the cycle after that it was produced. In the current implementation, the
SparcV8 register file has two write/read ports. The interconnections among the
functional units, input and output contexts are made through multiplexers.
We have coupled sleep transistors [15] to switch power on/off of each functional
unit in the reconfigurable data path. The dynamic reconfiguration process is
responsible for the sleep transistors management. Their states are stored in the
reconfiguration memory, together with the reconfiguration data. Thus, for a given
configuration, idle functional units are set to the off state, which avoids leakage or
dynamic power dissipation, since the incoming bits do not produce switching activity
in the disconnected circuit. Although the sleep transistors are bigger and in series to
the regular transistors used in the implemented circuit, they have been designed so that
their delays do not significantly impact the critical path or the reconfiguration time.
c) Dynamic Detection Hardware (Block 4)
The hardware responsible for code detection, named Dynamic Detection Hardware
(DDH), was implemented as a 4-stage pipelined circuit to avoid the increase of the
critical path of the original SparcV8 processor. These four stages are the following:
Instruction Decode (ID), Dependence Verification (DV), Resource Allocation (RA),
Update Tables (UT). They are responsible for decoding the incoming instruction and
allocating it to right functional unit in the data path, according to the dependences with
the already allocated instructions. At the end of the process, the configuration bits are
updated in the reconfiguration memory.
Figure 2 (b) shows a simple example of how a DAP could dynamically accelerate a
code segment of a single thread. The DAP works in four modes: probing, detecting,
reconfiguring and accelerating. The flow works as follows:
At the beginning of the time bar shown in Figure 2 (b), the DAP searches for an
already translated code segment to accelerate by comparing the address cache entries
to the content of the program counter register. However, when the first loop iteration
appears (i=0), the DDH detects that there is a new code segment to translate, and it
changes to detecting mode.
In the detecting mode, concomitantly with the instruction execution in the SparcV8
pipeline, these instructions are also translated into a configuration by the DDH
pipeline (the process stops when a branch instruction is found). When the second loop
iteration is found (i=1), the DDH is still finishing the detection process that started
when i=0. It takes few cycles to store the configuration bits into the reconfiguration
memory, and to update the address cache with the memory address of the first detected
instruction.
Then, when the first instruction of the third loop iteration comes to the fetch stage
of the SparcV8 pipeline (i=2), the probing mode detects a valid configuration in the
reconfiguration memory: the program counter content was found in the address cache
entry.
After that, the DAP enters in the reconfiguring mode, where it feeds the
reconfigurable data path with the operands. For example, if 8 operands are needed, 4
cycles are necessary, since 2 read ports are available in the register file. In parallel
with the operands fetch, the reconfiguration bits are also loaded from the
reconfiguration memory. The reconfiguration memory is accessed on demand: at each
clock cycle, only the necessary bits to configure a data path level are fetched, instead
of fetching all the reconfiguration bits at once. This approach decreases the
reconfiguration memory port width, which is one of the main sources of power
consumption in memories [16].
Finally, the accelerating mode is activated and the next loop iterations (until the
99th) are efficiently executed by taking advantage of the reconfigurable logic.
d) Storage Components (Block 3)
Two storage components are part of the DAP acceleration process: address cache
and reconfiguration memory. The address cache holds the memory address of the first
instruction of every configuration built by the dynamic detection hardware. It is used
to verify the existence of a configuration in the reconfiguration memory: an address
cache hit indicates that a configuration was found. The address cache is implemented
as a 4-way set associative table containing 64 entries. The reconfiguration memory
stores the routing bits and the necessary information to fire a configuration, such as the
input and output contexts and immediate values.
Besides the two storage components explained before, the current DAP
implementation has a private 32 KB 4-way set associative L1 data cache and a private
8 KB 4-way set associative L1 instruction cache. According to our experiments, the
same hit rate is achieved by the SparcV8 in the DAP compared to the standalone
SparcV8 using a quarter size of its 32KB L1 instruction cache. It happens because, as
translated instructions are stored in the reconfiguration memory, the SparcV8 within
the DAP has fewer memory accesses in the L1 instruction cache than the standalone
SparcV8 processor. Thus, the impact of the additional area of the address cache and
the reconfiguration memory in the DAP design is amortized.
4 CReAMS Evaluation
In all experiments we have compared the CReAMS to a multiprocessing platform
built through the replication of standalone SparcV8 processors, named MPSparcV8.
The configurations of the DAP and the standalone SparcV8 processor used in this
evaluation are shown in Table 1 (b). Both CReAMS and MPSparcV8 have an on-chip
unified 512 KB 8-way set associative L2 shared cache.
Table 1. (a) Area (um2) of DAP and SparcV8; (b) The configuration of both basic processors
(a)
4.1 Benchmarks
In order to measure the performance and energy efficiency of CReAMS considering
a heterogeneous application set, benchmarks from different suites were selected to
cover a wide range of behaviors in terms of type (i.e. TLP and ILP) and levels of
existing parallelism. The scope is to mimic future complex embedded applications that
will run in portable devices. From the parallel suites [17] [18], we have selected md,
jacobi and lu that are applications where TLP is dominant. Also, three SPEC
OMPM2001 [19] applications (apsi, equake and ammp) were chosen. Finally, we have
selected four applications (susan edges, susan smoothing, susan corners and patricia)
from the MiBench suite [20], which reflects a traditional embedded scenario.
The benchmarks were parallelized using OpenMP and POSIX. These libraries
provide methods that discover, at run time, the number of processors of the underlying
multiprocessor architecture, so they can take full advantage of the available resources
even when the platform changes (e.g. processors are added), with no need for
recompilation.
4.2 Methodology
We have used the scheme presented in [21] for the performance evaluation. The
base platform is Simics [22], an instruction level simulator. While the application runs
on the Simics, instructions of each software thread are sent to the cycle-accurate
timing simulators. There is one timing simulator for each DAP (in the case of the
CReAMS simulation) or for each standalone SparcV8 processor (when simulating the
MPSparcV8). Special attention is given to synchronization mechanisms, such as locks
and barriers, so that the time spent with blocking synchronization and memory
transfers are precisely calculated.
We have described the CReAMS architecture in VHDL, the power management
technique using Sleep Transistors was also described. The MPSparcV8 VHDL
description was obtained from [23]. The Synopsys Design and Power Compiler tools,
using a CMOS 90nm technology, were employed to synthesize the VHDL descriptions
to standard cell and gather data about power, critical path and area. Data on memory
was obtained with CACTI 6.5 [24].
In addition, we have considered CReAMS setups with different numbers of DAPs
(4, 8, 16 and 64 DAPs). The same was done with the MPSparcV8 system (4, 8, 16 and
64 SparcV8s). Each DAP has a reconfigurable data path (Block 1 of the Figure 3)
composed of 6 arithmetic and logic, 4 load/store and 2 multipliers units per column.
The entire reconfigurable data path has 24 columns. A 46 KB reconfiguration
memory, implemented as a DRAM memory, is able to store up to 64 configurations.
They are indexed by a 64-entries 4-way set associative Address Cache. As already
discussed in Section 3.1, the DAP has one quarter less L1 I-Cache accesses than the
SparcV8 processor. Thus, to avoid that the memory hierarchy bias the results in our
favor, we reduced to 8KB the size of the DAP L1 I-Cache to achieve the same cache
hit ratio as the 32KB L1 I-Cache of the standalone SparcV8.
Instead of showing absolute data about energy consumption, we use the energy-
delay product (EDP) metric [25] to present the real contribution of the proposed
approach to embedded systems domain.
4.3 Performance and Energy-Delay Product Evaluation
Figure 4 shows the speedup (y-axis), varying the number of processors (x-axis),
provided by the MPSparcV8 and CReAMS over a standalone SparcV8 processor.
Analyzing only the MPSparcV8 speedups (striped bars), the applications that contain
massive TLP (md, jacobi and susan_s) scale almost linearly as the number of
processors increases. For these applications, the mean performance of the 64-
MPSparcV8 grows 46 times. On the other hand, small performance gains are obtained
in those applications that contain small TLP (equake, apsi, ammp, susan_e, susan_c,
lu and patricia). In this case, with the 64-MPSparcV8 platform, performance
improvements of only 2.3 are obtained, which reinforces the need for an adaptable ILP
exploitation mechanism to achieve balanced performance in future embedded systems,
since they will mix many applications from different domains and amounts of ILP and
TLP.
Speedup MPSparcV8 CReAMS
40 49 93 74 51 70
38
36
34
32
30
28
26
24
22
20
18
16
14
12
10
8
6
4
2
0
64
64
64
64
64
64
64
64
64
64
16
16
16
16
16
16
16
16
16
16
4
4
8
8
equake apsi ammp susan_e patricia susan_c susan_s md jacobi lu
Number of Processors
CReAMS (solid bars) presents better performance than the MPSparcV8 with the
same number of processors, even for those applications that present massive TLP.
However, a DAP is almost four times bigger than the SparcV8 processor, as it can be
seen in the Table 1(a). Therefore, let us compare both architectures with equivalent
area occupation: the CReAMS composed of 4 DAPs against the MPSparcV8
composed of 16 SparcV8; and the CReAMS composed of 16 DAPs against the
MPSparcV8 composed of 64 SparcV8.
As it can be observed, the 4-DAPs CReAMS outperforms the 16-SparcV8
MPSparcV8 execution time in six benchmarks (equake, apsi, ammp, susan edges,
patricia and lu), even when applications that present high TLP is considered, such as
lu and susan edges. ammp is the one that benefits most from CReAMS: it presents an
execution time 41% smaller than the MPSparcV8.
Table 2 shows the energy-delay product of both platforms. In this Table, the
designs with equivalent area are highlighted with the same color. Let us consider the
4-DAPs CReAMS setup running ammp: it saves 32% of energy consumption and
improves in 41% the performance compared to the 16-SparcV8 MPSparcV8.
Therefore, CReAMS provides a reduction in the energy-delay product in a factor of
almost four times in this case. The average reduction in energy-delay by using
CReAMS, and considering the whole set of benchmarks, is of 33%; while the energy
is reduced in 82% and performance is improved in 4% on average when the same chip
area is considered.
The main sources of CReAMS energy savings are:
• Although more power is spent because of the DDH hardware and reconfigurable
data path, fewer memory accesses for instructions are performed: once they were
translated to a data path configuration, they will reside in the reconfiguration memory.
• The reconfiguration strategy: let us consider the loop example of the Figure 2 (b),
the data path is only reconfigured once to execute 98 loop iterations, thus the
acceleration process avoids several accesses to the L1 instruction cache and
reconfiguration memory;
• Shorter execution time: as CReAMS accelerates the code, it provides less energy
consumption;
• The use of sleep transistors avoids that idle functional units spend unnecessary
power.
Table 2. Energy-Delay product, in J*1e-3s, of the MPSparcV8 and CReAMS (lower is better)
MPSparcV8 CReAMS
4 SparcV8 8SparcV8 16SparcV8 64SparcV8 4 DAPs 8 DAPs 16 DAPs 64 DAPs
equake 293 305 287 343 169 175 162 149
apsi 10768 10386 10169 10282 6361 6143 6023 5355
ammp 19698 22155 20764 12814 5452 6282 5761 2984
susan_e 7.97 7.15 6.55 6.12 3.46 2.85 2.43 2.12
patricia 2.91 1.96 2.54 2.65 1.47 0.77 1.02 1.06
susan_c 1.0296 0.8134 0.6274 0.4624 0.5160 0.3761 0.2600 0.1600
susan_s 177.2 102.5 59.8 18.7 41.7 24.9 14.7 4.7
md 0.000282 0.000171 0.000098 0.000039 0.000066 0.000042 0.000026 0.000013
jacobi 35.1 20.7 11.3 3.6 17.4 10.3 5.7 1.9
lu 0.000124 0.000091 0.000127 0.000215 0.000041 0.000028 0.000043 0.000076
The gains in energy and performance provided by the second setup considering the
same area (16-DAPs CReAMS against 64-SparcV8 MPSparcV8) is greater than the
first comparison shown above. As the number of cores increases, the level of TLP
tends to stagnate. Now, CReAMS outperforms MPSparcV8 in seven benchmarks
(equake, apsi, ammp, susan_e, susan_c, patricia and lu). In this case, the execution of
lu in the 16-DAPs CReAMS is 57% faster than in the 64-SparcV8 MPSparcV8 and
consumes 50% less energy, which reflects in a energy delay product reduction in a
factor of five times.
Only three benchmarks (susan smoothing, md and jacobi) present better
performance in the MPSparcV8, since they have perfect load balancing among their
threads. However, even showing worse performance than MPSparcV8, CReaMS
provides reductions in the energy-delay product of 26% and 33% for susan smoothing
and md respectively. Due to its massive TLP and perfect load balancing, jacobi is the
only application where CReAMS does not provide gains neither in performance nor in
energy when the same chip area is considered. Nevertheless, using the energy-delay
product metric, we demonstrated the CReAMS efficiency to dynamically adapt to the
applications with different levels of parallelism.
5 Conclusions
In this work, we have proposed CReAMS, aiming to offer the advantages of a
heterogeneous multiprocessor architecture and to improve software productivity, since
neither tool chains nor code recompilations are necessary for its use. Considering
designs with equivalent chip areas, CReAMS offers considerable improvements in
performance and in the energy-delay product for a wide range of different
applications, compared to a traditional homogeneous multiprocessor environment.
References
1. International Technology Roadmap for
Semiconductors:http://www.itrs.net/links/2009ITRS/2009Chapters_2009Tables/2 009_SysDrivers.pdf
2. Clark, N., Kudlur, M., Park, H. Mahlke, S., Flautner, K.,"Application-Specific Processing on a
General-Purpose Core via Transparent Instruction Set Customization". In MICRO-37, pp. 30-40, Dec.
2004
3. Lysecky, R., Stitt, G., Vahid, F., "Warp Processors". In ACM Transactions on Design Automation of
Electronic Systems, pp. 659-681, July 2006
4. Compton K., Hauck, S., “Reconfigurable computing: A survey of systems and software”. In ACM
Computing Surveys, vol. 34, no. 2, pp. 171-210, June 2002.
5. Hauck, S., Fry, T., Hosler, M., Kao, J., “The Chimaera reconfigurable functional unit”. In Proc. IEEE
Symp. FPGAs for Custom Computing Machines, Napa Valley, CA, pp. 87–96, 1997.
6. Hauser, J. R., Wawrzynek, J., “Garp: a MIPS processor with a reconfigurable coprocessor”. In Proc.
1997 IEEE Symp. Field Programmable Custom Computing Machines, pp. 12–21, 1997.
7. Sankaralingam, k., Nagarajan, R., Liu, H., Kim, C. , Huh, J. , Burger, D. , Keckler, S. W., Moore C.
R., “Exploiting ILP, TLP and DLP with the Polymorphous TRIPS Architecture”. In Proc. of the 30th
Int. Symp. on Computer Architecture, pp. 422-433, June 2003.
8. Swanson, S., Michelson, K., Schwerin, A., Oskin. M. “WaveScalar”. In MICRO-36, Dec. 2003
9. Goldstein, S. C., Schmit, H., Budiu, M., Cadambi, S., Moe, M., Taylor, R.R., "PipeRench: A
Reconfigurable Architecture and Compiler". In IEEE Computer, pp. 70-77, April, 2000
10. Vassiliadis, S., Gaydadjiev, G. N., Bertels, K.L.M., Panainte, E. M., “The Molen Programming
Paradigm”. In Proceedings of the Third International Workshop on Systems, Architectures, Modeling,
and Simulation, pp. 1-10, Greece, July 2003
11. Watkins, M.A.; Albonesi, D.H., "Enabling Parallelization via a Reconfigurable Chip Multiprocessor",
Workshop on Parallel Execution of Sequential Programs on Multi-core Architectures, 2010. 37th
International Symposium on Computer Architecture, June 2010.
12. Cumming, P. The TI OMAP Platform Approach to SoC. Kluwer Academic Publ, Boston, MA
13. Koenig, R.; Bauer, L.; Stripf, T.; Shafique, M.; Ahmed, W.; Becker, J.; Henkel, J.; , "KAHRISMA: A
Novel Hypermorphic Reconfigurable-Instruction-SetMulti-grained-Array Architecture," Design,
Automation & Test in Europe Conference, pp.819-824, 2010
14. Lee, J., Wu, H., Ravichandran, M., and Clark, N. 2010. Thread tailor: dynamically weaving threads
together for efficient, adaptive parallel applications. ISCA '10. pp. 270-279.
15. Shi, K.; Howard, D. “Challenges in Sleep Transistor Design and Implementation in Low-Power
Designs”. In Proceedings of Design Automation Conference, 43. 2006, pp. 113 – 116.
16. Berticelli Lo, T.; Beck, A.C.S.; Rutzig, M.B.; Carro, L.; , "A low-energy approach for context memory
in reconfigurable systems," Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW),
2010, pp.1-8, 19-23 April 2010
17. Dorta, A. J., Rodriguez, C., Sande, F. d., and Gonzalez-Escribano, A. 2005. The OpenMP Source Code
Repository. In Proceedings of the 13th Euromicro Conference on Parallel, Distributed and Network-
Based Processing (February 09 - 12, 2005). PDP. IEEE ComputerSociety, Washington, DC, 244-250.
18. Woo, S. C., Ohara, M., Torrie, E., Singh, J. P., and Gupta, A. 1995.The SPLASH-2 programs:
characterization and methodological considerations. In Proceedings of the 22nd Annual international
Symposium on Computer Architecture (S. Margherita Ligure, Italy, June 22 - 24, 1995). ISCA '95.
ACM, New York, NY, 24-36.
19. Kaivalya M. Dixit. 1993. The SPEC benchmarks. In Computer benchmarks, Jack J. Dongarra and
Wolfgang Gentzsch (Eds.). Elsevier Advances In Parallel Computing Series, Vol. 8. pp. 149-163.
20. Guthaus, M.R., Ringenberg, J.S., Ernst, D., Austin, T.M., Mudge T.,Brown, R.B., “MiBench: A Free,
Commercially Representative Embedded Benchmark Suite. 4th Workshop on Workload
Characterization”, Austin, TX, Dec. 2001
21. Monchiero, M., Ahn, J., Falcón, A., Ortega, D., and Faraboschi, P. 2009. How to simulate 1000 cores.
SIGARCH Comput. Archit. News37, 2 (Jul. 2009), 10-19.
22. Magnusson, P. S., Christensson, M., Eskilson, J., Forsgren, D., Hållberg, G., Högberg, J., Larsson, F.,
Moestedt, A., and Werner, B. 2002. Simics: A Full System Simulation Platform. Computer 35, 2 (Feb.
2002), 50-58.
23. Aeroflex Gaisler: http://www.gaisler.com
24. Wilton, S.J.E.; Jouppi, N.P.; , "CACTI: an enhanced cache access and cycle time model," Solid-State
Circuits, IEEE Journal of , vol.31, no.5, pp.677-688, May 1996
25. Gonzalez, R.; Horowitz, M.; , "Energy dissipation in general purpose microprocessors," Solid-State
Circuits, IEEE Journal of , vol.31, no.9, pp.1277-1284, Sep 1996
26. Geoffrey Blake, Ronald G. Dreslinski, Trevor Mudge, and Kriszti\&\#225;n Flautner. 2010. Evolution
of thread-level parallelism in desktop applications. SIGARCH Comput. Archit. News 38, 3 (June
2010), 302-313.