Sei sulla pagina 1di 4

A Synthesis Algorithm for Customized Heterogeneous Multi-processors

Rahim Soleymanpour

Siamak Mohammadi

Hamed Rajabi

School of Electrical and Computer Engineering,


University of Tehran,
Tehran , Iran
r.soleymanpour@ut.ac.ir

School of Electrical and Computer Engineering,


University of Tehran,
Tehran , Iran
smohammadi@ece.ut.ac.ir

Shomal University
Amol ,Iran
h.rajabi@shomal.ac.ir

Abstract: Today, running applications on embedded systems


demand more computation to satisfy requirements such as
performance and power consumption. Multiprocessor systems on
chip (MPSoC) provide practical solution by having significant effect
on throughput. . We can employ application-specific instruction set
processor (ASIP) concept to customize each processor in MPSoC
platform based on the application mapped onto it. Here, we propose
an algorithm to synthesize an optimal heterogeneous MPSoC with
ASIP and also identify map an application on MPSoC platform. The
synthesis algorithm and the scheduling are executed simultaneously.
Additionally, the synthesis algorithm identifies processor numbers to
reach maximum performance. Experimental results show an
average speedup of 42.71% and energy consumption on Network on
Chip (NoC) of 30.77% compared to a homogenous MPSoC.

Keywords: hardware/software codesign, synthesis algorithm,


application-specic
instruction-set
processor
(ASIP),
heterogeneous MPSoC.
I.

Introduction

ASIP has provided sufficient flexibility for design. In this


technique, new instruction set is added to the base processor to
enhance performance and reduce power consumption and code
size. Algorithms, in this field, search a pattern of basic
instructions that occur frequently in the application code by
exploring a huge space, to identify custom instructions. Then,
from potential candidates of custom instructions it selects the
ones that have further merits. Without using an automatic tool,
this procedure becomes tedious and may not attain the desired
performance.
To devise optimal MPSoC, designers properly investigate all
factors that affect MPSoC performance [1] such as
decomposing an application program into subtasks, mapping
and scheduling the tasks into processor elements (PE), and
interprocessor communication. Consequently, to customize an
MPSoC, one must certainly consider these parameters
simultaneously; otherwise it might lead to suboptimal results.
Researchers need a heuristic approach to look into vast design
space exploration to reach practical solutions.
We present a synthesis algorithm to construct optimal
heterogeneous MPSoCs for applications in embedded systems.
This algorithm takes application codes and a directed acyclic
task graph (DAG) as inputs and make use of iterative
procedure. Selecting custom instructions for a task depends on
two things: the potential tasks identified in previous iterations
and data communication between the tasks. Lastly, it selects
the configuration, which has appropriate tradeoff for speedup,
processor numbers, and area overhead for custom instructions
and energy dissipation of network on chip (NoC). To evaluate
our proposed algorithm, various applications are used such as
multimedia, encryption and networking. The experimental

978-1-4673-2990-3/12/$31.00 2012 IEEE

result shows that the synthesized heterogeneous MPSoC


achieves better performance.
In the rest of the paper, we discuss previous works related to
synthesis of heterogeneous MPSoCs in Section II. In Section
III, we describe the proposed algorithm, and depict its pseudocode. Finally, experimental results and conclusion are
presented in Sections IV, V, respectively.
II.

Related Work

ASIP approach is used to reduce time to market and also


have low complexity compared to the application-specified
integrated circuits (ASICs). Authors in [2] offer a method to
design a processor with a special instruction set. They have
investigated the behavior of objective applications and then
have analyzed various configurations, and lastly have
expressed requirements for hardware resources. The algorithms
in [3] and [4] describe a method to extract custom instructions
for the target program. The platform [5] automatically searches
for templates of custom instructions through basic blocks of
applications. The platform [5] also depicts effects of input and
output numbers of register file for the custom function unit
(CFU) on the energy consumption.
Scheduling on platforms with homogeneous process
elements has been widely studied. In [6] [7] different heuristics
have been introduced to reduce the completion time of
applications on a homogenous platform. Authors in [8]
categorize some scheduling algorithms for homogenous
MPSoCs and then compare the performance. On the other hand
for heterogeneous processor elements, DCPD algorithm [9] has
been proposed that duplicates some tasks to improve the
performance. The HEFT and CPOP algorithms [10] are
proposed for scheduling application graph on heterogeneous
processors.
Issues that may be encountered in MPSoC designs that use
extensible and configurable processors have been reviewed in
[11] and then a tool for data intensive application were
presented. The method described in [12] uses an iterative
process that first completes the scheduling in each iteration of
the algorithm and then randomly picks a critical path. In the
following step, it adds the custom instructions to the tasks,
which are on the selected critical path and continues it until the
selected path is no longer critical. For task sequence, a dynamic
programming model that customizes processors is proposed in
[13]. It identifies the appropriate partition number for
applications and selects the suitable custom instructions to
improve performance for stream applications such as MP3.
Partition number of an application is equal to processor
number.

- 151 -

ISOCC 2012

III.

Proposed Algorithm

Algorithm 1: synthesis MultiProcessor (task graph, code)

Here, we propose the algorithm that executes the


repetitive and time consuming parts on a custom function
unit (CFU). Inputs of our proposed algorithm are: 1)
Directed acyclic task graph (DAG) illustrated as {T1, T2, ,
TN}, where N stands for task numbers and exeTime(TJ) and
commCost(TJ,TI) state execution time of task TJ and data
communication cost from TJ to TI, respectively. 2) The task
code to profile execution time and extract custom
instructions. This approach will improve the time completion
and energy consumption on NoC.
Generally, the proposed algorithm uses incremental and
iterative process and accomplishes the scheduling tasks in
each step of the synthesis algorithm. If the task is critical, the
task customization process is executed at the same time as
the scheduling. Since the tasks may migrate to a different
processor at the next iteration, it does assign selected custom
instructions to the task instead of the processor. This
situation may occur due to the reduced execution time of
tasks because of customization routine. Consequently, the
scheduling may differ from the task mapping compared to
the previous iterations. Now, the synthesis algorithm will be
expressed in more details.
Algorithm 1: First of all, tasks are profiled to identify hot
pots and execution time of the application. We use dynamic
critical path (DCP) algorithm [6] For scheduling that utilizes
two concepts of absolute earliest start time (AEST) and
absolute latest start time (ALST) to identify a task on critical
path. The task is critical when AEST and ALST are equal. In
each step, it calculates the initial value of AEST and ALST for
all tasks. The second loop performs scheduling and selects
custom instructions. It determines whether the selected task is
critical.
In the next step, the synthesis algorithm performs
scheduling with new execution time of the tasks. Tasks that
were critical in the previous step may not be critical in the
current iteration and vice versa. If a task is still critical, the
procedure can try to insert more custom instructions.
At the end of each step, the completion time of the
application is indicated by the current configuration, and
best_merit is calculated for next step. Additionally, traffic and
energy dissipation in the network are calculated. We use the
following formula presented in [14] to estimate the energy
consumption in NoC:
(1)
 = 
+  + 
In the above equation, energy consumption for each bit
inside the switch fabric is Ebit and consists of the bit energy
consumed on node switches, ESbit , on internal buffers, EBbit,
and on interconnect wires, EWbit. .
As depicted in Algorithm 1, in the first loop, the algorithm
terminates when there is no further improvement. The
termination condition is diagnosed by analyzing the
improvement systematically during last iterations. The
process is continued as long as the overall average is above a
constraint variable. All points that were found in all steps are
explored to select the best possible solution according to the
constraints.

978-1-4673-2990-3/12/$31.00 2012 IEEE

1 :
2 :
3 :
4 :
5 :
6 :
7 :
8 :
9 :
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:
26:
27:
28:

consumption_area  0;
best_merit  ;
improvement  true;
profiling for all tasks;
while (consumption_area <area_budget && improvment == true) do
initialize AEST and ALST for all tasks;
unscheduling for all task;
temp_merit  ;
while(!(all node scheduled)) do
task=identifyTask();
if task is on critical path then
paramCI =select_processor(task,critical,best_merit);
else
select_processor(task,non critical);
end if
update AEST and ALST for all nodes;
if paramCI.merit > temp_merit then
//compute best merit CI for next iteration
swap(paramCI.merit , temp_merit);
end if
compute area consumption;
end while
update of best_merit for next iteation
compute traffic and energy consumption in NoC;
compute improvement;//based on history
save intermediate result;
end while;
select best solution;

Algorithm 2: The select_processor function described in


Algorithm 2 performs the selection of best processor and calls
extraction and custom instructions selection routines.
Algorithm 2 is called for every task in all iterations. If the
selected task is not critical, the algorithm only attempts to
select processor from all available processors. Otherwise, if
the task is critical, after detecting the appropriate processor, it
calls the customization routine. The algorithm utilizes
information obtained in the previous iteration to pick up the
best custom instruction in the whole process. Assume a task is
critical, in order to apply customization process, some custom
instructions are selected for this task. However, tasks that
have not yet been scheduled and which will be critical in
following steps might have better custom instructions than in
previous steps. This situation must be considered to make sure
we do not lose efficiency. With this assumption, the algorithm
utilizes a variable (best_merit) to select the best set of custom
instructions in each iteration. It picks custom instructions
which have the best merit between all tasks, and then update
best_merit for the next iteration. The return value of merit
variable is used to determine best_merit value for next
iteration. It is clear that if some tasks have large data
communication, they had better be mapped onto the same
processor. Here, we ignore data communication cost of the
tasks that execute on the same processor. Algorithm 2 tries to
map the selected task onto a processor that consists of tasks
that have the most data communication with the selected task.
Also, if possible the critical child of the selected task is
mapped onto the same processor. The algorithm uses custom
instructions to reduce completion time and also to provide the
above condition. The following example illustrates this
situation: Assume there is a task with vast data
communication with other tasks. Thereby, they should be run
on the same processor but, there is no sufficient time slot. In

- 152 -

ISOCC 2012

Algorithm 2 (continuance):

Algorithm 2: select_processor (task, location, best_merit)


1 :
2 :
3 :
4 :
5 :
6 :
7 :
8 :
9 :
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:

for i=0 to number processors are in processorList do


state  real; nameProc  processorList;
thisAEST  findSlot(Task, nameProc , location );
if task is critical then
CCR  findMaxCCR(Task ,nameProc, location);
if thisAEST < 0 then
Diff  thisAEST;
Temporary exeTime(task) = diff;
thisAEST  findSlot(Task, nameProc , location );
State  temp;
end if
update maxCCR;
end if
if thisAEST > 0 then
nc  select child node of task have smallest
difference between ALST and AEST;
temporary map task on nameProc;
childAEST  findSlot(nc , nameProc);
if childAEST+ThisAEST<best_Composite_AEST &&
state!=temp then
best_real_processor  nameProc;

20:
21:
22:
23:
24:
25:
26:
27:
28:
29:
30:
31:
32:
33:
34:
35:
36:
37:

processors, if possible.

this case, we can remedy this situation by reducing the


execution time through extraction of custom instructions. We
use the following equation [6] to check whether there is
enough idle time slot on the processor, on which the task is
mapped to:
min( ) +  !"#$%(&' ), *-./012345 67
 max{AEST(89 ), :>?@ABCD E + FGHIJKL(MN )} O
PQRSTUV(WX );

(2)

As shown in Algorithm 2, the processor that will have


minimum AEST will be selected. If there is sufficient time
slot on the desired processor for the critical task; the algorithm
assigns the task temporarily to a processor called real
processor. Otherwise, the task is assigned to a so-called
temporary processor. In the following step, the algorithm
identifies the best processor between real and temporary

Normalized execution
time

Finally, the algorithm returns values such as the merit of


the last selected custom instruction, area overhead and
achieved speedup for selected task. At the end of the
algorithm, the task is assigned to the adequate processor.
IV.

YZ [ \]^_`ab cdef
gh [ 
  

Network

best_Composite_AEST  childAEST+ThisAEST;
end if
if childAEST+ThisAEST<best_Composite_AEST_unreal &&
state==temp then
best_ temp_processor  nameProc;
best_temp_Composite_AEST  childAEST+ThisAEST;
best_diff diff;
undo exeTime(task) += diff;
end if;
end if
end do
If task is critical then
Coff_comm  compute coefficient of communication;
parmCIs  select_CIs(task,best_merit , coff_comm);
if best_processor_temp!=best_processor then
map task into best_temp_processor;
end if
end if
assign task to the best processor;
Return paramCIs;

Experimental Results

Our proposed algorithm has been implemented with about


7500 lines of C/C++. We have constructed a platform to
evaluate it. The platform consists of MIPS [15] processors
implemented by a commercial design tool with CMOS 90nm
library. Also crossbar network paradigm is used for
communications between processors. The platform uses the
algorithm described in [5] to extract custom instructions
automatically. We have also implemented the proposed method
in [12], mentioned above, in our platform to compare
performance of both methods. We have used different
benchmarks [16][17][18] from various scopes such as

App1

MPEG2DEC

App2

Algorithm [12]

Area consumption for custom instructions (nm )

Our proposed algorithm

Normalized network
traffic

Figure 1. Effect of adding custom instruction on execution time with different benchmarks

Network

MPEG2DE

App1

App2

Algorithm [12]
Our proposed algorithm

Area consumption for custom instructions (nm2)

Figure 2. Effect of adding custom instruction on network traffic with different benchmarks

978-1-4673-2990-3/12/$31.00 2012 IEEE

- 153 -

ISOCC 2012

TABLE 1: The optimal points obtained from the platform


MPEG2DEC
APP1
APP2
Application
Consumption area for CI
(nm2)
speedup
network traffic
Processor numbers

Our
Algorithm

Algorithm
[12]

Our
Algorithm

7.1151

10.5814

2.1735

23.97%
21.76%
4

24.75%
15.03%
5

69.26%
28.13%
3

Algorithm
[12]

Algorithm
[12]

1.3931

0.6981

0.2709

0.5106

0.6736

69.25%
-10.94%
3

33.78%
0.0%
2

5.88%
0.0%
2

48.44%
56%
2

34.29%
0.0%
2

multimedia, networking and encryption.

Energy consumption in NoC is a considerable portion of


total energy. Since it depends entirely on the network traffic,
and on how the task mapping can alter drastically the energy
consumption. Fig. 2 illustrates the effect of the customization
procedure on the network traffic. Additionally, it shows ups
and downs in network traffic through customization routine
that come from changing task mapping on processors. Axes X
and Y show the increase in custom instructions and the traffic
between processors, respectively. Sine our algorithm intents to
reduce the network traffic, it creates opportunity to decrease
energy consumption. However, it could be possible that
customization procedure has no effect on network traffic. For
example for App2, customizing does not change the task
mapping on processors. On the other hand the method [12]
does not consider the data communication. Consequently, it
loses efficiency in power dissipation in the network on chip.
The optimal points related to the implemented algorithms
are demonstrated in TABLE 1. Our algorithm becomes stable
or terminates in maximum 38 iterations. we get on average
42.71% and 30.77% from speedup and energy dissipation,
respectively, compared to 24.75% and 3.99% obtained from
[12]. The results demonstrates that considering scheduling and
customization together with communication cost results in
better efficiency in synthesis of heterogeneous MPSoCs.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]
[9]

[10]

[11]

[12]

[13]

[14]

Conclusion

After discussing issues that affect MPSoC performance, we


have introduced the synthesis algorithm, which performs
scheduling and customization procedure simultaneously and
attempts to map the tasks that have heavy communication
onto the same processor. It utilizes an incremental and
repetitive process to select the best possible custom
instructions in whole iterations. Experimental results confirm
that the proposed algorithm improves the efficiency. The
obtained results can be used as a golden model for
implementations.

978-1-4673-2990-3/12/$31.00 2012 IEEE

Our
Algorithm

Algorithm
[12]

Reference

Fig. 1 depicts how adding custom instructions can affect the


speedup. It shows adding custom instructions usually improve
the performance for both platforms, while our algorithm results
in better performance. Our algorithm explores more points in
the design space, and as a result it finds more likely better
solutions. For example in Fig. 1 for App2, both algorithms go
through the same part of the path, but the method in [12] stops
earlier, while our algorithm continues the customization
procedure. Sometimes, adding custom instructions does not
improve performance. This happens for benchmark App1.

V.

Networking

Our
Algorithm

[15]
[16]

[17]

[18]

- 154 -

G. Martin, Overview of the MPSoC design challenge, in


Proceedings of the 43rd annual Design Automation Conference, 2006,
pp. 274279.
A. De Gloria and P. Faraboschi, An evaluation system for application
specific architectures, in Microprogramming and Microarchitecture.
Micro 23. Proceedings of the 23rd Annual Workshop and Symposium.,
Workshop on, 1990, pp. 8089.
L. Pozzi, K. Atasu, and P. Ienne, Exact and approximate algorithms
for the extension of embedded processor instruction sets, ComputerAided Design of Integrated Circuits and Systems, IEEE Transactions
on, vol. 25, no. 7, pp. 12091229, 2006.
N. Clark, H. Zhong, and S. Mahlke, Processor acceleration through
automated instruction set customization, in Proceedings of the 36th
annual IEEE/ACM International Symposium on Microarchitecture,
2003, p. 129.
A. Yazdanbakhsh, M. E. Salehi, and S. M. Fakhraie, ArchitectureAware Graph-Covering Algorithm for Custom Instruction Selection,
in Future Information Technology (FutureTech), 2010 5th
International Conference on, pp. 16.
Y. K. Kwok and I. Ahmad, Dynamic critical-path scheduling: An
effective technique for allocating task graphs to multiprocessors,
Parallel and Distributed Systems, IEEE Transactions on, vol. 7, no. 5,
pp. 506521, 1996.
M. Y. Wu and D. D. Gajski, Hypertool: A programming aid for
message-passing systems, Parallel and Distributed Systems, IEEE
Transactions on, vol. 1, no. 3, pp. 330343, 1990.
A. Khan, C. L. McCreary, and M. Jones, A comparison of
multiprocessor scheduling heuristics, 1994.
C. H. Liu, C. F. Li, K. C. Lai, and C. C. Wu, A dynamic critical path
duplication task scheduling algorithm for distributed heterogeneous
computing systems, in Parallel and Distributed Systems, 2006.
ICPADS 2006. 12th International Conference on, 2006, vol. 1, p. 8pp.
H. Topcuoglu, S. Hariri, and M. Wu, Performance-effective and lowcomplexity task scheduling for heterogeneous computing, IEEE
transactions on parallel and distributed systems, pp. 260274, 2002.
G. Martin, Multi-Processor SoC-Based Design Methodologies Using
Configurable and Extensible Processors, Journal of Signal Processing
Systems, vol. 53, no. 1, pp. 113127, 2008.
F. Sun, S. Ravi, A. Raghunathan, and N. Jha, A Framework for
Extensible Processor Based MPSoC Design, Designing Embedded
Processors, pp. 6595, 2007.
L. Chen, N. Boichat, and T. Mitra, Customized MPSoC synthesis for
task sequence, in Application Specific Processors (SASP), 2011 IEEE
9th Symposium on, pp. 1621.
T. T. Ye, G. D. Micheli, and L. Benini, Analysis of power
consumption on switch fabrics in network routers, in Proceedings of
the 39th annual Design Automation Conference, 2002, pp. 524529.
MIPS
Technologies
Home.
[Online].
Available:
http://www.mips.com/. [Accessed: 13-Nov-2011].
C. Lee, M. Potkonjak, and W. H. Mangione-Smith, Mediabench: A
tool for evaluating and synthesizing multimedia and communications
systems, in micro, 1997, p. 330.
M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge,
and R. B. Brown, MiBench: A free, commercially representative
embedded benchmark suite, in Workload Characterization, 2001.
WWC-4. 2001 IEEE International Workshop on, 2001, pp. 314.
R. Ramaswamy and T. Wolf, PacketBench: A tool for workload
characterization
of
network
processing,
in
Workload
Characterization, 2003. WWC-6. 2003 IEEE International Workshop
on, 2003, pp. 4250.

ISOCC 2012

Potrebbero piacerti anche