Sei sulla pagina 1di 14

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPDS.2015.2487346, IEEE Transactions on Parallel and Distributed Systems

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. X, NO. X, XXX 2015
1

Hardware Implementation on FPGA for Task-


level Parallel Dataflow Execution Engine
Chao Wang, Member, IEEE, Junneng Zhang, Student Member, IEEE, Xi Li, Member, IEEE, Aili
Wang, and Xuehai Zhou, Member, IEEE

AbstractHeterogeneous multicore platform has been widely used in various areas to achieve both power efficiency and high
performance. However, it poses significant challenges to researchers to uncover more coarse-grained task level parallelization.
In order to support automatic task parallel execution, this paper proposes a FPGA implementation of a hardware out-of-order
scheduler on heterogeneous multicore platform. The scheduler is capable of exploring potential inter-task dependency, leading
to a significant acceleration of dependence-aware applications. With the help of renaming scheme, the task dependencies are
detected automatically during execution, and then task-level Write-After-Write (WAW) and Write-After-Read (WAR)
dependencies can be eliminated dynamically. We extended the instruction level renaming techniques to perform task-level out-
of-order execution, and implemented a prototype on a state-of-art Xilinx Virtex-5 FPGA device. Given the reconfigurable
characteristic of FPGA, our scheduler supports changing accelerators at runtime to improve the flexibility. Experimental results
demonstrate that our scheduler is efficient at both performance and resources usage.

Index TermsFPGA, Dataflow, Dependency Analysis, Out-of-order, Parallel Architecture

1 INTRODUCTION
ask-level parallelism (TLP) has motivated research Meanwhile, with diverse types of accelerators inte-
T into simplified parallel programming models [1].
However, a drawback of common task-based models
grated, heterogeneous multicore platform is able to
achieve high performance for a large variety of applica-
such as Cilk [2], OpenMP [3], MPI, Intel TBB, CUDA and tions. However, the software for such heterogeneous sys-
OpenCL [4] is burdening the programmer with the non- tems can be quite complex due to the management of
trivial assignment of resolving inter-task data dependen- low-level aspects of the computation. To maintain the
cies. To resolve this problem, automatic parallelization correctness, the software must decompose an application
has been widely researched at different levels, e.g. pro- into parallel tasks and synchronize the tasks automati-
gramming model [5], compiler and runtime library [6], cally. As a consequence, data dependences within an ap-
and microarchitecture [7]. Most of the programming plication may seriously affect the overall performance.
models perform well for regular operations (e.g. loop Generally data dependences are classified into three cate-
structures [8]). However, the results may be unacceptable gories: read-after-write (RAW) dependence, write-after-
satisfying for most irregular operations. At compiler level, write (WAW) dependence and write-after-read (WAR)
it poses significant challenges to detect dependencies dependence, of which WAW and WAR can be eliminated
statically. Alternatively, using special architecture to sup- using the renaming techniques.
port task-level Out-of-Order (OoO) execution is becoming To tackle this problem, in this paper we propose a
effective and efficient on performance [9]. However, how hardware scheduler that supports task-level OoO parallel
to make the architecture flexible to suite for various sys- execution. The fundamental system is an FPGA based
tems remains unresolved, especially for heterogeneous platform that contains different types of PEs: one or sev-
systems with different types of processing elements (PEs). eral general purpose processor (GPPs) and a variety of
Intellectual Property (IP) cores. A earlier version of this

paper is presented at [10]. We have extended and imple-
C.WangiswiththeUniversityofScienceandTechnologyofChina,Hefei,
230027,Anhui,China.Email:saintwc@mail.ustc.edu.cn.
mented a demonstrating hardware scheduler for hetero-
J.ZhangiswiththeUniversityofScienceandTechnologyofChina,Hefei, geneous platforms to support OoO task execution. The
230027,Anhui,China.Email:zjneng@mail.ustc.edu.cn. scheduler detects task-level data dependencies and elimi-
X.LiiswiththeUniversityofScienceandTechnologyofChina,Hefei, nates WAW and WAR dependencies automatically at
230027,Anhui,China.Email:llxx@ustc.edu.cn.
A.WangiswiththeUniversityofScienceandTechnologyofChina, runtime using renaming scheme. We claim following con-
Suzhou,215123,Jiangsu,China.Email:wangal@ustc.edu.cn. tributions and highlights:
X.ZhouiswiththeUniversityofScienceandTechnologyofChina,Hefei, 1) This work implements a hardware scheduler on
230027,Anhui,China.Email:xhzhou@ustc.edu.cn. FPGA to support parallel dataflow execution and dy-

ManuscriptreceivedXXXXX.2015. namic accelerator reconfiguration. By eliminating the
Forinformationonobtainingreprintsofthisarticle,pleasesendemailto: WAW and WAR dependences among dataflow, applica-
tpds@computer.org,andreferenceIEEECSLogNumberTPDS01970706. tions will get as much parallelism as possible. This is es-
DigitalObjectIdentifierno.10.1109/TPDS.2007.1068.
pecially important for dependence-aware applications in

xxxx-xxxx/0x/$xx.00 200x IEEE

1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPDS.2015.2487346, IEEE Transactions on Parallel and Distributed Systems

2 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. X, NO. X, XXX 2015

real-time system. ming frameworks. It is now common knowledge that


2) Reconfigurability is supported so as to improve the these non-speculative models rely upon developer ex-
flexibility of heterogeneous multicore platform. Our periences to parallelize programs. When an irregular pro-
scheduler is as efficient as Task Superscalar [7] at the abil- gram written in these programming models is executed,
ity to eliminate WAW and WAR dependence while has the decomposition of parallel tasks could be difficult.
much lower scheduling overhead due to the FPGA effi- During dynamic execution of the statically-parallel pro-
ciency. Furthermore, as our scheduler is implemented on gram, a multitude of complexities may arise that make
a practical FPGA board instead of the simulation, there- program development onerous. Deterministic, specula-
fore it is more practical compared to the Task Superscalar tive models offer simple programming models, but incur
scheme. potentially significant dynamic overheads to support re-
The structure of the paper is organized as follows: sec- covery from mis-speculation and to dynamically check
tion 2 summaries the related work and the motivation. for potential conflicts. Of the cutting-edge programming
Section 3 shows a high level view of the proposed frame- paradigms, CellSs [5] and StarSs [25] are programming
work, which includes the problem description and defini- models that uncover TLP in particular supporting irregu-
tion of tasks. Section 4 details the programming model lar tasks. Besides, dependency detecting and eliminating
layer. Thereafter in Section 5 we illustrate the composi- are left to the hardware scheduler, which greatly simpli-
tion of the scheduler, and section 6 gives the hardware fies the design of the programming model and runtime
implementation for the scheduler. Experiments method library. Harmony [26] is also an execution model and run-
and results are presented in section 7. Finally section 8 time for heterogeneous manycore systems. However, it
concludes the paper and outlines some future works. only considers the tasks that run on GPPs, while tasks
that run on accelerators (e.g. DSP, IP) are not taken into
account.
2 RELATED WORKS Along with the prototype platforms targeting specific
Data dependencies and synchronization problem has hardware, there are some well-known reconfigurable
already posed a significant challenge in parallelism. Tra- hardware infrastructures: ReconOS [27], for instance,
ditional algorithms, such as Scoreboarding and Tomasulo demonstrates hardware/software multithreading meth-
[11], explore instruction level parallelism (ILP) with mul- odology on a host OS running on the PowerPC core using
tiple arithmetic units, which can dynamically schedule modern FPGA platforms. Hthreads [28], RecoBus [29] and
the instructions for out-of-order execution. Much effort [30] are also state-of-the-art FPGA-based reconfigurable
has been made to extract task-level parallelism from pro- platforms. Besides FPGA-based research platforms, Flex-
grams in the last decade. Schemes on different levels have Core [9] proposes a heterogeneous platform which sup-
been proposed to solve TLP problems. ports run-time monitoring and incorporates bookkeeping
With the increasing popularity of MPSoC platform, techniques.
parallelism is shifting from instruction level to task level. At architectural level, Task Superscalar [7], Multiscalar
There are already some creditable FPGA based research [17], Stanford Hydra [21] and Program Demultiplexing
platforms, such as RAMP [12], Platune [13] and MOLEN [31] are hardware schedulers for multi-processor systems
[14]. These studies focus on providing reconfigurable based on renaming techniques. McFarlin [32] presents a
FPGA based environments and related tools that can be speculative out-of-order execution architecture support
utilized to construct application specific MPSoC. Alterna- for heterogeneous multicore architectures. Similarly,
tively, products and prototypes of processor are designed Sridharan [33] presents an efficient parallel execution of
to increase TLP with coarser grained parallelism, such as parallel programs, which can be adaptive to different
RAMPSoC [15], MLCA [16], Multiscalar [17], Trace Proc- running applications. But the scheduling overhead of the
essors [18], IBM CELL [19], RAW Processor [20], Intel software speculation is significantly higher than hard-
Terascale and Hydra [21]. These design paradigms pre- ware schedulers. In summary, none of them has been im-
sent thread-level or individual cores which can split a plemented as hardware circuits.
group of applications into small speculatively independ- It is a trend that in foreseeable future most computing
ent threads. Some other works like TRIPS [22] and systems will contain a considerable number of heteroge-
WaveScalar [23] combine both static and dynamic data- neous computing resources. So far in industry, GPU has
flow analysis in order to exploit more parallelism. Yoga been widely used to support graphic processing and
[24] is a hybrid dynamic processor with out-of-order exe- thereby achieve speedup of performance. It will be highly
cution and VLIW instructions sets. The concept of Yoga is possible that more heterogeneous accelerators based on
to leverage the trade-off between high performance OoO FPGA will be integrated to the computer architecture. So
and low power VLIW, therefore it can not be applied to we focus on the OoO execution that supports heterogene-
general heterogeneous frameworks. Furthermore, a ous systems, trying to find an efficient way to design the
common concept of these literatures is to split a large task programming model, compiler and scheduler. In order to
window into small threads that can be executed in paral- summarize the differences between related works and
lel. However, the performance is seriously constrained by our approach, we list the strength and weakness of the
inter-task data dependencies. typical parallel scheduling algorithms and engines. The
At programming model level, MPI, Pthreads, OpenMP, differences are illustrated in Table 1, which includes gen-
Cilk, and TBB are the state-of-the-art parallel program- eral parallel programming models, out-of-order pro-

1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPDS.2015.2487346, IEEE Transactions on Parallel and Distributed Systems

CHAO WANG ET AL.: HARDWARE IMPLEMENTATION ON FPGA FOR TASK-LEVEL PARALLEL DATAFLOW EXECUTION ENGINE 3

TABLE 1
SUMMARY FOR STATE-OF-THE-ART PARALLEL EXECUTION ENGINES ON FPGA

Types Typical Strength Weakness

Generalparallelprogramming
OpenMP [3] Intels TBB, Ct, CnC General model for CMP proces Bring burden to program
models
MapReduceOpenCL[34]Cilk[35] sors mers

Outoforder programming Support automatic OoO execu AppliedonlytoCellBEarchi
StarSs[25],CellSs[36],Oscar[37]
models tion tectureandSMPservers
Perform well for traditional
CEDAR [38], MLCA [16], Multiscalar [17], OoO not supported, pro
Task Level Outoforder superscalar machines, run tasks
Trace [18], IBM CELL [19], RAW [20], Hy grammers handle the task
Scheduling in parallel on different proces
dra[39],WaveScalar[23],TRIPS[22] datadependencymanually
sorcores
Support OoO automatic par Not applicable to FPGA
TaskSuperscalar[40],Yoga[24]
allelexecution withreconfiguration
Dataflow Based OoO Execu With register renaming tech Limited to DSP architec
DSP[41]
tionModel nologiesofTomasulo tures
An OoO compiler for Java run
Not applicable to hardware
OoOJava[42],Dataflow[1] time with determinate parallel
executionengines
execution

gramming models and task level scheduling mechanisms. task 2 manually to support parallel execution. Conse-
quently all tasks after task 2 that use b also need to re-
name their parameter b to c, which greatly limits the
3 HIGH LEVEL VIEW programmability. If b is to be written to non-volatile
3.1 Problem Description storage devices (e.g. disk), then the task 1 and task 2 can
It has been widely acknowledged that traditional pro- not be executed in parallel by OpenMP. In order to re-
gramming models for TLP such as OpenMP and MPI per- lease programmers from such burden, CellSs was pro-
form well for regular programs, especially for loop based posed, as is illustrated in Fig. 1 (c). The first two anno-
codes. However, the parallelism degree of irregular pro- tated lines of Fig. 1 (c) inform the compiler that task 1 and
grams of OpenMP and MPI depends on the programmer, task 2 can be executed simultaneously, enabling task 1
which requires too much burden to the programmers and and task 2 to be sent to different computing resources for
the performance may be dramatically unsatisfied. Conse- parallel execution. The programmers are only required to
quently the motivation of this paper is to pursue the add the simple annotations of the tasks, and then the par-
automatic OoO task parallel execution method, especially allelization is automated by compiler and hardware sup-
for heterogeneous systems. port.
In order to compare the differences of three program- As mentioned above, both CellSs and Task Superscalar
ming models: serial model, OpenMP [3] and CellSs [5], use renaming scheme to detect and eliminate WAW and
we illustrate the example code snippet respectively. WAR dependencies. However, they are both designed for
CMP architecture, in which all processors are able to run
#pragma omp section #pragma .. all kinds of tasks. This characteristic brings higher flexibil-
Task_1(b,a); Task_1(b,c); #pragma .. ity and scalability, but also causes significant difficulties
. #pragma omp section Task_1(b,a); to task-level OoO execution for heterogeneous systems.
Task_2(b,c); Task_2(e,c); Task_2(b,c); Besides, Task Superscalar detects inter-task dependencies
at a fine-grained level, which makes it too sophisticated
for resources constrained systems. In order to make task-
(a) Serial Code (b) OpemMP Code (c) CellSs level parallelization efficient for heterogeneous systems,
in this paper we propose a variable-based OoO frame-
Fig. 1 Sample code snippet of different programming models. work for heterogeneous system.

Fig. 1 illustrates the three code segments of the same 3.1 Definition of Tasks and Task-level Dependencies
functionality. Task 1 takes a as input parameter, and Throughout this paper, tasks refer to the execution of
write results to b, while task 2 takes c as input, and also actors, and actors can be GPPs or IP cores. Each task has a
write results to b. Fig. 1 (a) presents the normal serial- set of input parameters and a set of output parameters,
ized codes, in which task 2 can not be issued until task 1 and is defined as task_name ({ the set of output parame-
is done. Assuming that task 1 and task 2 can run in paral- ters }, { the set of input parameters }).
lel using different computing resources. Fig. 1 (b) shows Exploiting task-level concurrency is especially impor-
the sample codes in OpenMP pattern. Because both task 1 tant in a heterogeneous environment due to the require-
and task 2 write b, the programmer must rename b of ment to match the execution requirements of different

1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPDS.2015.2487346, IEEE Transactions on Parallel and Distributed Systems

4 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. X, NO. X, XXX 2015

parts in the program with computational capabilities of composed of interconnections and processing elements
the different platforms. Some tasks may require special- (PEs). The PEs includes GPPs and IP cores, and each IP
purpose hardware, either because the hardware can exe- core can do only a specific type of task.
cute that task more efficiently or because the hardware
has some unique functionality that the task requires. Con-
sequently the motivation of this paper is to pursue the
4. PROGRAMMING MODEL LAYER
automatic OoO task parallel execution method especially In this section, we first introduce the programming
for heterogeneous systems. model layer, which is able to facilitate software develop-
ers for high level programming. The programming model
TABLE II uses annotations to define tasks. In our framework, com-
DEFINITIONS OF TASK-LEVEL DEPENDENCIES puting resources are divided into two categories: hard-
ware computing resources and software computing re-
sources. Each type hardware computing resource can
Type Definition Example only do a specific kind of task, while software computing
When an issuing task reads a pa- task_1({b}{a}); resources have the capability to do all kinds of tasks. In
rameter in the output set of an is- task_2({c},{b}); our framework, the interfaces to call hardware computing
RAW resources are regulated in runtime libraries. To use soft-
sued task, a RAW dependency oc-
curs. ware computing resources, programmers need to define
When an issuing task writes a pa- task_1({c},{a}); the tasks in prior to the hardware execution.
rameter in the output set of an is- task_2({c},{b}); The programming model layer makes all the details of
WAW the platform transparent to the programmers. In order to
sued task, a WAW dependency
occurs. simplify the programming, programmers only have a
When an issuing task writes a pa- task_1({b},{a}); sequential view of the systems, in other words, the pro-
WAR rameter in the input set of an issued task_2({a},{c}); grammers do not need to know what the underlying plat-
task, a WAR dependency occurs. form is composed of or how tasks execute in parallel.
In order to reduce the burden of the scheduler, tasks
are divided into two categories: those that pass through
In Table II we define three types of task-level depend-
the scheduler and those that execute on the local GPP.
encies. The bold parameters indicate the parameters those
Thus the program is divided into two parts: kernel part
lead to data dependencies. Tasks have RAW dependency
and non-kernel part. A kernel is defined as a sequence of
should always run in order, meanwhile, WAW and WAR
tasks that all pass through the scheduler while the non-
dependencies can be eliminated using renaming scheme. kernel part is defined as a sequence of tasks that all run
locally.

Fig. 2 Task-level OoO framework. The framework is divided into


software part and hardware part, and both parts have two individual
layers.

Fig. 3 Codes snippet in our programming models.
Fig. 2 presents the framework that is composed of four
layers. The top layer is the programming model layer When compiling the program, the compiler identifies
which makes the details of the scheduler and the hard- the kernel parts at first. Then synchronizations are in-
ware platform transparent to the programmer. The sec- serted to the head and the end of each kernel part. The
ond layer is the library layer that consists of the software synchronizations make sure that the data used by tasks
library and the hardware library. The software library has the latest value. The synchronizations are inserted to
includes all the tasks interface definitions, while the the code by a source-to-source compiler automatically,
hardware library contains the bit files of the IP cores. The and thereby programmers do not have the burden to do
third layer is the scheduler which is in charge of task such things by themselves.
mapping, task-level OoO scheduling and reconfiguration. After the source-to-source compilation, the program
The bottom layer is the hardware platform layer which is then is compiled into executable binaries. Tasks in the

1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPDS.2015.2487346, IEEE Transactions on Parallel and Distributed Systems

CHAO WANG ET AL.: HARDWARE IMPLEMENTATION ON FPGA FOR TASK-LEVEL PARALLEL DATAFLOW EXECUTION ENGINE 5

kernel parts are sent to the scheduler for OoO execution. shapes of edges stand for different types of dependencies,
Fig. 3 illustrates sample code snippet in our programming in particular direction of an edge is from the task that
model. The codes are divided into two segments: declara- brings about the dependency to the task that is affected.
tion segment and normal segment. The declaration seg- In particular, inter-task dependencies are divided into
ment (Line 2 - Line 6) contains header files inclusion (Line three categories as follows:
2) and task definitions (Line 3 - Line 6). All interfaces to RAW: The solid edges indicate RAW dependencies. All
call hardware computing resources are included in Hard- solid edges compose a normal Directed Acyclic Graph
ware.h. When hardware computing resources are de- (DAG). When an issuing task reads a variable in the wr set
signed, the corresponding interfaces should be added to of an issued task, a RAW dependency occurs, for example,
Hardware.h. Tasks defined by programmers are started task 4 takes variable D as input while task 2 writes D, so
with annotations, as Line 3 shows. Each annotation in- there is a solid edge from task 2 to task 4.
cludes three aspects of parameters: direction, type and WAW: The dashed edges indicate WAW dependencies.
size. The direction can be chosen from input, output, and When an issuing task writes a variable in the wr set of an
inout, which is detailed as follows: issued task, a WAW dependency occurs. Task 4 and task
Input: The parameter will be read by the task, and 1 both write variable B so that there is a dashed edge.
when the task is completed, the parameter will not be WAR: The dotted edges indicate WAR dependencies.
updated. When an issuing task writes a variable in the rd set of an
Output: The task starts without considering the value issued task, there is a WAR dependency. Task 3 writes
of the parameter, and when the task is completed, the variable A while task 1 reads A, therefore a dotted edge is
parameterwillbeupdated. from task 3 to task 1.
Inout:Thetaskwillusethedatastoredintheparame It must be noted that tasks that have RAW dependency
ter, when the task is finished, the parameter will be up should always run in order. Meanwhile, WAW and WAR
dated.Line4andLine6arenormaltaskdefinitionswrit dependencies can be eliminated using special techniques.
ten in C language. Line 8 to Line 15 presents sample Fig. 5 illustrates two different executing models for the
codes of main function. Hd_idctand Hd_aesare tasks de tasks in Fig. 4(b). The bottom refers to the mode used in
finedinHardware.h,whileIdctandAesaredefinedbythe [1], where WAW and WAR dependencies are detected
programmer. but not eliminated, while the top stands for the ideal OoO
mode presented in our scheduler, in which only RAW
dependency will affect the parallelism. It can be obvi-
ously derived that the OoO mode will gain great per-
formance improvement (the time between FT1 and FT2).
In this paper, we are to propose a scheduling algorithm
that automatically eliminates task-level WAW and WAR
dependencies, especially for heterogeneous platforms.

Fig. 4 Task-level data dependencies.

In this paper, we bring the instruction-level depend-


ency analysis concepts to task-level. Each task is treated
as an abstract instruction, and the parameters of a task is
equivalent to operands at instruction level, with the ex-
pansion that each task can have more than three parame- Fig. 5 Execution comparison. The bottom part is the execution time
ters (In a standard x86 instruction set, an instruction has chart in [1], which finishes at time FT2, while the top part is the OoO
three operands at most). All the parameters must be de- mode presented in our scheduler, which finishes at time FT1.
fined as variables. Fig. 4 (a) describes a sample code pat-
tern, in which wr_set and rd_set stand for write set and
read set of the function, respectively. Each function in the
loop is viewed as a task, i.e., wr_set contains variables 5. SCHEDULER LAYER
defined as Input and Inout, while rd_set contains variables The scheduler layer is in charge of scheduling tasks to
defined as Output and Inout. Fig. 4 (b) is a task sequence exploit the potential parallelism. In this paper, OoO task
when the codes in Fig. 4 (a) actually execute. Fig. 4 (c) scheduler is implemented as a hardware layer to uncover
shows all inter-task dependencies, in which different TLP. For demonstration we have implemented MP-

1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPDS.2015.2487346, IEEE Transactions on Parallel and Distributed Systems

6 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. X, NO. X, XXX 2015

Tomasulo algorithm in hardware, which extends the in- and WAR dependencies. In order to send different tasks
struction level Tomasulo algorithm to task level for paral- simultaneously,eachPEhasaprivateRSTable.Onceall
lel execution. The MP-Tomasulo algorithm is able to dy- the input parameters of a task are prepared, the task is
namically detect and eliminate inter-task WAW and WAR offloadedtotheassociatedPEforexecution.Eachentryof
dependencies, and thus speeds up the execution of the RSTable is a tuple of < ID;Busy; V
whole program. With the help of MP-Tomasulo algorithm, [num_rd_set];Q[num_rd_set]>,inwhichIDisusedtoiden
programmers need not to take care of the inter-task data tify different entries, Busy indicates whether the entry is
dependencies. inuseornot,Visusedtotemporarilystoreinputparame
tersofatask,QindicatesifthevalueinVisvalid.
5.1 Hardware Scheduler Structure Task Receiving Queue: The Task Receiving Queue
buffersthefinishedtasks.
VSTable:VSTableisregardedasregisterstatustable
that holds the status of variables used in applications.
Each variable is mapped to an entry of VSTable. Each
entry of VSVable is a tuple of < ID;PE_ID;Value >. ID
stands for the primary unique key. PE_ID represents the
PEthatwillproducethelatestvalueofthevariable.Value
storagestheactualvalue.
MPTomasuloalgorithmisdividedintothreestagesas
follows:
Issue:TheheadtaskofTaskIssuingQueueisinIssue
stageifaRStableentryisavailable.
Execute: If all the input parameters of a task are pre
pared, then the task can be distributed to the associated
PE for execution immediately. Otherwise the RSTable
builds an implicit data dependency graph indicating
Fig. 6 High level structures of MP-Tomasulo module. The part sur- whichtaskwillproduceneededparameters.Onceallthe
rounded by the dashed line is named MP-Tomasulo module, which is inputparametersofataskareready,thetaskisspawned
the critical part of the framework. tothecorrespondingPE.
WriteResults:WhenaPEcompletesatask,itsendsre
At instruction level, each instruction has one destina sults back to the Task Receiving Queue of MPTomasulo
tionoperandandtwosourceoperandsatmost.However, module,andupdatesRSTable.ForeachtaskinRStable,
a task may have more than one output parameter, or if the input parameters are produced by the completed
more than two input parameters. We extend the tradi task, the associated RSTable entry is updated with the
tional Tomasulo algorithm to support multi outputs and results.
more than two inputs. MPTomasulo algorithm is de Compared to the software scheduler in presented in
signed as a hardware module so that the maximum of [44], we remove the reorder buffer (ROB) in this hard
outputs and inputs are limited to a certain number. As wareimplementation.Wehavefollowingconcerns:
suming that wr_set refers to the output set, rd_set repre 1) The ROB structure is designed for speculation. In
sents the input set, while num_wr_set and num_rd_set fact,duetothescheduleronlydealswithtasksequences
standforthemaximumofoutputandinput,respectively. thatalwaysshouldexecute,thusthereisnoneedtomake
Fig. 6 illustrates MPTomasulo module that is composed anyspeculation.
ofthefollowingcomponents: 2) The ROB structure will makea significant contribu
Task Issuing Queue: In our prototype, each task has tion to the hardware utilization. Therefore we save the
num_wr_set outputs and num_rd_set inputs, so eachtask ROBareaconsideringtheareaandcostoftheFPGAchip.
inTaskIssueQueuecanbeformallyregulatedas<Type; Furthermore, in order to make the scheduler flexible,
wr_set[num_wr_set]; rd_set[num_rd_set] >, in which Type thesizeoftheVSTable,RSTableandthenumberofPEs
identifiesthetypeofthetask,wr_setandrd_setstandfor canbeconfiguredaccordingthecharacteristicsofapplica
thesetofoutputparametersandthesetofinputparame tions.Theseconfigurationscanbechangedwhenthesys
ters,respectively. tem starts and remain fixed during execution. To make
Mapper: The Mapper sends tasks to different RS efficient use of the onchip resources, the IP cores are al
TablesaccordingtoTypeofeachtuplestoredinTaskIssu lowedtobedynamicallyreconfiguredatruntime.
ingQueue.TheMapperworksasablackbox,andthere
fore system designers can design other alternative map
5.2 Adaptive Task mapping
pingalgorithms.IftheRSTablesdonothaveenoughen The scheduling module decides when the task can be
tries to store the task, the system will be blocked until executedduetodatadependencies,furthermore,whena
someRSTableentryisfree. task is ready, only one target function units is selected
RSTable: RSTable refers to reservation station table, frommultipleoptions. Thetaskmapping schemeshould
which contains an implicit dependency graph of tasks decide the target for each task, as is described in Algo
andusesautomaticrenamingschemetoeliminateWAW rithm1.

1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPDS.2015.2487346, IEEE Transactions on Parallel and Distributed Systems

CHAO WANG ET AL.: HARDWARE IMPLEMENTATION ON FPGA FOR TASK-LEVEL PARALLEL DATAFLOW EXECUTION ENGINE 7

data to related hardware IP cores through on chip inter


Algorithm 1. Task scheduling and mapping algorithm connection. Otherwise, the control tasks and glue func
Input: Generated Task Set T tionsareexecutedonlocalscheduler.
Output: IP Core ID set for each task S
1 Foreach t T do 6. HARDWARE SCHEDULER DETAILS
2 snum= ValidIPCoreNumber(t.opcode)
// get candidate IP Cores In order to make dataflow run in parallel, renaming
3 switch (snum) method is introduced to eliminate WAW and WAR de
4 case: snum = 0 //no available IP Cores pendences.ThemainreasonforcausingWAWandWAR
5 return null; dependences is that a subflow may update its write set
6 case: snum = 1 //only one available beforeaprevioussubflowreadsorwritespartofit.The
7 add s to the IP Core ID set criticalideaofrenamingmethodistomakeacopyofthe
8 return s; datathatmaycausedependences.
9 default:
10 for j=0 to snum do
11 Tfinish=Twaiting+Texecution+Ttransfer
// Calculate execution time
12 end
13 TNum= number of minimum Toverall
14 If TNum>1
15 Select the s using Round-Robin
16 Else
17 Select the s with minimum Toverall
// Select a target IP Core
17 add s to the IP Core ID set
18 end

5.3 FCFS Queue based Scheduling

Fig. 8 Hardware implementation of the scheduler

Fig. 8 illustrates the hardware implementation of the


scheduler,whichisredesignedfromthehighlevelstruc
tureinFig.6.ApplicationsalwaysstartfromGPPthatis
in gray color, and the dataflow of execution is sent to
Dataflow FIFO (DF). The critical part of the hardware
scheduleristheDataflowMapper(DM).TheDMreceives
dataflowfromDF,identifiesthewritesetandreadsetof
each subflow, searches for a proper PE to execute the
Fig. 7 Queue Based Task Scheduling. subflow and updates Variable Set (VS) and the found
PEs Map Table (MT). A MT contains the parameters of
The task scheduler architecture is shown in Figure 7. the subflow, or which PE will produce the needed pa
We implement a queuebased task scheduling algorithm rameters. When a PE finishes a subflow, an interrupt
using first come first serve (FCFS) policy. Scheduler is occursandtheInterruptController(ICtr)reportsittoDM.
responsible for tasks partition, scheduling and distribut TheDMupdatesMTsandwritesresultstoOrderingUnit
ing to different computing resources. The major work
(OU). OU writes results to the memory in the order as
includesthefollowingsubjects:
1. Receive task requests. Task requests are transferred subflows order in DF. These modules will be detailed
to scheduler during execution. Furthermore, the current respectively.
status of each task is rendered by scheduler to program
ming models. Status checking interface can obtain the 6.1Dataflow FIFO (DF)
statusofeachIPcoreofthesystem.
2. Task partitioning and allocation among hardware TheDFstoresthedataflowwhichconsistsofaseriesof
and software: scheduler classified tasks into two types: subflows. The detailed structure of DF is illustrated in
kernel part and nonkernel part. As to the sub tasks re Fig. 9. The communication interfaces between DF and
questingforthekernelpart,schedulerneedstodistribute GPP contains a 1bit DF_not_full indicating the status of

1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPDS.2015.2487346, IEEE Transactions on Parallel and Distributed Systems

8 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. X, NO. X, XXX 2015

DFanda32bitdatafortransmission.Thecommunication VS and OU are both composed of Block RAM (BRAM)


interfacebetweenDFandDMcontainsa1bitDM_enable and BRAM controllers. Assuming that VS has 256 entries
and OU has 64 entries (throughout this paper we all use
indicating the status of DM and a 32bit data. Each sub
this configuration), so the widths of Addr wires of VS and
flow starts with start_flag and ends with end_flag. OU are 8 and 6 bits, respectively. As the content of each
Out_numandin_numrefertothenumberofwritesetand entry of VS is associated with a certain position of OU, so
read set, respectively, both measured by 32bit. Between the content with of VS is 6 bits. Supposing that each sub-
in_numandend_flagareactualparameters. flow has no more than 16 parameters (parameters in write
set and read set are both limited to no more than 8), each
of which is no more than 32 bits (usually used as 32-bit,
16-bit or 8-bit), then the width of a OU entry is 256 bits
because only the write set need to be stored while the
read set is ignored. Fig. 10(b) presents how VS and OU
are updated when the head sub-flow of OU finishes. In
the left part of Fig. 9(b), the head sub-flow of OU is the
latest producer of No.1 entry of VS. After communication
through DM, both VS and OU are updated. The content
of No.1 entry of VS is changed to N, meaning that the en-
try is free. The content of No.0 entry of OU is also
changed to N. Besides, the head of OU is back to No.1
entry.
6.3 Map Table (MT)
Fig. 9 Dataflow FIFO Hardware Implementation The MTs hold an implicit DAG of all the executing
sub-flows and eliminate WAW and WAR dependences
6.2 Variable Set (VS) and Ordering Unit (OU) through copies of sub-flows parameters. Fig. 11 shows
the detailed view of a MT. In the proposed design, each
VariableSet(VS)keepsthelatestvalueofacertainpa PE has a separate MT and there are 4 entries in each MT.
rameter that the subflow in OU produces. When a sub An entry of a MT consists of a 6-bit OU_ID, a 72-bit
flow passes through DM, the information of its parame Read_Set_Status and a 256-bit Parameters_Value. OU_ID
ters in write set will be updated in VS, indicating a new indicates the location of sub-flow in OU. The parameters
producer of these parameters. Assuming that each entry in the read set of a sub-flow are recorded in the MT. Each
parameter has 9-bit Read_Set_Status, of which the first 8
ofVShasNbits,andOUhas2Nentries.
bits represent the parameter ID and the last bit indicates
whether the parameter is prepared. If the last bit is 1, then
the associated Parameters_Value contains the exact value
of the parameter. Otherwise, Parameters_Value contains
the OU entry ID which will produce the actual value.

Fig. 11 Circuit Implementation of Mapping Table

The address of a MT has three components. The first


two bits index the entry, the flowing bit is a tag indicating
Fig. 10 Circuit Implementation of Variable Set and Ordering Unit write Read_Set_Status or Parameters_Value, and the last
three bits shows which part of Parameters_Value is to be
OU is a FIFO stores all theexecuting subflows as the updated.
order they pass through DM. As long as the head sub
flow in OU finishes, the entry is released and VS is up 6.4 Interrupt Controller (ICtr)
dated.TheresultsoftheheadentryofOUarewrittento When a PE finishes the execution of a sub-flow, an in-
thememory.CommunicationbetweenVSandOUiscon terrupt occurs and the ICtr handles the interrupt. The
trolledbyDM. interrupt results will be transferred to DM. Each PE is
Fig. 10 (a) describes the detailed view of VS and OU. indexed by a unique PE_ID and has a fixed priority in

1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPDS.2015.2487346, IEEE Transactions on Parallel and Distributed Systems

CHAO WANG ET AL.: HARDWARE IMPLEMENTATION ON FPGA FOR TASK-LEVEL PARALLEL DATAFLOW EXECUTION ENGINE 9

ICtr. A PE with a smaller PE_ID has a higher interrupt Fig. 12. The MT_full is set to 1 if none of PEs that can exe-
priority. In order to get better overall performance, IP cute the sub-flow has a free entry.
cores always have higher interrupt priority than GPPs,
meaning that when occurring simultaneously, IP inter-
rupt request (IPR) are always handled in prior to GPP
interrupt request (GPPR). When all interrupt requests
pass through ICtr, only one output wire is set to 1, indi-
cating that the associated interrupt has the highest prior-
ity and results can be received while others are pending.
Interrupt nesting is not supported due to that interrupt
nesting needs extra hardware resources to store the inter-
rupt stack. An entry of MT or the OU is released upon
finished for the DM to receive sub-flows.

6.5 Dataflow Mapper (DM)


DM is the critical part of the hardware scheduler
connecting DF, VS, MTs, ICtr and OU. The DM is
called(x,y)channelifitsupportsxIPcoresandyGPPs
at most. Fig. 11 illustrates the details of DM. On re
ceivingsubflowsfromDF,theDMfirstgeneratesthe
addressofVS,OUandMT.VS_addrisincludedinthe
subflowsothatVScanbeeasilyupdated.Inorderto
acceleratethegenerationofOU_addr,aregisternamed
TailRegisutilizedtokeeptheaddressofthefirstfree
entryofOU.HeadRegindicatesthefirstusedentryof
OU, and together with Tail Reg it is easy to check
Fig. 12 Hardware Implementation of Dataflow Mapper
whether OU is full. DM is responsible for mapping
subflowstodifferentPEs.Thestrategyformappingis The multi-channel design of DM can easily support the
summarizedinAlgorithm2.PleasenotethattheMap reconfigurability of FPGA platform using Xilinx Early
perworksasablackbox,andthesystemdesignercan Access Partial Reconfiguration (EAPR) technology. For
designtheirownmappingalgorithm. example a (4, 4)-channel DM, when replacing an IP core,
the associated counter is set to the number of entries of a
MT, meaning that it can accommodate no more sub-flows.
Algorithm 2 Mapping Algorithm
The only logic needs to be changed is the decoder. Fig. 13
input: a sub-flow illustrates the reconfigurable flow in which bitstream0 is
output: a PE ID replaced by other bitstreams (a bitstream is routed cir-
1: if (a IP core p can execute the sub-flow) cuits for a PE).
2: if (the MT of p has a free entry)
3: update MT,OU,VS;
4: return p;
5: else select the GPP g that has the most free entries
N
6: if (N is larger than 0)
7: update MT,OU,VS;
8: return g;
9: else pending the sub-flow;

It can be seen that IP cores have higher priority than


Fig. 13 Reconfigurable Flow
GPPs for executing sub-flows. In order to record the
number of free entries in MTs, each MT has an associated
3-bit counter. When DM sends a sub-flow to a PE the
counter increments by 1, and when an interrupt from a 7. EXPERIMENTS AND RESULTS
PE is accepted the associated counter decrements by 1. A
decoder is used to identify if a sub-flow can be executed Our experimental platform is built on Xilinx Virtex-5
by some IP cores of the platform. To avoid dead locks, a XC5VLX110T FPGA. The structure of the platform is pre-
sub-flow can be sent to DM only when MTs, VS and OU sented in Fig. 14, including four PEs, one scheduler and
all have a free entry, as presented in the bottom part of peripherals. The PEs can be classified into three types: (1)

1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPDS.2015.2487346, IEEE Transactions on Parallel and Distributed Systems

10 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. X, NO. X, XXX 2015

the upper left part is a MicroBlaze processor that is re- WAW/WAR dependencies, and achieves more than 95%
sponsible for issuing tasks and receiving data. Besides, it of the ideal speedup.
is also used as computing resources. (2) The upper right
part is a MicroBlaze processor only used for computing.
(3) The lower middle part contains two IP cores used as
accelerators, which are used to accelerate the execution of
the CC/DCT/Quant of JPEG application. The upper
middle part is a hardware scheduler that detects
RAW/WAW/WAR dependencies and eliminates
WAW/WAR dependencies using renaming scheme. The
lower left part is peripherals including timer, interrupt
controller (intc), etc.

Fig. 15 Experimental results. The X-axis refers to the picture size,


and the Y-axis is average speedup. The blue (first) bar is the aver-
age speedup if the WAW dependency is not eliminated. The red
(second) bar is the actual speedup using our proposed method. The
green (third) bar is the ideal speedup ignoring the scheduling time.

7.2 Comparison using Parallel and Sequential


Applications
By reviewing the data dependencies problems (RAW,
WAW and WAR) at instruction level, Scoreboarding and
Tomasulo are both effective methods used in superscalar
pipeline machines for out-of-order instruction execution.
Of these two methods, Tomasulo is more widely used
Fig. 14 Experimental platform built on Xilinx Virtex-5 FPGA. because it can eliminate WAW and WAR dependencies,
while Scoreboarding only solves it by run tasks in se-
quence. Therefore, in this paper we choose Tomasulo in-
7.1 JPEG Case Study stead of Scoreboarding at task level. In particular, we
We first use JPEG to test the efficiency of our frame- have compared our approach with the Scoreboarding
work. JPEG can be divided into four stages: Color space techniques that has been applied in the reference [45].
Convert (CC), two-dimensional Discrete Cosine Trans-
form (DCT), Quantization (Quant) and Entropy coding Speedup using Parallel and Sequential Applications
4
(Huffman). The first three stages have fixed amount input
3.5 Parallell-Tomasulo
parameters and output parameters; therefore they are Parallell-Scoreboarding
3
appropriate to be implemented as a hardware IP core. In Sequential-Tomasulo
Sequential-Scoreboarding
contrast, the Huffman stage runs on MB because it does 2.5
Speedup

not have fixed amount output parameters and only takes 2

1.94% of the total execution time of the whole program on 1.5

average. There are two IP cores on our platform, both are 1

named CC-DCT-Quant and used to accelerate the execu- 0.5

tion the first three stages of JPEG application. The JPEG 0


5000 10000 25000 50000 100000
algorithm takes a BMP picture as input, and output a pic- Task Execution Time(Cycles)
ture in JPEG format. The whole picture is divided into 8*8
Fig. 16. Speedup using Parallel and Sequential Applications
bits blocks, each time only one block is processed. The
configuration of the scheduler can be configured accord- We evaluated the speedup of Tomasulo algorithm and
ing to different applications. compared with the Scoreboarding techniques proposed in
With the help of MP-Tomasulo module, the sequential [2]. Both sequential and parallel applications in [2] are
execution mode of JPEG program is converted to an OoO used as benchmarks, and the execution of the applications
mode, only tasks that have RAW dependency with others
is configured from 5000 to 100000 cycles. It is clearly illus-
need to run in order, while the WAW dependency will be trated from Fig. 16 that the peak speedup of the Tomasulo
dynamically eliminated by MP-Tomasulo module. We algorithm on parallel applications achieves 3.74X, which
select 30 pictures of 6 different sizes for experiments, for significantly higher than that of the Scoreboarding (0.97X).
each size we randomly picked 5 pictures in BMP format. The reason is that due to the renaming techniques em-
Fig. 15 presents the experimental results. We use the se- ployed by the Tomasulo, the WAW and WAR dependen-
quential execution time on a GPP as the basis, and com-
cies can be automatically eliminated. In contrast, Score-
pute the speedup using different strategies. It can be eas- boarding scheme is only able to detect the dependency.
ily concluded that our algorithm actually eliminates the Regarding the speedup of sequential applications,

1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPDS.2015.2487346, IEEE Transactions on Parallel and Distributed Systems

CHAO WANG ET AL.: HARDWARE IMPLEMENTATION ON FPGA FOR TASK-LEVEL PARALLEL DATAFLOW EXECUTION ENGINE 11

Tomasulo and Scoreboarding has similar performance as overall speedup is significantly larger than Scoreboarding.
the data dependency can not be eliminated, which are
0.95X and 0.98X respectively. 7.4 Discussion
In this section we use the timing diagram to show the
7.3 Comparison on Regular Tasks inter-task data dependencies, and how it is solved.
We measured speedups of the typical task sequences
through partitioning the test tasks in [2]. In this test case,
we select regular test with 11 different random tasks. The
sample task sequence is presented in Table III.

TABLE III
SAMPLE TASK SEQUENCE
Task Tasktype Source Destination
number variables variable
T1 JPEG a c
T2 IDCT b a
T3 JPEG J i
T4 AES_ENC d,e F
T5 AES_ENC h,e d
T6 DES_DEC E,e, g
T7 JPEG a c
T8 DES_DEC H,e g
T9 DES_DEC H,e a
T10 JPEG f e
T11 IDCT b a

Fig. 17 depicts the comparison between theoretical and


experimental results of the task sequence. The length of
the sequence increases from 1 to 11, and the curve of ex-
perimental value is consistent with theoretical value, but Fig. 18 Timing Diagraph. The horizontal axis denotes the model time,
slightly larger. This is because theoretical value has not while the vertical axis denotes the execution flow of tasks. The
events executing and writing result of each task are represented by
considered of scheduling and communication overheads. bars outlined with solid lines, while the stalls of tasks are represented
The average of overheads is less than 4.9% of task run- by the bars outlined with dashed lines. The length of a bar repre-
ning time itself. The execution time remains flat when the sents the duration time of each event. Besides, the arrowed lines are
length of the queue increases from 2 to 6. The result is used to represent different types of inter-task dependences.
caused by out-of-order execution and completion. Taking
task No.2 and No.4 for instances, No.2 takes more clock Figure 18 illustrates the timing diagraph for both with
cycles to finish than No.4. As there are not data depend- Tomasulo scheduling and without Tomasulo scheduling
encies among these tasks, they can be issued at the same algorithms.[a] presentsthe timinggraphwith Tomasulo
time. After task No.4 is finished, it must wait until all the schedulers.Atthemodelstartup,thetokenrepresenting
predecessor tasks are finished. the task T1 is generated and then canbe dispatched. The
Running Time Comparison on Relugar Tasks transitionchecksthestateofcomputationalkernelsinthe
500 modeledsystem,andassignstheJPEGIPcoretothetask
450 Tomasulo Theoretical
T1. In the second time unit, the task T2 is generated. At
Running Time (K Cycles)

400 Tomasulo Experimental


Scoreboarding Theoretical
350 Scoreboarding Experimental thattime,thesourcesvariableaisnotreadyduetoT1is
300
250
not finished, thus a readafterwrite (RAW) structural
200 dependenceisidentifiedandthetaskT2stalls.Inthethird
150
timeunit,taskT3isgenerated.AsT3hasnodatadepend
100
50 encies with T1, therefore it could be assigned to the GPP
0
for computational kernel immediately. Similarly, task T4
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11
Length of the Sequences andT6canalsobeissuedimmediatelyduetothereareno
Fig. 17. Experimental Results of Regular Tasks. intertask dependencies while other tasks should be
Compared to Scoreboarding, Tomasulo has higher stalled until the dependencies are eliminated. Finally, all
scheduling overheads, which leads to a bigger gap be- thetasksarefinishedatthetime300kcycles.
tween experimental and theoretical value. However, since
Bycontrast,[b]illustratesthetiminggraphwithoutthe
MP-Tomasulo can not only detect WAW and WAR haz-
ards but also eliminate them by register renaming, the Tomasulo scheduling algorithm. Due to there is no OoO

1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPDS.2015.2487346, IEEE Transactions on Parallel and Distributed Systems

12 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. X, NO. X, XXX 2015

scheduling methods, all the tasks have to be issued in at 32 bits. In order to improve the task level parallelism,
sequential.Asisclearly describedinFigure19[b],allthe we use 8 MT in total, each of which contains 4 entries at
tasksarefinishedat450kcycles,whichmeanstheToma 334-bit width.
Table V illustrates the resources consumption for our
suloalgorithmcanefficientlyreducetheexecutiontimeof
hardware scheduler and the MicroBlaze processor. Our
thetasksequencingbyabout33%.
scheduler takes about the same Look-Up Tables (LUTs) as
By investigating the timing diagram, we gain a deep
a MicroBlaze while significantly less number of less regis-
insight on how tasks interact with each other and main
ters. Most of the LUTs in a MicroBlaze are used as logic,
tain the data dependences in our scheme. Furthermore,
while about one third are used as RAM in our scheduler.
wecanverifythecorrectnessofsimulationresultthrough
Furthermore, to calculate the area cost, we use the area
comparing the timing diagram with task execution flow
utilization model that is presented by Jonathan Rose in
on a prototype system. We have studied simulation re
[46]. The area cost of the hardware scheduler is similar to
sultsinanumberofcases.Thesimulationresultsareex
the Microblaze processor based on the LUTs consumption.
actlyconsistentwiththeactualresultsobtainedfrompro
totype systems. To this end, we confirm that our Outof TABLE V.
order scheduling scheme can correctly schedule tasks RESOURCES USAGE
withvariousdependencestoexploitparallelism.
Hardware
MicroBlaze
Scheduler
Gap Between Theorectial Execution Time and Actual Time
Registers 295(0.4%) 1445(2.1%)
15.00%
LUTs used as Logic 1158(1.7%) 1434(2.1%)
Execution Time and Actual Time

Communication Overheads Scheduling Overheads


LUTs used as RAM 510(0.7%) 84(0.1%)
Gap Between Theorectial

10.00% Total LUTs 1668(2.4%) 1518(2.2%)


Area Utilization 1.35 MM2 1.23 MM2
5.00%
(*%): Resources used/Total on-chip resources

In order to show the performance of our scheduler,


0.00%
1 2 3 4 5 6 7 8 9 10 11
Table VI compares our work with software MP-Tomasulo
Length of the Sequences scheduler, which runs at the Microblaze microprocessor
Fig. 19 Gap between Theoretical Execution Time and Actual Execu- at 100MHz. On average, the software scheduler takes
tion time. more than 10,000 cycles for scheduling while our hard-
ware scheduler only takes about 20 cycles, which means
Fig. 19 depicts the gap between theoretical and ex- the speedup of the scheduling achieves up to 500X. Fur-
perimental results of a regular task sequence. In this case thermore, we have evaluated the scheduling time at Intel
we use the same benchmark as the regular tasks used in Core i5 4460 processors at 3.2 GHz. The speedup of the
Section 7.3. When the length of the sequence increases HW scheduler against Intel SW scheduler on Core i5 is
from 1 to 11, the gap between the experimental and actual 15.65X approximately.
results keeps at 10%. The gap includes both communica-
tion overheads and scheduling overheads. In particular, TABLE VI.
the communication overheads keep stable approximately SCHEDULING OVERHEAD COMPARISON
at 6%, while the scheduling overheads fluctuate from
Average
0.15% to 7.13%, respectively.
Scheduling Scheduling
Frequency
Overhead Time
7.5 Hardware Utilization (cycles)
Our experimental platform is constructed on Xilinx SW Scheduler
Virtex-5 XC5VLX110T FPGA. A (4,4)-channel hardware >10000 100 MHz >100 us
on Microblaze
scheduler is prototyped, the results are shown in the fol- SW Scheduler
lowing tables. on Intel Core 10000 3.2 GHz 3.13 us
TABLE IV i5 4460
SCHEDULER CONFIGURATION HW Scheduler ~20 100 MHz 0.2 us
Number of Number of Entry Width
Units Entries (bits)
DF 1 64 32
8. CONCLUSIONS
VS 1 256 6 Out-of-order scheduling is playing a key role in exploit
OU 1 64 256 task level parallelization. In this paper, we focused on the
MT 8 4 334 hardware implementation of automatic out-of-order exe-
cution engines on a FPGA based heterogeneous platforms.
Table IV illustrates the scheduler configuration. In the We have proposed a flexible hardware scheduler to fit
JPEG case, we integrate 1 DF, 1 VS, 1 OU and 8 MT mod- reconfigurable heterogeneous systems, and then we detail
ules. Each DF has 64 entries and data width is configured one of the OoO task execution methods. WAW and WAR

1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPDS.2015.2487346, IEEE Transactions on Parallel and Distributed Systems

CHAO WANG ET AL.: HARDWARE IMPLEMENTATION ON FPGA FOR TASK-LEVEL PARALLEL DATAFLOW EXECUTION ENGINE 13

dependences can be dynamically eliminated so as to IEEE International Symposium on Circuits and Systems. 2013: p.
1216-1219.
greatly accelerate the execution of applications on hetero- [11].R. M. Tomasulo, An Efficient Algorithm for Exploiting Multiple Arith-
geneous platforms. Experimental results demonstrate that metic Units. IBM Journal of Research and Development, 1967. 11(1):
our framework is efficient and the average performance p. 25-33.
[12].J. Wawrzynek, D. Patterson, M. Oskin, Shin-Lien Lu, C. Kozyrakis,
achieves 95% of theoretical speedup. Furthermore, ex- J.C. Hoe, D. Chiou, and K. Asanovic. RAMP: Research Accelerator
perimental results also show that our design is efficient for Multiple Processors. in IEEE Micro. 2007: IEEE Computer Society
on both performance and resources usage. The perform- p. 46 - 57
[13].T. Givargis and F. Vahid, Platune: a tuning framework for system-on-
ance is 500X faster than the software scheduler, while the a-chip platforms. Computer-Aided Design of Integrated Circuits and
resources usage in not more than a MicroBlaze. Besides, Systems, IEEE Transactions on, 2002. 21(11): p. 1317-1327.
our hardware scheduler supports the dynamic reconfigu- [14].G. Kuzmanov, G. Gaydadjiev, and S. Vassiliadis. The MOLEN proc-
essor prototype. in 12th Annual IEEE Symposium on Field-
ration of the FPGA platform by changing a small part of Programmable Custom Computing Machines, ( FCCM 2004.). 2004:
logic. p. 296 - 299
Although the experimental results are promising, there [15].D. Gohringer, M. Hubner, V. Schatz, and J. Becker. Runtime adap-
tive multi-processor system-on-chip: RAMPSoC. in IEEE Interna-
are a few future directions worth pursuing. First, im- tional Symposium on Parallel and Distributed Processing. 2008: p. 1-
proved task partitioning and further adaptive mapping 7.
schemes are essential to support automatic task-level par- [16].Karim Faraydon, Mellan Alain, Nguyen Anh, Aydonat Utku, and Ab-
delrahman Tarek, A Multilevel Computing Architecture for Embedded
allelization. Second, we also plan to study the out-of- Multimedia Applications. 2004, IEEE Computer Society Press. p. 56-
order task execution paradigm to explore the potential 66.
parallelism in data-intensive applications. [17].G. S. Sohi, S. E. Breach, and T. N. Vijaykumar. Multiscalar proces-
sors. in Intl. Symp. on Computer Architecture. 1995: p. 414-425.
[18].E. Rotenberg, Q. Jacobson, Y. Sazeides, and J. Smith. Trace proc-
ACKNOWLEDGMENT essors. in Intl. Symp. on Microarch. 1997: p. 138.
[19].J.A.Kahle, M.N.Day, H.P.Hofstee, C.R.Johns, T.R.Maeurer, and
This work was supported by the National Science Foun- D.Shippy, Introduction to the Cell multiprocessor. IBM Journal of Re-
dation of China under grants (No. 61379040, No. 61272131, search and Development, 2005. 49.
[20].M. B. Taylor, J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B.
No. 61202053, No. 61222204, No. 61221062), Jiangsu Pro- Greenwald, H. Hoffman, P. Johnson, Lee Jae-Wook, W. Lee, A. Ma,
vincial Natural Science Foundation (No. SBK201240198), A. Saraf, M. Seneski, N. Shnidman, V. Strumpen, M. Frank, S. Ama-
Natural Science Foundation of Fujian Province of China rasinghe, and A. Agarwal, The Raw microprocessor: a computational
fabric for software circuits and general-purpose programs. Micro,
(No.2014J05075), Open Project of State Key Laboratory of IEEE, 2002. 22(2): p. 25-35.
Computer Architecture Institute of Computing Tech- [21].L. Hammond, B. A. Hubbert, M. Siu, M. K. Prabhu, M. Chen, and K.
Olukolun, The Stanford Hydra CMP. Micro, IEEE, 2000. 20(2): p. 71-
nology, Chinese Academy of Sciences. The authors 84.
deeply appreciate many reviewers for their insightful [22].K. Sankaralingam, R. Nagarajan, R. McDonald, R. Desikan, S. Drolia,
comments and suggestions. M. S. Govindan, P. Gratz, D. Gulati, H. Hanson, C. Kim, H. Liu, N.
Ranganathan, S. Sethumadhavan, S. Sharif
P. Shivakumar, S. W. Keckler, and D. Burger. Distributed microarchitec-
REFERENCES tural protocols in the TRIPS prototype processor. in Intl. Symp. on
Microarch. 2006: p. 480-491.
[1].Gagan Gupta and Gurindar S. Sohi. Dataflow execution of sequential [23].S. Swanson, K. Michelson, A. Schwerin, and M. Oskin. WaveScalar.
imperative programs on multicore architectures. in Proceedings of in Intl. Symp. on Microarch. 2003: p. 291.
the 44th Annual IEEE/ACM International Symposium on Microarchi- [24].C. Villavieja, J. A. Joao, R. Miftakhutdinov, and Y. N. Patt, Yoga:A
tecture. 2011. Porto Alegre, Brazil: ACM: p. 59-70. hybrid dynamic vliw/ooo processor. 2014, High Performance Sys-
[2].Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, tems Group, Department of Electrical and Computer Engineering,
Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. Cilk: an effi- The University of Texas at Austin.
cient multithreaded runtime system. in Proceedings of the fifth ACM [25].Judit Planas, Rosa M. Badia, Eduard Ayguad, and Jesus Labarta,
SIGPLAN symposium on Principles and practice of parallel pro- Hierarchical Task-Based Programming With StarSs. International
gramming 1995: ACM: p. 207-216. Journal of High Performance Computing Applications, 2009. 23(3): p.
[3].O. A. R. Board. OpenMP C and C++ application program interface, 284-299.
version 1.0,. 1998; http://www.openmp.org. [26].Gregory F. Diamos and Sudhakar Yalamanchili. Harmony: an execu-
[4].OpenCL. http://www.khronos.org/opencl/. tion model and runtime for heterogeneous many core systems. in
[5].Pieter Bellens, Josep M. Perez, Rosa M. Badia, and Jesus Labarta. Proceedings of the 17th international symposium on High perform-
CellSs: a programming model for the cell BE architecture. in Pro- ance distributed computing. 2008. Boston, MA, USA: ACM: p. 197-
ceedings of the 2006 ACM/IEEE conference on Supercomputing. 200.
2006. Tampa, Florida: ACM: p. 86. [27].Enno Lubbers and Marco Platzner, ReconOS: Multithreaded pro-
[6].Chao Wang, Junneng Zhang, Xuehai Zhou, Xiaojing Feng, and Xiaon- gramming for reconfigurable computers. ACM Transactions on Em-
ing Nie. SOMP: Service-Oriented Multi Processors. in Proceedings of bedded Computing Systems, 2009. 9(1): p. 1-33.
the 2011 IEEE International Conference on Services Computing. [28].W. Peck, E. Anderson, J. Agron, J. Stevens, F. Baijot, and D.;
2011: IEEE Computer Society: p. 709-716. Andrews. Hthreads: A Computational Model for Reconfigurable De-
[7].Yoav Etsion, Felipe Cabarcas, Alejandro Rico, Alex Ramirez, Rosa M. vices in International Conference on Field Programmable Logic and
Badia, Eduard Ayguade, Jesus Labarta, and Mateo Valero. Task Su- Applications, FPL '06. 2006. Madrid, Spain: p. 1-4.
perscalar: An Out-of-Order Task Pipeline. in Proceedings of the 2010 [29].Dirk Koch, Christian Beckhoff, and Juergen Teich. A communication
43rd Annual IEEE/ACM International Symposium on Microarchitec- architecture for complex runtime reconfigurable systems and its im-
ture. 2010: IEEE Computer Society: p. 89-100. plementation on spartan-3 FPGAs. in Proceedings of the
[8].Yongjoo Kim, Jongeun Lee, Toan X. Mai, and Yunheung Paek, Im- ACM/SIGDA international symposium on Field programmable gate
proving performance of nested loops on reconfigurable array proces- arrays. 2009. Monterey, California, USA: ACM: p. 253-256.
sors. ACM Transactions on Architecture and Code Optimization [30].Kyle Rupnow, Keith D. Underwood, and Katherine Compton, Scien-
(TACO), 2012. 8(4): p. 1-23. tific Application Demands on a Reconfigurable Functional Unit Inter-
[9].Thuresson Martin, Sj Magnus, lander, Bj Magnus, rk, Svensson Lars, face. ACM Trans. Reconfigurable Technol. Syst., 2011. 4(2): p. 1-30.
Larsson-Edefors Per, and Stenstrom Per, FlexCore: Utilizing Ex- [31].S. Balakrishnan and G. Sohi. Program demultiplexing: Dataflow
posed Datapath Control for Efficient Computing. 2009, Kluwer Aca- based speculative parallelization of methods in sequential programs.
demic Publishers. p. 5-19. in 33rd International Symposium on Computer Architecture. 2006: p.
[10].Junneng Zhang, Chao Wang, Xi Li, and Xuehai Zhou. FPGA imple- 302-313.
mentation of a scheduler supporting parallel dataflow execution. in

1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPDS.2015.2487346, IEEE Transactions on Parallel and Distributed Systems

14 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. X, NO. X, XXX 2015

[32].D. S. McFarlin, C. Tucker, and C. Zilles. Discerning the dominant cusonMulticoreandreconfigurablecomputing.Heservesas


out-of-order performance advantage: Is it speculation or dynamism.
in International Conference on Architectural Support for Program- theeditorboardmemberforMICPRO,IETCDT,andIJHPSA.
ming Languages and Operating Systems. 2013: p. 241-252. HeisalsothepublicitychairofHiPEAC2015andISPA2014.
[33].Srinath Sridharan, Gagan Gupta, and Gurindar S. Sohi. Adaptive,
efficient, parallel execution of parallel programs. in Proceedings of
the 35th ACM SIGPLAN Conference on Programming Language De- Junneng Zhang received the B.S. degree
sign and Implementation. 2014. Edinburgh, United Kingdom: ACM: p. of computer science from University of
169-180. ScienceandTechnologyofChinain2009.
[34].KHRONOS Group. OpenCL. 2010; http://www.khronos.org/opencl/.
[35].D. Blumofe Robert, F. Joerg Christopher, C. Kuszmaul Bradley, E. He is a Ph.D student in the Embedded
Leiserson Charles, H. Randall Keith, and Zhou Yuli, Cilk: an efficient System Lab in Suzhou Institute of Uni
multithreaded runtime system. 1995, ACM. p. 207-216. versity of Science and Technology of
[36].Pieter Bellens, Josep M. Perez, Felipe Cabarcas, Alex Ramirez,
Rosa M. Badia, and Jesus Labarta, CellSs: Scheduling techniques to China,Suzhou,China.Hisresearchinter
better exploit memory hierarchy. Scientific Programming - High Per- ests focus on multiprocessor system on
formance Computing with the Cell Broadband Engine, 2009. 17(1-2): chip,reconfigurablecomputingandoperatingsystem.Heis
p. 77-95.
[37].Akihiro Hayashi, Yasutaka Wada, Takeshi Watanabe, Takeshi Se- astudentmemberoftheIEEE.
kiguchi, Masayoshi Mase, Jun Shirako, Keiji Kimura, and Hironori
Kasahara. Parallelizing compiler framework and API for power reduc- Xi Li is a Professor and vice dean in the
tion and software productivity of real-time heterogeneous multicores.
in Proceedings of the 23rd international conference on Languages School of Software Engineering, University
and compilers for parallel computing. 2010. Houston, TX: Springer- of Science and Technology of China. There
Verlag: p. 184-198 hedirectstheresearchprogramsinEmbed
[38].D. Kuck, E. Davidson, D. Lawrie, A. Sameh, C. Q. Zhu, A. Vei-
denbaum, J. Konicek, P. Yew, K. Gallivan, W. Jalby, H. Wijshoff, R. dedSystemLab,examiningvariousaspects
Bramley, U. M. Yang, P. Emrath, D. Padua, R. Eigenmann, J. ofembeddedsystemwiththefocusonper
Hoeflinger, G. Jayson, Z. Li, T. Murphy, and J. Andrews, The Cedar formance,availability,flexibilityandenergy
system and an initial performance study, in 25 years of the interna-
tional symposia on Computer architecture (selected papers). 1998, efficiency. He has lead several national key projects of
ACM: Barcelona, Spain. CHINA,severalnational863projectsandNSFCprojects.
[39].Lance Hammond, Benedict A. Hubbert, Michael Siu, Manohar K.
Prabhu, Michael Chen, and Kunle Olukotun, The Stanford Hydra
CMP. IEEE Micro, 2000. 20: p. 71-84.
[40].Yoav Etsion, Felipe Cabarcas, Alejandro Rico, Alex Ramirez, Rosa Aili Wang is a lecturer of School of Software
M. Badia, Eduard Ayguade, Jesus Labarta, and Mateo Valero, Task Engineering, University of Science and Tech-
Superscalar: An Out-of-Order Task Pipeline, in Proceedings of the
2010 43rd Annual IEEE/ACM International Symposium on Microar- nology of China. She serves as the guest editor
chitecture. 2010, IEEE Computer Society. p. 89-100. of Applied Soft Computing, and International
[41].Berekovic Mladen and Niggemeier Tim, A distributed, simultaneously Journal of Parallel Programming. Meanwhile
multi-threaded (SMT) processor with clustered scheduling windows
for scalable DSP performance. 2008, Kluwer Academic Publishers. p. she is a reviewer for International Journal of
201-229. Electronics. She has published over 10 Interna-
[42].James Christopher Jenista, Yong hun Eom, and Brian Charles Dem- tional journal and conference articles in the areas of software
sky. OoOJava: software out-of-order execution. in Proceedings of the
16th ACM symposium on Principles and practice of parallel pro- engineering, operating systems, and distributed computing sys-
gramming. 2011. San Antonio, TX, USA: ACM: p. 57-68. tems.
[43].Feng Liu, Heejin Ahn, Stephen R. Beard, Taewook Oh, and David I.
August. DynaSpAM: Dynamic Spatial Architecture Mapping using
Out of Order Instruction Schedules. in 42nd International Symposium
on Computer Architecture (ISCA). 2015. Xuehai Zhou is the executive dean of
[44].Chao Wang, Xi Li, Junneng Zhang, Xuehai Zhou, and Xiaoning Nie, School of Software Engineering, Uni
MP-Tomasulo: a Dependency-aware Automatic Parallel Execution
Engine for Sequential Programs. ACM Trans. Archit. Code Optim., versity of Science and Technology of
2013. 10(2). China, and Professor in the School of
[45].Chao Wang, Xi Li, Junneng Zhang, Peng Chen, Yunji Chen, Xuehai Computer Science. He serves as gen
Zhou, and Ray C.C. Cheung, Architecture Support for Task Out-of-
order Execution in MPSoCs. IEEE Transactions on Computers, 2014. eralsecretaryofsteeringcommitteeof
64(5): p. 1296-1310. computer College fundamental Les
[46].Kuon Ian and Rose Jonathan, Area and delay trade-offs in the circuit sons,andtechnicalcommitteeofOpen
and architecture design of FPGAs, in Proceedings of the 16th inter-
national ACM/SIGDA symposium on Field programmable gate arrays. Systems, China Computer Federation.
2008, ACM: Monterey, California, USA. His research interests include various aspects of multicore
and distributing systems. He has lead many national 863
projects and NSFC projects. Prof. Zhou has published over
BIOGRAPHY 100Internationaljournalandconferencearticlesintheareas
of software engineering, operating systems, and distributed
ChaoWang(M11)receivedB.S.andPh.D computingsystems.
degree from University of Science and
Technology of China, in 2006 and 2011
respectively, both in of computer science.
HeisanAssociateProfessorinEmbedded
SystemLabinSuzhouInstituteofUniver
sity of Science and Technology of China,
Suzhou, China. His research interests fo

1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Potrebbero piacerti anche