Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPDS.2015.2487346, IEEE Transactions on Parallel and Distributed Systems
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. X, NO. X, XXX 2015
1
AbstractHeterogeneous multicore platform has been widely used in various areas to achieve both power efficiency and high
performance. However, it poses significant challenges to researchers to uncover more coarse-grained task level parallelization.
In order to support automatic task parallel execution, this paper proposes a FPGA implementation of a hardware out-of-order
scheduler on heterogeneous multicore platform. The scheduler is capable of exploring potential inter-task dependency, leading
to a significant acceleration of dependence-aware applications. With the help of renaming scheme, the task dependencies are
detected automatically during execution, and then task-level Write-After-Write (WAW) and Write-After-Read (WAR)
dependencies can be eliminated dynamically. We extended the instruction level renaming techniques to perform task-level out-
of-order execution, and implemented a prototype on a state-of-art Xilinx Virtex-5 FPGA device. Given the reconfigurable
characteristic of FPGA, our scheduler supports changing accelerators at runtime to improve the flexibility. Experimental results
demonstrate that our scheduler is efficient at both performance and resources usage.
1 INTRODUCTION
ask-level parallelism (TLP) has motivated research Meanwhile, with diverse types of accelerators inte-
T into simplified parallel programming models [1].
However, a drawback of common task-based models
grated, heterogeneous multicore platform is able to
achieve high performance for a large variety of applica-
such as Cilk [2], OpenMP [3], MPI, Intel TBB, CUDA and tions. However, the software for such heterogeneous sys-
OpenCL [4] is burdening the programmer with the non- tems can be quite complex due to the management of
trivial assignment of resolving inter-task data dependen- low-level aspects of the computation. To maintain the
cies. To resolve this problem, automatic parallelization correctness, the software must decompose an application
has been widely researched at different levels, e.g. pro- into parallel tasks and synchronize the tasks automati-
gramming model [5], compiler and runtime library [6], cally. As a consequence, data dependences within an ap-
and microarchitecture [7]. Most of the programming plication may seriously affect the overall performance.
models perform well for regular operations (e.g. loop Generally data dependences are classified into three cate-
structures [8]). However, the results may be unacceptable gories: read-after-write (RAW) dependence, write-after-
satisfying for most irregular operations. At compiler level, write (WAW) dependence and write-after-read (WAR)
it poses significant challenges to detect dependencies dependence, of which WAW and WAR can be eliminated
statically. Alternatively, using special architecture to sup- using the renaming techniques.
port task-level Out-of-Order (OoO) execution is becoming To tackle this problem, in this paper we propose a
effective and efficient on performance [9]. However, how hardware scheduler that supports task-level OoO parallel
to make the architecture flexible to suite for various sys- execution. The fundamental system is an FPGA based
tems remains unresolved, especially for heterogeneous platform that contains different types of PEs: one or sev-
systems with different types of processing elements (PEs). eral general purpose processor (GPPs) and a variety of
Intellectual Property (IP) cores. A earlier version of this
paper is presented at [10]. We have extended and imple-
C.WangiswiththeUniversityofScienceandTechnologyofChina,Hefei,
230027,Anhui,China.Email:saintwc@mail.ustc.edu.cn.
mented a demonstrating hardware scheduler for hetero-
J.ZhangiswiththeUniversityofScienceandTechnologyofChina,Hefei, geneous platforms to support OoO task execution. The
230027,Anhui,China.Email:zjneng@mail.ustc.edu.cn. scheduler detects task-level data dependencies and elimi-
X.LiiswiththeUniversityofScienceandTechnologyofChina,Hefei, nates WAW and WAR dependencies automatically at
230027,Anhui,China.Email:llxx@ustc.edu.cn.
A.WangiswiththeUniversityofScienceandTechnologyofChina, runtime using renaming scheme. We claim following con-
Suzhou,215123,Jiangsu,China.Email:wangal@ustc.edu.cn. tributions and highlights:
X.ZhouiswiththeUniversityofScienceandTechnologyofChina,Hefei, 1) This work implements a hardware scheduler on
230027,Anhui,China.Email:xhzhou@ustc.edu.cn. FPGA to support parallel dataflow execution and dy-
ManuscriptreceivedXXXXX.2015. namic accelerator reconfiguration. By eliminating the
Forinformationonobtainingreprintsofthisarticle,pleasesendemailto: WAW and WAR dependences among dataflow, applica-
tpds@computer.org,andreferenceIEEECSLogNumberTPDS01970706. tions will get as much parallelism as possible. This is es-
DigitalObjectIdentifierno.10.1109/TPDS.2007.1068.
pecially important for dependence-aware applications in
1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPDS.2015.2487346, IEEE Transactions on Parallel and Distributed Systems
2 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. X, NO. X, XXX 2015
1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPDS.2015.2487346, IEEE Transactions on Parallel and Distributed Systems
CHAO WANG ET AL.: HARDWARE IMPLEMENTATION ON FPGA FOR TASK-LEVEL PARALLEL DATAFLOW EXECUTION ENGINE 3
TABLE 1
SUMMARY FOR STATE-OF-THE-ART PARALLEL EXECUTION ENGINES ON FPGA
Generalparallelprogramming
OpenMP [3] Intels TBB, Ct, CnC General model for CMP proces Bring burden to program
models
MapReduceOpenCL[34]Cilk[35] sors mers
Outoforder programming Support automatic OoO execu AppliedonlytoCellBEarchi
StarSs[25],CellSs[36],Oscar[37]
models tion tectureandSMPservers
Perform well for traditional
CEDAR [38], MLCA [16], Multiscalar [17], OoO not supported, pro
Task Level Outoforder superscalar machines, run tasks
Trace [18], IBM CELL [19], RAW [20], Hy grammers handle the task
Scheduling in parallel on different proces
dra[39],WaveScalar[23],TRIPS[22] datadependencymanually
sorcores
Support OoO automatic par Not applicable to FPGA
TaskSuperscalar[40],Yoga[24]
allelexecution withreconfiguration
Dataflow Based OoO Execu With register renaming tech Limited to DSP architec
DSP[41]
tionModel nologiesofTomasulo tures
An OoO compiler for Java run
Not applicable to hardware
OoOJava[42],Dataflow[1] time with determinate parallel
executionengines
execution
gramming models and task level scheduling mechanisms. task 2 manually to support parallel execution. Conse-
quently all tasks after task 2 that use b also need to re-
name their parameter b to c, which greatly limits the
3 HIGH LEVEL VIEW programmability. If b is to be written to non-volatile
3.1 Problem Description storage devices (e.g. disk), then the task 1 and task 2 can
It has been widely acknowledged that traditional pro- not be executed in parallel by OpenMP. In order to re-
gramming models for TLP such as OpenMP and MPI per- lease programmers from such burden, CellSs was pro-
form well for regular programs, especially for loop based posed, as is illustrated in Fig. 1 (c). The first two anno-
codes. However, the parallelism degree of irregular pro- tated lines of Fig. 1 (c) inform the compiler that task 1 and
grams of OpenMP and MPI depends on the programmer, task 2 can be executed simultaneously, enabling task 1
which requires too much burden to the programmers and and task 2 to be sent to different computing resources for
the performance may be dramatically unsatisfied. Conse- parallel execution. The programmers are only required to
quently the motivation of this paper is to pursue the add the simple annotations of the tasks, and then the par-
automatic OoO task parallel execution method, especially allelization is automated by compiler and hardware sup-
for heterogeneous systems. port.
In order to compare the differences of three program- As mentioned above, both CellSs and Task Superscalar
ming models: serial model, OpenMP [3] and CellSs [5], use renaming scheme to detect and eliminate WAW and
we illustrate the example code snippet respectively. WAR dependencies. However, they are both designed for
CMP architecture, in which all processors are able to run
#pragma omp section #pragma .. all kinds of tasks. This characteristic brings higher flexibil-
Task_1(b,a); Task_1(b,c); #pragma .. ity and scalability, but also causes significant difficulties
. #pragma omp section Task_1(b,a); to task-level OoO execution for heterogeneous systems.
Task_2(b,c); Task_2(e,c); Task_2(b,c); Besides, Task Superscalar detects inter-task dependencies
at a fine-grained level, which makes it too sophisticated
for resources constrained systems. In order to make task-
(a) Serial Code (b) OpemMP Code (c) CellSs level parallelization efficient for heterogeneous systems,
in this paper we propose a variable-based OoO frame-
Fig. 1 Sample code snippet of different programming models. work for heterogeneous system.
Fig. 1 illustrates the three code segments of the same 3.1 Definition of Tasks and Task-level Dependencies
functionality. Task 1 takes a as input parameter, and Throughout this paper, tasks refer to the execution of
write results to b, while task 2 takes c as input, and also actors, and actors can be GPPs or IP cores. Each task has a
write results to b. Fig. 1 (a) presents the normal serial- set of input parameters and a set of output parameters,
ized codes, in which task 2 can not be issued until task 1 and is defined as task_name ({ the set of output parame-
is done. Assuming that task 1 and task 2 can run in paral- ters }, { the set of input parameters }).
lel using different computing resources. Fig. 1 (b) shows Exploiting task-level concurrency is especially impor-
the sample codes in OpenMP pattern. Because both task 1 tant in a heterogeneous environment due to the require-
and task 2 write b, the programmer must rename b of ment to match the execution requirements of different
1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPDS.2015.2487346, IEEE Transactions on Parallel and Distributed Systems
4 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. X, NO. X, XXX 2015
parts in the program with computational capabilities of composed of interconnections and processing elements
the different platforms. Some tasks may require special- (PEs). The PEs includes GPPs and IP cores, and each IP
purpose hardware, either because the hardware can exe- core can do only a specific type of task.
cute that task more efficiently or because the hardware
has some unique functionality that the task requires. Con-
sequently the motivation of this paper is to pursue the
4. PROGRAMMING MODEL LAYER
automatic OoO task parallel execution method especially In this section, we first introduce the programming
for heterogeneous systems. model layer, which is able to facilitate software develop-
ers for high level programming. The programming model
TABLE II uses annotations to define tasks. In our framework, com-
DEFINITIONS OF TASK-LEVEL DEPENDENCIES puting resources are divided into two categories: hard-
ware computing resources and software computing re-
sources. Each type hardware computing resource can
Type Definition Example only do a specific kind of task, while software computing
When an issuing task reads a pa- task_1({b}{a}); resources have the capability to do all kinds of tasks. In
rameter in the output set of an is- task_2({c},{b}); our framework, the interfaces to call hardware computing
RAW resources are regulated in runtime libraries. To use soft-
sued task, a RAW dependency oc-
curs. ware computing resources, programmers need to define
When an issuing task writes a pa- task_1({c},{a}); the tasks in prior to the hardware execution.
rameter in the output set of an is- task_2({c},{b}); The programming model layer makes all the details of
WAW the platform transparent to the programmers. In order to
sued task, a WAW dependency
occurs. simplify the programming, programmers only have a
When an issuing task writes a pa- task_1({b},{a}); sequential view of the systems, in other words, the pro-
WAR rameter in the input set of an issued task_2({a},{c}); grammers do not need to know what the underlying plat-
task, a WAR dependency occurs. form is composed of or how tasks execute in parallel.
In order to reduce the burden of the scheduler, tasks
are divided into two categories: those that pass through
In Table II we define three types of task-level depend-
the scheduler and those that execute on the local GPP.
encies. The bold parameters indicate the parameters those
Thus the program is divided into two parts: kernel part
lead to data dependencies. Tasks have RAW dependency
and non-kernel part. A kernel is defined as a sequence of
should always run in order, meanwhile, WAW and WAR
tasks that all pass through the scheduler while the non-
dependencies can be eliminated using renaming scheme. kernel part is defined as a sequence of tasks that all run
locally.
1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPDS.2015.2487346, IEEE Transactions on Parallel and Distributed Systems
CHAO WANG ET AL.: HARDWARE IMPLEMENTATION ON FPGA FOR TASK-LEVEL PARALLEL DATAFLOW EXECUTION ENGINE 5
kernel parts are sent to the scheduler for OoO execution. shapes of edges stand for different types of dependencies,
Fig. 3 illustrates sample code snippet in our programming in particular direction of an edge is from the task that
model. The codes are divided into two segments: declara- brings about the dependency to the task that is affected.
tion segment and normal segment. The declaration seg- In particular, inter-task dependencies are divided into
ment (Line 2 - Line 6) contains header files inclusion (Line three categories as follows:
2) and task definitions (Line 3 - Line 6). All interfaces to RAW: The solid edges indicate RAW dependencies. All
call hardware computing resources are included in Hard- solid edges compose a normal Directed Acyclic Graph
ware.h. When hardware computing resources are de- (DAG). When an issuing task reads a variable in the wr set
signed, the corresponding interfaces should be added to of an issued task, a RAW dependency occurs, for example,
Hardware.h. Tasks defined by programmers are started task 4 takes variable D as input while task 2 writes D, so
with annotations, as Line 3 shows. Each annotation in- there is a solid edge from task 2 to task 4.
cludes three aspects of parameters: direction, type and WAW: The dashed edges indicate WAW dependencies.
size. The direction can be chosen from input, output, and When an issuing task writes a variable in the wr set of an
inout, which is detailed as follows: issued task, a WAW dependency occurs. Task 4 and task
Input: The parameter will be read by the task, and 1 both write variable B so that there is a dashed edge.
when the task is completed, the parameter will not be WAR: The dotted edges indicate WAR dependencies.
updated. When an issuing task writes a variable in the rd set of an
Output: The task starts without considering the value issued task, there is a WAR dependency. Task 3 writes
of the parameter, and when the task is completed, the variable A while task 1 reads A, therefore a dotted edge is
parameterwillbeupdated. from task 3 to task 1.
Inout:Thetaskwillusethedatastoredintheparame It must be noted that tasks that have RAW dependency
ter, when the task is finished, the parameter will be up should always run in order. Meanwhile, WAW and WAR
dated.Line4andLine6arenormaltaskdefinitionswrit dependencies can be eliminated using special techniques.
ten in C language. Line 8 to Line 15 presents sample Fig. 5 illustrates two different executing models for the
codes of main function. Hd_idctand Hd_aesare tasks de tasks in Fig. 4(b). The bottom refers to the mode used in
finedinHardware.h,whileIdctandAesaredefinedbythe [1], where WAW and WAR dependencies are detected
programmer. but not eliminated, while the top stands for the ideal OoO
mode presented in our scheduler, in which only RAW
dependency will affect the parallelism. It can be obvi-
ously derived that the OoO mode will gain great per-
formance improvement (the time between FT1 and FT2).
In this paper, we are to propose a scheduling algorithm
that automatically eliminates task-level WAW and WAR
dependencies, especially for heterogeneous platforms.
1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPDS.2015.2487346, IEEE Transactions on Parallel and Distributed Systems
6 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. X, NO. X, XXX 2015
Tomasulo algorithm in hardware, which extends the in- and WAR dependencies. In order to send different tasks
struction level Tomasulo algorithm to task level for paral- simultaneously,eachPEhasaprivateRSTable.Onceall
lel execution. The MP-Tomasulo algorithm is able to dy- the input parameters of a task are prepared, the task is
namically detect and eliminate inter-task WAW and WAR offloadedtotheassociatedPEforexecution.Eachentryof
dependencies, and thus speeds up the execution of the RSTable is a tuple of < ID;Busy; V
whole program. With the help of MP-Tomasulo algorithm, [num_rd_set];Q[num_rd_set]>,inwhichIDisusedtoiden
programmers need not to take care of the inter-task data tify different entries, Busy indicates whether the entry is
dependencies. inuseornot,Visusedtotemporarilystoreinputparame
tersofatask,QindicatesifthevalueinVisvalid.
5.1 Hardware Scheduler Structure Task Receiving Queue: The Task Receiving Queue
buffersthefinishedtasks.
VSTable:VSTableisregardedasregisterstatustable
that holds the status of variables used in applications.
Each variable is mapped to an entry of VSTable. Each
entry of VSVable is a tuple of < ID;PE_ID;Value >. ID
stands for the primary unique key. PE_ID represents the
PEthatwillproducethelatestvalueofthevariable.Value
storagestheactualvalue.
MPTomasuloalgorithmisdividedintothreestagesas
follows:
Issue:TheheadtaskofTaskIssuingQueueisinIssue
stageifaRStableentryisavailable.
Execute: If all the input parameters of a task are pre
pared, then the task can be distributed to the associated
PE for execution immediately. Otherwise the RSTable
builds an implicit data dependency graph indicating
Fig. 6 High level structures of MP-Tomasulo module. The part sur- whichtaskwillproduceneededparameters.Onceallthe
rounded by the dashed line is named MP-Tomasulo module, which is inputparametersofataskareready,thetaskisspawned
the critical part of the framework. tothecorrespondingPE.
WriteResults:WhenaPEcompletesatask,itsendsre
At instruction level, each instruction has one destina sults back to the Task Receiving Queue of MPTomasulo
tionoperandandtwosourceoperandsatmost.However, module,andupdatesRSTable.ForeachtaskinRStable,
a task may have more than one output parameter, or if the input parameters are produced by the completed
more than two input parameters. We extend the tradi task, the associated RSTable entry is updated with the
tional Tomasulo algorithm to support multi outputs and results.
more than two inputs. MPTomasulo algorithm is de Compared to the software scheduler in presented in
signed as a hardware module so that the maximum of [44], we remove the reorder buffer (ROB) in this hard
outputs and inputs are limited to a certain number. As wareimplementation.Wehavefollowingconcerns:
suming that wr_set refers to the output set, rd_set repre 1) The ROB structure is designed for speculation. In
sents the input set, while num_wr_set and num_rd_set fact,duetothescheduleronlydealswithtasksequences
standforthemaximumofoutputandinput,respectively. thatalwaysshouldexecute,thusthereisnoneedtomake
Fig. 6 illustrates MPTomasulo module that is composed anyspeculation.
ofthefollowingcomponents: 2) The ROB structure will makea significant contribu
Task Issuing Queue: In our prototype, each task has tion to the hardware utilization. Therefore we save the
num_wr_set outputs and num_rd_set inputs, so eachtask ROBareaconsideringtheareaandcostoftheFPGAchip.
inTaskIssueQueuecanbeformallyregulatedas<Type; Furthermore, in order to make the scheduler flexible,
wr_set[num_wr_set]; rd_set[num_rd_set] >, in which Type thesizeoftheVSTable,RSTableandthenumberofPEs
identifiesthetypeofthetask,wr_setandrd_setstandfor canbeconfiguredaccordingthecharacteristicsofapplica
thesetofoutputparametersandthesetofinputparame tions.Theseconfigurationscanbechangedwhenthesys
ters,respectively. tem starts and remain fixed during execution. To make
Mapper: The Mapper sends tasks to different RS efficient use of the onchip resources, the IP cores are al
TablesaccordingtoTypeofeachtuplestoredinTaskIssu lowedtobedynamicallyreconfiguredatruntime.
ingQueue.TheMapperworksasablackbox,andthere
fore system designers can design other alternative map
5.2 Adaptive Task mapping
pingalgorithms.IftheRSTablesdonothaveenoughen The scheduling module decides when the task can be
tries to store the task, the system will be blocked until executedduetodatadependencies,furthermore,whena
someRSTableentryisfree. task is ready, only one target function units is selected
RSTable: RSTable refers to reservation station table, frommultipleoptions. Thetaskmapping schemeshould
which contains an implicit dependency graph of tasks decide the target for each task, as is described in Algo
andusesautomaticrenamingschemetoeliminateWAW rithm1.
1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPDS.2015.2487346, IEEE Transactions on Parallel and Distributed Systems
CHAO WANG ET AL.: HARDWARE IMPLEMENTATION ON FPGA FOR TASK-LEVEL PARALLEL DATAFLOW EXECUTION ENGINE 7
1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPDS.2015.2487346, IEEE Transactions on Parallel and Distributed Systems
8 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. X, NO. X, XXX 2015
1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPDS.2015.2487346, IEEE Transactions on Parallel and Distributed Systems
CHAO WANG ET AL.: HARDWARE IMPLEMENTATION ON FPGA FOR TASK-LEVEL PARALLEL DATAFLOW EXECUTION ENGINE 9
ICtr. A PE with a smaller PE_ID has a higher interrupt Fig. 12. The MT_full is set to 1 if none of PEs that can exe-
priority. In order to get better overall performance, IP cute the sub-flow has a free entry.
cores always have higher interrupt priority than GPPs,
meaning that when occurring simultaneously, IP inter-
rupt request (IPR) are always handled in prior to GPP
interrupt request (GPPR). When all interrupt requests
pass through ICtr, only one output wire is set to 1, indi-
cating that the associated interrupt has the highest prior-
ity and results can be received while others are pending.
Interrupt nesting is not supported due to that interrupt
nesting needs extra hardware resources to store the inter-
rupt stack. An entry of MT or the OU is released upon
finished for the DM to receive sub-flows.
1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPDS.2015.2487346, IEEE Transactions on Parallel and Distributed Systems
10 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. X, NO. X, XXX 2015
the upper left part is a MicroBlaze processor that is re- WAW/WAR dependencies, and achieves more than 95%
sponsible for issuing tasks and receiving data. Besides, it of the ideal speedup.
is also used as computing resources. (2) The upper right
part is a MicroBlaze processor only used for computing.
(3) The lower middle part contains two IP cores used as
accelerators, which are used to accelerate the execution of
the CC/DCT/Quant of JPEG application. The upper
middle part is a hardware scheduler that detects
RAW/WAW/WAR dependencies and eliminates
WAW/WAR dependencies using renaming scheme. The
lower left part is peripherals including timer, interrupt
controller (intc), etc.
1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPDS.2015.2487346, IEEE Transactions on Parallel and Distributed Systems
CHAO WANG ET AL.: HARDWARE IMPLEMENTATION ON FPGA FOR TASK-LEVEL PARALLEL DATAFLOW EXECUTION ENGINE 11
Tomasulo and Scoreboarding has similar performance as overall speedup is significantly larger than Scoreboarding.
the data dependency can not be eliminated, which are
0.95X and 0.98X respectively. 7.4 Discussion
In this section we use the timing diagram to show the
7.3 Comparison on Regular Tasks inter-task data dependencies, and how it is solved.
We measured speedups of the typical task sequences
through partitioning the test tasks in [2]. In this test case,
we select regular test with 11 different random tasks. The
sample task sequence is presented in Table III.
TABLE III
SAMPLE TASK SEQUENCE
Task Tasktype Source Destination
number variables variable
T1 JPEG a c
T2 IDCT b a
T3 JPEG J i
T4 AES_ENC d,e F
T5 AES_ENC h,e d
T6 DES_DEC E,e, g
T7 JPEG a c
T8 DES_DEC H,e g
T9 DES_DEC H,e a
T10 JPEG f e
T11 IDCT b a
1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPDS.2015.2487346, IEEE Transactions on Parallel and Distributed Systems
12 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. X, NO. X, XXX 2015
scheduling methods, all the tasks have to be issued in at 32 bits. In order to improve the task level parallelism,
sequential.Asisclearly describedinFigure19[b],allthe we use 8 MT in total, each of which contains 4 entries at
tasksarefinishedat450kcycles,whichmeanstheToma 334-bit width.
Table V illustrates the resources consumption for our
suloalgorithmcanefficientlyreducetheexecutiontimeof
hardware scheduler and the MicroBlaze processor. Our
thetasksequencingbyabout33%.
scheduler takes about the same Look-Up Tables (LUTs) as
By investigating the timing diagram, we gain a deep
a MicroBlaze while significantly less number of less regis-
insight on how tasks interact with each other and main
ters. Most of the LUTs in a MicroBlaze are used as logic,
tain the data dependences in our scheme. Furthermore,
while about one third are used as RAM in our scheduler.
wecanverifythecorrectnessofsimulationresultthrough
Furthermore, to calculate the area cost, we use the area
comparing the timing diagram with task execution flow
utilization model that is presented by Jonathan Rose in
on a prototype system. We have studied simulation re
[46]. The area cost of the hardware scheduler is similar to
sultsinanumberofcases.Thesimulationresultsareex
the Microblaze processor based on the LUTs consumption.
actlyconsistentwiththeactualresultsobtainedfrompro
totype systems. To this end, we confirm that our Outof TABLE V.
order scheduling scheme can correctly schedule tasks RESOURCES USAGE
withvariousdependencestoexploitparallelism.
Hardware
MicroBlaze
Scheduler
Gap Between Theorectial Execution Time and Actual Time
Registers 295(0.4%) 1445(2.1%)
15.00%
LUTs used as Logic 1158(1.7%) 1434(2.1%)
Execution Time and Actual Time
1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPDS.2015.2487346, IEEE Transactions on Parallel and Distributed Systems
CHAO WANG ET AL.: HARDWARE IMPLEMENTATION ON FPGA FOR TASK-LEVEL PARALLEL DATAFLOW EXECUTION ENGINE 13
dependences can be dynamically eliminated so as to IEEE International Symposium on Circuits and Systems. 2013: p.
1216-1219.
greatly accelerate the execution of applications on hetero- [11].R. M. Tomasulo, An Efficient Algorithm for Exploiting Multiple Arith-
geneous platforms. Experimental results demonstrate that metic Units. IBM Journal of Research and Development, 1967. 11(1):
our framework is efficient and the average performance p. 25-33.
[12].J. Wawrzynek, D. Patterson, M. Oskin, Shin-Lien Lu, C. Kozyrakis,
achieves 95% of theoretical speedup. Furthermore, ex- J.C. Hoe, D. Chiou, and K. Asanovic. RAMP: Research Accelerator
perimental results also show that our design is efficient for Multiple Processors. in IEEE Micro. 2007: IEEE Computer Society
on both performance and resources usage. The perform- p. 46 - 57
[13].T. Givargis and F. Vahid, Platune: a tuning framework for system-on-
ance is 500X faster than the software scheduler, while the a-chip platforms. Computer-Aided Design of Integrated Circuits and
resources usage in not more than a MicroBlaze. Besides, Systems, IEEE Transactions on, 2002. 21(11): p. 1317-1327.
our hardware scheduler supports the dynamic reconfigu- [14].G. Kuzmanov, G. Gaydadjiev, and S. Vassiliadis. The MOLEN proc-
essor prototype. in 12th Annual IEEE Symposium on Field-
ration of the FPGA platform by changing a small part of Programmable Custom Computing Machines, ( FCCM 2004.). 2004:
logic. p. 296 - 299
Although the experimental results are promising, there [15].D. Gohringer, M. Hubner, V. Schatz, and J. Becker. Runtime adap-
tive multi-processor system-on-chip: RAMPSoC. in IEEE Interna-
are a few future directions worth pursuing. First, im- tional Symposium on Parallel and Distributed Processing. 2008: p. 1-
proved task partitioning and further adaptive mapping 7.
schemes are essential to support automatic task-level par- [16].Karim Faraydon, Mellan Alain, Nguyen Anh, Aydonat Utku, and Ab-
delrahman Tarek, A Multilevel Computing Architecture for Embedded
allelization. Second, we also plan to study the out-of- Multimedia Applications. 2004, IEEE Computer Society Press. p. 56-
order task execution paradigm to explore the potential 66.
parallelism in data-intensive applications. [17].G. S. Sohi, S. E. Breach, and T. N. Vijaykumar. Multiscalar proces-
sors. in Intl. Symp. on Computer Architecture. 1995: p. 414-425.
[18].E. Rotenberg, Q. Jacobson, Y. Sazeides, and J. Smith. Trace proc-
ACKNOWLEDGMENT essors. in Intl. Symp. on Microarch. 1997: p. 138.
[19].J.A.Kahle, M.N.Day, H.P.Hofstee, C.R.Johns, T.R.Maeurer, and
This work was supported by the National Science Foun- D.Shippy, Introduction to the Cell multiprocessor. IBM Journal of Re-
dation of China under grants (No. 61379040, No. 61272131, search and Development, 2005. 49.
[20].M. B. Taylor, J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B.
No. 61202053, No. 61222204, No. 61221062), Jiangsu Pro- Greenwald, H. Hoffman, P. Johnson, Lee Jae-Wook, W. Lee, A. Ma,
vincial Natural Science Foundation (No. SBK201240198), A. Saraf, M. Seneski, N. Shnidman, V. Strumpen, M. Frank, S. Ama-
Natural Science Foundation of Fujian Province of China rasinghe, and A. Agarwal, The Raw microprocessor: a computational
fabric for software circuits and general-purpose programs. Micro,
(No.2014J05075), Open Project of State Key Laboratory of IEEE, 2002. 22(2): p. 25-35.
Computer Architecture Institute of Computing Tech- [21].L. Hammond, B. A. Hubbert, M. Siu, M. K. Prabhu, M. Chen, and K.
Olukolun, The Stanford Hydra CMP. Micro, IEEE, 2000. 20(2): p. 71-
nology, Chinese Academy of Sciences. The authors 84.
deeply appreciate many reviewers for their insightful [22].K. Sankaralingam, R. Nagarajan, R. McDonald, R. Desikan, S. Drolia,
comments and suggestions. M. S. Govindan, P. Gratz, D. Gulati, H. Hanson, C. Kim, H. Liu, N.
Ranganathan, S. Sethumadhavan, S. Sharif
P. Shivakumar, S. W. Keckler, and D. Burger. Distributed microarchitec-
REFERENCES tural protocols in the TRIPS prototype processor. in Intl. Symp. on
Microarch. 2006: p. 480-491.
[1].Gagan Gupta and Gurindar S. Sohi. Dataflow execution of sequential [23].S. Swanson, K. Michelson, A. Schwerin, and M. Oskin. WaveScalar.
imperative programs on multicore architectures. in Proceedings of in Intl. Symp. on Microarch. 2003: p. 291.
the 44th Annual IEEE/ACM International Symposium on Microarchi- [24].C. Villavieja, J. A. Joao, R. Miftakhutdinov, and Y. N. Patt, Yoga:A
tecture. 2011. Porto Alegre, Brazil: ACM: p. 59-70. hybrid dynamic vliw/ooo processor. 2014, High Performance Sys-
[2].Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, tems Group, Department of Electrical and Computer Engineering,
Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. Cilk: an effi- The University of Texas at Austin.
cient multithreaded runtime system. in Proceedings of the fifth ACM [25].Judit Planas, Rosa M. Badia, Eduard Ayguad, and Jesus Labarta,
SIGPLAN symposium on Principles and practice of parallel pro- Hierarchical Task-Based Programming With StarSs. International
gramming 1995: ACM: p. 207-216. Journal of High Performance Computing Applications, 2009. 23(3): p.
[3].O. A. R. Board. OpenMP C and C++ application program interface, 284-299.
version 1.0,. 1998; http://www.openmp.org. [26].Gregory F. Diamos and Sudhakar Yalamanchili. Harmony: an execu-
[4].OpenCL. http://www.khronos.org/opencl/. tion model and runtime for heterogeneous many core systems. in
[5].Pieter Bellens, Josep M. Perez, Rosa M. Badia, and Jesus Labarta. Proceedings of the 17th international symposium on High perform-
CellSs: a programming model for the cell BE architecture. in Pro- ance distributed computing. 2008. Boston, MA, USA: ACM: p. 197-
ceedings of the 2006 ACM/IEEE conference on Supercomputing. 200.
2006. Tampa, Florida: ACM: p. 86. [27].Enno Lubbers and Marco Platzner, ReconOS: Multithreaded pro-
[6].Chao Wang, Junneng Zhang, Xuehai Zhou, Xiaojing Feng, and Xiaon- gramming for reconfigurable computers. ACM Transactions on Em-
ing Nie. SOMP: Service-Oriented Multi Processors. in Proceedings of bedded Computing Systems, 2009. 9(1): p. 1-33.
the 2011 IEEE International Conference on Services Computing. [28].W. Peck, E. Anderson, J. Agron, J. Stevens, F. Baijot, and D.;
2011: IEEE Computer Society: p. 709-716. Andrews. Hthreads: A Computational Model for Reconfigurable De-
[7].Yoav Etsion, Felipe Cabarcas, Alejandro Rico, Alex Ramirez, Rosa M. vices in International Conference on Field Programmable Logic and
Badia, Eduard Ayguade, Jesus Labarta, and Mateo Valero. Task Su- Applications, FPL '06. 2006. Madrid, Spain: p. 1-4.
perscalar: An Out-of-Order Task Pipeline. in Proceedings of the 2010 [29].Dirk Koch, Christian Beckhoff, and Juergen Teich. A communication
43rd Annual IEEE/ACM International Symposium on Microarchitec- architecture for complex runtime reconfigurable systems and its im-
ture. 2010: IEEE Computer Society: p. 89-100. plementation on spartan-3 FPGAs. in Proceedings of the
[8].Yongjoo Kim, Jongeun Lee, Toan X. Mai, and Yunheung Paek, Im- ACM/SIGDA international symposium on Field programmable gate
proving performance of nested loops on reconfigurable array proces- arrays. 2009. Monterey, California, USA: ACM: p. 253-256.
sors. ACM Transactions on Architecture and Code Optimization [30].Kyle Rupnow, Keith D. Underwood, and Katherine Compton, Scien-
(TACO), 2012. 8(4): p. 1-23. tific Application Demands on a Reconfigurable Functional Unit Inter-
[9].Thuresson Martin, Sj Magnus, lander, Bj Magnus, rk, Svensson Lars, face. ACM Trans. Reconfigurable Technol. Syst., 2011. 4(2): p. 1-30.
Larsson-Edefors Per, and Stenstrom Per, FlexCore: Utilizing Ex- [31].S. Balakrishnan and G. Sohi. Program demultiplexing: Dataflow
posed Datapath Control for Efficient Computing. 2009, Kluwer Aca- based speculative parallelization of methods in sequential programs.
demic Publishers. p. 5-19. in 33rd International Symposium on Computer Architecture. 2006: p.
[10].Junneng Zhang, Chao Wang, Xi Li, and Xuehai Zhou. FPGA imple- 302-313.
mentation of a scheduler supporting parallel dataflow execution. in
1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPDS.2015.2487346, IEEE Transactions on Parallel and Distributed Systems
14 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. X, NO. X, XXX 2015
1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.