Sei sulla pagina 1di 4

Communication-time Hardware Accelerator for Data

Sort and Priority Management

AbstractThe paper is dedicated to hardware accelerators


for data sort and priority management in real-time applications
based on FPGAs. Two basic components that are the processing
unit running software and the programmable logic executing
time-critical operations interact through on-chip interfaces with
predefined bandwidth. The main distinctive feature of the
proposed technique is data processing combined with
interactions between the basic components in such a way that
operations in hardware are executed in real-time with data
exchange through communication circuits. The results of
implementation and comparison are given for embedded soft
processing units (such as microblaze) and embedded highperformance processing system (such as ARM Cortex A9 for
Xilinx all programmable systems on chip from Zynq-7000
family).
Keywords real-time system; parallel processing; sorting
networks; FPGA; hardware/software co-design

I.
INTRODUCTION
Distributed embedded real-time systems are widely
exploited in different areas (such as defense, industry,
aerospace, home automation and so on) with their increasing
capabilities, unique characteristics, and resource constraints
[1]. Sorting is a frequently required procedure and for real-time
systems increase of throughput is important [2]. To better
satisfy performance requirements, fast accelerators based on
different design techniques have been investigated. Many of
such techniques can be combined and efficiently implemented
in FPGA and in all programmable systems-on-chip (APSoC)
devices, such as Zynq-7000 [3] containing a multi-core
processing unit (PU) that is the dual-core ARM CortexTM
MPCoreTM and programmable logic (PL) on the same
microchip. Interactions between the PU and PL are supported
by different interfaces and other signals through over 3,000 onchip connections [3]. Available four 32/64-bit highperformance (HP) Advanced eXtensible Interfaces (AXI) and
64-bit AXI Accelerator Coherency Port enable fast data
exchange with theoretical bandwidths shown in [3]. For many
practical applications (especially in the scope of embedded
systems) sorting is involved for relatively small sets of data
items which may rapidly be transferred from the PU to the PL
and vice versa through general-purpose ports [4]. A similar
technique is efficient for priority management, such as that is
described in [5].

Fig. 1 demonstrates the technique which is proposed and


will be discussed in the paper. We have studied two
implementations that are shown in Fig. 1,a and in Fig. 1, b. The
first one is applicable to FPGA that contains microprocessor,
such as soft microblaze core [6] interacting with programmable
logic. The second one can be used in Zynq-7000 devices [3] for
accelerating ARM-based applications using highly parallel
networks. In both cases we will use Xilinx AXI_GPIO
intellectual property (IP) core [7] which provides support for
interactions between either microblase soft core (see Fig. 1, a)
or ARM Cortex A9 (see Fig. 1, b) and programmable logic.
a)
MicroBlaze
processor

Generalpurpose ports

Network-based
hardware
accelerator

ARM Cortex
A9

Generalpurpose ports

Network-based
hardware
accelerator

b)

Fig. 1. Two basic implementations that are studied in the paper

The majority of known hardware accelerators for data


sorters use Batcher evenodd and bitonic mergers [8,9] which
are the fastest because they have the lowest latency L(N)
measured by the number of levels of basic network elements
through which signals propagate from the inputs to the outputs.
Such elements are comparators/swappers for data items. Let p
= log2N, where N is the number of K-bit data items that have to
be sorted. It is known that latency L(N) for both referenced
above mergers [8,9] is equal to p(p+1)/2. The cost C(N) (the
number of comparators/swappers) for even-odd merger is
smaller than for bitonic merger and is equal to (p2-p+4)2p-2-1.
A review of recent results in hardware accelerators for data
sort can be found in [10] which demonstrates that resources of
modern reconfigurable microchips only allow circuits to be
constructed that can handle a very limited number of items
because of large values of C(N). We suggest here circuits that
can handle significantly larger number N of data items and
besides they sort items in parallel with communication between
the PU and PL.
The remainder of this paper contains 4 sections. Section II
suggests fast sorting networks for single port data transfers. It
demonstrates that as soon as all data are received the results of
sorting are ready and this is the main distinctive feature of the

proposed methods. Section III describes a technique for


multiple ports. Section IV presents the results of experiments
and comparisons. It explicitly indicates the scope of the
proposed methods. The conclusion is given in section V.

c
g
d

RN-1

K-bit
output

clock
max
Communicates either K-bit input vector or max

c)
RN-2 0 55
RN-1 55 d 0
c1

RN-3
55
RN-2 55 99 f 99
RN-1 99 d 55 0
c2

0
0 0 0 0 0 0 14 16
0
0 0 0 0 0 14 16 14
0
0 0 0 0 14 21 21 21
0
0 0 0 14 21 16 14 17
0
0 0 55 55 55 77 77 77
0
0 55 21 21 77 55 55 55
0 55 99 99 99 99 99 99 99
0
0 0 0 0 0 0 0 14
R0,,R7 9977552117161414

Output (sorted) data

c1 c2 c3 c4 c5 c6 c7 c8

clock

N K-bit registers
K-bit
input

The proposed network, which is based on the circuit for


discovering the minimum and maximum values from [11], is
shown in Fig.2,a. It contains N K-bit registers R0,,RN-1, and
N-1 comparators/swappers. For the sake of simplicity N is
assigned to 8. Clearly, other values may also be chosen.

b)

- comparator/swapper
R0

TRANSFERS

Input (unsorted) data

a)

II. REAL-TIME SORTING NETWORK FOR SINGLE-PORT DATA

1714167714219955

RN-5
RN-4 0 55 55 RN-4
c
RN-3 55 0 21 RN-3
RN-2 99 99 f 99 RN-2
RN-1 21 21 0 RN-1
c3

0 14
55 55
21 g 21
99 99
14 0
c4

RN-6 0 14 14 14
b
RN-5 14 0
0
21
RN-4 55 55 55 55
RN-3 21 21 77 77
RN-2 99 99 99 g 99
RN-1 77 77 f 21
0
c5

Fig. 1. Real-time accumulator/sorter (a), an example (b), iterations for data acquisition (c)

At the initialization step, all the registers R0,,RN-1 are set


to the minimum possible value for data items. For the sake of
clarity this value is assumed to be 0. Any other value may also
be chosen. Data items are received sequentially from
interfacing circuits through the multiplexer M. Since all the
registers are set to the minimum values, all input items with
non-minimum values will be moved up and accommodated
somehow in the registers R0,,RN-1. Fig. 2, b demonstrates
how N=8 K-bit items are accommodated using an example
with items arriving in the following sequence 1) 55; 2) 99; 3)
21; 4) 14; 5) 77; 6) 16; 7) 14; 8) 17. Fig. 2, c shows which
comparators/swappers (marked with symbols a, b, c, d, e, f, g)
swap inputs in each clock cycle c1, c2, . The circuit in Fig.
2,a composed of the comparators/swappers is combinational
and all the comparators/swappers a,,g operate in parallel
handling input data from the registers R0,,RN-1. Outputs of
the circuit composed of the comparators/swappers are written
to the registers R0,,RN-1 through feedback connections and
only the bottom output (marked as K-bit output) does not have
a feedback. Note that data may be received from the PU and
accommodated in the registers R0,,RN-1 during
communication time in N clock cycles indicated in Fig. 2, b by
symbols c1,,c8 (N=8). As soon as N sorted data are received,
the sorted result can be transferred immediately to the PU as
shown in Fig. 3, a.

Fig. 3, b demonstrates data propagation through the


network of comparators/swappers for the first three clock
cycles. It is important to note that after the clock cycle c2 all
data items have already been positioned in the sorted order
which is easily visible (the upper value 16 is the smallest and
the bottom value 99 is the largest). This is because the
proposed network always moves the maximum value m=max
to the upper positions. Thus, sorting is completed almost
immediately after all input data have been received from a
single port by the circuit. Consequently, more than one sorted
item can be transferred to the PU.
III. REAL-TIME SORTING NETWORK FOR DATA TRANSFERS
THROUGH MULTIPLE PORTS

The proposed circuits may also be used for multiple ports.


For example, in Zynq-7000 devices there are five highperformance AXI ports between the PU and PL each of which
can be configured for 32 or 64-bit data transfers. Theoretical
bandwidth for read/write operations through one port is 1200
MB/s [3]. Thus, transferring data through multiple ports
permits additional acceleration to be achieved.
Fig. 4 depicts the proposed network for Q ports. The circuit
is decomposed into Q autonomous sub-circuits (segments) of
equal size (i.e. up to items can be acquired by each sub-

circuit). The network in Fig. 4 assumes Q=4. Clearly other

a)

values may be chosen.

b)
c1 R0
c2 c3 c4 c5 c6 c7 c8
16 17 21 55 77 99 m
16
a
17 21 55 77 99 m m
14
21 55 77 99 m m m
21
e 55 77 99 m m m m
b
17
77 99 m m m m m
77
c
99 m m m m m m
55
g m m m m m m m
99
df
14 16 17 21 55 77 99
M 14
99
77
55 21 17 16 14 14
RN-1
K-bit output
clock

16 16 16 16
14 14 17 17
21 21 e 21 21
17 17 14 55
77 77 77 77
55 55 99 g 99
99 m f m m
d
m 99 55 14
14

sorted
16 17 17
a
17 16 21
21 55 e 55
b
55 21 16
77 99 99
c
99 77 m
m mf m
m m 77
14

c2

m Communicates the maximum value (m)

17
21
55
77
99
gm
m
16

17 21 21 21
a
21 17 55 55
55 77 e 77 77
b
77 55 17 99
99 m m m
c
m 99 m g m
m mf m m
m m 99 17
16

c3

c4

Fig. 2. Transmitting sorted data items (a), an example for the first three clock cycles

Input from port 1 Segment 1


Input from port 2 Segment 2
Q=4
Input from port 3 Segment 3
Input from port 4 Segment 4

Sorted
data

Fig. 3. Transferring data items through multiple ports

Data items are received in parallel from Q ports in such a


way that port i supplies inputs for the segment i (i = 1,2,,Q).
Any segment is a circuit shown in Fig. 2,a with multiplexers
M. Different segments are linked by comparators/swappers
(such as e, f, g in Fig. 2,a). Swapping of data items between
different segments cannot be done at any transition from Q
TABLE I.

Lp(N)
Leom(N)
Cp(N)
Ceom(N)

N=64
6
21
63
543

input ports except the last one for which such swapping is
indeed needed for transferring the smallest item. As soon as all
input data are saved in the registers R0,,RN-1, exactly the
same functionality as in Fig. 2,a is provided. Note that more
clock cycles than in Fig. 3 will be needed to make all items
sorted. Completion of sorting may be recognized by additional
comparators verifying that any upper item is smaller or equal
than neighboring lower item. The circuit in Fig. 4 is well suited
for devices from Zynq-7000 family that have 5 HP AXI ports.
IV. IMPLEMENTATION, EXPERIMENTS AND COMPARISONS
The number of combinational levels in the proposed circuit
is equal to p and it is less than that for the networks [8,9] where
it is equal to p(p+1)/2. Thus, the proposed circuits may
operate at higher clock frequency. The number of
comparators/swappers in the proposed circuit is N-1 which is
considerably less than in the networks [8,9]. Table I presents
the results of comparisons for different values of N, where
subscripts eom and p indicate the known even-odd merge and
the proposed circuits.

COMPARISON OF THE PROPOSED CIRCUIT WITH EVEN-ODD MERGE NETWORK


N=128
7
28
127
1,471

N=256
8
36
255
3,839

N=512
9
45
511
9,727

Note that although even-odd merge network does not


involve sequential operations (like the proposed circuits)
additional time and components are needed such as those to
prepare long size input vectors and to transmit long size output
vectors through interfaces with a limited number of lines. The
proposed circuit does not require such components.
Experiments were done for the Artix-7 FPGA available on
the Nexys-4 prototyping board [12] and Zynq APSoC xc7z020
available on the Avnet ZedBoard [13]. The proposed circuits
were designed and implemented in Xilinx Vivado 2015.1
design suite. The architecture shown in Fig. 1,a was used for
the board [12]. Microblaze processor received data generated
in software developed in C language in Xilinx software

N=1,024
10
55
1,023
24,063

N=2,048
11
66
2,047
58,367

N=4,096
12
78
4,095
139,263

N=8,192
13
91
8,191
327,679

development kit (SDK 2015.1). Interaction with the proposed


circuits was organized through Xilinx IP core AXI_GPIO [7].
The following two scenarios were tested: 1) for sort of data
randomly generated in software by the C function rand; 2) for
selecting the most priority tasks from a set of given tasks. In
the last case each task has a priority number. All the priorities
are sorted in the proposed network and the highest priority task
is chosen. A set of tasks is arbitrary changed. Thus, the highest
priority is selected dynamically. The architecture shown in Fig.
2,b was used for the board [13]. Embedded ARM processor is
used similarly to microblaze processor in the previous
implementation. The same Xilinx IP core AXI_GPIO [7] is
used.

It is found that the proposed circuits permitted data sort to


be significantly accelerated because: 1) larger subsets can be
sorted in the programmable logic section (the maximum value
of N was increased up to 512; K = 32); and 2) sorting is done
in parallel with data transfers, i.e. it does not require additional
time beyond that needed for communication. The PU sorts
larger data sets merging the sorted in hardware subsets. The
details of merging can be found in [10]. We tested also the
known even-odd and bitonic merge networks [8,9] and found
that they can be implemented in the same device only for N<64
due to the lack of hardware resources. Additional experiments
were done with data transfers through single and multiple highperformance ports (for Q=4 32-bit HP AXI ports). Table 2
indicates average number of additional clock cycles (from 100
runs over randomly generated unsorted data items) to produce
the sorted set after data acquisition is completed.
AVERAGE NUMBER OF ADDITIONAL CLOCK CYCLES
FROM 100 RUNS OVER RANDOMLY GENERATED DATA

(such as AXI) and sort the data within the time needed for
communication. Both single and multiple port data transfers
can be involved. The circuit achieves a good compromise
between the resources it occupies and performance. It is faster
and less resource consuming than the best known alternatives,
such as even-odd and bitonic mergers.
ACKNOWLEDGMENT
Removed for blind review.
REFERENCES
[1]

[2]
[3]

TABLE II.

N=32

N=64

N=128

N=256

N=512

Single
port

Multiple
ports
(Q=4)

14.7

19.9

26.3

36.0

46.8

[4]

[5]

[6]

We compared also the proposed circuits with networks


from [14] and conclude the following:

[7]

The networks from [14] cannot process data during


transmission but they have smaller propagation delays and
fewer resources. They can be recommended for non-real
time large scale data sorters and data exchange through
AXI high-performance ports.

[8]

The proposed networks process data during transmission


but they have larger propagation delays and resources.
This is because the deepness of the network is larger and
regularity of the circuit is lower. However, the result is
ready immediately after the last item is transmitted. So, the
proposed networks are recommended for real-time data
sorters and real-time priority management. In the latter
case the most priority number (which has either the
smallest or the largest value [5]) is ready immediately,
which is not possible for the known methods [14].

[9]
[10]

[11]

[12]

[13]

[14]

V. CONCLUSION
The paper suggests a real-time sorting circuit which can
receive and transmit data items through widely used interfaces

Aziz MW et al. Service based meta-model for the development of


distributed embedded real-time systems. Real-Time Syst, 2013, Vol. 49,
pp. 563579.
Mueller R. Data Stream Processing on Embedded Devices. Ph.D. thesis,
ETH, Zurich, 2010.
Xilinx, Inc. (2014) Zynq-7000 All Programmable SoC Technical
Reference
Manual.
Available
at:
http://www.xilinx.com/support/documentation/user_guides/ug585-Zynq7000-TRM.pdf.
J.Silva,
V.Sklyarov.,
I.Skliarova.
Comparison
of On-chip
Communications in Zynq-7000 All Programmable Systems-on-Chip.
IEEE Embedded Systems Letter, Vol. 7, Issue 1, March, 2015, pp. 3134.
V. Sklyarov, I. Skliarova, Modeling, Design, and Implementation of a
Priority Buffer for Embedded Systems, Proc. of the 7th Asian Control
Conference ASCC2009, Hong Kong, August 2009, pp. 9-14.
UG984 MicroBlaze Processor Reference Guide, 2014. Available at:
http://www.xilinx.com/support/documentation/sw_manuals/xilinx2014_
2/ug984-vivado-microblaze-ref.pdf
LogiCORE IP AXI GPIO v2.0 Product Guide. 2014. Available at:.
http://www.xilinx.com/support/documentation/ip_documentation/axi_gp
io/v2_0/pg144-axi-gpio.pdf
Batcher K.E. Sorting networks and their applications, Proc. AFIPS
Spring Joint Computer Conf., USA, 1968, pp. 307-314.
Al-Haj Baddar S.W., Batcher K.E. Designing Sorting Networks. A New
Paradigm, Springer, 2011..
Sklyarov V., Skliarova I. High-performance implementation of regular
and easily scalable sorting networks on an FPGA, Microprocessors and
Microsystems, 2014, Vol. 38, Issue 5, pp. 470-484.
Sklyarov V., Skliarova I. Fast regular circuits for network-based parallel
data processing, Adv Electr Comput Eng, 2013, Vol 13, Issue 4, pp. 47
50.
Nexys4 FPGA Board Reference Manual, 2013. Available at:
https://www.digilentinc.com/Data/Products/NEXYS4/Nexys4_RM_VB
2_Final_5.pdf
Avnet, Inc. ZedBoard. ZynqTM Evaluation and Development Hardware
Users
Guide,
2014.
Available
at:
http://zedboard.org/sites/default/files/documentations/ZedBoard_HW_U
G_v2_2.pdf .
Sklyarov V., Skliarova I., Barkalov A., Titarenko L. Synthesis and
Optimization of FPGA-based Systems. Springer, 2014.

Potrebbero piacerti anche