Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
I.
INTRODUCTION
Distributed embedded real-time systems are widely
exploited in different areas (such as defense, industry,
aerospace, home automation and so on) with their increasing
capabilities, unique characteristics, and resource constraints
[1]. Sorting is a frequently required procedure and for real-time
systems increase of throughput is important [2]. To better
satisfy performance requirements, fast accelerators based on
different design techniques have been investigated. Many of
such techniques can be combined and efficiently implemented
in FPGA and in all programmable systems-on-chip (APSoC)
devices, such as Zynq-7000 [3] containing a multi-core
processing unit (PU) that is the dual-core ARM CortexTM
MPCoreTM and programmable logic (PL) on the same
microchip. Interactions between the PU and PL are supported
by different interfaces and other signals through over 3,000 onchip connections [3]. Available four 32/64-bit highperformance (HP) Advanced eXtensible Interfaces (AXI) and
64-bit AXI Accelerator Coherency Port enable fast data
exchange with theoretical bandwidths shown in [3]. For many
practical applications (especially in the scope of embedded
systems) sorting is involved for relatively small sets of data
items which may rapidly be transferred from the PU to the PL
and vice versa through general-purpose ports [4]. A similar
technique is efficient for priority management, such as that is
described in [5].
Generalpurpose ports
Network-based
hardware
accelerator
ARM Cortex
A9
Generalpurpose ports
Network-based
hardware
accelerator
b)
c
g
d
RN-1
K-bit
output
clock
max
Communicates either K-bit input vector or max
c)
RN-2 0 55
RN-1 55 d 0
c1
RN-3
55
RN-2 55 99 f 99
RN-1 99 d 55 0
c2
0
0 0 0 0 0 0 14 16
0
0 0 0 0 0 14 16 14
0
0 0 0 0 14 21 21 21
0
0 0 0 14 21 16 14 17
0
0 0 55 55 55 77 77 77
0
0 55 21 21 77 55 55 55
0 55 99 99 99 99 99 99 99
0
0 0 0 0 0 0 0 14
R0,,R7 9977552117161414
c1 c2 c3 c4 c5 c6 c7 c8
clock
N K-bit registers
K-bit
input
b)
- comparator/swapper
R0
TRANSFERS
a)
1714167714219955
RN-5
RN-4 0 55 55 RN-4
c
RN-3 55 0 21 RN-3
RN-2 99 99 f 99 RN-2
RN-1 21 21 0 RN-1
c3
0 14
55 55
21 g 21
99 99
14 0
c4
RN-6 0 14 14 14
b
RN-5 14 0
0
21
RN-4 55 55 55 55
RN-3 21 21 77 77
RN-2 99 99 99 g 99
RN-1 77 77 f 21
0
c5
Fig. 1. Real-time accumulator/sorter (a), an example (b), iterations for data acquisition (c)
a)
b)
c1 R0
c2 c3 c4 c5 c6 c7 c8
16 17 21 55 77 99 m
16
a
17 21 55 77 99 m m
14
21 55 77 99 m m m
21
e 55 77 99 m m m m
b
17
77 99 m m m m m
77
c
99 m m m m m m
55
g m m m m m m m
99
df
14 16 17 21 55 77 99
M 14
99
77
55 21 17 16 14 14
RN-1
K-bit output
clock
16 16 16 16
14 14 17 17
21 21 e 21 21
17 17 14 55
77 77 77 77
55 55 99 g 99
99 m f m m
d
m 99 55 14
14
sorted
16 17 17
a
17 16 21
21 55 e 55
b
55 21 16
77 99 99
c
99 77 m
m mf m
m m 77
14
c2
17
21
55
77
99
gm
m
16
17 21 21 21
a
21 17 55 55
55 77 e 77 77
b
77 55 17 99
99 m m m
c
m 99 m g m
m mf m m
m m 99 17
16
c3
c4
Fig. 2. Transmitting sorted data items (a), an example for the first three clock cycles
Sorted
data
Lp(N)
Leom(N)
Cp(N)
Ceom(N)
N=64
6
21
63
543
input ports except the last one for which such swapping is
indeed needed for transferring the smallest item. As soon as all
input data are saved in the registers R0,,RN-1, exactly the
same functionality as in Fig. 2,a is provided. Note that more
clock cycles than in Fig. 3 will be needed to make all items
sorted. Completion of sorting may be recognized by additional
comparators verifying that any upper item is smaller or equal
than neighboring lower item. The circuit in Fig. 4 is well suited
for devices from Zynq-7000 family that have 5 HP AXI ports.
IV. IMPLEMENTATION, EXPERIMENTS AND COMPARISONS
The number of combinational levels in the proposed circuit
is equal to p and it is less than that for the networks [8,9] where
it is equal to p(p+1)/2. Thus, the proposed circuits may
operate at higher clock frequency. The number of
comparators/swappers in the proposed circuit is N-1 which is
considerably less than in the networks [8,9]. Table I presents
the results of comparisons for different values of N, where
subscripts eom and p indicate the known even-odd merge and
the proposed circuits.
N=256
8
36
255
3,839
N=512
9
45
511
9,727
N=1,024
10
55
1,023
24,063
N=2,048
11
66
2,047
58,367
N=4,096
12
78
4,095
139,263
N=8,192
13
91
8,191
327,679
(such as AXI) and sort the data within the time needed for
communication. Both single and multiple port data transfers
can be involved. The circuit achieves a good compromise
between the resources it occupies and performance. It is faster
and less resource consuming than the best known alternatives,
such as even-odd and bitonic mergers.
ACKNOWLEDGMENT
Removed for blind review.
REFERENCES
[1]
[2]
[3]
TABLE II.
N=32
N=64
N=128
N=256
N=512
Single
port
Multiple
ports
(Q=4)
14.7
19.9
26.3
36.0
46.8
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
V. CONCLUSION
The paper suggests a real-time sorting circuit which can
receive and transmit data items through widely used interfaces