Sei sulla pagina 1di 5

Performance Efficient FPGA Implementation

of Parallel 2-D MRI Image Filtering


Algorithms using Xilinx System Generator

Sami Hasan, Alex Yakovlev and Said Boussakta
School of Electrical, Electronic and Computer Engineering,
University of Newcastle upon Tyne, UK
{sami.hasan, alex.yakovlev, s.boussakta}@ncl.ac.uk


Abstract Currently, Field Programmable Gate Array
(FPGA) goes beyond the low-level line-by-line hardware
description language programming in implementing parallel
multidimensional image filtering algorithms. High-level
abstract hardware-oriented parallel programming method
can structurally bridge this gap. This paper proposes a first
step toward such a method to efficiently implement Parallel
2-D MRI image filtering algorithms using the Xilinx system
generator. The implementation method consists of five
simple steps that provide fast FPGA prototyping for high
performance computation to obtain excellent quality of
results. The results are obtained for nine 2-D image filtering
algorithms. Behaviourally, two Virtex-6 FPGA boards,
namely, xc6vlX240Tl-1lff1759 and xc6vlX130Tl-1lff1156 are
targeted to achieve; lower power consumption of (1.57 W)
and down to (0.97 W) respectively at maximum sampling
frequency of up to (230 MHZ). Then, one of the nine MRI
image filtering algorithms, has empirically improved to
generate an enhanced MRI image filtering with moderate
lower power consumption at higher maximum frequency.
I. INTRODUCTION
FPGAs are increasingly used in modern parallel
algorithm applications such as medical imaging [1], DSP
[2], image filtering [3], power consumption in portable
image processing [4], MPEG-4 motion estimation in
mobile applications [5], satellite data processing [6], new
Mersenne Number Transform [7][8], high speed wavelet-
based image compress [9] and even the global
communication link [10]. However, most of the above
FPGA-based solutions are typically programmed with
low-level hardware description languages (HDL) inherited
from ASIC design methodologies [11].
On the other hand, parallel multidimensional image
filtering algorithms[12], for aerospace, defence, digital
communications, multimedia, video and imaging
industries, demand insatiable computationally complex
operations [13] [14] at maximum sampling frequency.
Traditional DSP processor arrays, with fixed architectures
and relatively short life, can be costly programmed line-
by-line with thousands of codes lines [15] [16].
Alternatively, this paper presents a high-level abstract
implementation method to fill the present programming
gap between parallel algorithms coding and final FPGA
implementation.
The proposed FPGA implementation method is
architecturally based on the Xilinx system generator
development tool [17] within the ISE 11.3 development
suite. This tool is a system-level block diagram modeling
environment that facilitates FPGA hardware
implementation for the bit-accurate and cycle-true
performance efficient specifications of the parallel multi
dimensional filtering algorithms.
The new method is tested on the performance efficient
implementation of nine 2-D image digital filtering
algorithms: Edge, Sobel X, Sobel Y, Sobel X-Y, Blur,
Smooth, Sharpen, Gaussian and Identity[13] [14],
targeting two Virtex-6 FPGA [18] boards, namely,
xc6vlX240Tl-1lff1759 and xc6vlX130Tl-1lff1156.
II. PARALLEL 2-D IMAGE FILTERING ALGORITHMS
Parallel 2-D MRI filtering algorithms are 5x5
convolutional kernel mask-based image processing
algorithms. Generally, the parallel architecture of these
algorithms is constructed of two input matrices, 2-D
convolution array for processing and a parallel to series
reconstructed output matrix, as shown in Fig.1.

Figure 1. The parallel 2-D MRI image filtering algorithms architecture
SIP-6
978-1-86135-369-6/10/$25.00 2010 IEEE
765 CSNDSP 2010
Generally, let the original image, x (n
1
, n
2
), be of size
(N x N), and the kernel, (m
1
, m
2
) of size (M x M), then
the output image, y (n
1
, n
2
), can be expressed by the 2-D
convolution formula:
y(n
1
, n
2
) = x(m
1
, m
2
)[(n
1
-m
1
, n
2
-m
2
) (2)
N-1
m
2
=0
N-1
m
1
=0

Where, u n
1,
n
2
< N+H-1. Moreover, the 2-D image
is equally subdivided into small sub-sequences of size
((N/n) x (N/n)) which are independently convolved:
y(n
1
, n
2
) = x(m
1
, m
2
)[(n
1
-m
1
, n
2
-m
2
) (2)
(
N
n
)-1
m
2
=0
(
N
n
)-1
m
1
=0

Where, u n
1,
n
2
< (Nn) +H-1.
Nine 5x5 convolutional kernels are utilized for the
parallel 2-D MRI image filtering algorithms. One of the
nine algorithms, namely, the Edge algorithms is
empirically modified by a new Edge enhancement
orthogonal kernels matrix to enhance fine detail in images,
New Edge=
l
l
l
l
l
u u - u.12S u u
u u - u.12S u u
- u.12S u.12S 2.uu - u.12S u.12S
u u - u.12S u u
u u - u.12S u u
1
1
1
1
1
(3)
The Edge algorithm is selected after the first round of
the algorithm performance results of table I, which shows
a noticeable performance wide span.
III. THE PARALLEL 2-D ALGORITHMS CAPTURE
These parallel 2-D MRI image filtering algorithms can
be behaviorally captured as a stream model-based
synchronous dataflow system using system generator
libraries. The clock and its corresponding enable logic do
not appear in the system generator block diagram but are
internally generated when the FPGA implementation is
behaviorally compiled within Xilinx/Simulink
environment.
The 2-D convolution operation, in (1), can be
functionally implemented as an n-tap MAC FIR filter [13]
[14] [17]. Consequently, the parallel 2-D image filtering
algorithms can be efficiently realized using n-tap MAC
FIR filters with nine programmable coefficient sets.
Further high abstracted implementation can be achieved
using a 5x5 filter image block, as in Fig. 2.
The implementation diagram consists of three stages:
MRI input, processing and output. In the first stage, the
TABLE I.
PERFORMANCE INDICES USING TWO VIRTEX-6 BOARDS
2-D image
Filtering
algorithms
Power
Consumption
(Watt)
Maximum
Frequency
(MHz)
X240T X130T X240T X130T
Edge 1.57 0.97 194 230
SobelX 1.57 0.97 222 228
SobelY 1.57 0.97 202 230
SobelXY 1.56 0.97 223 230
Blur 1.57 0.97 227 226
Smooth 1.57 0.97 230 207
Sharpen 1.57 0.97 214 230
Gaussian 1.57 0.97 230 230
Identity 1.57 0.97 230 230

Figure 2. Xilinx System Generator Captures of the Parallel Nine 2-D
Image filtering algorithms.
magnetic resonance imaging (MRI) pixels are sequentially
sub-streamed into 5 virtex line buffers via a pipelined
gateway block. Each line is delayed by 64 samples and
line 5 is a copy of the MRI scan. The second stage
consists of parallel five n-tap MAC FIR filters and four
adder blocks structure which can be abstractly provided
by the 5x5 filter block, as shown in Fig. 2, to filter the
64x64 grayscale MRI scan.
Nine different 2-D FIR filters can be applied via the
5x5 filter block. The nine filters are Edge, SobelX,
SobelY, SobelXY, Blur, Smooth, Sharpen, Gaussian and
Identity. This 2-D FIR filter offers compile time mask
parameters. Then the nine 2-D filters types can be either
selected by changing the mask parameter on the 5x5 Filter
block or modified. The 2-D filter coefficients are stored in
a block RAM. Thus, the stored coefficients can be
modified by changing the mask of the 5x5 FIR filter. Each
n-tap MAC FIR filter is clocked 5 times faster than the
input rate and the 5x5 filter operates at 213 MHz [17].
Therefore the throughput of the design is 213 MHz / 5 =
42.6 million pixels/second. For the 64x64 MRI image,
this is 42.6x10^6/ (64x64) = 10,400 frames/sec.
The third stage is pipelined by inserting delay block
between the 5x5 filter and the gateway boundary block to
be displayed via a simulink block, Fig. 2, that pop up the
original MRI image together with the filtered result, as
shown in Fig. 3, Fig.4 and Fig.5.
The single system generator diagram in Fig. 2 is
behaviorally equivalent to a 7140 lines of VHDL program


Figure 3. The 2-D MRI images filtered, via Virtex-6 X240T, using 2-D
filter types; A. Edge, B. SobelX, C. SobelY, D. SobelXY, E. Blur, F.
Smooth, G. Sharpen, H. Gaussian, I. Identity.
A
E F
G
B C
D
H I
SIP-6 766 CSNDSP 2010

Figure 4. The 2-D MRI images filtered, via Virtex-6 X130T, using 2-D
filter types; a. Edge, b. SobelX, c. SobelY, d. SobelXY, e. Blur, f.
Smooth, g. Sharpen, h. Gaussian, i. Identity.
code and a 8423 lines of Verilog program code. Those
thousands of code lines must be manually verified, refined
and re-entered line-by-line. This can be a waste of
valuable time. Consequently, this paper proposes, after
development, an FPGA implementation method.
IV. AN FPGA IMPLEMENTATION METHOD
The developed method is a high-level FPGA
implementation method for any DSP algorithms to avoid
all the drawbacks of the traditional HDL programming.
The method has only five simple steps, namely:
1. State the DSP algorithm.
2. Structure the DSP algorithm architecture.
3. Algorithm captures using system generator
from Xilinx.
4. Quality of results is verified, refined and
improved.
5. FPGA bit stream generation.
V. RESULTS
The goal of this paper is a new FPGA implementation
method that provides fast FPGA prototyping for high
performance computation of parallel 2-D MRI image
filtering algorithms. A time analysis compilation tool is
needed to evaluate the speed/power consumption
performance indices. Thus the Xilinx Timing Analyzer is
utilized to generate time statistics, total power analysis
and histogram charts of FPGA implementation paths
delay. This provides guides to clarify the bottleneck in the
design and focus on the optimization of the slow paths
outliers.
The performance efficient implementation results can
be behaviorally achieved by low power consumption at
maximum frequency for the parallel 2-D MRI image
filtering algorithms. Consequently, comparative results of
two Virtex-6 FPGA boards, xc6vlX240Tl-1lff1759 and
xc6vlX130Tl-1lff1156 are compiled for the nine 2-D
filters by two sets of 5x5 coefficient mask. The first set is
the stored mask within the 5x5 filter block, and the second
set is obtained by empirically modifying the 5x5 Edge
coefficients to a new 5x5 Edge Enhancement Orthogonal
Kernels as in (2).The results presented into three forms:
performance index table, grayscale MRI filtered images
and Histogram Charts of path delay distribution.
Behaviorally, the results from the first set show that the
parallel 2-D MRI image filtering algorithms have better


Figure 5. The 2-D MRI images filtered using the new 2-D MRI Edge
filter for both FPGA boards.
performance when implemented via the X130T board
compared to X240T board.
Furthermore, the results from the second set reveals an
observable MRI filtering improvement compared to that
of the first set.
Noticeably, the performance indices within table I
outperform efficiently the X130T FPGA implementation
compared to X240T FPGA by its minimum total power
consumption (around 0.97 Watt) and maximum
frequency (mostly around 230 MHz). This high
performance efficient FPGA implementation is
observably apparent for the 2-D MRI Edge filter
algorithm. Thus, the modification is empirically
conducted on the 5x5 convolutional Edge operators.
The filtered 2-d MRI images of Fig. 3 and Fig. 4 are
generated from the nine parallel filtering algorithms
implementation using Virtex-6 X240T and X130T
FPGAs respectively. By inspection, the two figures show
slight improvement of the 2-D MRI images filtered via
X130T FPGA compared to X240T.
The histogram time charts, in Fig. 6 and Fig. 7 depict
the slow paths distributions of the 2-D MRI Edge filter
captured behaviourally via X240T and X130T FPGA
board respectively. Each histogram chart is a useful
metric to analyze the FPGA implementation. Where are
the slowest paths concentrated? How many slow paths
are in each bin? How efficient is the implementation to
meet timing? Accordingly, the FPGA implementation can
be adjusted.
Those histograms are grouped into regions of roughly
formed normal distribution paths groups. The numbers at
the top of the bins show the number of paths in each bin.
Fig. 6 shows 308 paths that are roughly forming five
groups. These groups are probably from different
portions of the system generator architecture, as in Fig. 2,
or from different timing clock region constraints. This
shows that most of the slow paths are concentrated
around (2.81 ns). The slowest path is about (6.15 ns).
There are an outlier group of slow paths in the time range
6.13ns-6.30ns with empty bins to the right of it. That is
because the FPGA implementation frequency, from table
I, is the slowest (194 MHz) for this 2-D MRI Edge filter.
However, there are no red/ pink bins or portions that do
not meet the timing constrains.
Fig. 7 shows a shorter histogram chart of 308 paths that
forming totally different distributed histogram with
a
c
b
d
e
f
g
h i
X240T
X130T
SIP-6 767 CSNDSP 2010

Figure 6. Histogram Chart depicts the total path delay distribution of the
2-D MRI Edge filter captured behaviourally via (X240T) FPGA board.
roughly only three normally distributed paths groups
between (2.2 ns) and (4.36 ns). That is because the FPGA
implementation frequency, from table I, is the highest
(230 MHz) for the same 2-D MRI Edge filter.
The slow paths are concentrated between (2.2 ns) and
(2.8 ns). The slowest path is about (4.2 ns). Moreover, the
greater number of only one path per bin, distributed
throughout the nanosecond domain demonstrate the
highly outperformance efficient implementation of (230
MHz) maximum frequency. Consequently, there are no
red/pink bins or portions that do not meet the timing
constrains.
The second result set is generated by targeting the
same two Virtex-6 FPGA boards after modifying the
Xilinx stored Edge coefficients matrix up to a new
empirical Edge Enhancement Orthogonal Kernel of (2).
Fig. 5, Fig. 8 and Fig. 9 are depicted those results.
The new Edge filtering algorithm is noticeably
revealing the MRI image filtering improvement, as
depicted in Fig. 5, compared to that MRI Edge filtered
image in Fig. 3.A and Fig. 4.a.


Figure 7. Histogram Chart depicts the total path delay distribution of the
2-D MRI Edge filter captured behaviourally via (X130T) FPGA board.

Figure 8. Histogram Chart depicts the total path delay distribution of the
new 2-D MIR Edge filter captured behaviourally via (X240T) FPGA b .
Furthermore, the X240T FPGA based implementation
frequency increased from (194 MHz) to (229 MHz) with
relatively the same total power consumption of (1.56
Watt). On the other hand, the X130T FPGA Power
consumption is comparatively lowered to (0.96 W) at
maximum frequency of (228 MHz).
The histogram charts, in Fig. 8 and Fig. 9 are
displaying the reflections of the new maximum sampling
frequencies over the slow paths concentration for the new
Edge filter FPGA implementation of X240T and X130T
respectively.
Fig. 8 chart shows a shorted histogram compared to
that of Fig. 6, because of the new maximum frequency
(229 MHz). This chart depicts 308 paths grouped
roughly into four bell curve regions. Most of the slow
paths are concentrated around (2.4 ns). The slowest
path is about (4 ns). Consequently, the outlier group of
the slowest paths are shifted to the time range of 3.88ns-
4.20ns with empty bins to the right of it. There are no
red/ pink bins or portions that do not meet the timing
constrains.


Figure 9. Histogram Chart depicts the path delays distribution of the
new 2-D MRI Edge filter captured behaviourally via (X130T) FPGA .
SIP-6 768 CSNDSP 2010
Fig. 9 histogram is distributed 308 slow paths to
roughly form three bell shape distribution between (2 ns)
and (4.2 ns). The slowest path is about (4.09 ns). There
are less one path bins compared to those of Fig. 7.
Consequently, there are no red/pink bins or portions that
do not meet the timing constrains.

VI. CONCLUSION
This paper developed new FPGA implementation
methods that provide fast FPGA prototyping for high
performance computation. This methodology is of high-
level abstract hardware-oriented parallel programming, to
outperform the low-level line-by-line HDL programming,
with excellent quality of performance results for nine
parallel 2-D MRI image filtering algorithms of power
consumption down to (0.96) at maximum frequency of up
to (230 MHz).
The FPGA implementation is behaviourally targeted
two Virtex-6 FPGA boards, namely, xc6vlX240Tl-
1lff1759 and xc6vlX130Tl-1lff1156 using the updated
Xilinx system generator within the ISE 11.3 development
suite.
The X130T board outperforms the X240T board in
parallel MRI filtering by consuming the lowest power at
maximum sampling frequency.
One of the nine parallel filtering algorithms, the Edge
algorithm, is empirically improved by a new enhanced
orthogonal 5x5 kernel which generates excellent MRI
filtering results, compared to the previous filtering run
with moderate lower power consumption at higher
maximum sampling frequency.
The future work will be focused on the high
performance efficient FPGA implementation for the
parallel 3-D image filtering algorithms of the next
generation advanced DSP applications within aerospace,
defence, digital communications, multimedia, video and
imaging industries.
REFERENCES
[1] S. Coric, M. Leeser, E. Miller, M. Trepanier, " Parallel-Beam
Back projection: an FPGA implementation optimized for
medical imaging," Journal of VLSI signal Processing systems for
signal, image, and video technology 39 (3), 2005, pp.: 295-311.
[2] O. Maslennikow, A. Sergiyenko, Mapping DSP Algorithms into
FPGA, Parallel Computing in Electrical Engineering, PAR
ELEC 2006, and International Symposium on 13-17 Sept. 2006,
pp. 208 - 213.
[3] M. Kiran, K. M. War, L. M. Kuan, L. K. Meng and L.W. Kin,
Implementing image processing algorithms using Hardware in
the loop approach for Xilinx FPGA, Electronic Design, ICED
2008, International Conference, Dec. 2008, PP.:1 6.
[4] W. Atabany and P. Degenaar, "Parallelism to reduce power
consumption on FPGA Spatiotemporal image processing," Proc.
IEEE International Symposium on Circuits and Systems, ISCAS
2008, pp. 14761479.
[5] R. Gao, D. Xu and J.P. Bentley, Reconfigurable Hardware
Implementation of an Improved Parallel Architecture for MPEG-
4 Motion Estimation in Mobile Applications," IEEE Transactions
on Consumer Electronics, Vol. 49, 2003, pp.: 1383- 1390.
[6] K. R. Nataraj, S. Ramachandran and B. S. Nagabushan,
Development of Algorithm, Architecture and FPGA
Implementation of Demodulator for Processing Satellite Data
Communication" IJCSNS International Journal of Computer
Science and Network security, 2009, VOL.9, pp.:137-147.
[7] O. Nibouche, S. Boussakta and M. Darnell,"Pipeline
architectures for radix-2 new Mersenne number transform ,"
IEEE Transactions on Circuits and Systems I: Regular Papers 56
(8), 2009, pp. 1668-1680.
[8] O. Nibouche, S. Boussakta and M. Darnell, " A new architecture
for radix-2 new Mersenne number transform," IEEE International
Conference on Communications 2006, pp. 3219- 3222.
[9] A. Masoudnia, H. Sarbazi-Azad and S. Boussakta, "Design and
performance of a pixel - level pipelined - parallel architecture for
high speed wavelet-based image compression" Computers and
Electrical Engineering 31 (8), 2005, pp. 572-588.
[10] T. Mak, et al, Implementation of wave-pipelined interconnects in
FPGAs," Proceedings - Second IEEE International Symposium
on NOCS 2008, , pp. 213-214.
[11] C. Chang, Design and application of a reconfigurable computing
System for High Performance Digital Signal Processing", Ph.D.
thesis, University of California, Berkeley, 2005.
[12] S. Boussakta, "A novel method for parallel image Processing
applications," Journal of Systems, 45 (10), 1999, pp. 825-839
[13] Clive Maxfield, FPGAs: World Class Design 2009, Elsevier Ins.
[14] R. Woods, J. McAllister, G. Lightbody and Y. Yi, FPGA-based
Implementation of Signal Processing Systems 2008, John Wiley
&Sons, Ltd.
[15] M. Aziz, "Parallel Digital Filtering Algorithms for Multiprocessor
DSP systems, a PhD thesis, 2004, University Of Leeds.
[16] O. Alshibami, S. Boussakta and M. Aziz, " Fast algorithm for the
2-D new Mersenne number transform," Signal Processing 81 (8),
2001,pp.:1725-1735.
[17] System Generator for DSP user guides, 2010, downloadable from;
http://www.xilinx.com/support/sw_manuals/sysgen_bklist.pdf
[18] Virtex-6 FPGA Xilinx documentation 2010, downloadable from;
http://www.xilinx.com/support/documentation/virtex-6.htm





SIP-6 769 CSNDSP 2010