Sei sulla pagina 1di 9

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.


IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS

A Generalized Algorithm and Reconfigurable


Architecture for Efficient and Scalable Orthogonal
Approximation of DCT
Maher Jridi, Member, IEEE, Ayman Alfalou, Senior Member, IEEE, and Pramod Kumar Meher, Senior Member, IEEE

AbstractApproximation of discrete cosine transform (DCT)


is useful for reducing its computational complexity without significant impact on its coding performance. Most of the existing
algorithms for approximation of the DCT target only the DCT of
small transform lengths, and some of them are non-orthogonal.
This paper presents a generalized recursive algorithm to obtain
orthogonal approximation of DCT where an approximate DCT
could be derived from a pair of DCTs of length
of length
at the cost of
additions for input preprocessing. We
perform recursive sparse matrix decomposition and make use of
the symmetries of DCT basis vectors for deriving the proposed
approximation algorithm. Proposed algorithm is highly scalable
for hardware as well as software implementation of DCT of higher
lengths, and it can make use of the existing approximation of
8-point DCT to obtain approximate DCT of any power of two
. We demonstrate that the proposed approximation
length,
of DCT provides comparable or better image and video compression performance than the existing approximation methods.
It is shown that proposed algorithm involves lower arithmetic
complexity compared with the other existing approximation algorithms. We have presented a fully scalable reconfigurable parallel
architecture for the computation of approximate DCT based on
the proposed algorithm. One uniquely interesting feature of the
proposed design is that it could be configured for the computation
of a 32-point DCT or for parallel computation of two 16-point
DCTs or four 8-point DCTs with a marginal control overhead. The
proposed architecture is found to offer many advantages in terms
of hardware complexity, regularity and modularity. Experimental
results obtained from FPGA implementation show the advantage
of the proposed method.
Index TermsAlgorithm-architecture codesign, DCT approximation, discrete cosine transform (DCT), high efficiency video
coding (HEVC).

I. INTRODUCTION

HE DISCRETE cosine transform (DCT) is popularly used


in image and video compression. Since the DCT is computationally intensive, several algorithms have been proposed
in the literature to compute it efficiently [1][3]. Recently, significant work has been done to derive approximate of 8-point
DCT for reducing the computational complexity [4][9]. The
main objective of the approximation algorithms is to get rid of

Manuscript received April 29, 2014; revised July 02, 2014 and August 27,
2014; accepted September 18, 2014. This paper was recommended by Associate
Editor H. Johansson.
M. Jridi and A. Alfalou are with the Vision Team, ISEN Brest, 29228 Brest
Cedex 2, France (e-mail: maher.jridi@isen.fr; ayman.al-falou@isen-bretagne.
fr).
P. K. Meher is with the School of Computer Engineering, Nanyang Technological University, Singapore 639798. (e-mail: ASPKMeher@ntu.edu.
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TCSI.2014.2360763

multiplications which consume most of the power and computation-time, and to obtain meaningful estimation of DCT as well.
8
Haweel [8] has proposed the signed DCT (SDCT) for 8
blocks where the basis vector elements are replaced by their
sign, i.e, 1. Bouguezel-Ahmad-Swamy (BAS) have proposed
a series of methods. They have provided a good estimation of
the DCT by replacing the basis vector elements by 0, 1/2, 1
[7]. In the same vein, Bayer and Cintra [5], [6] have proposed
two transforms derived from 0 and 1 as elements of transform
kernel, and have shown that their methods perform better than
the method in [7], particularly for low- and high-compression
ratio scenarios.
The need of approximation is more important for higher-size
DCT since the computational complexity of the DCT grows
nonlinearly. On the other hand, modern video coding standards
such as high efficiency video coding (HEVC) [10] uses DCT
of larger block sizes (up to 32 32) in order to achieve higher
compression ratio. But, the extension of the design strategy used
in H264 AVC for larger transform sizes, such as 16-point and
32-point is not possible [11]. Besides, several image processing
applications such as tracking [12] and simultaneous compression and encryption [13] require higher DCT sizes. In this context, Cintra has introduced a new class of integer transforms applicable to several block-lengths [14]. Cintra et al. have proposed a new 16 16 matrix also for approximation of 16-point
DCT, and have validated it experimentally [15]. Recently, two
new transforms have been proposed for 8-point DCT approximation: Cintra et al. have proposed a low-complexity 8-point
approximate DCT based on integer functions [16] and Potluri
et al. have proposed a novel 8-point DCT approximation that
requires only 14 addition [17]. On the other hand, Bouguezel et
al. have proposed two methods for multiplication-free approxi, 16 and
mate form of DCT. The first method is for length
32; and is based on the appropriate extension of integer DCT
[18]. Also, a systematic method for developing a binary version of high-size DCT (BDCT) by using the sequency-ordered
Walsh-Hadamard transform (SO-WHT) is proposed in [4]. This
transform is a permutated version of the WHT which approximates the DCT very well and maintains all the advantages of
the WHT.
A scheme of approximation of DCT should have the following features:
i) It should have low computational complexity.
ii) It should have low error energy in order to provide compression performance close to the exact DCT, and preferably should be orthogonal.

1549-8328 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
2

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS

iii) It should work for higher lengths of DCT to support


modern video coding standards, and other applications
like tracking, surveillance, and simultaneous compression and encryption.
But the existing DCT algorithms do not provide the best of all
the above three requirements. Some of the existing methods are
deficient in terms of scalability [18], generalization for higher
sizes [15], and orthogonality [14]. We intend to maintain orthogonality in the approximate DCT for two reasons. Firstly, if the
transform is orthogonal, we can always find its inverse, and the
kernel matrix of the inverse transform is obtained by just transposing the kernel matrix of the forward transform. This feature
of inverse transform could be used to compute the forward and
inverse DCT by similar computing structures. Moreover, in case
of orthogonal transforms, similar fast algorithms are applicable
to both forward and inverse transforms [19], [20].
In this paper, we propose an algorithm to derive approximate
form of DCTs which satisfy all the three features. We obtain
the proposed approximate form of DCT by recursive decomposition of sparse DCT matrix. It is observed that proposed algorithm involves less arithmetic complexity than the existing DCT
approximation algorithms. The proposed approximate form of
DCT of different lengths are orthogonal, and result in lower
error-energy compared to the existing algorithms for DCT approximation. The decomposition process allows generalization
of the proposed transform for higher-size DCTs. Interestingly,
proposed algorithm is easily scalable for hardware as well as
software implementation of DCT of higher lengths, and it can
make use of the best of the existing approximations of 8-point
DCT. Based on the proposed algorithm, we have proposed a
fully scalable, reconfigurable, and parallel architecture for approximate DCT computation. One uniquely interesting feature
of proposed design is that the structure for the computation of
32-point DCT could be configured for parallel computation of
two 16-point DCTs or four 8-point DCTs. The proposed algorithm is found to be better than the existing methods in terms of
energy compaction and hardware complexity, as well.
The remainder of this paper is organized as follows. In Section II, we derive the proposed algorithm for the generation
of kernel matrices for the approximate DCT. In Section III we
provide the proposed configurable parallel architecture and discuss the performance evaluation of the proposed architecture
in terms hardware complexity. In Section IV, the application
of the proposed method in image and video compression, and
compression performances are discussed. Finally, conclusions
are presented in Section V.
II. PROPOSED DCT APPROXIMATION
The elements of

-point DCT matrix

are given by:


(1)

,
, and
for
where
. The DCT given by (1) is referred to as exact DCT in order
to distinguish it from approximated forms of DCT. For
and
, for any even value of
we can
find that
(2)

since

, (2) can be rewritten as:


(3)

Hence, the cosine transform kernel on the right-hand side of


(3) corresponds to
-point DCT and its elements can be assumed to be
, for
. Therefore, the
first
elements of even rows of DCT matrix of size
correspond to the
-point DCT matrix. Accordingly, the recursive decomposition of
can be performed as detailed in
(4)(8). Using the even/odd symmetries of its row vectors, DCT
matrix
can be represented by the following matrix product
(4)
is a block sparse matrix expressed by:

where

(5)
where
matrix
.

is the
zero matrix. Block subconsists of odd rows of the first
columns of
is a permutation matrix expressed by:
(6)

where

is a row of
zeros and
matrix defined by its row vectors as:

is a

(7)
is the
th row vector of the
identity matrix. Finally, the last matrix in (4),
defined by:

where

is
(8)

where
is an
matrix having all ones on
the anti-diagonal and zeros elsewhere.
To reduce the computational complexity of DCT, the computational cost of matrices presented in (4) is required to be
assessed. Since
does not involve any arithmetic or logic
requires
additions and
subtracoperation, and
tions, they contribute very little to the total arithmetic complexity and cannot be reduced further. Therefore, for reducing
the computational complexity of -point DCT, we need to approximate
in (5). Let
and
denote the approximation matrices of
and
, respectively. To find these
approximated submatrices we take the smallest size of DCT matrix to terminate the approximation procedure to 8, since 4-point
DCT and 2-point DCT can be implemented by adders only. Consequently, a good approximation of
, where is an integral
power of two, for
, leads to a proper approximations of
and . For approximation of
we can choose the 8-point
DCT given in [6] since that presents the best trade-off between
the number of required arithmetic operators and quality of the
reconstructed image. The trade-off analysis given in [6] shows
that approximating
by
where
denotes the
rounding-off operation outperforms the current state-of-the-art
of 8-point approximation methods.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
JRIDI et al.: A GENERALIZED ALGORITHM AND RECONFIGURABLE ARCHITECTURE FOR EFFICIENT AND SCALABLE ORTHOGONAL APPROXIMATION OF DCT 3

When we closely look at (4) and (5), we note that


operates
on sums of pixel pairs while
operates on differences of the
same pixel pairs. Therefore, if we replace
by
, we shall
have two main advantages. Firstly, we shall have good comand secondly
pression performance due to the efficiency of
the implementation will be much simpler, scalable and reconfigurable. For approximation of
we have investigated two
other low-complexity alternatives, and in the following we discuss here three possible options of approximation of :
i) The first one is to approximate
by null matrix, which
implies all even-indexed DCT coefficients are assumed
to be zero. The transform obtained by this approximation
is far from the exact values of even-indexed DCT coefficients, and the odd coefficients do not contain any information.
ii) The second solution is obtained by approximating
by
an 8 8 matrix where each row contains one 1 and all
other elements are zeros. Here, elements equal to 1 correspond to the maximum of elements of the exact DCT in
each row. The approximate transform in this case is closer
to the exact DCT than the solution obtained by null matrix.
iii) The third solution consists of approximation of
by .
Since
as well as
are submatrices of
and operate on matrices generated by sum and differences of
pixel pairs at distance of 8, approximation of
by
has attractive computational properties: regularity of the
is orthogonalsignal-flow graph, orthogonality since
izable, and good compression efficiency, other than scalability and scope for reconfigurable implementation.
We have not done exhaustive search of all possible solutions. So
there could be other possible low-complexity implementation
of . But other solutions are not expected to have the potential
for reconfigurablity what we achieve by replacement of
by
. Based on this third possible approximation of , we have
of
as:
obtained the proposed approximation

(9)

Fig. 1. Signal flow graph (SFG) of (


tions by 1.

). Dashed arrows represent multiplica-

Algorithm 1 Generation of proposed DCT matrix


function PROPOSED DCT(N)

while

power of 2,
is the number of 8-sample blocks

do
,

Eq (6), (8)
Eq (9)

end while
Eq (10)
return
end function
III. SCALABLE AND RECONFIGURABLE ARCHITECTURE FOR
DCT COMPUTATION
In this section, we discuss the proposed scalable architecture
and 32. We
for the computation of approximate DCT of
have derived the theoretical estimate of its hardware complexity
and discuss the reconfiguration scheme.
A. Proposed Scalable Design

As stated before, matrix


each
we can calculate

is orthogonalizable. Indeed, for


given by:

(10)
denotes matrix transposition. For data compreswhere
sion, we can use
instead of
since
. Since
is a diagonal matrix, it
can be integrated into the scaling in the quantization process
(without additional computational complexity). Therefore, as
adopted in [4][8], the computational cost of
is equal
. Moreover, the term of
in (9) can be
to that of
integrated in the quantization step in order to have multiplerless
architecture. The procedure for the generation of the proposed
orthogonal approximated DCT is stated in Algorithm 1.

The basic computational block of algorithm for the proposed


is given in [6]. The block diagram of
DCT approximation,
is shown in Fig. 1. For a
the computation of DCT based on
,
, the approximate
given input sequence
DCT coefficients are obtained by
. An example
is illustrated in Fig. 2, where two
of the block diagram of
units for the computation of
are used along with an input
adder unit and output permutation unit. The functions of these
two blocks are shown respectively in (8) and (6). Note that structures of 16-point DCT of Fig. 2 could be extended to obtain the
DCT of higher sizes. For example, the structure for the computation of 32-point DCT could be obtained by combining a pair
of 16-point DCTs with an input adder block and output permutation block.
B. Complexity Comparison
To assess the computational complexity of proposed -point
approximate DCT, we need to determine the computational

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
4

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS

ASSESSMENT

Fig. 2. Block diagram of the proposed DCT for

).

cost of matrices quoted in (9). As shown in Fig. 1 the approximate 8-point DCT involves 22 additions. Since
has no
computational cost and
requires additions for -point
DCT, the overall arithmetic complexity of 16-point, 32-point,
and 64-point DCT approximations are 60, 152, and 368 additions, respectively. More generally, the arithmetic complexity
additions.
of -point DCT is equal to
Moreover, since the structures for the computation of DCT of
different lengths are regular and scalable, the computational
DCT coefficients can be found to be
time for
where
is the addition-time. The number of arithmetic operations involved in proposed DCT approximation of different
lengths and those of the existing competing approximations are
shown in Table I. It can be found that the proposed method requires the lowest number of additions, and does not require any
shift operations. Note that shift operation does not involve any
combinational components, and requires only rewiring during
hardware implementation. But it has indirect contribution to the
hardware complexity since shift-add operations lead to increase
in bit-width which leads to higher hardware complexity of
arithmetic units which follow the shift-add operation. Also, we
note that all considered approximation methods involve significantly less computational complexity over that of the exact
DCT algorithms. According to the Loeffler algorithm [2], the
exact DCT computation requires 29, 81, 209, and 513 additions
along with 11, 31, 79, and 191 multiplications, respectively for
8, 16, 32, and 64-point DCTs.
Pipelined and non-pipelined designs of different methods are
developed, synthesized and validated using an integrated logic
analyzer. The validation is carried out by using the Digilent EB
of Spartan6-LX45. We have used 8-bit inputs, and we have allowed the increase of output size (without any truncations). For
the 8-point transform of Fig. 1, we have 11-bit and 10-bit outputs. The pipelined design are obtained by insertion of registers
in the input and output stages along with registers after each
adder stage, while the no pipeline registers are used within the
non-pipelined designs. The synthesis results obtained from XST
synthesizer are presented in Table II. It shows that pipelined
designs provide significantly higher maximum operating frequency (MOF). It also shows that the proposed design involves
nearly 7%, 6%, and 5% less area compared to the BDCT design for equal to 16, 32, and 64, respectively. Note that both
pipelined and non-pipelined designs involve the same number

OF

TABLE I
REQUIRED ARITHMETIC OPERATIONS
APPROXIMATION ALGORITHMS

FOR

SEVERAL

TABLE II
SYNTHESIS RESULTS OF PIPELINED AND NON-PIPELINED DESIGNS
SEVERAL APPROXIMATION ALGORITHMS

FOR

Fig. 3. Proposed reconfigurable architecture for approximate DCT of lengths


and 16.

of LUTs since pipeline registers do not require additional LUTs.


For 8-point DCT, we have used the approximation proposed
in [6] which forms the basic computing block of the proposed
method. Also, we underline that all designs have the same critical path; and accordingly have the same MOFs. Most importantly, the proposed designs are reusable for different transform
lengths.
C. Proposed Reconfiguration Scheme
As specified in the recently adopted HEVC [10], DCT of
, 16, 32 are required to be
different lengths such as
used in video coding applications. Therefore, a given DCT
architecture should be potentially reused for the DCT of different lengths instead of using separate structures for different

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
JRIDI et al.: A GENERALIZED ALGORITHM AND RECONFIGURABLE ARCHITECTURE FOR EFFICIENT AND SCALABLE ORTHOGONAL APPROXIMATION OF DCT 5

Fig. 4. Proposed reconfigurable architecture for approximate DCT of lengths

lengths. We propose here such reconfigurable DCT structures


which could be reused for the computation of DCT of different
lengths. The reconfigurable architecture for the implementation
of approximated 16-point DCT is shown in Fig. 3. It consists
of three computing units, namely two 8-point approximated
DCT units and a 16-point input adder unit that generates
and
,
. The input to the first 8-point DCT
approximation unit is fed through 8 MUXes that select either
or
, depending
on whether it is used for 16-point DCT calculation or 8-point
DCT calculation. Similarly, the input to the second 8-point
DCT unit (Fig. 3) is fed through 8 MUXes that select either
or
, depending
on whether it is used for 16-point DCT calculation or 8-point
DCT calculation. On the other hand, the output permutation
unit uses 14 MUXes to select and re-order the output depending
is used as control input
on the size of the selected DCT.
of the MUXes to select inputs and to perform permutation
according to the size of the DCT to be computed. Specifically,
enables the computation of 16-point DCT and
enables the computation of a pair of 8-point DCTs
in parallel. Consequently, the architecture of Fig. 3 allows the
calculation of a 16-point DCT or two 8-point DCTs in parallel.
A reconfigurable design for the computation of 32-, 16-, and
8-point DCTs is presented in Fig. 4. It performs the calculation of a 32-point DCT or two 16-point DCTs in parallel or
four 8-point DCTs in parallel. The architecture is composed of
32-point input adder unit, two 16-point input adder units, and
four 8-point DCT units. The reconfigurability is achieved by

, 16 and 32.

three control blocks composed of 64 2:1 MUXes along with 30


3:1 MUXes. The first control block decides whether the DCT
, the selection of input data
size is of 32 or lower. If
is done for the 32-point DCT, otherwise, for the DCTs of lower
lengths. The second control block decides whether the DCT size
the length of the DCT to be comis higher than 8. If
puted is higher than 8 (DCT length of 16 or 32), otherwise, the
length is 8. The third control block is used for the output permutation unit which re-orders the output depending on the size
and
are used as control sigof the selected DCT.
equal
nals to the 3:1 MUXes. Specifically, for
to {00}, {01} or {11} the 32 outputs correspond to four 8-point
parallel DCTs, two parallel 16-point DCTs, or 32-point DCT,
respectively. Note that the throughput is of 32 DCT coefficients
per cycle irrespective of the desired transform size.
IV. EXPERIMENTAL VALIDATION
We discuss here the performance of the proposed DCT approximation algorithm in terms of energy compaction characteristics compared to the DCT approximations suggested in [4],
[14], [15] and [18]. Note that the method in [15] was proposed
. The method in [18] was defined for
for a fixed size
, 16 and 32, but it is not suitable to be extended for sizes
higher than 32. Also, we have not compared with the SDCT
method [8]. Although the SDCT method is the first one related
to the approximation of DCT algorithms, its reconstruction capabilities are significantly lower than others in [4], [14], [15]
and [18].

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
6

Fig. 5. Average PSNR: exact (solid line), proposed (


[18] is defined for maximum size of
.

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS

), BAS [8] (

A. Application for Image Compression


Image compression is a typical application to validate the energy compaction capability of DCT. We have used the compression scheme of [8] to compare the performance of the competing DCT approximation schemes. According to this method,
only coefin each block of the 2D-transform of size
ficients are retained to reconstruct the image according to the
zigzag sequence (all the other coefficients were set to zero).
, we have used
and
, where
For
, which corresponds to compression ratio in
the range 50% to 98.43%. In order to keep the same compression
and
are multiplied by 4
ratio for different DCT sizes,
when is doubled. Accordingly, the number of retained coefficients is varied in the range [4,128], [16,512], and [64,2048],
respectively, for block-sizes (16 16), (32 32), and (64
64) to vary the compression ratio from 50% to 98.43% equivalently for all block-sizes. In this context, we have introduced the
notation of normalized which varies in the range [1,32] for all
block sizes. For a given block-size, the normalized is the ratio
of to the number of times the block-size is larger than 8 8.
Accordingly, normalized remains the same for all block sizes.
The idea behind this notation is to have the same compression
ratio for different block sizes for the same normalized .
In Fig. 5 we have plotted the average PSNR of reconstructed
512 8-bit greyscale images
image from a set of forty 512
(Miscellaneous and Aerials) obtained from a standard database
[21]. For comparison, we have taken the method in [14] with
the zeroth order approximation which leads to a lower hard, the DCT matrix given
ware complexity. Note that for
by [6] is the same one used in [14]. Moreover, our algorithm
is based on the 8-point DCT approximation given in [6]. For
that reason, we find the same PSNR performance of proposed

), Cintra [14] (

), BDCT [4] (

) and BC [15] (

) DCT. BAS transform

approximation as that of [6]. It is also shown that for high com, method in [6] presents a higher PSNR,
pression ratio
, BAS method [18] outperforms all 8-point apand for
, the proposed method often presents
proximations. For
. Finally, for
a better results especially for normalized
and
the proposed method has a higher PSNR at
all compression ratios. All these results are obtained by Matlab
simulations and are in conformity with those presented in [4]
where comparison is made between methods in [4] and [18].
We have considered the structural similarity (SSIM) index [22],
(which is an improved version of the universal quality index
(UQI) [23], and one of the best objective metrics for image
distortion). In Fig. 6 we have shown the plot of mean SSIM
values obtained for different approximation methods relative to
the exact DCT. We have used different compression ratios for
and
for the same images of PSNR assessment. It is found that the proposed algorithm outperforms
and for compression ratios higher
existing methods for all
. For lower compression ratios, the SSIM
than 84.37%
values are similar. Finally, subjective evaluation is presented in
and for
. As expected, it
Fig. 7 for normalized
is found that the reconstructed images using the proposed algorithm have less blocking artifacts than methods in [4], [8]
and [14] while method of [18] presents a reconstructed image
closely similar to the proposed method.
B. Application for Video Compression
To evaluate the performance of the proposed algorithm for
video coding we have integrated the proposed approximated
DCT into HEVC reference software HM12.1, in the same way
as has been done in [17]. Moreover, we have integrated the existing methods to have a comparison in real time video coding.
One key feature of HEVC is that it involves DCTs of different

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
JRIDI et al.: A GENERALIZED ALGORITHM AND RECONFIGURABLE ARCHITECTURE FOR EFFICIENT AND SCALABLE ORTHOGONAL APPROXIMATION OF DCT 7

Fig. 6. Mean SSIM relative to exact DCT of proposed (


transform [18] is defined for maximum size of
.

Fig. 7. Reconstructed images using several algorithms for

), BAS [18] (

), Cintra [14] (

and normalized

sizes such as 4, 8, 16, and 32. Therefore, we select the BDCT in


[4] and BAS [18] methods since orthogonal approximate DCTs
for transform sizes up to 32 32 are defined in [4] and BAS
[18]. All these competing transforms are implemented in the
encoder to produce HEVC-compliant bit-stream. The inverse
DCT defined in the HEVC final draft international standard [24]
is used in the decoder.
We performed the tests using JCT-VC HM 12.1 for the main
profile, and all-intra configuration over Qp values of 22, 27,
32, and 37. We have taken the PeopleOnStreet 2560 1600,

), BDCT [4] (

) and BC [15] (

) DCT. BAS

CrowdRun 1920 1080, and OldTownCross 1280 720 test


sequences for comparison. All simulations are performed under
Desktop PCs with the Intel Core2 processor family with 2 GHz
of processor frequency and 1 Gbit of RAM. We have measured
as the percentage difference of bitbit-rate difference
, and from the bitrate of each of the methods
rate of the anchor
with the same value of PSNR.
values shown in Table III are calculated as:
The

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
8

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS

TABLE III
BRD PERCENTAGE OF APPROXIMATED TRANSFORMS FOR BDCT [4], BAS
[18], AND PROPOSED METHOD

where positive values indicate coding loss compared to the anchor. Note that the bitrate is calculated by taking into account
the frame rate of the test sequence, the total number of tested
frames and the size of the bitstream. Assessment of video coding
performance of approximated transforms compared to HM12.1
are summarized in Table III. The approximated DCT algorithms
give coding loss between 3.43% and 6.40% for different test sequences. This coding loss can be explained by the entropy coder
used after the DCT. The entropy coder is designed for coding
the exact DCT coefficients and not approximated coefficients.
The coding loss could be reduced by an appropriate modification of context models used in CABAC coder, which we can
take up as a future work.
V. CONCLUSION
In this paper, we have proposed a recursive algorithm to
obtain orthogonal approximation of DCT where approximate
could be derived from a pair of DCTs of
DCT of length
at the cost of additions for input preprocessing.
length
The proposed approximated DCT has several advantages, such
as of regularity, structural simplicity, lower-computational
complexity, and scalability. Comparison with recently proposed competing methods shows the effectiveness of the
proposed approximation in terms of error energy, hardware
resources consumption, and compressed image quality. We
have also proposed a fully scalable reconfigurable architecture
for approximate DCT computation where the computation of
32-point DCT could be configured for parallel computation of
two 16-point DCTs or four 8-point DCTs.

[7] S. Bouguezel, M. Ahmad, and M. N. S. Swamy, Low-complexity 8


8 transform for image compression, Electron. Lett., vol. 44, no. 21,
pp. 12491250, Oct. 2008.
[8] T. I. Haweel, A new square wave transform based on the DCT, Signal
Process., vol. 81, no. 11, pp. 23092319, Nov. 2001.
[9] V. Britanak, P. Y. Yip, and K. R. Rao, Discrete Cosine and Sine Transforms: General Properties, Fast Algorithms and Integer Approximations. London, U.K.: Academic, 2007.
[10] G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, Overview of the
high efficiency video coding (HEVC) standard, IEEE Trans. Circuits
Syst. Video Technol., vol. 22, no. 12, pp. 16491668, Dec. 2012.
[11] F. Bossen, B. Bross, K. Suhring, and D. Flynn, HEVC complexity and
implementation analysis, IEEE Trans. Circuits Syst. Video Technol.,
vol. 22, no. 12, pp. 16851696, 2012.
[12] X. Li, A. Dick, C. Shen, A. van den Hengel, and H. Wang, Incremental learning of 3D-DCT compact representations for robust visual
tracking, IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 4, pp.
863881, Apr. 2013.
[13] A. Alfalou, C. Brosseau, N. Abdallah, and M. Jridi, Assessing the performance of a method of simultaneous compression and encryption of
multiple images and its resistance against various attacks, Opt. Express, vol. 21, no. 7, pp. 80258043, 2013.
[14] R. J. Cintra, An integer approximation method for discrete sinusoidal transforms, Circuits, Syst., Signal Process., vol. 30, no. 6, pp.
14811501, 2011.
[15] F. M. Bayer, R. J. Cintra, A. Edirisuriya, and A. Madanayake, A
digital hardware fast algorithm and FPGA-based prototype for a novel
16-point approximate DCT for image compression applications,
Meas. Sci. Technol., vol. 23, no. 11, pp. 110, 2012.
[16] R. J. Cintra, F. M. Bayer, and C. J. Tablada, Low-complexity 8-point
DCT approximations based on integer functions, Signal Process., vol.
99, pp. 201214, 2014.
[17] U. S. Potluri, A. Madanayake, R. J. Cintra, F. M. Bayer, S. Kulasekera,
and A. Edirisuriya, Improved 8-point approximate DCT for image and
video compression requiring only 14 additions, IEEE Trans. Circuits
Syst. I, Reg. Papers, vol. 61, no. 6, pp. 17271740, Jun. 2014.
[18] S. Bouguezel, M. Ahmad, and M. N. S. Swamy, A novel transform
for image compression, in Proc. 2010 53rd IEEE Int. Midwest Symp.
Circuits Syst. (MWSCAS), pp. 509512.
[19] K. R. Rao and N. Ahmed, Orthogonal transforms for digital signal
processing, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process.
(ICASSP), Apr. 1976, vol. 1, pp. 136140.
[20] Z. Mohd-Yusof, I. Suleiman, and Z. Aspar, Implementation of two
dimensional forward DCT and inverse DCT using FPGA, in Proc.
TENCON 2000, vol. 3, pp. 242245.
[21] USC-SIPI image database, Univ. Southern California, Signal and
Image Processing Institute [Online]. Available: http://sipi.usc.edu/
database/, 2012
[22] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, Image
quality assessment: From error visibility to structural similarity, IEEE
Trans. Image Process., vol. 13, no. 4, pp. 600612, 2004.
[23] Z. Wang and A. C. Bovik, A universal image quality index, IEEE
Signal Process. Lett., vol. 9, no. 3, pp. 8184, 2002.
[24] B. Bross, W.-J. Han, J.-R. Ohm, G. J. Sullivan, Y.-K. Wang, and T.
Wiegand, High efficiency video coding (HEVC) text specification
draft 10 (for FDIS and consent) associated resources, in Proc.
JCT-VC, Geneva, Switzerland, Jan. 2013, Doc. JCTVC-L1003.

REFERENCES
[1] A. M. Shams, A. Chidanandan, W. Pan, and M. A. Bayoumi, NEDA:
A low-power high-performance DCT architecture, IEEE Trans.
Signal Process., vol. 54, no. 3, pp. 955964, 2006.
[2] C. Loeffler, A. Lightenberg, and G. S. Moschytz, Practical fast 1-D
DCT algorithm with 11 multiplications, in Proc. Int. Conf. Acoust.,
Speech, Signal Process. (ICASSP), May 1989, pp. 988991.
[3] M. Jridi, P. K. Meher, and A. Alfalou, Zero-quantised discrete cosine
transform coefficients prediction technique for intra-frame video encoding, IET Image Process., vol. 7, no. 2, pp. 165173, Mar. 2013.
[4] S. Bouguezel, M. O. Ahmad, and M. N. S. Swamy, Binary discrete
cosine and Hartley transforms, IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 60, no. 4, pp. 9891002, Apr. 2013.
[5] F. M. Bayer and R. J. Cintra, DCT-like transform for image compression requires 14 additions only, Electron. Lett., vol. 48, no. 15, pp.
919921, Jul. 2012.
[6] R. J. Cintra and F. M. Bayer, A DCT approximation for image compression, IEEE Signal Process. Lett., vol. 18, no. 10, pp. 579582,
Oct. 2011.

Maher Jridi received the Engineering degree from


the Graduate Telecommunication School SupCom,
Tunisia, in 2003. He received the Ph.D. degree in
electronics from the University of Science and Technology, Bordeaux I, France, in 2007 at IMS laboratory. He worked in digital calibration for interleaved
data converters. Currently, he is an Associate Professor at ISEN-Brest, France, and he is member of
the vision group. He has published around 30 journal
articles or conference papers, His research interests
focus on digital electronics design for VLSI signal/
image/video processing and more particularly in algorithm architecture co-design for resource-constrained systems.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
JRIDI et al.: A GENERALIZED ALGORITHM AND RECONFIGURABLE ARCHITECTURE FOR EFFICIENT AND SCALABLE ORTHOGONAL APPROXIMATION OF DCT 9

Ayman Alfalou (SM03) received the Ph.D. degree


in telecommunications and signal processing from
the French National Telecommunication Graduate
Engineering School of Brittany (ENSTB-France)
and the University of Rennes (1), France, in 1999.
He held a Postdoc position for one year at French
National Telecommunications graduate Engineering
School of Brittany (ENSTB-France) (DGA, French
Army), consisting in designing and realizing an
optical compact and high rate correlator. Partners:
THOMSN-lcr-tco-detexis, ENST of Brittany, and
ENSP of Marseille. Since June 2000, he has been a Professor of Telecommunications and Signal Processing at ISEN-Brest (Institut Suprieur de
lElectronique et du Numrique), France. At ISEN, he created the Optical
Signal and Image Processing Laboratory (VISION). His research interests
are Signal processing and image processing, Telecommunications, optical
systems, optical Processing, opt-electronics, Laser, polarization. He has some
Ph.D., Masters, and Engineering students (recognition forms, compression,
encryption, polarization). He has published over 130 refereed journal articles
or conferences on a wide variety of theoretical and experimental topics. He
has organized or co-organized many conferences and special sessions; he
was the chairman or a member of scientific committees of many international
conferences. In September 2006 he passed the HDR degree (Full professor
at ISEN-Brest). He is a Senior Member of OSA, Senior Member of SPIE, and
elected member of the Institute of Physics.

Pramod Kumar Meher (SM03) received the B.Sc.


(Honours) and M.Sc. degree in physics, and the Ph.D.
degree in science from Sambalpur University, India,
in 1976, 1978, and 1996, respectively.
Currently, he is a Senior Research Scientist with
Nanyang Technological University, Singapore. Previously, he was a Professor of Computer Applications
with Utkal University, India, from 1997 to 2002, and
a Reader in electronics with Berhampur University,
India, from 1993 to 1997. His research interest includes design of dedicated and reconfigurable architectures for computation-intensive algorithms pertaining to signal, image and
video processing, communication, bio-informatics and intelligent computing.
He has contributed more than 200 technical papers to various reputed journals
and conference proceedings.
Dr. Meher has served as a speaker for the Distinguished Lecturer Program
(DLP) of IEEE Circuits Systems Society during 2011 and 2012 and Associate
Editor of the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSPART II:
EXPRESS BRIEFS during 2008 to 2011, and Associate Editor for the IEEE
TRANSACTIONS ON CIRCUITS AND SYSTEMSPART I: REGULAR PAPERS
during 20122013. Currently, he is serving as Associate Editor for the IEEE
TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, and
the Journal of Circuits, Systems, and Signal Processing. Dr. Meher is a Fellow
of the Institution of Electronics and Telecommunication Engineers, India.
He was the recipient of the Samanta Chandrasekhar Award for excellence in
research in engineering and technology for 1999.

Potrebbero piacerti anche