J RealTime Image Proc (2008) 3:217–229 DOI 10.1007/s1155400800878
SPECIAL ISSUE 
G. Seetharaman Æ B. Venkataramani Æ
G. Lakshminarayanan
Received: 12 July 2007 / Accepted: 19 May 2008 / Published online: 10 June 2008 SpringerVerlag 2008
Abstract In the literature, techniques such as pipelining and wavepipelining (WP) are proposed for increasing the operating frequency of a digital circuit. In general, use of pipelining results in higher speed at the cost of increase in the area and clock routing complexity. On the other hand, use of WP results in less clock routing complexity and less area but enables the digital circuit to be operated only at moderate speeds. In this paper, a hybrid wavepipelining scheme is proposed to get the beneﬁts of both pipelining and WP techniques. Major contributions of this paper are:
proposal for the implementation of 2D DWT using lifting scheme by adopting the hybrid wavepipelining and pro posal for the automation of the choice of clock frequency and clock skew between the input and output registers of wavepipelined circuit using built in self test (BIST) and systemonchip (SOC) approaches. In the hybrid scheme, different lifting blocks are interconnected using pipelining registers and the individual blocks are implemented using WP. For the purpose of evaluating the superiority of the schemes proposed in this paper, the system for the com putation of one level 2D DWT is implemented using the following techniques: pipelining, nonpipelining and hybrid wavepipelining. The BIST approach is used for the implementation on Xilinx SpartanII device. The SOC
G. 
Seetharaman (&) B. Venkataramani 
G. 
Lakshminarayanan 
Department of ECE, National Institute of Technology, Tiruchirappalli, India 

email: gsraman@nitt.edu 

B. 
Venkataramani 
email: bvenki@nitt.edu 

G. 
Lakshminarayanan 
email: laksh@nitt.edu 
approach is adopted for implementation on Altera and Xilinx ﬁeld programmable gate arrays (FPGAs) based SOC kits with Nios II or Micro blaze softcore processors. From the implementation results, it is veriﬁed that the hybrid WP circuit is faster than nonpipelined circuit by a factor of 1.25–1.39. The pipelined circuit is in turn faster than the hybrid wavepipelined circuit by a factor of 1.15–1.38 and this is achieved with the increase in the number of registers by a factor of 1.79–3.15 and increase in the number of LEs by a factor of 1.11–1.65. The softcore processor based automation scheme has considerably reduced the effort required for the design and testing of the hybrid wave pipelined circuit. The techniques proposed in this paper, are also applicable for ASICs. The optimization schemes proposed in this paper are also applicable for the compu tation of other image transforms such as DCT, DHT.
Keywords
DWT Lifting SOC Wavepipelining
Pipelining Self test
1 Introduction
Programmable logic devices such as FPGAs offer an alternative solution for the computationally intensive functions performed traditionally by digital signal proces sors with Harvard architecture. The ability to design, fabricate and test application speciﬁc integrated circuits (ASICs) as well as FPGAs with gate count of the order of a few tens of millions, has led to the development of com plex embedded systemonchip. The development of intellectual property (IP) cores for the FPGAs for a variety of standard functions including processors enables a mul timillion gate FPGA to be conﬁgured to contain all the components of a complete system. Development tools from
218
J RealTime Image Proc (2008) 3:217–229
FPGA vendors such as the Altera or Xilinx enable the integration of IP cores and the user designed custom blocks with the softcore processors such as the Micro blaze or Nios II processors [1, 2]. The system designed by inte gration of IP cores and the user designed custom blocks with the softcore processors are far more ﬂexible than the hardcore processors and they can be enhanced with cus tom hardware to optimize them for speciﬁc application [3]. The increased performance available with SOC based FPGAs makes them quite suited for implementation of area as well as speed intensive image processing applications such as discrete cosine transform (DCT) and discrete wavelet transform (DWT). For example, the study in [4] shows that FPGA based image processing system is faster by 8–800 times compared to that using Pentium III processor. For image processing applications, in addition to DCT, wavelet transform is increasingly used. It is a part of the joint photographic experts group (JPEG) 2000 standard for still image compression. The VLSI implementation of image encoders with DWT has been addressed in number of previous works. The implementation of 2D DWT using lifting scheme and compression using EZT algorithm is reported in [5] taking the advantage of ﬂexible memory conﬁguration available in FPGAs. The image is partioned into subimages of size 32 9 32 and external memory is used for storing the subimages and the transform coefﬁ cients in [5]. Block RAMs in FPGAs are proposed for storing the sub images and 2D DWT coefﬁcients in [6]. A new multiplier algorithm denoted as Baugh–Wooley pipelined constant coefﬁcient multiplier (BWPKCM) which combines the KCM with Baugh–Wooley multiplication algorithm is proposed and used for the study and comparison of dis tributed arithmetic algorithm and lifting scheme [7] for 2D DWT on FPGAs in [6]. Even though pipelining is adopted for high speed applications such as that in [6], pipelined systems have a number of disadvantages such as increase of power dissi pation, clock routing complexity and clock skews between different parts of the system. The circuit design technique such as wavepipelining is one of the techniques proposed for achieving high speed without the above limitations. Wavepipelined circuit dispenses with the need for regis ters for storing the intermediate results and instead uses the inherent capacitance at the input to the various combina torial blocks. A number of systems have been implemented using wavepipelining on ASICs and FPGAs [8, 9]. The concept of wavepipelining has been described in a number of previous works [10–14]. One of the limitations of the wavepipelined circuits is that their highest operating fre quency reduces with the complexity of the circuit or equivalently the logic depth [14]. In order to combine the
advantages of both pipelining and wavepipelining a hybrid scheme is proposed in this paper. A complex circuit is split into a number of smaller circuits and is pipelined. Each of the smaller circuits is realized using wavepipelining. The organization of the rest of the paper is as follows: in Sect. 2, the review of previous work on lifting based 2D DWT with BW multiplier is described. In Sect. 3, the previous work related to wavepipelining and the chal lenges involved in the design of wavepipelined circuits are described. In Sect. 4, automation schemes for wavepipe lined circuits are presented. In Sect. 5, the architecture used and assumptions made for the implementation of the 2D DWT are presented. In Sect. 6, the implementation results of the pipelined and hybrid wavepipelined 2D DWT are presented. Sect. 7, summarizes the conclusions.
2 Review of previous work on lifting based 2D DWT with BW multiplier
The hybrid scheme is proposed to be used for the com putation of 2D DWT. The DWT decomposes a signal into different subbands so that the lower frequency subbands have ﬁner frequency resolution and coarser time resolution compared to the higher frequency subbands. A survey of VLSI architectures for the computation of 2D DWT is given in [15]. The 2D DWT may be computed using ﬁlter banks. Figure 1 shows how an N 9 M image can be decomposed using subband decomposition for one level 2D DWT. The samples corresponding to the image pixels are passed through two stages of analysis ﬁlters. The ele ments of the pixel matrix are read row wise and are ﬁrst processed by the low pass h[n] and high pass g[n] hori zontal ﬁlters. The transform coefﬁcients matrices are then subsampled by two along the rows to obtain two N 9 M/2 matrices L1 and H1. Subsequently, the outputs (L1, H1) are processed by low pass and high pass vertical ﬁlters to obtain four N/2 9 M/2 transform coefﬁcient matrices. Out of these four matrices denoted as LL1, LH1, HH1 and HL1, respectively, LL1 represents a coarse approximation of the original image [15, 16]. For the two level 2D DWT, LL1 component is pro cessed by both horizontal and vertical ﬁlters and sub sampled to obtain four more matrices LL2, LH2, HH2 and HL2. This process is continued until the desired level of subband structure is obtained. The horizontal and vertical ﬁlters shown in Fig. 1 may be implemented by adopting the lifting scheme [7] which uses a factorization scheme for the polyphase matrix corresponding to the analysis ﬁlter. The main feature of lifting based DWT scheme is to break up the high pass and low pass wavelet ﬁlters into a sequence of smaller ﬁlters. This scheme requires about 50% less computational complexity compared to that using
J RealTime Image Proc (2008) 3:217–229
219
Fig. 1 Subband decomposition of an N 9 M image
the convolutionbased approach [7]. It has other advanta ges, including ‘‘inplace’’ computation of the DWT, integer to integer wavelet transform and symmetric hardware architecture for the computation of both forward and inverse transform [15]. In the lifting scheme for a ﬁlter bank with the low pass and high pass ﬁlters of nine and seven taps, respectively, the odd and even input samples are processed by ﬁve lifting blocks [a, b, c, d, n (n _{1} , n _{2} )] in cascade as shown in Fig. 2. n _{1} , n _{2} are scaling blocks. The internal diagram of a and b blocks are shown in Figs. 3 and 4. The c and d blocks are obtained by replacing the constants a, b with c, d. In Figs. 3 and 4, since the output from one block is fed as the input to the next block, the maximum rate at which the input can be fed to the system depends on the sum of the delays in all the four stages. The speed may be increased by introducing pipe lining at the points indicated by dotted lines in Figs. 3 and 4. In this case, the input rate is determined by the largest delay among all the four blocks. The delay in the individual stages may be reduced fur ther by using constant coefﬁcient multiplier (KCM) which uses a look up table (LUT) for ﬁnding the product of a constant and a variable. The variable is fed as address to the LUT, which contains the products corresponding to all possible combinations of the operands. FPGAs normally contain four input LUTs. When an LUT with more number of inputs are required, it has to be implemented using a number of stages of four input LUTs and adders. For example, a 12 9 12 bit KCM is implemented using three
Fig. 2 Simpliﬁed block diagram of lifting scheme for 9/7 ﬁlter
4 9 12 bit KCM and two stages of 16 bit adders. The speed of the KCM can be increased by introducing the pipelining registers at the outputs of LUTs and adders. The content of the LUT corresponding to multiplication of signed numbers can be computed using three approa ches: (a) Assuming unsigned multiplication and 2’s complement blocks (resulting multiplier is referred to as conventional 2’s complement multiplier (C2CM)) (b) Using sign extension (c) Baugh Wooley (BW) multiplier. The pipelined constant coefﬁcient multiplier (PKCM) using the BW content is referred to as BWPKCM and it is
220
J RealTime Image Proc (2008) 3:217–229
shown to be superior compared to the other two approaches [6]. Hence, only this multiplier is considered for wave pipelining in this paper. The detailed diagram of the a block implemented using BWPKCM is shown in Fig. 5. The same scheme can be adopted for the b, c, d, n _{1}_{,} n _{2} blocks. The dotted line indicates points where registers may be inserted for pipelining. For wavepipelining all the stages are directly connected without registers. The regis ters are used only at the inputs and outputs. In hybrid wave pipelining, registers are used between adjacent lifting blocks and the individual lifting blocks are connected without registers.
3 Review of previous work on wavepipelining
In this section, the technique used for wavepipelining the a block in Fig. 5 is considered. An RTL model of a circuit consists of a combinational logic circuit separated by the input and output registers. The combinational logic circuit may be considered to be a wavepipelined circuit if a number of waves are made to simultaneously propagate through it as shown in Fig. 6, [10]. In other words, at any point of time, a sequence of data is processed in the combinational logic block. In the case of pipelining, only one data is processed in the combinational logic block at a time. Further, the maximum data rate in the pipelined circuit depends only on D _{m}_{a}_{x} , the maximum propagation delay in the combinational logic block. Figure 7 shows temporal/spatial diagram of combinational logic circuits [11]. If D _{m}_{i}_{n} denotes the minimum propagation delay of the signal through the combinational logic block, the maxi mum data rate of the wavepipelined circuit depends on
(D max –D min ).
In
the case
of
a block, D _{m}_{a}_{x}
corresponds to the pro
cessing and propagation delay between the even samples and a _{0} output (this involves three adder delays, one LUT
Fig. 6 Multiple coherent waves of data sent through combinational logic acting as pipeline in WP
Fig. 7 Temporal/spatial diagram of combinational logic circuits
delay and four interconnect delays); D _{m}_{i}_{n} corresponds to the processing and propagation delay between the odd samples and a _{0} output (this involves two adder delays and two interconnect delays). Traditionally, in a wavepipelined circuit, higher speeds are achieved by equalizing the D _{m}_{a}_{x} and D _{m}_{i}_{n} [10]. The output of the wavepipelined circuit alternates between unstable and stable states. The stable period decreases with the increase in the logic depth. By adjusting the latching instant at the output register to lie in the stable period, the wavepipelined circuit can be made to work properly. But, for large logic depths, there may not be any stable period. Hence, adjusting the latching instant by itself may not be adequate for storing the correct result at the output register.
For such cases, the clock period has to be increased to increase the stable period. Equalization of path delays, adjustment of the clock period and clock skew are the three tasks carried out for maximizing the operating speed of the wavepipelined circuit. All the three tasks require the delays to be measured and altered if required. Layout editors, such as FPGA editor from Xilinx, or Floor planner from Altera may be used for this purpose. These tasks are carried out manually in [13, 14]. The wavepipelined circuit designed using the layout editor may be tested using simulation. However, the simulation is inadequate for testing due to the difference between the actual delays and the delays calculated by the layout editor. This is because, the layout editor considers only the worst case delays and the actual delays may be signiﬁcantly _{F}_{i}_{g}_{.} _{5} _{a} _{B}_{l}_{o}_{c}_{k} _{u}_{s}_{i}_{n}_{g} _{B}_{W}_{}_{P}_{K}_{C}_{M} different due to fabrication variations. This difference
J RealTime Image Proc (2008) 3:217–229
221
becomes important as the logic depth of the circuit increases. Hence, the design is downloaded to the actual FPGA and its operation is checked using a personal com puter (PC) based test system in [14]. If correct results are not obtained, delays are altered and the design is down loaded for testing again. A number of iterations of place and route, simulation, downloading and testing in the actual device may be required till the correct results are obtained. The design of wavepipelined circuit in this fashion requires human intervention and is time consum ing. Automation of the above three tasks are considered in the next section.
4 Automation schemes for wavepipelined circuits
Equalization of the path delays of the combinational logic blocks such as the a block is considered ﬁrst. This cannot be completely automated as the commercially available synthesis tools do not support the speciﬁcation of inter connect delays. However, the difference in path delays can be minimized by specifying the physical location of logic cells (referred to as slices in Xilinx FPGAs) or logic ele ments used for the implementation, through either the user constraints ﬁle (UCF) or the Logic lock feature supported by the FPGA CAD tools [2, 3]. UCF approach is proposed for Xilinx FPGAs in [14]. The logic lock feature is adopted for the Altera FPGAs in this paper. The adjustment of the clock skew and clock period can be automated by using programmable clock, skew gener ator and a processor. Clock generator using LUTs and interconnects (nets) is proposed for the ﬁrst time in [14]. (The LUTs are programmed as noninverting buffers). The interconnects are manually chosen using the FPGA layout editor in [14]. The programmable clock is proposed in this paper using multiplexers in addition to the LUTs and nets as shown in Fig. 8. The interconnect delays are selected using the multiplexer. The number of possible interconnect delays (D _{i} ) is restricted to minimize the overheads due to the additional LUTs required for the introduction of the delay and the multiplexers. Hence, only a coarse variation in the delay values can be achieved. Using the manual routing, much smaller variations in the delay may be achieved. In Fig. 8, inputs C0–C3 are the programmable select inputs, which determine the actual clock frequency. The diagram of programmable clock skew circuit is given in Fig. 9. In Fig. 9, D _{i} denotes the ith interconnect (net) speciﬁcally introduced to vary the delay. In addition to this, there are interconnects between the output of one multiplexer and the input of another multi plexer and also between the LUTs. But their delay values are not controlled by the program. The select inputs S0–S3 are the programmable delay inputs.
Fig. 8 Programmable clock generator
Fig. 9 Programmable clock skew circuit
The clocks required for the wavepipelined circuit may also be derived using the internal system clock generator of Altera and Xilinx systemonprogrammable chip (SOPC) devices. The maximum operating frequency in this case is limited by the system bus. Alternately, an external clock may be multiplied by an arbitrary number using the Altera mega core function altclkclock or pllclock. Similarly, in the case of Xilinx SpartanIII family FPGAs, the delaylocked loop (DLL)/digital clock manager (DCM) module may be used for clock multiplication. However, the multiplication factor has to be speciﬁed at the synthesis time and hence the clock frequency cannot be dynamically altered as in the scheme given in Fig. 8. The circuit using the programmable clock and skew generator is a suboptimal wavepipelined circuit but can operate at a higher frequency than that reported by the commercially available synthesis tools which use D _{m}_{a}_{x} for ﬁxing the operating frequency. The clock and skew gen erator may be programmed using either offchip processor
222
J RealTime Image Proc (2008) 3:217–229
or onchip processor. In order to minimize the time required for adjustment of the parameters of the wave pipelined circuit (clock frequency and skew), the BIST approach for design for testability [17, 18] may be used. In the BIST approach, a ﬁnite state machine (FSM) is assumed to be available offchip and is used for adjustment of the parameters of the wavepipelined circuit [19, 20]. In the SOC approach, a processor is assumed to be available onchip and it is used for adjustment of the parameters of the wavepipelined circuit.
4.1 BIST approach for wavepipelined circuit
Testing a large chip requires a large test sequence and application of these test sequences to the circuit under test (CUT) using external testers is time consuming. Built in self test scheme is an alternative for minimizing the testing time. In the BIST scheme, the test sequences are internally generated, applied to the CUT at full speed and a signature is generated for ﬁnding whether it is good or bad. The block diagram of a wavepipelined circuit with BIST is given in Fig. 10. This is obtained by including the FSM block and selftest circuit. The selftest circuit contains programmable clock, clock skew generator, signature analyzer and test vector RAM to the circuit given in Fig. 10.
4.1.1 FSM block
The ﬂow chart given
in
Fig. 11
describes
the function
performed by the FSM. In Fig. 11 {T _{i} : i = 0, 1, 2… N  1}, {d _{j} : j = 0, 1, 2… M  1} denotes the set of clock periods and clock skews, respectively. The FSM block generates the control signal to choose between the normal mode and the self test mode and this is applied to the select
input of multiplexer. In the self test mode, the FSM sys tematically varies the clock skews and clock periods. For each clock frequency and skew, the self test circuit gen erates the test inputs, applies them, generates the signature, compares it with the expected result and ﬁnally generates a ﬂag indicating the match. The FSM progresses with the testing till the frequency at which the circuit under test works for at least three or more skew values is found. The operating skew value is chosen to be the middle value so that the CUT would reliably work even if the delays change due to environmental conditions. For example, in Fig. 7, when the skew is chosen so that it corresponds to either t _{2} or t _{2} ^{0} , the circuit would reliably work during its entire life time. In order to minimize the time required to determine the correct value of clock skew and clock per iod, a two step procedure is adopted. The clock frequencies are varied by large steps to determine the range of fre quency in which the circuit works. This is achieved by varying only the higher order two bits of the select inputs of the programmable clock. After the range is determined, ﬁne tuning is achieved by varying the lower order bits. For every frequency at which the circuit is tested, the clock skews are varied gradually and the results are tested for its correctness and the clock skews for which the circuit works satisfactorily is noted. The testing time can be minimized by using the optimal test vector set and a sig nature analyzer [17].
4.1.2 Signature generator
For testing the correctness of the circuit, N test vectors may be fed one after another and the N outputs obtained should be compared with the expected outputs. In order to mini mize the number of comparisons, a unique signature is generated out of the N outputs and it is compared with the
Fig. 10 Selftuned wave pipelined circuit
J RealTime Image Proc (2008) 3:217–229
223
Fig. 11 Flowchart of FSM operation
signature corresponding to the expected outputs. The sig nature generator consists of a pseudo random binary sequence (PRBS) generator with multiple data input [17] as shown in Fig. 12. The successive output of the output register is XOR’ed with the state of the PRBS to generate the next state. If the test vector set consists of N vectors, the PRBS generator output contains the signature after application of N clock pulses. However, due to the propagation delay in
Fig. 12 Signature generator
the random access memory (RAM), I/O registers and the combinational logic block, the time at which signature generation begins should be delayed with respect to the time at which the application of test vectors begins. The delay depends on the depth of the combinational logic blocks.
4.1.3 Test vector generation
In principle, the number of test vectors required for an M input combinational logic circuit is 2 ^{M} . If the value of M is small, exhaustive testing of the circuit may be carried out by generating the test inputs through an M bit counter and checking the signature after the counter completes one full cycle. However, some of the inputs may contribute more to D _{m}_{a}_{x} than the others. For example, in the case of the multipliers, the maximum propagation delay occurs only when MSBs of the operands are 1. If the multiplier works for this case, it will work for the other cases where at least one of the MSBs is zero. Hence, a (M  2) bit counter is adequate for testing. For circuits with large number of inputs, exhaustive testing would require very large testing time. Minimal test vector set, which reduces the testing time without compromising the quality of detection of faults, may be obtained using the automatic test pattern generator (ATPG) algorithms [17]. Computed aided design tools may also be used for generating the minimal test vectors using ATPG algorithm and assessing their fault coverage ratio. However, the generation of test patterns for wavepipelined circuit is nontrivial because we have to account for data dependent delays (delay for 001 is dif ferent from that for 101) [11] and this is compounded by the absence of accurate models for interconnects in FPGAs. Since the conventional ATPG techniques are not applicable for wavepipelined circuits, we have to content with only random test vectors. By choosing different test vector sets consisting of different combinations and different ordering of test vectors, we can improve the conﬁdence level.
4.2 SOC approach for wavepipelined circuits
As mentioned in Sect. 4.1, the BIST approach requires a number of overheads such as FSM, signature generator and test vector RAM. These blocks are useful only when the clock frequency and skew are to be varied. If the operating
224
J RealTime Image Proc (2008) 3:217–229
frequency is chosen so that the stable period in Fig. 7 is greater by at least twice the worst case variation in the delay due to temperature, neither the clock frequency nor the skew need to be adjusted again. After these initial selection, the 2D DWT blocks require no further tuning and work satisfactorily without any external intervention. Instead of using a dedicated circuit such as BIST, a pro cessor may be used to carry out the above tuning task. For example, an FPGA based speech recognition with SOC may perform the various tasks required by optimally par titioning between hardware and software [21]. The tasks performed in software uses the onchip processor. The hardware block may use wavepipelining and it may be tuned by the onchip processor at the beginning. For the SOC approach, PRBS generator, signature comparator blocks in Fig. 8 may be replaced by a block RAM which is used to store the outputs of the CUT cor responding to the test inputs. Since the communication interface between the on chip processor and the circuit under test is faster, the outputs can be directly read and compared with the expected output for every combination of skew and clock frequency. The ﬂow chart in Fig. 11 can be modiﬁed accordingly. The select inputs for the clock as well as skew blocks and the data inputs to the wave pipelined circuit may be applied and varied through the on chip processor. A variety of choices exist for the implementation of SOC. The SOC may consist of a hard core processor such as power PC or ARM processor and an FPGA coprocessor or DSP block. Alternatively, it may consist of a softcore processor such as Nios II or Micro blaze and a custom DSP block implemented in FPGA. In this paper, FPGA based SOCs consisting of either Nios II or Micro blaze softcore processor is used for the implementation. Figure 13 shows the interface diagram of a Nios II processor along with the custom block (hybrid wavepipelined circuit).
5 Architecture for the computation of 2D DWT using lifting scheme
The automation schemes proposed in the previous section is used for tuning the hybrid scheme for 2D DWT. The details of architecture used and the assumptions made about the individual blocks of 2D DWT are presented in this section. Subimages of size 32 9 32 with 8 bits per pixel are used for the computation. The DWT coefﬁcients are assumed to be represented using 11 bits. Number of bits per pixel is converted to be 11 bits by appending three zeros to the most signiﬁcant position. This is done in order to make the word size of the inputs to the horizontal ﬁlter and vertical ﬁlters to be the same. This enables the same hardware or program to be reused for the computation of
Fig. 13 Adding custom logic to the Nios II ALU
the outputs of both horizontal and vertical ﬁlters. The inputs to the horizontal ﬁlter are the pixel intensity values whereas the inputs to the vertical ﬁlters are DWT coefﬁ cients. The lifting multiplier constants (a, b, c, d, n) are assumed to be of 11 bits each. The block diagram of one level 2D DWT is shown in Fig. 14. For the horizontal ﬁl ters, the even and odd inputs are applied from two block RAMs of size 512 9 11. The result is written into four block RAMs of size 256 9 11. For the vertical ﬁlters, the inputs are applied from these four block RAM blocks and the outputs are written into another four block RAMs. For testing, the image is assumed to be loaded into the block RAMs using memory initialization ﬁle (MIF).
5.1 Block diagram of two level 2D DWT
The block diagram of two level 2D DWT is shown in Fig. 15. In order to minimize the area required for imple mentation, the horizontal ﬁlter and the vertical ﬁlters are reused to compute the multilevel 2D DWT. Block RAMs E1, O1 contain the even and odd streams of the initial data to be transformed. Block RAMs E2E/E2O, E3, E4, E5 denote the output of one level 2D DWT. The even and odd numbered coefﬁcients of LL1 component are stored in two block RAMs E2E and E2O and are used as inputs for the 2nd level DWT. The outputs of the 2nd level DWT are stored in block RAMs E6, E7, E8 and E9. The output of the horizontal ﬁlter is stored in four blocks RAMs E10, O10,
Fig. 14 Overall block diagram of one level 2D DWT
J RealTime Image Proc (2008) 3:217–229
225
Fig. 15 Block diagram of two level 2D DWT
E11, O11. If LL2, the low pass band corresponding to two level DWT alone is required, only one demultiplexer and seven blocks RAMs (E1, E2E, O1, E2O, E10, O10, E3) are required. For the purpose of veriﬁcation, only LL2 is computed and compared for the different schemes of computation of two level 2D DWT. For the computation of LL1 component of one level 2D DWT, only block RAMs E1, O1, E2E, E2O, E10, O10 are used.
5.2 Overlapping scheme for the computation of 2D DWT of complete image
The architecture proposed in Sect. 5.1 for subimages of size 32 9 32 may be used for the computation of 2D DWT of a larger image by splitting it into a number of overlap ping sub images of size 32 9 32. The advantage of splitting the image into a number of subimages is to per form the computation of 2D DWT in parallel in a number of computational engines. Further, it also reduces the memory required for storing the image and its transform. In the overlapping scheme, the image block is formed such that a number of pixels overlapped between adjacent blocks along the vertical and horizontal direction are equal to the order of the ﬁlter. For example, for the 9/7 bi orthogonal ﬁlter used for the 2D DWT, the number of overlap pixels should be equal to four on the left and four on the right between horizontal blocks. Similarly, the number of pixel overlap between vertical blocks should be equal to four on the top and four on the bottom. For the blocks on the boundary, overlapping needs to be done only on the nonboundary edge.
6 Implementation results
In order to demonstrate the applicability of the automation approaches for both Xilinx and Altera FPGAs, the 2D DWT is implemented using both Xilinx Spartan and Altera Cyclone FPGAs and the results are presented in this sec tion. In each of the FPGAs, the 2D DWT is computed using
three multiplication schemes: hybrid WPP BWKCM, nonpipelined BWKCM and BWPKCM.
6.1 Programmable clock and skew generators
The operating frequency of the wavepipelined circuit is expected to lie between that of nonpipelined circuit and pipelined circuits. Hence, the minimum and maximum frequency of the clock generator should correspond to the maximum operating frequencies of the nonpipelined cir cuit and pipelined circuits, respectively. The approximate values of the clock periods of these circuits for the implementation of the b block on Cyclone FPGA are 5.6 and 7.4 ns, respectively. The values of D _{m}_{a}_{x}_{,} D _{m}_{i}_{n} for the a block are 15.302 and 7.34 ns, respectively. The program mable clock and skew generator are designed such that the clock period can be varied from 8.4 to 20.6 ns in steps of
0.8 ns and skew can be varied from 12.3 to 26.2 ns in steps
of 0.9 ns approximately. The same exercise is carried out for b, c and d blocks using the synthesis report. A single clock generator is used for all the four blocks. Separate skew generators are used for each of the four blocks. In order to remove the glitches in the clock signal ‘‘Majority Logic Gate’’ is suggested in [23]. The operating frequency
and skews are chosen using FSM such that all the blocks work satisfactorily. Similar procedure is adopted for the implementation on the other two FPGAs. The location of the logic elements and the interconnects used for the implementation of clock and skew blocks should be ﬁxed so that when these blocks are integrated with the 2D DWT or the softcore processor, the inter connect delays are not altered. This is achieved by using the Logic lock feature in Altera. In the case of Xilinx FPGAs, this is achieved by using the Macros.
6.2 Implementation of 2D DWT using BIST approach
The one level 2D DWT is implemented on Xilinx Spartan II XC2S100 FPGA using BIST approach. It may be noted that the BIST approach is also applicable for Altera
226
J RealTime Image Proc (2008) 3:217–229
Table 1 Implementation of 9/7 biorthogonal ﬁlters with 11 9 8 multipliers using the various schemes
Multiplier 
Slices 
Number 
Speed 
(BWKCM) 
of registers 
(MHz) 

Nonpipelined 
253 
176 
57.3 
Pipelined 
453 
803 
149.18 
Hybrid WPP 
253 
176 
75.75 
FPGAs. A personal computer (PC) is used for the reali zation of the FSM. The interface used between PC and FPGA is the same as that described in [14]. The output of the hybrid circuit (11 bits) is EXOR’ed with the 11 bit PRBS generator and the signature is obtained. The implementation results of the 9/7 horizontal ﬁlters for one level 2D DWT on Xilinx SpartanII XC2S100 FPGA are given in Table 1. Multipliers of size 11 9 8 are implemented. From Table 1, it may be concluded that for the ﬁlter, the method using hybrid WPP BWKCM is faster than nonpipelined BWKCM by a factor of 1.32 and requires the same area. The pipelined BWPKCM is in turn faster than the hybrid WPP BWKCM by a factor of 1.97 and this is achieved with the increase in the number of registers by a factor of 4.6 and increase in the number of slices by a factor of 1.79. The implementation results of one level 2D DWT for a subimage of size 32 9 32 using BIST approach are shown in Table 2. In order to make the horizontal and vertical ﬁlters to be identical, multipliers of size 11 9 11 are used for both of them. Three zeros are appended before the input samples as discussed in Sect. 5. The overheads required for the wavepipelined circuits are also shown in Table 2. It may be noted that overhead required is about 22.5%. From Table 2, it may be con cluded that for the lifting scheme, the method using hybrid WPP BWKCM is faster than nonpipelined BW KCM by a factor of 1.4 and requires the same area. The
pipelined BW(P) KCM is in turn faster than the hybrid WPP BWKCM by a factor of 1.2 and this is achieved with the increase in the number of registers by a factor of
2.73 and increase in the number of slices by a factor of
1.32.
Table 3, shows the implementation results for two level 2D DWT for the pipelined scheme. The implementation of hybrid WP two level 2D DWT using BIST approach is under progress.
6.3 Implementation of 2D DWT using SOC approach
For the hybrid WP block, the optimal clock period and clock skews are determined using the procedure described in Sect. 6.1. The hybrid wavepipelined 2D DWT unit (obtained by adding the input and output block RAMs to the nonpipelined circuit along with the programmable clock and clock skew blocks) is tested ﬁrst using simu lation. As mentioned in Sect. 3, simulation is inadequate to test the hybrid wavepipelined circuit. Hence, this circuit is implemented along with the Nios II or Micro blaze softcore processor and the former is added as the custom block to the Nios II or Micro blaze using SOPC builder or embedded design kit (EDK) builder. The pro gram to be executed by the Nios II or Micro blaze is written in C/C++ and the custom block is invoked as a function in the C/C++ program. A C++ program is written to read and write from the block RAM in the custom block. The C++ program is compiled and the executable code along with the conﬁguration bits corre sponding to Nios II or Micro blaze integrated with the custom block is down loaded to the FPGA. When the C program is run, it systematically varies the select inputs for the clock and clock skew blocks, and uploads the content of the output block RAM. The clock and skew are adjusted till the match occurs for at least three consecutive clock skews. The operating
Table 2 Implementation results on one level 2D DWT
BIST approach with SpartanII 
SOC 
approach 
with 
CycloneII 
SOC 
approach 
with 
SpartanIII 

XC2S100PQ2085 
EP2C35F672C6 
XC3S200FT2564 

Lifting scheme 
Non 
Pipelining 
Hybrid 
Non 
Pipelining 
Hybrid 
Non 
Pipelining 
Hybrid 

pipelining 
WPP 
pipelining 
WPP 
pipelining 
WPP 

Number of slices or LEs 
836 
1,110 
836 
703 
782 
703 ^{a} (30) 
897 
1,381 
897 

1 slice = 2 LUTs 
^{a} (188) 
^{a} (32) 

Number of registers 
611 
1,670 
611 
375 
671 
375 
730 
2,305 
730 

^{a} (85) 
^{a} (8) 
^{a} (8) 

Speed 
54.45 
87.54 
75.75 
117.83 
203.92 
147.5 
67.9 
114.19 
82.6 

(MHz) 
^{a} Denotes additional overhead for testing WP circuits
J RealTime Image Proc (2008) 3:217–229
227
Table 3 Area and speed performance of two level forward 2D DWT on Xilinx SpartanII XC2S200PQ2085
Lifting scheme 
Slices used 
Speed 
Number of 
(MHz) 
registers 

Pipelined 
1,511 
61.42 
2,506 
clock and clock skew of the wavepipelined circuit is ﬁxed at the middle value and from now on, the custom block works without any intervention from the Nios II or Micro blaze processor.
6.3.1 Implementation results on one level 2D DWT using CycloneII EP2C35F672C6
The one level 2D DWT is implemented on CycloneII EP2C35F672C6 with and without pipelining. A single ﬁlter is implemented and time shared for the computation of the outputs of both horizontal and vertical ﬁlters. The 2D DWT block added as a custom block to Nios II CPU and downloaded to the CycloneII. 2D DWT is also computed using the inbuilt instruction set of Nios II [22]. The number of CPU clocks for both the cases are tabulated in Table 4. (Clock frequency obtained using the above device is 40 MHz.) For the hybrid wavepipelined circuit, the number of logic elements, number of registers, maximum operating frequency and power dissipated are computed using CycloneII FPGA and the results are given in Table 2. It may be noted that the overhead required for the wave pipelined circuit is about 4%. From this Table 2, it may be concluded that for the lifting scheme, the method using the hybrid WPP BWKCM is faster than nonpipelined BW KCM by a factor of 1.25. The scheme with Baugh–Wooley Pipelined Constant Coefﬁcient Multiplier is in turn faster than the hybrid WPP BWKCM by a factor of 1.38 and this is achieved with the increase in the number of registers by a factor of 1.78 and increase in number of LEs by a factor of 1.11. Pipelining may be used either for increasing the operat ing frequency of a circuit or for reducing the power dissipation [12]. Pipelining requires more registers and area. It automatically may not lead to more power dissipation. In order to assess whether the hybrid wavepipelining is superior or not with regard to power dissipation, both hybrid wavepipelined and pipelined circuits are operated at the
Table 4 Computation time for 2D DWT
same frequency (corresponding to the maximum operating frequency of the hybrid wavepipelined circuit) and the power dissipated for the two approaches are also given in Table 5. From this Table 5, it may be noted that the pipe lined circuit dissipates 11% less power than hybrid wave pipelined 2D DWT.
6.3.2 Implementation results on one level 2D DWT using SpartanIII XC3S200
Implementation results for one level 2D DWT on Xilinx SpartanIII XC3S200 using all the three approaches are given in Table 2. The programmable clock and clock skew blocks are implemented as Macro blocks using Xilinx ISE 8.1i project navigator. For tuning the hybrid wavepipe lined circuit, the Micro blaze softcore processor is used. Xilinx EDK software is used to integrate the custom block to the Micro blaze processor. The rest of the steps are similar to what is used for the Altera SOC kit. For all the three schemes, the number of logic elements, number of registers and maximum operating frequency are computed and the results are given in Table 2. It may be noted that the overheads required for the wavepipelined circuits is about 3.5%. It may be noted that dedicated ﬁlters are used for the computation of the outputs of both horizontal and vertical ﬁlter. Hence, the area required for this scheme is higher than that that using cyclone II devices. From this Table 2, it may be concluded that for the lifting scheme, the method using hybrid WPP BWKCM is faster than nonpipelined BWKCM by a factor of 1.21. The scheme with Baugh–Wooley Pipelined Constant Coefﬁcient Multiplier is in turn faster than the hybrid WP P BWKCM by a factor of 1.38 and this is achieved with the increase in the number of registers by a factor of 3.15 and increase in the number of LEs by a factor of 1.53.
6.4 Validation of the scheme for 2D DWT
To verify the correctness of the schemes proposed for the computation of 2D DWT, Lena image of size 128 9 128 with blocks (subimages) of size 32 9 32 pixels is used. The 128 9 128 image is shown in Fig. 16 and is obtained by subsampling the standard image of size 512 9 512 by a factor of four along both dimensions. As mentioned in Sect. 5.2, overlap of four pixels is used between the adja cent blocks. Totally 36 image blocks are used for the
Table 5 Power dissipated by pipelined and hybrid wavepipelined one level 2D DWT at normalized frequency
Function 
Number of CPU clock 
Equivalent CPU clock 
Description of the circuit 
Pipelined circuit 
Hybrid circuit 

cycles for software approach 
cycles for custom block 

Power at normalized frequency 
158.97 
179.58 

2D DWT 
73,280 
814 
including additional overhead 
228
J RealTime Image Proc (2008) 3:217–229
Fig. 16 LL1 component compared with input image
128 9 128 image. The 2D DWT for the image is also computed using a C program. This is carried out using both highlevel language C and hardware approach using FPGA. For implementation in C language, the lifting multiplier constants (a, b, c, d, n _{1} , n _{2} ) and the ﬁlter coefﬁcients for the distributed arithmetic algorithm are declared as ‘‘double’’ type (64 bits) variables. The pixel intensities are declared as ‘‘short’’ type (16 bits) variables. The analysis ﬁlter output obtained corresponding to 36 image blocks are merged suitably and LL1 component of the image is shown in Fig. 16. The implementation of the forward 2D DWT for image block of size 32 9 32 is carried out for lifting scheme with BWhybrid WPPKCM. For the implementation, Xilinx Spartan XC2S100PQ2085 device is used. For storing the image input, outputs of the horizontal ﬁlter and the outputs of the vertical ﬁlters, the block RAMs are conﬁgured suitably. The image is loaded into the block RAMs through the UCF of the implementation tool. The one level 2D DWT is computed using the above scheme for all the 36 image blocks and merged suitably. The LL1 component of the image is shown in Fig. 16. From these ﬁgures, it may be concluded that the LL1 components obtained through the FPGA implementation match well with that obtained using C. The LL1 components also match well with the original image. In order to make a quantitative comparison of the LL1 component with the original image, the original LENA image is subsampled to be of size 64 9 64. Treating the LL1 component itself as the compressed image, the PSNR of the compressed images using BWhybrid WPKCM and C are computed and are found to be 28.22 and 33.33, respectively.
7 Conclusion
Two automation schemes are proposed in this paper for the implementation of the 9/7 biorthogonal ﬁlters using hybrid WPP constant coefﬁcient multiplier with Baugh– Wooley multiplication algorithm. Nios II and Micro blaze softcore processors are integrated with 2D DWT blocks successfully and the optimum clock period and clock skews for the 2D DWT blocks are selected using them.
After these initial selection, the 2D DWT blocks work satisfactorily without any external intervention and the processors are free to do other tasks. The 9/7 biorthog onal ﬁlters are implemented on both Xilinx and Altera devices using the lifting scheme with the following three multipliers: BWPKCM, BWKCM and hybrid WPP BWKCM. From the implementation results, it is veriﬁed that hybrid WPP BWKCM is faster than nonpipelined BWKCM. The scheme with BWPKCM is in turn faster than the hybrid WPP BWKCM and this is achieved with the increase in the number of registers and increase in the number of LEs. The custom instruction for 2D DWT is found to be faster compared to the implementation using C. The correctness of the procedure for the computation of 2D DWT of an image, using the 2D DWT of sub images, is veriﬁed by computing the 2D DWT using both hardware and software approaches (using C) and dis playing the LL1 components for an image of size 128 9 128.transform. The automation schemes proposed in this paper has also been successfully employed in [23] for the implementation of wavepipelined ﬁlters using distributed arithmetic algorithm and sine wave generator using CORDIC. The work on the computation of multi level 2D DWT and real time computation of 2D DWT using the hybrid scheme are under progress. One of the challenges in the design of FPGA based wavepipelined circuits is the accurate modeling of the interconnects as well as device delays and their tempera ture dependence. In the absence of these models, the wave pipelined circuits can only be operated at moderate speeds.
References
1. Xilinx documentation library, Xilinx Corporation, USA
2. Altera documentation library2003 Altera Corporation, USA
3. Sheldon, D., Kumar, R., Vahid, F., Tullsen, D., Lysecky, R.: Conjoining softcore FPGA processors. In: IEEE/ACM Interna tional Conference on Computer Aided Design, 2006, ICCAD, pp. 694–701 (2006)
4. Draper, B.A., Beveridge, J.R., Willem Bohm, A.P., Ross, C., Chawathe, M.: Accelerated image processing on FPGAs. IEEE Trans. Image. Process. 12(12), 1543–1551 (2003)
5. Ritter, J., Molitor, P.: A pipelined architecture for partitioned DWT based lossy image compression using FPGA’s. In: Pro ceedings ACM Conference FPGA 2001, pp. 201–206 (2001)
6. Lakshminarayanan, G., Venkataramani, B., Senthil Kumar, J., Yousuf, A.K., Sriram, G.: Design and FPGA implementation of image block encoders with 2DDWT. Proceedings TENCON 2003. 3, 1015–1019 (2003)
7. Daubechies, I., Sweldens, W.: Factoring wavelet transforms into lifting steps. J. Fourier Anal. Appl. 4, 247–269 (1998)
8. Nyathi, J., DelgadoFrias, J.G.: A hybrid wavepipelined network router. IEEE Trans. Circuits. SystI, Fundam. Theory. Appl. 49(12), 1764–1772 (2002)
9. Hauck, O., Katoch, A., Huss, S.A.: VLSI system design using asynchronous wave pipelines: a 0.35 lm CMOS 1.5 GHz elliptic curve public key cryptosystem chip. In: Proceeding of sixth
J RealTime Image Proc (2008) 3:217–229
229
international symposium on advanced research in asynchronous circuits and systems 2000 (ASYNC 2000), pp. 188–197 (2000)
10. Burleson, W.P., Ciesielski, M., Klass, F., Liu, W.: Wavepipe lining: a tutorial and research survey. IEEE Trans. Very Large Scale Integration (VLSI) Syst. 6(3), 464–474 (1998)
11. Gray, T.C., Liu, W., Cavin, R.K., III.: WavePipelining: Theory and CMOS Implementation. Kluwer, Boston (1994)
12. Parhi, K.K.: VLSI Signal Processing Systems. Wiley, New York (1999)
13. Boemo, E.I., LopezBuedo, S., Meneses, J.M.: Wavepipelines via lookup tables. IEEE Int. Symp. Circuits Syst. (ISCAS ’1996). 4, 85–88 (1996)
14. Lakshminarayanan, G., Venkataramani, B.: Optimization tech niques for FPGA based wavepipelined DSP blocks. IEEE Trans. Very Large Scale Integration (VLSI) Syst. 13(7), 783–793 (2005)
15. Aharya, T., Tsai, P.S.: JPEG2000 Standard for Image Com pression Concepts Algorithms and VLSI Architectures. Wiley, New York (2005)
16. Sayood, K.: Introduction to Data Compression. Morgan Kauf mann, Menlo Park (2000). An Imprint of Elsevier
17. Smith, M.J.S.: Application Speciﬁc Integrated Circuits. Pearson Education Asia Pvt. Ltd, Singapore (2003)
18. Seetharaman, G., Venkataramani, B., Lakshminarayanan, G.: Design and FPGA implementation of self tuned wavepipelined ﬁlters. IETE J. Res. 524, 281–286 (2006)
19. Seetharaman, G., Venkataramani, B., Lakshminarayanan, G.: Design and FPGA implementation of wavepipelined image block encoders using 2DDWT. In: Proceedings of VLSI design and test symposium VDAT 2005, pp. 12–20 (2005)
20. Seetharaman, G., Venkataramani, B., Lakshminarayanan, G.: Design and FPGA implementation of wavepipelined distributed arithmetic based ﬁlters. In: Proceedings of VLSI Design and Test workshop VDAT 2004, pp. 216–220 (2004)
21. Amudha, V., Venkataramani, B., Vinoth Kumar, R., Ravishankar, S.: SOC Implementation of HMM Based Speaker Independent Isolated Digit Recognition System. In: 20th International IEEE Conference on VLSI Design (VLSID’07), pp. 1–6 (2007)
22. Seetharaman, G., Venkataramani, B., Amudha, V., Saundattikar, A.: System on chip implementation of 2D DWT using lifting scheme. In: Proceedings of the International Asia and South Paciﬁc Conference on Embedded SOCs (ASPICES 2005), (2005)
23. Seetharaman, G., Venkataramani, B.: SOC implementation of wavepipelined circuits. Proceedings of IEEE International con ference on Field Programmable Technology 2007 (ICFPT 2007), pp. 9–16 (2007)
Author Biographies
G. Seetharaman received his B.E. and M.E. degree in Electronics
and Communication Engineering from Regional Engineering Collage, Tiruchirappalli in 1997 and 2002, respectively. Presently, he is car rying out his doctoral thesis work in the National Institute of Technology, Tiruchirappalli. Previously, he worked as faculty in the Jayaram College of Engineering and Technology, Tiruchirappalli, for 6 years and as Research Associate for three semesters in the National Institute of Technology, Tiruchirappalli. Presently, he is working as Laboratory Engineer in National Institute of Technology, Tiruchi rappalli. His current research interests include embedded system design using ﬁeldprogrammable gate arrays (FPGAs) and systemon chip (SOC).
B. Venkataramani received his B.E. degree in Electronics and
Communication Engineering from Regional Engineering College, Tiruchirappalli in 1979 and M.Tech. and Ph.D. degrees in Elec trical Engineering from Indian Institute of Technology, Kanpur in 1984 and 1996, respectively. He worked as Deputy Engineer in Bharat Electronics Limited, Bangalore, India, and as a research Engineer in Indian Institute of Technology, Kanpur, each for approximately 3 years. Since 1987 he has been faculty member of National Institute of Technology, (formerly Regional Engineering College) Tiruchirappalli. Presently, he is working as Professor and Head of the Department of Electronics and Communication in National Institute of Technology. He has published two books and numerous papers in journals and international conferences. His current research interests include FPGA applications and SOC based system design and performance analysis of high speed computer networks.
G. Lakshminarayanan received his M.E. and Ph.D. degrees in
Electronics and Communication Engineering from Bharathidasan University, Tiruchirappalli in 1995 and 2005, respectively. He pre viously worked as a Service Engineer for 5 years and as a scientist and Research Associate for 4 years in Regional Engineering College, Tiruchirappalli. He was a faculty member in SASTRA, Tanjore, for two semesters and as an Assistant Professor in Saranathan College of Engineering, Tiruchirappalli for 1 year. Presently he is working as Assistant Professor in National Institute of Technology, Tiruchirap palli. His current research interests include FPGA based system design and VLSI front end design.
Molto più che documenti.
Scopri tutto ciò che Scribd ha da offrire, inclusi libri e audiolibri dei maggiori editori.
Annulla in qualsiasi momento.