Sei sulla pagina 1di 13

J Real-Time Image Proc (2008) 3:217–229 DOI 10.1007/s11554-008-0087-8

SPECIAL ISSUE

 

Automation techniques for implementation of hybrid wave-pipelined 2D DWT

  • G. Seetharaman Æ B. Venkataramani Æ

  • G. Lakshminarayanan

Received: 12 July 2007 / Accepted: 19 May 2008 / Published online: 10 June 2008 Springer-Verlag 2008

Abstract In the literature, techniques such as pipelining and wave-pipelining (WP) are proposed for increasing the operating frequency of a digital circuit. In general, use of pipelining results in higher speed at the cost of increase in the area and clock routing complexity. On the other hand, use of WP results in less clock routing complexity and less area but enables the digital circuit to be operated only at moderate speeds. In this paper, a hybrid wave-pipelining scheme is proposed to get the benefits of both pipelining and WP techniques. Major contributions of this paper are:

proposal for the implementation of 2D DWT using lifting scheme by adopting the hybrid wave-pipelining and pro- posal for the automation of the choice of clock frequency and clock skew between the input and output registers of wave-pipelined circuit using built in self test (BIST) and system-on-chip (SOC) approaches. In the hybrid scheme, different lifting blocks are interconnected using pipelining registers and the individual blocks are implemented using WP. For the purpose of evaluating the superiority of the schemes proposed in this paper, the system for the com- putation of one level 2D DWT is implemented using the following techniques: pipelining, non-pipelining and hybrid wave-pipelining. The BIST approach is used for the implementation on Xilinx Spartan-II device. The SOC

G.

Seetharaman (&) B. Venkataramani

G.

Lakshminarayanan

Department of ECE, National Institute of Technology, Tiruchirappalli, India

e-mail: gsraman@nitt.edu

B.

Venkataramani

e-mail: bvenki@nitt.edu

G.

Lakshminarayanan

e-mail: laksh@nitt.edu

approach is adopted for implementation on Altera and Xilinx field programmable gate arrays (FPGAs) based SOC kits with Nios II or Micro blaze soft-core processors. From the implementation results, it is verified that the hybrid WP circuit is faster than non-pipelined circuit by a factor of 1.25–1.39. The pipelined circuit is in turn faster than the hybrid wave-pipelined circuit by a factor of 1.15–1.38 and this is achieved with the increase in the number of registers by a factor of 1.79–3.15 and increase in the number of LEs by a factor of 1.11–1.65. The soft-core processor based automation scheme has considerably reduced the effort required for the design and testing of the hybrid wave- pipelined circuit. The techniques proposed in this paper, are also applicable for ASICs. The optimization schemes proposed in this paper are also applicable for the compu- tation of other image transforms such as DCT, DHT.

Keywords

DWT Lifting SOC Wave-pipelining

Pipelining Self test

1 Introduction

Programmable logic devices such as FPGAs offer an alternative solution for the computationally intensive functions performed traditionally by digital signal proces- sors with Harvard architecture. The ability to design, fabricate and test application specific integrated circuits (ASICs) as well as FPGAs with gate count of the order of a few tens of millions, has led to the development of com- plex embedded system-on-chip. The development of intellectual property (IP) cores for the FPGAs for a variety of standard functions including processors enables a mul- timillion gate FPGA to be configured to contain all the components of a complete system. Development tools from

123

218

J Real-Time Image Proc (2008) 3:217–229

FPGA vendors such as the Altera or Xilinx enable the integration of IP cores and the user designed custom blocks with the soft-core processors such as the Micro blaze or Nios II processors [1, 2]. The system designed by inte- gration of IP cores and the user designed custom blocks with the soft-core processors are far more flexible than the hard-core processors and they can be enhanced with cus- tom hardware to optimize them for specific application [3]. The increased performance available with SOC based FPGAs makes them quite suited for implementation of area as well as speed intensive image processing applications such as discrete cosine transform (DCT) and discrete wavelet transform (DWT). For example, the study in [4] shows that FPGA based image processing system is faster by 8–800 times compared to that using Pentium III processor. For image processing applications, in addition to DCT, wavelet transform is increasingly used. It is a part of the joint photographic experts group (JPEG) 2000 standard for still image compression. The VLSI implementation of image encoders with DWT has been addressed in number of previous works. The implementation of 2D DWT using lifting scheme and compression using EZT algorithm is reported in [5] taking the advantage of flexible memory configuration available in FPGAs. The image is partioned into sub-images of size 32 9 32 and external memory is used for storing the sub-images and the transform coeffi- cients in [5]. Block RAMs in FPGAs are proposed for storing the sub- images and 2D DWT coefficients in [6]. A new multiplier algorithm denoted as Baugh–Wooley pipelined constant coefficient multiplier (BW-PKCM) which combines the KCM with Baugh–Wooley multiplication algorithm is proposed and used for the study and comparison of dis- tributed arithmetic algorithm and lifting scheme [7] for 2D DWT on FPGAs in [6]. Even though pipelining is adopted for high speed applications such as that in [6], pipelined systems have a number of disadvantages such as increase of power dissi- pation, clock routing complexity and clock skews between different parts of the system. The circuit design technique such as wave-pipelining is one of the techniques proposed for achieving high speed without the above limitations. Wave-pipelined circuit dispenses with the need for regis- ters for storing the intermediate results and instead uses the inherent capacitance at the input to the various combina- torial blocks. A number of systems have been implemented using wave-pipelining on ASICs and FPGAs [8, 9]. The concept of wave-pipelining has been described in a number of previous works [1014]. One of the limitations of the wave-pipelined circuits is that their highest operating fre- quency reduces with the complexity of the circuit or equivalently the logic depth [14]. In order to combine the

advantages of both pipelining and wave-pipelining a hybrid scheme is proposed in this paper. A complex circuit is split into a number of smaller circuits and is pipelined. Each of the smaller circuits is realized using wave-pipelining. The organization of the rest of the paper is as follows: in Sect. 2, the review of previous work on lifting based 2D DWT with BW multiplier is described. In Sect. 3, the previous work related to wave-pipelining and the chal- lenges involved in the design of wave-pipelined circuits are described. In Sect. 4, automation schemes for wave-pipe- lined circuits are presented. In Sect. 5, the architecture used and assumptions made for the implementation of the 2D DWT are presented. In Sect. 6, the implementation results of the pipelined and hybrid wave-pipelined 2D DWT are presented. Sect. 7, summarizes the conclusions.

2 Review of previous work on lifting based 2D DWT with BW multiplier

The hybrid scheme is proposed to be used for the com- putation of 2D DWT. The DWT decomposes a signal into different sub-bands so that the lower frequency sub-bands have finer frequency resolution and coarser time resolution compared to the higher frequency sub-bands. A survey of VLSI architectures for the computation of 2D DWT is given in [15]. The 2D DWT may be computed using filter banks. Figure 1 shows how an N 9 M image can be decomposed using sub-band decomposition for one level 2D DWT. The samples corresponding to the image pixels are passed through two stages of analysis filters. The ele- ments of the pixel matrix are read row wise and are first processed by the low pass h[n] and high pass g[n] hori- zontal filters. The transform coefficients matrices are then sub-sampled by two along the rows to obtain two N 9 M/2 matrices L1 and H1. Subsequently, the outputs (L1, H1) are processed by low pass and high pass vertical filters to obtain four N/2 9 M/2 transform coefficient matrices. Out of these four matrices denoted as LL1, LH1, HH1 and HL1, respectively, LL1 represents a coarse approximation of the original image [15, 16]. For the two level 2D DWT, LL1 component is pro- cessed by both horizontal and vertical filters and sub- sampled to obtain four more matrices LL2, LH2, HH2 and HL2. This process is continued until the desired level of sub-band structure is obtained. The horizontal and vertical filters shown in Fig. 1 may be implemented by adopting the lifting scheme [7] which uses a factorization scheme for the poly-phase matrix corresponding to the analysis filter. The main feature of lifting based DWT scheme is to break up the high pass and low pass wavelet filters into a sequence of smaller filters. This scheme requires about 50% less computational complexity compared to that using

123

J Real-Time Image Proc (2008) 3:217–229

219

J Real-Time Image Proc (2008) 3:217–229 219 Fig. 1 Sub-band decomposition of an N 9 M

Fig. 1 Sub-band decomposition of an N 9 M image

the convolution-based approach [7]. It has other advanta- ges, including ‘‘in-place’’ computation of the DWT, integer to integer wavelet transform and symmetric hardware architecture for the computation of both forward and inverse transform [15]. In the lifting scheme for a filter bank with the low pass and high pass filters of nine and seven taps, respectively, the odd and even input samples are processed by five lifting blocks [a, b, c, d, n (n 1 , n 2 )] in cascade as shown in Fig. 2. n 1 , n 2 are scaling blocks. The internal diagram of a and b blocks are shown in Figs. 3 and 4. The c and d blocks are obtained by replacing the constants a, b with c, d. In Figs. 3 and 4, since the output from one block is fed as the input to the next block, the maximum rate at which the input can be fed to the system depends on the sum of the delays in all the four stages. The speed may be increased by introducing pipe- lining at the points indicated by dotted lines in Figs. 3 and 4. In this case, the input rate is determined by the largest delay among all the four blocks. The delay in the individual stages may be reduced fur- ther by using constant coefficient multiplier (KCM) which uses a look up table (LUT) for finding the product of a constant and a variable. The variable is fed as address to the LUT, which contains the products corresponding to all possible combinations of the operands. FPGAs normally contain four input LUTs. When an LUT with more number of inputs are required, it has to be implemented using a number of stages of four input LUTs and adders. For example, a 12 9 12 bit KCM is implemented using three

J Real-Time Image Proc (2008) 3:217–229 219 Fig. 1 Sub-band decomposition of an N 9 M

Fig. 2 Simplified block diagram of lifting scheme for 9/7 filter

J Real-Time Image Proc (2008) 3:217–229 219 Fig. 1 Sub-band decomposition of an N 9 M
Fig. 3 a Block Fig. 4 b Block
Fig. 3
a Block
Fig. 4
b Block

4 9 12 bit KCM and two stages of 16 bit adders. The speed of the KCM can be increased by introducing the pipelining registers at the outputs of LUTs and adders. The content of the LUT corresponding to multiplication of signed numbers can be computed using three approa- ches: (a) Assuming unsigned multiplication and 2’s complement blocks (resulting multiplier is referred to as conventional 2’s complement multiplier (C2CM)) (b) Using sign extension (c) Baugh Wooley (BW) multiplier. The pipelined constant coefficient multiplier (PKCM) using the BW content is referred to as BW-PKCM and it is

123

220

J Real-Time Image Proc (2008) 3:217–229

shown to be superior compared to the other two approaches [6]. Hence, only this multiplier is considered for wave- pipelining in this paper. The detailed diagram of the a block implemented using BW-PKCM is shown in Fig. 5. The same scheme can be adopted for the b, c, d, n 1, n 2 blocks. The dotted line indicates points where registers may be inserted for pipelining. For wave-pipelining all the stages are directly connected without registers. The regis- ters are used only at the inputs and outputs. In hybrid wave- pipelining, registers are used between adjacent lifting blocks and the individual lifting blocks are connected without registers.

3 Review of previous work on wave-pipelining

In this section, the technique used for wave-pipelining the a block in Fig. 5 is considered. An RTL model of a circuit consists of a combinational logic circuit separated by the input and output registers. The combinational logic circuit may be considered to be a wave-pipelined circuit if a number of waves are made to simultaneously propagate through it as shown in Fig. 6, [10]. In other words, at any point of time, a sequence of data is processed in the combinational logic block. In the case of pipelining, only one data is processed in the combinational logic block at a time. Further, the maximum data rate in the pipelined circuit depends only on D max , the maximum propagation delay in the combinational logic block. Figure 7 shows temporal/spatial diagram of combinational logic circuits [11]. If D min denotes the minimum propagation delay of the signal through the combinational logic block, the maxi- mum data rate of the wave-pipelined circuit depends on

(D max –D min ).

In

the case

of

a block, D max

corresponds to the pro-

cessing and propagation delay between the even samples and a 0 output (this involves three adder delays, one LUT

220 J Real-Time Image Proc (2008) 3:217–229 shown to be superior compared to the other two

Fig. 6 Multiple coherent waves of data sent through combinational logic acting as pipeline in WP

220 J Real-Time Image Proc (2008) 3:217–229 shown to be superior compared to the other two

Fig. 7 Temporal/spatial diagram of combinational logic circuits

delay and four interconnect delays); D min corresponds to the processing and propagation delay between the odd samples and a 0 output (this involves two adder delays and two interconnect delays). Traditionally, in a wave-pipelined circuit, higher speeds are achieved by equalizing the D max and D min [10]. The output of the wave-pipelined circuit alternates between unstable and stable states. The stable period decreases with the increase in the logic depth. By adjusting the latching instant at the output register to lie in the stable period, the wave-pipelined circuit can be made to work properly. But, for large logic depths, there may not be any stable period. Hence, adjusting the latching instant by itself may not be adequate for storing the correct result at the output register.

For such cases, the clock period has to be increased to increase the stable period. Equalization of path delays, adjustment of the clock period and clock skew are the three tasks carried out for maximizing the operating speed of the wave-pipelined circuit. All the three tasks require the delays to be measured and altered if required. Layout editors, such as FPGA editor from Xilinx, or Floor planner from Altera may be used for this purpose. These tasks are carried out manually in [13, 14]. The wave-pipelined circuit designed using the layout editor may be tested using simulation. However, the simulation is inadequate for testing due to the difference between the actual delays and the delays calculated by the layout editor. This is because, the layout editor considers only the worst case delays and the actual delays may be significantly Fig. 5 a Block using BW-PKCM different due to fabrication variations. This difference

220 J Real-Time Image Proc (2008) 3:217–229 shown to be superior compared to the other two

123

J Real-Time Image Proc (2008) 3:217–229

221

becomes important as the logic depth of the circuit increases. Hence, the design is downloaded to the actual FPGA and its operation is checked using a personal com- puter (PC) based test system in [14]. If correct results are not obtained, delays are altered and the design is down- loaded for testing again. A number of iterations of place and route, simulation, downloading and testing in the actual device may be required till the correct results are obtained. The design of wave-pipelined circuit in this fashion requires human intervention and is time consum- ing. Automation of the above three tasks are considered in the next section.

4 Automation schemes for wave-pipelined circuits

Equalization of the path delays of the combinational logic blocks such as the a block is considered first. This cannot be completely automated as the commercially available synthesis tools do not support the specification of inter- connect delays. However, the difference in path delays can be minimized by specifying the physical location of logic cells (referred to as slices in Xilinx FPGAs) or logic ele- ments used for the implementation, through either the user constraints file (UCF) or the Logic lock feature supported by the FPGA CAD tools [2, 3]. UCF approach is proposed for Xilinx FPGAs in [14]. The logic lock feature is adopted for the Altera FPGAs in this paper. The adjustment of the clock skew and clock period can be automated by using programmable clock, skew gener- ator and a processor. Clock generator using LUTs and interconnects (nets) is proposed for the first time in [14]. (The LUTs are programmed as non-inverting buffers). The interconnects are manually chosen using the FPGA layout editor in [14]. The programmable clock is proposed in this paper using multiplexers in addition to the LUTs and nets as shown in Fig. 8. The interconnect delays are selected using the multiplexer. The number of possible interconnect delays (D i ) is restricted to minimize the overheads due to the additional LUTs required for the introduction of the delay and the multiplexers. Hence, only a coarse variation in the delay values can be achieved. Using the manual routing, much smaller variations in the delay may be achieved. In Fig. 8, inputs C0–C3 are the programmable select inputs, which determine the actual clock frequency. The diagram of programmable clock skew circuit is given in Fig. 9. In Fig. 9, D i denotes the ith interconnect (net) specifically introduced to vary the delay. In addition to this, there are interconnects between the output of one multiplexer and the input of another multi- plexer and also between the LUTs. But their delay values are not controlled by the program. The select inputs S0–S3 are the programmable delay inputs.

J Real-Time Image Proc (2008) 3:217–229 221 becomes important as the logic depth of the circuit

Fig. 8 Programmable clock generator

J Real-Time Image Proc (2008) 3:217–229 221 becomes important as the logic depth of the circuit

Fig. 9 Programmable clock skew circuit

The clocks required for the wave-pipelined circuit may also be derived using the internal system clock generator of Altera and Xilinx system-on-programmable chip (SOPC) devices. The maximum operating frequency in this case is limited by the system bus. Alternately, an external clock may be multiplied by an arbitrary number using the Altera mega core function altclkclock or pllclock. Similarly, in the case of Xilinx Spartan-III family FPGAs, the delay-locked loop (DLL)/digital clock manager (DCM) module may be used for clock multiplication. However, the multiplication factor has to be specified at the synthesis time and hence the clock frequency cannot be dynamically altered as in the scheme given in Fig. 8. The circuit using the programmable clock and skew generator is a suboptimal wave-pipelined circuit but can operate at a higher frequency than that reported by the commercially available synthesis tools which use D max for fixing the operating frequency. The clock and skew gen- erator may be programmed using either off-chip processor

123

222

J Real-Time Image Proc (2008) 3:217–229

or on-chip processor. In order to minimize the time required for adjustment of the parameters of the wave- pipelined circuit (clock frequency and skew), the BIST approach for design for testability [17, 18] may be used. In the BIST approach, a finite state machine (FSM) is assumed to be available off-chip and is used for adjustment of the parameters of the wave-pipelined circuit [19, 20]. In the SOC approach, a processor is assumed to be available on-chip and it is used for adjustment of the parameters of the wave-pipelined circuit.

4.1 BIST approach for wave-pipelined circuit

Testing a large chip requires a large test sequence and application of these test sequences to the circuit under test (CUT) using external testers is time consuming. Built in self test scheme is an alternative for minimizing the testing time. In the BIST scheme, the test sequences are internally generated, applied to the CUT at full speed and a signature is generated for finding whether it is good or bad. The block diagram of a wave-pipelined circuit with BIST is given in Fig. 10. This is obtained by including the FSM block and self-test circuit. The self-test circuit contains programmable clock, clock skew generator, signature analyzer and test vector RAM to the circuit given in Fig. 10.

4.1.1 FSM block

The flow chart given

in

Fig. 11

describes

the function

performed by the FSM. In Fig. 11 {T i : i = 0, 1, 2N - 1}, {d j : j = 0, 1, 2M - 1} denotes the set of clock periods and clock skews, respectively. The FSM block generates the control signal to choose between the normal mode and the self test mode and this is applied to the select

input of multiplexer. In the self test mode, the FSM sys- tematically varies the clock skews and clock periods. For each clock frequency and skew, the self test circuit gen- erates the test inputs, applies them, generates the signature, compares it with the expected result and finally generates a flag indicating the match. The FSM progresses with the testing till the frequency at which the circuit under test works for at least three or more skew values is found. The operating skew value is chosen to be the middle value so that the CUT would reliably work even if the delays change due to environmental conditions. For example, in Fig. 7, when the skew is chosen so that it corresponds to either t 2 or t 2 0 , the circuit would reliably work during its entire life time. In order to minimize the time required to determine the correct value of clock skew and clock per- iod, a two step procedure is adopted. The clock frequencies are varied by large steps to determine the range of fre- quency in which the circuit works. This is achieved by varying only the higher order two bits of the select inputs of the programmable clock. After the range is determined, fine tuning is achieved by varying the lower order bits. For every frequency at which the circuit is tested, the clock skews are varied gradually and the results are tested for its correctness and the clock skews for which the circuit works satisfactorily is noted. The testing time can be minimized by using the optimal test vector set and a sig- nature analyzer [17].

4.1.2 Signature generator

For testing the correctness of the circuit, N test vectors may be fed one after another and the N outputs obtained should be compared with the expected outputs. In order to mini- mize the number of comparisons, a unique signature is generated out of the N outputs and it is compared with the

Fig. 10 Selftuned wave- pipelined circuit

222 J Real-Time Image Proc (2008) 3:217–229 or on-chip processor. In order to minimize the time

123

J Real-Time Image Proc (2008) 3:217–229

223

J Real-Time Image Proc (2008) 3:217–229 223 Fig. 11 Flowchart of FSM operation signature corresponding to

Fig. 11 Flowchart of FSM operation

signature corresponding to the expected outputs. The sig- nature generator consists of a pseudo random binary sequence (PRBS) generator with multiple data input [17] as shown in Fig. 12. The successive output of the output register is XOR’ed with the state of the PRBS to generate the next state. If the test vector set consists of N vectors, the PRBS generator output contains the signature after application of N clock pulses. However, due to the propagation delay in

J Real-Time Image Proc (2008) 3:217–229 223 Fig. 11 Flowchart of FSM operation signature corresponding to

Fig. 12 Signature generator

the random access memory (RAM), I/O registers and the combinational logic block, the time at which signature generation begins should be delayed with respect to the time at which the application of test vectors begins. The delay depends on the depth of the combinational logic blocks.

4.1.3 Test vector generation

In principle, the number of test vectors required for an M input combinational logic circuit is 2 M . If the value of M is small, exhaustive testing of the circuit may be carried out by generating the test inputs through an M bit counter and checking the signature after the counter completes one full cycle. However, some of the inputs may contribute more to D max than the others. For example, in the case of the multipliers, the maximum propagation delay occurs only when MSBs of the operands are 1. If the multiplier works for this case, it will work for the other cases where at least one of the MSBs is zero. Hence, a (M - 2) bit counter is adequate for testing. For circuits with large number of inputs, exhaustive testing would require very large testing time. Minimal test vector set, which reduces the testing time without compromising the quality of detection of faults, may be obtained using the automatic test pattern generator (ATPG) algorithms [17]. Computed aided design tools may also be used for generating the minimal test vectors using ATPG algorithm and assessing their fault coverage ratio. However, the generation of test patterns for wave-pipelined circuit is non-trivial because we have to account for data dependent delays (delay for 001 is dif- ferent from that for 101) [11] and this is compounded by the absence of accurate models for interconnects in FPGAs. Since the conventional ATPG techniques are not applicable for wave-pipelined circuits, we have to content with only random test vectors. By choosing different test vector sets consisting of different combinations and different ordering of test vectors, we can improve the confidence level.

4.2 SOC approach for wave-pipelined circuits

As mentioned in Sect. 4.1, the BIST approach requires a number of overheads such as FSM, signature generator and test vector RAM. These blocks are useful only when the clock frequency and skew are to be varied. If the operating

123

224

J Real-Time Image Proc (2008) 3:217–229

frequency is chosen so that the stable period in Fig. 7 is greater by at least twice the worst case variation in the delay due to temperature, neither the clock frequency nor the skew need to be adjusted again. After these initial selection, the 2D DWT blocks require no further tuning and work satisfactorily without any external intervention. Instead of using a dedicated circuit such as BIST, a pro- cessor may be used to carry out the above tuning task. For example, an FPGA based speech recognition with SOC may perform the various tasks required by optimally par- titioning between hardware and software [21]. The tasks performed in software uses the on-chip processor. The hardware block may use wave-pipelining and it may be tuned by the on-chip processor at the beginning. For the SOC approach, PRBS generator, signature comparator blocks in Fig. 8 may be replaced by a block RAM which is used to store the outputs of the CUT cor- responding to the test inputs. Since the communication interface between the on chip processor and the circuit under test is faster, the outputs can be directly read and compared with the expected output for every combination of skew and clock frequency. The flow chart in Fig. 11 can be modified accordingly. The select inputs for the clock as well as skew blocks and the data inputs to the wave- pipelined circuit may be applied and varied through the on- chip processor. A variety of choices exist for the implementation of SOC. The SOC may consist of a hard core processor such as power PC or ARM processor and an FPGA coprocessor or DSP block. Alternatively, it may consist of a soft-core processor such as Nios II or Micro blaze and a custom DSP block implemented in FPGA. In this paper, FPGA based SOCs consisting of either Nios II or Micro blaze soft-core processor is used for the implementation. Figure 13 shows the interface diagram of a Nios II processor along with the custom block (hybrid wave-pipelined circuit).

5 Architecture for the computation of 2D DWT using lifting scheme

The automation schemes proposed in the previous section is used for tuning the hybrid scheme for 2D DWT. The details of architecture used and the assumptions made about the individual blocks of 2D DWT are presented in this section. Sub-images of size 32 9 32 with 8 bits per pixel are used for the computation. The DWT coefficients are assumed to be represented using 11 bits. Number of bits per pixel is converted to be 11 bits by appending three zeros to the most significant position. This is done in order to make the word size of the inputs to the horizontal filter and vertical filters to be the same. This enables the same hardware or program to be reused for the computation of

224 J Real-Time Image Proc (2008) 3:217–229 frequency is chosen so that the stable period in

Fig. 13 Adding custom logic to the Nios II ALU

the outputs of both horizontal and vertical filters. The inputs to the horizontal filter are the pixel intensity values whereas the inputs to the vertical filters are DWT coeffi- cients. The lifting multiplier constants (a, b, c, d, n) are assumed to be of 11 bits each. The block diagram of one level 2D DWT is shown in Fig. 14. For the horizontal fil- ters, the even and odd inputs are applied from two block RAMs of size 512 9 11. The result is written into four block RAMs of size 256 9 11. For the vertical filters, the inputs are applied from these four block RAM blocks and the outputs are written into another four block RAMs. For testing, the image is assumed to be loaded into the block RAMs using memory initialization file (MIF).

5.1 Block diagram of two level 2D DWT

The block diagram of two level 2D DWT is shown in Fig. 15. In order to minimize the area required for imple- mentation, the horizontal filter and the vertical filters are reused to compute the multilevel 2D DWT. Block RAMs E1, O1 contain the even and odd streams of the initial data to be transformed. Block RAMs E2E/E2O, E3, E4, E5 denote the output of one level 2D DWT. The even and odd numbered coefficients of LL1 component are stored in two block RAMs E2E and E2O and are used as inputs for the 2nd level DWT. The outputs of the 2nd level DWT are stored in block RAMs E6, E7, E8 and E9. The output of the horizontal filter is stored in four blocks RAMs E10, O10,

224 J Real-Time Image Proc (2008) 3:217–229 frequency is chosen so that the stable period in

Fig. 14 Overall block diagram of one level 2D DWT

123

J Real-Time Image Proc (2008) 3:217–229

225

Fig. 15 Block diagram of two level 2D DWT

J Real-Time Image Proc (2008) 3:217–229 225 Fig. 15 Block diagram of two level 2D DWT

E11, O11. If LL2, the low pass band corresponding to two level DWT alone is required, only one demultiplexer and seven blocks RAMs (E1, E2E, O1, E2O, E10, O10, E3) are required. For the purpose of verification, only LL2 is computed and compared for the different schemes of computation of two level 2D DWT. For the computation of LL1 component of one level 2D DWT, only block RAMs E1, O1, E2E, E2O, E10, O10 are used.

5.2 Overlapping scheme for the computation of 2D DWT of complete image

The architecture proposed in Sect. 5.1 for sub-images of size 32 9 32 may be used for the computation of 2D DWT of a larger image by splitting it into a number of overlap- ping sub images of size 32 9 32. The advantage of splitting the image into a number of sub-images is to per- form the computation of 2D DWT in parallel in a number of computational engines. Further, it also reduces the memory required for storing the image and its transform. In the overlapping scheme, the image block is formed such that a number of pixels overlapped between adjacent blocks along the vertical and horizontal direction are equal to the order of the filter. For example, for the 9/7 bi- orthogonal filter used for the 2D DWT, the number of overlap pixels should be equal to four on the left and four on the right between horizontal blocks. Similarly, the number of pixel overlap between vertical blocks should be equal to four on the top and four on the bottom. For the blocks on the boundary, overlapping needs to be done only on the non-boundary edge.

6 Implementation results

In order to demonstrate the applicability of the automation approaches for both Xilinx and Altera FPGAs, the 2D DWT is implemented using both Xilinx Spartan and Altera Cyclone FPGAs and the results are presented in this sec- tion. In each of the FPGAs, the 2D DWT is computed using

three multiplication schemes: hybrid WP-P BW-KCM, non-pipelined BW-KCM and BW-PKCM.

  • 6.1 Programmable clock and skew generators

The operating frequency of the wave-pipelined circuit is expected to lie between that of non-pipelined circuit and pipelined circuits. Hence, the minimum and maximum frequency of the clock generator should correspond to the maximum operating frequencies of the non-pipelined cir- cuit and pipelined circuits, respectively. The approximate values of the clock periods of these circuits for the implementation of the b block on Cyclone FPGA are 5.6 and 7.4 ns, respectively. The values of D max, D min for the a block are 15.302 and 7.34 ns, respectively. The program- mable clock and skew generator are designed such that the clock period can be varied from 8.4 to 20.6 ns in steps of

  • 0.8 ns and skew can be varied from 12.3 to 26.2 ns in steps

of 0.9 ns approximately. The same exercise is carried out for b, c and d blocks using the synthesis report. A single clock generator is used for all the four blocks. Separate skew generators are used for each of the four blocks. In order to remove the glitches in the clock signal ‘‘Majority Logic Gate’’ is suggested in [23]. The operating frequency

and skews are chosen using FSM such that all the blocks work satisfactorily. Similar procedure is adopted for the implementation on the other two FPGAs. The location of the logic elements and the interconnects used for the implementation of clock and skew blocks should be fixed so that when these blocks are integrated with the 2D DWT or the soft-core processor, the inter- connect delays are not altered. This is achieved by using the Logic lock feature in Altera. In the case of Xilinx FPGAs, this is achieved by using the Macros.

  • 6.2 Implementation of 2D DWT using BIST approach

The one level 2D DWT is implemented on Xilinx Spartan- II XC2S100 FPGA using BIST approach. It may be noted that the BIST approach is also applicable for Altera

123

226

J Real-Time Image Proc (2008) 3:217–229

Table 1 Implementation of 9/7 bi-orthogonal filters with 11 9 8 multipliers using the various schemes

Multiplier

Slices

Number

Speed

(BW-KCM)

of registers

(MHz)

Non-pipelined

253

176

57.3

Pipelined

453

803

149.18

Hybrid WP-P

253

176

75.75

FPGAs. A personal computer (PC) is used for the reali- zation of the FSM. The interface used between PC and FPGA is the same as that described in [14]. The output of the hybrid circuit (11 bits) is EXOR’ed with the 11 bit PRBS generator and the signature is obtained. The implementation results of the 9/7 horizontal filters for one level 2D DWT on Xilinx Spartan-II XC2S100 FPGA are given in Table 1. Multipliers of size 11 9 8 are implemented. From Table 1, it may be concluded that for the filter, the method using hybrid WP-P BW-KCM is faster than non-pipelined BW-KCM by a factor of 1.32 and requires the same area. The pipelined BW-PKCM is in turn faster than the hybrid WP-P BW-KCM by a factor of 1.97 and this is achieved with the increase in the number of registers by a factor of 4.6 and increase in the number of slices by a factor of 1.79. The implementation results of one level 2D DWT for a sub-image of size 32 9 32 using BIST approach are shown in Table 2. In order to make the horizontal and vertical filters to be identical, multipliers of size 11 9 11 are used for both of them. Three zeros are appended before the input samples as discussed in Sect. 5. The overheads required for the wave-pipelined circuits are also shown in Table 2. It may be noted that overhead required is about 22.5%. From Table 2, it may be con- cluded that for the lifting scheme, the method using hybrid WP-P BW-KCM is faster than non-pipelined BW- KCM by a factor of 1.4 and requires the same area. The

pipelined BW-(P) KCM is in turn faster than the hybrid WP-P BW-KCM by a factor of 1.2 and this is achieved with the increase in the number of registers by a factor of

2.73 and increase in the number of slices by a factor of

1.32.

Table 3, shows the implementation results for two level 2D DWT for the pipelined scheme. The implementation of hybrid WP two level 2D DWT using BIST approach is under progress.

6.3 Implementation of 2D DWT using SOC approach

For the hybrid WP block, the optimal clock period and clock skews are determined using the procedure described in Sect. 6.1. The hybrid wave-pipelined 2D DWT unit (obtained by adding the input and output block RAMs to the non-pipelined circuit along with the programmable clock and clock skew blocks) is tested first using simu- lation. As mentioned in Sect. 3, simulation is inadequate to test the hybrid wave-pipelined circuit. Hence, this circuit is implemented along with the Nios II or Micro blaze soft-core processor and the former is added as the custom block to the Nios II or Micro blaze using SOPC builder or embedded design kit (EDK) builder. The pro- gram to be executed by the Nios II or Micro blaze is written in C/C++ and the custom block is invoked as a function in the C/C++ program. A C++ program is written to read and write from the block RAM in the custom block. The C++ program is compiled and the executable code along with the configuration bits corre- sponding to Nios II or Micro blaze integrated with the custom block is down loaded to the FPGA. When the C program is run, it systematically varies the select inputs for the clock and clock skew blocks, and uploads the content of the output block RAM. The clock and skew are adjusted till the match occurs for at least three consecutive clock skews. The operating

Table 2 Implementation results on one level 2D DWT

 

BIST approach with Spartan-II

 

SOC

approach

with

Cyclone-II

SOC

approach

with

Spartan-III

XC2S100PQ208-5

 

EP2C35F672C6

 

XC3S200FT256-4

 

Lifting scheme

Non-

Pipelining

Hybrid

Non-

Pipelining

Hybrid

Non-

Pipelining

Hybrid

pipelining

WP-P

pipelining

 

WP-P

pipelining

 

WP-P

Number of slices or LEs

836

1,110

836

703

782

703 a (30)

897

1,381

897

1 slice = 2 LUTs

a (188)

a (32)

Number of registers

611

1,670

611

375

671

375

730

2,305

730

 

a (85)

a (8)

a (8)

Speed

54.45

87.54

75.75

117.83

203.92

147.5

67.9

114.19

82.6

(MHz)

J Real-Time Image Proc (2008) 3:217–229

227

Table 3 Area and speed performance of two level forward 2D DWT on Xilinx Spartan-II XC2S-200PQ208-5

Lifting scheme

Slices used

Speed

Number of

 

(MHz)

registers

Pipelined

1,511

61.42

2,506

clock and clock skew of the wave-pipelined circuit is fixed at the middle value and from now on, the custom block works without any intervention from the Nios II or Micro blaze processor.

6.3.1 Implementation results on one level 2D DWT using Cyclone-II EP2C35F672C6

The one level 2D DWT is implemented on Cyclone-II EP2C35F672C6 with and without pipelining. A single filter is implemented and time shared for the computation of the outputs of both horizontal and vertical filters. The 2D DWT block added as a custom block to Nios II CPU and downloaded to the Cyclone-II. 2D DWT is also computed using the in-built instruction set of Nios II [22]. The number of CPU clocks for both the cases are tabulated in Table 4. (Clock frequency obtained using the above device is 40 MHz.) For the hybrid wave-pipelined circuit, the number of logic elements, number of registers, maximum operating frequency and power dissipated are computed using Cyclone-II FPGA and the results are given in Table 2. It may be noted that the overhead required for the wave- pipelined circuit is about 4%. From this Table 2, it may be concluded that for the lifting scheme, the method using the hybrid WP-P BW-KCM is faster than non-pipelined BW- KCM by a factor of 1.25. The scheme with Baugh–Wooley Pipelined Constant Coefficient Multiplier is in turn faster than the hybrid WP-P BW-KCM by a factor of 1.38 and this is achieved with the increase in the number of registers by a factor of 1.78 and increase in number of LEs by a factor of 1.11. Pipelining may be used either for increasing the operat- ing frequency of a circuit or for reducing the power dissipation [12]. Pipelining requires more registers and area. It automatically may not lead to more power dissipation. In order to assess whether the hybrid wave-pipelining is superior or not with regard to power dissipation, both hybrid wave-pipelined and pipelined circuits are operated at the

Table 4 Computation time for 2D DWT

same frequency (corresponding to the maximum operating frequency of the hybrid wave-pipelined circuit) and the power dissipated for the two approaches are also given in Table 5. From this Table 5, it may be noted that the pipe- lined circuit dissipates 11% less power than hybrid wave- pipelined 2D DWT.

6.3.2 Implementation results on one level 2D DWT using Spartan-III XC3S200

Implementation results for one level 2D DWT on Xilinx Spartan-III XC3S200 using all the three approaches are given in Table 2. The programmable clock and clock skew blocks are implemented as Macro blocks using Xilinx ISE 8.1i project navigator. For tuning the hybrid wave-pipe- lined circuit, the Micro blaze soft-core processor is used. Xilinx EDK software is used to integrate the custom block to the Micro blaze processor. The rest of the steps are similar to what is used for the Altera SOC kit. For all the three schemes, the number of logic elements, number of registers and maximum operating frequency are computed and the results are given in Table 2. It may be noted that the overheads required for the wave-pipelined circuits is about 3.5%. It may be noted that dedicated filters are used for the computation of the outputs of both horizontal and vertical filter. Hence, the area required for this scheme is higher than that that using cyclone II devices. From this Table 2, it may be concluded that for the lifting scheme, the method using hybrid WP-P BW-KCM is faster than non-pipelined BW-KCM by a factor of 1.21. The scheme with Baugh–Wooley Pipelined Constant Coefficient Multiplier is in turn faster than the hybrid WP- P BW-KCM by a factor of 1.38 and this is achieved with the increase in the number of registers by a factor of 3.15 and increase in the number of LEs by a factor of 1.53.

6.4 Validation of the scheme for 2D DWT

To verify the correctness of the schemes proposed for the computation of 2D DWT, Lena image of size 128 9 128 with blocks (sub-images) of size 32 9 32 pixels is used. The 128 9 128 image is shown in Fig. 16 and is obtained by subsampling the standard image of size 512 9 512 by a factor of four along both dimensions. As mentioned in Sect. 5.2, overlap of four pixels is used between the adja- cent blocks. Totally 36 image blocks are used for the

Table 5 Power dissipated by pipelined and hybrid wavepipelined one level 2D DWT at normalized frequency

Function

Number of CPU clock

Equivalent CPU clock

Description of the circuit

Pipelined circuit

Hybrid circuit

cycles for software approach

cycles for custom block

 

Power at normalized frequency

158.97

179.58

2D DWT

73,280

814

including additional overhead

123

228

J Real-Time Image Proc (2008) 3:217–229

228 J Real-Time Image Proc (2008) 3:217–229 Fig. 16 LL1 component compared with input image 128

Fig. 16 LL1 component compared with input image

128 9 128 image. The 2D DWT for the image is also computed using a C program. This is carried out using both high-level language C and hardware approach using FPGA. For implementation in C language, the lifting multiplier constants (a, b, c, d, n 1 , n 2 ) and the filter coefficients for the distributed arithmetic algorithm are declared as ‘‘double’’ type (64 bits) variables. The pixel intensities are declared as ‘‘short’’ type (16 bits) variables. The analysis filter output obtained corresponding to 36 image blocks are merged suitably and LL1 component of the image is shown in Fig. 16. The implementation of the forward 2D DWT for image block of size 32 9 32 is carried out for lifting scheme with BW-hybrid WP-PKCM. For the implementation, Xilinx Spartan XC2S100PQ208-5 device is used. For storing the image input, outputs of the horizontal filter and the outputs of the vertical filters, the block RAMs are configured suitably. The image is loaded into the block RAMs through the UCF of the implementation tool. The one level 2D DWT is computed using the above scheme for all the 36 image blocks and merged suitably. The LL1 component of the image is shown in Fig. 16. From these figures, it may be concluded that the LL1 components obtained through the FPGA implementation match well with that obtained using C. The LL1 components also match well with the original image. In order to make a quantitative comparison of the LL1 component with the original image, the original LENA image is subsampled to be of size 64 9 64. Treating the LL1 component itself as the compressed image, the PSNR of the compressed images using BW-hybrid WPKCM and C are computed and are found to be 28.22 and 33.33, respectively.

7 Conclusion

Two automation schemes are proposed in this paper for the implementation of the 9/7 bi-orthogonal filters using hybrid WP-P constant coefficient multiplier with Baugh– Wooley multiplication algorithm. Nios II and Micro blaze soft-core processors are integrated with 2D DWT blocks successfully and the optimum clock period and clock skews for the 2D DWT blocks are selected using them.

After these initial selection, the 2D DWT blocks work satisfactorily without any external intervention and the processors are free to do other tasks. The 9/7 bi-orthog- onal filters are implemented on both Xilinx and Altera devices using the lifting scheme with the following three multipliers: BW-PKCM, BW-KCM and hybrid WP-P BW-KCM. From the implementation results, it is verified that hybrid WP-P BW-KCM is faster than non-pipelined BW-KCM. The scheme with BW-PKCM is in turn faster than the hybrid WP-P BW-KCM and this is achieved with the increase in the number of registers and increase in the number of LEs. The custom instruction for 2D DWT is found to be faster compared to the implementation using C. The correctness of the procedure for the computation of 2D DWT of an image, using the 2D DWT of sub images, is verified by computing the 2D DWT using both hardware and software approaches (using C) and dis- playing the LL1 components for an image of size 128 9 128.transform. The automation schemes proposed in this paper has also been successfully employed in [23] for the implementation of wave-pipelined filters using distributed arithmetic algorithm and sine wave generator using CORDIC. The work on the computation of multi level 2D DWT and real time computation of 2D DWT using the hybrid scheme are under progress. One of the challenges in the design of FPGA based wave-pipelined circuits is the accurate modeling of the interconnects as well as device delays and their tempera- ture dependence. In the absence of these models, the wave- pipelined circuits can only be operated at moderate speeds.

References

  • 1. Xilinx documentation library, Xilinx Corporation, USA

  • 2. Altera documentation library-2003 Altera Corporation, USA

  • 3. Sheldon, D., Kumar, R., Vahid, F., Tullsen, D., Lysecky, R.: Conjoining soft-core FPGA processors. In: IEEE/ACM Interna- tional Conference on Computer Aided Design, 2006, ICCAD, pp. 694–701 (2006)

  • 4. Draper, B.A., Beveridge, J.R., Willem Bohm, A.P., Ross, C., Chawathe, M.: Accelerated image processing on FPGAs. IEEE Trans. Image. Process. 12(12), 1543–1551 (2003)

  • 5. Ritter, J., Molitor, P.: A pipelined architecture for partitioned DWT based lossy image compression using FPGA’s. In: Pro- ceedings ACM Conference FPGA 2001, pp. 201–206 (2001)

  • 6. Lakshminarayanan, G., Venkataramani, B., Senthil Kumar, J., Yousuf, A.K., Sriram, G.: Design and FPGA implementation of image block encoders with 2D-DWT. Proceedings TENCON 2003. 3, 1015–1019 (2003)

  • 7. Daubechies, I., Sweldens, W.: Factoring wavelet transforms into lifting steps. J. Fourier Anal. Appl. 4, 247–269 (1998)

  • 8. Nyathi, J., Delgado-Frias, J.G.: A hybrid wave-pipelined network router. IEEE Trans. Circuits. Syst-I, Fundam. Theory. Appl. 49(12), 1764–1772 (2002)

  • 9. Hauck, O., Katoch, A., Huss, S.A.: VLSI system design using asynchronous wave pipelines: a 0.35 lm CMOS 1.5 GHz elliptic curve public key cryptosystem chip. In: Proceeding of sixth

123

J Real-Time Image Proc (2008) 3:217–229

229

international symposium on advanced research in asynchronous circuits and systems 2000 (ASYNC 2000), pp. 188–197 (2000)

  • 10. Burleson, W.P., Ciesielski, M., Klass, F., Liu, W.: Wave-pipe- lining: a tutorial and research survey. IEEE Trans. Very Large Scale Integration (VLSI) Syst. 6(3), 464–474 (1998)

  • 11. Gray, T.C., Liu, W., Cavin, R.K., III.: Wave-Pipelining: Theory and CMOS Implementation. Kluwer, Boston (1994)

  • 12. Parhi, K.K.: VLSI Signal Processing Systems. Wiley, New York (1999)

  • 13. Boemo, E.I., Lopez-Buedo, S., Meneses, J.M.: Wave-pipelines via look-up tables. IEEE Int. Symp. Circuits Syst. (ISCAS ’1996). 4, 85–88 (1996)

  • 14. Lakshminarayanan, G., Venkataramani, B.: Optimization tech- niques for FPGA based wave-pipelined DSP blocks. IEEE Trans. Very Large Scale Integration (VLSI) Syst. 13(7), 783–793 (2005)

  • 15. Aharya, T., Tsai, P.-S.: JPEG2000 Standard for Image Com- pression Concepts Algorithms and VLSI Architectures. Wiley, New York (2005)

  • 16. Sayood, K.: Introduction to Data Compression. Morgan Kauf- mann, Menlo Park (2000). An Imprint of Elsevier

  • 17. Smith, M.J.S.: Application Specific Integrated Circuits. Pearson Education Asia Pvt. Ltd, Singapore (2003)

  • 18. Seetharaman, G., Venkataramani, B., Lakshminarayanan, G.: Design and FPGA implementation of self tuned wave-pipelined filters. IETE J. Res. 524, 281–286 (2006)

  • 19. Seetharaman, G., Venkataramani, B., Lakshminarayanan, G.: Design and FPGA implementation of wave-pipelined image block encoders using 2D-DWT. In: Proceedings of VLSI design and test symposium VDAT 2005, pp. 12–20 (2005)

  • 20. Seetharaman, G., Venkataramani, B., Lakshminarayanan, G.: Design and FPGA implementation of wave-pipelined distributed arithmetic based filters. In: Proceedings of VLSI Design and Test workshop VDAT 2004, pp. 216–220 (2004)

  • 21. Amudha, V., Venkataramani, B., Vinoth Kumar, R., Ravishankar, S.: SOC Implementation of HMM Based Speaker Independent Isolated Digit Recognition System. In: 20th International IEEE Conference on VLSI Design (VLSID’07), pp. 1–6 (2007)

  • 22. Seetharaman, G., Venkataramani, B., Amudha, V., Saundattikar, A.: System on chip implementation of 2D DWT using lifting scheme. In: Proceedings of the International Asia and South Pacific Conference on Embedded SOCs (ASPICES 2005), (2005)

  • 23. Seetharaman, G., Venkataramani, B.: SOC implementation of wave-pipelined circuits. Proceedings of IEEE International con- ference on Field Programmable Technology 2007 (ICFPT 2007), pp. 9–16 (2007)

Author Biographies

  • G. Seetharaman received his B.E. and M.E. degree in Electronics

and Communication Engineering from Regional Engineering Collage, Tiruchirappalli in 1997 and 2002, respectively. Presently, he is car- rying out his doctoral thesis work in the National Institute of Technology, Tiruchirappalli. Previously, he worked as faculty in the Jayaram College of Engineering and Technology, Tiruchirappalli, for 6 years and as Research Associate for three semesters in the National Institute of Technology, Tiruchirappalli. Presently, he is working as Laboratory Engineer in National Institute of Technology, Tiruchi- rappalli. His current research interests include embedded system design using field-programmable gate arrays (FPGAs) and system-on- chip (SOC).

  • B. Venkataramani received his B.E. degree in Electronics and

Communication Engineering from Regional Engineering College, Tiruchirappalli in 1979 and M.Tech. and Ph.D. degrees in Elec- trical Engineering from Indian Institute of Technology, Kanpur in 1984 and 1996, respectively. He worked as Deputy Engineer in Bharat Electronics Limited, Bangalore, India, and as a research Engineer in Indian Institute of Technology, Kanpur, each for approximately 3 years. Since 1987 he has been faculty member of National Institute of Technology, (formerly Regional Engineering College) Tiruchirappalli. Presently, he is working as Professor and Head of the Department of Electronics and Communication in National Institute of Technology. He has published two books and numerous papers in journals and international conferences. His current research interests include FPGA applications and SOC based system design and performance analysis of high speed computer networks.

  • G. Lakshminarayanan received his M.E. and Ph.D. degrees in

Electronics and Communication Engineering from Bharathidasan University, Tiruchirappalli in 1995 and 2005, respectively. He pre- viously worked as a Service Engineer for 5 years and as a scientist and Research Associate for 4 years in Regional Engineering College, Tiruchirappalli. He was a faculty member in SASTRA, Tanjore, for two semesters and as an Assistant Professor in Saranathan College of Engineering, Tiruchirappalli for 1 year. Presently he is working as Assistant Professor in National Institute of Technology, Tiruchirap- palli. His current research interests include FPGA based system design and VLSI front end design.

123