Sei sulla pagina 1di 13

J Real-Time Image Proc (2008) 3:217229 DOI 10.

1007/s11554-008-0087-8

SPECIAL ISSUE

Automation techniques for implementation of hybrid wave-pipelined 2D DWT


G. Seetharaman B. Venkataramani G. Lakshminarayanan

Received: 12 July 2007 / Accepted: 19 May 2008 / Published online: 10 June 2008 Springer-Verlag 2008

Abstract In the literature, techniques such as pipelining and wave-pipelining (WP) are proposed for increasing the operating frequency of a digital circuit. In general, use of pipelining results in higher speed at the cost of increase in the area and clock routing complexity. On the other hand, use of WP results in less clock routing complexity and less area but enables the digital circuit to be operated only at moderate speeds. In this paper, a hybrid wave-pipelining scheme is proposed to get the benets of both pipelining and WP techniques. Major contributions of this paper are: proposal for the implementation of 2D DWT using lifting scheme by adopting the hybrid wave-pipelining and proposal for the automation of the choice of clock frequency and clock skew between the input and output registers of wave-pipelined circuit using built in self test (BIST) and system-on-chip (SOC) approaches. In the hybrid scheme, different lifting blocks are interconnected using pipelining registers and the individual blocks are implemented using WP. For the purpose of evaluating the superiority of the schemes proposed in this paper, the system for the computation of one level 2D DWT is implemented using the following techniques: pipelining, non-pipelining and hybrid wave-pipelining. The BIST approach is used for the implementation on Xilinx Spartan-II device. The SOC

approach is adopted for implementation on Altera and Xilinx eld programmable gate arrays (FPGAs) based SOC kits with Nios II or Micro blaze soft-core processors. From the implementation results, it is veried that the hybrid WP circuit is faster than non-pipelined circuit by a factor of 1.251.39. The pipelined circuit is in turn faster than the hybrid wave-pipelined circuit by a factor of 1.151.38 and this is achieved with the increase in the number of registers by a factor of 1.793.15 and increase in the number of LEs by a factor of 1.111.65. The soft-core processor based automation scheme has considerably reduced the effort required for the design and testing of the hybrid wavepipelined circuit. The techniques proposed in this paper, are also applicable for ASICs. The optimization schemes proposed in this paper are also applicable for the computation of other image transforms such as DCT, DHT. Keywords DWT Lifting SOC Wave-pipelining Pipelining Self test

1 Introduction Programmable logic devices such as FPGAs offer an alternative solution for the computationally intensive functions performed traditionally by digital signal processors with Harvard architecture. The ability to design, fabricate and test application specic integrated circuits (ASICs) as well as FPGAs with gate count of the order of a few tens of millions, has led to the development of complex embedded system-on-chip. The development of intellectual property (IP) cores for the FPGAs for a variety of standard functions including processors enables a multimillion gate FPGA to be congured to contain all the components of a complete system. Development tools from

G. Seetharaman (&) B. Venkataramani G. Lakshminarayanan Department of ECE, National Institute of Technology, Tiruchirappalli, India e-mail: gsraman@nitt.edu B. Venkataramani e-mail: bvenki@nitt.edu G. Lakshminarayanan e-mail: laksh@nitt.edu

123

218

J Real-Time Image Proc (2008) 3:217229

FPGA vendors such as the Altera or Xilinx enable the integration of IP cores and the user designed custom blocks with the soft-core processors such as the Micro blaze or Nios II processors [1, 2]. The system designed by integration of IP cores and the user designed custom blocks with the soft-core processors are far more exible than the hard-core processors and they can be enhanced with custom hardware to optimize them for specic application [3]. The increased performance available with SOC based FPGAs makes them quite suited for implementation of area as well as speed intensive image processing applications such as discrete cosine transform (DCT) and discrete wavelet transform (DWT). For example, the study in [4] shows that FPGA based image processing system is faster by 8800 times compared to that using Pentium III processor. For image processing applications, in addition to DCT, wavelet transform is increasingly used. It is a part of the joint photographic experts group (JPEG) 2000 standard for still image compression. The VLSI implementation of image encoders with DWT has been addressed in number of previous works. The implementation of 2D DWT using lifting scheme and compression using EZT algorithm is reported in [5] taking the advantage of exible memory conguration available in FPGAs. The image is partioned into sub-images of size 32 9 32 and external memory is used for storing the sub-images and the transform coefcients in [5]. Block RAMs in FPGAs are proposed for storing the subimages and 2D DWT coefcients in [6]. A new multiplier algorithm denoted as BaughWooley pipelined constant coefcient multiplier (BW-PKCM) which combines the KCM with BaughWooley multiplication algorithm is proposed and used for the study and comparison of distributed arithmetic algorithm and lifting scheme [7] for 2D DWT on FPGAs in [6]. Even though pipelining is adopted for high speed applications such as that in [6], pipelined systems have a number of disadvantages such as increase of power dissipation, clock routing complexity and clock skews between different parts of the system. The circuit design technique such as wave-pipelining is one of the techniques proposed for achieving high speed without the above limitations. Wave-pipelined circuit dispenses with the need for registers for storing the intermediate results and instead uses the inherent capacitance at the input to the various combinatorial blocks. A number of systems have been implemented using wave-pipelining on ASICs and FPGAs [8, 9]. The concept of wave-pipelining has been described in a number of previous works [1014]. One of the limitations of the wave-pipelined circuits is that their highest operating frequency reduces with the complexity of the circuit or equivalently the logic depth [14]. In order to combine the

advantages of both pipelining and wave-pipelining a hybrid scheme is proposed in this paper. A complex circuit is split into a number of smaller circuits and is pipelined. Each of the smaller circuits is realized using wave-pipelining. The organization of the rest of the paper is as follows: in Sect. 2, the review of previous work on lifting based 2D DWT with BW multiplier is described. In Sect. 3, the previous work related to wave-pipelining and the challenges involved in the design of wave-pipelined circuits are described. In Sect. 4, automation schemes for wave-pipelined circuits are presented. In Sect. 5, the architecture used and assumptions made for the implementation of the 2D DWT are presented. In Sect. 6, the implementation results of the pipelined and hybrid wave-pipelined 2D DWT are presented. Sect. 7, summarizes the conclusions.

2 Review of previous work on lifting based 2D DWT with BW multiplier The hybrid scheme is proposed to be used for the computation of 2D DWT. The DWT decomposes a signal into different sub-bands so that the lower frequency sub-bands have ner frequency resolution and coarser time resolution compared to the higher frequency sub-bands. A survey of VLSI architectures for the computation of 2D DWT is given in [15]. The 2D DWT may be computed using lter banks. Figure 1 shows how an N 9 M image can be decomposed using sub-band decomposition for one level 2D DWT. The samples corresponding to the image pixels are passed through two stages of analysis lters. The elements of the pixel matrix are read row wise and are rst processed by the low pass h[n] and high pass g[n] horizontal lters. The transform coefcients matrices are then sub-sampled by two along the rows to obtain two N 9 M/2 matrices L1 and H1. Subsequently, the outputs (L1, H1) are processed by low pass and high pass vertical lters to obtain four N/2 9 M/2 transform coefcient matrices. Out of these four matrices denoted as LL1, LH1, HH1 and HL1, respectively, LL1 represents a coarse approximation of the original image [15, 16]. For the two level 2D DWT, LL1 component is processed by both horizontal and vertical lters and subsampled to obtain four more matrices LL2, LH2, HH2 and HL2. This process is continued until the desired level of sub-band structure is obtained. The horizontal and vertical lters shown in Fig. 1 may be implemented by adopting the lifting scheme [7] which uses a factorization scheme for the poly-phase matrix corresponding to the analysis lter. The main feature of lifting based DWT scheme is to break up the high pass and low pass wavelet lters into a sequence of smaller lters. This scheme requires about 50% less computational complexity compared to that using

123

J Real-Time Image Proc (2008) 3:217229

219

Fig. 1 Sub-band decomposition of an N 9 M image

the convolution-based approach [7]. It has other advantages, including in-place computation of the DWT, integer to integer wavelet transform and symmetric hardware architecture for the computation of both forward and inverse transform [15]. In the lifting scheme for a lter bank with the low pass and high pass lters of nine and seven taps, respectively, the odd and even input samples are processed by ve lifting blocks [a, b, c, d, n (n1, n2)] in cascade as shown in Fig. 2. n1, n2 are scaling blocks. The internal diagram of a and b blocks are shown in Figs. 3 and 4. The c and d blocks are obtained by replacing the constants a, b with c, d. In Figs. 3 and 4, since the output from one block is fed as the input to the next block, the maximum rate at which the input can be fed to the system depends on the sum of the delays in all the four stages. The speed may be increased by introducing pipelining at the points indicated by dotted lines in Figs. 3 and 4. In this case, the input rate is determined by the largest delay among all the four blocks. The delay in the individual stages may be reduced further by using constant coefcient multiplier (KCM) which uses a look up table (LUT) for nding the product of a constant and a variable. The variable is fed as address to the LUT, which contains the products corresponding to all possible combinations of the operands. FPGAs normally contain four input LUTs. When an LUT with more number of inputs are required, it has to be implemented using a number of stages of four input LUTs and adders. For example, a 12 9 12 bit KCM is implemented using three

Fig. 3 a Block

Fig. 4 b Block

Fig. 2 Simplied block diagram of lifting scheme for 9/7 lter

4 9 12 bit KCM and two stages of 16 bit adders. The speed of the KCM can be increased by introducing the pipelining registers at the outputs of LUTs and adders. The content of the LUT corresponding to multiplication of signed numbers can be computed using three approaches: (a) Assuming unsigned multiplication and 2s complement blocks (resulting multiplier is referred to as conventional 2s complement multiplier (C2CM)) (b) Using sign extension (c) Baugh Wooley (BW) multiplier. The pipelined constant coefcient multiplier (PKCM) using the BW content is referred to as BW-PKCM and it is

123

220

J Real-Time Image Proc (2008) 3:217229

shown to be superior compared to the other two approaches [6]. Hence, only this multiplier is considered for wavepipelining in this paper. The detailed diagram of the a block implemented using BW-PKCM is shown in Fig. 5. The same scheme can be adopted for the b, c, d, n1, n2 blocks. The dotted line indicates points where registers may be inserted for pipelining. For wave-pipelining all the stages are directly connected without registers. The registers are used only at the inputs and outputs. In hybrid wavepipelining, registers are used between adjacent lifting blocks and the individual lifting blocks are connected without registers.

Fig. 6 Multiple coherent waves of data sent through combinational logic acting as pipeline in WP

3 Review of previous work on wave-pipelining In this section, the technique used for wave-pipelining the a block in Fig. 5 is considered. An RTL model of a circuit consists of a combinational logic circuit separated by the input and output registers. The combinational logic circuit may be considered to be a wave-pipelined circuit if a number of waves are made to simultaneously propagate through it as shown in Fig. 6, [10]. In other words, at any point of time, a sequence of data is processed in the combinational logic block. In the case of pipelining, only one data is processed in the combinational logic block at a time. Further, the maximum data rate in the pipelined circuit depends only on Dmax, the maximum propagation delay in the combinational logic block. Figure 7 shows temporal/spatial diagram of combinational logic circuits [11]. If Dmin denotes the minimum propagation delay of the signal through the combinational logic block, the maximum data rate of the wave-pipelined circuit depends on (DmaxDmin). In the case of a block, Dmax corresponds to the processing and propagation delay between the even samples and a0 output (this involves three adder delays, one LUT

Fig. 7 Temporal/spatial diagram of combinational logic circuits

Fig. 5 a Block using BW-PKCM

delay and four interconnect delays); Dmin corresponds to the processing and propagation delay between the odd samples and a0 output (this involves two adder delays and two interconnect delays). Traditionally, in a wave-pipelined circuit, higher speeds are achieved by equalizing the Dmax and Dmin [10]. The output of the wave-pipelined circuit alternates between unstable and stable states. The stable period decreases with the increase in the logic depth. By adjusting the latching instant at the output register to lie in the stable period, the wave-pipelined circuit can be made to work properly. But, for large logic depths, there may not be any stable period. Hence, adjusting the latching instant by itself may not be adequate for storing the correct result at the output register. For such cases, the clock period has to be increased to increase the stable period. Equalization of path delays, adjustment of the clock period and clock skew are the three tasks carried out for maximizing the operating speed of the wave-pipelined circuit. All the three tasks require the delays to be measured and altered if required. Layout editors, such as FPGA editor from Xilinx, or Floor planner from Altera may be used for this purpose. These tasks are carried out manually in [13, 14]. The wave-pipelined circuit designed using the layout editor may be tested using simulation. However, the simulation is inadequate for testing due to the difference between the actual delays and the delays calculated by the layout editor. This is because, the layout editor considers only the worst case delays and the actual delays may be signicantly different due to fabrication variations. This difference

123

J Real-Time Image Proc (2008) 3:217229

221

becomes important as the logic depth of the circuit increases. Hence, the design is downloaded to the actual FPGA and its operation is checked using a personal computer (PC) based test system in [14]. If correct results are not obtained, delays are altered and the design is downloaded for testing again. A number of iterations of place and route, simulation, downloading and testing in the actual device may be required till the correct results are obtained. The design of wave-pipelined circuit in this fashion requires human intervention and is time consuming. Automation of the above three tasks are considered in the next section.

4 Automation schemes for wave-pipelined circuits Equalization of the path delays of the combinational logic blocks such as the a block is considered rst. This cannot be completely automated as the commercially available synthesis tools do not support the specication of interconnect delays. However, the difference in path delays can be minimized by specifying the physical location of logic cells (referred to as slices in Xilinx FPGAs) or logic elements used for the implementation, through either the user constraints le (UCF) or the Logic lock feature supported by the FPGA CAD tools [2, 3]. UCF approach is proposed for Xilinx FPGAs in [14]. The logic lock feature is adopted for the Altera FPGAs in this paper. The adjustment of the clock skew and clock period can be automated by using programmable clock, skew generator and a processor. Clock generator using LUTs and interconnects (nets) is proposed for the rst time in [14]. (The LUTs are programmed as non-inverting buffers). The interconnects are manually chosen using the FPGA layout editor in [14]. The programmable clock is proposed in this paper using multiplexers in addition to the LUTs and nets as shown in Fig. 8. The interconnect delays are selected using the multiplexer. The number of possible interconnect delays (Di) is restricted to minimize the overheads due to the additional LUTs required for the introduction of the delay and the multiplexers. Hence, only a coarse variation in the delay values can be achieved. Using the manual routing, much smaller variations in the delay may be achieved. In Fig. 8, inputs C0C3 are the programmable select inputs, which determine the actual clock frequency. The diagram of programmable clock skew circuit is given in Fig. 9. In Fig. 9, Di denotes the ith interconnect (net) specically introduced to vary the delay. In addition to this, there are interconnects between the output of one multiplexer and the input of another multiplexer and also between the LUTs. But their delay values are not controlled by the program. The select inputs S0S3 are the programmable delay inputs.
Fig. 8 Programmable clock generator

Fig. 9 Programmable clock skew circuit

The clocks required for the wave-pipelined circuit may also be derived using the internal system clock generator of Altera and Xilinx system-on-programmable chip (SOPC) devices. The maximum operating frequency in this case is limited by the system bus. Alternately, an external clock may be multiplied by an arbitrary number using the Altera mega core function altclkclock or pllclock. Similarly, in the case of Xilinx Spartan-III family FPGAs, the delay-locked loop (DLL)/digital clock manager (DCM) module may be used for clock multiplication. However, the multiplication factor has to be specied at the synthesis time and hence the clock frequency cannot be dynamically altered as in the scheme given in Fig. 8. The circuit using the programmable clock and skew generator is a suboptimal wave-pipelined circuit but can operate at a higher frequency than that reported by the commercially available synthesis tools which use Dmax for xing the operating frequency. The clock and skew generator may be programmed using either off-chip processor

123

222

J Real-Time Image Proc (2008) 3:217229

or on-chip processor. In order to minimize the time required for adjustment of the parameters of the wavepipelined circuit (clock frequency and skew), the BIST approach for design for testability [17, 18] may be used. In the BIST approach, a nite state machine (FSM) is assumed to be available off-chip and is used for adjustment of the parameters of the wave-pipelined circuit [19, 20]. In the SOC approach, a processor is assumed to be available on-chip and it is used for adjustment of the parameters of the wave-pipelined circuit. 4.1 BIST approach for wave-pipelined circuit Testing a large chip requires a large test sequence and application of these test sequences to the circuit under test (CUT) using external testers is time consuming. Built in self test scheme is an alternative for minimizing the testing time. In the BIST scheme, the test sequences are internally generated, applied to the CUT at full speed and a signature is generated for nding whether it is good or bad. The block diagram of a wave-pipelined circuit with BIST is given in Fig. 10. This is obtained by including the FSM block and self-test circuit. The self-test circuit contains programmable clock, clock skew generator, signature analyzer and test vector RAM to the circuit given in Fig. 10. 4.1.1 FSM block

input of multiplexer. In the self test mode, the FSM systematically varies the clock skews and clock periods. For each clock frequency and skew, the self test circuit generates the test inputs, applies them, generates the signature, compares it with the expected result and nally generates a ag indicating the match. The FSM progresses with the testing till the frequency at which the circuit under test works for at least three or more skew values is found. The operating skew value is chosen to be the middle value so that the CUT would reliably work even if the delays change due to environmental conditions. For example, in Fig. 7, when the skew is chosen so that it corresponds to either t2 or t20 , the circuit would reliably work during its entire life time. In order to minimize the time required to determine the correct value of clock skew and clock period, a two step procedure is adopted. The clock frequencies are varied by large steps to determine the range of frequency in which the circuit works. This is achieved by varying only the higher order two bits of the select inputs of the programmable clock. After the range is determined, ne tuning is achieved by varying the lower order bits. For every frequency at which the circuit is tested, the clock skews are varied gradually and the results are tested for its correctness and the clock skews for which the circuit works satisfactorily is noted. The testing time can be minimized by using the optimal test vector set and a signature analyzer [17]. 4.1.2 Signature generator

The ow chart given in Fig. 11 describes the function performed by the FSM. In Fig. 11 {Ti: i = 0, 1, 2 N 1}, {dj: j = 0, 1, 2 M - 1} denotes the set of clock periods and clock skews, respectively. The FSM block generates the control signal to choose between the normal mode and the self test mode and this is applied to the select
Fig. 10 Selftuned wavepipelined circuit

For testing the correctness of the circuit, N test vectors may be fed one after another and the N outputs obtained should be compared with the expected outputs. In order to minimize the number of comparisons, a unique signature is generated out of the N outputs and it is compared with the

123

J Real-Time Image Proc (2008) 3:217229

223

Fig. 12 Signature generator

the random access memory (RAM), I/O registers and the combinational logic block, the time at which signature generation begins should be delayed with respect to the time at which the application of test vectors begins. The delay depends on the depth of the combinational logic blocks. 4.1.3 Test vector generation In principle, the number of test vectors required for an M input combinational logic circuit is 2 M. If the value of M is small, exhaustive testing of the circuit may be carried out by generating the test inputs through an M bit counter and checking the signature after the counter completes one full cycle. However, some of the inputs may contribute more to Dmax than the others. For example, in the case of the multipliers, the maximum propagation delay occurs only when MSBs of the operands are 1. If the multiplier works for this case, it will work for the other cases where at least one of the MSBs is zero. Hence, a (M - 2) bit counter is adequate for testing. For circuits with large number of inputs, exhaustive testing would require very large testing time. Minimal test vector set, which reduces the testing time without compromising the quality of detection of faults, may be obtained using the automatic test pattern generator (ATPG) algorithms [17]. Computed aided design tools may also be used for generating the minimal test vectors using ATPG algorithm and assessing their fault coverage ratio. However, the generation of test patterns for wave-pipelined circuit is non-trivial because we have to account for data dependent delays (delay for 001 is different from that for 101) [11] and this is compounded by the absence of accurate models for interconnects in FPGAs. Since the conventional ATPG techniques are not applicable for wave-pipelined circuits, we have to content with only random test vectors. By choosing different test vector sets consisting of different combinations and different ordering of test vectors, we can improve the condence level. 4.2 SOC approach for wave-pipelined circuits As mentioned in Sect. 4.1, the BIST approach requires a number of overheads such as FSM, signature generator and test vector RAM. These blocks are useful only when the clock frequency and skew are to be varied. If the operating

Fig. 11 Flowchart of FSM operation

signature corresponding to the expected outputs. The signature generator consists of a pseudo random binary sequence (PRBS) generator with multiple data input [17] as shown in Fig. 12. The successive output of the output register is XORed with the state of the PRBS to generate the next state. If the test vector set consists of N vectors, the PRBS generator output contains the signature after application of N clock pulses. However, due to the propagation delay in

123

224

J Real-Time Image Proc (2008) 3:217229

frequency is chosen so that the stable period in Fig. 7 is greater by at least twice the worst case variation in the delay due to temperature, neither the clock frequency nor the skew need to be adjusted again. After these initial selection, the 2D DWT blocks require no further tuning and work satisfactorily without any external intervention. Instead of using a dedicated circuit such as BIST, a processor may be used to carry out the above tuning task. For example, an FPGA based speech recognition with SOC may perform the various tasks required by optimally partitioning between hardware and software [21]. The tasks performed in software uses the on-chip processor. The hardware block may use wave-pipelining and it may be tuned by the on-chip processor at the beginning. For the SOC approach, PRBS generator, signature comparator blocks in Fig. 8 may be replaced by a block RAM which is used to store the outputs of the CUT corresponding to the test inputs. Since the communication interface between the on chip processor and the circuit under test is faster, the outputs can be directly read and compared with the expected output for every combination of skew and clock frequency. The ow chart in Fig. 11 can be modied accordingly. The select inputs for the clock as well as skew blocks and the data inputs to the wavepipelined circuit may be applied and varied through the onchip processor. A variety of choices exist for the implementation of SOC. The SOC may consist of a hard core processor such as power PC or ARM processor and an FPGA coprocessor or DSP block. Alternatively, it may consist of a soft-core processor such as Nios II or Micro blaze and a custom DSP block implemented in FPGA. In this paper, FPGA based SOCs consisting of either Nios II or Micro blaze soft-core processor is used for the implementation. Figure 13 shows the interface diagram of a Nios II processor along with the custom block (hybrid wave-pipelined circuit).

Fig. 13 Adding custom logic to the Nios II ALU

the outputs of both horizontal and vertical lters. The inputs to the horizontal lter are the pixel intensity values whereas the inputs to the vertical lters are DWT coefcients. The lifting multiplier constants (a, b, c, d, n) are assumed to be of 11 bits each. The block diagram of one level 2D DWT is shown in Fig. 14. For the horizontal lters, the even and odd inputs are applied from two block RAMs of size 512 9 11. The result is written into four block RAMs of size 256 9 11. For the vertical lters, the inputs are applied from these four block RAM blocks and the outputs are written into another four block RAMs. For testing, the image is assumed to be loaded into the block RAMs using memory initialization le (MIF). 5.1 Block diagram of two level 2D DWT The block diagram of two level 2D DWT is shown in Fig. 15. In order to minimize the area required for implementation, the horizontal lter and the vertical lters are reused to compute the multilevel 2D DWT. Block RAMs E1, O1 contain the even and odd streams of the initial data to be transformed. Block RAMs E2E/E2O, E3, E4, E5 denote the output of one level 2D DWT. The even and odd numbered coefcients of LL1 component are stored in two block RAMs E2E and E2O and are used as inputs for the 2nd level DWT. The outputs of the 2nd level DWT are stored in block RAMs E6, E7, E8 and E9. The output of the horizontal lter is stored in four blocks RAMs E10, O10,

5 Architecture for the computation of 2D DWT using lifting scheme The automation schemes proposed in the previous section is used for tuning the hybrid scheme for 2D DWT. The details of architecture used and the assumptions made about the individual blocks of 2D DWT are presented in this section. Sub-images of size 32 9 32 with 8 bits per pixel are used for the computation. The DWT coefcients are assumed to be represented using 11 bits. Number of bits per pixel is converted to be 11 bits by appending three zeros to the most signicant position. This is done in order to make the word size of the inputs to the horizontal lter and vertical lters to be the same. This enables the same hardware or program to be reused for the computation of

Fig. 14 Overall block diagram of one level 2D DWT

123

J Real-Time Image Proc (2008) 3:217229 Fig. 15 Block diagram of two level 2D DWT

225

E11, O11. If LL2, the low pass band corresponding to two level DWT alone is required, only one demultiplexer and seven blocks RAMs (E1, E2E, O1, E2O, E10, O10, E3) are required. For the purpose of verication, only LL2 is computed and compared for the different schemes of computation of two level 2D DWT. For the computation of LL1 component of one level 2D DWT, only block RAMs E1, O1, E2E, E2O, E10, O10 are used. 5.2 Overlapping scheme for the computation of 2D DWT of complete image The architecture proposed in Sect. 5.1 for sub-images of size 32 9 32 may be used for the computation of 2D DWT of a larger image by splitting it into a number of overlapping sub images of size 32 9 32. The advantage of splitting the image into a number of sub-images is to perform the computation of 2D DWT in parallel in a number of computational engines. Further, it also reduces the memory required for storing the image and its transform. In the overlapping scheme, the image block is formed such that a number of pixels overlapped between adjacent blocks along the vertical and horizontal direction are equal to the order of the lter. For example, for the 9/7 biorthogonal lter used for the 2D DWT, the number of overlap pixels should be equal to four on the left and four on the right between horizontal blocks. Similarly, the number of pixel overlap between vertical blocks should be equal to four on the top and four on the bottom. For the blocks on the boundary, overlapping needs to be done only on the non-boundary edge.

three multiplication schemes: hybrid WP-P BW-KCM, non-pipelined BW-KCM and BW-PKCM. 6.1 Programmable clock and skew generators The operating frequency of the wave-pipelined circuit is expected to lie between that of non-pipelined circuit and pipelined circuits. Hence, the minimum and maximum frequency of the clock generator should correspond to the maximum operating frequencies of the non-pipelined circuit and pipelined circuits, respectively. The approximate values of the clock periods of these circuits for the implementation of the b block on Cyclone FPGA are 5.6 and 7.4 ns, respectively. The values of Dmax, Dmin for the a block are 15.302 and 7.34 ns, respectively. The programmable clock and skew generator are designed such that the clock period can be varied from 8.4 to 20.6 ns in steps of 0.8 ns and skew can be varied from 12.3 to 26.2 ns in steps of 0.9 ns approximately. The same exercise is carried out for b, c and d blocks using the synthesis report. A single clock generator is used for all the four blocks. Separate skew generators are used for each of the four blocks. In order to remove the glitches in the clock signal Majority Logic Gate is suggested in [23]. The operating frequency and skews are chosen using FSM such that all the blocks work satisfactorily. Similar procedure is adopted for the implementation on the other two FPGAs. The location of the logic elements and the interconnects used for the implementation of clock and skew blocks should be xed so that when these blocks are integrated with the 2D DWT or the soft-core processor, the interconnect delays are not altered. This is achieved by using the Logic lock feature in Altera. In the case of Xilinx FPGAs, this is achieved by using the Macros. 6.2 Implementation of 2D DWT using BIST approach The one level 2D DWT is implemented on Xilinx SpartanII XC2S100 FPGA using BIST approach. It may be noted that the BIST approach is also applicable for Altera

6 Implementation results In order to demonstrate the applicability of the automation approaches for both Xilinx and Altera FPGAs, the 2D DWT is implemented using both Xilinx Spartan and Altera Cyclone FPGAs and the results are presented in this section. In each of the FPGAs, the 2D DWT is computed using

123

226 Table 1 Implementation of 9/7 bi-orthogonal lters with 11 9 8 multipliers using the various schemes Multiplier (BW-KCM) Non-pipelined Pipelined Hybrid WP-P Slices Number of registers 176 803 176 Speed (MHz) 57.3 149.18 75.75

J Real-Time Image Proc (2008) 3:217229

253 453 253

pipelined BW-(P) KCM is in turn faster than the hybrid WP-P BW-KCM by a factor of 1.2 and this is achieved with the increase in the number of registers by a factor of 2.73 and increase in the number of slices by a factor of 1.32. Table 3, shows the implementation results for two level 2D DWT for the pipelined scheme. The implementation of hybrid WP two level 2D DWT using BIST approach is under progress. 6.3 Implementation of 2D DWT using SOC approach For the hybrid WP block, the optimal clock period and clock skews are determined using the procedure described in Sect. 6.1. The hybrid wave-pipelined 2D DWT unit (obtained by adding the input and output block RAMs to the non-pipelined circuit along with the programmable clock and clock skew blocks) is tested rst using simulation. As mentioned in Sect. 3, simulation is inadequate to test the hybrid wave-pipelined circuit. Hence, this circuit is implemented along with the Nios II or Micro blaze soft-core processor and the former is added as the custom block to the Nios II or Micro blaze using SOPC builder or embedded design kit (EDK) builder. The program to be executed by the Nios II or Micro blaze is written in C/C++ and the custom block is invoked as a function in the C/C++ program. A C++ program is written to read and write from the block RAM in the custom block. The C++ program is compiled and the executable code along with the conguration bits corresponding to Nios II or Micro blaze integrated with the custom block is down loaded to the FPGA. When the C program is run, it systematically varies the select inputs for the clock and clock skew blocks, and uploads the content of the output block RAM. The clock and skew are adjusted till the match occurs for at least three consecutive clock skews. The operating

FPGAs. A personal computer (PC) is used for the realization of the FSM. The interface used between PC and FPGA is the same as that described in [14]. The output of the hybrid circuit (11 bits) is EXORed with the 11 bit PRBS generator and the signature is obtained. The implementation results of the 9/7 horizontal lters for one level 2D DWT on Xilinx Spartan-II XC2S100 FPGA are given in Table 1. Multipliers of size 11 9 8 are implemented. From Table 1, it may be concluded that for the lter, the method using hybrid WP-P BW-KCM is faster than non-pipelined BW-KCM by a factor of 1.32 and requires the same area. The pipelined BW-PKCM is in turn faster than the hybrid WP-P BW-KCM by a factor of 1.97 and this is achieved with the increase in the number of registers by a factor of 4.6 and increase in the number of slices by a factor of 1.79. The implementation results of one level 2D DWT for a sub-image of size 32 9 32 using BIST approach are shown in Table 2. In order to make the horizontal and vertical lters to be identical, multipliers of size 11 9 11 are used for both of them. Three zeros are appended before the input samples as discussed in Sect. 5. The overheads required for the wave-pipelined circuits are also shown in Table 2. It may be noted that overhead required is about 22.5%. From Table 2, it may be concluded that for the lifting scheme, the method using hybrid WP-P BW-KCM is faster than non-pipelined BWKCM by a factor of 1.4 and requires the same area. The

Table 2 Implementation results on one level 2D DWT BIST approach with Spartan-II SOC approach XC2S100PQ208-5 EP2C35F672C6 Lifting scheme NonPipelining Hybrid Nonpipelining WP-P pipelining 1,110 1,670 87.54 836
a

with

Cyclone-II SOC approach XC3S200FT256-4 Hybrid WP-P 703 a(30) 375


a

with

Spartan-III

Pipelining 782 671 203.92

Nonpipelining 897 730 67.9

Pipelining 1,381 2,305 114.19

Hybrid WP-P 897


a

Number of slices or LEs 836 1 slice = 2 LUTs Number of registers Speed (MHz)
a

703 375 117.83

611 54.45

(188) 611 (85) 75.75

(32) 730 (8) 82.6

(8)

147.5

Denotes additional overhead for testing WP circuits

123

J Real-Time Image Proc (2008) 3:217229 Table 3 Area and speed performance of two level forward 2D DWT on Xilinx Spartan-II XC2S-200PQ208-5 Lifting scheme Slices used Speed (MHz) 61.42 Number of registers 2,506

227

Pipelined

1,511

same frequency (corresponding to the maximum operating frequency of the hybrid wave-pipelined circuit) and the power dissipated for the two approaches are also given in Table 5. From this Table 5, it may be noted that the pipelined circuit dissipates 11% less power than hybrid wavepipelined 2D DWT. 6.3.2 Implementation results on one level 2D DWT using Spartan-III XC3S200 Implementation results for one level 2D DWT on Xilinx Spartan-III XC3S200 using all the three approaches are given in Table 2. The programmable clock and clock skew blocks are implemented as Macro blocks using Xilinx ISE 8.1i project navigator. For tuning the hybrid wave-pipelined circuit, the Micro blaze soft-core processor is used. Xilinx EDK software is used to integrate the custom block to the Micro blaze processor. The rest of the steps are similar to what is used for the Altera SOC kit. For all the three schemes, the number of logic elements, number of registers and maximum operating frequency are computed and the results are given in Table 2. It may be noted that the overheads required for the wave-pipelined circuits is about 3.5%. It may be noted that dedicated lters are used for the computation of the outputs of both horizontal and vertical lter. Hence, the area required for this scheme is higher than that that using cyclone II devices. From this Table 2, it may be concluded that for the lifting scheme, the method using hybrid WP-P BW-KCM is faster than non-pipelined BW-KCM by a factor of 1.21. The scheme with BaughWooley Pipelined Constant Coefcient Multiplier is in turn faster than the hybrid WPP BW-KCM by a factor of 1.38 and this is achieved with the increase in the number of registers by a factor of 3.15 and increase in the number of LEs by a factor of 1.53. 6.4 Validation of the scheme for 2D DWT To verify the correctness of the schemes proposed for the computation of 2D DWT, Lena image of size 128 9 128 with blocks (sub-images) of size 32 9 32 pixels is used. The 128 9 128 image is shown in Fig. 16 and is obtained by subsampling the standard image of size 512 9 512 by a factor of four along both dimensions. As mentioned in Sect. 5.2, overlap of four pixels is used between the adjacent blocks. Totally 36 image blocks are used for the
Table 5 Power dissipated by pipelined and hybrid wavepipelined one level 2D DWT at normalized frequency

clock and clock skew of the wave-pipelined circuit is xed at the middle value and from now on, the custom block works without any intervention from the Nios II or Micro blaze processor. 6.3.1 Implementation results on one level 2D DWT using Cyclone-II EP2C35F672C6 The one level 2D DWT is implemented on Cyclone-II EP2C35F672C6 with and without pipelining. A single lter is implemented and time shared for the computation of the outputs of both horizontal and vertical lters. The 2D DWT block added as a custom block to Nios II CPU and downloaded to the Cyclone-II. 2D DWT is also computed using the in-built instruction set of Nios II [22]. The number of CPU clocks for both the cases are tabulated in Table 4. (Clock frequency obtained using the above device is 40 MHz.) For the hybrid wave-pipelined circuit, the number of logic elements, number of registers, maximum operating frequency and power dissipated are computed using Cyclone-II FPGA and the results are given in Table 2. It may be noted that the overhead required for the wavepipelined circuit is about 4%. From this Table 2, it may be concluded that for the lifting scheme, the method using the hybrid WP-P BW-KCM is faster than non-pipelined BWKCM by a factor of 1.25. The scheme with BaughWooley Pipelined Constant Coefcient Multiplier is in turn faster than the hybrid WP-P BW-KCM by a factor of 1.38 and this is achieved with the increase in the number of registers by a factor of 1.78 and increase in number of LEs by a factor of 1.11. Pipelining may be used either for increasing the operating frequency of a circuit or for reducing the power dissipation [12]. Pipelining requires more registers and area. It automatically may not lead to more power dissipation. In order to assess whether the hybrid wave-pipelining is superior or not with regard to power dissipation, both hybrid wave-pipelined and pipelined circuits are operated at the
Table 4 Computation time for 2D DWT Function 2D DWT Number of CPU clock cycles for software approach 73,280 Equivalent CPU clock cycles for custom block 814

Description of the circuit Power at normalized frequency including additional overhead

Pipelined circuit 158.97

Hybrid circuit 179.58

123

228

J Real-Time Image Proc (2008) 3:217229

Fig. 16 LL1 component compared with input image

128 9 128 image. The 2D DWT for the image is also computed using a C program. This is carried out using both high-level language C and hardware approach using FPGA. For implementation in C language, the lifting multiplier constants (a, b, c, d, n1, n2) and the lter coefcients for the distributed arithmetic algorithm are declared as double type (64 bits) variables. The pixel intensities are declared as short type (16 bits) variables. The analysis lter output obtained corresponding to 36 image blocks are merged suitably and LL1 component of the image is shown in Fig. 16. The implementation of the forward 2D DWT for image block of size 32 9 32 is carried out for lifting scheme with BW-hybrid WP-PKCM. For the implementation, Xilinx Spartan XC2S100PQ208-5 device is used. For storing the image input, outputs of the horizontal lter and the outputs of the vertical lters, the block RAMs are congured suitably. The image is loaded into the block RAMs through the UCF of the implementation tool. The one level 2D DWT is computed using the above scheme for all the 36 image blocks and merged suitably. The LL1 component of the image is shown in Fig. 16. From these gures, it may be concluded that the LL1 components obtained through the FPGA implementation match well with that obtained using C. The LL1 components also match well with the original image. In order to make a quantitative comparison of the LL1 component with the original image, the original LENA image is subsampled to be of size 64 9 64. Treating the LL1 component itself as the compressed image, the PSNR of the compressed images using BW-hybrid WPKCM and C are computed and are found to be 28.22 and 33.33, respectively.

After these initial selection, the 2D DWT blocks work satisfactorily without any external intervention and the processors are free to do other tasks. The 9/7 bi-orthogonal lters are implemented on both Xilinx and Altera devices using the lifting scheme with the following three multipliers: BW-PKCM, BW-KCM and hybrid WP-P BW-KCM. From the implementation results, it is veried that hybrid WP-P BW-KCM is faster than non-pipelined BW-KCM. The scheme with BW-PKCM is in turn faster than the hybrid WP-P BW-KCM and this is achieved with the increase in the number of registers and increase in the number of LEs. The custom instruction for 2D DWT is found to be faster compared to the implementation using C. The correctness of the procedure for the computation of 2D DWT of an image, using the 2D DWT of sub images, is veried by computing the 2D DWT using both hardware and software approaches (using C) and displaying the LL1 components for an image of size 128 9 128.transform. The automation schemes proposed in this paper has also been successfully employed in [23] for the implementation of wave-pipelined lters using distributed arithmetic algorithm and sine wave generator using CORDIC. The work on the computation of multi level 2D DWT and real time computation of 2D DWT using the hybrid scheme are under progress. One of the challenges in the design of FPGA based wave-pipelined circuits is the accurate modeling of the interconnects as well as device delays and their temperature dependence. In the absence of these models, the wavepipelined circuits can only be operated at moderate speeds.

References
1. Xilinx documentation library, Xilinx Corporation, USA 2. Altera documentation library-2003 Altera Corporation, USA 3. Sheldon, D., Kumar, R., Vahid, F., Tullsen, D., Lysecky, R.: Conjoining soft-core FPGA processors. In: IEEE/ACM International Conference on Computer Aided Design, 2006, ICCAD, pp. 694701 (2006) 4. Draper, B.A., Beveridge, J.R., Willem Bohm, A.P., Ross, C., Chawathe, M.: Accelerated image processing on FPGAs. IEEE Trans. Image. Process. 12(12), 15431551 (2003) 5. Ritter, J., Molitor, P.: A pipelined architecture for partitioned DWT based lossy image compression using FPGAs. In: Proceedings ACM Conference FPGA 2001, pp. 201206 (2001) 6. Lakshminarayanan, G., Venkataramani, B., Senthil Kumar, J., Yousuf, A.K., Sriram, G.: Design and FPGA implementation of image block encoders with 2D-DWT. Proceedings TENCON 2003. 3, 10151019 (2003) 7. Daubechies, I., Sweldens, W.: Factoring wavelet transforms into lifting steps. J. Fourier Anal. Appl. 4, 247269 (1998) 8. Nyathi, J., Delgado-Frias, J.G.: A hybrid wave-pipelined network router. IEEE Trans. Circuits. Syst-I, Fundam. Theory. Appl. 49(12), 17641772 (2002) 9. Hauck, O., Katoch, A., Huss, S.A.: VLSI system design using asynchronous wave pipelines: a 0.35 lm CMOS 1.5 GHz elliptic curve public key cryptosystem chip. In: Proceeding of sixth

7 Conclusion Two automation schemes are proposed in this paper for the implementation of the 9/7 bi-orthogonal lters using hybrid WP-P constant coefcient multiplier with Baugh Wooley multiplication algorithm. Nios II and Micro blaze soft-core processors are integrated with 2D DWT blocks successfully and the optimum clock period and clock skews for the 2D DWT blocks are selected using them.

123

J Real-Time Image Proc (2008) 3:217229 international symposium on advanced research in asynchronous circuits and systems 2000 (ASYNC 2000), pp. 188197 (2000) Burleson, W.P., Ciesielski, M., Klass, F., Liu, W.: Wave-pipelining: a tutorial and research survey. IEEE Trans. Very Large Scale Integration (VLSI) Syst. 6(3), 464474 (1998) Gray, T.C., Liu, W., Cavin, R.K., III.: Wave-Pipelining: Theory and CMOS Implementation. Kluwer, Boston (1994) Parhi, K.K.: VLSI Signal Processing Systems. Wiley, New York (1999) Boemo, E.I., Lopez-Buedo, S., Meneses, J.M.: Wave-pipelines via look-up tables. IEEE Int. Symp. Circuits Syst. (ISCAS 1996). 4, 8588 (1996) Lakshminarayanan, G., Venkataramani, B.: Optimization techniques for FPGA based wave-pipelined DSP blocks. IEEE Trans. Very Large Scale Integration (VLSI) Syst. 13(7), 783793 (2005) Aharya, T., Tsai, P.-S.: JPEG2000 Standard for Image Compression Concepts Algorithms and VLSI Architectures. Wiley, New York (2005) Sayood, K.: Introduction to Data Compression. Morgan Kaufmann, Menlo Park (2000). An Imprint of Elsevier Smith, M.J.S.: Application Specic Integrated Circuits. Pearson Education Asia Pvt. Ltd, Singapore (2003) Seetharaman, G., Venkataramani, B., Lakshminarayanan, G.: Design and FPGA implementation of self tuned wave-pipelined lters. IETE J. Res. 524, 281286 (2006) Seetharaman, G., Venkataramani, B., Lakshminarayanan, G.: Design and FPGA implementation of wave-pipelined image block encoders using 2D-DWT. In: Proceedings of VLSI design and test symposium VDAT 2005, pp. 1220 (2005) Seetharaman, G., Venkataramani, B., Lakshminarayanan, G.: Design and FPGA implementation of wave-pipelined distributed arithmetic based lters. In: Proceedings of VLSI Design and Test workshop VDAT 2004, pp. 216220 (2004) Amudha, V., Venkataramani, B., Vinoth Kumar, R., Ravishankar, S.: SOC Implementation of HMM Based Speaker Independent Isolated Digit Recognition System. In: 20th International IEEE Conference on VLSI Design (VLSID07), pp. 16 (2007) Seetharaman, G., Venkataramani, B., Amudha, V., Saundattikar, A.: System on chip implementation of 2D DWT using lifting scheme. In: Proceedings of the International Asia and South Pacic Conference on Embedded SOCs (ASPICES 2005), (2005) Seetharaman, G., Venkataramani, B.: SOC implementation of wave-pipelined circuits. Proceedings of IEEE International conference on Field Programmable Technology 2007 (ICFPT 2007), pp. 916 (2007)

229

Author Biographies
G. Seetharaman received his B.E. and M.E. degree in Electronics and Communication Engineering from Regional Engineering Collage, Tiruchirappalli in 1997 and 2002, respectively. Presently, he is carrying out his doctoral thesis work in the National Institute of Technology, Tiruchirappalli. Previously, he worked as faculty in the Jayaram College of Engineering and Technology, Tiruchirappalli, for 6 years and as Research Associate for three semesters in the National Institute of Technology, Tiruchirappalli. Presently, he is working as Laboratory Engineer in National Institute of Technology, Tiruchirappalli. His current research interests include embedded system design using eld-programmable gate arrays (FPGAs) and system-onchip (SOC). B. Venkataramani received his B.E. degree in Electronics and Communication Engineering from Regional Engineering College, Tiruchirappalli in 1979 and M.Tech. and Ph.D. degrees in Electrical Engineering from Indian Institute of Technology, Kanpur in 1984 and 1996, respectively. He worked as Deputy Engineer in Bharat Electronics Limited, Bangalore, India, and as a research Engineer in Indian Institute of Technology, Kanpur, each for approximately 3 years. Since 1987 he has been faculty member of National Institute of Technology, (formerly Regional Engineering College) Tiruchirappalli. Presently, he is working as Professor and Head of the Department of Electronics and Communication in National Institute of Technology. He has published two books and numerous papers in journals and international conferences. His current research interests include FPGA applications and SOC based system design and performance analysis of high speed computer networks. G. Lakshminarayanan received his M.E. and Ph.D. degrees in Electronics and Communication Engineering from Bharathidasan University, Tiruchirappalli in 1995 and 2005, respectively. He previously worked as a Service Engineer for 5 years and as a scientist and Research Associate for 4 years in Regional Engineering College, Tiruchirappalli. He was a faculty member in SASTRA, Tanjore, for two semesters and as an Assistant Professor in Saranathan College of Engineering, Tiruchirappalli for 1 year. Presently he is working as Assistant Professor in National Institute of Technology, Tiruchirappalli. His current research interests include FPGA based system design and VLSI front end design.

10.

11. 12. 13.

14.

15.

16. 17. 18.

19.

20.

21.

22.

23.

123

Potrebbero piacerti anche