Sei sulla pagina 1di 12

1350

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 16, NO. 11, NOVEMBER 2006

VLSI Design of a Wavelet Processing Core


Sze-Wei Lee, Member, IEEE, and Soon-Chieh Lim
AbstractA processing core architecture for the implementation of the discrete wavelet transform (DWT), optimized for throughput, scalability and programmability is proposed. The architecture is based on the RISC architecture with an instruction set specically designed to facilitate the implementation of wavelet-based applications and a memory controller optimized for the memory access pattern of DWT processing. Index TermsDiscrete wavelet transform (DWT), parallel computing, RISC architecture, VLSI design.

Fig. 1. Example of the 5/3 fast lifting wavelet lter.

I. INTRODUCTION

ISCRETE wavelet transform (DWT) is a class of transform to analyze and represent a signal, and has been found to be one of the most powerful techniques in digital signal processing applications. The DWT can be performed using the fast-lifting form approach [1]. The basic concept of the fast-lifting DWT is to represent the DWT-implementing lter as a poly-phase matrix, which in turn is factorized into a series of smaller ltering steps. Fig. 1 shows an example of the 5/3 wavelet lters being factorized into two lifting steps. Instead of eight multiplications and six additions as in the original 5/3 wavelet lter equations, the lifting wavelet transform reduces the complexity by almost half, as it only requires 4 additions and 4 multiplications (only two multiplications are required terms if the terms are grouped before multiplications). Furthermore, the inverse transform can be readily obtained by reversing the signal ow (arrow directions), and inverting the signs of the coefcients (positive to negative, and vice versa), as shown in Fig. 2. Another important property of the fast lifting DWT is that it allows rounding to be performed on the ltered outputs and intermediate results and still achieving perfect reconstruction [2]. Calderbank, Daubechies, Sweldens and Yeo [3] designed a series of different wavelet lters that maps integers to integers. These lters enable efcient VLSI implementations because it only requires integer arithmetic operations. Table I shows some of the DWT lters proposed in [3]. There have been several propositions of the VLSI implementation of DWT. The VLSI implementations proposed in [4] and [5] are optimized for the Daubechies wavelet but not scalable to support other types of wavelets. The systolic array and singleinstruction multiple-data (SIMD) processor array architectures

Fig. 2. Signal ow diagram of the fast lifting forward and inverse transform.

TABLE I EXAMPLES OF FAST LIFTING INTEGER WAVELET TRANSFORM FILTER

Manuscript received March 8, 2005; revised March 22, 2006. This paper was recommended by Associate Editor L.-G. Chen. S.-W. Lee is with the Faculty of Engineering, Multimedia University, 63100 Cyberjaya, Selangor, Malaysia (e-mail: swlee@mmu.edu.my). S.-C. Lim was with the Faculty of Engineering, Multimedia University, 63100 Cyberjaya, Selangor, Malaysia. He is now with eAsic Corporation, Penang, Malaysia (e-mail: limsoonchieh@yahoo.com). Digital Object Identier 10.1109/TCSVT.2006.883507

proposed in [6] are based on the direct form DWT and require a large number of multipliers which results in a large chip area. Furthermore, the systolic array is not congurable on-the-y to suit different forms of wavelets (with different coefcients and lengths). More recent architectures, such as those proposed in [7][9] are based on systems comprising of eld-programmable gate arrays (FPGAs) and digital signal processors. These systems are congurable and programmable. However, they also require large chip areas because of the use of off-the-shelf commercial components and large amount of off-chip memory. Furthermore, the performances of these designs depend very much on the performance of the FPGAs and digital signal processors used. In this paper, we propose a highly programmable and scalable processing core architecture for the implementation of DWT based on the fast-lifting approach. The high programmability of the architecture is achieved through an instruction set that

1051-8215/$20.00 2006 IEEE

LEE AND LIM: VLSI DESIGN OF A WAVELET PROCESSING CORE

1351

TABLE II FILTERING OPERATIONS FOR VARIOUS IMAGE SIZES AND FRAME RATES

Fig. 3. Two-level, 2-D pyramid algorithm. At each level, the input samples are transformed and down-sampled twice (horizontal and vertical passes).

enables not only DWT processing but other signal processing tasks to be implemented efciently and exibly on the processing core for different applications. The processing core is readily programmable for the implementation of different lters with different coefcients. The high scalability of the architecture is achieved through the exibility and ease of choosing the number of processing engines (PEs), clock frequency, and memory congurations to suit different application requirements. Furthermore, the architecture can be used as a building-block for system-on-chip (SOC) integration. II. REQUIREMENTS A. Pyramid Algorithm For two-dimensional (2-D) data, such as image data, a singlelevel 2-D DWT requires two passes of one-dimensional (1-D) transforms i.e. a vertical and a horizontal transform. At each pass, a 1-D transform is performed on each row (for horizontal pass) or column (for vertical pass) of the data. Hence, for an rows and columns of pixels, a 2-D transimage of with transform operations. Furthermore, form would require for moving pictures (e.g. video) with a frame rate of frames per second (fps), we would require transform operations per second for carrying out a 1-level 2-D DWT on each frame. In the pyramid algorithm for multi-resolution analysis [10], multiple levels of transform are performed on 2-D input data, and the resulting coefcients are effectively down-sampled by four after each level of 2-D transform (Fig. 3). Since each transform operation is implemented using lter, we shall refer transform operation as ltering operation subsequently. For a -level 2-D DWT transform, the total number of ltering operations per second (FOPS), , required is

TABLE III ARITHMETIC OPERATIONS IN VARIOUS FAST-LIFTING DWT FILTERS

(1) (2) Table II lists the number of ltering operations required for gray-scale (8-bit per pixel) videos of different image sizes and frame rates, computed using (2). B. Arithmetic Operations Adams and Kossentini [11] evaluated some popular fast-lifting DWT lters and found the computational requirements (the number of additions, shifts, and multiplications per ltering operation) as listed in Table III. Each ltering

operation of a fast-lifting lter generates one high-pass and one low-pass coefcient and is thus equivalent to two transform operations mentioned above. Note that all divisions are implemented as right shifts. Adams et al. also concluded that the 5/3 lter performs the best for lossless compression (lowest bit rate) and has the lowest computational complexity, while the 9/7-F lter performs the best for lossy compression [lowest peak signal-to-noise ratio (PSNR)] but is signicantly higher in complexity. From Table III, we observe that the 9/7-F requires 12 additions, 4 shifts, and 4 multiplications for each ltering operation. The largest multiplier coefcient used in the 9/7-F wavelet is 1817 and it is also the largest among the coefcients of all the DWT lters in Table III. The coefcients can be represented using 11 bits. For image data, normally, the bits-per-pixel is 8. The product of an 8-bit number with an 11-bit coefcient will yield a 19-bit result. The result is further right-shifted by 13 bits (divide by 4096), yielding a nal result of only 6 bits. Therefore, memory width of 16-bit is sufcient for the storage of all DWT coefcients. Our nding here is consistent with the nding in [11]. C. Hardware Performance Requirements If each ltering operation is carried out within a clock cycle, an innite-level of 2-D DWT on a gray-scale 640 480 image at 30 fps would require 24.58 million clock cycles, or a clock frequency of 24.58 MHz. A ltering operation consists of a series of basic arithmetic operations. For example, the 9/7-F lter in Table III involves 20 operations of additions, shifts and multiplications per ltering operation. Instead of a ltering operation being carried out within a clock cycle, we design for each basic arithmetic

1352

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 16, NO. 11, NOVEMBER 2006

operation to be completed within a single clock cycle. Each ltering operation produces two coefcients: a high-pass and low-pass coefcient. Thus, we require ten clock cycles per coefcient, or a 24.58 10 245.8 MHz clock is needed for an innite-level, 2-D DWT on 30 fps, 640 480 video. The required clock frequency can be halved if the number of coefcients per ltering operation can be increased by 2 using SIMD techniques. For practical implementations, the wavelet processing system requires not only a sufciently fast processor to support the frame rate requirement, but also a fast memory for storing the intermediate results of the DWT operations. The memory required is directly proportional to the required FOPS as listed can be comTable II. In fact, the total memory bandwidth puted as follows: (3) where and are the FOPS and bits per coefcient of the 2-D DWT on a video. The factor of 2 in (3) accounts for bidirectional access (reads and writes) on a single-ported memory. For (16-bit coefcients), and example, if (30 fps, 640 480 video; Table II), is found to be 786.4 Mbps. To support such high level of memory bandwidth requirement, an on-chip internal SRAM or off-chip external DRAM implementations should be considered. The VLSI implementations in [13][15] used internal SRAM buffers for storing parts of an image, where an image is split into smaller parts called tiles to be processed independently. The tile sizes are arbitrary but normally vary from 128 128 to 256 256. Such approach, though can reduce the memory bandwidth requirement, may result in blocky effect or discontinuity at the boundaries of the tiles. Furthermore, internal SRAM implementation is only feasible for small images because of the sheer size of SRAM required. For example, an 8-bit 640 480 image requires memory space of at least two times of its image size, about 1.2 MB (for temporary storage of 16-bit DWT coefcients), which could be too large for internal SRAM in embedded system implementations. For the external DRAM implementation, a 266-MHz DDR-DRAM with a 32-bit data bus can support an average of 5107.2 Mbps memory bandwidth (with an average 60% bus efciency), which is sufcient for the processing of gray-scale, 8-bit, 640 480 videos of frame rates up to 194 fps. III. ARCHITECTURE Fig. 4 shows the proposed wavelet processing core architecture. The core consists of 3 major components: multiple instantiations of processing engines (PEs), a system bus, and a memory controller. The PEs are programmable, and can be used to perform the wavelet ltering functions. The system bus connects the PEs together. The double data rate (DDR) memory controller supports the memory bandwidth required for wavelet processing, depending on the type and computation complexity of the wavelet lter function. The PEs are connected in a multiple-instruction-multiple-data (MIMD) structure for scalable performance. The internal architecture of the PE is shown in Fig. 5.

Fig. 4. Architecture of the wavelet processing core.

Fig. 5. Internal architecture of the PE.

The major component of the PE is the control module which implements a 5-stage pipeline. At each cycle, a new instruction enters the pipeline. The pipeline stages are: 1) Instruction fetch 1 (P0): The control module fetches the instruction to be executed from the code_store module at the address pointed to by the program_counter. 2) Instruction decode (P1): The instruction is decoded. Most instructions require source or input operands. Thus, read enable and address signals are generated to read the source operands from the register_le module. The register_le module consists of two banks of registers, with each bank having 16 entries of 32-bit words. The register_le is dualported as two source operands can be read out at any cycle. 3) Operand read (P2): The source operands from the register_le are sent to the ALU module. Read-after-write hazards are prevented using data bypass mechanisms in the bypass module. 4) Execution (P3): The ALU, which consists of 3 submodules (adder, shifter and multiplier), executes the instruction. The program counter is updated at this stage for the next instruction fetch. 5) Write back (P4): The result from the ALU module is written back to the register_le to update the architectural state. Result from the ALU is also used as address for accessing the data_memory (internal SRAM for data) or the io_interface modules (external memory via the system bus). 6) Multiply/load (P5): An optional pipeline stage which is operational when a load from memory or a multiply instruction is executed.

LEE AND LIM: VLSI DESIGN OF A WAVELET PROCESSING CORE

1353

TABLE IV SUBSET OF THE PE INSTRUCTION SET

Fig. 6. Program_counter module and code_store module with connection to the control module.

A. Instruction Set A set of 32-bit xed length instructions have been designed inline with the 32-bit data path of the PE. All instructions perform integer arithmetic operations, such as add, subtract, multiply, and shift. These operations are the most commonly used instructions in the fast-lifting DWT algorithms. Furthermore, these instructions, being SIMD instructions, support parallel computation of two pairs of 16-bit operands and thus doubling the throughput for arithmetic operations. Almost all instructions are designed to execute in a single cycle to maximize throughput, except the multiply instruction which takes 2 cycles to complete. However, this does not affect the throughput signicantly as it is pipelined with other instructions to mask the 2-cycle latency. Furthermore, 32 general purpose registers have been implemented for the fast storing of intermediate results of the DWT process. A subset of the instruction set commonly used for DWT is shown in Table IV. Descriptions of their usage are given in the subsequent sections. B. Control Path The control path consists of the control, program_counter and code_store modules and it implements an in-order-issue and in-order-execution pipeline to achieve a cycle-per-instruction (CPI) of 1.0. It has also been designed to utilize the delayed branch strategy instead of dynamic branch prediction to reduce branch penalties. In the program_counter (Fig. 6) one of the ve sources of instruction address for accessing the code_store is selected. They are the branch address, jump address, debug address, stalled address, and incremented address. The code_store module stores the instructions of a program as 32-bit words. The instruction word or opcode is read out at the memory location as addressed by the program_counter and then sent to the control module. The control module consists of the 32-bit pipeline registers (Fig. 7), p1_opcode, p2_opcode, p3_opcode and p4_opcode. These pipeline registers are decoded to generate control signals for the respective subblocks of the PE. The instruction opcode from the code_store is latched into the p1_opcode pipeline registers. At the P1 stage, the register_le read addresses and read enables are generated by extracting the information from the instruction opcode stored in the p1_opcode registers. At the P2 stage, the source operands of the current instruction are compared with the target operands of the

Fig. 7. Pipeline registers in the control module.

previous 2 instructions, in order to generate the multiplexer selects to route the correct operands to ALU module for execution. This is used to solve read-after-write (RAW) hazards among sequential instructions and thus preventing pipeline stalls. At the P3 stage, the condition codes from the ALU are used to determine branch execution. Furthermore, for load and store instructions, read and write enables are generated to the data_memory and io_interface modules. Also, write enables and write address are sent to the register_le for the storing of the ALU output. At the P4 stage, another set of write enables and write address is sent to the register_le for the storing of the output from the data_memory or the io_interface. The control module also implements delayed branches and jumps. It allows 1 extra instruction to be executed following an unconditional jump, and 3 extra instructions to be executed following a conditional branch:

1354

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 16, NO. 11, NOVEMBER 2006

Fig. 9. ALU and its submodules. Fig. 8. PE data path with bypass multiplexers where each operand (A and B) is 32-bit and can be either 32-bit word or a pair of 16-bit words for SIMD mode instructions.

For load instructions, additional cycles are required to wait for data to be fetched from external memory. However, this does not stall the pipeline as long as the load target operand is not used as a source operand by the subsequent instruction:

register data just been evaluated at the P3 stage. Hence, the read out from the register_le is not the latest data. Consider another case as follows:

For the xor instruction, the register_le is accessed at the P2 stage to read the register while the add instruction result is being written to the register at the P4 stage. The xor instruction does not get the latest data in time, and the old value of the register is read out instead. To prevent the RAW hazard and additional pipeline stalls, multiplexers to recirculate the output from the ALU module, data_memory module or the io_interface module (or delay output by one cycle) back to the ALU module for the next instruction is included in the bypass module. The multiplexer selects are generated by the control module at the P1 and P2 stages. C. Data Path The data path of a PE consists of the register_le, the bypass multiplexers, ALU, data_memory for internal memory, and io_interface for loading/storing to external memory. See Fig. 8. The data bus is 32-bit. The source operands (operand A and B) of an instruction could be an immediate data (embedded in the instruction) or a data register in the register_le module. However, due to RAW hazard, where the current instruction requires the use of a register which is also the target register of one of the previous two instructions, the register may not be updated with the correct value in time. Consider the following case: D. ALU The ALU module consists of three submodules: adder, shifter, and multiplier (Fig. 9). The adder is used to perform add and subtract instructions, while the shifter is used for shifting operations as well as logic (and, or, xor) operations. The adder and shifter execute in two modes: normal mode where a pair of 32-bit operands are used to generate a 32-bit result, or in the SIMD mode where a two pairs of 16-bit operands are used to generate a pair of 16-bit results. On the other hand, the multiplier is a 16-bit booth multiplier where it multiplies two 16-bit operands to generate a 32-bit result. The multiplier requires two cycles to complete a 16-bit multiplication, where the carry and sum vectors are generated by the multiplier in the rst cycle, and sent to the adder for nal addition to generate a single 32-bit result. The output of the multiplier is circulated to the adder through a multiplexer at the input of the adder. Furthermore, a sneak path from the adder output to the input of the ALU is enabled by an extra set of multiplexers. This sneak path is used to pipeline multiply and shift operations

For the sub instruction, the register_le is accessed at the P2 stage to read the register, while the add instruction result has

LEE AND LIM: VLSI DESIGN OF A WAVELET PROCESSING CORE

1355

Fig. 10. Inner loop of the 9/7-F lter which forms the basic group of arithmetic operations in the fast-lifting DWT as shown in Fig. 2. Fig. 11. The io_interface module.

which commonly used in DWT. An example of operations involved in the inner loop of the 9/7-F wavelet lter (Fig. 10) is given as follows:

' ` '

Fig. 12. The 2-D logical memory access on sequential physical memory through DDR controller.

Fig. 13. Data packing and unpacking.

The operations are completed in a total of 6 cycles using 6 instructions, and two 16-bit intermediate coefcients are generated in parallel in the SIMD mode. E. Interface Unit and System Bus The io_interface module (Fig. 11) interfaces the PE with the other components in the wavelet processing core. It responds to load/store instructions by initiating read/write transactions on the system bus. It has two rst-in-rst-out (FIFO) queues: an input FIFO and an output FIFO. A direct-memory-access (DMA) mechanism is implemented where the PE core will congure the local registers with the read or write start address and transfer length, and the io_interface will initiate read/write transactions to the external memory. A load or store instruction from the PE is used to pop or push data from the PE to the FIFOs

The DMA mechanism takes in parameters such as the number of rows and columns to read or write from the external memory. The memory address is automatically incremented to the next row when the DMA has reached the column count. An example is shown in Fig. 12, where a PE accesses the external memory via its io_interface at the start address of 0x0a0. Here, the column count is specied as 0x100 and the row count is also specied as 0x100. When the address reaches 0x1a0 column count 0x100), the address is (start address 0x0a0 automatically incremented by 0x100 (the row count) to 0x2a0. Hence, the DMA of the io_interface accesses the external memory as if it is a 2-D array, and this approach helps to reduce the number of explicit instructions needed to implement a 2-D DWT algorithm in the PE. The byte-swapping logic performs data packing/unpacking by swapping bytes or 16-bit words in a 32-bit word. For example, two 32-bit words can be packed into four pairs of 8-bit operands (Fig. 13). This is useful for preparing the input operands upfront in SIMD form for the execution of SIMD instructions in the PE. The system bus is implemented as a system command, read and write bus, where all transactions are essentially split-read or

1356

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 16, NO. 11, NOVEMBER 2006

Fig. 14. Read and write transactions on the system bus.

split-write transactions. When a read command is issued from the PE to the DDR controller via the system command bus, the DDR controller does not need to respond to the command immediately. It only needs to return the read data to the PE after some time via the system write bus. During that time, the DDR controller can accept other commands from other PEs. Hence, the throughput is maximized with zero wait states. The system bus arbiter performs bus arbitration for the PEs. Fig. 14 shows an example where three PEs send system commands to the DDR controller via the system bus. PE0 is granted the bus rst and it sends a read command, but will only get the read data 6 cycles later. During this time, PE1 is also granted the bus followed by PE2. PE2 is able to send a write command followed by the write data to the DDR controller while PE0 and PE1 are still waiting for the read data. F. DDR Controller In our implementation, we attempt to target high frame rates for DWT in the worst case where no tiling is involved. The DDR controller plays an important role in the wavelet processing core by providing adequate memory bandwidth to the PEs for storing and loading data. DDR memory technology requires overhead cycles for row and page activation, precharge, refresh, etc. A typical read request with a minimum burst length of 2 will require 16 cycles (from activation to activation) and a write request with a burst length of 2 will consume about 20 cycles. This translates to an average efciency of only or 11.11%. This means that out of the possible 200 000 000 32-bit transfers per second, only 11.11% or 22 220 000 cycles involve actual data transfer. For an 8-bit 640 480 image, this translates to a maximum throughput of only 27.55 fps. In order to increase the efciency of the memory controller, we have implemented a few techniques: 1) Increase the number of memory banks: Increasing the number of memory banks enables more memory pages to be remained opened simultaneously, thus reducing the frequency of reactivating and precharging memory pages and saving overhead cycles. 2) Perform prefetched reads and posted writes: The number of external memory access is reduced by prefetching and caching 16-byte lines instead of 32-bit words for reads, and post-combining four 32-bit words into a 16-byte line for writes to external memory.

Fig. 15. Direct and special logical-to-physical memory address mapping.

Fig. 16. DDR controller architecture.

3) Special logical-to-physical address mapping: In the 2-D DWT transform there are occasions in which the data is read or written at non-sequential logical memory addresses [access pattern (2) in Fig. 15]. This will often result in a memory row miss and waste of precious cycles in row precharge and activation if direct logical-to-physical memory address mapping is adopted. However, we can specially map portions of different logical memory address rows to a physical memory address row, such that memory accesses on different logical memory rows result in memory accesses on a single memory row or similar rows of different memory banks. See Fig. 15. The special mapping scheme is implemented by scrambling the connection of the address pins of the memory controller to the memory address bus. The DDR controller in Fig. 16 consists of the command FIFO that receives commands from the PE and the host bridge via the system command bus. If the command is a read command, the read address will rst be sent to the content-addressable-memory (CAM) for lookup. The CAM and the read buffer serves as a 16-entry cache for read data. If there is a hit, the associated data from the read buffer is sent to the output FIFO. Since each line in the read buffer is 16 bytes, the latencies for reads on the same line are greatly reduced as no DDR read transaction is required. This happens when the PEs are reading inputs from sequential logical memory locations. If there is a miss, the read address is queued at the read FIFO. If there is already a read command to the same address in the read FIFO,

LEE AND LIM: VLSI DESIGN OF A WAVELET PROCESSING CORE

1357

Fig. 17. Parallel-row horizontal reading with multiple-row jump of the input image data or DWT coefcients. Fig. 18. Vertical writing with multiple-column jump of DWT coefcients.

the read pipeline will stall. The DDR interface state machine takes address from the read FIFO to initiate read transactions over the DDR bus to read a line of 16 bytes, which is then updated in the CAM and read buffer. The data is also sent to the output FIFO to be returned to the requesting PE via the system read bus. In the case of a write command, the write address is rst compared with all entries in the write FIFO. If there exists a write command with the same address in the write FIFO, the write data received via the system write bus is written to the same 16-byte line in the write-combine buffer, thus essentially coalescing the current write data with the existing line. This happens when the PEs are writing the transform coefcients sequentially to memory. Otherwise, the write command is queued on the write FIFO and a new line is allocated for the write data. Entries in the write FIFO and their associated write-line in the write-combine buffer are retired to the external memory by the DDR interface state machine. The DDR interface state machine implements the DDR protocol to read or write to the external DDR memory. Each read or write address presented to the state machine is also compared with a table which contains the addresses of active pages of the external memory. If the address matches an entry in the table, a transaction is initiated immediately without the need to precharge and activate a new memory page. This reduces the average memory access latency. At 200-MHz clock, 32-bit the DDR controller achieves an average efciency of about 40% to 60% or 25603840 Mbps, depending on the 2-D DWT memory access pattern. IV. EXAMPLE Consider a gray-scale 640 480 image stored in an external memory starting at location 0 0, with each 8-bit pixel stored in a memory location of 16-bit word size. If the core contains four PEs, each PE will read from the external memory in the horizontal direction (Fig. 17) and write back in the vertical direction (Fig. 18). The io_interface of each PE is congured to load

Fig. 19. Image before (on the left) and after (on the right) 3-level 2-D DWT with 10 subbands.

two rows of inputs from the memory, and then skips the next six rows. After each PE has performed the ltering operations, the io_interface vertically writes the high and low-pass coefcients to columns in two different memory regions with a skip of three columns. This essentially transposes the coefcients (rows to columns) and divides them into the high-pass and low-pass regions. The process is performed automatically by the DMA engine of the io_interface of each PE. The sequential horizontal read carried out by the PEs will most probably results in a hit in the CAM of the DDR controller, while the write operations sent by the four PEs to the controller will most probably be combined into a single write line in the write-combine buffer. The purpose of the above mechanisms is to minimize the number of reads and writes to the external memory and thus memory access latencies and pipeline stalls. In a 1-level 2-D DWT process the entire of an original image is transformed and transposed in the manner described above. No tiling is used and thus the scenario considered here presents the worst case requirement in terms of memory bandwidth. Repeating the same process with the LL subband coefcients as input corresponds to the 2-level 2-D pyramid algorithm in Fig. 3. A test image undergoing a 3-level 2-D DWT using the 9/7 lter results in ten subband regions as depicted in Fig. 19. Note that the throughput performance of the proposed architecture is independent of image content.

1358

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 16, NO. 11, NOVEMBER 2006

TABLE V CIRCUIT PERFORMANCE FOR THE WAVELET PROCESSING CORE IN 0.18-m CMOS TECHNOLOGY

Fig. 21. Simulation: test code and test image being loaded into the PE Verilog model and the DRAM model, respectively.

Fig. 20. Pseudocode for the implementation of a 9/7 DWT lter on a PE.

the gate-level netlist and the simulation waveform for the test case in Section IV. B. Simulation Results The wavelet processing core with 4 PEs was simulated using a test case based on the example described in Section IV with Mentor Graphics Modelsim HDL simulator. The test case was simulated by writing the DWT code using the PE instruction set and loading the code into the PE code_memory (Fig. 21). The image in Fig. 19 was also loaded into the DDR DRAM HDL model. We varied the operating frequency of the PEs (100 and 200 MHz), and the number of memory banks (4, 8, and 16). The throughput results as in Tables VI and VII were obtained. From Tables VI and VII, we observe that deployment of more memory banks increases the maximum throughput of the wavelet processing core. This is because with more memory banks, fewer memory precharge and activation cycles are required to close and open memory pages and thus results in smaller memory access latency. We have also found that the throughput of the wavelet processing core is saturated at 58 or 14.5 fps per PE, for the 9/7-F forward and inverse transform at PE clock speed of 100 MHz. Increasing the PE clock speed to 200 MHz almost doubled the throughput. However, for the 5/3 forward and inverse transforms, the throughput of the core saturates at 130 or 32.5 fps per PE at 100- and 200-MHz

Pseudocode for the implementation of a 9/7-F DWT lter on the PE is shown in Fig. 20. Loading of input data is carried out by popping data from the input FIFO, and storing of output data is carried out by pushing the data into the output FIFO. As long as the DMA mechanism in the io_interface block is programmed to appropriately fetch the data from external memory, very low memory instruction overhead is involved i.e. no computation of read/write addresses is needed in the PE code. Additionally, the handling of boundary extension at left and right edges has been incorporated in the pseudocode and hence, no additional instruction overheads are incurred elsewhere. V. RESULTS A. Circuit Performance A wavelet processing core with 4 PEs was synthesized with the Synopsys Design Compiler, placed and routed using a 0.18- m CMOS standard cell library with Synopsys Astro, and timed with Synopsys Primetime static timing analysis tool. The performance parameters are listed in Table V. A global clock skew of 350 ps was used during static timing analysis. Power was estimated by using Sequence Power Theater with

LEE AND LIM: VLSI DESIGN OF A WAVELET PROCESSING CORE

1359

TABLE VI SIMULATED THROUGHPUT FOR THE WAVELET PROCESSING CORE WITH 4 PES AT 100-MHz OPERATING FREQUENCY

TABLE IX SIMULATED THROUGHPUT FOR THE WAVELET PROCESSING CORE WITH 2, 4, AND 8 PES

TABLE VII SIMULATED THROUGHPUT FOR THE WAVELET PROCESSING CORE WITH 4 PES AT 200-MHz OPERATING FREQUENCY

TABLE X FEATURE COMPARISON WITH OTHER ARCHITECTURES

TABLE VIII SIMULATED THROUGHPUT FOR THE WAVELET PROCESSING CORE WITH A 64-BIT DDR CONTROLLER

clock speed. This is because the performance is limited by the memory bandwidth. To increase the memory bandwidth further, we increased the memory bus from 32 to 64-bit. The simulation results in Table VIII show that using a 64-bit DDR controller increases the throughput of the 5/3 forward and inverse transform to a maximum of 192 fps. However, the 9/7-F forward and inverse transform throughputs saturate at 116 fps as the performance becomes limited by the PE itself and not the memory

bandwidth. Furthermore, the number of memory stall cycles observed for the 5/3 lter is signicantly much higher than the 9/7-F lter. Hence, we can conclude that the performance of the 5/3 lter in this case is limited by the memory bandwidth. We next varied the number of PEs used in the simulation. The test code used was the same as in Fig. 22, except that the PE io_interface modules DMA was programmed with different image attributes. The simulation results as listed in Table IX show that the maximum throughput saturates at around 190 fps. This is due to the memory bandwidth bottleneck at the memory controller. The results also show that for simple lters, such as the 5/3 lter, deployment of 4 PEs is sufcient to achieve the maximum throughput, whereas for more complex lters, such as the 9/7-F lter, 8 PEs are needed. Table X shows a comparison of this proposed architecture with existing architectures: Lius architecture [13]; Chens architecture [15] and Wus architecture [16]. Our proposed architecture is software programmable, and hence it can support various types of lters as well as any nite number of DWT levels. Furthermore, it relies on external DRAM, and hence it does not have a limitation on tile size and also only minimum amount of the internal memory is needed. The main disadvantage of DWT processing systems with offchip memory compared to those with on-chip memory is the lower memory bandwidth. Memory bandwidth can be expressed in terms of bus efciency which is dened as follows: Efciency Total data transferred Total bus cycles cosumed Data bus width (4)

1360

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 16, NO. 11, NOVEMBER 2006

TABLE XI BANDWIDTH EFFICIENCY FOR THE WAVELET PROCESSING CORE WITH A 64-BIT DDR CONTROLLER

VIII and IX by a factor of 1/3 since a 24-bit image can be represented by three separate 8-bit images in the red, green and blue color planes. Hence, the maximum throughput attainable for the 3-level 2-D DWT of 24-bit, 640 480 images is estimated at fps. VI. CONCLUSION A highly programmable and scalable wavelet processing core architecture has been proposed. The exibility of the architecture of using variable number of PEs, clock frequency, and different DDR controller congurations enables the performance to be scaled according to application requirements. The programmability of the individual PEs makes the wavelet processing core highly recongurable for other signal processing functions. We have shown through simulation that a wavelet processing core based on the architecture is capable of providing throughput up to 192 fps for a 3-level 2-D DWT on an 8-bit, 640 480 image using a 200-MHz clock, without fragmenting the image into tiles. For practical applications which only require 2030 fps, we have signicant large capacity for other functions to be implemented on the wavelet processing core. REFERENCES
[1] W. Sweldens and I. Daubechies, Factoring wavelet transforms into lifting steps, J. Fourier Anal. Applicat., vol. 4, no. 3, pp. 247269, 1998. [2] C. Valens, A really friendly guide to wavelets 1999 [Online]. Available: http://www.perso.wanadoo.fr/polyvalens/clemens/download/arfgtw.pdf [3] A. R. Calderbank, I. Daubechies, W. Sweldens, and B. L. Yeo, Lossless image compression using integer to integer wavelet transforms, in Proc. Int. Conf. Image Process., 1997, pp. 596599. [4] A. S. Lewis and G. Knowles, VLSI architecture for 2-D Daubechies wavelet transform without multipliers, Electron. Lett., vol. 27, no. 2, pp. 171173, 1991. [5] K. K. Parhi and T. Nishitani, VLSI architectures for discrete wavelet transform, IEEE Trans. Very Large Integr. (VLSI) Syst., vol. 1, no. 2, pp. 191202, Mar. 1993. [6] C. Chakrabarti and M. Vishwanath, Efcient realizations of the discrete and continuous wavelet transforms: from single chip implementations to mappings on SIMD array computers, IEEE Trans. Signal Process., vol. 43, no. 3, pp. 759771, Mar. 1995. [7] M. Martina, G. Masera, G. Piccinini, and M. Zamboni, A VLSI architecture for IWT (Integer Wavelet Transform), in Proc. 43rd MidWest Symp. Circuits Syst., 2000, vol. 3, pp. 11741177. [8] P. Jamkhandi, A. Mukherjee, K. Mukherjee, and R. Franceschini, Parallel hardwaresoftware architecture for computation of discrete wavelet transform using the recursive merge ltering algorithm, in Proc. Int. Parallel Distrib. Process. Symp. Workshop, 2000, pp. 250256. [9] A. Petrovsky, T. Laopoulos, V. Golovko, R. Sadykhov, and A. Sachenko, Dynamic instructor set computer architecture of wavelet packet processor for real-time audio signal compression systems, in Proc. 2nd ICNNA, Feb. 2004, pp. 422424. [10] S. G. Mallat, A theory for multiresolution signal decomposition: the wavelet representation, IEEE Trans. Pattern Anal. Mach. Intell., vol. 11, no. 7, pp. 674693, Jul. 1989. [11] M. D. Adam and F. Kossentini, Reversible integer-to-integer wavelet transforms for image compression: Performance evaluation and analysis, IEEE Trans. Image Process., vol. 9, no. 6, pp. 10101024, Jun. 2000. [12] M. Weeks, Precision for 2-D discrete wavelet transform processors, in Proc. IEEE Workshop SiPS, 2000, pp. 8089. [13] Y. Liu and E. M.-K. Lai, Design and implementation of an RNS-based 2-D DWT processor, IEEE Trans. Consum. Electron., vol. 50, no. 1, pp. 376384, Feb. 2004. [14] B. Das and S. Barnerjee, A memory efcient 3-D DWT architecture, in Proc. 16th Int. Conf. VLSI Design (VLSI03), 2003, pp. 208213.

A bus efciency of 1.0 is achieved when the memory data bus is used to read or write a word (32 or 64-bit) in every cycle. This is equivalent to the bus efciency of a system with single ported, on-chip SRAM. For systems with DDR DRAM, a bus efciency of 1.0 will never be achieved as there are always overhead bus cycles incurred in the DDR DRAM bus protocol. Table XI shows the bus efciencies of the proposed 2-D DWT processing core implementing different DWT lters with 32-bit and 64-bit bus and different number of memory banks. The efciency range of 0.420.55 achieved by the proposed architecture for the case of 32-bit DDR data bus is signicantly higher than the efciency of 0.11 achieved by the conventional systems as explained in Section III. For the case of 64-bit DDR data bus, bus efciency is lower. This is because the two times increase in the bus width only results in the reduction of bus cycles by less than two times due to the occurrence of xed overhead cycles in bus accesses which are independent of bus widths. The novelty of the proposed architecture is that the PEs design for the architecture are small (only 50-K gates) and can be deployed in the implementation 2-D DWT processing core with highly scalable performance through the easily implemented connection of multiple PEs (2, 4, 8, or 16). The instructions designed have been optimized for the DWT operation with CPI of 1.0 and they are also generic enough for the implementation of any DWT lters. Furthermore, no data cache (required in the conventional DSPs) is needed because the input coefcients are serialized and loaded by the DMA mechanism in each PE. Hence, the design of the PE instruction set is simplied without the need for complex memory addressing modes, and the same set of code can be used for different image sizes. Moreover, only a single memory controller is needed to serve up to 16 PEs, and the controller has been designed with the memory access pattern of the DWT taken in consideration to optimize throughput without the need to buffer image tiles in the PEs, as opposed to the use of tiles in many of the existing 2-D DWT architectures. C. Color Images The previous throughput simulation results were obtained by carrying out the 3-level fast-lifting DWT on an 8-bit, 640 480 image. For 24-bit color images, the estimated throughput can be obtained by linearly scaling the throughput in Table VI, VII,

LEE AND LIM: VLSI DESIGN OF A WAVELET PROCESSING CORE

1361

[15] C.-Y. Chen, Z.-L. Yang, T.-C. Wang, and L.-G. Chen, A programmable VLSI architecture for 2-D discrete wavelet transform, in IEEE Int. Symp. Circuits Syst., May 2000, pp. 619622. [16] B.-F. Wu and Y.-Q. Hu, An efcient VLSI implementation of the discrete wavelet transform using embedded instruction codes for symmetric lters, IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 9, pp. 936943, Sep. 2003. Sze-Wei Lee (M99) was born in Malaysia in 1970. He received the B.Eng. (hons.) degree in electronics and optoelectronics, the M.Phil. and the Ph.D. degrees from the Institute of Science and Technology, University of Manchester, Manchester, U.K., in 1995, 1996, and 1998, respectively. He joined the Faculty of Engineering, Multimedia University, Cyberjaya, Selangor, Malaysia, in 1999, where he is currently an Associate Professor. His research interests include digital signal processing and digital communications.

Soon-Chieh Lim was born in Malaysia in 1977. He obtained the B.Eng. (hons.) degree in electrical and electronic engineering from the University of Technology, Johor, Malaysia, in 2000 and the M.Eng.Sc. degree from the Multimedia University, Cyberjaya, Selangor, Malaysia, in 2005. He is currently a Senior Design Engineer at eAsic Corporation, Penang, Malaysia. His research interest includes microprocessor architectures and parallel processing.

Potrebbero piacerti anche