Sei sulla pagina 1di 7

Enabling Real-Time JPEG2000 with FPGA Architectures

Olivier Cantineau Brian Jentz


Barco Silex Altera Rue du bosquet 7 101 Innovation Dr B-1348 Louvain-la-Neuve, Belgium San Jose, CA 95134 +32 10 454904 (408) 544-7000 olivier.cantineau@barco.com bjentz@altera.com

the large size of the images. For instance, a 640x480 VGA RGB image is 1 Byte; a 2048x1536 8-bit grayscale medical image is 3 Mbytes; and a 4096x2160 36-bit digital cinema image is 38 Mbytes. There is a need, therefore, for high quality image compression. Each application has slightly different requirements. JPEG2000 offers superior benefits to JPEG for a wide variety of applications. It has support for both lossy and lossless compression in a single algorithm. It offers improved quality at the same compression ratio, due to its removal of block artifacts, support for regions of interest, and non-iterative optimal rate control. It has been designed to facilitate onscreen display and computer imagery. It has significantly improved bit- stream scalability, which is defined at the image level. This scalability includes embedded fast preview with further refinements and adaptability to instantaneously available bandwidth. All the features and compression efficiency, however, come at the expense of algorithm complexity. JPEG2000 is up to 6 times more complex to implement than JPEG. Hardware acceleration is, therefore, required for an efficient solution.

The JPEG2000 standard was developed to address a wide range of applications, including medical imaging, military, security systems, and digital cinema. To enable these applications, JPEG2000 has many unique features, including scalability, support for regions of interest, lossless support and low latency. This paper will describe an FPGA-based JPEG2000 implementation and will demonstrate the performance, cost, and integration benefits that can be derived by FPGA and structured ASIC implementations.

II. JPEG2000 Overview I. Introduction


Many imaging applications are moving from the analog to digital domain for a number of reasons, including perfect copy, controllable transmission quality, easy storage, and easy manipulation. One major challenge to implementing this transition is created by The JPEG2000 algorithm is illustrated in Figure 1. The processing is divided into two separate stages highlighted with dashed boxes. The first stage performs the encoding, while the second stage (Tier-2) builds up the stream and includes an a-posteriori rate allocator.

CF-JPG031505-1.0

DWT
tiles

[Quant]
(Optional)

(arithmetic )

Entropy

Distortion Metrics

Compressed Blocks

Tier-2
Rate Allocator

JP2K Stream

Figure 1: JPEG2000 Block Diagram First Stage: Discrete Wavelet Transform (DWT) Based Compression
This encoding stage is illustrated in Figure 2, showing 2 levels of wavelet decomposition resulting in 7 sub-bands. To apply JPEG2000 compression, the image is divided into rectangular tiles of configurable size. Figure 2 illustrates the JPEG2000 operations on a given tile of pixels. Each tile separately undergoes the 2-D wavelet transform, which splits the frequency information of the tile in a series of pictures, named sub-bands. This is the decorrelation transform of the JPEG2000 algorithm. Each sub-band is the result of the 2-D filtering of the original tile for a given frequency range. The wavelet transform is a recursive operation that can be applied for a configurable number of times. This is called the number of decomposition levels. Each application of the transform generates 4 subbands from its original image by combining high-pass and low-pass filtering operations along the lines and the columns of the pictures. This generates sub-bands marked as LL, LH, HL, and HH, where L represents low-pass filtering and H represents highpass filtering (refer to Figure 2). The two letters are grouped for row-column combinations. Each level of wavelet decomposition applies on the LL result of the previous wavelet decomposition. The level of decomposition for a given sub-band is included in its name. This is the number appearing in the #LL, #LH, #HL and #HH marks in Figure 2. Each sub-band can then undergo selective quantization by a programmable factor for lossy compression. Bypassing the quantization gives lossless operation.The resultant quantized sub-bands are further divided into smaller rectangular blocks, named code blocks. Each code block passes through an entropy encoder. This is the compression engine of the JPEG2000 algorithm, which reduces the number of bits need to represent the code blocks. All bit planes of the current code block are examined starting from the most significant one. In each plane, the bits are scanned in a zigzag order and their context (information on the predominant value of the surrounding bits) is determined. Finally, an arithmetic encoder uses the value of the bit and the context. It generates the code stream representing the compressed code block. The arithmetic encoder also computes distortion metrics. These reflect the image distortion that would be encountered when reconstructing the code block with its currently encoded portion.

Code Block Grid

2LL

2HL

1HL

2LH

2HH

1LH

1HH

Entropy encoder Modeler Arithmetic

1 tile

DWT

Figure 2: Overview of the JPEG2000 Encoding Stage (showing 7 sub-bands) Second Stage (Tier-2): Packet Selection and Reordering
The code stream generated by the arithmetic encoder, together with the distortion metrics, allows the JPEG2000 post-processing stage to selectively build the final bit stream. This process is driven by two user-defined parameters that are detailed below. The compression ratio - The Tier-2 stage selects incoming packets to attain the compression ratio specified by the user. The algorithm rejects packets that do not contribute to a sufficient improvement of the compression distortion. This mechanism allows a precise control of the generated compressed file size, while maintaining a good image quality. The progression order - JPEG2000 allows an initial preview of a picture with the first portion of the bit stream. With the subsequent parts of the compressed file, the image is progressively refined. JPEG2000 standardizes various refinement orders by prioritizing an image characteristic, for example, quality or resolution. The Tier-2 stage attains the desired progression order by reordering the incoming packets.

III. Implementation on FPGA


Due to its powerful features, JPEG2000 requires more computational resources than the classic JPEG standard to achieve similar encoding and decoding speeds. To increase the JPEG2000 performance, this paper proposes an architecture where computationally intensive tasks are offloaded to an FPGA co-processor, as illustrated under Figure 3. JPEG2000 processing is accelerated by executing wavelet, quantization, and entropy encoding on an FPGA.

DWT Tier-2

[Quant]

Entropy

CPU FPGA
Figure 3: Co-Processing Architecture
Figure 4 illustrates a software benchmark of the JPEG2000 algorithm for lossless and lossy compressions, where a large part of the processor time is spent on entropy encoding. This is particularly true for lossless encoding, which requires many encoding passes. Hardware FPGA implementations can accelerate wavelet transform and quantization by pipelining these operations. Entropy encoding, however, is more difficult to optimize on due to its bit-serial structure. This is illustrated by Figure 5, which shows the amount of time spent on the DWT and entropy operations by the FPGA co-processor proposed in Figure 3. The bottleneck is the entropy-encoding stage showing a permanent activity (100 percent on the graph). The hardware wavelet transform is only active during 7 to 10 percent of the time needed to entropy encode the corresponding data. To compensate for this slow entropy encoding and to better balance the activity of the various blocks of the JPEG2000 encoding, several entropy encoders must be placed in parallel to independently process the code blocks generated by a single wavelet engine.
120

120 % Time Activity 100 80 60 40 20 0


Lossless Lossy

100 % CPU Load 80 60 40 20 0 Other Entropy DWT

DWT Entropy

Lossless

Lossy

Figure 4: Software JPEG2000 Benchmarking Figure 6 shows a block diagram of the Barco Silex JPEG2000 encoder core (BA112JPEG2000E). This figure illustrates the main functional modules and provides a simplified view of the interfaces. It also gives an overview of the logic and memory usage for each module on Alteras Stratix

Figure 5: Hardware JPEG2000 Benchmarking


and Stratix II devices. The block diagram shows the parallel structure of the core, where several entropy encoders are implemented to process the data generated by the wavelet engine. Alteras Stratix and Stratix II FPGA featuresfast and numerous RAM blocks and a large amount

of logic and hardware multipliersmake these FPGAs excellent choices for implementing JPEG2000 solutions. The presence of large on-chip M-RAM blocks allows the implementation of a large on-chip tile buffer, increasing the overall performance and integration level of the core.

Pixel data is input through the pixel interface, and compressed streams are made available at the compressed interfaces, together with distortion metrics. The core features a simple generic CPU interface suited for use as a bus peripheral to various processors. The following sections describe the modules constituting the BA112JPEG2000E core as depicted in Figure 6.

Pixel

Line Buffer
2 M4K

2D DWT
4400 LEs 12 MULT 26 M4K

Tile Buffer
2 MRAM

Q
1000 LEs 2 MULT

Tile Splitter
2000 LEs

Compressed Data Compressed Data

Entropy Encoder Entropy Encoder


2000 LEs 4 M4K

CBlock Buffer CBlock Buffer


8 M4K

Entropy

Compressed Data

Entropy Encoder

CBlock Buffer

Figure 6: Block Diagram of the Barco Silex JPEG2000 Encoder 2D DWT


The first module of the core is the wavelettransform engine. This module can be configured to accept tiles of pixels of any size up to 128 by 128. It performs twodimensional discrete wavelet decomposition on the incoming data with up to five programmable decomposition levels. The wavelet transform can be programmed to be lossy, lossless or bypassed. The DWT module accepts incoming pixels of any size up to 12 bits (10 bits for lossy). Finally, it stores its results in the on-chip tile buffer ready to undergo quantization and code-block decomposition. The quantizer fetches the sub-bands available from the tile buffer and applies a programmable quantization step. Different quantization steps can be programmed for each sub-band. Lower frequency sub-bands can thus be weighted differently from higher frequency ones. The quantizer can be bypassed for lossless operation.

Tile Splitter
This unit further divides the quantized subbands into rectangular code blocks of programmable size (up to 32 by 32), ready for the entropy encoding by an arithmetic encoder. The cores feature a configurable number of entropy encoders placed in parallel in order to sustain high encoding rates. The number of implemented chains is selected during the

Quantizer

IP synthesis process. Each entropy chain processes a code block independently from neighboring chains. The tile-splitter module is responsible for arbitrating between the available chains, dispatching the various code blocks to be encoded. It stores the code blocks in the local code-block buffers.

Performance Analysis
Table 1 illustrates JPEG2000 decoding capabilities benchmarked on different Stratix II devices. The following information is given in the table for each of three members of the family: LE usage for the JPEG2000 decoder core (with device fulfillment percentages) Decoding configurationthe number of entropy channels (configured at synthesis stage) Resultant sample rate (left number is for typical lossy compression; right number is for typical lossless compression) Resultant VGA frame rate (640x480 24bit RGB) Resultant decoding time for monochrome 8-bit 3-Mpixel medical images (2048x1536)

Modeler and Arithmetic Encoder


The modeler performs the first part of the entropy encoding. It examines the code block, bit plane by bit plane and extracts relevant bits in zigzag order in each bit plane. Moreover, it computes the context information needed by the arithmetic encoder and the distortion metrics. These will be made available at the compressed interface and are used by the tier-2 part of the JPEG2000 algorithm. The arithmetic encoder processes the bits and contexts, and makes the stream available at the compressed interface.

Table 1: JPEG2000 Deocder Performance on Stratix II


Stratix II Device EP2S15C5
EP2S30C5 EP2S60C5

Area (#LEs, Usage %) 10500 (67%)


25000 (74%) 50000 (83%)

# Entropy Channels 2
8 8

Sample Rate 14M / 10M


50M / 37M 100M / 74M

VGA (Hz) 15 / 10
54 / 40 108 / 80

PACS 3M (ms)

225 / 315 63 / 85 32 / 43

These results can be compared to the estimated performance of software implementations. A mid-range Stratix II device, EP2S60C5, can achieve 100 MSPS. This compares to a 600 MHz Texas Instruments TMS320DM642-600 DSP at 6 MSPS and a 3 GHz Pentium IV at 10 MSPS. These results can be used to estimate a price-performance ratio comparison between an FPGA implementation and a DSP solution as shown in Figure 7. The DSP price is based on Texas Instruments

TMS320DM642 10k-unit price of $45. The Altera price is based on EP2S30C5 10k-unit price of $80. An FPGA-based solution using Stratix is 4.5 times more efficient than a DSP solution, while a structured-ASIC based solution using an Altera Stratix HardCopy device is 11.5 times more efficient. These results show the advantages offered by FPGAs and structured ASICs for implementing the highly complex bit-serial operations involved in the JPEG2000 compression algorithm.

$8.00 $7.00 $6.00 $5.00 $4.00 $3.00 $2.00 $1.00 $0.00

$7.50 TI DM642-600 2S30C5 Hardcopy

$1.60

$0.64

$/Mega Sample Rate

Figure 7: Price/Performance Ratio Comparison

Conclusion
The JPEG2000 standard defines an algorithm that is able to offer a large spectrum of features, such as progressive bit stream, precise rate control, region of interest, and high quality lossless and lossy compression. For these reasons, JPEG2000 is being considered in a variety of applications, including medical imaging, military, security systems, and digital cinema. These features come at the expense of algorithm complexity. This paper demonstrates the performance, cost, and integration benefits that can be derived by FPGA and structured ASIC implementations.

Potrebbero piacerti anche