Jp2k On Fpga

Enabling Real-Time JPEG2000 with FPGA Architectures
Olivier Cantineau Brian Jentz

Barco Silex Altera Rue du bosquet 7 101 Innovation Dr B-1348 Louvain-la-Neuve, Belgium San Jose, CA 95134 +32 10 454904 (408) 544-7000 olivier.cantineau@barco.com bjentz@altera.com
the large size of the images. For instance, a 640x480 VGA RGB image is 1 Byte; a 2048x1536 8-bit grayscale medical image is 3 Mbytes; and a 4096x2160 36-bit digital cinema image is 38 Mbytes. There is a need, therefore, for high quality image compression. Each application has slightly different requirements. JPEG2000 offers superior benefits to JPEG for a wide variety of applications. It has support for both lossy and lossless compression in a single algorithm. It offers improved quality at the same compression ratio, due to its removal of block artifacts, support for regions of interest, and non-iterative optimal rate control. It has been designed to facilitate onscreen display and computer imagery. It has significantly improved bit- stream scalability, which is defined at the image level. This scalability includes embedded fast preview with further refinements and adaptability to instantaneously available bandwidth. All the features and compression efficiency, however, come at the expense of algorithm complexity. JPEG2000 is up to 6 times more complex to implement than JPEG. Hardware acceleration is, therefore, required for an efficient solution.
The JPEG2000 standard was developed to address a wide range of applications, including medical imaging, military, security systems, and digital cinema. To enable these applications, JPEG2000 has many unique features, including scalability, support for regions of interest, lossless support and low latency. This paper will describe an FPGA-based JPEG2000 implementation and will demonstrate the performance, cost, and integration benefits that can be derived by FPGA and structured ASIC implementations.
II. JPEG2000 Overview I. Introduction

Many imaging applications are moving from the analog to digital domain for a number of reasons, including perfect copy, controllable transmission quality, easy storage, and easy manipulation. One major challenge to implementing this transition is created by The JPEG2000 algorithm is illustrated in Figure 1. The processing is divided into two separate stages highlighted with dashed boxes. The first stage performs the encoding, while the second stage (Tier-2) builds up the stream and includes an a-posteriori rate allocator.
CF-JPG031505-1.0
DWT
tiles
[Quant]
(Optional)
(arithmetic )
Entropy
Distortion Metrics
Compressed Blocks
Tier-2
Rate Allocator
JP2K Stream
Figure 1: JPEG2000 Block Diagram First Stage: Discrete Wavelet Transform (DWT) Based Compression
This encoding stage is illustrated in Figure 2, showing 2 levels of wavelet decomposition resulting in 7 sub-bands. To apply JPEG2000 compression, the image is divided into rectangular tiles of configurable size. Figure 2 illustrates the JPEG2000 operations on a given tile of pixels. Each tile separately undergoes the 2-D wavelet transform, which splits the frequency information of the tile in a series of pictures, named sub-bands. This is the decorrelation transform of the JPEG2000 algorithm. Each sub-band is the result of the 2-D filtering of the original tile for a given frequency range. The wavelet transform is a recursive operation that can be applied for a configurable number of times. This is called the number of decomposition levels. Each application of the transform generates 4 subbands from its original image by combining high-pass and low-pass filtering operations along the lines and the columns of the pictures. This generates sub-bands marked as LL, LH, HL, and HH, where L represents low-pass filtering and H represents highpass filtering (refer to Figure 2). The two letters are grouped for row-column combinations. Each level of wavelet decomposition applies on the LL result of the previous wavelet decomposition. The level of decomposition for a given sub-band is included in its name. This is the number appearing in the #LL, #LH, #HL and #HH marks in Figure 2. Each sub-band can then undergo selective quantization by a programmable factor for lossy compression. Bypassing the quantization gives lossless operation.The resultant quantized sub-bands are further divided into smaller rectangular blocks, named code blocks. Each code block passes through an entropy encoder. This is the compression engine of the JPEG2000 algorithm, which reduces the number of bits need to represent the code blocks. All bit planes of the current code block are examined starting from the most significant one. In each plane, the bits are scanned in a zigzag order and their context (information on the predominant value of the surrounding bits) is determined. Finally, an arithmetic encoder uses the value of the bit and the context. It generates the code stream representing the compressed code block. The arithmetic encoder also computes distortion metrics. These reflect the image distortion that would be encountered when reconstructing the code block with its currently encoded portion.
Code Block Grid
2LL
2HL
1HL
2LH
2HH
1LH
1HH
Entropy encoder Modeler Arithmetic
1 tile
DWT
Figure 2: Overview of the JPEG2000 Encoding Stage (showing 7 sub-bands) Second Stage (Tier-2): Packet Selection and Reordering
The code stream generated by the arithmetic encoder, together with the distortion metrics, allows the JPEG2000 post-processing stage to selectively build the final bit stream. This process is driven by two user-defined parameters that are detailed below. The compression ratio - The Tier-2 stage selects incoming packets to attain the compression ratio specified by the user. The algorithm rejects packets that do not contribute to a sufficient improvement of the compression distortion. This mechanism allows a precise control of the generated compressed file size, while maintaining a good image quality. The progression order - JPEG2000 allows an initial preview of a picture with the first portion of the bit stream. With the subsequent parts of the compressed file, the image is progressively refined. JPEG2000 standardizes various refinement orders by prioritizing an image characteristic, for example, quality or resolution. The Tier-2 stage attains the desired progression order by reordering the incoming packets.
III. Implementation on FPGA

Due to its powerful features, JPEG2000 requires more computational resources than the classic JPEG standard to achieve similar encoding and decoding speeds. To increase the JPEG2000 performance, this paper proposes an architecture where computationally intensive tasks are offloaded to an FPGA co-processor, as illustrated under Figure 3. JPEG2000 processing is accelerated by executing wavelet, quantization, and entropy encoding on an FPGA.
DWT Tier-2
[Quant]
Entropy
CPU FPGA
Figure 3: Co-Processing Architecture
Figure 4 illustrates a software benchmark of the JPEG2000 algorithm for lossless and lossy compressions, where a large part of the processor time is spent on entropy encoding. This is particularly true for lossless encoding, which requires many encoding passes. Hardware FPGA implementations can accelerate wavelet transform and quantization by pipelining these operations. Entropy encoding, however, is more difficult to optimize on due to its bit-serial structure. This is illustrated by Figure 5, which shows the amount of time spent on the DWT and entropy operations by the FPGA co-processor proposed in Figure 3. The bottleneck is the entropy-encoding stage showing a permanent activity (100 percent on the graph). The hardware wavelet transform is only active during 7 to 10 percent of the time needed to entropy encode the corresponding data. To compensate for this slow entropy encoding and to better balance the activity of the various blocks of the JPEG2000 encoding, several entropy encoders must be placed in parallel to independently process the code blocks generated by a single wavelet engine.
120
120 % Time Activity 100 80 60 40 20 0

Lossless Lossy
100 % CPU Load 80 60 40 20 0 Other Entropy DWT
DWT Entropy
Lossless
Lossy
Figure 4: Software JPEG2000 Benchmarking Figure 6 shows a block diagram of the Barco Silex JPEG2000 encoder core (BA112JPEG2000E). This figure illustrates the main functional modules and provides a simplified view of the interfaces. It also gives an overview of the logic and memory usage for each module on Alteras Stratix
Figure 5: Hardware JPEG2000 Benchmarking

and Stratix II devices. The block diagram shows the parallel structure of the core, where several entropy encoders are implemented to process the data generated by the wavelet engine. Alteras Stratix and Stratix II FPGA featuresfast and numerous RAM blocks and a large amount
of logic and hardware multipliersmake these FPGAs excellent choices for implementing JPEG2000 solutions. The presence of large on-chip M-RAM blocks allows the implementation of a large on-chip tile buffer, increasing the overall performance and integration level of the core.
Pixel data is input through the pixel interface, and compressed streams are made available at the compressed interfaces, together with distortion metrics. The core features a simple generic CPU interface suited for use as a bus peripheral to various processors. The following sections describe the modules constituting the BA112JPEG2000E core as depicted in Figure 6.
Pixel
Line Buffer
2 M4K
2D DWT
4400 LEs 12 MULT 26 M4K
Tile Buffer
2 MRAM
Q
1000 LEs 2 MULT
Tile Splitter
2000 LEs
Compressed Data Compressed Data
Entropy Encoder Entropy Encoder

2000 LEs 4 M4K
CBlock Buffer CBlock Buffer

8 M4K
Entropy
Compressed Data
Entropy Encoder
CBlock Buffer
Figure 6: Block Diagram of the Barco Silex JPEG2000 Encoder 2D DWT

The first module of the core is the wavelettransform engine. This module can be configured to accept tiles of pixels of any size up to 128 by 128. It performs twodimensional discrete wavelet decomposition on the incoming data with up to five programmable decomposition levels. The wavelet transform can be programmed to be lossy, lossless or bypassed. The DWT module accepts incoming pixels of any size up to 12 bits (10 bits for lossy). Finally, it stores its results in the on-chip tile buffer ready to undergo quantization and code-block decomposition. The quantizer fetches the sub-bands available from the tile buffer and applies a programmable quantization step. Different quantization steps can be programmed for each sub-band. Lower frequency sub-bands can thus be weighted differently from higher frequency ones. The quantizer can be bypassed for lossless operation.
Tile Splitter
This unit further divides the quantized subbands into rectangular code blocks of programmable size (up to 32 by 32), ready for the entropy encoding by an arithmetic encoder. The cores feature a configurable number of entropy encoders placed in parallel in order to sustain high encoding rates. The number of implemented chains is selected during the
Quantizer
IP synthesis process. Each entropy chain processes a code block independently from neighboring chains. The tile-splitter module is responsible for arbitrating between the available chains, dispatching the various code blocks to be encoded. It stores the code blocks in the local code-block buffers.
Performance Analysis
Table 1 illustrates JPEG2000 decoding capabilities benchmarked on different Stratix II devices. The following information is given in the table for each of three members of the family: LE usage for the JPEG2000 decoder core (with device fulfillment percentages) Decoding configurationthe number of entropy channels (configured at synthesis stage) Resultant sample rate (left number is for typical lossy compression; right number is for typical lossless compression) Resultant VGA frame rate (640x480 24bit RGB) Resultant decoding time for monochrome 8-bit 3-Mpixel medical images (2048x1536)
Modeler and Arithmetic Encoder

The modeler performs the first part of the entropy encoding. It examines the code block, bit plane by bit plane and extracts relevant bits in zigzag order in each bit plane. Moreover, it computes the context information needed by the arithmetic encoder and the distortion metrics. These will be made available at the compressed interface and are used by the tier-2 part of the JPEG2000 algorithm. The arithmetic encoder processes the bits and contexts, and makes the stream available at the compressed interface.
Table 1: JPEG2000 Deocder Performance on Stratix II

Stratix II Device EP2S15C5
EP2S30C5 EP2S60C5
Area (#LEs, Usage %) 10500 (67%)

25000 (74%) 50000 (83%)
# Entropy Channels 2
8 8
Sample Rate 14M / 10M

50M / 37M 100M / 74M
VGA (Hz) 15 / 10
54 / 40 108 / 80
PACS 3M (ms)
225 / 315 63 / 85 32 / 43
These results can be compared to the estimated performance of software implementations. A mid-range Stratix II device, EP2S60C5, can achieve 100 MSPS. This compares to a 600 MHz Texas Instruments TMS320DM642-600 DSP at 6 MSPS and a 3 GHz Pentium IV at 10 MSPS. These results can be used to estimate a price-performance ratio comparison between an FPGA implementation and a DSP solution as shown in Figure 7. The DSP price is based on Texas Instruments
TMS320DM642 10k-unit price of $45. The Altera price is based on EP2S30C5 10k-unit price of $80. An FPGA-based solution using Stratix is 4.5 times more efficient than a DSP solution, while a structured-ASIC based solution using an Altera Stratix HardCopy device is 11.5 times more efficient. These results show the advantages offered by FPGAs and structured ASICs for implementing the highly complex bit-serial operations involved in the JPEG2000 compression algorithm.
$8.00 $7.00 $6.00 $5.00 $4.00 $3.00 $2.00 $1.00 $0.00
$7.50 TI DM642-600 2S30C5 Hardcopy
$1.60
$0.64
$/Mega Sample Rate
Figure 7: Price/Performance Ratio Comparison
Conclusion
The JPEG2000 standard defines an algorithm that is able to offer a large spectrum of features, such as progressive bit stream, precise rate control, region of interest, and high quality lossless and lossy compression. For these reasons, JPEG2000 is being considered in a variety of applications, including medical imaging, military, security systems, and digital cinema. These features come at the expense of algorithm complexity. This paper demonstrates the performance, cost, and integration benefits that can be derived by FPGA and structured ASIC implementations.

Jp2k On Fpga

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Jp2k On Fpga

Caricato da

Copyright:

Formati disponibili

Enabling Real-Time JPEG2000 with FPGA Architectures

Olivier Cantineau Brian Jentz

II. JPEG2000 Overview I. Introduction

Code Block Grid

Entropy encoder Modeler Arithmetic

III. Implementation on FPGA

120 % Time Activity 100 80 60 40 20 0

100 % CPU Load 80 60 40 20 0 Other Entropy DWT

Figure 5: Hardware JPEG2000 Benchmarking

Compressed Data Compressed Data

Entropy Encoder Entropy Encoder

CBlock Buffer CBlock Buffer

Figure 6: Block Diagram of the Barco Silex JPEG2000 Encoder 2D DWT

Modeler and Arithmetic Encoder

Table 1: JPEG2000 Deocder Performance on Stratix II

Area (#LEs, Usage %) 10500 (67%)

Sample Rate 14M / 10M

$8.00 $7.00 $6.00 $5.00 $4.00 $3.00 $2.00 $1.00 $0.00

$7.50 TI DM642-600 2S30C5 Hardcopy

$/Mega Sample Rate

Figure 7: Price/Performance Ratio Comparison

Potrebbero piacerti anche