Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
the large size of the images. For instance, a 640x480 VGA RGB image is 1 Byte; a 2048x1536 8-bit grayscale medical image is 3 Mbytes; and a 4096x2160 36-bit digital cinema image is 38 Mbytes. There is a need, therefore, for high quality image compression. Each application has slightly different requirements. JPEG2000 offers superior benefits to JPEG for a wide variety of applications. It has support for both lossy and lossless compression in a single algorithm. It offers improved quality at the same compression ratio, due to its removal of block artifacts, support for regions of interest, and non-iterative optimal rate control. It has been designed to facilitate onscreen display and computer imagery. It has significantly improved bit- stream scalability, which is defined at the image level. This scalability includes embedded fast preview with further refinements and adaptability to instantaneously available bandwidth. All the features and compression efficiency, however, come at the expense of algorithm complexity. JPEG2000 is up to 6 times more complex to implement than JPEG. Hardware acceleration is, therefore, required for an efficient solution.
The JPEG2000 standard was developed to address a wide range of applications, including medical imaging, military, security systems, and digital cinema. To enable these applications, JPEG2000 has many unique features, including scalability, support for regions of interest, lossless support and low latency. This paper will describe an FPGA-based JPEG2000 implementation and will demonstrate the performance, cost, and integration benefits that can be derived by FPGA and structured ASIC implementations.
CF-JPG031505-1.0
DWT
tiles
[Quant]
(Optional)
(arithmetic )
Entropy
Distortion Metrics
Compressed Blocks
Tier-2
Rate Allocator
JP2K Stream
Figure 1: JPEG2000 Block Diagram First Stage: Discrete Wavelet Transform (DWT) Based Compression
This encoding stage is illustrated in Figure 2, showing 2 levels of wavelet decomposition resulting in 7 sub-bands. To apply JPEG2000 compression, the image is divided into rectangular tiles of configurable size. Figure 2 illustrates the JPEG2000 operations on a given tile of pixels. Each tile separately undergoes the 2-D wavelet transform, which splits the frequency information of the tile in a series of pictures, named sub-bands. This is the decorrelation transform of the JPEG2000 algorithm. Each sub-band is the result of the 2-D filtering of the original tile for a given frequency range. The wavelet transform is a recursive operation that can be applied for a configurable number of times. This is called the number of decomposition levels. Each application of the transform generates 4 subbands from its original image by combining high-pass and low-pass filtering operations along the lines and the columns of the pictures. This generates sub-bands marked as LL, LH, HL, and HH, where L represents low-pass filtering and H represents highpass filtering (refer to Figure 2). The two letters are grouped for row-column combinations. Each level of wavelet decomposition applies on the LL result of the previous wavelet decomposition. The level of decomposition for a given sub-band is included in its name. This is the number appearing in the #LL, #LH, #HL and #HH marks in Figure 2. Each sub-band can then undergo selective quantization by a programmable factor for lossy compression. Bypassing the quantization gives lossless operation.The resultant quantized sub-bands are further divided into smaller rectangular blocks, named code blocks. Each code block passes through an entropy encoder. This is the compression engine of the JPEG2000 algorithm, which reduces the number of bits need to represent the code blocks. All bit planes of the current code block are examined starting from the most significant one. In each plane, the bits are scanned in a zigzag order and their context (information on the predominant value of the surrounding bits) is determined. Finally, an arithmetic encoder uses the value of the bit and the context. It generates the code stream representing the compressed code block. The arithmetic encoder also computes distortion metrics. These reflect the image distortion that would be encountered when reconstructing the code block with its currently encoded portion.
2LL
2HL
1HL
2LH
2HH
1LH
1HH
1 tile
DWT
Figure 2: Overview of the JPEG2000 Encoding Stage (showing 7 sub-bands) Second Stage (Tier-2): Packet Selection and Reordering
The code stream generated by the arithmetic encoder, together with the distortion metrics, allows the JPEG2000 post-processing stage to selectively build the final bit stream. This process is driven by two user-defined parameters that are detailed below. The compression ratio - The Tier-2 stage selects incoming packets to attain the compression ratio specified by the user. The algorithm rejects packets that do not contribute to a sufficient improvement of the compression distortion. This mechanism allows a precise control of the generated compressed file size, while maintaining a good image quality. The progression order - JPEG2000 allows an initial preview of a picture with the first portion of the bit stream. With the subsequent parts of the compressed file, the image is progressively refined. JPEG2000 standardizes various refinement orders by prioritizing an image characteristic, for example, quality or resolution. The Tier-2 stage attains the desired progression order by reordering the incoming packets.
DWT Tier-2
[Quant]
Entropy
CPU FPGA
Figure 3: Co-Processing Architecture
Figure 4 illustrates a software benchmark of the JPEG2000 algorithm for lossless and lossy compressions, where a large part of the processor time is spent on entropy encoding. This is particularly true for lossless encoding, which requires many encoding passes. Hardware FPGA implementations can accelerate wavelet transform and quantization by pipelining these operations. Entropy encoding, however, is more difficult to optimize on due to its bit-serial structure. This is illustrated by Figure 5, which shows the amount of time spent on the DWT and entropy operations by the FPGA co-processor proposed in Figure 3. The bottleneck is the entropy-encoding stage showing a permanent activity (100 percent on the graph). The hardware wavelet transform is only active during 7 to 10 percent of the time needed to entropy encode the corresponding data. To compensate for this slow entropy encoding and to better balance the activity of the various blocks of the JPEG2000 encoding, several entropy encoders must be placed in parallel to independently process the code blocks generated by a single wavelet engine.
120
DWT Entropy
Lossless
Lossy
Figure 4: Software JPEG2000 Benchmarking Figure 6 shows a block diagram of the Barco Silex JPEG2000 encoder core (BA112JPEG2000E). This figure illustrates the main functional modules and provides a simplified view of the interfaces. It also gives an overview of the logic and memory usage for each module on Alteras Stratix
of logic and hardware multipliersmake these FPGAs excellent choices for implementing JPEG2000 solutions. The presence of large on-chip M-RAM blocks allows the implementation of a large on-chip tile buffer, increasing the overall performance and integration level of the core.
Pixel data is input through the pixel interface, and compressed streams are made available at the compressed interfaces, together with distortion metrics. The core features a simple generic CPU interface suited for use as a bus peripheral to various processors. The following sections describe the modules constituting the BA112JPEG2000E core as depicted in Figure 6.
Pixel
Line Buffer
2 M4K
2D DWT
4400 LEs 12 MULT 26 M4K
Tile Buffer
2 MRAM
Q
1000 LEs 2 MULT
Tile Splitter
2000 LEs
Entropy
Compressed Data
Entropy Encoder
CBlock Buffer
Tile Splitter
This unit further divides the quantized subbands into rectangular code blocks of programmable size (up to 32 by 32), ready for the entropy encoding by an arithmetic encoder. The cores feature a configurable number of entropy encoders placed in parallel in order to sustain high encoding rates. The number of implemented chains is selected during the
Quantizer
IP synthesis process. Each entropy chain processes a code block independently from neighboring chains. The tile-splitter module is responsible for arbitrating between the available chains, dispatching the various code blocks to be encoded. It stores the code blocks in the local code-block buffers.
Performance Analysis
Table 1 illustrates JPEG2000 decoding capabilities benchmarked on different Stratix II devices. The following information is given in the table for each of three members of the family: LE usage for the JPEG2000 decoder core (with device fulfillment percentages) Decoding configurationthe number of entropy channels (configured at synthesis stage) Resultant sample rate (left number is for typical lossy compression; right number is for typical lossless compression) Resultant VGA frame rate (640x480 24bit RGB) Resultant decoding time for monochrome 8-bit 3-Mpixel medical images (2048x1536)
# Entropy Channels 2
8 8
VGA (Hz) 15 / 10
54 / 40 108 / 80
PACS 3M (ms)
225 / 315 63 / 85 32 / 43
These results can be compared to the estimated performance of software implementations. A mid-range Stratix II device, EP2S60C5, can achieve 100 MSPS. This compares to a 600 MHz Texas Instruments TMS320DM642-600 DSP at 6 MSPS and a 3 GHz Pentium IV at 10 MSPS. These results can be used to estimate a price-performance ratio comparison between an FPGA implementation and a DSP solution as shown in Figure 7. The DSP price is based on Texas Instruments
TMS320DM642 10k-unit price of $45. The Altera price is based on EP2S30C5 10k-unit price of $80. An FPGA-based solution using Stratix is 4.5 times more efficient than a DSP solution, while a structured-ASIC based solution using an Altera Stratix HardCopy device is 11.5 times more efficient. These results show the advantages offered by FPGAs and structured ASICs for implementing the highly complex bit-serial operations involved in the JPEG2000 compression algorithm.
$1.60
$0.64
Conclusion
The JPEG2000 standard defines an algorithm that is able to offer a large spectrum of features, such as progressive bit stream, precise rate control, region of interest, and high quality lossless and lossy compression. For these reasons, JPEG2000 is being considered in a variety of applications, including medical imaging, military, security systems, and digital cinema. These features come at the expense of algorithm complexity. This paper demonstrates the performance, cost, and integration benefits that can be derived by FPGA and structured ASIC implementations.