Sei sulla pagina 1di 31

4

CHAPTER 2 VIDEO COMPRESSION TECHNIQUES AND STANDARDS


2.1 Introduction

Digital video techniques have been used for a number of years, for example in the television broadcasting industry. However, until recently a number of factors have prevented the widespread use of digital video. An analog video signal typically occupies a bandwidth of a few megahertz. However, when it is converted into digital form, at an equivalent quality, the digital version typically has a bit rate well over 100 Mbps. This bit rate is too high for most networks or processors to handle. Therefore, the digital video information has to be compressed before it can be stored or transmitted. Over the last couple of decades, digital video compression techniques have been constantly improving. Many international standards that specialize in different digital video applications have been developed or are being developed. At the same time, processor technology has improved dramatically in recent years. The availability of cheap, high-performance processors together with the development of international standards for video compression has enabled a wide range of video communications applications. All video coding standards make use of the redundancy inherent within digital video information in order to substantially reduce its bit rate. A still image, or a single frame within a video sequence, contains a significant amount of spatial redundancy. To eliminate some of this redundancy, the image is first transformed. The transform domain provides a more succinct way of representing the visual information. Furthermore, the human visual system is less sensitive to certain (usually high frequency) components of the transformed information. For this reason, these components can be eliminated without seriously reducing the visual quality of the decoded image. The remaining information can then be efficiently encoded using entropy encoding (for example, variable length coding such as Huffman coding).

5
In a moving video sequence, successive frames of video are usually very similar. This is called temporal redundancy. Removing temporal redundancy can result in further compression. To do this, only parts of the new frame that have changed from the previous frame are sent. In most cases, changes between frames are due to movement in the scene that can be approximated as simple linear motion. From the previous transmitted frames, we can predict the motion of regions and send only the prediction error (motion prediction). This way the video bit rate is further reduced. In this chapter, we describe the main international standards for image and video coding. We explain how these standards achieve video compression and what image and video coding techniques are used within them. Among these coding standards, the Joint Photographic Experts Group (JPEG) standard describes techniques for compressing still images or individual frames of video. A typical JPEG encoder compresses images by a factor of between 10 and 20 times without seriously reducing the visual quality of the reconstructed image. The H.261 standard supports motion video coding for videoconferencing and videotelephony applications. It is optimized for video communications at the bit rates supported by ISDNs. The recent development of the H.263 draft standard specializes in very low bit rate videoconferencing (less than 64 Kbps). Motion Picture Experts Group (MPEG) standards addresses the issues of video coding for entertainment and broadcast purposes. MPEG1 is optimized for coding of video and associated audio for digital storage media such as CD-ROM. MPEG2 enhances the techniques of MPEG1 to support video coding for a range of video communication applications, including broadcast digital television (at an equivalent resolution and quality to analog television and also at higher resolutions). The MPEG4 initiative is addressing generic, integrated video communications.

2.2 Digital Video

2.2.1 The RGB and YUV Representation of Video Signals A color can be synthesized by combining the three primary colors red, blue, and green (RGB). The RGB color system is one means of representing color images. Alternatively, the luminance (brightness) and chrominance (color) information can be represented separately. By

6
calculating a weighted sum of the three colors R, G, and B, we can obtain the luminance signal Y which represents the brightness of the color. We can also compute color difference signals Cr, Cb, and Cg by subtracting the luminance from each primary component using the following, Cr = Wr* (R - Y) Cb = Wb* (B - Y) Cg=Wg* (G - Y) where Wr, Wb, and Wg are weights. Of the three color difference signals, only two of them are linearly independent, the third one can always be expressed as a linear combination of the other two. Therefore, we only need the luminance Y and any two of the color difference signals to represent the original color. The three major standards for analog color television in current use are PAL, SECAM, and NTSC. All three systems use three components: luminance Y, blue color difference U (equivalent to Cb above) and red color difference V (equivalent to Cr above) to represent a color. We call this the YUV system. Computer graphics community to convert RGB to YUV mostly uses the following formulas, Y = (0.257 * R) + (0.504 * G) + (0.098 * B) +16 U = (0.439 * R) + (0.368 * G) (0.071 * B) +128 V = -(0.148 * R) - (0.291 * G) + (0.439 * B) +128 This YUV representation system has certain advantages over the RGB system. Since the human visual system (HVS) is less sensitive to chrominance than to brightness, the chrominance signals can therefore be represented with a lower resolution than the luminance without significantly affecting the visual quality. This by itself achieves some degree of data compression.

2.2.2

CCIR 601 Standard Analog video has to be converted into the digital domain before the video information can be

transmitted. The CCIR 601 standard [13] (proposed by the former CCIR, now the International Telecommunications Union-Radio or ITU-R) provides a standard method of encoding television information in digital form. The luminance and color difference components YUV are sampled with a precision of 8 bits. The sampling rate is chosen to give acceptable quality compared with the original

7
analog television signal. For example, the luminance component Y of an NTSC frame is sampled to produce an image of 525 lines, each containing 858 samples. The active area of the digitized frame is 720 x 486 pixels. The color difference signals are sampled at a lower rate in CCIR 601: the vertical resolution is the same but the horizontal resolution is halved, i.e., only the odd-numbered luminance pixels in each line have associated color difference pixels. This sampling structure is described as a 4:2:2 component system. The bit rate of a CCIR 601 digital television signal is 216 Mbps. As an example we show how this number is obtained for NTSC system. NTSC has 30 frames per second, 858 x 525 luminance samples, 429 x 525 x 2 chrominance samples, 8 bits per sample. Therefore, the bit rate = 30 x 8 x ((858 x 525) + (429 x 525 x 2)) = 216.216 Mbps.

2.2.3

The Need for Compression A single digital television signal in CCIR 601 format requires a transmission rate of 216 Mbps.

This bit rate is too high for most existing practical communication networks. For example, most local area networks (LANs) offer data transmission at rates on the order of 10 Mbps, and most wide area networks (WANs) support much lower data rates than this. The emerging ATM networks are capable of transmitting higher bit rates. However, distributing an uncompressed CCIR 601 bitstream over these networks is still prohibitively expensive. This means that the digital video information must be compressed (encoded) prior to transmission to accommodate for different transmission medias capabilities. At the receivers end, the compressed bitstream received is first decompressed (decoded) and then displayed. A number of video coding techniques and standards have been developed within the last few years that exploit the inherent redundancy [14] in still images and moving video sequences to provide significant data compression. Some compression can be achieved by exploiting the statistical redundancy within the data. For example, video data is often highly correlated both spatially and temporally. This redundancy can be removed by coding the data with entropy encoders (e.g. Huffman coding or arithmetic coding). Compression of this nature does not sacrifice any visual information carried by the original data and is

8
hence a reversible process. We call such compression lossless compression. The degree of

compression achievable by lossless compression is quite limited. To achieve higher compression, we need to remove subjectively redundant information, which is information that is not visually obvious to the viewer and can be removed without severely reducing the subjective quality of the decoded video signal. This type of compression destroys some of the original image information, which cannot later be recovered. We call such compression lossy compression. In the rest of this Chapter we will elaborate on the two main categories of image coding: still image coding and motion video coding. Still image coding exploits the spatial redundancy within images (intra-frame). Motion video coding takes into account temporal as well as spatial redundancy (inter-frame).

2.3 Coding of Still Images

A typical photographic-quality still image contains a large amount of spatial redundancy. There are basically two kinds of spatial redundancy: statistical redundancy and subjective redundancy. Statistical redundancy refers to the fact that the neighboring pixel values in one image are often highly correlated. This redundancy can be removed by using some form of entropy coding such as Huffman coding. Entropy coding is reversible in that the original image can be fully recovered by a proper entropy decoder. Subjective redundancy refers to the fact that the human visual system is not sensitive to certain components of the visual information. Therefore, such components can be removed without causing severe degradation in subjective quality. To do this, the image is first transformed and then those transform coefficients that correspond to the less important components are quantized. By combining these two forms of coding, the capacity required to store or transmit the image can be greatly reduced. The general procedure for compression image information is shown in Figure 2.1. The encoder model models the image in some way to remove redundancy. It then produces symbols that represent the remaining information in the original image. These symbols are then entropy encoded to further reduce the bit rate. The decoder carries out the reverse procedure to recreate a copy of the

9
original image.

Encoder Image Data Encoder Model Entropy Encoder Coded Data TRANSMIT OR STORE

Decoder Entropy Decoder Decoder Model Image Data

Figure 2.1 Image Compression.

2.3.1

Predictive Coding A very popular form of predictive coding is called Differential Pulse Code Modulation (DPCM).

The image is stored in the raster scan order (i.e., from left to right and from top to bottom). Each pixel is represented as a number with a certain precision (e.g., 8 bits). For each pixel, a DPCM encoder predicts its value based on previously transmitted pixel values. Then it quantizes the prediction error between the pixels predicted value and its actual value. The quantized prediction errors are then transmitted. Because most images contain significant spatial redundancy, neighboring pixels tend to be highly correlated and the prediction error tends to be small. These quantized prediction errors can be further coded with variable length coding (entropy encoding). Variable length coding encodes the more common values with a shorter code and the less common values with a longer code. In DPCM system, since small errors are more likely to occur, they are represented by shorter codes. The decoder uses the same prediction method as the encoder. It then adds the received prediction error to the predicted value to obtain the current pixels value. The coding efficiency of this system depends to a large degree on the prediction accuracy. The more accurate the prediction is, the smaller the prediction errors will be, hence the fewer number of bits are required to represent them. One simplest way to predict the current pixel is to use its immediate neighboring pixel to the left, i.e., the pixel a in Figure 2.2. However, such a prediction will be rather crude. A slightly better alternative is to use some weighted average of the neighboring pixels a, b, and c (see Figure 2.2) to predict the current pixel.

10

b c a
current pixel
Figure 2.2 DPCM Prediction. To make predictions even more accurate, more neighboring pixels should be taken into account. The choice of the predictor and the weights of the neighboring pixels have a direct bearing on the efficiency of the algorithm. For this reason, in adaptive prediction, the predictor is modified on the fly according to the statistics of the current image.

2.3.2

Discrete Cosine Transform Coding Transform coding is extensively used in image coding. In transform-based image coding,

pixels are first grouped into blocks. A block of pixels is then transformed into a set of transform coefficients. Actual coding then happens in the transform domain. An effective transform should compact the energy in the block of pixels into only a few of the corresponding coefficients. Compression is achieved by quantizing the coefficients so that only coefficients with big enough amplitudes (i.e., useful coefficients) are transmitted; other coefficients are discarded after quantization because they will have zero amplitude. The most effective energy compaction transform is called the Karhunen-Loeve Transform (KLT) [15]. However, the KLT is very computationally intensive in practice; it does not have fast algorithms. Hence, KLT cannot be used in practical image coding systems. The Discrete Cosine Transform (DCT) is a popular alternative to the KLT for image coding [16]. For most continuous-tone photographic images, the DCT provides energy compaction that is close to the optimum. Furthermore, a number of fast algorithms exist for the DCT [17]. These factors have led to the prevalent use of the DCT for image and video compression systems. A typical DCT-based image coding system carries out the following steps:

1. Grouping the image into blocks This step groups the image pixels into blocks with fixed size, e.g., blocks of 8 x 8 or 16 x 16

11
pixels. A larger block size leads to more efficient coding, but is also more computationally expensive. Partitioning images into fixed-size blocks poses a fundamental limitation on the DCT-based coding systems. The use of uniformly sized blocks simplifies the coding system, but does not take into account the irregular shapes within real images. Better compression efficiency can be achieved by using a combination of blocks of different shapes, but at the expense of increased system complexity [18].

2. Discrete Cosine Transform DCT converts a block of image pixels into a block of transform coefficients of the same dimension. These DCT coefficients represent the original pixels values in the frequency domain. Any gray-scale 8 x 8 pixel block can be fully represented by a weighted sum of 64 DCT basis functions where the weights are just the corresponding DCT coefficients. The two-dimensional DCT transform of an N x N pixel block is described in (2.1), where

f (i , j ) is the pixel value at position ( i, j ) and F ( u, v) is the transform coefficient at

position (u , v) .

F ( u, v) =

N 1 N 1 2 (2i + 1) u ( 2 j + 1) v G (u )G( v) f (i , j ) cos( ) cos( ) N 2N 2N i =0 j =0

(2.1)

The corresponding inverse DCT transform is given by

f (i , j ) =
where

2 N

G(u )G (v )F (u, v) cos(


u= 0 v = 0

N 1 N 1

(2i + 1) u (2 j + 1) v ) cos( ) 2N 2N

(2.2)

1 ,x =0 G( x) = 2 1, otherwise
Using formulae (2.1) and (2.2), the forward and inverse transforms require a large number of floating point computation. Simple calculation shows that a total of 64 x 64 = 4,096 computations are needed to transform a block of 8 x 8 pixels. Note that the forward and inverse DCT transforms are separable, meaning that the two-

12
dimensional transform coefficients can be obtained by applying a one-dimensional transform first along the horizontal direction and then along the vertical direction separately. This can reduce the number of computations required for each 8x8 block from 4,096 to 1,024. The computational complexity can be further reduced by replacing the cosine function in (2.1) and (2.2) with a fast algorithm such as that described in [19], which reduces the operation to a short series of multiplications and additions.

3. Quantization For a typical block in a photographic image, most of the high-frequency DCT coefficients will be near zero. On average, the dc coefficient and other low-frequency coefficients often have larger amplitudes. This is because in an image with smooth natural scene, most blocks tend to contain little high-frequency contents; in general only a few of the DCT coefficients have significant values. The DCT coefficients are quantized so that the near-zero coefficients are set to zero and the remaining coefficients are represented with reduced precision. To quantize each coefficient, it is divided by the quantizer step size and the result is rounded to the nearest integer. Therefore, larger quantizer step sizes means coarser quantization. This results in information loss, but also in compression since most of the coefficient values in each block now are zero. Coarser quantization (i.e., larger quantizer step size) gives higher compression and poorer decoded image quality.

4.Entropy coding After quantization, nonzero coefficients are further encoded using an entropy coder such as Huffman coder. In Huffman coding (and in other entropy coding schemes), the more frequent values are represented with shorter codes and the less frequent values with longer codes. Zero coefficients can be efficiently encoded using run-length encoding. Instead of transmitting all the zero values one by one, run length coding simply transmits the total number of the current run of zeros. The result is a compressed representation of the original image. To decode the image, the reverse procedure is carried out. First, the variable-length codes (entropy codes) are decoded to get back the quantized coefficients. These are then multiplied by the appropriate quantizer step size to

13
obtain an approximation to the original DCT coefficients. These coefficients are put through the inverse DCT to get back the pixel values in the spatial domain. These decoded pixel values will not be identical to the original image pixels since a certain amount of information is lost during quantization. A lossy DCT CODEC produces characteristic distortions due to the quantization process. These include blocking artifacts, where the block structure used by the encoder becomes apparent in the decoded image, and mosquito noise, where lines and edges in the image are surrounded by fine lines. DCT-based image coding systems can provide compression ratios of between 10 and 20 while maintaining reasonably good image quality. The actual efficiency depends to some extent on the image content: images with lots of detail will contain many nonzero high-frequency DCT coefficients; therefore they will be coded at higher rates than images with less detail. Compression can be improved by increasing the quantization step size. In general, higher compression is obtained at the expense of poorer decoded image quality.

2.4 Coding of Moving Images

The concept of differential prediction discussed in Section 2.3.1 can be extended to coding moving video sequences. Most video sequences are highly redundant in the temporal domain. Within one video sequence, successive frames are usually very similar. The differences between successive frames are usually due to movement in the scene. This movement can often be closely approximated by linear functions, for example due to a camera pan across the scene. By using differential prediction (DPCM) in the temporal rather than the spatial domain, for each frame only the difference frame needs to be transmitted. Because two successive frames in a sequence are usually very similar, most of the blocks within the difference frame contain no information and do not need to be transmitted. Hence, this substantially reduces the amount of information that needs to be transmitted. Further compression can be achieved by using motion prediction. Each block in the current frame is matched with a number of neighboring blocks in the previous frame. The offset between the

14
two blocks is called a motion vector. The errors (pixel by pixel) between the two blocks is encoded and transmitted along with the motion vector for the block. At the receiver, the original block can be reconstructed by decoding the error block and adding it to the offset block in the previous frame.

2.4.1 Generic DCT/DPCM CODEC Figure 2.3 shows the block diagram of a generic CODEC based on motion-compensated DPCM and DCT-based encoding of the difference frames. This generic CODEC will serve as a useful reference for the discussion of specific video coding standards.
Encoder
Video Frame DCT Quant Entropy Encode

Decoder
Entropy Decode IQuant IDCT Video Frame

IQuant

IDCT Motion Com p Frame Store

Motion Comp

Frame Store

Motion Estim

Figure 2.3 Generic DCT/DPCM CODEC. Encoder: The frame store contains a reconstructed copy of the previous frame transmitted; it is used as the reference for temporal prediction. The motion estimator calculates motion vectors for each block in the current frame. A motion-compensated version of the previous frame is subtracted from the current frame to create a difference or error frame. Each block of this difference frame is then DCT transformed. The DCT coefficients are then quantized, entropy coded, and transmitted together with the motion vectors (also entropy coded). At the same time, to ensure that the encoder and decoder use identical reference frames for motion compensation, the quantized DCT coefficients are scaled back by the quantization step size (see the Iquant block in Figure 2.3) and inverse transformed (IDCT) to create a local copy of the reconstructed current frame. This is used as the prediction reference for the next frame. Decoder: The coded data is first entropy decoded. Then the DCT coefficients are scaled back

15
by the quantization step size and put through inverse DCT transform. This gives back the reconstructed difference frame. A motion-compensated reference frame is created using the previous decoded frame and the motion vectors for the current frame. The current frame is reconstructed by adding the difference frame to the reference frame. This frame is displayed and is also stored in the decoder frame store to be used in the decoding of the next frame.

2.4.2

Bi-directional Prediction Motion estimation based on only the previous frame does not work in every situation. For

example, a scene may contain movements where the moving objects reveal parts of the background that were previously hidden. These hidden parts cannot be successfully predicted from the previous frame. In such cases, prediction can be improved by using a combination of motion prediction from the previous frame and from a future frame. This is known as interpolated or bi-directional motion prediction. In bi-directional motion prediction, the encoder searches for a matching block in the previous and in the future frame. The appropriate motion vector is calculated with regard to the best match. This technique can give a significant improvement in compression efficiency over forward motion prediction, since the number of potentially successful matching blocks is increased. However, it also comes with several problems. Firstly, the system (including both encoder and decoder) complexity is increased. Secondly, a delay is introduced into the system. An interpolated frame cannot be encoded until a future frame is read into the encoder; likewise, such a frame cannot be decoded until its future reference frame is decoded first. In some real-world applications, e.g., twoway videoconferencing, a delay of even a few frames can cause problems.

2.5 JPEG: Still Image Coding

ISO International Standard 10918 provides a standard format for coding and compressing still images [20]. This standard is commonly known as the Joint Photographic Experts Group (JPEG). An overview of the standard is given in [21]. The standard defines four coding modes. A certain application can choose to use one of these modes depending on its particular requirements. The color components (see Section 2.2.1) are processed separately.

16
Sequential encoding: Each component is encoded in a single pass or scan. The scanning

order is left to right, top to bottom. The encoding process is based on the DCT. Progressive encoding: Each component is encoded in multiple scans. Each scan contains

a portion of the encoded information of the image. Using only the first scan, a rough version of the reconstructed image can be quickly decoded and displayed. Using subsequent scans, this rough version can be built up to versions with fuller detail. Hierarchical encoding: Each component is encoded at multiple resolutions. A lower-

resolution version of the original image can be decoded and displayed without decompressing the full-resolution image. Lossless encoding: Unlike the previous three modes, the lossless encoding mode is

based on a DPCM system. As the name suggests, this mode provides compression without any loss of quality, i.e., a truly authentic version of the original image will be recovered at the decoders side. However, the compression efficiency is considerably reduced. The standard defines a baseline CODEC that implements a minimum set of features (a subset of the sequential encoding mode). The baseline CODEC provides sufficient features for most general-purpose image coding applications. It has been widely adopted as the most common implementation of JPEG. The baseline CODEC uses the sequential encoding mode. For the input image, each of its color components has a precision of 8 bits-per-pixel. Each image component is encoded as follows. DCT: Each 8x8 block of pixels is transformed using DCT. The result is an 8 x 8 block of

DCT coefficients. Quantization: Each DCT coefficient is quantized to reduce its precision. The quantizer

step size can be (and generally is) different for different coefficients in the block. Figure 2.4 shows an example of quantization step size table used in JPEG. Prior to quantization, the DCT coefficients lie in the range of -1,023 to +1,023. Each coefficient is divided by the corresponding step size and is then rounded to the nearest integer. Notice that the step size increases as the spatial frequency increases (to the right and down). This is because of the fact that the HVS is less sensitive to quality loss in the

17
higher frequency components. Zigzag Ordering: The lower spatial frequency coefficients are more likely to be nonzero

than the higher frequency coefficients. For this reason, the quantized coefficients are reordered in a zigzag scanning order, starting with the dc coefficient and ending with the highest frequency ac coefficient. Figure 2.5 shows the zigzag ordering of the 64 coefficients in each block. Reordering in this way tends to create long runs of zero-valued coefficients.

Figure 2.4 JPEG Quantization Table.

18

Figure 2.5 The Zigzag Scan Pattern in JPEG. Encoding: The dc coefficients are coded separately from the ac coefficients. The dc

coefficients from adjacent blocks usually have similar values. Therefore, each dc coefficient is encoded differentially from the dc coefficient in the previous block. In a given block, most of the ac coefficients are likely to be zero, so the 63 coefficients are converted into a set of (run-length, value) symbol pairs. The run-length value stands for the number of zero coefficients prior to the current nonzero coefficient; the value value stands for the nonzero coefficient itself. The set of symbols is then encoded using Huffman coding. The end result is a series of variable-length codes that describe the quantized coefficients in a compressed form. The baseline decoder carries out a reverse procedure to reconstruct the decoded image. The quality of the decoded image depends on the quantization step size used at the encoder side. The compression efficiency depends on mainly two factors: the quantization step size and the original image content.

2.6 H.261: Motion Video Coding for Videoconferencing

ITU-T Recommendation H.261, as a part of the H.320 group of standards, supports video coding for videoconferencing applications. Recommendation H.320 is an umbrella standard that

19
describes the various components of a videoconferencing system [22], including the following, H.261: video CODEC G.711, G.722, or G.728: audio CODEC H.230: audiovisual control and synchronization H.221: frame structure for ISDN channel Among these, H.261 describes the video coding component of an H.320 system [23]. The H.261 algorithm was developed with the aim of supporting videoconferencing and videotelephony over ISDNs that provide data rates at multiples of 64 Kbps. An overview of H.261 can be found in [24]. In H.261, only two image resolutions are supported: common intermediate format (CIF) and quarter CIF (QCIF). Video information is represented by Y, Cr, and Cb components. The resolution of the luminance (Y) component is 352 x 288 pixels for CIF and 176 x 144 pixels for QCIF. In both resolutions, the chrominance components (Cr and Cb) have exactly half the horizontal and vertical resolution of the Y component; this is called the 4:2:0 sampling. The maximum picture rate is 29.97 frames per second. When required, this frame rate can be reduced by dropping up to three subsequent frames for each frame transmitted. Many H.261 applications involve video communications over a 64 or 128 Kbps ISDN connection. At these bit rates, substantial data compression is required. Therefore, QCIF resolution is often adopted and the source frame rate is restricted to around 10 frames per second. The H.261 CODEC design is similar to the generic CODEC shown in Figure 2.3. It is also based on DCT-transform coding and motion prediction. The image data is processed in macroblock. Each macroblock consists of four 8 x 8 blocks (i.e., one 16x16 block) of Y samples, one 8 x 8 block of Cr samples, and one 8 x 8 block of Cb samples. There are two coding modes: intracoding which uses no motion prediction and intercoding which uses motion prediction. Each macroblock is coded using either of the two modes. In the first frame in an H.261 sequence, all macroblocks are intracoded as there are no previous frames to predict from. Intracoded macroblocks are coded similarly to baseline JPEG encoding. Each 8 x 8 block of

20
Y, Cr, and Cb samples is transformed using the DCT. The coefficients are quantized and then variable-length encoded. In order to form predictions for future frames, the quantized coefficients are rescaled back and inverse-transformed to provide a local copy of the frame, which is identical to the receivers decoded version. This copy is stored in a frame store. Macroblocks in subsequent encoded frames are usually intercoded. Each macroblock is motion-predicted from a nearby macroblock in the previous frame. A 16 x 16 block of Y samples is compared with adjacent 16 x 16 blocks in the previous frame and the closest matching block is chosen as the reference for motion prediction. The position offset between the reference and current macroblocks is encoded as a motion vector for the current macroblock. If the prediction error (i.e. difference macroblock) is less than a certain threshold, no further information is encoded. Otherwise, the prediction error is encoded using the DCT, quantization, and variable-length encoding steps just as before. If the prediction error is above a second larger threshold, then the motion compensation process is not considered useful for compression: hence, the macroblock will be intracoded. Again, the predicted frame is reconstructed (by carrying out the decoding procedure at the encoder) and stored in the frame store for future use. The bitstream produced by an H.261 encoder has a hierarchical structure, shown in Figure 2.6. Blocks of variable-length coded coefficients are collected together to form macroblocks. Macroblocks are then further collected to form a group of blocks (GOB). A complete picture is made up of several GOBs.

21

Figure 2.6 Structure of H.261 Bitstream. H.261 is typically used to send coded data over a constant bit rate (CBR) channel, such as an ISDN channel. The encoder output bit rate, if left unconstrained, varies depending on the current activity in the scene. A scene containing a lot of motion will generate much more encoded data than a scene with little motion. To map this varying bit rate onto the CBR channel, a rate control mechanism is required. This topic is the main topic in this thesis. We will discuss it in detail later in Chapters 4 and 5. H.261 provides a relatively straightforward system of encoding and decoding video information for two-way video communications. Its main drawback is the poor decoded video quality, particularly when bit rates are low.

2.7 H.263: Low Bit Rate Video Coding

Some applications require video communications with adequate quality at around 30 Kbps. For example, home and small businesses are usually connected to the Internet through public switched telephone network (PSTN) with V.34 modems. These modems used to operate at 28.8

22
Kbps. This area is also addressed by the ITU-T. The low bit rate coding group within ITU-T study group 15 took two approaches. The near

term approach is to develop a new video coding standard based on existing techniques; the longer term approach aims to make use of fundamentally new coding techniques. The near term standard, H.263, is largely based on H.261, with a number of improvements that can provide higher quality video at low bit rates [25]. These improvements include Motion compensation with half-pixel prediction (as opposed to full-pixel prediction in

H.261). Half-pixel values for motion compensation are obtained by interpolation between neighboring pixel values. This can give improved motion estimation and hence a lower prediction error. Unrestricted motion vectors (optional). The motion compensated macroblock can lie partly

outside of the picture. The pixel values outside the boundary are constructed by extrapolating from the edge pixels. This can reduce the prediction error if movement occurs across the frame edges. Arithmetic coding rather than Huffman coding (optional). Advanced prediction mode (optional). Four motion vectors, each for one 8 x 8 luminance

block within a 16x16 macroblock, are used instead of one motion vector for the macroblock. The four vectors can reduce the prediction error but also require more bits to encode. The prediction error for using four vectors is compared with that for using single vector; if the error is significantly less, then four motion vectors are used. PB frames mode (optional). Two frames are coded as one unit: the next P frame (forward

predicted from the previous P frame) together with a B frame bi-directionally predicted between the previous and next P frames. This mode is based on the B pictures used in the MPEG standards (see Section 2.8.1). The use of some or all of these features can lead to a significant improvement in decoded video quality over H.261, particularly at low bit rates. They require a more complex CODEC; however, video coding and processing hardware is continually improving, so higher complexity can be justified if it leads to improved video quality. The long-term initiative of the ITU-T will address low bit rate video communications in a wider

23
sense. It is intended to make use of new techniques that provide considerable performance improvements and to support video communications in other low bit rate environments such as mobile communication networks.

2.8 MPEG: Motion Video Coding for Entertainment and Broadcast

H.261 and H.263 have been optimized for lower line speeds even on a 64kbps channel, where MPEG standards are defined to use 0.9 1.5 Mbps. The Moving Picture Experts Group (MPEG) is part of the International Standards Organization working group ISO-IEC/JTC1/SC2/WG11. The MPEG committee started its activities in 1988. Since then, the efforts of this working group have led to two international standards, informally known as MPEG1 and MPEG2. Another standard, MPEG4, is currently under development. These standards address the encoding and decoding of video and audio information for a range of applications.

2.8.1 MPEG1 for CD Storage The first of the MPEG standards released was ISO 11172, 1993, popularly known as MPEG1 [26]. An overview of this standard can be found in [27]. MPEG1 supports coding of video and associated audio at a bit rate of about 1.5 Mbps. The video coding techniques are optimized for coded bit rates of between 1.1 and 1.5 Mbps, but can also be applied at a range of other bit rates. The bit rate of the coded video stream together with its associated audio information matches the data transfer rate of about 1.4 Mbps in CD-ROM systems. Though MPEG1 is optimized for CD-based video/audio entertainment, it can be also applied to other storage or communications systems. The requirements of common MPEG1 applications are different from those of two-way video conferencing. This leads to a number of key differences between MPEG1 and H.261. Rather than specifying a particular design of video encoder, MPEG1 specifies the syntax of the compliant bitstreams. While a model decoder is described in the standard, many of the encoder design issues are left to the developer. However, in order to comply with the standard, MPEG1 developers must follow the correct syntax so that the coded bistreams are decodable by the model decoder. Frames of video are coded as pictures. Resolution of the source video information is not

24
restricted. However, a resolution of 352 x 240 pixels in the luminance component, at about 30 frames per second, gives an encoded bit rate of around 1.2 Mbps. MPEG1 is optimized for this approximate bit rate. It may not necessarily encode video sequences at higher resolutions the most efficiently. Resolution of 352 x 240 pixels provides image quality similar to VHS video. MPEG1 allows a significant amount of flexibility in choosing encoding parameters. It also defines a specific set of parameters (the constrained parameters) that provide sufficient functionality for many applications. There are three types of coded pictures in MPEG1: I pictures, P pictures, and B pictures. I pictures (intra-pictures) are intraframe encoded without any temporal prediction. For I pictures MPEG1 uses a coding technique similar to that used in H.261. Blocks of pixel values are transformed using the DCT, quantized, reordered (in a zigzag scan as described in Section 2.5) and variable-length encoded. Coded blocks are grouped together in macroblocks, where each macroblock consists of four 8 x 8 Y blocks, one 8 x 8 Cr block, and one 8 x 8 Cb block. The Cr and Cb components have half the horizontal and vertical resolution of the Y components, i.e., 4:2:0 sampling is used. P pictures (forward predicted pictures) are interframe encoded using motion prediction from the previous I or P picture in the sequence, as in the H.261 standard. For example, the Y component of each macroblock is matched with a number of neighboring 16x16 blocks in the previous I or P picture (reference picture). The prediction error, together with the motion vector, is encoded and transmitted. Macroblocks in P pictures may be optionally intracoded, if the prediction error is too high, signaling ineffective motion prediction. B pictures (bi-directionally predicted pictures) are interframe encoded using interpolated motion prediction between the previous I or P picture and the next I or P picture in the sequence. Each macroblock is compared with the neighboring area in the previous and the next I or P picture. The best matching macroblock is chosen from the previous picture, from the next picture, or by calculating the average of the two. The forward and/or reverse motion vectors from the best matching macroblock, along with the corresponding prediction error, are encoded and transmitted. Again intracoding may be used if motion prediction is not effective. B pictures are not used as a reference

25
for further predicted pictures. The three picture classes are grouped together in to group of pictures (GOPs). Each GOP consists of one I picture followed by a number of P and B pictures. Figure 2.7 gives an example of a GOP. In the example each I or P picture is followed by two B pictures. The structure and size of each GOP is not specified in the standard. This can be chosen by a particular application to suit itself. I pictures have the lowest compression efficiency because they do not use motion prediction; P pictures have a higher compression efficiency; and B pictures have the best compression efficiency of the three due to the efficiency of bi-directional motion estimation. In general, a larger GOP size are compressed more efficiently since fewer pictures are coded as I pictures. However, I pictures provide a useful access point to the entire sequence, which is particularly important for applications that require random access. The encoded MPEG1 bitstream is arranged into the hierarchical structure shown in Figure 2.8. The highest level is the video sequence level. The video sequence header describes basic parameters such as spatial and temporal resolution. Coded frames are grouped together into GOPs at the group of pictures level. Within each GOP there are a number of pictures. The picture header describes the type of the coded picture as I, P, or B. It also describes a temporal reference number that gives this pictures relative position within the GOP. The next level is the slice level. Each slice consists of a continuous series of macroblocks. A complete coded picture is made up of one or more slices. The slice header is not a variable-length coded. Therefore the coder can use the slice header to resynchronize if an error has occured. The remaining levels in the hierarchy are the macroblock level and the block level.

26

Figure 2.7 MPEG GOP Structure.

Figure 2.8 Structure of MPEG Bitstream.

27
In most MPEG1 applications, a video sequence is encoded once and decoded many times. For example, a MPEG video is prerecorded on a CD-ROM and then viewed by the end user many times. Therefore, often times we need the decoding process to be as simple as possible, while we may be willing to increase the encoders complexity. The use of B pictures leads to a delay in the encoding and decoding processes, since both the previous and the next I/P pictures must be stored in order for a B picture to be encoded or decoded. The delay at the decoder is more of a concern for the above reason. To minimize this delay, the encoder reorders the coded pictures prior to transmission or storage, so that the I and/or P pictures required to decode each B picture are placed before the B picture. For example, the original order of the pictures in Figure 2.7 is as follows: I1B2 B3 P4 B5 B6 P7 B8 B9 I10. It is reordered to look like: I1P4 B2 B3 P7B5 B6 I10B 8 B9. The decoder receives and decodes I1 and P4. It can display I1 since this is the first frame in display order. B is then received and immediately decoded and displayed. B3 is decoded and 2 displayed next, P4 can then be displayed, and so on. The decoder only needs to store at most two decoded frames at a time and the decoder can display each frame after at the most a one-frame delay. The decoder processing requirements are relatively simple, as it does not need to perform motion prediction for each macroblock. A real-time decoder can be implemented in hardware at a low cost. The encoder, however, has a much more demanding task. It must perform the expensive block matching and motion prediction processes; it also must buffer more frames during encoding. MPEG1 encoding is often carried out off-line, i.e., not in real time. Real-time MPEG1 encoders are usually very expensive.

2.8.2 MPEG2 for Broadcasting It was quickly established that MPEG1 would not be an ideal video and audio coding standard for television broadcasting applications. Television quality video information (e.g., CCIR601 video) produces a higher encoded bit rate that requires different coding techniques than those

28
provided by MPEG1. In addition, MPEG1 cannot efficiently encode interlaced fields. For these reasons, MPEG2 was developed. The MPEG2 standard comes in three main parts: systems [28], video [29], and audio [30]. MPEG2 extends the functions provided by MPEG1 to enable efficient encoding of video and associated audio at a wide range of resolutions and bit rates. Table 2.1 shows some of the bit rates and applications. Table 2.1 MPEG2 Bit Rates and Applications Application Home entertainment video-MPEG1 quality Digital TV Extended definition TV 720x486, 30 Hz 1,920 x 1,080, 30 Hz 5-10 Mbps 30-40 Mbps Approx. Resolution 352x240, 30Hz Approx. Coded Bit Rate 1.5 Mbps

Part 1 of the MPEG2 standard (systems) specifies two types of multiplexed bitstreams: the program stream and the transport stream. The program stream is analogous to the systems part in MPEG1. It is designed for flexible processing of the multiplexed stream and for environments with low error probabilities. The transport stream is constructed in a different way and includes a number of features that are designed to support video communications or storage in environments with significantly higher error probabilities. MPEG2 video [29] is based on the MPEG1 video encoding techniques but includes a number of enhancements and extensions. For example, the constrained parameters subset of MPEG1 is also incorporated into MPEG2; MPEG2 also has the hierarchical bitstream structure similar to that of MPEG1 (see Figure 2.8). MPEG2 also adds a number of extra features including the following: Features to support interlaced video as well as progressive (non-interlaced) video. These

include the ability to encode video directly in interlaced form as well as a number of new motion prediction modes. In addition to motion prediction from a macroblock in a previous or future frame, MPEG2 supports prediction from a previous or future field. This increases the prediction efficiency.

29
More chrominance sampling modes. A 4:2:2 sampling produces chrominance

components with the same vertical resolution and half of the horizontal resolution as those of the luminance component. This provides better color resolution than 4:2:0 sampling. An even higher resolution is supported by 4:4:4 sampling, where the chrominance components are encoded at the same resolution as the luminance component. An alternative block scanning pattern to the zigzag scanning order. The alternative pattern

can improve the coding performance for interlaced video. Scalable coding modes. These modes enable video information to be coded into two or

more layers. There are four scalable coding modes in MPEG2: Spatial scalability is analogous to the hierarchical coding mode in JPEG, where each

frame is encoded at a range of resolutions that can be built up to the full resolution. For example, a standard decoder could decode a CCIR-601 resolution picture from the stream while a high definition television (HDTV) decoder could decode the full HDTV resolution from the same transmitted bitstream. Data partitioning enables the coded data to be separated into two streams. A high-priority

stream contains essential information such as headers, motion vectors, and perhaps low-frequency DCT coefficients. A low-priority stream contains the remaining information. The encoder can choose to place certain components of the syntax in each stream. Signal to noise ratio (SNR) scalability is similar to the successive approximation mode of

JPEG, where the picture is encoded in two layers, the lower of which contains information to decode a coarse version of the video and the higher of which contains enhancement information needed to decode the video at its full quality. Temporal scalability is a hierarchical coding mode in the temporal domain. The base layer

is encoded at a lower frame rate. To give the full frame rate, the intermediate frames are interpolated between successive base layer frames. The difference between the interpolated frames and the actual intermediate frames is encoded as a second layer.

30
MPEG2 describes a range of profiles and levels that provide encoding parameters for a range of applications. A profile is a subset of the full MPEG2 syntax that specifies a particular set of coding features. Each profile is a superset of the preceding profiles. Within each profile, one or more levels specify a subset of spatial and temporal resolutions that can be handled. The profiles defined in the standard are shown in Table 2.2. Each level puts an upper limit on the spatial and temporal resolution of the sequence, as shown in Table 2.3. Only a limited number of profile/level combinations are recommended in the standard, as summarized in Table 2.4. Table 2.2 MPEG2 Profiles Profile Simple Main SNR Spatial High Features 4:2:0 sampling, I/P pictures only, no scalable coding As above, plus B pictures As above, plus SNR scalability As above, plus spatial scalability As above, plus 4:2:2 sampling

Table 2.3 MPEG2 Levels Level Low Main High-1, 440 High Maximum Resolution 352 x 288 luminance samples, 30 Hz 720 x 576 luminance samples, 30 Hz 1,440 x 1,152 luminance samples, 60 Hz 1,920 x 1,152 luminance samples, 60 Hz

31
Table 2.4 Recommended Profile/Level Combinations Level Profile Simple Main SNR Spatial High X X X Low Main X X X X X X X X X High-1, 440 High

Particular profile/level combinations are designed to support particular categories of applications. Simple profile/main level is suitable for conferencing applications, as no B pictures are transmitted, leading to low encoding and decoding delay. Main profile/main level is suitable for most digital television applications; the majority of currently available MPEG2 encoders and decoders support main profile/main level coding. The two high levels are designed to support HDTV applications; they can be used with either non-scalable coding (main profile) or spatially scalable coding (spatial/high profiles). Note that the profiles and levels are only recommendations and that other combinations of coding parameters are possible within the MPEG2 standard.

2.8.3

MPEG4 for Integrated Visual Communications The MPEG1 and MPEG2 standards are successful because they enable digital audiovisual

(AV) information communications with high performance in both quality and compression efficiency. After MPEG1 and MPEG2, the MPEG committee has developed a new video coding standard called MPEG4. JSO/IEC MPEG4 started its standardization in July 1993. MPEG4 version 1 became an IS in February 1999, with version 2 targeted for November 1999. Starting from the original goal of providing an audiovisual coding standard for very low bit rate channels, such as found in mobile applications or todays Internet, along time MPEG4 has developed into a standard that goes much further than achieving more compression and lower bit rates. The following set of functionalities are defined in MPEG4 [34]:

32
Content-based multimedia data access tools Content-based manipulation and bit stream editing Hybrid natural and synthetic data, both video and audio coding Improved temporal random access Improved coding efficiency Coding multiple concurrent data streams Robustness in error-prone environments High content scalability

MPEG4s content-based approach will allow the flexible decoding, representation, and manipulation of video objects in a scene. In MPEG4 the bitstreams are object layered. The receiver can reconstruct the original sequence in its entity by decoding all objects layers and by displaying the objects at original size and at the original location. Alternatively, it can manipulate the video scene by simple operations on the bitstreams. For example, some objects can be ignored in reconstruction while new objects that did not belong to the original scene can be included. Such capabilities are provided for natural objects, synthetic objects, and hybrids of both. MPEG4 video aims at providing standardized core technologies allowing efficient storage, transmission, and manipulation of video data in multimedia environments. The focus of MPEG4 video is the development of Video Verification Models (VMs) that evolve over time by means of core experiments. The VM is a common platform with a precise definition of encoding and decoding algorithms that can be presented as tools addressing specific functionalities. New algorithms/tools are added to the VM and old algorithms/tools are replaced in the VM by successful core experiments [35]. So far, MPEG4 video group has focused its efforts on a single VM that has gradually evolved from version 1.0 to version 12.0, and in the process has addressed an increasing number of desired functionalities: content-based object and temporal scalabilities, spatial scalability, error resilience, and compression efficiency. The core experiments in the video group cover the following major classes of tools and algorithms: Compression efficiency. For most applications involving digital video, such as

33
videoconferencing, Internet video games, or digital TV, coding efficiency is essential. MPEG4 has evaluated over a dozen methods intended to improve the coding efficiency of the preceding standards. Error resilience. The ongoing work in error resilience addressed the problem of accessing

video information over a vide range of storage and transmission media. In particular, due to a rapid growth of mobile communications, it is extremely important that access is available to video and audio information via wireless networks. This implies a need for useful operation of video and audio compression algorithms in error-prone environments at low bit rates (that is, less than 64 Kbps). Shape and alpha map coding. Alpha maps describe the shapes of 2D objects. Multilevel

alpha maps are frequently used to blend different layers of image sequences for the final film. Other applications that benefit from associating binary alpha maps with images are content-based image representation for image databases, interactive games, surveillance, and animation. Arbitrarily shaped region texture coding. Texture coding for arbitrarily shaped regions is

required for achieving an efficient texture representation for arbitrarily shaped objects. Hence, these algorithms are used for objects whose shape is described with an alpha map. Multifunctional tools and algorithms. Multifunctional coding provides tools to support a

number of content-based objects as well as other functionalities. For instance, for Internet and database applications, spatial and temporal scalabilities are essential for channel bandwidth scaling for robust delivery. Multifunctional coding also addresses multiview and stereoscopic applications as well as representations that enable simultaneous coding and tracking of objects for surveillance and other applications. Besides the mentioned applications, a number of tools are developed for segmentation of a video scene into objects.

34

Figure 2.9 MPEG4 General Overview Figure 2.9 gives a very general overview of the MPEG4 system. The objects that make up a scene are sent or stored together with information about their spatio-temporal relationships (that is, composition information). The compositor uses this information to reconstruct the complete scene. Composition information is used to synchronize different objects in time and to give them the right position in space. Coding different objects separately makes it possible to change the speed of a moving object in the same scene or make it rotate, to influence which objects are sent and at what quality and error protection. In addition, it p ermits composing a scene with objects that arrive from different locations. The MPEG4 standard does not prescribe how a given scene is to be organized into objects (that is, segmented). The segmentation is usually regarded to take place even before the encoding, as a preprocessor step, which is never a standardization issue. By not specifying preprocessing and encoding, the standard leaves room for systems manufacturers to distinguish themselves from their competitors by providing a better quality or more options. It also allows the use of different encoding strategies for different applications and leaves room for technological progress in analysis and decoding strategies.

Potrebbero piacerti anche