Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Abstract
In this Master Thesis, current video compression techniques are examined and an implementation of a video compression algorithm based on the H.263 standard is presented. All video compression standards that are available now have lossy characteristics: the algorithm reduces the size of video stream signi cantly at the cost of (some) quality loss. Motion-JPEG is currently mostly used, but to make the video stream even smaller, Motion Estimation algorithms are applied, which are found in the H.261, MPEG and H.263 standards. Application of these motion estimation algorithms must be done correctly to make them e cient with regard to compression ratio and computing speed. For Desktop Video Conferencing over lowbandwidth lines, the new H.263 standard is most appropriate due to its high compression ratio. To make an implementation of this standard applicable for Real-Time purposes, the Motion Estimation, DCT, and Quantization algorithm are examined carefully and fast implementations that do not sacri ce quality much are integrated in the encoder. Other advanced techniques that reduce the size of the stream that must compressed are also examined and implementations based on skin detection, background substraction, and change detection are tested and evaluated. The result is a video compression library that compresses QCIF-sized (176 144) video streams at near Real-Time speed on fast desktop platforms. An application that sets the correct video compression parameters based on system load is also implemented.
ii
Preface
Welcome to video compression! In the next couple of chapters, I will describe what video compression is, why it is required, and how it can be done e ectively. This thesis is the last part of my study Computer Science at Twente University, and I hope my work will be the rst part of further development in the area of Real-Time video compression research. I have looked at video compression for Desktop Video Conferencing (DVC) in particular. Large-size video streams of high quality cannot be compressed that much to transmit it through, for instance, a telephone line. As desktop PCs or small workstations also lack enough computing power for compression of these large video streams, DVC is the best choice for these systems. Emphasis of my work has been put onto software video compression as opposed to hardware compression for a number of reasons. In the rst place, it is much more exible than hardware implementation. To change an algorithm, the code only has to be changed and be compiled. For hardware implementation, a whole new design must be made and implementation of a chip is also quite expensive. Another reason for using software compression is that it can be used to test algorithms e ectively, before the actual (intended) implementation in hardware is done. The nal reason is the compatibility of a software compression compared to hardware compression. Software can be run on a variety of systems, while a hardware compression device often requires di erent drivers or is only available for one or not more than a few di erent systems. The result is a software library that can be compiled on many system, independent of their internal architectures. This is also the reason I prevented to use a mixture of a software and hardware solution as it would make the software implementation dependent of (at some point in the future) outdated hardware. I am personally not against a mixture of hard- and software, but only if the hardware provides a clear functional limited part of the encoding that replaces some time-consuming software routines. Some object-oriented approach, where a software object is replaced by a hardware object would in this case be a solution. An example of some software/hardware combination that I did use was the ability of the AVA to transmit YUV color spaced frames, instead of more common RGB color spaced frames1.
1 As will be explained later, video compression is applied on YUV color spaced frames only. If the encoder receives RGB color spaced frames, these must rst be converted to YUV.
iii
Preface
Due to the nature of video streams, large amounts of data must be processed by the processor for any video algorithm, including compression. In the rst time of my graduation, I was a bit sceptical about Real-Time software compression, and thought that the implementation would not be faster than 5{10 fps. However, developments in performance of PCs (Pentiums that are as fast as DEC Alpha workstations), right motion analysis, and usage of fast implementations made it clear Real-Time video conferencing becomes feasible very soon now . Another surprise is the size to which a video stream can be compressed. It might turn out that audio and not video will be the bottleneck of video conferencing2 .
Although I have not done much research in the area of audio compression
iv
Contents
Preface 1 Introduction
1.1 1.2 1.3 1.4 1.5 A brief history of computing power The Pegasus project . . . . . . . . Goals . . . . . . . . . . . . . . . . Problems and Constraints . . . . . Areas of research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
iii 1
1 2 2 2 3
5 6 6 6 8 8 9 9 11 12 13 13 18
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
19
19 19 20 20 20 22 23 23 25 26
Contents
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
29 29 30 31 31 33 33 34 34 35 36 36 36 37 38 38 38 40 40 42 43 44 45 47 47 48 49 50 50 51 52 52 52 53 54 54 55
5.1 Introduction . . . . . . . . . . . . . . . . . . . 5.2 Basics of Motion Estimation . . . . . . . . . . 5.3 Block-comparison alternatives . . . . . . . . . 5.3.1 SAD . . . . . . . . . . . . . . . . . . . 5.3.2 Summation of columns/rows (SOCR) 5.4 Estimation Alternatives . . . . . . . . . . . . 5.4.1 Exhaustive search with search window 5.4.2 Logarithmic search . . . . . . . . . . . 5.4.3 Half-pixel prediction . . . . . . . . . . 5.4.4 Other techniques . . . . . . . . . . . . 5.5 Motion in video sequences . . . . . . . . . . . 5.5.1 Motion vectors . . . . . . . . . . . . . 5.5.2 Frame-rate and motion vectors . . . . 5.5.3 Size of the search window . . . . . . . 5.5.4 Real-Time and frame rate . . . . . . . 5.5.5 Results on SOCR block comparison . 5.6 Quality of a Real-Time video stream . . . . . 5.7 Summary . . . . . . . . . . . . . . . . . . . . 6.1 6.2 6.3 6.4 Introduction . . . . . . . . . . . Implementations of DCTs . . . Comparison of DCT algorithms Quantization . . . . . . . . . . 6.4.1 Scaled DCT . . . . . . . 6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
33
47
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Skin selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 Implementation of skin selection under H.263 . . . . . . . 7.2.2 Very fast skin selection implementation . . . . . . . . . . 7.3 Background substraction . . . . . . . . . . . . . . . . . . . . . . . 7.3.1 Background replacement (Chroma-key with virtual blue room) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 Change detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
51
Contents
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
57
57 57 58 58 58 58 59 59 60 61 61 62
63
63 63 63 64 65 66 68 69 69 70
10 Conclusion 11 Acknowledgements
71 73
vii
Contents
viii
List of Figures
2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.1 3.2 3.3 3.4 5.1 5.2 5.3 5.4 5.5 The 4:2:0-format . . . . . . . . . . . . . . The 4:2:2-format . . . . . . . . . . . . . . Quantization and de-quantization . . . . . Example of a Discrete Wavelet Transform Overview of the JPEG codec . . . . . . . Zig-zag coe cients sequence . . . . . . . . Component interleaving using MCUs . . . JPEG Lossless prediction for pixel X . . . Hierarchical encoding in JPEG . . . . . . Forward prediction . . . . . . Average prediction . . . . . . Composition of an H.261 CIF B-frame of H.263 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 8 9 12 14 15 15 17 18 21 22 24 26 35 36 37 39 40 41 42
SOCR block comparison using 8 8 blocks . . . . . . . . . . . . Logarithmic search algorithm for 9 9 blocks . . . . . . . . . . . Example of interpolation of 2 2 pixels . . . . . . . . . . . . . . Average length of X and Y motion vectors for a video sequence. Maximum length found per frame of X and Y motion vectors for a video sequence. . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 Size of MV against fraction of MVs with this size . . . . . . . . . 5.7 Size of MV against cumulative percentage of MVs within this size
6.1 Plot of compression ratio versus SNR (dB) . . . . . . . . . . . . 50 7.1 Overview of background substraction . . . . . . . . . . . . . . . . 53 7.2 Steps for encoding of one Macro-Block . . . . . . . . . . . . . . . 54 7.3 Macro-block encoding with block change detection . . . . . . . . 55 9.1 Model for a real-time video compression application . . . . . . . 68
ix
List of Figures
List of Tables
2.1 LZW encoding of word \bananas" . . . . . . . . . . . . . . . . . 11 2.2 LZW dictionary after encoding/decoding of word \bananas" . . . 11 2.3 Lossless prediction formula table . . . . . . . . . . . . . . . . . . 17 3.1 MPEG-1 Constraint Parameter Bit stream . . . . . . . . . . . . . 23 4.1 E ect of H.263 advanced options on compression ratio (Miss America sequence, 150 frames) . . . . . . . . . . . . . . . . . . . 30 5.1 5.2 5.3 5.4 6.1 6.2 6.3 6.4 Search window size for Miss America video stream at 15 fps . . . Search window size for Miss America video stream at 7.5 fps . . SOCR block comparison on Miss America sequence at 15 fps . . Quality of video stream compared with real frame rate of 30.0 fps SNR comparison of DCT algorithms. . . . . . . . . . . . Performance comparison of DCT algorithms. . . . . . . Compression comparison of DCT algorithms. . . . . . . Quantization versus Quality for Miss America sequence . . . . . . . . . . . . . . . . . . . . 41 42 43 45 48 48 48 49
8.1 Comparison between di erent compression methods for a QCIF frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 9.1 Comparison of Miss America encoding (75 frames) with and without motion detection (MD) . . . . . . . . . . . . . . . . . . . 9.2 The streams that are used for compression . . . . . . . . . . . . . 9.3 The streams that are used for compression . . . . . . . . . . . . . 9.4 Pro le of encoding . . . . . . . . . . . . . . . . . . . . . . . . . . 66 67 67 68
xi
List of Tables
xii
Chapter 1
Introduction
1.1 A brief history of computing power
Computer technology is improved in a revolutionary way since the rst systems were developed in the 50s and 60s: Size of computers decreased signi cantly while at the same time computing power increased by factors of two each year. The developments lead to new applications for each new generation of computers. After the introduction of computer systems in businesses in the 70s, small computer systems entered households in the 80s. First, text-based applications were replaced by graphics-based What You See Is What You Get (WYSIWYG) applications. At the end of the 80s, more and more applications made use of the available graphical user interface (GUI) for cartoon animations and to display pictures. This development encouraged manufacturers of video cards and monitor to display up to true-color1 images on standard computer monitors. The next logical step was the development of technology to display motion-fragments on a computer monitor. In summary, video technology and computing power drove each other to new and higher standards in the computer industry. Video processing requires a large amount of resources. Data storage and network capacity are currently the main bottlenecks for more widespread use of video applications. Waiting for new technology in these elds is not an option and other approaches must be taken to make more e ective use of the available resources. Compression might be the solution in this situation. Compression algorithms make it possible to store the same data in less storage space. Compression is not only useful for storage on media such as hard disks, CDs and tape streamers, it is also useful for transmission of data over a network: less bandwidth is used for the same data. However, compression does not come free: the initial data must be transformed to its compressed equivalent. To use the initial data again, the compressed data must be decompressed. This transformation forwards and backwards takes time, may reduce quality of the data, uses system memory and above all, it costs CPU power.
1 True color is 24-bit color, which means that for every red, green and blue component 8 bits per pixel are reserved
Introduction
1.3 Goals
The current algorithms used for video transport and storage are not as e cient as newer, advanced algorithms. In the Pegasus multimedia system, video streams are transmitted as Motion-JPEG streams. These streams consist of video frames that are compressed individually by a factor of 10 with negligible quality loss (invisible to the human eye). This compression mechanism is done Real-Time (30 frames per second). However, more advanced compression algorithms make use of the temporal redundancy found in subsequent video frames. These algorithms compress video up to a factor of 400. Unfortunately, this cannot (at this time) be done real-time without advanced, multiprocessor, dedicated computer systems. To make video conferencing possible between two persons at two di erent locations, these advanced compression algorithms, discussed in the previous paragraph, must be used to reduce the required network bandwidth. Unfortunately, computing power of a desktop personal computer is signi cant lower than for an high-end workstation. Purpose of my work is to examine the di erent video compression algorithms that are available, nd the conditions and parameters that apply to DVC and create a working implementation of a video compression algorithm.
Introduction
Computing power: The computing power that is available in a system determines the number of instructions that can be executed in a speci ed time. The more instructions are executed, the more redundancies in the data can be found. Processor (CPU) speed, the type of instruction set (e.g. multimedia instructions), main, virtual and cache memory in the the system and the system bus all determine the computing power of a system. Network bandwidth: The bandwidth of the network determines the amount of data that can be transported in a certain time. Not only the average bandwidth of the network is important, the minimum bandwidth and the network latency is also important. The minimum bandwidth is the bandwidth that is guaranteed by the network. In some network, e.g. Internet, this number is equal to zero. In other network, this number is equal to the average bandwidth (ISDN). In ATM-networks, services classes determine this behaviour (See also Section 8.3.4). Latency: The nal constraint for a real-time compression algorithm is the time between capturing a video frame from a camera and displaying this frame after sending it over a network. The latency depends on both the computing power and the network latency. For some applications other than DVC, latency is less important. For instance, when a video stream is stored on hard disk solely for later use. The total latency is the sum of all partial latency that occurs: The time between the frame that is grabbed and the time the data becomes available, the time it takes to compress this frame, the time it takes to transmit and receive a frame, and the time it takes to decompress and display the frame. If all resources are used on a system for compression, the compress latency is equal to the time it takes to compress a frame2. Latency is especially important for situation where interaction is required: the response time may not be too long, because it would disrupt the interaction.
See also Chapter 8 for a closer look at the characteristics of di erent network and the requirements for video communications.
Introduction
Current developments in the area of computing indicate that CPUs become faster and real-time encoding is feasible in a couple of years. Other developments in the area of data networks and communication indicate that network capacity will grow and the number of connected people with a faster-thanPOTS3 -connection also increases signi cantly over the next years. These developments give hope for teleconferencing in the future, but to make real-time compression possible now, new approaches must be taken. To make real-time encoding possible, a number of sacri ces must be made. Frame-rate could be decreased to a still-acceptable rate, screen size may be reduced, and the size of the compressed video stream might be increased to win valuable computing power. Another research topic is concentration of computing power to areas of the video screen that are important. These areas can be detected by nding parts of the screen that contain skin color Zwaal95] or foreground. All these issues in uence each other and make it di cult to nd a standard solution in the form of one piece of hardware and one o -the-shelf software package. Finding an appropriate video compression standard and creating a software codec that satis es this standard is the rst part of my work. The second part is the analysis of a DVC-speci c video stream in order to optimize the codec for this kind of data. The nal, and most ambitious part of this research is the development and implementation of a codec that proves that near Real-Time encoding is possible for modest-sized video sequences. In the next two chapters, respectively image and video compression are discussed. These two chapters were published earlier as a Pegasus Paper Aalmoes95]. In Chapter 4, the arguments are given to use the H.263 standard for video compression of low-bandwidth streams. In Chapter 5, a video stream is analyzed on its movements and alternative motion estimation algorithms are discussed here. In Chapter 6, a variety of Discrete Cosine Transform (DCT) and Inverse Discrete Cosine Transform (IDCT) algorithms are examined. Quantization, that is sometimes combined with the DCT operation is also discussed here. In Chapter 7, a number of methods are discussed to reduce the video stream or encoding time in an non-confessional way by only compressing the parts of the screen that are most important. In order to connect the technology with the reality, a variety of network connections are discussed and their implications on the video compression engine. In Chapter 8, a number of di erent computer and telecommunications networks are discussed and their implication on video encoding. The implementation of a fast H.263 encoder is discussed in Chapter 9, and its corresponding results are also given here. Finally, in Chapter 10, the conclusions are given on Real-Time video compression for low-bandwidth networks.
3 POTS stands for Plain Old Telephone System, and represents the interconnected telephone network of the national PTTs
Chapter 2
2.1 Introduction
In the past years a number of compression standards have emerged and a number is now being developed. Although it would be useful to use only one general video compression standard, a growing number of standards is developed because of enhanced processing power, dedicated hardware, new compression techniques, and networks with di erent bandwidths. Each compression standard supports a speci c video application. It is di cult to choose the correct compression standard for a speci c application. As is true of compression in general that there does not exist one best compression algorithm, the same is true of video compression: there is no best standard. Some applications require fast real-time encoding, at the cost of the compression factor (video-conferencing), while other applications want maximum compression at encoding that need not be done real-time, as long as decoding is real-time (e.g. compressing a video stream on CD-ROM). This paper describes the di erent compression techniques used in the available standards on video compression. This simpli es the choice of which standard is most suitable for a certain application. Furthermore, an estimation can be made on the computational costs and size of video stream of a video compression standard. A division is made between still-picture and video compression techniques. In Section 2.2, some commonly used terms and compression techniques for still-picture compression are explained. These techniques include ways to remove redundant information and information that is not visible to the human eye. In Section 2.3, the most important still-picture compression standards are discussed. Video compression techniques rely strongly on the techniques used in still-picture compression, but also incorporate prediction and motioncompensation algorithms. These algorithms that remove redundant information 5
between the di erent frames of a video stream are discussed in Section 3.1. In Section 3.2, some video compression standards or standards to be are discussed.
To convert a RGB color space to an Y C C color space, the three color component intensities of RGB determine the luminance Y . The Y value is a weighted sum of the three color intensities: the green component is brighter than the red component and the red component is brighter than the blue component for the same value. The C component is the blue component without the total luminance (C ? Y ). The C component is the red component without the total luminance (C ? Y ). The Hue-Saturation-Brightness (HSB ), the Hue-Lightness-Saturation (HLS ), or also called the Hue-Saturation-Value (HSV ) color space is based on specifying the colors numerically. The Hue component describes the pure color, the Saturation component describes the degree to which the pure color is diluted by while light, and the Brightness describes the brightness or luminance. The problem with this color space is that no reference is made to the linearity or non-linearity of the colors. To determine the lightness from RGB , the three color component values are averaged ((R + G + B )=3), while the visual luminance of green is much higher than the visual luminance of blue. For (lossy) image compression, it is advised to convert RGB color space to Y C C Wallace91]. The human eye is more sensitive to luminance components than to chrominance components and by separating them, the luminance component can be encoded in a higher resolution than the chrominance components. In other words, less bits for the chrominance components need to be encoded. The relation between the resolution of the luminance components and the chrominance components determines the picture format. A luminance component accompanied by two chrominance components that are down-sampled in both horizontal and vertical dimensions by two is called the 4:2:0-format (see Figure 2.1) Kleine95] Filippini95]. If the chrominance components are only down-sampled in horizontal direction by 2, the format is called the 4:2:2-format (see Figure 2.2). Finally, the 4:1:1-format has its chrominance components horizontally down-sampled by 4 and has no down-sampling in the vertical dimension.
B B R R B R
Y (8 x 8)
(4 x 4)
(4 x 4)
Y (8 x 8)
(4 x 8)
(4 x 8)
Reproduced value 0 0 2 2 4 4 6 6
Figure 2.3: Quantization and de-quantization quantization, only the values 0, 2, 4 and 6 have the same value as before the quantization, but the other values are approximated by the nearest value to the original integer.
density in a message.
LZW compression
Lempel-Ziv-Welch (LZW) is an entropy encoding technique, developed by Terry Welch Nelson91]. The best known implementations of LZW are the UNIX "compress" utility and CompuServe's Graphics Interchange Format (GIF). LZW is based on the LZ77 and LZ78, which are developed by Lempel and Ziv. LZ77 and LZ78 are dictionary-based algorithms: they build up a dictionary of previously used strings of characters. The output stream of these encoders consists of characters or references to the dictionary. A combination of a reference with a character generates a new reference in the dictionary. For example, a reference to "Hi" in the dictionary followed by the character "s" results in a new reference "His". LZW is an improvement over LZ78. LZW uses a table of entries with an index eld and a substitution-string eld. This dictionary is pre-loaded with every possible symbol in the alphabet. Thus, every symbol can be found in 10
the dictionary by using a reference. The encoder searches in the dictionary for the largest possible reference to the string at the input. This reference plus the rst symbol of the input stream after the reference is stored in the output stream. An example of the encoding of the word "bananas" is given in Figure 2.2 and Table 2.1. The decoder reads the encoded stream and replaces the reference by the substitution string that is stored in the associated entry of the dictionary. The symbol that follows the reference is directly stored in the decoded stream. The reference and the symbol are also used to create a new entry in the dictionary. Input stream 'b' 'a' 'n' 'a' 'n' 'a' 's' Generated entry Output stream 256 = \ba" 257 = \an" 258 = \na" 257 259 = \ana" 260 = \as" 'b' 'a' 'n' 257 'a' 's'
Table 2.1: LZW encoding of word \bananas" Index 0 1 ... 255 256 257 258 259 260 substitution string (char) 0 (char) 1 .. (char) 255 \ba" \an" \na" \ana" \as"
and was capable of compression from 8:1 to 50:1 while remaining reasonable quality. This implementation searches a combination of transformations that represent the image the best. Unfortunately, the search to nd these transformation is very computationally intensive, which makes it unattractive for real-time image compression. Iterated Systems developed and sells a fractal-based compressor/decompressor, mainly used for CD-ROM encyclopedia applications.
X0 X1 X2 DWT X3 X4 X5 X6 X7
S S S S
Figure 2.4: Example of a Discrete Wavelet Transform An example of wavelet transform is given in Figure 2.4. In this example a Discrete Wavelet Transform (DWT) is applied to an array of 8 coordinates. The result are 4 approximation transform coordinates S0::S3 (also called the smooth vector) and 4 detail transform coe cients D0::D3 (also called the detail vector). Now, the DWT is applied again on the approximation transform coe cients. 12
All of these resulting coe cients together with the detail transform coe cients from Figure 2.4 form the nal wavelet coe cients. Wavelet compression is obtained by only storing those coe cients of the wavelet transformation that have an amplitude above a certain threshold together with the place of those coe cients in the transformed domain. Because the coe cients are also time-domain, high contrast edges are maintained at the cost of low contrast areas. By using quantization and entropy encoding in combination with wavelet transform the number of bits needed to store the wavelet coe cients is further reduced.
2.3.1 JPEG
The JPEG standard Wallace91] is developed by the Joint Photographic Experts Group. It is a collaboration between the former International Telegraph and Telephone Consultative Committee (CCITT)1, and the International Standardization Organization (ISO). The JPEG standard is now widely adopted in the world. There are four modes of operation: Sequential encoding: This is the general mode, in which a picture is encoded from top to bottom. Progressive encoding: In this mode, the picture builds up in multiple scans. After each scan, the picture gets sharper. Lossless encoding: In this mode, the picture is compressed in a way that no data is lost after decompression. The algorithm used for lossless encoding is rather di erent from one used in the sequential and progressive modes of operation. Hierarchical encoding: In this mode, the image is encoded in di erent resolutions. Accessing a low-resolution version does not require decompression of the full resolution version. The JPEG encoder works on one color component at a time. For gray-scale pictures, which only consist of one component, the encoding is straight-forward. For color pictures, every component is encoded separately just like a gray-scale picture. The color components can be interleaved with each other or can be sent after one another, see Section 2.3.1.
1
13
Sequential encoding
The most common way to encode a JPEG picture is by using sequential encoding. An overview of the codec (Compressor/de-compressor) is given in Figure 2.5.
JPEG-encoder
Original picture
DCT
Quantization
Entropy encoding
Compressed picture
Decompressed picture
IDCT
De-quantization
Entropy decoding
Quantization table
Huffman table
JPEG-decoder
Figure 2.5: Overview of the JPEG codec For every component, the picture is divided in blocks of 8 8 pixels. Each block is transformed in another 8 8 block using a DCT function. The resulting transformed block consists of 64 unique two-dimensional spatial frequencies coe cients, of which the higher frequency coe cients are very small or zero. After the DCT transformation, the transformed block is quantized by using an 8 8 quantization table. This means that every coe cient is divided by its corresponding quantization value and rounding the result to the nearest integer. Note that this step is lossy and removes data that may not be visible to the human eye. The resulting block of coe cients contains even more small or zero values. This block of coe cients is stored in a sequence according to a zig-zag route de ned in the block, see Figure 2.6. This zig-zag sequence is chosen in a way that low-frequency coe cients are stored rst and the highfrequency coe cients last. Putting the high-frequency coe cient next to each other results in a series of low or zero value data at the end of the sequence. This sequence is encoded e ciently by using the entropy encoder. The nal step is entropy encoding of the created sequence. The quantized DC coe cients are treated a little di erent from the other AC coe cients: because the value of DC coe cient of adjacent blocks correlate strongly, not the value itself but the di erence with the previous DC coe cient is used for entropy encoding. The entropy encoder is a mixture of a variable-length encoder and the Hu man or arithmetic encoder. 14
Y0 0 ; Y1 0 ; Y0 1 ; Y1 1 ; C
; ; ; ;
B0;0
;C
R0;0
; Y2 0; Y2 1; Y3 0; Y3 1; C
; ; ; ;
B1;0
;C
R1;0
; :::
0 0 2 4
6 0
2 0
+
2
+
2
Y (8 x 8)
(4 x 4)
(4 x 4)
MCU 1
MCU 2
Progressive encoding
Progressive encoding allows a user to transmit a picture in a number of scans. The rst scan is a rough approximation of the picture, but every next scan improves the picture. Progressive encoding uses the same compression techniques found in sequential encoding. The progressive encoding mode, however, introduces a bu er between the quantization and entropy encoding step large enough to store the whole DCT-encoded and quantized picture. The bu er is then entropy encoded in a number of scans. Two methods can be chosen to select the information per scan: Spectral selection: a selection of the quantized coe cients is made that are transmitted. For example, in the rst scan, only the DC coe cient and the three rst AC coe cients (according to the zig-zag ordering) are transmitted, in the second scan the next 16 AC coe cient are transmitted, and in the nal scan the last 44 AC coe cients are transmitted. Successive approximation: a bit selection of every quantized coe cient is sent per scan instead of the whole quantized coe cient. For instance, in the rst scan, the three most signi cant bits of all the quantized coefcients are transmitted and in the second and nal scan the rest of the quantized coe cients are transmitted. The two methods can be mixed, which enables the user to choose the "progression" in a very exible way. The drawback of progressive encoding compared to sequential encoding is the extra bu er that is introduced in the encoder and the decoder, and a more computational-intensive decoder as for each scan the quantization and IDCTprocesses must be executed again.
Lossless encoding
The JPEG lossless mode does not make use of the DCT transformation. Instead of DCT and quantization, it uses a prediction process that determines the value based on the values of the pixels on the left and above the current pixel, see Figure 2.8. The selection value for prediction (see Table 2.3) and the di erence with the actual pixel value is then sent to the entropy encoder. The entropy encoder can be either an Hu man encoder or an arithmetic encoder. For the Hu man encoder, the entropy encoding stage is almost identical the the DCcoe cient encoder in the sequential mode.
Hierarchical encoding
Hierarchical encoding allows a user to decode a low-resolution version of a picture, without decoding and down-sampling the whole encoded picture. The hierarchical mode is used in combination with lossless, sequential, or progressive encoding mode. A number of steps are done by the hierarchical encoder, see Figure 2.9: rst, the picture is down-sampled a desired number of times by a factor of two in 16
Figure 2.8: JPEG Lossless prediction for pixel X Selection value Prediction for X 0 no prediction 1 A 2 B 3 C 4 A+B ?C 5 A + (B ? C )=2 6 B + (A ? C )=2 7 (A + B )=2 Table 2.3: Lossless prediction formula table horizontal dimension, in vertical dimension, or in both dimensions. The result is the minimal resolution of the picture that can be retrieved by the decoder. Then, the encoder compresses this down-sampled picture using one of the sequential, progressive or lossless compression modes. This compressed image is sent to the outgoing video stream. After that, the encoder decompresses the compressed image, so it has the same image as the decoder2 . This image is up-sampled by a factor of 2 in either horizontal, vertical or both dimensions, by using an interpolation lter that is also used in the decoder. The result is then compared with the original image which is down-sampled to the same resolution, or with the original image itself (without down-sampling) if it is already the same resolution. The di erence of this comparison is encoded using the same compression method as mentioned before. In this encoding, a di erent quanti cation table can be used, because the di erence of two images has other (statistical) characteristics than an image itself. If the up-sampled image still has a lower resolution than the original image, the encoder can up-sample, interpolate, compare with the (down-sampled) original image, and calculate a compressed di erence-image again, until the whole resolution of the original image is sent over. The drawback of hierarchical encoding is the need for picture bu ers for each resolution that is sent over at the encoder and one extra bu er at the decoder. Furthermore, if the decoder wants the picture with the highest resolution, a lot more calculations must be made at the encoder and at the decoder than with sequential coding.
2 This decoding step can be optimized when lossless mode is used or an intermediate result is stored before the lossless entropy encoding during encoding.
17
Decompressed image
Decompression
Difference image
Comparison
Figure 2.9: Hierarchical encoding in JPEG The Independent JPEG Group (IJG) developed a public domain source code that supports lossless, sequential and progressive operations modes. Hierarchical mode is not (yet) supported. GIF is a lossless, 256 color, still-picture compression standard Gailly95]. It a variation of the LZW compression technique called Variable-Length LZW. GIF is most suitable for images that have a small number of colors, such as computer generated graphics and cartoons. It is also useful for small images. The di erence between the LZW compression method and Variable-Length LZW used in GIF is that in the latter the size of the code to represent an entry in the table is increased by a bit when the table is full. If this code is 12-bit and the table is full, a special character symbol (a clear code) is encoded to indicate that the table must be emptied and the table must be rebuilt from scratch. There are two widely used versions of GIF: 87a and 89a Compuserv90]. The 89a version has some extensions to insert text into the picture, and comments and application speci c codes in a GIF le, but the LZW algorithm used is not di erent.
2.3.2 GIF
18
Chapter 3
3.1.1 Prediction
The most basic form of prediction checks if a block of n m data in the current frame is the same as the block on the same place in the previous frame. If there is no change, the data of this block is not encoded. Although this is an easy example, the implementation still requires quite some thought: What size is chosen for the block that is compared and must the blocks be exactly the same, or is there a threshold value before the block is marked as changed? In most implementations, prediction is combined with motion-compensated interpolation. If a block of data is not identical to a block of data in a previous frame, the best matching block is found and the di erence is used for further compression. The resulting block is compressed better than the original block of data. The area that is used for comparison of the block determines the quality of the nal prediction. The larger the area that is searched to nd a matching block, the larger the change it is actually found. But most matching blocks are 19
found around the place of the original block and increasing the search area also increases the computation time to nd a matching block. Bi-directional prediction not only searches a previous frame for a closematching block, it also searches a next frame in the video stream. Another advantage of bi-directional prediction is that it can combine the prediction of a previous frame with the prediction of a next frame into an average prediction image block.
3.2.2 MPEG-1
MPEG stands for Motion Picture Experts Group, and is concerned with the development of video compression standard Gall91]. Although the MPEG is also developing audio and synchronization standards as part of the MPEG standard, we only look at the video compression techniques used. MPEG-1 is the rst standard of the MPEG-group. It describes the way a video stream must be stored, but it does not give speci cations on how the coding (and decoding) must be done. The MPEG-1 standard is designed for (encoded) video streams of 1.5 Mbps, which is su cient for CD-ROM applications. MPEG-1 supports encoding of the Standard Interchange Format (SIF), which has a resolution of 352 240 pixels at a rate of 30 frames per second for NTSC and a resolution of 352 288 at a rate of 25 frames per second for PAL and SECAM. A MPEG video stream uses three di erent types of frames: I-frames: The I or intra-picture frames are compressed independent of the other frames in the video stream. 20
P-frames: P or predicted frames are frames that store the di erence between the current frame and the previous P or I frame that is encoded. B-frames: B or Bidirectional prediction frames use both the previous I or P frame and the next I or P frame to predict the current frame. I, B and P frames are compressed di erently. In I frames, compression is achieved by reducing spatial redundancy in the frame. P and B frames also use temporal redundancy reduction to improve the compression factor. Because B-frames make also use of the next I or P frame as reference, B frames have the highest compression factor. An MPEG video stream consisting of only I frames has, except for some quantization and Hu man-encoding details, the same compression factor as a motion-JPEG video stream (using the same video format). However, I frames are important for random access, the ability to decode independent frames without decoding the whole video stream. A frame is divided in a number of 16 16 blocks of pixels called Macroblocks. A macro-block can be encoded in four di erent ways: Intra-block encoding: no prediction is used. Forward predicted encoding: A 16 16 block of pixels is searched in the next I or P-frame that most closely resembles the current macro-block, see Figure 3.1. The di erence between these blocks are used for further compression. Backward predicted encoding: This encoding is the same as forward predicted encoding, but with the di erence that blocks are searched in the previous I or P frame instead of the next frame. Average encoding: Backward and forward predicted encoding is used to nd two blocks of pixels that resemble the current macro-block best, see Figure 3.2. These two blocks are averaged and the di erence with the current macro-block is used for further compression.
T0
T1
T-1
T0
T1
Figure 3.2: Average prediction I frames only use intra-block coding; P frames use either intra-block coding or backward predicted coding. B frames use any of the encoding modes. After motion prediction, a macro-block must be compressed to reduce spatial redundancy. An 8 8 DCT is used similar to one found in JPEG. After DCT, coe cients are stored in zig-zag order. These coe cient are quantized depending of the original encoding mode. For intra-block encoding, low spatial frequency coe cients are quantized with a lower quantization factor than high spatial frequency coe cients. For the other encoding modes, the coe cients are DCT-transformed di erences of pixel blocks. Low frequencies of these blocks will be close to zero, because of the applied prediction. Therefore, another quantization matrix must be used than for intra-block-encoded, DCT blocks. MPEG also allows di erent quantization step sizes for di erent blocks. This is independent of the encoding mode (intra or predictive encoding). Di erent quantization step sizes allow the encoder to code certain blocks more accurate than others. In general, an MPEG video stream consists of many B-frames, some Pframes and a few I-frames. The I-frames guarantee random-access in the video stream. P frames are also important because B-frames can only refer to I or P-frames, not to B-frames. After motion prediction, DCT and quantization, the output stream is entropy encoded by a variant of the variable-length compression technique found in JPEG. The MPEG-1 is tuned for compression of video streams that comply to a Constraint Parameter Bit stream (CPB), see Table 3.1. Video streams that use more bandwidth compared to this CPB may be encoded through MPEG-1, but the encoding is not necessarily e cient and support is not guaranteed by MPEG-1 hardware.
3.2.3 MPEG-2
The MPEG-2 standard is developed for high-end video applications that need a compressed video stream from 4 Mbps up to 100 Mbps Kleine95] Okubo95]. MPEG-1 may not be e cient for these video streams, but is to video streams 22
Horizontal size in pixels Vertical size in pixels Total macro-blocks per picture Total macro-block per second Frame rate Bit rate Decoder bu er
Table 3.1: MPEG-1 Constraint Parameter Bit stream that conform to the CPB of MPEG-1. Furthermore, interlaced video streams, which are common in the television industry, are not easily converted to MPEG1; MPEG-2 is more suited for these interlaced video streams. Video streams of MPEG-2 are nevertheless compatible with MPEG-1. The MPEG-2 standard deals with di erent resolution video streams that are divided in pro les and levels. The lowest level format is 352 288 pixels (PAL format) and the highest is 1920 1152 (PAL format) pixels. The simplest pro le does not use B-frames, is not scalable and uses a 4:2:0 luminance/chrominance format, while the high pro le uses B-frames, is scalable and uses either a 4:2:0 or a 4:2:2 luminance/chrominance format.
3.2.4 MPEG-4
At this moment, MPEG-4 is still in development and no concrete algorithms or methods are yet determined Filippini95]. However, an outline of the goals of MPEG-4 is available. MPEG-4 is not just a compression standard, it will incorporate a description language that determines the contents of a video stream. It also distinguishes di erent objects that enables the user to set priorities to di erent objects so that the foreground of a picture has a higher priority than the background. MPEG4 intends to support a wide variety of video streams from low-bandwidth to 3-dimensional video streams. An MPEG-4 stream combines tools, algorithm and pro les. These will determine how data is stored. For example, subtitles will be coded di erently than other video objects. MPEG-4 is scheduled to become a standard in the end of 1998.
3.2.5 H.261
The CCITT developed the H.261 video compression standard that is designed for video communications over ISDN networks Liou91] Turletti93]. H.261 can handle p 64 Kbps (where p = 1; 2; :::; 30) video streams and this is equal to the possible bandwidths in ISDN. The H.261 standard supports the following two video formats: Common Intermediate Format (CIF). This format has a resolution of 360 288 pixels for the luminance (Y ) part of the video stream and a 23
resolution of 180 144 pixels for the two chrominance parts (C and C ) of the video stream; Quarter-CIF (QCIF). This format contains a quarter of the information of a CIF video stream. This means that the luminance resolution is 180 144 pixels and the two chrominance resolutions are 90 72 pixels; The maximum frame rate for a H.261 video stream is 30 frames per second. The CIF and QCIF consist of pictures for each frame, and within each picture of Group Of Blocks or GOBs, see Figure 3.3. A QCIF has 3 GOBs, while a CIF has 12 GOBs. Each GOB consist of 3 11 Macro Blocks (MB). A Macro Block is composed of 4 8 8 luminance blocks and two 8 8 chrominance blocks (C and C ). A macro block can be compared to an MCU in JPEG.
B R
GOB
QCIF C CIF Y MB B C R
Figure 3.3: Composition of an H.261 CIF The H.261 encoder can operate in two modes. In the intra-frame mode, every 8 8 block is DCT-transformed, linearly quantized, and sent to the video multiplexer. In the inter-frame mode, every 8 8 block is also DCT-transformed and linearly quantized, but the result is rst sent to a motion-compensator before it is sent to the video multiplexer. The motion-compensator is used for comparing the macro-block of the current frame with blocks of data from the previously sent frame. If the di erence is below a pre-determined threshold, no data is sent for this block. Otherwise, the di erence is DCT transformed, and linearly quantized. The nal encoding step is the video multiplexer that uses a variable wordlength coder to reduce the bit stream even more. After the video multiplexer, the result is inserted in a transmission bu er, which controls the linear quantizer in order to regulate the outgoing bit stream. H.261 is similar to MPEG with respect to the DCT encoding and quantization. During the standardization of MPEG-1, this is done on purpose to simplify implementations that encode or decode both H.261 and MPEG. The main di erence between H.261 and MPEG-1 is that motion vectors in H.261 are restricted to 15 pixels away from the original place in the picture. Furthermore, no future-prediction is used in H.261 which means that H.261 has no equivalent of B-frames in MPEG. 24
INRIA Video conferencing System (IVS) is an implementation of a video conferencing tool that uses H.261 for video compression. Vic is also a video conferencing tool that supports H.261.
3.2.6 H.263
The ITU H.263 draft 1 ITULBC95] is an improvement over H.261. H.263 is developed for low-bandwidth communications over Plain-Old Telephone Systems (POTS), in particular 28.8 Kbps modems. Compared to H.261, the number of available picture formats is increased, the motion-compensation algorithm has improved, better entropy encoding is used and a new frame is introduced that allows a simple form of forward-prediction. Three new video formats are added in H.263: a sub-QCIF format is added (128 96), a 4CIF format is added (704 786), and a 16CIF format is added (1408 1152). As in H.261, the number of chrominance pixels is always half of the number of total (luminance) pixels. This means that for 2 2 luminance pixels, one C and one C pixel is used. H.263 also supports unrestricted motion vector mode. In the default (restricted) motion vector mode, the block that is referenced should be fully inside the picture. In unregistered motion vector mode, an arbitrary number of pixels may be outside the pixel. For every reference to these pixels, the closest edge pixel is used instead. The Advanced Prediction mode is also new in H.263. Instead of one motion vector to a 16 16 macro-block, four motion vectors to 8 8 blocks are used for prediction. Although this encoding uses more bits than with one motion vector, the quality of prediction improves signi cantly. Another improvement in H.263 is the use of motion vectors that refer to half-pixel displacements instead of displacements with a integer number of pixels. To calculate the referenced sub-pixels, the value of surrounding pixels are interpolated. Besides Intra and Inter encoded frames, H.263 introduces PB-frames. The name is derived from the MPEG P and B frames. A PB-frame (see Figure 3.4) consists of two mixed frames: one "normal" P-frame frame and one bi-directional prediction frame. When a PB-frame is decoded, rst the B-frame and after that the P-frame is shown. The P-frame can use Inter- or Intra encoding modes, but the B-frame can only use the new forward or older backward prediction mode and not the intra encoding mode. The B-frame can refer to the associated P-frame in the PB-frame, and to an average of the associated P-frame and the previously encoded P-frame. Test Model Near (TMN) is an implementation of the upcoming H.263 standard and is used as test model for this standard. It claims that it has a factor of two better compression than H.261 Telenor95]. Source code of TMN is also available.
B R
25
3.3 Summary
T-1
T0 B frame PB-frame
T1 P frame
3.3 Summary
A number of picture and video compression techniques are discussed. For picture compression, distinction is made between lossless and lossy methods. Lossless compression generate an exact copy of compressed data after decompression. Lossy compression methods give up this requirement to obtain a much higher compression factor. Thus, the quality of lossy compression depends not only on the compression factor as with lossless compression , but also on the way the decompressed image resembles the original picture. Lossless methods for compression are LZW, (adaptive) Hu man and DCT, although the latter loses this property during storage of the rounded coe cients. DCT transforms a block of n n pixels into a matrix of n n coe cient which represent spatial frequencies. Because most high-frequency coe cients are (near) zero, compression is attained. Quantization reduces the number of bits for data by reducing the density of the domain. Scalar quantization does this by dividing data by a quantization factor; decompression is done by multiplying with the same quantization factor. Vector quantization methods use a codebook to translate data to an index in the codebook; the collection of indexes and the codebook together are used to retrieve the original picture. Wavelet transform is another technique that transforms a time domain into a time-frequency domain. Compression is done by storing only some of the generated coe cients. JPEG has four di erent operating modes. The lossy modes use a combination of the lossy DCT and quantization methods together with lossless entropy encoding methods to compress pictures. The progressive mode allows a user to send the picture in a number of scans so that the picture improves after each scan. The JPEG lossless mode uses a prediction-based method to compress a picture. Hierarchical mode enables a user to send di erent resolutions of a picture at the same time; low-end decoders will only decode the rst scan or the rst couple of scans, while high-end decoders decode all scans. 26
3.3 Summary
GIF uses the LZW algorithm to compress lossless 256-color pictures. Video compression techniques not only make use of spatial redundancy to reduce the bitstream, but also use temporal redundancy found in consecutive video frames. In this way, the current frame is predicted from the previous and sometimes future frames and only the di erence of these frame with the current frame is encoded. All video compression standards discussed here use DCT-transformation followed by quantization and entropy encoding. A number of video compression standards are available. Motion-JPEG is a series of JPEG compressed images stored after each other. Only spatial redundancy and not temporal redundancy is reduced. The advantage of this method is the easy implementation and random access of individual frames. The disadvantage is the poor compression factor of video streams. MPEG-1 is a standard for (compressed) bitstreams of 1.5 Mbps. It uses three di erent types of frames: I-frames that store a frame independent of the others in a stream, P-frames that store the di erence between the current and the previous I- or P-frame, and B-frame that use both the previous I- or P-frame and the next I- or P-frame in the video stream for prediction of the current frame. Pand B-frames use motion compensation techniques for their prediction. MPEG2 is an enhancement of MPEG-1 that is optimized for higher bitstreams and better video resolutions than MPEG-1. MPEG-4 is still in development but will make it possible to determine individual objects in video streams. Standards designed for ISDN and POTS telecommunication are H.261 and H.263, respectively. H.261 has two picture formats: CIF has a resolution of 352 288 and QCIF is a quarter of the CIF resolution. H.261 frames operate in either of the following modes: In intra-frame mode, the frame is individually compressed. In inter-frame mode, the di erence with the previous frame is calculated (using motion-compensation) and is stored in the outgoing video stream. H.263 is an improvement over H.261 which supports two more resolutions: 4CIF (704 786) and 16CIF (1408 1152). Furthermore, it introduces a PB-frame that enables a simple form of forward-prediction. The motion prediction algorithm is also improved in H.263.
27
3.3 Summary
28
Chapter 4
of half-pixel prediction for motion estimation. As movements are small, a good estimation is made by comparing the macro-blocks with an interpolation of the luminance data. Consequence of this feature is the required extra time to calculate an interpolation. For example, a DEC alpha 150Mhz takes approximately 8 ms to make an interpolation of a Quarter-CIF (QCIF) sized luminance component.
The advanced options in uence compression signi cantly in a number of aspects. In the rst place, compression ratio is increased by using more of the advanced options. Other in uences that are not given here is the extra compression time for more advanced options. Especially the slowness of the advanced prediction mode makes practical use impossible. Option D and G also require more motion estimation time as both uses a larger search area. The time penalty cause by the arithmetic coding instead of Hu man coding is relatively less than with the other options and also depends on the size of the output bitstream. 30
4.5 Summary
After examination of the di erent encoding standards and techniques, H.263 is currently the best choice for DVC. This standard not only allows a number of di erent ways of encoding (intra, inter) in a exible way, it is developed for low-bandwidth lines and it also uses advanced techniques not found in other standards (such as half-pel estimation and the advanced options). This reserves some capabilities for video encoding in the future, when faster computers are available and the advanced options are more than now usable. In the next three chapters, alternatives are discussed for the various components found in an H.263 encoder. After this discussion, in chapter 8 implication 31
4.5 Summary
of network characteristics on compression are given and in Chapter 9 the implementation and the results of the developed H.263 encoder are discussed.
32
Chapter 5
33
two pixels by a Mean Square Error (MSE)and adding them up. The block with the lowest summation of MSEs is considered as the match. Now, only the coordinates of this reference block and the di erence between this reference block and the macro-block are stored in the video stream. After all the macro-blocks have been processed, the frame is transmitted and the current frame is used as reference frame for the next frame to be encoded. If applied not e ectively, motion estimation requires an enormous amount of CPU power at a relative small reduction of video stream size. The following calculations display the amount of processing power that is required: Comparison: To compare two blocks of pixels, the following calculation is used: 1 MSE (x; y; k; l) = N 2 Where:
N X i;j
=0
(F?1 (i + x; j + y ) ? F0 (i + k; j + l))2
{ F0 and F?1 are respectively the current and previous frame. { k; l are position of macro-block in current frame (multitude of N ) { x; y are position of reference block in previous frame
Estimation: The number of comparisons to compute depends on the frame size. For a frame size of U V pixels, where U and V are a multitude of N , (U ? N ) (V ? N ) comparisons must be made for a single macro-block and thus: (U ? N ) (V ? N ) No of comparisons = . . . for a whole frame. If we apply this algorithm on a small Quarter-CIF format (176x144), we will need more than 5 billion pixel comparison operations (including the square operation) per frame without even including control operations (such as increasing the index to the image data) that are also required.
U U N N
5.3.1 SAD
SAD(x; y; k; l) =
N X
i;j
=0
All popular compression implementations found for MPEG, H.261 and H.263 use SAD to compare blocks. 34
Figure 5.1: SOCR block comparison using 8 8 blocks Because consecutive pixel values can cancel each other out, this method is less precise than the SAD method. However, since it is done both horizontally and vertically, the chance of this occuring is somewhat reduced. The original idea for this implementation was the reduced time to compare blocks. For more comparisons per macro-block, this method becomes more suitable as after a initial complete row and sum addition of the frame, it requires N=2 times less comparisons as with full block comparison. Results of this methods are discussed in Section 5.5.5. 35
To make the block-search more realistic, a search window is introduced that limits the block-match algorithm. Reducing the search to 16 16 pixels in the neighbourhood reduces the complexity of the example given in Section 5.2 to approximately 6 million pixel comparison operations. For high frame rates and low movement, the search window can even be reduced down to 4 4, which are less than half million pixel comparisons per frame. This way of motion estimation is called exhaustive search, because all blocks (limited within the search window) are examined. Most video compression systems, both software and hardware, are using this strategy as it guarantees the best matching block to be found and thus the highest compression factor.
Figure 5.2: Logarithmic search algorithm for 9 9 blocks An alternative to the exhaustive search algorithm is the logarithmic search algorithm, see Figure 5.2. Instead of comparing every block in the search window, it only compares a number of blocks that are evenly distributed in the search area (e.g. one block in the center and eight blocks between the border of the search window and the block in the center). For the best-matching block, this
2 Luminance data is used only for two reasons: rst, the chrominance data is down-sampled which means that is has a lower resolution than luminance data; second, it requires less time to use only the luminance part.
36
step is repeated with the search window set around all pixels that are more close to this block than the other blocks. This step continues again until the best-matching block is found for only neighbouring blocks. This block found is used for the motion compensation. The advantage of a logarithmic-based search algorithm is the reduction of complexity. Instead of a complexity of O(N 2), with N the size of the search window, the complexity is reduced to O(log (N )). Even for a small search window of 9 9 as given in Figure 5.2, the number of required comparisons is reduced from 9 9 = 81 to 9 + 8 + 8 = 25. A drawback of the logarithmic search is that not all blocks are used for comparison. It is therefore possible that the algorithm gets \stuck" at a local minimum. Compression tests con rm this behaviour with a slight reduction of compression ratio compared with the exhaustive search, see further Section 5.5.
Interpolated pixels
Figure 5.3: Example of interpolation of 2 2 pixels Although half-pixel search increases the size of the search window, it improves prediction signi cantly. This technique in particular makes H.263 compression superior over H.261 (See Section 3.1). Half-pixel prediction can be combined with both exhaustive and logarithmic search. In the Telenor implementation of H.263, even a combination of full- and half-pixel prediction is used: First, an exhaustive full-pixel search is performed for blocks in the search window. After that, for the best matching block, a search is done among the eight neighbouring half-pixel interpolated blocks. 37
38
Figure 5.4 shows the average X and Y motion vectors per frame. A positive value means the block in the previous frame with the best prediction was to the right of or under the current block, while a negative value means that the best-matching block was found in the previous frame to the left or above the position of the current block.
2 1.5 1 0.5 0 -0.5 -1 -1.5 -2 0 20 40 60 80 100 120 140 average X movement average Y movement
Figure 5.4: Average length of X and Y motion vectors for a video sequence. If these gures are compared with the real video sequence, movements of the head to the left and right are detected by the peaks in the graph of the average X value. The absolute value of the motion vectors that have the highest value for a frame are given in Figure 5.5. For both the X and Y lines, these data give a distorted view on the behaviour in the video sequence. This is caused by a number of predictions in each frame that are uncorrelated with the other motion vectors found in the frame. In this source video sequence, it is unlikely the maximum motion vector is near 15 pixels between each frame. To explain this phenomenon, a matrix of motion vectors is taken from an arbitrary frame in the encoding process:
(0.0, 0.0) (0.0, 0.0) (0.0, 0.0) (0.0, 0.0) (0.0,14.5) (0.0, 0.0) (0.0, 0.0) (0.0, 0.0) (0.0, 0.0) (0.0,0.0) (0.0,0.0) (0.0,0.0) (0.0,0.0) (0.0,0.0) (0.0,0.0) (0.0,0.0) (0.0,0.0) (0.0,0.0) ( 0.0,0.0) ( 0.0,0.0) (-1.0,1.0) (-1.0,1.0) (-0.5,0.5) ( 0.0,0.0) ( 0.0,0.0) ( 0.0,0.0) ( 0.0,0.0) ( 0.0,0.0) (-1.5,1.5) (-1.5,1.5) (-1.0,1.0) (-1.0,0.5) (-1.0,1.0) ( 0.0,0.0) ( 0.0,0.0) ( 0.0,0.0) (-1.5,0.0) (-1.5,0.5) (-1.5,1.0) (-1.5,1.0) (-1.5,1.0) (-1.0,1.0) (-1.0,1.5) ( 0.0,0.0) ( 0.0,0.0) (-1.5,0.0) (-1.5,0.5) (-1.0,0.5) (-1.0,0.5) (-1.5,1.0) (-1.0,0.5) (-1.0,1.5) (-0.5,0.5) ( 0.0,0.0) ( 0.0,0.0) (-1.0,0.5) (-1.0,0.5) (-1.0,0.0) (-0.5,0.0) (-1.0,0.5) (-0.5,0.5) ( 0.0,0.0) ( 0.0,0.0) ( 0.0,0.0) ( 0.0,0.0) ( 0.0,0.0) ( 0.0,0.0) (-0.5,0.5) ( 0.0,0.0) ( 0.0,0.0) ( 0.0,0.0) ( 0.0,0.0) (0.0,0.0) (0.0,0.0) (0.0,0.0) (0.0,0.0) (0.0,0.0) (0.0,0.0) (0.0,0.0) (0.0,0.0) (0.0,0.0) (0.0, (0.0, (0.0, (0.0, (0.0, (0.0, (0.0, (0.0, (0.0, 0.0) 0.0) 0.0) 0.0) 0.0) 0.0) 0.0) 0.0) 0.0) (0.0,0.0) (0.0,0.0) (0.0,0.0) (0.0,0.0) (0.0,0.0) (0.0,0.0) (0.0,0.0) (0.0,0.0) (0.0,0.0)
In this rst column, fth row, a motion vector occurs that is out of line of most other vectors. Examining the rest of the motion vectors show a maximum motion of 1.5 pixels, while this distorted value makes this frame have a maximum motion of 14.5 . These distortions appear in about every frame 39
16 14 12 10 8 6 4 2 0 0 20 40 60
80
100
120
140
Figure 5.5: Maximum length found per frame of X and Y motion vectors for a video sequence. on arbitrary places (not only the border motion vectors!). Although thorough research is not done on these vectors, it is probably caused by tiny distortions in the camera which are enhanced by the nature of motion estimation: a block that matches the previous block only a bit more is selected instead of the \real" reference block. The selection of a SAD instead of a MSE might also in uence the wrong choice of the motion estimation algorithm.
0.9 0.8 0.7 0.6 Fraction of MV 0.5 with this size 0.4 (full pixels) 0.3 0.2 0.1 0
Miss Am. 7.5 fps Miss Am. 15 fps Miss Am. 30 fps
Table 5.1: Search window size for Miss America video stream at 15 fps
The logarithmic algorithm is more exible. Increasing the search window costs only a fraction of the computational costs. It is also shown that if we compare the result without motion estimation (a search window of zero) and with motion estimation, the output video stream is approximately halved with only 5 ? 15% extra processing time required. As with exhaustive search, it is advisable to make the search window not too large: although the calculation costs even decrease, the size of the output stream increases. This phenomenon is explained by the nature of the logarithmic search: the larger the search area, the larger the chance that the algorithm is trapped in a local minimum. The search time is also reduced by comparison with blocks that do not match very good: during the search, a block comparison is aborted if the preliminary SAD value is already higher than a previously calculated SAD (such as the SAD at position 0,0 which is always calculated rst). Comparison of the 7.5 fps and 15 fps video streams show the decreased encoding time for sequences with higher frame rates. This is caused by the 41
Miss Am. 7.5 fps Miss Am. 15 fps Miss Am. 30 fps
Figure 5.7: Size of MV against cumulative percentage of MVs within this size
Algorithm & Output Quality Total time time per Search window size Size (Y / Cb / Cr) to encode frame Exhaustive 0 0] 23113 37.4 / 38.2 / 37.2 4.4s 119ms Exhaustive ?1 1] 14976 37.6 / 38.3 / 37.6 5.0s 135ms Exhaustive ?2 2] 11843 37.9 / 38.4 / 37.6 6.5s 176ms Exhaustive ?3 3] 10482 38.0 / 38.5 / 37.7 8.4s 227ms Exhaustive ?6 6] 9907 38.1 / 38.5 / 37.7 17.3s 468ms Logarithmic 0 0] 23113 37.4 / 38.2 / 37.2 4.4s 119ms Logarithmic ?1 1] 15873 37.1 / 38.4 / 37.6 4.5s 122ms Logarithmic ?3 3] 12556 37.6 / 38.4 / 37.7 4.7s 127ms Logarithmic ?6 6] 12170 37.8 / 38.4 / 37.5 4.9s 132ms Logarithmic ?12 12] 13881 37.5 / 38.3 / 37.5 4.8 130ms
:: :: :: :: :: :: :: :: :: ::
Table 5.2: Search window size for Miss America video stream at 7.5 fps
(0,0) motion vector that occurs more frequent in a higher frame rate sequence than in a lower frame rate sequence. For the same reason as is mentioned in the previous paragraph, encoding for these higher frame rates takes less time.
the size of the search window to obtain an e cient compression ratio for the new circumstances. For an exhaustive algorithm, it requires doubling the width and height of the search window, which is computational expensive. Where for 15 fps, a search window of ?1::1] seems most e cient4 , for 7.5 fps, a search window of at least ?2::2] is required. With these search windows, the time it takes to encode a single frame at 7.5 fps takes 35% more time than encoding a single 15 fps frame. If we compare the CPU usage it takes to compress a stream at 15 fps compared with a stream at 7.5 fps, compression of the 15 fps stream takes only 50% more CPU usage compared to compression of the 7.5 fps stream. If we compare the logarithmic algorithm with the exhaustive algorithm, then the logarithmic algorithm performs better for real-time conditions. Compression of one frame at 7.5 fps, only takes 10% more CPU time than compression of one frame at 15 fps. So, the CPU time required to compress a 15 fps video is here 80% more CPU usage compared to compression of a 7.5 fps video. In summary, compression of low-frame rate videos requires less CPU usage than high-frame rate videos (as expected), but this reduction is less than the ratio between the number of frames that are compressed.
For exhaustive and logarithmic search, the SOCR method delivers a video stream that is between 15%{20% larger than with a full block comparison. The quality on the other hand is only 0.3 dB lower. The performance of the SOCR method depends heavily on the search parameters. As the initial build-up of the SOCR table causes most time, this method is ine cient for small search windows. The method is also not very suited for half-pixel prediction, as it increases the processing time by eight to build up the initial summation table. For instance, in case of a QCIF encoding, it takes nearly 3.5 times as much time
4 This captures 95% of the real motion vectors, and thus gives a good compression, see Figure 5.7
43
to calculate the summation table than to calculate a luminance interpolation 5. Situation where the SOCR comparison might work better are circumstances where half-pixel is not used or where multiple frames refer to a single frame. It the latter case, a summation table need to be built only once. As these circumstances apply to MPEG encoding, this algorithm might improve the speed of this encoding technique signi cantly. As we concentrate on real-time encoding of DVCvideos, this is left for further study.
In Table 5.4, the signal-to-noise ratio is calculated using this method for
5
44
5.7 Summary
two Miss America sequences at 15 and 7.5 fps. The compression used are a full window search area of ?1::1] with exhaustive search, which is the same as given in row two of Table 5.2 and Table 5.1. Table 5.4: Quality of video stream compared with real frame rate of 30.0 fps
Encoding frame rate (fps) 7.5 15 30 Quality of Quality of encoded frames all frames Y / Cb /Cr (dB) Y / Cb / Cr (dB) 37.6 / 38.3 / 37.6 34.8 / 38.1 / 36.9 37.9 / 38.5 / 37.7 36.9 / 38.4 / 37.6 38.0 / 38.5 / 37.8 38.0 /38.5 / 37.8
As expected, the SNR reduces signi cantly for lower frame-rates, especially for the 7.5 fps video. To keep quality of video-compression as high as possible, video streams must be compressed with high frame-rates. Increasing the speed of the video compression will therefore also increases video quality as higher frame-rates are attained.
5.7 Summary
Motion Estimation, the technique to reduce temporal redundancy between successive video frames, requires large amount of computing power. To reduce this demand for computing resources, algorithms and parameters must be selected to make motion estimation usable in Real-Time video encoding. Analyses of video data give a good indication of e cient parameter settings. Choosing a search algorithm partly depends on the parameter settings. The exhaustive algorithm is only practically useful for small search windows, while the logarithmic algorithm achieves a little less compression than its exhaustive counterpart, but is better scalable for di erent search window sizes. In Chapter 7, a number of techniques are discussed to reduce the number of macro-blocks that are encoded. As Motion Estimation takes a signi cant amount of time, techniques used here are more applicable by combining them with the techniques found in Chapter 7.
45
5.7 Summary
46
Chapter 6
implementation was taken from the MPEG-1 encoder from Berkeley and will be addressed as \Fast DCT". The added IDCT operation was taken from the H.263 decoder and is also known as the Chen-Wang algorithm. This IDCT algorithm will be addressed as \Very fast IDCT" in this chapter.
To compare the di erent DCT algorithms, a number of tests are performed that measure the quality, the time and the in uence on the compression ratio for a combination of di erent FDCT and IDCT implementations. The results are given in Table 6.1 through Table 6.3. The source video used is the Miss America sequence of 150 frames at 30 fps. Compression parameters used are half-pixel exhaustive search with a search window of +/- 3 full pixels with a quantization factor of 8. Table 6.1 shows that quality is hardly a ected by di erent implementations. In fact, the Fast DCT operations even performs slightly better for quality; this is probably caused by some rounding and quantization, as the di erence in quality is insigni cant. The fast IDCT operation gives the same results as the standard IDCT operation. Due to the realistic quantization factor of 8, quality is not sacri ced for speed. Table 6.2 gives a good indication of the di erent (real) transform speeds of the algorithms. The time given is the time that the compression algorithm spending for FDCT plus IDCT transformation. Using the combination of Fast DCT and Very fast IDCT instead of standard DCT/IDCT gives an improvement of almost 3.5 times for (I)DCT transformations. Finally, Table 6.3 shows the size of the compressed video stream for the di erent implementations. It shows that standard IDCT and fast IDCT give the same output size and examination of both streams show that they are exactly the same. Using the fast DCT and very fast IDCT combination only gives a 2.5 percent video stream size penalty over Standard DCT/IDCT. These tests show that using the Fast DCT/Very fast IDCT combination, 48
6.4 Quantization
a large performance improvement can be attained, while sacri cing no quality loss and hardly compromising on the compression ratio.
6.4 Quantization
Quantization is the step in image compression that causes both high compression ratios and some loss of data. In short, quantization reduces the resolution of data determined by the quantization factor. Although some data may be lost during DCT (due to fast DCT algorithms and limited storage for transformed coordinates), the main consideration between compression and loss is done here. Too much quantization causes an heavily distorted video sequence, while too little quantization causes a relative large amount of data for the kind of video sequence. Blocking artifacts in a decoded video stream are probably an indication of too much quantization. A very large video stream for a sequence that is not signi cant better than a video stream with a higher quantization factor indicates too small quantization. Table 6.4: Quantization versus Quality for Miss America sequence
Quantization factor 1 2 4 6 8 12 16 Compression ratio 4.2:1 21.3:1 66.8:1 124.8:1 191.7:1 345.3:1 497.8:1 SNR for Y / Cb / Cr (dB) 48.5 / 48.5 / 48.5 44.8 / 42.5 / 44.3 41.1 / 40.6 / 41.2 38.8 / 39.4 / 39.2 37.4 / 38.4 / 37.7 35.7 / 37.4 / 35.9 34.7 / 37.1 / 35.1
Comparison of di erent quantization factors are given in Table 6.4 and Figure 6.4. The tests are performed on 150 frames encoding of the Miss Americasequence. Higher compression factors yield smaller video streams but also reduce quality. The encoding time for each of these sequences are equal, except for the entropy encoding that takes longer for larger output video streams. Comparison of the pro le outputs of the encoding with quantization factor of 1 and 16 shows that in the latter case encoding takes 57 percent longer due to extra entropy encoding. In practical video conferencing situations, however, quantization will vary between 4 and 12. Except for the amount of data that must be encoded (and decoded), the quantization level does not in uence the speed of the encoder (and decoder). If network bandwidth is not an issue (e.g. bandwidth is otherwise not used) , the lowest possible quantization factor that will ll up the bandwidth may be chosen to optimize bandwidth usage and the resulting video quality. 49
6.5 Summary
100 150 200 250 300 350 400 450 500 Compression ratio
Figure 6.1: Plot of compression ratio versus SNR (dB) One of the optimizations that is used in video compression is the combination of DCT and quantization (or IDCT and de-quantization). In these so-called scaled-(FDCT) and scaled-(IDCT) operations, the quantization is done during the transformation. This is possible because, for example for a FDCT, delivering a scaled version of the actual frequency coordinates requires less operations. After this scaled operations, the quantization factor must be adjusted for the scaled-DCT. Although a version of a scaled-DCT was not implemented for testing, an implementation of a decoder using a scaled-IDCT is given in Bhaskaran95b]. The IDCT used here is the Feig-Winograd algorithm, discussed in Feig92].
6.5 Summary
The DCT is the most suitable transformation available for video compression. A number of di erent implementations are available and tests show that the fast alternatives do not sacri ce quality for speed. Variation of quantization directly in uences quality and size of the video stream. It is therefore the instrument to ne-tune and adapt compression to user or network constraints. Nevertheless, quantization must kept between practical limits to prevent an explosion of network bandwidth requirements or strong degradation of quality.
50
Chapter 7
zation steps that are otherwise performed on these blocks before detecting that they really are unchanged.
Comparison
H.263 Encoder
Network Display
H.263 Decoder
Figure 7.1: Overview of background substraction The background substraction method (See Figure 7.1) rst takes a snapshot of the view of the current camera position, without the presence of moving objects. It averages a number of successive frames so this reference background frame is free of camera noise. This step must be repeated for every new camera position1 or for large background changes. After this initialization phase, background substraction may begin. Every captured frame is divided in macro-blocks that are compressed independently by the encoder. For every macro-block, the number of pixels that di er from the background more than a certain threshold are counted. Only the macro-blocks with more than a predetermined number of foreground pixels are encoded and transmitted to the receiver, to prevent camera noise to be transmitted. The resulting frames are processed by the compression unit and transmitted to the receiver. At the receiver, the encoded macro-blocks of a frame are decompressed and combined with macro-blocks that have not changed. Advantages of this approach compared with skin selection is that any difference with the background is detected, including shadows, clothes of persons, and objects that persons are holding. The background substraction method works well in an environment which has suitable lightning. But in areas where the person is close to the camera and blocks the light from the background, the background substraction method behaves considerable worse. Logic in the camera that changes the output brightness for darker circumstances is also unsuitable for this method. Further analysis on these issues might improve background substraction for these kind of circumstances, but this is left for further study.
1
53
Transmit/Skip
Figure 7.2: Steps for encoding of one Macro-Block Although detection of change is already done in the encoder, this is quite cumbersome, see Figure 7.2: First, a motion estimation is made. Then, the difference image is DCT transformed and quantized, followed by a de-quantization and an IDCT. Finally, when the coordinates are all zero, the block is considered as \not changed", and no information is transmitted. When change detection is applied, a short-cut is taken for blocks that do not change, see Figure 7.3: First, the block is immediately checked with the block at the same position in the previous frame. If the comparison shows that 54
7.5 Summary
Macro-Block
Equal?
Transmit/Skip
Macro-Block encoder
Figure 7.3: Macro-block encoding with block change detection change is below a threshold, the block will not be encoded, otherwise the block will pass the usual encoding steps. As most blocks do not change, the extra time required to calculate the changed blocks is more than compensated by the extra time it would cost to process these blocks by the encoder.
7.5 Summary
A number of techniques are discussed that reduce the input for the encoder. This makes it possible to do faster frame rate encoding, while maintaining the same quality for important areas of the screen. Skin detection is based on the detection of human skin color and its implementation is fast as the YUV color space is used. Background substraction requires a snapshot of the background, but performs better on foreground objects than skin detection. Finally, change detection does a brief comparison of the current frame with the previous frame to detection blocks that have changed.
55
7.5 Summary
56
Chapter 8
a high throughput, but also a low latency to make interactive communication possible. Therefore, the combination of network throughput and latency gives a good indication of the performance of a network. Another classi cation that is often used is the separation between connection-oriented and connection-less networks. In the rst case, a network session is rst set-up, after which data is transmitted. When the network connection is no longer needed, the session is closed. In connection-less networks, no connections are set up, but information is directly transmitted to the destination. In these networks, every piece of information that is transmitted (called a packet ) contains the destination address, and a route to this destination must be found per packet by the network.
8.3.2 ISDN
The ISDN is designed for telecommunication of computers, faxes and telephones. ISDN Comes in two connection types: The Basic Rate Interface (BRI) delivers two bidirectional 64 kilobit/s channels and one 16 kilobit/s control channel. The Primary Rate Interface (PRI) delivers 23 64 kilobit/s channels and one 64 kilobit/s control channel. Like POTS, ISDN is connection-oriented. Although ISDN is not as widespread as POTS, it is already available in large areas in Europe. Based on its network capacity, ISDN is a suitable infrastructure for video conferencing applications.
8.3.3 Internet
The largest and most widely used network for computer communication is Internet. Internet is actually a collection of interconnected networks, that developed in the 90s from a research network for computer scientists to a general public network. On top of its connection less IP protocol, it supports both unreliable connection-oriented UDP/IP connections as reliable connection-oriented TCP/IP connections. The throughput and latency of the network varies depending on the locations and the route between two sites, as on the time of 58
day (load on this part of network). The reliable TCP/IP does not give guarantees when a packet will arrive, only that a packet will arrive and in order of transmission. Unless average network performance is not improved signi cantly and protocols that support reservation of resources are introduced, high-quality video conferencing is not feasible on Internet. But because of its wide-spread usage, Internet is a ne test-bed for video conferencing applications, such as vic McCanne95] and CU-See-Me.
8.3.4 ATM
In the science community, ATM receives a lot of attention. ATM is based on transmission of 53 bytes sized cells and is connection-oriented. Large ATM networks are being set up, mainly for testing purposes. ATM makes it possible to guarantee certain bandwidth and makes it ideal for a number of applications that require these guarantees. Currently, there is not \one ATM Network", but this new technology makes it possible to do high quality communications over long distances. In ATM networks, di erent services classes are available for di erent types of data: Constant Bit Rate (CBR), Real-Time Variable Bit Rate (Real-Time VBR), non-Real Time Variable Bit Rate (non-real time VBR), Unspeci ed Bit Rate (UBR) and Available Bit Rate (ABR). These service classes in uence both network bandwidth and latency characteristics of the communication link. For video compression, the Real-Time VBR is most appropriate. In the pegasus LAB, a number of ATM devices are available, such as workstation interface cards, an ATM network switch, video grabbers (AVAs) which send video over ATM, and video displayers (ATVs) which show video streams from an ATM network.
Inter compression without motion estimation : This encoding strategy compares the current frame with the previous frame. The di erence between these frame are compressed. This encoder may require some Intra compression frame encodings, for instance, for the rst frame or for blocks that have changed a lot. Inter compression with motion estimation : The most drastic compression strategy is the use of motion estimation within inter compression. Results show (See Section 5.5) better compression for sequences with high motion or low frame rates.
60
8.6 Latency
Current software encoders available for internet, such as VIC McCanne95] and the INRIA H.261 encoder Turletti93] are primary based on intra-only encoding, sometimes combined with motion detection to skip macro-blocks that have not changed. The main reasons for using only intra encoding are that packet loss is high which makes inter encoding less appropriate and that intra encoding takes less time to compress. For network connections that have low bandwidth such as POTS and ISDN, inter encoding becomes more important and therefore a fast H.263 encoder is essential for video conferencing over these networks.
8.6 Latency
Before the implementation of the encoder is discussed, rst one of the constraints named in Chapter 1 will be considered: Latency. The time of the latency in a video conference determines the usefulness for interactive communication. If latency is high (larger than, for instance, 1 second), visual reactions might not be understood or misinterpreted. On the other hand, a unnecessary low latency constraint might make a video conferencing system too expensive to implement, as better networks and faster computers are required. To give an overview where latency occurs, the total latency is divided in a number of smaller latency components: Video Grab Latency: This is the latency between the actual time something happens before the camera, and the time the data of this frame is received in the computer. Because most intermediate steps are done here 61
8.7 Summary
in hardware (of the camera or the video grabber device), the maximum time is the time between two frame grabs. Compress Latency: This is the time it takes to compress a frame. This latency depends heavily on the encoder that is used and the performance of the system it runs on. If the encoder uses all computing time for encoding (and compresses at the maximum frame rate it can handle), the latency is the time it takes to compress one frame. If the encoder works at a frame rate of 10 fps, the compress latency is 10 1fps = 100ms. Network Latency: This is the time required before data transmitted from one end of the network arrives at the other end of the network. This latency depends heavily on the kind of network used. For instance, for some implemented ATM this latency is negligible compared to the compress latency. Other networks, like internet (TCP/IP) may have latencies that depends heavily on the network tra c. Measures have shown that network latency between Enschede (NL) and Michigan (USA) could be up to 10 seconds. Decompress and display latency: Although both decompress latency and display latency could be named independently, for the easy of use are they combined. Decompress latency requires less time than compress latency (because Motion Estimation searches need not be done at decoding) and display latency is also small. The two main contributors to the total latency are the compress and network latency. If frames are encoded at a high frame rate (> 10 fps), compress latency will not be too high to disturb the video interaction. Network latency is di cult to in uence, but on some networks (ATM) this may be speci ed in the Quality of Service call during communication-setup. In case of internet, other protocols must be used that handle loss and corruption of packages for video data better than TCP/IP. A nal solution to reduce latency is to combine the compression and transmission of frame at the same time. In this case, it is not required to wait until a frame is fully compressed. In fact, and what will not be discussed further, this is already done by the H.263 encoder that outputs bytes immediately after its contents is known.
8.7 Summary
To transmit video over a network, the compression strategy is determined by the capabilities of the network and the computing power that is available. The more bandwidth is available or the more loss occurs, intra frames as opposed to inter frames must be transmitted. The protocol involved also determines the choice for inter frames or intra frames. Finally, the size of the latency depends whether video conferencing is a useful way for interaction. For fast networks, latency is mainly determined by the latency of the compression, which is at most the time to encode one frame. 62
Chapter 9
Optimization ags are useful for making the code faster without sacri cing much time rewriting ine cient implementation. During testing of the functionality itself of the encoder, it may not be wise to choose \heavy" speed optimization. Some optimization options make it harder to nd bugs in the code and compilation takes more time. The clearest example between the choice of \debug-ability" and speed is the use of the debug compiler ag (in gcc, this is ag -g), that compiles symbol information into the code. Using this option makes it easier to nd bugs with the help of the debugger, but it has a dramatic e ect of the performance of the code. Another ag that has some impact on the speed is the pro ling capability (in gcc, this is ag -p), but without this option, it is di cult to monitor the real performance of parts of the code. 63
The most useful compiler options for increasing performance are the optimization ags (in gcc, the -O ag, that may be followed by a number indicating how much optimization must be done). The native compiler also has similar optimization options, that improve speed without much e ort (For the DEC alpha, the options -O5 -migrate make e cient code). The use of two di erent compilers makes it possible to compare not only the compilers themselves, but also parts of the codes that might need optimizations. For instance, although the gcc compiler performed generally better than the native cc compiler for the DEC Alpha, the two pro les showed that cc created a code for a procedure with a lot of if instructions that was almost ve times faster than the code gcc generated. Rewriting this code results is a more balanced performance for both compilers.1
9.2.2 Fine-tuning
Just using the optimization ags of the c-compiler will optimize the code, but much more e ciency is achieved by hand-tuning the code to a speci c machineplatform. This may make the code more e cient for some architecture than for others, but most techniques applied here, like copying data in the largest possible memory register size, will optimize the code for any architecture. The best technique to improve the speed of the code is to optimize the part of the code that the encoder spends the most time in. This is done by generating pro le data on a representative run of the encoder. For video encoding, using a representative run is especially important because frame rate determines strongly what part of the code is mostly used. For instance, high frame rate encodings rely heavily on the DCT operation while lower frame rate encodings spent more time in motion estimation. The pro le information of the H.263 encoder showed that the encoder used a lot of time for the motion estimation, the DCT and IDCT and copy-operations. Therefore, these functions were the primary target for optimization.
1 Another option is to use a mix of two compilers to generate the fastest code. However, this approach makes the code very speci c for a system and requires continuous monitoring of the speed of each function.
64
The abs function calculates the absolute value and after compilation, it is inserted in-line, so no function-call overhead is introduced. The new, optimized, code is given underneath:
sad sad sad sad sad sad sad sad sad sad sad sad sad sad sad sad += += += += += += += += += += += += += += += += abs(*ii-(*curr++)); abs(*ii-(*curr++)); abs(*ii-(*curr++)); abs(*ii-(*curr++)); abs(*ii-(*curr++)); abs(*ii-(*curr++)); abs(*ii-(*curr++)); abs(*ii-(*curr++)); abs(*ii-(*curr++)); abs(*ii-(*curr++)); abs(*ii-(*curr++)); abs(*ii-(*curr++)); abs(*ii-(*curr++)); abs(*ii-(*curr++)); abs(*ii-(*curr++)); abs(*ii-*curr); ii ii ii ii ii ii ii ii ii ii ii ii ii ii ii += += += += += += += += += += += += += += += 2; 2; 2; 2; 2; 2; 2; 2; 2; 2; 2; 2; 2; 2; 2;
As both pointer must be updated for the next line anyway, resetting the pointers does not cost extra time. Splitting up the calculation of the sad value into multiple instructions and replacing the indexed pointer by a moving pointer improved the speed of the procedure by 40%. For a procedure that took more than 20% of the CPU time for encoding (for some full-search options even more than 60%), this reduction is signi cant.
Table 9.1: Comparison of Miss America encoding (75 frames) with and without motion detection (MD) Stream Pixel MB Size Quality (SNR) Encode threshold threshold (bytes) Y / Cb / Cr time Original 15786 36.1 / 38.3 / 37.3 6.6s Encoding + MD 2 2 15564 36.1 / 38.3 / 37.3 4.7s Encoding + MD 2 4 15157 35.8 / 38.2 / 37.1 4.1s Encoding + MD 4 2 14640 35.7 / 38.2 / 36.9 3.8s Encoding + MD 4 4 13658 34.5 / 38.1 / 36.1 3.1s An implementation was made to examine this method and the results are given in Table 9.1. The Miss America sequence is encoded for 75 frames / 15 fps with logarithmic search window of ?2::2], and a quantization factor of 8. Initially, as pixel threshold, an absolute di erence for the luminance of more than two was chosen. For the macro-block threshold, a number of more than two changed pixels per macro-block was chosen. With this setting, approximately 60% of the macro-blocks on the screen were used for encoding and this seems to be correct regarding this DVC-type video. This test shows that quality is not sacri ced for speed, probably due to a good threshold-setting. The size of the stream produced by using change detection is also not a ected in a negative way, and speed is improved by almost 30% (and this includes the change detection procedure time itself). Tests with higher threshold values give higher speed but at the cost of some quality. Two tests with respectively a macro block and pixel threshold set to four encode approximately 45% and 40% of all macro-blocks. In the reconstructed video, some artifacts of these settings are visible by some background pixels that are not restored when the person in the video moves over them. A nal test with both thresholds set to four gives even worse results. Only 25% of the macro-blocks are encoded and results are especially worse on the border of the face with the background. The change detection method used in IVS is useful for encoding when the right thresholds are chosen. It is also possible to manipulate the threshold value to reduce compression time. Finally, the time to compress a frame can be measured by looking at the number of macro-blocks that are marked as changed.
taking every second frame from the A stream. Table 9.2: The streams that are used for compression Stream Video Capture frame rate No. of frames A Miss America 30 fps 150 frames B Miss America 15 fps 75 frames C Miss America 7.5 fps 38 frames The results are given in Table 9.3. In column 3, the average time is given for compression of a frame on a DEC Alpha 150 MHz. The time to encode a frame is slightly longer for streams with lower frame rates. This is logical if we look at the movement between two consecutive frames: for the C stream, the movement is up to 4 times more as for the A stream. Therefore, more blocks must be compared to nd the best-matching block. For the A stream, at most 9 block comparisons are done, for the B stream at most 9+4 = 13 comparisons are done and for the C stream, at most 9 + 9 = 18 comparisons are done per macro-block. The time to encode a frame at 30 fps is actually larger than the time that is available, as 30 fps indicate that at most 1s=30:0 fps = 33:3 ms may be used. Therefore, real-time encoding at 30.0 fps is not feasible on this system. If frame rate is dropped to 15.0 fps, real-time encoding is just possible. The size per frame, given in column 5, also shows that the information transmitted between two frames is much higher with the lower frame rate sequences than with a higher frame rate sequence. Nevertheless, as 30 frames must be transmitted for the A stream against 7.5 for the C stream, the double bandwidth is required for a video stream that has a frame rate of four times as much. Table 9.3: The streams that are used for compression Stream Motion Estimation frame enQuality frame size parameters code time Y / Cb / Cr (SNR) (bytes) A Exhaustive ?0:5; 0:5] 59 ms 38.0 / 38.5 / 37.7 162 B Logarithmic ?1; 1] 60 ms 37.3 / 38.5 / 37.6 252 C Logarithmic ?2; 2] 66 ms 37.7 / 38.3 / 37.4 325 In Table 9.4, the encoding time is split in a number of components. The Motion Estimation component includes the search algorithm with its blockmatching algorithm, a luminance interpolation function required for half-pixel prediction and the change detection algorithm discussed in Section 7.4. The DCT & Quantization contains the fast DCT algorithm discussed in Chapter 6 and a quantization by 8 function. In fact, the quantization function also delivers some information for the variable-length encoder as this was e cient to combine, but the time this takes extra is negligible. The reconstruction component include all operations that are required to generate a copy of the reconstructed frame that was just encoded. This is required for making a good motion estimation for the next frame. It is also easy to use this copy at the encoder to verify the quality of the video stream. In some video compression algorithm 67
where good quality is transmitted and Intra frames are transmitted regularly (such as MPEG), this step is omitted and the source video stream is used as reference frame instead of the reconstructed frame. For low bit-rate encoding, however, these circumstances do not apply. The variable-length encoding component consists of the Hu man encoding and delivering an H.263 stream. The I/O component is the time required to read a source frame from disk and writing the output stream. Finally, the control component is the time to manage the encoding process itself. By selecting a good parameter setting for compression, a nice balanced encoding is achieved. Even the time to do more motion estimation at lower frame rate is spreaded across the di erent components, because more macroblocks are encoded at lower frame rates, and so components other than motion estimation will also take more time. Stream A B C Table 9.4: Pro le of encoding Motion DCT & Recon- VLE I/O Control Estimation Quantization struction 39.6% 21.3% 18.4% 8.4% 3.7% 8.6% 37.2% 21.7% 21.1% 8.2 % 3.1% 8.7 % 41.7% 17.8% 19.5% 9.9% 3.2% 7.9%
H.263 Compression
Figure 9.1: Model for a real-time video compression application In this section, a model is discussed that allows real-time video compression under varying system and network loads. This model is given in Figure 9.1. The implementation of this model is also used as a sample application to show that the R263 compression library can be applied and controlled in an easy way. The Compression Control Unit (CCU) is the central unit that directs the compression. It receives QCIF frames from the camera and determines the time between the arrival of this frame and the previous frame that was received. Based on this information and control information from the network, it 68
determines the compression parameters used for compression of this frame. The CCU may also decide not to compress this frame, because network bandwidth or processing time is not su cient. Parameters that may be set by the control unit are the type of frame encoding (inter or intra coding), search algorithm (Exhaustive or Logarithmic ) and size of the search window (larger for lower frame rates and smaller for high frame rates). It is also possible to deliver an array of the macro-blocks that must be encoded, which makes it possible to use a skin detection or background substraction technique in advance to the compression. The frame grabber consists of a camera, an AVA connected to an ATM network, an ATM interface card and a thread in the application that grabs frames and converts them to QCIF frames. Although the ATM delivers YUV-type data, it must be converted from an 8 8-based tile format to a scan line-order format and the chrominance data must also be down-sampled from 4:2:2 to 4:2:0 (See Section 2.2.2). When the CCU is nished processing a frame, it receives a new frame together with a number indicating how many frames have been skipped and the time between the last two frame grabs. This information is used to calculate the appropriate compression parameters, using the ndings in Chapter 5. The compression engine converts the QCIF frame using the compression parameters speci ed by the CCU and delivers a bitstream of compressed data. This bitstream is transmitted over the network using TCP/IP. The decoder at the receiving end is the standard tmn-1.7 decoder, that is available under GNU License. It is adapted to receive H.263 streams over TCP/IP sockets and uses the same IDCT operation that is found in the encoder. The decoder uses much less time than the encoder (almost 3 times less), and most of it (30%) is actually used for dithering and not for the decoding itself. To show the performance of the CCU on a non-dedicated system, a video was taken with low, but representative movement. In this video, there are some seconds where there is hardly movements and some seconds where there is a little bit more movement than in the rest of the sequence. These characteristics are directly related to the average number of frames that can be compressed. The output of CCU is given here:
10.00 8.00 8.00 8.50 10.00 10.50 8.50 8.50 fps fps fps fps fps fps fps fps |1828.25 |2077.06 |2325.75 |2217.88 | 838.85 | 671.71 |1307.94 | 672.53 bpf bpf bpf bpf bpf bpf bpf bpf | | | | | | | | 18.28 16.62 18.61 18.85 8.39 7.05 11.12 5.72 Kbps Kbps Kbps Kbps Kbps Kbps Kbps Kbps
9.5.1 Implementation
9.5.2 Results
69
9.6 Summary
12.00 10.50 8.00 8.00 9.00 6.00 8.00 8.50 7.00 fps fps fps fps fps fps fps fps fps | 250.42 |1136.33 |1540.50 |2565.44 |1592.00 |4087.25 |1413.75 | 757.59 |2166.14 bpf bpf bpf bpf bpf bpf bpf bpf bpf | | | | | | | | |
Every row gives a summation of the compression results that are achieved in a predetermined interval, which is set to 2 seconds in this example. The rst column gives the number of frames that is compressed per second (excluding skipped frames), and the second and third column give the average number of bits per frame and the number of kilobits per second that is transmitted. This result shows that on a non-dedicated machine, video compression is possible at a medium frame rate. For a computer running a Real-Time operating system, the actual encoding frame rate will be higher as the scheduler can give the encoder more time to compress. Then, results given in Section 9.4 are feasible. The test also shows that the size of the video stream is low: With a 28k8 modem, the stream could be transmitted easily. For ISDN lines of 64 Kbps, this test gives good perspective on transmission of video and audio combined over one line. It might turn out that audio, and not video, is the bottleneck in video conferencing transmissions over low-bandwidth networks.
9.6 Summary
In this chapter, the results are presented of the R263 encoder. Selection of suitable compression parameters, optimization of the code both by automatic tools as by hand, and the implementation of advanced video techniques show that Real-Time encoding of QCIF frames is possible in the near future on desktop PCs. To make the encoder useful for a variety of application, the encoder may be used as library function. A sample application that uses this library to compress video based on system load is also implemented.
70
Chapter 10
Conclusion
Technical developments in the computer industry has created a current situation where under some constraints, video communication becomes feasible. Unfortunately, video communication required many resources, such as processing speed, camera and screen resolution, but beyond all network bandwidth. As some of these developments stay behind other for ideal video communication circumstances, sophisticated algorithms must be applied to overcome technical constraints. One of these techniques that reduces network bandwidth at the cost of computing power is video compression. Compression of a video stream is done as either frame per frame or as a whole, where compression of the current frame depends on the compression of the previous frames. Although frame per frame compression, such as motionJPEG, are popular, much more compression can be attained by compressing the video as a whole. Motion Estimation is the proven technique to reduce the temporal redundancy between the frames. Unfortunately, when it is applied with the wrong set of parameters, the video compression ratio is too small, or the computing power required is too much for certain compression. Another factor that must be taken into account is the quality of the video stream that is transmitted as lossy compression algorithms are used here. This research has determined the appropriate settings (or parameters) for di erent video streams to be compressed most e ectively. As this research is aimed at low-bandwidth communication, the choice for the H.263 standard is obvious: The compression ratio is higher than MPEG for low-bandwidth communication at the same quality. H.263 is also preferable over H.261, because more advanced techniques are integrated in the standard. Examination of the advanced options of H.263 showed that usage of some of these options has low e ect on the compression ratio or that it does have good e ect on the compression ratio, but that it requires too much computing power. Half-pixel prediction, not found in the H.261 standard, is used however. To make the H.263 implementation usable, drastic steps must be taken to improve performance. Compared the the Telenor functional test implementation of H.263, some parts are rewritten, replaced or added: discrete cosine transformations, motion estimation algorithms, and most of the copying and control code. 71
Conclusion
By using high frame rates, the compress latency that occurs is kept as small as possible: the maximum latency here is equal to the time it takes to compress a frame. During encoding, compressed data that is generated is immediately ready for usage, and the application is not required to wait until the full frame is compressed. In this way, the total latency that occurs for between video grabbing and display is kept as low as possible to improve interaction. Three new techniques were tested to focus compression on most important part of the picture, the person in the video: skin selection, background substraction, and change detection. A fast implementation of skin selection is developed that only requires one test per pixel for determining skin color. The background substraction uses a previously stored image of the background to determine foreground objects that are more important than background. The nal method is change detection. Research has shown that most macro-blocks in the video frame do not have motion, but this is detected after quite a number of complex operations. Change detection determines these blocks so only they are fed into the encoder. Application of change detection show signi cant improvement of the speed of the encoder. The encoder is implemented as a library function, which is easy to use and to integrate into other applications. A sample application under UNIX, called Compression Control Unit, is also implemented that uses this library to compress images from the AVA and to transmit them over TCP/IP. This application determines the time between the last two received frames and applies the best compression parameters. Results of this application show that QCIF video (176 144 pixels) can be transmitted over POTS networks (28.8 Kbps) at 10 fps, using a (non-dedicated) desktop PC (Pentium 150 MHz). Pro ling information shows that dedicated machines, or systems that run real-time operating systems will compress at up-to 15{20 fps1 . When faster computers are available, this encoder is suitable for real-time (25{30 fps) transmission over ISDN. This research has shown that networks with large bandwidth (> 1 Mbps), are not a requirement for video communication: New, advanced standards such as H.263 are able to compress video streams drastically if enough computing power is available. Although dedicated hardware implementation are the fastest solution, this work shows that the more exible way of software encoding is a serious option. With the introduction of new processors capable of handling more data in parallel (multimedia instructions), and faster chip technology, prospects of real-time software video compression are more than hopeful. I therefore recommend further research on the usefulness of this new generation of processors for video encoding.
1 Numbers taken for sequences with moderate movement: When there is no movement at all, the encoder compresses already at the maximum of 25 fps
72
Chapter 11
Acknowledgements
A large number of people have helped me during my study in one way or another. I will try to give everyone who helped me some credits, but the order of the credits given here is irrelevant, as classi cation of the di erent kinds of \help" might be a master thesis on its own. I would like to thank my graduation committee: Prof. Sape Mullender, for giving me such an interesting assignment, Peter Bosch for being my rst rst \begeleider" (I bought Elements of Style !) and of course Lars Vognild for being my second rst \begeleider". I also like to thank the other group members: Arne, Martijn, Paul, Pierre and Tatjana and everyone I forget. I also like to thank everyone who had given me work as student-assistant: Martin Beusekamp, Mr. Bonnema, Hans Scholten, Albert Schoute, and everyone else. Also would I like to thank all students who made working at SPA inspiring and pleasant. Thanks to Erwin: the word \Hacker" is named after you. Thanks BJ for your excellent taste for music. Thanks Mars for your AVA code. Thanks Robert (at least somebody else also working on H.263). And not to forget: Eelco, Berend, Rob, Tjipke (how about all your schedules?), Jing-Xu (success on your promotion), Nick, Michiel, and all others. There are a lot of people who helped me enormously, who I have never met or seen. With the result of my work, we might have a video conference in the future. Of course would I like to thank Karl Lillevold for making the H.263 code freely available. Without him, it would have taken several more months before my work was nished. I also like to thank him for his clear implementation of the code. I also want to thank Leonid Kasperovich for an interesting discussion on motion estimation in H.261, your encoder looks promising! Also thanks to Burkhard Neidecker-Lutz, I will try to give you a copy. In particular I would like to thank Edwin, whom I worked with for quite some courses. However, Coca Cola (TM) is it! 73
Acknowledgements
The other part of my life at Twente could be pronounced in a six-letter word: KRONOS. Thank you all! Training has been the best distraction there is, and I will miss you. In the nal place, and last but de nitely not least I would would like to thank my family for supporting me during my study, in every way possible. Even in a small country like the Netherlands, Twente is a long way from Schagen, and the Dutch public transportation does not always help to bridge the gap (and neither does the Secretary of Education).
Roalt Aalmoes
74
Bibliography
Aalmoes95] Roalt Aalmoes and Peter Bosch. Overview of Still-picture and Video Compression Standards. Pegasus 95-03. University of Twente, December 1995. in preparation. Bhaskaran95a] V. Bhaskaran and K. Konstantinides. Image and video compression standards. Kluwer academic publishers, 1995. Bhaskaran95b] Vasudev Bhaskaran, Konstantinos Konstantinides, Ruby B. Lee, and John P. Beck. Algorithmic and Architectural Enhancements for real-time MPEG-1 decoding on a General Purpose RISC workstation. IEEE Transactions on circuits and systems for video technology, 5(5):380{ 6, October 1995. Bryan95] John Bryan. Compression scorecard. Byte, 20(5):107{12, May 1995. Chan96] Yui-Lam Chan and Wan-Chi Siu. New adaptive pixel decimation for block motion vector estimation. IEEE Transactions on circuits and systems for video technology, 6(1):113{18, February 1996. cody92] Mac A. Cody. The fast wavelet transform. Dr. Dobbs journal, 17(4):16{28, April 1992. Compuserv90] CompuServe Inc. Graphics interchange format (sm) version 89a. Technical report. CompuServe, Incorporated Columbus, Ohio, 1990. Feig92] Ephraim Feig and Shmuel Winograd. Fast algorithms for the discrete cosine transform. IEEE Transactions on signal processing, 40(9):2174{93, September 1992. Filippini95] Luigi Filippini. MPEG informations, questions and answers, 31 July 1995. http://www.crs4.it/~uigi/MPEG/mpegfaq.html. l Gailly95] Jean loup Gailly. comp.compression frequently asked questions, 28 September 1995. ftp://rtfm.mit.edu/pub/usenet/news.answers/compression-faq/part 1-3]. Gall91] Didier Le Gall. Mpeg: A video compression standard for multimedia applications. Communications of ACM, 34(4):46{58, april 1991. Girod96] Bernd Girod. Recent advances in video compression. ISCAS-96 (Atlanta , May 1996), May 1996. 75
Bibliography
Gray92] Robert M. Gray, Pamela C. Cosman, and Eve A. Riskin. Image compression and tree-structured vector quantization. In James A. Storer, editor, Image and text compression, Communications and information theory, pages 3{34. Kluwer academic publishers, 1992. ITULBC95] R. Schaphorst. Draft recommendation H.263. Technical report LBC-95-251. ITU-T, 3 October 1995. Kleine95] G. Kleine. Digitale televisie met behulp van mpeg-2-kompressie. Elektuur, 35(9):68{75, September 1995. Koga81] T. Koga, K. Iinuma, A. Hirano, Y. Iijima, and T. Ishiguro, Motioncompensated interframe coding for video conferencing, Proceedings of NTC 81 (New Orleans), November/December 1981. Lane95] Tom Lane. JPEG-faq, part 1, 28 May 1995. ftp://rtfm.mit.edu/pub/usenet/news.answers/jpeg-faq/part1. Liou91] Ming L. Liou. Overview of the p*64 kbit/s video coding standard. CACM, 34(4):60{3, apr. 1991. Liu93] Bede Liu and Andre Zaccarin. New fast algorithms for the estimation of block motion vectors. IEEE transactions on circuits and systems for video technology, 3(2):148{57, April 1993. McCanne95] Steven McCanne and Van Jacobson. vic: a exible framework for packet video. ACM Multimedia (San Franscisco, November 1995), November 1995. Moura96] Jose M. F. Moura, Radu S. Jasinschi, Hirohisa Shiojiri, and JyhCherng Lin. Video over wireless. IEEE personal communications, 2:44{54, February 1996. Mullender92] Sape J. Mullender, Ian M. Leslie, and Derek McAuley. Pegasus Project Description. Memoranda Informatica 92{75. University of Twente, Faculty of Computer Science, September 1992. Nelson91] Mark Nelson. The data compression book. M & T Publishing, Incorporated, 501 Galveston Drive, Redwood City, CA 94063-4728, U.S.A., 1991. Okubo95] Sakae Okubo, Ken McCann, and Andrew Lippmann. MPEG-2 requirements, pro les and performance veri cation | framework for developing a generic video coding standard. Signal processing image communication, pages 201{9, July 1995. Patel93] Ketan Patel, Brian C. Smith, and Lawrench A. Rowe. Performance of a Software MPEG Video Decoder. ACM Multimedia Conference (Anaheim 1993), 1993. Poynton95] Charles A. Poynton. Frequently asked questions about colour, 1995. ftp://ftp.inforamp.net/pub/users/poynton/doc/colour/. 76
Bibliography
press91] William H. Press. Wavelet Transforms. Harvard-Smithsosian Center for Astrophysics, 1991. Preprint No. 3184. Renxiang94] Renxiang Li, Bing Zeng, and Ming L. Liou. New three-step search algorithm for block motion estimation. IEEE Transactions on circuits and systems for video technology, 4(4):438{42, August 1994. Telenor95] Karl O. Lillevold. Digital video coding at Telenor R&D, 11 November 1995. Homepage on internet. Turletti93] Thierry Turletti. H.261 software codec for videoconferencing over the internet. Technical report 1834. INRIA, January 1993. Wallace91] Gregory K. Wallace. The jpeg still picture compression standard. Communications of ACM, 34(4):30{44, april 1991. Yokoyama95] Yutaka Yokoyama, Yoshihiro Miyamoto, and Mutsumi Ohta. Very low bit rate video coding using arbitrarily shaped region-based motion compensation. IEEE Transactions on circuits and systems for video technology, 5(6):500{7, December 1995. Zwaal95] Hugo Zwaal. The Design and Implementation of a Camera Independent Face Tracker. Technical report 95-09. University of Twente, January 1995.
77