Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Jens-Rainer Ohm
Chair and Institute of Communications Engineering, RWTH Aachen University
Melatener Str. 23, 52074 Aachen, Germany
phone: + (49) 241-80-27671, fax: + (49) 241-80-22196, email: ohm@ient.rwth-aachen.de
web: www.ient.rwth-aachen.de
ABSTRACT tween the standards defined by the both groups, and also
The market for digital video is growing, with compression some work has been done jointly. For example, ITU-T rec.
as a core enabling technology. Standardized methods as mo- H.262 is identical to the ISO standard MPEG-2, and ITU-T
tion-compensated DCT, which have reached a level of ma- rec. H.264 is identical to part 10 of the ISO standard MPEG-
turity over decades, are dominating today’s products. During 4, ’Advanced Video Coding’.
the past years, another wave of innovation has occurred in Proprietary solutions beyond open standards also exist, but
video compression research and development, which leads mostly use very similar methods as the open standards,
to the assumption that it is still far from reaching its final which were developed by strong groups of researchers both
bounds. Key factors are further improvements in motion from industry and academia. The basic principle behind all
compensation, better understanding of spatio-temporal co- compression standards, as relevant today in Internet trans-
herences over shorter and longer distances, and more ad- mission, broadcast and storage of digital video, is the so-
vanced encoding methods, as e.g. implemented in the new called hybrid coding approach, a combination of motion-
Advanced Video Coding (AVC) standard. While the adop- compensated prediction over the time axis with spatial (2D)
tion of successful research trends into standardization and block transform coding of the prediction error residual, or of
products seems to occur almost seamlessly, new trends are newly encoded intraframe information. All necessary com-
already showing up which may lead to more paradigm shifts ponents such as motion vectors, switching modes, transform
in the video compression arena. In general, the interrelation- coefficients are typically variable-length coded for maxi-
ship between transmission networks and compression tech- mum compression performance. Advances have been made
nology bears many problems yet to be solved. Efficient scal- over time in all those elements and tools of video codecs,
able representation of video becomes more interesting now, their interrelationships becoming better understood. As an
providing flexible multi-dimensional resolution adaptation, example for recent developments, the youngest member in
to support various network and terminal capabilities and the series of open standards, known as H.264 or MPEG-4
better error robustness. This has finally become possible by part 10 "Advanced Video Coding" (AVC) is described by
the advent of open-loop temporal compression methods, more detail in the following section. Other standards, like
denoted as motion-compensated temporal filtering (MCTF), MPEG-4 part 2 "Visual", are targeting different functional-
which are presently investigated for standardization. More ities, such as object-based coding (arbitrary shape) or scal-
improvements seem to be possible in the fields of motion ability of bit streams. Section 3 reflects more recent progress
compensation and texture representation, but also new ap- in Scalable Video Coding (SVC) standardization, which
plication domains are emerging. 3D and multi-view video allows more flexibility than ever achieved before in post-
seems will become more realistically used in applications, compression adaptation and tailoring of video streams, sup-
due to the availability of appropriate displays. Mobile and porting specific needs of applications, networks and termi-
sensor networks are raising a future demand for low- nals, while retaining good compression performance. In
complexity encoders. The talk will analyse some of these section 4, some visible and possible trends for future devel-
trends and investigate perspectives and potential develop- opments in video compression are discussed.
ments.
2. ADVANCED VIDEO CODING
1. INTRODUCTION The demand for ever increasing compression performance
The most important series of recommendations and standards has initiated the creation of a new part of the MPEG-4 stan-
for video compression were defined by groups of the ITU dard, ISO/IEC 14496-10: ’Coding of Audiovisual Objects
(ITU-T Rec. H.261/262/263/264) for application domains of Part 10: Advanced Video Coding’, which is identical by text
telecommunications, and by the Moving Pictures Experts with ITU-T Rec. H.264. The development of AVC was per-
Group (MPEG) of the ISO/IEC (ISO standards MPEG-1/-2/- formed by the Joint Video Team (JVT), which consists of
4) for applications in computers and consumer electronics, members of MPEG and of the ITU-T Video Coding Experts
entertainment etc. They are reflecting leading-edge research Group. To achieve highest possible compression perform-
results of their respective time. There are commonalities be- ance, any kind of backward or forward compatibility with
earlier standards was given up. Nevertheless, the basic ap-
proach is again a hybrid video codec (MC prediction + 2D quent pictures, provided that a causal processing order
transform), as in the predecessors. The most relevant new is observed. Furthermore, follow-up prediction of B-
tools and elements are as follows: type slices from other B-type slices is possible, which
e.g. allows implementation of a B-frame based tempo-
− Motion compensation using variable block sizes (de-
ral-differences pyramid. Different weighting factors can
noted as sub-macroblock partitions) of 4x4, 4x8, 8x4,
be used for the reference frames in the prediction (for-
8x8, 8x16, 16x8 or 16x16; motion vectors encoded by
merly, this was restricted to averaging by weighting fac-
hierarchical prediction starting at the 16x16 macroblock
tors 0.5 each). In combination with multi-reference pre-
level; diction, this allows a high degree of flexibility, but also
− Motion compensation is performed by quarter-sample could incur less regular memory accesses and increased
accuracy, using high-quality interpolation filters; complexity of the encoder and decoder.
− Usage of an integer transform of block size 4x4, option- − Two different entropy coding mechanisms are defined,
ally switchable to 8x8. This is not a DCT, but could be one of which is Context-adaptive VLC (CAVLC), the
interpreted as an integer approximation thereof. The other Context-adaptive Binary Arithmetic Coding
transform is orthogonal, with appropriate normalization (CABAC). Both are universally applicable to all ele-
to be observed during quantization. For the entire build- ments of the code syntax, based on a systematic con-
ing block of transform and quantization, implementa- struction of variable-length code tables. By proper defi-
tion by 16-bit integer arithmetic precision is possible nition of the contexts, it is possible to exploit non-linear
both in encoding and decoding. To compensate for the dependencies between the different elements to be en-
negative impact of possibly small block sizes on the coded. As CABAC is a coding method for binary sig-
compression performance, extensive usage of spatial in- nals, a binarization of multi-level values such as trans-
ter-block prediction is made, and a sophisticated de- form coefficients or motion vectors must be performed
blocking filter is applied in the prediction loop. before it can be applied; four different basic context
− Intraframe coding is performed by first predicting the models are defined, where the usage depends on the
entire block from the boundary pixels of adjacent specific values to be encoded [1].
blocks. Prediction is possible for 4x4 and 16x16 blocks, − A Network Abstraction Layer (NAL) is defined for sim-
where for 16x16 blocks only horizontal, vertical and ple interfacing of the video stream with different net-
DC prediction (each pixel predicted by same mean work transport mechanisms, e.g. for access unit defini-
value of boundary pixels) is allowed. In 4x4 block pre- tion, error control etc.
diction, directional-adaptive prediction is used, where
eight different prediction directions are selectable. The The key improvements are indeed made in the area of motion
integer transform is finally applied to the residual result- compensation. The sophisticated loop filter allows a signifi-
ing from intra prediction. cant increase of subjective quality at low and very low data
− An adaptive de-blocking filter is applied within the pre- rates, and also compensates the potential blocking artifacts
diction loop. The optimum selection of this filter highly produced by the block transform. State-of-the-art context-
depends on the amount of quantization error fed back based entropy coding drives compression to the limits. On
from the previous frame. The adaptation process of the the other hand, the high degrees of freedom in mode selec-
filter is non-linear, with lowpass strength of the filter tion, reference-frame selection, motion block-size selection,
steered firstly by the quantization parameter (step size). context initialization etc. will only provide significant im-
Further parameters considered in the filter selection are provement of compression performance when appropriate
the difference between motion vectors at the respective optimization decisions, in particular based on rate-distortion
block edges, the coding mode used (e.g. stronger filter- criteria, are made by the encoder. Such methods have been
ing is made for intra mode), the presence of coded coef- included in the reference encoder software (Joint Model, JM)
ficients and the differences between reconstruction val- used by the JVT during the development of the standard.
ues across the block boundaries. The filter kernel itself The combination of all these different methods has led to a
represents a linear 1D filter with directional orientation significant increase of compression performance compared to
perpendicular to the block boundary, impulse response previous standard solutions. Reduction of the bit rate at same
lengths are between 3 and 5 taps depending on the filter quality level by up to 50%, as compared to H.263 or
strength. MPEG−4 Simple Profile, and up to 30% as compared to
− Multiple reference picture prediction allows defining MPEG−4 Advanced Simple Profile have been reported [2].
references for prediction of a macroblock from any of On the other hand, the complexity of encoders must signifi-
up to F previously decoded pictures; the number F itself cantly be increased as well to achieve such performance.
depends on the maximum amount of frame memory Regarding decoder complexity, the integer transform and
available in a decoder. Typically, values around F=5 are usage of systematic VLC designs clearly reduces the com-
used, but for some cases F=15 could also be realized. plexity as compared to previous (DCT based) solutions,
− The bi-directional prediction approach (denoted as bi- while memory accesses are also becoming more irregular due
predictive or B-type slices) is generalized compared to to usage of smaller motion block sizes and possibility of
previous standards. This in particular allows to define multi-frame access. The loop filter and the arithmetic de-
structures of prediction from two previous or two subse- coder also add complexity.
3. SCALABLE VIDEO CODING proposed in [7],[8] and [9]. Here, the polyphase components
are even and odd indexed frames "A" and "B". An illustration
Video and motion pictures are frequently transmitted over
for the case of the 5/3 bi-orthogonal filter is given in Fig. 1.
variable-bandwidth channels, both in wireless and cable net-
The result of the prediction step is a highpass frame "H",
works. They have to be stored on media of different capacity,
which is generated as motion-compensated prediction differ-
ranging e.g. from memory sticks to next-generation DVD.
ence of a frame A and its two (averaged) neighbours B. The
They have to be displayed on a variety of devices, including
complementary update step shall usually revert the motion
a range from small mobile terminals up to high-resolution
compensation and leads to a lowpass filtered (weighted aver-
projection systems. Scalable video coding (SVC) schemes
age) representation "L" of a frame B, including information
are intended to encode the signal once at highest rate and
from its two previous and two subsequent neighbours.
resolution, but enable decoding from partial streams in any
situation, depending on the specific rate and resolution re-
quired by a certain application. This enables a simple and
flexible solution for transmission over heterogeneous net-
works, additionally providing adaptability for bandwidth B A B A B A B
tion-compensated prediction and block transform encoding. 1/4 1 1/4 1/4 1 1/4 1/4 1 0 1/2 1 1/4
A theoretical investigation shows that this is mainly caused
by the recursive structure of the prediction loop, which L L L L
causes a drift problem whenever incomplete information is
decoded [3][4]. As a consequence, wider acceptance of scal- lowpass
able video coding in the market of its prospective applica- sequence
tions had never occurred in the past.
Within the last 5 years, a breakthrough was made in de- Fig. 1. MCTF lifting scheme for decomposition of a video se-
veloping open-loop concepts, designated as motion- quence into lowpass (L) and highpass (H) frames
compensated temporal filtering (MCTF). Originally based on
wavelet concepts, any infinite recursions in encoding and The lifting process needs to be adapted in cases where mo-
decoding are abandoned. This has finally enabled implemen- tion estimation and compensation are likely to fail, e.g. in
tation of highly efficient scalable video codecs. As a conse- case of fast inhomogeneous motion or occlusions. This can
quence, MCTF schemes can provide flexible spatial, tempo- e.g be implemented by switching to uni-directional predic-
ral, SNR and complexity scalability with fine granularity tion and update, by omitting the update step, or by omitting
over a large range of bit rates, while maintaining a very good the prediction step (intraframe coding modes). Typically, in a
compression performance. The inherent prioritization of data wavelet pyramid, the MCTF decomposition is iteratively
in this framework leads to added robustness and considerably performed over several hierarchy levels of L frames. In fact,
improved error concealment properties as a by-product. the B-slice pyramid as specified in the AVC standard can be
Early implementations of MCTF did not provide the capa- interpreted as a simple variant thereof, omitting the update
bility for perfect reconstruction under constraints of arbitrary step.
motion vector fields [5]. A very efficient implementation of The MCTF process including prediction and update step
pairs of bi-orthogonal wavelet filters is made by the lifting establishes an orthogonal or bi-orthogonal transform with
structure [6]. The first step of the lifting filter is a decomposi- beneficial effect on equal distribution of coding errors among
tion of the signal into its even- and odd-indexed polyphase the different frames, including a de-noising effect in case of
components. Then, the two basic operations are prediction discarded highpass frame information. Due to the non-
steps P(z) and update steps U(z). The prediction and update recursive (open-loop) structure, scalability can easily be im-
filters are primitive kernels, typically of filter lengths 2-3. As plemented
the lifting method is inherently reversible and guarantees − temporally, by discarding H frames
perfect reconstruction, the application of this approach for − quality-wise, by reducing the quality of H and L frames
MCTF in video compression was straightforward, and first without drift or leakage effects
− spatially, by running a pyramid (wavelet-based or lay-
ered differences) over several spatial resolutions. CITY 704x576 60
42
The layered concept is presently implemented in the 41.5
41
Working Draft of the most recent SVC standardization activ- 40.5
40
ity of ISO-MPEG’s and ITU-VCEG’s Joint Video Team 39.5
(JVT), which is planned to become an amendment to the 39
38.5
AVC standard. The most important properties are: 38
37.5
− SVC supports scalable bit streams with typically up to 4 37
PSNR [dB]
36.5
spatial resolutions, up to 4 different frame rates, and fine 36
35.5
granularity of quality scalability at the various spatio- 35
34.5
temporal resolutions; 34
33.5
− SVC will become an extension of AVC with capability 33
32.5
to implement an AVC-compatible base layer; 32
31.5
− Motion-compensated temporal filtering (MCTF) with 31
JM94 EPZS
JSVM-0
30.5
adaptive prediction and update steps is implemented for 30
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
efficient open-loop compression; Rate [kbit/s] 4
x 10
− Encoder and decoder are run in a layered structure with HARBOUR 704x576 60
"bottom-up" prediction from lower layers; this is simply 42
41.5
implemented by allowing additional coding modes that 41
40.5
switch between intra-layer and inter-layer prediction. 40
39.5
For a more simple implementation, only one motion 39
38.5
compensation loop is used; 38
− The fine granularity of scalability (FGS) functionality is 37.5
37
PSNR [dB]
38.5
"typical", indicating that the new codec does not dramatically 38
37.5
penalize either high or low rate points (as previous scalable 37
solutions did). In average, even though additional syntax 36.5
36
overhead is necessary to allow scalable stream access, the 35.5
rate increase as compared to non-scalable coding is not 35
34.5
higher than 10%. Therefore, the new SVC scheme offers a 34 JM94 EPZS
33.5 JSVM-0
clear compression advantage as compared to simulcast or 33
parallel storage of non-scalable streams, as would otherwise 0 2000 4000 6000 8000 10000 12000 14000 16000
Rate [kbit/s]
18000