Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Rendering
Editors
123
Editors
Ce Zhu
School of Electronic Engineering
University of Electronic Science
and Technology of China
Chengdu
Peoples Republic of China
Lu Yu
Department of Information Science
and Electronic Engineering
Zhejiang University
Hangzhou
Peoples Republic of China
Masayuki Tanimoto
Department of Electrical Engineering
and Computer Science
Graduate School of Engineering
Nagoya University
Nagoya
Japan
Yin Zhao
Department of Information Science
and Electronic Engineering
Zhejiang University
Hangzhou
Peoples Republic of China
ISBN 978-1-4419-9963-4
DOI 10.1007/978-1-4419-9964-1
ISBN 978-1-4419-9964-1
(eBook)
Preface
vi
Preface
to make better use of available bandwidth. Since view synthesis, coupled with
errors introduced by depth generation and compression, may introduce new types
of artifacts that are different from those of conventional video acquisition and
compression systems, it is also necessary to understand the visual quality of the
views produced by DIBR techniques, which is critical to ensure a comfortable,
realistic, and immersive 3D experience.
This book focuses on this depth-based 3D-TV system which is expected to be
put into applications in the near future as a more attractive alternative to the
current stereoscopic 3D-TV system. Following an open call for chapters and a few
rounds of extensive peer reviews, 15 chapters of good quality have been finally
accepted, ranging from a technical review and literature survey on the whole
system or a particular topic, solutions to some technical issues, to implementation
of some prototypes. According to the scope of these chapters, this book is organized into four sections, namely system overview, content generation, data compression and transmission, and 3D visualization and quality assessment, with the
chapters in each section summarized below.
Part I (Chap. 1) provides an overview of the depth-based 3D-TV system.
Chapter 1 entitled An overview of 3D-TV system using depth-image-based
rendering covers key technologies involved in this depth-based 3D-TV system
using the DIBR technique, including content generation, data compression and
transmission, 3D visualization, and quality evaluation. It also compares the conventional stereoscopic 3D with the new depth-based 3D systems, and reviews
some standardization efforts for 3D-TV systems.
Part II (Chaps. 27) focuses on 3D video content creation, which specifically
targets at depth map generation and view synthesis technologies.
As the leading chapter in the section, Chap. 2, entitled Generic content creation for 3D displays, discusses future 3D video applications and presents a
generic display-agnostic production workflow that supports the wide range of all
existing and anticipated 3D displays.
Chapter 3 entitled Stereo matching and viewpoint synthesis FPGA implementation introduces real-time implementation of stereo matching and view
synthesis algorithms, and describes Stereo-In to Multiple-Viewpoint-Out
functionality on a general FPGA-based system, demonstrating a real-time high
quality depth extraction and viewpoint synthesizer, as a prototype toward a future
chipset for 3D-HDTV.
Chapter 4 entitled DIBR-based conversion from monoscopic to stereoscopic
and multi-view video provides an overview on 2D-to-3D video conversion that
exploits depth-image based rendering (DIBR) techniques. The basic principles and
various methods for the conversion, including depth extraction strategies and
DIBR-based view synthesis approaches, are reviewed. Furthermore, evaluation of
conversion quality and conversion artifacts are discussed in this chapter.
Preface
vii
viii
Preface
cues to depth and their interactions with disparity, and development and adaptability of the binocular system.
Chapter 13 entitled Stereoscopic and autostereoscopic displays first explains
the fundamentals of stereoscopic perception and some of the artifacts associated
with 3D displays. Then, a description of the basic 3D displays is given. A brief
history is followed by a state of the art covering glasses displays through volumetric, light field, multi-view, head tracked to holographic displays.
Chapter 14 entitled Subjective and objective visual quality assessment in the
context of stereoscopic 3D-TV discusses current challenges in relation to subjective and objective visual quality assessment for stereo-based 3D-TV (S-3DTV).
Two case studies are presented to illustrate the current state of the art and some of
the remaining challenges.
Chapter 15 entitled Visual quality assessment of synthesized views in the
context of 3D-TV addresses the challenges on evaluating synthesized content, and
proposes two experiments, one on the assessment of still images and the other on
video sequence assessment. The two experiments question the reliability of the
usual subjective and objective tools when assessing the visual quality of synthesized views in a 2D context.
As can be seen from the above introductions, this book spans systematically a
number of important and emerging topics in the depth-based 3D-TV system. In
conclusion, we aim to acquaint the scholars and practitioners involved in the
research and development of depth-based 3D-TV system with such a most updated
reference on a wide range of related topics. The target audience of this book would
be those interested in various aspects of 3D-TV using DIBR, such as data capture,
depth map generation, 3D video coding, transmission, human factors, 3D visualization, and quality assessment. This book is meant to be accessible to audiences
including researchers, developers, engineers, and innovators working in the relevant areas. It can also serve as a solid advanced-level course supplement to 3D-TV
technologies for senior undergraduates and postgraduates.
On the occasion of the completion of this edited book, we would like to thank
all the authors for contributing their high quality works. Without their expertise
and contribution, this book would never have come to fruition. We would also like
to thank all the reviewers for their insightful and constructive comments, which
helped to improve the quality of this book. Our special thanks go to the editorial
assistants of this book, Elizabeth Dougherty and Brett Kurzman, for their
tremendous guidance and patience throughout the whole publication process. This
project is supported in part by the National Basic Research Program of China (973)
under Grant No.2009CB320903 and Singapore Ministry of Education Academic
Research Fund Tier 1 (AcRF Tier 1 RG7/09).
Acknowledgment
The editors would like to thank the following reviewers for their valuable
suggestions and comments, which improve quality of the chapters.
A. Aydn Alatan, Middle East Technical University
Ghassan AlRegib, Georgia Institute of Technology
Holger Blume, Leibniz Universitt Hannover
Ismael Daribo, National Institute of Informatics
W. A. C. Fernando, University of Surrey
Anatol Frick, Christian-Albrechts-University of Kiel
Thorsten Herfet, Saarland University
Yo-Sung Ho, Gwangju Institute of Science and Technology (GIST)
Peter Howarth, Loughborough University
Quan Huynh-Thu, Technicolor Research & Innovation
Peter Kauff, Fraunhofer Heinrich Hertz Institute (HHI)
Sung-Yeol Kim, The University of Tennessee at Knoxville
Reinhard Koch, Christian-Albrechts-University of Kiel
Martin Kppel, Fraunhofer Heinrich Hertz Institute (HHI)
PoLin Lai, Samsung Telecommunications America
Chao-Kang Liao, IMEC Taiwan Co.
Wen-Nung Lie, National Chung Cheng University
Yanwei Liu, Chinese Academy of Sciences
Anush K. Moorthy, The University of Texas at Austin
Antonio Ortega, University of Southern California
Goran Petrovic, Saarland University
Z. M. Parvez Sazzad, University of Dhaka
Wa James Tam, Communications Research Centre (CRC)
Masayuki Tanimoto, Nagoya University
Patrick Vandewalle, Philips Research Eindhoven
Anthony Vetro, Mitsubishi Electric Research Laboratories (MERL)
Jia-ling Wu, National Taiwan University
Junyong You, Norwegian University of Science and Technology
Lu Yu, Zhejiang University
ix
Acknowledgment
Contents
Part I
1
System Overview
Part II
Content Generation
39
69
107
145
169
191
xi
xii
Contents
Part III
3D Video Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Karsten Mller, Philipp Merkle and Gerhard Tech
223
249
10
277
11
299
Part IV
12
347
13
375
14
15
413
439
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
475
Part I
System Overview
Chapter 1
Y. Zhao (&) L. Yu
Department of Information Science and Electronic Engineering,
Zhejiang University, 310027 Hangzhou, Peoples Republic of China
e-mail: zhaoyin@zju.edu.cn
L. Yu
e-mail: yul@zju.edu.cn
C. Zhu
School of Electronic Engineering, University of Electronic Science
and Technology of China, 611731 Chengdu, Peoples Republic of China
e-mail: eczhu@uestc.edu.cn
M. Tanimoto
Department of Electrical Engineering and Computer Science,
Graduate School of Engineering, Nagoya University,
Nagoya 464-8603, Japan
e-mail: tanimoto@nuee.nagoya-u.ac.jp
Y. Zhao et al.
1.1 Introduction
The first television (TV) service was debuted by the British Broadcasting Corporation (BBC) in 1936 [1]. Since then, with advancement in video technologies
(e.g., capture, coding, communication, and display), TV broadcasting has evolved
from monochrome to color, analog to digital, CRT to LCD, and also from passive
one-to-all broadcasts to interactive Video on Demand (VOD) services. Nowadays,
it has been moving toward two different directions for realistic immersive experience: ultra high definition TV (UHDTV) [2] and three-dimensional TV (3D-TV)
[3, 4]. The former aims to provide 2D video services of extremely high quality
with a resolution (up to 7,680 9 4,320, 60 frames per second) much higher than
that of the current high definition TV (HD-TV). The latter vividly extends the
conventional 2D video into a third dimension (e.g., stereoscopic 3D-TV), making
users feel that they are watching real objects through a window instead of looking
at plain images on a panel. Free-viewpoint Television (FTV) [5] is considered the
ultimate 3D-TV, which gives users more immersive experience by allowing users
to view a 3D scene by freely changing the viewpoint as if they were there. This
chapter (and also this book) focuses on an upcoming and promising 3D-TV system
using a depth-image-based rendering (DIBR) technique, which is one phase in the
evolution from the conventional stereoscopic 3D-TV to the ultimate FTV.
Success of a commercial TV service stands mainly on three pillars: (1) abundant content resources to be presented at terminals, (2) efficient content compression methods and transmission networks which are capable to deliver images
of decent visual quality, and (3) cost-affordable displays and auxiliary devices
(e.g., set-top box). Among these, display technology plays an important role and,
to some extent, greatly drives the development of TV broadcasting, since it is
natural and straightforward to capture and compress/transmit what is required by
the displays. In the context of 3D-TV, different 3D displays may demand diverse
data processing chains that finally lead to various branches of 3D-TV systems.
Figure 1.1a depicts block diagrams of a 3D-TV system including content capture,
data compression, storage and transmission, 3D display or visualization, and
quality evaluation.
Although lacking a consensus on the classification of various 3D displays [68],
there are at least five types of 3D visualization techniques: stereoscopy [8, 9],
multi-view autostereoscopy [8, 9], integral imaging [10], holography [11], and
volumetric imaging [12]. Stereoscopic displays basically present two slightly
Fig. 1.1 Illustration of the frameworks of a the conventional stereoscopic 3D system, and b the
depth-based-3D system using Depth-Image-Based Rendering (DIBR), which considers, c depthbased formats obtained from stereoscopic or multi-view video. Note that the MVD and LDV
samples of the computer-generated Mobile sequences [170], more for illustrating the two
formats, are not fully converted from the multi-view video
Y. Zhao et al.
camera configuration, pixels in a camera view can be projected into the image
plane of a virtual camera, resulting in a virtual-view texture image. A 3D representation of one texture and one depth video is often known as the 2D ? Z
format, and one of multiple texture and depth videos is referred to as the Multiview Video plus Depth (MVD) format [20], as illustrated in Fig. 1.1c. Another
more advanced depth-based representation, called Layered Depth Video (LDV)
[21], contains not only a layer of texture plus depth of foreground objects exposed
to the current camera viewpoint, but also additional layer(s) of hidden background
scene information, which are occluded by the foreground objects at the current
camera viewpoint. The MVD and LDV formats can be viewed as augmented
versions of the basic 2D ? Z format with more scene descriptions contained in
additional camera views and hidden layers, respectively. These supplementary
data can complement in view synthesis the information that is missing in singleview 2D ? Z data, which will be discussed thoroughly in Sect. 1.2.2.
This chapter presents an overview of a 3D-TV system that employs the depthbased 3D representations and DIBR technique to support both stereoscopic and
multi-view autostereoscopic displays. Although a TV service usually involves both
audio and video, only topics related to the video part are addressed in this chapter
(as well as in this book). Interested readers may refer to [2227] for additional
information on the 3D audio part.
After the introduction, Sect. 1.2 elaborates the key technologies, which are
crucial to the success of the depth-based 3D-TV system, ranging from content
generation, compression, transmission, to view synthesis, 3D display and quality
assessment. Section 1.3 discusses the pros and cons of the depth-based 3D system
in comparison with the conventional stereoscopic 3D system, as well as the
remaining challenges and industry movements related to this 3D-TV framework.
The status and prospects of this new 3D system are concluded in Sect. 1.4.
Y. Zhao et al.
rearranged into a new pattern before being presented on the screen. Since visual
artifacts may be induced by lossy data compression and error-prone networks, 3D
visual quality should be monitored to adapt video compression and error control
mechanisms, to maintain acceptable visual quality at the decoder side. These
aspects, from capture to display, will be discussed with their fundamental problems and major solutions.
10
Y. Zhao et al.
disparities as long as the camera parameters are known. Post-processing of the depth
map may be required to improve boundary alignment, to reduce estimation errors,
and to fill occlusions [16]. Detailed description of stereo matching algorithms is
provided in Chaps. 2 and 3.
Depth sensing and aligning with multi-view video
Stereo matching aims to find pairs of correspondences with the closest features.
It often fails in homogeneous image region where multiple candidates exhibit
similar features and the best matching point is ambiguous. Also, correspondences
may not be located exactly at occlusion regions where parts of an image are not
visible in the other one.
As an alternative to stereo matching for depth map generation, physical ranging
methods are introduced, which do not rely on scene texture to obtain depth information, and can tackle occlusion conflicts. Some examples of such physical ranging
methods are time-of-flight (TOF) sensor [38] and structured light scanner [39].
Using depth sensing equipments, however, gives rise to other technical problems. First, the depth and video cameras are placed at different positions and
output different types of signals. It is necessary to calibrate the depth camera with
the video camera based on two diverse scene descriptions [42]. Second, the output
depth map generally has lower spatial resolution than that of the captured color
image. This means the measured depth map should be transformed and properly
upsampled to align with the texture image [40, 41]. Third, the measured depth
maps are often contaminated by random noise [41] which should be removed for
quality enhancement.
TOF camera itself also has several inherent defects. Since TOF camera captures
the depth information by measuring phase-shift between self-emitted and reflected
infrared lights, the capture range is limited to around 7 meter. Besides, interference by surrounding lights in the same spectrum may occur. Due to the limitation,
the outdoor scenes cannot be correctly measured by the current TOF solutions.
Moreover, surrounding materials may absorb the emitted infrared or direct them
away from the sensor. Part of the rays will never return to the sensor, thus affecting
the depth calculation. Using multiple TOF cameras at the same time may also
introduce interference from the cast rays of similar frequencies.
The structured light scanner such as the Kinect by Microsoft is also not perfect. Its
accuracy turns worse as the distance increases. It also encounters the problems of
limited range, ray missing, and interference from multiple emitters and ambient lights.
Since depth sensing approaches have different error sources and limitations
from stereo matching methods, fusing depth data obtained from the two avenues
appears to be promising to enhance the accuracy of the depth information [4143].
More details on depth map generation using TOF sensor are elaborated in Chap. 7.
2D-to-3D conversion based on a single-view video
Apart from using stereo matching and depth sensor, depth map can also be
generated from video sequence itself by extracting monoscopic depth cues
involved in the video sequence. This depth generation is often studied in the
research on 2D-to-3D video conversion [4446]. This approach sounds more
appealing, as a video shot by a single camera can be converted into stereoscopic
11
3D. Given the abundant 2D video data like films and TV programs, 2D-to-3D
conversion is very popular in media industry and post-production studios. Disney
is reported to have been reproducing the 3D version of earlier 2D movies like
The Lion King.
It is known that the human visual system perceives depth information from both
binocular cues and monocular cues. The former are often referred to as stereopsis
(an object appearing at different positions in two eyes) and convergence (two eyes
rotating inward toward a focused point). Stereo matching simulates the stereopsis
effect that senses depth information from retinal disparities. 2D-to-3D conversion,
however, exploits monocular cues, including (but not limited to):
1. linear perspective from vanishing lines: converging lines that are parallel
actually indicate a surface that recedes in depth [47];
2. blur from defocus: the object closer to the camera is clear while blur of the
defocused regions increases with the distance away from the focal plane [48];
3. velocity of motion: for two objects moving at a similar velocity, the nearby
object corresponds to a larger displacement in the captured video [49];
4. occlusion from motion: background region occluded by a moving foreground
object will be exposed in another frame [50].
Apart from making use of those monocular cues, the depth map can be estimated using a machine learning algorithm (MLA) [44]. Some important pixels are
selected as a training set, and the MLA learns the relationship between the samples
(position and color value) and their predefined depth values. Then, the remaining
pixels are determined by the training model. This method requires much less effort
than manual depth map painting [16, 64]. Since depth is usually consistent over
time, manual assistance may only be employed at several key frames in a
sequence, while other depth frames can be automatically obtained with the
propagation of the key frames [43, 65]. It should be noted that 2D-to-3D conversion aims to (or say, is only capable to) estimate an inaccurate depth map that
properly reflects the ordinals of major 3D surfaces, rather than to recover real
depth values as stereo matching and depth sensing do. Fortunately, our eyes are
robust to feel a natural 3D sensation as long as there are no severe conflicts among
depth cues. More in-depth discussions on state-of-the-art 2D-to-3D conversion
techniques are provided in Chap. 4.
Synthetic texture and depth data from 3D model
Computer-generated imagery (CGI) is another important source of 3D content.
It describes a virtual scene in a complete 3D model, which can be converted into
the texture plus depth data easily. For example, the texture images of a target view
can be rendered using the conventional ray casting technique [66], in which the
color of each pixel is determined by tracing which surface of the computergenerated models is intersected by a corresponding ray cast from the virtual
camera. The associated depth maps can be generated by calculating the orthogonal
distances from the intersected 3D points to the image plane of the virtual camera.
Occlusion can be solved based on comparing the depth values of multiple points
projected to the same location, which is known as the z-buffering.
12
Y. Zhao et al.
13
Fig. 1.4 LDV converted from MVD using different approaches [52] for Ballet [171].
a Foreground layer, and b background layer using the z-buffer-based approach. c Background
layer using incremental residual insertion. White regions in b and c are foreground masks
14
Y. Zhao et al.
Fig. 1.5 View synthesis with 2D ? Z data or two-view MVD data of Mobile, using VSRS
[70, 172]. a The left view (view 4). b After warping the left view to the central virtual point
(view 5), where holes are marked in black. c Applying hole filling on b using inpainting [73].
d The right view (view 6). e After warping the right view to the central virtual point. f Merging
the warped left and right views b and e, and all pixels are assigned with color values. For more
complicated scenes, we may have to fill a few small holes in the merged image using simple
interpolation of surrounding pixels. Inpainting may fail to create natural patterns for the
disoccluded regions, while merging the two warped views b and e is efficient to make up the
holes
interpolation), and thus is required in high-quality rendering applications. Background/occlusion layers of LDV provide the same functionality of supplying
occlusion information.
It is obvious that inaccurate depth values cause camera-view pixels to be
projected to wrong positions in a target virtual view. These kinds of geometry
distortions are most evident around object boundaries where incorrect depth values
15
16
Y. Zhao et al.
Fig. 1.6 a Illustration of (lossy) video codec. b Rate-distortion (R-D) performance of two codecs
on a (depth) video. Codec 2 presents superior R-D performance to Codec 1, since the R-D curves
show that at the same level of distortion (suggested by PSNR value; where a higher PSNR means
a lower level of distortion), Codec 2 needs less bitrate to compress the video. For example,
compared with Codec 1, Codec 2 saves about 34 % rate at PSNR = 45 dB
data) can be compressed more efficiently with the Multi-view Video Coding
(MVC) [8183] standard, an amendment to H.264/AVC with adoption of interview prediction [82].
The conventional video coding standards, however, are developed for texture
video compression, and have been shown to be not efficient for coding depth
videos which have very different statistical characteristics. As discussed above,
both texture and depth data contribute to the quality of a synthesized image, where
texture distortions cause color value deviations at fixed positions and depth distortions may lead to geometry changes [84]. This evokes the need for a different
distortion measurement for depth coding [85] and a proper bit allocation strategy
between texture and depth [86]. Compared with texture images, depth maps are
17
generally much smoother within an object whereas presenting poorer temporal and
inter-view consistency (limited by current depth acquisition technologies).
Therefore, some conventional coding tools, like the inter-view prediction in MVC,
may turn out to be insufficient as the inter-view correlation changes accordingly.
More efficient methods are desired to preserve the fidelity of depth information,
which may be realized through any of the following ways or their combinations:
1. accommodate the existing coding techniques (e.g., intra prediction [87], inter
prediction [88], and in-loop filter [89]) to fit the statistical features of depth
data;
2. measure or quantify the influence of depth data distortions on view synthesis
quality [85, 90] and incorporate the effect in the rate-distortion optimization of
depth coding [85];
3. use other coding frameworks (e.g., platelet coding [84] and wavelet coding
[91]);
4. exploit the correlation between texture and depth (e.g., employing the structural
similarity [92]).
In addition, the depth data can also assist to enhance the coding efficiency of
texture data (e.g., using view synthesis prediction [93, 94]) or to reduce encoder
complexity (e.g., fast mode decision [95]). More details on the current status of 3D
video coding, advances on depth compression techniques, and wavelet coding for
depth maps will be provided in Chaps. 8, 9 and 10, respectively.
18
Y. Zhao et al.
Combining all texture and depth bitstreams into one TS under the bitrate constraint
will lead to lower quality texture videos for 2D playback. Thus, the recent Blu-ray
3D specification considers a mode of using one main-TS for the base view for 2D
viewing and using another sub-TS for the dependent view for stereoscopic
viewing, which does not compromise the 2D visual quality. Although no depthbased 3D video services have been launched worldwide, these concerns from the
stereoscopic 3D system may also be considered in developing the new 3D transmission system. Chapter 11 provides a study on transmitting 3D video over cable,
terrestrial, and satellite broadcasting networks.
Two other research topics in transmission are error resilience and concealment
for unreliable networks (e.g., wireless and Internet Protocol (IP) networks).
The former, employed at encoders, spends overhead bits to protect a compressed
bitstream, while the latter, utilized by decoders, attempts to recover lost or corrupted information during transmission [9799]. Apart from using the
conventional error resilience and concealment methods for 2D [97102] and stereoscopic 3D [103105], some studies exploit the correlation/similarity between
texture and depth, and develop new approaches for depth-based 3D video
[106109]. For example, since motion vectors (MV) for texture and depth of the
same view are highly correlated, they are jointly estimated and then transmitted
twice (once in each bitstream) to protect the important motion information [106].
Motion vector sharing is also used in error concealment. For a missing texture
block, MV of its associated depth block is adopted as a candidate MV [107].
The neighboring blocks with depth information similar to that of the missing block
indicate that they likely belong to the same object and should have the same
motion. On the other hand, depth information also distinguishes object boundaries
from homogeneous regions, and two objects at each side of a boundary may have
different motion. In that case, the assumption of a smooth MV field turns to be
invalid, and the block needs to be split into foreground and background parts with
two (different) MVs [108]. In addition, by exploiting inter-view geometry
relationship, missing texture in one view can be recovered by DIBR with texture
and depth data of another view [109].
19
Fig. 1.7 a Illustration of convergence, positive, and negative disparities. Two eyes fixate at point
F, and F falls at the center of the two eyes (called fovea) with zero disparity. Point P (or N) has a
positive/uncrossed (or negative/crossed) disparity, when the left-eye projection position P1 (or
N1) is at the right (or left) side of the right-eye position P2 (or N2). In the context of stereoscopic
viewing, Point N appears in front of the screen and Point P seems to be behind the display.
b Illustration of the stereoscopic comfort zone, in which the most comfortable regions are close to
the screen plane
Binocular depth perception measures more accurate depth based on two main
factors: eye convergence and binocular parallax. Practically, two eyes first rotate
inward to a fixated point in space, and the convergence angle reflects the actual
distance of the point, as shown in Fig. 1.7a. Then, each 3D point falls on the two
retinal images which are further combined in the brain with binocular fusion or
rivalry [110]. The difference between the retinal positions of one object results in a
zero, negative, or positive disparity, which is interpreted by our visual system as
the relative depth of this point with respect to the fixation point. More details of
binocular depth perception as well as other binocular vision properties (e.g.,
binocular fusion) are covered in Chap. 12.
The visualization of 3D effects in the depth-based system is achieved by presenting reconverted stereoscopic or multi-view video on a stereoscopic display
(SD) or multi-view autostereoscopic display (MASD) [69]. The former meets the
binocular parallax by showing spatially mixed or temporally alternating left and
right images on the screen and by separating lights of the two images with passive
(anaglyph and polarized) or active (shutter) glasses [6, 9]. The latter satisfies
binocular parallax without the need for wearing special glasses, which directs rays
from different views into separate viewing regions using optical devices such as
lenticular sheets and parallax barriers [6, 9]. Usually, a series of discrete views are
shown on an MASD simultaneously, and viewers can feel monocular motion
parallax since they will receive a new stereo pair when shifting their heads slightly.
20
Y. Zhao et al.
The binocular head-mounted display that places one screen in front of each eye in
a helmet can be considered as an SD [9]. When an SD tracks a viewers head
motion and generates two appropriate views for his/her eyes, the motion parallax is
fulfilled as well [6].
Although SD and MASD are relatively less expensive compared with holographic and volumetric displays, neither of them delivers true 3D experiences from
reproducing all rays in the captured scene. Instead, they drive the brain to fuse two
given images, thus evoking depth perception from binocular cues. This mechanism
costs extra efforts to watch stereoscopic images, which explains why we may need
a few seconds to perceive the 3D effect after putting on 3D glasses, different from
the daily life experience of immediate depth perception. As a consequence, the
fake 3D sensation could bring visual discomfort in several ways. First, vergence
(driven by disparity) and accommodation (driven by blur) are considered two
parallel feedback control systems with cross-links [111114]. In stereoscopic
viewing, the accommodation should be kept at the screen plane in order to perceive a clear/sharp image, whereas the vergence is directed at the actual distance
of the gazed object for binocular fusion. The discrepancy between the accommodation and vergence pushes the accommodation back and forth, thus tiring the
eyes. To minimize this conflict and reduce visual discomfort, the disparities in a
stereoscopic image should be controlled in a small range, which makes the mismatch of accommodation and vergence not evident and ensures all perceived 3D
objects fall in a stereoscopic comfort zone [115], as shown in Fig. 1.7b. Visual
experiments reveal that positive disparities (i.e., object behind the screen) may be
more comfortable than negative ones [116], and hence the comfort zone or
comfortable degree may be asymmetric on both sides of a screen.
There exist some perceptual and cognitive conflicts [114] that the depth from
disparity contradicts other cues. The window violation is a typical example, in
which part of an object appearing in front of the screen hits the display margin.
The occlusion cue implies the object behind the screen, whereas the convergence
suggests the opposite. This paradox can be solved either by using a floating
window technique that adds a virtual black border perceptually occluding the
object [117], or by pushing the whole scene backwards (by shifting the stereoscopic images or by remapping disparities of the scene) to make the object appear
on or behind the screen [115].
Other sources of unnatural 3D feeling include (1) cross talk where rays for one
eye leak into the other eye, generating ghosting effects such as double edges [118],
(2) disparity discontinuity from scene cut, forcing the eyes to readapt to the new
disparities [115], (3) flickering artifacts from view synthesis with temporally
inconsistent depth data [114, 119], (4) puppet-theater effect and cardboard effect
that a 3D object looks overly small and is as thin as a paper [120], (5) motion
tearing that eyes lose track and coordination for a fast moving object on timesequential displays, thus seeing two objects at different positions [121]. More
details on state-of-the-art SD and MASD as well as other 3D displays will be
surveyed in Chap. 13.
21
22
Y. Zhao et al.
(or other combinations) of two views 2D quality scores cannot achieve satisfactory evaluation performance [128, 137, 138]. Prediction accuracy can be
improved by further incorporating disparity/depth distortions which reflect geometric changes or depth sensation degradation [138140]. Moreover, two eyes
images are finally combined in the brain to form a cyclopean image as if it is
perceived by a single eye placed at the middle of the two eyes [110, 141]. To
further account for this important visual property which is neglected in the abovementioned schemes, a few attempts have been made to quality evaluation using
cyclopean images [142, 143]. However, these metrics always apply binocular
fusion (also using a rough model with simple averaging) for all correspondences,
without considering whether the distorted video signals evoke binocular rivalry.
In other words, the modeling of cyclopean images in these algorithms is hardly
complete. Readers can refer to Chap. 14 of the recent advances on subjective and
objective QA of stereoscopic videos.
On the other hand, it is necessary to understand the quality of the synthetic
images, since the presented stereoscopic images in this depth-based 3D system can
be either synthesized or captured. The artifacts induced by view synthesis are
unique from 2D video processing (e.g., transmission and compression), and stateof-the-art 2D metrics seem to be inadequate to assess synthesized views [144].
A few preliminary studies have been made to tackle this problem, e.g., evaluating
the contour shifting [144], disocclusion regions [145, 144], and temporal noise
[119] in synthesized videos, which lead to a better correlation to subject evaluation
than 2D metrics (e.g., PSNR). More details on assessing DIBR-synthesized content
can be found in Chap. 15.
23
inclusion of depth data, and some displayed views may be synthesized based on
the DIBR technique. In a broader view, view synthesis (e.g., in 2D-to-3D
conversion and 3D animation) can also be considered a useful technique to
create the second view for stereoscopic 3D.
2. The transmitted data are different. Compared with the stereoscopic system that
compresses and transmits texture videos only, the depth-based system also
delivers the depth information. Accordingly, the receivers further need a view
synthesis module to recover the stereoscopic format before 3D displaying, and
the rendered views are free from (in single-view-based rendering) or with less
(in multiple-view-based rendering with view merging) inter-view color
difference.
The depth-based system offers three obvious advantages, though it is more
complex than the stereoscopic counterpart.
1. Depth-based system can adjust the baseline distance of the presented stereo
pair, which cannot be achieved easily in stereoscopic 3D. Stereoscopic video
services present two fixed views to viewers. These two views, which are
captured by professional stereo cameras, usually have a wider baseline than our
pupil distance (typically 65 mm). This mismatch may lead to unsuitably large
disparities that make the displayed 3D object fall outside the stereoscopic
comfort zone (as mentioned in Sect. 1.2.5), thus inducing eye strain and visual
discomfort. Moreover, viewers may require different degrees of binocular
disparities based on their preference to the intensity of 3D perception which
increases with disparity. The stereoscopic video system fails to meet this
requirement, unless there is a feedback to the sender to request sending another
two views. The depth-based system can solve these problems by generating a
new virtual view with a user-defined baseline, through which the 3D sensation
can be controlled user-friendly by each viewer.
2. Depth-based system is suitable for autostereoscopic display, while stereoscopic
3D system is not. Multi-view autostereoscopic displays, which are believed to
provide more natural 3D experience, are stepping into some niche markets
including advertisement and exhibition. As mentioned above, a series of discrete views are simultaneously presented on the screen, and rays from different
views are separated by a specific optical system into non-overlapping viewing
regions. To fit the delicate imaging system, multiple-view images are spatially
multiplexed into a complex pattern. Therefore, the view composition rules for
different multi-view displays are usually distinct, and thus it is impossible to
transmit one view-composite video that adapts to any multi-view displays.
Moreover, multi-view displays may use varied number of views, such as the 28view Dimenco display [146] and 8-view Alioscopy display [147], which makes
the multi-view video transmission solution impractical as well. The depthbased system typically uses 13 views of MVD as well as a Depth Enhanced
Stereo (DES) [148] composed of two views of one-background-layer LDV.
Then, other required views can be synthesized based on the received data. View
synthesis can be conducted regardless of views actually delivered (i.e., view
24
Y. Zhao et al.
merging may be absent and hole filling may be different), and the synthetic
quality increases with the number of views. In this sense, the new framework is
more flexible to support the multi-view displays.
3. The depth-based framework exhibits higher transmission efficiency. With
current compression techniques, it is reported that depth can generally be
compressed with around 25 % of the texture bitrate for rendering a goodquality virtual view under one-view 2D ? Z, especially when the disoccluded
regions are simple to be filled. In comparison, an MVC coded stereo pair may
require 160 % of the base-view bitrate [17] (since the second view can be
compressed more efficiently with inter-view prediction). In this case, the
2D ? Z solution consumes less bandwidth. However, even though the disocclusion texture is sometimes so complex that a second view or layer is required
to supply the missing information for high-quality rendering, the stereoscopic
MVD format increases the total bitrate by roughly 31 %, whereas the functionality of adjustable baseline is achieved (so does the 2D ? Z). For multiview displays, this advantage becomes more evident. It is perhaps too
demanding to capture, store, and transmit 8 or even 28 views over current
networks. In contrast, the depth-based system requires much fewer bits to
prepare and deliver the 3D information.
25
26
Y. Zhao et al.
The two points, data format and compression technique, are always connected tightly
and standardized together. To meet the different requirements by various applications,
a compression standard may have many profiles for supporting slightly different data
formats (e.g., YUV4:2:0 and YUV4:4:4) and coding tools. Motion Picture Expert
Group (MPEG) has been establishing a 3D video compression standard for the depthbased formats, and issued a call for proposal on 3D video coding technology [156] in
March 2011. Both MVD and LDV are highlighted in this group, while MVD is
currently supported by more members. Various compression solutions are submitted
for evaluation and competition at the time of this writing, and the standard is scheduled
to be finalized by 2013 or early 2014 [157]. Stereoscopic videos can basically be coded
using two-view simulcast or frame-compatible stereo (where the left and right views
are spatially multiplexed, e.g., side-by-side or topbottom) with H.264/AVC [80], or
using more efficient MVC to exploit inter-view correlation [83]. MPEG is also
developing a more advanced frame-compatible format [158], using an enhancement
layer to supply information missing in the frame-compatible base view, which is
oriented at higher quality applications while ensuring backward compatibility with
existing frame-compatible 3D services [159, 17].
The third part is the standardization for 3D video transmission. A broadcast
standard often specifies the spectrum, bandwidth, carrier modulation, error correcting code, audio and video compression method, data multiplexing, etc. Different
nations adopt different digital television broadcast standards [160], such as the DVB
[161] in Europe and ATSC [162] in North America. At present, there are no standards
for broadcasting depth-based formats, while some of them (e.g., DVB [163] and
ATSC [164]) have been working on the stereoscopic format. So far, more than 30
channels have been broadcasting stereoscopic 3D over the world since 2008, many of
which actually started before the DVB-3D-TV standard [162] was issued. It is
encouraging that study group 6 of International Telecommunication Union (ITU)
released a report in early 2010 outlining a roadmap for future 3D-TV implementation
[165], which considered the plano-stereoscopic and depth-based frameworks as the
first and second generation of 3D-TV, respectively.
In addition to the three major aspects, standards may be established for specific
devices or interfaces to facilitate a video service. Typical examples are the Blu-ray
specification for Blu-ray disks and players, and the HDMI for connecting two local
equipments (e.g., a set-top box and a display). These standards have been put in use to
support stereoscopic 3D, such as the Blu-ray 3D specification [166] and HDMI 1.4
[167]. Overall speaking, however, the depth-based system is still in a nascent stage. On
the other hand, since the two compared 3D systems share a few common components
(e.g., video capture and display), some standards contribute to both frameworks. For
example, the Consumer Electronics Association (CEA), a standards and trade organization for the consumer electronics industry in the United States, has been developing a standard for active 3D glasses with infrared synchronized interface to
strengthen the interoperability of shutter glasses in the near-future market [168].
Besides, Video Quality Expert Group (VQEG) has been working on the standards for
assessing 3D quality of service. Their short-term goals include measuring different
27
aspects of 3D quality (e.g., image quality, depth perception, naturalness, and visual
fatigue) and improving subjective test procedure for 3D video.
Moreover, international forums/organizations serve as an active force to accelerate the progress of 3D video service. For example, 3D@Home [169] is a non-profit
consortium targeting at faster adoption of quality 3D at home. It consists of six
steering teams, covering content creation, storage and transmission, 3D promotion,
consumer products, human factors, and mobile 3D. This consortium has been producing a lot of whitepapers and expert recommendations for good 3D content.
1.4 Conclusion
This chapter throws some light on the development of the depth-based 3D-TV system,
focusing on the technical challenges, typical solutions, standardization efforts, and
performance comparison with the stereoscopic 3D system. Research trends and detailed
discussions on many aspects of the depth-based 3D-TV framework will be touched
upon in the following tutorial chapters. In sum, the new system appears to be more
flexible and efficient to support stereoscopic and multi-view 3D displays, and is
believed to bring more comfortable 3D visual sensation, at the cost of extra computational complexity, such as the depth estimation and view synthesis. These new technologies also slightly jeopardize the backward compatibility to current 2D broadcast.
The first generation of stereoscopic 3D is clearly on the roadmap. Limited by
insufficient 3D content and immature depth acquisition, the second generation
3D-TV with depth-based representations and DIBR technique is still under
development, and it may take a few years for some key technologies to flourish
and mature. Meanwhile, some technical solutions within this system, especially
the depth-based view rendering and 3D content editing with human factors, are
also useful for the current stereoscopic system. Stereoscopic 3D is becoming a
pilot to gain public acceptance to 3D-TV service, while the momentum may also
propel the demand and maturity of the depth-based 3D-TV system.
Acknowledgment The authors thank Philips and Microsoft for kindly providing the Mobile
and Ballet sequences. They are also grateful to Dr. Vincent Jantet for preparing the LDI images
in Fig. 1.4. This work is partially supported by the National Basic Research Program of China
(973) under Grant No.2009CB320903 and Singapore Ministry of Education Academic Research
Fund Tier 1 (AcRF Tier 1 RG7/09).
References
1. Television Invention Timeline Available: http://www.history-timelines.org.uk/eventstimelines/08-television-invention-timeline.htm
2. Ito T (2010) Future televisionsuper hi-vision and beyond. In: Proceedings of IEEE Asian
solid-state circuits conference, Nov 2010, Beijing, China, pp 14
28
Y. Zhao et al.
29
30
Y. Zhao et al.
31
73. Bertalmio M, Bertozzi AL, Sapiro G (2001) Navier-stokes, fluid dynamics, and image and
video inpainting. In: Proceedings of IEEE international conference on computer vision and
pattern recognition, pp 355362
74. Oh K, Yea S, Ho Y (2009) Hole-filling method using depth based in-painting for view
synthesis in free viewpoint television (FTV) and 3D video. In: Picture coding symposium
(PCS), Chicago, pp 233236
75. Daribo I, Pesquet-Popescu B (2010) Depth-aided image inpainting for novel view synthesis.
In: Proceedings of IEEE international workshop on multimedia signal processing (MMSP)
76. Ndjiki-Nya P, Kppel M, Doshkov D, Lakshman H, Merkle P, Mller K, Wiegand T (2011)
Depth image-based rendering with advanced texture synthesis for 3-D video. IEEE Trans
Multimedia 13(3):453465
77. Schmeing M, Jiang X (2010) Depth image based rendering: a faithful approach for the
disocclusion problem. In: Proceedings of 3DTV conference, pp 14
78. Zhao Y, Zhu C, Chen Z, Tian D, Yu L (2011) Boundary artifact reduction in view
synthesisview synthesis of 3D video: from perspective of texture-depth alignment. IEEE
Trans Broadcast 57(2):510522
79. Zhao Y, Yu L (2010) A perceptual metric for evaluating quality of synthesized sequences in 3DV
system. In: Proceedings of visual communications and image processing (VCIP), July 2010
80. Wiegand T, Sullivan GJ, Bjntegaard G, Luthra A (2003) Overview of the H.264/AVC
video coding standard. IEEE Trans Circuits Syst Video Technol 13(7):560576
81. Vetro A, Yea S, Zwicker M, Matusik W, Pfister H (2007) Overview of multiview video
coding and anti-aliasing for 3D displays. In: Proceedings of international conference on
image processing, vol 1. pp I-17I-20, Sept 2007
82. Merkle P, Smolic A, Mller K, Wiegand T (2007) Efficient prediction structures for
multiview video coding. IEEE Trans Circuits Syst Video Technol 17(11):14611473
83. Chen Y, Wang Y-K, Ugur K, Hannuksela M, Lainema J, Gabbouj M (2009) The emerging
mvc standard for 3D video services. EURASIP J Adv Sig Process 2009(1), Jan 2009
84. Merkle P, Morvan Y, Smolic A, Farin D, Mller K, de With PHN, Wiegand T (2009) The
effects of multiview depth video compression on multiview rendering. Sig Process: Image
Commun 24(12):7388
85. Kim W-S, Ortega A, Lai P, Tian D, Gomila C (2010) Depth map coding with distortion
estimation of rendered view. In: Proceedings of SPIE visual information processing and
communication, vol 7543. pp 75430B75430B-10
86. Tikanmaki A, Gotchev A, Smolic A, Muller K (2008) Quality assessment of 3D video in
rate allocation experiments. In: Proceedings of IEEE international symposium on consumer
electronics
87. Kang M-K, Ho Y-S (2010) Adaptive geometry-based intra prediction for depth video
coding. In: Proceedings of IEEE international conference on multimedia and expo (ICME),
July 2010, pp 12301235
88. Kim W-S, Ortega A, Lai P, Tian D, Gomila C (2009) Depth map distortion analysis for
view rendering and depth coding. In: Proceedings of international conference on image
processing
89. Oh K-J, Vetro A, Ho Y-S (2011) Depth coding using a boundary reconstruction filter for 3D
video systems. IEEE Trans Circuits Syst Video Technol 21(3):350359
90. Zhao Y, Zhu C, Chen Z, Yu L (2011) Depth no-synthesis error model for view synthesis in
3D video. IEEE Trans Image Process 20(8):22212228, Aug 2011
91. Daribo I, Tillier C, Pesquet-Popescu B (2008) Adaptive wavelet coding of the depth map for
stereoscopic view synthesis. In: Proceedings of IEEE international workshop on multimedia
signal processing (MMSP08), Cairns, Australia, pp 3439, Oct 2008
92. Liu S, Lai P, Tian D, Chen CW (2011) New depth coding techniques with utilization of
corresponding video. IEEE Trans Broadcast 57(2):551561
93. Shimizu S, Kitahara M, Kimata H, Kamikura K, Yashima Y (2007) View scalable multiview video coding using 3-D warping with depth map. IEEE Trans Circuits Syst Video
Technol 17(11):14851495
32
Y. Zhao et al.
94. Yea S, Vetro A (2009) View synthesis prediction for multiview video coding. Sig Process
Image Commun 24(1+2):89100
95. Lin YH, Wu JL (2011) A depth information based fast mode decision algorithm for color
plus depth-map 3D videos. IEEE Trans Broadcast 57(2):542550
96. Merkle P, Wang Y, Mller K, Smolic A, Wiegand T (2009) Video plus depth compression
for mobile 3D services. In: Proceedings of 3DTV conference
97. Wang Y, Zhu Q-F (1998) Error control and concealment for video communication:
a review. Proc IEEE 86(5):974997
98. Wang Y, Wenger S, Wen J, Katsaggelos A (2000) Error resilient video coding techniques.
IEEE Signal Process Mag 17(4):6182
99. Stockhammer T, Hannuksela M, Wiegand T (2003) H.264/AVC in wireless environments.
IEEE Trans Circuits Syst Video Tech 13(7):657673
100. Zhang R, Regunathan SL, Rose K (2000) Video coding with optimal inter/intra-mode
switching for packet loss resilience. IEEE J Sel Areas Commun 18(6):966976
101. Zhang J, Arnold JF, Frater MR (2000) A cell-loss concealment technique for MPEG-2
coded video. IEEE Trans Circuits Syst Video Technol 10(6):659665
102. Agrafiotis D, Bull DR, Canagarajah CN (2006) Enhanced error concealment with mode
selection. IEEE Trans Circuits Syst Video Technol 16(8):960973
103. Xiang X, Zhao D, Wang Q, Ji X, Gao W (2007) A novel error concealment method for
stereoscopic video coding. In: Proceedings of international conference on image processing
(ICIP2007), pp 101104
104. Akar GB, Tekalp AM, Fehn C, Civanlar MR (2007) Transport methods in 3DTV-a survey.
IEEE Trans Circuits Syst Video Technol 17(11):16221630
105. Tan AS, Aksay A, Akar GB, Arikan E (2009) Rate-distortion optimization for stereoscopic
video streaming with unequal error protection. EURASIP J Adv Sig Process, vol 2009.
Article ID 632545, Jan 2009
106. De Silva DVSX, Fernando WAC, Worrall ST (2010) 3D video communication scheme for
error prone environments based on motion vector sharing. In: Proceedings of IEEE 3DTVCON, Tampere, Finland
107. Yan B (2007) A novel H.264 based motion vector recovery method for 3D video
transmission. IEEE Trans Consum Electron 53(4):15461552
108. Liu Y, Wang J, Zhang H (2010) Depth image-based temporal error concealment for 3-D
video transmission. IEEE Trans Circuits Syst Video Technol 20(4):600604
109. Chung TY, Sull S, Kim CS (2011) Frame loss concealment for stereoscopic video plus
depth sequences. IEEE Trans Consum Electron 57(3):13361344
110. Howard IP, Rogers BJ (1995) Binocular vision and stereopsis. Oxford University Press,
Oxford
111. Yano S, Ide S, Mitsuhashi T, Thwaites H (2002) A study of visual fatigue and visual
comfort for 3D HDTV/HDTV images. Displays 23(4):191201
112. Hoffman DM, Girshick AR, Akeley K, Banks MS (2008) Vergence-accommodation
conflicts hinder visual performance and cause visual fatigue. J Vis 8(3):130
113. Lambooij MTM, IJsselsteijn WA, Fortuin M, Heynderickx I (2009) Visual discomfort and
visual fatigue of stereoscopic displays: a review. J Imaging Sci Technol 53(3):030201030201-14. MayJun 2009
114. Tam WJ, Speranza F, Yano S, Shimono K, Ono H (2011) Stereoscopic 3D-TV: visual
comfort. IEEE Trans Broadcast 57(2):335346
115. Lang M, Hornung A, Wang O, Poulakos S, Smolic A, Gross M (2010) Nonlinear disparity
mapping for stereoscopic 3D. ACM Trans Graph 29(4):75:175:10. July 2010
116. Nojiri Y, Yamanoue H, Ide S, Yano S, Okana F (2006) Parallax distribution and visual
comfort on stereoscopic HDTV. In: Proceedings of IBC, pp 373380
117. Gunnewiek RK, Vandewalle P (2010) How to display 3D content realistically. In: Proceedings
of international workshop video processing quality metrics consumer electronics (VPQM),
Jan 2010
33
118. Daly SJ, Held RT, Hoffman DM (2011) Perceptual issues in stereoscopic signal processing.
IEEE Trans Broadcast 57(2):347361
119. Zhao Y, Yu L (2010) A perceptual metric for evaluating quality of synthesized sequences in 3DV
system. In: Proceedings of visual communications and image processing (VCIP), July 2010
120. Yamanoue H, Okui M, Okano F (2006) Geometrical analysis of puppet-theatre and
cardboard effects in stereoscopic HDTV images. IEEE Trans Circuits Syst Video Technol
16(6):744752
121. Wittlief K (2007) Stereoscopic 3D film and animationgetting it right. Comput
Graph 41(3). Aug 2007 Avaliable: http://www.siggraph.org/publications/newsletter/
volume/stereoscopic-3d-film-and-animationgetting-it-right
122. Sheikh HR, Sabir MF, Bovik AC (2006) A statistical evaluation of recent full reference
image quality assessment algorithms. IEEE Trans Image Process 15(11):34403451
123. Engelke U, Zepernick HJ (2007) Perceptual-based quality metrics for image and video
services: a survey. In: 3rd EuroNGI conference on next generation internet networks,
pp 190197
124. Seshadrinathan K, Soundararajan R, Bovik AC, Cormack LK (2010) Study of subjective
and objective quality assessment of video. IEEE Trans Image Process 19(16):14271441
125. Chikkerur S, Vijay S, Reisslein M, Karam LJ (2011) Objective video quality assessment
methods: a classification, review, and performance comparison. IEEE Trans Broadcast
57(2):165182
126. Zhao Y, Yu L, Chen Z, Zhu C (2011) Video quality assessment based on measuring
perceptual noise from spatial and temporal perspectives. IEEE Trans Circuits Syst Video
Technol 21(12):18901902
127. IJsselsteijn W, de Ridder H, Hamberg R, Bouwhuis D, Freeman J (1998) Perceived depth
and the feeling of presence in 3DTV. Displays 18(4):207214
128. Yasakethu SLP, Hewage CTER, Fernando WAC, Kondoz AM (2008) Quality analysis for
3D video using 2D video quality models. IEEE Trans Consum Electron 54(4):19691976
129. ITU-R Rec. BT.1438 (2000) Subjective assessment of stereoscopic television pictures.
International Telecommunication Union
130. ITU-R Rec. BT.500-11 (2002) Methodology for the subjective assessment of the quality of
television pictures. International Telecommunication Union
131. ITU-R (2008) Digital three-dimensional (3D) TV broadcasting. Question ITU-R 128/6
132. Xing L, You J, Ebrahimi T, Perkis A (2010) An objective metric for assessing quality of
experience on stereoscopic images. In: Proceedings of IEEE international workshop on
multimedia signal processing (MMSP), pp 373378
133. Goldmann L, Lee JS, Ebrahimi T (2010) Temporal synchronization in stereoscopic video:
influence on quality of experience and automatic asynchrony detection. In: Proceedings of
international conference on image processing (ICIP), Hong Kong, pp 32413244, Sept 2010
134. Levelt WJ (1965) Binocular brightness averaging and contour information. Brit J Psychol
56:113
135. Stelmach LB, Tam WJ (1998) Stereoscopic image coding: effect of disparate image-quality
in left- and right-eye views. Sig Process: Image Commun 14:111117
136. Zhao Y, Chen Z, Zhu C, Tan Y, Yu L (2011) Binocular just-noticeable-difference model for
stereoscopic images. IEEE Signal Process Lett 18(1):1922
137. Hewage CTER, Worrall ST, Dogan S, Villette S, Kondoz AM (2009) Quality evaluation of
color plus depth map based stereoscopic video. IEEE J Sel Top Sig Process 3(2):304318
138. You J, Xing L, Perkis A, Wang X (2010) Perceptual quality assessment for stereoscopic
images based on 2D image quality metrics and disparity analysis. In: Proceedings of 5th
international workshop on video processing and quality metrics for consumer electronics
(VPQM), Scottsdale, AZ, USA
139. Benoit A, Le Callet P, Campisi P, Cousseau R (2008) Quality assessment of stereoscopic
images. EURASIP J Image Video Process, vol 2008. Article ID 659024
140. Lambooij M (2011) Evaluation of stereoscopic images: beyond 2D quality. IEEE Trans
Broadcast 57(2):432444
34
Y. Zhao et al.
141. Julesz B (1971) Foundations of cyclopean perception. The University of Chicago Press, Chicago
142. Boev A, Gotchev A, Egiazarian K, Aksay A, Akar GB (2006) Towards compound stereovideo quality metric: a specific encoder-based framework. In: Proceedings of IEEE
southwest symposium on image analysis and interpretation, pp 218222
143. Maalouf A, Larabi M-C (2011) CYCLOP: a stereo color image quality assessment metric.
In: Proceedings of IEEE international conference on acoustics, speech and signal processing
(ICASSP), pp 11611164
144. Bosc E, Pepion R, Le Callet P, Koppel M, Ndjiki-Nya P, Pressigout M, Morin L (2011)
Towards a new quality metric for 3-D synthesized view assessment. IEEE J Sel Top Sig
Process 5(7):13321343
145. Shao H, Cao X, Er G (2009) Objective quality assessment of depth image based rendering in
3DTV system. In: Proceedings of 3DTV conference, pp 14
146. Dimenco display Available: http://www.dimenco.eu/displays/
147. Alioscopy display Available: http://www.alioscopy.com/3d-solutions-displays
148. Smolic A, Muller K, Merkle P, Kauff P, Wiegand T (2009) An overview of available and
emerging 3D video formats and depth enhanced stereo as efficient generic solution. In:
Proceedings of picture coding symposium (PCS), pp 389392
149. Grau O, Borel T, Kauff P, Smolic A, Tanger R (2011) 3D-TV R&D activities in Europe.
IEEE Trans Broadcast 57(2):408420
150. Seventh Framework Programme (FP7) Available: http://cordis.europa.eu/fp7/home_en.html
151. 3D4YOU Available: http://www.3d4you.eu/
152. 2020 3D Media Available: http://www.20203dmedia.eu/
153. Mobile 3DTV Available: http://sp.cs.tut.fi/mobile3dtv/
154. 3DPHONE Available: http://www.3dphone.org/
155. Report of SMPTE task force on 3D to the Home Available: http://store.smpte.org/product-p/
tf3d.htm
156. Video and Requirement Group (2011) Call for proposals on 3d video coding technology.
ISO/IEC JTC1/SC29/WG11 Doc. N12036, Mar 2011
157. Video Group (2011) Standardization tracks considered in 3D video coding. ISO/IEC JTC1/
SC29/WG11 Doc. N12434, Dec 2011
158. Video and Requirement Group (2011) Draft call for proposals on mpeg frame-compatible
enhancement. ISO/IEC JTC1/SC29/WG11 Doc. N12249, Jul 2011
159. Tourapis AM, Pahalawatta P, Leontaris A, He Y, Ye Y, Stec K, Husak W (2010) A frame
compatible system for 3D delivery. ISO/IEC JTC1/SC29/WG11 Doc. M17925, Jul 2010
160. Wu Y, Hirakawa S, Reimers U, Whitaker J (2006) Overview of digital television
development worldwide. Proc IEEE 94(1):821
161. Reimers U (2006) DVBthe family of international standards for digital video broadcasting.
Proc IEEE 94(1):173182
162. Richer MS, Reitmeier G, Gurley T, Jones GA, Whitaker J, Rast R (2006) The ATSC digital
television system. Proc IEEE 94(1):3742
163. European Telecommunications Standard Institute ETSI (2011) Digital video broadcasting
(DVB): frame compatible plano-stereoscopic 3DTV (DVB-3DTV). DVB Document A154,
Feb 2011
164. ATSC begins work on broadcast standard for 3D-TV transmissions Available: http://
www.atsc.org/cms/index.php/communications/press-releases/257-atsc-begins-work-onbroadcast-standard-for-3d-tv-transmissions
165. Report ITU-R BT.2160 (2010) Features of three-dimensional television video systems for
broadcasting. International Telecommunication Union
166. Final 3-D Blu-ray specification announced Available: http://www.blu-ray.com/news/
?id=3924
167. Specification Available: http://www.hdmi.org/manufacturer/specification.aspx
168. CEA begins standards process for 3D glasses Available: http://www.ce.org/Press/
CurrentNews/press_release_detail.asp?id=12067
169. Steering teamsoverview Available: http://www.3dathome.org/steering-overview.aspx
35
170. Bruls F, Gunnewiek RK, van de Walle P (2009) Philips response to new call for 3DV test
material: arrive book and mobile. ISO/IEC JTC1/SC29/WG11 Doc. M16420, Apr 2009
171. Microsoft 3D video test sequences Available: http://research.microsoft.com/ivm/
3DVideoDownload/
172. Tanimoto M, Fujii T, Suzuki K (2009) View synthesis algorithm in view synthesis reference
software 2.0 (VSRS2.0). ISO/IEC JTC1/SC29/WG11 Doc. M16090, Lausanne, Switzerland,
Feb 2009
Part II
Content Generation
Chapter 2
Keywords 3D display 3D production 3D representation 3D videoconferencing Auto-stereoscopic multi-view display Content creation Depth estimation Depth map Depth-image-based rendering (DIBR) Display-agnostic
production Extrapolation Stereo display Stereoscopic video Stereo matching
39
40
F. Zilly et al.
2.1 Introduction
The commercial situation of 3D video has changed dramatically during the last
couple of years. Being lost in niche markets like IMAX theatres and theme parks
for a long time, 3D video is now migrating into a mass market. One main reason is
the introduction of Digital Cinema. It is the high performance of digital cameras,
post-production and screening that enables to show 3D video for the first time with
acceptable, and convincing quality. As a consequence, installations of 3D screens
in cinema theatres are increasing exponentially (see Fig. 2.1). Numerous 3D
movies have been released to theatres and have created large revenues with
growing success since years. The value-added chain of 3D cinema is under
intensive worldwide discussion.
This process also includes repurposing of 3D productions for home entertainment. In 2010, the Blu-ray Disc Association and the HDMI Consortium have
specified interfaces for 3D home systems. Simultaneously, first 3D Blu-ray
recorders and 3D-TV sets have entered the market. Global players in consumer
electronics assume that in 2015 more than 30 % of all HD panels at home will be
equipped with 3D capabilities. Major live events have been broadcasted in 3D and
first commercial 3D-TV channels are on air now. More than 100 new 3D-TV
channels are expected to be installed during the next few years.
The rapid development of 3D Cinema and 3D-TV is now going to find its way
into many different fields of applications. Gamers will enjoy their favorite entertainment in a new dimension. Mobile phones, PDAs, laptops, and similar devices
will provide the extended visual 3D sensation anytime and anywhere. The usage of
3D cameras is not restricted to professional high-end productions any longer, but
low-budget systems are now also available for private consumers or semi-professional users. Digital signage and commercials are exploiting 3D imaging to
increase their attractiveness. Medical applications use 3D screens to enhance the
visualization in operating rooms. Other non-entertainment applications like teleeducation, augmented reality, visual analytics, tele-presence systems, shared virtual environments, or video conferencing use 3D representations to increase
effectiveness, naturalism, and immersiveness.
The main challenge of this development is that all these applications will use
quite different stereo representations and 3D displays. Thus, future 3D productions
for 3D-TV, digital signage, commercials, and other 3D video applications will
cope with the problem that they have to address a wide range of different 3D
displays, ranging from glasses-based standard stereo displays to auto-stereoscopic
multi-view displays, or even future integral imaging, holographic, and light-field
displays. The challenge will be to serve all these display types with sufficient
quality and appealing content by using one generic representation format and one
common production workflow.
Against this background the underlying book chapter discusses flexible solutions for 3D capture, generic 3D representation formats using depth maps, robust
methods for reliable depth estimation, required pre-processing of captured multi-
41
Fig. 2.1 Worldwide development of 3D screen installations in cinema theatres over the last
5 years (Source FLYING EYE, PRIME project, funded by German federal ministry of economics
and technology, grant no. 01MT08001)
view footage, post-processing of estimated depth maps, and, finally, depth-imagebased rendering (DIBR) for creating missing virtual views at the display side.
Section 2.2 will first discuss future 3D video applications and related requirements. Then, Sect. 2.3 will review the functional concept of auto-stereoscopic
displays, and afterwards, in Sect. 2.4 a generic display-agnostic production workflow that supports the wide range of all existing and anticipated 3D displays will be
presented. Subsequently, Sects. 2.5, 2.6, and 2.7 focus on details of this generic
display-agnostic production workflow, such as rectification, stereo correction,
depth estimation, and DIBR. Finally, Sect. 2.8 will discuss an extension of this
workflow toward multi-view capturing and processing with more than two views.
Section 2.9 will summarize the chapter and give an outlook on future work and
challenges.
42
F. Zilly et al.
Fig. 2.2 Immersive 3D video conferencing as an application example for using 3D displays in
non-entertainment business segments (Source European FP7 project 3D presence)
43
44
F. Zilly et al.
Fig. 2.3 Function of lenticular lenses and parallax barriers at auto-stereoscopic 3D displays
45
Fig. 2.4 Repeating of viewing cones with a given number of stereo views at an autostereoscopic multi-view multi-user display (left) and an example for an interweaving pattern of
slanted lenticular lenses or parallax barriers (right)
46
F. Zilly et al.
horizontal resolution of a ninth per view compared to the one of the underlying
image panel (e.g. 213 pixels per view in case of an HD panel with 1,920 pixels).
To compensate this systematic loss of resolution, the lenticular lenses or parallax
barriers are often arranged in a slanted direction. By doing so, the loss of resolution can be distributed equally in horizontal and vertical direction as well as over
the three color components.
To explain it in more detail, Fig. 2.4 (right) shows an example of a slanted
raster for a display with eight stereo pairs per viewing cone. Each colored rectangle denotes a sub-pixel of the corresponding RGB color component at the
underlying image panel. The number inside each rectangle indicates the corresponding stereo view. The three adumbrated lenses illustrate the slope of the
optical system mounted in front of the panel. The emitted light of a sub-pixel will
be deviated by a specific angle according to its position behind the lens. Consequently, the image data corresponding to the different views need to be loaded into
the raster such that all sub-pixels corresponding to the same view are located at the
same position and angle relative to the covering lens. That way, the perceived loss
of resolution per stereo view can substantially be reduced.
The opening angle of the viewing cone, the number of views per cone, the slanted
positioning of the optical system, and the corresponding interweaved sub-pixel
raster differ significantly between the display manufacturers. Usually, the details of
the sub-pixel raster are not public as they determine the resulting 3D-quality of the
display. In this sense, every display manufacturer has its own philosophy and finds it
own way to optimize the relevant system parameters and the resulting trade-off. As a
consequence, the envisaged generic display-agnostic 3D production workflow has to
be that flexible that it can support each of the existing auto-stereoscopic 3D display,
independently of the used number of views and the other system design parameters.
Moreover, a generic workflow also has to be future proof in this sense. The
number of stereo pairs per viewing cone, the angle of viewing cones and the
density of stereo views will increase considerably with increasing resolution of
LCD panels. First prototypes using 4k panels are already available and will enter
the market very soon. The current availability of 4k panels is certainly not the end
of this development. First 8k panels are on the way and panels with even higher
resolution will follow. It is generally agreed that this development is the key
technology for auto-stereoscopic displays and will end up with 50 views and more.
Today the usual number of views is in a range of ten, but first displays with up to
28 views have recently been introduced onto the market. An experimental display
with 28 views has been described in 2000 by Dodgson et al. [18].
47
Fig. 2.5 Processing chain of a generic display-agnostic production and distribution workflow
48
F. Zilly et al.
In the next step, depth information is extracted from the stereo signals. This
depth estimation results in dense pixel-by-pixel depth maps for each camera view
and video frame, a representation format often called multiple-video-plus-depth
(MVD). Depending on the number of views required by a specific 3D display, it is
used at the receiver to interpolate or extrapolate missing views by means of DIBR.
Hence, using MVD as an interim format for 3D production is the most important
feature of the envisaged display-agnostic workflow because it decouples the stereo
geometry of the acquisition and display side.
If needed for transmission or storage, depth and video data of MVD can then be
encoded separately or jointly by suitable coding schemes. For this purpose, MVD is
often transformed into other representation formats like LDV or DES that are more
suitable for coding. Details of these coding schemes and representation formats are
explained in more detail in Chap. 1 and Chap. 8 of this book and, hence, will not be
addressed again in this chapter.
At the display side, original or decoded MVD data are used to adapt the
transmitted stereo signals to the specific viewing conditions and the 3D display in
use. On one hand, this processing step includes the calculation of missing views by
DIBR if the number of transmitted stereo signals is lower than the number of
required views at the 3D display. On the other hand, it provides the capability to
adapt the available depth range to the special properties of the display or the
viewing conditions in the living room. The latter aspect is not only important for
auto-stereoscopic displays, it is also useful for standard stereo displays. For
instance, the depth range of a 3D cinema movie might be too low if watched at a
3D-TV set and a new stereo signal with enlarged depth range can be rendered by
using the MVD format.
The above production workflow and the related processing chain allow for
supporting a variety of 3D-displays, from conventional glasses-based displays to
sophisticated auto-stereoscopic displays. The following sections will describe
more details of the different processing steps, starting with the pre-processing
including calibration and rectification in Sect. 2.5, followed by depth estimation in
Sect. 2.6 and ending with DIBR in Sect. 2.7.
49
Fig. 2.6 Robust feature point correspondences for the original stereo images of test sequence
BEER GARDEN, kindly provided by European research project 3D4YOU [53]
stereo images have to respect the epipolar constraint, where F denotes the fundamental matrix defined by a set of geometrical parameters like orientations,
relative positions, focal lengths, and principal points of the two stereo cameras:
T
m0 F m 0
2:1
50
F. Zilly et al.
term of the matrix F where cx, cy and cz denote the orientation angles of the right
camera:
2
3
cy cy
cy cz
0
f
cx
2:3
F 4 cz
1 rf 5
f
f
cy
1
f cx
Note that the above preconditions are generally fulfilled in case of a proper
stereo set-up using professional rigs and prime lenses. Based on this linearization,
the epipolar equation from Eq. (2.3) can also be written as follows:
v0{z
v}
|
vert: disparity
cy Du
|{z}
yshift
0
u v
cy
f
|{z}
cy keystone
c u0 rf v0 f cx
|{z}
|z{z} |{z}
roll
D zoom
tilt offset
v v0
u v 0 u0 v
cx
cz
f
f
|{z} |{z}
cx keystone
2:4
zshift
Once the fundamental matrix F has been estimated from matching feature point
correspondences, the coefficients from the linearization in Eq. (2.3) can be derived
from Eq. (2.4). For this purpose the relation from Eq. (2.4) is used to build up a
system of linear equations on the basis of the same robust feature point correspondences. Solving this linear equation system by suitable optimization procedures (e.g. RANSAC [23]) enables a robust estimation of the existing stereo
geometry. The resulting parameters can then be exploited to steer and correct
geometrical and optical settings in case of motorized rig and lenses.
For example, the D-zoom parameter indicates a small deviation in focal lengths
between the two stereo cameras and can be used to correct it in case of motorized
zoom lenses. Furthermore, detected roll, tilt-offset, and cx-keystone parameters can
be used to correct unwanted roll and tilt angles at the rig, if the rig allows this
adjustment, either by motors or manually. Same holds for the parameters y-shift
and z-shift indicating a translational displacement of the cameras in vertical
direction and along the optical axes, respectively. The cy-keystone parameter refers
to an existing toe-in of converging cameras and its correction can be used to ensure
a parallel set-up or to perform a de-keystoning electronically.
A perfect control of geometrical and optical settings will not be possible in any
case. Some stereo rigs are not motorized and adjustments have to be done manually with limited mechanical accuracy. When changing the focus, the focal length
of lenses might be affected. In addition, lenses are exchanged during shootings
and, if zoom lenses are used, motors do not synchronize exactly and lens control
suffers from backlash hysteresis.
As a consequence, slight geometrical distortions may remain in the stereo
images. These remaining distortions can be corrected electronically by means of
image rectification. The process of image rectification is well-known from literature [22, 2426]. It describes 2D warping functions H and H0 that are applied to
left and right stereo images, respectively, to compensate deviations from the ideal
51
case of parallel stereo geometry. In the particular case, H and H0 are derived from
a set of constraints that have to be defined by the given application scenario.
One major constraint in any image rectification is that multiplying corresponding image points m and m0 in Eq. (2.1) with the searched 2D warping
matrices H and H0 has to end up with a new fundamental matrix that is equal to the
rectified state in Eq. (2.2). Clearly, this is not enough to calculate all 16 grades of
freedom in the two matrices H and H0 and, hence, further constraints have to be
defined for the particular application case.
One further constraint in the given application scenario is that the horizontal
shifts of the images have to respect the user-defined specification of convergence
plane. Furthermore, the 2D warping matrix H for the left image has to be chosen
such that the horizontal and vertical deviations cy and cz of the right camera are
eliminated, i.e. the left camera has to be rotated such that the new baseline after
rectification goes through the focal point of the right camera:
2
3
1
cy 0
H 4 cy
1 05
2:5
cz =f 0 1
Based on this determination, the 2D warping matrix H0 for right image can be
calculated in a straightforward way by taking into account the additional sideconstraints that left and right camera have the same orientation after rectification,
that both cameras have the same focal length and that the x-axis of the right
camera has the same orientation perpendicular to the new baseline:
2
3
cz c y
0
1 rf
H0 4 cz cy 1 rf f cx 5
2:6
cy cz =f cx =f
1
Figure 2.7 shows results from an application of image rectification to the nonrectified originals from Fig. 2.6. Note that vertical disparities have almost been
eliminated.
52
F. Zilly et al.
Fig. 2.7 Results of image rectification for the images of test sequence from Fig. 2.6
2:7
The term Edata(d) takes care of the constant-brightness constraint (color-constancy assuming lambertian surfaces) measuring matching point correspondences
between the stereo images whereas the term Esmooth(d) exploits the smoothness
constraint by penalizing disparity discontinuities. The first image in the second
row shows a result of Dynamic Programming which minimizes the energy function for each scan-line independently [3133]. As it can be seen in Fig. 2.8d, this
approach leads to the well-known streaking artifacts. In contrast the two other
images in the second row refer to global methods that minimize a 2D energy
function for the whole image, such as Graph-Cut [3438] and Belief Propagation
[3941]. As shown in Fig. 2.8e and f these methods provide much more consistent
depth maps than Dynamic Programming.
Due to these results global optimization methods are often referred as top
performing stereo-algorithms that clearly outperform local methods [27]. This
conclusion is also confirmed by a comparison of results from Graph-Cut and Belief
Propagation Fig. 2.8e and f with the one of a straight-forward Block Matcher in
53
Fig. 2.8 Comparison of stereo-algorithms on basis of the Tsukuba test image of the Middlebury
stereo data set [27]: a, b original left and right image, c ground truth depth map, d dynamic
programming, e belief propagation, f graph-cut, g block matching without post-processing,
h block matching with post-processing
Fig. 2.8g which is the prominent representative of local methods and also known
in the context of motion estimation. An in-depth comparison of block matching
and other disparity estimation techniques is given in [42].
The advantage of global methods over local ones could only be proven for still
images so far. Furthermore, recent research has proven that local methods with
adequate post-processing of resulting depth maps, e.g. by cross-bilateral filtering
[43, 44], can provide results that are almost comparable with those of global
methods. In this context Fig. 2.8h shows an example for Block matching with
post-processing. Furthermore global methods suffer from high computational
complexity and a considerable lack of temporal consistency. Because of all these
reasons local approaches are usually preferred 3D video processing for the time
being, especially if real-time processing is addressed.
Against this background, Fig. 2.9 depicts as an example the processing scheme
of a real-time depth estimator for 3D video using hybrid recursive matching
(HRM) in combination with motion estimation and adaptive cross-trilateral
median filtering (ACTMF) for depth map post-processing.
The HRM algorithm is a local stereo matching method that is used for initial
depth estimation in the processing scheme from Fig. 2.9. It is based on a hybrid
54
F. Zilly et al.
solution using both, spatially and temporally recursive block matching and pixelrecursive depth estimation [45]. Due to its recursive structure, the HRM algorithm
produces almost smooth and temporally consistent pixel-by-pixel disparity maps.
In the structure from Fig. 2.9 two independent HRM processes (stereo matching
from right to left and, vice versa, from left to right view) estimate two initial
disparity maps, one for each stereo view. The confidence of disparity estimation is
then measured by checking the consistency between the two disparity maps and by
computing a confidence kernel from normalized cross-correlation used by HRM.
In addition, a texture analysis detects critical regions with ambiguities (homogeneities, similarities, periodicities, etc.) that might cause mismatches and a motion
estimator provides temporally predicted estimates by applying motion compensation to disparities of the previous frame to further improve temporal consistency.
Finally, the two initial disparity maps are post-processed by using ACTMF. The
adaptive filter is applied to initial disparity values Dip from the HRM at pixel
positions p within a filter region ns around the center position s:
Dos weighted medianp2ns wp ; Dip :
55
2:8
2:9
56
F. Zilly et al.
Fig. 2.10 Results of depth estimation applying the processing chain from Fig. 2.9 to stereo
images of test sequence BAND06 kindly provided by European research project MUSCADE [46]
57
Fig. 2.11 Depth-based 3D representation format consisting of two stereo images as known from
conventional S3D production and aligned dense pixel-by-pixel depth maps
from the reproduction geometry at the display side. Due to the delivered depth
information, new views can be interpolated or rendered along the stereo baseline.
Dedicated rendering techniques known as DIBR are used for the generation of
virtual views [50].
Figure 2.12 illustrates the general concept of DIBR. The two pictures in the first
row show a video image and corresponding dense depth map. To calculate a new
virtual view along the stereo baseline, each sample in the video image is shifted
horizontally in relation to the assigned depth value. The amount and the direction
of the shift depend on the position of the virtual camera position with respect to
one of the original camera. Objects in the foreground are moved by a larger
horizontal shift than objects in the background. The two pictures in the second row
of Fig. 2.12 show examples of the DIBR process. The left of the two pictures
corresponds to a virtual camera position right to the original one and vice versa,
the right picture to a virtual camera position left to the original one.
These examples demonstrate that the depth-dependent shifting of samples
during the DIBR process yields to exposed image areas which are marked in black
in Fig. 2.12. The width of these areas decreases with the distance of objects to the
camera, i.e. the nearer an object is to the camera the larger the width of the
exposed areas. In addition, their width grows with the distance between original
and virtual camera position. As there is no texture information for these areas in
the original image, they are kept black in Fig. 2.12. The missing texture has to be
extrapolated from the existing texture, taken from other cameras view or generated
by specialized in-painting techniques.
In a depth-based representation format with two views like the one from
Fig. 2.11, the missing texture information can usually be extracted from the other
stereo view. This applies especially to cases where the new virtual camera position
lies between the left and right stereo camera. Figure 2.13 illustrates the related
process. First, a preferred original view to render the so-called primary view is
selected among the two existing camera views. This selection depends on the
position of the virtual camera. To minimize the size of exposed areas, the nearer of
58
F. Zilly et al.
Fig. 2.12 Synthesis of two virtual views using DIBR: a original view, b aligned dense depth
map, c virtual view rendered to the left, d virtual view rendered to the right
the two cameras is selected and the primary view is rendered from it as described
above. Subsequently, the same virtual view is rendered from the other camera, the
so-called secondary view. Finally, both rendered views are merged by filling the
exposed areas in the primary view with texture information from the secondary
view, resulting in a merged view. To reduce the visibility of remaining colorimetric or geometric deviations between primary and secondary view, the pixels
neighboring the exposed areas will be blended between both views.
Following this rendering concept, one can generate an arbitrary number of
virtual views between the two original camera positions. One possible application
is the generation of a new virtual stereo pair. This is especially interesting if one
aims to adapt the stereoscopic playback and depth range on a glasses-based system
to the respective viewing conditions, like screen size or viewing distance. However, creating a new virtual stereo pair is not the only target application. In the case
of auto-stereoscopic multi-view multi-user displays, DIBR can be used to calculate
the needed number of views and to adapt the overall depth range to the specific
properties of the 3D display. In this general case it might occur that not all of the
virtual views are located between the two original views but have to be positioned
outside of the stereo baseline. That especially applies if the total number of
required views is high (e.g. large viewing cone) or if one wishes to display more
depth on the auto-stereoscopic display. In fact, the existing depth range might be
limited if the production rules for conventional stereoscopic displays have been
taken into account (see Sect. 2.2). Thus, the available depth budget has to be
shared between all virtual views of an auto-stereoscopic display if DIBR is
restricted to view interpolation only, or, in other words, additional depth budget for
auto-stereoscopic displays can only be achieved by extrapolating beyond the
original stereo pair.
59
Fig. 2.13 View merging to fill exposed areas. a Primary view rendered from the left camera
view, b secondary view rendered from the right camera view, c merged view
This extrapolation process produces exposed areas in the rendered images that
cannot be filled with texture information from the original views. Hence, other
texture synthesis methods need to be applied. The suitability of the different
methods depends on the scene properties. If, for instance, the background or
occluded far-distant objects mainly contain homogeneous texture, a simple repetition of neighboring pixels in the occluded image parts is a good choice to fill
exposed regions. In case of more complex texture, however, this method fails and
annoying artifacts that considerably degrade the image quality of the rendered
views become visible. More sophisticated techniques, such as patch-based
in-painting methods, are required in this case [51, 52]. These methods reconstruct
missing texture by analyzing adjacent or corresponding image regions and by
generating artificial texture patches that seamlessly fit into the exposed areas.
Suitable texture patches can be found in the neighborhood, but also in other parts
of the image, or even in another frame of the video sequence.
Figure 2.14 shows the results of the two different texture synthesis methods
applied to a critical background. The top row depicts the two original camera
views, followed by extrapolated virtual views with large exposed regions in the
second rows. The exposed regions are marked by black pixels. The third row
shows the result of a simple horizontal repetition of the background samples.
Clearly, this simple method yields annoying artifacts due to the critical texture in
some background regions. The bottom row presents results of a patch-based inpainting algorithm providing a considerable improvement of the rendering quality
compared to the simple pixel repetition method.
60
F. Zilly et al.
Fig. 2.14 Extrapolated virtual views with critical content. First row the original stereo pair.
Second row extrapolated virtual views with exposed regions marked in black, third row simple
algorithm based on pixel repetition, fourth row sophisticated patch-based in-painting algorithm.
Test images kindly provided by European research project 3D4YOU [53]
61
Fig. 2.15 Outline of camera positions (left), CAD drawing of the 4-camera system used within
MUSCADE [46] (center), 4-camera system in action (right). CAD drawing provided by KUK
Filmproduktion
62
F. Zilly et al.
bring all four cameras in a position, such that every pair of two cameras is rectified, i.e. that Eq. (2.2) is valid (cf. Sect. 2.5). On the other hand, one can take
advantage of the additional cameras and estimate the trifocal tensor instead of a set
of fundamental matrices. A convenient approach for the estimation of the geometry is to select camera 1, 2, and 3 (cf. Fig. 2.15) as one trifocal triplet and camera
2, 3, and 4 as another trifocal triplet. That way, both triplets consist of cameras
which have still a good overlap in their respective field of view. This is especially
important for a feature point-based calibration. In contrast, a camera triplet
involving camera 1 and 4 would suffer from a low number of feature points, as the
overlap in the field of view is small, due to the large baseline between the two
cameras. We have aligned all three camera of a triplet on a common baseline if the
trifocal tensor in matrix notation simplifies to
82
3 2
3 2
39
0
0 cx =
0
cx 0
< cx c0x 0 0
0 0 5 2:10
fT1 ; T2 ; T3 g 4 0
0 0 5; 4 c0x 0 0 5; 4 0
:
;
c0x 0 0
0
0 0
0
0 0
where cx and c0x denote the stereo baseline between the first and second camera and
the baseline between the first and the third camera, respectively. One can now
develop the trifocal tensor around this rectified state in a Taylor expansion comparable to Eq. (2.3) assuming that all deviations from the ideal geometric position
are small enough to linearize them. A feature point triplet consisting of
!
X
m0
mi Ti m00 033
2:11
i
63
Fig. 2.16 Disparity maps for camera 2 generated from the inner stereo pair (a), and involving the
satellite camera 1 (b). The two disparity maps have opposite sign and a different scaling due to
the two different baselines between cameras 1 and 2 on one hand, and cameras 2 and 3 on the
other hand
discussed in Sect. 2.6. Apparently, the occlusions for the wide baseline system (b)
are more important than for the narrow baseline system (a). In addition, the two
disparity maps have opposite signs and a different scaling, as the two baselines
differ. We can now apply the trifocal constraint to the disparity maps.
Figure 2.17a shows the result of the trifocal consistency check. Only pixels
available in both disparity maps (Fig. 2.16a, b) can be trifocal consistent. In a
general case, the trifocal consistency is checked by performing a point transfer of a
candidate pixel in the third view involving the trifocal tensor and a 3D re-projection. In our case, the trifocal tensor is degenerated according to Eq. (2.10).
Consequently, a candidate pixel lies on the same line in all three views. Moreover,
a simple multiplication of the disparities by the ratio cx =c0x of the two baselines is
sufficient to perform the point transfer. This greatly improves the speed and the
accuracy of the trifocal consistency check. The same baseline ratio can be used to
normalize all disparity maps allowing us to merge them into a single disparity map
as shown in Fig. 2.17b. In a last step, we convert the disparity maps into depth
maps. Please note that all pixels which are trifocal consistent have a high depth
resolution which will improve the quality of a later view virtual generation.
A generic 3D representation format for wide baseline applications should consequently allow for a resolution of more than 8 bits per pixel of the depth maps.
Certainly, a respective encoding scheme would need to reflect this requirement.
Once we have a set of four depth maps, we can generate virtual views along the
whole baseline using view interpolation. The concept of the depth-image-based
rendering has been discussed in Sect. 2.7. We will now extend the rendering using
DES toward the multi-camera geometry, i.e. MVD4 in our case.
Figure 2.18 illustrates the rendering process using DIBR and MVD4. We have
four cameras with associated depth maps. By inspection, we can identify the inner
stereo pair with a narrow baseline and small parallax (between cameras 2 and 3).
These views are suitable for showing them one a stereoscopic display without view
interpolation. There is a noticeable perspective change between the camera 1 and 2
as well as between cameras 3 and 4. However, the views generated using DIBR
have an equidistant virtual baseline. These views are suitable for auto-stereoscopic
displays.
64
F. Zilly et al.
Fig. 2.17 a Trifocal consistent disparity maps. b Merged disparity maps. The two disparity maps
from Fig. 2.16 are normalized and merged into one resulting disparity map
Fig. 2.18 Depth-image-based rendering using MVD4. An arbitrary number of views can be
generated using a wide baseline setup
We will now describe how the DES rendering concept is applied to MVD4
content. The main idea is to perform a piece-wise DES rendering. For every virtual
camera position, two original views are selected as DES stereo pair. In the simplest
case, these are the two original views nearest to the virtual view. Figure 2.19
illustrates the selection process of primary and secondary view.
However, the quality of the depth maps might differ between the views. In this
case, one might want to use a high quality view as primary view, even if it is
farther away from the position of the virtual view. As shown in Fig. 2.19, the
rendering order can be visualized in a simple table. View A, for instance lies
between camera 1 and 2. It is nearest to view 1 (blue) and second nearest to view 2
65
Fig. 2.19 Each original view can serve as primary or secondary view depending on the position
of the virtual view. Generally, the two nearest views serve as primary and secondary view for the
DES rendering (cf. text and Sect. 2.7)
2.9 Conclusions
In this chapter, we have described experiences and results of different conversion
processes from stereo to multi-view. The workflow has been extended toward a
four-camera production workflow in the framework of the European MUSCADE
project which aims to perform field trials and real-time implementations needed by
generic 3D production chain.
We presented the requirements of a generic 3D representation format and a
generic display-agnostic production workflow which should support different types
of displays, as well as current, and future 3D applications.
Subsequently, calibration strategies for stereo- and multi-camera setups have
been discussed which allow an efficient setup of the rigs not only under laboratory
environments but also in the field. A successful introduction of the displayagnostic 3D workflow relies on the existence of a cost-effective production
workflow. In this sense, a fast and reliable calibration is one important component.
66
F. Zilly et al.
The stereo calibration has been discussed in Sect. 2.5 while the extension toward
multi-camera-rig calibration has been discussed within Sect. 2.6.
Different approaches for efficient depth map generation have been discussed for
stereo- (Sect. 2.6) and multi-view setups (Sect. 2.8). As the quality of the depth
maps is critical for the whole DIBR production chain, we presented strategies to
improve their quality by using more than two cameras, i.e. a multi-camera rig (see
Sect. 2.8).
Finally, the process of the DIBR using DES has been presented in Sect. 2.7. We
extended the workflow toward MVD4 and similar generic 3D representation formats in Sect. 2.8.
References
1. Feldmann I, Schreer O, Kauff P, Schfer R, Fei Z, Belt HJW, Divorra Escoda (2009)
Immersive multi-user 3D video communication. In: Proceedings of international broadcast
conference (IBC 2009), Amsterdam, NL, Sept 2009
2. Divorra Escoda O, Civit J, Zuo F, Belt H, Feldmann I, Schreer O, Yellin E, Ijsselsteijn W,
van Eijk R, Espinola D, Hagendorf P, Waizennegger W, Braspenning R (2010) Towards
3D-aware telepresence: working on technologies behind the scene. In: Proceedings of ACM
conference on computer supported cooperative work (CSCW), new frontiers in telepresence,
Savannah, Georgia, USA, 0610 Feb 2010
3. Feldmann I, Atzpadin N, Schreer O, Pujol-Acolado J-C, Landabaso J-L, Divorra Escoda O
(2009) multi-view depth estimation based on visual-hull enhanced hybrid recursive matching
for 3D video conference systems. In: Proceedings of 16th international conference on image
processing (ICIP 2009), Kairo, Egypt, Nov 2009
4. Waizenegger W, Feldmann I, Schreer O (2011) Real-time patch sweeping for high-quality
depth estimation in 3D videoconferencing applications. In: SPIE 2011 conference on realtime image and video processing, San Francisco, CA, USA, 2327 Jan 2011, Invited Paper
5. Pastoor S (1991) 3D-Television: a survey of recent research results on subjective requirements.
Signal Process Image Commun 4(1):2132
6. IJsselsteijn WA, de Ridder H, Vliegen J (2000) Effects of stereoscopic filming parameters
and display duration on the subjective assessment of eye strain. In: Proceedings of SPIE
stereoscopic displays and virtual reality systems, San Jose, Apr 2000
7. Mendiburu B (2008) 3D movie makingstereoscopic digital cinema from script to screen.
Elsevier, ISBN: 978-0-240-81137-6
8. Yeh YY, Silverstein LD (1990) Limits of fusion and depth judgement in stereoscopic color
pictures. Hum Factors 32(1):4560
9. Holliman N (2004) Mapping perceived depth to regions of interest in stereoscopic images. In:
Stereoscopic displays and applications XV, San Jose, California, Jan 2004
10. Jones G, Lee D, Holliman N, Ezra D (2001) Controlling perceived depth in stereoscopic
images. In: Proceedings SPIE stereoscopic displays and virtual reality systems VIII, San Jose,
CA, USA, Jan 2001
11. Woods A, Docherty T, Koch R (1993) Image distortions in stereoscopic video systems. Proc.
SPIE 1915:3648
12. Faubert J (2000) Motion parallax, stereoscopy, and the perception of depth: practical and
theoretical issues. In: Proceedings of SPIE three-dimensional video and display: devices and
systems, Boston, MA, USA, pp 168191, Nov 2000
67
13. Zilly F, Kluger J, Kauff P (2011) Production rules of 3D stereo acquisistion. In: Proceedings
of the IEEE (PIEEE), special issue on 3D media and displays, vol 99, issue 4, pp 590606,
Apr 2011
14. Johnson RB, Jacobsen GA (2005) Advances in lenticular lens arrays for visual display. In:
Current developments in lens design and optical engineering VI, proceedings of SPIE, vol
5874, Paper 5874-06, San Diego, Aug 2005
15. Brner R (1993) Auto-stereoscopic 3D imaging by front and rear projection and on flat panel
displays. Displays 14(1):3946
16. Omura K, Shiwa S, Kishino F (1995) Development of lenticular stereoscopic display
systems: multiple images for multiple viewers. In: Proceedings of SID 95 digest, pp 761763
17. Hamagishi G et al (1995) New stereoscopic LC displays without special glasses. Proc Asia
Disp 95:921927
18. Dodgson NA, Moore JR, Lang SR, Martin G, Canepa P (2000) A time sequential multiprojector autostereoscopic display. J Soc Inform Display 8(2):169176
19. Lowe D (2004) Distinctive image features from scale invariant keypoints. IJCV 60(2):91110
20. Bay H, Ess A, Tuytelaars T, Van Gool L (2008) SURF: speeded up robust features. Comput
Vis Image Underst (CVIU) 110(3):346359
21. Zilly F, Riechert C, Eisert P, Kauff P (2011) Semantic kernels binarizeda feature descriptor
for fast and robust matching. In: Conference on visual media production (CVMP), London,
UK, Nov 2011
22. Hartley R, Zisserman A (2000) Multiple view geometry in computer vision. Cambrigde
University Press, Cambrigde
23. Fischler M, Bolles R (1980) Random sample consensus: a paradigm for model fitting applications
to image analysis and automated cartography. In: Proceedings of image understanding workshop,
April 1980, pp 7188
24. Fusiello A, Trucco E, Verri A (2000) A compact algorithm for rectification of stereo pairs.
Mach Vis Appl 12(1):1622
25. Mallon J, Whelan PF (2005) Projective rectification from the fundamental matrix. Image Vis
Comput 23(7):643650
26. Wu H-H, Yu Y-H (2005) Projective rectification with reduced geometric distortion for stereo
vision and stereoscopic video. J Intell Rob Syst 42:7194
27. Scharstein D, Szeliski R (2001) A taxonomy and evaluation of dense two-frame stereo
correspondence algorithms. International Journal of Computer Vision, 47(1/2/3):742,
AprJune 2002. Microsoft Research Technical Report MSR-TR-2001-81, Nov 2001
28. Brown MZ, Burschka D, Hager GD (2003) Advances in computational stereo. IEEE Trans
Pattern Anal Mach Intell (PAMI) 25(8):9931008
29. Aschwanden P, Guggenbuhl W (1993) Experimental results from a comparative study on
correlation-type registration algorithms. In: Forstner W, Ruwiedel St (eds) Robust computer
vision. Wickmann, Karlsruhe, pp 268289
30. Wegner K, Stankiewicz O (2009) Similarity measures for depth estimation. In: 3DTV
conference, the true vision capture transmission and display of 3D video
31. Birchfield S, Tomasi C (1996) Depth discontinuities by pixel-to-pixel stereo. In: Technical
report STAN-CS-TR-96-1573, Stanford University, Stanford
32. Belhumeur PN (1996) A Bayesian approach to binocular stereopsis. IJCV 19(3):237260
33. Cox IJ, Hingorani SL, Rao SB, Maggs BM (1996) A maximum likelihood stereo algorithm.
CVIU 63(3):542567
34. Boykov Y, Veksler O, Zabih R (2001) Fast approximate energy minimization via graph cuts.
IEEE Trans Pattern Anal Mach Intell 23(11):12221239
35. Kolmogorov V, Zabih R (2001) Computing visual correspondence with occlusions using
graph cuts. Proc Int Conf Comput Vis 2:508515
36. Kolmogorov V, Zabih R (2005) Graph cut algorithms for binocular stereo with occlusions.
In: Mathematical models in computer vision: the handbook. Springer, New York
37. Kolmogorov V, Zabih R (2004) What energy functions can be minimized via graph cuts?
IEEE Trans Pattern Anal Mach Intell 26(2):147159
68
F. Zilly et al.
38. Bleyer M, Gelautz M (2007) Graph-cut-based stereo matching using image segmentation
with symmetrical treatment of occlusions. Signal Process Image Commun 22(2):127143
39. Sun J, Shum HY, Zheng NN (2002) Stereo matching using belief propagation. ECCV
40. Yang Q, Wang L, Yang R, Wang S, Liao M, Nister D (2006) Real-time global stereo
matching using hierachical belief propagation. In: Proceedings of British machine computer
vision, 2006
41. Felzenswalb PF, Huttenlocher DP (2006) Efficient belief propagation for early vision. Int J
Comput Vis 70(1) (October)
42. Brown MZ, Burschka D, Hager GD (2003) Advances in computational stereo. IEEE Trans
Pattern Anal Mach Intell 25(8):9931008
43. Kopf J, Cohen M, Lischinski D, Uyttendaele M (2007) Joint bilateral upsampling. In:
Proceedings of the SIGGRAPH conference on ACM Transactions on Graphics, vol 26, no 3
44. Riemens AK, Gangwal OP, Barenbrug B, Berretty R-PM (2009) Joint multi-step joint
bilateral depth upsampling. In: Proceedings of SPIE visual communications and image
processing, vol 7257, article M, Jan 2009
45. Atzpadin N, Kauff P, Schreer O (2004) Stereo analysis by hybrid recursive matching for realtime immersive video stereo analysis by hybrid recursive matching for real-time immersive
video conferencing. In: IEEE Transactions on circuits and systems for video technology,
special issue on immersive telecommunications, vol 14, No. 3, pp 321334, Jan 2004
46. Muscade (MUltimedia SCAlable 3D for Europe), European FP7 research project. http://
www.muscade.eu/
47. Ziegler M, Falkenhagen L, ter Horst R, Kalivasd D (1998) Evolution of stereoscopic and
three-dimensional video. Signal Process Image Commun 14(12):1731946
48. Redert A, Op de Beeck M, Fehn C, IJsselsteijn W, Pollefeys M, Van Gool L, Ofek E, Sexton
I, Surman P (2002) ATTESTadvanced three-dimensional television systems technologies.
In: Proceedings of first international symposium on 3D data processing, visualization, and
transmission, Padova, Italy, pp 313319, June 2002
49. Mohr R, Buschmann R, Falkenhagen L, Van Gool L, Koch R (1998) Cumuli, panorama, and
vanguard project overview. In: 3D structure from multiple images of large-scale
environments, lecture notes in computer science, vol 1506/1998, pp 113. doi:10.1007/3540-49437-5_1
50. Fehn C (2004) Depth-image based rendering (DIBR), compression and transmission for a
new approach on 3D-TV. In: Proceedings of SPIE stereoscopic display and virtual reality
systems XI, San Jose, CA, USA, pp 93104, Jan 2004
51. Kppel M, Ndjiki-Nya P, Doshkov D, Lakshman H, Merkle P, Mueller K, Wiegand T (2010)
Temporally consistent handling of disocclusions with texture synthesis for depth-imagebased rendering. In: Proceedings of IEEE ICIP, Hong Kong
52. Ndjiki-Nya P, Kppel M, Doshkov D, Lakshman H, Merkle P, Mueller K, Wiegand T (2010)
Depth-image based rendering with advanced texture synthesis. In: Proceedings of IEEE
international conference on multimedia & expo, Singapore
53. 3D4YOU, European FP7 research project. http://www.3d4you.eu/
54. Zilly F, Mller M, Kauff P (2010) The stereoscopic analyzeran image-based assistance tool
for stereo shooting and 3D production. In: Proceedings of ICIP 2010, special session on
image processing for 3D cinema production, Hong Kong, 2629 Sept 2010
Chapter 3
Abstract With the advent of 3D-TV, the increasing interest of free viewpoint TV in
MPEG, and the inevitable evolution toward high-quality and higher resolution TV
(from SDTV to HDTV and even UDTV) with comfortable viewing experience, there is
a need to develop low-cost solutions addressing the 3D-TV market. Moreover, it is
believed that in a not too distant future 2D-UDTV display technology will support a
reasonable quality 3D-TV autostereoscopic display mode (no need for 3D glasses)
where up to a dozens of intermediate views are rendered between the extreme left and
right stereo video input views. These intermediate views can be synthesized by using
viewpoint synthesizing techniques with the left and/or right image and associated
depth map. With the increasing penetration of 3D-TV broadcasting with left and right
images as straightforward 3D-TV broadcasting method, extracting high-quality depth
map from these stereo input images becomes mandatory to synthesize other intermediate views. This chapter describes such Stereo-In to Multiple-Viewpoint-Out
C.-K. Liao (&)
IMEC Taiwan Co., Hsinchu, Taiwan
e-mail: ckliao@imec.be
H.-C. Yeh K. Zhang V. Geert G. Lafruit
IMEC vzw, Leuven, Belgium
e-mail: lalalachi@gmail.com
K. Zhang
e-mail: zhangke@imec.be
V. Geert
e-mail: vanmeerb@imec.be
T.-S. Chang
National Chiao Tung University, Hsinchu, Taiwan
e-mail: tschang@twins.ee.nctu.edu.tw
G. Lafruit
e-mail: lafruit@imec.be
69
70
3.1 Introduction
In recent years, people get more and more attracted by the large range of applications enabled by 3D sensing and display technology. The basic principle is that a
vision device providing a pair of stereo images to the left and right eye respectively is sufficient to provide a 3D depth sensation through the brains natural
depth interpretation ability. Many applications like free viewpoint TV and 3D-TV
benefit from this unique characteristic.
71
Fig. 3.1 Synthesized virtual images with different level of depth control
viewers eyes into the two corresponding adjacent viewing cones, giving the
viewer the illusion to turn around the projected 3D content.
Though it is conceivable that the accompanying viewpoint interpolation process
requiring scene depth information might be based on sender-precalculated depth
maps broadcasted through the network, it is more likely that todays left/right
stereo video format used in 3D digital cinema and 3D bluray with polaroid/shutterglasses display technology will become legacy in tomorrows 3D-TV broadcasting
72
Fig. 3.3 Free viewpoint TV: the images of a user-defined viewing direction are rendered
market. Each autostereoscopic 3D-TV (or settop-box) receiver will then have to
include a chipset for depth estimation and viewpoint synthesis at reasonable cost.
73
Fig. 3.4 Depth measurement for 3D gesture recognition. The monitor shows the depth-image
captured and calculated by a FullHD stereo-camera with the proposed system
projects a pattern onto the scene, out of which the players 3D movements can be
extracted.
The same function can be achieved using a stereo camera where depth is extracted
by stereo matching (Fig. 3.4). There is much benefit from using a stereo camera
system: the device allows higher depth precision (active depth sensing devices
always exhibit a low output image resolution), and the system can be used outdoors,
even under intense sunlight conditions, which is out of reach for any IR active sensing
device. Finally, compared with the active IR lighting systems, the stereo approach
exhibits lower energy consumption. Therefore, using stereo cameras with a stereo
matching technique has a high potential in replacing Kinect-alike devices.
Although stereo matching provides depth information providing added-value to a
gesture recognition application, it remains a challenge to have a good image quality
at a reasonable computational complexity. Since today even GPUs with their
incredible processing power are hardly capable of performing such tasks in real time,
we propose the development of a stereo matching and viewpoint interpolation processing chain, implemented onto FPGA, paving the way to future 3D-TV chipset
solutions. In particular, we address the challenge of proper quality/complexity
tradeoffs satisfying user requirements at minimal implementation cost (memory
footprint and gate count). The following describes a real-time FPGA implementation
of such stereo matching and viewpoint synthesis functionality. It is also a milestone
to eventually pave the way to full-HD 3D on autostereoscopic 3D displays.
74
Fig. 3.5 The schematic model of a stereo camera capturing an object P. The cameras with focal
length f are assumed to be identical and are located on an epipolar line with a spacing S
of different light paths from the object to respectively the left and right cameras
with their corresponding location and viewing angle [1], mimicking the human
visual system with its depth impression and 3D sensation.
Figure 3.5 shows an illustration of disparity and the corresponding geometrical
relation between an object and a stereo camera rig. Two images from the object are
captured by two cameras which are assumed to be perfectly parallel and aligned to
each other on a lateral line, corresponding to the so-called epipolar line [2]. A
setup with converging cameras can be transformed back and forth to this scenario
through image rectification. To simplify the discussion, the camera parameters
(except their location) are identical.
In this chapter, we assume that the rectification has been done in advance. The
origins of the cameras are shown to be on their rear side, corresponding to their
respective focal point. When an object image is projected to the camera, each pixel
senses the light along the direction from the object to the camera origin. Consequently, any camera orientation has its own corresponding object image and
therefore the object is at a different pixel location in the respective camera views.
For example, the object on point P is detected in pixel Xleft = +5 and
Xright = -7 respectively in the left and right cameras (see Fig. 3.5). The disparity is calculated as the difference of these two lateral distances (i.e.,
5 ? 7 = 12). The disparity is a function of the depth z, the spacing between these
two cameras s, and focal length f of the camera as shown in Eq. (3.1).
z sf =d
3:1
According to this equation the object depth can be calculated by measuring the
disparity from the stereo images. In general, textures of an object on the respective
stereo images exhibit a high image correlation, both in color and shape, since they
originate from the same object with the same illumination environment and a
75
similar background and are assumed to be captured by identical cameras. Therefore, it is suitable to measure the disparity by detecting the image texture displacement with a texture similarity matching metric.
According to the taxonomy summarized by [3] and [4], algorithms for stereo
matching have been investigated for mostly 40 years, and can be categorized into
two classes, the local-approach and the global-approach algorithms. The stereo
matching computational flow can be summarized in four steps:
1.
2.
3.
4.
76
P
2
2
I x;y~Ir It xd;y~It
x;y r
P
SAD
jIr x; y It x d; yj
Point operation
Multiplication
Subtraction
x;y
SSD
P
Ir x; y It x d; y2
x;y
CT
Ik x; y Bitstringm;n Ik m; n\Ik x; y
P
HammingIr x; y; It x d; y
XOR
x;y
have been proposed to calculate the matching cost. The definitions of these
methods with their point operation are shown in Table 3.1 [5].
Cross-correlation is a commonly used operation to calculate the similarity in the
standard statistics methods [6]. Furthermore, NCC is also reported in order to prevent
the influence of radiometric differences between the stereo images to the cost estimation result [7, 8]. Cross-correlation is however computationally complex and thus
only few researches use this method to calculate the matching cost in a real-time
system [9]. Similar to SSD, SAD mainly measures the intensity difference between
the reference region and the target region. Comparing to the SSD, SAD is often used
for system implementation since it is more computationally efficient. Different from
these approaches, the CT performs outstanding similarity measurements at acceptable cost [10]. The CT method firstly translates the luminance comparison result
between the processed anchor pixel (the central pixel of the moving window) and the
neighboring pixels into a bit string. Then it generates the matching cost by computing
the hamming distance between these bit strings on the reference and target images.
As such, the CT matching cost includes the luminance relative information, not the
luminance absolute values, making it more tolerant against luminance error/bias
between the two regions to match, while increasing the discrimination level by
recording a statistical distribution rather than individual pixel values.
77
Fig. 3.7 The CT and hamming distance calculation resulting in a raw matching cost in terms of
disparities. a The image intensity map. b The CT. c The census vector aggregated from neighbor
pixels, and d the calculated raw matching cost. Note that the processed anchor pixel (65) is at
coordinates (x, y)
78
Fig. 3.8 Categories of local stereo matching algorithms based on the level of detail in the
support region of neighboring pixels around the processed anchor pixel. Four categories are listed
from course to fine, i.e., fixed window methods, multiple windows methods, adaptive shape
methods, and adaptive weight methods. One possible support region to the pixel p is presented for
each category, respectively (brightness denotes the weight value for the adaptive weight methods)
the same disparity value. Typically, the Gaussian filter is used to decide the weight of
aggregation matching cost based on the distance to the processed anchor pixel.
Although the adaptive weight method is able to achieve very accurate disparity maps,
the computational complexity is relatively higher than other approaches.
Conventional cost aggregation method within the support region requires a
computational complexity in O(n2 ), where n is the size of support region window.
Wang et al. [15] propose a two-pass cost aggregation method, which efficiently
reduces the computational complexity to O(2n). Other techniques for hardware
complexity reduction use data-reuse techniques. Chang et al. [16] propose the
mini-CT and partial column reuse techniques to reduce the memory bandwidth and
computational complexity. It caches each column data over the overlapped windows. In our previous hardware implementation work which was contributed from
Lu et al. [1719], the two-pass cost aggregation approach was chosen and associated with the cross-based support region information. Data-reuse techniques
were also applied to both horizontal and vertical cost aggregation computations to
construct a hardware efficient architecture.
A more recent implementation has modified the disparity computations as
explained in the following sections.
79
3:2
The first term represents the sum of matching costs for the image pixels p at
their respective disparities dp. Equation (3.3) shows the function of first term of
Eq. (3.2):
X
Edata d
Cp; dp
3:3
p2N
where N represents the total number of pixels, and dp ranges from 0 to a chosen
maximum disparity value. The second term of Eq. (3.2) represents the smoothness
function as in Eq. (3.4). We compare two models, Linear and Potts models, to
implement the smoothness function:
X
Esmooth d k
Sdp ; dq
3:4
p;q2N
where k is a scaling coefficient, which adapts to the local intensity changes in order
to preserve disparity discontinuous regions. dp and dq are disparity arrays in
adjacent pixels p and q. Equation (3.5) shows the Linear model for the smoothness
function. The smoothness cost of the linear model depends on the difference
between the disparity from the current pixel and all candidate disparities from the
previous pixel. A higher disparity difference introduces a higher smoothness cost
penalty. Figure 3.9 is an example demonstrating the smoothness cost penalties in
the Linear model. This mechanism helps to preserve the smoothness of the disparity map when performing global stereo matching.
Sdp ; dq dp dq
3:5
where dp and dq[ [0, disparity range-1].
80
Unfortunately, the linear model smoothness function requires high computational resources. Assuming that the maximum disparity range is D, the computational complexity of the smoothness cost is O(D2 ).
Equation (3.6) shows the Potts model smoothness function which has lower
complexity than the linear model. The Potts model introduces the same smoothness
cost penalty to all candidate disparities. Figure 3.10 is an example demonstrating the
smoothness cost penalties in the Potts model. If the disparity value is different, the
smoothness cost penalty is introduced into the energy function to reduce the probability of different disparities to win in the winner-take-all computation. In the Potts
model smoothness function, the computational complexity is of O(D).
0; dp dq
Sdp ; dq
3:6
c; others
where c is a constant to introduce smoothness penalty.
Figure 3.11 shows an example of disparity maps that are generated from Linear
and Potts model individually. Obviously, the linear model performs better at slant
surfaces and reveals more detail. However, it also blurs the disparity map on
discontinuous regions (such as object edges).
The state-of-the-art global stereo matching approaches include dynamic programming (DP) [21], graph cut (GC) [22], and belief propagation (BP) [23]. The
principle of those approaches is based on the way one minimizes the energy
function. In the following, we choose the DP approach for the example of the
hardware implementation that we will study later.
DP is performed after the calculation of the raw matching cost. To simplify the
computational complexity, DP generally refers to executing the global stereo
matching within an image scanline (an epipolar line after image rectification). It
consists of a forward updating of the raw matching cost and a backward minimum
tracking. The DP function helps to choose the accurate minimum disparity value
81
Fig. 3.11 Disparity map results by using different smoothness function models
Fig. 3.12 DP consists of a forward cost update and a backward tracking. In the forward process,
the costs are updated based on the cost distribution of the previous pixel, and therefore, the final
disparity result is not only determined by its original cost map, but also affected by neighboring
pixels
where the function C() presents the original raw matching cost at disparity dq for
pixel j on scanline W, and Cw() function is the accumulated cost updated from the
original cost. S() is the smoothness function that introduces a penalty to the disparity difference between the current and previous scanline pixels. (Fig. 3.12).
After updating the cost of every pixel on the scanline, the optimal disparity
result of a scanline can be harvested by backward selecting the disparity with
minimum cost, according to:
82
3:8
83
0
r max @r
r21;L
1
dp; pi A
3:10
i21;r
where r* indicates the largest span for the four direction arms. In the equation
pi = (xp - i,yp) and L corresponds to the predetermined maximum arm length.
(
1;
max jIc p1 Ic p2 j s
c2R;G;B
dp1 ; p2
3:11
0; otherwise:
+
+
After the arm lengths are calculated, the four arms (hp , hp , vp , vp ) are used to
define the horizontal segment H(p) and vertical segment V(p). Note that the full
support region U(p) is the integration of all the horizontal segments of those pixels
residing on the vertical segment.
Once the support region (indicating the texture-less region) has been determined, the disparity voting method can be performed to provide a better image
quality. It accumulates the disparity values within the support region of the central
pixel into histogram; and then chooses the winner as the final disparity value.
Figure 3.14 demonstrates an example of such 2D voting.
84
Fig. 3.14 Example of 2D disparity voting. All the disparities in the support region are taken into
account in the histogram to choose the most voted disparity. a Disparities in support region.
b disparities in support region c voted disparity
85
format. For instance, the video signal onto the television screen are scanned and
transferred by either the progressive or the interlaced raster scan direction format,
and these scan formats both follow a horizontal direction. All information is
extracted from a horizontal strip of successive lines, gradually moving from top to
bottom of the image. There is hence no need to store the entire image frame in the
processing memory and only a very limited amount of scan lines needed for the
processing are stored. Second, the disparity is naturally expressed in the horizontal
direction, since it finds its origin in the position difference between the viewers
left and right eyes which are naturally in a horizontal spacing. Due to these
reasons, it is more practical to build up a stereo matching system achieving realtime performances in a line-based structure. As long as the depth extraction is
simplified as harvesting information from the local data rather than from the frame
data, the implementation of the stereo matching onto the FPGA prototyping system becomes more efficient.
Figure 3.15 shows the proposed example in a more detailed flow graph of the
stereo matching implementation on FPGA. It performs the following steps:
Support region builder, raw cost generation with CT, and Hamming distance.
DP computation.
Disparity map refinement with consistency check, disparity voting, and median
filter.
The hardware architecture of these different steps will be discussed in further
details in the following sections.
86
3.3.1.1 CT
In this section, an example design of the CT is shown in Fig. 3.16. The memory
architecture uses two line buffers to keep the luminance data of the vertical scanline
pixels locally in order to provide two-dimensional pixel information for the CT
computation. This memory architecture writes and reads the luminance information
of the input pixel streams simultaneously in a circular manner, which keeps the
scanline pixel data locally by a data reuse technique. Successive horizontal scanlines
are grouped as vertical pixels through additional shift registers. This example
illustrates a 3 9 3 CT computational array. The results from the different comparators CMP are concatenated to a census bitstream ci for each pixel i.
87
matching cost measurement for each pixel). As shown in Fig. 3.17 the raw
matching cost result is stored in an additional, independent dimension with a size
equal to the maximum number of disparity levels.
The raw matching cost indicates the similarity of the reference pixel to the target
pixels at various disparity positions. Therefore, it should be easy to get the most similar
pixel by taking the disparity with the minimum raw matching cost. However, in most
practical situations, there is a need to improve the extracted disparity candidates, by
using global matching algorithms such as DP, as explained in next section.
88
Fig. 3.18 Example hardware design of the cross-based support region buildersupport region
builder
Cmin
Assum
min cw j 1; dp c
dp 2w
Cw j; dq cj; dq min cmin
Assum ; cj
1; dp
3:12
89
(
dj 1
Backwardmin Cpath j ;
if BackwardPathj;d 1
dj;
if BackwardPathj;d 0
3:14
90
Fig. 3.20 Example design of forward- (left) and backward-pass (right) functions
from streak effects and speckle noise. They also show some mismatching problems. In the next subsection, the postprocessor is described to circumvent these
problems with a disparity map refinement approach.
91
Fig. 3.21 Tsukuba left (a) and right (b) stereo image sources, and their resulting left (c) and
right (d) disparity maps from the DP
by the nearest good disparity value. Another approach is to tag the mismatching
pixels to form occlusion maps. Then the information of the occlusion pixels will be
used in the disparity voting function. The occlusion pixels are then replaced by the
most popular good disparity values within their support region. In the next subsection, the RTL design of the disparity voting will be further explained.
Figure 3.24 shows the disparity map results that are captured from the output of
the consistency check function, where the occlusion solving strategy is replacing
the mismatching disparity pixel by the nearest good disparity value.
92
Fig. 3.24 Tsukuba left and right disparity maps after consistency check
93
and assigns the corresponding disparity bit flag where 1 represents equal and 0
represents not equal. Then the support region mask filters the disparity flags that do
not belong to the support region by setting them to 0. Afterwards, the population
counter function accumulates the disparity bit flags into a histogram for the
disparity counter comparator tree and selects the most popular disparity value.
Figure 3.27 is an example design of the vertical voting function which selects
the most popular disparity value in the vertical support region. In general, the
architecture is almost the same as the horizontal voting function, except for the
memory architecture. The memory architecture accumulates the incoming
disparity scanline stream into line buffers and utilizes data-reuse techniques to
extract vertical pixels for disparity voting.
Figure 3.28 demonstrates the disparity map results that are captured from the
disparity voting function. Comparing to Fig. 3.24, the disparity mismatch, the
occlusion and streaking effects are heavily alleviated.
94
95
Fig. 3.28 Tsukuba left and right disparity maps after voting (a), and the following median filtering (b)
96
97
Fig. 3.31 Algorithm flowchart of virtual view synthesis. It mainly consists of the forward
warping, reverse warping and a blending process
Given the warped VDL and VDR from the forward warping stage, the reverse
warping fetches the texture data from the original input left and right images (IL and IR)
to produce the two virtual view frames (VIL and VIR). This stage warps in a similar way
as in the forward warping stage. However, instead of using the depth map as a reference
image, the reverse warping stage uses the left/right image as reference and warps these
images based on the virtual view depth maps. After this stage, some of the occlusion
region again causes some blank regions, which are dealt with in the next stage. Besides
the warping, the hole-dilation, on the other hand, expands the hole region by one pixel,
so that the dilated result can be used in the next stage for improving synthesis quality.
The blending process, which is considered to be the final stage of the proposed
synthesizes process, merges the left/right virtual images (VIL and VIR) into one to
create the synthesized result. Because the synthesized result has most of its pixels
that can be seen in the left/right virtual images, the blending process is set as an
average process of the texture values from VL and VR. This improves the color
consistency between the two input views (L and R) with higher quality. For the
occlusion regions, because most of the occlusion regions could be seen by either
one of the two views, VL and VR are then used to complement each other, and to
generate a synthesized view. In addition, the preceding hole-dilation step has more
background information copied from the complementary view.
After the blending process, there are still some holes that cannot be found from both
input views. This problem can be solved by inpainting methods. Here, a simplified
method proposed by Horng [35] is employed to do the inpainting. Unlike other
methods that always suffer from their computational complexity and time-consuming
iterative process, the proposed hole filling process adopts a 5-by-9 weighted array filter
as a bilinear interpolation to fill the hole region with less computational complexity.
98
Fig. 3.32 Forward warping (top) and reverse warping (bottom). In the forward warping, depth
maps are warped based on the disparity value, and holes are generated after this value has been
warped to a new location. In the reverse warping, a similar process is performed on the image in a
pixel-by-pixel process based on the warped disparity map. Note that the warping direction is
opposite to the left image
The proposed hardware architecture is shown in Fig. 3.33. Our view synthesis
engine adopts a scanline-based pipelined architecture. Because the input frames of
the texture images and depth maps are both rectified in the preceding stereomatching method, the process is considered to be a one-by-one scanline processing. During the viewpoint synthesis, all rows in a specific input frame are
independently processed in creating the final output results. Under these considerations, we synthesize the virtual view row by row.
Because of the scanline-based architecture, the input and output of our view
synthesis engine may be able to receive and deliver a sequential line-by-line data
stream. For example, for 1024 9 768 video frames, the input stream has a length
of 768 lines of 1024 data packets. Under this scanline-based architecture, the
forward warped depth information can be stored in internal memory, so that the
bandwidth of the warped depth data access from the external memory is greatly
reduced. For the row-based architecture, the size of the internal memory buffers
99
depends on the frame width for storing a frame row. In our design, we adopted the
SRAM with 1920 9 8 bits memory space to support up to HD1080p video.
In the view synthesis engine, the data flow is described according to Fig. 3.31.
In stage I, the DL and DR are warped toward the requested viewpoint. This process
is done in scanline order from the depth map and two scanline buffers (L/R buffer) for
left and right images, respectively. Holes in the warped depth images are filled in the
next two stages: stage II handles small 1-pixel width cracks with hole dilation, and
stage III handles more extensive holes before outputting the L/R synthesized views.
Based on this structure, the whole viewpoint synthesis process can be segmented into three sequential and independent stages. Since there is no recursive
data fed back from the last stage, the whole process can be pipelined. For example,
while stage II handles scanline number i, scanline number i - 1 is in
stage I and i ? 1 is in stage III. Consequently, three rows are independently active in the engine, and the process latency is greatly reduced by efficiently
operating every stage in the engine within a predetermined execution time.
100
Fig. 3.34 Disparity results of the proposed implement for the Tsukuba, Venus, Teddy, and
Cones stereo images (from left to right)
would possibly impact the performance of the tested algorithm. In the following,
we describe the quality estimation of the stereo matching and viewpoint synthesis,
respectively.
Nonocc
2.54
1.4
2.21
1.2
1.38
1.27
Proposed
RT-ColorAW [38]
SeqTreeDP [39]
MultiCue [40]
AdaptWeight [41]
InteriorPtLP [42]
3.09
3.08
2.76
1.81
1.85
1.62
All
11.1
5.81
10.3
6.31
6.9
6.82
Disc
0.19
0.72
0.46
0.43
0.71
1.15
Nonocc
0.42
1.71
0.6
0.69
1.19
1.67
All
2.36
3.8
2.44
3.36
6.13
12.7
Disc
6.74
6.69
9.58
7.09
7.88
8.07
Nonocc
12.4
14
15.2
14
13.3
11.9
All
Table 3.2 Quantitative results of the stereo matching for the original Middlebury stereo database
Algorithm
Tsukuba
Venus
Teddy
Disc
17.1
15.3
18.4
17.2
18.6
18.7
4.42
4.03
3.23
5.42
3.97
3.92
Nonocc
Cones
All
10.2
11.9
7.86
12.6
9.79
9.68
Disc
11.5
10.2
8.83
12.5
8.26
9.62
6.84
6.55
6.82
6.89
6.67
7.26
Average percent
bad pixels
102
Fig. 3.35 The synthesized virtual images from various viewing angles by using the ground truth
(top row) and the measured depth map (bottom row)
Table 3.3 PSNR of the synthesized virtual images by using the depth map from the ground truth
and the proposed stereo-matching method. Note that each number is an average value from IM3
to IM5
PSNR (dB)
Venus
Teddy
Sawtooth
Poster
Cones
GroundT
StereoM
45.07
45.09
44.76
44.53
44.77
44.57
44.24
44.07
43.12
42.92
create the left/right depth maps (i.e., DL and DR). The calculated depth map and
the ground truth were then used to synthesize the virtual views at exactly the
viewpoints IM3, IM4, and IM5, as shown in Fig. 3.35. Images on the top row show
the virtual view synthesized by using ground truth (i.e., VIM3, VIM4, and VIM5)
and those on the bottom row show the result by using the resulting depth map from
stereo matching. As a comparison, PSNRs were calculated from the virtual views
and its corresponding images, as listed in Table 3.3. It is clear that the PSNR of the
virtual images synthesized from the ground truth and the calculated depth are in a
value around 4345 and 4245 dB, respectively.
103
Table 3.4 The predefined engine spec of the proposed system in two cases
HD
FullHD
Image width
Image height
Frame rate (FPS)
Maximum disparity level
Speed performance (MDE/s)
960
1080
60
128
7962
1024
768
60
128
6039
Table 3.5 Estimated system resources and hardware costs based on TSMC 250 nm process by
using Cadence RTL compiler
Process
Gate count (k gates)
Memory size (k bytes)
Stereo matching
Viewpoint synthesis
Total
Image size
Maximum disparity
CTSR
Rawcost
DP
Post
Forward_warp
Reversed_warp
Hole_filling
65.6
88.2
79.3
98
41.8
38.2
58.6
469.7
1920 9 1080
64 levels
62
0
43
54
4
1
2
159
been reported to be 2796.2 MDE/s (HD with 27.8 FPS and disparity range of 128)
[26]. In addition, Jin et al. report a FPGA stereo matching system with a speed
performance of 4522 MDE/s (VGA with 230 FPS and disparity range of 64) [27].
Since most of the algorithms described in previous section are focused on both the
quality of the depth map and the implementation on hardware the question How
fast is the system? is very relevant. Here, we synthesized the engine in the Altera
Quartus II design framework and implemented it on to the stratix III FPGA with
the spec listed in Table 3.4.
In Table 3.4, two cases have been implemented for HD and FullHD (i.e.,
1080p) video format. In the HD case, the input image resolution is set to be 1024
by 768 and with a frame rate of 60 FPS. The maximum disparity of these two
cases is defined as 128 levels to cover a large dynamic depth range in the video.
Note that the configuration in the FullHD case is a preliminary test for the FullHD
video steam and was configured for half side-by-side format. In this scenario, a
speed performance of 7,962 MDE/s is achieved.
Furthermore, we also studied on how this algorithm performed on an ASIC
(application specific integrated circuit). It includes an estimation of the complexity, internal memory resource usage, and the possible execution speed. An
ASIC platform synthesis and simulation have been performed to get estimated
figures of merit. We use the TSMC 250 nm process library with cadence RTL
compilation at medium area optimization for logic synthesis as reference forwarding to ASIC implementation.
104
Table 3.5 lists the estimated gate count and memory size. The gate count is
below 500 kgates, which is very hardware-friendly for ASIC implementation in a
low-cost technology node. For FullHD 1080p, the gate count of the system
remains unchanged. The disparity level is predefined to be 64, which satisfies the
depth budget of the FullHD (*3 %) for high-quality 3D content [43].
3.7 Conclusion
In this chapter, we have demonstrated the application of a user-defined depth
scaling in 3D-TV, where a new left/right stereo image pair is calculated and
rendered on the 3D display, providing a reduced depth impression and better
viewing experience with less eye strain. Calculating more than two output views
by replicating some digital signal processing kernels supports the Stereo-Into
Multiply-Viewpoint-Out functionality for 3D auto-stereoscopic displays. Depth
is extracted from the pair of left/right input images by stereo matching, and
multiple new viewpoints are synthesized and rendered to provide the proper depth
impression in all viewing directions. However, there is always a tradeoff between
system performance and cost. For instance, an accurate cost calculation needs
larger aggregate size (Sect. 3.2.1), which leads to a higher gate count and memory
budget in the system and degrades the system performance. Consequently,
building a high-quality, real-time system remains a challenge. We proposed a way
to achieve this target. For every step in the proposed stereo matching system, we
employ efficient algorithms to provide high-quality results at acceptable implementation cost, demonstrated on FPGA. We have demonstrated the efficiency and
quality of the proposed system both for the stereo-matching and viewpoint synthesis, achieving more than 6,000 million disparity estimations per second at HD
resolution and 60 Hz frame rate with a BPER of no more than 7 % and a PSNR of
around 44 dB on MPEGs and Middleburys test sequences.
References
1. Hartley R, Zisserman A (2004) Multiple view geometry in computer vision. Cambridge
University Press, Cambridge
2. Papadimitriou DV, Dennis TJ (1996) Epipolar line estimation and rectification for stereo
image pairs. IEEE Trans Image Process 5(4):672676
3. Scharstein D, Szeliski R, Zabih R (2001) A taxonomy and evaluation of dense two-frame
stereo correspondence algorithms. In: Proceedings IEEE workshop on stereo and multibaseline vision (SMBV 2001)
4. Brown MZ, Burschka D, Hager GD (2003) Advances in computational stereo. IEEE Trans
Pattern Anal Mach Intell 25(8):9931008
5. Porter RB, Bergmann NW (1997) A generic implementation framework for FPGA based
stereo matching. In: IEEE region 10 annual conference. Speech and image technologies for
computing and telecommunications (TENCON 97), Proceedings of IEEE
105
6. Lane RA, Thacker NA (1998) Tutorial: overview of stereo matching research. Available
from: http://www.tina-vision.net/docs/memos/1994-001.pdf
7. Zhang K et al (2009) Robust stereo matching with fast normalized cross-correlation over
shape-adaptive regions. In: 16th IEEE international conference on image processing (ICIP)
8. Chang NYC, Tseng Y-C, Chang TS (2008) Analysis of color space and similarity measure
impact on stereo block matching. In: IEEE Asia Pacific conference on circuits and systems
(APCCAS)
9. Hirschmuller H, Scharstein D (2009) Evaluation of stereo matching costs on images with
radiometric differences. IEEE Trans Pattern Anal Mach Intell 31(9):15821599
10. Humenberger M, Engelke T, Kubinger W (2010) A census-based stereo vision algorithm using
modified semi-global matching and plane fitting to improve matching quality. In: IEEE
computer society conference on computer vision and pattern recognition workshops (CVPRW)
11. Zhang K et al (2011) Real-time and accurate stereo: a scalable approach with bitwise fast
voting on CUDA. IEEE Trans Circuits Syst Video Technol 21(7):867878
12. Zhang K, Lu J, Gauthier L (2009) Cross-based local stereo matching using orthogonal
integral images. IEEE Trans Circuits Syst Video Technol 19(7):10731079
13. Yoon K-J, Kweon I-S (2005) Locally adaptive support-weight approach for visual
correspondence search. In: IEEE computer society conference on computer vision and
pattern recognition (CVPR)
14. Hosni A et al (2009) Local stereo matching using geodesic support weights. In: 16th IEEE
international conference on image processing (ICIP)
15. Wang L et al (2006) High-quality real-time stereo using adaptive cost aggregation and dynamic
programming. In: Third international symposium on 3D data processing, visualization, and
transmission
16. Chang NYC et al (2010) Algorithm and architecture of disparity estimation with mini-census
adaptive support weight. IEEE Trans Circuits Syst Video Technol 20(6):792805
17. Zhang L et al (2011) Real-time high-definition stereo matching on FPGA. In: Proceedings of
the 19th ACM/SIGDA international symposium on field programmable gate arrays (FPGA)
18. Zhang L (2010) Design and implementation of real-time high-definition stereo matching SoC
on FPGA. In: Department of microelectronics and computer engineering, Delft university of
technology, The Netherlands
19. Yi GY (2011) High-quality, real-time HD video stereo matching on FPGA. In: Department of
microelectronics and computer engineering, Delft university of technology: The Netherlands
20. Hirschmuller H (2005) Accurate and efficient stereo processing by semi-global matching and
mutual information. In: IEEE computer society conference on computer vision and pattern
recognition (CVPR)
21. Tanimoto M (2004) Free viewpoint televisionFTV. In: Proceedings of picture coding
symposium
22. Boykov Y, Veksler O, Zabih R (2001) Fast approximate energy minimization via graph cuts.
IEEE Trans Pattern Anal Mach Intell 23(11):12221239
23. Sun J, Zheng NN, Shum H-Y (2003) Stereo matching using belief propagation. IEEE Trans
Pattern Anal Mach Intell 25(7):787800
24. Berdnikov Y, Vatolin D (2011) Real-time depth map occlusion filling and scene background
restoration for projected pattern-based depth cameras. Available from: http://
gc2011.graphicon.ru/files/gc2011/proceedings/conference/gc2011berdnikov.pdf
25. Wang L et al (2008) Stereoscopic inpainting: joint color and depth completion from stereo
images. In: IEEE conference on computer vision and pattern recognition (CVPR)
26. Banz C et al (2010) Real-time stereo vision system using semi-global matching disparity
estimation: architecture and FPGA-implementation. In: International conference on
embedded computer systems (SAMOS)
27. Seunghun J et al (2010) FPGA design and implementation of a real-time stereo vision system.
Circuits and Systems for Video Technology. IEEE Trans Circuits Syst Video Technol 20(1):1526
28. Gehrig SK, Meyer FET (2009) A real-time low-power stereo vision engine using semi-global
matching. Lect Notes Comput Sci 5815:134143
106
Chapter 4
DIBR-Based Conversion
from Monoscopic to Stereoscopic
and Multi-View Video
Liang Zhang, Carlos Vzquez, Grgory Huchet
and Wa James Tam
107
108
L. Zhang et al.
4.1 Introduction
The development of digital broadcasting infrastructures and technologies around
the world as well as rapid advances in stereoscopic 3D (S3D) display technologies
have created an unprecedented momentum in the push for the standardization and
delivery of S3D to both television receivers in the home and mobile devices on the
road. Three-dimensional television (3D-TV) based on a minimum of two views
that are taken from slightly different viewpoints has for the last few years become
the common goal of private, regional, and international standardization organizations (e.g., ATSC, EBU, ITU, MPEG, SMPTE) [13]. However, standardization
efforts are also being made to encompass multi-view (MV) displays (e.g., ITU,
MPEG) [4]. For brevity and ease of discussion, in this manuscript S3D will be
used to refer to both stereoscopic and MV imaging as the case may apply.
The fundamental principle behind 3D-TV in its most basic form is the transmission of two streams of video images. The images depict vantage points that
have been captured with two cameras from slightly offset positions, simulating the
two separate viewpoints of a viewers eyes. Appropriately configured TV receivers
then process the transmitted S3D signals and display them to viewers in a manner
that ensures the two separate streams are presented to the correct eyes. In addition,
images for the left and right eyes are synchronized to be seen simultaneously, or
near-simultaneously, so that the human visual system (HVS) can fuse them into a
stereoscopic 3D depiction of the transmitted video contents.
Given the experience of the initially slow rollout of high-definition television
(HDTV), it is not surprising that entrepreneurs and advocates of 3D-TV in the
industry have envisioned the need to ensure an ample supply of S3D program
contents for the successful deployment of 3D-TV. Although equipment manufacturers have been relatively quick in developing video monitors and television
sets that are capable of presenting video material in S3D, the development of
professional equipment for reliable and acceptable stereoscopic capture has not
been as swift. Several major challenges must be overcome, especially synchronization of the two video streams, increased storage, alignment of the images for
the separate views, and matching of the two sets of camera lens for capturing the
separate views. Even now (2011), the various professional S3D capture equipment
are bulky, expensive, and not easy to operate.
Aside from accelerating the development of technologies for the proper and
smooth capture of S3D program material for 3D-TV, there is one approach that has
provided some visible success in generating suitable material for stereoscopic
display without the need for stereoscopic camera rigs. Given the vast amount of
program material being available in standard two-dimensional (2D) format, a
number of production houses have successfully developed workflows for the
creation of convincing S3D versions of 2D video programs and movies. The
substantial success of the conversion of Hollywood movies to S3D has been
reflected in box office receipts, especially of computer-generated (animation)
movies. Most welcomed by the media industry and post-production houses, the
109
approach of converting 2D-to-S3D has not only provided a promising venue for
increasing program material for 3D-TV, but has also opened up a new source of
revenue from repurposing of existing 2D video and film material for new release.
It has also been put forward, by a few commercially successful post-production
houses, that the process of conversion of 2D-to-S3D movies provides flexibility and a
secure means of ensuring that S3D material is comfortable for viewing [5]. As is
often raised in popular media, stereoscopic presentation of film and video material
can lead to visual discomfort, headaches, and even nausea and dizziness if the
contents are displayed or created improperly [6]. Through careful conversion,
the conditions that give rise to these symptoms could be minimized and even
eradicated because guidelines to avoid situations of visual discomfort can be easily
followed. Furthermore, trial and error for different options for parallax setting and
depth layout of various objects and their structures can be experimented not only for
safety and health concerns, but also for artistic purposes. That is, the depth
distribution can be customized for different scenes to meet specific director
requirements. Thus, the conversion from monoscopic to stereoscopic and MV video
is an important technology for several reasons: accelerating 3D-TV deployment,
creating a new source of revenue from existing 2D program material, providing a
means for ensuring visual comfort of S3D material, and allowing for repurposing and
customization of program material for various venues and artistic intent.
Having presented the motivation for 2D-to-3D video conversion in Sect. 4.1,
the rest of this chapter is organized into topical sections. Section 4.2 describes the
theory and the basis underlying 2D-to-3D video conversion. It also introduces a
framework for conversion that is based on the generation of depth information
from the 2D source images. Section 4.3 presents several strategies for generating
the depth information, including scene modeling and depth estimation from
pictorial and motion depth cues that are contained in the 2D images. The use of
surrogate depth information based on visual cognition is also described.
Section 4.4 presents the important topic of how new views are synthesised.
Procedures for preprocessing of the estimated depth information, pixel shifting and
methods for filling in newly exposed areas in the synthesised views are explained
in detail. Section 4.5 discusses the type of conversion artifacts that can occur and
how they might affect both picture and depth quality. Issues associated with the
measurement of conversion quality are briefly discussed. Section 4.6 concludes
with a discussion of some of the important issues that are related to the long-term
prospects of 2D-to-3D video conversion technologies and suggests areas in which
research could be beneficial for advancing current implementations.
110
L. Zhang et al.
stereoscopic perception of depth by the viewer. The main goal in 2D-to-3D video
conversion is the generation of that second, or higher, stream of video from the
original images of the 2D (monoscopic) stream.
111
but also on other visual cues that provide information about the depth of objects in
the scene that our brains have learned. For example, using our learned perceptive
knowledge we can easily place objects in depth in the scene that is depicted by a
flat 2D image in Fig. 4.2 even if there is no stereoscopic information available.
112
L. Zhang et al.
113
114
L. Zhang et al.
scope of this chapter, since they use image warping instead of DIBR for virtual
view synthesis.
From Fig. 4.4, the first step is to specify the virtual camera configuration,
i.e., where to locate the virtual cameras to generate the stereoscopic content. The
relative positions of the cameras, together with the viewing conditions, determine
the parameters for converting depth values into disparity between the two images;
this process creates a disparity map. Parallel cameras are normally preferred since
they do not introduce vertical disparity which can be a source of discomfort for
viewers. There are two possible camera positions to choose from; use the original
2D content as one of the cameras and position the second close to it to form a
virtual stereo pair, or treat the original 2D content as being captured from a central
view and generate two new views, one to each side of the original content.
Generating only one single-view stream of images is less computationally intensive and ensures that at least one view (consisting of the original images) has high
quality. Generating images for two views has the advantage that both of the
generated images are closer to the original one and as a consequence the disocclusions or newly exposed regions, which are a source of artifacts, are smaller and
spread over the two images.
The viewing conditions have also to be taken into account. The distance to the
display, the width of the display, and the viewers interocular distance are three of
the main parameters to take into account while converting depth information into
disparity values used in the DIBR process to create the new image that forms the
stereoscopic pair.
The second main operation in the DIBR process is the shifting of pixels to their
positions in the new image. The original depth values are preprocessed to remove
noise and reduce the amount of small disocclusions in the rendered image. Then the
depth values are converted into disparity values by taking into account the viewing
conditions. The process of shifting can be interpreted theoretically as a reprojection
of 3D points onto the image plane of the new virtual camera. The points from the
original camera, for which the depth is available, are positioned in the 3D space and
then reprojected onto the image plane of the new virtual camera. In practice,
however, what really happens is that the shift is only horizontal because of the
115
Fig. 4.5 Disoccluded regions (holes) are produced at sharp disparity edges in the newly
generated image which have to be properly filled. See main text for details. Original images from
Middlebury Stereo Datasets [18]
parallel configuration of the camera pair and, consequently, pixels are shifted
horizontally to their new positions without having to find their real positions in the 3D
space [17]. This horizontal shifting of pixels is driven by the disparities that were
computed using the camera configuration and by the simple rule that objects in front
must occlude objects in the back. Horizontal shifting of pixels also makes the
geometry simpler and allows for linewise parallelized computer processing.
The last main operation in the DIBR process is to fill-in the holes that are
created by sharp edges in the disparity map. Whenever a pixel with a large disparity is
followed (in the horizontal direction) by a pixel with a smaller disparity, the
difference in disparity will create either an occlusion (superposition of objects) or a
disocclusion (lack of visual information). These disocclusions are commonly
referred to as holes, since they appear in the rendered image as regions without any
color or texture information.
In Fig. 4.5, we present an example of the process that leads to disoccluded
regions. The images in the top row, starting from the left, represent the original left
image, the disparity map image and the original right image (ground-truth),
respectively. The second row of images, starting from the left, represents an
116
L. Zhang et al.
enlarged view of a small segment of the original left image and its corresponding
area in the disparity image. The last three images represent, respectively: the
segment in the newly generated right image with the hole (disocclusions region),
the segment with the shape of the hole shown in white, and the segment with the
hole (outlined by the red and blue pixels) that has been filled with color and texture
information from the original right image. As pixels in the image are shifted to
generate the new virtual view, the difference in depth at the edge produces a
hole in the rendered image for which no visual information is available. The
rightmost image in the bottom row shows the real right-eye view (ground-truth)
with the information that should be placed in the hole to make it look realistic or
natural. This illustration gives an idea of the complexity of the task of filling holes
with consistent information. In reality, the content of the hole must be computed
from the left image since the ground-truth right image is not available.
117
One of the main steps in this approach of using scene models is the extraction of
features from the images that allow the scene to be properly classified and
eventually correctly assigned one of the predefined models. In [20] for example,
the first step is a classification of the scene into an indoor, an outdoor or an outdoor
with geometric elements scene. Other meaningful classifications that allow for the
selection of the appropriate model to apply to the scene are presented in [19, 21].
However, the depth models as discussed do not accurately represent the depth of
the scene because frequently objects that are in the scene do not conform to the
general assumption that changes in depth are of a gradual nature. This is the main
drawback of this technique because it does not provide the variations in depth that
make a scene look natural. The main advantage of this method, on the other hand,
is that it is fast and simple to implement. Thus, because of its speed and simplicity,
this method is mainly used for real-time applications.
It should be noted that the depth model approaches to defining depth in 2D-to-3D
video conversion are normally used as a first and rough approximation of the depth
[22, 23], and then complemented with information extracted from other depth cues in
the scene. For example, the detection of vertical features in the image can lead to the
extraction of objects that stand in front of the background, allowing for the correct
representation in depth of objects whose depth is not encompassed by the selected
model.
There are several depth cues in the 2D images that can help improve the depth
model to better represent the depth of the underlying scene. Among the depth cues
that are used is the position of objects as a function of the height in an image [24].
The higher an object is in the image, the farther away it is assumed to be. Linear
perspective [14], horizon location [20], and geometric structure [21] are features in
the image that can also be used to improve the predefined depth scene model
approach.
There are some examples of depth scene models being used for 2D-to-3D video
conversion in real-time. Among them, the commercialized box from JVCTM uses
three different depth scene models as a base for the depth in the conversion [22], and
then adds some modifications to this depth based on color information. The proposed
method starts with three models: a spherical model, a cylindrical model, and a planar
model and assigns one of them (or a combination of them) to the image based on the
amount of spatial activity (texture) found in different regions of the image. The
models are blended and a color-based improvement is used to recede cool colors
and advance warm colors.
118
L. Zhang et al.
of discrepancies among visual depth cues and still be able to make sense of the
scene from the depth organization point of view [9, 25]. Most of the methods used
in this approach generate a depth map based on general features of the images,
such as luminance intensity, color, edge-based saliency or other image features
that provide an enhanced perception of depth, compared to a corresponding 2D
image, without explicitly aiming to model accurately the real depth of the scene
[26]. The depth maps thus generated are referred to as surrogate depth maps
because they are based on perceptual effects rather than estimation of actual depths
to arrive at a stable percept of a visual scene [27].
To better explain how this type of method works, we need to understand how
the HVS functions in depth perception. The perception of depth is an active
process that depends not only on the binocular depth cues (parallax, convergence),
but also on all available monocular depth cues including knowledge from past
experiences. By providing a stereoscopic image pair, we are providing the HVS
with not only the binocular parallax information, but also all other depth cues, i.e.,
the monocular depth cues that are contained in the images. Of course, the images
also provide the stimulus to activate past experiences that are stored in the brain
and that are readily available to the viewer. The HVS will take all the available
information to create a consistent perception of the scene. In the particular case of
stereoscopic viewing, when the binocular parallax information is present (based on
the information from a surrogate depth map), but is inconsistent with a number of
the other cues, the HVS will try to resolve the conflict and generate a percept that
makes the most sense from all the available information. Observations indicate
that any conflicting parallax information is simply modified or overridden by the
other cues, particularly by the cue for interposition [2628]. On the other hand,
when the surrogate depth is consistent with all the other depth cues, then the depth
effect is reinforced and the depth quality is improved. A major condition for this
surrogate-based approach to be effective is that the depth provided by the synthesised binocular parallax cue in the rendered stereoscopic images is not so strong
that it dominates the other depth cues. When the conflict introduced is too strong,
viewers would experience visual discomfort.
Several methods in the literature exploit this approach for the conversion of 2D
material to the S3D format. In particular the methods that rely on luminance, color,
and saliency rely on this process to provide depth. For example, the methods
proposed in [26, 29] rely on the detection of edges in the image to generate a depth
map. Depth edges are normally correlated to image edges in defining the borders
of objects, so by providing depth to edges in the image, objects are separated from
the surrounding background and the depth is enhanced. A color segmentation
assisted by motion segmentation is used in [30] to generate a depth map for 2D-to3D video conversion. Each segment in the image is assigned a depth following a
local analysis of edges to classify each pixel as near or far.
Another method that has been proposed uses the color information to generate
a depth map that is then used for 2D-to-3D video conversion [27, 31]. The method
proposed by [27] employs the red chromatic component (Cr) of the YCbCr color
space as a surrogate for the depth. It is based on the observation that the gray-level
119
image of the Cr component of a color image looks similar to what a real depth map
might look like of the scene depicted in the image. That is, objects or regions with
bluish or greenish tint appear darker in the Cr component than objects or regions
with reddish tint. Furthermore, subtle depth information is also reflected in the
gray-intensity shading of areas within objects of the Cr component image, e.g., the
folds in the clothing of people in an image. In general, the direction of the graylevel gradient and intensity of the gray-level shading of the Cr component,
especially of naturalistic images, tend to roughly match that expected of an actual
depth map of the scene depicted by the image. In particular, faces which have a
strong red tint tend to show depth that is closer to the viewer while the sky and
landscapes, because of having a stronger bluish or greenish tint, are receded into
the background. The variation in shading observed in the Cr component, such as
within images of faces and of trees, also provides relative depth information that
reflects depth details within objects. Studies have shown that this method is
effective in practice and has the advantage of simplicity of implementation [27].
Figure 4.6 presents an example of a surrogate depth map computed from color
information. The main structure of the scene is present and objects are well defined
in depth.
Another approach that has received much attention recently is based on visual
attention or saliency maps [32] used as surrogates for depth. The main hypothesis
supporting these approaches is that humans concentrate their attention mainly on
foreground objects and objects with salient features [33] such as high contrast or
textures. Based on this hypothesis some researchers are proposing to use a saliency
map to represent the depth of a given scene in such a way that the most interesting
objects, from the point of view of the viewer, are brought to the front of the scene
and less interesting objects are pushed to the back as background.
In general, these methods of using different types of surrogates as a source of
depth information are simple to use and provide reasonably good results for the
conversion of 2D content to 3D format. Because they are based on the robustness
of the HVS for consolidating all available visual information to derive the perception of depth, they tend to produce more naturally-looking images than
methods relying on predefined models. These surrogate depth-based methods are
mainly used for fast (real-time) conversion of video material because of their
simplicity and performance.
120
L. Zhang et al.
121
122
L. Zhang et al.
Fig. 4.7 Schematic of shearing motion: scene structure and camera motion shown on the left,
and shearing motion in the captured image shown on the right
visually pop-out more when the direction of motion is opposite to that of the camera
(see Fig. 4.7). Future studies will be required to examine the effects of IMOs on the
visual comfort experienced by viewers when they view such conversions.
123
particular object in the scene, shearing motion in the image plane is created [34].
Shearing motion is schematically shown in Fig. 4.7, where the camera moves around
the tree in the scene. In such a case, objects that lie between the tree and the moving
camera will have a motion in the image plane that is opposite in direction to that of the
camera. Objects that lie beyond the tree move in the same direction as the camera.
Furthermore, the farther in depth the objects are with respect to the tree, the larger the
magnitude of movement. To tap this depth information from shearing motion, the
dominant camera motion direction is included in the computation for the motion-todepth mapping [37].
For mapping of image motion into depth values when camera motion is
complicated or when more realistic depth values are desired, the paths taken by a
moving camera, called tracks, and the camera parameters are required [36, 47].
The tracks and camera parameters, including intrinsic and extrinsic parameters,
can be estimated by using structure-from-motion techniques [38]. Once the camera
parameters are obtained, the depth value for each pixel is calculated by triangulation of the correspondent pixels between two images that are selected from the
sequence [36]. Such a mapping generates a depth map that more closely
approximates the correct relative depths of the scene.
124
L. Zhang et al.
125
generated within a newly rendered image is to heavily smooth the depth information before DIBR [28]. In other words, filtering simplifies the rendering by
removing possible sources of artifacts and improves the quality of the rendered
scene. In [51], it was demonstrated that smoothing potentially sharp depth
transitions across object boundaries reduces the number of holes generated during
the virtual view synthesis and helps to improve the image quality of virtual views.
Furthermore, it was shown in [52] that the use of an asymmetrical Gaussian filter,
with a stronger filter along the vertical than the horizontal dimension of the image,
is better at preserving the vertical structures in the rendered images.
Whereas a smoothing effect reduces the number of holes generated, it has the
effect of diminishing the perceived depth contrast in the scene because it softens
the depth transitions at boundaries of objects. It is interesting to note that although
smoothing across object boundaries reduces the depth contrast between object and
background, it has the beneficial effect of reducing the extent of the cardboard
effect in which objects appear flat [53]. In [54], a bilateral filter is used to smooth
the areas inside the objects, reducing depth-related artifacts inside objects while
keeping sharp depth transitions at object boundaries, and thus preserving the depth
contrast. Nevertheless, this is hardly an ideal solution, since large holes between
objects are still present and need to be filled. In general, when applying a
smoothing filter, lowering the strength of the smoothing leaves large holes that are
difficult to fill, which in turn, degrades the quality of the rendered scene [55]
because any artifact would encompass a wider region and would likely be more
visible. For these reasons, a tradeoff is essential to balance between a small
quantity of holes and a good depth impression without causing the rendered
scenes quality to degrade too much.
4.4.2 3D Rendering
3D rendering or pixel shifting is the next main operation in the DIBR process. Let
us consider the case in which a parallel-camera configuration is used to generate
the virtual S3D image pair, the original content is used as the left-eye view and the
right-eye view is the rendered one. Under these circumstances, pixel shifting
126
L. Zhang et al.
Fig. 4.9 The diagram on the left shows the horizontal disparity value of each pixel in a row of
pixels contained in the left image. The diagram on the right shows how pixels in the row from the
left image are horizontally shifted, based on the pixel depth value, to render a right image and
form a stereoscopic image pair
becomes a one-dimensional operation that is applied row-wise in the cameracaptured images. Pixels have only to be shifted horizontally from the left image
position to the right image position to generate the right-eye view.
127
of the discontinuity. In the diagram, that is represented by the dotted lines. The
difference in depth between pixels 5 and 7 defines a region in which no pixel, aside
from pixel 6 which defines a depth edge, is located, defining a disoccluded region or
hole. This region is indicated in the figure by the green rectangle.
In most cases, pixels in the source image have a corresponding pixel in the new
image with the location of each of the pixels reflecting its associated horizontal
disparity. The generated new image can be either a left-eye or a right-eye view, or
both. Figure 4.10 is a schematic of a model of a stereoscopic viewing system
shown with a display screen and the light rays from a depicted object being
projected to the left eye and the right eye [56].
With respect to Fig. 4.10, the following expression provides the horizontal
disparity, p, presented by the display according to the perceived depth of a pixel
at zp :
D
p xR xL t c 1
4:1
D zp
tc corresponds to the interpupillary distance which is the human eye separation.
It is usually assumed to be 63 mm. D represents the viewing distance from the
display. Hence, the pixel disparity is expressed as:
ppix
pNpix
;
W
4:2
where Npix is the horizontal pixel resolution of the display and W its width in the
same units as p.
In this particular case, it is important to note that the perceived depth zp is
taken relative to the location of the screen which is at the Zero Parallax Plane. This
means that zp 0 for objects lying in the screen, 0\zp \D for objects in front of
the screen, and zp \0 for objects behind the screen. A positive parallax means that
the object will be perceived behind the display and a negative parallax means that
the object will be perceived in front of the display.
There are other viewing system models that exist where the distance is taken
from the camera to the position of the object. Such models are generally better
adapted for a multi-camera setup used for the generation of stereoscopic or MV
images [10].
4.4.2.2 Depth Mapping
A depth map can be construed as a collection of values m f0; . . .; 2N 1 :
N 2 N g where N is the number of bits encoding a given value. Conventionally,
the higher the value, the closer an object is to the viewer and the brighter is the
appearance of the object in a gray-level image of the depth map. These depth maps
need to be converted into disparity so the pixels of one image can be shifted to new
positions based on the disparity values to generate the new image.
128
L. Zhang et al.
For a given pixel disparity, the depth impression can be different from one person
to another because viewers have different inter-pupillary distances. Viewers might
also have different personal preferences and different tolerance levels in terms of
viewing comfort. The individual differences suggest that viewers might differ in their
preference for a more reduced or a more exaggerated depth structure [57]. Therefore,
it is desirable to have the option of being able to control the magnitude of depth in a
2D-to-3D video conversion process. One method to accomplish this is to have each
depth value m mapped to a real depth according to a set of parameters provided by the
viewer. As an example, the volume of the scene can be controlled directly by
adjusting the farthest and closest distances: Zfar and Znear : If the depth is taken relative
to the viewer, the range of depth R is defined as:
R Zfar Znear :
The depth mapping function is then:
m
zp R 1 N
Znear :
2 1
4:3
4:4
Here we can easily verify that zp Zfar for m 0; and zp Znear for
m 2N 1:
A similar technique is used in [56]. However, the mapping function is created
relative to the screen position as in Fig. 4.10. Given two parameters knear and kfar,
which represent the distance (as a percentage of the display width W) that is in
front of and at the back of the screen, respectively, we can define the range of
depth R produced as:
R W knear kfar :
Hence, zp can be defined as:
m
zp W N
knear kfar kfar :
2 1
129
4:5
4:6
The parameters knear and kfar control the amount of depth perceived in the scene.
The advantage of this method is that the volume of the scene is defined relative to
the viewing conditions, such as size and position of the display with respect to the
viewer instead of being defined relative to the capturing conditions. This makes it
relatively simple to change the amount of depth on the scene by changing
parameters under control of the viewer.
There are also other possible mappings of the value in the depth map to the real
depth in the scene. In the previously presented mapping procedures, the conversions were linear. Nonlinear mappings of depth and nonlinear processing of the
disparity have also been proposed [58] to improve the depth perception of the
scene or to improve visual comfort levels for incorrectly captured video material.
130
L. Zhang et al.
Fig. 4.11 Horizontal shifting problem. The disoccluded area behind the triangle is incorrectly
filled. Correctly labeled disoccluded areas are in black
the objects have to be extracted. However, in most cases, the disparity is small
enough to avoid this problem.
131
Fig. 4.12 The top row shows an original stereoscopic image pair. The bottom row shows the
stereoscopic image pair after modification with a floating window, which is visible as black bars
in the left and right borders of the images
One of the best and simplest techniques known so far for disocclusion filling
is the patented Exemplar-based image inpainting by A. Criminisi et al. as
presented in [62, 63]. It has been adapted later to DIBR [64] using the depth
information as a guide to maintain consistency along objects boundaries next to a
disoccluded region. In this case, only pixels that share the same depth value are
used to fill disocclusions. In [65] and [66], methods are proposed to improve the
spatial and temporal consistencies across video frames.
With respect to disocclusions that are located along the left and right edges of
an image boundary, the commonly used solution is to scale the image up in such
a way that the disoccluded part is removed from the visible image. Another
efficient way to mask them would be to add a floating window around the image.
Its width is usually only a few pixels wide along each direction to be effective,
as is shown in Fig. 4.12.1
N.B. the black bars that constitute the floating window are shown much wider than they
actually are to make it more visible in the Figure.
132
L. Zhang et al.
Aside from filling in the blank regions, a floating window is also helpful for
avoiding conflicting depth cues that can occur at the screen borders. For
instance, when an object with a negative parallax, located in front of the screen,
is in contact with the display border an edge violation is produced (see Fig. 4.12
top row). The object is occluded in this case by the physical display border that
is located behind the virtual object (right arm of girl). The response of the
Human Visual System to this impossible situation is to push the object back
to the depth of the screen, resulting in a reduction of the perceived depth of the
object. A floating window resolves such a situation if it is placed in front of
the object that is in contact with the display border (see Fig. 4.12 bottom row).
The viewer will then have the impression that a window is floating in front
of the object and partially occluding it in a natural way. In Fig. 4.12, the portion
of the girls arm that is not visible in the right image is covered by the black
floating window, effectively removing the discrepancy between left and right
images. The floating window does not need to be strictly vertical; it could lie
in a plane that is closer at the bottom than at the top. Figure 4.13 shows
schematically how the floating window should look once placed in the image.
A frame with black borders will occlude the closest part of the image, in this
case the girl, and prevent the border violation problem.
133
by repeating those steps and with each synthesised view having an appropriately
different inter-camera baseline distance with respect to the original image.
Given that multiple images are generated in the MV case, the new images could
be generated with the original image positioned as either the first or the last image
in the range of views to be synthesised. However, this procedure is not recommended because the new image that is farthest from the original one would contain
larger areas of disoccluded regions than the new image that is farthest from the
original one if the new images were created with the original image positioned in
the middle of the range of views. That is, the farther the synthesised view is from
the original image the more likely rendering artifacts, specifically from the
hole-filling process in disoccluded regions, would be noticeable.
This problem is unique to generating MV images versus stereoscopic images
from DIBR and must be taken into consideration. That is, researchers often point
out that a larger inter-camera baseline is generally required for MV than for
stereoscopic 3D displays. Multi-view is intended for providing motion parallax
information through self-motion and/or for multiple viewers standing side-by-side
when viewing the S3D scene offered by MV auto-stereoscopic displays [67].
A wider baseline is required to handle the wider viewing angle that has to be
covered, and the wider baseline between the synthesised view and the original
image will lead to larger disoccluded regions that have to be filled when generating
the synthesised view.
Cheng et al. [68] proposed an algorithm to address this specific issue of large
baseline for view synthesis from a depth image. The large disoccluded regions in
the color image are filled using a depth-guided, exemplar-based image inpainting
algorithm. An important aspect of the algorithm is that it combines the structural
strengths of the color gradient to preserve the image structure in the restored filledin regions.
134
L. Zhang et al.
displacements and hole filling. Factors (1) and (3) mainly contribute to the range
and magnitude of conversion artifacts, and factor (2) can contribute significantly to
viewers visual discomfort and to perceived distortions in the global scene
(e.g., space curvature) and of objects in the scene (i.e., puppet-theater and cardboard
effects). These unintended changes in the images and their contents can lead to
reduced image and depth quality, as well as in other perceptual dimensions of
naturalness and immersion.
135
the neighboring pixels that are part of the background and extend structures and/or
textures into the exposed regions [52]. In any case, there is no actual information
for properly patching these regions, and so inaccuracies in filling these regions
have a high probability of occurrence. When visible, the artifacts will appear as a
halo around the left or right borders of objects, depending on whether the lefteye or right-eye image is being rendered, respectively.
As suggested earlier, aside from rendering artifacts that are visible even when
viewed monocularly, there are depth distortion artifacts that can be perceived only
binocularly. The advantage of having a depth map for rendering an image as if
taken from a new camera viewpoint is also a feature that can lead to depth
distortion artifacts. There are two types of such perceptual artifacts. There is the
cardboard effect, in which objects appear to be flattened in the z-plane. There is
also the puppet-theater effect, in which objects are perceived to be much smaller
than in real life. In both cases, the higher level artifacts are created by an incongruent match of the depth reproduction magnification and the lateral reproduction magnification compared to the ratio observed in the real world [71].
Given that the depth information can be scaled to an arbitrary range in the rendering process, the manifestation of the cardboard effect and/or the puppet-theater
effect can occur.
Selection of where to locate objects in a depth scene for a given display-viewer
configuration can lead to perceptual effects that can diminish the depth quality.
In particular, objects that are cut off by the left or right edges of a display should be
located behind the screen plane otherwise they would be an inconsistency that can
cause visual discomfort because objects that are blocked by the edge of the screen
in a real world cannot be in front of the screen. This is referred to as edge
violation [72]. Furthermore, poor choice of rendering parameter values, defined
by the viewing conditions (see Sect. 4.4.2), can lead to large disparity magnitudes
that are difficult to fuse binocularly by viewers [6, 73]. This will lead to visual
discomfort and, at the extreme, to headaches and dizziness.
In summary, errors contained in the depth map, poor choice of rendering
parameters, and improper processing of the depth information can translate into
visible artifacts that can occur around boundary regions of objects, visual
discomfort, and perceptual effects in terms of size and depth distortions.
136
L. Zhang et al.
of parameters for the cameraviewer configuration, and (3) the rendering quality.
Thus, it is the presence, magnitude, extent, and frequency of visible artifacts in the
monocular images from these processes (such as depth ringing and halos)
that determine the conversion quality. There are also conversion artifacts that
occur after binocular integration of the left-eye and right-eye inputs consisting of
space distortion artifacts such as curvature, puppet, and cardboard effects. These
are dependent on the choice of rendering parameters and conscious placement of
objects in the depth scene. These also contribute to the conversion quality. Thus,
because of the wide range of factors contributing to and the multidimensional
nature of perceptual tasks, the evaluation of conversion quality is difficult to
conduct with an objective method.
Evaluation of conversion quality is not straightforward because there is no
reference by which to compare against. That is, the original scene with all its depth
information in stereoscopic view is not available. The conversion quality is ultimately based on the absence of noticeable and/or annoying artifacts and the quality
of the depth experience. With DIBR, artifacts around objects give rise to annoying
artifacts which degenerate the quality of the images. The quality of the depth
experience can suffer if content creators generate depth maps that have noticeable
errors in them and if poor parameter values are chosen for the displayed depth.
Noticeable is very subjective and is dependent on how the images are viewed.
That is, the artifacts will be more visible if they are scrutinized under freeze frame
mode as opposed to being played back naturally. This is because the depth percept
can take time to build up [74].
Nevertheless, subjective tests involving viewers rating of image quality, depth
quality, sense of presence, naturalness and visual comfort provide a basis for
evaluating conversion quality. This can be done using a five-section continuous
quality rating scale [75]. The labels for the different sections could be changed
based on the perceptual dimension that is to be assessed. Examples are shown in
Fig. 4.14. Another possible measurement of conversion quality that does not
require a reference and that does not require the assessment of any of the
perceptual dimensions indicated so far is that of pseudo stereopsis. For a
standard stereoscopic image pair, when the left and right-eye designated images
are inadvertently reversed during display, viewers often report a strange experience that they find hard to pin-point and describe. They would simply say that
Yes, there is depth but there is something wrong. This experience of depth, but
at the same time feeling of something wrong, coupled with a strange feeling in the
eye is the detection of errors in the scene. Thus, one could conduct an evaluation
on conversion quality by examining ease of detection of pseudo stereopsis. The
assumption being that it would be easier to detect if it was a real stereo image pair
that has been reversed than if the stereo image pair were converted. So, it is the
determination of viewers detection rate. If it is 100 % detection when the
converted image pair is reversed it would indicate that the conversion has been
very effective. If the detection rate is 50 % it means that the conversion was poorly
done. In the latter case, the conversion is such that the human visual system cannot
really tell whether the converted view is for the left or the right eye.
137
Fig. 4.14 Rating scales with different labels for assessing visual comfort (a), (b), and (c) and for
rating image or depth quality of a converted video or film sequence (d)
138
L. Zhang et al.
The overview indicates that the underlying principle for the conversion of 2D
program material to stereoscopic and MV 3D is simple and straightforward. On the
other hand, the strategies and methods that have been proposed and employed to
implement 2D-to-S3D conversion are quite varied. Despite the fact that progress
has been made, there is no one singular strategy that stands out from the rest as
being most effective. Although it seems easy for the human visual system to utilize
the depth information in 2D images, there is still much to be researched in terms of
how best to extract and utilize the pictorial and motion information that is
available in the 2D sources, for transformation into new stereoscopic views for
depth enhancement. In addition, more effective methods that do not require
intensive manual labor are required to fill in newly exposed areas in the
synthesised stereoscopic views. Beyond the study of different strategies and the
details and effectiveness of different methods of S3D conversion technologies, at a
higher level there are more general issues. Importantly, are 2D-to-3D video
conversion technologies here to stay?
Although S3D conversion technologies are useful in producing contents to
augment the trickle of stereoscopic material that are created in the initial stages of
3D-TV deployment, conversion technologies can be exploited for rejuvenating the
vast amount of existing 2D program material as a new source for revenue. For this
reason alone, 2D-to-3D video conversion is going to stay for quite some time
beyond the early stages of 3D-TV. Given that 2D-to-3D video conversion allows
for depth adjustments of video and film contents to be made that follow visual
comfort guidelines, it is highly likely that the 3D conversion process will become
ingrained as one of the tools or stages of the production process for prevention of
visual discomfort. This is true for even source video and film material that has
been originally captured with actual stereoscopic camera rigs. For example, a
stereoscopic scene might have been unwisely captured with a much larger baseline
than typically used, and it is too late to do a re-shoot. The only method to save the
scene might be to create a 2D-to-3D converted version, otherwise the other option
would be to delete the scene altogether (if horizontal shifting of the stereoscopic
image pairs in opposite directions is not an acceptable alternative). As another
example, when a significant number of frames for only one eye are inadvertently
damaged, they must be restored or regenerated. A practical method of restoring the
damaged stereoscopic clip would be to make use of available 3D conversion
technologies. Lastly, DIBR conversion technologies are likely to stay as part of the
post-production tools for repurposing, as an example, of movies that were originally created for a 10-m wide screen for display on a 60 television screen or a
7 9 4 cm mobile device. The same technologies can be artfully used to manipulate objects in depth to optimize the creative desires of the video or film director
for different perceptual effects of personal space, motion, and object size.
Other than the issue of whether 2D-to-3D video conversion is going to stay after
the S3D industry has matured, there has been an ongoing and intense debate as to
whether movies should be originally shot with stereoscopic camera rigs or whether
they should be shot in 2D with the intention of conversion afterwards to 3D as a
post-production process. The main argument for the latter is that the director and
139
post-production crews can have full control of the 3D effects without surprises of
either visual discomfort or unintended distracting perceptual effects. Another
major advantage is that the production equipment during image capture can be
reduced to that of standard video or film capture, thus providing savings in terms
of human resources and equipment costs. Nevertheless, there have been strong
advocates against 2D-to-3D video conversion of video and film material. This
largely stems from the poor conversions of hastily generated S3D movies that have
been rushed out to meet non-practical deadlines. Perceptual artifacts, such as size
(e.g., puppet effect) and depth distortions (e.g., cardboard effect) can be rampant if
the 3D conversion is not done properly. As well, loss of depth details and even
distracting depth errors can turn off not only the ardent S3D movie buff, but also
the general public who might have been unwittingly foxed into expecting more
through aggressive advertisements by stakeholders in the 3D entertainment field.
In summary, 2D-to-3D video conversion can generate outstanding movies and
video program material, but the conversion quality is highly dependent on the
contents and the amount of resources (of both time and money) put into the
process. It is for this very same reason that the quality of conversion for realtime applications is not the same as that for off-line production.
Finally, as discussed under the various approaches and methods on 2D-to-3D
video conversion, there are pros and cons to each methodology used. For example,
the use of predefined scene models can be very useful for reducing computation
time and increasing accuracy when the contents are known in advance. Surrogate
depth maps with heavy filtering can reduce both computation time and rendering
significantly, but the conversion quality might not be suitable for demanding
applications, such as cinematic requirements. Thus, future research requires novel
combination of the best of different worlds. Importantly, there is a lack of
experimental data on how the human visual system combines the output of the
hypothesized depth processing modules involving depth cues, such as texture,
motion, linear perspective, and familiar size, in arriving at a final and stable
percept of the visual world around us. Identification and selection of the depth cues
that are more stable and relevant for most situations need to be analyzed. Recent
studies have also started investigating the efficacy of utilizing higher level
modules, such as attention, in deriving more reliable and more accurate estimation of depth for conversion of monoscopic to stereoscopic and MV video.
However, the solutions do not appear to be achievable in the foreseeable future.
Even the relatively simple use of motion information for depth perception that
humans are so good at are still far from reach in computational-algorithmic form.
In conclusion, future research should not only focus on algorithm implementation
of various conversion methodologies, but also try to better understand human
depth perception to find clues that will enable a much faster, more reliable, and
more stable 2D-to-3D video conversion methodology.
Acknowledgment We would like to express our sincere thanks to Mr. Robert Klepko for
constructive suggestions during the preparation of this manuscript. Thanks are also due to NHK
for providing the Balloons, Tulips, and Redleaf sequences.
140
L. Zhang et al.
References
1. Advanced Television Systems Commitee (ATSC) (2011) Final report of the ATSC planning
team on 3D-TV. PT1-049r1. Advanced Television Systems Commitee (ATSC), Washington
DC, USA
2. International Organisation for Standardisation (ISO) (2009) Vision on 3D video. ISO/IEC
JTC1/SC29/WG11 N10357, International Organisation for Standardisation (ISO), Lausanne,
Switzerland
3. Society of Motion Picture and Television Engineers (SMPTE) (2009) Report of SMPTE task
force on 3D to the home. TF3D, Society of Motion Picture and Television Engineers
4. Smolic A, Mueller K, Merkle P, Vetro A (2009) Development of a new MPEG standard for
advanced 3D video applications. TR2009-068, Mitsubishi Electric Research Laboratories,
Cambridge, MA, USA
5. Valentini VI (2011) Legend3D sets the transformers 2D-3D conversion record straight.
In: indiefilm3D. Available at: http://indiefilm3d.com/node/518
6. Tam WJ, Speranza F, Yano S, Ono K, Shimono H (2011) Stereoscopic 3D-TV: visual
comfort. IEEE Trans Broadcast 57(2):335346 part II
7. Kauff P, Atzpadin N, Fehn C, Mller M, Schreer O, Smolic A, Tanger R (2007) Depth map
creation and image-based rendering for advanced 3DTV services providing interoperability
and scalability. Signal Processing: Image Communication (Special issue on threedimensional video and television) 22(2):217234
8. Zhang L, Vzquez C, Knorr S (2011) 3D-TV content creation: automatic 2D-to-3D video
conversion. IEEE Trans Broadcast 57(2):372383
9. Tam WJ, Zhang L (2006) 3D-TV content generation: 2D-to-3D conversion. In IEEE
International Conference on Multimedia and Expo, Toronto, Canada
10. Fehn C (2003) A 3D-TV approach using depth-image-based rendering (DIBR). In 3rd
conference on visualization. Imaging and Image Processing, Benalmadena, Spain
11. Ostnes R, Abbott V, Lavender S (2004) Visualisation techniques: an overviewPart 1.
Hydrogr J 113:47
12. Shimono K, Tam WJ, Nakamizo S (1999) Wheatstone-panum limiting case: occlusion,
camouflage, and vergence-induced disparity cues. Percept Psychophys 61(3):445455
13. Ens J, Lawrence P (1993) An investigation of methods for determining depth from focus.
IEEE Trans Pattern Anal Mach Intell 15(2):97107
14. Battiato S, Curti S, La Cascia M, Tortora M, Scordato E (2004) Depth map generation by
image classification. Proc SPIE 5302:95104
15. Hudson W (1967) The study of the problem of pictorial perception among unacculturated
groups. Int J Psychol 2(2):89107
16. Knorr S, Kunter M, Sikora T (2008) Stereoscopic 3D from 2D video with super-resolution
capabilities. Signal Process: Image commun 23(9):665676
17. Mancini A (1998) Disparity estimation and intermediate view reconstruction for novel
applications in stereoscopic video, McGill University, Canada
18. Scharstein D, Szeliski R (2003) High-accuracy stereo depth maps using structured light. In:
IEEE Computer society conference on computer vision and pattern recognition (CVPR
2003), vol 1. Madison, WI, USA, pp 195202
19. Yamada K, Suehiro K, Nakamura H (2005) Pseudo 3D image generation with simple depth
models. In: International conference in consumer electronics, Las Vegas, NV, pp 277278
20. Battiato S, Capra A, Curti S, La Cascia M (2004) 3D Stereoscopic image pairs by depth-map
generation. In: 3D Data processing, visualization and transmission, pp 124131
21. Nedovic V, Smeulders AWM, Redertand A, Geusebroek JM (2007) Depth information
by stage classification. In: International conference on computer vision
22. Yamada K, Suzuki Y (2009) Real-time 2D-to-3D conversion at full HD 1080P resolution.
In: 13th IEEE International symposium on consumer electronics, Las Vegas, NV, pp 103106
141
23. Huang X, Wang L, Huang J, Li D, Zhang M (2009) A depth extraction method based on
motion and geometry for 2D-to-3D conversion. In: Third international symposium on
intelligent information technology application
24. Jung Y-J, Baik A, Park D (2009) A novel 2D-to-3D conversion technique based on relative
height-depth-cue. In: SPIE Conference on stereoscopic displays and applications XX, San
Jose, CA, vol 7237. p 72371U
25. Tam WJ, Speranza F, Zhang L (2009) Depth map generation for 3-D TV: importance of edge
and boundary information. In: Javidi B, Okano F, Son J-Y(eds) Three-dimensional imaging,
visualization and display. Springer, New York, pp 153181
26. Tam WJ, Yee AS, Ferreira J, Tariq S, Speranza F (2005) Stereoscopic image rendering based
on depth maps created from blur and edge information. In: Proceedings of the stereoscopic
displays and applications, vol 5664. pp 104115
27. Tam WJ, Vzquez C, Speranza F (2009) 3D-TV: a novel method for generating surrogate
depth maps using colour information. In: SPIE Conference stereoscopic displays and
applications XX, San Jos, USA, vol 7237, p 72371A
28. Zhang L, Tam WJ (2005) Stereoscopic image generation based on depth images for 3D TV.
IEEE Trans Broadcast 51:191199
29. Ernst FE (2003) 2D-to-3D video conversion based on time-consistent segmentation. In:
Proceedings of the immersive communication and broadcast systems workshop, Berlin,
Germany
30. Chang Y-L, Fang C-Y, Ding L-F, Chen S-Y, Chen L-G (2007) Depth map generation for
2D-to-3D conversion by short-term motion assisted color segmentation. In: IEEE
International conference on multimedia and expo, pp 19581961
31. Vzquez C, Tam WJ (May 2010) CRC-CSDM: 2D to 3D conversion using colour-based
surrogate depth maps. In: International conference on 3D systems and applications (3DSA
2010), Tokyo, Japan
32. Kim J, Baik A, Jung YJ, Park D (2010) 2D-to-3D conversion by using visual attention
analysis. In: Proceedings SPIE, vol 7524, p 752412
33. Nothdurft H (2000) Salience from feature contrast: additivity across dimensions. Vis Res
40:11831201
34. Rogers B-J, Graham M-E (1979) Motion parallax as an independent cue for depth perception.
Perception 8:125134
35. Ferris S-H (1972) Motion parallax and absolute distance. J Exp Psychol 95(2):258263
36. Matsumoto Y, Terasaki H, Sugimoto K, Arakawa T (1997) Conversion system of monocular
image sequence to stereo using motion parallax. In: SPIE Conference in stereoscopic displays
and virtual reality systems IV, San Jose, CA, vol 3012. pp 108112
37. Zhang L, Lawrence B, Wang D, Vincent A (2005) Comparison study of feature matching and
block matching for automatic 2D to 3D video conversion. In: 2nd IEEE European conference
on visual media production, London, UK, pp 122129
38. Hartley R, Zisserman A (2000) Multiple view geometry in computer vision. Cambridge
University Press, Cambridge, UK
39. Choi S, Woods J (1999) Motion-compensated 3-D subband coding of video. IEEE Trans
Image Process 8(2):155167
40. Kim MB, Song MS (1998) Stereoscopic conversion of monoscopic video by the
transformation of vertical to horizontal disparity. Proc SPIE 3295:6575
41. Ideses I, Yaroslavsky LP, Fishbain B (2007) Real-time 2D to 3D video conversion. J RealTime Image Process 2(1):37
42. Pourazad M-T, Nasiopoulos P, Ward R-K (2009) An H.264-based scheme for 2D-to-3D
video conversion. IEEE Trans Consum Electron 55(2):742748
43. Pourazad M-T, Nasiopoulos P, Ward R-K (2010) Generating the depth map from the motion
information of H.264-encoded 2D video sequence. EURASIP J Image Video Process
44. Kim D, Min D, Sohn K (2008) A stereoscopic video generation method using stereoscopic
display characterization and motion analysis. IEEE Trans Broadcast 54(2):188197
142
L. Zhang et al.
45. Po L-M, Xu X, Zhu Y, Zhang S, Cheung K-W, Ting C-W (2010) Automatic 2D-to-3D video
conversion technique based on depth-from-motion and color segmentation. In: IEEE
International conference on signal processing, Hong Kong, China, pp 10001003
46. Xu F, Er G, Xie X, Dai Q (2008) 2D-to-3D conversion based on motion and color mergence.
In: 3DTV Conference, Istanbul, Turkey
47. Zhang G, Jia J, Wong TT, Bao H (2009) Consistent depth maps recovery from a video
sequence. IEEE Trans Pattern Anal Mach Intell 31(6):974988
48. Chang YL, Chang JY, Tsai YM, Lee CL, Chen LG (2008) Priority depth fusion for 2D-to-3D
conversion systems. In: SPIE Conference on three-dimensional image capture and
applications, San Jose, CA, vol 6805, p 680513
49. Cheng C-C, Li C-T, Tsai Y-M, Chen L-G (2009) Hybrid depth cueing for 2D-to-3D
conversion system. In: SPIE Conference on Stereoscopic Displays and Applications XX, San
Jose. CA, USA, vol 7237, p 723721
50. Chen Y, Zhang R, Karczewicz M (2011) Low-complexity 2D-to-3D video conversion. In:
SPIE Conference on stereoscopic displays and applications XXII, vol 7863, p 78631I
51. Tam WJ, Alain G, Zhang L, Martin T, Renaud R (2004) Smoothing depth maps for improved
stereoscopic image quality. In: Three-dimensional TV, video and display III (ITCOM04),
Philadelphia, PA, vol 5599, p 162
52. Vzquez C, Tam WJ, Speranza F (2006) Stereoscopic imaging: filling disoccluded areas in
depth image-based rendering. In: SPIE Conference on three-dimensional tv, video and
display V, Boston, MA, vol 6392, p 63920D
53. Shimono K, Tam WJ, Speranza F, Vzquez C, Renaud R (2010) Removing the cardboard
effect in stereoscopic images using smoothed depth maps. In: Stereoscopic displays and
applications XXI, San Jos, CA, vol 7524, p 75241C
54. Mori Y, Fukushima N, Yendo T, Fujii T, Tanimoto M (2009) View generation with 3D
warping using depth information for FTV. Signal Process: Image Commun 24(12):6572
55. Chen W-Y, Chang Y-L, Lin S-F, Ding L-F, Chen L-G (2005) Efficient depth image based
rendering with edge depenedent filter and interpolation. In: IEEE Internatinal conference on
multimedia and expo, Amnsterdam, The Netherlands
56. International Organization for Standardization / International Electrotechnical Commission
(2007) Representation of auxiliary video and supplemental information. ISO/IEC FDIS 230023:2007(E), International organization for standardization / International electrotechnical
commission, Lausanne
57. Daly SJ, Held RT, Hoffman DM (2011) Perceptual issues in stereoscopic signal processing.
IEEE Trans Broadcast 57(2):347361
58. Lang M, Hornung A, Wang O, Poulakos S, Smolic A, Gross M (2010) Nonlinear disparity
mapping for stereoscopic 3D. In: ACM SIGGRAPH, Los Angeles, CA
59. Vzquez C, Tam WJ (2008) 3D-TV: coding of disocclusions for 2D+Depth representation of
multi-view images. In: Tenth international conference on computer graphics and imaging
(CGIM), Innsbruck, Austria
60. Tauber Z, Li Z-N, Drew M-S (2007) Review and preview: disocclusion by inpainting for
image-based rendering. IEEE Trans Syst Man Cybernetics Part C: Appl Rev 37(4):527540
61. Azzari L, Battisti F, Gotchev A (2010) Comparative analysis of occlusion-filling techniques
in depth image-based rendering for 3D videos. In: 3rd Workshop on mobile video delivery,
Firenze, Italy
62. Criminisi A, Perez P, Toyama K (2004) Region filling and object removal by exemplar-based
image inpainting. IEEE Trans Image Process 13:12001212
63. Criminisi A, Perez P, Toyama K, Gangnet M, Blake A (2006) Image region filling by
exemplar-based inpainting. Patent No: 6,987,520, United States
64. Daribo I, Pesquet-Popescu B (2010) Depth-aided image inpainting for novel view synthesis.
In International Workshop on Multimedia Signal Processing, Saint-Malo, France, pp 167170
65. Gunnewiek R-K, Berrety R-PM, Barenbrug B, Magalhaes J-P (2009) Coherent spatial and
temporal occlusion generation. In: Proceedings SPIE, vol 7237, p 723713
143
66. Cheng C-M, Lin S-J, Lai S-H (2011) Spatio-temporal consistent novel view synthesis
algorithm from video-plus-depth sequences for autostereoscopic displays. IEEE Trans
Broadcast 57(2):523532
67. Holliman NS, Dodgson NA, Favarola GE, Pockett L (2011) Three-dimensional displays:
a review and applications analysis. IEEE Trans Broadcast 57(2):362371
68. Cheng CM, Lin SJ, Lai SH, Yang JC (2003) Improved novel view sysnthesis from depth
image with large baseline. In: International conference on pattern recognition, Tampa, FL
69. Seymour M (2011) Art of stereo conversion: 2D-to-3D. In: fxguide. Available at: http://
www.fxguide.com/featured/art-of-stereo-conversion-2d-to-3d/
70. Boev A, Hollosi D, Gotchev A (2008) Classification of stereoscopic artefacts., Mobile3DTV
(Project No. 216503) http://sp.cs.tut.fi/mobile3dtv/results/tech/D5.1_Mobile3DTV_v1.0.pdf.
Accessed 22 Jun 2011
71. Yamanoue H, Okui M, Okano F (2006) Geometrical analysis of puppet-theater and cardboard
effects in stereoscopic HDTV images. IEEE Trans Circuits Syst Video Technol 16(6):
744752
72. Mendiburu B (2009) Fundamentals of stereoscopic imaging. In: Digital cinema summit, NAB
Las Vegas. Available at: http://www.3dtv.fr/NAB09_3D-Tutorial_BernardMendiburu.pdf
73. Yeh Y-Y, Silverstein LD (1990) Limits of fusion and depth judgment in stereoscopic color
displays. Hum Factors: J Hum Factors Ergon Soc 32:4560
74. Tam WJ, Stelmach LB (1998) Display duration and stereoscopic depth discrimination. Can J
Exp Psychol 52(1):5661
75. International Telecommunication Union (2010) Methodology for the subjective assessment
of the quality of television pictures, ITU-R
76. Tam WJ, Vincent A, Renaud R, Blanchfield P, Martin T (2003) Comparison of stereoscopic
and non-stereoscopic video images for visual telephone systems. In: Stereoscopic displays
and virtual reality systems X, San Jos, CA, vol 5006, pp 304312
Chapter 5
Abstract: With texture and depth data, virtual views are synthesized to produce a
disparity-adjustable stereo pair for stereoscopic displays, or to generate multiple
views required by autostereoscopic displays. View synthesis typically consists of
three steps: 3D warping, view merging, and hole filling. However, simple synthesis
algorithms may yield some visual artifacts, e.g., texture flickering, boundary artifact, and smearing effect, and many efforts have been made to suppress these synthesis artifacts. Some employ spatial/temporal filters to smooth depth maps, which
mitigate depth errors and enhance temporal consistency; some use a cross-check
technique to detect and prevent possible synthesis distortions; some focus on
removing boundary artifacts and others attempt to create natural texture patches for
the disoccluded regions. In addition to rendering quality, real-time implementation
is necessary for view synthesis. So far, the basic three-step rendering process has
been realized in real time through GPU programming and a design.
Y. Zhao (&) L. Yu
Department of Information Science and Electronic Engineering, Zhejiang University,
310027 Hangzhou, China
e-mail: zhaoyin@zju.edu.cn
L. Yu
e-mail: yul@zju.edu.cn
C. Zhu
School of Electronic Engineering, University of Electronic Science
and Technology of China, 611731 Chengdu, Peoples Republic of China
e-mail: eczhu@uestc.edu.cn
145
146
Y. Zhao et al.
5.1 Introduction
View synthesis is an important component of 3D content generation, which is
employed to create virtual views as if the views are captured by virtual cameras at
different positions from several real cameras, as shown in Fig. 5.1a. A camera
view contains both color and depth information of the captured scene. Based on
the depth information, pixels in a camera view can be projected into a novel
viewpoint, and a virtual view is synthesized. Then, the generated virtual views as
well as the captured camera views will be presented on stereoscopic or multi-view
autostereoscopic displays to visualize 3D effect.
View synthesis contributes to more natural 3D viewing experience with stereoscopic displays. New stereo pairs with flexible baseline distances can be formed with
the synthesized views, which enables changes of disparity range (corresponding to
the intensity of perceived depth) of the displayed stereo videos. Besides, view synthesis can provide a series of views required by multi-view autostereoscopic displays, which is much more efficient than to transmit all the required views.
View synthesis employs a Depth-Image-Based Rendering (DIBR) technique [2]
which utilizes both texture and depth information to create novel views. Currently,
there are two prevalent texture plus depth data formats, Multi-view Video plus Depth
(MVD) and Layered Depth Video (LDV). MVD includes multiple views of texture
videos and their corresponding depth maps that record the distance of each pixel to
the camera. LDV contains multiple layers of texture plus depth data, as introduced in
Chap. 1. Accordingly, view synthesis procedures for the two types of 3D representations differ slightly. This chapter only covers the MVD-based rendering, and the
view synthesis procedure with LDV data will be elaborated in Chap. 7.
Given M M 1 input camera views (also called reference views), a virtual
view can be synthesized through the following three major steps: (1) project pixels
in a reference view to a target virtual view, which is termed as 3D warping; (2)
merge pixels projected to the same position in the virtual view from different
reference views (if M 2), called view merging; and (3) make up the remaining
holes (i.e., positions without any projected pixel) in the virtual view by creating
texture patterns that visually match the neighborhood, known as hole filling. More
details of view synthesis algorithms are provided in Sect. 5.2.
However, the basic DIBR scheme may not guarantee a synthesized view of perfect
quality, especially when errors present on the depth maps. Depth errors cause the
associated pixels to be warped to wrong positions in the virtual view, yielding geometric
distortions. Most noticeable geometric distortions appear at object boundaries due to
error-prone depth data along these areas, shown as broken edges and spotted background.
Besides, the appearance and vanishing of a synthesis artifact also temporally evokes
texture flickering in the virtual view, which greatly degrades visual quality. Moreover,
simple hole filling algorithms using interpolation with the surrounding pixels may fail to
recover the missing texture information in holes, especially at highly textured regions.
To alleviate those rendering artifacts, many algorithms have been proposed to enhance
different stages in the rendering process, which will be reviewed in Sect. 5.3.
147
View synthesis typically operates at the user end of the 3D-TV system
(as mentioned in Chap. 1), which requires the virtual views to be synthesized in
real time. 3D-TV terminals include TV sets, computers, mobile phones, and so on.
For computer platform, GPU has been used to assist CPU to carry out parallel
processing in view rendering; for devices without flexible computational ability,
e.g., TV set (or set-top box), specific hardware accelerators have been designed for
real-time view synthesis. Detailed information is given in Sect. 5.4.
5.2.1 3D Warping
Based on input depth values and predetermined camera models (which are
obtained in the multi-view video acquisition process as mentioned in Chap. 2),
3D warping maps pixels in a reference camera view to a target virtual view [13].
Assuming that 3D surfaces in the captured scene exhibit Lambertian reflection, the
148
Y. Zhao et al.
virtual-view positions will possess the same color values as their correspondences
in the reference view. For a point Px; y; z in the 3D space, it is projected into both
the reference view and virtual view, and we denote the pair of projection positions
in the image planes of the two cameras as pr ur ; vr ; 1 and pv uv ; vv ; 1 in
homogeneous notation, respectively, as illustrated in Fig. 5.2. Thus, we have two
perspective projection equations that warp the 3D point in the world coordinate
system into the two camera systems,
zr pr Ar Rr P tr
5:1
zv pv Av Rv P tv
5:2
149
Then, we project the 3D point into the target virtual view, and obtain the pixel
position in the virtual camera coordinate system, which is equivalent to substituting Eq. (5.4) into Eq. (5.2). Accordingly, we have
pv
1
1
Av Rv R1
zr Ar pr tr tv :
r
zv
5:5
It is clear that the projection becomes incorrect, once the depth value or the
camera parameters are inaccurate.
There is an equivalent way for 3D warping, which uses multiple homographies.
Two images of a same plane in space are related through a homography (also
known as projective transformation). A 3D scene can be sliced into multiple planes
with the same distance (i.e., depth value) to the reference camera. For 3D points in
such a plane (called a depth plane), their projected points in the reference view and
the virtual view can be associated via a homography,
pv Av Hvr;z A1
r pr :
5:6
zr
1
p A v tv ;
zv r zv
5:7
that is,
12 3
tx
6 7
6 7 B
C6 7
zv 4 vv 5 zr 4 vr 5 @ 0 fy oy A4 0 5:
0
1
1
0 0 1
2
uv
ur
fx 0 ox
5:8
150
Y. Zhao et al.
fx tx
:
zr
5:9
With the 3D warping (also known as forward warping), each pixel in the
original view is projected to a floating point coordinate in the virtual view. Then,
this point is commonly rounded to the nearest position of an integer or a subpixel
sample raster (if subpixel mapping precision is available) [35]. If several pixels
are mapped to the same position on the raster, which implies occlusion, the pixel
that is closest to the camera will occlude the others and be selected for this
position, which is known as the z-buffer method [35]. After the forward warping,
most pixels (typically over 90 % in a narrow baseline case) in a warped view can
be determined, as shown in Fig. 5.3c and d, and the remaining positions on the
image grid without corresponding pixels from the reference view are called holes.
Holes are generated mostly from disocclusion or non-overlap visual field of the
cameras, either of which means some regions in the virtual view are not visible in
the reference view. Thus, the warped reference view lacks the information of the
newly exposed areas. In addition, limited by the precision of pixel position
rounding in forward warping or insufficient sampling rate of the reference views,
some individual pixels may be left blank, causing typically one-pixel-wide cracks.
Moreover, depth errors cause associated pixels to be deviated from target positions, also leaving them as holes. Holes in warped views are to be eliminated by
view merging and hole filling techniques.
5:10
151
Fig. 5.3 View synthesis with two-view MVD data. a Texture image of the left reference view.
b Associated depth map of the left reference view (a higher luminance value means that the
object is closer to the camera). c Warped left view (with holes induced by dis-occlusion and nonoverlap visual field, marked in white). d Warped right view. e Blended view with remaining
holes. f The final synthesized view after hole filling
152
Y. Zhao et al.
153
154
Y. Zhao et al.
Fig. 5.4 Samples of view synthesis artifacts. a The magnified chair leg presents eroded edges
(one type of boundary artifacts) and smearing effect at instant t. Though the leg turns intact in the
following instants, the quality variations temporally yield a flicker. The smearing effect,
appearing evident in t and t ? 3 while less noticeable in t ? 1 and t ? 2, also evokes a flicker.
b Background noises around the phone (another type of boundary artifact). c Unnatural synthetic
texture by hole filling in single-view synthesis, also shown as the smearing effect
filling. It is also shown that symmetric Gaussian filter causes certain geometric
distortions that vertically straight object boundaries become curved, depending on
the depth in the neighboring regions [16]. Accordingly, asymmetric smoothing
with stronger filtering strength in the vertical direction overcomes the disocclusion
problem as well as reducing bended lines. More details can be found in Chap. 6.
Besides, depth maps are often noisy with irregular changes on the same object,
which may cause unnatural-looking pixels in the synthesized view [5]. Smoothing
the depth map with a low-pass filter can suppress the noises and improve the
rendering quality. However, low-pass filtering will blur the sharp depth edges along
155
object boundaries which are critical for high-quality view synthesis. Therefore,
bilateral filter [28], known for its effectiveness of smoothing plain regions while
preserving discontinuities, is demonstrated as a superior alternate [5, 13],
ZZ
1
hx
f ncn; xsf n; f xdn
5:11
kx
D
ZZ
kx
cn; xsf n; f xdn
5:12
D
where kx is the normalization factor and D represents the filtering domain; the
factors cn; x and sf n; f x measure geometric closeness and photometric
similarity between the neighborhood center x and a nearby point n; respectively.
Another edge-preserving filter is proposed in [14], in which the filter coefficient wn; x is designed as the inverse of image gradient from the center point
x to the filter tap position n: Thus, pixels in the homogenous region around x are
assigned with large weights, while the edge-crossing ones are almost bypassed in
filtering. It shall be noted that complex filtering schemes (e.g., with multiple
iterations to finalize a depth map), though showing superior performance, will
greatly compromise real-time implementation of view synthesis. Therefore, they
are preferred to be deployed in depth production instead of for user-end view
rendering.
d i; n
N
P
an i; n k d 0 i; n k
k1
N
P
5:13
an i; n k
k1
where di; n and d0 i; n are the original and enhanced depth values of the ith pixel
in the nth frame, respectively; N N 1 is the filter order and k is a constant weight
for the current original depth value; the rest filter parameter an i; n k
0 an i; n k 1 is determined as the product of a temporal decay factor
wn k for frame n k and a temporal stationary probability pn i; n k for
pixel i; n k. The decay factor is introduced on an assumption that the correlation
between two frames decreases with their temporal interval, and the probability
factor is monotonically increasing with the texture structural similarity (SSIM)
156
Y. Zhao et al.
Fig. 5.5 Comparison on temporal consistency of the original and enhanced depth sequences of
Book arrival [37]: a frame 3 of texture sequence at view 9; b, c, and d differences between
frame 2 and 3 of the texture sequence, that of the original depth sequence, and that of the
enhanced depth sequence, respectively (925 for printing purpose). Fake depth variations in (c)
are eliminated by the temporal filtering. (Reproduced with permission from [15])
1
SSIMn i; n k
C1
C2
1e
5:14
5:15
where C1 and C2 are two empirical constants. It is assumed that a local area is very
likely to be stationary (i.e., pn i; n k is near one) if the two corresponding
texture patterns have a high structure similarity (e.g., SSIMn i; n k [ 0:9), and
an area is probably moving (i.e., pn i; n k approaches zero) with a lower
structural similarity (e.g., SSIMn i; n k\0:7). Thus, the two constants are
determined as C1 20 and C2 0:04 to make the sigmoid function pn i; n k
greater than 0.9 when SSIMn i; n k [ 0:9 and smaller than 0.1 when SSIMn
i; n k\0:7.
The adaptive temporal filter assigns high weights at stationary regions in
adjacent frames, and it significantly enhances temporal consistency of the depth
maps, as shown in Fig. 5.5. Accordingly, flickering artifacts in the synthesized
view are suppressed due to the removal of temporal noise in the depth maps.
157
Fig. 5.6 The Inter-View Cross Check (IVCC) approach [17]. The dash lines denote the cross
check to determine pixels with unreliable depth values. Afterwards, projections of the unreliable
pixels to the virtual view are withdrawn to avoid synthesis artifacts. (Reproduced with permission
from [22])
158
Y. Zhao et al.
(or the left) reference view, and the color difference between the projected pixel and
the corresponding original pixel in the other camera view is checked. A pixel with a
suprathreshold color difference is considered unreliable (e.g., the pixel 1R in
Fig. 5.6), because the color mismatch is probably induced by an incorrect depth
value; otherwise, the pixel is reliable (e.g., the pixel 1L). Finally, virtual-view pixels
projected from unreliable reference-view pixels are discarded from the warped
images, i.e., to withdraw all unreliable projections to the virtual view.
Though IVCC improves view synthesis quality significantly [17], it still has
some limitations. First, the color difference threshold in the cross check would be
large enough to accommodate either illumination difference between the original
views or color distortions due to video coding; otherwise, many pixels will be
wrongly treated as unreliable owing to a small threshold, which may in turn result
in a great number of holes in the warped view. Thus, the IVCC method is unable to
detect the unreliable pixels below the color difference threshold. In addition, at
least two views are required for the cross check, which is not applicable in view
synthesis with a single view input.
In addition to intelligently detecting and excluding unreliable pixels in view
synthesis, IVCC can also benefit view merging [18]. Conventional view blending
employs the weight for each view mainly based on the baseline distance between
the original and the virtual views [5], as mentioned in Sect. 5.2.2. With the reliability check, depth quality can be inferred from the number of wrong projections.
It is common with MVD data that one views depth map may be more accurate
than anothers, since the better depth map is generated with user assistance or even
manual editing. Therefore, it is advisable in view blending to assign a higher
weight to pixels from the more reliable view [18].
159
Fig. 5.7 a Illustration of boundary artifacts in the warped view due to incorrect depth values of
some foreground pixels [22]. (FG: Foreground, BG: Background, and H: Hole). Pixels at area
a and c are misaligned with background depth values. After warping, foreground area a (or c) is
separated from foreground and is projected to the background area b (or d) due to the incorrect
background depth values, which yields background noises as well as foreground erosion [22].
b Illustration of the Background Contour Region Replacement (BCRR) method [23]. Background
contour region with a predetermined width is replaced by pixels from the other view, and
background noises in b and e are eliminated. Failures (e.g., area f) occur when the background
noises are beyond the background contour region. (Reproduced with permission from [22])
erosion artifacts, as shown in Fig. 5.8a. Several techniques for boundary artifact
removal have been developed, which are reviewed as follows.
160
Y. Zhao et al.
Fig. 5.8 Samples of synthesized images of Art [31] with depth maps generated by DERS [32].
From left to right: (1) basic view synthesis by VSRS 1D mode [3], (2) with BCRR [23], (3) with
IVCC [17], and (4) with SMART [22]. BCRR and IVCC clean part of the background noises
while omit the foreground erosion. SMART tackles both foreground erosion and background
noises, making the boundary of the pen more straight and smooth. (Reproduced with permission
from [22])
foreground object, whereas the corresponding areas in the other warped view are
prone to be free from distortions. Thus, the background contour regions are
intentionally replaced by more reliable pixels from the other view, and most
background noises are eliminated (e.g., area b and e in Fig. 5.8b). Similar
explanation and solution appear in [6]. Limitations of BCRR are clear: (1) it fails
to clean the background noises beyond the predefined background contour regions
(e.g., area f in Fig. 5.8b); and (2) foreground erosion artifacts are ignored.
161
162
Y. Zhao et al.
Fig. 5.9 a Boundary artifacts due to texture-depth misalignment at object boundaries can be
reduced by SMART [22]. In this example of a left original view, foreground misalignment (FM)
and color transition (CT) regions are associated with background depth values. Accordingly, the
pixels in the two regions are erroneously warped to the background, yielding foreground erosion
and background noise. SMART corrects the depth values in the FM to enforce foreground
alignment, while it suppresses the warping of the unreliable pixels in the CT. As a result, the
wrong pixel projections are eliminated, and foreground boundaries are kept intact. b The
framework of the SMART algorithm. (Reproduced with permission from [22])
the idea of using directional image inpainting [25] to recover patterns of holes in a
non-homogeneous background region. Some algorithms [27] also utilize spatial
similarity for texture synthesis. It is assumed that the background texture patterns
may be duplicated (e.g., wallpaper), and thus it is possible to find a similar patch
for the hole at a nearby background region. Accordingly, the best continuation
patch out of all the candidate patches is obtained by minimizing a cost function of
texture closeness.
There is another approach to hole filling: the disoccluded region missing in the
current frame may be found in another frame [26, 27]. It is plausible when the scene is
record by a fixed camera, such that regions occluded by a moving foreground object
163
will appear at another time instant. Thus, the background information is accumulated
over time to build up a complete background layer (or sprite), which will be copied
to complement the missing texture in the holes of a virtual view.
164
Y. Zhao et al.
Fig. 5.11 Reducing energy of synthesis color error with error feedback on smooth texture. Color
intensity of pixel A is denoted as I(A). With an inaccurate depth value, pixel A is warped to B and
C in the virtual view and right view, respectively. From the right view, the color error due to
position deviation is known. Assuming smooth texture varies almost linearly, the corresponding
color error in the virtual view can be estimated. More accurate color value is then obtained by
adding the estimated error
pixels are associated with unreliable depth values. Accordingly, holes in warped
views are reduced due to fewer pixels being discarded in warping.
165
of the view synthesis algorithm in VSRS has been proposed [36]. The chip design
tackles the high computational complexity by substituting a hardware-friendly
bilinear hole filling algorithm for the image inpainting in VSRS as well as by
employing multiple optimization techniques. As a result, the hardware view
synthesis engine achieves the throughput of more than 30 frame/second for
HD1080p sequences. Interested readers may further refer to Chap. 3 which
elaborates real-time implementation of view synthesis on FPGA.
Applications always highlight the balance between performance and cost.
The aforementioned implementations, either by GPU assistance or ASIC design,
target at the basic view synthesis framework. Synthesis quality can be further
improved by integrating the enhancement techniques introduced in Sect. 5.3, at the
cost of reduced rendering speed due to imposed computational complexity.
However, for portable devices with constrained power and processing capability,
the complexity should be carefully controlled in a reasonable range, and some
power-consuming algorithms may not be appropriate for these application
scenarios (e.g., 3D-TV on mobile phones).
5.5 Conclusion
View synthesis is the last component of 3D content generation, generating multi-view
images required by 3D displays. Deployed at 3D-TV terminals, the synthesis task needs
to be completed in real time. The basic view synthesis framework, which consists of 3D
warping, view merging, and hole filling, can fulfill the real-time constraint through
some techniques like GPU acceleration or specific circuits. However, the simple
scheme is prone to yield visual distortions in the synthesized view, e.g., flickering and
boundary artifacts. Accordingly, many enhancement algorithms have been developed
to adapt different components in the synthesis framework, which significantly suppress
visual artifacts and improve the synthesis quality.
View synthesis is changing from a passive standard three-step process into an
intelligent engine with versatile analysis tools. It may not always trust input depth
information and simply transform the pixels into a virtual view; instead, advanced
synthesis schemes may examine texture and depth information to realize reliable
projections and to minimize or eliminate warping errors. Besides, they may figure
out more natural texture patterns for missing samples in the virtual view with
further considerations on color patch and foreground-background relationship. On
the other hand, different frames are rendered individually and independently by the
conventional scheme. New engines are expected to make good use of texture and
depth information over time for more powerful and dynamic view synthesis with
highest rendering quality.
Acknowledgement The authors would like to thank Middlebury College, Fraunhofer Institute
for Telecommunications Heinrich Hertz Institute (HHI) and Philips for kindly providing the
multi-view images, Book_arrival and Mobile sequences. This work is partially supported by
166
Y. Zhao et al.
the National Basic Research Program of China (973) under Grant No.2009CB320903 and
Singapore Ministry of Education Academic Research Fund Tier 1 (AcRF Tier 1 RG7/09).
References
1. Mark WR, McMillan L Bishop G (1997) Post-Rendering 3D Warping. In: Proceedings of
the Symposium on interactive 30 graphics, pp 716, Providence, Rhode Island, Apr 1997
2. Fehn C (2003) A 3D-TV approach using depth-image-based rendering (DIBR).
In: Proceedings of visualization, imaging and image processing (VIIP), pp 482487
3. Tian D, Lai P, Lopez P, Gomila C (2009) View synthesis techniques for 3D video.
In: Proceedings applications of digital image processing XXXII, vol 7443, pp 74430T111
4. Mller K, Smolic A, Dix K, Merkle P, Kauff P, Wiegand T(2008) View synthesis for
advanced 3D video systems. EURASIP Journal on Image and Video Processing, vol 2008,
Article ID 438148
5. Mori Y, Fukushima N, Yendo T, Fujii T, Tanimoto M (2009) View generation with 3D
warping using depth information for FTV. Sig Processing: Image Commun 24(12):6572
6. Zinger S, Do L, de With PHN (2010) Free-viewpoint depth image based rendering. J Vis
Commun and Image Represent 21:533541
7. Domanski M, Gotfryd M, Wegner K (2009) View synthesis for multiview video
transmission. In: International conference on image processing, computer vision, and
pattern recognition, Las Vegas, USA, Jul 2009, pp 1316
8. Bertalmio M, Sapiro G, Caselles ,V, Ballester C (2000) Image inpainting. In: Proceedings of ACM
conference on computer graphics (SIGGRAPH), pp 417424, New Orleans, LU, July 2000
9. Criminisi A, Perez P, Toyama K (2004) Region filling and object removal by exemplar-based
inpainting. IEEE Trans Image Process 13(9):12001212
10. Telea A (2004) An image inpainting technique based on the fast marching method. J Graph
Tools 9(1):2536
11. Oh K, Yea S, Ho Y (2009) Hole-filling method using depth based in-painting for view
synthesis in free viewpoint television (FTV) and 3D video. In: Proceedings of the picture
coding symposium (PCS), Chicago, pp 23336
12. Zhao Y, Yu L (2010) A perceptual metric for evaluating quality of synthesized sequences in
3DV system. In: Proceedings of visual communications and image processing (VCIP), Jul 2010
13. Daribo I, Saito H (2010) Bilateral depth-discontinuity filter for novel view synthesis.
In: Proceedings of IEEE international workshop on multimedia signal processing (MMSP),
Saint-Malo, France, Oct 2010, pp 145149
14. Park JK, Jung K, Oh J, Lee S, Kim JK, Lee G, Lee H, Yun K, Hur N, Kim J (2009) Depthimage-based rendering for 3DTV service over T-DMB. J Vis Commun Image Represent
24(12):122136
15. Fu D, Zhao Y, Yu L (2010) Temporal consistency enhancement on depth sequences. In:
Proceedings of picture coding symposium (PCS), Dec 2010, pp 342345
16. Zhang L, Tam WJ (2005) Stereoscopic image generation based on depth images for 3D TV.
IEEE Trans Broadcast 51(2):191199
17. Yang L, Yendoa T, Tehrania MP, Fujii T, Tanimoto M (2010) Artifact reduction using
reliability reasoning for image generation of FTV. J Vis Commun Image Represent, vol 21,
pp 542560, JulAug 2010
18. Yang L, Yendo T, Tehrani MP, Fujii T, Tanimoto M(2010) Error suppression in view synthesis
using reliability reasoning for FTV. In: Proceedings of 3DTV Conference (3DTV-CON),
Tampere, Finland
19. Scharstein D, Szeliski R (2002) A taxonomy and evaluation of dense two-frame stereo
correspondence algorithms. Int J Comput Vision 47(13):742
167
20. Vandewalle P, Gunnewiek RK, Varekamp C (2010) Improving depth maps with limited user
input. In: Proceedings of the stereoscopic displays and applications XXI, vol 7524
21. Fieseler M, Jiang X (2009) Registration of depth and video data in depth image based
rendering In: Proceedings of 3DTV Conference (3DTV-CON), pp 14
22. Zhao Y, Zhu C, Chen Z, Tian D, Yu L (2011) Boundary artifact reduction in view synthesis of
3D video: from perspective of texture-depth alignment. IEEE Trans Broadcast 57(2):510522
23. Lee C, Ho YS (2008) Boundary filtering on synthesized views of 3D video. In: Proceedings
international conference on future generation communication and networking symposia,
Sanya, pp 1518
24. Zitnick CL, Kang SB, Uyttendaele M, Winder S, Szeliski R (2004) High-quality video view
interpolation using a layered representation. In: Proceedings of ACM SIGGRAPH, pp 600608
25. Daribo I, Pesquet-Popescu B (2010) Depth-aided image inpainting for novel view synthesis.
In: Proceedings of IEEE international workshop on multimedia signal processing (MMSP)
26. Schmeing M, Jiang X (2010) Depth image based rendering: a faithful approach for the
disocclusion problem. In: Proceedings of 3DTV conference, pp 14
27. Ndjiki-Nya P, Kppel M, Doshkov D, Lakshman H, Merkle P, Mller K, Wiegand T (2011)
Depth image-based rendering with advanced texture synthesis for 3-D video. IEEE Trans
Multimedia 13(3):453465
28. Tomasi C, Manduchi R (1998) Bilateral filtering for gray and color images. In: Proceedings
of IEEE international conference computer vision, pp 839846
29. Wang Z, Sheikh HR (2004) Image quality assessment: from error visibility to structural
similarity. IEEE Trans Image Process 13(4):114
30. Canny J (1986) A computational approach to edge detection. IEEE Trans Pattern Anal Mach
Intell 8(6):679698
31. Middlebury Stereo Vision Page (2007) Available: http://vision.middlebury.edu/stereo/
32. Tanimoto M, Fujii T, Tehrani MP, Suzuki K, Wildeboer M (2009) Depth estimation
reference software (DERS) 3.0. ISO/IEC JTC1/SC29/WG11 Doc. M16390, Apr 2009
33. Tong X, Yang P, Zheng X, Zheng J, He Y (2010) A sub-pixel virtual view synthesis method for
multiple view synthesis. In: Proceedings picture coding symposium (PCS), Dec 2010, pp 490493
34. H. Furihata, T. Yendo, M. P. Tehrani, T. Fujii, M. Tanimoto, Novel view synthesis with
residual error feedback for FTV, in Proc. Stereoscopic Displays and Applications XXI, vol.
7524, Jan. 2010, pp. 75240L-1-12
35. Shin H, Kim Y, Park H, Park J (2008) Fast view synthesis using GPU for 3D display. IEEE
Trans Consum Electron 54(4):20682076
36. Rogmans S, Lu J, Bekaert P, Lafruit G (2009) Real-Time Stereo-Based View Synthesis
Algorithms: A Unified Framework and Evaluation on Commodity GPUs. Sig Processing:
Image Commun 24(12):4964
37. Feldmann I et al (2008) HHI test material for 3D video. ISO/IEC JTC1/SC29/WG11 Doc.
M15413, Apr 2008
38. Schechner YY, Kiryati N, Basri R (2000) Separation of transparent layers using focus. Int J
Comput Vis 39(1):2539
39. Zhao Y (2011) Depth no-synthesis-error model for view synthesis in 3-D video. IEEE Trans
Image Process 20(8):22212228
Chapter 6
I. Daribo (&)
Division of Digital Content and Media Sciences,
National Institute of Informatics, 2-1-2 Hitotsubashi,
Chiyoda-ku 101-8430, Tokyo, Japan
e-mail: daribo@nii.ac.jp
H. Saito
Department of Information and Computer Science,
Keio University, Minato, Tokyo, Japan
e-mail: saito@hvrl.ics.keio.ac.jp
I. Daribo R. Furukawa
ShinsakuHiura N. Asada
Faculty of Information Sciences,
Hiroshima City University,
Hiroshima, Japan
e-mail: ryo-f@hiroshima-cu.ac.jp
ShinsakuHiura
e-mail: hiura@hiroshima-cu.ac.jp
N. Asada
e-mail: asada@hiroshima-cu.ac.jp
169
170
I. Daribo et al.
depth video before DIBR, while for larger disocclusions an inpainting approach is
proposed to retrieve the missing pixels by leveraging the given depth information.
Keywords 3D Warping Bilateral filter Contour of interest Depth-imagebased rendering Disocclusion Distance map Hole filling Distance map
Multi-view video Patch matching Priority computation Smoothing filter
Structural inpainting Texture inpainting View synthesis
6.1 Introduction
A well-suitable, associated 3D video data representation and its multi-view extension
are known respectively as video-plus-depth and multi-view video-plus-depth (MVD).
They provide regular 2D videos enriched with their associated depth data. The 2D
video provides the texture information, the color intensity, whereas the depth video
represents the Z-distance per-pixel between the camera and a 3D point in the visual
scene. With recent evolution in acquisition technologies including 3D depth cameras
(time-of-fight and Microsoft Kinect) and multi-camera systems, to name a few,
depth-based systems have gained significant interest recently, particularly in terms
of view synthesis approaches. Especially, depth-image-based rendering (DIBR)
technique is recognized as a promising tool for supporting advanced 3D video services,
by synthesizing some new novel views from either the video-plus-depth data representation or its multi-view extension. Let us distinguish two scenarios: (1) generate a
second shifted view from one reference viewpoint, (2) synthesize any desired intermediate views from at least two neighboring reference viewpoints for free viewpoint
scene observation. The problem, however, is that every pixel does not necessarily
exist in every view, which results in the occurrence of holes when a novel view is
synthesized from another one. View synthesis then exposes the parts of the scene that
are occluded in the reference view and make them visible in the targeted views. This is
a process known as disocclusion as a consequence of the occlusion of points in the
reference viewpoint as illustrated in Fig. 6.1.
One solution would be to increase the captured camera viewpoints to make every
point visible from at least one captured viewpoint. For example in Fig. 6.1 the point
B4 is visible from neither camera cam1 nor cam2. However, that yields an increasing
of the amount of captured data to process, transmit, and render. This chapter gives
more attention to the single video-plus-depth scenarios. More details can be found in
the multi-view case in Chap. 5. Another solution may consist in relying on more
complex multi-dimensional data representations, like layer depth video (LDV) data
representation that allows storing additional depth and color values for pixels that are
occluded in the original view. This extra data provide the necessary information that
is needed to be filled in disoccluded areas in the synthesized views. This means,
however, increasing the overhead complexity of the system. In this chapter, we first
investigate two camera configurations: small and large baseline, i.e., small and large
distance between the cameras. The baseline affects the disocclusion size. Larger the
171
Fig. 6.1 Stereo configuration wherein all pixels are not visible from all camera viewpoints. For
example, when transferring the view from camera cam2 to cam1, points B2 and B3 are a occluded
in camera cam2, and b disoccluded in camera cam1
baseline is, bigger the disocclusions become. We then address the disocclusion
problem through a framework that consists of two strategies at different places of the
DIBR flowchart: (1) disocclusion removal is achieved by applying a low-pass filter to
preprocess the depth video before DIBR, (2) the synthesized view is postprocessed to
fill in larger missing areas with plausible color information. The process of filling in
the disocclusions is also known as hole filling.
This chapter starts by introducing the general formulation of the 3D image
warping view synthesis equations in Sect. 6.2. Afterwards, the disocclusion problem
is discussed along with related works.
Section 6.3 introduces a prefiltering framework based on the local properties of
the depth map to remove discontinuities that provokes the aforementioned disocclusions. Those discontinuities are identified and smoothed through an adaptive filter.
The recovery problem of the larger disoccluded regions is addressed in Sect. 6.4.
To this end, an inpainting-based postprocessing of the warped image is proposed.
Moreover, a texture and structure propagation process improves the novel view
quality and preserves object boundaries.
172
I. Daribo et al.
Fig. 6.2 3D image warping: projection of a 3D point on two image planes in homogeneous
coordinates
used to project I1 into the second view I2 u2 ; v2 with the given depth data
Zu1 ; v1 : Conceptually, the 3D image warping process can be separated into two
steps: a backprojection of the reference image into the 3D-world, followed by a
projection of the backprojected 3D scene into the targeted image plane [1]. If we
look at the pixel location u1 ; v1 ; first, a backprojection per-pixel is performed
from the 2D reference camera image plane I1 to the 3D-world coordinates. Next, a
second projection is performed from the 3D-world to the image plane I2 of the
target camera at pixel location u2 ; v2 ; and so on for each pixel location.
To perform these operations, three quantities are needed: K1 ; R1 ; and t1 ; which
denote the 3 x 3 intrinsic matrix, the 3 x 3 orthogonal rotation matrix, and
the 3 9 1 translation vector of the reference view I1 ; respectively. The 3D-world
backprojected point M x; y; zT is expressed in non-homogeneous coordinates as
0 1
0 1
x
u1
@ y A R1 K 1 @ v1 Ak1 R1 t1 ;
6:1
1
1
1
z
1
where k1 is a positive scaling factor. Looking at the target camera quantities,
K2 ; R2 ; and t2 ; the backprojected 3D-world point M x; y; zT is then mapped
0
0
into the targeted 2D image coordinates u ; v ; 1T in homogeneous coordinates as:
0 0 1
0 1
x
u2
@ v0 A K2 R2 @ y A K2 t2 :
6:2
20
z
w2
We can therefore express the targeted coordinates function of the reference
coordinates by
0 0 1
0 1
u1
u2
@ v0 A K2 R2 R1 K 1 @ v1 Ak1 K2 R2 R1 t1 K2 t2 :
6:3
1
1
1
20
1
w2
173
It is common to attach the world coordinates system to the first camera system, so
that R1 I3 and t1 03 ; which simplifies Eq. 6.3 into
0 0 1
0 1
u1
u2
@ v0 A K2 R2 K 1 @ v1 Ak1 K2 t2 :
6:4
1
20
1
w2
0
6:5
In the final
0 step,
the homogeneous result is converted in pixel location as
0
u 2 v2
u2 ; v2 w0 ; w0 : Note that z is the third component of the 3D-world point M;
2
174
I. Daribo et al.
Fig. 6.3 In magenta: example of disoccluded regions (from the ATTEST test sequence
Interview). (a) reference texture (b) reference depth (c) synthesized view
intensity variation in color space, and in consistent boundaries between texture and
depth images.
Nevertheless, after depth preprocessing, larger disocclusions may still remain,
which requires a next stage that consists in interpolating the missing values. To
this end, average filter has been commonly used [3]. Average filter, however, does
not preserve edge information of the interpolated area, which results in obvious
artifacts in highly textured areas. Mark et al. [10] and Zhan-wei et al. [11]
proposed merging two warped images provided by two different spatial or
temporal viewpoints, where the pixels in reference images are processed in an
occlusion-compatible order. In a review of inpainting techniques, Tauber et al.
argue that inpainting can provide an attractive and efficient framework to fill the
disocclusions with texture and structure propagation, and should be integrated with
image-based rendering (IBR) techniques [12].
Although these methods are effective to some extent, there still existing
problems such as: the degradation of non-disoccluded areas in the depth map, the
depth-induced distortion of warped images and undesirable artifacts in inpainted
disoccluded regions. To overcome these issues we propose an adaptive depth map
preprocessing that operates mainly on the edges, and an inpainting-based postprocessing that uses depth information in case of a large distance between the
cameras.
175
Z g2D u; v
w
2
X
h
2
X
Zu x; v y g2D x; y;
6:6
w
h
x
2 y 2
where the two-dimensional approximation of the discrete Gaussian function g2D is
separable into x and y components, and expressed as follows:
y2
x2
1
1
2
2r2y
p e 2rx p e
;
2prx
2pry
6:7
176
I. Daribo et al.
distributions along the horizontal and vertical directions respectively. In the case
of a symmetric Gaussian distribution, Eq. 6.7 is updated to
x2 y2
1
g2D x; y
e 2r2 ; where r rx ry :
2pr2
6:8
Nonetheless, the asymmetric nature of the distribution may help reducing the
geometrical distortion in the warped pictured induced by applying a stronger
smoothing in one direction.
6:9
f p sgZp Zs :
6:10
p2X
In practice, a discrete Gaussian function is used for the spatial filter f in the spatial
domain, and the range filter g in the intensity domain as follows:
dp s2
2r2d ; with dp s kp sk ;
f p s e
2
6:11
6:12
177
Fig. 6.5 Example of contours of interest (CI) from the depth map discontinuities. (a) depth map
(b) CI
where rd and rr are the standard deviation of the spatial filter f and the range filter
g; respectively. The filter extent is then controlled by these two input parameters.
Therefore, a bilateral filter can be considered as a product of two Gaussian filters,
where the value at a pixel location s is computed with a weight average of his
neighbors by a spatial component that favors close pixels, and a range component
that penalizes pixels with different intensity.
Hysteresis is used to track the more relevant pixels along the contours. Hysteresis uses two
thresholds and if the magnitude is below the first threshold, it is set to zero (made a non-edge). If
the magnitude is above the high threshold, it is made an edge. If the magnitude is between the two
thresholds, then it is set to zero unless the pixel is located near an edge detected by the high
threshold.
178
I. Daribo et al.
Fig. 6.6 Example of distance map derived from the contours of interests (CI). (a) CI (b) distance
map
according to the distance from the closest CI. The distance information has then to
be computed (Fig. 6.5).
6:13
We define the distance map D of the depth map Z; with respect to the given input
CI, by the following function:
Du; v minfdZu; v; pg:
pCI
6:14
It is possible to take into account the spatial propagation of the distance, and
compute it successively from neighboring pixels with a reasonable computing
time, with an average linear complexity of the number of pixels. The propagation
of the distance relies on the assumption that it is possible to deduce the distance of
qFor
k
ju1 u2 jk jv1 v2 jk
179
a pixel location from the value of its neighbors, which fits well sequential and
parallel algorithms. One example of distance map is shown in Fig. 6.6.
6:15
minfDu; v; Dmax g
;
Dmax
180
I. Daribo et al.
Fig. 6.7 Example of different depth prefiltering strategies. (a) original (b) all-smoothing
(c) CI-dependent
Fig. 6.8 Error comparison between the original depth map and the preprocessed one. Gray color
corresponds to a no-error value. (a) all-smoothing (b) CI-dependent
Fig. 6.9 Novel synthesized images after depth-processing. (top) Disocclusion removal is
achieved by depth preprocessing. (bottom) Right-side information is preserved and vertical lines
do not bend due better depth data preservation. (a) original (b) all-smoothing (c) CI-dependent
181
mean time the disoccluded regions are removed in the warped image, as we can see in
Fig. 6.9. In addition, the main gain of the adaptive approach also comes from the
conservation of the non-disoccluded regions. Indeed, by introducing the concept of
CI, any unnecessary filtering-induced distortion is limited.
6.3.5.1 PSNR Comparison
~ and,
Let us consider the original depth map Z and the preprocessed depth map Z;
~ In order
also, the warped pictures Ivirt and ~Ivirt , respectively according to Z and Z:
to measure the filtering-induced distortion in the depth map, and in the warped
pictures, we defined two objective peak signal-to-noise ratio (PSNR) measurements
that take place before and after the 3D image warping (see Fig. 6.10) as follows:
PSNRdepth PSNR Z; Z~ and PSNRvirt PSNR Ivirt ; ~Ivirt DnO [ O~
~ being the disocclusion
where D 2 N2 is the discrete image support, O and O
~
image support of the 3D image warping using respectively Z and Z:
PSNRdepth is calculated between the original depth map and the filtered one.
Hence, PSNRdepth only considers coding artifacts in the depth map. It , however,
does not reflect the overall quality of the synthesized view. After, PSNRvirt is
calculated between the warped image mapped with the decoded depth map and the
original depth map. In order not to introduce in the PSNRvirt measurement the
warping-induced distortion, PSNRvirt is computed only on the non-disoccluded
~
areas DnO [ O:
We can observe in Fig. 6.11 the important quality improvement obtained with
the proposed method. In a subjective way, it can also remark less degradation on
the reconstructed images due to the fact that the proposed method preserves more
details in the depth map.
182
I. Daribo et al.
Fig. 6.11 PSNR comparison (a) depth video (b) synthesized view
broadly classified as structural inpainting or as textural inpainting. Structural inpainting reconstructs using prior assumptions about the smoothness of structures in
the missing regions and boundary conditions, while textural inpainting considers
only the available data from texture exemplars or other templates in non-missing
regions. Initially introduced by Bertalmio et al. [16], structural inpainting uses
either isotropic diffusion or more complex anisotropic diffusion to propagate
boundary data in the isophote3 direction, and prior assumptions about the
smoothness of structures in the missing regions. Textural inpainting considers a
statistical or template knowledge of patterns inside the missing regions, commonly
modeled by markov random fields (MRF). Thus, Levin et al. suggest in [17]
extracting some relevant statistics about the known part of the image, and combine
them in a MRF framework. Besides spatial image inpainting, other works that
combine both spatial and temporal consistency can be found in the literature [18, 19].
In this chapter, we start from the Criminisi et al. work [20], in which they attempted to
183
Fig. 6.12 Removing large objects from photographs using Criminisis inpainting algorithm.
(a) Original image, (b) the target region (10 % of the total image area) has been blanked out,
(ce) intermediate stages of the filling process, (f) the target region has been completely filled
and the selected object removed (from [20])
6:16
where C p is the confidence term that indicates the reliability of the current patch
and D p is the data term that gives special priority to the isophote direction.
These terms are defined as follows:
184
I. Daribo et al.
Cp
X
1
r? Ip ; np
Cqand
Dp
a
Wp q2Wp \ U
6:17
where Wp is the area of Wp (in terms of number of pixels within patch Wp ), a is a
normalization factor (e.g. a 255 for a typical gray-level image), np is a unit
vector orthogonal to dX at point p; and r? oy ; ox is the direction of the
isophote. Cq represents the percentage of non-missing pixels in patch Wp and is
set at initialization to Cq 0 for missing pixels in X; and Cq 1 everywhere
else. Once all the priorities on dX are computed, a block-matching algorithm
derives the best exemplar W^q to fill in the missing pixels under the highest priority
patch WP^ ; previously selected, as follows
W^q arg minWq/ dW^p ; Wq ;
6:18
where d:; : is the distance between two patches, defined as the sum of squared
differences (SSD). After finding the optimal source exemplar W^q ; the value of each
pixel-to-be-filled ^
p 2 W^p \ U is copied from its corresponding pixel in W^q : After
the patch W^p has been filled, the confidence term Cp is updated as follows
C p C ^
p; 8p 2 W^p \ U:
6:19
185
of the texture matching, and then a trilateral filter utilizes the spatial and depth
information to filter the depth map, thus enhancing the view synthesis quality. In a
similar study, Oh et al. [22] proposed replacing the foreground boundaries with
background ones located on the opposite side. They intentionally manipulated the
disocclusion boundary so that it only contained pixels from the background. Then
existing inpainting techniques are applied. Based on these works, we propose a
depth-aided texture inpainting method using Criminisis algorithm principles that
gives background pixels higher priority than foreground ones.
6.4.2.1 Priority Computation
Given the associated depth patch Z in the targeted image plane, in our definition of
priority computation, we suggest weighting the previous priority computation in
Eq. 6.16 by adding a third multiplicative term:
Pp Cp Dp Lp
6:20
where Lp is the level regularity term, defined as the inverse variance of the depth
patch Zp :
Zp
6:21
Lp P
2
Zp
Zq Z q
q2Wp \ U
where Zp is the area of Zp (in terms of number of pixels), and Zq the mean value.
We then give more priority to the patch that overlays at the same depth level,
which naturally favors background pixels over foreground ones.
6.4.2.2 Patch Matching
Considering the depth information, we update Eq. 6.18 as follows:
W^q arg minW
dW^p ; Wq b dZ^p ; Zq
q/
6:22
where the block-matching algorithm is processed in the texture and depth domains
through parameter b; which allows us to control the importance given to the depth
distance minimization. By updating the distance measure, we favor the search of
patches located at the same depth level, which naturally makes more sense.
186
I. Daribo et al.
Fig. 6.14 Recovery example of large disocclusions. (a) large disocclusions (b) conventional
inpainting (c) depth-aided inpainting
supplied with the sequences. The depth video provided for each camera was
estimated via a color-based segmentation algorithm [24]. The Ballet sequence
represents two dancers at two different depth levels. Due to the large baseline
between the central camera and the side cameras, more disocclusions appear
during the 3D warping process.
Criminisis algorithm [20] has been utilized to demonstrate the necessity to
consider the boundaries of the disocclusions. In the case of inpainting from
boundaries pixels located in different part of the scene (i.e., foreground and
187
background), it can be observed that foreground part of the scene suffers from
highly geometric deformations as shown in Fig. 6.14b. As we can see in
Fig. 6.14a, disocclusion boundaries belong both to the foreground and background
part of the scene, which makes conventional inpainting methods less efficient.
Comparing the two methods clearly demonstrated that the proposed
depth-based framework better preserves the contours of foreground objects and can
enhance the visual quality of the inpainted images. This is achieved by prioritizing
the propagation of the texture and structure from background regions, while
conventional inpainting techniques, such as Criminisis algorithm, make no
boundaries distinction. The objective performance is investigated in the loss of
quality curves plotted in Fig. 6.15, which is measured by the PSNR. It can be
observed a significant quality improvement.
188
I. Daribo et al.
allows to process differently disocclusion boundaries located at either the foreground or background part of the scene. Clearly indicating which disocclusion
contour is close to the object of interest and which one is in the background
neighborhood significantly improves the inpainting algorithm in this context.
Acknowledgments This work is partially supported by the National Institute of Information and
Communications Technology (NICT), Strategic Information and Communications R&D Promotion
Programme (SCOPE) No.101710002, Grand-in-Aid for Scientific Research No.21200002 in Japan,
Funding Program for Next Generation World-Leading Researchers No. LR030 (Cabinet Office,
Government Of Japan) in Japan, and the Japan Society for the Promotion of Science (JSPS) Program
for Foreign Researchers.
References
1. Leonard M Jr. (1997) An Image-based approach to three-dimensional computer graphics.
PhD thesis, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
2. James Tam W, Guillaume A, Zhang L, Taali M, Ronald R (2004) Smoothing depth maps for
improved steroscopic image quality. In: Proceedings of the SPIE international symposium
ITCOM on three-dimensional TV, video and display III, Philadelphia, USA, vol 5599,
pp 162172
3. Zhang L, Tam WJ (2005) Stereoscopic image generation based on depth images for 3D TV.
IEEE Trans Broadcast 51(2):191199 June 2005
4. Chen W-Y, Chang Y-L, Lin S-F, Ding L-F, Chen L-G (2005) Efficient depth image based
rendering with edge dependent depth filter and interpolation. In: Proceedings of the IEEE
international conference on multimedia and expo (ICME), pp 13141317, 68 July 2005
5. Daribo I, Tillier C, Pesquet-Popescu B (2007) Distance dependent depth filtering in 3D
warping for 3DTV. In: Proceedings of the IEEE workshop on multimedia signal processing
(MMSP), Crete, Greece, pp 312315, Oct 2007
6. Sang-Beom L, Yo-Sung H (2009) Discontinuity-adaptive depth map filtering for 3D view
generation. In Proceedings of the 2nd international conference on immersive
telecommunications (IMMERSCOM), ICST, Brussels, Belgium. ICST (Institute for
Computer Sciences, Social-Informatics and Telecommunications Engineering)
7. Tomasi C, Manduchi R (1998) Bilateral filtering for gray and color images. In: Proceedings
of the IEEE international conference on computer vision (ICCV), pp 839846
8. Cheng C-M, Lin S-J, Lai S-H, Yang J-C (2008) Improved novel view synthesis from depth
image with large baseline. In: Proceedings of the international conference on pattern
recognition, Tampa, Finland, pp 14, December 2008
9. Gangwal OP, Berretty RP (2009) Depth map post-processing for 3D-TV. In: Digest of
technical papers international conference on consumer electronics (ICCE), pp 12, Jan 2009
10. Mark WR, McMillan L, Bishop G (1997) Post-rendering 3D warping. In: Proceedings of the
symposium on interactive 3D graphics (SI3D), ACM Press, New York, pp 716
11. Zhan-wei L, Ping A, Su-xing L, Zhao-yang Z (2007) Arbitrary view generation based on
DIBR. In: Proceedings of the international symposium on intelligent signal processing and
communication systems (ISPACS), pp 168171
12. Tauber Z, Li Z-N, Drew MS (2007) Review and preview: disocclusion by inpainting for
image-based rendering. IEEE Trans Syst Man Cybern Part C Appl Rev 37(4):527540
13. Ge Y, Fitzpatrick JM (1996) On the generation of skeletons from discrete euclidean distance
maps. IEEE Trans Pattern Anal Mach Intell 18(11):10551066
14. Fehn C, Schr K, Feldmann I, Kauff P, Smolic A.(2002) Distribution of ATTEST test
sequences for EE4 in MPEG 3DAV. ISO/IEC JTC1/SC29/WG11, M9219 doc., Dec 2002
189
Chapter 7
Abstract The technology around 3D-TV is evolving rapidly. There are already
different stereo displays available and auto-stereoscopic displays promise 3D
without glasses in the near future. All of the commercially available content today
is purely image-based. Depth-based content on the other hand provides better
flexibility and scalability regarding future 3D-TV requirements and in the long
term is considered to be a better alternative for 3D-TV production. However, depth
estimation is a difficult process, which threatens to become the main bottleneck in
the whole production chain. There are already different sophisticated depth-based
formats such as LDV (layered depth video) or MVD (multi-view video plus depth)
available, but no reliable production techniques for these formats exist today.
Usually camera systems, consisting of multiple color cameras, are used for capturing. These systems however rely on stereo matching for depth estimation, which
often fails in presence of repetitive patterns or textureless regions. Newer, hybrid
systems offer a better alternative here. Hybrid systems incorporate active sensors
in the depth estimation process and allow to overcome difficulties of the standard
multi-camera systems. In this chapter a complete production chain for 2-layer
LDV format, based on a hybrid camera system of 5 color cameras and 2 timeof-flight cameras, is presented. It includes real-time preview capabilities for
quality control during the shooting and post-production algorithms to generate
high-quality LDV content consisting of foreground and occlusion layers.
191
192
Keywords 3D-TV
Alignment
Depth estimation
Depth image-based
rendering (DIBR)
Foreground layer
Grab-cut Hybrid camera system
Layered depth video (LDV) LDV compliant capturing Multi-view video plus
depth (MVD) Occlusion layer Post-production Bilateral filtering Stereo
matching Thresholding Time-of-flight (ToF) camera Warping
7.1 Introduction
The 3D-TV production pipeline consists of three main blocks: content acquisition,
transmission, and display. While all of the blocks are important and the display
technology certainly plays a crucial role in user acceptance, the content acquisition
is currently the most problematic part in the whole pipeline.
Most of the content today is shot with standard stereo systems of two cameras,
often combined through a mirror system to provide narrow baseline, and is purely
image-based. While relatively easily produced, such content however does not
fulfill all the 3D-TV requirements today and will certainly not fulfill the
requirements of the future 3D-TV. The main drawbacks are its bad scalability
concerning different display types and inflexibility concerning modifications. For
example, the content shot for a big cinema display cannot be easily adapted to a
small display in the home environment, without distorting or destroying the 3D
effect. This makes it necessary to consider all possible target geometries during the
shooting, for example, by using several baselines, one for the cinema and one for
the home environment. Such practices however make the content acquisition more
complex and expensive. Also the 3D effect cannot be easily changed after capturing. It is for example not possible to adapt the content to conform to the
viewers personal settings, like distance to the screen or eye distance. Another
problem is that the content can only be viewed on a stereo display. It is not
possible to view the content on a multi-view display, because only two views are
available. The naive solution would be to shoot the content with so many cameras
as views required, but different displays require different numbers of views, so that
such content also would not scale well. Additionally the complexity and costs of
shooting with a multi-camera rig grows with the number of cameras and as such is
infeasible for real production.
An alternative to the pure image-based content is depth-based content, where
additional information in form of depth images, that describes the scene geometry, is
used. With this information, virtual views can be constructed from the existing ones
using depth image-based rendering methods [13]. This makes this kind of content
independent of display type and offers flexibility for later modifications. One can
produce a variable number of views, required by the specific displays, and render these
views according to user preferences and display geometry. Depth estimation however
is a difficult problem. Traditional capture systems use multiple color cameras and rely
on stereo matching [4] for depth estimation. However, although much progress was
193
achieved in this area in the last years, stereo matching still remains unreliable in
presence of repetitive patterns and textureless regions and is slow if acceptable quality
is required. Hybrid systems [5, 6] on the contrary, use additional active sensors for
depth measurement. Such sensors are for instance laser scanners or time-of-flight
(TOF) cameras. While laser scanners are restricted to static scenes only, the operational
principle of ToF cameras allows handling of dynamic scenes, which makes them very
suitable for 3D-TV production. ToF cameras measure distance to the scene points by
emitting infrared light and by determining the time this light needs to travel to the
object and back to the camera. There are different types of ToF cameras that are
available, some are based on pulse measurement and some on correlation between the
emitted and reflected light. The current resolution of the ToF cameras varies between
64 9 48 and 204 9 204 pixels. The ToF cameras measure depth from a single point
of view. Therefore, the accuracy of the depth measurement does not depend on the
resolution of the cameras or a baseline between the cameras as is the case with stereo
matching. In addition to the depth images ToF cameras provide a reflectance image,
which can be used to estimate the intrinsic and extrinsic camera parameters, allowing a
straightforward integration of ToF cameras in a multi-camera setup. However, the
operational range of the ToF cameras is limited, for example, about 7.5 m by correlating cameras, so that a 3D-TV system using time-of-flight cameras is typically
limited to indoor scenarios. ToF cameras also suffer from low resolution and low
signal-to-noise ratio, so that for high-quality results additional processing is required.
For more information on ToF cameras refer to [79].
One of the first European research projects demonstrating the use of depthbased content in 3D-TV was ATTEST [1, 10]. This was also one of the first
projects to use a ToF camera (ZCam) in a 3D-TV capture system. The main goal of
ATTEST was to establish a complete production chain based on a video format
consisting of one color and corresponding depth image (single video plus depth).
However, with single video plus depth, no proper occlusion handling could be
performed, which results in decreasing quality of rendered views with increasing
view distance.
As a follow-up, the European project 3D4YOU [11] focused on developing a
3D-TV production chain for a more sophisticated LDV [1115] format, extending
the work of ATTEST. Also MVD [ 1116] format and conversion techniques from
MVD to LDV were investigated. Both formats are a straightforward extension to
the single video plus depth format but conceptually quite different. MVD represents the scene from different viewpoints, each consisting of a color and a corresponding depth image. LDV represents the scene from a single viewpoint, the
so-called reference viewpoint. It consists of multiple layers representing different
parts of the scene, each consisting of a color and a corresponding depth image. The
first layer of the LDV format is called the foreground layer. All successive layers
are called occlusion layers and represent the parts of the scene, which are
uncovered by rendering a previous layer to the new viewpoint. While multiple
layers are possible, two are in most cases sufficient to produce high-quality results.
Therefore, in practice a 2-layer LDV format, consisting of one foreground and one
occlusion layer, is used.
194
195
Fig. 7.1 Picture of the camera system: (left) real picture (right) schematic representation.
(Reproduced with permission from [5])
Fig. 7.2 (Left) A schematic representation of the ToF cameras with the reference camera in the
center. Left with parallel aligned optical axis, no full coverage of the reference camera view and
right with the ToF cameras rotated outwards, full coverage of the reference camera view. (Right)
A diagram representing the workflow of the system. (Reproduced with permission from [5])
196
Fig. 7.3 Image of the reference camera, together with the ToF depth images. Both ToF cameras
are rotated outwards to provide optimal scene coverage for the view of the reference camera.
(Reproduced with permission from [5])
are contained in this constellation: (C1, T1, C2), (C3, T2, C4), (C1, C3), (C2, C4),
(C1, C5, C4) and (C2, C5, C3). These baselines can be used to combine ToF depth
measurements with traditional stereo matching techniques.
Due to different baseline orientations, ambiguities in stereo matching can be
resolved better. The system was designed with the modules symmetrically around the
reference camera so that the occlusion information in the left and right direction can
be estimated by a symmetric process. This allows generation of virtual views from
the LDV content in both directions (like in the Philips WoW-display).
During the shooting one has to concentrate on the reference camera only,
whereby other cameras play only a supportive role. This makes the 3D shooting
comparable to a standard 2D shooting.
To use the captured data for the LDV generation the camera system has to be
calibrated. Due to the bad signal-to-noise ratio, small resolution and systematic
depth measurement errors ToF cameras are difficult to calibrate with traditional
calibration methods. To provide a reliable calibration the approach from [9] can be
used, where ToF cameras are calibrated together with the color cameras incorporating depth measurements in the calibration process.
The camera system is operated from a single PC with Intel(R) Core(TM) i7-860
CPU and 8 GB RAM. The cameras C1, C2, C3, and C4 are connected through a
FireWire interface and deliver images in Bayer pattern format. The reference
camera C5 is connected through the HD-SDI interface and delivers images in
YUV format. The ToF cameras are connected through USB2.0 and provide phase
images, from which depth images can be calculated in an additional step.
The systems processing architecture is described in Fig. 7.2 (right). It consists
of active elements in form of arrows (processes) and passive elements, such as
image buffers, which are used for communication between the active elements.
Data processing starts with image acquisition, where the data from all cameras is
captured in parallel through multiple threads with 25 frames/s, which results in the
total amount of ca. 300 MB/s. Following the image acquisition, two processes run
in parallel: the image conversion process and the save-to-disk process. The
image conversion process converts the color images to RGB and the phase images
from the ToF cameras to depth images. The save-to-disk process is responsible for
storing the captured data permanently. To handle the large throughput of data and
197
ensure the required frame rate, a solid-state drive (OCZ Z-Drive p84) of 256 GB
capacity, connected over x8 PCI-Express interface, is used, with writing speed up
to 640 MB/s. This allows capturing ca. 14 min of content in a single shot. The
filtering process is an optional step. If the filtering process is running, a number of
predefined filters are applied to selected image streams, before copying images to
the preview buffer. Due to bad signal-to-noise ratio of the PMD cameras, the
filtering step is important for further processing. For filtering a 3 9 3 or 5 9 5
median filter can be currently applied to ToF depth images, but more sophisticated
filtering techniques are possible, for example, filtering over time [18, 19]. The
median filter also runs in parallel to ensure real-time performance. In the last step,
the data are transferred through the GPUProcessing process to the GPU memory, where multiple shaders are run for preview generation. Currently the GeForce
GTX 285 is used as the GPU. The whole system is capable to capture and
simultaneously provide a 3D preview at 25 frame/s.
198
Fig. 7.4 Overview over the preview system. (Reproduced with permission from [5].)
7.3.1 Warping
The purpose of warping is to transform the left and right ToF depth images into the
view of the reference camera. Changing the viewpoint, however, causes new
regions to become visible, the disocclusion regions, where no depth information is
available in the reference view. When warping both ToF cameras to the reference
view, such regions appear on the left and right side of the foreground objects, on
the right side for the left and on the left side for the right ToF camera (Fig. 7.5).
As one can clearly see from Fig. 7.5, both ToF cameras compete each other in
the overlap area, so that the disocclusion regions caused by the warping of the left
ToF image can be partially filled through depth information from the right ToF
image and vice versa. For the warping of the ToF depth images to the view of the
reference camera the method from [5, 7] is applied, where several depth images
can be warped to the view of the target camera by a 3D mesh warping technique.
Special care has to be taken when combining the different meshes from the left
and right ToF camera. The problem by warping two meshes from different
viewpoints simultaneously is that the disocclusions caused through the viewpoint
change of one of the ToF cameras are filled linearly because of the triangle mesh
properties and cannot be filled correctly through the other ToF camera, causing
annoying rubber band effects on the object boundaries, see Fig. 7.6 (right). While
the artifacts may seem to be minor, they become worse with the shorter distance to
199
Fig. 7.5 ToF depth images warped to the reference camera. Left, ToF depth image warped from
the left ToF camera and right, ToF depth image warped from the right ToF camera. (Reproduced
with permission from [5])
Fig. 7.6 Combined ToF depth images warped to the view of the reference camera, left, with
mesh cutting and right, without mesh cutting. Notice the rubber effects, when no mesh cutting is
performed. (Reproduced with permission from [5])
the camera and are difficult to control. To avoid this, the mesh is cut on object
boundaries in a geometry shader before the transformation, by calculating the
normal for each triangle and discarding the triangle if the angle between the
normal and the viewing direction of the ToF camera exceeds a predefined
threshold.
200
Fig. 7.7 (Left) Depth image after background extrapolation. (Right) Depth image from left after
low-pass filtering. (Reproduced with permission from [5])
To reduce this artifact the depth image can be low-pass filtered as proposed in [3].
In [3] authors use a Gaussian-Filter. In the proposed system an asymmetric BoxFilter is used, because it provides results of sufficient quality while having less
computational complexity. The results of filtering the depth image after the
Interpolation step with an asymmetric Box-Filter from size (20 9 30) are shown in
Fig. 7.7 (right). These data are fed to the rendering stage as 1-layer LDV preview
and used to assess the quality of depth acquisition during shooting.
201
Fig. 7.8 Data provided by the hybrid camera system. Notice that the modules provide not only
color and depth for the foreground but also for the occlusion layer
instead first warped to the corresponding color cameras in the modules. The high
resolution color images are then used to refine depth discontinuities of the warped
depth images, which is described in more detail in Sect. 7.5. This step has two
reasons. First the module color cameras have a much shorter distance to the ToF
cameras than the reference camera. Therefore, errors due to the viewpoint change
are smaller and thus are easier to correct. Secondly, through alignment of color and
depth, more consistent occlusion layer information can be created.
To create the occlusion layer, data for three virtual views are created, consisting
of color and corresponding depth images (Fig. 7.9 (left)). The original reference
camera view is used as the central virtual view. Thereby the color image of the
reference camera is directly used, while the corresponding depth image is constructed from the refined depth images of both side modules. The left virtual view
is constructed from the refined depth images of the left-side module and the right
virtual view from the refined depth images of the right-side module. The transformation of depth from a modules corner cameras to the one of the virtual views
is done by the 3D warping technique described in Sect. 7.3. The corresponding
color is transformed through backward mapping, from the virtual views to the
corner cameras. All virtual views have the same internal camera parameters as the
reference camera, i.e., resolution, aspect ratio, and so on. Notice, that the left and
right virtual views are incomplete. This is due to different viewing frustums of the
reference and the side module cameras. However, all the necessary information for
202
Fig. 7.9 (Left) Three virtual views. The depth image from the central virtual view is warped to
the left and right virtual view to get the occlusion information. Black regions in the warped depth
images correspond to disoccluded areas. (Right) Generated LDV Frame, with central virtual view
as foreground layer and occlusion layer cut out from the left and right virtual images and
transformed back to the central view
Fig. 7.10 Views rendered from the LDV frame from Fig. 7.9 (right). Center view is the original
view, left and right views are purely virtual. Notice that the right virtual view shows an object
held by the person in the background, which is completely occluded by the foreground person in
the original view
occlusion layer generation is contained. Depth and color are also well aligned by
means of previously applied refinement step. Before further processing, the depth
for the central virtual view is again aligned to the corresponding color images
yielding the foreground layer of the LDV frame (Fig. 7.9 (right)). To create the
occlusion layer, the depth map from the foreground layer is warped to the left and
right virtual views, see Fig. 7.9 (left). By doing so, the disocclusion regions are
revealed. However, due to the viewpoint difference between C1C4 and C5 all the
necessary information is now available in the left and right virtual views. The
relevant depth and texture information can be simply masked out by the disocclusion areas coded in black and transformed to the view of the reference camera.
This information is then combined to the occlusion layer. Note that by rendering a
virtual view from LDV-frame, the reverse operation will be performed, so that
disocclusions will be filled with the correct texture information.
Figure 7.9 (right) shows the full LDV frame constructed with the proposed method.
From this LDV frame multiple views can then be rendered along the baseline.
Figure 7.10 shows three views rendered from the LDV frame in Fig. 7.9 (right).
203
7:1
7:2
204
^ x; y the updated
whereby N is the neighborhood defined on a pixel grid and Cd;
cost value. To reduce filtering complexity a separable approximation of the
bilateral filtering may be used [25]. Figure 7.12 (left) shows the result after
refinement of the depth image through the bilateral filter approach. Right from the
refined depth image is the corresponding color image with the removed background (black). The background was removed by applying a threshold to the pixel
depth values. All pixels, which had a depth value below the threshold were considered as a background and marked black. The bilateral filtering approach provides good results (compare the result to the originally warped depth image), but
has several limitations.
Due to the local nature of the bilateral filters some oversmoothing artifacts may
appear on the object boundaries if the color contrast is poor. Another problem is
that fine details, once lost, cannot be recovered. For example, notice that the arms
of the chair after refinement are still missing. Another problem occurs during
processing of video sequences. If all images in the sequence are processed separately no time consistency can be guaranteed, which can result in annoying
flickering artifacts.
205
Fig. 7.12 (Left) Refined depth image, (Right) Color image segmented through depth thresholding. (Reproduced with permission from [5])
Fig. 7.13 (Left) Depth image corrected after Grab-Cut segmentation, (Right) result of Grab-Cut
segmentation
background in the scene. The transition between the foreground and background
regions then identifies the correct object borders. Using the foreground as a mask
the discontinuities in the depth map can be adjusted to the discontinuities defined
through the mask borders. Figure 7.13 (right) shows a segmentation result from the
same image as used for the local approach. The depth map left from the segmented
image was produced by cutting out the foreground depth corresponding to
segmentation result and by filling the missing depth values from the neighborhood.
One can clearly see that no oversmoothing is present and also fine details like arms
of the chair could be restored. Another way to improve depth images, based on
segmentation results, is to apply a restricted bilateral filtering. This means that the
bilateral filtering is performed only on the mask. Therefore precise object borders
stay preserved also in regions with pure local contrast. In the following, first a grabcut segmentation in a video volume and a depth-based foreground extraction
technique are presented. Next an application of the presented foreground extraction
method to the depth refinement in the 2-layer LDV generation process is introduced. Figure 7.14 shows a diagram of the global refinement scheme for one depth
map and corresponding color image. Note that the application of the global
refinement to the 2-layer LDV generation process is a bit more complicated,
because of warping operations between the refinement steps and multiple depth
and color images involved (see Sect. Depth map refinement).
206
Fig. 7.14 A diagram of the proposed global depth map refinement scheme
207
Fig. 7.15 (Left) Result from the normal estimation through the PCA, color coded; (Right)
Normals for the pixels in the most outer histogram bin after refinement, color coded. (Reproduced
with permission from [24])
Fig. 7.16 (Left) Image from the reference camera; (Right) Combined ToF images, warped to the
view of the reference camera. (Reproduced with permission from [24])
matrix from the given point cloud and by performing singular value decomposition.
The calculated eigenvector with the smallest eigenvalue is then the plane normal in a
least squares sense. Figure 7.15 (left) shows the results from the normal estimation
through the PCA, from the depth image from Fig. 7.16 (right).
After normal estimation the set of the normals is large, in the worst case the
number of different normals is the product of width and height of the depth image. To
further reduce the set of potential candidates for the plane orientations the clustering
method from [26] is used. The method performs iterative splitting of the clusters
orthogonal to their greatest variance axis. The clustering is performed hierarchically,
starting with one cluster constructed from all image points and proceeding to split a
cluster with the biggest variance until a maximal cluster number is reached. To
determine the optimal cluster number adaptively, the splitting is stopped if no cluster
with variance over a certain threshold exists. At the end the normals are oriented to
point to the inside of the room, which is important for later processing.
208
Fig. 7.17 (Left) A schematic representation of a room with objects; (Right) Histogram
corresponding to the schematic room to the left, with Bin0 being the most outer bin of the
histogram. (Reproduced with permission from [24])
guaranteed to be correct. Therefore an additional normal refinement step is performed. To increase robustness in the refinement process, the normals are ordered in
decreasing order based on the size of the corresponding clusters (biggest cluster first).
After that all points in the image are projected in 3D space using depth values from
the depth map and iteratively applying the following steps for each normal in the set:
A discrete histogram is built by projecting all 3D points in the scene to the line
defined through the normal. Because the normal is oriented inside the room the
first bin of the histogram (Bin0) will be the most outer bin in the room.
Therefore, if the normal corresponds to the normal of the outer wall, all the wall
points will be projected to the first bin (Fig. 7.17). The bin size is a manually set
parameter introduced to compensate for depth measurement errors and noise.
For the results in this section it was set to 20 cm.
Using RANSAC [27] a plane is estimated from the 3D points projected to the first bin.
After that, the points are classified to inliers and outliers and an optimal fitting plane is
estimated from the inliers set using PCA. After that the cluster normal is substituted
by the normal of the fitted plane. The idea here is to refine the normal orientation to be
perfectly perpendicular to the planes defined by the walls, ceiling, and floor.
The number of iterations in the refinement process has to be defined by the user,
but three iterations are generally sufficient. After the refinement, all 3D points in
the first bin are projected to the fitting plane of the last iteration. Figure 7.15 (right)
shows the normal vectors for the pixels corresponding to the 3D points in the first
bin for each refined normal. The pixels colored white correspond to the 3D points,
which are not lying in the first bin of any normal and are foreground candidates.
Final Thresholding
For the actual thresholding a histogram is constructed for each refined normal by
projecting all 3D points to the line defined by the normal. All the pixels corresponding to the 3D points in the first bin of the histogram (Bin0) are then removed
from the depth image. The size of the bin defines thereby the position of the thresholding plane along the histogram direction. Figure 7.18 (left) shows the result after
thresholding the depth image from Fig. 7.16 (right). To identify the foreground for
209
Fig. 7.18 (Left) Thresholded depth image; (Right) Trimap constructed from the binarized
threshoded depth image from the left. (Reproduced with permission from [24])
further processing a binary mask can be constructed from the thresholded depth
image, with 255 being the foreground and 0 BEING the background.
7.5.2.2 Grab-Cut: Segmentation in a Video Volume
The grab-cut algorithm was originally proposed in [28]. It requires a trimap, which divides
an image in definitive foreground, definitive background and uncertainty region, whereby
the definitive foreground region can be omitted in the initialization. Based on the provided
trimap two Gaussian mixture color models one for the foreground and one for the
background are constructed. After that all pixel in the uncertainty region are classified as
foreground or background by iteratively updating corresponding color models. The
transition between the foreground and background pixels inside the uncertainty region
defines the object border. In order to work properly the grab-cut algorithm requires that the
correct object borders must be contained inside the uncertainty region and definitive
foreground and background regions must not have false classified pixel.
The approach presented here [24] extends the standard grab-cut algorithm to the
video volume. Similar to the standard grab-cut, trimaps are used to divide an image in
definitive foreground, definitive background, and uncertainty region. The Gaussian
color mixture models are used to represent foreground and background regions, but
are built and updated for a batch B of images simultaneously. To achieve the temporal
consistency a 3D graph is constructed from all images in B; connecting pixels in
different images through temporal edges. To handle the complexity, an image pyramid is used and only pixels in the uncertainty region are included in the segmentation process, as well as their direct neighbors in definitive foreground and
background. In the following the segmentation scheme is described in more detail.
Trimap Generation
In the original paper [28] the trimap creation is performed by the user so that the
quality of segmentation depends on the users accuracy. The interactive trimap
generation is however a time-consuming operation especially for long video
sequences. Based on the binary foreground mask Mk ; from the initial foreground
210
segmentation step, the trimap T k can be created automatically for each image k in
the batch B: For the automatic trimap generation morphological operations, erosion and dilation are applied to each binary mask Mk : Let Ek be the kth binary
image after erosion, Dk the kth binary image after dilation and DDk the binary
image after dilation applied to Dk : The definitive foreground for the kth batch
image is then defined as Tfk Ek ; uncertainty region as Tuk Dk Ek and
definitive background as Tbk DDk Dk : The trimap for the batch image k is
defined as: T k Tfk [ Tbk [ Tuk : Additionally, a reduced trimap is defined, as foln
po
lows: T k Tuk [ pk 2 Tbk [ Tfk j9qk 2 Tuk : distpk ; qk 2 ; which are the
pixels in the uncertainty region together with the direct neighbors in background
and foreground. The number of erosion and dilation operations defines the size of
Tuk ; Tbk and Tfk in the trimap. Figure 7.18 (right) shows the generated trimap for the
thresholded image from Fig. 7.18 (left).
Hierarchical Segmentation Scheme
To reduce computational complexity the segmentation is performed in an image
pyramid of N levels. In the first level the original color images are used. In each
successive level the resolution of the images from the previous level is reduced by
a factor of 2. The segmentation starts with the lowest level N: In this level the
trimaps are generated from the thresholded depth images (Fig. 7.18) as described
before. In each level j\N the segmentation result from the level j 1 N is then
used for trimap creation. The sizes of the regions Tu ; Tb and Tf in the level N
should be set appropriately to compensate for depth measurement errors. In each
level j\N the sizes of the trimap regions are fixed to compensate for upscaling.
All images in a level are processed in batches of size jBj as follows.
For each batch B:
1. A trimap is created for each image Ik in the batch.
2. Using the definitive foreground and definitive background from all trimaps in
the batch two Gaussian mixture color models are created, one for the foreground
GMf and one for the background GMb : To create a Gaussian mixture model, the
clustering method from [26] with a specified variance threshold as a stopping
criterion is used (compare to Initial plane estimation). The clusters calculated
by this algorithm are used to determine individual components of a Gaussian
mixture model. By using a variance threshold as a stopping criterion the number
of clusters and hence the number of Gaussian components can be determined
automatically, instead of setting it to a fixed number as in the original paper [28].
3. A 3D graph is created from all batch images and pixels in the uncertainty
regions are classified as foreground or background using graph-cut
optimization
[29]. The classification of a pixel is stored in a map A, where A pk FG; if pk
is classified as the foreground pixel and A pk BG else.
211
4. The color models are updated based on the new pixel classification.
5. The steps 3 and 4 are repeated until the number of pixels changing classification
is under a certain threshold or until a fixed number of iterations is reached.
Graph-Cut Optimization
The classification of pixels in a video volume (batch B) in foreground and
background can be formulated as a binary labeling problem on the map A and
expressed in form of the following energy functional:
FA; GMf ; GMb VA; GMf ; GMb kS ES A kT ET A:
7:4
VA; GMf ; GMb is the data term penalizing the pixel affiliation to the color
models and is defined as:
X
VA; GMf ; GMb
7:5
Dpk ; A; GMf ; GMb ;
k
k
p 2T for k jBj
For pk 2 Tuk ; Dpk ; A; GMf ; GMb is defined as
ln Prpk ; GMb ; Apk BG
k
Dp ; A; GMf ; GMb
;
ln Prpk ; GMf ; A pk FG
7:6
ES A
dpk ; qk
2r2S
pk ;qk 2N
7:7
7:8
ET A
X
pi ;q j 2NT
exp
jI i pi I j q j j
dA pi ; A q j
2
2rT
7:9
212
(
NT
)
pk 2 T k ^ p j 2 T j ^
p ;q
^pk 2 Tuk _ p j 2 Tuj ^ dpk ; q j n
k
7:11
jB[
j1
T k [ fs; tg;
7:11
k0
E NS [ NT [
s; pk ; t; pk pk 2 V fs; tg ;
|{z}
7:12
ND
whereby s and t are two additional nodes representing background (t) and
foreground (s). The capacity function for an edge e fp; qg 2 E is then defined
as follows:
8
I k p I k q2 =2r2
>
e 2 NS
>
>
k
2 S 2
>
>
k1
>
I
p
I
q
=2r
e
2 NT
>
T
>
>
>
ln
Prp;
GM
e
2
N
^
q
s ^ p 2 Tuk
>
b
D
>
>
<
ln Prp; GMf
e 2 ND ^ q t ^ p 2 Tuk
7:13
cape
q s ^ p 2 Tbk _
>
>
0
>
>
>
q t ^ p 2 Tfk ^ e 2 ND
>
>
>
>
>
q s ^ p 2 Tfk _
>
>
>
1
:
q t ^ p 2 Tbk ^ e 2 ND
To minimize the functional a minimum cut is calculated on the graph G using
the algorithm from [30], where minimum cut capacity is equivalent to the minimum of the functional.
Segmentation Results
For the experimental evaluation a sequence of 800 frames (color ? depth) was
used.
213
Fig. 7.19 (Left) Grab-Cut segmentation result; (Right) Schematic representation of the 3D graph
for grab-cut segmentation. (Reproduced with permission from [24])
Fig. 7.20 Results from the video segmentation (frames: 0, 200, 430, 700). (Reproduced with
permission from [24])
Figure 7.20 shows the results after the grab-cut segmentation for images 0, 200,
430, 700 in the sequence. One can see that in almost all cases foreground was
reliably identified throughout the whole sequence. Some artifacts remain, for
example, small parts of the plant are missing, but most object borders are detected
accurately.
Notice that the camera was moved from left to right and back during the
capturing. This demonstrates that the proposed approach can be applied to general
dynamic scenes with non-stationary cameras. The results from automatic segmentation were additionally compared to the manual segmentation results for
images 286296. Without time edges the percentage of false classified pixels
compared to manual segmentation is 1.97 %, with time edges the number of false
214
Fig. 7.21 From left to right: binary foreground mask from the left ToF depth image and from the
right ToF depth image; combined binary foreground mask warped to the view of the top color
camera of the left module
classified pixel decreases to 1.57 %. While the difference is only about 0.4 %, the
flickering on the object borders decreases significantly.
215
Fig. 7.22 (Left) Color image from the top camera of the left module. (Right) Refined binary
foreground mask
Fig. 7.23 (Left) Depth image warped to the top color camera of the left module. (Right) Depth
image refined through restricted bilateral filtering
216
Fig. 7.25 (Left) Warped depth image from the central virtual view. (Right) Refined depth image
from the central virtual view (depth from the foreground layer)
views. Notice that the images are not complete due to different viewing frustums
of the ToF cameras, module cameras, and the reference camera. Although the
images are purely virtual the texture quality is quite good. This demonstrates the
quality of the refinement process.
The central virtual view corresponds to the view of the reference camera.
Therefore the original color image of the reference camera is used as the color
image for the view. The corresponding depth image is constructed through the
combination of the refined images of the four corner cameras. Figure 7.25 (left)
shows the constructed central depth image. One can see that the object borders
already fit well, but some artifacts are still present. This can occur due to discrepancies in depth between different views, imperfectness of depth refinement in
the previous step, or calibration errors in conjunction with viewpoint change. To
reduce the remaining artifacts the global depth refinement is performed again in
the view of the reference camera. The foreground mask for the refinement thereby
is constructed as a unified foreground mask from all refined foreground masks
from the corner cameras transformed to the view of the reference camera. The
result from the refinement is shown in the Fig. 7.25 (right).
Figure 7.26 shows the 2-layer LDV frame constructed from the three virtual
views.
Using this frame, novel views can be rendered to the left and to the right of the
view of the reference camera. Figure 7.27 shows two novel views left (-180 mm)
and right (+180 mm) from the reference camera view.
217
Fig. 7.27 Two views rendered left (-180 mm) and right (+180 mm) from the view of the
reference camera
The quality of the novel views is already good, but there are still some issues
like color discrepancy between the foreground and the occlusion layer or artifacts
on the right side of the box, which diminish the achieved quality. Color discrepancy is a technical problem, which can be solved using similar cameras. Rendering
artifacts on the right border of the box are however a more essential problem.
While depth and color discontinuities in the virtual views are precisely aligned,
depth values on the object borders are determined through a filtering process,
trivially spoken: propagated from the neighborhood. In most cases this is sufficient, but sometimes no correct depth is available in the neighborhood or cannot be
correctly propagated, which leads to discrepancies between the foreground and the
occlusion layer during the occlusion layer creation process. This may lead to
imperfect alignment of discontinuities between two layers and later rendering
artifacts.
218
References
1. Fehn C (2004) Depth-image-based rendering (dibr), compression, and transmission for a new
approach on 3d- tv. In: Stereoscopic displays and virtual reality systems XI. In: Proceedings
of the SPIE 5291, pp 93104, May 2004
2. Kauff P, Atzpadin N, Fehn C et al (2007) Depth map creation and image-based rendering for
advanced 3DTV services providing interoperability and scalability. Sig Process: Image
Commun 22:217234
219
3. Zhang L, Tam WJ (2005) Stereoscopic image generation based on depth images for 3D TV.
Broadcast IEEE Trans 51(2):191199
4. Scharstein D, Szeliski R (2002) A taxonomy and evaluation of dense two-frame stereo
correspondence algorithms. Int J Comput Vision 47(1/2/3):742
5. Frick A, Bartczak B, Koch R (2010) Real-time preview for layered depth video in 3D-TV.
In: Proceedings of SPIE, vol 7724, p 77240F
6. Lee E-K, Ho Y-S (2011) Generation of high-quality depth maps using hybrid camera system
for 3-D video. J Visual Commun Image Represent (JVCI) 22:7384
7. Bartczak B, Schiller I, Beder C, Koch R (2008) Integration of a time-of-flight camera into
a mixed reality system for handling dynamic scenes, moving viewpoints and occlusions in
real-time. In: Proceedings of the 3DPVT Workshop, 2008
8. Kolb A, Barth E, Koch R, Larsen R (2009) Time-of-flight sensors in computer graphic.
In Proceedings of eurographics 2009state of the art reports, pp 119134
9. Schiller I, Beder C, Koch R (2008) Calibration of a PMD-Camera using a planar calibration
pattern together with a multi-camera setup. In: Proceedings XXXVII international social for
photogrammetry
10. Fehn C, Kauff P, Op de Beeck M, et al, An evolutionary and optimised approach on 3D-TV.
In: IBC 2002, International broadcast convention, Amsterdam, Netherlands, Sept 2002
11. Bartczak B, Vandewalle P, Grau O, Briand G, Fournier J, Kerbiriou P, Murdoch M et al
(2011) Display-independent 3D-TV production and delivery using the layered depth video
format. IEEE Trans Broadcast 57(2):477490
12. Barenburg B (2009) Declipse 2: multi-layer image and depth with transparency made
practical. SPIE 7237:72371G
13. Klein Gunnewiek R, Berretty R-P M, Barenbrug B, Magalhes JP (2009) Coherent spatial
and temporal occlusion generation. In: Proceedings of SPIE vol 7237, p 723713
14. Shade J, Gortler S, He L, Szeliski R (1998) Layered depth images. In: Proceedings of the
25th annual conference on Computer graphics and interactive techniques (SIGGRAPH 98).
ACM, New York, pp 231242
15. Smolic A, Mueller K, Merkle P, Kauff P, Wiegand T (2009) An overview of available and
emerging 3D video formats and depth enhanced stereo as efficient generic solution. In:
Picture coding symposium, 2009, PCS 2009
16. Merkle P, Smolic A, Muller K, Wiegand T (2007) Multi-View video plus depth
representation and coding. In: Image processing, 2007. ICIP 2007. IEEE International
conference on, vol 1, pp 201204
17. Frick A, Bartczack B, Koch R (2010) 3D-TV LDV content generation with a hybrid ToFmulticamera RIG. In: 3DTV-Conference: the true visionCapture, transmission and display
of 3D Video, June 2010
18. Frick A, Kellner F, Bartczak B, Koch R (2009) Generation of 3D-TV LDV-content with
Time-Of-Flight Camera. 3D-TV Conference: the true visionCapture, transmission and
display of 3D Video, Mai 2009
19. Xu M, Ellis T (2001) Illumination-invariant motion detection using colour mixture models.
In: British machine vision conference (BMVC 2001), Manchester, pp 163172
20. Yang Q, Yang R, Davis J, Nister D (2007) Spatial-depth super resolution for range images.
In: Computer vision and pattern recognition, CVPR 07, IEEE conference on, pp 18, June 2007
21. Kim S-Yl, Cho J-H, Koschan A, Abidi MA (2010) 3D Video generation and service based on
a TOF depth sensor in MPEG-4 multimedia framework. IEEE Trans Consum Electron
56(3):17301738
22. Diebel J, Thrun S (2005) An application of markov random fields to range sensing.
In: Advances in neural information processing systems, pp. 291298
23. Chan D, Buisman H, Theobalt C, Thrun S (2008) A noise-aware filter for real-time depth
upsampling. In: Workshop on multicamera and multi-modal sensor fusion, M2SFA2
24. Frick A, Franke M, Koch R (2011) Time-consistent foreground segmentation of dynamic
content from color and depth video. In: DAGM 2011, LNCS 6835. Springer, Heidelberg,
2011, pp 296305
220
25. Pham TQ, van Vliet LJ (2005) Separable bilateral filtering for fast video preprocessing.
In: Multimedia and Expo, 2005. ICME 2005. IEEE International conference on, 2005
26. Orchard M, Bouman C (1991) Color quantization of images. IEEE Trans Signal Process
39(12):26772690
27. Fischler M, Bolles R (1981) Random sample consensus: a paradigm for model fitting with
applications to image analysis and automated cartography. Commun ACM 24(6):381395
28. Rother C, Kolmogorov V, Blake A (2004) Grabcut: interactive foreground extraction using
iterated graph cuts. ACM Trans Graph 23(3):309314
29. Boykov Y, Jolly M (2000) Interactive organ segmentation using graph cuts. In: Medical
image computing and computer-assisted-intervention (MICCAI), pp 276286
30. Boykov Y, Kolmogorov V (2004) An experimental comparison of min-cut/max- flow
algorithms for energy minimization in vision. IEEE Trans Pattern Anal Mach Intell
26(9):11241137
31. Bartczak B, Koch R (2009) Dense depth maps from low resolution time-of-flight depth and
high resolution color views. In: Advances in visual computing, vol 5876, Springer, Berlin,
pp 228239
Part III
Chapter 8
3D Video Compression
Karsten Mller, Philipp Merkle and Gerhard Tech
Abstract In this chapter, compression methods for 3D video (3DV) are presented.
This includes data formats, video and depth compression, evaluation methods, and
analysis tools. First, the fundamental principles of video coding for classical 2D
video content are reviewed, including signal prediction, quantization, transformation, and entropy coding. These methods are extended toward multi-view video
coding (MVC), where inter-view prediction is added to the 2D video coding
methods to gain higher coding efficiency. Next, 3DV coding principles are
introduced, which are different from previous coding methods. In 3DV, a generic
input format is used for coding and a dense number of output views are generated
for different types of autostereoscopic displays. This influences the format selection, encoder optimization, evaluation methods, and requires new modules, like the
decoder-side view generation, as discussed in this chapter. Finally, different 3DV
formats are compared and discussed for their applicability for 3DV systems.
223
224
K. Mller et al.
8.1 Introduction
3D video (3DV) systems have entered different application areas in recent years,
like digital cinema, home entertainment, and mobile services. Especially for
3D-TV home entertainment, a variety of 2-view (stereoscopic) and N-view autostereoscopic displays have been developed [1, 2]. For these displays, 3DV formats are required, which are generic enough to provide any number of views at
dense spatial positions. As the majority of 3DV content is currently produced with
few (mostly two) cameras, mechanisms for extracting the required additional
views at the display are required. Here, the usage of geometry information of a
recorded 3D scene, e.g., in the form of depth or range data has been studied and
found to be a suitable and generic approach. Therefore, depth-enhanced video
formats, like multi-view video plus depth (MVD) have been studied, where 2 or 3
recorded views are amended with per-pixel depth data. The depth data may
originate from different sources, like time-of-flight cameras, depth estimation from
the original videos, or graphics rendering for computer-generated synthetic content. Such formats thus include additional information to generate the required
high number of views for autostereoscopic displays.
For the compression of such formats, new approaches beyond current 2D video
coding methods need to be investigated. Therefore, this chapter explains the
requirements for the compression of video and depth data and looks into new
principles of 3DV coding.
First, a short review on basic principles of video coding, which build the basis
for todays advanced coding methods, is given in Sect. 8.2.
Next, the extension toward multi-view video coding (MVC) is explained in
Sect. 8.3. To achieve higher coding gains, inter-view prediction was added in this
extension. For this, evaluation of coding results showed that no major changes are
required, such that the coding mechanisms for 2D can directly be applied to MVC.
For autostereoscopic displays with different number and position of views,
3DV coding methods are required, which efficiently transmit stereo and multi-view
video content and are able to provide any number of output views for the variety
of autostereoscopic displays. For this, new assumptions and coding principles
are required, which are explained in Sect. 8.4. One of the novelties in the new
3DV coding is the usage of supplementary data, e.g, in the form of depth maps, in
order to generate additional intermediate views at the display. For depth information, different coding methods are required, as shown by the depth data analysis
in Sect. 8.4.
Besides video and depth data, additional supplementary information, like
occlusion data, may also be used. Therefore, different 3DV formats exist, which
are explained and discussed in terms of coding efficiency in Sect. 8.5.
8 3D Video Compression
225
8:1
pb
Here, rate R and distortion D are related via the Lagrange multiplier k to form
the functional Dpb k Rpb : Rate and distortion depend on the parameter
vector pb of a coding mode b. The optimal coding mode is determined by the
parameter vector popt, which minimizes the functional. For this, the best mode
among a number of candidates is selected, as explained in the coding principles
below.
A video coding system consists of encoder and decoder, as shown in Fig. 8.1.
The encoder applies video compression to the input data and generates a bit
stream. This bit stream can then be transmitted and is used by the decoder to
reconstruct the video. The compressed bit stream has to meet system requirements,
like a maximum given transmission bandwidth (and thus maximum bit rate).
Therefore, the encoder has to omit certain information within the original video, a
process, called lossy compression. Accordingly, the reconstructed video after
decoding has a lower quality in comparison with the original input video at the
encoder. The quality comparison between reconstructed output and original video
input, as well as the bit rate of the compressed bit stream, as shown in Fig. 8.1, is
used by the rate-distortion-optimization process in the encoder.
Video coding systems incorporate fundamental principles of signal processing,
such as signal prediction, quantization, transformation, and entropy coding to
reduce temporal as well as spatial signal correlations, as described in detail in [5].
As a result, a more compact video representation can be achieved, where important
information is carried by fewer bits, than in the original video in order to reduce
the data rate of the compressed bit stream.
226
K. Mller et al.
Fig. 8.1 Video coding system with encoder and decoder, objective quality measure of
reconstructed output versus original input video at certain bit rates
8:2
Typically, the error en is much smaller than the original pixel value xn;
therefore, a transmission of en is usually more efficient. The original signal is
then reconstructed from the error and prediction signal: xn en ^xn:
The second fundamental principle in video compression is quantization. One
example for this is already given by the most widely used YUV 8 bit video format,
where each pixel is represented by a luminance value (Y) and two chrominance
values (U and V) each with 8bit precision. Therefore, each pixel value component
is quantized into 256 possible values. In video compression, the error signal en is
quantized into a discrete number of values eq n: The combination of signal
prediction and error signal quantization is shown in the encoder in Fig. 8.2.
With the quantization of the error signal, the original and quantized version
differ, such that eq n 6 en: Consequently, also the reconstructed signal ~xn
differs from the original signal, as ~xn eq n ^xn 6 xn: In prediction
scenarios, previously coded and decoded data are often used as an input to the
predictor. These data contain quantization errors from the previous coding cycle,
such that an error accumulation occurs over time, which is known as drift.
Therefore, the typical basic encoder structure consists of a differential pulse code
modulation (DPCM) loop, shown in Fig. 8.2. This backward prediction loop
guarantees that the current quantized prediction error eq n is used for locally
reconstructing the output signal ~xn: This is used for a new prediction, such that
quantization errors cannot accumulate over time. Furthermore, the encoder uses
~xn to calculate the distortion error for the rate-distortion optimization.
The third fundamental principle in video compression is signal transformation.
Here, a number of adjacent signal samples are transformed from its original
8 3D Video Compression
227
Fig. 8.2 Differential pulse code modulation structure with encoder and decoder. The decoder
structure is also part of the encoder (gray box)
domain into another domain. For images and videos, usually a frequency transformation is applied to a rectangular block of pixels. Thus, an image block is
transformed from its original spatial domain into the frequency domain. The
purpose of this frequency transformation is to concentrate most of a signals
energy and thus the most relevant information in few large-value frequency
coefficients. As the samples of an original image block usually have a high correlation, mostly low frequency coefficients have large values. Consequently, data
reduction can be carried out by simply omitting high-frequency coefficients with
small values and still preserving a certain visual quality for the entire image block.
Adding a frequency transformation to signal prediction and quantization leads to
the basic structure of todays 2D video encoders, shown in Fig. 8.3.
In such an encoder, an image is tiled into a number of blocks, which are
sequentially processed. For each block xn; a predicted block ^xn is obtained by
advanced prediction mechanisms. Here, the encoder selects among a number of
different prediction modes, e.g., intra-frame prediction, where spatially neighboring content is used, or motion compensated temporal prediction, where similar
content in preceding pictures is used. The difference between the original and
predicted block en is transformed (T) and quantized. This residual block Qq n is
inversely transformed (T-1) into eq n and used to generate the local reconstructed
output block ~
xn (see Fig. 8.3). A deblocking filter is further applied to reduce
visible blocking artifacts. The locally reconstructed output image is used for the
temporal prediction of the following encoder cycle. The temporal prediction
consists of the motion compensation, as well as the motion estimation. Here,
motion vectors between input xn and reconstructed image blocks ~xn are
estimated. Each encoder cycle produces a set of residual block data Qq n and
motion vectors, required for decoding. These data are fed into the entropy coder.
The coder control is responsible for testing different prediction modes and
determining the overall optimal prediction mode in terms of minimum distortion
and bit rate after entropy coding.
The fourth fundamental principle in video compression is entropy coding. If a
signal is represented by a set of different source symbols, an entropy coding
228
K. Mller et al.
Fig. 8.3 Block-based hybrid video encoder with DPCM Loop, transformation (T), quantization,
inverse transformation (T-1), prediction (intra frame prediction, motion compensation, motion
estimation), entropy coding, and deblocking filter modules
I
1X
xi n ~xi n2
I i1
8:3
Here, xi n and ~xi n represent the single pixel values of blocks xn and x~n
respectively with I being the number of pixels per block. Thus, the MSE is calculated from the averaged squared pixel differences in Eq. (8.3). For the calculation of objective image and video quality, the peak signal-to-noise ratio (PSNR)
is calculated from the MSE, as shown in Eq. (8.4). The PSNR gives the
8 3D Video Compression
229
logarithmic ratio between the squared maximum value and the MSE. As an
example, also the PSNR for 8 bit signals (maximum value: 255) is given in (8.4).
MaxValue2
MSE
2552
10 log
kxn x~nk2
PSNR 10 log
PSNR8 bit
8:4
For image and video coding, the PSNR is widely used, as it represents subjective quality very well and can easily be calculated. When a new coding technology is developed, excessive tests of subjective assessment are carried out with a
large number of participants. Here, measurements, like the mean opinion score
(MOS) are used. In MOS, a quality scale is specified, ranging, e.g., from 0 to 5 for
very bad to perfect quality. Participants are then asked to rate their subjective
impression within this scale. For (2D) video coding, the MOS value is highly
correlated with the objective PSNR. A higher PSNR value is also rated higher in
MOS value. Thus, the automatic encoder rate distortion optimization can maximize the objective PSNR value and thus also guarantees a high perceived subjective quality.
8:5
230
K. Mller et al.
view n
8:6
8 3D Video Compression
231
Fig. 8.4 Block relations in stereo video with respect to the lower right block at (t2,v2): temporal
relation via motion vectors m, inter-view relation via disparity vectors d
Fig. 8.5 Multi-view video coding prediction structure for stereo video: inter-view prediction
(vertical-arrow relations) combined with hierarchical B pictures for temporal prediction
(horizontal-arrow relations), source [20]
prediction. The temporal prediction structure in MVC uses a hierarchy of bipredictive pictures, known as hierarchical B pictures [16]. Here, different quantization
parameters can be applied to each hierarchy level of B pictures in order to further
increase the coding efficiency. Vertical arrows indicate inter-view dependencies,
i.e., video v2 requires v1 as reference. Note that the dependency arrows in Fig. 8.5
are inversely in direction to the motion and disparity vectors from Fig. 8.4.
The coding efficiency of MVC gains from using both temporal and inter-view
reference pictures. However, the inter-view decorrelation loses some efficiency, as
the algorithms and methods optimized for temporal signal decorrelation are
applied unmodified. Although motion and disparity both reflect content displacement between images at different temporal and spatial positions, they also differ in
their statistics. Consider a static scene, recorded with a stereo camera: Here, no
motion occurs, while the disparity between the two views is determined by the
intrinsic setting of the stereo cameras, such as spacing, angle, and depth of the
scene. Therefore, an encoder is optimized for a default motion of m = 0. This is
not optimal for disparity, where default values vary according to the parameters,
mentioned before. This is also reflected by the occurrence of coding decisions for
232
K. Mller et al.
temporally predicted picture blocks versus inter-view predicted blocks, where the
latter only occur in up to 20 % [11]. This only leads to a moderate increase in
coding efficiency of current MVC versus single view coding, where coding gains
of up to 40 % have been measured in experiments for multiple views [11].
Although such gains can be achieved by MVC, the bit rate is still linearly
dependent on the number of views.
The MVC encoder applies the same rate-distortion optimization principles, as
2D video coding. Here, PSNR is used as an objective measure between the locally
reconstructed and the original uncoded image. For two views of a stereo sequence,
a similar quality is usually aimed for, although unequal quality coding has also
been investigated. For the PSNR calculation of two or more views, the mean
squared error (MSE) is separately calculated for each view. The MSE values of all
views are then averaged and the PSNR value calculated, as shown for the stereo
(2-view) and N-view case in (8.7):
PSNR2
PSNRN
Views
10 log
Views
10 log
1=NMSEview1
MaxValue2
0:5MSEview1 + MSEview2
MaxValue2
:
+ MSEview2 + MSEviewN
8:7
Note that averaging the individual PSNR values of all views, would lead to a
wrong encoder optimizations toward maximizing the PSNR in one view only,
while omitting the others, especially for unequal view coding.
Based on the PSNR calculation in (8.7), MVC gives the best possible quality at a
given bit rate. Similar to 2D video coding, the PSNR correlates with subjective
MOS, although initially for the 2D quality of the separate views. In addition, tests
have also been carried out, where subjective 3D perception in stereo video was
assessed and a correlation with PSNR values shown [17]. Therefore, rate-distortionoptimization methods from 2D video coding are also directly applicable to stereo
and MVC.
8 3D Video Compression
233
systems, such that only two views are available. Therefore, new coding methods
are required for 3DV coding, which decouple the production and coding format
from the display format.
234
K. Mller et al.
Fig. 8.6 3D video coding system with single generic input format and multi-view output range
for N-view displays. For overall evaluation of different 3DVC methods with compression and
view generation methods, quality and data rate measurements are indicated by dashed boxes
In the 3DV system, the encoder has to be designed in such a way that a high
quality of the dense range of output views is ensured and that the rate-distortion
optimization can be carried out automatically. Therefore, some of the view generation functionality also has to be emulated at the encoder. For comparison, the
3DV encoder can generate uncompressed intermediate views from the original
video and depth data. These views can then be used in the rate-distortion-optimization as a reference, as shown in Sect. 8.4.3.
8 3D Video Compression
235
Fig. 8.7 Multi-view video plus depth format for two input cameras v1 and v2, Book_Arrival
sequence
content may lead to erroneous depth assignments. This, however, is only problematic, if visible artifacts occur in synthesized intermediate views. Here,
advanced view synthesis methods can reduce depth estimation errors to some
degree [34, 35].
For existing stereo content, where only two color views are available, depth
data has to be estimated. Newer recording devices are able to record depth
information by sending out light beams, which are reflected back by a scene
object, as described in detail in [36]. These cameras then measure the time-offlight time in order to calculate depth information.
In contrast to recorded natural 3DV data, animated films are computer generated. For these, extensive scene models with 3D geometry are available [37, 38].
From the 3D geometry, depth information with respect to any given camera
viewpoint can be generated and converted into the required 3DV depth format.
For 3DVC, a linear parallel camera setup in horizontal direction is typically
used. In addition, preprocessing of the video data is carried out for achieving exact
vertical alignment [39] in order to avoid viewing discomfort and fatigue. Consequently, a simple relation between the depth information z(n) at a pixel position
n and the associated disparity vector d(n) can be formulated:
236
K. Mller et al.
kdnk
f Ds
:
zn
8:8
Here, f is the focal length of the cameras (assuming identical focal lengths) and
Ds the camera distance between two views v1 and v2. Due to the vertical alignment, the disparity vector d only has a horizontal component and its length is
determined by the inverse scene depth value, as also shown in detail in [20].
8:9
Here, j specifies the spatial position between original views v1 and v2. For
instance, a value of j = 0.5 indicates that vS is positioned in the middle between
both original views, as also shown in Fig. 8.8. The two values j = 0 and j = 1
determine the original positions, where vS = v2 and vS = v1 respectively. The
positions of corresponding blocks in original and synthesized views are related via
the disparity vector d, such that nvS nv1 1 j d as well as nvS
nv2 j d: Thus, Eq. (8.9) can be modified by substituting the corresponding
positions, using the j-scaled disparity shifts:
xvS nvS j xv1 nvS 1 j d 1 j xv2 nvS j d:
8:10
8 3D Video Compression
237
gradual adaptation of the color information, when navigating across the viewing
range of intermediate dense views from v1 to v2. Equation (8.10) assumes that
both original blocks xv1 and xv2 are visible. As the original cameras have recorded
a scene from slightly different viewpoints, background areas exist, which are only
visible in one view, while being occluded by foreground objects in the second
view. Therefore, (8.10) has to be adapted for both occlusion cases in order to
obtain the intermediate picture block, i.e., xvS nvS xv1 nvS 1 j d if xv2
is occluded and xvS nvS xv2 nvS j d if xv1 is occluded.
Equation (8.10) represents the general case of intermediate view generation for
uncompressed data. As shown in Sect. 8.2, video compression introduces coding
errors due to quantization, such that an original data block xn and its reconstructed version after decoding ~
xn are different. In previous coding approaches,
the color difference between both values determined the reconstruction quality for
2D and MVC [see (8.4) and (8.7) respectively]. In 3DVC, color and depth data are
encoded. This leads to different reconstructed color values ~xn 6 xn as well as
different reconstructed depth values, where the latter cause different disparity
~
values dn
6 dn: As shown in (8.10), disparity data is used for intermediate
view synthesis and causes a position shift between original and intermediate
views. Therefore, depth coding errors result in a disparity offset or shift error
~
Dd dn dn:
Consequently, Eq. (8.10) is subject to color as well as depth
coding errors and becomes for the coding case:
~
xv1 nvS 1 j d Dd 1 j ~xv2 nvS j d Dd:
xvS nvS j ~
8:11
Equation (8.11) shows that coding of color data changes the interpolation value,
while coding of depth data causes disparity offsets, which reference neighboring
coded blocks at different positions in the horizontal direction. These neighboring
blocks may have a very different color contribution for the interpolation of
~
xvS nvS in the color blending process (8.11). Especially at color edges, completely
different color values are thus used, which lead to strong sample scattering and
color bleeding in synthesized views.
For the compression of 3DV data, color and depth need to be jointly coded and
evaluated with respect to the intermediate views. One possibility for a joint coding
optimization can be obtained by comparing the original color contributions from
v1 and v2 in (8.10) with their coded and reconstructed versions in (8.11). Thus, the
following MSE distortion measure could be derived:
238
K. Mller et al.
8j
and
MSEC;v2 max MSEC;v2 j ; MSED;v2 max MSED;v2 j :
8j
8:14
8j
8 3D Video Compression
239
In the literature, sometimes also the PSNR for the reconstructed intermediate
views ~
xvS nvS ; is calculated, assuming the uncoded synthesized view xvS nvS as
a reference. This however, does not reflect the subjective quality adequately
enough, as synthesis errors in the uncoded intermediate view are not considered.
Therefore, the final reconstruction quality has to be subjectively evaluated, as
shown in the 3DV system output in Fig. 8.6.
240
CDi;Dj k; l
K. Mller et al.
I1 X
J 1
X
1; if x i
0; else
iDi jDj
j T k and x i Di j Dj T l
8:15
Here, Di [ 0, and Dj [ 0 are the distances between pixel positions in the
vertical and horizontal directions of a picture with I 9 J pixels. With this,
Hspatial k; l accumulates both first-order neighborhood co-occurrence matrices for
vertical and horizontal directions over the entire sequence of pictures from
t = 1T:
Hspatial k; l
T
X
C0;1 k; l; t C1;0 k; l; t :
8:16
t1
8 3D Video Compression
241
Fig. 8.9 Normalized logarithmic correlation histograms log(Hspatial(k, l)) for original luminance
and depth data of views v1 and v2, Book_Arrival sequence (see Fig. 8.7). The grayscale codes are
in logarithmic scale. The gray values at the main diagonal refer to the higher values up to 10.
With increasing distance from the diagonal, the histogram values decrease down to 0
CHs can be used to analyze the coded data and compare it to the original data.
Especially, the preservation of edge information and larger homogeneous areas
can be studied. An example for MVC coding is shown in Fig. 8.10 for the
Book_Arrival sequence at a medium data rate of 650 kBits/s.
The comparison of the decoded CHs in Fig. 8.10 with the original CHs in
Fig. 8.9 shows the differences in luminance and depth coding using MVC. The
luminance CHs show a stronger concentration along the diagonal due to typical
video coding artifacts, like low pass filtering. For the depth data, the discrete CH
values in the original data in Fig. 8.9 have spread out, such that a more continuous
CH is obtained for the coded version in Fig. 8.10. CH values and value clusters
toward the upper left and lower right corner of the CH represent important depth
edges. In the CH of the coded version, these values have disappeared or enlarged,
such as the lighter gray isolated areas of discrete values in Fig. 8.9, bottom.
This indicates that important details have been altered. The different changes in
242
K. Mller et al.
Fig. 8.10 Normalized logarithmic correlation histograms log(Hspatial(k, l)) for decoded luminance and depth data of views v1 and v2, Book_Arrival sequence (see Fig. 8.7). The grayscale
codes are in logarithmic scale. The gray values at the main diagonal refer to the higher values up
to 10. With increasing distance from the diagonal, the histogram values decrease down to 0
luminance and depth CHs show that MVC was optimized for video coding, while
its application to depth coding removes important features.
Consequently, a number of alternative depth coding approaches have been
investigated for preserving the most important depth features for good intermediate view synthesis [4549, 52]. These are discussed in detail in Chap. 9.
8 3D Video Compression
243
have been used, such as layered depth video (LDV) and depth-enhanced stereo
(DES) [50]. These formats are shown in Fig. 8.11.
Both LDV and DES can be derived from the MVD format. The LDV format
consists of one (central) video and depth map, as well as associated occlusion
information for video and depth data. In Fig. 8.11, this is shown as background
information, which needs to be extracted beforehand in a prepossessing step.
Alternatively, the background information can be reduced to pure occlusion or
difference information, which only contains background information behind
foreground objects. The LDV view synthesis generates the required output views
for stereoscopic and autostereoscopic displays by projecting the central video with
the depth information to the required viewing positions and filling the disoccluded
areas in the background with the occlusion information [51].
For LDV, higher compression efficiency might be obtained, as only one full
view and occlusion information at the same position needs to be encoded. In
comparison, a minimum of two views from different viewpoints and thus differentiated by disparity offset needs to be coded for MVD. On the other hand, the
background data in LDV that is revealed behind foreground objects has no original
reference. This information is originally obtained from the second view of an
MVD format or further original views, if available. Therefore, both formats
originate from the same input data. For an optimized encoding process, inter-view
correlation between adjacent input views is only exploited at different stages in the
encoding process: In LDV, a redundancy reduction between the input views
already takes place at a preprocessing stage for the creation of this format, such
that the encoding cannot significantly reduce the compressed data size, especially
in disoccluded areas. For MVD, the redundancy reduction takes place in the
encoding process, where a precise inter-view prediction can better reduce the
compressed data size for areas, which are similar in adjacent views. Thus, similar
compression results are expected for LDV and MVD by fully optimized 3DV
coding methods.
As LDV only contains one original view, view synthesis always needs to be
carried out in order to obtain the second view. This can be problematic for pure
stereo applications that want to use two original or decoded views without further
view generation. Here, an alternative format is DES, which combines MVD and
LDV into a two-view representation with video, depth, occlusion video, and
occlusion depth signals for two views. The DES format enables pure stereo processing on one hand and additionally contains occlusion information for additional
view extrapolation toward either side of the stereo pair for multi-view displays on
the other hand. For pure stereo content, only limited occlusion information is
available, such that the MVD2 format is most suitable for stereo content. If, however, original multi-view content is available from a number of input cameras, the
creation of occlusion information from additional cameras is rather useful to generate a compact format that can be used to synthesize a wide range of output views.
244
K. Mller et al.
Fig. 8.11 Different 3D video formats: MVD with 2 video and 2 depth signals, LDV with 1 video,
1 depth, 1 occlusion video and 1 occlusion depth signal, DES with 2 video, 2 depth, 2 occlusion
video, and 2 occlusion depth signals
8 3D Video Compression
245
When encoding video and depth data, a high quality of the entire viewing range of
synthesized views needs to be provided. Therefore, this chapter has shown how the
3DV encoder optimization emulates view synthesis functionality by analyzing the
coding distortion for intermediate views. Accordingly, color and depth MSE
consider the different types of video and depth coding errors. For the overall 3DV
system evaluation, subjective MOS needs to be obtained, either for the entire
viewing range or for stereo pairs out of this range. In addition, the generic 3DV
format is reconstructed before view synthesis for evaluating the coding efficiency
only by objective PSNR measures. Consequently, interdependencies between
compression technology and view synthesis methods are resolved. For the
assessment of depth coding methods, correlation histograms (CHs) have been
introduced. The differences between original and reconstructed CH reveal whether
a specific coding algorithm is capable of preserving important features. For depth
data coding, especially the important edges between foreground and background
need to be preserved, as they can lead to pixel displacement and consequently
wrong color shifts in intermediate views.
Finally, alternative formats, such as LDV and DES, have been discussed with
respect to application areas and coding efficiency in comparison with the MVD
format.
This chapter has summarized the fundamentals of 3DV coding, where coding
and view generation methods are applied to provide a range of high-quality output
views for any stereoscopic and multi-view display from a generic 3DV input at a
limited bit rate.
References
1. Benzie P, Watson J, Surman P, Rakkolainen I, Hopf K, Urey H, Sainov V, von Kopylow C
(2007) A survey of 3DTV displays: techniques and technologies. IEEE Trans Circuits Syst
Video Technol 17(11):16471658
2. Konrad J, Halle M (2007) 3-D displays and signal processing: an Answer to 3-D Ills? IEEE
Signal Proces Mag 24(6):21
3. Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J
27(3):21632177
4. Berger T (1971) Rate distortion theory. Prentice-Hall, Englewood Cliffs
5. Wiegand T, Schwarz H (2011) Source coding: part I of fundamentals of source and video
coding. Found Trends Signal Proces 4(1-2):1222, Jan 2011. http://dx.doi.org/10.1561/
2000000010
6. Jayant NS, Noll P (1994) Digital coding of waveforms. Prentice-Hall, Englewood Cliffs
7. Huffman DA (1952) A method for the construction of minimum redundancy codes.
In: Proceedings IRE, pp 10981101, Sept 1952
8. Said A (2003) Arithmetic coding. In: Sayood K (ed) Lossless compression handbook.
San Diego, Academic, London
9. Chen Y, Wang Y-K, Ugur K, Hannuksela M, Lainema J, Gabbouj M (2009) The Emerging
MVC standard for 3D video services. EURASIP J Adv Sign Proces 2009(1)
10. ISO/IEC JTC1/SC29/WG11 (2008) Text of ISO/IEC 14496-10:200X/FDAM 1 multiview
video coding. Doc. N9978, Hannover, Germany, July 2008
246
K. Mller et al.
11. Merkle P, Smolic A, Mueller K, Wiegand T (2007) Efficient prediction structures for
multiview video coding, invited paper. IEEE Trans Circuits Syst Video Technol
17(11):14611473
12. Shimizu S, Kitahara M, Kimata H, Kamikura K, Yashima Y (2007) View scalable multi-view
video coding using 3-d warping with depth map. IEEE Trans Circuits Syst Video Technol
17(11):14851495
13. Vetro A, Wiegand T, Sullivan GJ (2011) Overview of the stereo and multiview video coding
extensions of the H.264/AVC standard. Proc IEEE, Special issue on 3D Media and Displays
99(4):626642
14. ITU-T and ISO/IEC JTC 1 (2010) Advanced video coding for generic audiovisual services.
ITU-T Recommendation H.264 and ISO/IEC 14496-10 (MPEG-4 AVC), Version 10, March
2010
15. Wiegand T, Sullivan GJ, Bjntegaard G, Luthra A (2003) Overview of the H.264/AVC video
coding standard. IEEE Trans Circuits Syst Video Technol 13(7):560576
16. Schwarz H, Marpe D, Wiegand T (2006) Analysis of hierarchical B pictures and MCTF,
ICME 2006. IEEE international conference on multimedia and expo, Toronto, July 2006
17. Strohmeier D, Tech G (2010) On comparing different codec profiles of coding methods for
mobile 3D television and video. In: Proceedings 3D systems and applications, Tokyo, May
2010
18. ISO/IEC JTC1/SC29/WG11 (2009) Vision on 3D video. Doc. N10357, Lausanne, Feb 2009
19. Mller K, Smolic A, Dix K, Merkle P, Wiegand T (2009) Coding and intermediate view
synthesis of multi-view video plus depth. In: Proceedings IEEE international conference on
image processing (ICIP09), Cairo, pp 741744, Nov. 2009
20. Mller K, Merkle P, Wiegand T (2011) 3D video representation using depth maps. Proc
IEEE, Special issue on 3D media and displays 99(4):643656
21. Faugeras O (1993) Three-dimensional computer vision: a geometric viewpoint. MIT Press,
Cambridge
22. Hartley R, Zisserman A (2000) Multiple view geometry in computer vision. Cambrigde
Universitity Press, Cambrigde
23. Scharstein D, Szeliski R (2002) A taxonomy and evaluation of dense two-frame stereo
correspondence algorithms. Int J Comput Vision 47(1):742
24. Bleyer M, Gelautz M (2005) A layered stereo matching algorithm using image segmentation
and global visibility constraints. ISPRS J Photogrammetry Remote Sens 59(3):128150
25. Szeliski R, Zabih R, Scharstein D, Veksler O, Kolmogorov V, Agarwala A, Tappen M,
Rother C (2006) A Comparative study of energy minimization methods for markov random
fields. European conference on computer vision (ECCV 2006), vol 2, pp 1629, Graz, May
2006
26. Atzpadin N, Kauff P, Schreer O (2004) Stereo analysis by hybrid recursive matching for realtime immersive video conferencing. IEEE Trans Circuits Syst Video Technol, Special issue
on immersive Telecommunications 14(3):321334
27. Cigla C, Zabulis X, Alatan AA (2007) Region-based dense depth extraction from multi-view
video. In: Proceedings IEEE international conference on image processing (ICIP07), San
Antonio, USA, pp 213216, Sept 2007
28. Felzenszwalb PF, Huttenlocher DP (2006) Efficient belief propagation for early vision. Int J
Comp Vision 70(1):41
29. Kolmogorov V (2006) Convergent tree-reweighted message passing for energy minimization.
IEEE Trans Pattern Anal Mach Intell 28(10):1568
30. Kolmogorov V, Zabih R (2002) Multi-camera scene reconstruction via graph cuts. European
conference on computer vision, May 2002
31. Lee S-B, Ho Y-S (2010) View consistent multiview depth estimation for three-dimensional
video generation. In: Proceedings IEEE 3DTV conference, Tampere, Finland, June 2010
32. Min D, Yea S, Vetro A (2010) Temporally consistent stereo matching using coherence
function. In: Proceedings IEEE 3DTV conference, Tampere, June 2010
8 3D Video Compression
247
33. Tanimoto M, Fujii T, Suzuki K (2008) Improvement of depth map estimation and view
synthesis. ISO/IEC JTC1/SC29/WG11, M15090, Antalya, Jan 2008
34. Mller K, Smolic A, Dix K, Merkle P, Kauff P, Wiegand T (2008) View synthesis for
advanced 3D video systems. EURASIP J Image Video Proces, Special issue on 3D Image and
Video Processing, vol 2008, Article ID 438148, 11 pages, 2008 doi:10.1155/2008/438148
35. Zitnick CL, Kang SB, Uyttendaele M, Winder S, Szeliski R (2004) High-quality video view
interpolation using a layered representation. ACM SIGGRAPH and ACM Transaction on
Graphics, Los Angeles, Aug 2004
36. Gokturk S, Yalcin H, Bamji C (2004) A time-of-flight depth sensor system description, issues
and solutions. In: Proceedings of IEEE computer vision and pattern recognition workshop,
vol 4, pp 3543
37. ISO/IEC DIS 14772-1 (1997) The virtual reality modeling language. April 1997
38. Wrmlin S, Lamboray E, Gross M (2004) 3d video fragments: dynamic point samples for
real-time free-viewpoint video. Computers and graphics, Special issue on coding,
compression and streaming techniques for 3D and multimedia data, Elsevier, pp 314
39. Fusiello A, Trucco E, Verri A (2000) A compact algorithm for rectification of stereo pairs.
Mach Vis Appl 12(1):1622
40. Kauff P, Atzpadin N, Fehn C, Mller M, Schreer O, Smolic A, Tanger R (2007) Depth map
creation and image based rendering for advanced 3DTV services providing interoperability
and scalability. Signal processing: image communication. Special issue on 3DTV, Feb 2007
41. Redert A, de Beeck MO, Fehn C, Ijsselsteijn W, Pollefeys M, Van Gool L, Ofek E, Sexton I,
Surman P (2002) ATTESTadvanced three-dimensional television system techniques. In:
Proceedings of international symposium on 3D data processing, visualization and
transmission, pp 313319, June 2002
42. Merkle P, Morvan Y, Smolic A, Farin D, Mller K, de With PHN, Wiegand T (2009) The
effects of multiview depth video compression on multiview rendering. Signal Proces: Image
Commun 24(12):7388
43. Liu Y, Huang Q, Ma S, Zhao D, Gao W (2009) Joint video/depth rate allocation for 3d video
coding based on view synthesis distortion model. Signal Proces: Image Commun
24(8):666681
44. Merkle P, Singla J, Mller K, Wiegand T (2010) Correlation histogram analysis of depthenhanced 3D video coding. In: Proceedings IEEE international conference on image
processing (ICIP10), Hong Kong, pp 26052608, Sept 2010
45. Choi J, Min D, Ham B, Sohn K (2009) Spatial and temporal up-conversion technique for
depth video. In: Proceedings IEEE international conference on image processing (ICIP09),
Cairo, Egypt, pp 741744, Nov 2009
46. Daribo I, Tillier C, Pesquet-Popescu B (2008) Adaptive wavelet coding of the depth map for
stereoscopic view synthesis. In: Proceedings IEEE international workshop on multimedia
signal processing (MMSP08), Cairns, Australia, pp 3439, Oct 2009
47. Kim S-Y, Ho Y-S (2007) Mesh-based depth coding for 3d video using hierarchical
decomposition of depth maps. In: Proceedings IEEE international conference on image
processing (ICIP07), San Antonio, pp V117V120, Sept 2007
48. Kim W-S, Ortega A, Lai P, Tian D, Gomila C (2010) Depth map coding with distortion
estimation of rendered view. Visual information processing and communication, Proceedings
of the SPIE, vol 7543
49. Oh K-J, Yea S, Vetro A, Ho Y-S (2009) Depth reconstruction filter and down/up sampling for
depth coding in 3-D video. IEEE Signal Proces Lett 16(9):747750
50. Smolic A, Mller K, Merkle P, Kauff P, Wiegand T (2009) An overview of available and
emerging 3D video formats and depth enhanced stereo as efficient generic solution.
In: Proceedings picture coding symposium (PCS 2009), Chicago, May 2009
248
K. Mller et al.
51. Mller K, Smolic A, Dix K, Kauff P, Wiegand T (2008) Reliability-based generation and
view synthesis in layered depth video. In: Proceedings IEEE international workshop on
multimedia signal processing (MMSP2008), Cairns, pp 3439, Oct 2008
52. Maitre M, Do MN (2009) Shape-adaptive wavelet encoding of depth maps. In: Proceedings
picture coding symposium (PCS09), Chicago, USA, May 2009
Chapter 9
G. Cheung (&)
National Institute of Informatics, 2-1-2 Hitotsubashi, Chiyoda-ku,
Tokyo 101-8430, Japan
e-mail: cheung@nii.ac.jp
A. Ortega
University of Southern California, Los Angeles, CA, USA
e-mail: antonio.ortega@sipi.usc.edu
W.-S. Kim
Texas Instruments Inc., Dallas, TX, USA
e-mail: wskim@ti.com
V. Velisavljevic
University of Bedfordshire, Bedfordshire, UK
e-mail: vladan.velisavljevic@beds.ac.uk
A. Kubota
Chuo University, Hachio-ji, Tokyo, Japan
e-mail: kubota@elect.chuo-u.ac.jp
249
250
G. Cheung et al.
9.1 Introduction
As described in other chapters, 3D-TV systems often require multiple video channels
as well as depth information in order to enable synthesis of a large number of views at
the display. For example, some systems are meant to enable synthesis of any arbitrary
intermediate view chosen by the observer between two captured viewsa media
interaction commonly known as free viewpoint. Synthesis can be performed via
depth-image-based rendering (DIBR).
Clearly, compression techniques will be required for these systems to be used in
practice. Extensive work has been done to investigate approaches for compression
of multiview video, showing that some gains can be achieved by jointly encoding
multiple views, using extensions of well-known techniques developed for singleview video. For example, well-established motion compensation methods can be
extended to provide disparity compensation (predicting a frame in a certain view
based on corresponding frames in neighboring views), thus reducing overall rate
requirements for multiview systems.
In this chapter, we consider systems where multiple video channels are transmitted, along with corresponding depth information for some of them. Because
techniques for encoding of multiple video channels, multiview video coding
(MVC), are based on well-known methods,1 here we focus on problems where new
coding tools are showing more promise.
In particular, we provide an overview of techniques for coding of depth data.
Depth data may require coding techniques different from those employed for
standard video coding, because of its different characteristics. For example, depth
signals tend to be piecewise smooth, unlike typical video signals which can
include significant texture information. Moreover, depth images are not displayed
directly and are instead used to synthesize intermediate views. Because of this,
distortion metrics that take the synthesis process into account need to be used in
the context of any rate-distortion (RD) optimization of encoding for these images.
Finally, since both depth and texture are encoded together, techniques for joint
encoding must be considered, including, for example, techniques to optimize the
bit allocation between video and depth. Note that some authors have suggested to
replace or supplement depth information in order to achieve better view synthesis
[2, 3]. These studies fall outside of the scope of this chapter.
This chapter is organized as follows. We start by discussing briefly, in Sect. 9.2,
those characteristics that make depth signals different from standard video signals.
This will help us provide some intuition to explain the different approaches that
have been proposed to deal with depth signals. In Sect. 9.3, we discuss approaches
that exploit specific characteristics of depth in the context of existing coding
algorithms. Then, in Sect. 9.4, we provide an overview of new methods that have
been proposed to specifically target depth signals. Because depth and texture are
1
As an example, the MPEG committee defined an MVC extension to an existing standard [1]
where no new coding tools were introduced.
251
252
Fig. 9.1 Temperature
images of the absolute
difference between the
synthesized view with and
without depth map coding,
Champagne Tower
G. Cheung et al.
180
160
100
140
200
120
300
100
400
80
500
60
40
600
20
700
200
400
600
800
1000
1200
A more challenging problem arises in how these displacement errors lead to view
synthesis errors. Observe that a geometric error leads to a view synthesis error only
if the pixels used for interpolation have in fact different intensity. In an extreme case, if
pixel intensity is constant, even relatively large geometric errors may cause no view
synthesis error. Alternatively, if the intensity signal has a significant amount of texture,
even a minimal geometric error can lead to significant view synthesis error. This also
applies to occlusion. For example, some depth map areas will be occluded after
mapping to the target view. Depth map areas corresponding to occluding area will
obviously have more impact on the rendered view quality than those in the occluded
area. Thus, the effect of depth errors on view synthesis is inherently local.
Moreover, this mapping is inherently non-linear, that is, we cannot expect
increases in view synthesis distortion to be proportional to increases in depth
distortion. Consider, for example, a target pixel within an object having constant
intensity. As the geometry error increases, the view synthesis error will remain
small and nearly constant, until the error is large enough that a pixel outside the
object is selected for interpolation, leading to a sudden large increase in view
synthesis distortion. Thus, depth information corresponding to pixels closer to the
object boundary will tend to be more sensitive; i.e., large increase in synthesis
error will be reached with a smaller depth error.
Figure 9.1 illustrates these nonlinear and localized error characteristics. The
absolute difference between the views synthesized with and without depth map
coding is represented as a temperature image [9]. Although quantization is applied
uniformly throughout the frame, its effect on view interpolation is much more
significant near edges. As the quantization step size Q increases, the geometric error
increases, and thus the rendered view distortion also increases. Note, however, that
even though quantization error in depth map increases, the rendered view distortion
still remains localized to a relatively small portion of pixels in the frame.
253
These characteristics suggest that using depth error as a distortion metric may
lead to suboptimal behavior. Moreover, since the actual distortion is local in
nature, any alternative metrics to be developed for encoding are likely to be local
as well. These ideas will be further explored in Sect. 9.3.
254
G. Cheung et al.
255
The relationship between the actual depth value, z; and the 8 bit depth map value,
Lp xim ; yim ; is given by
Lp xim ; yim
1
1
1 1
;
9:1
z
255
Znear Zfar
Zfar
where Znear and Zfar are the nearest and the farthest clipping planes, which correspond to value 255 and 0 in the depth map, respectively, with the assumption
that z, Znear and Zfar have all positive or all negative values [14].
Kim et al. [15] derived the geometric error due to a depth map error in the pth
view, DLp ; using intrinsic and extrinsic camera parameters as follows:
0
1
Dxim
1
1
@ Dyim A DLp xim ; yim
9:2
Ap0 Rp0 Tp Tp0 ;
255
Znear Zfar
1
where Dxim and Dyim are translation error in horizontal and vertical direction,
respectively, and Ap0 ; Rp0 ; and Tp0 are the camera intrinsic matrix, the rotation
matrix, and a translation vector in p0 th view, respectively. This reveals that there is
a linear relationship between the depth map distortion DL; and the translational
rendering position error DP in the rendered view, which can be represented as
k
Dxim
DP
DLp xim ; yim x ;
9:3
ky
Dyim
where kx and ky are the scale factors determined from the camera parameters and
the depth ranges as shown in (9.2).
When the cameras are arranged in a parallel setting, further simplification
(rectification) can be made, so that there will be no translation other than in the
horizontal direction [16]. In this case, there will be a difference only in horizontal
direction between the translation vectors, i.e., ky 0: In addition, the rotation
matrix in (9.2) becomes an identity matrix. Neglecting radial distortion in the
camera intrinsic matrix, the scaling factor, kx ; in (9.3) becomes
1
1
1
9:4
kx
fx Dtx ;
255 Znear Zfar
where fx is the focal length divided by the effective pixel size in horizontal
direction, and Dtx is the camera baseline distance. Note that the geometric distortion incurred will depend on both the camera parameters and on the depth
characteristics of the scene: there will be increased distortion if the distance
between cameras is larger or if the range greater. Furthermore, the impact of a
given geometric distortion on the view interpolation itself will depend on the local
characteristics of the video. We describe coding techniques that take this into
account next.
256
G. Cheung et al.
!
N
1X
jDPnj
SSE 2N 1 1
q
r2Xn ;
N n1 1
257
9:5
where N represents the number of pixels in the block, q1 represents the video
correlation when translated by one pixel, and r2Xn is the variance of video
block, Xn : With this model, the estimated distortion increases as the correlation q1
decreases or the variance increases, where these characteristics of the video signal
are estimated on a block-by-block basis. As expected, distortion estimates are not
as accurate as those obtained from the same method, but the accuracy is often
sufficient, given that encoding decisions are performed at the block level [15].
258
G. Cheung et al.
N
X
ai /i
9:6
i1
Only non-zero quantized transform coefficients a^i s are encoded and transmitted
to the receiver for reconstruction of approximate signal ^s: ai s are obtained using
s; i.e., ai hs; /
i; where hx; yi denotes
a complementary set of basis functions /
i
i
a well-defined inner product between two signals x and y in Hilbert space IRN :
In general, coding gain can be achieved if the quantized transform coefficients
^
ai s are sufficiently sparse relative to original signal s; i.e., the number of non-zero
coefficients ^
ai s is small. Representations of typical image and video signals in
popular transform domains such as Discrete Cosine Transform (DCT) and Wavelet
Transform (WT) have been shown to be sparse in a statistical sense, given
appropriate quantization used, resulting in excellent compression performance.
In the case of depth map encoding, to further improve representation sparsity of a
given signal s in the transform domain, one can explicitly manipulate values of less
important depth pixels to sparsify the signal sresulting in even fewer non-zero
transform coefficients ^
ai s, optimally trading off its representation sparsity and its
adverse effect on synthesized view distortion.
In particular, in this section we discuss two methods to define depth-pixel
importance and associated coding optimizations given defined per-depth-pixel
importance to improve representation sparsity in the transform domain. In the first
method called dont care region (DCR) [22], one can define, for a given code
block, a range of depth values for each pixel in the block, where any depth value
within range will worsen synthesized view distortion by no more than a threshold
value T: Given defined per-pixel DCRs, one can formulate a sparsity maximization
problem, where the objective is to find the most sparse representation in the
transform domain while constraining the search space to be inside per-pixel DCRs.
In the second method, for each depth pixel in a code block, [23] first defines a
quadratic penalty function, where larger deviation from its nadir (ground truth
259
9:7
Given the above definition, one can now determine a DCR for left pixel
Dl m; n given threshold T as follows: find the smallest lower bound depth value
f m; n and largest upper bound depth value gm; n; f m; n\Dl m; n\gm; n;
such that the resulting synthesized error El e; m; n; for any depth error
e; f m; n e Dl m; n gm; n; do not exceed El 0; m; n T:
See Fig. 9.3a for an example of a depth map Dl for multiview sequence teddy
[24] and Fig. 9.3b for an example of ground truth depth Dl m; n (blue), DCR
lower and upper bounds f m; n (red) and gm; n (black) for a pixel row in a 8 8
block in teddy. Similar procedure can be derived to find DCR for the right depth
map.
In general, a larger threshold T offers a larger subspace for an algorithm to
search for sparse representations in transform domain leading to compression gain,
at the expense of larger resulting synthesized distortion.
2
Dl m; n is more commonly called the disparity value, which is technically the inverse of the
depth value. For simplicity of presentation, we assume this is understood from context and will
refer to Dl m; n as depth value.
260
G. Cheung et al.
Fig. 9.3 Depth map Dl (view 2) and DCR for the first pixel row in a 8 8 block for teddy at
T 7: a Depth map for teddy, b dont care region
s.t.
a Us
9:9
s.t.
a Us
9:10
kaklw
1
wi jai j
261
9:11
It is clear that if weights wi 1=jai j (for ai 6 0), then the weighted lw1 -norm is
the same as the l0 -norm, and an optimal solution to (9.10) is also optimal for (9.9).
Having fixed weights means (9.10) can be solved using one of several known
linear programming algorithms such as Simplex [27]. Thus, it seems that if one can
set appropriate weights wi s for (9.10) a priori, the weighted lw1 -norm can promote
sparsity, just like l0 -norm. [25, 26] have indeed used an iterative algorithm so that
solution ai s to previous iteration of (9.10) is used as weights for optimization in
the current iteration. See [22] how the iterative algorithm in [25] is adapted to
solve (9.9).
9:12
where si is the depth value corresponding to pixel location i; and ai ; bi and ci are
the quadratic function parameters. The procedure used in [23] to fit gi si to the
error function is as follows. Given threshold q; first seek the nearest depth value
Dl m; n e below ground truth Dl m; n that results in error El e; m; n
exceeding q El 0; m; n: Using only two data points at Dl m; n e and
Dl m; n; and assuming gi si has minimum at ground truth depth value Dl m; n;
one can construct one quadratic function. A similar procedure is applied to construct another penalty function using two data points at Dl m; n e and Dl m; n
instead. The sharper of the two constructed functions (larger a) is the chosen
penalty function for this pixel.
Continuing with our earlier example, we see in Fig. 9.4a that two quadratic
functions (in dashed lines) with minimum at ground truth depth value are constructed. The narrower of the two is chosen as the penalty function. In Fig. 9.4b,
the per-pixel curvature (parameter a) of the penalty functions of the right depth
map of Teddy is shown. We can clearly see that larger curvatures (larger penalties
262
G. Cheung et al.
abs err vs. depth value
30
gt
err func
quad1
quad2
absolute error
25
20
15
10
5
0
80
100
120
depth value
140
160
Fig. 9.4 Error and quadratic penalty functions constructed for one pixel in right view (view 6),
and curvature of penalty functions for entire right view in Teddy. a Per-pixel penalty function,
b penalty curvature for Teddy
in white) occur at object boundaries, agreeing with our intuition that a synthesized
view is more sensitive to depth pixels at object boundaries.
1
where /1
i is the ith row of the inverse transform U ; and k is a weight parameter
to trade off transform domain sparsity and resulting synthesized view distortion.
As mentioned earlier, minimizing the l0 -norm is combinatorial and non-convex,
and so (9.13) is difficult to solve efficiently. Instead, [23] replaces the l0 -norm in
(9.13) with a weighted l2 -norm instead [28]:
X
X
min
wi a2i k
gi /1
9:14
i a
a
For a fixed set of weights wi s, (9.14) can be efficiently solved as an unconstrained quadratic program [29] (see [23] for details). The challenge again is how
to choose weights wi s such that when (9.14) is solved iteratively, minimizing
weighted l2 -norm is sparsity promoting. To accomplish this, [23] adopts the
iterative re-weighted least squares (IRLS) approach [26, 28]. The key point is that
after obtaining a solution ao in one iteration, each weight wi is assigned 1=jaoi j2 if
jaoi j is sufficiently larger than 0, so that contribution of the ith non-zero coefficient
wi jaoi j2 is roughly 1. See [23] for details of the iterative algorithm.
263
264
G. Cheung et al.
265
significant energy. This leads to higher bit rate, or potentially highly visible coding
artifacts when operating at low rate due to coarse quantization of transform
coefficients.
To solve this problem variations of DCT have been proposed, such as shapeadaptive DCT [38], directional DCT [3941], spatially varying transform [42, 43],
variable block-size transform [44], direction-adaptive partitioned block transform
[45], etc., in which the transform block size is changed according to edge location or
the signal samples are rearranged to be aligned to the main direction of dominant
edges in a block. Karhunen-Love transform (KLT) is also used for shape adaptive
transform [33] or intra-prediction direction adaptive transform [46]. These approaches can be applied efficiently to certain patterns of edge shapes such as straight line
with preset orientation angles; however, they are not efficient with edges having
arbitrary shapes. The Radon transform has also been used for image coding [47, 48],
but perfect reconstruction is only for binary images. Platelets [49] are applied for
depth map coding [50], and approximate depth map images as piecewise planar
signals. Since depth maps are not exactly piecewise planar, this representation will
have a fixed approximation error.
To solve these problems, a GBT has been proposed as an edge-adaptive block
transform that represents signals using graphs, where no connection between nodes
(or pixels) is set across an image edge.3 GBT works well for depth map coding
since depth map consists of smooth regions with sharp edges between objects at
different depths. Now, we describe how to construct the transform and apply it to
depth map coding. Refer to [37, 51] for detailed properties and analysis of the
transform.
The transform construction procedure consists of three steps: (1) edge detection
on a residual block, (2) generation of a graph from pixels in the block using the
edge map, and (3) construction of transform matrix from the graph.
In the first step, after the intra/inter-prediction, edges are detected in a residual
block based on the difference between the neighboring residual pixel values.
A simple thresholding technique can be used to generate the binary edge map.
Then, the edge map is compressed and included into a bitstream, so that the same
transform matrix can be constructed at the decoder side.
In the second step, each pixel position is regarded as a node in a graph, G; and
neighboring nodes are connected either by 4-connectivity or 8-connectivity, unless
there is edge between them. From the graph, the adjacency matrix A is formed,
where Ai; j Aj; i 1 if pixel positions i and j are immediate neighbors not
separated by an edge. Otherwise, Ai; j Aj; i 0: The adjacency matrix is
then used to compute the degree matrix D, where Di; i equals the number of nonzero entries in the ith row of A, and Di; j 0 for all i 6 j:
Note that while edge can refer to a link or connection between nodes in graph theory, we
only use the term edge to refer an image edge to avoid confusion.
266
G. Cheung et al.
Edges
0
1
A=
0
0
1
1
L=
0
1 0 0
0 0 0
0 0 1
0 1 0
1 0
1 0
0
0
0 1 1
0 1 1
1
0
D=
0
0
1
2
0
Et =
1
2
0
0 0 0
1 0 0
0 1 0
0 0 1
1
2
1
2
1
2
0
0
1
2
0
1
2
1
2
Fig. 9.6 Example of a 2 2 block. Pixels 1 and 2 are separated from pixels 3 and 4 by a single
vertical edge (shown as the thinner dotted line). The corresponding adjacency matrix A and degree
matrix D are also shown there, along with the Laplacian matrix L and the resulting GBT Et :
In the third step, from the adjacency and the degree matrices, the Laplacian
matrix is computed as L D A [52]. Figure 9.6 shows an example of these
three matrices.
Then, projecting a signal G onto the eigenvectors of the Laplacian L yields a
spectral decomposition of the signal, i.e., it provides a frequency domain
interpretation of the signal on the graph. Thus, a transform matrix can be constructed from the eigenvectors of the Laplacian of the graph. Since the Laplacian
L is symmetric, the eigenvector matrix E can be efficiently computed using the
well-known cyclic Jacobi method [53], and its transpose, ET ; is taken as GBT
matrix. Note that the eigenvalues are sorted in descending order, and corresponding eigenvectors are put in the matrix in order. This leads to transform
coefficients ordered in ascending order in frequency domain.
It is also possible to combine the first and second steps together [51]. Instead of
generating the edge map explicitly, we can find the best transform kernel for the
given block signal by searching for the optimal adjacency matrix. While the
number of possible matrices is large, it is possible to use a greedy search to obtain
adjacency matrices that lead to better RD performance [51].
Transform coefficients are computed as follows. For an N N block of residual
pixels, form a one-dimensional input vector x by concatenating the columns of the
block together into a single N 2 1 dimensional vector, i.e., xNj i Xi; j for
all i; j 0; 1; :::; N 1: The GBT transform coefficients are then given by
y ET x; where y is also an N 2 1 dimensional vector. The coefficients are
quantized with a uniform scalar quantizer followed by entropy coding. Unlike
DCT which uses zigzag scan of transform coefficients for entropy coding, GBT
does not need any such arrangement since its coefficients are already arranged in
ascending order in frequency domain.
267
To achieve the best RD performance, one can select between DCT and GBT on
a per-block basis in an optimal fashion. For example, for each block the RD cost
can be calculated for both DCT and GBT, and the smaller one can be selected.
Overhead indicating the transform that was chosen can be encoded into the bitstream for each block, and the edge map is provided only for blocks coded using
GBT. It has been reported in [54] that coding efficiency improvement by 14 % on
average can be achieved using GBT when it is used to compress various depth map
sequences.
268
G. Cheung et al.
to derive the appropriate motion vectors for code blocks in depth maps using
motion vectors of code blocks of different sizes in texture maps. Daribo et al. [5]
also employed this idea of motion information sharing, but motion vectors are
searched minimizing energy of both texture and depth maps, where a parameter a
is used to tune the relative weight of the two maps. Further, [5] included optimal
bit allocation as well, where optimal quantization parameters are chosen for
texture and depth maps (to be dicussed further in the next section).
Instead of sharing motion information between texture and depth maps, edge
information can also be shared, if an edge-adaptive coding scheme is adopted. For
compression of multi-view images (no time dimension) for DIBR at decoder, [36]
proposed to use edge-adaptive WT to code texture and depth maps at multiple
capturing viewpoints. Furthermore, the authors exploit the correlation of the edge
locations in texture and depth maps, relying on the fact that the edges in a depth
map reappear in the corresponding texture map of the same viewpoint and, hence,
these edges are encoded only once to save bit rate.
More recently, [56] proposed to use encoded depth maps as side information to
assist in the coding of texture maps. Specifically, observing that pixels of similar
depth have similar motion (first observed in [57]), a texture block containing pixels
of two different levels of depth (e.g., foreground object and background) was
divided into two corresponding subblocks, so that each can perform its own motion
compensation separately. Introducing this new subblock motion compensation
mode, which have arbitrary shapes according to the depth boundary in the block,
showed up to 0.7 dB performance gain in PSNR over native H.264 implementation with variable but fixed rectangular block sizes for motion compensation.
269
the encoded signal components for a given maximal bit budget. Depending on the
assumed model, the optimization can be analytical (leading to a closed-form
solution), numerical (resulting in a numerical optimal solution), or a mixture of the
two.
The distortion-related objective function can reflect the distortions of individual
captured and encoded views, the weighted sum (or average) of these distortions, or
the distortions of synthesized views. In the last case, the deployed view synthesis
tools need additional information about the scene to be encoded along with the
captured views (e.g., disparity or depth maps for DIBR), which is penalized by the
required overhead coding rate.
While the model-based coding for traditional single-camera image and video
signals has been well studied in the literature [5861], it has become popular in
MVC only recently. In [62], the correlation between multi-view texture and depth
video frames is exploited in the concept of free viewpoint television (FTV). The
depth video frames are first encoded using joint multi-view video coding model
(JMVM) and view synthesis prediction [63] and, then, the multi-view texture
video sequences are processed using the same intermediate prediction results as
side information. However, the rate allocation strategy in this work has been
completely inherited from the JMVM coding without any adaptive rate distribution
between texture and depth frames.
The correlation between texture and depth frames has also been analyzed in [5]
such that the motion vector fields for these two types of sequences are estimated
jointly as a unique field. The rate is adaptively distributed between texture and
depth multi-view video sequences in order to minimize the objective function,
which is a weighted sum of the distortions of compressed texture and depth
frames. The separate RD relations for the compressed texture and depth frames are
adopted from [18] and they are verified for high rates and high spatial resolutions.
Finally, the MPEG coder is applied to the texture and depth sequences with the
optimized rates as an input.
Another method for optimally distributing the bit rate between the texture and
depth video stream has been proposed in [64]. The minimized objective function is
a view synthesis distortion at a specific virtual viewpoint, where this distortion is
influenced by the additive factors obtained by texture and depth compression and
geometric error in the reconstructions. The distortion induced by depth compression is characterized by a piecewise linear function with respect to the motion
warping error. Given this model and a fixed set of synthesis viewpoints, the
method computes the optimal quantization points for texture and depth sequences.
Within the concept of 3D-TV, with only two captured and coded views, the
work in [65] has proposed a solution for bit allocation among the texture and depth
sequences such that the synthesized view distortion at a fixed virtual viewpoint is
minimized. Here, the distortions of the left and right texture and depth frames are
modeled as linear functions of the corresponding quantization step sizes [66],
whereas the allocated rates are characterized by a fractional model [67]. The
Lagrangian multiplier-based optimization then results in the optimal quantization
step sizes for the texture and depth sequences.
270
G. Cheung et al.
In [68], the authors exploited the fact that depth video sequences consist mostly
of large homogeneous areas separated by sharp edges, exhibiting significantly
different characteristics as compared to their texture counterparts. Two techniques
for depth video sequence compression are proposed to reduce the coding artifacts
that appear around sharp edges. The first method uses trilateral filtering of depth
frames, where the filters are designed by taking into account correlation among
neighbor pixels such that the edges preserve sharpness after compression. In the
second approach, the depth frames are segmented into blocks with approximately
the same depth and the entire block is represented by a single depth value that
minimizes the mean absolute error. For efficient encoding of the block shapes and
sizes, the edge locations in the depth frames are predicted from the same blocks in
the texture frames, assuming a high correlation among the two types of edges.
A detailed model for rate control in the H.264 MVC coder [69, 70] and rate
allocation between texture and depth frames across the view and temporal
dimensions is exposed in [71]. In the model, three levels of the rate control are
adopted. The view level rate control inherits the allocation across views from the
predictive coding used in H.264/MVC coder, where, statistically, the I frames in
the view dimension are commonly encoded by most bits, followed by the B frames
and P frames. For the texture/depth level rate control, the allocated rates to depth
and texture frames are linearly related such that the depth rate is smaller than the
other. Such a relation reflects the chosen quantization levels for both types of the
frames in the H.264/MVC coder. Finally, in the frame level rate control, the rates
allocated to the I, B, and P frames across the temporal dimension are linearly
scaled, where the scaling factors are chosen empirically. Such a model allows for
fine tuning of the rate allocation across multiple dimensions in the MVC coder and
for providing the optimal solutions given various optimization criteria, not only the
synthesized view distortion.
When model-based rate allocation is applied to multi-view imaging instead of
video, the lack of temporal dimension leads to a simpler and smaller data structure.
Then, the models and allocation methods become more complex, and also more
accurate. Davidoiu et al. [72] introduce a model of error variance between the
captured and disparity compensated left and right reference views. They decompose this error into three decorrelated additive terms and approximate the relation
of each with respect to the operational quantization step size. Finally, the available
rate is distributed among the encoded images such that the objective distortion,
which is the sum of the left and right reference view distortions, is minimized. The
source RD relation is adopted from [73]. The rate allocation algorithm is also
applied to multiview videos, where each subset of frames at the same temporal
instant are considered as multi-view image data set.
Another method for rate allocation in multi-view imaging [74] approximates
the scene depth by a finite set of planar layers with associated constant depths or
disparity compensation vectors. The multi-dimensional WT is then applied to the
captured texture layers taking into account the disparity compensation across
views for each layer. Both the wavelet coefficients and layer contours are encoded,
where the corresponding RD relations are analytically modeled. The available bits
70
80
65
78
60
76
d0
t1
d1
D (x)
74
s
50
D (x)
55
45
72
Sampled values
Cubic model
70
40
68
35
30
25
271
66
Sampled values
Cubic model
0.2
0.4
0.6
0.8
64
0.2
0.4
0.6
0.8
Fig. 9.7 Two examples of sampled virtual view distortion and estimated cubic model using a
linear least square estimator for Middlebury [24] data sets Bowling2 (left) and Rocks2 (right)
are distributed among encoding the wavelet coefficients and layer contours so that
the aggregate distortion of the encoded images is minimized.
In [75, 76], the authors model the distortion of synthesized views using DIBR at
any continuous viewpoint between the reference views as a cubic polynomial with
respect to the distance between the synthesis and reference viewpoints. They adopt
from [6] that the synthesized texture pixels are obtained in DIBR by blending the
warped pixels from the left and right reference views, linearly weighted by the
distance between the synthesized and reference viewpoints. Further, they show
that, after compression of the reference views, the resulting mean-square quantization error of the blended synthesized pixels consists of two multiplicative factors
related to the quantization errors of the corresponding reference texture and depth
images, respectively. The first factor depends on the quantization of the reference
textures and, due to the linear blending and the mean square error evaluation, it
can be expressed as a quadratic function of the distance between synthesis and
reference viewpoints. The second factor reflects the quantization of the reference
depth images, which results in a geometrical distortion in the synthesized image
because of erroneous disparity information. This geometrical distortion is
estimated by assuming a linear spatial correlation among the texture pixels [15]
leading to a linear relation between the synthesized view mean square error and the
view distances. Finally, multiplying these two factors gives the cubic behavior of
the distortion across the synthesis viewpoints. The experiments exhibit an accurate
matching between the obtained distortions and the model, as illustrated in Fig. 9.7.
Furthermore, the model is used to optimally allocate the rate among the encoded
texture and depth images and the resulting RD coding performance is compared in
Fig. 9.8 to the other related methods applied to two data sets.
272
G. Cheung et al.
Bowling2
34
32
33
31
PSNR [dB]
PSNR [dB]
32
31
30
29
30
29
28
27
Optimal allocation
H.264/AVC
Simple allocation
28
27
0
Rocks2
33
0.1
0.2
0.3
bpp
0.4
0.5
Optimal allocation
H.264/AVC
Simple allocation
26
0.6
25
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
bpp
Fig. 9.8 RD performance of the optimal rate allocation compression algorithm compared to the
performance of a simple uniform allocation and H.264/AVC. The results are obtained for two
data sets: Bowling2 (left) and Rocks2 (right) from the Middlebury database [24]. The optimal
allocation compression always outperforms the compression with the uniform allocation and has
a better RD performance than H.264/AVC at mid- and high-range rates. However, at lower rates,
the sophisticated motion compensation tools in H.264/AVC have a key influence on the RD
performance and thus lead to a better quality for the virtual views
9.6 Summary
In this chapter, toward the goal of compact representation of texture and depth
maps for free viewpoint image synthesis at decoder via DIBR, we discuss the
problem of depth map compression. We first study the unique characteristics of
depth maps. We then study new coding optimizations for depth maps using
existing coding tools, and new compression algorithms designed specifically for
depth maps in order. Finally, we overview proposed techniques in joint texture/
depth map coding, including the important bit allocation problem for joint texture
and depth map coding.
As depth map coding is still an emerging topic, there are still many remaining
unresolved questions left for future research. First, the question of how many depth
maps and at what resolution should they be encoded for given desired view synthesis
quality at decoder remains open. For example, if communication bit budget is scarce,
it is not clear densely sampled depth maps in view, which themselves are auxiliary
data only to provide geometric information, should be transmitted at full resolution.
This is especially questionable given depth maps in many cases are estimated from
texture maps (available at decoder, albeit in lossily compressed format) using stereomatching algorithms in the first place. Further, whether depth maps are the sole
appropriate auxiliary data for view synthesis remains an open question. Alternatives
to depth maps [2, 3] for view synthesis have already been proposed in the literature.
Further investigation on more efficient representation of 3D scene in motion is
warranted.
273
References
1. Vetro A, Wiegand T, Sullivan GJ (2011) Overview of the stereo and multiview video coding
extensions of the H.264/MPEG-4 AVC standard. Proc IEEE 99(4):626642
2. Kim WS, Ortega A, Lee J, Wey H (2011) 3D video quality improvement using depth
transition data. In: IEEE international workshop on hot topics in 3D. Barcelona, Spain
3. Farre M, Wang O, Lang M, Stefanoski N, Hornung A, Smolic A (2011) Automatic content
creation for multiview autostereoscopic displays using image domain warping. In: IEEE
international workshop on hot topics in 3D. Barcelona, Spain
4. Oh H, Ho YS (2006) H.264-based depth map sequence coding using motion information of
corresponding texture video. In: The Pacific-Rim symposium on image and video technology.
Hsinchu, Taiwan
5. Daribo I, Tillier C, Pesquet-Popescu B (2009) Motion vector sharing and bit-rate allocation
for 3D video-plus-depth coding. In: EURASIP: special issue on 3DTV in Journal on
Advances in Signal Processing, vol 2009
6. Merkle P, Morvan Y, Smolic A, Farin D, Muller K, de With P, Wiegand T (2009) The effects
of multiview depth video compression on multiview rendering. Signal Process Image
Commun 24:7388
7. Leon G, Kalva H, Furht B (2008) 3D video quality evaluation with depth quality variations.
In: Proceedings of 3DTV-conference: the true vision - capture, transmission and display of
3D video, 3DTV-CON 2008. Istanbul, Turkey
8. Tanimoto M, Fujii T, Suzuki K (2009) View synthesis algorithm in view synthesis reference
software 2.0 (VSRS2.0). Document M16090, ISO/IEC JTC1/SC29/WG11
9. Kim WS, Ortega A, Lee J, Wey H (2010) 3-D video coding using depth transition data. In:
IEEE picture coding symposium. Nagoya, Japan
10. Mller K, Smolic A, Dix K, Merkle P, Wiegand T (2009) Coding and intermediate view
synthesis of multiview video plus depth. In: Proceedings of IEEE international conference on
image processing, ICIP 2009. Cairo, Egypt
11. Nguyen HT, Do MN (2009) Error analysis for image-based rendering with depth information.
IEEE Trans Image Process 18(4):703716
12. Ramanathan P, Girod B (2006) Rate-distortion analysis for light field coding and streaming.
Singal Process Image Commun 21(6):462475
13. Kim WS, Ortega A, Lai P, Tian D, Gomila C (2009) Depth map distortion analysis for view
rendering and depth coding. In: IEEE international conference on image processing. Cairo, Egypt
14. Video (2010) Report on experimental framework for 3D video coding. Document N11631,
ISO/IEC JTC1/SC29/WG11
15. Kim WS, Ortega A, Lai P, Tian D, Gomila C (2010) Depth map coding with distortion
estimation of rendered view. In: SPIE visual information processing and communication. San
Jose, CA
16. Lai P, Ortega A, Dorea C, Yin P, Gomila C (2009) Improving view rendering quality and coding
efficiency by suppressing compression artifacts in depth-image coding. In: Proceedings of SPIE
visual communication and image processing, VCIP 2009. San Jose, CA, USA
17. Ortega A, Ramchandran K (1998) Rate-distortion techniques in image and video
compression. IEEE Signal Process Mag 15(6):2350
18. Sullivan G, Wiegand T (1988) Rate-distortion optimization for video compression. IEEE
Signal Process Mag 15(6):7490
19. Wiegand T, Girod B (2001) Lagrange multiplier selection in hybrid video coder control. In:
IEEE international conference on image processing. Thessaloniki, Greece
20. Wiegand T, Sullivan G, Bjontegaard G, Luthra A (2003) Overview of the H.264/AVC video
coding standard. IEEE Trans Circuits Syst Video Technol 13(7):560576
21. Mark W, McMillan L, Bishop G (1997) Post-rendering 3D warping. In: Symposium on
interactive 3D graphics. New York, NY
274
G. Cheung et al.
22. Cheung G, Kubota A, Ortega A (2010) Sparse representation of depth maps for efficient
transform coding. In: IEEE picture coding symposium. Nagoya, Japan
23. Cheung G, Ishida J, Kubota A, Ortega A (2011) Transform domain sparsification of depth
maps using iterative quadratic programming. In: IEEE international conference on image
processing. Brussels, Belgium
24. (2006) stereo datasets. http://vision.middlebury.edu/stereo/data/scenes2006/
25. Candes EJ, Wakin MB, Boyd SP (2008) Enhancing sparsity by reweighted l1 minimization.
J Fourier Anal Appl 14(5):877905
26. Wipf D, Nagarajan S (2010) Iterative reweighted l1 and l2 methods for finding sparse
solutions. IEEE J Sel Top Sign Process 4(2):317329
27. Papadimitriou CH, Steiglitz K (1998) Combinatorial optimization: algorithms and
complexity. Dover, NY
28. Daubechies I, Devore R, Fornasier M, Gunturk S (2010) Iteratively re-weighted least squares
minimization for sparse recovery. Commun Pure Appl Math 63(1):138
29. Boyd S, Vandenberghe L (2004) Convex optimization. Cambridge University Press,
Cambridge
30. Valenzise G, Cheung G, Galvao R, Cagnazzo M, Pesquet-Popescu B, Ortega A (2012)
Motion prediction of depth video for depth-image-based rendering using dont care regions.
In: Picture coding symposium. Krakow, Poland
31. Gilge M, Engelhardt T, Mehlan R (1989) Coding of arbitrarily shaped image segments based
on a generalized orthogonal transform. Signal Process Image Commun 1:153180
32. Chang SF, Messerschmitt DG (1993) Transform coding of arbitrarily-shaped image
segments. In: Proceedings of 1st ACM international conference on multimedia. Anaheim,
CA, pp 8390
33. Sikora T, Bauer S, Makai B (1995) Efficiency of shape-adaptive 2-D transforms for coding of
arbitrarily shaped image segments. IEEE Trans Circuits Syst Video Technol 5(3):254258
34. Li S, Li W (2000) Shape-adaptive discrete wavelet transforms for arbitrarily shaped visual
object coding. IEEE Trans Circuits Syst Video Technol 10(5):725743
35. Freeman H (1961) On the encoding of arbitrary geometric configurations. IRE Trans Electron
Comput 10(2):260268
36. Maitre M, Shinagawa Y, Do M (2008) Wavelet-based joint estimation and encoding of depthimage-based representations for free-viewpoint rendering. IEEE Trans Image Process
17(6):946957
37. Shen G, Kim WS, Narang S, Ortega A, Lee J, Wey H (2010) Edge-adaptive transforms for
efficient depth map coding. In: IEEE picture coding symposium. Nagoya, Japan
38. Philips W (1999) Comparison of techniques for intra-frame coding of arbitrarily shaped video
object boundary blocks. IEEE Trans Circuits Syst Video Technol 9(7):10091012
39. Zeng B, Fu J (2006) Directional discrete cosine transforms for image coding. In: Proceedings of
IEEE international conference on multimedia and expo, ICME 2006. Toronto, Canada, pp 721724
40. Fu J, Zeng B (2007) Directional discrete cosine transforms: a theoretical analysis. In:
Proceedings of IEEE international conference on acoustics, speech and signal processing,
ICASSP 2007, vol I. Honolulu, HI, USA, pp 11051108
41. Zeng B, Fu J (2008) Directional discrete cosine transformsa new framework for image
coding. IEEE Trans Circuits Syst Video Technol 18(3):305313
42. Zhang C, Ugur K, Lainema J, Gabbouj M (2009) Video coding using spatially varying
transform. In: Proceedings of 3rd Pacific Rim symposium on advances in image and video
technology, PSIVT 2007. Tokyo, Japan, pp 796806
43. Zhang C, Ugur K, Lainema J, Gabbouj M (2009) Video coding using variable block-size
spatially varying transforms. In: Proceedings of IEEE international conference on acoustics,
speech and signal processing, ICASSP 2009, Taipei, Taiwan, pp 905908
44. Wien M (2003) Variable block-size transforms for H.264/AVC. IEEE Trans Circuits Syst
Video Technol 13(7):604613
45. Chang CL, Makar M, Tsai SS, Girod B (2010) Direction-adaptive partitioned block transform
for color image coding. IEEE Trans Image Proc 19(7):17401755
275
46. Ye Y, Karczewicz M (2008) Improved H.264 intra coding based on bi-directional intra
prediction, directional transform, and adaptive coefficient scanning. In: Proceedings of IEEE
international conference on image processing, ICIP 2008. San Diego, CA, USA,
pp 21162119
47. Soumekh M (1988) Binary image reconstruction from four projections. In: Proceedings of
IEEE international conference on acoustics, speech and signal processing, ICASSP 1988.
New York, NY, USA, pp 12801283
48. Ramesh GR, Rajgopal K (1990) Binary image compression using the radon transform. In:
Proceedings of XVI annual convention and exhibition of the IEEE in India, ACE 90.
Bangalore, India, pp 178182
49. Willett R, Nowak R (2003) Platelets: a multiscale approach for recovering edges and surfaces
in photon-limited medical imaging. IEEE Trans Med Imaging 22(3):332350
50. Morvan Y, de With P, Farin D (2006) Platelets-based coding of depth maps for the transmission
of multiview images. In: SPIE stereoscopic displays and applications. San Jose, CA
51. Kim WS (2011) 3-D video coding system with enhanced rendered view quality. Ph.D. thesis,
University of Southern California
52. Hammond D, Vandergheynst P, Gribonval R (2010) Wavelets on graphs via spectral graph
theory. Elsevier: Appl Comput Harmonic Anal 30:129150
53. Rutishauser H (1966) The Jacobi method for real symmetric matrices. Numer Math 9(1):
54. Kim WS, Narang SK, Ortega A (2012) Graph based transforms for depth video coding. In:
Proceedings of IEEE international conference on acoustics, speech and signal processing,
ICASSP 2012. Kyoto, Japan
55. Grewatsch S, Muller E (2004) Sharing of motion vectors in 3D video coding. In: IEEE
International Conference on Image Processing, Singapore
56. Daribo I, Florencio D, Cheung G (2012) Arbitrarily shaped sub-block motion prediction in
texture map compression using depth information. In: Picture coding symposium. Krakow,
Poland
57. Cheung G, Ortega A, Sakamoto T (2008) Fast H.264 mode selection using depth information
for distributed game viewing. In: IS&T/SPIE visual communications and image processing
(VCIP08). San Jose, CA
58. Gray RM, Hashimoto T (2008) Rate-distortion functions for nonstationary Gaussian
autoregressive processes. In: IEEE data compression conference, pp 5362
59. Sagetong P, Ortega A (2002) Rate-distortion model and analytical bit allocation for waveletbased region of interest coding. In: IEEE international conference on image processing, vol 3,
pp 97100
60. Hang HM, Chen JJ (1997) Source model for transform video coder and its applicationpart
I: fundamental theory. IEEE Trans Circuits Syst Video Technol 7:287298
61. Lin LJ, Ortega A (1998) Bit-rate control using piecewise approximated rate-distortion
characteristics. IEEE Trans Circuits Syst Video Technol 8:446459
62. Na ST, Oh KJ, Ho YS (2008) Joint coding of multi-view video and corresponding depth map.
In: IEEE international conference on image processing, pp 24682471
63. Ince S, Martinian E, Yea S, Vetor A (2007) Depth estimation for view synthesis in multiview
video coding. In: IEEE 3DTV conference
64. Liu Y, Huang Q, Ma S, Zhao D, Gao W (2009) Joint video/depth rate allocation for 3D video
coding based on view synthesis distortion model. Elsevier, Signal Process Image Commun
24(8):666681
65. Yuan H, Chang Y, Huo J, Yang F, Lu Z (2011) Model-based joint bit allocation between
texture videos and depth maps for 3-D video coding. IEEE Trans Circuits Syst Video Technol
21(4):485497
66. Wang H, Kwong S (2008) Rate-distortion optimization of rate control for H.264 with
adaptive initial quantization parameter determination. IEEE Trans Circuits Syst Video
Technol 18(1):140144
67. Ma S, Gao W, Lu Y (2005) Rate-distortion analysis for H.264/AVC video coding and its
application to rate control. IEEE Trans Circuits Syst Video Technol 15(12):15331544
276
G. Cheung et al.
68. Liu S, Lai P, Tian D, Chen CW (2011) New depth coding techniques with utilization of
corresponding video. IEEE Trans Broadcast 57(2):551561 part 2
69. Merkle P, Smolic A, Muller K, Wiegand T (2007) Efficient prediction structures for
multiview video coding. IEEE Trans Circuits Syst Video Technol 17(11):14611473
70. Shen LQ, Liu Z, Liu SX, Zhang ZY, An P (2009) Selective disparity estimation and variable
size motion estimation based on motion homogeneity for multi-view coding. IEEE Trans
Broadcast 55(4):761766
71. Liu Y, Huang Q, Ma S, Zhao D, Gao W, Ci S, Tang H (2011) A novel rate control technique
for multiview video plus depth based 3D video coding. IEEE Trans Broadcast 57(2):562571
(part 2)
72. Davidoiu V, Maugey T, Pesquet-Popescu B, Frossard P (2011) Rate distortion analysis in a
disparity compensated scheme. In: IEEE international conference on acoustics, speech and
signal processing. Prague, Czech Republic
73. Fraysse A, Pesquet-Popescu B, Pesquet JC (2009) On the uniform quantization of a class of
sparse source. IEEE Trans Inf Theory 55(7):32433263
74. Gelman A, Dragotti PL, Velisavljevic V (2012) Multiview image coding using depth layers and
an optimized bit allocation. In: IEEE Transactions on Image Processing (to appear in 2012)
75. Velisavljevic V, Cheung G, Chakareski J (2011) Bit allocation for multiview image
compression using cubic synthesized view distortion model. In: IEEE international workshop
on hot topics in 3D (in conjunction with ICME 2011). Barcelona, Spain
76. Cheung G, Velisavljevic V, Ortega A (2011) On dependent bit allocation for multiview
image coding with depth-image-based rendering. IEEE Trans Image Process 20(11):
31793194
Chapter 10
277
278
I. Daribo et al.
10.1 Introduction
Three-dimensional television (3D-TV) has a long history, and over the years a
consensus has been reached that a successful introduction of 3D-TV broadcast
services can only reach success if the perceived image quality and the viewing
comfort are at least comparable to conventional two-dimensional television
(2D-TV). The improvement of 3D technologies raises more interest in 3D-TV [1]
and in free viewpoint television (FTV) [2]. While 3D-TV offers depth perception
of entertainment programs without wearing special additional glasses, FTV allows
the user to freely change his viewpoint position and viewpoint direction around a
3D reconstructed scene.
Although there is no doubt that high definition television (HDTV) has succeeded in largely increasing the realism of television, it still lacks one very
important feature: the representation of natural depth sensation. At present, 3D-TV
and FTV can be considered to be the logical next step complementing of HDTV to
incorporate 3D perception into the viewing experience. In that sense, multi-view
video (MVV) systems have gained significant interest recently, more specifically
to the novel view synthesis enabled by depth-image-based rendering (DIBR)
approaches, also called 3D image warping in the computer graphics literature.
A well-suitable associated 3D video data representation is known as multi-view
video plus depth (MVD) that provides regular two-dimensional (2D) videos
enriched with their associated depth video (see Fig. 10.1). The 2D video provides
the texture information, the color intensity, the structure of the scene, whereas the
depth video represents the Z-distance per-pixel between the camera optical center
and a 3D point in the visual scene. In the following the 2D video may be denoted
as texture video unlike the depth video. In addition, we represent depth data in
color domain in order to highlight wavelet-induced distortions.
The benefit of this representation is to still be able to respond the stereoscopic
vision needs at the receiver side as illustrated in Fig. 10.2. After decoding,
intermediate views can be reconstructed from the transmitted MVD data by means
of DIBR techniques [3, 4]. Therefore, 3D impression and viewpoint can be
adjusted and customized after transmission. However, the rendering process does
not allow creating perfect novel views in general. It is still prone to errors in
particular from the coding and transmission of the depth video which is a key side
information in novel view synthesis.
10
279
Fig. 10.1 Example of texture image and its associated depth map (Microsoft Ballet MVV sequence)
Fig. 10.2 Efficient support of multi-view autostereoscopic displays based on MVD content
A first study of the impact of the depth compression on the view synthesis has
been investigated under the MPEG 3DAV AHG activities, in which a MPEG-4
compression scheme has been used. One proposed solution consists, after
decoding, in applying a median filter on the decoded depth video to limit the
coding-induced artifacts on the view synthesis, as similarly done in H.264/AVC
with the deblocking filter. Afterwards, a comparative study has been proposed
between H.264/AVC intracoding and platelet-based depth coding on the quality of
the view synthesis [5]. The platelet-based depth coding algorithm models smooth
regions by using piecewise-linear functions and sharp boundaries by straight lines.
The results indicate that a worse depth coding PSNR does not necessarily imply a
worse synthesis PSNR. Indeed, platelet-based depth coding leads to the conclusion
that preserving the depth discontinuities in the depth compression scheme leads to
a higher rendering quality than H.264/AVC intracoding.
280
I. Daribo et al.
In this chapter, we propose to extend these studies to the wavelet domain. Due to
its unique spatial-frequency characteristics, wavelet-based compression approach
is considered as an alternative to traditional video coding standard based on discrete cosine transform (DCT) (e.g. H.264/AVC). Discrete wavelet transform
(DWT) has mainly two advantages over DCT that are important for compression:
(1) the multiresolution representation of the signal by wavelet decomposition that
greatly facilitates sub-band coding, (2) wavelet transform reaches a good compromise between frequency and time (or space for image) resolutions of the signal.
Although wavelet is less widely used in broadcast than DCT, it can be considered
as a promising alternative for compressing depth information within a 3D-TV
framework. For still image compression, the DWT outperforms the DCT by
around 1 dB, and less for video coding [6]. However, in the case of 3D video
coding, the DWT presents worse performance than the DCT with regard to the 3D
quality of the synthesized view. We then intend to understand the reason behind
this poor 3D quality results by studying the effects of wavelet-based compression
on the quality of the novel view synthesis by DIBR. This study includes the
analysis of the wavelet transforms for compressing the depth video, which tries to
answer the question:
Which wavelet transform should be used to improve the final 3D quality?
To properly answer this question, first, a brief review of basic concepts of wavelets
are introduced in Sect. 10.2, followed by the lifting mechanism that provides a
framework for implementing classical wavelet transform and the flexibility for
developing adaptive wavelet transforms.
Then, Sect. 10.3 investigates the impact of the choice of different classical
wavelet transforms on the depth map and its influence on the novel view synthesis
in terms of both compression efficiency and quality.
Finally, an adaptive wavelet transform is proposed to illustrate the result of the
aforementioned investigation. As a result, Sects. 10.4 and 10.5 address the
wavelet-based depth compression through different adaptive wavelet transforms
based on the local properties of the depth map.
10
281
282
I. Daribo et al.
Fig. 10.4 Lifting scheme composed of the analysis (left) and the synthesis (right) steps
10
283
Update
An update stage U of the even values follows, such that
li x2i U hi i2N ;
10:3
x2i li U hi i2N ;
10:4
x2i1 hi P x2i i2N ;
10:5
x x2i [ x2i1
10:6
Undo Predict
Merging
284
I. Daribo et al.
10
285
286
I. Daribo et al.
Fig. 10.5 Appearance of the Gibbs (ringing) effects along Ballet depth contours at 0.08 bpp
10
287
Fig. 10.7 Effect of the wavelet-based depth compression onto the novel view synthesis at 0.08 bpp
of the novel view synthesis by DIBR. The novel view synthesis experiments are
partially realized with the software provided by the Tanimoto Lab. of Nagoya
University [15]. In this study, the novel view is generated between the camera 3 and
the camera 5, such that the viewpoint from the camera 4 is reconstructed by DIBR
from the reference camera 3 and 5. After transmission and decoding, the depth videos
are preprocessed by first using a median filter with an aperture linear size of 3, and then
filtered with a bilateral filter with parameters 4 and 40 for the filter sigma in color space
and in coordinate space, respectively. Disocclusions are filled in by using the reference
view from camera 3 and 5. Finally, remaining disocclusions are inpainted [16].
At low bitrate, the Gibbs phenomenon becomes more visible, leading to a
visible degradation of the novel view as shown in Fig. 10.7, in particular near the
object boundaries, which correspond to the aforementioned edge-localized errors
in the depth map.
Previously, we observed that the shorter the wavelet support filter width is, the
better the edges of the disoccluded areas are preserved, and thus, the quality of the
novel view is better. Although a lower depth compression ratio, Haar filter bank
can then be considered as the most efficient wavelet transform among the others
with respect to the rendering quality of the novel synthesized image (see Fig. 10.8).
As a conclusion, depth edges point up the weakness of classical wavelet-based
coding methods to efficiently preserve structures very well localized in the space
domain but with large frequency band, which yields to a degradation of the novel
synthesis quality. A good depth compression does not necessarily yield better
synthesis views. It then can be expected that the smaller the support of the wavelet
288
I. Daribo et al.
transform is, the better the Gibbs (ringing) effects are reduced; and thus, the 3D
quality is better, despite a lower compression ratio of the depth map compared with
longer filters. Indeed the depth data are mainly composed of smooth areas, which
favors longer filter when considering the RD performance, at the cost of edgelocalized errors on the depth video. A tradeoff between depth compression efficiency and novel view synthesis quality has to be found.
10
289
Fig. 10.9 An adaptive lifting scheme by using depth edges as side information
coefficients. Inspired by these still texture image techniques, various depth coding
methods have been proposed that require an edge-detection stage, followed by an
adaptive transform [2124].
290
I. Daribo et al.
10:7
Hysteresis is used to track the more relevant pixels along the contours. Hysteresis uses two
thresholds and if the magnitude is below the first threshold, it is set to zero (made a nonedge). If the
magnitude is above the high threshold, it is made an edge. And if the magnitude is between the two
thresholds, then it is set to zero unless the pixel is located near a edge detected by the high threshold.
10
291
Fig. 10.10 Adaptive lifting scheme using depth edges as side information
292
I. Daribo et al.
Fig. 10.11 Adaptive lifting scheme using texture edges as side information
Fig. 10.12 Example of edges of the texture image (left) and depth map (right)
Fig. 10.13 Adaptive lifting scheme using the approximated depth edges as side information
10
293
Fig. 10.14 Interpolated depth edges at different bitrate (up) at the encoder, (middle) at the decoder,
(down) difference between both
the location of the edges when using the two slightly different decompositions. The
edges of the interpolated depth map are, however, very sensitive to the bitrate, as we
can see in Fig. 10.14. And the slight difference between the two interpolated depth
edges does not allow a perfect reversibility of the adaptive lifting scheme.
294
I. Daribo et al.
Fig. 10.15 Adaptive lifting scheme using mixed texture-depth edges as side information
i; j
end if
end for
where h denotes the impulse response of the high-pass filter described by Eq. 10.7.
Moreover, the real differences between the two interpolated maps are not crucial
for the edge detection in the texture image, since only a neighborhood of the edges
is used to validate the texture contours. As shown in Fig. 10.16, this allows us to
retrieve from the pair texture plus interpolated depth map the location of the
original edge of the depth map.
10
295
Fig. 10.16 Mixed depth edges at different bitrates (up) at the encoder, (middle) at the decoder,
(down) difference between both
Fig. 10.17 Rate-distortion results of the depth maps and novel synthesized images. The DWT Le
Galls 5/3 is compared with the adaptive support DWT [21]. The labels depth edges, texture
edge, interpolated depth edges, mixed edges being the different strategies presented in
Sect. 10.5 to send the edges as side information
to encode the depth map is equal to 20 % of the bitrate of the texture image. Note
that in Fig. 10.17, the increase in bitrate related to the depth edges side information
has been neglected when reporting experimental data about the coding rate, which
is equivalent to perfectly retrieve the depth edges at the decoder side without any
extra bits to be sent.
296
I. Daribo et al.
The gain in the depth map coding becomes more perceptible when measuring
the PSNR of the warped image. Actually, the warped image PSNR measurements
indicate a quality gain around 1.8 dB. Thus, the adaptive schemes do not necessarily improve the overall quality of the transmitted depth map over a classical
linear lifting scheme. However, at similar PSNR values, the support adaptivity
clearly indicates a better preservation of the depth edges, and consequently an
improvement of the quality of the synthesized view.
It can be observed in Fig. 10.17 that if the side information rate cost does not
exceed 0.02 bpp, the strategy of sending directly the depth edges as side information
provides better RD performance. However, with state-of-the-art edge coders, such
rate performance is still difficult to be obtained. For example, the lossless encoding
of boundary shapes costs an average of 1.4 bits/boundary pel [26]. As a conclusion
the Mixed edges as side information provides the best performance.
10
297
References
1. Fehn C, Cooke E, Schreer O, Kauff P (2002) 3D analysis and image-based rendering for
immersive TV applications. Signal Process Image Commun 17(9):705715
2. Tanimoto M (2006) Overview of free viewpoint television. Signal Process Image Commun
21:454461
3. McMillan L Jr (1997) An image-based approach to three-dimensional computer graphics.
PhD thesis, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
4. Oliveira MM (2000) Relief texture mapping. PhD thesis, University of North Carolina at
Chapel Hill, NC, USA
5. Morvan Y, Farin D, de With PHN (2005) Novel coding technique for depth images using
quadtree decomposition and plane approximation. In: Visual communications and image
processing, vol 5960, Beijing, China, pp 11871194
6. Xiong Z, Ramchandran K, Orchard MT, Zhang Y-Q (1999) A comparative study of dct- and
wavelet-based image coding. IEEE Trans Circuits Syst Video Technol 9(5):692695
7. Mallat SG (1989) A theory for multiresolution signal decomposition: the wavelet
representation. IEEE Trans Pattern Anal Mach Intell 11(7):674693
8. Rioul O, Duhamel P (1992) Fast algorithms for discrete and continuous wavelet transforms.
IEEE Trans Inf Theory 38(2):569586
9. Sweldens W (1995) The lifting scheme: a new philosophy in biorthogonal wavelet constructions.
In: Proceedings of the SPIE, wavelet applications in signal and image processing III, vol 2569,
pp 6879
10. Daubechies I, Sweldens W (1998) Factoring wavelet transforms into lifting steps. J Fourier
Anal Appl 4:247269
11. Le Gall D, Tabatabai A (1988) Sub-band coding of digital images using symmetric short kernel
filters and arithmetic coding techniques. In: Proceedings of the IEEE international conference
on acoustics, speech and signal processing (ICASSP), pp 761764, 1114 Apr 1988
12. Antonini M, Barlaud M, Mathieu P, Daubechies I (1992) Image coding using wavelet
transform. IEEE Trans Image Process 1(2):205220
13. Cohen A, Daubechies I, Feauveau J-C (1992) Biorthogonal bases of compactly supported
wavelets. Commun Pure Appl Math 45:485500
14. Microsoft sequence Ballet and Breakdancers (2004) [Online] Available: http://
research.microsoft.com/en-us/um/people/sbkang/3dvideodownload/
15. Tanimoto M, Fujii T, Suzuki K, Fukushima N, Mori Y (2008) Reference softwares for depth
estimation and view synthesis, M15377 doc., Archamps, France, Apr 2008
16. Telea A (2004) An image inpainting technique based on the fast marching method. J Graph
GPU Game Tools 9(1):2334
17. Do MN, Vetterli M (2005) The contourlet transform: an efficient directional multiresolution
image representation. IEEE Trans Image Process 14(12):20912106
18. Li S, Li W (2000) Shape-adaptive discrete wavelet transforms for arbitrarily shaped visual
object coding. IEEE Trans Circuits Syst Video Technol 10(5):725743
19. Peyr G, Mallat S (2000) Surface compression with geometric bandelets. In: Proceedings of
the annual conference on computer graphics and interactive techniques (SIGGRAPH), New
York, NY, USA, pp 601608. ACM
20. Shukla R, Dragotti PL, Do MN, Vetterli M (2005) Rate-distortion optimized tree-structured
compression algorithms for piecewise polynomial images. IEEE Trans Image Process 14(3):343359
21. Daribo I, Tillier C, Pesquet-Popescu B (2008) Adaptive wavelet coding of the depth map for
stereoscopic view synthesis. In: Proceedings of the IEEE workshop on multimedia signal
processing (MMSP), Cairns, Queensland, Australia, pp 413417, Oct 2008
22. Maitre M, Do MN (2009) Shape-adaptive wavelet encoding of depth maps. In: Proceedings
of the picture coding symposium (PCS), Chicago, USA, pp 14, May 2009
298
I. Daribo et al.
23. Sanchez A, Shen G, Ortega A (2009) Edge-preserving depth-map coding using graph-based
wavelets. In: Proceedings of the asilomar conference on signals, systems and computers
record, Pacific Grove, CA, USA, pp 578582, Nov 2009
24. Shen G, Kim W-S, Narang SK, Ortega A, Lee J, Wey H (2010) Edge-adaptive transforms for
efficient depth map coding. In: Proceedings of the picture coding symposium (PCS), Nagoya,
Japan, pp 566569, Dec 2010
25. Freeman H (1961) On the encoding of arbitrary geometric configurations. IRE Trans Electron
Comput 2:260268
26. Eden M, Kocher M (1985) On the performance of a contour coding algorithm in the context
of image coding part I: contour segment coding. Signal Process 8(4):381386
Chapter 11
Transmission of 3D Video
over Broadcasting
Pablo Angueira, David de la Vega, Javier Morgade
and Manuel Mara Vlez
299
300
P. Angueira et al.
11.1 Introduction
This chapter provides a general perspective of the feasibility options of the digital
broadcasting networks for delivering three-dimensional TV (3D-TV) services.
Currently, the production and consumption of 3D video is a reality in cinema and BlueRay formats but the general deployment of commercial 3D-TV services is not a reality
yet. Satellite and some terrestrial broadcasters in countries like United Kingdom have
shown interest to provide these services in the short or mid-term. Specifically, several
broadcasters continue to carry out experiments in stereoscopic 3D-TV production in
various European countries. The pay-Television (TV) operator BSkyB (UK) started a
stereoscopic 3D-TV channel in 2010 and some consumer electronics manufacturers
have also announced stereoscopic TV receivers during 2010.
In parallel, the digital broadcast standards are being redesigned. The second
generation systems in different International Telecommunications Union (ITU)
Regions will have better spectral efficiency and higher throughput. Good examples
of this second generation of standards can be found in the DVB family, with
the second generation of satellite (DVB-S2), cable (DVB-C2) and terrestrial
(DVB-T2) systems.
Additionally, the advances in video coding and specifically in 3D video coding
have enabled the convergence between the bitrates associated to 3D streams and
the capacity of the wired and unwired broadcast systems under specific conditions.
Finally, the activity in the technical and commercial committees of different
standardization bodies and industry associations has been very intense in the last
couple of years focusing mainly on production and displays. Examples of this
work can be found in committees of the Society of Motion Picture and Television
Engineers (SMPTE), the European Broadcasting Union (EBU), the Advanced
Television Systems Committee (ATSC), and the Digital Video Broadcasting
Consortium (DVB). In all cases, the initial approach for short and mid-term
deployment is based on stereoscopic 3D, with different formats and coding options.
Exception made for a few auto stereoscopic prototypes, the 3D-TV services will
require a specific display based on wearing special glasses.
The following sections provide a general description of the factors that need to
be accounted for in the deployment stages of 3D-TV services over broadcast
networks with special emphasis made on systems based on Depth-Image-Based
Rendering (DIBR) techniques.
11
301
Year
Report
Recommendation
ITU-R BT.312-5
ITU-R BT.1198
(1990)
(1995)
Report
ITU-R BT.2017
Recommendation
ITU-R BT.1438
Report
Question
ITU-R BT.2088
WP6A 128/6
Report
ITU-R BT.2160
(1998)
(2000)
(2006)
(2008)a
(2010)
302
P. Angueira et al.
This committee is now in the preliminary stages for developing a new family of
coding tools for 3D video called 3DVideo. The target of this new task is to define a
data format and associated compression technology to enable the high-quality
reconstruction of synthesized views for 3D displays The call for technology proposals has been released in 2011 [3].
The SMPTE has been also one of the key organizations to promote 3D standardization worldwide. This organization produces standards (ANSI), reports, and
studies as well as technical journals and conferences in the generic field of TV. In
August 2008, SMPTE formed a Task Force on 3D to the Home. The report
produced by this committee contains the definition of 3D Home Master and
provides a description of the terminology and use cases for 3D in the home
environment [4].
The Digital Video Broadcasting (DVB) consortium activities should be mentioned here, considering the number of viewers that access TV services worldwide
by means of its standards. The DVB specifications relating to 3D-TV are currently
described by the DVB-3D-TV specification (DVB BlueBook A154 and A151)
[5, 6] and a 3D Subtitling addendum. The work in the technical committees has
focused on Stereoscopic Frame Compatible formats guided by the commercial
strategy of the DVB Commercial Module. The DVB plans propose to accomplish
the 3D-TV introduction in two phases. Phase 1 will be based on Frame Compatible
formats, and Phase 2 could be based either on a Frame Compatible Compatible or
a Service Compatible approach (the format definitions will be explained in Sect.
11.3.2).
Finally, it is worth mentioning that almost every commercial and non-commercial
broadcasting-related organization in all ITU regions has a working committee that
either develops or monitors aspects of 3D-TV policies and technology. Examples can
be found in the Advanced Television Standards Committee (ATSC), EBU, Asian
Broadcasting Union (ABU), Association of Radio Industries and Businesses (ARIB)
and Consumer Electronics Association (CEA).
11
303
either), specially taking into consideration that the second generation of digital
broadcast systems has been already standardized [79]. These new systems have
driven the performance curves close to Shannons Limit and thus the challenge to
solve in the following decades will be the commercial rollout of the technology.
11.3.1 Production
The first step in a 3D TV transmission chain is the generation of suitable content.
Although the topic is not on the scope of this chapter, it is worth mentioning here
the relevance of 3D production storytelling (also called production grammar). The
production grammar of 3D should differ from 2D productions for a good 3D
viewing experience. This can lead to some compromises for the 2D viewer and
thus, compatibility issues might be more related to the content and how to show it
rather than formats, coding, or other technical matters.
Capturing 3D video scenes is one of the most complex aspects of the whole 3D
chain. Currently, the production of contents for broadcasting is based on 2D
cameras. The output is a 2D image which is the basis for different processing
techniques that produce additional views or layers required by each specific 3D
format. The simplest options are based on two camera modules that produce a
stereoscopic output composed of two 2D images, one per each eye. An example of
this system is the specification from the SMPTE that recommends a production
format with a resolution of 1,920 9 1,080 pixels at 60 frames per second per eye.
Another family of techniques is based on multi camera systems, which provides
a higher flexibility, in the sense that it allows capturing content that can be processed into virtually any 3D format. This advantage is limited due to the extreme
complexity of the multi camera arrangements that limit their application to very
special situations (e.g., specific studios).
Focusing on systems based on depth information for constructing 3D images,
there are different approaches to obtain the depth data of the scene depending on
304
P. Angueira et al.
the production case. In the case of 3D production of new material, some camera
systems have the capacity of simultaneously capturing video and associated pixel
depth information by a system based on infrared sensing of the scene depth [10]. If
the camera system is not furnished with the modules that obtain the depth information
directly, the depth maps can be obtained by processing the different components of
the 3D image (stereoscopic or multi-view), as explained in Chaps. 2 and 7.
The case of computer generated video streams is less problematic, as full scene
models with 3D geometry are available and thus the depth information with
respect to any given camera view-point can be generated and converted into the
required 3DV depth format.
Finally, and especially during the introductory phases of 3D services, existing
adequate 2D material can be processed to obtain equivalent 3D content. In this
case, there have been different proposals that extract depth information from a 2D
image. A detailed survey of the state of the art of the conversion methods can be
found in [11].
Once a 3D video stream, in different formats, is captured or computer generated, it has to be coded to be either delivered or stored on different media. The
coding algorithm will remove time, spatial, or inter stream redundancies that will
make delivery more efficient. This efficiency is a critical problem in digital terrestrial broadcast systems where the throughput is always limited. Depth databased systems show an interesting performance between achieved 3D perception
quality and required bitrate.
11.3.2 Formats
The representation format used to produce a 3D image will influence the video
coder choice and thus the bitrate that will be delivered to the user, which has
obvious implications on the transport and broadcast networks. Last but not least,
the video format will have a close dependency with the consumer display.
11
305
HD frame compatible
Conventional HD
service compatible
Table 11.3 Usual 3D format terminology used for different system features description
Format
3D display type
3D representation/coding
(compatibility)
HD display
compatible
HD frame
compatible
HD frame
compatible
compatible
HD Service
compatible
Anaglyph
Frame compatible stereo
2D ? Depth Multi-view plus
depth (2 views)
Multi-view plus depth (2 views)
layered depth video depth
enhanced stereo
Full Resolution Stereo (simulcast)
multi-view plus depth depth
enhanced stereo
can be related also to the 3D representation and coding of the images (see
Table 11.3).
It should be mentioned that there is no general agreement in the literature for
the correspondence of formats, displays, and 3D representation techniques. The
3D-TV specifications made so far by SMPTE and DVB [5, 6] recommend an HD
Frame Compatible format, based on Frame Compatible Stereo Images and it is
assumed that the general case of displays will be based on wearing glasses. An
example can be found in Korea. In this case the 3D representation and coding is
based on 2D plus depth (3D is built on DIBR techniques). The system is compatible with existing receivers and displays, as the baseline image is the same 2D
video sequence.
306
P. Angueira et al.
11
307
while the owners of a 3D decoder will be able to extract from the multiplex the
difference signal that will be used to modify the 2D video and create the complementary view. The output from the set-top box to the display would normally
be an L and R stereo pair with different resolutions depending on each case. The
difference signal can be compressed using either a standard video encoder e.g.,
using the MPEG-4 Stereo High Profile [12].
Finally, the fourth group of representation formats adds one or multiple auxiliary views of depth information through Depth-Image-Based Rendering (DIBR)
techniques. The depth map associated to a specific scene will contain geometric
information for each image pixel. This being information geometry data, the
reconstruction of the 3D image can be adapted to the display type, size, and
spectator conditions [13, 14].
There are different formats based on image depth information. The simplest
version contains a 2D image sequence and a depth information layer, also similar in
shape to a 2D image. The problem associated to this format is the representation of
occlusions by objects in a closer plane from the viewer perspective. The problem is
solved by formats called 2D plus DOT (Depth, Occlusion, and Transparency) or
MVD (Multi-view Video plus Depth) [15]. MVD formats contain additional layers
with occlusion areas of the scene, and have the disadvantage of the redundancy
intrinsic to them. Some versions as Layered Depth Video (LDV) [16] contain the
information of the 2D original scene, the image depth information, and a third layer
with occlusion information, only considering first order occlusions. Similar versions of depth information rendering formats are available in the literature. The
technique for capturing or producing depth information can be achieved from
specially designed cameras or by post-processing a multi-view image stream.
11.3.3 Transport
This section describes the existing methods for multiplexing 3D-TV (and 2D
HDTV) services into a bit stream structure. The objective of creating these
structures is two-fold. On the one hand, these structures enable a comprehensive
308
P. Angueira et al.
association of the different components of a multimedia service: the video component(s), the audio component(s) and associated data streams. Furthermore, this
association is extended to aggregate several individual multimedia services into a
multi-programme structure. The well-known MPEG Transport Stream is the most
widespread example.
The second objective of these multiplex structures is to allow the transportation
of the programme structure through different networks (microwave line of sight
links, fiber optic links, satellite contribution links) to the broadcast nodes (transmitters in the case of terrestrial networks, head-ends in cable networks and earth
satellite stations in satellite broadcasting). In some cases, these streams are further
encapsulated into higher level structures for transportation. An example of the
latter is IP encapsulation of MPEG-TS packets [17].
The contents of this section summarize the ways of conveying different 3D
video formats into transport structures. The information is organized into two
differentiated groups: solutions for Conventional Frame Compatible formats and
solutions for systems based on DIBR and thus conveying different versions of the
scene depth data.
11
309
H.264/AVC Simulcast
H.264/AVC Simulcast is specified as the individual application of an H.264/
AVC [19] conforming coder to several video sequences in a generic way. The
process is illustrated in Fig. 11.3.
Thus, this solution for delivering a stereoscopic 3D video stream would consist
of independent coding and aggregation of the bit stream into the transport structure
being used, either MPEG-TS or RTP protocols.
310
P. Angueira et al.
11
311
the apparent position of an object viewed along two different lines of sight. New
values for additional data representations could be added to meet future technologies of coding. The MPEG-C Part 3 is directly applicable to broadcasting video
because it allows specifying the video stream as a 2D stream plus associated depth,
where the single channel video is augmented by the per-pixel depth attached as
auxiliary data [21, 22, 24].
312
P. Angueira et al.
11
313
Reference value
720p
1,080i
1,080p
8
10
13
11
13
13
6
7.5
7.8
Resolution
Ref. [33] Ref. [34]
1,080p/50
200
200
1,080p/100 or 2,160p/50 (reduced resolution) 170190 110160
1,080p/50 ? additional layer
140180 130180
1,080p/50 ? additional layer
120160
1,080p/50 ? additional layer
180220 130
Ref. [5]
200
100160
160a
160a
160a
a
The DVB consortium is currently considering a maximum of 60 % overhead for phase 2 first
generation 3D-TV formats
The coding output bitrate is still a matter of discussion in the research community
[2832]. Values from 6 to 18 Mbps can be found in different reports as a function
of the perceptual quality and delivery infrastructure (terrestrial, cable or satellite).
The values proposed here are a summary of the values found in these references,
including the future performance gains forecasted by the same authors for the
following 57 years in the case of 1,080p the coding algorithm is MPEG-4/AVC.
The reference 3D-TV bitrate values used in the following sections associated to
the terrestrial standards will be close to the maximum values, considering expected
gains in statistical multiplexing and advances in the coding technologies. The
bitrate requirements associated to 3D services will depend on the format and
coding choice as described in Sect. 11.3.2 and Chaps. 810. Foresight studies in
Europe have suggested the bitrate ranges of Table 11.5.
Table 11.5 shows the clear advantages of the depth information-based formats.
Although still an area of challenging research, the 2D ? Depth approach offers the
prospect of considerable bitrate saving.
314
P. Angueira et al.
Table 11.6 Summary of expected capacities associated with different delivery media
System
Throughput (Mbps)
DVB-T
DVB-T2
DVB-S2
cabled system
24
35
45
80
11.3.5.1 Terrestrial
Terrestrial broadcast networks are composed of a number of transmitting centers
that send the DTV signal using UHF channels. These transmitter sites are usually
high power stations, with radiating systems placed well above the average terrain
height of the service area. In principle, the network configuration is independent of
the service being delivered, provided the throughput associated to the transmission
11
315
parameters is high enough to allocate the service. Currently, there are four main
digital terrestrial TV standards around the world:
1. ATSC. This is a North America-based DTV standards organization, which
developed the ATSC terrestrial DTV series of standards.
2. DVB Consortium. A European-based standards organization, which developed the
DVB series of DTV standards DVB-T/H, and more recently the second generation
systems like DVB-T2. The systems designed by DVB are all standardized by the
European Telecommunication Standard Institute (ETSI). Although initially
European, the scope of DVB encompasses today the three ITU Regions.
3. The ISDB standards, a series of DTV standards developed and standardized by
the ARIB and by the Japan Cable Television Engineering Association
(JCTEA).
4. DTMB (Digital Terrestrial Multimedia Broadcast), which is the TV standard
for mobile and fixed terminals used in the Peoples Republic of China, Hong
Kong, and Macau.
11.3.5.2 Cable
A cable broadcast network is a communication network that is composed of a number
of head-ends, a transport and distribution section based on fiber technologies, and a
final access section usually relying on coaxial cable. Historically, these networks have
been referred as HFC (Hybrid Fiber Coaxial) networks. The head-ends of the system
aggregate contents from different sources and content providers. These services are
then fed to the fiber distribution network, usually based on SDH/SONET switching.
The distribution network transports the services to a node close to the final user area
(usually a few blocks in a city). Within this local area, the services are delivered by a
coaxial network to the consumer premises. The coaxial section has a tree structure,
based on power splitters and RF amplifiers. The signals on the cable are transmitted
using the frequencies that range from a few MHz to the full UHF band. The system total
bandwidth is limited by the effective bandwidth of the coaxial cable section.
Unlike the terrestrial market, the standards for cable TV are only two. The ISDB-C
(Integrated Services Digital BroadcastingCable) is used exclusively in Japan and
was standardized by the JCTEA [35]. The DVB consortium standardized in 1994 the
DVB-C standard. DVB-C is the widespread cable standard used in all ITU-R Regions
1, 2, and 3. Recently, the second generation system DVB-C2 [36] has been approved
and endorsed by the European Telecommunications Standards Institute.
11.3.5.3 Satellite
A Broadcast Satellite Service network, also called Direct to the Home (DTH) or
Direct Broadcast Satellite (DBS) is a communications network (currently with
some degree of interactivity through dedicated return channels) that is based on a
316
P. Angueira et al.
11
317
Multi-view systems have a discrete number of views within the viewing field,
generating different regions where the different perspectives of the video scene can
be appreciated. In this case, some motion parallax is provided but it is restricted by
the limited number of views available. Nevertheless, there are adjacent view
fusion strategies that try to smooth the transition from a viewing position to the
next ones [38].
Holoform techniques try to provide a smooth motion parallax for a viewer moving
along the viewing field. Volumetric displays are based on generating the image
content inside a volume in space. These techniques are based on a visual representation of the scene in three dimensions [39]. One of the main difficulties associated to
volumetric systems is the required high resolution of the source material. Finally,
holographic techniques aim at representing exact replicas of the scenes that cannot be
differentiated from the original. These techniques try to capture the light field of the
scene including all the associated physical light attributes, hence the spectators eyes
receive the same light conditions as the original scene. These last techniques are still
in a very preliminary study phase [40].
318
P. Angueira et al.
11.4.1 Production
The advantages of depth-based systems for production are based on the pixel by
pixel matching of the different layers of the picture. This structure facilitates 3D
post-processing. An example can be found in object segmentation based on depthkeying. This technique allows an easy integration of synthetic 3D objects into real
sequences, for example real-time 3D special effects [41]. Another important aspect
is the suppression of photometrical asymmetries that are associated to left- and
right-eye view stereoscopic distribution, that create visual discomfort and might
degrade the stereoscopic sensation [42].
The most widespread methods for 2D3D conversion are based on extracting
depth information: Depth from blur, Vanishing Point-based Depth Estimation and
Depth from Motion Parallax [43, 44]. This is a major advantage of DIBR systems
as the success of any 3D-TV broadcast system will depend strongly on the
availability of attractive 3D video material.
Despite these advantages, there are also challenges that need to be addressed for
commercial implementation. The first one is associated to the occluded areas of
the original image. Suitable occlusion concealing algorithms are required in the
postproduction stages. Also, in the simplest versions of DIBR-based systems, the
effects associated to transparencies (atmospheric effects like fog or smoke, semitransparent glasses, shadows, refractions, reflections) cannot be handled adequately by a single depth layer and additional transparency layers (or production
techniques processing multiple views with associated depth) are required. Finally,
it should be mentioned that creating depth maps with great accuracy is still a
challenge, particularly for real-time events.
11
319
depth information can be compressed much more efficiently than the 2D component. Nevertheless, this will not be a general rule. Depth information will also
have very sharp transitions, that might be distorted by coding and decoding processes. Finally, the fact that some of the transport mechanisms (see Sect. 11.3.3)
have been designed specifically to convey various depth information layers makes
this option very attractive for broadcasters, with a limited impact on transport and
distribution networks.
320
P. Angueira et al.
11
321
A Forward Error Correction scheme based on a concatenation of BCH (BoseChaudhuri-Hocquenghem) and LDPC (Low Density Parity Check) coding is used
for a better performance in the presence of high levels of noise and interference.
The introduction of two FEC code block lengths (64,800 and 16,200) was dictated
by two opposite needs: the C/N performance is higher for long block lengths, but
the end-to-end modem latency increases as well. For applications where end-toend delay is not critical, such as TV broadcasting, the long frames are the best
solution, as long block lengths provide an improved C/N performance.
Additionally, when used for different hierarchies of services (like TV, HDTV)
or for interactive point-to point applications (like IP unicasting), Variable Coding
and Modulation (VCM) functionality allows different modulations and error protection levels that can be modified on a frame-by-frame basis. This may be
combined with the use of a return channel to achieve closed-loop Adaptive Coding
and Modulation (ACM), and thus allowing the transmission parameters to be
optimized for each individual user, depending on the particular conditions of the
delivery path. Optional backwards-compatible modes have been defined in
DVB-S2, intended to send two Transport Streams on a single satellite channel: the
High Priority (HP) TS, compatible with DVB-S and DVB-S2 receivers, and the
Low Priority (LP) TS, compatible with DVB-S2 receivers only. HP and LP
Transport Streams are synchronously combined by using a hierarchical modulation
on a non-uniform 8PSK constellation. The LP DVB-S2 compliant signal is BCH
and LDPC encoded, with LDPC code rates 1/4, 1/3, 1/2, or 3/5.
322
P. Angueira et al.
11
323
QPSK 3/4
30.9 (a = 0.2)
46
10 SDTV or 2 HDTV
8PSK 2/3
29.7 (a = 0.25)
58.8
13 SDTV or 3 HDTV
QPSK 3/4
8PSK 2/3
30.9 (a = 0.2)
29.7 (a = 0.25)
46
58.8
44-16
2 2D ? Depth
2 2D ? Depth
and 1 HDTV
3 HDTV
1 2D ? Depth
324
P. Angueira et al.
11.6.1 DVB-T2
The predecessor of DVB-T2 is DVB-T. DVB-T was the first European standard
for digital terrestrial broadcasting, and was approved by the European Telecommunications Standards Institute on 1997 [50, 51]. DVB-T is based on OFDM
modulation and was designed to be broadcasted using 6, 7, and 8 MHz wide
channels on the VHF and UHF bands. The standard has two options for the number
of carriers, namely 2 and 8 K and different carrier modulation schemes (QPSK,
16QAM y 64QAM) that can be selected by the broadcaster. The channel coding is
based on FEC techniques, with a mixture of Convolutional (1/2, 2/3, 3/4, 5/6, and
7/8 rates) and Reed-Solomon coders (188, 204). The spectrum includes pilot
carriers that enable simple channel estimation both in time and frequency.
Amongst all the usual configuration options, the most used ones targeting fixed
receivers provide a throughput between 19 and 25 Mbps. DVB-T is based on the
MPEG-TS structure and the synchronization and signaling is strongly dependent
on the multiplex timing. The video coder associated to DVB-T is MPEG-2.
DVB-T2 has been designed with the objective of increasing the throughput of
DVB-T a 30 %, number considered to be the requirement for HDTV service
rollout in countries where DVB-T does not provide enough capacity.
11
325
Input Processing
Module
BCH/LDPC
Encoder
QAM
Modulator
Cell Time
Interleaver
BCH/LDPC
Encoder
QAM
Modulator
Cell Time
Interleaver
BCH/LDPC
Encoder
QAM
Modulator
Cell Time
Interleaver
PLP 1
PLP 2
PLP N
Frame
Builder
Cell
Mux
IFFT
Guard
Interval
Output
and maximizing the data payload depending on the FFT size and Guard Interval
values. The DVB-T2 physical layer data is divided into logical entities called
physical layer pipes (PLP). Each PLP will convey one logical data stream with
specific coding and modulation characteristics. The PLP architecture is designed to
be flexible so that arbitrary adjustments to robustness and capacity can be easily
done [7].
326
P. Angueira et al.
Table 11.9 Service configuration and bitrate requirements in a 2D HDTV compatible 3D-TV
deployment scenario
Number of services (COFDM 32 K, 1/128 guard interval)
HD720p
HD1080i
HD1080p
3D in 2D
2D ? Delta
Bitrate (Mbps)a
DVBT2 mode
Modulation
(QAM order)code rate
a
2
2
2
42
2563/4
1
2
1
41.6
2563/4
1
28.6
645/6
2
1
31.6
2565/6
46
2564/5
1
35.6
645/6
1
25.6
645/6
11.6.2 ATSC
ATSC is a set of standards developed by the ATSC for digital TV transmission over
terrestrial, cable, and satellite networks. The ATSC DTV standard [54, 55] established in 1995, was the worlds first standard for DTV, and it was included as System
A in ITU-R Recommendations BT.1300 [56] and BT.1306 [57]. ATSC standard for
terrestrial broadcasting was adopted by the Federal Communications Commission
(FCC) in USA in December 1996, and the first commercial DTV/HDTV transmissions were launched in November 1998. In November 1997 this standard was
adopted also in Canada and Korea, and some time later in other countries.
11
327
Value
Occupied bandwidth
Pilot frequency
Modulation
Transmission payload bitrate
Symbol rate
Channel coding(FEC):
5.38 MHz
310 kHz
(8-Level vestigial side-band)
19.39 Mbps
10.762 baud
Trellis coded (inner)/Reed solomon (outer)
328
P. Angueira et al.
11
329
MPEG-2 standard. The total bitrate of the two video streams is maintained below
the overall bitrate of an ATSC channel (17.5 Mbps). In this way, either current
DTV or new 3D-TV are accessible, and 2D and 3D programmes can be displayed
normally, complying with each service capability and solving the loss of resolution
in the frame compatible method. A 3D-TV display can show both left- and righteye views in HD resolution (1,080i for each view), and 2D displays can show the
left view with a little sacrifice in image quality. A typical bitrate assignment could
be 12 Mbps (MPEG-2) for the left (primary) stream and an additional 6 Mbps
(H.264/AVC) stream for the right channel. It is important to design a signaling
mechanism in accordance with the multiplexing scheme used for each video
elementary stream. Other issues such as synchronization between MPEG-2
encoded left image and H.264/AVC encoded right image are under development.
This format has been adopted in Korea for providing 3D-TV services [66].
Depth map information may complement a single view to form a 3D programme, or it may be supplemental to the stereo pair. The depth maps may be
encoded with either MPEG-2 or advanced video codecs (AVC/MVC). In this
option, the encoding format can be integrated into broadcast multiplex [67].
11.6.3 ISDB-T
The ISDB system was designed to provide integrated digital information services
consisting of reliable high-quality audio/video/data via satellite, terrestrial, or
cable TV network transmission channels. Currently, the ISDB-T digital terrestrial
TV broadcasting system provides HDTV-based multimedia services, including
service for fixed receivers or moving vehicles, and TV service to cellular [6872].
The ISDB-T system was developed for Japan. Afterwards, other countries
adopted this system such as Brazil, that adopted a modified version of the standard
which considers H.264/AVC coding for video compression (High Profile for SDTV
and HDTV, and Baseline Profile for One-Seg) in 2006. Other South American
countries have followed the decision of Brazil, like Peru, Argentina, and Chile
in 2009. The ISDB-T transmission system was included as System C in ITU-R
Recommendation BT.1306 [57].
330
P. Angueira et al.
6/8 MHz
Bandwidth segmented transmission (BST)-OFDM
2, 4 and 8 K FFT (2 and 4 K for mobile service)
DPSK, QPSK, 16QAM and 64QAM
1/4, 1/8, 1/16 and 1/32 of symbol duration
Convolutional (inner)/Reed solomon (outer)
MPEG-2 /H.264/AVC
11
331
6
8
3.623.3
4.931
332
P. Angueira et al.
As shown, the MPEG-2 coder imposes a limitation for image quality in a backwards
compatible scenario. In the ISDB-T version where the H.264/AVC video coder is
included (Brazilian standard) the delivery of 2 HDTV programmes is feasible in a
6 MHz channel (6 OFDM segments each). This option would allow the transmission of
right and left images in HDTV for 3D composition.
11.6.4 DTMB
After carrying out some test trials for evaluating different DTV standards (ATSC,
DVB-T and ISDB-T systems in 2000), China developed its own system for fixed,
mobile, and high-speed mobile reception: Digital Terrestrial Multimedia Broadcasting (DTMB). The standard was ratified in 2006 and adopted as GB20600
Standard [76].
11
333
In contrast, the combination of 4-QAM with FEC code rate of 0.4, providing a
throughput beyond 5.4 Mbps, is a good option to support the mobile reception
application. Consequently, the high and ultra-high data rate modes are used for
fixed reception, transmitting 1012 SDTV programmes or 12 HDTV programmes
in one 8 MHz channel, while low and middle data rate modes are used for mobile
reception, transmitting 25 SDTV programmes in one 8 MHz channel.
334
P. Angueira et al.
and view enhancement. One of the major outcome of the trials was the quantification
of the required additional bitrate for video plus depth formats: a 600 kbps 2D video
service would require an increase in the range of 1030 % (depending on the specific
content) to provide acceptable 3D image quality [85, 86].
The mobile 3D activities in Korea have been focused around the standard
T-DMB [87, 88]. The Telecommunications Technology Association (TTA) has
started the standardization of a 3D video service for T-DMB and S-DMB and the
major issue remaining is how to determine the proper stereoscopic video codec.
The options currently being studied are MVC (Multi-view Video Coding) [89]
H.264/AVC independent, and HEVC (High Efficiency Video Coding) [90]. On the
trials carried out so far, different formats have been analyzed (Frame Compatible
3D, dual channel, video plus depth and partial 3D) and different MPEG-4 based
coders have been tested [9195]. Nevertheless, the deployment scenario assumes
that the services will be 2D compatible and any of the views will be regarded as
base layer and coded using MPEG-2.
11
335
Table 11.13 Main features of DVB-C and DVB-C2 systems (source: DVB-C2 standard)
DVB-C
DVB-C2
Input interface
Single transport
stream (TS)
Modes
FEC
Interleaving
Modulation
Pilots
Guard interval
Modulation Schemes
336
P. Angueira et al.
with the requirements of 3D-TV content delivery. Because of these reasons, only
DVB-C2 will be considered in this section.
11
337
Display
First generation
Second generation
Third generation
L?R
Multi-view
Continuous object wave
338
P. Angueira et al.
11
339
available in this scenario, they would deliver 3D-TV content to a group of subscribers and provide a simulcast programme for non 3D-TV displays. Satellite and
cable operators are most likely to implement this option.
Terrestrial free to air TV business model is completely different. Given the scarce
frequency resources the major concern is to keep the services backward compatible
with existing 2D HD receivers. The 3D-TV content offer would then be necessarily
based in an additional information channel that would provide the data to reconstruct
the second image for suitably equipped 3D receivers (Layer 4 model as described by
ITU-R). There are two format choices for building this 3D scenario:
1. Formats following the 2D ? delta scheme. This option would be based on
SVC coders or even in a mixture of MPEG-2 and H.264/AVC coders, where
one of the channels is the baseline for legacy receivers and the additional layer
would provide service access for 3D receivers.
2. Formats based on DIBR techniques using any type of auxiliary depth information. In this case either 2D ? DOT (data to represent depth, occlusion, and
transparency information) or 2D ? depth coding scheme could allow
multiple views to be generated for presentation on Auto-stereoscopic displays.
Korea is an example of the coexistence of the two scenarios described above.
On one hand, the terrestrial broadcasters KBS, MBC (Munhwa Broadcasting
Corp.), SBS and EBS (Educational Broadcasting System), have prepared for 3D
trial broadcasting from October 2010 using dual stream coding (left image with
MPEG-2, right image with H.264/AVC) at a resolution of 1,920 9 1,080 interlaced 30 fps. On the other hand, the pay TV segment formed by cable broadcasters
CJHelloVision and HCN and Korea Digital Satellite Broadcasting, will also take
part in the 3D trial broadcasting service situation, using in this case a frame-based
solution.
11.9 Conclusions
The rollout of mass market 3D-TV services over the broadcast networks is still
under development. During the last couple of years, a remarkable standardization
and research activity has been carried out, with different degrees of success.
Among all the possibilities to represent the 3D video content, the choice will be
strongly dependent on the business model. Free to air terrestrial TV (mainly terrestrial) will have backwards compatibility as one of the major requirements due to
the scarce capacity resources inherent to terrestrial networks. Here, DIBR-based
approaches are a good compromise between simplicity, 3D perceived quality and
2D receiver and display compatibility. The requirement of 2D backwards compatibility is being assumed by different consortia and standardization bodies
(DVB, ATSC and SMPTE among others).
In the case of pay TV broadcasters (specially satellite and cable) it is clear that
the short-term deployments will be based in any of the versions of Frame-
340
P. Angueira et al.
Compatible 3D video. The next step would then be the enhancement of the content
resolution to achieve a quality that would be close to the Full Resolution format.
Again, DIBR techniques have in this area a special interest, as a way to produce a
complementary data layer that could upgrade a limited resolution 3D stream to an
HDTV 3D service.
As relevant as the format, at least from the broadcaster side, the coding algorithm will be either one of the enabling technologies or obstacles for the fast
deployments of 3D services. Currently, the bitrate to provide a 3D service is being
assumed around a 60 % higher than the equivalent 2D material. This increase is a
challenge for the current broadcast standards, if todays services should be
maintained. The solution for this obstacle might rely on the new generation of
video coding standards.
References
1. International Telecommunications Union. Radiocommunications Sector (2010) Report
ITU-R. BT.2160 features of three-dimensional television video systems for broadcasting
2. International Organization for Standardization (2007) ISO/IEC 230023:2007Information
technologyMPEG video technologiesPart 3: Representation of auxiliary video and
supplemental information
3. International Organization for Standardization (2011) Call for proposals on 3D video coding
technology, ISO/IEC JTC1/SC29/WG11 MPEG2011/N12036
4. Society of Motion Picture and Television Engineers (2009) Report of SMPTE Task Force on
3D to the Home
5. Digital Video Broadcasting (2010) DVB BlueBook A151 Commercial requirements for
DVB-3DTV
6. Digital Video Broadcasting (2011) DVB Frame compatible plano-stereoscopic 3DTV
(DVB-3DTV), DVB BlueBook A154
7. European Telecommunications Standard Institute (2011) ETSI EN 302 755 V1.2.1. Frame
structure channel coding and modulation for a second generation digital terrestrial television
broadcasting system (DVB-T2)
8. European Telecommunications Standard Institute (2011) ETSI EN 302 769 V1.2.1 Frame
structure channel coding and modulation for a second generation digital transmission system
for cable systems (DVB-C2)
9. European Telecommunications Standard Institute (2009) ETSI EN 302 307 V1.2.1. Second
generation framing structure, channel coding and modulation systems for broadcasting,
interactive services, news gathering and other broadband satellite applications
10. Mller K et al (2009) Coding and intermediate view synthesis of multiview video plus depth.
In: 16th IEEE International conference on image processing (ICIP), pp 741744
11. Li Sisi et al (2010) The overview of 2D to 3D conversion system. In: IEEE 11th International
conference on computer-aided industrial design and conceptual design (CAIDCD), vol 2,
pp 13881392
12. International Organization for Standardization (2009) ISO/IEC JTC1/SC29/WG11 N10540:
Text of ISO/IEC 1449610:2009 FDAM 1(including stereo high profile)
13. Fehn C (2004) 3D-TV using depth-image-based rendering (DIBR). In: Proceedings of picture
coding symposium
14. Zhang L, Tam WJ (2005) Stereoscopic image generation based on depth images for
3DTV3D-TV. IEEE Trans Broadcast 51(2):191199
11
341
15. Merkle P et al (2007) Multi-View Video Plus Depth Representations and Coding
Technologies. IEEE Conference on Image Processing, pp 201204
16. Bartczak B et al (2011) Display-independent 3D-TV production and delivery using the
layered depth video format. IEEE Trans Broadcast 57(2):477490
17. European Telecommunications Standard Institute (2009) ETSI TS 102 034 V1.4.1. Transport
of MPEG-2 TS based DVB services over IP based networks (and associated XML)
18. Internet Engineering Task Force (IETF) (2011) RTP Payload format for H.264 video, RFC
6184, proposed standard
19. International Organization for Standardization (2010) ISO/IEC 14496-10:2010. Information
technologycoding of audio-visual objectsPart 10: advanced video coding
20. Vetro A et al (2011) Overview of the stereo and multiview video coding extensions of the
H.264/MPEG-4 AVC standard. Proc IEEE 99(4):626642
21. Smolic A et al (2009) Development of a new MPEG standard for advanced 3D video
applications. In: Proceedings of 6th international symposium on image and signal processing
and analysis, pp 400407
22. Merkle P et al (2008) Adaptation and optimization of coding algorithms for mobile 3DTV.
MOBILE 3DTV, Technical report D2.2
23. International Organization for Standardization (2007) Text of ISO/IEC 13818-1:2003/
FPDAM2 Carriage of auxiliary video data streams and supplemental information, Doc No.
8799
24. Bourge A et al (2006) MPEG-C Part 3: Enabling the introduction of video plus depth
contents. In: Proceeding workshop content generation coding 3D-Television
25. Sukhee C et al (2005) Disparity-compensated stereoscopic video coding using the MAC in
MPEG-4. ETRI J 27(3):326329
26. Hewage CTER et al (2007) Comparison of stereo video coding support in MPEG-4 MAC,
H.264/AVC and H.264/SVC. In: 4th IET international conference on visual information
engineering
27. Hur N (2010) 3D DMB: A portable/mobile 3D-TV system. 3D-TV workshop, Shanghai
28. Hoffmann H et al (2006) Studies on the bit rate requirements for a HDTV format With 1920
1,080 pixel resolution, progressive scanning at 50 Hz frame rate targeting large flat panel
displays. IEEE Trans Broadcast 52(4)
29. Hoffmann H et al (2008) A novel method for subjective picture quality assessment and
further studies of HDTV formats. IEEE Trans Broadcast 54(1)
30. European Broadcasting Union (2006) Digital terrestrial HDTV broadcasting in Europe. EBU
Tech 3312
31. Klein K et al (2007) Advice on spectrum usage, HDTV and MPEG-4. http://www.bbc.co.uk/
bbctrust/assets/files/pdf/consult/hdtv/sagentia.pdf
32. Brugger R et al (2009) Spectrum usage and requirements for future terrestrial broadcast
applications. EBU technical review
33. McCann K et al (2009) Beyond HDTV: implications for digital delivery. An independent
report by Zeta cast Ltd commissioned by ofcom
34. Husak W (2009) Issues in broadcast delivery of 3D. EBU 3D TV workshop
35. Tagiri S et al (2006) ISDB-C: cable television transmission for digital broadcasting in Japan.
Proc IEEE 94(1):303311
36. Robert J et al (2009) DVB-C2The standard for next generation digital cable transmission.
In: IEEE international symposium on broadband multimedia systems and broadcasting,
BMSB
37. Benzie P et al (2007) A survey of 3DTV displays: techniques and technologies. IEEE Trans
Circuits Syst Video Technol 17(11):
38. Onural L et al (2006) An assessment of 3DTV technologies. NAB BEC proceedings,
pp 456467
39. Surman P et al (2008) Chapter 13: Development of viable domestic 3DTV displays.
In: Ozaktas HM, Onural L (eds) Three-dimensional television, capture, transmission, display.
Springer, Berlin
342
P. Angueira et al.
11
343
66. Park S et al (2010) A new method of terrestrial 3DTV broadcasting system. IEEE broadcast
symposium
67. Advanced Television Systems Committee (2011) ATSC planning team interim report. Part2:
3D technology
68. Association of Radio Industries and Businesses (2005) Transmission system for digital
terrestrial television broadcasting, ARIB Standard STD-B31
69. Association of Radio Industries and Businesses (2006) Operational guidelines for digital
terrestrial television broadcasting, ARIB Tech. Rep. TR-B14
70. Association of Radio Industries and Businesses (2007) Video coding, audio coding and
multiplexing specifications for digital broadcasting, ARIB standard STD-B32
71. Association of Radio Industries and Businesses (2008) Data coding and transmission
specification for digital broadcasting, ARIB standard STD-B24
72. Asami H, Sasaki M (2006) Outline of ISDB systems. Proc IEEE 94:248250
73. Uehara M (2006) Application of MPEG-2 systems to terrestrial ISDB (ISDB-T). Proc IEEE
94:261268
74. Takada M, Saito M (2006) Transmission system for ISDB-T. Proc IEEE 94:251256
75. Itoh N, Tsuchida K (2006) HDTV mobile reception in automobiles. Proc IEEE 94:274280
76. Standardization Administration of the Peoples Republic of China (2006) Frame structure,
channel coding and modulation for a digital television terrestrial broadcasting system,
chinese national standard. GB 20600
77. Wu J et al (2007) Robust timing and frequency synchronization scheme for DTMB system.
IEEE Trans Consum Electron 53(4):13481352
78. Zhang W et al (2007) An introduction of the Chinese DTTB standard and analysis of the
PN595 working modes. IEEE Trans Broadcasting 53(1):813
79. Song J et al (2007) Technical review on chinese digital terrestrial television broadcasting
standard and measurements on some working modes. IEEE Trans Broadcasting 53(1):17
80. OFTA (2009) Technical specifications for digital terrestrial television baseline receiver
requirements
81. Ong C (2009) White paper on latest development of digital terrestrial multimedia broadcasting
(DTMB) technologies. Hong Kong Applied Science and Technology Research Institute
(ASTRI), Hong Kong
82. European Telecommunications Standards Institute (2009) ETSI TR 102 377 v1.3.1 digital
video broadcasting (DVB): DVB-H implementation guidelines
83. European Telecommunications Standards Institute (2004) ETSI EN 302 304 v1.1.1 Digital
video broadcasting (DVB); transmission system for handheld terminals (DVB-H)
84. Faria G et al (2006) DVB-H: digital broadcast services to handheld devices. Proc IEEE
94(1):194209
85. Atanas G et al (2010) Complete end-to-end 3DTV system over DVB-H. Mobile3DTV project
86. Atanas G et al (2011) Mobile 3DTV content delivery optimization over DVB-H system. Final
public summary. Mobile3DTV project
87. European Telecommunications Standards Institute (2006) ETSI EV 300401 v1.4.1, radio
broadcasting systems, digital audio broadcasting (DAB) to mobile, portable and fixed receivers
88. Telecommunications Technology Association (2005) TTASKO-07.0024 radio broadcasting
systems, Specification of the video services for VHF digital multimedia broadcasting (DMB)
to mobile, portable and fixed receivers
89. International Organization for Standardization (2008) ISO/IEC JTC1/SC29/WG11 joint draft
7.0 on multiview video coding
90. Baroncini V, Sullivan GJ and Ohm JR (2010) Report of subjective testing of responses to
joint call for proposals on video coding technology for high efficiency video coding (HEVC).
Document JCTVC-A204 of JCT-VC
91. Yun K et al (2008) Development of 3D video and data services for T-DMB. SPIE
Stereoscopic Disp Appl XIX 6803:2830
92. International Organization for Standardization (2010) ISO/IEC. JTC1/SC29/WG11 a frame
compatible system for 3D delivery. Doc. M17925
344
P. Angueira et al.
Part IV
Chapter 12
Keywords 3D-TV Binocular correspondence Binocular development Binocular rivalry Binocular visual system Depth-image-based rendering (DIBR)
Disparity scaling Dynamic depth cue Fusion limit Horopter Monocular
occlusion zone Motion parallax Size scaling Stereoacuity Visual cortex
12.1 Introduction
In the last decade, stereoscopic media such as 3D1 movies, 3D television (3D-TV),
and mobile devices have become increasingly common. This has been facilitated
by technical advances and reduced cost. Moreover, viewers report a preference for
3D content over the same 2D content in many contexts. Viewing media content in
stereoscopic 3D increases viewers sense of presence (e.g. [1]) and 3D content
viewed on large screens and mobile devices are scored more favorably than 2D
content viewed on the same displays [2]. With the increasing demand for 3D
1
P. M. Grove (&)
School of Psychology, The University of Queensland, Brisbane, Australia
e-mail: p.grove@psy.uq.edu.au
347
348
P. M. Grove
12
349
disparity scaling with distance (Sect. 12.6), dynamic depth cues and their interaction
with binocular disparity (Sect. 12.7), and concludes with binocular development and
plasticity (Sect. 12.8). Section 12.9 concludes the chapter. This chapter aims to help
workers in 3D media to understand the biological side of the 3D mediaconsumer
relationship with the goal of eliminating artifacts and distortions from their displays,
and maximizing user comfort and presence.
Researchers in human visual perception specify the size of distal objects, the extent of visual
space, and binocular disparities in terms of their angular extent at the eye rather than linear
measurements such as display-screen units like pixels. See Harris [13] for a description. For
reference, a 1 cm wide object 57 cm from the eye subtends approximately 1 degree of visual
angle.
350
P. M. Grove
12
351
352
P. M. Grove
12
353
Fig. 12.3 In (a) the two eyes are fixating on point P. The larger arc of the circle intersects the
fixation point and the optical nodal points of the two eyes. Geometry dictates that the angle
subtended at Q is equal to the angle subtended at the fixation point (P) (the angles labeled x).
It follows that the image of object Q is an equal angular distance from the fovea in the two eyes.
This is true for the binocular images of any object lying on this arc. In (b) The geometric vertical
horopter is a line perpendicular to the plane containing the horizontal horopter that intersects the
fixation point. The object at R is equally distant from the two eyes and therefore its angular
elevation from the plane containing the horizontal horopter is the same for both eyes (the angles
labeled v). Therefore, the image of the square at R projects to a location on the retina an equal
angular extent below the fovea in each eye
line perpendicular the plane containing the geometric horizontal horopter, intersecting the fixation point, referred to as the geometric vertical horopter. The geometric horopter (Fig. 12.3) is limited to these loci because locations away from the
median plane of the head and the horizontal plane of regard give rise to vertical image
size differences owing to the fact that eccentrically located objects are different
distances from the two eyes, precluding stimulation of corresponding points.
The empirical horopter is a map of the locations in space that observers perceive
in identical directions from both eyes for a given point of fixation. In general,
the empirical horopter is determined by having an observer fixate at a specific distance straight ahead and make judgments about the relative alignment of dichoptic
targets, presented at various eccentricities. The inference from these measurements
is that aligned targets stimulate corresponding points in the two retinas.
The loci of points comprising the empirical horopter differ from those making up
the geometric horopter. The portion of the empirical horopter in the horizontal plane
is characterized by a shallower curve than the geometric horopter [28, 29]
(Fig. 12.4). More striking, however, is the difference between the empirical vertical
horopter and its geometric counterpart. The vertical component of the empirical
horopter is inclined top away [3035]. Moreover, the inclination of the empirical
vertical horopter increases with viewing distance such that it corresponds with the
ground plane at viewing distances greater than about 2 m.
354
P. M. Grove
Knowing the loci making up the geometric and empirical horopters is useful for
at least two reasons. First, these loci predict the regions in space at which we would
expect superior stereoacuity. As discussed in Sect. 12.4, stereoacuity degrades quickly
as the pair of objects is displaced farther in front or behind the fixation point. Therefore,
for best stereoscopic performance, it would be desirable to present objects close to zero
disparity loci. Second, it is very likely that zero disparity loci are close to the middle
of the range of fusible disparities [36]. Knowing this locus can guide the positioning
of the zone of comfortable viewing [9].
As mentioned above, corresponding points cannot be determined for locations
in space away from the horizontal and vertical meridians. However, with fixation
straight ahead, a surface, inclined top away will come close stimulating corresponding points in the two eyes [35, 36]. One implication from these data is that
stereo performance should be superior around these loci and that the zone of
comfort for binocular disparities should be centered on the empirical horopters.
The shapes of the horizontal and vertical empirical horopters differ from flat
consumer 3D displays. Deviations from the horizontal horopter are likely to be small
for normal viewing distances of approximately 2 m. For example a large
60-inch flat TV display viewed from 2 m would be beyond the horizontal horopter
for all locations except the fixation point. With fixation in the center of the display,
the outer edges would be approximately 10 cm behind the geometric horopter. The
discrepancy between the TV and the empirical horizontal horopter would be less
owing to the latters shallower curve. However, TVs are usually upright and vertical.
Therefore, the typical 3D display deviates markedly from the inclination of the
empirical vertical horopter. This deviation grows with increasing viewing distance as
the backward inclination of the vertical horopter increases. Psychophysical studies
have shown that observers prefer to orient displays top away approaching the
empirical vertical horopter [33, 38]. Indeed, basic human perception studies have
shown that observers have a perceptual bias to see vertical lines as tilted top away
[39]. These findings combine nicely with the applied work by Nojiri et al. [6] who
reported that viewers find images with uncrossed disparities in the upper visual field
12
355
more comfortable to view than other distributions, broadly consistent with the noted
backward inclination of the empirical vertical horopter.
356
P. M. Grove
12
357
spatial frequency stimuli. It seems that the upper fusion limit is determined by the
highest spatial frequency contained in an image.
If eye movements are not controlled, as is the case for commercial 3D displays,
very large disparities can be introduced and a vergence eye movement can
effectively reduce them. For example, Howard and Duke [54], and later Grove
et al. [55] showed that observers could reliably match the depth of a square
displaced in depth with crossed disparities as large as 3. Although it is possible to
fuse very large screen disparities with accompanying vergence movements, it is
demanding and generally leads to fatigue and discomfort. Therefore, one recommendation for entertainment 3D media is to limit the range of disparities in the
display to [56].
Under conditions where observers must fixate on a static stimulus, the range of
fusible horizontal disparities is larger than the range of fusible vertical disparities.
For example, Grove et al. [52] found diplopia thresholds in central vision for
horizontal disparities were approximately 10 arc min but they were only 5 arc min
for vertical disparity.
The binocular system responds to vertical disparities resulting from vertical
misalignments of stereoscopic images by making compensatory eye movements to
align the eyes and eliminate the vertical disparities as much as possible. The
mechanism for vertical vergence eye movements integrates visual information
over a large part of the visual field [57]. This is probably why viewers can tolerate
rather large vertical offsets between left and right eye video sequences presented
on large displays [58, 17] but not on small displays [59]. The large displays will
stimulate compensatory vertical vergence eye movements, but the smaller displays
do not.
Vertical disparities arise in 3D media when the stereoscopic images are
acquired with two converged (toed-in) cameras. For example, filming an objective
square placed directly in front of two converged cameras will result in two trapezoidal images, one in each eye. The left side of the trapezoid will be taller in the
left eye than the right eye and vice versa, introducing a gradient of vertical disparities across the width of the image. If these distortions are large enough, they
358
P. M. Grove
could lead to double images. However, viewers seem to tolerate these vertical
disparities with little reduction in viewing comfort [60].
The maximum disparity that can be fused depends on the horizontal and vertical
spacing between the objects in depth. This has been operationally defined as the
disparity gradient, the disparity between two images divided by their angular
separation. Burt and Julesz [61] reported that for small dots a disparity gradient of
one was the boundary between fusion and diplopia. That is, when the angular
separation between two dots was equal to or less than the angular disparity,
diplopia was experienced. The concept of disparity gradient is somewhat problematic, however, because it is not clear how to specify the separation between
images wider than a small dot. Nevertheless, the presence of adjacent objects in a
3D scene affects the fusion of disparate images.
12
359
Fig. 12.7 a Stimuli for binocular rivalry: left eye views vertical stripes and the right eye views
horizontal stripes or vice versa. In b different perceptions depending on image size are shown.
Small images alternate in their entirety. Larger images alternate in a piecemeal fashion (right
panel in b). The reader can experience rivalry by free fusing the images in a
strategy that removes high spatial frequencies from the image at the cost of fine
detail. Meegan et al. [67] showed that when an uncompressed image was presented
to one eye and a compressed image was presented to the other eye, the perceived
quality of the fused image was dependent on the type of compression strategy.
When one image was blurred, the perceived quality of the fused image was close
to the uncompressed image. When JPEG compression was applied to one eyes
image, the fused image was degraded compared to the uncompressed image. This
is likely due to high spatial frequency artifacts introduced in the blocky image
suppressing corresponding regions in the uncompressed image. Therefore, this
applied research study combines nicely with the visual perception studies on
binocular rivalry discussed above, suggesting that low-pass filtering of one eyes
image is a promising compression strategy for 3D media.
360
P. M. Grove
Fig. 12.8 Monocular occlusion zones. The near surface blocks portions of the far surface from
view for one of the black eyes but not the other. The gray eye in the center illustrates that
translating between the two eyes positions results in portions of the background becoming
visible to that eye while other portions become invisible (See Sect. 12.7 for a discussion of
dynamic depth cues). Adapted from Vision Research, vol. 30, 11, Nakayama and Shimojo [76]
Da vinci stereopsis: depth and subjective occluding contours from unpaired image points,
p. 18111825, with permission from Elsevier. Copyright 1990
match in the other eye and therefore disparity is undefined. Moreover, these
monocular regions may differ in texture or luminance from the corresponding
region in the other eye. Considering the previous discussion on binocular rivalry,
monocular regions should constitute an impediment to stereopsis and potentially
induce rivalry.
Psychophysical investigations since the late 1980s have shown that monocular
occlusion zones consistent with naturally occurring occlusion surface arrangements resist rivalry [68] and contribute to binocular depth perception. For example
the presence, absence, or type of texture in monocular regions impacts the speed of
depth perception and the magnitude of depth perceived in relatively simple
laboratory stimuli as well as complex real world images [69]. When monocular
texture is present and can be interpreted as a continuation of the background
surface or object to which it belongs, perceived depth is accelerated [70] relative to
when that region is left blank. If, on the other hand the monocular texture is very
different from the surrounding background texture, perceived depth is retarded or
even destroyed [7173]. Furthermore, monocular occlusions can also elicit the
perception of depth between visible surfaces in the absence of binocular disparity
[74, 75]. Even more striking are demonstrations in which monocular features elicit
a 3D impression and the generation of an illusory surface created in perception
possibly to account for the absence of that feature in one eyes view [7680]. For a
recent review, see Harris and Wilcox [81].
12
361
Fig. 12.9 A slanted line (a) or a partially occluded frontoparallel line, (b) can generate identical
horizontal disparities. Black rectangles below each eye illustrate the difference in width between
the left and right eyes images. The gray region in the right figure is a monocular occlusion zone.
From Vision Research, Vol. 44, 20, Gillam and Grove [86] Slant or occlusion: global factors
resolve stereoscopic ambiguity in sets of horizontal lines, p. 23592366, with permission from
Elsevier. Copyright 2004
In addition to affecting the latency for stereopsis and the magnitude of depth,
differential occlusion in the two eyes can have a dramatic effect on how existing
binocular disparities in the scene are resolved [8285]. In some scenes, local horizontal
disparities are equally consistent with a slanted object and a flat object that is partially
occluded by a nearer surface. If a plausible occlusion solution is available, it is preferred.
For example, consider Fig. 12.9a. A contour that is slanted in depth generates
images of different widths in the two eyes. The corresponding endpoints are
matched and their disparities are computed. However, note that the same image
width differences of a frontoparallel line are generated when it is differentially
occluded in the two eyes by a nearer surface, as in Fig. 12.9b. In the latter case,
when an occluder is present, the right end of the line is noncorresponding.
Disparity computations are discarded here and the binocular images are interpreted
to be resulting from partial occlusion by the nearer surface [82, 83].
These human visual perception experiments complement applied work on 3D
displays and in 3D media production to solve the problem of unmatched features in
the two eyes resulting from objects entering and exiting at the edges of the 3D
display [86]. Depending on the image content, these unmatched features can elicit
rivalry, as discussed previously, or they can introduce spurious disparities resulting
in misperceived depth. A common technique is to apply masks to the left and right
sides of the display and stereoscopically move them forward in depth [87]. The
resulting floating window partially occludes the right eyes view of the right side of
the display and the left eyes view of the left side of the display.
In the context of DIBR, monocular occlusion zones are referred to as holes and
are a major source of display artifacts [12]. The synthesis of a second virtual
stereoscopic image from a 2D image and a depth map is possible for all parts of the
362
P. M. Grove
image that are visible to both eyes in the fused scene. However, there is no
information in the original 2D image or the depth map about the content of
monocular occlusion zones. Indeed, these regions present as blank holes, known as
disocclusions in the computer graphics literature, in the synthesized image and
must be filled with texture. However, it is not clear as to how to fill these regions.
Algorithms that interpolate pixel information in the background produce visible
artifacts of varying severity depending on image content [2]. Furthermore, visual
resolution in monocular regions is equal to binocularly visible regions [88].
Therefore, blurring these regions or reducing their contrast is not a viable strategy
to conceal the artifacts.
The choice of texture to fill the holes could be informed by additional information provided in the depth map such that more than one depth and color is
stored for each image location [2]. However, this increases the computational load
and bandwidth requirements for transmission. Zhang and Tam [12] proposed a
method in which the depth map is preprocessed to smooth out sharp depth
discontinuities at occlusion boundaries. With increased smoothing, disocclusions
can be nearly eliminated without a significant decrease in subjective image quality.
Nevertheless, their informal observations revealed that object boundaries were less
sharp in images with the smoothed depth map and depth quality was somewhat
compromised.
As reviewed above, however, monocular occlusion zones have a major impact
on human stereopsis. Therefore, more research is needed to determine the optimal
balance between the benefits accrued from depth map smoothing and the costs
associated with reducing the visibility of monocular zone information from these
displays.
12
363
where h is the linear height of the object, d is the distance between the object and
the eye, and a is the angular subtense of the object at the retina. As can be deduced
from Eq. (12.1), for an object of a given size, changes in distance lead to changes
in the angular extent of retinal image such that a doubling of the viewing distance
results in halving the angular extent of the retinal image. Thought of another way,
to maintain a retinal image of a given size, the real object must shrink or grow as it
approaches or recedes from the viewer. Consider Fig. 12.10. The inverted retinal
image of the arrow corresponds to both the arrow at distance d and the arrow that
is twice the height at distance 2d. For a fixed retinal image size, changing the
distance of the object requires a change in size of the object. Size scaling is a
perceptual process leading to correctly perceiving an objects size as constant
despite large changes in the size of the retinal image due to changes in the viewing
distance [89].
Size scaling is a robust process in the real three-dimensional environment,
though it often breaks down in artificial environments. Indeed, when an object is
stereoscopically displaced nearer in depth, d is perceptually reduced but the retinal
image remains the same and the object perceptually shrinks. When the object is
stereoscopically displaced farther in depth, d perceptually increases and the
objects height is perceptually overestimated. The direction of the size distortions
is in the opposite direction to the expected change in size with distance. Therefore,
viewers could experience incorrect depth if they base their depth judgments on
size rather than disparity.
A well-known illusion related to size constancy is the puppet theater effect in
which familiar images, usually of humans, appear unusually small in 3D displays
as though they are puppets. The magnitude of this illusion is linked to the type of
camera configuration, with a greater illusion occurring for toed-in cameras than
parallel cameras [90, 91]. However, the flexibility of 3D reproduction afforded by
DIBR offers a solution to size distortions that can be implemented at any time after
capturing the images [2]. Potentially, size scaling of objects displaced in depth can
be programmed into the rendering algorithm and distortions associated with
camera positioning and configuration can also be corrected.
364
P. M. Grove
aDd
in radians
D2 DDd
12:2
12
365
during acquisition are likely to be even greater when viewing 3D media on mobile
devices with small hand held displays.
Differences between parameters during image capture and those during image
viewing, combined with the fact that the relative depth from a given disparity
scales with the inverse of the square of viewing distance are likely contributing
factors to the so-called cardboard cutout phenomenon [40]. For example, a stereoscopic photograph of a group of people may, when viewed in a stereoscope,
yield the perception of a number of depth planes. However the volume of the
individual people is perceptually reduced such that they appear as cardboard
cutouts instead of their real volume. Typically, such photographs are taken from a
greater distance than the photos are viewed from. Therefore, when viewing these
photos, the vergence, accommodation, and other cues such as perspective signal a
shorter viewing distance. Depth is still appreciated between the individuals, but the
striking distortionthat the subjects appear as cardboard cutoutsresults from the
disparities signaling the volumetric depth of their bodies being scaled down with
viewing distance. Attempts to eliminate this illusion by making the viewing
conditions as close to the capture conditions as possible have been partly
successful [92, 93] suggesting that at least part of the phenomenon is due to cue
conflicts arising from differences between shooting and viewing conditions.
A promising feature of DIBR is its flexibility in 3D reproduction [2]. When
combined with stereoscopic image capture using a parallel camera configuration,
the main system variables such as camera separation and convergence distance
need not be fixed from the time of shooting. Instead, these parameters can be
optimized in the rendering process to adapt to specific viewing conditions. This
would enable the same 3D media to be optimally presented in varying contexts
from cinemas to mobile devices.
Harris [13] makes an important observation about the appropriateness of
modifying 3D content to maximize comfort versus accurate representation. In
entertainment settings such as movies, TV, and mobile devices it is reasonable to
sacrifice accuracy for viewing comfort. However, it is possible that DIBR
technology may be implemented in contexts where accuracy is very important
such as remote medical procedures and design CAD/CAM applications. Nevertheless, DIBR algorithms could be customized to support either entertainment or
industrial/medical applications.
366
P. M. Grove
dimensional object or scene. If the translation of the eye is equal in length the
individuals interocular distance, depth from motion parallax is geometrically
identical to depth from stereopsis [94]. Figure 12.12 shows the relative motion of the
retinal images of two objects separated in depth as the viewer moves laterally. With
fixation on the closer black dot, as the eye translates from right to left, the image of the
dot remains on the fovea while the image of the more distant square moves across the
retina from its original position on the nasal side toward the temporal side. The total
distance moved by the image of the square is equal to the binocular disparity
generated viewing these two objects simultaneously with two stationary eyes. Note
that with fixation on the closer of the two objects, the motion of the image of the more
distant object across the retina is against the motion of the eye. With fixation on the
far object, the image of the closer object moves across the retina in the same direction
as the movement of the eye.
The minimum detectable depth from motion parallax is about the same as for
binocular disparity [95] if the parallax is due to active lateral head motion. Depth
thresholds are slightly higher when passively viewing a translating display
containing a depth interval. This could be because the retinal image is more
accurately stabilized during self-motion than when tracking a translating image
[96]. This is an important consideration for 3D media developers when adding
motion parallax as a cue to depth in their content. Greater visual fidelity will be
achieved if the relative motion is yoked to the viewers head movement than if
they passively view the display.
Like binocular disparity, motion parallax has an upper limit to the relative
motion of images on the retina after which perceived depth is degraded. Depth
from motion parallax is metrical and precise for relative motions up to approximately 20 arc min (of equivalent disparity) [97]. Increasing the relative motion
beyond this point results in the perception of relative motion and depth.
12
367
Still further increasing the relative motion destroys the depth percept and only
relative motion is perceived. Viewers may be able to intuit the depth order from
the large relative motions, but depth is not directly perceived [98].
A closely related source of dynamic information is accretion and deletion of
background texture at the vertical edges of a near surface. For example, Fig. 12.8
shows monocular regions on a far surface owing to partial occlusion by a near
surface. Consider an eye starting at the position of the right eye in this figure and
translating to the position of the left eye. During this translation, parts of the
background that were visible to the eye in its initial position will gradually be
hidden from view by the closer occluder. Conversely, regions of the background
that were initially hidden from view gradually come into view. At a position
midway between the eyes (the gray eye in Fig. 12.8), part of the monocular zone
that was visible to the right is now partially hidden by the near occluder and part of
the monocular zone that was completely hidden to the right eye (the left eye
monocular zone) is now partially visible. Therefore, in the real 3D environment,
motion of the head yields an additional cue to depth, the gradual accretion and
deletion of background texture near the vertical edges of foreground objects and
surfaces [89, 99]. Dynamic accretion and deletion of texture is an additional
challenge for DIBR because filling in partially occluded texture regions will need
to be computed online. This introduces additional computational problems and
also requires an efficient strategy for choosing appropriate textures to avoid visible
artifacts (see [100]).
368
P. M. Grove
perception just from disparity has been demonstrated in children just over
3 months of age. Infants as young as 14 weeks will visually follow a disparitydefined contour moving from side to side. This test was conducted with dynamic
random-dot stereo displays in which the field of random dots was replaced on
every frame of the sequence, thus eliminating any monocular cues to the contours
motion [102].
Visual development is marked by critical periods during which normal inputs
are required for normal development. In the case of binocular vision, it is assumed
that normal binocular inputs early in life are required for the appropriate cortical
development to take place and normal stereopsis to arise. A common example of
compromised binocular input is strabismus, the turning of one eye relative to the
other such that the visual axes do not intersect at the point of fixation. Often input
from the turned eye is suppressed. Surgical correction of the turned eye is
beneficial but it is widely accepted that this should occur before stereopsis is fully
developed. Otherwise, stereopsis will not develop even if normal binocular input is
achieved later in life. Fawcett et al. [103] reported that compromised binocular
input that occurs as early as 2.4 months of age and as late as 4.6 years severely
disrupts development of stereopsis.
Although there is less research on changes binocular vision as a function of
normal aging, the evidence suggests that motor control of the eyes and stereoacuity
remain stable after maturity. There is a slight decline in stereoacuity after
approximately 45 or 50 years of age [23, 20]. Very recently Susan Barry,
a research scientist, who had strabismus as a child but later corrected through
surgery, published her account of how she recovered stereopsis through an intense
regimen of visual therapy at age 50 [104]. This case implies that the binocular
system has the capacity to reorganize in response to correlated inputsa neural
plasticity later in life [105].
12.9 Conclusion
The goal of this chapter was to introduce the reader to psychophysical research on
human binocular vision and link this research with issues related to stereoscopic
imaging and processing. Beginning with a brief overview of the physiology of the
binocular system, this chapter then discussed the theoretical and empirical loci in
space from which corresponding points are stimulated in the two eyes. These loci
have implications for the ergonomics of 3D display shape and position as well as
defining the distributions of binocular disparities in displays to maximize viewer
comfort. Next, the minimum and maximum disparities that can be processed by
the binocular system were discussed. These should be considered by 3D media
developers in order to avoid visible artifacts and user discomfort. This chapter then
defined and explored binocular rivalry, highlighting implications for the choice of
compression algorithm when transmitting stereoscopic media. Mismatches due to
artifacts in the rendering process such as holes were analyzed in the context of the
12
369
References
1. Freeman J, Avons S (2000) Focus group exploration of presence through advanced
broadcast services. Proc SPIE 3959:530539
2. Shibata T, Kurihara S, Kawai T, Takahashi T, Shimizu T, Kawada R, Ito A, Hkkinen J,
Takatalo J, Nyman G (2009) Evaluation of stereoscopic image quality for mobile devices
using interpretation based quality methodology. Proc SPIE 7237:72371E. doi:10.1117/
12.807080
3. Fehn C, De La Barre R, Pastoor S (2006) Interactive 3D-TVconcepts and key
technologies. Proc IEEE 94(3):524538. doi:10.1109/JPROC.2006.870688
4. Zhang L, Vzquez C, Knorr S (2011) 3D-TV content creation: automatic 2D-3D video
conversion. IEEE T Broad 57(2):372383
5. Yano S, Ide S, Mitsuhashi T, Thwaites H (2002) A study of visual fatigue and visual comfort
for 3D HDTV/HDTV images. Displays 23:191201. doi:10.1016/S0141-9382(02)00038-0
6. Nojiri Y, Yamanoue H, Hanazato A, Okana F (2003) Measurement of parallax distribution,
and its application to the analysis of visual comfort for stereoscopic HDTV. Proc SPIE
5006:195205. doi:10.1117/12.474146
7. Nojiri Y, Yamanoue S, Ide S, Yano S, Okana F (2006) Parallax distributions and visual
comfort on stereoscopic HDTV. Proc IBC 2006:373380
8. Patterson R, Silzars A (2009) Immersive stereo displays, intuitive reasoning, and cognitive
engineering. J SID 17(5):443448. doi:10.1889/JSID17.5.443
9. Meesters LMJ, IJsselsterijn WA, Seuntins PJH (2004) A survey of perceptual evaluations
and requirements of three-dimensional TV. IEEE T Circuits Syst 14(3):381391.
doi:10.1109/TCSVT.2004.823398
10. Lambooij M, IJsselsteijn W, Fortuin M, Heynderickx I (2009) Visual discomfort and visual
fatigue of stereoscopic displays: a review. J Imaging Sci Tech 53(3):0302010302014.
doi:10.2352/J.ImagingSci.Technol.2009.53.3.030201
11. Daly SJ, Held R, Hoffman DM (2011) Perceptual issues in stereoscopic signal processing.
IEEE T Broadcast 57(2):347361. doi:10.1109/TBC.2011.2127630
370
P. M. Grove
12. Tam WJ, Speranza F, Yano S, Shimono K, Ono H (2011) Stereoscopic 3D-TV: visual
comfort. IEEE T Broad 57(2):335346. doi:10.1109/TBC.2005.846190
13. Zhang L, Tam WJ (2005) Stereoscopic image generation based on depth images for 3D TV.
IEEE T Broad 51(2):191199. doi:10.1109/TBC.2005.846190
14. Harris JM (2010) Monocular zones in stereoscopic scenes: a useful source of information
for human binocular vision? Proc SPIE 7524:111. doi:10.1117/12.837465
15. Grove PM, Ashida H, Kaneko H, Ono H (2008) Interocular transfer of a rotational motion
aftereffect as a function of eccentricity. Percept 37:11521159. doi:10.1068/p5771
16. Mather G (2006) Foundations of perception. Psychology Press, New York
17. Hennessy RT, Iida T, Shina K, Leibowitz HW (1976) The effect of pupil size on
accommodation. Vis Res 16:587589. doi:10.1016/0042-6989(76)90004-3
18. Allison RS (2007) Analysis of the influence of vertical disparities arising in toed-in
stereoscopic cameras. J Imaging Sci and Tech 51(4):317327
19. Kertesz AE, Sullivan MJ (1978) The effect of stimulus size on human cyclofusional
response. Vis Res 18(5):567571. doi:10.1016/0042-6989(78)90204-3
20. Curcio CA, Allen KA (1990) Topography of ganglion cells in human retina. J Comp Neurol
300:525. doi:10.1002/cne.903000103
21. Mariotte E (1665) A new discovery touching vision. Philos Trans v 3:668669
22. Steinman SB, Steinman BA, Garzia RP (2000) Foundations of binocular vision: a clinical
perspective. McGraw-Hill, New York
23. Hubel DH, Wiesel TN (1959) Receptive fields of single neurons in the cats visual cortex.
J Physiol 148:574591
24. Barlow HB, Blakemore C, Pettigrew JD (1967) The neural mechanisms of binocular depth
discrimination. J Physiol 193:327342
25. Howard IP (2002) Seeing in depth vol 1 basic mechanisms. Porteous, Toronto
26. Nagahama Y, Takayama Y, Fukuyama H, Yamauchi H, Matsuzaki S, Magata MY,
Shibasaki H, Kimura J (1996) Functional anatomy on perception of position and motion in
depth. Neuroreport 7(11):17171721
27. Vieth G (1818) ber die Richtung der Augen. Ann Phys 58(3):233253
28. Mller J (1826) Zur vergleichenden Physiologie des Gesichtssinnes des Menschen und der
Thiere. Cnobloch, Leipzig
29. Howarth PA (2011) The geometric horopter. Vis Res 51:397399. doi:10.1016/
j.visres.2010.12.018
30. Ames A, Ogle KN, Gliddon GH (1932) Corresponding retinal points, the horopter, and size
and shape of ocular images. J Opt Soc Am 22:575631
31. Shipley T, Rawlings SC (1970) The nonius horopterII. An experimental report. Vis Res
10(11):12631299. doi:10.1016/0042-6989(70)90040-4
32. Helmholtz H (1925) Helmholtzs treatise on physiological optics. In: Southall JPC (ed)
Hanbuch der physiologischen optic, vol 3. Optical Society of America, New York
33. Ledgeway T, Rogers BJ (1999) The effects of eccentricity and vergence angle upon the
relative tilt of corresponding vertical and horizontal meridian revealed using the minimum
motion paradigm. Percept 28:143153. doi:10.1068/p2738
34. Siderov J, Harwerth RS, Bedell HE (1999) Stereopsis, cyclovergence and the backwards tilt
of the vertical horopter. Vis Res 39(7):12471357. doi:10.1016/S0042-6989(98)00252-1
35. Grove PM, Kaneko H, Ono H (2001) The backward inclination of a surface defined by
empirical corresponding points. Percept 30:411429. doi:10.1068/p3091
36. Schreiber KM, Hillis JM, Filippini HR, Schor CM, Banks MS (2008) The surface of the
empirical horopter. J Vis 8(3):120. doi:10.1167/8.3.7
37. Cooper EA, Burge J, Banks MS (2011) The vertical horopter is not adaptable, but it may be
adaptive. J Vis 11(3):119. doi:10.1167/11.3.20
38. Fischer FP (1924) III. Experimentelle Beitraege zum Gegriff der Sehrichtungsgemeinschaft
der Netzhaute auf Grund der Binokularen Noniusmethode. In: Tschermak A (ed)
Fortgesetzte Studien uber Binokularsehen. Pflugers Archiv fur die Gesamte Physiologie
des Menschen und der Tiere vol 204, pp 234246
12
371
39. Ankrum DR, Hansen EE, Nemeth KJ (1995) The vertical horopter and the angle of view. In:
Greico A, Molteni G, Occhipinti E, Picoli B (eds) Work with display units 94. Elsevier,
New York
40. Cogen A (1979) The relationship between the apparent vertical and the vertical horopter.
Vis Res 19(6):655665. doi:10.1016/0042-6989(79)90241-4
41. Howard IP, Rogers BJ (2002) Seeing in depth vol 2 depth perception. Porteous, Toronto
42. Westheimer G, McKee SP (1978) Stereoscopic acuity for moving retinal images. J Opt Soc
Am 68(4):450455. doi:10.1364/JOSA/68.000450
43. Morgan MJ, Castet E (1995) Stereoscopic depth perception at high velocities. Nature
378:380383. doi:10.1038/378380a0
44. Rawlings SC, Shipley T (1969) Stereoscopic activity and horizontal angular distance from
fixation. J Opt Soc Am 59:991993
45. Fendick M, Westheimer G (1983) Effects of practice and the separation of test targets on
foveal and peripheral stereoacuity. Vis Res 23(2):145150. doi:10.1016/00426989(83)90137-2
46. Blakemore C (1970) The range and scope of binocular depth discrimination in man.
J Physiol 211:599622
47. Patterson R, Fox R (1984) The effect of testing method on stereoanomoly. Vis Res
24(5):403408. doi:10.1016/0042-6989(84)90038-5
48. Tam WJ, Stelmach LB (1998) Display duration and stereoscopic depth discrimination. Can
J Exp Psychol 52(1):5661
49. Panum PL (1858) Physiologische Untersuchungen ber das Sehen mit zwei Augen. Keil,
Schwers
50. Speranza F, Tam WJ, Renaud R, Hur N (2006) Effect of disparity and motion on visual
comfort of stereoscopic images. Proc SPIE 6055:60550B. doi:10.1117/12.640865
51. Wopking M (1995) Visual comfort with stereoscopic pictures: an experimental study on the
subjective effects of disparity magnitude and depth of focus. J SID 3:10101103.
doi:10.1889/1.1984948
52. Ogle KN (1952) On the limits of stereoscopic vision. J Exp Psychol 44(4):253259.
doi:10.1037/h0057643
53. Grove PM, Finlayson NJ, Ono H (2011) The effect of stimulus size on stereoscopic fusion
limits and response criteria. i-Percept 2(4):401. doi:10.1068/ic401
54. Schor CM, Wood IC, Ogawa J (1984) Binocular sensory fusion is limited by spatial
resolution. Vis Res 24(7):661665. doi:10.1016/0042-6989(84)90207-4
55. Howard IP, Duke PA (2003) Monocular transparency generates quantitative depth. Vis Res
43(25):26152621. doi:10.1016/S0042-6989(03)00477-2
56. Grove PM, Sachtler WL, Gillam BJ (2006) Amodal completion with the background
determines depth from monocular gap stereopsis. Vis Res 46:37713774. doi:10.1016/
j.visres.2006.06.020
57. Seigel M, Nagata S (2000) Just enough reality: comfortable 3-D viewing via
microstereopsis. IEEE T Circuits Syst 10(3):387396. doi:10.1109/76.836283
58. Howard IP, Fang X, Allison RS, Zacher JE (2000) Effects of stimulus size and eccentricity
on horizontal and vertical vergence. Exp Brain Res 130:124132. doi:10.1007/
s002210050014
59. Speranza F, Wilcox LM (2002) Viewing stereoscopic images comfortably: the effects of
whole-field vertical disparity. Proc SPIE 4660:1825. doi:10.1117/12.468047
60. Kooi FL, Toet A (2004) Visual comfort of binocular and 3D displays. Displays
25(23):99108. doi:10.1016/j.displa.2004.07.004
61. Stelmach L, Tam WJ, Speranza F, Renaud R, Martin T (2003) Improving the visual comfort
of stereoscopic images. Proc SPIE 5006:269282. doi:10.1117/12.474093
62. Burt P, Julesz B (1980) A disparity gradient limit for binocular fusion. Science
208(4444):615617. doi:10.1126/science.7367885
63. Levelt WJM (1968) On binocular rivalry. Mouton, The Hauge
64. Alais D, Blake R (2005) Binocular rivalry. MIT Press, Cambridge
372
P. M. Grove
65. Humphriss D (1982) The psychological septum. An investigation into its function. Am J
Optom Physiol Opt 59(8):639641
66. Ono H, Lillakas L, Grove PM, Suzuki M (2003) Leonardos constraint: two opaque objects
cannot be seen in the same direction. J Exp Psychol: Gen 132(2):253265. doi:10.1037/
0096-3445.132.2.253
67. Arnold DH, Grove PM, Wallis TSA (2007) Staying focused: a functional account of
perceptual suppression during binocular rivalry. J Vis 7(7):18. doi:10.1167/7.7.7
68. Seuntiens P, Meesters L, IJsselsteijn W (2006) Perceived quality of compressed
stereoscopic images: effects of symmetric and asymmetric JPEG coding and camera
separation. ACM Trans Appl Percept 3(2):95109. doi:10.1145/1141897.1141899
69. Meegan DV, Stelmach LB, Tam WJ (2001) Unequal weighting of monocular inputs in
binocular combination: implications for the compression of stereoscopic imagery. J Exp
Psychol Appl 7:143153. doi:10.1037/1076-898X.7.2.143
70. Shimojo S, Nakayama K (1990) Real world occlusion constraints and binocular rivalry. Vis
Res 30:6980. doi:10.1016/0042-6989(90)90128-8
71. Wilcox L, Lakra DC (2007) Depth from binocular half-occlusions in stereoscopic images of
natural scenes. Percept 36:830839. doi:10.1068/p5708
72. Gillam B, Borsting E (1988) The role of monocular regions in stereoscopic displays. Percept
17(5):603608. doi:10.1068p170603
73. Grove PM, Ono H (1999) Ecologically invalid monocular texture leads to longer perceptual
latencies in random-dot stereograms. Percept 28:627639. doi:10.1068/p2908
74. Grove PM, Gillam B, Ono H (2002) Content and context of monocular regions determine
perceived depth in random dot, unpaired background and phantom stereograms. Vis Res
42(15):18591870. doi:10.1016/S0042-6989(02)00083-4
75. Grove PM, Brooks K, Anderson BL, Gillam BJ (2006) Monocular transparency and
unpaired stereopsis. Vis Res 46(18):30423053. doi:10.1016/j.visres.2006.05.003
76. Gillam B, Blackburn S, Nakayama K (1999) Stereopsis based on monocular gaps: metrical
encoding of depth and slant without matching contours. Vis Res 39(3):493502.
doi:10.1016/S0042-6989(98)00131-X
77. Forte J, Peirce JW, Lennie P (2002) Binocular integration of partially occluded surfaces. Vis
Res 42(10):12251235. doi:10.1016/S0042-6989(02)00053-6
78. Nakayama K, Shimojo S (1990) Da Vinci stereopsis: depth and subjective occluding
contours from unpaired image points. Vis Res 30:18111825. doi:10.1016/00426989(90)90161-D
79. Anderson BL (1994) The role of partial occlusion in stereopsis. Nature 367:365368.
doi:10.1038/367365a0
80. Liu L, Stevenson SB, Schor CM (1994) Quantitative stereoscopic depth without binocular
correspondence. Nature 267(6458):6669. doi:10.1038/367066a0
81. Gillam B, Nakayama K (1999) Quantitative depth for a phantom surface can be based on
cyclopean occlusion cues alone. Vis Res 39:109112. doi:10.1016/S0042-6989(98)00052-2
82. Tsirlin I, Wilcox LM, Allison RS (2010) Monocular occlusions determine the perceived
shape and depth of occluding surfaces. J Vis 10(6):112. doi:10.1167/10.6.11
83. Harris JM, Wilcox LM (2009) The role of monocularly visible regions in depth and surface
perception. Vis Res 49:26662685. doi:10.1016/j.visres.2009.06.021
84. Hkkinen J, Nyman G (1997) Occlusion constraints and stereoscopic slant. Percept
26:2938. doi:10.1068/p260029
85. Grove PM, Kaneko H, Ono H (2003) T-junctions and perceived slant of partially occluded
surfaces. Percept 32:14511464. doi:10.1068/p5054
86. Gillam B, Grove PM (2004) Slant or occlusion: global factors resolve stereoscopic
ambiguity in sets of horizontal lines. Vis Res 44(20):23592366. doi:10.1016/
j.visres.2004.05.002
87. Grove PM, Byrne JM, Gillam B (2005) How configurations of binocular disparity determine
whether stereoscopic slant or stereoscopic occlusion is seen. Percept 34:10831094.
doi:10.1068/p5274
12
373
88. Ohtsuka S, Ishigure Y, Janatsugu Y, Yoshida T, Usui S (1996) Virtual window: a technique
for correcting depth-perception distortion in stereoscopic displays. Soc Inform Disp Symp
Dig 27:893898
89. Mendiburu B (2009) 3D movie making: stereoscopic digital cinema from script to screen.
Focal Press, Oxford
90. Liu J (1995) Stereo image compressionthe importance of spatial resolution in half
occluded regions. Proc SPIE 2411:271276. doi:10.1117/12.207545
91. Palmer SE (1999) Vision science: photons to phenomenology. MIT Press, Cambridge
92. Yamanoue H (1997) The relation between size distortion and shooting conditions for
stereoscopic images. SMPTE J 106:225232. doi:10.5594/L00566
93. Yamanoue H, Okui M, Okano F (2006) Geometrical analysis of puppet-theatre and
cardboard effects in stereoscopic HDTV images. IEEE T Circuits Tech 16(6):744752.
doi:10.1109/TCSVT.2006.875213
94. Sato T, Kitazaki M (1999) Cardboard cut-out phenomenon in virtual-reality environment.
Percept 28:125 ECVP abstract supplement
95. Rogers BJ (2002) Charles wheatstone and the cardboard cut-out phenomenon. Percept 31:58
ECVP abstract supplement
96. Gillam B, Palmisano SA, Govan DG (2011) Depth interval estimates from motion parallax
and binocular disparity beyond interaction space. Percept 40:3949. doi:10.1068/p6868
97. Bradshaw MF, Rogers BJ (1999) Sensitivity to horizontal and vertical corrugations defined
by binocular disparity. Vis Res 39(18):30493056. doi:10.1016/S0042-6989(99)00015-2
98. Cornilleau-Prs V, Droulez J (1994) The visual perception of three-dimensional shape
from self-motion and object-motion. Vis Res 34(18):23312336. doi:10.1016/00426989(94)90279-8
99. Ono H, Ujike H (2005) Motion parallax driven by head movements: Conditions for visual
stability, perceived depth, and perceived concomitant motion. Percept 24:477490.
doi:10.1068/p5221
100. Ono H, Wade N (2006) Depth and motion perceptions produced by motion parallax. Teach
Psychol 33:199202
101. Ono H, Rogers BJ, Ohmi M (1988) Dynamic occlusion and motion parallax in depth
perception. Percept 17:255266. doi:10.1068/p170255
102. Wilcox L, Tsirlin I, Allison RS (2010) Sensitivity to monocular occlusions in stereoscopic
imagery: Implications for S3D content creation, distribution and exhibition. In: Proceedings
of SMPTE international conference on stereoscopic 3D for media and entertainment
103. Birch EE, Gwiazda J, Held R (1982) Stereoacuity development for crossed and uncrossed
disparities in human infants. Vis Res 22(5):507513. doi:10.1016/0042-6989(82)90108-0
104. Fox R, Aslin RN, Shea SL, Dumais ST (1980) Stereopsis in human infants. Science
207(4428):323324. doi:10.1126/science.7350666
105. Fawcette SL, Wang Y, Birch EE (2005) The critical period for susceptibility of human
stereopsis. Invest Ophth Vis Sci 46(2):521525. doi:10.1167/iows.04-0175
106. Barry SR (2009) Fixing my gaze: a scientists journey into seeing in three dimensions. Basic
Books, New York
107. Blake R, Wilson H (2011) Binocular vision. Vis Res 51(7):754770. doi:10/1016/
j.visres.2010.10.009
Chapter 13
Abstract This chapter covers the state of the art in stereoscopic and autostereoscopic displays. The coverage is not exhaustive but is intended that in the relatively limited space available a reasonably comprehensive snapshot of the current
state of the art can be provided. In order to give a background to this, a brief
introduction to stereoscopic perception and a short history of stereoscopic displays
is given. Holography is not covered in detail here as it is really a separate area of
study and also is not likely to be the basis of a commercially viable display within
the near future.
P. Surman (&)
Imaging and Displays Research Group, De Montfort University, Leicester, UK
e-mail: psurman@dmu.ac.uk
375
376
P. Surman
Depth
information
Oculomotor
Accommodation
Visual
Convergence
Interposition
Monocular
Binocular
Static cues
Motion
parallax
Size
Perspective
Essential for
stereo
Fig. 13.1 Oculomotor and visual cuesoculomotor cues involve muscles controlling the eyes.
Visual cues relate to monocular where depth information is determined by image content and
binocular where disparity provides information
through Pulfrich glasses; when the camera goes from being static to panning the
image appears to come alive as the viewer sees the same image in 3D (provided
the camera is moving in the correct direction). Pulfrich stereo is considered in
more detail later in this section. The first reference to the richness of monoscopic
cues was in the book The Theory of Stereoscopic Transmission first published in
1953 [1].
This section describes the ways in which stereo is perceived. It covers the
physical oculomotor cues of accommodation and convergence and then the visual
cues. The visual cues can be monocular where the images received by the brain are
interpreted using their content alone, or they can be binocular where the brain
utilizes the differences in the eyes images as these are captured at two different
viewpoints.
The oculomotor cues are accommodation and convergence and involve the
actual positioning and focus of the eyes. Accommodation is the ability of the lens
of the eye to change its power in accordance with distance so that the region of
interest in the scene is in focus. Convergence is the ability of the eyes to adjust
their visual axes in order for the region of interest to focus on to the fovea of each
eye. Figure 13.1 shows the complete range of cues.
Monocular cues are determined only by the content of the images. Twodimensional images provide a good representation of actual scenes and paintings,
photographs, television, and cinema all provide realistic depictions of the real
world due to their wealth of monocular cues. There are many of these and a
selection of the principal cues is given below.
The most obvious monoscopic cue is that of occlusion where nearby objects
hide regions of objects behind them. Figure 13.2 shows some of the monocular
cues and it can be seen in Fig. 13.2a that the cube A is at the front as this occludes
sphere B which in turn occludes sphere C. In the upper figure the depth order
(from the front) has been changed to C A B by altering the occlusion.
13
(a)
(b)
377
(c)
C
B
A
C
B
A
Occlusion
Linear Parallax
Size Constancy
Fig. 13.2 Monocular cues. a Relative distances of objects inferred from occlusions. b Parallel
lines converge at horizon. c Relative distance inferred by assuming all these objects are around
same actual size
In Fig. 13.2b the effect of linear parallax is shown where the lower end of the
beam appears closer to the observer as it is wider than the top end. This is in
accordance with the rules of perspective where parallel lines converge at a point on
the horizon known as the vanishing point and where it can be inferred that the
closer the lines appear to be, the closer they are to the vanishing point.
In the example of size constancy shown in Fig. 13.2c the smallest of the
matchstick figures appears to be the furthest as the assumption is made by the
observer that each of the figures is around the same actual size as the others so that
the one subtending the smallest angle is furthest away.
In each of the above examples certain assumptions are made by the observer
that are generally correct, but not necessarily true at all times. In the case of
occlusion for instance, a real object could be specially constructed to fool the
visual system from one viewpoint but not when observed from another. With
linear parallax the rules of perspective are assumed to apply; however, before
humans lived in structures comprising rectangular sides and were surrounded by
rectangular objects, the so-called rules of perspective probably had little meaning.
There are many other monocular cues that are too numerous to describe here,
these include: texture gradient, aerial perspective, lighting and shading, and
defocus blur. The most important other monocular cue that is not dependent on the
static content of the image is motion parallax. This is the effect of a different
perspective being observed at different lateral positions of the head [2]. This
causes the images of objects that are closer to appear to move more rapidly on the
retina as the head is moved. In this way the relative position of an object can be
inferred by observing it against the background while moving head position.
Figure 13.3 shows the effect of observer position where it is apparent that depth
information can be implied from the series of observed images.
The strongest stereoscopic cue is binocular parallax where depth is perceived
by each eye receiving a different perspective of the scene so that fusion of the
378
P. Surman
MOVEMENT OF OBSERVER
Fig. 13.3 Motion parallaxcloser objects show greater displacement in apparent position as
observation position moves laterally so that relative distances of objects is established
images by the brain gives the effect of depth. Different perspectives are illustrated
in Fig. 13.4a where in this case the appearance of the cube is slightly different for
each eye. Binocular parallax results in disparity in a stereoscopic image pair where
points in the scene are displayed at different lateral positions in each image.
Although the monocular cues mentioned previously provide an indication of
depth, the actual sensation of depth is provided by disparity. This sensation is
perceived even in the absence of any monocular cues and can be convincingly
illustrated with the use of Julesz random dot stereograms [3]. An example of these
is shown in Fig. 13.4b; in this figure there are no monocular cues whatsoever,
however, when the image pair is observed with a stereoscopic viewer a square can
be seen in front of a flat background.
It is also possible to see the square without the use of a stereoscopic viewer by
using the technique known as cross-eyed stereo. With this technique the eyes are
crossed so that the left and right images are combined in the brain. The ability to
do this can be quite difficult to acquire and can cause considerable eyestrain. The
stereoscopic effect is also reversed (pseudoscopic) so that in this case a square
located behind a flat background is observed.
The sensation of 3D can be seen by other means, among these being the Pulfrich
effect and the kinetic depth effect. The Pulfrich effect was first reported by the
German physicist Carl Pulfrich in 1922 [4]. It was noticed that when a pendulum is
observed with one eye covered by a dark filter the pendulum appears to follow an
elliptical path. The explanation for this is that the dimmer image effectively takes
longer to reach the brain. In Fig. 13.5, it can be seen that the actual position of the
pendulum is A but at that instant the apparent position lies at position B on line
13
(a)
379
(b)
Left View
Right View
R
Viewer
L
Plan
Left and Right Perspectives
Fig. 13.4 Binocular parallax. a Different perspectives of cube seen by each eye to give effect of
depth. b Even with no content cues depth is seen in random dot stereograms (in this case a square)
XX due to the delay in the visual system. This gives an apparent position at point
C where the lines XX and YY intersect. Although Pulfrich stereo can be shown on a
two-dimensional image without the loss of resolution or color rendition and without
the use of special cameras, its use is limited and it is only suitable for novelty
applications as either the cameras have to be continually panning or the subject must
keep moving in relation to its background.
The kinetic depth effect is an illusion where the three-dimensional structure of
an object in a two-dimensional image is revealed by movement of the image [5].
The illusion is particularly interesting as images appear to be three dimensional
even when viewed with one eye. Unlike Pulfrich stereo, the kinetic depth effect has
some useful applications, for example, it can be used to enable images from airport
security X-ray scanners to be seen in 3D. The virtual position of the object is
moved in a rocking motion in order to reveal objects that might otherwise be
missed by merely observing the customary static false color images [6].
To summarize the usefulness of the depth effects on stereoscopic displays,
motion parallax is useful but not absolutely necessary and it should be borne in
mind that in many viewing situations, viewers tend to be seated and hence fairly
static so that the look-around capability provided by motion parallax adds very
little to the effect. The sensation of depth is not particularly enhanced by the
oculomotor cues of accommodation and convergence but where there is conflict
between these two, discomfort can occur and this is considered in more detail in
the following section. The kinetic depth effect has restricted use and the Pulfrich
effect has no apparent practical use. Even if all of the aforementioned cues are
present, 3D will not be perceived unless binocular parallax is present.
380
Apparent
position of
pendulum
B
C
A
Viewer
X
Dark
filter
Actual
position of
pendulum
Apparent
path of
pendulum
Path of pendulum
P. Surman
13.2.1 Disparity
As described previously, binocular parallax is the most important visual depth cue.
In a display where a stereo pair is presented to the users eyes, if a point in the
image is not located in the plane of the screen then it must appear in two different
lateral positions on the screen, one position for the left eye and one for the right
eye. This is referred to as disparity.
When an image point occupies the same position on the screen for both the left
and right eyes it has what is referred to as zero disparity and this is the condition
when a normal two-dimensional image is observed (Fig. 13.6a). In this case the
image point will obviously appear to be located at the plane of the screen
(Point Z).
When an object appears behind the plane of the screen as in Fig. 13.6b the
disparity is referred to as uncrossed (or positive). It can be seen that the object will
appear to be point U behind the screen where the lines passing through the pupil
centers and the displayed positions on the screen intersect. Similarly, for crossed
(or negative disparity) shown in Fig. 13.6c the apparent position of the image point
will be at the intersection at position C in front of the screen.
13
381
Viewer
PLAN VIEWS
Fig. 13.6 Disparitydisparity gives appearance of depth when two 2D images are fused. In
uncrossed disparity the eyes converge behind screen so object appears to be further than screen;
the opposite applies for crossed disparity
viewed on a display the eyes will always focus on the screen, however the apparent
distance and hence the convergence (also referred to as vergence) will invariably
be different as shown in Fig. 13.7. In this chapter accommodation/convergence
conflict will be referred to as A/C conflict. If this conflict is too great visual
discomfort occurs and this is the case of cross-eyed stereo where the eyes are
converging at half the focusing distance.
There are various criteria mentioned in the literature regarding the acceptable
maximum level of conflict and the tolerance between individuals will vary. Also, if
objects occasionally jump excessively out of the screen this may be tolerable but
if it happens frequently viewer fatigue can occur.
There are two widely published criteria for acceptable difference between
accommodation and convergence, the first is the one degree difference between
accommodation and convergence rule [7]. This rule of thumb states that the
angular difference between the focus and accommodation (angle h in Fig. 13.7)
should not exceed 1 (h = w - u).
The other criterion is the one half to one third diopter rule [8]. In this case
diopters, the reciprocal of the distance in meters, are used as the unit of measurement.
Applying the rule in Fig. 13.7 gives:
1 1 1
\
13:1
C A 3
382
P. Surman
Screen
Eyes
converge
on object
PLAN
VIEW
Eyes focus
on plane of
screen
A
C
False rotation is the effect where the image appears to move with movement of
the viewers head. This is inevitable with a stereo pair as an apparent image point
must always lie on the axis between P, the center of the two image points on the
screen and the mid-point between the viewers eyes. Figure 13.8a indicates how
this effect arises. As the viewer moves from position P1 to P2, the apparent image
point moves from position M1 to M2. At each viewer position the image point lies
on the line between the eye-center and the screen at approximately 0.4 of the
distance. Although the figure shows virtual images in front of the screen, the same
considerations apply to images behind the screen where the points appear on
the axis extended into this region.
Image plane pivoting is the effect where the apparent angle of a surface alters with
viewpoint. In Fig. 13.8b consider surface A0 A00 that is perceived by viewer PA: point
A0 is located around 0.3 of the distance from the screen and point A00 half the distance.
Similarly, for viewer PB the same relationships apply for points B0 and B00 that are at
the ends of surface B0 B00 . It can be seen that as the viewer moves from PA to PB the
apparent angle of the surface pivots in a clockwise direction.
13
(a)
383
(b)
Image
M1
Viewer
P1
M2
PLAN
VIEW
P2
False Rotation
PA
Screen
Screen
B
A"
B"
PB
Pivoting
Fig. 13.8 Geometrical distortions. a False potation is caused by apparent point always being
located on the line between the eye center and point P. b Pivoting is result of combination of false
rotation and variation of apparent distance
the part that is not seen is obscured by the left side of the screen as if by a window.
The rectangular rod will also appear natural as it is completely contained within a
volume where it is not unnaturally truncated. The cylinder will appear unnatural as
it is cutoff in space by an imaginary plane in front of the screen that could not be
there in practice. Occasional display of this truncation may be acceptable but it is
something that should be avoided if possible.
When stereo is captured with a camera pair there is a rule of thumb frequently
applied stating that the camera separation should be around 1/30th the distance of
the nearest object of interest in the scene. If this is the case and images are
captured with no post processing, for the subject to appear near to the plane of the
screen the cameras will have to be toed-in so that their axes intersect at around the
distance of the subject. This will cause geometrical distortion in the image known
as keystoning. For example, if the object in the scene is a rectangle at right angles
to the axis its images will be trapezoidal with parallel sides, hence the term
keystone. The detrimental effect of this is that the same point in the scene can
appear at different heights when being viewed thus giving difficulty in fusing the
images. In the literature there are different schools of thought as to what is
acceptable [9].
When we observe a two-dimensional representation of a scene, the fact that the
angle subtended by an object in the scene may be considerably less than it would
be by direct viewing does not concern us. However, when the representation is
three dimensional the binocular cue conflict can make objects appear unnaturally
small, thus giving rise to the puppet theater effect [10]. It is possible that with
greater familiarity with 3D, viewers will become accustomed to this effect as they
have done with flat images.
There are many other artifacts and a useful list of these is given in a publication
by the 3D@Home Consortium [11].
384
P. Surman
Screen
C
B
A
Viewing zone
pyramid
Viewer
Fig. 13.9 Viewing zoneleft side of sphere A is cutoff by window of screen edge. Bar B
protrudes into the virtual image pyramid in front of the screen. Rod C cut off unnaturally by edge
of pyramid
13.2.5 Crosstalk
In displays where the 3D effect is achieved by showing a stereo pair it is essential
that as little of the image intended for the left eye reaches the right eye and vice
versa. This unwanted bleeding of images is referred to as crosstalk and creates
difficulty in fusing the images which can cause discomfort and headaches.
Crosstalk is expressed as a percentage and is most simply expressed as;
Crosstalk % leakage=signal 100
where leakage is defined as the maximum luminance of light from the unintended channel into the intended channel and signal is the maximum luminance
of the intended channel.
The simplified representation of crosstalk is shown in Fig. 13.10 where part of
each image is shown as bleeding into the other channel. The mechanism for this
can be due to several causes; in an anaglyph display it would be due to the
matching of the spectral characteristics of the glasses to the display. With shuttered
glasses it could be due to incomplete extinction of transmission by a lens when it
should be opaque or due to timing errors between the display and the glasses, and
in an autostereoscopic display it is due to some of the light rays from one of the
image pair traveling to the eye for which they were not intended.
A tolerable level of crosstalk is generally considered as being in the region of
2 %. The actual subjective effect of crosstalk is dependent on various factors; for
example, the dimmer the images, the higher the level of crosstalk that can be
accepted. Image content is also an important factor. When the contrast is greater,
the subjective effect of crosstalk is increased. If the image has vertical edges where
one side is black and the other white, the effect of crosstalk is very pronounced and
in this case it should be the region of 1 % or less.
13
385
Right
image
Left image
RM
LM
RC
LC
Right eye
Left eye
Fig. 13.10 Crosstalksimplified representation showing some left image bleeding into right
eye and right image bleeding into left eye. This occurs in free space in an autostereoscopic
display and in the lenses of a glasses display
386
P. Surman
(a)
(b)
1
Virtual image
of object
Image 3
Viewing zones
Relative
luminosity
Image 2
1
Y
Eye
position
Distance across viewing field
Screen
2
3
4
Image 1
PLAN VIEW
Fig. 13.11 Multi-view viewing zones. a Viewing zones overlap in a multi-view display.
b Overlap causes softening of image points away from plane of screen which causes depth of
field reduction
13
(a)
387
(b)
Virtual
images
Left
Inverted
right
image
Inverted
left
image
Images
Right
Mirrors
Lenses
Viewer
Viewer
Fig. 13.12 Early stereoscopes. a Two inverted virtual images are formed in space behind
combining mirrors and fused in brain. b Convex viewing lenses enable eyes to focus at
convergence distance
The criteria above apply to horizontal parallax only displays where there is no
parallax in the vertical direction. This could conceivably produce an effect similar
to astigmatism where the image on the retina can be focused in one direction but
not at right angles to this. None of the references mention this potential effect so it
is not known at present whether or not it would be an issue.
388
P. Surman
Microlens array
Reconstructed
image
PLAN
VIEWS
(b)
Images
Elemental
image
plane
Barrier
(a)
Viewer
Observer
Elemental
images
Integral Imaging
Parallax Barrier
Fig. 13.13 Integral imaging and parallax barrier. a Elemental images enable reconstruction of
input ray patternwithout correction this is pseudoscopic. b Vertical apertures in barrier direct
left and right images to appropriate eyes
In early integral imaging the same elemental images that were captured also
reproduced the image seen by the viewer. If a natural orthoscopic scene is captured
a certain pattern of elemental images is formed and when the image is reproduced,
light beams travel back in the opposite direction to reproduce the shape of the
original captured surface. The viewer however sees this surface from the opposite
direction to which it was captured and effectively sees the surface from the inside
thus giving a pseudoscopic image. Methods have been developed to reverse this
effect either optically [17] or more recently by reversing the elemental images
electronically [18].
Another approach developed in the early twentieth century is the parallax barrier
where light directions are controlled by a series of vertical apertures. In 1903 Frederic
Ives patented the parallax stereogram where left and right images are directed to the
appropriate left and right eyes as in Fig. 13.13b.
The problem of limited head movement was addressed in a later development
of the display known as a parallax panoramagram. The first method of capturing a
series of images was patented by Clarence Kanolt in 1918 [19]. This used a camera
that moved laterally in order to capture a series of viewpoints. As the camera
moved a barrier changed position in front of the film so that the images were
recorded as a series of vertical strips.
Another recording method was developed by Frederic Ives son Herbert who
captured the parallax information with a large diameter lens. A parallax barrier
located in front of the film in the camera was used in order to separate the
directions of the rays incident upon it.
It is not widely known that one of the pioneers of television, the British inventor
John Logie Baird, pioneered 3D television before the Second World War. The
apparatus used was an adaption of his mechanically scanned system (Fig. 13.14a).
13
(a)
389
(b)
Light sources
Motor
Photoelectric
cells
Lenses
Synchronized
scanning discs
Motor
Scene
Neon
tube
Transmitter
Apparatus
Viewer
Receiver
As in his standard 30-line system the image was captured by illuminating it with
scanning light. In this case two scans were carried out sequentially, one for the left
image and one for the right. This was achieved with a scanning disk having two
sets of spiral apertures as shown in Fig. 13.14b. Images were reproduced by using
a neon tube illumination source as the output of this could be modulated sufficiently rapidly. Viewing was carried out by observing this through another double
spiral scanning disk running in synchronism with the capture disk.
Possibly of greater interest is Bairds work on a volumetric display that he referred
to as a Phantoscope. Image capture involved the use of the inverse square law to
determine the range of points on the scene surface and reproduction was achieved by
projecting an image on to a surface that moved at right angles to its plane [20, 21].
A 3D movie display system called the Stereoptiplexer was developed by Robert
Collender in the latter part of the twentieth century. This operates on the principle
of laterally scanning a slit where appearance of the slit varies with viewing angle
in the same way as for the natural scene being located behind the slit. As the slit
moves across the screen a complete 3D representation is built up. The display can
operate in two modes; these are inside looking out and outside looking in [22].
Figure 13.15a shows the former case where a virtual image within the cylinder can
be seen over 360. The mechanism operates in a similar manner to the zoetrope
[23] that was an early method of viewing moving images.
The outside looking in display uses the method of so-called aerial exit pupils
where virtual apertures are generated in free space [24]. This embodiment of the
display gives the appearance of looking at the scene through a window.
A variant of this principle that does not require the use of film but generates the
images electronically is Homer Tiltons parallactiscope [25] where the images are
produced on a cathode ray tube (CRT). This does not produce real video images as
does Collenders display but has the advantage of having a reduced number of
moving parts. In order to produce an effectively laterally moving aperture with the
minimum mass, a half wave retarder is moved between crossed polarizers as
shown in Fig. 13.15b. The retarder is moved with a voice coil actuator.
390
(a)
P. Surman
16 revs/sec
Reconstructed
3D image
Viewer
(b)
Actuator
Half wave
retarder
Oscillating
arm
Slit
CRT
High-speed
continuous
loop
projector
Light
exits
here
Crossed
polarizers
Stereoptiplexer
Parallactiscope
Fig. 13.15 Stereoptiplexer and parallactiscope. a Rapidly moving images from film projector viewed
through rotating slit. b Same principle used in parallactiscope where images are formed on CRT
Both Collenders and Tiltons displays, and also integral imaging are early
examples of what are termed light field displays; these are described in more detail
in Sect. 13.4.3.
13
391
3D display
Autostereoscopic
Holographic
Glasses
Multiple
image
Volumetric
Real image
Virtual
image
Stereoscope
& HMD
Light
field
Multiview
Twoimage
Cinema-likegreater than six users and greater than 2 m diagonal. Autostereoscopic presentation is difficult or maybe impossible for a large number of
users. However, the wearing of special glasses is acceptable in this viewing
environment.
13.4.1 Glasses
The earliest form of 3D glasses viewing was anaglyph introduced by Wilhelm
Rollmann in 1853; this is the familiar red/green glasses method. Anaglyph operates by separating the left and right image channels by their color. Imagine a red
line printed on a white background; when the line is viewed through a red filter the
line is barely perceptible as it appears to be around the same colour as the
background. However, when the line is viewed through a green filter the background appears green and the line virtually black. The opposite effect occurs for a
green line on a white background is viewed through red and green filters. In this
way different images can be seen by each eye, albeit with different colors. This
color difference does not prevent the brain fusing a stereoscopic pair.
Early anaglyph systems used red and green or red and blue. When liquid crystal
displays (LCD) are used where colors are produced by additive color mixing,
better color separation is obtained with red/cyan glasses. Typical spectral transmission curves for anaglyph glasses and LCD color filters are shown in Fig. 13.17.
It should be noted that the curves for the LCD filter transmission only give an
indication of the spectral output of the panel. In practice the spectra of the cold
cathode fluorescent lamp (CCFL) illumination sources generally used are quite
392
P. Surman
(a)
100
Blue
LCD
filter
80
Red
glasses
filter
Green
LCD
filter
(b)
Cyan
glasses
filter
Red
LCD
filter
60
40
20
0
400
500
600
700
400
500
600
700
Wavelength (nm)
Red Channel
Cyan Channel
Fig. 13.17 Typical anaglyph transmission spectraplots showing typical separation for red/
cyan glasses and LCD filters. These only give indication as separation also affected by spectral
output of CCFL backlight
spiky. Examination of the curves shows that crosstalk can occur. For example, in
Fig. 13.17b it can be seen that some of the red light emitted by the LCD between
400 and 500 nm will pass to the left eye through the cyan filter. Extensive
investigation into anaglyph crosstalk has been carried out by Curtin University in
Australia [27].
Another method of separating the left and right image channels is with the use
of polarized light that can be either linearly or circularly polarized. In linearly
polarized light the electric field is oriented in one direction only and orthogonally
oriented polarizing filters in front of the left and right eyes can be used to separate
the left and right image channels.
Light can also be circularly polarized where the electric field describes a circle
as time progresses. The polarization can be left or right handed depending on the
relative phase of the field components. Again, the channels can be separated with
the use of left- and right-hand polarizers. Circular polarization has the advantage
that the separation of the channels is independent of the orientation of the polarizers so that rotation of the glasses does not introduce crosstalk. It should noted
that although crosstalk does not increase with rotation of the head, the fusion of
stereoscopic images becomes more difficult as corresponding points in the two
images do not lie in the same image plane in relation to the axis between the
centers of the users eyes.
The other method of separating the channels is with the use of liquid crystal
shutter glasses, where left and right images are time multiplexed. In order to avoid
image flicker images must be presented at 60 Hz per eye, that is, a combined frame
rate of 120 Hz. The earliest shutter glasses displays used a CRT to produce the
images as these can run at the higher frame rate required. It is only relatively
recently that LCD displays have become sufficiently fast and in 2008 the NVIDIA
3D vision gaming kit was introduced.
13
393
LEFT
LEFT
RIGHT
RIGHT
0 ms
LCD addressed
with left image
4.17 ms
Left image displayed
over complete LCD
8.33 ms
LCD addressed
with right image
12.5 ms
Right image displayed
over complete LCD
Fig. 13.18 Shutter glasses timingover 16.7 ms period of 120 Hz display there are two periods
with both cells are opaque; these are when LCD is being addressed and both left and right images
are displayed at one time
The switching of the LC cells must be synchronized with the display as shown
in Fig. 13.18. This is usually achieved with an infrared link. When the display is
being addressed, both of the cells in the glasses must be switched off otherwise
parts of both images will be seen by one eye thus causing crosstalk.
13.4.2 Volumetric
Volumetric displays reproduce the surface of the image within a volume of space.
The 3D elements of the surface are referred to as voxels, as opposed to pixels on
a 2D surface. As volumetric displays create a picture in which each point of light
has a real point of origin in a space, the images may be observed from a wide range
of viewpoints. Additionally, the eye can focus at a real point within the image thus
giving the sense of ocular accommodation.
Volumetric displays are usually more suited for computer graphics than video
applications due to the difficulties in capturing the images. However, their most
important drawback, in particular with regard to television displays is that they
generally suffer from image transparency where parts of an image that are normally occluded are seen through the foreground object. Another limitation that
could give an unrealistic appearance to natural images is the inability in general to
display surfaces with a non-Lambertian intensity distribution. Some volumetric
displays do have the potential to overcome these problems and these are described
in Sect. 13.5.2.
Volumetric displays can be of two basic types; these are virtual image where
the voxels are formed by a moving or deformable lens or mirror and real image
where the voxels are on a moving screen or are produced on static regions. An
early virtual image method described is that of Traub [28] where a mirror of
varying focal length (varifocal) is used to produce a series of images at differing
apparent distances (Fig. 13.19a). The variable surface curvature of the mirror
entails smaller movement for a diven depth effect than would be required from a
moving flat surface on to which an image is projected.
394
P. Surman
(a)
(b)
Virtual
image
(c)
Mirror
Mirror
Mirror
CRT
image
Varifocal
mirror
Viewer
Virtual Image
Fast
projector
Stack of
switchable
screens
Mirror
Fast
projector
Rotating
screen
Fig. 13.19 Volumetric display types. a Moving virtual image plane formed by deformable
mirror. b Image built up of slices projected on to a stack of switchable screens. c Image
projected on to moving screen
In static displays voxels are produced on stationary regions in the image space
(Fig. 13.19b). A simple two-plane method by Floating Images Inc. [29] uses a partially reflecting mirror to combine the real foreground image with the reflected
background image behind it. The foreground image is brighter in order for it to appear
opaque. This type of display is not suitable for video but does provide a simple and
inexpensive display that can be effective in applications such as advertising.
The DepthCube Z1024 3D display consists of 20 stacked LCD shutter panels
that allow viewers to see objects in three dimensions without the need for glasses.
It enhances depth perception and creates a volumetric display where the viewer
can focus on the planes so the accommodation is correct [30].
A solid image can be produced in a volume of space by displaying slices of the
image on a moving screen. In Fig. 13.19c the display comprises a fast projector
projecting an image on a rotating spiral screen. The screen could also be flat and moving
in a reciprocating motion and if, for example, a sphere is to be displayed this can be
achieved by projecting in sequence a series of circles of varying size on to the surface.
Most attempts to create volumetric 3D images are based on swept volume techniques
because they can be implemented with currently available hardware and software.
There are many volumetric display methods and a useful reference work on the
subject has been written by B G Blundell [31].
13
Virtual
optical
modules
Mirror
395
(b)
PLAN
VIEW
Optical modules
(a)
Viewers
High
frame
rate
projector
PLAN
VIEW
Screen
FLC
screen
Viewer
B
Mirror
Virtual
optical
modules
Vertically
diffusing
screen
Optical Module
Scanning
slit
Fig. 13.20 Light field displays. a Optical modules create image points with intersecting beams
(in front or behind screen). b Scanning slit display operates on same principle as stereoptiplexer
and parallactiscope
modules, and dynamic aperture. Light field displays require large amounts of
information to be displayed. In methods that use projectors or light modules an
insufficient amount of information can produce effects such as lack of depth of
field or shearing of the images.
NHK in Japan has carried out research into integral imaging for several years
including one approach using projection [32]. In 2009, NHK announced they had
developed an integral 3D TV achieved by using a 1.34 mm pitch lens array that
covers an ultra-high definition panel. Hitachi has demonstrated a 10 Full Parallax 3D display that has a resolution of 640 9 480. This uses 16 projectors in
conjunction with a lens array sheet [33] and provides vertical and horizontal
parallax. There is a tradeoff between the number of viewpoints and the resolution; Hitachi uses sixteen 800 9 600 resolution projectors and in total there are
7.7 million pixels (equivalent to 4000 9 2000 resolution).
Some light field displays employ optical modules to provide multiple beams that
either intersect in front of the screen to form real image voxels (Fig. 13.20a) or
diverge to produce virtual voxels behind the screen. The term voxel is used here as
it best describes an image point formed in space whether this point is either real or
virtual. The screen diffuses the beams in the vertical direction only, therefore
allowing viewers vertical freedom of movement without altering horizontal beam
directions. As the projectors/optical modules are set back from the screen, mirrors are
situated either side in order to extend the effective width of the actual array.
The dynamic aperture type of light field display uses a fast frame rate projector
in conjunction with a horizontally scanned dynamic aperture as in Fig. 13.20b.
Although the actual embodiment appears to be totally different to the optical
module approach, the results they achieve are similar. In the case of the dynamic
aperture the beams are formed in temporal manner. The Collender and Tilton
displays mentioned earlier are examples of this type.
P. Surman
(a)
(b)
View 1
View 2
View 3
Red
Green
Blue
396
Lenticular
screen
1
3
7 2
4 6
1 3
5 7
7 2
2
4
6
1 3
1
5 7
2
4 6
6
1 3
5 7
5 7
2 4
6
LCD
sub
pixels
PLAN
VIEW
Lenticular screen
LCD
3 2 1
Lenticular Screen
View number
1 3
5
Pixel
layer
FRONT VIEW
Fig. 13.21 Multi-view display. a Views formed by collimated beams emerging from lenticular
screen where LCD image is located at focal plane. b Pixel and lens configuration for 7-view
slanted lenticular display
13.4.4 Multi-View
Multi-view displays present a series of discrete views across the viewing field. One
eye will lie in a region where one perspective is seen, and the other eye in a position
where another perspective is seen. The number of views is too small for continuous
motion parallax. Current methods use either lenticular screens or parallax barriers to
direct images in the appropriate directions. There have been complex embodiments
of this type of display in the past [34] where the views are produced in a dynamic
manner with the use of a fast display; this type of display employs a technique
referred to as Fourier-plane shuttering [35]. However, recent types using lenticular
screens [36] or parallax barriers provide a simple solution that is potentially inexpensive. These displays have the advantages of providing the look-around capability
(motion parallax) and a reasonable degree of freedom of movement.
Lenticular screens with the lenses running vertically can be used to direct the light
from columns of pixels on an LCD into viewing zones across the viewing field. The
principle of operation is shown in Fig. 13.21a. The liquid crystal layer lies in the focal
plane of the lenses and the lens pitch is slightly less than the horizontal pitch of the
pixels in order to give viewing zones at the chosen optimum distance from the screen.
In the figure three columns of pixels contribute to three viewing zones.
13
397
A simple multi-view display with this construction suffers from two quite
serious drawbacks. First, the mask between the columns of pixels in the LCD gives
rise to the appearance of vertical banding on the image known as the picket fence
effect. Second, when a viewers eye traverses the region between two viewing
zones the image appears to flip between views. These problems were originally
addressed by Philips Research Laboratories in the UK by simply slanting the
lenticular screen in relation to the LCD as in Fig. 13.21b where a 7-view display is
shown. An observer moving sideways in front of the display always sees a constant
amount of black mask. This renders it invisible and eliminates the appearance of
the picket fence effect which is a moir-like artifact where the LCD mask is
magnified by the lenticular screen.
The slanted screen also enables the transition between adjacent views to be
softened so that the appearance to the viewer is closer to the continuous motion
parallax of natural images. Additionally it enables the reduction of perceived resolution against the display native resolution to be spread between the vertical and
horizontal directions. For example, in the Philips Wow display the production of nine
views reduces the resolution in each direction by a factor of three. The improvements
obtained with a slanted lenticular screen also apply to a slanted parallax barrier.
398
P. Surman
(a)
(b)
Screen
B
R
Viewer
L
PLAN VIEWS
Fig. 13.22 Exit pupils. a Exit pupil pair located at positions of the viewers eyes. b Multiple exit
pupil pairs under control of head tracker head tracker follow positions of viewers heads
is located in front of the screen. The German company SeeReal produces a head
tracked display where light is steered through an LCD with a prismatic arrangement [43].
13
399
13.5.1.2 AnaglyphInfitec
A rather different approach is taken by the German company Infitec GmbH. This is
a projection system where narrowband dichroic interference filters are used to
separate the channels. The filter for each eye has three narrow pass bands that
provide the primary colors for each channel. In order to give separation the
wavelengths have to be slightly different for each channel and are typically: left
eye, Red 629 nm, Green 532 nm, Blue 446 nm; right eye, Red 615 nm, Green
518 nm, Blue 432 nm. The effect of the slight differences in the primary colors is
relatively small as it can be corrected to a certain extent and principally affects
highly saturated colors.
Ideally the light sources would be lasers but in practice an ultra-high pressure
(UHP) lamp is used [45] where spectral peaks are close to the desired wavelengths.
One advantage of this system is that polarized light is not used so that the screen
does not have to be metallized but can be white. A disadvantage is that the glasses
are relatively expensive (around $38 in early 2012) and are therefore not disposable. This requires additional manpower at the theater as the glasses have to be
collected afterwards and they also have to be sterilised before reuse.
The Infitec system has been adopted by Dolby where the projector can show
either 2D or 3D. When 3D is shown the regular color wheel used is replaced by
one with two sets of filters, one set for each channel.
400
P. Surman
The active polarization filter called a ZScreen was invented by Lenny Lipton
of the Stereographics Corporation and was first patented in 1988 [46]. It was
originally used in shutter glasses systems using CRT displays as at the time this
was the only technology sufficiently fast to support this mode of operation.
In the RealD XLS system used by Sony a 4 K projector (4096 9 2160 resolution)
displays two 2 K (2048 9 858 resolution) images. Only a single projector is necessary as the two images are displayed one above the other in the 4 K frame. These
are combined by a special optical assembly that incorporates separating prisms. The
lenses for each channel are located one above the other at the projector output.
Displaying the images simultaneously avoids the artifacts produced by the alternative triple flash method where left and right images are presented sequentially.
Another polarization switching system is produced by DepthQ where a range of
modules is available that can support projector output up to 30 K Lumens. The
modules have a fast switching time of 50 ls and will give low crosstalk for frame
rates in excess of 280 FPS. The system includes the liquidcrystal modulator and its
control unit with a standard synchronization input. Output from the graphics card or
projectors GPIO (general purpose input/output) port is supplied to the control unit,
which then converts the signal to match the required input of the modulator.
A mechanical version of the switched filter cell is produced by Masterimage where
a filter wheel in the output beam of the projector running at 4320 rpm enables triple
flash sequential images to be shown. Triple flash is where within a 1/24 s period the
left image is displayed three times and the right image displayed three times.
It is possible to combine a screen-sized active retarder with a direct-view LCD
to provide a passive glasses system so that there is no loss of vertical resolution.
Samsung and RealD showed prototypes of this system in 2011. The additional
weight and cost of the active shutter panel was seen as an issue but RealD stated
that they had a plan to make the panel affordable.
The left and right images can be spatially multiplexed with the use of a film
pattern retarder (FPR) where alternate rows of pixels have orthogonal polarization.
LG led the way with this technology and other manufacturers have followed suit.
There is an argument that says the perceived vertical resolution is halved but this
is countered with another argument suggesting that as each eye sees half the
resolution, the overall effect is of minimal resolution loss [47].
13.5.2 Volumetric
There are many different methods for producing a volumetric display and a large
proportion of these have been available for some years. As most of these suffer from
the disadvantages of having transparent image surfaces and the inability to display
anisotropic light distribution (cannot show a non-Lambertian distribution) from their
component voxels, these are not particularly relevant in a state-of-the-art survey.
It is clear that if the image is produced by any of the three basic methods
described in Sect. 13.4.2 the light distribution from the voxels will be isotropic as
13
401
the real surfaces from which the light radiates are diffusing and have a Lambertian
distribution. Images will therefore lack the realism of real-world scenes and this
isotropic distribution is also the cause of image surface transparency. Making the
display hybrid so that it is effectively a combination of volumetric and multi-view
or light field types can overcome this deficiency [48].
The limitations of a conventional volumetric display, referred to as an IEVD
(isotropically emissive volumetric display), can be partially mitigated by
employing a technique using lumi-volumes in the rendering. This has been
reported by a team comprising groups from Swansea University in the UK and
Purdue University in the US [49].
Isotropic emission limitations can be overcome with a display developed at
Purdue University that is effectively a combination of volumetric and multi-view
display [50]. This uses a Perspecta volumetric display that gives a spherical
viewing volume 250 mm diameter. The principal modification to the display
hardware is the replacement of the standard diffusing rotating screen with a mirror
and a 60 FWHM (full width half maximum) Luminit vertical diffuser. This
provides a wide vertical viewing zone. The results published in the paper clearly
show occlusions and the stated advantage of this approach over conventional
integral imaging is that the viewing angle is 180. Although this is an interesting
approach the Perspecta display unfortunately is no longer available.
Two displays employing rotating screens have been reported. In the Transpost
system [51] multiple images of the object, taken from different angles, are projected onto a directionally reflective spinning screen with a limited viewing angle.
In another system called Live Dimension [52] the display comprises projectors
arranged in a circle. The Lambertian surfaces of the screens are covered by vertical
louvers that direct the light towards the viewer.
A completely different type of display, but one that is of interest is the
laser-produced plasma display of the National Institute of Advanced Industrial Science
and Technology (AIST) [53]. This produces luminous dots in free space by using
focussed infrared pulsed lasers with a pulse duration of around 1 nanosecond and a
repetition frequency of approximately 100 Hz. Objects can be formed in air by moving
the beam with an XYZ scanner and this enables around 100 dots per second to be
produced. The AIST paper shows images of various simple three-dimensional dot
matrix images. The display is marketed by the Japanese company Burton Inc. who has
announced that an RGB version will become available in the near future.
402
P. Surman
order to enable vertical movement of the viewer. Although the display is inherently horizontal parallax only, vertical viewer position tracking can be used to
render the images in accordance with position so that the appearance of vertical
parallax, with minor distortions, could be provided. Although this referred to as a
light field display its operating principle appears to be similar to the modified
Perspecta display. This demonstrates the point that the distinction between the
different types of display when hybrid technology is employed can become
blurred. The same group has also proposed another display where the screen
comprises a two-dimensional pico projector array [55]. The paper describes a
display that would actually work but with poor performance; the pixels of the
display would be the projector lenses so that the resolution would be very low.
13.5.3.2 Holografika
The Holografika display [56] uses optical modules to provide multiple beams that converge and intersect in front of the screen to form real image voxels, or diverge to produce
virtual voxels behind the screen. The screen diffuses the beams in the vertical direction
only without altering horizontal beam directions. Holografika can currently supply a
range of products including a 32 display using 9.8 megapixels and a 72 version with
34.5 megapixels [57]. A version of the display that has a higher angular resolution of the
emergent light field will become available and will provide a 120 field of view.
13.5.3.3 Other
Hitachi has demonstrated a version of its display where a combination of mirrors
and lenses is used to overlay stereoscopic images within a real object [58]. The
images are made up by views from 24 projectors and enable 3D to be seen by
several users over a wide viewing angle. The dynamic aperture type of light field
display uses a fast frame rate projector in conjunction with a horizontally scanned
aperture. An early version of this was the Tilton display with the mechanically
scanned aperture. More recently this principle has been developed further at
Cambridge University with the use of a fast digital micromirror device (DMD)
projector and a ferroelectric shutter [59]. This is currently available from Setred as
a 20 XGA display intended for medical applications.
13.5.4 Multi-View
13.5.4.1 Two-View
The simplest type of multi-view display is one where a single image pair is
displayed. If the display hardware is only capable of showing a single image pupil
pair the sweet spot occupies only a very small part of the viewing field.
13
403
13.5.4.2 Projector
Although projection techniques do not provide the most compact form of multiview display they do have the potential to offer the image source where a large
number of views are required. This approach has become more viable recently as
relatively inexpensive pico projectors are now available. There are two simple
ways projectors can be used. The first method is the use of a double lenticular
screen [60] where the region of contact between the sheets is the flat common focal
plane of the two outer layers of vertically aligned cylindrical lenses. This is an
example of a Gabor superlens [61]. The lenses of the projectors form real images
in the viewing field that are the exit pupils and the light must be diffused vertically
in order to enable vertical movement of the viewers.
Another method is the use of a retroreflecting screen that returns the light rays
back in the same direction as it enters the screen [62]. The screen could comprise a
lenticular arrangement to operate in the same manner to cats eyes that are used
as reflectors on UK roads. An alternative method is to use a corner cube structure
where three orthogonal reflecting surfaces return light entering back in the same
direction. Again, vertical diffusing is required to enable vertical viewer movement.
Hitachi has announced a 24-view system [63] where the images from 24 projectors are combined with the use of a half-silvered mirror that enables the 3D
image to be overlaid with an actual object. The viewing region of 30 in the
vertical direction and 60 horizontally enables several people to view the images.
13.5.4.3 Lenticular/Barrier
Currently the most common form of multi-view display uses a slanted lenticular
view-directing screen. Philips [64] was active in this area for around 15 years but
pulled out in 2009. The work is being continued by a small company called
Dimenco [65] that consists of a few ex-Philips engineers. It appears that Philips
404
P. Surman
has again taken up marketing this display. An alternative method is the slanted
parallax barrier [66] where the result of the slanting has the same effect as for the
lenticular screen, that is, the elimination of the appearance of visible banding and
the reduction in resolution spread over both the horizontal and vertical direction.
The Masterimage display [67] uses an active twisted nematic (TN) LCD parallax barrier located in front of an image-forming thin film transistor (TFT) LCD
panel. The barrier is switchable to enable 3D to be seen in the portrait and the
landscape orientation and to switch between the 2D and 3D modes. Information
released by the company states that the system has no crosstalk. The display is
incorporated into the Hitachi Wooo mobile phone that was launched in 2009.
Light directions exiting the screen can also be controlled by using a color filter
array in a component that is referred to as a wavelength selective filter [68]. This
technique was developed around 2000 by 4D-Vision GmbH in Germany where it
was used to support an 8-view display.
There are many multi-view displays on the market and new products frequently
appear with others becoming discontinued. This is a relatively specialized market
and a convenient means of finding out what is still available, at least in the UK, is
to contact or to visit the website of Inition [69] who supply a wide range of 3D
display products.
13.5.5.2 SeeReal
SeeReal is developing a display that combines the advantages of head tracking and
holography [71]. When a holographic image is produced a large amount of
redundant information is required in order to provide images over a large region
that is only sparsely populated with viewers eye pupils. This requires a screen
resolution that is in the order of a wavelength of light and even if vertical parallax
is not displayed the screen resolution is still too high to be handled by current
13
405
13.5.5.3 Microsoft
Microsoft Applied Sciences Group has announced a system where the Wedge
waveguide technology developed at Cambridge University is adapted to provide
steering of light beams from an array of LEDs [72]. The display has a thin form factor
of around 50 mm. Images are produced view sequentially so that the focussing of the
rays exiting the screen changes position each time a new view is displayed.
The display in operation can be seen via the link in the following Ref. [73].
13.5.5.4 Apple
Another means of directing light to viewers eyes is described in a patent assigned
to Apple [74]. This uses a corrugated reflecting screen where the angle of the
exiting beam is a function of its X position. The text of the patent mentions that
head tracking can be applied to this display. Proposed interactive applications for
the display, including gesture and presence detection, can be obtained on the
Patently Apple blog [75].
406
P. Surman
low tendue properties that enable accurate control of the light ray directions. It is not
used for its high coherence properties as holographic techniques are not employed in
the display. The front screen assembly includes a Gabor superlens which is a novel
type of lens that can provide angular magnification. This was invented by Denis
Gabor in the 1940 s. The current HELIUM3D setup was intended to incorporate a fast
light engine that would have provided motion parallax to several users. A suitable
light engine is not yet available and the system is currently being adapted for operation as the backlight of a 120 Hz LCD where the display provides the same stereo
pair to several users. It will have a compact housing size therefore making it suitable
for 3D television use.
13.6 Conclusion
Currently the most common type of autostereoscopic display is multi-view;
however, these suffer from resolution loss, restricted depth of field, and a relatively
limited viewing region. These effects are reduced with increasing number of views
but this creates a tradeoff with resolutionunfortunately you cannot get something
for nothing. It is possible that eventually SMV displays could become viable with
ultra-high resolution panels used in conjunction with slanted lenticular screens in
the event of such panels becoming available.
If the criteria for the appearance of continuous motion parallax given in
Sect. 13.2.6 are adhered to then it would appear that a view pitch in the order of
2.5 mm would suffice. As an example, 100 parallax images would be required for a
continuous 250 mm viewing width. With a slanted lenticular screen the resolution
loss is approximately equal to the square root of the number of views so each
individual view would have a tenth of the native display resolution. Consider the
example of an 8 K display (7860 9 4320); the resolution of each view image
would only be in the order of 768 9 432. Although the addition of 3D enhances
the perceived image quality, it is not clear that this resolution reduction would be
mitigated.
The performance of a multi-view display can be improved with the addition of
head tracking. Toshiba demonstrated a 55 autostereoscopic display at the 2012
Consumer Electronics Show (CES). Before the show the display had already been
available in Japan and Germany. The native resolution of the display is 4 K
(3840 9 2160) which enables a series of nine 1160 9 720 images to be formed.
The addition of head tracking permits the viewing zones to be located in the
optimum position for the best viewing condition for several viewers. Toshiba
states that this can be up to nine viewers but best performance is obtained with four
or fewer. In addition to the resolution loss the other disadvantage of this display is
that viewers have to be close to a fixed distance from the display.
If SMV displays do become widely used it must be borne in mind that image
capture will be more complex. If the continuous viewing width is, for example
250 mm then the capture width should be in the same order. It is most likely that a
13
407
camera array would be used with intermediate views being produced by interpolation if necessary. Plenoptic techniques [78] are probably not suitable as a single
lens used at the first stage would have to be extremely large.
3D televisions were initially active glasses where the left and right images are
separated sequentially as described in Sect. 13.4.1. In early 2011 the Display Daily
blog, that is, prepared by well-informed Insight Media, proposed the Wave
Theory of 3D [79]. In this proposal the first wave of 3D television is active
glasses, the second wave passive glasses and the third wave glasses-free. Although
there are pros and cons regarding glasses displays, for example, cost and weight of
glasses, display weight, and image resolution, ultimately the acceptance of 3D
television will be determined by whether or not glasses will need to be worn at all.
Although glasses-free is shown as the third wave interestingly there is no time
scale given in Insight Medias report.
Head tracking can provide an autostereoscopic display that does not have to
compromise the image resolution. In its present form it does not appear that the
Microsoft head tracked display can control the exit pupil positions in the Z
direction so the use of this display could be limited in a television viewing situation. It is not apparent in the Apple patent how readily tracking could be controlled in the Z direction. The HELIUM3D display is fairly complex but the
complexity is principally due to its inherent ability to produce exit pupils over a
large region allowing several viewers to move freely over a room-sized area.
In September 2009, SeeReal announced that they had produced an 8 holographic demonstrator with 35 micron pixels forming a phase hologram and that a
24 product with a slightly smaller pixel size was planned. It does not appear at the
time of writing (February 2012) that a product is actually available yet.
Holographic displays may be some years away but research is being carried out
that is leading the way to its inception. The University of Arizona is developing a
holographic display that is updatable so that moving images can be shown [80].
In order to be suitable for this application the material displaying the hologram must
have a high diffraction efficiency, fast writing time, hours of image persistence,
capability for rapid erasure and the potential for large display area; this is a combination of properties that had not been realized before. A material has been developed
that is a photorefractive polymer written by a 50 Hz nanosecond pulsed laser. A proof
of concept of demonstrator has been built where the image is refreshed every 2 s.
Another holographic approach is taken by the object-based media group at
MIT. In this display a multi-view holographic stereogram is produced in what is
termed a diffraction specific coherent (DSC) panoramagram [81]. This provides a
sufficiently large number of discrete views to provide the impression of smooth
motion parallax. Frame rates of 15 frames per second have been achieved. So far
horizontal-only parallax has been demonstrated but the system can be readily
extended to full parallax.
To summarize the current situation with stereoscopic displays, it is fairly
certain that glasses will not be acceptable for television for much longer and that
autostereoscopic displays will replace glasses types. At present the multi-view
displays are most common autostereoscopic display but do not provide sufficiently
408
P. Surman
good quality for television use. The use of head tracking in conjunction with a
multi-view display with a high native resolution is one means of making multiview displays more acceptable, however the limitations of resolution loss and
limited depth of usable viewing field could still be an issue.
Head tracking applied to two-view displays offers a solution to the resolution and
restricted viewing issues but these displays, for example the HELIUM3D display still
require further development before they become a viable consumer television product.
It is likely that glasses systems will continue to be used in cinemas for the
foreseeable future. The reasons for this are twofold; first the wearing of glasses is
more acceptable in a cinema where the audience tend not to keep putting on and
taking off the glasses. Second, the technical difficulties in placing images in the
correct positions over a large area would be extremely difficult to overcome.
Niche markets where cost is of less importance will be suitable for the other
types of display, for example, light field and volumetric. Eventually all 3D displays could be either true holographic or possibly hybrid conventional/holographic
as work is proceeding in this area; however, it is not likely to happen within the
next 10 years. If holographic displays eventually become the principal display
method it is still unlikely that the capture would be holographic. Illuminating a
scene with coherent light would be extremely difficult and also would not create
natural lighting conditions. It is more likely that the hologram would be synthesized from images captured on a camera array.
References
1. Spottiswoode N, Spottiswoode R (1953) The theory of stereoscopic transmission. University
of California Press, p 13
2. Gibson EJ, Gibson JJ, Smith OW, Flock H (1959) Motion parallax as a determinant of
perceived depth. J Exp Psychol 58(1):4051
3. Julesz B (1960) Binocular depth perception of computer-generated patterns. Bell Syst Tech J
39:11251162
4. Scotcher SM, Laidlaw DA, Canning CR, Weal MJ, Harrad RA (1997) Pulfrichs
phenomenon in unilateral cataract. Br J Ophthalmol 81(12):10501055
5. Wallach H, OConnelu DN (1953) The kinetic depth effect. J Exp Psychol 45(4):205217
6. Evans JPO (2003) Kinetic depth effect X-ray (KDEX) imaging for security screening visual
information engineering. In: VIE 2003 international conference, 79 July 2003
7. Lambooij M, Ijsselsteijn W, Fortuin M, Heynderickx I (2009) Visual discomfort and visual
fatigue of stereoscopic displays: a review. J Imaging Sci Tech 53(3):030201
8. Hoffmann DM, Girshik AR, Akeley K, Banks MS (2008) Vergenceaccommodation conflicts
hinder visual performance and cause visual fatigue. J Vis 8(3), Article 33
9. Woods A, Docherty T, Koch R (1993) Image distortions in stereoscopic video systems.
In: Proceedings of the SPIE, stereoscopic displays and applications IV, vol 1915. San Jose
10. Pastoor S (1991) 3D-television: a survey of recent research results on subjective requirements
signal processing image communication. Elsevier Science Publishers BV, Amsterdam, pp 2132
11. McCarthy S (2010) Glossary for video & perceptual quality of stereoscopic video. White paper
prepared by the 3D @Home Consortium and the MPEG Forum 3DTV working group. 17 Aug
2010. http://www.3dathome.org/files/ST1-01-01_Glossary.pdf. Accessed 24 Jan 2012
13
409
12. Kajiki Y (1997) Hologram-like video images by 45-view stereoscopic display. SPIE Proc
Stereosc Disp Virtual Real Syst IV 3012:154166
13. Lucente M, Benton SA, Hilaire P St (1994) Electronic holography: the newest.
In: International symposium on 3D imaging and holography, Osaka
14. Hilaire P St (1995) Modulation transfer function of holographic stereograms. In: Proceedings
of the SPIE, applications of optical holography
15. Pastoor S (1992) Human factors of 3DTV: an overview of current research at Heinrich-HertzInstitut Berlin. IEE colloquium on stereoscopic television: Digest No 1992/173:11/3
16. Lippmann G (1908) Epreuves reversibles. Photographies intgrales. Comptes-Rendus de
lAcadmie des Sciences 146(9):446451
17. McCormick M, Davies N, Chowanietz EG (1992) Restricted parallax images for 3D TV. IEE
colloquium on stereoscopic television: Digest No 1992/173:3/13/4
18. Arai J, Kawai H, Okano F (2006) Microlens arrays for integral imaging system. Appl Opt
45(36):90669078
19. Kanolt CW (1918) US Patent 1260682
20. Brown D (2009) Images across space. Middlesex University Press, London
21. Baird JL (1945) Improvements in television. UK Patent 573,008. Applied for 26 Aug 1943 to
9 Feb 1944, Accepted 1 Nov 1945
22. Funk (2008) Stereoptiplexer cinema systemoutside-looking in. Veritas et Visus
23. Bordwell D, Thompson K (2010) Film history: an introduction, 3rd edn. McGraw-Hill, New
York. ISBN 978-0-07-338613-3
24. Collender R (1986) 3D television, movies and computer graphics without glasses. IEEE
Trans Cons Electro CE-32(1):5661
25. Tilton HB (1987) The 3D oscilloscopea practical manual and guide. Prentice Hall Inc.,
New Jersey
26. 3D Display Technology Chart (2012) http://www.3dathome.org/. Accessed 24 Jan 2012
27. Woods AJ, Yuen KL, Karvinen KS (2007) Characterizing crosstalk in anaglyphic
stereoscopic images on LCD monitors and plasma displays. J SID 15(11):889898
28. Traub AC (1967) Stereoscopic display using rapid varifocal mirror oscillations. Appl Opt
6(6):10851087
29. Dolgoff G (1997) Real-depthTM imaging: a new imaging technology with inexpensive directview (no glasses) video and other applications. SPIE Proc Stereosc Disp Virtual Real Syst IV
3012:282288
30. Sullivan A (2004) DepthCube solid-state 3D volumetric display. Proc SPIE 5291(1).
ISSN:0277786X:279284. doi:10.1117/12.527543
31. Blundell BG, Schwartz AJ (2000) Volumetric three-dimensional display systems. WileyIEEE Press, New York. ISBN 0-471-23928-3
32. Okui M, Arai J, Okano F (2007) New integral imaging technique uses projector. doi:10.1117/
2.1200707.0620. http://spie.org/x15277.xml?ArticleID=x15277. Accessed 28 Jan 2012
33. Hitachi (2010) Hitachi shows 10 glasses-free 3D display. Article published in 3D-display-info
website: www.3d-display-info.com/hitachi-shows-10-glasses-free-3d-display. Accessed 24 Jan 2012
34. Travis ARL, Lang SR (1991) The design and evaluation of a CRT-based autostereoscopic
display. Proc SID 32/4:279283
35. Dodgson N (2011) Multi-view autostereoscopic 3D display. Presentation given at the
Stanford workshop on 3D imaging. 27 Jan 2011. http://scien.stanford.edu/pages/conferences/
3D%20Imaging/presentations/Dodgson%20%20%20Stanford%20Workshop%20on%203D%20Imaging.pdf. Accessed 29 Jan 2012
36. Berkel C, Parker DW, Franklin AR (1996) Multview 3D-LCD. SPIE Proc Stereosc Disp
Virtual Real Syst IV 2653:3239
37. Schwartz A (1985) Head tracking stereoscopic display. In: Proceedings of IEEE international
display research conference, pp 141144
38. Woodgate GJ, Ezra D, Harrold J, Holliman NS, Jones GR, Moseley RR (1997) Observer
tracking autostereoscopic 3D display systems. SPIE Proc Stereosc Disp Virtual Real Syst IV
3012:187198
410
P. Surman
39. Benton SA, Slowe TE, Kropp AB, Smith SL (1999) Micropolarizer-based multiple-viewer
autostereoscopic display. SPIE Proc Stereosc Disp Virtual Real Syst IV 3639:7683
40. Ichinose S, Tetsutani N, Ishibashi M (1989) Full-color stereoscopic video pickup and display
technique without special glasses. Proc SID 3014:319323
41. Eichenlaub J (1994) An autostereoscopic display with high brightness and power efficiency.
SPIE Proc Stereosc Disp Virtual Real Syst IV 2177:143149
42. Sandlin DJ, Margolis T, Dawe G, Leigh J, DeFanti TA (2001) Varrier autostereographic
display. SPIE Proc Stereosc Disp Virtual Real Syst IV 4297:204211
43. Schwerdtner A, Heidrich H (1998) The dresden 3D display (D4D). SPIE Proc Stereosc Disp
Appl IX 3295:203210
44. Sorensen SEB, Hansen PS, Sorensen NL (2001) Method for recording and viewing
stereoscopic images in color using multichrome filters. US Patent 6687003. Free patents
online. http://www.freepatentsonline.com/6687003.html
45. Jorke H, Fritz M (2012) Infiteca new stereoscopic visualization tool by wavelength
multiplexing. http://jumbovision.com.au/files/Infitec_White_Paper.pdf. Accessed 25 Jan 2012
46. Lipton L (1988) Method and system employing a pushpull liquid crystal modulator. US
Patent 4792850
47. Soneira RM (2012) 3D TV display technology shoot-out. http://www.displaymate.com/
3D_TV_ShootOut_1.htm. Accessed 27 Jan 2012
48. Favalora GE (2009) Progress in volumetric three-dimensional displays and their applications.
Opt Soc Am. http://www.greggandjenny.com/gregg/Favalora_OSA_FiO_2009.pdf. Accessed
28 Jan 2012
49. Mora B, Maciejewski R, Chen M (2008) Visualization and computer graphics on
isotropically emissive volumetric displays. IEEE Comput Soc. doi:10.1109/TVCG.2008.99.
https://engineering.purdue.edu/purpl/level2/papers/Mora_LF.pdf. Accessed 28 Jan 2012
50. Cossairt O, Napoli J, Hill SL, Dorval RK, Favalora GE (2007) Occlusion-capable multiview
volumetric three-dimensional display. Appl Opt 46(8). https://engineering.purdue.edu/purpl/
level2/papers/Mora_LF.pdf. Accessed 28 Jan 2012
51. Otsuka R, Hoshino T, Horry Y (2006) Transpost: 360 deg-viewable three-dimensional
display system. Proc IEEE 94(3). doi:10.1109/JPROC.2006.870700
52. Tanaka K, Aoki S (2006) A method for the real-time construction of a full parallax light field.
In: Proceedings of SPIE, stereoscopic displays and virtual reality systems XIII 6055, 605516.
doi:10.1117/12.643597
53. Shimada S, Kimura T, Kakehata M. Sasaki F (2006) Three dimensional images in the air.
Translation of the AIST press release of 7 Feb 2006. http://www.aist.go.jp/aist_e/
latest_research/2006/20060210/20060210.html. Accessed 28 Jan 2012
54. Jones A, McDowall I, Yamada H, Bolas M, Debevec P (2007) Rendering for an interactive
3608 light field display. Siggraph 2007 Emerging Technologies. http://gl.ict.usc.edu/
Research/3DDisplay/. Accessed 28 Jan 2012
55. Jurik j, Jones A, Bolas M, Debevec P (2011) Prototyping a light field display involving direct
observation of a video projector array. In: Proceedings of IEEE computer society conference
on computer vision and pattern recognition workshops (CVPRW), pp 1520
56. Baloch T (2001) Method and apparatus for displaying three-dimensional images. US Patent
6,201,565 B1
57. http://www.holografika.com/. Accessed 28 Jan 2012
58. http://www.3d-display-info.com/hitachi-shows-10-glasses-free-3d-display. Accessed 28 Jan 2012
59. Moller C, Travis A (2004) Flat panel time multiplexed autostereoscopic display using an
optical wedge waveguide. In: Proceedings of 11th international display workshops, Niigata,
pp 14431446
60. Okoshi T (1976) Three dimensional imaging techniques. Academic Press, New York, p 129
61. Hembd C, Stevens R, Hutley M (1997) Imaging properties of the gabor superlens. EOS
topical meetings digest series: 13 microlens arrays. NPL Teddington, pp 101104
62. Okoshi T (1976) Three dimensional imaging techniques. Academic Press, New York, p 140
13
411
63. Hitachi (2011) Stereoscopic display technology to display stereoscopic images superimposed
on real space. News release 30 Sept 2011. http://www.hitachi.co.jp/New/cnews/month/2011/
09/0930.html. Accessed 29 Jan 2012
64. IJzerman W et al (2005) Design of 2D/3D switchable displays. Proc SID 36(1):98101
65. Dimenco (2012) Productsdisplays 3D stopping power5200 proffesional 3D display. http://
www.dimenco.eu/displays/. Accessed 29 Jan 2012
66. Boev A, Raunio K, Gotchev A, Egiazarian K (2008) GPU-based algorithms for optimized
visualization and crosstalk mitigation on a multiview display. In: Proceedings of SPIE-IS&T
electronic imaging. SPIE, vol 6803, pp 24. http://144.206.159.178/FT/CONF/16408309/
16408328.pdf. Accessed 29 Jan 2012
67. Masterimage (2012) Autostereoscopic 3D LCD. http://www.masterimage3d.com/products/
3d-lcd0079szsxdrf. Accessed 29 Jan 2012
68. Schmidt A, Grasnick (2002) Multi-viewpoint autostereoscopic displays from 4D-vision. In:
Proceedings of SPIE photonics west 2002: electronic imaging, vol 4660, pp 212221
69. Inition (2012) http://www.inition.co.uk/search/node/autostereoscopic. Accessed 29 Jan 2012
70. Surman P, Hopf K, Sexton I, Lee WK, Bates R (2008) Solving the problemthe history and
development of viable domestic 3DTV displays. In: Three-dimensional television, capture,
transmission, display (edited book). Springer Signals and Communication Technology
71. Haussler R, Schwerdtner A, Leister N (2008) Large holographic displays as an alternative to
stereoscopic displays. In: Proceedings of SPIE, stereoscopic displays and applications XIX,
vol 6803(1)
72. Travis A, Emerton N, Large T, Bathiche S, Rihn B (2010) Backlight for view-sequential
autostereo 3D. SID 2010 Digest, pp 215217
73. Microsoft (2010) The wedgeseeing smart displays through a new lens. http://
www.microsoft.com/appliedsciences/content/projects/wedge.aspx. Accessed 4 Feb 2012
74. Krah C (2010) Three-dimensional display system. US Patent 7,843,449. www.freepatentsonline/
7843339.pdf. Accessed 4 Feb 2012
75. Purcher J (2011) Apple wins a surprise 3D display and imaging patent stunner. http://
www.patentlyapple.com/patently-apple/2011/09/whoa-apple-wins-a-3d-display-imagingsystem-patent-stunner.html
76. Surman P, Sexton I, Hopf K, Lee WK, Neumann F, Buckley E, Jones G, Corbett A, Bates R,
Talukdar S (2008) Laser-based multi-user 3D display. J SID 16(7):743753
77. Erden E, Kishore VC, Urey H, Baghsiahi H, Willman E, Day SE, Selviah DR, Fernandez FA,
Surman P (2009) Laser scanning based autostereoscopic 3D display with pupil tracking. Proc
IEEE Photonics 2009:1011
78. Ng R, Levoy M, Bredif M, Duval G, Horowitz M, Hanrahan P (2005) Light field photography
with a hand-held plenoptic camera. Stanford Tech Report CTSR 2005-02. http://
hci.stanford.edu/cstr/reports/2005-02.pdf. Accessed 6 Feb 2012
79. Chinnock C (2011) Here comes the second wave of 3D. Display Daily. http://
displaydaily.com/2011/01/05/here-comes-the-second-wave-of-3d/. Accessed 6 Feb 2012
80. Blanche P-A et al (2010) Holographic three-dimensional telepresence using large-area
photorefractive polymer. Nature 468(7320):8083. http://www.optics.arizona.edu/pablanche/
images/Articles/1010_Blanche_Nature468.pdf. Accessed 6 Feb 2012
81. Barabas J, Jolly S, Smalley DE, Bove VM (2011) Diffraction specific coherent
panoramagrams of real scenes. Proceedings of SPIE Practice Hologram XXV, vol 7957
Chapter 14
413
414
M. Barkowsky et al.
14.1 Introduction
There is a growing demand for three-dimensional TV (3D-TV) in various applications including entertainment, gaming, training, and education, to name a few.
This in turn has spurred the need for methodologies that can reliably assess the
quality of the user experience in the context of 3D-TV-based applications.
Subjective and objective quality assessment has been widely studied in the
context of two-dimensional (2D) video and television. These assessment methods
provide metrics to quantify the perceived visual quality of the 2D video. Subjective quality assessment entails human observers to measure different aspects of
the video quality such as compression artifacts, sharpness level, color and contrast,
and overall visual experience through experimental setup. The average score
provided by the subjects is a typical output of such experiments. This is considered
as a measurement of subjective quality, also called the mean opinion score (MOS).
Subjective quality assessment methods are time-consuming and are hard to deploy
P. Lebreton A. Raake
Assessment of IP-Based Applications, Telekom Innovation Laboratories,
TU Berlin, Ernst-Reuter-Platz 7, 10587 Berlin, Germany
e-mail: pierre.lebreton@telekom.de
A. Raake
e-mail: Alexander.raake@telekom.de
A. Perkis L. Xing J. You
Centre for Quantifiable Quality of Service (Q2S) in Communication Systems,
Norwegian University of Science and Technology (NTNU),
N-7491 Trondheim, Norway
e-mail: andrew.perkis@iet.ntnu.no
L. Xing
e-mail: liyuan.xing@q2s.ntnu.no
J. You
e-mail: junyong.you@q2s.ntnu.no
M. Subedar
Intel Corporation and Arizona State University, Chandler, AZ 85226 USA
e-mail: mahesh.subedar@asu.edu
K. Wang
Department of Information Technology and Media (ITM), Mid Sweden University,
Holmgatan 10, 85170 Sundsvall, Sweden
e-mail: kun.wang@acreo.se
14
415
on a regular basis, hence there is a need for objective quality assessment. Objective
quality assessment methods provide metrics or indicies that are computational
models for predicting the perceived visual quality by humans and aim to provide a
high degree of correlation with the subjective scores. For extensive reviews on 2D
objective quality metrics, the reader might refer to the following papers [1, 2].
3D video adds several new challenges compared to 2D video from both human
factors and technological point of views. 3D video needs to capture the sense of
presence or ultimate reality. Hence, the overall quality of 3D visual experience is
referred to quality of experience (QoE). Consequently, both subjective and
objective 2D visual quality assessments require revisiting and augmenting to
measure the 3D QoE. Toward this goal, several efforts have been initiated in many
academic labs working on 3D video QoE assessment. Simultaneously, standardization bodies and relative groups (ITU, VQEG, IEEE P3333) have identified
several questions on 3D QoE assessment witnessing the current difficulties to go
beyond 2D quality assessment. The COST action IC1003 QUALINET (Network
on QoE in Multimedia Systems and Services, www.qualinet.eu) which gathers
effort from more than 25 countries is also strongly considering 3D-TV QoE
assessment as an important challenge to tackle for wider acceptance of 3D video
technology. It is so far, one of the largest efforts to integrate and coordinate
activities related to 3D-TV QoE assessment from different research groups. This
chapter is reflecting some of the activities related to QUALINET. A discussion on
current challenges related to subjective and objective visual quality assessment for
stereo 3D video and a brief review of the current state of the art are presented.
These challenges are then illustrated by the two case studies to measure the 3D
subjective and objective quality metrics. The first case study deals with how to
measure the effects of crosstalk as an aspect of display technologies on 3D Video
QoE. The second case study is focused on examining effects of both transmission
and coding errors on 3D video QoE.
416
M. Barkowsky et al.
QoE. Separate methods of quality assessment need to be developed for different application scenarios and to measure and improve the current
technologies.
3D video quality is highly dependent on the sense of true presence or ultimate
reality and can be significantly affected if discomfort is experienced. With the
current stereo 3D displays, visual discomfort is experienced by some viewers,
which typically induces symptoms such as eye soreness, headache, and dizziness.
3D QoE should also measure the depth perception and added advantage of depth
dimension to the video. Hence, in order to more reliably measure the QoE of a 3D
video system, the three components, shown in Fig. 14.1, need to be considered.
Subjective and objective metrics should measure the contribution of these factors
to the 3D QoE.
14
417
known as the inter-pupillary distance (IPD), each eye sees a slightly different
projection of the same scene. Binocular vision corresponds to the process in which
our brain computes the disparity, which is the difference in distance between the
center of fixation and two corresponding points in the retinal image of the left and
right eyes.
The images projected on the fixation center have zero disparity and they correspond to objects on the fixation curve (Fig. 14.3). Objects with positive or
negative disparity appear in front or behind the fixation curve. The binocular
vision is used in stereoscopic 3D displays, where the left-eye image is horizontally
shifted with respect to the right-eye image. The brain tries to fuse these disparity
418
M. Barkowsky et al.
14
419
blur, blockiness, graininess, and ringing. Transmission errors can also cause
degradations in quality. As the compression and transmission artifact impacts may be
different on the left and right eye images, spatial and temporal inconsistencies
between the left and right eye views may occur.
Finally, the 3D video signal needs to be rendered on a display. Depending on
the display technology, various artifacts may impair the perceived picture quality.
A reduction in spatial and/or temporal resolution, loss in luminance and/or color
gamut, and the occurrence of crosstalk are typical artifacts related to display
technologies. Visual discomfort is a physiological problem experienced by many
subjects when viewing 3D video. There are many reasons leading to the visual
complaints. One the most important cause for discomfort is accommodationvergence conflict. Accommodation refers to the occulomotor ability of the human
eye to adapt its lens to focus at different distances. Vergence refers to the occulomotor ability of the human eye to rotate both eyes toward the focused object. In
natural viewing conditions, accommodation and vergence are always matched. But
on (most) 3D displays, they are mismatched as a consequence of the fact that the
eyes focus at the screen, while they converge at a location in front or behind the
screen where an object is rendered.
420
M. Barkowsky et al.
14
421
422
M. Barkowsky et al.
introduced in the capturing stage and convergence-accommodation rivalry, intraocular crosstalk, shear distortion, puppet theater effect, picket fence effect in the
displaying stage, as reviewed in [1719]. Most of these artifacts cannot be eliminated
completely because of the limitation of current 3D imaging techniques, although the
causes and nature of these binocular artifacts have been extensively investigated.
In particular, crosstalk is produced by an imperfect view separation causing a
small proportion of one eye image to be seen by the other eye as well. It is one of
the most annoying artifacts in the visualization stage of stereoscopic imaging and
exists in almost all stereoscopic displays [20]. Although it is well known that
crosstalk artifacts are usually perceived as ghosts, shadows, or double contours by
human subjects, it is still unclear what and how the influence factors impact the
crosstalk perception qualitatively and quantitatively. Subjective quality testing
methodologies are often utilized to study these mechanisms. Some efforts have
been devoted to this topic and several factors have been discovered to have an
impact on crosstalk perception. Specifically, the binocular parallax (camera
baseline) [20, 21], crosstalk level [20], image contrast [21, 22], and hard edges
[22] have an intensified impact on users perception of crosstalk, while texture and
details of scene content [22] are helpful to conceal crosstalk. Moreover, other
factors (e.g. monocular cues of images [23]) and artifacts (e.g. blur and vertical
disparity [24]) have also been found to play an important role in crosstalk perception. In this section, we report results from a case study on the perception of the
crosstalk effect; both subjective and objective assessments are considered.
14
423
edges, and textures, as well as a wide range of depth structures. For each content, four
consecutive cameras were selected from the multi-view sequences, which forms three
camera baselines in a way that the leftmost camera always served as the left eye view
and the other three cameras took turns as the right eye views for stereoscopic images.
Finally, in order to simulate different levels of system-introduced crosstalk for
different displays, four levels of crosstalk artifacts were added to each 3D image
pairs using the algorithm developed in [27] as follows:
(
Rpc Ro p Lo
14:1
Lpc Lo p Ro
where Lo and Ro denote the original left and right views, Lpc and Rpc are the distorted views
by simulating the system-introduced crosstalk distortions, and the parameter p is to
adjust the level of crosstalk distortion, which was set to 0, 5, 10, and 15 %, respectively. By taking the system-introduced crosstalk of the display system P (3 %) into
consideration, the actual crosstalk level viewed by the subjects was thereby 3, 8, 13,
and 18 %, respectively. The additive rule between the system-introduced crosstalk of
display system P and simulated system-introduced crosstalk p was justified in [28].
Test Methodology:
The single stimulus (SS) method was adopted in the test, since it is difficult to
choose an original 3D image as the reference point when there is a combination of
different camera baselines. Moreover, a minor modification was made on the SS
method such that the subjects could decide the viewing time for each image with
freedom and a test interface was implemented accordingly. A total of 28 subjects
participated in the tests. Before the training sessions, visual perception-related
characteristics of the subjects were collected, including pupillary distance, normal
or corrected binocular vision, color vision, and stereo vision.
During the training session, an example of each five categorical adjectival
levels (Imperceptible; Perceptible but not annoying; Slightly annoying; Annoying;
Very annoying) was shown to the subjects in order to benchmark and harmonize
their measurement scales. The content Book Arrival with different camera baselines and crosstalk levels of (0 mm, 3 %), (58 mm, 3 %), (114 mm, 8 %),
(114 mm, 13 %), (172 mm, 18 %), has been used in the training. When each
image was displayed, the operator verbally explained the corresponding quality
level to the subjects until they completely understood.
During the test sessions, the subjects were first presented with three dummy 3D
images from the content of Book Arrival, which were not used in the training
sessions, in order to stabilize subjects judgment. Following the dummy images, 72
test images were shown to the subjects in a random order. A new 3D image would
be shown after a subject finished scoring of the previous image.
Subjective Results Analysis:
Before computing the MOS and performing a statistical analysis of the subjective scores, normality distribution test and outlier detection were first carried
out, as defined in ITU-R Rec. BT. 500. A b2 test showed that the subjective scores
424
M. Barkowsky et al.
followed a normal distribution, and one outlier was also detected and the corresponding results were excluded from the subsequent analysis.
According to the MOS values, two conclusions can be drawn, (1) crosstalk level
and camera baseline have an impact on crosstalk perception, (2) the impact of
crosstalk level and camera baseline on crosstalk perception varies with scene
content. These conclusions were further verified by an ANOVA analysis. The
ANOVA results confirmed that crosstalk level, camera baseline, and scene content
main effects were significant at 0.05 level. The two-way interaction terms were
also significant on the same level. However, three-way interactions were not
significant.
14
425
Fig. 14.5 Illustration of SSIM maps of Champagne (right 100 mm, 3 % left 50 mm, 13 %)
Ls SSIMLo ; Lc
14:2
where SSIM denotes the SSIM algorithm proposed in [29] and Ls is the derived
SSIM map of the left eye view. Figure 14.5 is a representative illustration of SSIM
map derived from the crosstalk distorted Champagne presentations.
The spatial position of crosstalk can characterize user perception of crosstalk in
3D space, because visible crosstalk of foreground objects might have a stronger
impact on perception than background objects. Therefore, in order to form a 3D
perceptual attribute map, the depth structure of scene content and regions of
visible crosstalk should be combined. We defined a filtered depth map as the 3D
perceptual attribute map:
Rdep DERSRo
Rdep i; j if Ls i; j\0:977
Rpdep i; j
0
if Ls i; j 0:977
14:3
14:4
where DERS denotes the DERS algorithm proposed in [30], Rdep is the generated
depth map of the right view, i and j are the pixel index, and Rpdep denotes the
filtered depth map corresponding to the visible crosstalk regions of the left eye
image. Especially, a threshold 0.977 was obtained empirically from our experiments. Figure 14.6 gives an example of the filtered depth map of Champagne and
Dog.
Objective Metric for crosstalk perception
The overall crosstalk perception is assumed to be an integration of the 2D and
3D perceptual attributes, which are represented by the SSIM map and filtered
depth map, respectively. Since the 3D perceptual attributes discover that visible
crosstalk of foreground objects have stronger impacts on perception than background objects, bigger weights should be assigned to the visible crosstalk of
foreground than background. In other words, the SSIM map should be weighted by
the filtered depth map. Thus, the integration is performed in the following
equation:
426
M. Barkowsky et al.
Fig. 14.6 Illustration of filtered depth maps of Champagne and dog when the camera baseline is
150 mm and the crosstalk level is 3 %
Cpdep Ls 1 Rpdep 255
Vpdep AVG Cpdep
14:5
14:6
where Cpdep and Vpdep denote the combined map and the quality value predicted by
the objective metric, respectively. AVG denotes an averaging operation. In Eq.
(14.6), the filtered depth map Rpdep is first normalized into the interval [01] by the
maximum depth value 255, and then subtracted from one to comply with the meaning
of the SSIM map in which a lower pixel value indicates a larger crosstalk distortion.
Experimental Results
The performance of an objective quality metric can be evaluated by a comparison with respect to the MOS values obtained in subjective tests. The proposed
metric shows a promising performance, since the Pearson correlation of Vpdep is
88.4 % with the subjective quality scores obtained with the crosstalk subjective
test described before. That is much higher than traditional 2D metrics, e.g., PSNR
(82.1 %) and normal SSIM (82.5 %). In addition, the scatter plot in Fig. 14.7
shows that they are straightforwardly correlative with each other, which further
demonstrates the performance of the objective metric.
14.4.3 Conclusion
Subjective tests for stereoscopic crosstalk perception have been conducted
with varying parameters of scene content, camera baseline, and crosstalk level.
Limiting the scope of the study to the perception of specific artifacts, it has been
possible to obtain reliable data with an ad hoc methodology for the subjective
experiment.
According to a statistical analysis, we found that the above three factors have major
impacts on the perception of crosstalk. By observing the visual variations of stimuli
14
427
with changing the three significant factors, three perceptual attributes of crosstalk are
summarized, including shadow degree, separation distance, and spatial position of
crosstalk. These perceptual attributes can be represented by the SSIM map and filtered
depth map. An integration of these two maps can form an effective objective metric for
crosstalk perception, achieving more than 88 % correlation with the MOS results.
428
M. Barkowsky et al.
study summarizes a series of tests conducted in the context of severely degraded video
sequences targeted at a typical Full-HD service chain. Further, it details a procedure
which may be used to analyze the validity of the ACR-HR methodology in this context.
14
429
Exp. 3 [33]
IRCCyN
T-Labs
10 (1a)
Yes
Yes
Yes
Acreo,
IRCCyN
11 (1b)
Yes
Yes
Yes
Yes
Yes
Yes
Alg. A1
Yes
Yes
7 (2)
Yes
Yes
Yes
Yes
Yes
Alg. A2
visual acuity in 2D and 3D using Snellen charts and the randot test as well as an
Ishihara color plate test was performed on each subject. A training session preceded each subjective evaluation. The experiment session was split into two or
three parts which were approximately 15 min each with a break afterwards.
The standardized ACR method was used for measuring video quality [34]. A QoE
scale was presented and the subjects were instructed to rate on overall perceived
QoE. In addition, the observers provided ratings of visual discomfort, with different
scales used in the different experiments. In Exp. 1, the observers could optionally
indicate a binary opinion on visual discomfort. In Exp. 2, a balanced rating scale was
used that allowed to indicate both higher and lower comfort as compared to the
well-known 2D TV condition, by using a forced choice on a five-point scale: much
more comfortable/more comfortable/as comfortable as/less comfortable/much less
comfortable than watching 2D television. In Exp. 3, an adaptation of the five point
ACR scale was used by indicating excellent/good/fair/poor/bad visual comfort.
The two scales were presented at the same time instant to each observer and he voted
on the two scales in parallel. The presence of the two scales may have led to the effect
that the subjects did not include visual discomfort in their QoE voting.
430
M. Barkowsky et al.
Fig. 14.8 Correlation between different labs on the QoE scale. a Cross-lab Exp. 2, b crossexperiment Exp. 1 vs. Exp. 2
expected. In addition, small confidence intervals indicate that the observers do not
disagree about the meaning of the terms found on the scale. The results found in all
three experiments indicate that the intra-lab variability corresponds to the limits
known from the ACR-HR methodology in 2D.
The inter-lab correlation is best measured by performing the exact same experiment in two different labs as it has been done for Exp. 2. Slightly less reliable is a
comparison of a subset of processed video sequences that have been evaluated in two
different experiments, since the corpus effect comes into play [35]. Such a comparison can be made between Exp. 1 and Exp. 2. A third method compares the mean
subjective ratings for HRCs that are common. This method may be applied even if the
source content differs provided that source contents spread approximately the same
range, thus in between all the different experiments.
Figure 14.8 shows scatter plots for the obtained MOS for processed video
sequences. Figure 14.8a shows the results of the fully reproduced test Exp. 2
between the two laboratories in Sweden and France. The regression curve indicates that there is a slight offset and gradient. One explanation could be that this
effect is due to different cultural influences such as the translation of the scale
items and the general acceptance of 3D in the two countries. The linear correlation
of 0.95 corresponds to the correlation found in ACR tests for 2D [36]. Figure 14.8b compares the MOS of common PVSs between Exp. 1 and the linearly
fitted results of the two experiments of Exp. 2. The correlation of 0.97 compares
well to those found for the common set in the VQEG studies [37, 38]. The HRCs
of the coding scenarios can be compared pairwise between all three experiments as
shown in Table 14.2. Taking into account that Exp. 3 used a different setup of
source content, the correlation may be considered as acceptable.
While the ACR methodology seems to provide stable results across labs and
experiments, questions arise regarding the correct interpretation of the scale
14
431
Table 14.2 Linear correlation coefficient for the common HRCs in the four experiments
Exp. 1
Exp. 2a
Exp. 2b
Exp. 3
Exp.
Exp.
Exp.
Exp.
1
2a
2b
3
1.00 (20HRC)
0.987 (6HRC)
0.987 (6HRC)
0.964 (5HRC)
0.987 (6HRC)
1.00 (15HRC)
0.988 (15HRC)
0.909 (9HRC)
0.987 (6HRC)
0.988 (15HRC)
1.00 (15HRC)
0.939 (9HRC)
0.964 (5HRC)
0.909 (9HRC)
0.939 (9HRC)
1.00 (23HRC)
QoE. Since 3D video has not been on the market long enough to establish a
solid category of user perception like traditional television has, it can be assumed
that features of the content play a large role for QoE judgments. All experiments
included the 3D reference videos and as a hidden reference a condition where only
one of the two 3D views was shown to both eyes in 3D mode, thus with zero
disparity corresponding to displaying 2D, but using the same viewing setup. In
experiments 1 and 2, 10 common contents were judged by 72 observers from the
10 common SRC contents, in three cases the 2D presentation was preferred over
the 3D presentation in a statistically significant way while the opposite did not
occur. It remains an open question whether this result stems from the ACR
methodology which lacks an explicit reference so that subjects might judge only
on the 2D QoE scale in the case of 2D presentation, or whether the reference 3D
sequences did not excite the viewers, or that the subjects preferred the 2D presentation even in the 3D context.
432
M. Barkowsky et al.
Fig. 14.9 Histogram of pairwise correlation between observers in each experiment, comparing
quality and comfort scale
in the same way. For comparison, the inter-observer correlation for the established
quality scale is calculated. Histograms showing the achieved correlation between
any two observers for the three tests and the two scales are presented in Fig. 14.9,
for example, for Exp. 2a, 133 pairs of observers have reached a correlation in the
interval of 0.91.0 on the QoE scale.
It can be seen that the agreement among different observers on the quality scale
is much higher than on the discomfort scale, indicating that the scale is not used
equally by all observers, although the mean value of two observer distributions is
well correlated.
Another question that arises is whether the subjects were able to clearly distinguish between the two scales. A certain correlation may be expected as a
sequence with a low comfort level may not be voted with a high QoE level. On the
opposite side, a low QoE level may not necessarily induce a low comfort level, so
a scatter plot may show a triangular behavior [31]. This effect was not observed in
this study. In Fig. 14.10, the correlation between the two scales for each observer
in the three experiments demonstrates that for some observers the distinction was
not clear, thus the correlation is high while others voted completely independent
on the two scales. For example, 3 observers in Exp. 2a had a very high correlation,
larger than 0.9, between the QoE and the visual comfort scale indicating that they
did not distinguish clearly between the two scales.
The presented analysis does not show a clear preference of either asking for
comfort on an immediate scale as done in Exp. 3 or asking to compare to the 2D
case as in Exp. 2a, b.
14
433
Fig. 14.10 Histogram of linear correlation between each observers vote on QoE and comfort
434
M. Barkowsky et al.
Fig. 14.11 Comparison of MVC, spatial and temporal reduction performance with AVC coding
concerning bitrate gain at same quality level (left) and Quality gain at same bitrate level (right)
reference software of H.264. Pausing the video was considered worst. Considering visual discomfort, pausing was the best solution, followed by switching to
2D. Using an error concealment algorithm on the impacted view led to the highest
visual discomfort.
Overall, it was noted that packet loss had a very high influence on the perceived
quality. An additional analysis is thus provided in Fig. 14.12. It illustrates a bitrate
redundancy factor comparison of all packet loss scenarios. This factor is introduced in a similar way as the bitrate gain in Fig. 14.11, but in this case, H.264
coding performance with error concealment method in all packet loss conditions is
compared: For each packet loss scenario, the resulting MOS is compared to a
similar quality in the pure coding scenario. Obviously, the bitrate that is required
to transmit an error free video at the same perceived quality is much lower.
Figure 14.12 shows the factor of the bitrate used in the packet loss case divided by
the bitrate of a similar coding-only quality. The higher the factor, the stronger the
14
435
impact due to packet loss. In other words, in an error-prone transmission environment with fixed bandwidth limit, the tested scenario corresponds to the transmission of a high quality encoded video and performing error concealment on the
decoder side. Alternatively, it can be seen that there may be an interest in using
error protection and correction methods such as automatic repeat request (ARQ) or
forward error correction (FEC) while transmitting the video content at a lower
bitrate.
14.6 Conclusions
The purpose of this chapter was to present the status of 3D QoE assessment in the
context of 3D-TV. It appears that new objective methods need to be developed
which can comprehend the multidimensional nature of the 3D video QoE. The
main initial effort should be focused on revisiting subjective experiments protocols
for 3D-TV. Even if there are urgent needs in 3D-TV industry, it is mandatory to be
very cautious while manipulating and using objective and subjective quality
assessment tools. A wise approach should be carried out by limiting the scope of
the studies (e.g. coding conditions, display rendering, etc.) with a deep understanding of the underlying mechanisms. Following this statement, the two case
studies reported in this chapter reached relevant results overcoming current limitation in terms of subjective assessment protocols.
The first case study permits to define a good model of the perception of
crosstalk effect after conducting ad hoc psychophysics experiments. This model
represents a comprehensive way to measure one of the components of the overall
QoE of 3D-TV services, taking into account visualization conditions. In the second
case study, methodological aspects have been further addressed. The ACR
methodology has been evaluated for use in 3D subjective experiments using a QoE
and a visual discomfort scale. For the validation of the methodology, a critical
analysis has been presented showing that the QoE scale seems to provide reliable
results. The visual discomfort scale has demonstrated some weaknesses which may
be counteracted by a larger number of observers in subjective experiments. The
evaluation itself revealed some interesting preliminary conclusions for industry
such as: at lower coding qualities, visual discomfort increased; spatial down
sampling is advantageous on the transmission channel in a rate-distortion sense
(except for very high quality content with fine details); when transmission errors
occur, the error concealment strategies that are commonly used in 2D might cause
visual discomfort when applied to one channel in a 3D video, thus resulting in
visual discomfort. Switching to 2D, when only one channel is affected, it was
found to provide superior quality and comfort.
DIBR, as part of 3D-TV context, is not an exception to the rule of cautiousness.
As for the two reported case studies, it requires fine understanding to use ad hoc
methodology for subjective quality assessment and ad hoc objective quality
metrics. Chapter 15 is fully dedicated to this context.
436
M. Barkowsky et al.
Acknowledgments This work was partially supported by the COST IC1003 European Network
on QoE in Multimedia Systems and ServicesQUALINET (http://www.qualinet.eu/).
References
1. Chikkerur S, Vijay S, Reisslein M, Karam LJ (2011) Objective video quality assessment
methods: a classification, review, and performance comparison. IEEE Trans Broadcast
57(2):165182
2. You J, Reiter U, Hannuksela MM, Gabbouj M, Perkis A (2010) Perceptual-based objective
quality metrics for audio-visual servicesa survey. Signal Process Image Commun
25(7):482501
3. Huynh-Thu Q, Le Callet P, Barkowsky M (2010) Video quality assessment: from 2D to 3D
challenges and future trends. In: IEEE ICIP, pp. 40254028, Sept 2010
4. Mendiburu B (2009) 3D movie making: stereoscopic digital cinema from script to screen.
Focal Press, Burlington
5. Zilly F, Muller M, Eisert P, Kauff P (2010) The stereoscopic analyzer-an image-based assistance
tool for stereo shooting and 3D production. In: IEEE ICIP, pp 40294032, Sept 2010
6. Pastoor S (1991) 3D-television: a survey of recent research results on subjective
requirements. Signal Process Image Commun 4(1):2132
7. Vetro A, Wiegand T, Sullivan G (2011) Overview of the stereo and multiview video coding
extensions of the H.264/MPEG-4 AVC standard. Proc IEEE 99:626642
8. Yamagishi K, Karam L, Okamoto J, Hayashi T (2011) Subjective characteristics for
stereoscopic high definition video. In: IEEE QoMEX, Sept 2011
9. Kima D, Mina D, Ohab J, Jeonb S, Sohna K (2009) Depth map quality metric for threedimensional video. Image, 5(6):7
10. Campisi P, Le Callet P, Marini E (2007) Stereoscopic images quality assessment. In:
Proceedings of 15th European signal processing conference (EUSIPCO)
11. Benoit A, Le Callet P, Campisi P, Cousseau R (2008) Quality assessment of stereoscopic
images. EURASIP J Image Video Process 2008:13
12. Gorley P, Holliman N (2008) Stereoscopic image quality metrics and compression. In:
Proceedings of SPIE, vol 6803
13. Lu F, Wang H, Ji X, Er G (2009) Quality assessment of 3D asymmetric view coding using
spatial frequency dominance model. In: IEEE 3DTV conference, pp 14
14. Sazzad Z, Yamanaka S, Kawayokeita Y, Horita Y (2009) Stereoscopic image quality
prediction. In: IEEE QoMEX, pp 180185
15. Sazzad Z, Yamanaka S, Horita Y (2010) Spatio-temporal segmentation based continuous noreference stereoscopic video quality prediction. In: International workshop on quality of
multimedia experience, pp 106111
16. Ekmekcioglu E, Worrall S, De Silva D, Fernando W, Kondoz A (2010) Depth based
perceptual quality assesment for synthesized camera viewpoints. In: Second international
conference on user centric media, Sept 2010
17. Boev A, Hollosi D, Gotchev A, Egiazarian K (2009) Classification and simulation of
stereoscopic artifacts in mobile 3DTV content. In: Proceedings of SPIE, pp 72371F-12
18. Woods A, Docherty T, Koch R (1993) Image distortions in stereoscopic video systems. In:
Proceedings of SPIE, San Jose, pp 3648
19. Meesters LMJ, Ijsselsteijn WA, Seuntins PJH (2004) A survey of perceptual evaluations and
requirements of three-dimensional TV. IEEE Trans Circuits Syst Video Technol 14(3):381391
20. Seuntiens PJH, Meesters LMJ, IJsselsteijn WA (2005) Perceptual attributes of crosstalk in 3D
images. Displays 26(4-5):177183
21. Pastoor S (1995) Human factors of 3D images: results of recent research at Heinrich-HertzInstitut Berlin. In: International display workshop, vol 3, pp 6972
14
437
Chapter 15
Abstract Depth-image-based rendering (DIBR) is fundamental to 3D-TV applications because the generation of new viewpoints is recurrent. Like any tool, DIBR
methods are subject to evaluations, thanks to the assessment of the visual quality
of the resulting generated views. This assessment task is peculiar because DIBR
can be used for different 3D-TV applications: either in a 2D context (Free
Viewpoint Television, FTV), or in a 3D context (3D displays reproducing
stereoscopic vision). Depending on the context, the factors affecting the visual
experience may differ. This chapter concerns the case of the use of DIBR in the 2D
context. It addresses two particular cases of use: visualization of still images and
visualization of video sequences, in FTV in the 2D context. Through these two
cases, the main issues of DIBR are presented in terms of visual quality assessment.
Two experiments are proposed as case studies addressing the problematic of this
chapter: the first one concerns the assessment of still images and the second one
concerns the video sequence assessment. The two experiments question the reliability of subjective and objective usual tools when assessing the visual quality of
synthesized views in a 2D context.
E. Bosc (&) L. Morin M. Pressigout
IETR UMR CNRS 6164, INSA de Rennes,
35708 RENNES CEDEX 7, France
e-mail: emilie.bosc@insa-rennes.fr
L. Morin
e-mail: luce.morin@insa-rennes.fr
M. Pressigout
e-mail: muriel.pressigout@insa-rennes.fr
P. Le Callet
LUNAM Universit, Universit de Nantes,
IRCCyN UMR CNRS 6597, Polytech
Nantes, 44306 Nantes Cedex 3, France
e-mail: patrick.lecallet@univ-nantes.fr
439
440
E. Bosc et al.
Keywords 3D-TV
Absolute categorical rating (ACR)
Blurry artifact
Correlation DIBR Distortion Human visual system ITU-R BT.500
Objective metric Paired Comparisons PSNR Shifting effect Subjective
assessment SSIM Subjective test method Synthesized view UQI Visual
quality VQM
15.1 Introduction
3D-TV technology has brought out new challenges such as the question of synthesized view evaluation. The success of the two main applications referred to as
3D Videonamely 3D Television (3D-TV) that provides depth to the scene, and
free viewpoint video (FVV) that enables interactive navigation inside the scene
[1]relies on their ability to provide added value (depth or immersion) coupled with
high-quality visual content. Depth-image-based rendering (DIBR) algorithms are
used for virtual view generation, which is required in both applications. From depth
and color data, novel views are synthesized with DIBR. This process induces new
types of artifacts. Consequently, its impact on the quality has to be identified considering various contexts of use. While many efforts have been dedicated to visual
quality assessment in the last twenty years, some issues still remain unsolved in the
context of 3D-TV. Actually, DIBR opens new challenges because it mainly deals
with geometric distortions, which have barely been addressed so far.
Virtual views synthesized either from decoded and distorted data, or from original data, need to be assessed. The best assessment tool remains the human judgment as long as the right protocol is used. Subjective quality assessment is still
delicate while addressing new type of conditions because one has to define the
optimal way to obtain reliable data. Tests are time-consuming and consequently one
should issue precise guidelines on how to conduct such experiment to save time and
the number of observers. Since DIBR introduces new parameters, the right protocol
to assess the visual quality with observers is still an open question. The adequate
assessment protocol might vary according to the targeted objective that researchers
focus on (impact of compression, DIBR techniques comparison, etc.).
Objective metrics are meant to predict human judgment and their reliability is
based on their correlation to subjective assessment results. As the way to conduct
the subjective quality assessment protocols is already questionable in a DIBR
context, their correlation with objective quality metrics (i.e. this validates the
reliability of the objective quality metrics) in the same context is also questionable.
Yet, trustworthy working groups partially base their future specifications
concerning new strategies for 3D video on the outcome of objective metrics.
Considering that the test conditions may rely on usual subjective and objective
protocols (because of their availability), the outcome of wrong choices could result
in poor quality experience for users. Therefore, new tests should be carried out to
15
441
442
E. Bosc et al.
Fig. 15.1 Shifting/resizing artifacts. Left: original frame. Right: synthesized frame. The shape of
the leaves, in this figure, is slightly modified (thinner or bigger). The vase is also moved
15
443
Fig. 15.2 Blurry artifacts (book arrival). a Original frame. b Synthesized frame
444
E. Bosc et al.
15
445
Fig. 15.3 Shifting effect from depth data compression results in distorted synthesized views
(Breakdancers). a Original depth frame (up) and color original frame (bottom). b Distorted depth
frame (up), synthesized view (bottom)
446
E. Bosc et al.
Fig. 15.4 Crumbling effect in depth data leads to distortions in the synthesized views. a Original depth
frame (up) and original color frame (bottom). b Distorted depth frame (up) and synthesized frame (bottom)
were initially created to address usual specific distortions, they may be unsuitable to
the problem of DIBR evaluation. This will be discussed in Sect. 15.3.
Another limitation of usual objective metrics concerns the need for nonreference quality metrics. In particular cases of use, like FTV, references are
unavailable because the generated viewpoint is virtual. In other words, there is no
ground truth allowing a full comparison with the distorted view.
The next section addresses two case studies that question the validity of subjective and objective quality assessment methods for the evaluation of synthesized
views in 2D conditions.
15
447
pause button; or the case of 3D display advertisings is also imaginable. These very
likely cases are interesting since the image can be subject to meticulous observation.
The second study addresses the evaluation of video sequences.
The two studies question the reliability of subjective and objective assessment
methods when evaluating the quality of synthesized views. Most of the proposed
metrics for assessing 3D media are based on 2D quality metrics. Previous studies
448
E. Bosc et al.
[911] already considered the reliability of usual objective metrics. In [12], You
et al. studied the assessment of stereoscopic images in stereoscopic conditions with
usual 2D image quality metrics, but the distorted pairs did not include any DIBRrelated artifacts. In such studies, experimental protocols often involve depth and/or
color compression, different 3D displays, and different 3D representations
(2D ? Z, stereoscopic video, MVD, etc.). In these cases, the quality scores
obtained from subjective assessments are compared to the quality scores obtained
through objective measurements, in order to find a correlation and validate the
objective metric. The experimental protocols often assess both compression distortion and synthesis distortion, at the same time without distinction. This is
problematic because there may be a combination of artifacts from various sources
(compression and synthesis) whose effects are neither understood nor assessed.
The studies presented in this chapter concern only views synthesized using DIBR
methods with uncompressed image and depth data, observed in 2D conditions.
The rest of this section presents the experimental material, the subjective
methodologies, and the objective quality metrics used in the studies.
15
449
450
E. Bosc et al.
References
DSIS
DSQS
SSNCS
SSCQE
SDSCE
ACR
ACR-HR
DCR
PC
SAMVIQ
[18]
[18]
[18]
[18]
[18]
[20]
[20]
[20]
[20]
[20]
15
5
4
3
2
1
451
Excellent
Good
Fair
Poor
Bad
ACR-HR requires many observers to minimize the contextual effects (previously presented stimuli influence the observers opinion, i.e. presentation order
influences opinion ratings). Accuracy increases with the number of participants.
452
E. Bosc et al.
15
Structure-based
HVS-based
453
Abbreviations
Tested
PSNR
UQI
IFC
VQM
PVQM
SSIM
MSSIM
V-SSIM
MOVIE
PSNR-HVS
PSNR-HVSM
VSNR
WSNR
VIF
MPQM
X
X
X
X
X
X
X
X
X
X
X
X
Thus, even if an error is not perceptible, it contributes to the decrease of the quality
score. Studies (such as [28]) showed that in the case of synthesized views, PSNR is
not reliable, especially when comparing two images with low PSNR scores. PSNR
cannot be used in very different scenarios as explained in [29].
454
E. Bosc et al.
UQI [30] is a perception-oriented metric. The quality score is the product of the
correlation between the original and the degraded image, a term defining
the luminance distortion and a term defining the contrast distortion. The quality
score is computed within a sliding window and the final score is defined as the
average of all local scores.
IFC [31] uses a distortion model to evaluate the information shared between the
reference image and the degraded image. IFC indicates the image fidelity rather
than the distortion. IFC is based on the hypothesis that, given a source channel and
a distortion channel, an image is made of multiple independently distorted subbands. The quality score is the sum of the mutual information between the source
and the distorted images for all the subbands.
VQM was proposed by Pinson and Wolf in [32]. It is a RR video metric that
measures the perceptual effects of numerous video distortions. It includes a calibration step (to correct spatial/temporal shift, contrast, and brightness according to
the reference video sequence) and an analysis of perceptual features. VQM score
combines all the perceptual calculated parameters. VQM method is complex but
its correlation to subjective scores is good according to [33]. The method is
validated in video display conditions.
Perceptual Video Quality Measure (PVQM) [34] is meant to detect perceptible
distortions in video sequences. Various indicators are used. First, an edge-based
indicator allows the detection of distorted edges in the images. Second, a motionbased indicator analyzes two successive frames. Third, a color-based indicator
detects non-saturated colors. Each indicator is pooled separately across the video
and incorporated in a weighting function to obtain the final score. As this method
was not available, it was not tested in our experiments.
15
455
weighted function of frames SSIM scores (based on motion). The choice of the
weights relies on the assumption that dark regions are less salient. However, this is
questionable because the relative luminance may depend on the used screen.
MOVIE [37] is an FR video metric that uses several steps before computing the
quality score. It includes the decomposition of both reference and distorted video
by using a multi-scale spatio-temporal Gabor filter-bank. An SSIM-like method is
used for spatial quality analysis. An optical flow calculation is used for motion
analysis. Spatial and temporal quality indicators determine the final score.
456
E. Bosc et al.
IFC has been improved by the introduction of an HVS model. The method is
called VIF [45]. VIFP is a pixel-based version of VIF. It uses wavelet decomposition and computes the parameters of the distortion models, which enhance the
computational complexity. In [45], five distortion types are used to validate the
performances of the method (JPEG and JPEG 2000 related distortions, white and
Gaussian noise over the entire image), which are quite different from the DIBRrelated artifacts.
MPQM [46] uses an HVS model. In particular it takes into account the masking
phenomenon and contrast sensitivity. It has high complexity and its correlation to
subjective scores varies according to [33]. Since the method is not available, it is
not tested in our experiments.
Only a few commonly used algorithms (in the 2D context) have been described
above. Since they are all dedicated to 2D applications, they are optimized to detect
and penalize specific distortions of 2D image and 2D video compression methods.
As explained in 15.2, distortions related to DIBR are very different from 2D
known artifacts. There exist many other algorithms for visual quality assessment
that are not covered here.
15
457
Experimental protocol
Fig. 15.8 Experimental protocol for fixed image experiment and for video experiment
458
E. Bosc et al.
Number of
participants
Methods
Objective measures
Experiment 2 (video
sequences)
Synthesized video
sequences
32
ACR-HR, PC
All available metrics of
MetriX MuX
ACR-HR
VQM, VSSIM, Still image
metrics
A6
A7
ACR-HR
Rank order
PC
Rank order
3.32
5
0.454
5
2.278
7
-2.055
7
3.539
1
1.038
1
3.386
4
0.508
4
3.145
6
0.207
6
3.40
3
0.531
3
3.496
2
0.936
2
DMOS scores obtained through the MOS scores. For the PC test, the first line
gives the hypothetical MOS scores obtained through the comparisons. For both
tests, the second line gives the ranking of the algorithms, obtained through the first
line.
In Table 15.5, although the algorithms can be ranked from the scaled scores, there
is no information concerning the statistical significance of the quality difference of
two stimuli (one preferred to another one). Therefore, statistical analyses were
conducted over the subjective measurements: a students t-test was performed over
ACR-HR scores, and over PC scores, for each algorithm. This provides knowledge
on the statistical equivalence of the algorithms. Tables 15.6 and 15.7 show the results
of the statistical tests over ACR-HR and PC values respectively. In both tables, the
number in parentheses indicates the minimum required number of observers that
allows statistical distinction (VQEG recommends 24 participants as a minimum [48],
values in bold are higher than 24 in the table).
A first analysis of these two tables indicates that the PC method leads to clearcut decisions, compared to the ACR-HR method: indeed, the distributions of the
algorithms are statistically distinguished with less than 24 participants in 17 cases
with PC (only 11 cases with ACR-HR). In one case (between A2 and A5), less
than 24 participants are required with PC, and more than 43 participants are
required to establish the statistical difference with ACR-HR. The latter case can
be explained by the fact that the visual quality of the synthesized images may be
perceived very similar for non-expert observers. That is to say that the distortions,
though different from an algorithm to another, are difficult to assess. The absolute
rating task is more delicate for observers than the comparison task. These results
indicate that it seems more difficult to assess the quality of synthesized views than
to assess the quality of degraded images in other contexts (for instance, quality
15
:(32)
;(32)
;(\24)
;(32)
O
(>43)
;(30)
;(\24)
:(\24)
:(\24)
;(\24)
(>43)
O
(>43)
O
(>43)
;(\24)
O
:(\24)
:(\24)
:(\24)
;(\24)
:(32)
O
(>43)
;(\24)
(>43)
O
(>43)
;(\24)
O
(>43)
(>43)
(>43)
;(\24)
O
;(28)
;(\24)
459
A6
A7
:(30)
O
(>43)
;(\24)
O
(>43)
:(28)
:(\24)
:(\24)
:(\24)
:(\24)
:(\24)
:(\24)
;(\24)
: (\24)
;(\24)
;(\24)
;(\24)
;(\24)
;(\24)
;(\24)
:(\24)
:(28)
;(28)
(>43)
:(\24)
O
(>43)
;(\24)
O
:(\24)
:(\24)
:(\24)
;(\24)
:(\24)
O
(<43)
;(\24)
:(\24)
;(>43)
;(\24)
A5
A6
A7
:(\24)
;(\24)
;(\24)
;(\24)
:(\24)
O
(>43)
;(\24)
:(>43)
:(\24)
:(\24)
:(\24)
:(\24)
:(\24)
:(\24)
:(\24)
;(\24)
;(\24)
;(\24)
statistically equivalent
Legend: : superior, ; inferior,
Reading: Line 1 is statistically superior to column 2. Distinction is stable when less than
24 observers participate
assessment of images distorted through compression). The results with the ACRHR method, in Table 15.6, confirm this observation: in most of the cases, more
than 24 participants (or even more than 43) are required to distinguish the classes
(Note that A7 is the synthesis with holes around the disoccluded areas).
However, as seen with rankings results above, methodologies give consistent
results: when the distinction between algorithms is clear, the ranking is the same
with either methodology.
Finally, these experiments show that fewer participants are required for a PC
test than for an ACR-HR test. However, as stated before, PC tests, while efficient,
are feasible only with a limited number of items to be compared. Another problem,
pointed out by these experiments, concerns the assessment of similar items: with
both methods, 43 participants were not always sufficient to obtain a clear and
reliable decision. Results suggest that observers had difficulties assessing the
different types of artifacts.
As a conclusion, this first analysis reveals that more than 24 participants may be
necessary for still image quality assessment.
Regarding the evaluation of PC and ACR-HR methods, PC gives clear-cut
decisions, due to the mode of assessment (preference) while algorithms statistical
distinction with ACR-HR is slightly less accurate. With ACR-HR, the task is not
PSNR
SSIM
MSSIM
VSNR
VIF
VIFP
UQI
IFC
NQM
WSNR
PSNR HSVM
PSNR HSV
83.9
79.6
87.3
77.0
70.6
53.6
71.6
95.2
98.2
99.2
99.0
96.7
93.9
93.4
92.4
81.5
92.9
84.9
83.7
83.2
83.5
83.9
89.7
88.8
90.2
86.3
89.4
85.6
81.1
77.9
78.3
79.6
96.7
87.9
83.3
71.9
84.0
85.3
85.5
86.1
85.8
87.3
93.9
89.7
97.5
75.2
98.7
74.4
78.1
79.4
80.2
77.0
93.4
88.8
87.9
85.9
99.2
73.6
75.0
72.2
72.9
70.6
92.4
90.2
83.3
97.5
81.9
70.2
61.8
50.9
50.8
53.6
81.5
86.3
71.9
75.2
85.9
Table 15.8 Correlation coefficients between objective and subjective scores in percentage
PSNR
SSIM
MSSIM
VSNR
VIF
VIFP
UQI
IFC
72.8
74.4
73.5
74.4
71.6
92.9
89.4
84.0
98.7
99.2
81.9
NQM
97.1
92.3
91.8
95.2
84.9
85.6
85.3
74.4
73.6
70.2
72.8
WSNR
97.4
97.1
98.2
83.7
81.1
85.5
78.1
75.0
61.8
74.4
97.1
PSNR
99.9
99.2
83.2
77.9
86.1
79.4
72.2
50.9
73.5
92.3
97.4
HSVM
PSNR
99.0
83.5
78.3
85.8
80.2
72.9
50.8
74.4
91.8
97.1
99.9
HSV
460
E. Bosc et al.
15
461
19.9
11.4
22.9
19.6 21.5
37.6
31.7
31.0
23.8
34.9
19.7
16.2 22.0
36.9
42.2
41.9
easy for the observers because the impairments among the tested images are small,
though each DIBR induces specific artifacts. Thus, this aspect should be taken into
account when evaluating the performances of different DIBR algorithms with this
methodology.
However, ACR-HR and PC are complementary: when assessing similar items,
like in this case study, PC can provide a ranking, while ACR-HR gives the overall
perceptual quality of the items.
462
E. Bosc et al.
A4
A5
A6
A7
ACR-HR
Rank order
PC
Rank order
PSNR
Rank order
SSIM
Rank order
MSSIM
Rank order
VSNR
Rank order
VIF
Rank order
VIFP
Rank order
UQI
Rank order
IFC
Rank order
NQM
Rank order
WSNR
Rank order
PSNR HSVM
Rank order
PSNR HSV
Rank order
2.250
3
0.531
3
26.117
3
0.859
1
0.950
1
22.004
3
0.425
2
0.448
1
0.577
1
2.587
2
17.074
3
21.597
3
21.428
3
20.938
3
2.345
2
0.936
2
26.171
2
0.859
1
0.949
2
22.247
1
0.425
2
0.448
1
0.576
3
2.586
3
17.198
2
21.697
2
21.458
2
20.958
2
2.169
5
0.454
5
26.177
1
0.858
3
0.949
2
22.195
2
0.426
1
0.448
1
0.577
1
2.591
1
17.201
1
21.716
1
21.491
1
20.987
1
1.126
7
-2.055
7
20.307
6
0.821
5
0.883
5
21.055
4
0.397
4
0.420
4
0.558
4
2.423
4
10.291
6
15.588
6
15.714
6
15.407
6
2.388
1
1.038
1
18.75
7
0.638
7
0.648
7
13.135
7
0.124
7
0.147
7
0.237
7
0.757
7
8.713
7
13.817
7
13.772
7
13.530
7
2.234
4
0.508
4
24.998
4
0.843
4
0.932
4
20.530
5
0.394
5
0.416
5
0.556
5
2.420
5
16.334
4
20.593
4
19.959
4
19.512
4
1.994
6
0.207
6
23.18
5
0.786
6
0.826
6
18.901
6
0.314
6
0.344
6
0.474
6
1.959
6
13.645
5
18.517
5
18.362
5
17.953
5
15
463
A6
A7
ACR-HR
Rank order
2.956
4
2.104
7
3.523
1
3.237
2
2.966
3
2.865
5
2.789
6
:(7)
;(7)
;(3)
;(3)
;(2)
;(3)
;(1)
:(3)
:(2)
;(2)
;(2)
;(1)
;(2)
;(1)
:(3)
:(2)
O
(>32)
(>32)
;(9)
O
(>32)
;(1)
O
:(2)
:(1)
:(9)
O
(>32)
(>32)
(>32)
;(1)
O
:(15)
;(1)
A6
A7
:(3)
:(2)
O
(>32)
O
(>32)
;(15)
:(1)
:(1)
:(1)
:(1)
:(1)
:(1)
;(1)
statistically equivalent
comparing the ranks affected to the algorithms. Table 15.10 presents the rankings
of the algorithms obtained from the objective scores. Rankings from subjective
scores are mentioned for comparison. They present a noticeable difference concerning the ranking order of A1: ranked as the best algorithm out of the seven by
the subjective scores, it is ranked as the worst by the whole set of objective
metrics. Another comment refers to the assessment of A6: often regarded as the
best algorithm, it is ranked as one of the worst algorithms by the subjective tests.
The ensuing assumption is that objective metrics detect and penalize non-annoying
artifacts.
As a conclusion, none of the tested metric reaches 50% of human judgment.
The tested metrics have the same response when assessing DIBR-related artifacts.
Given the inconsistencies with the subjective assessments, it is assumed that
objective metrics detect and penalize non-annoying artifacts.
464
E. Bosc et al.
A4
A5
A6
A7
ACR-HR
Rank order
PSNR
Rank order
SSIM
Rank order
MSSIMM
Rank order
VSNR
Rank order
VIF
Rank order
VIFP
Rank order
UQI
Rank order
IFC
Rank order
NQM
Rank order
WSNR
Rank order
PSNR HSVM
Rank order
PSNR HSV
Rank order
VSSIM
Rank
VQM
Rank order
2.03
5
25.994
3
0.859
1
0.948
1
21.786
3
0.423
2
0.446
1
0.598
3
2.562
2
16.635
3
21.76
3
21.278
3
20.795
3
0.899
1
0.572
3
1.96
6
26.035
2
0.859
1
0.948
1
21.965
2
0.423
2
0.446
1
0.598
3
2.562
2
16.739
1
21.839
2
21.318
2
20.823
2
0.898
2
0.556
1
2.13
4
26.04
1
0.859
1
0.948
1
21.968
1
0.424
1
0.446
1
0.598
3
2.564
1
16.739
1
21.844
1
21.326
1
20.833
1
0.893
3
0.557
2
1.28
7
20.89
6
0.824
5
0.888
5
20.73
4
0.396
4
0.419
4
0.667
1
2.404
4
10.63
6
16.46
6
16.23
6
15.91
6
0.854
5
0.652
6
2.70
1
19.02
7
0.648
7
0.664
7
13.14
7
0.129
7
0.153
7
0.359
7
0.779
7
8.66
7
14.41
7
13.99
7
13.74
7
0.662
7
0.888
7
2.41
2
24.99
4
0.844
4
0.932
4
20.41
5
0.393
5
0.415
5
0.664
5
2.399
5
15.933
4
20.85
4
19.37
4
19.52
4
0.879
4
0.623
5
2.14
3
23.227
5
0.786
6
0.825
6
18.75
6
0.313
6
0.342
6
0.58
6
1.926
6
13.415
5
18.853
5
18.361
5
17.958
5
0.809
6
0.581
4
Table 15.14 Correlation coefficients between objective and subjective scores in percentage
PSNR
SSIM
MSSIM
VSNR
VIF
VIFP
UQI
ACR-HR
ACR-HR
34.5
IFC
45.6
45.2
NQM
36.6
27
WSNR
32.9
47.3
PSNR HVSM
34.5
43.9
PSNR HVS
33.9
46.9
VSSIM
33
20.2
VQM
33.6
15
465
Fig. 15.10 Ranking of used metrics according to their correlation to human judgment
Although the values allow the ranking of the algorithms, they do not directly
provide knowledge on the statistical equivalence of the results. Table 15.12
depicts the results of the Students t-test processed with the values. Compared to
the ACR-HR test with still images detailed in Sect. 15.4.1, distinctions between
algorithms seem to be more obvious. The statistical significance of the difference
between the algorithms, based on the ACR-HR scores, exists and seems clearer for
video sequences than for still images. This can be explained by the exhibition time
of the video sequences: watching the whole video, observers can refine their
judgment, contrary to still images. Note that the same algorithms were not statistically differentiated: A4, A3, A5, and A6.
As a conclusion, though more than 32 participants are required to perform all
the distinctions in the tested conditions, the ACR-HR test with video sequences
gives clearer statistical differences between the algorithms than the ACR-HR test
with still images. This suggests that new elements allow the observers to make a
decision: existence of flickering, exhibition time, etc.
466
E. Bosc et al.
A1 performs the synthesis on a cropped image, and then enlarges it to reach the
original size. Consequently, signal-based metrics penalize it though it gives good
perceptual results.
Table 15.14 presents the correlation coefficients between objective scores and
subjective scores, based on the whole set of measured points. None of the tested
objective metric reaches 50% of subjective scores. The metric obtaining the higher
correlation coefficient is VSNR, with 47.3 %. Figure 15.10 shows the ranking of
the metrics according to the correlation scores of Table 15.14. It is observed that
the top metrics are perception-oriented metrics (they include psychophysical
approaches).
To conclude, performances of objective metrics, with respect to subjective
scores, are different for video sequences and for of still images. Correlation
coefficients between objective and subjective scores were higher in the case of
video sequences, when comparing Table 15.14 with 15.9. However, human
opinion also differed in the case of video sequences. For video sequences, perception-oriented metrics were the most correlated to subjective scores (also in
video conditions). However, in either context, none of the tested metrics reached
50 % statistical correlation with human judgment.
15
467
468
E. Bosc et al.
but none of them target views synthesized from DIBR in 2D viewing conditions.
Therefore they do not apply to the issue targeted in this chapter.
Most of the proposed metrics for assessing 3D media are inspired from 2D
quality metrics. It should be noted that experimental protocols validating the
proposed metrics often involve depth and/or color compression, different 3D
displays, and different 3D representations (2D ? Z, stereoscopic video, MVD,
etc.). The experimental protocols often assess at the same time both compression
distortion and synthesis distortion, without distinction. This is problematic because
there may be a combination of artifacts from various sources (compression and
synthesis) whose effects are not clearly specified and assessed.
In the following, we present the new trends regarding new objective metrics for
3D media assessment, by distinguishing whether they make use of depth data in
the quality score computation or not.
15
469
color views suffer two levels of quantization distortion; depth data suffer four
different types of distortion (quantization, low pass filtering, borders shifting, and
artificial local spot errors in certain regions). The study [56] shows that the proposed method enhances the correlation of PSNR and SSIM to subjective scores.
Yasakethu et al. [57] proposed an adapted VQM for measuring 3D Video
quality. It combines 2D color information quality and depth information quality.
Depth quality measurement includes an analysis of the depth planes. The final
depth quality measure combines 1) the measure of distortion of the relative distance within each depth plane, 2) the measure of the consistency of each depth
plane, and 3) the structural error of the depth. The color quality is based on the
VQM score. In [57], the metric is evaluated through left and right views (rendered
from 2D ? Z encoded data), and compared to subjective scores obtained by using
an auto-stereoscopic display. Results show higher correlation than simple VQM.
Solh et al. [58] introduced the 3D Video Quality Measure (3VQM) predict the
quality of views synthesized from DIBR algorithms. The method analyses the quality
of the depth map against an ideal depth map. Three different analyses lead to three
distortions measures: spatial outliers, temporal outliers, and temporal inconsistencies. These measures are combined to provide the final quality score. To validate the
method, subjective tests were run in stereoscopic conditions. The stereoscopic pairs
included views synthesized from depth map and colored video compression, depth
from stereo matching, depth from 2D to 3D conversion. Results showed accurate and
consistent scores compared to subjective assessments.
As a conclusion, subjective and objective methods tend to take more into
account the added value of depth. This makes the evaluation of depth an additional
feature to assess, just like the image quality. The proposed objective metrics still
rely on 2D usual methods but partially. They tend to include more tools allowing
the analysis of the depth component. These tools focus either on depth structure, or
on depth accuracy. Temporal consistency is also taken into account.
15.7 Conclusion
This chapter proposed a reflection on both subjective quality assessment protocols
and objective quality assessment method reliability in the context of DIBR-based
media.
Typical distortions related to DIBR were introduced. They are geometric
distortions and mainly located around the disoccluded areas. When compressionrelated distortions and synthesis-related distortions are combined, the errors are
generally scattered over the whole image, increasing perceptible annoyance.
Two case studies were presented answering the two questions relating, first, to the
suitability of two efficient subjective protocols (in 2D), and second, to the reliability
of commonly used objective metrics. Experiments considered commonly used
methods for assessing conventional images, either subjectively or objectively, to
assess DIBR-based synthesized images, from seven different algorithms.
470
E. Bosc et al.
References
1. Smolic A et al (2006) 3D video and free viewpoint video-technologies, applications and
mpeg standards. In: Proceedings of the IEEE international conference on multimedia and
expo (ICME06), pp 21612164
2. Fehn C (2004) Depth-image-based rendering (DIBR), compression and transmission for a
new approach on 3D-TV. In: Proceedings of SPIE stereoscopic displays and virtual reality
systems XI, vol 5291, pp 93104
3. Bosc E et al (2011) Towards a new quality metric for 3-D synthesized view assessment. IEEE
J Sel Top in Sign Proces 5(7):13321343
4. De Silva DVS et al (2010) Intra mode selection method for depth maps of 3D video based on
rendering distortion modeling. IEEE Trans Consum Electron 56(4):27352740
15
471
472
E. Bosc et al.
31. Sheikh HR et al (2005) An information fidelity criterion for image quality assessment using
natural scene statistics. IEEE Trans Image Process 14(12):21172128
32. Pinson MH et al (2004) A new standardized method for objectively measuring video quality.
IEEE Trans Broadcast 50(3):312322
33. Wang Y (2006) Survey of objective video quality measurements. EMC Corporation
Hopkinton, MA, vol 1748
34. Hekstra AP et al (2002) PVQM-A perceptual video quality measure. Sign Process Image
Commun 17(10):781798
35. Wang Z et al (2004) Image quality assessment: From error visibility to structural similarity.
IEEE Trans Image Process 13(4):600612
36. Wang Z et al (2004) Video quality assessment based on structural distortion measurement.
Sign Process Image Commun 19(2):121132
37. Seshadrinathan K et al (2010) Motion tuned spatio-temporal quality assessment of natural
videos. IEEE Trans Image Process 19(2):335350
38. Boev A et al (2009) Modelling of the stereoscopic HVS
39. Yang J et al (1994) Spatiotemporal separability in contrast sensitivity. Vision Res
34(19):25692576
40. Winkler S (2005) Digital video quality: vision models and metrics. Wiley
41. Egiazarian K et al (2006) New full-reference quality metrics based on HVS. In: CD-ROM
proceedings of the second international workshop on video processing and quality metrics,
Scottsdale, USA
42. Ponomarenko N et al (2007) On between-coefficient contrast masking of DCT basis
functions. In: CD-ROM proceedings of the third international workshop on video processing
and quality metrics, vol 4
43. Chandler DM et al (2007) VSNR: a wavelet-based visual signal-to-noise ratio for natural
images. IEEE Trans Image Process 16(9):22842298
44. Damera-Venkata N et al (2002) Image quality assessment based on a degradation model.
IEEE Trans Image Process 9(4):636650
45. Sheikh HR et al (2006) Image information and visual quality. IEEE Trans Image Process
15(2):430444
46. Van C et al (1996) Perceptual quality measure using a spatio-temporal model of the human
visual system
47. Video Quality Research (2011) [Online]. Available at http://www.its.bldrdoc.gov/vqm/.
Accessed on 19 Jul 2011
48. VQEG 3DTV Group (2010) VQEG 3DTV test plan for crosstalk influences on user quality of
experience, 21 Oct 2010
49. Engelke U et al (2011) Towards a framework of inter-observer analysis in multimedia quality
assessment
50. Haber M et al (2006) Coefficients of agreement for fixed observers. Stat Methods Med Res 15(3):255
51. Seuntiens P (2006) Visual experience of 3D TV. Doctoral thesis, Eindhoven University of
Technology
52. ITU (2000) Subjective assessment of stereoscopic television pictures. In: Recommandation
ITU-R BT, p 1438
53. Chen W et al (2010) New requirements of subjective video quality assessment methodologies
for 3DTV. In: Fifth international workshop on video processing and quality metrics for
consumer electronicsVPQM 2010, Scottsdale, Arizona, USA
54. Joveluro P et al (2010) Perceptual video quality metric for 3D video quality assessment. In:
3DTV-conference: the true vision-capture, transmission and display of 3D video (3DTVCON), pp 14
55. Zhao Y et al (2010) A perceptual metric for evaluating quality of synthesized sequences in
3DV system. In: Proceedings of SPIE, vol 7744, p 77440X
56. Ekmekcioglu E et al (2010) Depth based perceptual quality assessment for synthesized
camera viewpoints. In: Proceedings of second international conference on user centric media,
UCMedia 2010, Palma de Mallorca
15
473
57. Yasakethu SLP et al (2011) A compound depth and image quality metric for measuring the
effects of packet loss on 3D video. In: Proceedings of 17th international conference on digital
signal processing, Corfu, Greece
58. Solh M et al (2011) 3VQM: a vision-based quality measure for DIBR-based 3D videos.
In: IEEE international conference on multimedia and expo (ICME) 2011, pp 16
Index
A
Absolute Categorical Rating
(ACR), 427431, 449, 450
Adaptive edge-dependent lifting, 288
Alignment, 194, 197, 201, 217, 218, 235
Analysis tool, 165, 223, 244
Autostereoscopic, 69, 72, 73, 94
B
Bilateral filter, 173, 176, 177, 187, 203
Bilateral filtering, 203205, 214,
215, 218
Binocular correspondence, 348, 352
Binocular development, 349, 367
Binocular parallax, 19, 377, 379,
380, 422
Binocular rivalry, 348, 358360, 368
Binocular visual system, 348, 351
Bit allocation, 250, 268
Blurry artifact, 43
Boundary artifact, 158
C
Cable, 300, 304, 306, 308, 312, 313316, 320,
326, 329, 334336, 339
Census Transform, 7678, 8487
Challenge, 7, 48, 104, 347
Characteristics of depth, 240, 251, 253
Coding error, 237, 238
Computational complexity, 73, 75, 77, 78, 80,
84, 87, 92, 97, 165, 200, 268, 456
Content creation, 194
475
476
C (cont.)
Content generation, 3, 8, 22, 24, 25
Contour of interest, 175, 177
Conversion artifact, 109, 134, 136
Correlation, 440, 448, 454, 456, 462, 465, 466,
469, 470
Correlation histogram, 239, 241, 242, 245
Cross-check, 82, 90, 145, 153
Crosstalk, 384, 392, 393, 400, 404,
419421
Crosstalk perception, 422, 424, 425
D
Data format, 223, 244
Depth camera, 10
Depth cue, 109, 111113, 116118, 121, 123,
124, 132, 139, 380
Depth estimation, 3941, 47, 48, 51, 53, 55,
62, 109, 121, 191, 192
Depth map, 3, 69, 10, 11, 16, 17, 24, 29, 31,
33, 39, 40, 48, 5153, 5557, 6164,
66, 69, 71, 86, 89, 95100, 102, 103
Depth map coding, 253, 257, 263,
265, 296
Depth map preprocessing
Depth perception, 19, 20, 27
Depth video compression, 277
Depth-based 3D-TV, 3, 7, 27
Depth-enhanced stereo (DES), 223, 243
Depth-image-based rendering (DIBR), 3, 4,
28, 107, 192, 223, 236
Depth-image-based rendering methods (dibr),
170
Depth-of-field, 112, 113, 385, 406
DIBR, 440, 441, 444, 448, 455457, 461463,
466, 468470
Digital broadcast, 299, 300, 302, 303
Disocclusion, 115, 124, 129131, 169171,
173, 174, 181, 185, 187, 188
Disparity, 7375, 7794, 96, 102104, 378,
380, 385
Disparity scaling, 349, 362
Display-agnostic production, 41, 44, 46, 65
Distance map, 170, 178, 179
Distortion, 441, 444, 448, 449, 453456, 468,
469
Distortion measure, 223, 237239
Distortion model, 237239
Dont care region, 258, 259
DVB, 300, 302, 305, 314316, 320326,
332336, 338, 339
Dynamic depth cue, 349
Dynamic programming, 80
Index
E
Edge detector, 160, 177, 290, 292
Edge-adaptive wavelet, 264
Entropy coding, 228, 266, 285, 294
Evaluation method, 223, 233, 244
Extrapolation, 55, 59
F
Focus, 113, 139, 349
Foreground layer, 193, 197200, 202
FPGA, 69, 70, 73, 8385, 103, 104
Frame compatible, 306, 307, 321, 328, 329,
331, 337, 338
Free viewpoint TV, 70, 72
Fusion limit, 347, 356, 357
G
Geometric error, 251, 252, 254256
Geometrical distortion, 271, 381, 383
Glasses, 376, 384, 390394, 398400, 407, 408
GPU, 83, 102, 164, 197
Grab-cut, 204, 209
Graph-based transform, 263, 264, 289
Graph-based wavelet, 289
H
H.264/AVC, 301, 308312, 318, 327334, 339
Haar filter bank, 285, 287
Hamming distance, 85, 86
Hardward implementation
Head-tracked display
Hole filling, 96, 97, 146, 147, 150, 161, 162,
164, 165, 187
Horopter, 348, 352355
Human visual system, 11, 108, 111, 112, 136,
138, 139, 455
Hybrid approach, 123, 124
Hybrid camera system, 191, 201, 218
I
Image artifact, 348, 382
Image inpainting, 169, 170, 182
Integral imaging, 380, 388, 390, 394, 395, 401
Inter-view prediction, 223, 224, 230, 236, 243,
244, 309
IPTV, 427, 428
ISDB, 315, 316, 324, 329232
ITU-R, 301, 304, 315, 326, 329, 337-339, 419,
427, 423, 469
ITU-R BT.500
Index
J
Joint coding
L
Layered depth video (LDV), 3, 7, 191, 243
LDV compliant capturing, 194
Lifting scheme, 281284, 288, 293,
294, 296
Light field display, 390, 394, 395, 402
Linear perspective, 113, 139
M
Matching cost, 7581, 86, 87, 89
Mean opinion score, 21, 229, 414
Monocular cue, 11, 376378
Monocular occlusion zone, 348, 358362, 369
Motion parallax, 112, 120, 121, 123, 124,
133, 366
MPEG-2, 308, 312, 320, 323, 324,
327334, 339
Multiresolution wavelet
decomposition, 294
Multi-view display, 385, 396, 397, 402404,
406, 408
Multi-view video, 170, 448, 467
Multi-view video coding (MVC), 9, 16, 224,
231, 269, 309, 334
Multi-view video plus depth (MVD), 3, 146,
170, 191, 224
N
Network requirement, 322, 325, 327,
331, 336
O
Objective metric, 440, 441, 446, 448, 449, 452,
456, 461463, 465, 466, 468470
Objective visual quality, 413
Occlusion layer, 191, 193, 194,
199203, 217
P
Paired comparisons, 451
Patch matching, 185
Perceptual issue, 18, 348
Pictorial depth cue, 112
Post-production, 11, 40, 138, 191
Priority computation, 185
PSNR, 452, 455, 462, 468
477
Q
Quadric penalty function
Quality assessment, 21, 414, 419, 462
Quality enhancement technique, 153
Quality evaluation, 3, 4, 21, 25, 28, 244
Quality of experience (QOE), 21, 415
R
Rate-distortion optimization, 225, 228, 229,
232, 234, 244, 257
Real-time, 70, 73, 75, 76, 84, 95, 104, 117
Real-time implementation, 65, 121, 145,
155, 165
Reliability, 157, 160
Rendered view distortion, 252254, 257, 262
S
Satellite, 300, 304, 306, 308, 312314, 316,
320323, 326, 329, 334, 339
Scalable video coding (SVC), 312
Scaling, 282, 283
Shape-adaptive wavelet, 289
Shifting effect, 444
Side information, 277, 278, 285, 286, 288,
291, 293, 294, 296
Size scaling, 348, 363
Smart, 161
Smearing effect, 145, 161
Smoothing filter, 170, 176
Sparse representation, 259, 260, 262
Spatial filtering, 153
Splatting, 163
SSIM, 454, 455, 462, 468
Standardization, 3, 22, 2527, 300302, 334,
337, 339
Stereo display, 39, 40, 44, 48, 61
Stereo matching, 52, 53, 73, 75, 7880, 83, 85,
90, 94, 95, 98, 100, 102, 104, 191, 192,
196, 218
Stereoacuity, 347, 354, 355, 367, 368
Stereoscopic 3D (S3D), 108
Stereoscopic 3D-TV, 3, 4, 300, 319, 338
Stereoscopic display, 19, 32, 375, 379, 407
Stereoscopic perception, 375
Stereoscopic video, 41
Structural inpainting, 170, 182
Subjective assessment, 440, 441, 446, 448,
449, 457, 461, 463, 465, 467, 469
Subjective test method, 450
Subjective visual quality, 413
Support region builder, 85, 87
Surrogate depth map, 118, 119
478
S (cont.)
Synthesis artifact, 145, 146, 153, 158
Synthesized view, 440, 441, 444, 446, 447, 449,
452, 453, 456458, 462, 466468, 470
T
Temporal consistency, 53, 54, 157,
182, 468, 469
Temporal filtering, 155, 177
Terrestrial, 300, 304, 306, 308, 312316,
322, 324, 326, 327, 329, 334,
336, 339
Texture flickering, 145, 146
Texture inpainting, 185
Thresholding, 206208, 265
Time-of-flight (TOF) camera, 193196
Transform, 148, 164, 198, 258
Transport, 302, 304, 306, 308310, 315, 318,
322, 326328, 330, 331, 334
U
UQI, 454, 455
Index
V
Video coding, 223225, 228, 229, 232, 233,
238, 239, 241, 243245
View merging, 145147, 150, 152, 158, 164,
165
View synthesis, 13, 94, 99, 100, 145147, 153,
165, 171, 286
Viewing zone, 385, 386, 396, 401, 406
Virtual view, 145153, 157, 158, 160, 161,
163, 165
Visual cortex, 351, 352
Visual quality, 440, 456, 458
Volumetric display, 389, 393, 394, 400, 401
VQM, 454, 457, 469
W
Warping, 72, 96, 97, 114, 198, 199, 201, 205,
214, 215, 269
Wavelet coding, 17
Wavelet filter bank, 286, 294
Wavelet transform, 280282, 285, 287, 288,
290, 294
Wavelet transforms, 280282