Sei sulla pagina 1di 478

3D-TV System with Depth-Image-Based

Rendering

Ce Zhu Yin Zhao Lu Yu


Masayuki Tanimoto

Editors

3D-TV System with


Depth-Image-Based
Rendering
Architectures, Techniques and Challenges

123

Editors
Ce Zhu
School of Electronic Engineering
University of Electronic Science
and Technology of China
Chengdu
Peoples Republic of China

Lu Yu
Department of Information Science
and Electronic Engineering
Zhejiang University
Hangzhou
Peoples Republic of China
Masayuki Tanimoto
Department of Electrical Engineering
and Computer Science
Graduate School of Engineering
Nagoya University
Nagoya
Japan

Yin Zhao
Department of Information Science
and Electronic Engineering
Zhejiang University
Hangzhou
Peoples Republic of China

ISBN 978-1-4419-9963-4
DOI 10.1007/978-1-4419-9964-1

ISBN 978-1-4419-9964-1

(eBook)

Springer New York Heidelberg Dordrecht London


Library of Congress Control Number: 2012942254
 Springer Science+Business Media New York 2013
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or
information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed. Exempted from this legal reservation are brief
excerpts in connection with reviews or scholarly analysis or material supplied specifically for the
purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the
work. Duplication of this publication or parts thereof is permitted only under the provisions of
the Copyright Law of the Publishers location, in its current version, and permission for use must always
be obtained from Springer. Permissions for use may be obtained through RightsLink at the Copyright
Clearance Center. Violations are liable to prosecution under the respective Copyright Law.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt
from the relevant protective laws and regulations and therefore free for general use.
While the advice and information in this book are believed to be true and accurate at the date of
publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for
any errors or omissions that may be made. The publisher makes no warranty, express or implied, with
respect to the material contained herein.
Printed on acid-free paper
Springer is part of Springer Science+Business Media (www.springer.com)

Preface

Riding on the success of 3D cinema blockbusters and the advancements in


stereoscopic display technology, 3D video applications have been gathering
momentum in recent years, which further enhance visual experience by vividly
extending the conventional flat video into a third dimension. Several 3D video
prototypes have been developed based on distinct techniques in 3D visualization,
representation, and content production. Among them, stereoscopic 3D video
systems evoke 3D perception by binocular parallax, in which scene is presented in
a fixed perspective defined by two transmitted views, while further manipulations
on depth perception require expensive computation with current technologies.
Depth-image-based rendering (DIBR) is being considered to significantly
enhance the 3D visual experience relative to the conventional stereoscopic
systems. With DIBR techniques, it becomes possible to generate additional
viewpoints using 3D warping techniques to adjust the perceived depth of stereoscopic videos or to provide the necessary input for auto-stereoscopic displays that
do not require glasses to view the 3D scene. This functionality is also useful for
free-viewpoint video (FVV), where the viewer has the freedom to move about in
front of the display, and is able to perceive natural perspective changes as if
looking through a window. In recognition of progress being made in this area and a
strong interest from the industry to provide equipment and services supporting
such applications, MPEG is also embarking on a new phase of 3D video standardization based on DIBR techniques.
The technologies surrounding DIBR-oriented 3D video systems, however, are
not mature enough at this stage to fully fulfill the above targets. Depth maps,
which are central to the synthesis of virtual views, need to be either captured with
specialized apparatus or estimated from scene textures using stereo matching.
Existing solutions are either costly or not sufficiently robust. Besides, there is a
strong need to achieve efficient storage and robust transmission of this additional
information. Knowing that the depth maps and scene textures are different in
nature, and that synthesized views are the ultimate information for display, DIBRoriented depth, and texture coding may employ different distortion measures for
rate-distortion or rate-quality optimization, and possibly different coding principles
v

vi

Preface

to make better use of available bandwidth. Since view synthesis, coupled with
errors introduced by depth generation and compression, may introduce new types
of artifacts that are different from those of conventional video acquisition and
compression systems, it is also necessary to understand the visual quality of the
views produced by DIBR techniques, which is critical to ensure a comfortable,
realistic, and immersive 3D experience.
This book focuses on this depth-based 3D-TV system which is expected to be
put into applications in the near future as a more attractive alternative to the
current stereoscopic 3D-TV system. Following an open call for chapters and a few
rounds of extensive peer reviews, 15 chapters of good quality have been finally
accepted, ranging from a technical review and literature survey on the whole
system or a particular topic, solutions to some technical issues, to implementation
of some prototypes. According to the scope of these chapters, this book is organized into four sections, namely system overview, content generation, data compression and transmission, and 3D visualization and quality assessment, with the
chapters in each section summarized below.
Part I (Chap. 1) provides an overview of the depth-based 3D-TV system.
Chapter 1 entitled An overview of 3D-TV system using depth-image-based
rendering covers key technologies involved in this depth-based 3D-TV system
using the DIBR technique, including content generation, data compression and
transmission, 3D visualization, and quality evaluation. It also compares the conventional stereoscopic 3D with the new depth-based 3D systems, and reviews
some standardization efforts for 3D-TV systems.
Part II (Chaps. 27) focuses on 3D video content creation, which specifically
targets at depth map generation and view synthesis technologies.
As the leading chapter in the section, Chap. 2, entitled Generic content creation for 3D displays, discusses future 3D video applications and presents a
generic display-agnostic production workflow that supports the wide range of all
existing and anticipated 3D displays.
Chapter 3 entitled Stereo matching and viewpoint synthesis FPGA implementation introduces real-time implementation of stereo matching and view
synthesis algorithms, and describes Stereo-In to Multiple-Viewpoint-Out
functionality on a general FPGA-based system, demonstrating a real-time high
quality depth extraction and viewpoint synthesizer, as a prototype toward a future
chipset for 3D-HDTV.
Chapter 4 entitled DIBR-based conversion from monoscopic to stereoscopic
and multi-view video provides an overview on 2D-to-3D video conversion that
exploits depth-image based rendering (DIBR) techniques. The basic principles and
various methods for the conversion, including depth extraction strategies and
DIBR-based view synthesis approaches, are reviewed. Furthermore, evaluation of
conversion quality and conversion artifacts are discussed in this chapter.

Preface

vii

Chapter 5 entitled Virtual view synthesis and artifact reduction techniques


presents a tutorial on basic view synthesis framework using DIBR and various
quality enhancement approaches to suppressing synthesis artifacts. The chapter
also discusses the requirements of and solutions to real-time implementation of
view synthesis.
Chapter 6 entitled Hole filling for view synthesis addresses the inherent
disocclusion problem in the DIBR-based system of the newly exposed areas
appearing in novel synthesized views. The problem is solved in two manners: the
preprocessing of the depth data, and the image inpainting of the synthesizing view.
Chapter 7 entitled LDV generation from multi-view hybrid image and depth
video presents a complete production chain for 2-layer LDV format, based on a
hybrid camera system of five color cameras and two time-of-flight cameras.
It includes real-time preview capabilities for quality control in the shooting and
post-production algorithms to generate high quality LDV content consisting of
foreground and occlusion layers.
Part III (Chaps. 811) deals with the compression and transmission of 3D video
data.
Chapter 8 entitled 3D video compression first explains the basic coding
principles of 2D video compression, followed by the coding methods for multiview video. Next, 3D video is described with video and depth formats, special
requirements, coding, and synthesis methods for supporting multi-view 3D displays. Finally, the chapter introduces the 3D video evaluation framework.
Chapter 9 entitled Depth map compression for depth-image-based rendering
focuses on depth map coding. It discusses unique characteristics of depth maps,
reviews recent depth map coding techniques, and describes how texture and depth
map compression can be jointly optimized.
Chapter 10 entitled Effects of wavelet-based depth video compression also
concentrates on the compression of depth data. This chapter investigates the
wavelet-based compression of the depth video and the coding impact on the
quality of the view synthesis.
Chapter 11 entitled Transmission of 3D video over broadcasting gives a
comprehensive survey on various standards for transmitting 3D data over different
kinds of broadcasting networks, including terrestrial, cable, and satellite networks.
The chapter also addresses the important factors in the deployment stages of 3DTV services over broadcast networks with special emphasis on the depth-based
3D-TV system.
Part IV (Chaps. 1215) addresses 3D perception, visualization, and quality
assessment.
Chapter 12 entitled The psychophysics of binocular vision reviews psychophysical research on human stereoscopic processes and their relationship to DIBR.
Topics include basic physiology, binocular correspondence and the horopter,
stereo-acuity and fusion limits, non-corresponding inputs and rivalry, dynamic

viii

Preface

cues to depth and their interactions with disparity, and development and adaptability of the binocular system.
Chapter 13 entitled Stereoscopic and autostereoscopic displays first explains
the fundamentals of stereoscopic perception and some of the artifacts associated
with 3D displays. Then, a description of the basic 3D displays is given. A brief
history is followed by a state of the art covering glasses displays through volumetric, light field, multi-view, head tracked to holographic displays.
Chapter 14 entitled Subjective and objective visual quality assessment in the
context of stereoscopic 3D-TV discusses current challenges in relation to subjective and objective visual quality assessment for stereo-based 3D-TV (S-3DTV).
Two case studies are presented to illustrate the current state of the art and some of
the remaining challenges.
Chapter 15 entitled Visual quality assessment of synthesized views in the
context of 3D-TV addresses the challenges on evaluating synthesized content, and
proposes two experiments, one on the assessment of still images and the other on
video sequence assessment. The two experiments question the reliability of the
usual subjective and objective tools when assessing the visual quality of synthesized views in a 2D context.
As can be seen from the above introductions, this book spans systematically a
number of important and emerging topics in the depth-based 3D-TV system. In
conclusion, we aim to acquaint the scholars and practitioners involved in the
research and development of depth-based 3D-TV system with such a most updated
reference on a wide range of related topics. The target audience of this book would
be those interested in various aspects of 3D-TV using DIBR, such as data capture,
depth map generation, 3D video coding, transmission, human factors, 3D visualization, and quality assessment. This book is meant to be accessible to audiences
including researchers, developers, engineers, and innovators working in the relevant areas. It can also serve as a solid advanced-level course supplement to 3D-TV
technologies for senior undergraduates and postgraduates.
On the occasion of the completion of this edited book, we would like to thank
all the authors for contributing their high quality works. Without their expertise
and contribution, this book would never have come to fruition. We would also like
to thank all the reviewers for their insightful and constructive comments, which
helped to improve the quality of this book. Our special thanks go to the editorial
assistants of this book, Elizabeth Dougherty and Brett Kurzman, for their
tremendous guidance and patience throughout the whole publication process. This
project is supported in part by the National Basic Research Program of China (973)
under Grant No.2009CB320903 and Singapore Ministry of Education Academic
Research Fund Tier 1 (AcRF Tier 1 RG7/09).

Acknowledgment

The editors would like to thank the following reviewers for their valuable
suggestions and comments, which improve quality of the chapters.
A. Aydn Alatan, Middle East Technical University
Ghassan AlRegib, Georgia Institute of Technology
Holger Blume, Leibniz Universitt Hannover
Ismael Daribo, National Institute of Informatics
W. A. C. Fernando, University of Surrey
Anatol Frick, Christian-Albrechts-University of Kiel
Thorsten Herfet, Saarland University
Yo-Sung Ho, Gwangju Institute of Science and Technology (GIST)
Peter Howarth, Loughborough University
Quan Huynh-Thu, Technicolor Research & Innovation
Peter Kauff, Fraunhofer Heinrich Hertz Institute (HHI)
Sung-Yeol Kim, The University of Tennessee at Knoxville
Reinhard Koch, Christian-Albrechts-University of Kiel
Martin Kppel, Fraunhofer Heinrich Hertz Institute (HHI)
PoLin Lai, Samsung Telecommunications America
Chao-Kang Liao, IMEC Taiwan Co.
Wen-Nung Lie, National Chung Cheng University
Yanwei Liu, Chinese Academy of Sciences
Anush K. Moorthy, The University of Texas at Austin
Antonio Ortega, University of Southern California
Goran Petrovic, Saarland University
Z. M. Parvez Sazzad, University of Dhaka
Wa James Tam, Communications Research Centre (CRC)
Masayuki Tanimoto, Nagoya University
Patrick Vandewalle, Philips Research Eindhoven
Anthony Vetro, Mitsubishi Electric Research Laboratories (MERL)
Jia-ling Wu, National Taiwan University
Junyong You, Norwegian University of Science and Technology
Lu Yu, Zhejiang University
ix

Acknowledgment

Liang Zhang, Communications Research Centre (CRC)


Yin Zhao, Zhejiang University
Ce Zhu, University of Electronic Science and Technology of China

Contents

Part I
1

System Overview

An Overview of 3D-TV System Using


Depth-Image-Based Rendering . . . . . . . . . . . . . . . . . . . . . . . . . .
Yin Zhao, Ce Zhu, Lu Yu and Masayuki Tanimoto

Part II

Content Generation

Generic Content Creation for 3D Displays. . . . . . . . . . . . . . . . . .


Frederik Zilly, Marcus Mller and Peter Kauff

Stereo Matching and Viewpoint Synthesis


FPGA Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Chao-Kang Liao, Hsiu-Chi Yeh, Ke Zhang, Vanmeerbeeck Geert,
Tian-Sheuan Chang and Gauthier Lafruit

DIBR-Based Conversion from Monoscopic to Stereoscopic


and Multi-View Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Liang Zhang, Carlos Vzquez, Grgory Huchet and Wa James Tam

39

69

107

Virtual View Synthesis and Artifact Reduction Techniques . . . . .


Yin Zhao, Ce Zhu and Lu Yu

145

Hole Filling for View Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . .


Ismael Daribo, Hideo Saito, Ryo Furukawa,
Shinsaku Hiura and Naoki Asada

169

LDV Generation from Multi-View Hybrid Image


and Depth Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Anatol Frick and Reinhard Koch

191

xi

xii

Contents

Part III

Data Compression and Transmission

3D Video Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Karsten Mller, Philipp Merkle and Gerhard Tech

223

Depth Map Compression for Depth-Image-Based Rendering . . . .


Gene Cheung, Antonio Ortega, Woo-Shik Kim,
Vladan Velisavljevic and Akira Kubota

249

10

Effects of Wavelet-Based Depth Video Compression . . . . . . . . . .


Ismael Daribo, Hideo Saito, Ryo Furukawa, Shinsaku Hiura
and Naoki Asada

277

11

Transmission of 3D Video over Broadcasting. . . . . . . . . . . . . . . .


Pablo Angueira, David de la Vega, Javier Morgade
and Manuel Mara Vlez

299

Part IV

3D Visualization and Quality Assessment

12

The Psychophysics of Binocular Vision . . . . . . . . . . . . . . . . . . . .


Philip M. Grove

347

13

Stereoscopic and Autostereoscopic Displays . . . . . . . . . . . . . . . . .


Phil Surman

375

14

Subjective and Objective Visual Quality Assessment


in the Context of Stereoscopic 3D-TV . . . . . . . . . . . . . . . . . . . . .
Marcus Barkowsky, Kjell Brunnstrm, Touradj Ebrahimi,
Lina Karam, Pierre Lebreton, Patrick Le Callet, Andrew Perkis,
Alexander Raake, Mahesh Subedar, Kun Wang,
Liyuan Xing and Junyong You

15

413

Visual Quality Assessment of Synthesized Views


in the Context of 3D-TV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Emilie Bosc, Patrick Le Callet, Luce Morin and Muriel Pressigout

439

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

475

Part I

System Overview

Chapter 1

An Overview of 3D-TV System Using


Depth-Image-Based Rendering
Yin Zhao, Ce Zhu, Lu Yu and Masayuki Tanimoto

Abstract The depth-based 3D system is considered a strong candidate of the


second-generation 3D-TV, preceded by the stereoscopic 3D-TV. The data formats
involve one or several pairs of coupled texture images and depth maps, often known
as image-plus-depth (2D ? Z), multi-view video plus depth (MVD), and layered
depth video (LDV). With the depth information, novel views at arbitrary viewpoints
can be synthesized with a depth-image-based rendering (DIBR) technique. In such
a way, the depth-based 3D-TV system can provide stereoscopic pairs with an
adjustable baseline or multiple views for autostereoscopic displays. This chapter
overviews key technologies involved in this depth-based 3D-TV system, including
content generation, data compression and transmission, 3D visualization, and quality
evaluation. We will also present some challenges that hamper the commercialization
of the depth-based 3D video broadcast. Finally, some international research
cooperation and standardization efforts are briefly discussed as well.

Y. Zhao (&)  L. Yu
Department of Information Science and Electronic Engineering,
Zhejiang University, 310027 Hangzhou, Peoples Republic of China
e-mail: zhaoyin@zju.edu.cn
L. Yu
e-mail: yul@zju.edu.cn
C. Zhu
School of Electronic Engineering, University of Electronic Science
and Technology of China, 611731 Chengdu, Peoples Republic of China
e-mail: eczhu@uestc.edu.cn
M. Tanimoto
Department of Electrical Engineering and Computer Science,
Graduate School of Engineering, Nagoya University,
Nagoya 464-8603, Japan
e-mail: tanimoto@nuee.nagoya-u.ac.jp

C. Zhu et al. (eds.), 3D-TV System with Depth-Image-Based Rendering,


DOI: 10.1007/978-1-4419-9964-1_1,
Springer Science+Business Media New York 2013

Y. Zhao et al.

Keywords 3D video coding 3D visualization Challenge Content generation


Depth-based 3D-TV Depth camera Depth map Depth perception Depthimage-based rendering (DIBR) Layered depth video (LDV) Multi-view video
plus depth (MVD) Perceptual issue Quality evaluation Standardization
Stereoscopic display 3D video transmission View synthesis




1.1 Introduction
The first television (TV) service was debuted by the British Broadcasting Corporation (BBC) in 1936 [1]. Since then, with advancement in video technologies
(e.g., capture, coding, communication, and display), TV broadcasting has evolved
from monochrome to color, analog to digital, CRT to LCD, and also from passive
one-to-all broadcasts to interactive Video on Demand (VOD) services. Nowadays,
it has been moving toward two different directions for realistic immersive experience: ultra high definition TV (UHDTV) [2] and three-dimensional TV (3D-TV)
[3, 4]. The former aims to provide 2D video services of extremely high quality
with a resolution (up to 7,680 9 4,320, 60 frames per second) much higher than
that of the current high definition TV (HD-TV). The latter vividly extends the
conventional 2D video into a third dimension (e.g., stereoscopic 3D-TV), making
users feel that they are watching real objects through a window instead of looking
at plain images on a panel. Free-viewpoint Television (FTV) [5] is considered the
ultimate 3D-TV, which gives users more immersive experience by allowing users
to view a 3D scene by freely changing the viewpoint as if they were there. This
chapter (and also this book) focuses on an upcoming and promising 3D-TV system
using a depth-image-based rendering (DIBR) technique, which is one phase in the
evolution from the conventional stereoscopic 3D-TV to the ultimate FTV.
Success of a commercial TV service stands mainly on three pillars: (1) abundant content resources to be presented at terminals, (2) efficient content compression methods and transmission networks which are capable to deliver images
of decent visual quality, and (3) cost-affordable displays and auxiliary devices
(e.g., set-top box). Among these, display technology plays an important role and,
to some extent, greatly drives the development of TV broadcasting, since it is
natural and straightforward to capture and compress/transmit what is required by
the displays. In the context of 3D-TV, different 3D displays may demand diverse
data processing chains that finally lead to various branches of 3D-TV systems.
Figure 1.1a depicts block diagrams of a 3D-TV system including content capture,
data compression, storage and transmission, 3D display or visualization, and
quality evaluation.
Although lacking a consensus on the classification of various 3D displays [68],
there are at least five types of 3D visualization techniques: stereoscopy [8, 9],
multi-view autostereoscopy [8, 9], integral imaging [10], holography [11], and
volumetric imaging [12]. Stereoscopic displays basically present two slightly

1 An Overview of 3D-TV System Using Depth-Image-Based Rendering

Fig. 1.1 Illustration of the frameworks of a the conventional stereoscopic 3D system, and b the
depth-based-3D system using Depth-Image-Based Rendering (DIBR), which considers, c depthbased formats obtained from stereoscopic or multi-view video. Note that the MVD and LDV
samples of the computer-generated Mobile sequences [170], more for illustrating the two
formats, are not fully converted from the multi-view video

different images of a 3D scene, which are captured by a stereo camera with a


spacing desirably around of human eyes. Multi-view autostereoscopic displays
typically show more than two images of the scene at the same time. Integral
imaging systems replay a great number of 2D elemental images that are captured
through an array of microlenses [10]. Holographic displays project recorded
interference patterns of the wave field reflected from 3D objects with a reference
wave [11]. Volumetric displays reconstruct an illusion of a 3D scene within a
defined volume of space by displaying multiple computer-generated slices of
the scene [12].

Y. Zhao et al.

Different display technologies require different 3D recording approaches for


capturing 3D signals. These 3D recording approaches (as well as the previous
reproducing methods) are associated with different 3D data formats which may not
be convertible from one to another. For example, stereoscopic images only record
the intensity of incident light waves, while holographic patterns store the optical
wave field (including the intensity plus phase relationships of the incident light
waves). In other words, one type of 3D displays may require more input information than another (meanwhile tending to offer more realistic 3D sensation), thus
demanding more complicated 3D information capture systems.
To deploy a successful 3D-TV service with an appropriate data format, factors
to be considered include the ease of 3D data capture, coding and transmission
efficiency of the 3D data over channels/networks, as well as the complexity and
cost of display devices, to mention a few. In addition to these technical considerations, backward compatibility to the current TV system is another crucial
factor. Consumers tend to have a TV set capable of displaying both 2D and 3D
content instead of one set for 2D and another for 3D. It is highly desirable that 3D
programs can also be presented on old TV sets by using its partial received data.
In this sense, integral imaging, volumetric imaging, and holographic systems,
which employ 3D representations completely different from 2D image sequences,
require numerous changes to the existing broadcasting infrastructure. This poor
backward compatibility will create technical complexity and difficulties in the
development of 3D-TV services such as data conversion and processing, and thus
will greatly compromise the market acceptance to the new 3D services.
Currently, stereoscopic 3D, the simplest 3D representation exhibiting excellent
backward compatibility, has been gaining momentum and establishing itself as a
favorable 3D representation in the first wave of 3D-TV. The Blu-ray Disc Association
(BDA) finalized a Blu-ray 3D specification with stereoscopic video representation in
December 2009 [13]. The World Cup 2010, held in South Africa, was broadcasted
using stereoscopic 3D [14]. Many content providers, like Disney, DreamWorks,
and other Hollywood studios, have released movies in stereo 3D.
Providing a tight coupling of capture and display, this straightforward 3D
format may not be the most efficient data representation (which will be discussed
in Sect. 1.3). The captured input and displayed output can be decoupled via an
intermediate representation that is constructed by the captured data and then
utilized by various displays [4], as shown in Fig. 1.1b. Presently, depth-based 3D
representations [1517], which are shown to be more efficient and flexible than the
stereoscopic format, have been increasingly attracting research attention from both
academia and industry.
Depth-based 3D representations enable the generation of a virtual view using
DIBR-based view synthesis [18, 19], as if the virtual-view image is captured by a
real camera at the target viewpoint. This desirable representation usually consists
of a sequence of color/texture images of the scene (a texture video) and a sequence
of associated depth maps that record the depth value (z-value) at each pixel
(a depth video). The depth values indicate the 3D scene structure and imply the
inter-view geometry relationships. With the depth information and a virtual

1 An Overview of 3D-TV System Using Depth-Image-Based Rendering

camera configuration, pixels in a camera view can be projected into the image
plane of a virtual camera, resulting in a virtual-view texture image. A 3D representation of one texture and one depth video is often known as the 2D ? Z
format, and one of multiple texture and depth videos is referred to as the Multiview Video plus Depth (MVD) format [20], as illustrated in Fig. 1.1c. Another
more advanced depth-based representation, called Layered Depth Video (LDV)
[21], contains not only a layer of texture plus depth of foreground objects exposed
to the current camera viewpoint, but also additional layer(s) of hidden background
scene information, which are occluded by the foreground objects at the current
camera viewpoint. The MVD and LDV formats can be viewed as augmented
versions of the basic 2D ? Z format with more scene descriptions contained in
additional camera views and hidden layers, respectively. These supplementary
data can complement in view synthesis the information that is missing in singleview 2D ? Z data, which will be discussed thoroughly in Sect. 1.2.2.
This chapter presents an overview of a 3D-TV system that employs the depthbased 3D representations and DIBR technique to support both stereoscopic and
multi-view autostereoscopic displays. Although a TV service usually involves both
audio and video, only topics related to the video part are addressed in this chapter
(as well as in this book). Interested readers may refer to [2227] for additional
information on the 3D audio part.
After the introduction, Sect. 1.2 elaborates the key technologies, which are
crucial to the success of the depth-based 3D-TV system, ranging from content
generation, compression, transmission, to view synthesis, 3D display and quality
assessment. Section 1.3 discusses the pros and cons of the depth-based 3D system
in comparison with the conventional stereoscopic 3D system, as well as the
remaining challenges and industry movements related to this 3D-TV framework.
The status and prospects of this new 3D system are concluded in Sect. 1.4.

1.2 Key Technologies of the Depth-Based 3D-TV System


3D-TV system with DIBR takes the texture plus depth representations, as shown in
Fig. 1.1b. Acquisition of the texture videos are generally the same as that of the
conventional 2D video system, e.g., captured by a camera or a multi-view camera
array [2830], or rendered from 3D models [31] (such as ray tracing [32]).
Compared with texture video acquisition, it is more difficult to generate highquality depth maps. Texture and depth data exhibit different signal characteristics.
The conventional video compression techniques, optimized for texture coding,
may not be suitable for coding depth maps. In addition, the introduction of depth
data also means extra data to be coded and new methods for error correction
and concealment in transmission to be developed. After receiving the delivered
content, 3D-TV terminals shall decode in real-time the compressed texture-depth
bitstream and produce the stereoscopic or multi-view video required by 3D
displays. To fit the optical system of a display, the texture images are usually

Y. Zhao et al.

Fig. 1.2 Four approaches to generating texture plus depth data

rearranged into a new pattern before being presented on the screen. Since visual
artifacts may be induced by lossy data compression and error-prone networks, 3D
visual quality should be monitored to adapt video compression and error control
mechanisms, to maintain acceptable visual quality at the decoder side. These
aspects, from capture to display, will be discussed with their fundamental problems and major solutions.

1.2.1 Content Generation


Both texture and depth data are needed in the depth-based 3D system. For real
scenes, texture videos can be captured by a single camera or a camera array.
Obtaining the associated depth data is much more complicated, and there are
typically three ways to generate depth maps based on different available resources
(as shown in Fig. 1.2):
1. stereo matching based on multi-view video [3337];
2. depth sensing and further aligning with multi-view video [3843];
3. 2D-to-3D conversion based on a single-view video [4450]
For virtual 3D scene, the texture image and depth map can be rendered easily
from computer-generated 3D models. The LDV data (with additional background
layers) can be generated with some changes or be converted from MVD [5154].
Multi-view video capture
Compared with single-view capture, a multi-view recording system encounters
more technical problems, such as synchronization of multi-view signals, mobility
of the hardware system, and huge data storage [5557]. Due to different camera
photoelectric characteristics, captured images of different views may have significant color mismatches, which may degrade the performance of stereo matching
[58] (to be mentioned in the next section) and cause uncomfortable visual

1 An Overview of 3D-TV System Using Depth-Image-Based Rendering

Fig. 1.3 Two-view


geometry: A 3D point can be
located by its projections in
two (or more) camera images

experience. Moreover, color inconsistency decreases the efficiency of inter-view


prediction in multi-view video coding [5961]. The color mismatch problem can
be mitigated by color correction which adapts color values of each view to those of
a selected reference view. Commonly, a color transfer function (or a lookup table)
is developed by comparing the color values of a set of correspondences [59, 61] or
by matching histograms of two views [60]. Chapter 2 will discuss some other
issues on post-processing of multi-view videos (e.g., geometric correction).
Stereo matching based on multi-view video
Stereo matching is a popular solution to extracting depth information from
stereoscopic or multi-view images/videos. The cameras in a camera array are
calibrated [62] to determine their extrinsic (rotation and translation with respect to
a world coordinate system) and intrinsic parameters (focal length, principle point,
and skew of the horizontal and vertical image axes). With a determined camera
model, it is easy to know which position in the captured image a given 3D point in
the world coordinate is projected to (via a projection equation) [63], e.g., P is
projected to p1 on camera image 1 in Fig. 1.3. Intuitively, the position of a 3D
point can be calculated if we know its projections on two camera images, e.g., by
figuring out the intersection of line o1 p1 and o2 p2 (or more commonly, by solving
the simultaneous equations of the two projection equations). Then, the depth value
of the 3D point is obtained.
We can figure out depth values of pixels in a camera image, provided that the
corresponding points in another view are known. A pair of inter-view correspondences share similar features (e.g., color value and luminance gradient), and can be
located by stereo matching techniques [3337]. Normally, stereo matching sets up a
cost function that measures the feature similarity (e.g., Sum of Squared Differences,
or SSD) between one reference pixel in one view and a candidate corresponding pixel
in the other view. Some methods find the best candidate corresponding pixel within a
search window by looking for the lowest matching cost, i.e., this point is considered
the most similar to the reference pixel [33, 37]. Other methods further add
smoothness assumptions into the cost function and jointly minimize the costs that
measure the similarities of all pixels [34], thus determining all the correspondences
with a global optimization method such as belief propagation [35] and graph cuts
[36]. The position difference between each pair of correspondences (called disparity)
is recorded in a disparity map. The initial integer disparity map may be refined to
remove possible spurious values [34]. A depth map is converted from the estimated

10

Y. Zhao et al.

disparities as long as the camera parameters are known. Post-processing of the depth
map may be required to improve boundary alignment, to reduce estimation errors,
and to fill occlusions [16]. Detailed description of stereo matching algorithms is
provided in Chaps. 2 and 3.
Depth sensing and aligning with multi-view video
Stereo matching aims to find pairs of correspondences with the closest features.
It often fails in homogeneous image region where multiple candidates exhibit
similar features and the best matching point is ambiguous. Also, correspondences
may not be located exactly at occlusion regions where parts of an image are not
visible in the other one.
As an alternative to stereo matching for depth map generation, physical ranging
methods are introduced, which do not rely on scene texture to obtain depth information, and can tackle occlusion conflicts. Some examples of such physical ranging
methods are time-of-flight (TOF) sensor [38] and structured light scanner [39].
Using depth sensing equipments, however, gives rise to other technical problems. First, the depth and video cameras are placed at different positions and
output different types of signals. It is necessary to calibrate the depth camera with
the video camera based on two diverse scene descriptions [42]. Second, the output
depth map generally has lower spatial resolution than that of the captured color
image. This means the measured depth map should be transformed and properly
upsampled to align with the texture image [40, 41]. Third, the measured depth
maps are often contaminated by random noise [41] which should be removed for
quality enhancement.
TOF camera itself also has several inherent defects. Since TOF camera captures
the depth information by measuring phase-shift between self-emitted and reflected
infrared lights, the capture range is limited to around 7 meter. Besides, interference by surrounding lights in the same spectrum may occur. Due to the limitation,
the outdoor scenes cannot be correctly measured by the current TOF solutions.
Moreover, surrounding materials may absorb the emitted infrared or direct them
away from the sensor. Part of the rays will never return to the sensor, thus affecting
the depth calculation. Using multiple TOF cameras at the same time may also
introduce interference from the cast rays of similar frequencies.
The structured light scanner such as the Kinect by Microsoft is also not perfect. Its
accuracy turns worse as the distance increases. It also encounters the problems of
limited range, ray missing, and interference from multiple emitters and ambient lights.
Since depth sensing approaches have different error sources and limitations
from stereo matching methods, fusing depth data obtained from the two avenues
appears to be promising to enhance the accuracy of the depth information [4143].
More details on depth map generation using TOF sensor are elaborated in Chap. 7.
2D-to-3D conversion based on a single-view video
Apart from using stereo matching and depth sensor, depth map can also be
generated from video sequence itself by extracting monoscopic depth cues
involved in the video sequence. This depth generation is often studied in the
research on 2D-to-3D video conversion [4446]. This approach sounds more
appealing, as a video shot by a single camera can be converted into stereoscopic

1 An Overview of 3D-TV System Using Depth-Image-Based Rendering

11

3D. Given the abundant 2D video data like films and TV programs, 2D-to-3D
conversion is very popular in media industry and post-production studios. Disney
is reported to have been reproducing the 3D version of earlier 2D movies like
The Lion King.
It is known that the human visual system perceives depth information from both
binocular cues and monocular cues. The former are often referred to as stereopsis
(an object appearing at different positions in two eyes) and convergence (two eyes
rotating inward toward a focused point). Stereo matching simulates the stereopsis
effect that senses depth information from retinal disparities. 2D-to-3D conversion,
however, exploits monocular cues, including (but not limited to):
1. linear perspective from vanishing lines: converging lines that are parallel
actually indicate a surface that recedes in depth [47];
2. blur from defocus: the object closer to the camera is clear while blur of the
defocused regions increases with the distance away from the focal plane [48];
3. velocity of motion: for two objects moving at a similar velocity, the nearby
object corresponds to a larger displacement in the captured video [49];
4. occlusion from motion: background region occluded by a moving foreground
object will be exposed in another frame [50].
Apart from making use of those monocular cues, the depth map can be estimated using a machine learning algorithm (MLA) [44]. Some important pixels are
selected as a training set, and the MLA learns the relationship between the samples
(position and color value) and their predefined depth values. Then, the remaining
pixels are determined by the training model. This method requires much less effort
than manual depth map painting [16, 64]. Since depth is usually consistent over
time, manual assistance may only be employed at several key frames in a
sequence, while other depth frames can be automatically obtained with the
propagation of the key frames [43, 65]. It should be noted that 2D-to-3D conversion aims to (or say, is only capable to) estimate an inaccurate depth map that
properly reflects the ordinals of major 3D surfaces, rather than to recover real
depth values as stereo matching and depth sensing do. Fortunately, our eyes are
robust to feel a natural 3D sensation as long as there are no severe conflicts among
depth cues. More in-depth discussions on state-of-the-art 2D-to-3D conversion
techniques are provided in Chap. 4.
Synthetic texture and depth data from 3D model
Computer-generated imagery (CGI) is another important source of 3D content.
It describes a virtual scene in a complete 3D model, which can be converted into
the texture plus depth data easily. For example, the texture images of a target view
can be rendered using the conventional ray casting technique [66], in which the
color of each pixel is determined by tracing which surface of the computergenerated models is intersected by a corresponding ray cast from the virtual
camera. The associated depth maps can be generated by calculating the orthogonal
distances from the intersected 3D points to the image plane of the virtual camera.
Occlusion can be solved based on comparing the depth values of multiple points
projected to the same location, which is known as the z-buffering.

12

Y. Zhao et al.

Mixing synthetic content with real scene is popular in post-production.


Advanced visual effects can be created by fusing real foreground action with a
computer-generated 3D background using chroma keying techniques, or by adding
a 3D object into a natural scene. In addition, it is also suitable for inserting 3D
subtitles or logos into 3D videos [54] for further content manipulation.
Generation of LDV content
Different from MVD which consists of multiple views of 2D ? Z, LDV supplies several background layers of texture plus depth data in addition to one-view
foreground 2D ? Z. The enhancement data upon the foreground 2D ? Z data are
needed, because some background information in a virtual view cannot be referred
from the camera view, mostly due to occlusion. In many applications, LDV data
adopt only one background layer (or called residual layer), which is sufficient to
provide the missing background texture, since a second background layer contains
little complement for ordinary scenes without complex multi-layer occlusion [52].
Generally speaking, MVD implicitly distributes occlusion information about a
camera view into its neighboring views, while LDV manages it explicitly and collectively in a background layer of the camera view. Thus, it is possible to convert
MVD into LDV by retrieving the occluded background information from other views
of MVD. A straightforward solution is to project several neighboring views into a
target view. Ideally, the foreground objects from different views correspond to each
other, and will overlap in the same place with the same z values (i.e., distances to the
camera) after being warped to the same viewpoint. Then, the background occluded
pixels are clearly those with larger z values than the overlapped foreground pixels
[51]. However, if the depth data are not accurate or consistent among views (e.g.,
depth errors occur), inconsistent z values of the same object from views may be
mistreated as multiple objects in different depth planes. In such a case, the z-bufferbased approach is prone to generate a noisy background layer mixed with redundant
and erroneous foreground information, as shown in Fig. 1.4b, which makes the
residual layer less efficient for compression [52].
Accordingly, a better approach of incremental residual insertion is developed
to tackle the inter-view depth inconsistency. It first projects the central view into
the left (or right) view and finds empty regions (holes) along the left (or right)
side of foreground objects in the warped image. Later, the corresponding pixels in
the left (or right) view are warped back to the central view and inserted into the
background layer [52, 54, 67]. This strategy only accesses what lacks in the
warped central view, and thus prevents unnecessary foreground pixels from being
included into the background layer, as shown in Fig. 1.4c. In addition, a hole in
the warped image may resemble the surrounding texture, which means the
residual may be recovered using inpainting techniques, apart from being supplied
in the background layer. To further enhance the efficiency of the residual
representation, spatial correlation is exploited, in which the residual layer is
subtracted by inpainting-generated patterns for the holes to reduce the number of
required pixels [53]. Chapter 7 provides a tutorial on LDV generation, including
an LDV-compliant capture system [54] and advanced foreground and occlusion
layer creation.

1 An Overview of 3D-TV System Using Depth-Image-Based Rendering

13

Fig. 1.4 LDV converted from MVD using different approaches [52] for Ballet [171].
a Foreground layer, and b background layer using the z-buffer-based approach. c Background
layer using incremental residual insertion. White regions in b and c are foreground masks

1.2.2 View Synthesis with DIBR


View synthesis [18, 19, 6872] reconverts the depth-based intermediate 3D
representations into stereoscopic 3D required by the stereoscopic or multi-view
autostereoscopic displays, as shown in Fig. 1.1b. It employs the DIBR [18, 19]
technology to generate any novel view as if the image is shot by a camera placed at
the virtual viewpoint. Basically, DIBR can be regarded as the reverse process of
stereo matching that first searches the correspondences and then calculates the
depth values. According to the two-view geometry as illustrated in Fig. 1.3, DIBR
traces back the corresponding position in a virtual view for each camera-view
pixel, based on its location and associated depth value. More specifically, knowing
where and how the real camera and virtual view are deployed (i.e., the camera
parameters), DIBR locates the object position P in a world coordinate (which is
often set as the camera-view coordinate for simplicity) for a camera-view pixel p1 ;
and then translates P into the virtual-view camera to find the correspondence p2 :
After projecting the camera view into a virtual view using DIBR, some pixels in
the virtual-view image are not assigned, which are often called holes, as shown in
Fig. 1.5b and e. Holes are mostly introduced by disocclusion that new background
regions expose in the novel view, and need to be filled in order to obtain a
complete image. The philosophy of hole filling is that the missing areas can be
recovered using information from the surrounding pixels [73], especially the
background texture in the context of occlusion handling [74, 75], or can be found
somewhere else in the video (either in the same frame [76] or from another time
instant [77]). However, inpainting algorithm [73] may not always generate natural
patterns, especially for large holes, as shown in Fig. 1.5c. More details on hole
filling are given in Chap. 6.
If we have two camera views at different sides of the virtual view, holes in one
of the warped images most likely correspond to non-hole regions in the other one,
e.g., comparing Fig. 1.5b and e. Therefore, merging the two warped images can
drastically reduce the missing pixels and eliminate large holes which may be
difficult to be reconstructed. This is why MVD, compared with 2D ? Z, can
improve the visual quality of a synthesized view in between (often called view

14

Y. Zhao et al.

Fig. 1.5 View synthesis with 2D ? Z data or two-view MVD data of Mobile, using VSRS
[70, 172]. a The left view (view 4). b After warping the left view to the central virtual point
(view 5), where holes are marked in black. c Applying hole filling on b using inpainting [73].
d The right view (view 6). e After warping the right view to the central virtual point. f Merging
the warped left and right views b and e, and all pixels are assigned with color values. For more
complicated scenes, we may have to fill a few small holes in the merged image using simple
interpolation of surrounding pixels. Inpainting may fail to create natural patterns for the
disoccluded regions, while merging the two warped views b and e is efficient to make up the
holes

interpolation), and thus is required in high-quality rendering applications. Background/occlusion layers of LDV provide the same functionality of supplying
occlusion information.
It is obvious that inaccurate depth values cause camera-view pixels to be
projected to wrong positions in a target virtual view. These kinds of geometry
distortions are most evident around object boundaries where incorrect depth values

1 An Overview of 3D-TV System Using Depth-Image-Based Rendering

15

appear frequently, resulting in so-called boundary artifacts [78]. Furthermore,


depth errors can hardly be consistent in the temporal domain, which makes the
corresponding texture patterns shift or flicker over time [79]. These rendering
artifacts can be very annoying sometimes and make the synthesized video unacceptable to watch. In Chap. 5, we will revisit MVD-based view synthesis with
more in-depth descriptions; meanwhile, we will review many quality enhancement
processing to suppress rendering artifacts. View synthesis with LDV data can be
found in Chap. 7.

1.2.3 3D Video Compression


The captured or generated texture and depth data are often stored in YUV color
space which records one luma (Y) and two chrominance (U and V) components for
every pixel. The luma component indicates brightness, while the two chrominance
ones denote color information. Depth videos consist of only gray images, where
the chrominance components U and V can be regarded as a constant, e.g., 128. The
most prevalent YUV format in TV broadcasting nowadays is YUV 4:2:0, in which
chrominance components are both horizontally and vertically downsampled by a
ratio of 1/2. For a standard-resolution video (720 9 576, 25 frame per second), the
raw YUV 4:2:0 stream is about 124 Megabits per second (Mbps). Directly
transmitting the raw data is obviously impossible for current commercial broadcasting networks (e.g., up to 19.4 Mbps for terrestrial broadcasting in North
America).
A video contains a lot of spatial and temporal redundancy, which means the
actual information about the video can be expressed with far fewer bits than the
apparent raw data size. Video compression aims to squeeze a raw video into a
reduced size bitstream at an encoder, and to reconstruct the video at a decoder, as
shown in Fig. 1.6a. A high compression rate (in order to meet the bandwidth
budget) is often achieved at the cost of quality degradation of the decoded video,
since some necessary information may also be discarded in the encoding process,
which is often referred to as lossy compression. The compression efficiency of
different codecs are measured and compared in terms of Rate-Distortion (R-D)
performance, in which Peak Signal to Noise Ratio (PSNR) between the original
and reconstructed videos is the most common metric for the introduced signal
distortions, and the rate often refers to the bitrate of the encoded bitstream. When a
set of R-D sample points of a codec are connected in a figure, an R-D curve is
visualized, as shown in Fig. 1.6b. Using the same amount of bits, a codec of higher
compression efficiency introduces less significant signal distortions.
There are many ways to compress texture and depth data. The simplest
approach is to encode each texture or depth video independently (called simulcast)
using standard compression techniques, e.g., H.264/AVC [80]. Besides, there must
be some information shared among multiple views, since all views record the same
scene. By further exploiting inter-view redundancy, MVD (especially the texture

16

Y. Zhao et al.

Fig. 1.6 a Illustration of (lossy) video codec. b Rate-distortion (R-D) performance of two codecs
on a (depth) video. Codec 2 presents superior R-D performance to Codec 1, since the R-D curves
show that at the same level of distortion (suggested by PSNR value; where a higher PSNR means
a lower level of distortion), Codec 2 needs less bitrate to compress the video. For example,
compared with Codec 1, Codec 2 saves about 34 % rate at PSNR = 45 dB

data) can be compressed more efficiently with the Multi-view Video Coding
(MVC) [8183] standard, an amendment to H.264/AVC with adoption of interview prediction [82].
The conventional video coding standards, however, are developed for texture
video compression, and have been shown to be not efficient for coding depth
videos which have very different statistical characteristics. As discussed above,
both texture and depth data contribute to the quality of a synthesized image, where
texture distortions cause color value deviations at fixed positions and depth distortions may lead to geometry changes [84]. This evokes the need for a different
distortion measurement for depth coding [85] and a proper bit allocation strategy
between texture and depth [86]. Compared with texture images, depth maps are

1 An Overview of 3D-TV System Using Depth-Image-Based Rendering

17

generally much smoother within an object whereas presenting poorer temporal and
inter-view consistency (limited by current depth acquisition technologies).
Therefore, some conventional coding tools, like the inter-view prediction in MVC,
may turn out to be insufficient as the inter-view correlation changes accordingly.
More efficient methods are desired to preserve the fidelity of depth information,
which may be realized through any of the following ways or their combinations:
1. accommodate the existing coding techniques (e.g., intra prediction [87], inter
prediction [88], and in-loop filter [89]) to fit the statistical features of depth
data;
2. measure or quantify the influence of depth data distortions on view synthesis
quality [85, 90] and incorporate the effect in the rate-distortion optimization of
depth coding [85];
3. use other coding frameworks (e.g., platelet coding [84] and wavelet coding
[91]);
4. exploit the correlation between texture and depth (e.g., employing the structural
similarity [92]).
In addition, the depth data can also assist to enhance the coding efficiency of
texture data (e.g., using view synthesis prediction [93, 94]) or to reduce encoder
complexity (e.g., fast mode decision [95]). More details on the current status of 3D
video coding, advances on depth compression techniques, and wavelet coding for
depth maps will be provided in Chaps. 8, 9 and 10, respectively.

1.2.4 3D Video Transmission and Storage


Once 3D videos are compressed, the bitstreams will be either saved in storage
devices (e.g., a Blu-ray disk) or transmitted over various networks such as cable,
terrestrial, and satellite broadcasting. Due to the small amount of bits required for
depth information, transmitting additional one-view depth data for basic-quality
applications and even one or two more views of texture plus depth for high-quality
services does not seem to be a disaster for current 2D broadcast infrastructure.
With more homogeneous regions, depth maps generally require much less bitrate
than the associated texture images using the same compression technique [86, 96].
Some preliminary studies show that depth can be assigned with around 25 % of the
corresponding texture bitrate for achieving the best synthesis quality [86], given a
total bandwidth.
Then, a minor technical problem for 3D video storage and transmission is how
to organize the multiple bitstreams of texture and depth data into transport streams
(TS) such as the MPEG-2 TS, a widely used standard format for audio and video
transmission. In other words, two issues need to be solved: how many TS should
be used and how to multiplex (and signal) several bitstreams into one TS. Solutions may depend on the backward compatibility with legacy devices. For
example, the earlier Blu-ray players support a maximum video bitrate of 40 Mbps.

18

Y. Zhao et al.

Combining all texture and depth bitstreams into one TS under the bitrate constraint
will lead to lower quality texture videos for 2D playback. Thus, the recent Blu-ray
3D specification considers a mode of using one main-TS for the base view for 2D
viewing and using another sub-TS for the dependent view for stereoscopic
viewing, which does not compromise the 2D visual quality. Although no depthbased 3D video services have been launched worldwide, these concerns from the
stereoscopic 3D system may also be considered in developing the new 3D transmission system. Chapter 11 provides a study on transmitting 3D video over cable,
terrestrial, and satellite broadcasting networks.
Two other research topics in transmission are error resilience and concealment
for unreliable networks (e.g., wireless and Internet Protocol (IP) networks).
The former, employed at encoders, spends overhead bits to protect a compressed
bitstream, while the latter, utilized by decoders, attempts to recover lost or corrupted information during transmission [9799]. Apart from using the
conventional error resilience and concealment methods for 2D [97102] and stereoscopic 3D [103105], some studies exploit the correlation/similarity between
texture and depth, and develop new approaches for depth-based 3D video
[106109]. For example, since motion vectors (MV) for texture and depth of the
same view are highly correlated, they are jointly estimated and then transmitted
twice (once in each bitstream) to protect the important motion information [106].
Motion vector sharing is also used in error concealment. For a missing texture
block, MV of its associated depth block is adopted as a candidate MV [107].
The neighboring blocks with depth information similar to that of the missing block
indicate that they likely belong to the same object and should have the same
motion. On the other hand, depth information also distinguishes object boundaries
from homogeneous regions, and two objects at each side of a boundary may have
different motion. In that case, the assumption of a smooth MV field turns to be
invalid, and the block needs to be split into foreground and background parts with
two (different) MVs [108]. In addition, by exploiting inter-view geometry
relationship, missing texture in one view can be recovered by DIBR with texture
and depth data of another view [109].

1.2.5 3D Visualization and Related Perceptual Issues


As discussed above, the depth-based formats are intermediate representations that
do not bring 3D sensation to human. After 3D video delivery from content provides to consumers, the 3D data are visualized on special displays which exploit
human perception to understand the 3D effects. Basically, humans perceive depth
from monocular and binocular cues. Monocular cues include the accommodation,
perspective, occlusion, and motion parallax, which imply inaccurate relative
positions among different objects. That is why we can roughly discern foreground
from background when viewing a 2D image, but fail to thread a needle with one
eye shut.

1 An Overview of 3D-TV System Using Depth-Image-Based Rendering

19

Fig. 1.7 a Illustration of convergence, positive, and negative disparities. Two eyes fixate at point
F, and F falls at the center of the two eyes (called fovea) with zero disparity. Point P (or N) has a
positive/uncrossed (or negative/crossed) disparity, when the left-eye projection position P1 (or
N1) is at the right (or left) side of the right-eye position P2 (or N2). In the context of stereoscopic
viewing, Point N appears in front of the screen and Point P seems to be behind the display.
b Illustration of the stereoscopic comfort zone, in which the most comfortable regions are close to
the screen plane

Binocular depth perception measures more accurate depth based on two main
factors: eye convergence and binocular parallax. Practically, two eyes first rotate
inward to a fixated point in space, and the convergence angle reflects the actual
distance of the point, as shown in Fig. 1.7a. Then, each 3D point falls on the two
retinal images which are further combined in the brain with binocular fusion or
rivalry [110]. The difference between the retinal positions of one object results in a
zero, negative, or positive disparity, which is interpreted by our visual system as
the relative depth of this point with respect to the fixation point. More details of
binocular depth perception as well as other binocular vision properties (e.g.,
binocular fusion) are covered in Chap. 12.
The visualization of 3D effects in the depth-based system is achieved by presenting reconverted stereoscopic or multi-view video on a stereoscopic display
(SD) or multi-view autostereoscopic display (MASD) [69]. The former meets the
binocular parallax by showing spatially mixed or temporally alternating left and
right images on the screen and by separating lights of the two images with passive
(anaglyph and polarized) or active (shutter) glasses [6, 9]. The latter satisfies
binocular parallax without the need for wearing special glasses, which directs rays
from different views into separate viewing regions using optical devices such as
lenticular sheets and parallax barriers [6, 9]. Usually, a series of discrete views are
shown on an MASD simultaneously, and viewers can feel monocular motion
parallax since they will receive a new stereo pair when shifting their heads slightly.

20

Y. Zhao et al.

The binocular head-mounted display that places one screen in front of each eye in
a helmet can be considered as an SD [9]. When an SD tracks a viewers head
motion and generates two appropriate views for his/her eyes, the motion parallax is
fulfilled as well [6].
Although SD and MASD are relatively less expensive compared with holographic and volumetric displays, neither of them delivers true 3D experiences from
reproducing all rays in the captured scene. Instead, they drive the brain to fuse two
given images, thus evoking depth perception from binocular cues. This mechanism
costs extra efforts to watch stereoscopic images, which explains why we may need
a few seconds to perceive the 3D effect after putting on 3D glasses, different from
the daily life experience of immediate depth perception. As a consequence, the
fake 3D sensation could bring visual discomfort in several ways. First, vergence
(driven by disparity) and accommodation (driven by blur) are considered two
parallel feedback control systems with cross-links [111114]. In stereoscopic
viewing, the accommodation should be kept at the screen plane in order to perceive a clear/sharp image, whereas the vergence is directed at the actual distance
of the gazed object for binocular fusion. The discrepancy between the accommodation and vergence pushes the accommodation back and forth, thus tiring the
eyes. To minimize this conflict and reduce visual discomfort, the disparities in a
stereoscopic image should be controlled in a small range, which makes the mismatch of accommodation and vergence not evident and ensures all perceived 3D
objects fall in a stereoscopic comfort zone [115], as shown in Fig. 1.7b. Visual
experiments reveal that positive disparities (i.e., object behind the screen) may be
more comfortable than negative ones [116], and hence the comfort zone or
comfortable degree may be asymmetric on both sides of a screen.
There exist some perceptual and cognitive conflicts [114] that the depth from
disparity contradicts other cues. The window violation is a typical example, in
which part of an object appearing in front of the screen hits the display margin.
The occlusion cue implies the object behind the screen, whereas the convergence
suggests the opposite. This paradox can be solved either by using a floating
window technique that adds a virtual black border perceptually occluding the
object [117], or by pushing the whole scene backwards (by shifting the stereoscopic images or by remapping disparities of the scene) to make the object appear
on or behind the screen [115].
Other sources of unnatural 3D feeling include (1) cross talk where rays for one
eye leak into the other eye, generating ghosting effects such as double edges [118],
(2) disparity discontinuity from scene cut, forcing the eyes to readapt to the new
disparities [115], (3) flickering artifacts from view synthesis with temporally
inconsistent depth data [114, 119], (4) puppet-theater effect and cardboard effect
that a 3D object looks overly small and is as thin as a paper [120], (5) motion
tearing that eyes lose track and coordination for a fast moving object on timesequential displays, thus seeing two objects at different positions [121]. More
details on state-of-the-art SD and MASD as well as other 3D displays will be
surveyed in Chap. 13.

1 An Overview of 3D-TV System Using Depth-Image-Based Rendering

21

1.2.6 3D Quality Evaluation


The last component in this systematical chain, as shown in Fig. 1.1b, is the
evaluation on 3D video quality. Since visual distortions may be introduced in
transmission and compression, the received video signals are usually different
from the source content. Therefore, quality evaluation is employed to estimate the
overall visual quality of the delivered videos, based on which the content sender
may adjust the compression and error-resilience efforts for maintaining acceptable
Quality of Experience (QoE) at the receivers.
There have been two branches of quality evaluation, subjective and objective
quality assessment (QA). Subjective approaches use the average quality scores by
human subjects (often called the Mean Opinion Score, MOS) to assess the quality
of test videos corrupted from the original videos. Objective methods develop
automatic metrics which extract possible distortions in a test video (no-reference
model) or analyze the differences between the test and reference (distortion-free)
videos (full-reference or reduced-reference model) and then translate them into
quality scores. Subjective evaluation is considered the most reliable way for QA,
but it is time-consuming and not practical in many applications. This triggers the
research on objective metrics that simulate the subjective evaluation [122126].
Compared with 2D image/video QA, visual quality in 3D QA is more complex.
With the introduction of a new dimension, the overall visual quality may be related
to more attributes in addition to the texture quality concerned in 2D QA, such as
the presence, perceived depth, and naturalness of the depth [127]. Therefore, at an
early exploration stage, current 3D subjective evaluation experiments assess
multiple quality factors other than an overall quality score [128]. Some subjective
evaluation methodologies for stereoscopic videos have been standardized, e.g.,
ITU-R Rec. BT 1438 [129], which basically uses the same methods for 2D QA
specified in ITU-R Rec. BT 500 [130] and adjusts the viewing conditions (e.g., the
recommended viewing distance and display size for mitigating the accommodation
and vergence conflict). New methodologies are under study in ITU-R SG06 [131].
Apart from visual quality degradation from signal corruption, some studies also
focus on the reduced QoE from inappropriate acquisition or display of stereoscopic
content, such as the baseline distance [132], temporal synchronization [133],
and cross talk [132]. Moreover, subjective experiments are employed to measure
or re-estimate binocular perception of prevalent stimuli in communication systems, e.g., block artifact, blur, and noise. Some basic principles (e.g., the binocular
averaging property [134]) discovered by visual research community are adapted
into more practical formulas (e.g., a binocular contrast function [135]) or models
(e.g., a binocular just-noticeable difference model [136]) for video processing
researchers and engineers.
Objective 3D video assessment encounters two major challenges. On one hand,
videos are finally shown in the stereoscopic form, which requires new approaches
for stereoscopic 3D quality evaluation. This problem also lies in the conventional
stereoscopic system. Simply extending the 2D metric by using the average

22

Y. Zhao et al.

(or other combinations) of two views 2D quality scores cannot achieve satisfactory evaluation performance [128, 137, 138]. Prediction accuracy can be
improved by further incorporating disparity/depth distortions which reflect geometric changes or depth sensation degradation [138140]. Moreover, two eyes
images are finally combined in the brain to form a cyclopean image as if it is
perceived by a single eye placed at the middle of the two eyes [110, 141]. To
further account for this important visual property which is neglected in the abovementioned schemes, a few attempts have been made to quality evaluation using
cyclopean images [142, 143]. However, these metrics always apply binocular
fusion (also using a rough model with simple averaging) for all correspondences,
without considering whether the distorted video signals evoke binocular rivalry.
In other words, the modeling of cyclopean images in these algorithms is hardly
complete. Readers can refer to Chap. 14 of the recent advances on subjective and
objective QA of stereoscopic videos.
On the other hand, it is necessary to understand the quality of the synthetic
images, since the presented stereoscopic images in this depth-based 3D system can
be either synthesized or captured. The artifacts induced by view synthesis are
unique from 2D video processing (e.g., transmission and compression), and stateof-the-art 2D metrics seem to be inadequate to assess synthesized views [144].
A few preliminary studies have been made to tackle this problem, e.g., evaluating
the contour shifting [144], disocclusion regions [145, 144], and temporal noise
[119] in synthesized videos, which lead to a better correlation to subject evaluation
than 2D metrics (e.g., PSNR). More details on assessing DIBR-synthesized content
can be found in Chap. 15.

1.3 Depth-Based System Versus Stereoscopic System:


Advantages, Challenges, and Perspectives
In this section, we compare the depth-based system and the conventional stereoscopic system, and discuss both their advantages and demerits. Besides, some
international research cooperation and standardization are briefly addressed as
well.

1.3.1 Differences and Advantages of the Depth-Based System


There are two major differences between the conventional stereoscopic system and
the depth-based system:
1. The content generation processes are different. Videos in the stereoscopic
system are typically obtained from stereoscopic/multi-view capture, while the
content generation for the depth-based system is more versatile with explicit

1 An Overview of 3D-TV System Using Depth-Image-Based Rendering

23

inclusion of depth data, and some displayed views may be synthesized based on
the DIBR technique. In a broader view, view synthesis (e.g., in 2D-to-3D
conversion and 3D animation) can also be considered a useful technique to
create the second view for stereoscopic 3D.
2. The transmitted data are different. Compared with the stereoscopic system that
compresses and transmits texture videos only, the depth-based system also
delivers the depth information. Accordingly, the receivers further need a view
synthesis module to recover the stereoscopic format before 3D displaying, and
the rendered views are free from (in single-view-based rendering) or with less
(in multiple-view-based rendering with view merging) inter-view color
difference.
The depth-based system offers three obvious advantages, though it is more
complex than the stereoscopic counterpart.
1. Depth-based system can adjust the baseline distance of the presented stereo
pair, which cannot be achieved easily in stereoscopic 3D. Stereoscopic video
services present two fixed views to viewers. These two views, which are
captured by professional stereo cameras, usually have a wider baseline than our
pupil distance (typically 65 mm). This mismatch may lead to unsuitably large
disparities that make the displayed 3D object fall outside the stereoscopic
comfort zone (as mentioned in Sect. 1.2.5), thus inducing eye strain and visual
discomfort. Moreover, viewers may require different degrees of binocular
disparities based on their preference to the intensity of 3D perception which
increases with disparity. The stereoscopic video system fails to meet this
requirement, unless there is a feedback to the sender to request sending another
two views. The depth-based system can solve these problems by generating a
new virtual view with a user-defined baseline, through which the 3D sensation
can be controlled user-friendly by each viewer.
2. Depth-based system is suitable for autostereoscopic display, while stereoscopic
3D system is not. Multi-view autostereoscopic displays, which are believed to
provide more natural 3D experience, are stepping into some niche markets
including advertisement and exhibition. As mentioned above, a series of discrete views are simultaneously presented on the screen, and rays from different
views are separated by a specific optical system into non-overlapping viewing
regions. To fit the delicate imaging system, multiple-view images are spatially
multiplexed into a complex pattern. Therefore, the view composition rules for
different multi-view displays are usually distinct, and thus it is impossible to
transmit one view-composite video that adapts to any multi-view displays.
Moreover, multi-view displays may use varied number of views, such as the 28view Dimenco display [146] and 8-view Alioscopy display [147], which makes
the multi-view video transmission solution impractical as well. The depthbased system typically uses 13 views of MVD as well as a Depth Enhanced
Stereo (DES) [148] composed of two views of one-background-layer LDV.
Then, other required views can be synthesized based on the received data. View
synthesis can be conducted regardless of views actually delivered (i.e., view

24

Y. Zhao et al.

merging may be absent and hole filling may be different), and the synthetic
quality increases with the number of views. In this sense, the new framework is
more flexible to support the multi-view displays.
3. The depth-based framework exhibits higher transmission efficiency. With
current compression techniques, it is reported that depth can generally be
compressed with around 25 % of the texture bitrate for rendering a goodquality virtual view under one-view 2D ? Z, especially when the disoccluded
regions are simple to be filled. In comparison, an MVC coded stereo pair may
require 160 % of the base-view bitrate [17] (since the second view can be
compressed more efficiently with inter-view prediction). In this case, the
2D ? Z solution consumes less bandwidth. However, even though the disocclusion texture is sometimes so complex that a second view or layer is required
to supply the missing information for high-quality rendering, the stereoscopic
MVD format increases the total bitrate by roughly 31 %, whereas the functionality of adjustable baseline is achieved (so does the 2D ? Z). For multiview displays, this advantage becomes more evident. It is perhaps too
demanding to capture, store, and transmit 8 or even 28 views over current
networks. In contrast, the depth-based system requires much fewer bits to
prepare and deliver the 3D information.

1.3.2 Challenges and Research Cooperation


Although the depth-based system seems attractive, the second generation 3D
applications are mostly dependent on the maturity of the content generation
techniques. High-quality depth maps, which are spatially accurate, temporally
consistent, and well-aligned with the texture images, are indeed not easy to be
obtained at present. As mentioned in Sect. 1.2.1, the automatic stereo matching
and depth capture methods cannot robustly handle all kinds of scenes, thus producing depth errors in many cases. Combining the two solutions improves the
algorithm robustness, and manual assistance is still necessary for high-quality
content.
Compared with content generation, other components like 3D content delivery
and visualization in this system are technically less challenging, in view that the
current solutions, though not optimal, can generally support the basic functionality. However, there are still quite many factors to be improved or solved. More
efficient depth coding (and/or depth-assisted texture coding) may be still on
demand. Since view synthesis is involved at the terminals, the extra complexity
(perhaps similar to or even higher than video decoding) should be considered and
reduced for low power devices, like mobile phones. Health risks from stereoscopic
viewing have been attracting much attention, which stimulates the advances of
display technology and signal processing algorithms for creating visually

1 An Overview of 3D-TV System Using Depth-Image-Based Rendering

25

comfortable 3D content. 3D quality evaluation is at the very beginning of research,


and it is not proper to simply use or extend the 2D methods.
The tough technical challenges have triggered much international research
cooperation. For example, the Information and Communication Technologies
(ICT) area of Framework 7 of the European Commission has funded about eight
projects on 3D-TV [149, 150]. Among them, there are four projects involving the
depth-based formats:
1. 3D4YOU [151], which is dedicated to the content generation and delivery of
the depth-based formats;
2. 2020 3D MEDIA [152], which develops a complete chain from capture (with
structured light devices for depth sensing) to display;
3. MOBILE 3D-TV [153], which aims at the transmission 3D content to mobile
users and is evaluating different 3D formats for mobile 3D services;
4. 3DPHONE [154], a project for all fundamental elements of a 3D phone prototype, including the media display, user interface (UI), and personal information management (PIM) applications in 3D but usable without any stereo
glasses.

1.3.3 Standardization Efforts


Standardization is the process of developing technical standards, which is often
promoted by industry companies. After competing members in a standardization
organization consent to a common solution, a technical standard is released, which
may be a standard definition, data format, specification, test method, process, and
so on. In the context of telecommunication, the major purposes are (1) to make
data flow smoothly among or understood by different devices compliant to a
standard, and (2) to develop a commonly accepted method for evaluating quality
of service. A standard helps eliminate the chaos in a market, thus contributing to
the industrialization of certain applications.
To facilitate the depth-based video service, there are at least three aspects
necessary to be standardized. First, the 3D video data format(s) should be defined
in order to make different groups in the whole industry chain speak the same
language. Several formats are under consideration, such as the 2D ? Z, MVD,
LDV, and even the DES which is a combination of MVD and LDV. Society for
Motion Picture and Television Engineers (SMPTE) established a Task Force on
3D to the Home [155] to assess different 3D formats for establishing a 3D Home
Master standard, in which the stereoscopic 3D is the main focus while the depthbased formats are considered as optional solutions.
Then, a data compression standard is needed which clearly specifies the decoding
process (meanwhile the standard also implies how the video should be encoded), such
that all the receivers can recover the bitstream compressed based on this standard.

26

Y. Zhao et al.

The two points, data format and compression technique, are always connected tightly
and standardized together. To meet the different requirements by various applications,
a compression standard may have many profiles for supporting slightly different data
formats (e.g., YUV4:2:0 and YUV4:4:4) and coding tools. Motion Picture Expert
Group (MPEG) has been establishing a 3D video compression standard for the depthbased formats, and issued a call for proposal on 3D video coding technology [156] in
March 2011. Both MVD and LDV are highlighted in this group, while MVD is
currently supported by more members. Various compression solutions are submitted
for evaluation and competition at the time of this writing, and the standard is scheduled
to be finalized by 2013 or early 2014 [157]. Stereoscopic videos can basically be coded
using two-view simulcast or frame-compatible stereo (where the left and right views
are spatially multiplexed, e.g., side-by-side or topbottom) with H.264/AVC [80], or
using more efficient MVC to exploit inter-view correlation [83]. MPEG is also
developing a more advanced frame-compatible format [158], using an enhancement
layer to supply information missing in the frame-compatible base view, which is
oriented at higher quality applications while ensuring backward compatibility with
existing frame-compatible 3D services [159, 17].
The third part is the standardization for 3D video transmission. A broadcast
standard often specifies the spectrum, bandwidth, carrier modulation, error correcting code, audio and video compression method, data multiplexing, etc. Different
nations adopt different digital television broadcast standards [160], such as the DVB
[161] in Europe and ATSC [162] in North America. At present, there are no standards
for broadcasting depth-based formats, while some of them (e.g., DVB [163] and
ATSC [164]) have been working on the stereoscopic format. So far, more than 30
channels have been broadcasting stereoscopic 3D over the world since 2008, many of
which actually started before the DVB-3D-TV standard [162] was issued. It is
encouraging that study group 6 of International Telecommunication Union (ITU)
released a report in early 2010 outlining a roadmap for future 3D-TV implementation
[165], which considered the plano-stereoscopic and depth-based frameworks as the
first and second generation of 3D-TV, respectively.
In addition to the three major aspects, standards may be established for specific
devices or interfaces to facilitate a video service. Typical examples are the Blu-ray
specification for Blu-ray disks and players, and the HDMI for connecting two local
equipments (e.g., a set-top box and a display). These standards have been put in use to
support stereoscopic 3D, such as the Blu-ray 3D specification [166] and HDMI 1.4
[167]. Overall speaking, however, the depth-based system is still in a nascent stage. On
the other hand, since the two compared 3D systems share a few common components
(e.g., video capture and display), some standards contribute to both frameworks. For
example, the Consumer Electronics Association (CEA), a standards and trade organization for the consumer electronics industry in the United States, has been developing a standard for active 3D glasses with infrared synchronized interface to
strengthen the interoperability of shutter glasses in the near-future market [168].
Besides, Video Quality Expert Group (VQEG) has been working on the standards for
assessing 3D quality of service. Their short-term goals include measuring different

1 An Overview of 3D-TV System Using Depth-Image-Based Rendering

27

aspects of 3D quality (e.g., image quality, depth perception, naturalness, and visual
fatigue) and improving subjective test procedure for 3D video.
Moreover, international forums/organizations serve as an active force to accelerate the progress of 3D video service. For example, 3D@Home [169] is a non-profit
consortium targeting at faster adoption of quality 3D at home. It consists of six
steering teams, covering content creation, storage and transmission, 3D promotion,
consumer products, human factors, and mobile 3D. This consortium has been producing a lot of whitepapers and expert recommendations for good 3D content.

1.4 Conclusion
This chapter throws some light on the development of the depth-based 3D-TV system,
focusing on the technical challenges, typical solutions, standardization efforts, and
performance comparison with the stereoscopic 3D system. Research trends and detailed
discussions on many aspects of the depth-based 3D-TV framework will be touched
upon in the following tutorial chapters. In sum, the new system appears to be more
flexible and efficient to support stereoscopic and multi-view 3D displays, and is
believed to bring more comfortable 3D visual sensation, at the cost of extra computational complexity, such as the depth estimation and view synthesis. These new technologies also slightly jeopardize the backward compatibility to current 2D broadcast.
The first generation of stereoscopic 3D is clearly on the roadmap. Limited by
insufficient 3D content and immature depth acquisition, the second generation
3D-TV with depth-based representations and DIBR technique is still under
development, and it may take a few years for some key technologies to flourish
and mature. Meanwhile, some technical solutions within this system, especially
the depth-based view rendering and 3D content editing with human factors, are
also useful for the current stereoscopic system. Stereoscopic 3D is becoming a
pilot to gain public acceptance to 3D-TV service, while the momentum may also
propel the demand and maturity of the depth-based 3D-TV system.
Acknowledgment The authors thank Philips and Microsoft for kindly providing the Mobile
and Ballet sequences. They are also grateful to Dr. Vincent Jantet for preparing the LDI images
in Fig. 1.4. This work is partially supported by the National Basic Research Program of China
(973) under Grant No.2009CB320903 and Singapore Ministry of Education Academic Research
Fund Tier 1 (AcRF Tier 1 RG7/09).

References
1. Television Invention Timeline Available: http://www.history-timelines.org.uk/eventstimelines/08-television-invention-timeline.htm
2. Ito T (2010) Future televisionsuper hi-vision and beyond. In: Proceedings of IEEE Asian
solid-state circuits conference, Nov 2010, Beijing, China, pp 14

28

Y. Zhao et al.

3. Kubota A, Smolic A, Magnor M, Tanimoto M, Chen T, Zhang C (2007) Multiview imaging


and 3DTV. IEEE Signal Process Mag 24(6):1021
4. Onural L (2010) Signal processing and 3DTV. IEEE Signal Process Mag 27(5):142144
5. Tanimoto M, Tehrani MP, Fujii T, Yendo T (2011) Free-viewpoint TV. IEEE Signal
Process Mag 28(1):6776
6. Konrad J, Halle M (2007) 3-D displays and signal processing. IEEE Signal Process Mag
24(7):97111
7. Benzie P, Watson J, Surman P, Rakkolainen I, Hopf K, Urey H, Sainov V, von Kopylow C
(2007) A survey of 3DTV displays: techniques and technologies. IEEE Trans Circuits Syst
Video Technol 17(11):16471658
8. Holliman NS, Dodgson NA, Favalora GE, Pockett L (2011) Three-dimensional displays:
a review and applications analysis. IEEE Trans Broadcast 57(2):362371
9. Urey H, Chellappan KV, Erden E, Surman P (2011) State of the art in stereoscopic and
autostereoscopic displays. Proc IEEE 99(4):540555
10. Cho M, Daneshpanah M, Moon I, Javidi B (2011) Three-dimensional optical sensing and
visualization using integral imaging. Proc IEEE 99(4):556575
11. Onural L, Yaraz F, Kang H (2011) Digital holographic three-dimensional video displays.
Proc IEEE 99(4):576589
12. Favalora GE (2005) Volumetric 3D displays and application infrastructure. Computer
8(8):3744
13. Chen T, Kashiwagi Y (2010) Subjective picture quality evaluation of MVC stereo high
profile for full-resolution stereoscopic high-definition 3D video applications. In:
Proceedings of IASTED conference signal image processing, Maui, HI, Aug 2010
14. World Cup 2010 in 3D TV Available: http://www.itu.int/net/itunews/issues/2010/06/54.aspx
15. Mller K, Merkle P, Wiegand T (2011) 3-D video representation using depth mapdepth
maps. Proc IEEE 99(4):643656
16. Smolic A, Kauff P, Knorr S, Hornung A, Kunter M, Mller M, Lang M (2011) Threedimensional video postproduction and processing. Proc IEEE 99(4):607625
17. Vetro A, Tourapis AM, Mller K, Chen T (2011) 3D-TV content storage and transmission.
IEEE Trans Broadcast 57(2):384394
18. Fehn C (2003) A 3D-TV approach using depth-image-based rendering (DIBR). In:
Proceedings of visualization, imaging and image processing (VIIP), pp 482487
19. Fehn C (2004) Depth-image-based rendering (DIBR), compression and transmission for a
new approach on 3D-TV. In: Proceedings of stereoscopic displays virtual reality systems
XI, San Jose, CA, USA, Jan 2004, pp 93104
20. Merkle P, Smolic A, Mller K, Wiegand T (2007) Multi-view video plus depth
representation and coding. In: Proceedings of international conference on image
processing, pp I-201-I-204
21. Shade J, Gortler S, He L, Szeliski R (1998) Layered depth images. In: Proceedings of the
25th annual conference on computer graphics and interactive techniques, New York, NY,
USA, pp 231242
22. Jot JM, Larcher V, Pernaux JM (1999) A comparative study of 3-D audio encoding and
rendering techniques. In: Proceedings of 16th AES international conference, Mar 1999
23. Poletti M (2005) Three-dimensional surround sound systems based on spherical harmonics.
J Audio Eng Soc 53(11):10041025
24. Fazi F, Nelson P, Potthast R (2009) Analogies and differences between three methods for
soundfield reproduction. In: Proceedings of ambisonics symposium, Graz, Austria, June 2009
25. Okamoto T, Cui ZL, Iwaya Y, Suzuki Y (2010) Implementation of a high-definition 3D
audio-visual display based on higher order ambisonics using a 157-loudspeaker array
combined with a 3D projection display. In: Proceedings of international conference on
network infrastructure and digital content (IC-NIDC), pp 179183
26. Andr C, Embrechts JJ, Verly JG (2010) Adding 3D sound to 3D cinema: identification and
evaluation of different reproduction techniques. In: Proceedings of international conference
on audio, language and image processing, Nov 2010, Shanghai, China, pp 130137

1 An Overview of 3D-TV System Using Depth-Image-Based Rendering

29

27. Strohmeier D, Jumisko-Pyykk S (2008) How does my 3D video sound like?impact of


loudspeaker set-ups on audiovisual quality on mid-sized autostereoscopic display. In:
Proceedings of 3DTV conference, pp 7376
28. Narayanan PJ, Rander P, Kanade T (1995) Synchronous capture of image sequences from
multiple cameras. Technical report CMU-RI-TR-95-25, Robotics Institute, Carnegie Mellon
University, Dec 1995
29. Wilburn B, Joshi N, Vaish V, Talvala E-V, Antunez E, Barth A, Adams A, Levoy M,
Horowitz M (2005) High performance imaging using large camera arrays. In: Proceedings
of ACM SIGGRAPH, Los Angeles, CA, USA
30. Smolic A (2011) 3D video and free viewpoint videofrom capture to display. Pattern
Recognit 44(9):19581968
31. Owens JD, Luebke D, Govindaraju N, Harris M, Krger J, Lefohn AE, Purcell T (2007)
A survey of general-purpose computation on graphics hardware. Comput Graph Forum
26(1):80113
32. Whitted T (1980) An improved illumination model for shaded display. Commun ACM
23(6):343349
33. Kanade T, Okutomi M (1994) A stereo matching algorithm with an adaptive window:
theory and experiment. IEEE Trans Pattern Anal Mach Intell 16(9):920932
34. Scharstein D, Szeliski R (2002) A taxonomy and evaluation of dense two-frame stereo
correspondence algorithms. Int J Comput Vision 47(13):742
35. Sun J, Zheng N, Shum H (2003) Stereo matching using belief propagation. IEEE Trans
Pattern Anal Mach Intell 25(7):787800
36. Boykov Y, Kolmogorov V (2004) An experimental comparison of min-cut/max-flow
algorithms for energy minimization in vision. IEEE Trans Pattern Anal Mach Intell
26(9):11241137
37. Lowe D (2004) Distinctive image features from scale-invariant keypoints. Int J Comput
Vision 2(60):91110
38. Lange R, Seitz P (2001) Solid-state time-of-flight range camera. IEEE J Quantum Electron
37(3):390397
39. Scharstein D, Szeliski R (2003) High-accuracy stereo depth maps using structured light.
In: Proceedings of computer vision and pattern recognition, pp 195202
40. Yang Q, Yang R, Davis J, Nistr D (2007) Spatial-depth super resolution for range images.
In: Proceedings of international conference computer vision and pattern recognition, pp 18
41. Zhu J, Wang L, Yang R, Davis J (2008) Fusion of time-of-flight depth and stereo for high
accuracy depth maps. In: Proceedings of IEEE conference on computer vision and pattern
recognition, pp 231236, June 2008
42. Schiller I, Beder C, Koch R (2008) Calibration of a PMD-camera using a planar calibration
pattern together with a multi-camera setup. In: Proceedings of the society of photogrammetry
and remote sensing, pp 297302
43. Bartczak B, Koch R (2009) Dense depth maps from low resolution time-of-flight depth and
high resolution color views. In: Proceedings of international symposium on advanced visual
computing, pp 228239
44. Harman P, Flack J, Fox S, Dowley M (2002) Rapid 2D to 3D conversion. In: Proceedings of
SPIE, vol 4660. pp 7886
45. Tam WJ, Zhang L (2006) 3D-TV content generation: 2D-to-3D conversion. In: Proceedings
of IEEE international conference on multimedia and expo (ICME), Toronto, Canada
46. Zhang L, Vazquez C, Knorr S (2011) 3D-TV content creation: automatic 2D-to-3D video
conversion. IEEE Trans Broadcast 57(2):372383
47. Battiato S, Curti S, La Cascia M (2004) Depth map generation by image classification.
In: Proceedings of SPIE, vol 5302. pp 95104
48. Ens J, Lawrence P (1993) An investigation of methods for determining depth from focus.
IEEE Trans Pattern Anal Mach Intell 15(2):97108

30

Y. Zhao et al.

49. Moustakas K, Tzovaras D, Strintzis MG (2005) Stereoscopic video generation based on


efficient layered structure and motion estimation from a monoscopic image sequence. IEEE
Trans Circuits Syst Video Technol 15(8):10651073
50. Feng Y, Ren J, Jiang J (2011) Object-based 2D-to-3D video conversion for effective
stereoscopic content generation in 3D-TV applications. IEEE Trans Broadcast 57(2):500509
51. Cheng X, Sun L, Yang S (2007) Generation of layered depth images from multi-view video.
In: Proceedings of IEEE international conference on image processing (ICIP07), San
Antonio, TX, USA, vol 5. pp 225228, Sept 2007
52. Jantet V, Morin L, Guillemot C (2009) Incremental-LDI for multi-view coding. In:
Proceedings of 3DTV conference, Potsdam, Germany, pp 14, May 2009
53. Daribo I, Saito H (2011) A novel inpainting-based layered depth video for 3DTV. IEEE
Trans Broadcast 57(2):533541
54. Bartczak B et al (2011) Display-independent 3D-TV production and delivery using the
layered depth video format. IEEE Trans Broadcast 57(2):477490
55. Lou J, Cai H, Li J (2005) A real-time interactive multi-view video system. In: Proceedings of
the 13th annual ACM international conference on multimedia, Hilton, Singapore, Nov 2005
56. Matusik WJ, Pfister H (2004) 3D TV: a scalable system for real-time acquisition, transmission,
and autostereoscopic display of dynamic scenes. ACM Trans Graph 23(3):814824
57. Cao X, Liu Y, Dai Q (2009) A flexible client-driven 3DTV system for real-time acquisition,
transmission, and display of dynamic scenes. EURASIP J Adv Sig Process, vol 2009.
Article ID 351452, pp 115
58. Stankowski J, Klimaszewski K, Stankiewicz O, Wegner K, Domanski M (2010)
Preprocessing methods used for Poznan 3D/FTV test sequences. ISO/IEC JTC1/SC29/
WG11 Doc. M17174, Jan 2010
59. Yamamoto K, Kitahara M, Kimata H, Yendo T, Fujii T, Tanimoto M, Shimizu S, Kamikura
K, Yashima Y (2007) Multiview video coding using view interpolation and color correction.
IEEE Trans Circuits Syst Video Technol 17(11):14361449
60. Fecker U, Barkowsky M, Kaup A (2008) Histogram-based prefiltering for luminance and
chrominance compensation of multiview video. IEEE Trans Circuits Syst Video Technol
18(9):12581267
61. Doutre C, Nasiopoulos P (2009) Color correction preprocessing for multi-view video
coding. IEEE Trans Circuits Syst Video Technol 19(9):14001405
62. Zhang Z (2000) A flexible new technique for camera calibration. IEEE Trans Pattern Anal
Mach Intell 22(11):13301334
63. Hartley R, Zisserman A (2000) Multiple view geometry in computer vision. Cambridge
University Press, Cambridge
64. Mendiburu B (2009) 3D movie making: stereoscopic digital cinema from script to screen.
Focal Press, Burlington
65. Varekamp C, Barenbrug B (2007) Improved depth propagation for 2D to 3D video
conversion using key-frames. In: Proceedings of 4th IET European conference on visual
media production, pp 17, Nov 2007
66. Roth SD (1982) Ray casting for modeling solids. Comput Graph Image Process 18(2):109144
67. Frick A, Bartczak B, Koch R (2010) Real-time preview for layered depth video in 3D-TV.
In: Proceedings of real-time image and video processing, vol 7724. pp 77240F-1-10
68. Zhang L, Tam WJ (2005) Stereoscopic image generation based on depth images for 3D TV.
IEEE Trans Broadcast 51(2):191199
69. Mller K, Smolic A, Dix K, Merkle P, Kauff P, Wiegand T (2008) View synthesis for
advanced 3D video systems. EURASIP J Image Video Process, vol 2008. Article ID 438148
70. Tian D, Lai P, Lopez P, Gomila C (2009) View synthesis techniques for 3D video. In:
Proceedings of applications of digital image processing XXXII, vol 7443. pp 74430T-1-11
71. Mori Y, Fukushima N, Yendo T, Fujii T, Tanimoto M (2009) View generation with 3D
warping using depth information for FTV. Sig Process: Image Commun 24(12):6572
72. Zinger S, Do L, de With PHN (2010) Free-viewpoint depth image based rendering. J Vis
Commun Image Represent 21:533541

1 An Overview of 3D-TV System Using Depth-Image-Based Rendering

31

73. Bertalmio M, Bertozzi AL, Sapiro G (2001) Navier-stokes, fluid dynamics, and image and
video inpainting. In: Proceedings of IEEE international conference on computer vision and
pattern recognition, pp 355362
74. Oh K, Yea S, Ho Y (2009) Hole-filling method using depth based in-painting for view
synthesis in free viewpoint television (FTV) and 3D video. In: Picture coding symposium
(PCS), Chicago, pp 233236
75. Daribo I, Pesquet-Popescu B (2010) Depth-aided image inpainting for novel view synthesis.
In: Proceedings of IEEE international workshop on multimedia signal processing (MMSP)
76. Ndjiki-Nya P, Kppel M, Doshkov D, Lakshman H, Merkle P, Mller K, Wiegand T (2011)
Depth image-based rendering with advanced texture synthesis for 3-D video. IEEE Trans
Multimedia 13(3):453465
77. Schmeing M, Jiang X (2010) Depth image based rendering: a faithful approach for the
disocclusion problem. In: Proceedings of 3DTV conference, pp 14
78. Zhao Y, Zhu C, Chen Z, Tian D, Yu L (2011) Boundary artifact reduction in view
synthesisview synthesis of 3D video: from perspective of texture-depth alignment. IEEE
Trans Broadcast 57(2):510522
79. Zhao Y, Yu L (2010) A perceptual metric for evaluating quality of synthesized sequences in 3DV
system. In: Proceedings of visual communications and image processing (VCIP), July 2010
80. Wiegand T, Sullivan GJ, Bjntegaard G, Luthra A (2003) Overview of the H.264/AVC
video coding standard. IEEE Trans Circuits Syst Video Technol 13(7):560576
81. Vetro A, Yea S, Zwicker M, Matusik W, Pfister H (2007) Overview of multiview video
coding and anti-aliasing for 3D displays. In: Proceedings of international conference on
image processing, vol 1. pp I-17I-20, Sept 2007
82. Merkle P, Smolic A, Mller K, Wiegand T (2007) Efficient prediction structures for
multiview video coding. IEEE Trans Circuits Syst Video Technol 17(11):14611473
83. Chen Y, Wang Y-K, Ugur K, Hannuksela M, Lainema J, Gabbouj M (2009) The emerging
mvc standard for 3D video services. EURASIP J Adv Sig Process 2009(1), Jan 2009
84. Merkle P, Morvan Y, Smolic A, Farin D, Mller K, de With PHN, Wiegand T (2009) The
effects of multiview depth video compression on multiview rendering. Sig Process: Image
Commun 24(12):7388
85. Kim W-S, Ortega A, Lai P, Tian D, Gomila C (2010) Depth map coding with distortion
estimation of rendered view. In: Proceedings of SPIE visual information processing and
communication, vol 7543. pp 75430B75430B-10
86. Tikanmaki A, Gotchev A, Smolic A, Muller K (2008) Quality assessment of 3D video in
rate allocation experiments. In: Proceedings of IEEE international symposium on consumer
electronics
87. Kang M-K, Ho Y-S (2010) Adaptive geometry-based intra prediction for depth video
coding. In: Proceedings of IEEE international conference on multimedia and expo (ICME),
July 2010, pp 12301235
88. Kim W-S, Ortega A, Lai P, Tian D, Gomila C (2009) Depth map distortion analysis for
view rendering and depth coding. In: Proceedings of international conference on image
processing
89. Oh K-J, Vetro A, Ho Y-S (2011) Depth coding using a boundary reconstruction filter for 3D
video systems. IEEE Trans Circuits Syst Video Technol 21(3):350359
90. Zhao Y, Zhu C, Chen Z, Yu L (2011) Depth no-synthesis error model for view synthesis in
3D video. IEEE Trans Image Process 20(8):22212228, Aug 2011
91. Daribo I, Tillier C, Pesquet-Popescu B (2008) Adaptive wavelet coding of the depth map for
stereoscopic view synthesis. In: Proceedings of IEEE international workshop on multimedia
signal processing (MMSP08), Cairns, Australia, pp 3439, Oct 2008
92. Liu S, Lai P, Tian D, Chen CW (2011) New depth coding techniques with utilization of
corresponding video. IEEE Trans Broadcast 57(2):551561
93. Shimizu S, Kitahara M, Kimata H, Kamikura K, Yashima Y (2007) View scalable multiview video coding using 3-D warping with depth map. IEEE Trans Circuits Syst Video
Technol 17(11):14851495

32

Y. Zhao et al.

94. Yea S, Vetro A (2009) View synthesis prediction for multiview video coding. Sig Process
Image Commun 24(1+2):89100
95. Lin YH, Wu JL (2011) A depth information based fast mode decision algorithm for color
plus depth-map 3D videos. IEEE Trans Broadcast 57(2):542550
96. Merkle P, Wang Y, Mller K, Smolic A, Wiegand T (2009) Video plus depth compression
for mobile 3D services. In: Proceedings of 3DTV conference
97. Wang Y, Zhu Q-F (1998) Error control and concealment for video communication:
a review. Proc IEEE 86(5):974997
98. Wang Y, Wenger S, Wen J, Katsaggelos A (2000) Error resilient video coding techniques.
IEEE Signal Process Mag 17(4):6182
99. Stockhammer T, Hannuksela M, Wiegand T (2003) H.264/AVC in wireless environments.
IEEE Trans Circuits Syst Video Tech 13(7):657673
100. Zhang R, Regunathan SL, Rose K (2000) Video coding with optimal inter/intra-mode
switching for packet loss resilience. IEEE J Sel Areas Commun 18(6):966976
101. Zhang J, Arnold JF, Frater MR (2000) A cell-loss concealment technique for MPEG-2
coded video. IEEE Trans Circuits Syst Video Technol 10(6):659665
102. Agrafiotis D, Bull DR, Canagarajah CN (2006) Enhanced error concealment with mode
selection. IEEE Trans Circuits Syst Video Technol 16(8):960973
103. Xiang X, Zhao D, Wang Q, Ji X, Gao W (2007) A novel error concealment method for
stereoscopic video coding. In: Proceedings of international conference on image processing
(ICIP2007), pp 101104
104. Akar GB, Tekalp AM, Fehn C, Civanlar MR (2007) Transport methods in 3DTV-a survey.
IEEE Trans Circuits Syst Video Technol 17(11):16221630
105. Tan AS, Aksay A, Akar GB, Arikan E (2009) Rate-distortion optimization for stereoscopic
video streaming with unequal error protection. EURASIP J Adv Sig Process, vol 2009.
Article ID 632545, Jan 2009
106. De Silva DVSX, Fernando WAC, Worrall ST (2010) 3D video communication scheme for
error prone environments based on motion vector sharing. In: Proceedings of IEEE 3DTVCON, Tampere, Finland
107. Yan B (2007) A novel H.264 based motion vector recovery method for 3D video
transmission. IEEE Trans Consum Electron 53(4):15461552
108. Liu Y, Wang J, Zhang H (2010) Depth image-based temporal error concealment for 3-D
video transmission. IEEE Trans Circuits Syst Video Technol 20(4):600604
109. Chung TY, Sull S, Kim CS (2011) Frame loss concealment for stereoscopic video plus
depth sequences. IEEE Trans Consum Electron 57(3):13361344
110. Howard IP, Rogers BJ (1995) Binocular vision and stereopsis. Oxford University Press,
Oxford
111. Yano S, Ide S, Mitsuhashi T, Thwaites H (2002) A study of visual fatigue and visual
comfort for 3D HDTV/HDTV images. Displays 23(4):191201
112. Hoffman DM, Girshick AR, Akeley K, Banks MS (2008) Vergence-accommodation
conflicts hinder visual performance and cause visual fatigue. J Vis 8(3):130
113. Lambooij MTM, IJsselsteijn WA, Fortuin M, Heynderickx I (2009) Visual discomfort and
visual fatigue of stereoscopic displays: a review. J Imaging Sci Technol 53(3):030201030201-14. MayJun 2009
114. Tam WJ, Speranza F, Yano S, Shimono K, Ono H (2011) Stereoscopic 3D-TV: visual
comfort. IEEE Trans Broadcast 57(2):335346
115. Lang M, Hornung A, Wang O, Poulakos S, Smolic A, Gross M (2010) Nonlinear disparity
mapping for stereoscopic 3D. ACM Trans Graph 29(4):75:175:10. July 2010
116. Nojiri Y, Yamanoue H, Ide S, Yano S, Okana F (2006) Parallax distribution and visual
comfort on stereoscopic HDTV. In: Proceedings of IBC, pp 373380
117. Gunnewiek RK, Vandewalle P (2010) How to display 3D content realistically. In: Proceedings
of international workshop video processing quality metrics consumer electronics (VPQM),
Jan 2010

1 An Overview of 3D-TV System Using Depth-Image-Based Rendering

33

118. Daly SJ, Held RT, Hoffman DM (2011) Perceptual issues in stereoscopic signal processing.
IEEE Trans Broadcast 57(2):347361
119. Zhao Y, Yu L (2010) A perceptual metric for evaluating quality of synthesized sequences in 3DV
system. In: Proceedings of visual communications and image processing (VCIP), July 2010
120. Yamanoue H, Okui M, Okano F (2006) Geometrical analysis of puppet-theatre and
cardboard effects in stereoscopic HDTV images. IEEE Trans Circuits Syst Video Technol
16(6):744752
121. Wittlief K (2007) Stereoscopic 3D film and animationgetting it right. Comput
Graph 41(3). Aug 2007 Avaliable: http://www.siggraph.org/publications/newsletter/
volume/stereoscopic-3d-film-and-animationgetting-it-right
122. Sheikh HR, Sabir MF, Bovik AC (2006) A statistical evaluation of recent full reference
image quality assessment algorithms. IEEE Trans Image Process 15(11):34403451
123. Engelke U, Zepernick HJ (2007) Perceptual-based quality metrics for image and video
services: a survey. In: 3rd EuroNGI conference on next generation internet networks,
pp 190197
124. Seshadrinathan K, Soundararajan R, Bovik AC, Cormack LK (2010) Study of subjective
and objective quality assessment of video. IEEE Trans Image Process 19(16):14271441
125. Chikkerur S, Vijay S, Reisslein M, Karam LJ (2011) Objective video quality assessment
methods: a classification, review, and performance comparison. IEEE Trans Broadcast
57(2):165182
126. Zhao Y, Yu L, Chen Z, Zhu C (2011) Video quality assessment based on measuring
perceptual noise from spatial and temporal perspectives. IEEE Trans Circuits Syst Video
Technol 21(12):18901902
127. IJsselsteijn W, de Ridder H, Hamberg R, Bouwhuis D, Freeman J (1998) Perceived depth
and the feeling of presence in 3DTV. Displays 18(4):207214
128. Yasakethu SLP, Hewage CTER, Fernando WAC, Kondoz AM (2008) Quality analysis for
3D video using 2D video quality models. IEEE Trans Consum Electron 54(4):19691976
129. ITU-R Rec. BT.1438 (2000) Subjective assessment of stereoscopic television pictures.
International Telecommunication Union
130. ITU-R Rec. BT.500-11 (2002) Methodology for the subjective assessment of the quality of
television pictures. International Telecommunication Union
131. ITU-R (2008) Digital three-dimensional (3D) TV broadcasting. Question ITU-R 128/6
132. Xing L, You J, Ebrahimi T, Perkis A (2010) An objective metric for assessing quality of
experience on stereoscopic images. In: Proceedings of IEEE international workshop on
multimedia signal processing (MMSP), pp 373378
133. Goldmann L, Lee JS, Ebrahimi T (2010) Temporal synchronization in stereoscopic video:
influence on quality of experience and automatic asynchrony detection. In: Proceedings of
international conference on image processing (ICIP), Hong Kong, pp 32413244, Sept 2010
134. Levelt WJ (1965) Binocular brightness averaging and contour information. Brit J Psychol
56:113
135. Stelmach LB, Tam WJ (1998) Stereoscopic image coding: effect of disparate image-quality
in left- and right-eye views. Sig Process: Image Commun 14:111117
136. Zhao Y, Chen Z, Zhu C, Tan Y, Yu L (2011) Binocular just-noticeable-difference model for
stereoscopic images. IEEE Signal Process Lett 18(1):1922
137. Hewage CTER, Worrall ST, Dogan S, Villette S, Kondoz AM (2009) Quality evaluation of
color plus depth map based stereoscopic video. IEEE J Sel Top Sig Process 3(2):304318
138. You J, Xing L, Perkis A, Wang X (2010) Perceptual quality assessment for stereoscopic
images based on 2D image quality metrics and disparity analysis. In: Proceedings of 5th
international workshop on video processing and quality metrics for consumer electronics
(VPQM), Scottsdale, AZ, USA
139. Benoit A, Le Callet P, Campisi P, Cousseau R (2008) Quality assessment of stereoscopic
images. EURASIP J Image Video Process, vol 2008. Article ID 659024
140. Lambooij M (2011) Evaluation of stereoscopic images: beyond 2D quality. IEEE Trans
Broadcast 57(2):432444

34

Y. Zhao et al.

141. Julesz B (1971) Foundations of cyclopean perception. The University of Chicago Press, Chicago
142. Boev A, Gotchev A, Egiazarian K, Aksay A, Akar GB (2006) Towards compound stereovideo quality metric: a specific encoder-based framework. In: Proceedings of IEEE
southwest symposium on image analysis and interpretation, pp 218222
143. Maalouf A, Larabi M-C (2011) CYCLOP: a stereo color image quality assessment metric.
In: Proceedings of IEEE international conference on acoustics, speech and signal processing
(ICASSP), pp 11611164
144. Bosc E, Pepion R, Le Callet P, Koppel M, Ndjiki-Nya P, Pressigout M, Morin L (2011)
Towards a new quality metric for 3-D synthesized view assessment. IEEE J Sel Top Sig
Process 5(7):13321343
145. Shao H, Cao X, Er G (2009) Objective quality assessment of depth image based rendering in
3DTV system. In: Proceedings of 3DTV conference, pp 14
146. Dimenco display Available: http://www.dimenco.eu/displays/
147. Alioscopy display Available: http://www.alioscopy.com/3d-solutions-displays
148. Smolic A, Muller K, Merkle P, Kauff P, Wiegand T (2009) An overview of available and
emerging 3D video formats and depth enhanced stereo as efficient generic solution. In:
Proceedings of picture coding symposium (PCS), pp 389392
149. Grau O, Borel T, Kauff P, Smolic A, Tanger R (2011) 3D-TV R&D activities in Europe.
IEEE Trans Broadcast 57(2):408420
150. Seventh Framework Programme (FP7) Available: http://cordis.europa.eu/fp7/home_en.html
151. 3D4YOU Available: http://www.3d4you.eu/
152. 2020 3D Media Available: http://www.20203dmedia.eu/
153. Mobile 3DTV Available: http://sp.cs.tut.fi/mobile3dtv/
154. 3DPHONE Available: http://www.3dphone.org/
155. Report of SMPTE task force on 3D to the Home Available: http://store.smpte.org/product-p/
tf3d.htm
156. Video and Requirement Group (2011) Call for proposals on 3d video coding technology.
ISO/IEC JTC1/SC29/WG11 Doc. N12036, Mar 2011
157. Video Group (2011) Standardization tracks considered in 3D video coding. ISO/IEC JTC1/
SC29/WG11 Doc. N12434, Dec 2011
158. Video and Requirement Group (2011) Draft call for proposals on mpeg frame-compatible
enhancement. ISO/IEC JTC1/SC29/WG11 Doc. N12249, Jul 2011
159. Tourapis AM, Pahalawatta P, Leontaris A, He Y, Ye Y, Stec K, Husak W (2010) A frame
compatible system for 3D delivery. ISO/IEC JTC1/SC29/WG11 Doc. M17925, Jul 2010
160. Wu Y, Hirakawa S, Reimers U, Whitaker J (2006) Overview of digital television
development worldwide. Proc IEEE 94(1):821
161. Reimers U (2006) DVBthe family of international standards for digital video broadcasting.
Proc IEEE 94(1):173182
162. Richer MS, Reitmeier G, Gurley T, Jones GA, Whitaker J, Rast R (2006) The ATSC digital
television system. Proc IEEE 94(1):3742
163. European Telecommunications Standard Institute ETSI (2011) Digital video broadcasting
(DVB): frame compatible plano-stereoscopic 3DTV (DVB-3DTV). DVB Document A154,
Feb 2011
164. ATSC begins work on broadcast standard for 3D-TV transmissions Available: http://
www.atsc.org/cms/index.php/communications/press-releases/257-atsc-begins-work-onbroadcast-standard-for-3d-tv-transmissions
165. Report ITU-R BT.2160 (2010) Features of three-dimensional television video systems for
broadcasting. International Telecommunication Union
166. Final 3-D Blu-ray specification announced Available: http://www.blu-ray.com/news/
?id=3924
167. Specification Available: http://www.hdmi.org/manufacturer/specification.aspx
168. CEA begins standards process for 3D glasses Available: http://www.ce.org/Press/
CurrentNews/press_release_detail.asp?id=12067
169. Steering teamsoverview Available: http://www.3dathome.org/steering-overview.aspx

1 An Overview of 3D-TV System Using Depth-Image-Based Rendering

35

170. Bruls F, Gunnewiek RK, van de Walle P (2009) Philips response to new call for 3DV test
material: arrive book and mobile. ISO/IEC JTC1/SC29/WG11 Doc. M16420, Apr 2009
171. Microsoft 3D video test sequences Available: http://research.microsoft.com/ivm/
3DVideoDownload/
172. Tanimoto M, Fujii T, Suzuki K (2009) View synthesis algorithm in view synthesis reference
software 2.0 (VSRS2.0). ISO/IEC JTC1/SC29/WG11 Doc. M16090, Lausanne, Switzerland,
Feb 2009

Part II

Content Generation

Chapter 2

Generic Content Creation for 3D Displays


Frederik Zilly, Marcus Mller and Peter Kauff

Abstract Future 3D productions in the fields of digital signage, commercials, and


3D Television will cope with the problem that they have to address a wide range of
different 3D displays, ranging from glasses-based standard stereo displays to autostereoscopic multi-view displays or even light-field displays. The challenge will be
to serve all these display types with sufficient quality and appealing content.
Against this background this chapter discusses flexible solutions for 3D capture,
generic 3D representation formats using depth maps, robust methods for reliable
depth estimation, required preprocessing of captured multi-view footage, postprocessing of estimated depth maps, and, finally, depth-image-based rendering
(DIBR) for creating missing virtual views at the display side.

Keywords 3D display 3D production 3D representation 3D videoconferencing Auto-stereoscopic multi-view display Content creation Depth estimation Depth map Depth-image-based rendering (DIBR) Display-agnostic
production Extrapolation Stereo display Stereoscopic video Stereo matching







F. Zilly (&)  M. Mller  P. Kauff


Image Processing Department, Fraunhofer Institute for TelecommunicationsHeinrich
Hertz Institute, Einsteinufer 37, 10587 Berlin, Germany
e-mail: frederik.zilly@hhi.fraunhofer.de
M. Mller
e-mail: macus.mueller@hhi.fraunhofer.de
P. Kauff
e-mail: peter.kauff@hhi.fraunhofer.de

C. Zhu et al. (eds.), 3D-TV System with Depth-Image-Based Rendering,


DOI: 10.1007/978-1-4419-9964-1_2,
Springer Science+Business Media New York 2013

39

40

F. Zilly et al.

2.1 Introduction
The commercial situation of 3D video has changed dramatically during the last
couple of years. Being lost in niche markets like IMAX theatres and theme parks
for a long time, 3D video is now migrating into a mass market. One main reason is
the introduction of Digital Cinema. It is the high performance of digital cameras,
post-production and screening that enables to show 3D video for the first time with
acceptable, and convincing quality. As a consequence, installations of 3D screens
in cinema theatres are increasing exponentially (see Fig. 2.1). Numerous 3D
movies have been released to theatres and have created large revenues with
growing success since years. The value-added chain of 3D cinema is under
intensive worldwide discussion.
This process also includes repurposing of 3D productions for home entertainment. In 2010, the Blu-ray Disc Association and the HDMI Consortium have
specified interfaces for 3D home systems. Simultaneously, first 3D Blu-ray
recorders and 3D-TV sets have entered the market. Global players in consumer
electronics assume that in 2015 more than 30 % of all HD panels at home will be
equipped with 3D capabilities. Major live events have been broadcasted in 3D and
first commercial 3D-TV channels are on air now. More than 100 new 3D-TV
channels are expected to be installed during the next few years.
The rapid development of 3D Cinema and 3D-TV is now going to find its way
into many different fields of applications. Gamers will enjoy their favorite entertainment in a new dimension. Mobile phones, PDAs, laptops, and similar devices
will provide the extended visual 3D sensation anytime and anywhere. The usage of
3D cameras is not restricted to professional high-end productions any longer, but
low-budget systems are now also available for private consumers or semi-professional users. Digital signage and commercials are exploiting 3D imaging to
increase their attractiveness. Medical applications use 3D screens to enhance the
visualization in operating rooms. Other non-entertainment applications like teleeducation, augmented reality, visual analytics, tele-presence systems, shared virtual environments, or video conferencing use 3D representations to increase
effectiveness, naturalism, and immersiveness.
The main challenge of this development is that all these applications will use
quite different stereo representations and 3D displays. Thus, future 3D productions
for 3D-TV, digital signage, commercials, and other 3D video applications will
cope with the problem that they have to address a wide range of different 3D
displays, ranging from glasses-based standard stereo displays to auto-stereoscopic
multi-view displays, or even future integral imaging, holographic, and light-field
displays. The challenge will be to serve all these display types with sufficient
quality and appealing content by using one generic representation format and one
common production workflow.
Against this background the underlying book chapter discusses flexible solutions for 3D capture, generic 3D representation formats using depth maps, robust
methods for reliable depth estimation, required pre-processing of captured multi-

2 Generic Content Creation for 3D Displays

41

Fig. 2.1 Worldwide development of 3D screen installations in cinema theatres over the last
5 years (Source FLYING EYE, PRIME project, funded by German federal ministry of economics
and technology, grant no. 01MT08001)

view footage, post-processing of estimated depth maps, and, finally, depth-imagebased rendering (DIBR) for creating missing virtual views at the display side.
Section 2.2 will first discuss future 3D video applications and related requirements. Then, Sect. 2.3 will review the functional concept of auto-stereoscopic
displays, and afterwards, in Sect. 2.4 a generic display-agnostic production workflow that supports the wide range of all existing and anticipated 3D displays will be
presented. Subsequently, Sects. 2.5, 2.6, and 2.7 focus on details of this generic
display-agnostic production workflow, such as rectification, stereo correction,
depth estimation, and DIBR. Finally, Sect. 2.8 will discuss an extension of this
workflow toward multi-view capturing and processing with more than two views.
Section 2.9 will summarize the chapter and give an outlook on future work and
challenges.

2.2 Requirements on Future 3D Video Applications


For the time being, 3D cinema and television are certainly the most prominent and
successful representatives of stereoscopic video and 3D media. In the meantime,
however, they also pave the road for many other applications. One example is the
exploitation of 3D imaging for digital signage. Showing commercials in 3D clearly
increases their attractiveness, and hence, auto-stereoscopic 3D displays are frequently used as eye catcher in shopping malls, lobbies, and entrance halls, at
exhibitions and fairs or during fashion weeks. Other examples are medical
applications, especially for endoscopy and its 3D visualization in the operating
room. Moreover, stereoscopic video is more and more used in augmented 3D
visualization integrating real 3D footage under right perspective and correct stereo
geometry seamlessly into virtual environments. Further applications can be found
in the framework of education, gaming, or visual analytics. Finally, one promising

42

F. Zilly et al.

Fig. 2.2 Immersive 3D video conferencing as an application example for using 3D displays in
non-entertainment business segments (Source European FP7 project 3D presence)

and commercially interesting application is immersive 3D videoconferencing. As


an example, Fig. 2.2 shows an experimental system that has been developed within
a European FP7 research project called 3D Presence [1, 2]. It addresses a telepresence scenario where conferees from three remote sites can meet in a simulated
round table situation. The remote partners are reproduced in life-size at 3D displays. Special processing techniques enable eye contact and gesture awareness
between all involved persons [3, 4].
Today many of these applications still use glasses. In this case, the display
presents two images that have been captured with slightly different perspective
from horizontally separated camera positions, usually similar to the eye positions
of humans. The displays show both views at the same screen, but separated either
by optical means (polarization, color separation) or temporally interleaved. The
users have to wear special glasses that perform the view separation by optical
filters, either passively by polarization or color filters or actively by shuttered
glasses. These special glasses take care that the left eye watches the left view only
and, vice versa, the right eye watches the right view only. As long as the stereo
images are produced properly, the two separate views can be merged by the human
brain into a single 3D perception.
At this point one has to be aware of the fact that the depth impression given by
stereoscopic imaging is a mental delusion of the human visual system and not a
perfect 3D reconstruction. Improper creation of stereo content can therefore result
in a bad user experience [5]. Consequences might be eye strain and visual fatigue
[6]. Therefore, to create good stereo, the production workflow has to respect a
variety of requirements, guidelines, and rules. One main issue is to ensure that the
whole scene usually remains within a so-called comfortable viewing range.

2 Generic Content Creation for 3D Displays

43

The 3D experience is generally comfortable if all scene elements stay within a


limited depth space close to the screen [710]. Another one is to avoid retinal
rivalry caused by geometric and photometric distortions and inconsistencies
between the two stereo images, such as key-stones, vertical misalignments, lens
distortions, color mismatches, or differences in sharpness, brightness, contrast, or
gamma [11]. Finally, it has to be taken into account that depth perception is always
a concert of different monocular and binocular depth cues [12]. Stereo images,
however, if not produced properly, may produce perception conflicts between
these different kinds of depth cues. A well-known example is stereoscopic window
violation where out-screening parts of the scene are cut off at the borders of the
image. It results in a very annoying conflict between two different depth cues: the
binocular depth cue telling the viewer that the object is in front the screen and the
monocular cue of interposition indicating at the same time that the object must be
behind the screen because it is occluded by the image frame. Self-evidently, such
conflicts have to be avoided on principle. A good overview on requirements and
production rules on good stereo production can be found in [13].
Apart from these general requirements on good stereo productions, further
special requirements have to be taken into account in context of a generic displayagnostic representation format. As already mentioned, today most of the above
applications use glasses-based stereo reproduction. In conventional applications
like 3D cinema or theme parks it might be acceptable for a long time because
visitors are accustomed to this viewing situation in theatre-like venues and there is
no urgent demand to change. For other applications like 3D-TV the usage of
glasses is only considered as an interim solution to establish 3D entertainment as
soon as possible in the living room with currently available technology. However,
it is generally agreed that glasses will not be accepted in a long term for home
entertainment environments. The viewing practice at home differs considerably
from the one in cinema theatres because users at home do not permanently concentrate on the screen while watching TV but are used to do other things simultaneously like talking to family members and, hence, do not like to wear special
glasses all the time. In some applications, such as digital signage, video conferencing, or medical environments the usage of glasses is even extremely annoying
or, to some extent, even impossible. For these cases it is absolutely necessary to
provide auto-stereoscopic displays that allow for watching stereo content without
glasses. Such displays are already at the market but for the time being their overall
3D performance is lower than the one of conventional glasses-based systems. They
are not able to display the same quality in terms of depth range, spatial resolution,
and image brilliance. Hence, they are not yet suitable for a mass market like 3DTV and at the moment they are more useful for special niche applications. Nevertheless the technology of auto-stereoscopic displays is making a lot of progress
and it can be assumed that the performance gap between glasses-based and autostereoscopic 3D displays will be closed very soon. Obviously, auto-stereoscopic
3D displays will then also substitute glasses-based systems in mass market of
home environment and 3D-TV after some time.

44

F. Zilly et al.

Fig. 2.3 Function of lenticular lenses and parallax barriers at auto-stereoscopic 3D displays

Following these considerations, future 3D productions for 3D-TV or other 3D


video applications will cope with the problem that they have to address a wide
range of different 3D displays, ranging from glasses-based standard stereo displays
to auto-stereoscopic multi-view displays. It results in three further baseline
requirements on a generic display-agnostic production workflow: (a) the produced
data representation should be backwards compatible to conventional glasses-based
displays such that they can be watched at standard stereo displays without any
extra processing, (b) the data representation should support any kind of autostereoscopic multi-view displays, independent on the number of views that is used
by the specific display, and (c) it should be possible to convert standard stereoscopic material into the generic representation format.
For a common understanding, the next section will first describe details on the
functional concept of auto-stereoscopic displays and the related 3D display technologies before solutions for a generic display-agnostic production workflow and
the needed processing tools respecting the above requirements are explained in
more details.

2.3 Functional Concept of Auto-Stereoscopic Displays


As shown in Fig. 2.3, auto-stereoscopic 3D displays use optical components such
as lenticular lenses or parallax barriers in front of the image panel to avoid the
necessity of wearing glasses. In case of lenticular lenses, light emitted from particular sub-pixel pattern at the panel is deflected in different directions such that
the left and the right eyes watch separate sub-images. Similarly, parallax barriers
block light emitted from a particular sub-pixel pattern at the panel for one eye
whereas they let pass it for the other eye.
The concept of auto-stereoscopic imaging is not really new. In 1692, it was the
French painter G. A. Bois-Clair who discovered for the first time that the 3D sensation can be enhanced considerably by creating paintings containing two distinct
imagesone for the left eye and one for the right eyeinstead of presenting just one
image. The separation between the two views was achieved by a grid of vertical laths

2 Generic Content Creation for 3D Displays

45

Fig. 2.4 Repeating of viewing cones with a given number of stereo views at an autostereoscopic multi-view multi-user display (left) and an example for an interweaving pattern of
slanted lenticular lenses or parallax barriers (right)

in front of the paintingan antecessor of todays auto-stereoscopic displays using


parallax barriers [14]. The same principle was used in 1930 at the Russian cinema
Moskva for moving images. A fence with 30.000 copper wires was installed as
parallax barrier in front of the screen. The breakthrough, however, came in the 1990s
with the advent of digital projectors, high-quality CRTs, and later, LCD displays
[1517]. Since a few years auto-stereoscopic displays are now available as products
at the market and are offered by a couple of manufacturers.
A well-known disadvantage of conventional auto-stereoscopic approaches with
two views is that the depth sensation can only be perceived properly from certain
viewing positions and that annoying artifacts and visual degradations like cross
talk, pseudo-stereo, or moir disturbances may appear at other positions. Therefore
state-of-the-art 3D displays are usually constructed in a way that allows for
reproducing multiple adjacent stereos perspectives simultaneously. Ideally, one
person can watch a couple of continuous stereo views while changing the viewing
position, or the other way round, several persons at different viewing positions can
watch the same 3D content with sufficient quality.
For this purpose auto-stereoscopic multi-view multi-user displays are based on
a limited number of viewing cones, typically with an horizontal angel of around
10, which in turn contain a certain number of neighboring stereo perspectives (see
Fig. 2.4, left). With a proper dimensioning of the whole system, a fluent transition
between the views can be achieved and, hence, a spectator can freely move the
head within each viewing cone. As a side effect, this technology also enables to
look slightly behind objects while moving the head and the resulting monocular
depth cue of head-motion parallax further enhances the 3D sensation. Moreover,
the multi-user capability is mainly given by a periodic repetition of the 3D content
across the viewing cones. Within each viewing cone, one or more observers can
enjoy the 3D content from different viewing positions and angles.
To this end, the functional concept of auto-stereoscopic displays is always a
trade-off between the number of required views, free moving space of the
observers, and the multi-user capability of the system on one hand and the spatial
resolution left for each view on other hand. For instance, a display with eight
stereo pairs per viewing cone and vertically arranged lenticular lenses results in a

46

F. Zilly et al.

horizontal resolution of a ninth per view compared to the one of the underlying
image panel (e.g. 213 pixels per view in case of an HD panel with 1,920 pixels).
To compensate this systematic loss of resolution, the lenticular lenses or parallax
barriers are often arranged in a slanted direction. By doing so, the loss of resolution can be distributed equally in horizontal and vertical direction as well as over
the three color components.
To explain it in more detail, Fig. 2.4 (right) shows an example of a slanted
raster for a display with eight stereo pairs per viewing cone. Each colored rectangle denotes a sub-pixel of the corresponding RGB color component at the
underlying image panel. The number inside each rectangle indicates the corresponding stereo view. The three adumbrated lenses illustrate the slope of the
optical system mounted in front of the panel. The emitted light of a sub-pixel will
be deviated by a specific angle according to its position behind the lens. Consequently, the image data corresponding to the different views need to be loaded into
the raster such that all sub-pixels corresponding to the same view are located at the
same position and angle relative to the covering lens. That way, the perceived loss
of resolution per stereo view can substantially be reduced.
The opening angle of the viewing cone, the number of views per cone, the slanted
positioning of the optical system, and the corresponding interweaved sub-pixel
raster differ significantly between the display manufacturers. Usually, the details of
the sub-pixel raster are not public as they determine the resulting 3D-quality of the
display. In this sense, every display manufacturer has its own philosophy and finds it
own way to optimize the relevant system parameters and the resulting trade-off. As a
consequence, the envisaged generic display-agnostic 3D production workflow has to
be that flexible that it can support each of the existing auto-stereoscopic 3D display,
independently of the used number of views and the other system design parameters.
Moreover, a generic workflow also has to be future proof in this sense. The
number of stereo pairs per viewing cone, the angle of viewing cones and the
density of stereo views will increase considerably with increasing resolution of
LCD panels. First prototypes using 4k panels are already available and will enter
the market very soon. The current availability of 4k panels is certainly not the end
of this development. First 8k panels are on the way and panels with even higher
resolution will follow. It is generally agreed that this development is the key
technology for auto-stereoscopic displays and will end up with 50 views and more.
Today the usual number of views is in a range of ten, but first displays with up to
28 views have recently been introduced onto the market. An experimental display
with 28 views has been described in 2000 by Dodgson et al. [18].

2.4 Generic Display-Agnostic Production Workflow


Following the requirements from Sect. 2.2 and the considerations from the previous section, a generic display-agnostic production workflow has to cope with a
wide range of auto-stereoscopic displays using an arbitrary number of views and

2 Generic Content Creation for 3D Displays

47

Fig. 2.5 Processing chain of a generic display-agnostic production and distribution workflow

quite different system designs. Anticipating future developments, it should also be


able to support upcoming technologies like light-field displays or integral imaging.
Finally, the production workflow should be backwards compatible to existing
glasses-based stereoscopic displays and should be able to take into account all
conventional rules of producing good 3D content as discussed in Sect. 2.2.
Figure 2.5 shows the processing chain of such a display-agnostic approach. In a
first step, the scene will be captured by a suitable multi-camera rig. It consists of at
least two cameras, i.e. a conventional stereo system, but can also use more than
two cameras if needed. In the later case, the multi-camera rig should contain one
regular stereo camera to ensure direct backwards compatibility to standard stereo
processing, applications, and displays. The considerations on the production
workflow and processing tools in the next section will mainly focus on 3D content
captured with a regular two-camera stereo rig. Possible extensions toward multicamera systems will be discussed in Sect. 2.8.
During a pre-processing step, the camera rig needs to be calibrated and the
quality of the stereo content needs to be checked, either on-set or inside the ob-van
or during post-production. If necessary, colorimetric and geometric corrections
have to be applied to the stereo images. This pre-processing is an important
requirement for a pleasant 3D sensation. It ensures that the basic rules of good 3D
production, as discussed in Sect. 2.2, are respected in a sufficient manner and
delivers well rectified stereo images. However, in the context of the envisaged
generic production workflow, it also undertakes a second important role. Due to
the rectification process it simplifies and enhances subsequent processing steps
such as depth estimation, related coding, and DIBR by reducing their complexity
and increasing their robustness.

48

F. Zilly et al.

In the next step, depth information is extracted from the stereo signals. This
depth estimation results in dense pixel-by-pixel depth maps for each camera view
and video frame, a representation format often called multiple-video-plus-depth
(MVD). Depending on the number of views required by a specific 3D display, it is
used at the receiver to interpolate or extrapolate missing views by means of DIBR.
Hence, using MVD as an interim format for 3D production is the most important
feature of the envisaged display-agnostic workflow because it decouples the stereo
geometry of the acquisition and display side.
If needed for transmission or storage, depth and video data of MVD can then be
encoded separately or jointly by suitable coding schemes. For this purpose, MVD is
often transformed into other representation formats like LDV or DES that are more
suitable for coding. Details of these coding schemes and representation formats are
explained in more detail in Chap. 1 and Chap. 8 of this book and, hence, will not be
addressed again in this chapter.
At the display side, original or decoded MVD data are used to adapt the
transmitted stereo signals to the specific viewing conditions and the 3D display in
use. On one hand, this processing step includes the calculation of missing views by
DIBR if the number of transmitted stereo signals is lower than the number of
required views at the 3D display. On the other hand, it provides the capability to
adapt the available depth range to the special properties of the display or the
viewing conditions in the living room. The latter aspect is not only important for
auto-stereoscopic displays, it is also useful for standard stereo displays. For
instance, the depth range of a 3D cinema movie might be too low if watched at a
3D-TV set and a new stereo signal with enlarged depth range can be rendered by
using the MVD format.
The above production workflow and the related processing chain allow for
supporting a variety of 3D-displays, from conventional glasses-based displays to
sophisticated auto-stereoscopic displays. The following sections will describe
more details of the different processing steps, starting with the pre-processing
including calibration and rectification in Sect. 2.5, followed by depth estimation in
Sect. 2.6 and ending with DIBR in Sect. 2.7.

2.5 Calibration, Rectification, and Stereo Correction


A main challenge of the pre-processing step in the generic workflow from Fig. 2.5
is to analyze the depth structure in the scene and to derive robust rectification and
stereo correction from it. One possibility to make depth analysis robust enough is
the detection of feature point correspondences between the two stereo images. Any
suitable feature detectors like SIFT [19], SURF [20], or the recently proposed SKB
[21] can be used for this purpose. However, as even these very distinctive
descriptors will produce a certain amount of outliers, the search of robust point
correspondences has additionally to be constrained by the epipolar equation from
Eq. (2.1). As known from literature, corresponding image points m and m0 in two

2 Generic Content Creation for 3D Displays

49

Fig. 2.6 Robust feature point correspondences for the original stereo images of test sequence
BEER GARDEN, kindly provided by European research project 3D4YOU [53]

stereo images have to respect the epipolar constraint, where F denotes the fundamental matrix defined by a set of geometrical parameters like orientations,
relative positions, focal lengths, and principal points of the two stereo cameras:
T

m0  F  m 0

2:1

Based on this constraint, RANSAC estimation of the fundamental matrix F can


be used to eliminate outliers of feature points [22]. Figure 2.6 shows an example of
related results for images of a stereo test shooting. Note that the cameras are not
perfectly aligned in this case and the point correspondences still contain undesired
vertical disparities due to an undesired roll angle between the two cameras.
The robust depth analysis by feature points also allows for an estimation of the
camera geometry and a related mechanical calibration of the stereo rig. In general,
the derivation of the physical stereo geometry from the fundamental matrix F is a
numerically challenging problem. For stereo rigs, however, it can be assumed that
the cameras have already been mounted in an almost parallel set-up, i.e. the
cameras have almost the same orientation perpendicular to the stereo baseline.
Thus, the camera geometry is already close to the rectified state where F degenerates to the following simple relation:
2
3
0 0 0
F 4 0 0 1 5
2:2
0 1 0
Hence, F can be linearized by developing a Taylor expansion around the rectified state from Eq. (2.2) and by cutting after the first term. In addition, it can be
assumed that the principal points are located in the centers of the image sensors,
that the difference between the two focal lengths f and f0 is small (f/f0 = 1+rf with
rf  1), that the stereo baseline is defined by the x-axis of the left stereo camera
and that the deviations cy and cz of the right stereo camera in Y- and Z-direction
across the baseline is small compared to the inter-axial camera distance tc along
the baseline (cy  1 and cz  1 in case of a normalized inter-axial camera
distance cx = 1, note that all c-values have been normalized with respect to tc).
Under these preconditions, the linearization results in the following simplified

50

F. Zilly et al.

term of the matrix F where cx, cy and cz denote the orientation angles of the right
camera:
2
3
cy cy
cy cz
0
f
cx
2:3
F 4 cz
1 rf 5
f
f
cy
1
f  cx
Note that the above preconditions are generally fulfilled in case of a proper
stereo set-up using professional rigs and prime lenses. Based on this linearization,
the epipolar equation from Eq. (2.3) can also be written as follows:
v0{z
v}
|

vert: disparity

cy  Du
|{z}
yshift
0

u v
cy 

f
|{z}
cy keystone

c  u0 rf  v0  f  cx
|{z}
|z{z} |{z}
roll

D zoom

tilt offset

v  v0
u  v 0  u0  v
cx 
cz 
f
f
|{z} |{z}
cx keystone

2:4

zshift

Once the fundamental matrix F has been estimated from matching feature point
correspondences, the coefficients from the linearization in Eq. (2.3) can be derived
from Eq. (2.4). For this purpose the relation from Eq. (2.4) is used to build up a
system of linear equations on the basis of the same robust feature point correspondences. Solving this linear equation system by suitable optimization procedures (e.g. RANSAC [23]) enables a robust estimation of the existing stereo
geometry. The resulting parameters can then be exploited to steer and correct
geometrical and optical settings in case of motorized rig and lenses.
For example, the D-zoom parameter indicates a small deviation in focal lengths
between the two stereo cameras and can be used to correct it in case of motorized
zoom lenses. Furthermore, detected roll, tilt-offset, and cx-keystone parameters can
be used to correct unwanted roll and tilt angles at the rig, if the rig allows this
adjustment, either by motors or manually. Same holds for the parameters y-shift
and z-shift indicating a translational displacement of the cameras in vertical
direction and along the optical axes, respectively. The cy-keystone parameter refers
to an existing toe-in of converging cameras and its correction can be used to ensure
a parallel set-up or to perform a de-keystoning electronically.
A perfect control of geometrical and optical settings will not be possible in any
case. Some stereo rigs are not motorized and adjustments have to be done manually with limited mechanical accuracy. When changing the focus, the focal length
of lenses might be affected. In addition, lenses are exchanged during shootings
and, if zoom lenses are used, motors do not synchronize exactly and lens control
suffers from backlash hysteresis.
As a consequence, slight geometrical distortions may remain in the stereo
images. These remaining distortions can be corrected electronically by means of
image rectification. The process of image rectification is well-known from literature [22, 2426]. It describes 2D warping functions H and H0 that are applied to
left and right stereo images, respectively, to compensate deviations from the ideal

2 Generic Content Creation for 3D Displays

51

case of parallel stereo geometry. In the particular case, H and H0 are derived from
a set of constraints that have to be defined by the given application scenario.
One major constraint in any image rectification is that multiplying corresponding image points m and m0 in Eq. (2.1) with the searched 2D warping
matrices H and H0 has to end up with a new fundamental matrix that is equal to the
rectified state in Eq. (2.2). Clearly, this is not enough to calculate all 16 grades of
freedom in the two matrices H and H0 and, hence, further constraints have to be
defined for the particular application case.
One further constraint in the given application scenario is that the horizontal
shifts of the images have to respect the user-defined specification of convergence
plane. Furthermore, the 2D warping matrix H for the left image has to be chosen
such that the horizontal and vertical deviations cy and cz of the right camera are
eliminated, i.e. the left camera has to be rotated such that the new baseline after
rectification goes through the focal point of the right camera:
2
3
1
cy 0
H 4 cy
1 05
2:5
cz =f 0 1
Based on this determination, the 2D warping matrix H0 for right image can be
calculated in a straightforward way by taking into account the additional sideconstraints that left and right camera have the same orientation after rectification,
that both cameras have the same focal length and that the x-axis of the right
camera has the same orientation perpendicular to the new baseline:
2
3
cz c y
0
1  rf
H0 4 cz cy 1  rf f  cx 5
2:6
cy  cz =f cx =f
1
Figure 2.7 shows results from an application of image rectification to the nonrectified originals from Fig. 2.6. Note that vertical disparities have almost been
eliminated.

2.6 Algorithms for Robust Depth Estimation


As already mentioned in Sect. 2.4, robust depth estimation is one of the most
important features of the processing chain in Fig. 2.5. Over the last decades a huge
number of algorithms for depth estimation have been developed in the computer
vision community, so even an incomplete review would be beyond the scope of
this chapter. The interested reader is referred to [27] and [28] providing a good
overview on the state-of-the-art in computational stereo. The following considerations will only give a very short introduction to depth estimation, followed by a
more detailed explanation of one particular approach on dense depth map generation that is suitable for DIBR in the given context.

52

F. Zilly et al.

Fig. 2.7 Results of image rectification for the images of test sequence from Fig. 2.6

The aim of all stereo matching algorithms is to find corresponding pixels


between the two images under some constraints which reflect assumptions about
the captured scene. The most common ones are epipolar constraint, uniqueness,
constant brightness, smoothness, and ordering constraint. Stereo-algorithms can be
classified by the manner how these constraints are used during the estimation
process. While local approaches estimate the disparity at each pixel independentlyexploiting the constraints only for neighboring pixels within a correlation
windowglobal approaches attempt to minimize a certain energy function that
encodes these constraints for the whole image.
Local approaches mainly differ in the similarity metric as well as the size and
shape of the aggregation window (see [29] and [30] for more details) whereas
global methods vary in the energy function and its optimization. Figure 2.8 shows
a comparison of different global and local methods applied to the Tsukuba image
of the Middlebury stereo data set with ground truth depth maps [27]. The first row
from (a) to (c) shows the left and right image as well as related ground truth data.
The next row shows three different global methods. They all minimize an
energy function consisting of a data and a smoothness term:
Ed Edata d kEsmooth d

2:7

The term Edata(d) takes care of the constant-brightness constraint (color-constancy assuming lambertian surfaces) measuring matching point correspondences
between the stereo images whereas the term Esmooth(d) exploits the smoothness
constraint by penalizing disparity discontinuities. The first image in the second
row shows a result of Dynamic Programming which minimizes the energy function for each scan-line independently [3133]. As it can be seen in Fig. 2.8d, this
approach leads to the well-known streaking artifacts. In contrast the two other
images in the second row refer to global methods that minimize a 2D energy
function for the whole image, such as Graph-Cut [3438] and Belief Propagation
[3941]. As shown in Fig. 2.8e and f these methods provide much more consistent
depth maps than Dynamic Programming.
Due to these results global optimization methods are often referred as top
performing stereo-algorithms that clearly outperform local methods [27]. This
conclusion is also confirmed by a comparison of results from Graph-Cut and Belief
Propagation Fig. 2.8e and f with the one of a straight-forward Block Matcher in

2 Generic Content Creation for 3D Displays

53

Fig. 2.8 Comparison of stereo-algorithms on basis of the Tsukuba test image of the Middlebury
stereo data set [27]: a, b original left and right image, c ground truth depth map, d dynamic
programming, e belief propagation, f graph-cut, g block matching without post-processing,
h block matching with post-processing

Fig. 2.8g which is the prominent representative of local methods and also known
in the context of motion estimation. An in-depth comparison of block matching
and other disparity estimation techniques is given in [42].
The advantage of global methods over local ones could only be proven for still
images so far. Furthermore, recent research has proven that local methods with
adequate post-processing of resulting depth maps, e.g. by cross-bilateral filtering
[43, 44], can provide results that are almost comparable with those of global
methods. In this context Fig. 2.8h shows an example for Block matching with
post-processing. Furthermore global methods suffer from high computational
complexity and a considerable lack of temporal consistency. Because of all these
reasons local approaches are usually preferred 3D video processing for the time
being, especially if real-time processing is addressed.
Against this background, Fig. 2.9 depicts as an example the processing scheme
of a real-time depth estimator for 3D video using hybrid recursive matching
(HRM) in combination with motion estimation and adaptive cross-trilateral
median filtering (ACTMF) for depth map post-processing.
The HRM algorithm is a local stereo matching method that is used for initial
depth estimation in the processing scheme from Fig. 2.9. It is based on a hybrid

54

F. Zilly et al.

Fig. 2.9 Processing scheme


of a depth estimator using
hybrid recursive matching
(HRM) and adaptive crosstrilateral median filtering
(ACTMF) in combination
with motion estimation

solution using both, spatially and temporally recursive block matching and pixelrecursive depth estimation [45]. Due to its recursive structure, the HRM algorithm
produces almost smooth and temporally consistent pixel-by-pixel disparity maps.
In the structure from Fig. 2.9 two independent HRM processes (stereo matching
from right to left and, vice versa, from left to right view) estimate two initial
disparity maps, one for each stereo view. The confidence of disparity estimation is
then measured by checking the consistency between the two disparity maps and by
computing a confidence kernel from normalized cross-correlation used by HRM.
In addition, a texture analysis detects critical regions with ambiguities (homogeneities, similarities, periodicities, etc.) that might cause mismatches and a motion
estimator provides temporally predicted estimates by applying motion compensation to disparities of the previous frame to further improve temporal consistency.
Finally, the two initial disparity maps are post-processed by using ACTMF. The
adaptive filter is applied to initial disparity values Dip from the HRM at pixel
positions p within a filter region ns around the center position s:

2 Generic Content Creation for 3D Displays



Dos weighted medianp2ns wp ; Dip :

55

2:8

The filtered output disparity Dos at position s is computed by weighting factors


wp of the adaptive filter kernel:
wp conf Dip  distp  s  segIp  Is :

2:9

Following the definition of a weighted median operator, the coefficients wp are


used to increase the frequency of a particular disparity values Dip before applying
a standard median operation. The weighting factors depend on three adaptive
smoothing terms, all ranging from 1 to 10. The first one aggregates all results from
previous consistency checks, computed confidence kernels, and texture analysis to
one overall confidence term conf() for the initial disparity value Dip at position
p. A high confidence measure is scored by 10 and, vice versa, a low confidence by
1. The second one is a distance function dist() with scores that are the nearer to 10
the closer the position p is to the center position s within the filter area. The third
one describes a segmentation term seg() that scores high if the input disparity Dip
at position p belongs with high probability to the same image segment as the
filtered output value Dos at position s and low if not. As usual, in conventional
bilateral filtering, this term is driven by the difference between the color values Ip
and Is at positions p, and s, respectively.
As an example, Fig. 2.10 shows results of the entire processing scheme for the
test sequence BAND06 of the European research project MUSCADE [46]. The top
row depicts original left and right images of the stereo test sequence. Corresponding disparity maps after initial HRM depth estimation are presented in the
second row, followed by results of a subsequent consistency check where all
mismatches have been detected and removed from the depth maps in the third row.
Finally, the bottom row shows the filtered disparity maps after the ACTMF process. Note that the final depth maps are well smoothed while the depth transitions
at object boundaries are preserved. All holes from mismatch removal are filled
such that a robust dense pixel-by-pixel depth map is obtained at this stage.
The final depth maps are suitable for interpolation and extrapolation of virtual
views by means of DIBR. This rendering processing will be explained in the next
section. To this end, it should be mentioned that the good results from Fig. 2.10
can only be achieved if the initial stereo images are well rectified and processed as
described in Sect. 2.5.

2.7 Image-Based Rendering Using Depth Maps


The concept of depth-based 3D representation formats has already been proposed
around 20 years ago in the framework of several European research projects
[4749]. As an example, Fig. 2.11 illustrates a format that is delivered by the depth
estimator from Fig. 2.9 described in detail in the previous section. It consists of the

56

F. Zilly et al.

Fig. 2.10 Results of depth estimation applying the processing chain from Fig. 2.9 to stereo
images of test sequence BAND06 kindly provided by European research project MUSCADE [46]

two video streams of a conventional stereo representation plus two additional


depth maps, one for the left and one for the right stereo view. The depth maps
contain depth information for each pixel. The related depth samples refer to the
distance from the camera to the object. The depth information of the scene is
stored in an 8-bit gray scale format, where 0 denotes the far clipping plane and 255
the near clipping plane of the scene.
As already mentioned in Sect. 2.4, the crucial advantage of such a depth-based
format is that it enables to decouple the acquisition geometry of a 3D production

2 Generic Content Creation for 3D Displays

57

Fig. 2.11 Depth-based 3D representation format consisting of two stereo images as known from
conventional S3D production and aligned dense pixel-by-pixel depth maps

from the reproduction geometry at the display side. Due to the delivered depth
information, new views can be interpolated or rendered along the stereo baseline.
Dedicated rendering techniques known as DIBR are used for the generation of
virtual views [50].
Figure 2.12 illustrates the general concept of DIBR. The two pictures in the first
row show a video image and corresponding dense depth map. To calculate a new
virtual view along the stereo baseline, each sample in the video image is shifted
horizontally in relation to the assigned depth value. The amount and the direction
of the shift depend on the position of the virtual camera position with respect to
one of the original camera. Objects in the foreground are moved by a larger
horizontal shift than objects in the background. The two pictures in the second row
of Fig. 2.12 show examples of the DIBR process. The left of the two pictures
corresponds to a virtual camera position right to the original one and vice versa,
the right picture to a virtual camera position left to the original one.
These examples demonstrate that the depth-dependent shifting of samples
during the DIBR process yields to exposed image areas which are marked in black
in Fig. 2.12. The width of these areas decreases with the distance of objects to the
camera, i.e. the nearer an object is to the camera the larger the width of the
exposed areas. In addition, their width grows with the distance between original
and virtual camera position. As there is no texture information for these areas in
the original image, they are kept black in Fig. 2.12. The missing texture has to be
extrapolated from the existing texture, taken from other cameras view or generated
by specialized in-painting techniques.
In a depth-based representation format with two views like the one from
Fig. 2.11, the missing texture information can usually be extracted from the other
stereo view. This applies especially to cases where the new virtual camera position
lies between the left and right stereo camera. Figure 2.13 illustrates the related
process. First, a preferred original view to render the so-called primary view is
selected among the two existing camera views. This selection depends on the
position of the virtual camera. To minimize the size of exposed areas, the nearer of

58

F. Zilly et al.

Fig. 2.12 Synthesis of two virtual views using DIBR: a original view, b aligned dense depth
map, c virtual view rendered to the left, d virtual view rendered to the right

the two cameras is selected and the primary view is rendered from it as described
above. Subsequently, the same virtual view is rendered from the other camera, the
so-called secondary view. Finally, both rendered views are merged by filling the
exposed areas in the primary view with texture information from the secondary
view, resulting in a merged view. To reduce the visibility of remaining colorimetric or geometric deviations between primary and secondary view, the pixels
neighboring the exposed areas will be blended between both views.
Following this rendering concept, one can generate an arbitrary number of
virtual views between the two original camera positions. One possible application
is the generation of a new virtual stereo pair. This is especially interesting if one
aims to adapt the stereoscopic playback and depth range on a glasses-based system
to the respective viewing conditions, like screen size or viewing distance. However, creating a new virtual stereo pair is not the only target application. In the case
of auto-stereoscopic multi-view multi-user displays, DIBR can be used to calculate
the needed number of views and to adapt the overall depth range to the specific
properties of the 3D display. In this general case it might occur that not all of the
virtual views are located between the two original views but have to be positioned
outside of the stereo baseline. That especially applies if the total number of
required views is high (e.g. large viewing cone) or if one wishes to display more
depth on the auto-stereoscopic display. In fact, the existing depth range might be
limited if the production rules for conventional stereoscopic displays have been
taken into account (see Sect. 2.2). Thus, the available depth budget has to be
shared between all virtual views of an auto-stereoscopic display if DIBR is
restricted to view interpolation only, or, in other words, additional depth budget for
auto-stereoscopic displays can only be achieved by extrapolating beyond the
original stereo pair.

2 Generic Content Creation for 3D Displays

59

Fig. 2.13 View merging to fill exposed areas. a Primary view rendered from the left camera
view, b secondary view rendered from the right camera view, c merged view

This extrapolation process produces exposed areas in the rendered images that
cannot be filled with texture information from the original views. Hence, other
texture synthesis methods need to be applied. The suitability of the different
methods depends on the scene properties. If, for instance, the background or
occluded far-distant objects mainly contain homogeneous texture, a simple repetition of neighboring pixels in the occluded image parts is a good choice to fill
exposed regions. In case of more complex texture, however, this method fails and
annoying artifacts that considerably degrade the image quality of the rendered
views become visible. More sophisticated techniques, such as patch-based
in-painting methods, are required in this case [51, 52]. These methods reconstruct
missing texture by analyzing adjacent or corresponding image regions and by
generating artificial texture patches that seamlessly fit into the exposed areas.
Suitable texture patches can be found in the neighborhood, but also in other parts
of the image, or even in another frame of the video sequence.
Figure 2.14 shows the results of the two different texture synthesis methods
applied to a critical background. The top row depicts the two original camera
views, followed by extrapolated virtual views with large exposed regions in the
second rows. The exposed regions are marked by black pixels. The third row
shows the result of a simple horizontal repetition of the background samples.
Clearly, this simple method yields annoying artifacts due to the critical texture in
some background regions. The bottom row presents results of a patch-based inpainting algorithm providing a considerable improvement of the rendering quality
compared to the simple pixel repetition method.

60

F. Zilly et al.

Fig. 2.14 Extrapolated virtual views with critical content. First row the original stereo pair.
Second row extrapolated virtual views with exposed regions marked in black, third row simple
algorithm based on pixel repetition, fourth row sophisticated patch-based in-painting algorithm.
Test images kindly provided by European research project 3D4YOU [53]

2.8 Extension Toward Multi-Camera Systems


The previous sections described a workflow involving a generic 3D representation
format where two cameras inside a mirror rig were used to capture the scene. This
section addresses an extension of the workflow toward a multi-camera rig, i.e.
more than two cameras are used to capture the scene. Several multi-camera
approaches exist. In the simplest case, two or more cameras are positioned in a

2 Generic Content Creation for 3D Displays

61

Fig. 2.15 Outline of camera positions (left), CAD drawing of the 4-camera system used within
MUSCADE [46] (center), 4-camera system in action (right). CAD drawing provided by KUK
Filmproduktion

side-by-side scenario. However, the stereo baseline or interaxial distance between


two adjacent cameras would exceed the limits imposed by the stereoscopic productions rules to be applied when generating content for classical stereoscopic
displays. Consequently, in order to support a wide range of 3D displays while
being backwards compatible to already established stereo displays, two of the
captured views should coincide with a stereo pair which can directly be shown on
a glasses-based display without any virtual view interpolation or other view
manipulation. Further cameras should provide the information that is needed to
create a generic depth-based 3D representation format from the captured videos
and to inter- and extrapolate all missing views for a given auto-stereoscopic multiview display from this generic intermediate format. For instance, a possible multicamera set-up consists of two inner cameras inside a mirror-box which capture the
two backwards compatible stereo views whereas two additional satellite cameras
allow for wide baseline applications (see Fig. 2.15).
This approach is investigated by the European research projects MUSCADE
[46] and 3D4YOU [53]. A narrow baseline system is combined with a wide
baseline system. However, the MUSCADE project requests additional constraints
on the setup. In order to allow an efficient processing of the depth maps and virtual
view rendering, all cameras within the setup of the MUSCADE project need to lie
exactly on a common baseline. In order to meet this constraint a careful calibration
of the multi-camera rig needs to be performed during setup process as in a general
configuration, the four camera centers would not lie on a single line. In the case of
a conventional stereo production with two cameras, one can always apply two
rectifying homographies to the left and the right view during post-production
resulting in a perfectly aligned stereo pair (cf. Sect. 2.5). However, in the case of
four cameras, there is no set of four homographies which would lead to a configuration where all four cameras are pair-wise rectified. In contrast, a depthdependent re-rendering would be required.
The precise calibration of a multi-camera system is usually a time demanding
task as the system has many degrees of freedom. However, to keep the production
costs within a reasonable amount, a dedicated PC-based assistance has been
developed within the MUSCADE project. It is an extension of the stereoscopic
analyzer [54] toward a quadrifocal setup. For the multi-camera setup we wish to

62

F. Zilly et al.

bring all four cameras in a position, such that every pair of two cameras is rectified, i.e. that Eq. (2.2) is valid (cf. Sect. 2.5). On the other hand, one can take
advantage of the additional cameras and estimate the trifocal tensor instead of a set
of fundamental matrices. A convenient approach for the estimation of the geometry is to select camera 1, 2, and 3 (cf. Fig. 2.15) as one trifocal triplet and camera
2, 3, and 4 as another trifocal triplet. That way, both triplets consist of cameras
which have still a good overlap in their respective field of view. This is especially
important for a feature point-based calibration. In contrast, a camera triplet
involving camera 1 and 4 would suffer from a low number of feature points, as the
overlap in the field of view is small, due to the large baseline between the two
cameras. We have aligned all three camera of a triplet on a common baseline if the
trifocal tensor in matrix notation simplifies to
82
3 2
3 2
39
0
0 cx =
0
cx 0
< cx  c0x 0 0
0 0 5 2:10
fT1 ; T2 ; T3 g 4 0
0 0 5; 4 c0x 0 0 5; 4 0
:
;
c0x 0 0
0
0 0
0
0 0
where cx and c0x denote the stereo baseline between the first and second camera and
the baseline between the first and the third camera, respectively. One can now
develop the trifocal tensor around this rectified state in a Taylor expansion comparable to Eq. (2.3) assuming that all deviations from the ideal geometric position
are small enough to linearize them. A feature point triplet consisting of
!
X
m0 
mi Ti m00  033
2:11
i

Corresponding image points m, m0 , and m00 can be inserted in the trifocal


incidence relation from Eq. (2.11). It follows a linear system of equations comparable to Eq. (2.4). According to Eq. (2.11), each feature point triplet gives rise to
nine additional equations. However, a detailed elaboration of the equations would
be beyond the scope of this chapter. Different strategies for computing the trifocal
tensor are given in [22].
We now discuss a suitable workflow for the estimation of depth maps for a set
of cameras as shown in Fig. 2.15. In Sect. 2.6, we described a workflow for the
depth estimation for two cameras. With the multi-camera rig, we can improve the
quality of the depth maps of the inner stereo pair in three ways. First, we can
eliminate outliers which survived the leftright consistency check by applying a
trifocal consistency check. Second, holes in the original depth map caused by
occlusions can mostly be filled using depth data from the respective satellite
camera. Finally, the wide baseline system greatly increases the depth resolution
compared to the inner narrow baseline, which allows sub-pixel accuracy for the
depth maps of the inner stereo pair. We will now describe this process in more
detail.
Figure 2.16 shows the result of the disparity estimation for camera 2 using
camera 3 (a), and camera 1 (b) after the respective leftright consistency check as

2 Generic Content Creation for 3D Displays

63

Fig. 2.16 Disparity maps for camera 2 generated from the inner stereo pair (a), and involving the
satellite camera 1 (b). The two disparity maps have opposite sign and a different scaling due to
the two different baselines between cameras 1 and 2 on one hand, and cameras 2 and 3 on the
other hand

discussed in Sect. 2.6. Apparently, the occlusions for the wide baseline system (b)
are more important than for the narrow baseline system (a). In addition, the two
disparity maps have opposite signs and a different scaling, as the two baselines
differ. We can now apply the trifocal constraint to the disparity maps.
Figure 2.17a shows the result of the trifocal consistency check. Only pixels
available in both disparity maps (Fig. 2.16a, b) can be trifocal consistent. In a
general case, the trifocal consistency is checked by performing a point transfer of a
candidate pixel in the third view involving the trifocal tensor and a 3D re-projection. In our case, the trifocal tensor is degenerated according to Eq. (2.10).
Consequently, a candidate pixel lies on the same line in all three views. Moreover,
a simple multiplication of the disparities by the ratio cx =c0x of the two baselines is
sufficient to perform the point transfer. This greatly improves the speed and the
accuracy of the trifocal consistency check. The same baseline ratio can be used to
normalize all disparity maps allowing us to merge them into a single disparity map
as shown in Fig. 2.17b. In a last step, we convert the disparity maps into depth
maps. Please note that all pixels which are trifocal consistent have a high depth
resolution which will improve the quality of a later view virtual generation.
A generic 3D representation format for wide baseline applications should consequently allow for a resolution of more than 8 bits per pixel of the depth maps.
Certainly, a respective encoding scheme would need to reflect this requirement.
Once we have a set of four depth maps, we can generate virtual views along the
whole baseline using view interpolation. The concept of the depth-image-based
rendering has been discussed in Sect. 2.7. We will now extend the rendering using
DES toward the multi-camera geometry, i.e. MVD4 in our case.
Figure 2.18 illustrates the rendering process using DIBR and MVD4. We have
four cameras with associated depth maps. By inspection, we can identify the inner
stereo pair with a narrow baseline and small parallax (between cameras 2 and 3).
These views are suitable for showing them one a stereoscopic display without view
interpolation. There is a noticeable perspective change between the camera 1 and 2
as well as between cameras 3 and 4. However, the views generated using DIBR
have an equidistant virtual baseline. These views are suitable for auto-stereoscopic
displays.

64

F. Zilly et al.

Fig. 2.17 a Trifocal consistent disparity maps. b Merged disparity maps. The two disparity maps
from Fig. 2.16 are normalized and merged into one resulting disparity map

Fig. 2.18 Depth-image-based rendering using MVD4. An arbitrary number of views can be
generated using a wide baseline setup

We will now describe how the DES rendering concept is applied to MVD4
content. The main idea is to perform a piece-wise DES rendering. For every virtual
camera position, two original views are selected as DES stereo pair. In the simplest
case, these are the two original views nearest to the virtual view. Figure 2.19
illustrates the selection process of primary and secondary view.
However, the quality of the depth maps might differ between the views. In this
case, one might want to use a high quality view as primary view, even if it is
farther away from the position of the virtual view. As shown in Fig. 2.19, the
rendering order can be visualized in a simple table. View A, for instance lies
between camera 1 and 2. It is nearest to view 1 (blue) and second nearest to view 2

2 Generic Content Creation for 3D Displays

65

Fig. 2.19 Each original view can serve as primary or secondary view depending on the position
of the virtual view. Generally, the two nearest views serve as primary and secondary view for the
DES rendering (cf. text and Sect. 2.7)

(red). Consequently, view 1 serves as a primary view and view 2 as a secondary


view. Please note that the renderer needs to know the relative positions of the
original view, i.e. the stereo baseline cx and c0x from Eq. (2.10) in order to perform
a proper merging of the primary and secondary view. Consequently, the virtual
baseline positions therefore need to be transmitted as they are important meta-data
of the generic 3D representation format. The rendering scheme is then flexible
enough to handle DES, LDV, MVD, or a combination of these formats.

2.9 Conclusions
In this chapter, we have described experiences and results of different conversion
processes from stereo to multi-view. The workflow has been extended toward a
four-camera production workflow in the framework of the European MUSCADE
project which aims to perform field trials and real-time implementations needed by
generic 3D production chain.
We presented the requirements of a generic 3D representation format and a
generic display-agnostic production workflow which should support different types
of displays, as well as current, and future 3D applications.
Subsequently, calibration strategies for stereo- and multi-camera setups have
been discussed which allow an efficient setup of the rigs not only under laboratory
environments but also in the field. A successful introduction of the displayagnostic 3D workflow relies on the existence of a cost-effective production
workflow. In this sense, a fast and reliable calibration is one important component.

66

F. Zilly et al.

The stereo calibration has been discussed in Sect. 2.5 while the extension toward
multi-camera-rig calibration has been discussed within Sect. 2.6.
Different approaches for efficient depth map generation have been discussed for
stereo- (Sect. 2.6) and multi-view setups (Sect. 2.8). As the quality of the depth
maps is critical for the whole DIBR production chain, we presented strategies to
improve their quality by using more than two cameras, i.e. a multi-camera rig (see
Sect. 2.8).
Finally, the process of the DIBR using DES has been presented in Sect. 2.7. We
extended the workflow toward MVD4 and similar generic 3D representation formats in Sect. 2.8.

References
1. Feldmann I, Schreer O, Kauff P, Schfer R, Fei Z, Belt HJW, Divorra Escoda (2009)
Immersive multi-user 3D video communication. In: Proceedings of international broadcast
conference (IBC 2009), Amsterdam, NL, Sept 2009
2. Divorra Escoda O, Civit J, Zuo F, Belt H, Feldmann I, Schreer O, Yellin E, Ijsselsteijn W,
van Eijk R, Espinola D, Hagendorf P, Waizennegger W, Braspenning R (2010) Towards
3D-aware telepresence: working on technologies behind the scene. In: Proceedings of ACM
conference on computer supported cooperative work (CSCW), new frontiers in telepresence,
Savannah, Georgia, USA, 0610 Feb 2010
3. Feldmann I, Atzpadin N, Schreer O, Pujol-Acolado J-C, Landabaso J-L, Divorra Escoda O
(2009) multi-view depth estimation based on visual-hull enhanced hybrid recursive matching
for 3D video conference systems. In: Proceedings of 16th international conference on image
processing (ICIP 2009), Kairo, Egypt, Nov 2009
4. Waizenegger W, Feldmann I, Schreer O (2011) Real-time patch sweeping for high-quality
depth estimation in 3D videoconferencing applications. In: SPIE 2011 conference on realtime image and video processing, San Francisco, CA, USA, 2327 Jan 2011, Invited Paper
5. Pastoor S (1991) 3D-Television: a survey of recent research results on subjective requirements.
Signal Process Image Commun 4(1):2132
6. IJsselsteijn WA, de Ridder H, Vliegen J (2000) Effects of stereoscopic filming parameters
and display duration on the subjective assessment of eye strain. In: Proceedings of SPIE
stereoscopic displays and virtual reality systems, San Jose, Apr 2000
7. Mendiburu B (2008) 3D movie makingstereoscopic digital cinema from script to screen.
Elsevier, ISBN: 978-0-240-81137-6
8. Yeh YY, Silverstein LD (1990) Limits of fusion and depth judgement in stereoscopic color
pictures. Hum Factors 32(1):4560
9. Holliman N (2004) Mapping perceived depth to regions of interest in stereoscopic images. In:
Stereoscopic displays and applications XV, San Jose, California, Jan 2004
10. Jones G, Lee D, Holliman N, Ezra D (2001) Controlling perceived depth in stereoscopic
images. In: Proceedings SPIE stereoscopic displays and virtual reality systems VIII, San Jose,
CA, USA, Jan 2001
11. Woods A, Docherty T, Koch R (1993) Image distortions in stereoscopic video systems. Proc.
SPIE 1915:3648
12. Faubert J (2000) Motion parallax, stereoscopy, and the perception of depth: practical and
theoretical issues. In: Proceedings of SPIE three-dimensional video and display: devices and
systems, Boston, MA, USA, pp 168191, Nov 2000

2 Generic Content Creation for 3D Displays

67

13. Zilly F, Kluger J, Kauff P (2011) Production rules of 3D stereo acquisistion. In: Proceedings
of the IEEE (PIEEE), special issue on 3D media and displays, vol 99, issue 4, pp 590606,
Apr 2011
14. Johnson RB, Jacobsen GA (2005) Advances in lenticular lens arrays for visual display. In:
Current developments in lens design and optical engineering VI, proceedings of SPIE, vol
5874, Paper 5874-06, San Diego, Aug 2005
15. Brner R (1993) Auto-stereoscopic 3D imaging by front and rear projection and on flat panel
displays. Displays 14(1):3946
16. Omura K, Shiwa S, Kishino F (1995) Development of lenticular stereoscopic display
systems: multiple images for multiple viewers. In: Proceedings of SID 95 digest, pp 761763
17. Hamagishi G et al (1995) New stereoscopic LC displays without special glasses. Proc Asia
Disp 95:921927
18. Dodgson NA, Moore JR, Lang SR, Martin G, Canepa P (2000) A time sequential multiprojector autostereoscopic display. J Soc Inform Display 8(2):169176
19. Lowe D (2004) Distinctive image features from scale invariant keypoints. IJCV 60(2):91110
20. Bay H, Ess A, Tuytelaars T, Van Gool L (2008) SURF: speeded up robust features. Comput
Vis Image Underst (CVIU) 110(3):346359
21. Zilly F, Riechert C, Eisert P, Kauff P (2011) Semantic kernels binarizeda feature descriptor
for fast and robust matching. In: Conference on visual media production (CVMP), London,
UK, Nov 2011
22. Hartley R, Zisserman A (2000) Multiple view geometry in computer vision. Cambrigde
University Press, Cambrigde
23. Fischler M, Bolles R (1980) Random sample consensus: a paradigm for model fitting applications
to image analysis and automated cartography. In: Proceedings of image understanding workshop,
April 1980, pp 7188
24. Fusiello A, Trucco E, Verri A (2000) A compact algorithm for rectification of stereo pairs.
Mach Vis Appl 12(1):1622
25. Mallon J, Whelan PF (2005) Projective rectification from the fundamental matrix. Image Vis
Comput 23(7):643650
26. Wu H-H, Yu Y-H (2005) Projective rectification with reduced geometric distortion for stereo
vision and stereoscopic video. J Intell Rob Syst 42:7194
27. Scharstein D, Szeliski R (2001) A taxonomy and evaluation of dense two-frame stereo
correspondence algorithms. International Journal of Computer Vision, 47(1/2/3):742,
AprJune 2002. Microsoft Research Technical Report MSR-TR-2001-81, Nov 2001
28. Brown MZ, Burschka D, Hager GD (2003) Advances in computational stereo. IEEE Trans
Pattern Anal Mach Intell (PAMI) 25(8):9931008
29. Aschwanden P, Guggenbuhl W (1993) Experimental results from a comparative study on
correlation-type registration algorithms. In: Forstner W, Ruwiedel St (eds) Robust computer
vision. Wickmann, Karlsruhe, pp 268289
30. Wegner K, Stankiewicz O (2009) Similarity measures for depth estimation. In: 3DTV
conference, the true vision capture transmission and display of 3D video
31. Birchfield S, Tomasi C (1996) Depth discontinuities by pixel-to-pixel stereo. In: Technical
report STAN-CS-TR-96-1573, Stanford University, Stanford
32. Belhumeur PN (1996) A Bayesian approach to binocular stereopsis. IJCV 19(3):237260
33. Cox IJ, Hingorani SL, Rao SB, Maggs BM (1996) A maximum likelihood stereo algorithm.
CVIU 63(3):542567
34. Boykov Y, Veksler O, Zabih R (2001) Fast approximate energy minimization via graph cuts.
IEEE Trans Pattern Anal Mach Intell 23(11):12221239
35. Kolmogorov V, Zabih R (2001) Computing visual correspondence with occlusions using
graph cuts. Proc Int Conf Comput Vis 2:508515
36. Kolmogorov V, Zabih R (2005) Graph cut algorithms for binocular stereo with occlusions.
In: Mathematical models in computer vision: the handbook. Springer, New York
37. Kolmogorov V, Zabih R (2004) What energy functions can be minimized via graph cuts?
IEEE Trans Pattern Anal Mach Intell 26(2):147159

68

F. Zilly et al.

38. Bleyer M, Gelautz M (2007) Graph-cut-based stereo matching using image segmentation
with symmetrical treatment of occlusions. Signal Process Image Commun 22(2):127143
39. Sun J, Shum HY, Zheng NN (2002) Stereo matching using belief propagation. ECCV
40. Yang Q, Wang L, Yang R, Wang S, Liao M, Nister D (2006) Real-time global stereo
matching using hierachical belief propagation. In: Proceedings of British machine computer
vision, 2006
41. Felzenswalb PF, Huttenlocher DP (2006) Efficient belief propagation for early vision. Int J
Comput Vis 70(1) (October)
42. Brown MZ, Burschka D, Hager GD (2003) Advances in computational stereo. IEEE Trans
Pattern Anal Mach Intell 25(8):9931008
43. Kopf J, Cohen M, Lischinski D, Uyttendaele M (2007) Joint bilateral upsampling. In:
Proceedings of the SIGGRAPH conference on ACM Transactions on Graphics, vol 26, no 3
44. Riemens AK, Gangwal OP, Barenbrug B, Berretty R-PM (2009) Joint multi-step joint
bilateral depth upsampling. In: Proceedings of SPIE visual communications and image
processing, vol 7257, article M, Jan 2009
45. Atzpadin N, Kauff P, Schreer O (2004) Stereo analysis by hybrid recursive matching for realtime immersive video stereo analysis by hybrid recursive matching for real-time immersive
video conferencing. In: IEEE Transactions on circuits and systems for video technology,
special issue on immersive telecommunications, vol 14, No. 3, pp 321334, Jan 2004
46. Muscade (MUltimedia SCAlable 3D for Europe), European FP7 research project. http://
www.muscade.eu/
47. Ziegler M, Falkenhagen L, ter Horst R, Kalivasd D (1998) Evolution of stereoscopic and
three-dimensional video. Signal Process Image Commun 14(12):1731946
48. Redert A, Op de Beeck M, Fehn C, IJsselsteijn W, Pollefeys M, Van Gool L, Ofek E, Sexton
I, Surman P (2002) ATTESTadvanced three-dimensional television systems technologies.
In: Proceedings of first international symposium on 3D data processing, visualization, and
transmission, Padova, Italy, pp 313319, June 2002
49. Mohr R, Buschmann R, Falkenhagen L, Van Gool L, Koch R (1998) Cumuli, panorama, and
vanguard project overview. In: 3D structure from multiple images of large-scale
environments, lecture notes in computer science, vol 1506/1998, pp 113. doi:10.1007/3540-49437-5_1
50. Fehn C (2004) Depth-image based rendering (DIBR), compression and transmission for a
new approach on 3D-TV. In: Proceedings of SPIE stereoscopic display and virtual reality
systems XI, San Jose, CA, USA, pp 93104, Jan 2004
51. Kppel M, Ndjiki-Nya P, Doshkov D, Lakshman H, Merkle P, Mueller K, Wiegand T (2010)
Temporally consistent handling of disocclusions with texture synthesis for depth-imagebased rendering. In: Proceedings of IEEE ICIP, Hong Kong
52. Ndjiki-Nya P, Kppel M, Doshkov D, Lakshman H, Merkle P, Mueller K, Wiegand T (2010)
Depth-image based rendering with advanced texture synthesis. In: Proceedings of IEEE
international conference on multimedia & expo, Singapore
53. 3D4YOU, European FP7 research project. http://www.3d4you.eu/
54. Zilly F, Mller M, Kauff P (2010) The stereoscopic analyzeran image-based assistance tool
for stereo shooting and 3D production. In: Proceedings of ICIP 2010, special session on
image processing for 3D cinema production, Hong Kong, 2629 Sept 2010

Chapter 3

Stereo Matching and Viewpoint Synthesis


FPGA Implementation
Chao-Kang Liao, Hsiu-Chi Yeh, Ke Zhang, Vanmeerbeeck Geert,
Tian-Sheuan Chang and Gauthier Lafruit

Abstract With the advent of 3D-TV, the increasing interest of free viewpoint TV in
MPEG, and the inevitable evolution toward high-quality and higher resolution TV
(from SDTV to HDTV and even UDTV) with comfortable viewing experience, there is
a need to develop low-cost solutions addressing the 3D-TV market. Moreover, it is
believed that in a not too distant future 2D-UDTV display technology will support a
reasonable quality 3D-TV autostereoscopic display mode (no need for 3D glasses)
where up to a dozens of intermediate views are rendered between the extreme left and
right stereo video input views. These intermediate views can be synthesized by using
viewpoint synthesizing techniques with the left and/or right image and associated
depth map. With the increasing penetration of 3D-TV broadcasting with left and right
images as straightforward 3D-TV broadcasting method, extracting high-quality depth
map from these stereo input images becomes mandatory to synthesize other intermediate views. This chapter describes such Stereo-In to Multiple-Viewpoint-Out
C.-K. Liao (&)
IMEC Taiwan Co., Hsinchu, Taiwan
e-mail: ckliao@imec.be
H.-C. Yeh  K. Zhang  V. Geert  G. Lafruit
IMEC vzw, Leuven, Belgium
e-mail: lalalachi@gmail.com
K. Zhang
e-mail: zhangke@imec.be
V. Geert
e-mail: vanmeerb@imec.be
T.-S. Chang
National Chiao Tung University, Hsinchu, Taiwan
e-mail: tschang@twins.ee.nctu.edu.tw
G. Lafruit
e-mail: lafruit@imec.be

C. Zhu et al. (eds.), 3D-TV System with Depth-Image-Based Rendering,


DOI: 10.1007/978-1-4419-9964-1_3,
 Springer Science+Business Media New York 2013

69

70

C.-K. Liao et al.

functionality on a general FPGA-based system demonstrating a real-time high-quality


depth extraction and viewpoint synthesizer, as a prototype toward a future chipset for
3D-HDTV.




Keywords 3D-TV chipest Autostereoscopic Census Transform Computational


complexity Cross-check Depth map Disparity Dynamic programming Free
viewpoint TV FPGA GPU Hamming distance Hardward implementation
Hole filling Matching cost Real-time Stereo matching Support region builder
View synthesis Warping




3.1 Introduction
In recent years, people get more and more attracted by the large range of applications enabled by 3D sensing and display technology. The basic principle is that a
vision device providing a pair of stereo images to the left and right eye respectively is sufficient to provide a 3D depth sensation through the brains natural
depth interpretation ability. Many applications like free viewpoint TV and 3D-TV
benefit from this unique characteristic.

3.1.1 Stereoscopic Depth Scaling


Often in the early days of 3D movie theaters, 3D content was captured with
exaggerated depth to give a spectacular depth sensation at the cost of sometimes
visual discomfort. Nowadays, the depth perception has been greatly reduced to an
average degree of comfort to all viewers. Anyway, depth perception is subjective
and also depends on the viewing distance and display size. Consequently, as there
exists contrast and color/luminance controls on each 2D TV set, future 3D TV sets
will also add the 3D depth scale control, as shown in Fig. 3.1.
Such depth scaling can be achieved by calculating a new left/right image pair from
the original left/right stereo images. In essence, pixels in the images have to be shifted/
translated over a distance depending on their depth; hence, such viewpoint synthesis
system is very similar to the free viewpoint TV of Fig. 3.2, with the difference that two
viewpoints are rendered instead of one, for achieving 3D stereoscopic rendering.

3.1.2 3D Autostereoscopic Displays


Undoubtedly, the next step in watching 3D content is to use a glasses-free 3D
display solution where many intermediate views between extreme left and right
captured views are projected toward the viewer. The eyes then capture two of these
dozens of viewing cones, and lateral movements of the viewer will position the

3 Stereo Matching and Viewpoint Synthesis FPGA Implementation

71

Fig. 3.1 Synthesized virtual images with different level of depth control

Fig. 3.2 Schematic of an autostereoscopic display with multiple synthesized views

viewers eyes into the two corresponding adjacent viewing cones, giving the
viewer the illusion to turn around the projected 3D content.
Though it is conceivable that the accompanying viewpoint interpolation process
requiring scene depth information might be based on sender-precalculated depth
maps broadcasted through the network, it is more likely that todays left/right
stereo video format used in 3D digital cinema and 3D bluray with polaroid/shutterglasses display technology will become legacy in tomorrows 3D-TV broadcasting

72

C.-K. Liao et al.

Fig. 3.3 Free viewpoint TV: the images of a user-defined viewing direction are rendered

market. Each autostereoscopic 3D-TV (or settop-box) receiver will then have to
include a chipset for depth estimation and viewpoint synthesis at reasonable cost.

3.1.3 Free Viewpoint TV


Free viewpoint TV projects any viewer selected viewpoint onto the screen, in a similar
way as one chooses the viewpoint in a 3D game with the mouse and/or joystick (or any
gesture recognition system), as shown in Fig. 3.3. The difference in free viewpoint TV
however is that the original content is not presented in a 3D polygonal format; the
original content consists only of the images captured by a stereo camera system, and
any intermediate viewpoint between the two extreme viewpoints (viewpoint 1 and 5 in
Fig. 3.3) has to be synthesized by first detecting depth and occlusions in the scene,
followed by a depth-dependent warping and occlusion hole filling (inpainting) process.
In Fig. 3.3, five viewpoints can be rendered and only one of the viewpoints 2, 3, and 4 is
actually synthesized as a new free viewpoint TV viewpoint.

3.1.4 3D Gesture Recognition


In recent years, gesture detection for interfacing with gaming consoles, such as
Nintendo Wii and Microsoft Kinect for Xbox360, has gained a high popularity.
Games are now controlled with simple gestures instead of using the old-fashioned
joystick. For instance, the Kinect console is equipped with an infrared camera that

3 Stereo Matching and Viewpoint Synthesis FPGA Implementation

73

Fig. 3.4 Depth measurement for 3D gesture recognition. The monitor shows the depth-image
captured and calculated by a FullHD stereo-camera with the proposed system

projects a pattern onto the scene, out of which the players 3D movements can be
extracted.
The same function can be achieved using a stereo camera where depth is extracted
by stereo matching (Fig. 3.4). There is much benefit from using a stereo camera
system: the device allows higher depth precision (active depth sensing devices
always exhibit a low output image resolution), and the system can be used outdoors,
even under intense sunlight conditions, which is out of reach for any IR active sensing
device. Finally, compared with the active IR lighting systems, the stereo approach
exhibits lower energy consumption. Therefore, using stereo cameras with a stereo
matching technique has a high potential in replacing Kinect-alike devices.
Although stereo matching provides depth information providing added-value to a
gesture recognition application, it remains a challenge to have a good image quality
at a reasonable computational complexity. Since today even GPUs with their
incredible processing power are hardly capable of performing such tasks in real time,
we propose the development of a stereo matching and viewpoint interpolation processing chain, implemented onto FPGA, paving the way to future 3D-TV chipset
solutions. In particular, we address the challenge of proper quality/complexity
tradeoffs satisfying user requirements at minimal implementation cost (memory
footprint and gate count). The following describes a real-time FPGA implementation
of such stereo matching and viewpoint synthesis functionality. It is also a milestone
to eventually pave the way to full-HD 3D on autostereoscopic 3D displays.

3.2 Stereo Matching Algorithm Principles


Depth map extraction from the stereo input images is mostly based on measuring
the lateral position difference of an object shown on two horizontally separated
stereo images. The lateral position difference, called disparity, is a consequence

74

C.-K. Liao et al.

Fig. 3.5 The schematic model of a stereo camera capturing an object P. The cameras with focal
length f are assumed to be identical and are located on an epipolar line with a spacing S

of different light paths from the object to respectively the left and right cameras
with their corresponding location and viewing angle [1], mimicking the human
visual system with its depth impression and 3D sensation.
Figure 3.5 shows an illustration of disparity and the corresponding geometrical
relation between an object and a stereo camera rig. Two images from the object are
captured by two cameras which are assumed to be perfectly parallel and aligned to
each other on a lateral line, corresponding to the so-called epipolar line [2]. A
setup with converging cameras can be transformed back and forth to this scenario
through image rectification. To simplify the discussion, the camera parameters
(except their location) are identical.
In this chapter, we assume that the rectification has been done in advance. The
origins of the cameras are shown to be on their rear side, corresponding to their
respective focal point. When an object image is projected to the camera, each pixel
senses the light along the direction from the object to the camera origin. Consequently, any camera orientation has its own corresponding object image and
therefore the object is at a different pixel location in the respective camera views.
For example, the object on point P is detected in pixel Xleft = +5 and
Xright = -7 respectively in the left and right cameras (see Fig. 3.5). The disparity is calculated as the difference of these two lateral distances (i.e.,
5 ? 7 = 12). The disparity is a function of the depth z, the spacing between these
two cameras s, and focal length f of the camera as shown in Eq. (3.1).
z sf =d

3:1

According to this equation the object depth can be calculated by measuring the
disparity from the stereo images. In general, textures of an object on the respective
stereo images exhibit a high image correlation, both in color and shape, since they
originate from the same object with the same illumination environment and a

3 Stereo Matching and Viewpoint Synthesis FPGA Implementation

75

similar background and are assumed to be captured by identical cameras. Therefore, it is suitable to measure the disparity by detecting the image texture displacement with a texture similarity matching metric.
According to the taxonomy summarized by [3] and [4], algorithms for stereo
matching have been investigated for mostly 40 years, and can be categorized into
two classes, the local-approach and the global-approach algorithms. The stereo
matching computational flow can be summarized in four steps:
1.
2.
3.
4.

Matching cost computation.


Cost aggregation.
Disparity computation.
Disparity map refinement.

where the local-approach algorithms generally perform steps 1, 2, 3, and 4 but


without global optimization; whereas global-approach algorithms commonly
perform steps 1, 3, and 4.
In the local-approach algorithms, the disparity is calculated from the matching
cost information of matching candidates that are only included in a preset local
disparity range. In order to increase the matching accuracy, local approaches rely
on cost aggregation to include more matching cost information from neighborhood
pixels [3]. The global-approach algorithms, on the other hand, concentrate more on
disparity computation and global optimization by utilizing energy functions and
smoothness assumptions. The energy function includes the matching cost information from the entire image. It reduces the matching failure in some areas due to
the noise, occlusion, texture less, or repetitive pattern, etc. Although globalapproach methods render a good quality disparity map by solving these issues,
they always put higher computational complexity demands and memory resources,
compared to the local-approach methods, making a real-time system implementation challenging [3]. All in all, there exists a tradeoff between matching quality
and computational complexity.
In this section, we introduce the aforementioned four computation steps with
regard to their applicability in hardware system setups without impeding on the
resulting visual quality.

3.2.1 Matching Cost Calculation


The fundamental matching cost calculation is based on pixel-to-pixel matching
within one horizontal line within a certain disparity range. The image intensity of a
pixel on the reference image is compared with that of the target stereo image with
different disparities as shown in Fig. 3.6.
The cost is a function of disparity that shows how much the similarity is in
between the chosen stereo regions. Several methods, including summation of
absolute intensity difference (SAD), summation of squared intensity difference
(SD), normalized cross-correlation (NCC), and census transform (CT), etc. [3],

76

C.-K. Liao et al.

Fig. 3.6 Fundamental pixel-to-pixel matching method

Table 3.1 General matching cost calculation methods


Match metrics
Definition
P
I x;y~It It xd;y~Ir
NCC
x;y r
q

P
2
2
I x;y~Ir It xd;y~It
x;y r
P
SAD
jIr x; y  It x d; yj

Point operation
Multiplication
Subtraction

x;y

SSD

P
Ir x; y  It x d; y2

Subtraction and squarer

x;y

CT

Ik x; y Bitstringm;n Ik m; n\Ik x; y
P
HammingIr x; y; It x d; y

XOR

x;y

have been proposed to calculate the matching cost. The definitions of these
methods with their point operation are shown in Table 3.1 [5].
Cross-correlation is a commonly used operation to calculate the similarity in the
standard statistics methods [6]. Furthermore, NCC is also reported in order to prevent
the influence of radiometric differences between the stereo images to the cost estimation result [7, 8]. Cross-correlation is however computationally complex and thus
only few researches use this method to calculate the matching cost in a real-time
system [9]. Similar to SSD, SAD mainly measures the intensity difference between
the reference region and the target region. Comparing to the SSD, SAD is often used
for system implementation since it is more computationally efficient. Different from
these approaches, the CT performs outstanding similarity measurements at acceptable cost [10]. The CT method firstly translates the luminance comparison result
between the processed anchor pixel (the central pixel of the moving window) and the
neighboring pixels into a bit string. Then it generates the matching cost by computing
the hamming distance between these bit strings on the reference and target images.
As such, the CT matching cost includes the luminance relative information, not the
luminance absolute values, making it more tolerant against luminance error/bias
between the two regions to match, while increasing the discrimination level by
recording a statistical distribution rather than individual pixel values.

3 Stereo Matching and Viewpoint Synthesis FPGA Implementation

77

Fig. 3.7 The CT and hamming distance calculation resulting in a raw matching cost in terms of
disparities. a The image intensity map. b The CT. c The census vector aggregated from neighbor
pixels, and d the calculated raw matching cost. Note that the processed anchor pixel (65) is at
coordinates (x, y)

With a predetermined area of pixels the CT results in a census vector (a binary


array with the same number of bits as the number of pixels in the window region
around the pixel under study) that can be used as a representation of the processed
anchor pixel, which is the central pixel of the window (i.e., pixel at (x, y) in Fig. 3.7).
These bits are set to 1 if the luminance of the processed anchor pixel is greater than
the neighboring pixels, otherwise they are set to 0 (see Fig. 3.7). Although this
transform cannot exactly represent the uniqueness of the window center pixel, the
involved calculations like comparators and vector forming are hardware friendly.
After the CT, both left/right images become left/right census arrays, as shown in
Fig. 3.7. Consequently, the pixel characteristic is no longer represented by its
luminance, but rather by an aggregated vector that also includes the relative luminance difference from the neighborhood of the processed anchor pixel.
The cost aggregation step accumulates the neighboring matching costs within a
support region in order to increase the matching accuracy. For the hardware
implementation, the chosen cost aggregation method depends on the hardware cost
and the quality requirement of the disparity map. In general, there are four types of
cost aggregation strategies from coarse to fine: fixed window, multiple windows,
adaptive shape, and adaptive weight as shown in Fig. 3.8. The fixed window
approach sums up the matching cost within a simple window region. It yields the
lowest computational complexity but performs weakly in discontinuous, textureless, and repetitive regions. To overcome the problems of the fixed window method,
multiple windows is an advanced approach, combining several subwindows.
A number of subwindows are predefined to make up the support region, which is not
limited to be of rectangular shape. Another approach is the adaptive shape method,
where the matching cost aggregation is performed within a region of neighboring
pixels exhibiting similar color intensities. This method is able to preserve object
boundaries. Zhang et al. propose the cross-based method [11] which is an example of
the adaptive shape approach with acceptable computational complexity. Finally, the
most accurate method is the adaptive weight approach [1214], where the nearest
pixels with similar intensity around the central pixel have higher probability to share

78

C.-K. Liao et al.

Fig. 3.8 Categories of local stereo matching algorithms based on the level of detail in the
support region of neighboring pixels around the processed anchor pixel. Four categories are listed
from course to fine, i.e., fixed window methods, multiple windows methods, adaptive shape
methods, and adaptive weight methods. One possible support region to the pixel p is presented for
each category, respectively (brightness denotes the weight value for the adaptive weight methods)

the same disparity value. Typically, the Gaussian filter is used to decide the weight of
aggregation matching cost based on the distance to the processed anchor pixel.
Although the adaptive weight method is able to achieve very accurate disparity maps,
the computational complexity is relatively higher than other approaches.
Conventional cost aggregation method within the support region requires a
computational complexity in O(n2 ), where n is the size of support region window.
Wang et al. [15] propose a two-pass cost aggregation method, which efficiently
reduces the computational complexity to O(2n). Other techniques for hardware
complexity reduction use data-reuse techniques. Chang et al. [16] propose the
mini-CT and partial column reuse techniques to reduce the memory bandwidth and
computational complexity. It caches each column data over the overlapped windows. In our previous hardware implementation work which was contributed from
Lu et al. [1719], the two-pass cost aggregation approach was chosen and associated with the cross-based support region information. Data-reuse techniques
were also applied to both horizontal and vertical cost aggregation computations to
construct a hardware efficient architecture.
A more recent implementation has modified the disparity computations as
explained in the following sections.

3.2.2 Disparity Computation


Disparity computation approaches can be categorized in two types: local stereo
matching and global optimization. Local stereo matching algorithms utilize winner
take all (WTA) strategies to select the disparity which possesses the minimum
matching cost value. The chosen displacement is regarded as the disparity value
[20]. Global stereo matching optimization algorithms compute the disparity value
with the help of an energy function, which introduces a smoothness assumption.
Additional constrains are inserted to support smoothness such as penalizing sudden

3 Stereo Matching and Viewpoint Synthesis FPGA Implementation

79

Fig. 3.9 Example of


smoothness cost penalties in
Linear model

changes in disparity. The energy function can be simply represented as in


Eq. (3.2).
EdEdata dEsmooth d

3:2

The first term represents the sum of matching costs for the image pixels p at
their respective disparities dp. Equation (3.3) shows the function of first term of
Eq. (3.2):
X
Edata d
Cp; dp
3:3
p2N
where N represents the total number of pixels, and dp ranges from 0 to a chosen
maximum disparity value. The second term of Eq. (3.2) represents the smoothness
function as in Eq. (3.4). We compare two models, Linear and Potts models, to
implement the smoothness function:
X
Esmooth d k
Sdp ; dq
3:4
p;q2N
where k is a scaling coefficient, which adapts to the local intensity changes in order
to preserve disparity discontinuous regions. dp and dq are disparity arrays in
adjacent pixels p and q. Equation (3.5) shows the Linear model for the smoothness
function. The smoothness cost of the linear model depends on the difference
between the disparity from the current pixel and all candidate disparities from the
previous pixel. A higher disparity difference introduces a higher smoothness cost
penalty. Figure 3.9 is an example demonstrating the smoothness cost penalties in
the Linear model. This mechanism helps to preserve the smoothness of the disparity map when performing global stereo matching.


Sdp ; dq dp  dq 
3:5
where dp and dq[ [0, disparity range-1].

80

C.-K. Liao et al.

Fig. 3.10 Example of


smoothness cost penalties in
Potts model

Unfortunately, the linear model smoothness function requires high computational resources. Assuming that the maximum disparity range is D, the computational complexity of the smoothness cost is O(D2 ).
Equation (3.6) shows the Potts model smoothness function which has lower
complexity than the linear model. The Potts model introduces the same smoothness
cost penalty to all candidate disparities. Figure 3.10 is an example demonstrating the
smoothness cost penalties in the Potts model. If the disparity value is different, the
smoothness cost penalty is introduced into the energy function to reduce the probability of different disparities to win in the winner-take-all computation. In the Potts
model smoothness function, the computational complexity is of O(D).

0; dp dq
Sdp ; dq
3:6
c; others
where c is a constant to introduce smoothness penalty.
Figure 3.11 shows an example of disparity maps that are generated from Linear
and Potts model individually. Obviously, the linear model performs better at slant
surfaces and reveals more detail. However, it also blurs the disparity map on
discontinuous regions (such as object edges).
The state-of-the-art global stereo matching approaches include dynamic programming (DP) [21], graph cut (GC) [22], and belief propagation (BP) [23]. The
principle of those approaches is based on the way one minimizes the energy
function. In the following, we choose the DP approach for the example of the
hardware implementation that we will study later.
DP is performed after the calculation of the raw matching cost. To simplify the
computational complexity, DP generally refers to executing the global stereo
matching within an image scanline (an epipolar line after image rectification). It
consists of a forward updating of the raw matching cost and a backward minimum
tracking. The DP function helps to choose the accurate minimum disparity value

3 Stereo Matching and Viewpoint Synthesis FPGA Implementation

81

Fig. 3.11 Disparity map results by using different smoothness function models

Fig. 3.12 DP consists of a forward cost update and a backward tracking. In the forward process,
the costs are updated based on the cost distribution of the previous pixel, and therefore, the final
disparity result is not only determined by its original cost map, but also affected by neighboring
pixels

among the somewhat noise-sensitive row matching cost disparity candidates by


considering the cost of previous pixels. The forward pass of DP can be expressed as:


3:7
Cw j; dq cj; dq min cw j  1; dp sdp ; dq
dp 2w

where the function C() presents the original raw matching cost at disparity dq for
pixel j on scanline W, and Cw() function is the accumulated cost updated from the
original cost. S() is the smoothness function that introduces a penalty to the disparity difference between the current and previous scanline pixels. (Fig. 3.12).
After updating the cost of every pixel on the scanline, the optimal disparity
result of a scanline can be harvested by backward selecting the disparity with
minimum cost, according to:

82

C.-K. Liao et al.

dw arg 0min cw; d0 


d 20;D

3:8

3.2.3 Disparity Map Refinement


This is the final step that pushes the quality of the disparity map to an even higher
accuracy. Many articles have reported their technology on fine-tuning the resulting
disparity map based on reasonable constraints. In this subsection, we introduce
cross-checks, median filters, and disparity voting methods. Those techniques
contribute to a significant improvement of the quality of the disparity map [24].
3.2.3.1 Cross-Check
Cross-check techniques check the consistency between the left and right disparity
maps. It is based upon the assumption that the disparity should be identical from
the left to the right image and vice versa. If the disparity values are not consistent,
they either represent mismatching pixels or occlusion pixels. The cross-check
formulas can be expressed as:

jDx; y  D0 x  Dx; y; yj  k; left disparity
3:9
jD0 x; y  Dx Dx; y; yj  k; right disparity
where D and D0 represent the left and right disparity maps from the disparity
computation step, respectively. The constant k is the cross-check threshold for
tolerating trivial mismatching. After knowing the occlusion regions, the simplest
solution to solve the inconsistent disparity pixels is to replace them by the nearest
good disparity value [25]. Another solution is to replace the inconsistent disparity
pixels by the most popular good disparity value within the support region (similar
texture) by disparity voting [25].
3.2.3.2 Disparity Voting
Following the cross-check, a voting process is performed by selecting the most popular
disparity value within the support region. Disparity voting is based on the assumption
that the pixels within a similar texture-region share the same disparity value. This
method helps to improve the quality of the disparity map. In this refinement process, a
support region is used as the voting area, and only the pixels in this irregularly shaped
region (as shown in Fig. 3.13) are taken into account. In the following, we introduce the
cross-based support region generation method that is proposed by Zhang et al. [12].
The arm stretches to left/right and top/bottom directions respectively and stops
when two conditions are satisfied. First the arm is limited to a maximum length.
Second, the arm stops when two adjacent pixels are not consistent in color,
according to:

3 Stereo Matching and Viewpoint Synthesis FPGA Implementation

83

Fig. 3.13 Support region


U(p) of the anchor processed
pixel p is used to take every
pixel inside into account. It is
a region defined by four
stretched arms

0
r  max @r
r21;L

1
dp; pi A

3:10

i21;r

where r* indicates the largest span for the four direction arms. In the equation
pi = (xp - i,yp) and L corresponds to the predetermined maximum arm length.
(
1;
max jIc p1  Ic p2 j  s
c2R;G;B
dp1 ; p2
3:11
0; otherwise:
+
+
After the arm lengths are calculated, the four arms (hp , hp , vp , vp ) are used to
define the horizontal segment H(p) and vertical segment V(p). Note that the full
support region U(p) is the integration of all the horizontal segments of those pixels
residing on the vertical segment.
Once the support region (indicating the texture-less region) has been determined, the disparity voting method can be performed to provide a better image
quality. It accumulates the disparity values within the support region of the central
pixel into histogram; and then chooses the winner as the final disparity value.
Figure 3.14 demonstrates an example of such 2D voting.

3.3 Co-Designing the Algorithm and Hardware


Implementation
Most stereo matching implementation studies in the literature are realized on four
common-used platforms: CPU, GPU, DSP, and FPGA/ASIC. To decide about the
best implementation platform, different aspects such as matching accuracy,

84

C.-K. Liao et al.

Fig. 3.14 Example of 2D disparity voting. All the disparities in the support region are taken into
account in the histogram to choose the most voted disparity. a Disparities in support region.
b disparities in support region c voted disparity

robustness, real-time performance, computational complexity, and scalability


should be taken into consideration when implementing the algorithm. Implementations with CPU, GPU, and DSP are software-based approaches, which
require less development time. Zhang et al. [11] propose a real-time design for
accurate stereo matching on CUDA, which achieves an acceptable tradeoff
between matching accuracy and computational performance by introducing various sampling rate into the design. Banz et al. [26] apply the semiglobal algorithm
that was proposed by Hirschmuller [20] on GPU. However, the computational
logic, instructions and data paths are fixed. Besides, high clock frequency and
memory bandwidth are required. Therefore, dedicated hardware design is an
option to overcome the above-mentioned problems. FPGA and ASIC platforms
allow parallelism exploration and a pipeline architecture to achieve a higher
throughput at moderate clock speeds. Chang et al. [16] implement a high performance stereo matching algorithm with mini-CT and adaptive support weight on
UMC 90 nm ASIC, achieving 352 9 288@42FPS with 64 disparity ranges. Jin
et al. [27] design a pipelines hardware architecture with the CT and sum of
hamming distances, achieving 640 9 480@30FPS at a 64 disparity range under
20 MHz. Banz et al. realize the semiglobal matching algorithm on a hybrid FPGA/
RISC architecture, achieving 640 9 480@30FPS at a 64 disparity range under
12 M*208 MHz [26]. Stefan et al. [28] also utilize semiglobal matching algorithm to implement a stereo matching engine, which achieves 340 9 200@27FPS
with a disparity range of 64 levels. Zhang et al. [12] implement a local algorithm
with a cross-based support region in the cost aggregation and refinement stages,
achieving over a range of 64 disparity levels with a resolution of
1024 9 768@60FPS under 65 MHz. In this section, we will briefly introduce our
stereo matching algorithm and hardware implementation approach to reach an
optimal trade-off between efficiency and image quality.
Targeting low-cost solutions, we have reduced the memory size and computational complexity by extracting the necessary depth information from within a
small texture size that is sufficient to represent the particular object and its location. There are mainly two reasons to support the depth extraction in such linebased structure. First, video signals are typically fed in a horizontal line-by-line

3 Stereo Matching and Viewpoint Synthesis FPGA Implementation

85

Fig. 3.15 The schematic of stereo matching process flowchart

format. For instance, the video signal onto the television screen are scanned and
transferred by either the progressive or the interlaced raster scan direction format,
and these scan formats both follow a horizontal direction. All information is
extracted from a horizontal strip of successive lines, gradually moving from top to
bottom of the image. There is hence no need to store the entire image frame in the
processing memory and only a very limited amount of scan lines needed for the
processing are stored. Second, the disparity is naturally expressed in the horizontal
direction, since it finds its origin in the position difference between the viewers
left and right eyes which are naturally in a horizontal spacing. Due to these
reasons, it is more practical to build up a stereo matching system achieving realtime performances in a line-based structure. As long as the depth extraction is
simplified as harvesting information from the local data rather than from the frame
data, the implementation of the stereo matching onto the FPGA prototyping system becomes more efficient.
Figure 3.15 shows the proposed example in a more detailed flow graph of the
stereo matching implementation on FPGA. It performs the following steps:
Support region builder, raw cost generation with CT, and Hamming distance.
DP computation.
Disparity map refinement with consistency check, disparity voting, and median
filter.
The hardware architecture of these different steps will be discussed in further
details in the following sections.

3.3.1 Preprocessor Hardware Architecture


The preprocessor design contains three components: the CT, the Hamming distance calculation, and support region builder.

86

C.-K. Liao et al.

Fig. 3.16 Example of 3 9 3 CT hardware architecture for stream processing

3.3.1.1 CT
In this section, an example design of the CT is shown in Fig. 3.16. The memory
architecture uses two line buffers to keep the luminance data of the vertical scanline
pixels locally in order to provide two-dimensional pixel information for the CT
computation. This memory architecture writes and reads the luminance information
of the input pixel streams simultaneously in a circular manner, which keeps the
scanline pixel data locally by a data reuse technique. Successive horizontal scanlines
are grouped as vertical pixels through additional shift registers. This example
illustrates a 3 9 3 CT computational array. The results from the different comparators CMP are concatenated to a census bitstream ci for each pixel i.

3.3.1.2 Hamming Distance Computation


Once the census vector is obtained, a large amount of comparison between these
respective vectors in the left/right images will be performed. For example, when
calculating the left depth map, the left census array is set as a reference and the
right array as the target. This process is reversed for the right depth map to be
calculated. The census vectors on the reference array and their locally corresponding vectors on the target array are compared and the differences (known as
the raw matching cost) are quantitatively measured by checking every bit in these
two arrays. A large raw matching cost corresponds to a mismatch between the
reference pixel and the chosen target pixel from the right census array. The raw
matching cost is measured several times depending on the maximum disparity
range of the system (e.g., a 64-level disparity system requires 64 times a raw

3 Stereo Matching and Viewpoint Synthesis FPGA Implementation

87

Fig. 3.17 Example of parallel hamming distance hardware architecture

matching cost measurement for each pixel). As shown in Fig. 3.17 the raw
matching cost result is stored in an additional, independent dimension with a size
equal to the maximum number of disparity levels.
The raw matching cost indicates the similarity of the reference pixel to the target
pixels at various disparity positions. Therefore, it should be easy to get the most similar
pixel by taking the disparity with the minimum raw matching cost. However, in most
practical situations, there is a need to improve the extracted disparity candidates, by
using global matching algorithms such as DP, as explained in next section.

3.3.1.3 Support Region Builder


The cross-based support region algorithm [12] is chosen to extract the support
region information. Figure 3.18 shows an example design of such cross-based
support region builder. It contains multiple line buffers, a shift register array, and
support arm region encoders. Since the cross-based support function requires 2D
pixel luminance information, we use multiple line buffers to keep the input pixel
luminance stream on-chip. Fortunately, this memory architecture for support
region builder is overlapped with the one for the CT function. Hence, it is designed
to share data with the preprocessor. The encoded cross-arm information is stored
in a buffer and will also be used in the postprocessor for disparity voting.

3.3.2 DP Hardware Architecture


To reduce the hardware resource consumption of DP processor, we propose a
solution by taking the advantage of the Potts model smoothness function. The Potts
model approach takes an O(D . W) computational complexity in the energy function,
whereas linear approach is of order O(D2 . W),where D is the maximum disparity
range and W is the number of pixels in the image scanline. Furthermore, the mentioned forward pass Eq. (3.7) can be rewritten in Eq. (3.12), which requires less
adder components when applying maximum parallelism in the VLSI design.

88

C.-K. Liao et al.

Fig. 3.18 Example hardware design of the cross-based support region buildersupport region
builder

Cmin

Assum



min cw j  1; dp c
dp 2w


Cw j; dq cj; dq min cmin

Assum ; cj

 1; dp

3:12

One problem of DP is that it requires tremendous memory space to store the


aggregation cost information Cw(j,dq) for the backward pass process. Therefore,
we take the advantage of the Potts model again by only storing the decision of the
backward pass. Indeed, the Potts model has only two possible backward path
decisions: either jump to the disparity value with minimum cost or remain on the
same disparity value. Therefore, the backward path information can be represented
in Eq. (3.13) instead of storing a complete aggregation cost array, requiring less
memory utilization.

1; Cmin Assum  Cj  1; dp
Backward pathj; d
0; Cmin Assum [ Cj  1; dp


Backward MinC pathj arg min Cmin Assum ; Cj  1; dp
3:13
where j [ W, W is the number of image scanline pixels, and dp[max disparity
range. In the backward pass function, Eq. (3.8) can be rewritten into Eq. (3.14).
When the path decision is 0, the backward entry remains the same disparity. When
the path decision is 1, the backward entry points to the path which possesses the
minimum aggregation cost assumption.

3 Stereo Matching and Viewpoint Synthesis FPGA Implementation

89

Fig. 3.19 Example of forward DP operations

(
dj  1

Backwardmin Cpath j ;

if BackwardPathj;d 1

dj;

if BackwardPathj;d 0

3:14

where j [ W, W represents the number of image scanline pixels.


Figure 3.19 is an example of DP that shows how the backward path information
is first explored in the forward pass step, and the path decision is then stored in 1
bit. Then the backward pass step traverses back through the path information to
obtain the optimal disparity result in the image scanline.
Based on Eqs. (3.13) and (3.14), Fig. 3.20 illustrates an example hardware
architecture design of the forward pass that is proposed for the stream processor on
VLSI. This architecture computes the input matching cost array in maximum
parallelism in order to achieve 1:1 throughput, i.e., for each new pair of input
pixels, create one output pixel for the depth map. Then the calculated backward
path information is stored in buffer for the backward pass.
When the forward pass loop is processing new incoming matching costs for the
current image scanline, the backward pass loop reads out the backward path array
information from the buffer and traces back the backward path along the image
scanline concurrently. Since the disparity output result from the backward pass
function shows the inverse sequence order from the conventional input sequence
order, a sequence reorder circuit is needed. The scanline reorder circuit can be
simply achieved by a ping-pong buffer and some address counters.
Figure 3.21 shows a pair of disparity maps that are calculated from the image
pair Tsukuba [3]. The disparity maps are captured from the output of the DP
function. Comparing to the original stereo image sources, the disparity maps suffer

90

C.-K. Liao et al.

Fig. 3.20 Example design of forward- (left) and backward-pass (right) functions

from streak effects and speckle noise. They also show some mismatching problems. In the next subsection, the postprocessor is described to circumvent these
problems with a disparity map refinement approach.

3.3.3 Postprocessor Hardware Architecture


3.3.3.1 Cross-Check
Due to the occlusion and mismatching disparity problems that affect the quality of
the disparity map, we apply a cross-check function in the stereo matching algorithm to detect the mismatching regions. Figure 3.22 is an example that shows the
left and right occlusion maps.
An example RTL design of the consistency check function for the left disparity
map consistency check is shown in Fig. 3.23. The length of the line buffers are set
to the same as the maximum disparity range, which can also be implemented by
registers. The disparity value D(x, y) from the left image is compared to the
disparity value D0 (x - D(x, y), y) from the right disparity buffer in the consistency
check block. The consistency check block compares the disparity difference and a
threshold value. If the disparity difference is larger than the threshold, it is replaced

3 Stereo Matching and Viewpoint Synthesis FPGA Implementation

91

Fig. 3.21 Tsukuba left (a) and right (b) stereo image sources, and their resulting left (c) and
right (d) disparity maps from the DP

Fig. 3.22 Tsukuba left and right occlusion maps

by the nearest good disparity value. Another approach is to tag the mismatching
pixels to form occlusion maps. Then the information of the occlusion pixels will be
used in the disparity voting function. The occlusion pixels are then replaced by the
most popular good disparity values within their support region. In the next subsection, the RTL design of the disparity voting will be further explained.
Figure 3.24 shows the disparity map results that are captured from the output of
the consistency check function, where the occlusion solving strategy is replacing
the mismatching disparity pixel by the nearest good disparity value.

92

C.-K. Liao et al.

Fig. 3.23 Consistency check RTL example design

Fig. 3.24 Tsukuba left and right disparity maps after consistency check

3.3.3.2 Disparity Voting


Disparity voting is used to refine the disparity map. To reduce the computational
complexity, the voting process can be modified to a two-pass approach: horizontal
voting and vertical voting by using the cross-based support region approach of
Sect. 3.3.1 [11]. This makes the histogram calculation easier for hardware
implementation. The computational complexity can be reduced from O(N^2) to
O(2N). Figure 3.25 illustrates how the 2D disparity voting procedure can be
approximated to two successive1D voting procedures.
The proposed disparity voting RTL design first applies the horizontal voting
function, then vertical voting function. Figure 3.26 is an example design which
selects the most popular disparity value in the horizontal arm length support
region. This architecture is composed of shift registers, a disparity comparator
array, a support region mask, a population counter, and a comparator tree. There
are 2L ? 1 shift registers, where L represents the maximum support arm length.
The disparity comparator array compares the input disparities from the registers

3 Stereo Matching and Viewpoint Synthesis FPGA Implementation

93

Fig. 3.25 Two pass disparity map voting

Fig. 3.26 RTL example design of horizontal voting function

and assigns the corresponding disparity bit flag where 1 represents equal and 0
represents not equal. Then the support region mask filters the disparity flags that do
not belong to the support region by setting them to 0. Afterwards, the population
counter function accumulates the disparity bit flags into a histogram for the
disparity counter comparator tree and selects the most popular disparity value.
Figure 3.27 is an example design of the vertical voting function which selects
the most popular disparity value in the vertical support region. In general, the
architecture is almost the same as the horizontal voting function, except for the
memory architecture. The memory architecture accumulates the incoming
disparity scanline stream into line buffers and utilizes data-reuse techniques to
extract vertical pixels for disparity voting.
Figure 3.28 demonstrates the disparity map results that are captured from the
disparity voting function. Comparing to Fig. 3.24, the disparity mismatch, the
occlusion and streaking effects are heavily alleviated.

94

C.-K. Liao et al.

Fig. 3.27 Example design of the vertical voting function

3.3.3.3 Median Filter


In order to reduce the speckle and impulse noises in the disparity map, a median
filter is chosen in the final refinement stage. Figure 3.29 is an example design
showing the memory architecture of the median filter to keep all scanlines on-chip.
Figure 3.30 shows the hardware architecture of the median sorting array that was
proposed by Vega-Rodriguez et al. [29]. To avoid critical path constraints, a
pipeline technique can be applied to the median sorting array.
After the processing of the median filter, the final disparity map results are
shown in Fig. 3.28b.

3.4 Viewpoint Synthesis Kernel


The viewpoint synthesis kernel has the function to generate virtual views in
applications like free viewpoint TV and autostereoscopic 3D displays systems
[21]. In these systems, the view synthesis engine is a back-end process that synthesizes the virtual view(s) after the depth extraction (e.g. stereo matching) from
the multi-view video decoder [30]. In the associated view synthesis algorithms, the

3 Stereo Matching and Viewpoint Synthesis FPGA Implementation

95

Fig. 3.28 Tsukuba left and right disparity maps after voting (a), and the following median filtering (b)

depth-image-based rendering (DIBR) algorithm is now a common approach [31].


It warps an image or video to another view according to the given depth map.
Various ways have been proposed to improve the synthesis quality in previous
DIBR research [3234]. Erhan et al. first preprocess the depth map with a motionadaptive median filter. Lin et al. propose the hybrid inpainting method to fill out
holes on disocclusion regions. Tong et al. [34] perform view synthesis in the subpixel or quarter-pixel space to improve the subjective quality. With the above
improvements, the MPEG consortium developed the reference software for free
viewpoint TV, called the view synthesis reference software (VSRS), which synthesizes high-quality virtual views from a limited number of input views.
As mentioned in Sect. 3.1, it is important to achieve real-time performance for
HD and full-HD broadcasting, which represents a challenge in hardware design. In
the state of the art, there are many researchers working on a high-frame rate
implementation following two main approaches. Horng et al. develop the view
synthesis engine, which supports HD1080p real time for a single view [35]. Tsung
et al. designed a set-top box system on chip for FTV and 3D-TV system, and support
the real-time 4K-by-2K process for 9 views [36]. These proposed algorithms consider the input videos with camera rotation; therefore, they have to deal with the
complex matrix and fractional computations to align the respective camera views,
resulting in many random memory accesses in the calculation process.
In order to overcome these problems and to obtain a cost-efficiency solution, we
propose to achieve the synthesis by using the input video on a raster scanline basis
as in the aforementioned stereo-matching method and to exclude the effect of
camera rotation (the camera rotation is preprocessed in advance by a camera
calibration/registration technique). Both input and output turns into raster-scan
order. The view synthesis engine design based on this order significantly reduces
the hardware cost on buffers, and no reordering buffer is needed.
Figure 3.31 shows the algorithm flowchart of our hardware architecture. This
view synthesis algorithm is a simplified version of the VSRS algorithm, and it is
assumed that the input videos have been rectified and normalized, and no camera
rotation is further required. This algorithm consists of three sequential stages:

96

C.-K. Liao et al.

Fig. 3.29 Memory architecture example design of the median filter

Fig. 3.30 Median filter sorting architecture [29]

Forward warping stage.


Reverse warping stage.
Blending and hole filling stage.
In the forward warping, the depth maps of the left view DL and right view DR
are loaded and warped forward to the depth maps of the virtual to be synthesized
view VDL and VDR separately. The warping location of each pixel is determined
by its disparity. When more than two pixels are warped to the same location, the
pixel with the greatest disparity value (which is assumed to be in the foreground
scene) will be chosen, revealing foreground object pixels with greater priority.
After the warping, there are some blanking holes to which none of the image pixels
have been warped. The blanking holes are the result of two causes: the occluded
region of the reference image, and the downgraded precision from the integer
truncation of warping locations, as shown in Fig. 3.31. To reduce the impact of the
error to the next stage, a general median filter is used to fill these small blanking
holes with selected neighboring data, completing the virtual view depth maps
(VDL and VDR). (Fig. 3.32)

3 Stereo Matching and Viewpoint Synthesis FPGA Implementation

97

Fig. 3.31 Algorithm flowchart of virtual view synthesis. It mainly consists of the forward
warping, reverse warping and a blending process

Given the warped VDL and VDR from the forward warping stage, the reverse
warping fetches the texture data from the original input left and right images (IL and IR)
to produce the two virtual view frames (VIL and VIR). This stage warps in a similar way
as in the forward warping stage. However, instead of using the depth map as a reference
image, the reverse warping stage uses the left/right image as reference and warps these
images based on the virtual view depth maps. After this stage, some of the occlusion
region again causes some blank regions, which are dealt with in the next stage. Besides
the warping, the hole-dilation, on the other hand, expands the hole region by one pixel,
so that the dilated result can be used in the next stage for improving synthesis quality.
The blending process, which is considered to be the final stage of the proposed
synthesizes process, merges the left/right virtual images (VIL and VIR) into one to
create the synthesized result. Because the synthesized result has most of its pixels
that can be seen in the left/right virtual images, the blending process is set as an
average process of the texture values from VL and VR. This improves the color
consistency between the two input views (L and R) with higher quality. For the
occlusion regions, because most of the occlusion regions could be seen by either
one of the two views, VL and VR are then used to complement each other, and to
generate a synthesized view. In addition, the preceding hole-dilation step has more
background information copied from the complementary view.
After the blending process, there are still some holes that cannot be found from both
input views. This problem can be solved by inpainting methods. Here, a simplified
method proposed by Horng [35] is employed to do the inpainting. Unlike other
methods that always suffer from their computational complexity and time-consuming
iterative process, the proposed hole filling process adopts a 5-by-9 weighted array filter
as a bilinear interpolation to fill the hole region with less computational complexity.

98

C.-K. Liao et al.

Fig. 3.32 Forward warping (top) and reverse warping (bottom). In the forward warping, depth
maps are warped based on the disparity value, and holes are generated after this value has been
warped to a new location. In the reverse warping, a similar process is performed on the image in a
pixel-by-pixel process based on the warped disparity map. Note that the warping direction is
opposite to the left image

The proposed hardware architecture is shown in Fig. 3.33. Our view synthesis
engine adopts a scanline-based pipelined architecture. Because the input frames of
the texture images and depth maps are both rectified in the preceding stereomatching method, the process is considered to be a one-by-one scanline processing. During the viewpoint synthesis, all rows in a specific input frame are
independently processed in creating the final output results. Under these considerations, we synthesize the virtual view row by row.
Because of the scanline-based architecture, the input and output of our view
synthesis engine may be able to receive and deliver a sequential line-by-line data
stream. For example, for 1024 9 768 video frames, the input stream has a length
of 768 lines of 1024 data packets. Under this scanline-based architecture, the
forward warped depth information can be stored in internal memory, so that the
bandwidth of the warped depth data access from the external memory is greatly
reduced. For the row-based architecture, the size of the internal memory buffers

3 Stereo Matching and Viewpoint Synthesis FPGA Implementation

99

Fig. 3.33 Block diagram of the viewpoint synthesis hardware implementation

depends on the frame width for storing a frame row. In our design, we adopted the
SRAM with 1920 9 8 bits memory space to support up to HD1080p video.
In the view synthesis engine, the data flow is described according to Fig. 3.31.
In stage I, the DL and DR are warped toward the requested viewpoint. This process
is done in scanline order from the depth map and two scanline buffers (L/R buffer) for
left and right images, respectively. Holes in the warped depth images are filled in the
next two stages: stage II handles small 1-pixel width cracks with hole dilation, and
stage III handles more extensive holes before outputting the L/R synthesized views.
Based on this structure, the whole viewpoint synthesis process can be segmented into three sequential and independent stages. Since there is no recursive
data fed back from the last stage, the whole process can be pipelined. For example,
while stage II handles scanline number i, scanline number i - 1 is in
stage I and i ? 1 is in stage III. Consequently, three rows are independently active in the engine, and the process latency is greatly reduced by efficiently
operating every stage in the engine within a predetermined execution time.

3.5 System Quality Performance


In order to evaluate the performance of the proposed system, the Middlebury
stereo benchmark was adopted to quantitatively estimate the depth map quality
[3, 37]. In its database, four sets of stereo images, i.e., Tsukuba, Venus, Teddy, and
Cones, are provided as a reference for observing various shapes and textures that

100

C.-K. Liao et al.

Fig. 3.34 Disparity results of the proposed implement for the Tsukuba, Venus, Teddy, and
Cones stereo images (from left to right)

would possibly impact the performance of the tested algorithm. In the following,
we describe the quality estimation of the stereo matching and viewpoint synthesis,
respectively.

3.5.1 Stereo Matching


To evaluate the system performance on depth quality, the four data sets of the
stereo images were processed to create the depth map. The results are shown in
Fig. 3.34. The quality of resulting depth maps were quantitatively measured based
on the ground truth of the datasets by the Middlebury website and listed in
Table 3.2. Four result sets are listed in three values, including: 1. nonocc value
(estimated with only the result from only non-occluded area of the image), 2. all
value (shows the result for the whole image), and 3. disc value (presents the
error from only the depth discontinuities on the image). All of these values are in a
unit of percentage, and smaller values correspond to a higher quality, i.e,. depth
map closer to the ground truth. As listed in the table, most values from the
proposed system are below 5 in non-occlusion areas, which are considered as very
high-quality indicators.

3.5.2 View Synthesis


Performance of the viewpoint synthesis was evaluated by using five image sets
from Middlebury, including Venus, Teddy, Sawtooth, Poster, and Cones [37]. Each
set has five images numbered from 2 to 6 (IM2IM6), and additional two ground
truth images from image 2 and 6 (GT2, and GT6). Images 2 and 6 were used to

Nonocc

2.54
1.4
2.21
1.2
1.38
1.27

nonocc: non-occluded regions

Proposed
RT-ColorAW [38]
SeqTreeDP [39]
MultiCue [40]
AdaptWeight [41]
InteriorPtLP [42]

3.09
3.08
2.76
1.81
1.85
1.62

All

11.1
5.81
10.3
6.31
6.9
6.82

Disc
0.19
0.72
0.46
0.43
0.71
1.15

Nonocc
0.42
1.71
0.6
0.69
1.19
1.67

All
2.36
3.8
2.44
3.36
6.13
12.7

Disc
6.74
6.69
9.58
7.09
7.88
8.07

Nonocc
12.4
14
15.2
14
13.3
11.9

All

Table 3.2 Quantitative results of the stereo matching for the original Middlebury stereo database
Algorithm
Tsukuba
Venus
Teddy
Disc
17.1
15.3
18.4
17.2
18.6
18.7

4.42
4.03
3.23
5.42
3.97
3.92

Nonocc

Cones
All
10.2
11.9
7.86
12.6
9.79
9.68

Disc
11.5
10.2
8.83
12.5
8.26
9.62

6.84
6.55
6.82
6.89
6.67
7.26

Average percent
bad pixels

3 Stereo Matching and Viewpoint Synthesis FPGA Implementation


101

102

C.-K. Liao et al.

Fig. 3.35 The synthesized virtual images from various viewing angles by using the ground truth
(top row) and the measured depth map (bottom row)
Table 3.3 PSNR of the synthesized virtual images by using the depth map from the ground truth
and the proposed stereo-matching method. Note that each number is an average value from IM3
to IM5
PSNR (dB)
Venus
Teddy
Sawtooth
Poster
Cones
GroundT
StereoM

45.07
45.09

44.76
44.53

44.77
44.57

44.24
44.07

43.12
42.92

create the left/right depth maps (i.e., DL and DR). The calculated depth map and
the ground truth were then used to synthesize the virtual views at exactly the
viewpoints IM3, IM4, and IM5, as shown in Fig. 3.35. Images on the top row show
the virtual view synthesized by using ground truth (i.e., VIM3, VIM4, and VIM5)
and those on the bottom row show the result by using the resulting depth map from
stereo matching. As a comparison, PSNRs were calculated from the virtual views
and its corresponding images, as listed in Table 3.3. It is clear that the PSNR of the
virtual images synthesized from the ground truth and the calculated depth are in a
value around 4345 and 4245 dB, respectively.

3.6 System Speed Performance


In this section, we evaluate the system performances with MDE/s (million disparity estimations per second) for the stereo matching and FPS (frames per second)
for the viewpoint synthesis. The performance of GPU-based stereo matching has

3 Stereo Matching and Viewpoint Synthesis FPGA Implementation

103

Table 3.4 The predefined engine spec of the proposed system in two cases
HD

FullHD

Image width
Image height
Frame rate (FPS)
Maximum disparity level
Speed performance (MDE/s)

960
1080
60
128
7962

1024
768
60
128
6039

Table 3.5 Estimated system resources and hardware costs based on TSMC 250 nm process by
using Cadence RTL compiler
Process
Gate count (k gates)
Memory size (k bytes)
Stereo matching

Viewpoint synthesis

Total
Image size
Maximum disparity

CTSR
Rawcost
DP
Post
Forward_warp
Reversed_warp
Hole_filling

65.6
88.2
79.3
98
41.8
38.2
58.6
469.7
1920 9 1080
64 levels

62
0
43
54
4
1
2
159

been reported to be 2796.2 MDE/s (HD with 27.8 FPS and disparity range of 128)
[26]. In addition, Jin et al. report a FPGA stereo matching system with a speed
performance of 4522 MDE/s (VGA with 230 FPS and disparity range of 64) [27].
Since most of the algorithms described in previous section are focused on both the
quality of the depth map and the implementation on hardware the question How
fast is the system? is very relevant. Here, we synthesized the engine in the Altera
Quartus II design framework and implemented it on to the stratix III FPGA with
the spec listed in Table 3.4.
In Table 3.4, two cases have been implemented for HD and FullHD (i.e.,
1080p) video format. In the HD case, the input image resolution is set to be 1024
by 768 and with a frame rate of 60 FPS. The maximum disparity of these two
cases is defined as 128 levels to cover a large dynamic depth range in the video.
Note that the configuration in the FullHD case is a preliminary test for the FullHD
video steam and was configured for half side-by-side format. In this scenario, a
speed performance of 7,962 MDE/s is achieved.
Furthermore, we also studied on how this algorithm performed on an ASIC
(application specific integrated circuit). It includes an estimation of the complexity, internal memory resource usage, and the possible execution speed. An
ASIC platform synthesis and simulation have been performed to get estimated
figures of merit. We use the TSMC 250 nm process library with cadence RTL
compilation at medium area optimization for logic synthesis as reference forwarding to ASIC implementation.

104

C.-K. Liao et al.

Table 3.5 lists the estimated gate count and memory size. The gate count is
below 500 kgates, which is very hardware-friendly for ASIC implementation in a
low-cost technology node. For FullHD 1080p, the gate count of the system
remains unchanged. The disparity level is predefined to be 64, which satisfies the
depth budget of the FullHD (*3 %) for high-quality 3D content [43].

3.7 Conclusion
In this chapter, we have demonstrated the application of a user-defined depth
scaling in 3D-TV, where a new left/right stereo image pair is calculated and
rendered on the 3D display, providing a reduced depth impression and better
viewing experience with less eye strain. Calculating more than two output views
by replicating some digital signal processing kernels supports the Stereo-Into
Multiply-Viewpoint-Out functionality for 3D auto-stereoscopic displays. Depth
is extracted from the pair of left/right input images by stereo matching, and
multiple new viewpoints are synthesized and rendered to provide the proper depth
impression in all viewing directions. However, there is always a tradeoff between
system performance and cost. For instance, an accurate cost calculation needs
larger aggregate size (Sect. 3.2.1), which leads to a higher gate count and memory
budget in the system and degrades the system performance. Consequently,
building a high-quality, real-time system remains a challenge. We proposed a way
to achieve this target. For every step in the proposed stereo matching system, we
employ efficient algorithms to provide high-quality results at acceptable implementation cost, demonstrated on FPGA. We have demonstrated the efficiency and
quality of the proposed system both for the stereo-matching and viewpoint synthesis, achieving more than 6,000 million disparity estimations per second at HD
resolution and 60 Hz frame rate with a BPER of no more than 7 % and a PSNR of
around 44 dB on MPEGs and Middleburys test sequences.

References
1. Hartley R, Zisserman A (2004) Multiple view geometry in computer vision. Cambridge
University Press, Cambridge
2. Papadimitriou DV, Dennis TJ (1996) Epipolar line estimation and rectification for stereo
image pairs. IEEE Trans Image Process 5(4):672676
3. Scharstein D, Szeliski R, Zabih R (2001) A taxonomy and evaluation of dense two-frame
stereo correspondence algorithms. In: Proceedings IEEE workshop on stereo and multibaseline vision (SMBV 2001)
4. Brown MZ, Burschka D, Hager GD (2003) Advances in computational stereo. IEEE Trans
Pattern Anal Mach Intell 25(8):9931008
5. Porter RB, Bergmann NW (1997) A generic implementation framework for FPGA based
stereo matching. In: IEEE region 10 annual conference. Speech and image technologies for
computing and telecommunications (TENCON 97), Proceedings of IEEE

3 Stereo Matching and Viewpoint Synthesis FPGA Implementation

105

6. Lane RA, Thacker NA (1998) Tutorial: overview of stereo matching research. Available
from: http://www.tina-vision.net/docs/memos/1994-001.pdf
7. Zhang K et al (2009) Robust stereo matching with fast normalized cross-correlation over
shape-adaptive regions. In: 16th IEEE international conference on image processing (ICIP)
8. Chang NYC, Tseng Y-C, Chang TS (2008) Analysis of color space and similarity measure
impact on stereo block matching. In: IEEE Asia Pacific conference on circuits and systems
(APCCAS)
9. Hirschmuller H, Scharstein D (2009) Evaluation of stereo matching costs on images with
radiometric differences. IEEE Trans Pattern Anal Mach Intell 31(9):15821599
10. Humenberger M, Engelke T, Kubinger W (2010) A census-based stereo vision algorithm using
modified semi-global matching and plane fitting to improve matching quality. In: IEEE
computer society conference on computer vision and pattern recognition workshops (CVPRW)
11. Zhang K et al (2011) Real-time and accurate stereo: a scalable approach with bitwise fast
voting on CUDA. IEEE Trans Circuits Syst Video Technol 21(7):867878
12. Zhang K, Lu J, Gauthier L (2009) Cross-based local stereo matching using orthogonal
integral images. IEEE Trans Circuits Syst Video Technol 19(7):10731079
13. Yoon K-J, Kweon I-S (2005) Locally adaptive support-weight approach for visual
correspondence search. In: IEEE computer society conference on computer vision and
pattern recognition (CVPR)
14. Hosni A et al (2009) Local stereo matching using geodesic support weights. In: 16th IEEE
international conference on image processing (ICIP)
15. Wang L et al (2006) High-quality real-time stereo using adaptive cost aggregation and dynamic
programming. In: Third international symposium on 3D data processing, visualization, and
transmission
16. Chang NYC et al (2010) Algorithm and architecture of disparity estimation with mini-census
adaptive support weight. IEEE Trans Circuits Syst Video Technol 20(6):792805
17. Zhang L et al (2011) Real-time high-definition stereo matching on FPGA. In: Proceedings of
the 19th ACM/SIGDA international symposium on field programmable gate arrays (FPGA)
18. Zhang L (2010) Design and implementation of real-time high-definition stereo matching SoC
on FPGA. In: Department of microelectronics and computer engineering, Delft university of
technology, The Netherlands
19. Yi GY (2011) High-quality, real-time HD video stereo matching on FPGA. In: Department of
microelectronics and computer engineering, Delft university of technology: The Netherlands
20. Hirschmuller H (2005) Accurate and efficient stereo processing by semi-global matching and
mutual information. In: IEEE computer society conference on computer vision and pattern
recognition (CVPR)
21. Tanimoto M (2004) Free viewpoint televisionFTV. In: Proceedings of picture coding
symposium
22. Boykov Y, Veksler O, Zabih R (2001) Fast approximate energy minimization via graph cuts.
IEEE Trans Pattern Anal Mach Intell 23(11):12221239
23. Sun J, Zheng NN, Shum H-Y (2003) Stereo matching using belief propagation. IEEE Trans
Pattern Anal Mach Intell 25(7):787800
24. Berdnikov Y, Vatolin D (2011) Real-time depth map occlusion filling and scene background
restoration for projected pattern-based depth cameras. Available from: http://
gc2011.graphicon.ru/files/gc2011/proceedings/conference/gc2011berdnikov.pdf
25. Wang L et al (2008) Stereoscopic inpainting: joint color and depth completion from stereo
images. In: IEEE conference on computer vision and pattern recognition (CVPR)
26. Banz C et al (2010) Real-time stereo vision system using semi-global matching disparity
estimation: architecture and FPGA-implementation. In: International conference on
embedded computer systems (SAMOS)
27. Seunghun J et al (2010) FPGA design and implementation of a real-time stereo vision system.
Circuits and Systems for Video Technology. IEEE Trans Circuits Syst Video Technol 20(1):1526
28. Gehrig SK, Meyer FET (2009) A real-time low-power stereo vision engine using semi-global
matching. Lect Notes Comput Sci 5815:134143

106

C.-K. Liao et al.

29. Vega-Rodrguez MA, Snchez-Prez JM, Gmez-Pulido JA (2002) An FPGA-based


implementation for median filter meeting the real-time requirements of automated visual
inspection systems. In: Proceedings of the 10th mediterranean conference on control and
automation. Lisbon
30. Aljoscha S (2008) Introduction to multiview video coding. Antalya, Turkey: ISO/IEC JTC 1/
SC 29/WG 11
31. Fehn C (2004) Depth-image-based rendering (DIBR), compression and transmission for a
new approach on 3D-TV. In: Proceedings of SPIE conference on stereoscopic displays and
virtual reality system
32. Ekmekcioglu E, Velisavljevic V, Worrall ST (2009) Edge and motion-adaptive median
filtering for multi-view depth map enhancement. In: Picture coding symposium (PCS)
33. Tsung P-K et al (2011) A 216fps 4096 9 2160p 3DTV set-top box SoC for free-viewpoint
3DTV applications. In: IEEE international solid-state circuits conference (ISSCC). Digest of
technical papers
34. Tong X et al (2010) A sub-pixel virtual view synthesis method for multiple view synthesis.
In: Picture coding symposium (PCS)
35. Horng Y, Tseng Y, Chang T (2011) VLSI architecture for real time HD1080p view synthesis
engine. IEEE Trans Circuits Syst Video Technol 21(9):13291340
36. Tsung P-K et al. (2011) A 216fps 4096 9 2160p 3DTV set-top box SoC for free-viewpoint
3DTV applications. In: IEEE international solid-state circuits conference (ISSCC), San
Francisco
37. Scharstein D, Szellski R (2011) Middlebury stereo vision page. Available from: http://
vision.middlebury.edu/stereo/
38. Chang X et al (2011) Real-time accurate stereo matching using modified two-pass
aggregation and winner-take-all guided dynamic programming. In: International conference
on 3D imaging, modeling, processing, visualization and transmission (3DIMPVT)
39. Deng Y, Lin X (2006) A fast line segment based dense stereo algorithm using tree dynamic
programming. In: Proceedings of ECCV
40. Liu T, Luo L (2009) Robust context-based and separable low complexity local stereo
matching using multiple cues. Submitted to TIP
41. Yoon K-J, Kweon IS (2006) Adaptive support-weight approach for correspondence search.
IEEE Trans Pattern Anal Mach Intell 28(4):650656
42. Bhusnurmath A, Taylor CJ (2008) Solving stereo matching problems using interior point
methods. In: Fourth international symposium on 3D data processing, visualization and
transmission (3DPVT)
43. Knorr S et al (2011) Basic rules for good 3d and the avoidance of visual discomfort in
stereoscopic vision. IBC, Amsterdam

Chapter 4

DIBR-Based Conversion
from Monoscopic to Stereoscopic
and Multi-View Video
Liang Zhang, Carlos Vzquez, Grgory Huchet
and Wa James Tam

Abstract This chapter aims to provide a tutorial on 2D-to-3D video conversion


methods that exploit depth-image-based rendering (DIBR) techniques. It is devoted
not only to university students who are new to this area of research, but also to
researchers and engineers who want to enhance their knowledge of video conversion
techniques. The basic principles and the various methods for converting 2D video
to stereoscopic 3D, including depth extraction strategies and DIBR-based view
synthesis approaches, are reviewed. Conversion artifacts and evaluation of conversion quality are discussed, and the advantages and disadvantages of the different
methods are elaborated. Furthermore, practical implementations for the conversion
from monoscopic to stereoscopic and multi-view video are drawn.

Keywords 3D-TV 2D-to-3D video conversion Conversion artifact Depth cue


Depth estimation Depth-of-field Depth map preprocessing Depth-image-based
rendering (DIBR) Disocclusion Focus Hole filling Human visual system
Hybrid approach Linear perspective Motion parallax Pictorial depth cue
Stereoscopic 3D (S3D) Surrogate depth map View synthesis










L. Zhang (&)  C. Vzquez  G. Huchet  W. J. Tam


Communications Research Centre Canada, 3701 Carling Ave,
Ottawa, ON K2H 8S2, Canada
e-mail: liang.zhang@crc.gc.ca
C. Vzquez
e-mail: carlos.vazquez@crc.gc.ca
G. Huchet
e-mail: gregory.huchet@crc.gc.ca
W. J. Tam
e-mail: james.tam@crc.gc.ca

C. Zhu et al. (eds.), 3D-TV System with Depth-Image-Based Rendering,


DOI: 10.1007/978-1-4419-9964-1_4,
Springer Science+Business Media New York 2013

107

108

L. Zhang et al.

4.1 Introduction
The development of digital broadcasting infrastructures and technologies around
the world as well as rapid advances in stereoscopic 3D (S3D) display technologies
have created an unprecedented momentum in the push for the standardization and
delivery of S3D to both television receivers in the home and mobile devices on the
road. Three-dimensional television (3D-TV) based on a minimum of two views
that are taken from slightly different viewpoints has for the last few years become
the common goal of private, regional, and international standardization organizations (e.g., ATSC, EBU, ITU, MPEG, SMPTE) [13]. However, standardization
efforts are also being made to encompass multi-view (MV) displays (e.g., ITU,
MPEG) [4]. For brevity and ease of discussion, in this manuscript S3D will be
used to refer to both stereoscopic and MV imaging as the case may apply.
The fundamental principle behind 3D-TV in its most basic form is the transmission of two streams of video images. The images depict vantage points that
have been captured with two cameras from slightly offset positions, simulating the
two separate viewpoints of a viewers eyes. Appropriately configured TV receivers
then process the transmitted S3D signals and display them to viewers in a manner
that ensures the two separate streams are presented to the correct eyes. In addition,
images for the left and right eyes are synchronized to be seen simultaneously, or
near-simultaneously, so that the human visual system (HVS) can fuse them into a
stereoscopic 3D depiction of the transmitted video contents.
Given the experience of the initially slow rollout of high-definition television
(HDTV), it is not surprising that entrepreneurs and advocates of 3D-TV in the
industry have envisioned the need to ensure an ample supply of S3D program
contents for the successful deployment of 3D-TV. Although equipment manufacturers have been relatively quick in developing video monitors and television
sets that are capable of presenting video material in S3D, the development of
professional equipment for reliable and acceptable stereoscopic capture has not
been as swift. Several major challenges must be overcome, especially synchronization of the two video streams, increased storage, alignment of the images for
the separate views, and matching of the two sets of camera lens for capturing the
separate views. Even now (2011), the various professional S3D capture equipment
are bulky, expensive, and not easy to operate.
Aside from accelerating the development of technologies for the proper and
smooth capture of S3D program material for 3D-TV, there is one approach that has
provided some visible success in generating suitable material for stereoscopic
display without the need for stereoscopic camera rigs. Given the vast amount of
program material being available in standard two-dimensional (2D) format, a
number of production houses have successfully developed workflows for the
creation of convincing S3D versions of 2D video programs and movies. The
substantial success of the conversion of Hollywood movies to S3D has been
reflected in box office receipts, especially of computer-generated (animation)
movies. Most welcomed by the media industry and post-production houses, the

4 DIBR-Based Conversion from Monoscopic to Stereoscopic

109

approach of converting 2D-to-S3D has not only provided a promising venue for
increasing program material for 3D-TV, but has also opened up a new source of
revenue from repurposing of existing 2D video and film material for new release.
It has also been put forward, by a few commercially successful post-production
houses, that the process of conversion of 2D-to-S3D movies provides flexibility and a
secure means of ensuring that S3D material is comfortable for viewing [5]. As is
often raised in popular media, stereoscopic presentation of film and video material
can lead to visual discomfort, headaches, and even nausea and dizziness if the
contents are displayed or created improperly [6]. Through careful conversion,
the conditions that give rise to these symptoms could be minimized and even
eradicated because guidelines to avoid situations of visual discomfort can be easily
followed. Furthermore, trial and error for different options for parallax setting and
depth layout of various objects and their structures can be experimented not only for
safety and health concerns, but also for artistic purposes. That is, the depth
distribution can be customized for different scenes to meet specific director
requirements. Thus, the conversion from monoscopic to stereoscopic and MV video
is an important technology for several reasons: accelerating 3D-TV deployment,
creating a new source of revenue from existing 2D program material, providing a
means for ensuring visual comfort of S3D material, and allowing for repurposing and
customization of program material for various venues and artistic intent.
Having presented the motivation for 2D-to-3D video conversion in Sect. 4.1,
the rest of this chapter is organized into topical sections. Section 4.2 describes the
theory and the basis underlying 2D-to-3D video conversion. It also introduces a
framework for conversion that is based on the generation of depth information
from the 2D source images. Section 4.3 presents several strategies for generating
the depth information, including scene modeling and depth estimation from
pictorial and motion depth cues that are contained in the 2D images. The use of
surrogate depth information based on visual cognition is also described.
Section 4.4 presents the important topic of how new views are synthesised.
Procedures for preprocessing of the estimated depth information, pixel shifting and
methods for filling in newly exposed areas in the synthesised views are explained
in detail. Section 4.5 discusses the type of conversion artifacts that can occur and
how they might affect both picture and depth quality. Issues associated with the
measurement of conversion quality are briefly discussed. Section 4.6 concludes
with a discussion of some of the important issues that are related to the long-term
prospects of 2D-to-3D video conversion technologies and suggests areas in which
research could be beneficial for advancing current implementations.

4.2 Fundamentals of 2D-to-3D Video Conversion


3D-TV systems are based on the presentation of stereoscopic or MV content to
viewers [7]. This means that at least two streams of video obtained from slightly
different camera positions have to be fed to a 3D display system to create

110

L. Zhang et al.

Fig. 4.1 General framework


for 2D-to-3D video
conversion

stereoscopic perception of depth by the viewer. The main goal in 2D-to-3D video
conversion is the generation of that second, or higher, stream of video from the
original images of the 2D (monoscopic) stream.

4.2.1 What is 2D-to-3D Video Conversion?


2D-to-3D video conversion is a process that converts a monoscopic image sequence
into a stereoscopic image sequence or a MV video clip [8, 9]. This conversion entails
the extraction of depth information from a monoscopic image source and the generation of a new image that will represent the scene from a slightly different viewpoint in such a way that both images form a stereoscopic image pair. This general
conversion framework is schematically presented in Fig. 4.1.
This general framework can be viewed as the process of generating the image
that one of the eyes, say the right eye, would see given that the left eye is seeing
the original 2D image of an actual scene that is provided. Given that both eyes
see the same scene from different points of view, the conversion process
basically requires reproducing the point of view of the right eye. The extraction of
the depth is an essential part of this process because depending on the depth of
objects in the scene they will appear at different spatial locations in the newly
generated image [10].

4.2.2 Why is 2D-to-3D Video Conversion Possible?


The first question one might ask is why is this 2D-to-3D video conversion possible
given that there is apparently no depth information in the flat 2D images?
In fact, there are several indicators of depth in almost every 2D image. As part of
human development, we have learned to interpret visual information and to
understand the depth relations between objects in an image. It is possible to
generate the depth information because the HVS relies not only on stereopsis
(differences between the images perceived by both eyes) for depth perception,

4 DIBR-Based Conversion from Monoscopic to Stereoscopic

111

Fig. 4.2 Depth can be


perceived from viewing
monoscopic 2D images

but also on other visual cues that provide information about the depth of objects in
the scene that our brains have learned. For example, using our learned perceptive
knowledge we can easily place objects in depth in the scene that is depicted by a
flat 2D image in Fig. 4.2 even if there is no stereoscopic information available.

4.2.3 How is 2D Video Converted into 3D?


The conversion of a 2D image into a stereoscopic 3D image pair is accomplished by
extracting or directly generating the depth structures of the scene from the analysis of
the monoscopic image source. This information is later used to generate a new image
to form a stereoscopic image pair or to render several views in the case of MV images.
The human visual system is able to interpret a variety of depth cues to generate a
representation of the surrounding world in 3D, providing volume to the two flat
representations acquired by the eyes [11]. The depth information found in images is
normally referred to as depth cues and are typically classified into binocular (coming
from two eyes) and monocular (coming from a single eye) depth cues. Binocular
depth cues exploit the small differences between the images perceived by both eyes in
order to compute the depth of a given feature. Monocular depth cues, on the other
hand, help the HVS extract the relative depth between objects in a single view of the
scene. A concise list of depth cues is presented in Fig. 4.3.

4.2.4 Monocular Depth Cues in Video Sequences


2D-to-3D video conversion is necessary when binocular depth cues are not
available and all the depth information has to be extracted from the monocular
depth cues in the original image. Monocular depth cues in images can be classified
into two main categories: motion-based and pictorial cues.

112

L. Zhang et al.

Fig. 4.3 Depth cues


provided by images

Motion-based depth information in video sequences is perceived as motion


parallax, which can be defined as the differences in spatial position of objects
across time. When objects move in the scene, their spatial position in the image
is changed depending on the location of the object in the scene. Object motion is
induced by camera motion and/or object motion. A simple example of depth
perception from motion parallax is when viewing a landscape scene through a
window of a moving train. Objects that are far in the background appear to move
very slowly, while objects close to the train will pass by the window very fast. This
difference in speed, for objects that are basically static, is translated into depth
information by the human visual system. Motion parallax is a strong depth cue in
video sequences that can be used for extracting the depth information needed for
2D-to-3D video conversion.
In addition to the motion information that can be exploited for depth extraction
in 2D-to-3D video conversion, there are several pictorial depth cues in monoscopic
still images that can be used in the depth extraction process for 2D-to-3D video
conversion. Figure 4.3 lists most of the pictorial depth cues found in images.
It should be noted that all the depth cues in monoscopic sequences are available to
the human visual system to create a stable volumetric representation of the scene.
One of the main pictorial depth cues used by the human visual system is
interposition of objects. It is very simple and intuitive to see this cue. Whenever an
object is in front of another, the one that is closer to the viewer will obstruct the
view of the farther object. However, this depth cue appears to provide information
only on depth order and not magnitude [12].
Another important depth cue frequently found in images is related to the limited
depth of field of real cameras, which is reflected in the images in the form of
objects being focused or defocused. In the real world, we use the accommodation

4 DIBR-Based Conversion from Monoscopic to Stereoscopic

113

mechanism of the eye to keep objects in focus. We tend to accommodate to the


distance of an object of interest, leaving all other objects at other depths out of
focus. This mechanism is reproduced in video sequences because of the limited
depth of field of the acquisition systems optics, such that objects of interest appear
in focus in the 2D image while all other objects appear out of focus. This difference
in focus information between objects in the image can be used to determine the
depth relationship between objects [13].
Linear perspective is also used to extract depth information from still monoscopic
images. Human-constructed scene structures frequently have linear features that run
from near to far. The convergence point of those lines will define a point in the
horizon that can be used to construct a predefined depth structure for the scene [14].
Other depth cues are related to how we interpret visual information. The variation
in size, texture, color, light, and other object characteristics in the image are indicators of how far they are located in the scene. By analyzing these characteristics and
variations in the image it is possible to determine the relative depth position of
objects. The main difficulty associated with the use of these depth cues in computer
algorithms is that all these characteristics are learnt and very difficult to extract from
the images using automated means [15].
In summary, both still images and moving image sequences contain a wide
range of depth information that can be extracted to create a S3D version of the
image. However, it is not easy to extract reliable and accurate information using
these depth cues from 2D images without human intervention.

4.2.5 DIBR Framework for 2D-to-3D Video Conversion


Depth-image-based rendering (DIBR) is a technique for producing a S3D
video sequence by generating new virtual views from an original (monoscopic)
video sequence and the depth information that has been extracted from the original
sequence. The process consists of usage of the per pixel depth information to
place every pixel from the original image to the corresponding spatial location in the
newly generated image. There are three essential operations in this process:
(1) converting depth information into disparity values, in the form of horizontal pixel
coordinate differences of correspondent pixels in the original and rendered images,
for each pixel in the original image based on a specified viewing geometry,
(2) shifting pixels from their positions in the original image to new positions in the
newly generated image depending on the disparity value; and (3) filling in disoccluded regions that might appear as a consequence of the shifting operation.
A schematic of the DIBR process is presented in Fig. 4.4. The process is controlled
by a set of parameters that depends on the viewing conditions. In the DIBR
framework, depth information is represented by a 2D data set in which each pixel
value provides the depth. However, Knorr et al. [16] have proposed another
representation of depth information that consists of a number of 3D real-world
coordinates (a sparse 3D scene structure). This type of representation is beyond the

114

L. Zhang et al.

Fig. 4.4 Schematic of the DIBR process

scope of this chapter, since they use image warping instead of DIBR for virtual
view synthesis.
From Fig. 4.4, the first step is to specify the virtual camera configuration,
i.e., where to locate the virtual cameras to generate the stereoscopic content. The
relative positions of the cameras, together with the viewing conditions, determine
the parameters for converting depth values into disparity between the two images;
this process creates a disparity map. Parallel cameras are normally preferred since
they do not introduce vertical disparity which can be a source of discomfort for
viewers. There are two possible camera positions to choose from; use the original
2D content as one of the cameras and position the second close to it to form a
virtual stereo pair, or treat the original 2D content as being captured from a central
view and generate two new views, one to each side of the original content.
Generating only one single-view stream of images is less computationally intensive and ensures that at least one view (consisting of the original images) has high
quality. Generating images for two views has the advantage that both of the
generated images are closer to the original one and as a consequence the disocclusions or newly exposed regions, which are a source of artifacts, are smaller and
spread over the two images.
The viewing conditions have also to be taken into account. The distance to the
display, the width of the display, and the viewers interocular distance are three of
the main parameters to take into account while converting depth information into
disparity values used in the DIBR process to create the new image that forms the
stereoscopic pair.
The second main operation in the DIBR process is the shifting of pixels to their
positions in the new image. The original depth values are preprocessed to remove
noise and reduce the amount of small disocclusions in the rendered image. Then the
depth values are converted into disparity values by taking into account the viewing
conditions. The process of shifting can be interpreted theoretically as a reprojection
of 3D points onto the image plane of the new virtual camera. The points from the
original camera, for which the depth is available, are positioned in the 3D space and
then reprojected onto the image plane of the new virtual camera. In practice,
however, what really happens is that the shift is only horizontal because of the

4 DIBR-Based Conversion from Monoscopic to Stereoscopic

115

Fig. 4.5 Disoccluded regions (holes) are produced at sharp disparity edges in the newly
generated image which have to be properly filled. See main text for details. Original images from
Middlebury Stereo Datasets [18]

parallel configuration of the camera pair and, consequently, pixels are shifted
horizontally to their new positions without having to find their real positions in the 3D
space [17]. This horizontal shifting of pixels is driven by the disparities that were
computed using the camera configuration and by the simple rule that objects in front
must occlude objects in the back. Horizontal shifting of pixels also makes the
geometry simpler and allows for linewise parallelized computer processing.
The last main operation in the DIBR process is to fill-in the holes that are
created by sharp edges in the disparity map. Whenever a pixel with a large disparity is
followed (in the horizontal direction) by a pixel with a smaller disparity, the
difference in disparity will create either an occlusion (superposition of objects) or a
disocclusion (lack of visual information). These disocclusions are commonly
referred to as holes, since they appear in the rendered image as regions without any
color or texture information.
In Fig. 4.5, we present an example of the process that leads to disoccluded
regions. The images in the top row, starting from the left, represent the original left
image, the disparity map image and the original right image (ground-truth),
respectively. The second row of images, starting from the left, represents an

116

L. Zhang et al.

enlarged view of a small segment of the original left image and its corresponding
area in the disparity image. The last three images represent, respectively: the
segment in the newly generated right image with the hole (disocclusions region),
the segment with the shape of the hole shown in white, and the segment with the
hole (outlined by the red and blue pixels) that has been filled with color and texture
information from the original right image. As pixels in the image are shifted to
generate the new virtual view, the difference in depth at the edge produces a
hole in the rendered image for which no visual information is available. The
rightmost image in the bottom row shows the real right-eye view (ground-truth)
with the information that should be placed in the hole to make it look realistic or
natural. This illustration gives an idea of the complexity of the task of filling holes
with consistent information. In reality, the content of the hole must be computed
from the left image since the ground-truth right image is not available.

4.3 Strategies for DIBR-Based 2D-to-3D Video Conversion


2D-to-3D video conversion based on DIBR techniques requires the generation of
a depth map as a first operation in the process. Several strategies have been
explored over the years in order to provide the necessary depth information to the
DIBR process by relying on the information available from monocular depth cues
in the 2D images of the video sequence. In this section, we will explore some of
those strategies and present some representative examples of practical systems that
use those strategies for the generation of the depth information.

4.3.1 Predefined Scene Models


One of the first strategies employed in the generation of the depth information for
2D-to-3D video conversion is the use of predefined scene depth models. This technique
relies on the assumption that in most natural scenes the general depth structure can
be approximated with one of several predefined depth models that describe general
types of scenes. The most commonly used depth models assume that objects in the
center or at the top of the image are farther away from the viewer than objects at the
bottom of the scene [19]. This broad assumption, although not always valid, is basically
respected for many natural scenes. For outdoor scenes, for example, this assumption
could be accurate because in general we find that the ground under our feet is depicted
at the bottom of the image and the far sky at the top of the image. The depth then
gradually increases from the bottom of the image to the top. For indoor scenes, on the
other hand, the depth structure will depend on the shape of the room where the scene is
captured. In general, it can be assumed that there is a floor, a back wall, and a ceiling
above us. This assumption supports the use of a depth model that is in general spherical
with the bottom and top of the image being closer to the viewer than the central part
of the image which will represent the back wall and the farthest part of the scene.

4 DIBR-Based Conversion from Monoscopic to Stereoscopic

117

One of the main steps in this approach of using scene models is the extraction of
features from the images that allow the scene to be properly classified and
eventually correctly assigned one of the predefined models. In [20] for example,
the first step is a classification of the scene into an indoor, an outdoor or an outdoor
with geometric elements scene. Other meaningful classifications that allow for the
selection of the appropriate model to apply to the scene are presented in [19, 21].
However, the depth models as discussed do not accurately represent the depth of
the scene because frequently objects that are in the scene do not conform to the
general assumption that changes in depth are of a gradual nature. This is the main
drawback of this technique because it does not provide the variations in depth that
make a scene look natural. The main advantage of this method, on the other hand,
is that it is fast and simple to implement. Thus, because of its speed and simplicity,
this method is mainly used for real-time applications.
It should be noted that the depth model approaches to defining depth in 2D-to-3D
video conversion are normally used as a first and rough approximation of the depth
[22, 23], and then complemented with information extracted from other depth cues in
the scene. For example, the detection of vertical features in the image can lead to the
extraction of objects that stand in front of the background, allowing for the correct
representation in depth of objects whose depth is not encompassed by the selected
model.
There are several depth cues in the 2D images that can help improve the depth
model to better represent the depth of the underlying scene. Among the depth cues
that are used is the position of objects as a function of the height in an image [24].
The higher an object is in the image, the farther away it is assumed to be. Linear
perspective [14], horizon location [20], and geometric structure [21] are features in
the image that can also be used to improve the predefined depth scene model
approach.
There are some examples of depth scene models being used for 2D-to-3D video
conversion in real-time. Among them, the commercialized box from JVCTM uses
three different depth scene models as a base for the depth in the conversion [22], and
then adds some modifications to this depth based on color information. The proposed
method starts with three models: a spherical model, a cylindrical model, and a planar
model and assigns one of them (or a combination of them) to the image based on the
amount of spatial activity (texture) found in different regions of the image. The
models are blended and a color-based improvement is used to recede cool colors
and advance warm colors.

4.3.2 Surrogate Depth Information


Another solution to the problem of defining the depth for 2D-to-3D video
conversion makes use of general assumptions about the way humans perceive
depth. The idea behind this approach is that the Human Visual System (HVS) is
robust enough to support sparse depth information and even tolerate some amount

118

L. Zhang et al.

of discrepancies among visual depth cues and still be able to make sense of the
scene from the depth organization point of view [9, 25]. Most of the methods used
in this approach generate a depth map based on general features of the images,
such as luminance intensity, color, edge-based saliency or other image features
that provide an enhanced perception of depth, compared to a corresponding 2D
image, without explicitly aiming to model accurately the real depth of the scene
[26]. The depth maps thus generated are referred to as surrogate depth maps
because they are based on perceptual effects rather than estimation of actual depths
to arrive at a stable percept of a visual scene [27].
To better explain how this type of method works, we need to understand how
the HVS functions in depth perception. The perception of depth is an active
process that depends not only on the binocular depth cues (parallax, convergence),
but also on all available monocular depth cues including knowledge from past
experiences. By providing a stereoscopic image pair, we are providing the HVS
with not only the binocular parallax information, but also all other depth cues, i.e.,
the monocular depth cues that are contained in the images. Of course, the images
also provide the stimulus to activate past experiences that are stored in the brain
and that are readily available to the viewer. The HVS will take all the available
information to create a consistent perception of the scene. In the particular case of
stereoscopic viewing, when the binocular parallax information is present (based on
the information from a surrogate depth map), but is inconsistent with a number of
the other cues, the HVS will try to resolve the conflict and generate a percept that
makes the most sense from all the available information. Observations indicate
that any conflicting parallax information is simply modified or overridden by the
other cues, particularly by the cue for interposition [2628]. On the other hand,
when the surrogate depth is consistent with all the other depth cues, then the depth
effect is reinforced and the depth quality is improved. A major condition for this
surrogate-based approach to be effective is that the depth provided by the synthesised binocular parallax cue in the rendered stereoscopic images is not so strong
that it dominates the other depth cues. When the conflict introduced is too strong,
viewers would experience visual discomfort.
Several methods in the literature exploit this approach for the conversion of 2D
material to the S3D format. In particular the methods that rely on luminance, color,
and saliency rely on this process to provide depth. For example, the methods
proposed in [26, 29] rely on the detection of edges in the image to generate a depth
map. Depth edges are normally correlated to image edges in defining the borders
of objects, so by providing depth to edges in the image, objects are separated from
the surrounding background and the depth is enhanced. A color segmentation
assisted by motion segmentation is used in [30] to generate a depth map for 2D-to3D video conversion. Each segment in the image is assigned a depth following a
local analysis of edges to classify each pixel as near or far.
Another method that has been proposed uses the color information to generate
a depth map that is then used for 2D-to-3D video conversion [27, 31]. The method
proposed by [27] employs the red chromatic component (Cr) of the YCbCr color
space as a surrogate for the depth. It is based on the observation that the gray-level

4 DIBR-Based Conversion from Monoscopic to Stereoscopic

119

Fig. 4.6 Example of color-based surrogate depth map

image of the Cr component of a color image looks similar to what a real depth map
might look like of the scene depicted in the image. That is, objects or regions with
bluish or greenish tint appear darker in the Cr component than objects or regions
with reddish tint. Furthermore, subtle depth information is also reflected in the
gray-intensity shading of areas within objects of the Cr component image, e.g., the
folds in the clothing of people in an image. In general, the direction of the graylevel gradient and intensity of the gray-level shading of the Cr component,
especially of naturalistic images, tend to roughly match that expected of an actual
depth map of the scene depicted by the image. In particular, faces which have a
strong red tint tend to show depth that is closer to the viewer while the sky and
landscapes, because of having a stronger bluish or greenish tint, are receded into
the background. The variation in shading observed in the Cr component, such as
within images of faces and of trees, also provides relative depth information that
reflects depth details within objects. Studies have shown that this method is
effective in practice and has the advantage of simplicity of implementation [27].
Figure 4.6 presents an example of a surrogate depth map computed from color
information. The main structure of the scene is present and objects are well defined
in depth.
Another approach that has received much attention recently is based on visual
attention or saliency maps [32] used as surrogates for depth. The main hypothesis
supporting these approaches is that humans concentrate their attention mainly on
foreground objects and objects with salient features [33] such as high contrast or
textures. Based on this hypothesis some researchers are proposing to use a saliency
map to represent the depth of a given scene in such a way that the most interesting
objects, from the point of view of the viewer, are brought to the front of the scene
and less interesting objects are pushed to the back as background.
In general, these methods of using different types of surrogates as a source of
depth information are simple to use and provide reasonably good results for the
conversion of 2D content to 3D format. Because they are based on the robustness
of the HVS for consolidating all available visual information to derive the perception of depth, they tend to produce more naturally-looking images than
methods relying on predefined models. These surrogate depth-based methods are
mainly used for fast (real-time) conversion of video material because of their
simplicity and performance.

120

L. Zhang et al.

4.3.3 Depth Estimation from Motion Parallax


The third solution for 2D-to-3D video conversion is to extract scene depths from
motion parallax found in video sequences [34]. For a moving observer, the perceived relative motion of stationary objects against a background gives cues about
their relative distances. If information about the direction and velocity of movement is known, motion parallax can provide absolute depth information [35].
Motion parallax may be seen as a form of disparity over time and allows
perceiving depth from spatial differences between two temporally consecutive
frames in a video sequence captured with a moving camera. These differences are
observed in the video as image motion. By extracting this image motion, motion
parallax could be recovered.
Image motion may relate to the whole image or specific parts of an image, such as
rectangular blocks, arbitrarily shaped patches or even per pixel. For a static scene, the
image motion is caused only by camera motion and is related to all whole image
regions. This camera-induced image motion depends on camera motion parameters
and also on the depths of objects in the scene. Different camera motions will lead to
different strengths of depth perception. A freely moving camera, e.g., camera panning, can provide information about the depth in the scene since it contains scene
information from different perspectives, while a camera that only rotates around its
optical axis does not provide any information about the depth. For a dynamic scene,
the image motion is induced not only by camera motion, if it is in motion, but also by
independently moving objects (IMOs). The IMOs have a relative motion with
respect to the camera that is different from that of the background. Theoretically,
object-induced image motion is independent of the depth of objects in the scene.
The existence of IMOs makes the motion-to-depth conversion ambiguous. Interestingly, if the motion is in the direction along the optical axis the sizes of IMOs in the
image change as a function of the depth position. While they move farther away
they appear to become smaller and vice versa. In this case, it is the relative size of
objects that provide hints about their relative distance.
The approach for determining depth from motion parallax usually consists of
two steps: (1) estimating image motion, and (2) mapping of image motion into
depth information.

4.3.3.1 Estimation of Image Motion


Image motion can be represented by a motion model, such as an affine motion
model, translational motion model, and so on, which approximates the motion of a
moving object or a real video camera. Motion model parameters are determined
by motion estimation methods. Motion estimation is an ill-posed problem as
the motion is in 3D, but the images are a projection of the 3D scene onto a 2D
plane. Feature matching/tracking and block matching are two popular techniques
for motion estimation.

4 DIBR-Based Conversion from Monoscopic to Stereoscopic

121

Matsumoto et al. [36] developed a 2D-to-3D method to convert a monocular


image sequence of a static scene to S3D by estimation of camera motion from an
image sequence. Grid point correspondences between two consecutive images
from the sequence are established first by an energy minimization framework,
where the energy is defined as the sum of the squared intensity difference between
two block areas around potential correspondent grids and the quadratic differential
of motion parallax. An iterative algorithm is proposed to solve the energy minimization problem. After the grid point correspondences are found, camera translation parameters between the two images are then determined by using the
epipolar plane constraint.
Zhang et al. [37] compared the performance of depth estimation from motion
parallax using feature matching and block matching techniques for a video
sequence of a static scene with a camera-induced motion. The algorithm for
feature matching consisted of four steps [38]: (1) detecting significant feature
points in each image; (2) matching the significant features in one image to their
corresponding features in the consecutive image; (3) calculating the fundamental
matrix, which defines the motion geometry between two views of a static scene;
(4) determining correspondence for each pixel between consecutive images. The
block matching algorithm uses a hierarchical variable size block matching technique (HVSBM) to determine block correspondences between two images [39].
Multi-resolution pyramids and quad-tree block partition techniques are used in the
HVSBM algorithm. Experiments confirmed that the feature matching approach is
superior in terms of both the accuracy of generated depth maps and the quality of
rendered stereoscopic images for the image sequences that were tested, although it
requires more computing time than a simple block-matching approach [37].
There are also several 2D-to-3D video conversion methods that do not distinguish
differences between camera-induced and object-induced image motions. Kim et al.
[40] proposed to divide each video frame into blocks and directly calculate a motion
vector per block. Those calculated motion vectors are used as motion parallax for the
current frame. Ideses et al. [41] developed a real-time implementation of 2D-to-3D
video conversion using compressed video. They directly used the motion information provided in H.264 bit streams as depth cues for the purpose of 2D-to-3D video
conversion. Pourazad et al. [42, 43] also presented a similar method for generating
depth maps from the motion information of H.264-encoded 2D video sequences.
Different from the method [41], solutions for recovering the depth map of I-frames
and for refining motion vectors in H.264 were proposed. The absolute horizontal
values of the refined motion vectors are used to approximate scene depth values. Kim
et al. [44], Po et al. [45], and Xu et al. [46] improved the quality of depth from motion
by using object region information, which is obtained by color segmentation. The
differences between these three methods were that Kim et al. used a bidirectional
KanadeLucasTomasi (KLT) feature tracker to estimate the motion, Po et al. used a
block matching method, and Xu et al. utilized an optical flow method. These motionbased 2D-to-3D video conversion methods directly convert the motion vectors into
depth and do not differentiate between camera-induced and object-induced image
motions. Compared to the real depth of IMOs in the captured scene, the IMOs will

122

L. Zhang et al.

Fig. 4.7 Schematic of shearing motion: scene structure and camera motion shown on the left,
and shearing motion in the captured image shown on the right

visually pop-out more when the direction of motion is opposite to that of the camera
(see Fig. 4.7). Future studies will be required to examine the effects of IMOs on the
visual comfort experienced by viewers when they view such conversions.

4.3.3.2 Mapping of Image Motion into Depth Information


A depth map associated with an image can be generated by using its image motion
recovered from an image sequence through a motion estimation algorithm. When
the image sequence is acquired using a panning or close to panning camera the
magnitude of motion vectors within each video frame can be directly used as
values in the depth map. Such a motion-to-depth mapping might not result in
accurate absolute depth magnitudes, but, in the absence of IMOs in the scene,
would produce a correct depth order for objects in the scene.
Several motion-to-depth mappings have been proposed in the literature. They
usually do not distinguish between image motion produced by camera motion and by
object motion. A linear mapping method is proposed in [40] that directly uses the
magnitude of motion vectors as its depth value for each block of pixels. A modification to the linear mapping method is described in [37, 41]. It allows scaling of the
magnitudes of motion vectors so that the maximal depth value remains constant
across all video frames. A more flexible method utilizing nonlinear mapping is
proposed in [42, 43] to enhance the perceptual depth. With this method the whole
scene is divided into several depth layers and each depth layer is assigned a different
scaling factor. Beside the magnitude of the motion vectors, camera motion and scene
complexity can also be taken into account, as is described in [44]. Camera motion is
important because it can be exploited. For example, when the camera moves around a

4 DIBR-Based Conversion from Monoscopic to Stereoscopic

123

particular object in the scene, shearing motion in the image plane is created [34].
Shearing motion is schematically shown in Fig. 4.7, where the camera moves around
the tree in the scene. In such a case, objects that lie between the tree and the moving
camera will have a motion in the image plane that is opposite in direction to that of the
camera. Objects that lie beyond the tree move in the same direction as the camera.
Furthermore, the farther in depth the objects are with respect to the tree, the larger the
magnitude of movement. To tap this depth information from shearing motion, the
dominant camera motion direction is included in the computation for the motion-todepth mapping [37].
For mapping of image motion into depth values when camera motion is
complicated or when more realistic depth values are desired, the paths taken by a
moving camera, called tracks, and the camera parameters are required [36, 47].
The tracks and camera parameters, including intrinsic and extrinsic parameters,
can be estimated by using structure-from-motion techniques [38]. Once the camera
parameters are obtained, the depth value for each pixel is calculated by triangulation of the correspondent pixels between two images that are selected from the
sequence [36]. Such a mapping generates a depth map that more closely
approximates the correct relative depths of the scene.

4.3.4 Hybrid Approaches


Video sequences tend to have a wide variety of contents and, therefore, they can be
quite different from one another. They may have camera-induced and/or objectinduced motion, and even no motion. The solutions for depth map generation
discussed in previous subsections mainly rely on one specific depth cue. This
dependency on a single cue imposes limitations because each cue can provide
correct depth information, but only under specific circumstances in which the
depth cue that is chosen is present and when the assumptions based on it are valid.
Violation of the assumptions will result in error-prone depth information. As an
example, motion parallax can provide accurate scene depths only when the scene
is static and when the camera motion is restricted to translation. When the scene
contains moving objects, the motion parallax will provide ambiguous depth.
Imagine that two objects are located at the same distance to the camera, but have
different motions. According to the assumption of depth from motion parallax, the
object with fast motion will be assigned a depth that is closer to the camera than
the other object, and this would clearly be wrong.
A natural way to deal with these issues is to develop a 2D-to-3D video conversion
approach that exploits all possible depth cues that are contained in a video sequence.
The HVS is a good model in that it exploits a variety of depth cues to perceive the
world in 3D. Typically video sequences contain various depth cues that can help
observers perceive the depth structure of the depicted scene. Thus, utilizing all
possible depth cues is a desirable approach. We call a conversion that uses more
than one depth cue a hybrid approach.

124

L. Zhang et al.

Several hybrid 2D-to-3D video conversion approaches have been developed.


Chang et al. [48] developed a hybrid approach that combines three depth cues,
namely: motion parallax, scene geometry, and depth from textures, to generate the
depth information in a given visual scene. The depth information from the three depth
cues are linearly integrated, with weighting factors that were determined based on the
perceived importance of each depth cue. Cheng et al. [49] and Huang et al. [23]
developed an approach that exploits the motion and the geometrical perspective that
are contained in a scene. Two depth maps, one from motion and another from
geometrical perspective, are integrated into a final depth map based on averaging
[49] or according to the binarization results from a module that detects moving
objects [23, 49]. Chen et al. [50] developed an approach that combines the depth that
is derived from motion and that derived from color information.
Although several hybrid 2D-to-3D video conversions have been developed, the
integration of various depth cues is still a difficult problem that remains to be
solved. The important challenge is how to integrate all the extracted depths from
different cues to form, not only spatially but also temporally, a stable and
reasonable depth structure of objects in a depicted scene. Even more of a challenge
is that whatever method is developed should be versatile enough to handle a
wide variety of video contents. In conclusion, more investigations in this area are
required to provide spatially and temporally consistent depths that can be applied
to all types of video material.

4.4 New View Synthesis


As pointed out in Sect. 4.2.5, a DIBR process consists of three steps: (1) depth map
preprocessing and disparity computation; (2) 3D image rendering or pixel shifting
and (3) disocclusion handling (hole-filling). In this section, the necessary steps
for the rendering of one or several virtual views from a video-plus-depth sequence
are presented. A block diagram of this process is shown in Fig. 4.8.

4.4.1 Depth Map Preprocessing


Preprocessing of the depth, or the disparity, facilitates the rendering process
by preparing the depth or disparity information so as to improve the quality of
the finally rendered results. For example, a filtering operation of the depth can
remove imperfections, reduce the variations of disparity inside objects and, as a
consequence, reduce visual artifacts inside objects. Preprocessing can also enhance
special features in the image such as depth edges to improve the depth contrast
between objects in the newly generated image; however, this preprocessing should
be used with care since it will also increase the size of the disocclusions. One
practical method to reduce or even remove all the disocclusions that could be

4 DIBR-Based Conversion from Monoscopic to Stereoscopic

125

Fig. 4.8 Block diagram of the view synthesis process

generated within a newly rendered image is to heavily smooth the depth information before DIBR [28]. In other words, filtering simplifies the rendering by
removing possible sources of artifacts and improves the quality of the rendered
scene. In [51], it was demonstrated that smoothing potentially sharp depth
transitions across object boundaries reduces the number of holes generated during
the virtual view synthesis and helps to improve the image quality of virtual views.
Furthermore, it was shown in [52] that the use of an asymmetrical Gaussian filter,
with a stronger filter along the vertical than the horizontal dimension of the image,
is better at preserving the vertical structures in the rendered images.
Whereas a smoothing effect reduces the number of holes generated, it has the
effect of diminishing the perceived depth contrast in the scene because it softens
the depth transitions at boundaries of objects. It is interesting to note that although
smoothing across object boundaries reduces the depth contrast between object and
background, it has the beneficial effect of reducing the extent of the cardboard
effect in which objects appear flat [53]. In [54], a bilateral filter is used to smooth
the areas inside the objects, reducing depth-related artifacts inside objects while
keeping sharp depth transitions at object boundaries, and thus preserving the depth
contrast. Nevertheless, this is hardly an ideal solution, since large holes between
objects are still present and need to be filled. In general, when applying a
smoothing filter, lowering the strength of the smoothing leaves large holes that are
difficult to fill, which in turn, degrades the quality of the rendered scene [55]
because any artifact would encompass a wider region and would likely be more
visible. For these reasons, a tradeoff is essential to balance between a small
quantity of holes and a good depth impression without causing the rendered
scenes quality to degrade too much.

4.4.2 3D Rendering
3D rendering or pixel shifting is the next main operation in the DIBR process. Let
us consider the case in which a parallel-camera configuration is used to generate
the virtual S3D image pair, the original content is used as the left-eye view and the
right-eye view is the rendered one. Under these circumstances, pixel shifting

126

L. Zhang et al.

Fig. 4.9 The diagram on the left shows the horizontal disparity value of each pixel in a row of
pixels contained in the left image. The diagram on the right shows how pixels in the row from the
left image are horizontally shifted, based on the pixel depth value, to render a right image and
form a stereoscopic image pair

becomes a one-dimensional operation that is applied row-wise in the cameracaptured images. Pixels have only to be shifted horizontally from the left image
position to the right image position to generate the right-eye view.

4.4.2.1 Horizontal Disparity


Images taken with a parallel-camera configuration have the appealing feature that it
does not introduce vertical disparities, as compared to a toed-in camera
configuration. Thus, the advantage of converting images that have been captured
with a parallel-camera configuration is that the 3D-rendering process involves only
horizontal displacement of the source image pixels to generate images with a different camera viewpoint. Figure 4.9 shows a schematic of how this shifting of pixels
takes place. The diagram on the right indicates how pixels in a row of the left image
(top row) are shifted horizontally from a column position in the original left image to
the corresponding column position in the new right image (bottom row) based on the
horizontal disparity value indicated by the diagram on the left. Notice that pixels 2, 3,
and 4, counting from the left, are all displaced to the same column position in the right
image, but since pixel 4 has the largest disparity, this is the one that is kept. Also note
that pixel 6 lies on a depth discontinuity and its location in the new image depends on
the sharpness of the discontinuity. Pixel 6 represents a particular case: since it is lying
on the discontinuity, its disparity value would locate the rendered pixel 6 inside the
disoccluded region leaving it isolated (white dot in green rectangle). This situation is
represented by the solid line in the right diagram and the white dot on the left. This
situation should be avoided because isolated pixels become a visible artifact. One
way of avoiding this situation is to remove the pixel from the process. Since the pixel
is in the transition zone, it probably also contains color information from both sides of
the discontinuity. It would add an inconsistency to any side it is rendered with.
Another possible solution is the one represented with dotted lines: to deconvolve the
color and the disparity images and to shift the resulting pixels to the appropriate side

4 DIBR-Based Conversion from Monoscopic to Stereoscopic

127

of the discontinuity. In the diagram, that is represented by the dotted lines. The
difference in depth between pixels 5 and 7 defines a region in which no pixel, aside
from pixel 6 which defines a depth edge, is located, defining a disoccluded region or
hole. This region is indicated in the figure by the green rectangle.
In most cases, pixels in the source image have a corresponding pixel in the new
image with the location of each of the pixels reflecting its associated horizontal
disparity. The generated new image can be either a left-eye or a right-eye view, or
both. Figure 4.10 is a schematic of a model of a stereoscopic viewing system
shown with a display screen and the light rays from a depicted object being
projected to the left eye and the right eye [56].
With respect to Fig. 4.10, the following expression provides the horizontal
disparity, p, presented by the display according to the perceived depth of a pixel
at zp :


D
p xR  xL t c 1 
4:1
D  zp
tc corresponds to the interpupillary distance which is the human eye separation.
It is usually assumed to be 63 mm. D represents the viewing distance from the
display. Hence, the pixel disparity is expressed as:
ppix

pNpix
;
W

4:2

where Npix is the horizontal pixel resolution of the display and W its width in the
same units as p.

In this particular case, it is important to note that the perceived depth zp is
taken relative to the location of the screen which is at the Zero Parallax Plane. This
means that zp 0 for objects lying in the screen, 0\zp \D for objects in front of
the screen, and zp \0 for objects behind the screen. A positive parallax means that
the object will be perceived behind the display and a negative parallax means that
the object will be perceived in front of the display.
There are other viewing system models that exist where the distance is taken
from the camera to the position of the object. Such models are generally better
adapted for a multi-camera setup used for the generation of stereoscopic or MV
images [10].
4.4.2.2 Depth Mapping
A depth map can be construed as a collection of values m f0; . . .; 2N  1 :
N 2 N g where N is the number of bits encoding a given value. Conventionally,
the higher the value, the closer an object is to the viewer and the brighter is the
appearance of the object in a gray-level image of the depth map. These depth maps
need to be converted into disparity so the pixels of one image can be shifted to new
positions based on the disparity values to generate the new image.

128

L. Zhang et al.

Fig. 4.10 Model of a


stereoscopic viewing system.
Details are described in the
main text

For a given pixel disparity, the depth impression can be different from one person
to another because viewers have different inter-pupillary distances. Viewers might
also have different personal preferences and different tolerance levels in terms of
viewing comfort. The individual differences suggest that viewers might differ in their
preference for a more reduced or a more exaggerated depth structure [57]. Therefore,
it is desirable to have the option of being able to control the magnitude of depth in a
2D-to-3D video conversion process. One method to accomplish this is to have each
depth value m mapped to a real depth according to a set of parameters provided by the
viewer. As an example, the volume of the scene can be controlled directly by
adjusting the farthest and closest distances: Zfar and Znear : If the depth is taken relative
to the viewer, the range of depth R is defined as:
R Zfar  Znear :
The depth mapping function is then:

m 
zp R 1  N
Znear :
2 1

4:3

4:4

Here we can easily verify that zp Zfar for m 0; and zp Znear for
m 2N  1:
A similar technique is used in [56]. However, the mapping function is created
relative to the screen position as in Fig. 4.10. Given two parameters knear and kfar,
which represent the distance (as a percentage of the display width W) that is in
front of and at the back of the screen, respectively, we can define the range of
depth R produced as:

4 DIBR-Based Conversion from Monoscopic to Stereoscopic

R W knear kfar :
Hence, zp can be defined as:
 m

zp W N
knear kfar  kfar :
2 1

129

4:5

4:6

The parameters knear and kfar control the amount of depth perceived in the scene.
The advantage of this method is that the volume of the scene is defined relative to
the viewing conditions, such as size and position of the display with respect to the
viewer instead of being defined relative to the capturing conditions. This makes it
relatively simple to change the amount of depth on the scene by changing
parameters under control of the viewer.
There are also other possible mappings of the value in the depth map to the real
depth in the scene. In the previously presented mapping procedures, the conversions were linear. Nonlinear mappings of depth and nonlinear processing of the
disparity have also been proposed [58] to improve the depth perception of the
scene or to improve visual comfort levels for incorrectly captured video material.

4.4.2.3 Horizontal Shifting


Once the viewing conditions and the control parameters are set, a virtual view is
created by shifting the horizontal position of the source image pixels according to
their disparity values. However, there are situations in which complications can
arise. One common situation is that spatial interpolation might be necessary for
some of the pixels during pixel shifting because the pixel positions in the new
virtual image must lie on the regular sampling grid, and not all shifts results in
pixels falling on the regular grid. Furthermore, if two pixels have the same final
location, only the pixel that has the larger Z value (m value) is adopted in order to
ensure that pixels representing objects closest to the viewer are not overwritten and
blocked from view in the rendering process.
In some instances it is possible to have a situation in which three objects at the
same position, but at different depths, produce an erroneous rendering result. This
situation arises when the closest object is shifted, uncovering a previously
occluded region. When the middle object is itself shifted, the recently uncovered
area could land in front of a visible patch of the farthest object. Since the
uncovered area has no pixels in it from the middle object, it could be wrongly
filled with the pixels from the farthest object instead of leaving the area unfilled
and classified as a disocclusion. This situation will produce visible artifacts that
need to be removed from the image. This problem generally occurs when an object
possesses a large disparity value relative to objects behind it. Similar problems
could also arise with objects containing fine structures. Figure 4.11 shows an
illustration of this phenomenon where the disoccluded area behind the triangle is
filled by the green background instead of the blue color of the rectangle. Solving
this issue is very complex as it requires a contextual analysis of the image where

130

L. Zhang et al.

Fig. 4.11 Horizontal shifting problem. The disoccluded area behind the triangle is incorrectly
filled. Correctly labeled disoccluded areas are in black

the objects have to be extracted. However, in most cases, the disparity is small
enough to avoid this problem.

4.4.3 Hole Filling


As discussed in the previous subsection, a pixel in the virtual view does not
necessarily have a corresponding pixel in the source image. These unfilled pixels,
commonly called holes, can be classified into two categories: (1) cracks and (2)
disocclusions.
The shifting process may produce an object surface deformation since the
object projected in the virtual view may not have the same area as in the source
image. Therefore, cracks in the image are created and are generally one or two
pixels wide. They can be easily filled by linear interpolation, within the virtual
view, along the horizontal dimension.
Disocclusions appear when an object in the virtual view is in front of a more
distant one. They are located along the boundaries of objects as well as at the
edges of the images, and they correspond to areas behind objects that cannot
be seen from the viewpoint of the source image [10]. How to patch disocclusions is
probably the most challenging problem related to DIBR since no information
related to disoccluded regions is available. It is interesting to note that when
disocclusion information is available, it has been proven in [59] that the virtual
view synthesis quality is improved drastically.
So far, no ideal solution has been found to fill the disocclusions, when
disocclusion information is not explicitly available. Moreover, finding a reliable
solution to this problem is complex because both spatial and temporal consistencies are critical in image quality assessment by human observers. For real-time
processing, disocclusion filling presents an extra challenge. In [60] and [61],
comparative studies of disocclusion filling methods are presented showing that the
hole-filling problem is still unsolved and is a main source of artifacts in the DIBR
process.

4 DIBR-Based Conversion from Monoscopic to Stereoscopic

131

Fig. 4.12 The top row shows an original stereoscopic image pair. The bottom row shows the
stereoscopic image pair after modification with a floating window, which is visible as black bars
in the left and right borders of the images

One of the best and simplest techniques known so far for disocclusion filling
is the patented Exemplar-based image inpainting by A. Criminisi et al. as
presented in [62, 63]. It has been adapted later to DIBR [64] using the depth
information as a guide to maintain consistency along objects boundaries next to a
disoccluded region. In this case, only pixels that share the same depth value are
used to fill disocclusions. In [65] and [66], methods are proposed to improve the
spatial and temporal consistencies across video frames.
With respect to disocclusions that are located along the left and right edges of
an image boundary, the commonly used solution is to scale the image up in such
a way that the disoccluded part is removed from the visible image. Another
efficient way to mask them would be to add a floating window around the image.
Its width is usually only a few pixels wide along each direction to be effective,
as is shown in Fig. 4.12.1

N.B. the black bars that constitute the floating window are shown much wider than they
actually are to make it more visible in the Figure.

132

L. Zhang et al.

Fig. 4.13 Floating window


in front of the stereoscopic
image

Aside from filling in the blank regions, a floating window is also helpful for
avoiding conflicting depth cues that can occur at the screen borders. For
instance, when an object with a negative parallax, located in front of the screen,
is in contact with the display border an edge violation is produced (see Fig. 4.12
top row). The object is occluded in this case by the physical display border that
is located behind the virtual object (right arm of girl). The response of the
Human Visual System to this impossible situation is to push the object back
to the depth of the screen, resulting in a reduction of the perceived depth of the
object. A floating window resolves such a situation if it is placed in front of
the object that is in contact with the display border (see Fig. 4.12 bottom row).
The viewer will then have the impression that a window is floating in front
of the object and partially occluding it in a natural way. In Fig. 4.12, the portion
of the girls arm that is not visible in the right image is covered by the black
floating window, effectively removing the discrepancy between left and right
images. The floating window does not need to be strictly vertical; it could lie
in a plane that is closer at the bottom than at the top. Figure 4.13 shows
schematically how the floating window should look once placed in the image.
A frame with black borders will occlude the closest part of the image, in this
case the girl, and prevent the border violation problem.

4.4.4 Multi-View Considerations


The synthesis algorithm described in the previous sections presents the steps for
generating one synthesised view from an original image such that together they form a
stereoscopic image pair. For MV image generation, the same view synthesis algorithm
can be used to create multiple new images, such as for auto-stereoscopic 3D displays,

4 DIBR-Based Conversion from Monoscopic to Stereoscopic

133

by repeating those steps and with each synthesised view having an appropriately
different inter-camera baseline distance with respect to the original image.
Given that multiple images are generated in the MV case, the new images could
be generated with the original image positioned as either the first or the last image
in the range of views to be synthesised. However, this procedure is not recommended because the new image that is farthest from the original one would contain
larger areas of disoccluded regions than the new image that is farthest from the
original one if the new images were created with the original image positioned in
the middle of the range of views. That is, the farther the synthesised view is from
the original image the more likely rendering artifacts, specifically from the
hole-filling process in disoccluded regions, would be noticeable.
This problem is unique to generating MV images versus stereoscopic images
from DIBR and must be taken into consideration. That is, researchers often point
out that a larger inter-camera baseline is generally required for MV than for
stereoscopic 3D displays. Multi-view is intended for providing motion parallax
information through self-motion and/or for multiple viewers standing side-by-side
when viewing the S3D scene offered by MV auto-stereoscopic displays [67].
A wider baseline is required to handle the wider viewing angle that has to be
covered, and the wider baseline between the synthesised view and the original
image will lead to larger disoccluded regions that have to be filled when generating
the synthesised view.
Cheng et al. [68] proposed an algorithm to address this specific issue of large
baseline for view synthesis from a depth image. The large disoccluded regions in
the color image are filled using a depth-guided, exemplar-based image inpainting
algorithm. An important aspect of the algorithm is that it combines the structural
strengths of the color gradient to preserve the image structure in the restored filledin regions.

4.5 Conversion Quality and Evaluation


Conversion quality and evaluation are also important issues that are pertinent to the
success of 2D-to-3D video conversion. There are many factors that can affect the
quality of the conversion of 2D video or film material to S3D, such as spatial and
temporal characteristics of the original contents, and the actual workflow and technologies that are adopted for the conversion process. More global factors include
budget, time constraints, inter-personal communications and working relationships
among supervisors, vendors, and production staff, as well as whether the original 2D
assets were originally shot and preplanned for conversion to S3D [69]. With respect
to conversion quality related to the DIBR approach, there are three major contributing factors: (1) accuracy and quality of the depth maps, (2) preprocessing methods
and choice of parameters for the camera-viewer configuration required for
converting the 8-bit digitized range of depth information to the depth for the depicted
scene, and (3) the specific processes in the rendering step of horizontal pixel

134

L. Zhang et al.

displacements and hole filling. Factors (1) and (3) mainly contribute to the range
and magnitude of conversion artifacts, and factor (2) can contribute significantly to
viewers visual discomfort and to perceived distortions in the global scene
(e.g., space curvature) and of objects in the scene (i.e., puppet-theater and cardboard
effects). These unintended changes in the images and their contents can lead to
reduced image and depth quality, as well as in other perceptual dimensions of
naturalness and immersion.

4.5.1 Conversion Artifacts


A key factor in determining the image quality of S3D video or film that have been
converted from 2D is the presence of conversion artifacts. The conversion artifacts
are highly dependent on the approach and method used. DIBR-based conversion of
2D material to S3D involves the process of rendering new images consisting of
one or more different camera viewpoints from an original 2D image and its
corresponding depth map. Hence, the types of artifacts that are potentially visible
in a DIBR-based conversion reflect this reliance on depth maps. That is, the
rendering quality is highly dependent on the quality of the depth information.
If the depth map is inaccurate, there could be errors in the relative depth
between rendered objects and/or the depth order between objects could be wrong.
For the former the errors are generally less noticeable because there is no actual
reference scene for the viewer to compare the unintended changes in relative depth
between objects. For the latter, the outcome will depend on the viewing conditions.
For naturalistic S3D image sequences, the visual system appears to be able to
handle wrong depth order of objects relatively well during viewing of a moving
scene since a normal depth scene is perceived [27]. In the same referenced
study, ratings of visual comfort were obtained from viewers and no negative
impact on visual comfort was found. However, for static images or during
freeze frame mode, inaccuracies in depth order are more readily detected and
become uncomfortable to view.
In the process of extracting depth information from a 2D image, the extraction
process can be pixel or block-based. In the latter case, the depth map would be
blocky. Related to this, if there is coarse quantization of the depth information
contained in a dense depth map and if there is no preprocessing, then there is a
strong potential of a misalignment between the contours of objects in the original
2D image and the contours (depth transitions) of the same objects in the depth
map. This mismatch will lead to ringing artifacts at the boundaries of objects in the
newly rendered images and these are referred to as depth ringing [70].
Given that areas around object boundaries are exposed when pixels are shifted
to create a new image in accordance with a change of camera viewpoint, there are
exposed regions in the new image that consist of holes or cracks which have
to be filled in. The methods for filling these holes can be as simple as interpolation between neighboring pixels or as computationally intensive as analyzing

4 DIBR-Based Conversion from Monoscopic to Stereoscopic

135

the neighboring pixels that are part of the background and extend structures and/or
textures into the exposed regions [52]. In any case, there is no actual information
for properly patching these regions, and so inaccuracies in filling these regions
have a high probability of occurrence. When visible, the artifacts will appear as a
halo around the left or right borders of objects, depending on whether the lefteye or right-eye image is being rendered, respectively.
As suggested earlier, aside from rendering artifacts that are visible even when
viewed monocularly, there are depth distortion artifacts that can be perceived only
binocularly. The advantage of having a depth map for rendering an image as if
taken from a new camera viewpoint is also a feature that can lead to depth
distortion artifacts. There are two types of such perceptual artifacts. There is the
cardboard effect, in which objects appear to be flattened in the z-plane. There is
also the puppet-theater effect, in which objects are perceived to be much smaller
than in real life. In both cases, the higher level artifacts are created by an incongruent match of the depth reproduction magnification and the lateral reproduction magnification compared to the ratio observed in the real world [71].
Given that the depth information can be scaled to an arbitrary range in the rendering process, the manifestation of the cardboard effect and/or the puppet-theater
effect can occur.
Selection of where to locate objects in a depth scene for a given display-viewer
configuration can lead to perceptual effects that can diminish the depth quality.
In particular, objects that are cut off by the left or right edges of a display should be
located behind the screen plane otherwise they would be an inconsistency that can
cause visual discomfort because objects that are blocked by the edge of the screen
in a real world cannot be in front of the screen. This is referred to as edge
violation [72]. Furthermore, poor choice of rendering parameter values, defined
by the viewing conditions (see Sect. 4.4.2), can lead to large disparity magnitudes
that are difficult to fuse binocularly by viewers [6, 73]. This will lead to visual
discomfort and, at the extreme, to headaches and dizziness.
In summary, errors contained in the depth map, poor choice of rendering
parameters, and improper processing of the depth information can translate into
visible artifacts that can occur around boundary regions of objects, visual
discomfort, and perceptual effects in terms of size and depth distortions.

4.5.2 Effects and Evaluation of Conversion Quality


When different methods are utilized for conversion of 2D-to-S3D material, it is
very natural to ask about, and make comparisons between, the qualities of the
different conversion methods. However, this is not an easy task because the quality
of a conversion is multidimensional in terms of image quality, depth quality,
sharpness, naturalness, sense of presence, and visual comfort. When comparing
conversion quality that are based on DIBR, the main concerns should reflect the
(1) accuracy and quality of the depth maps, (2) preprocessing methods and choice

136

L. Zhang et al.

of parameters for the cameraviewer configuration, and (3) the rendering quality.
Thus, it is the presence, magnitude, extent, and frequency of visible artifacts in the
monocular images from these processes (such as depth ringing and halos)
that determine the conversion quality. There are also conversion artifacts that
occur after binocular integration of the left-eye and right-eye inputs consisting of
space distortion artifacts such as curvature, puppet, and cardboard effects. These
are dependent on the choice of rendering parameters and conscious placement of
objects in the depth scene. These also contribute to the conversion quality. Thus,
because of the wide range of factors contributing to and the multidimensional
nature of perceptual tasks, the evaluation of conversion quality is difficult to
conduct with an objective method.
Evaluation of conversion quality is not straightforward because there is no
reference by which to compare against. That is, the original scene with all its depth
information in stereoscopic view is not available. The conversion quality is ultimately based on the absence of noticeable and/or annoying artifacts and the quality
of the depth experience. With DIBR, artifacts around objects give rise to annoying
artifacts which degenerate the quality of the images. The quality of the depth
experience can suffer if content creators generate depth maps that have noticeable
errors in them and if poor parameter values are chosen for the displayed depth.
Noticeable is very subjective and is dependent on how the images are viewed.
That is, the artifacts will be more visible if they are scrutinized under freeze frame
mode as opposed to being played back naturally. This is because the depth percept
can take time to build up [74].
Nevertheless, subjective tests involving viewers rating of image quality, depth
quality, sense of presence, naturalness and visual comfort provide a basis for
evaluating conversion quality. This can be done using a five-section continuous
quality rating scale [75]. The labels for the different sections could be changed
based on the perceptual dimension that is to be assessed. Examples are shown in
Fig. 4.14. Another possible measurement of conversion quality that does not
require a reference and that does not require the assessment of any of the
perceptual dimensions indicated so far is that of pseudo stereopsis. For a
standard stereoscopic image pair, when the left and right-eye designated images
are inadvertently reversed during display, viewers often report a strange experience that they find hard to pin-point and describe. They would simply say that
Yes, there is depth but there is something wrong. This experience of depth, but
at the same time feeling of something wrong, coupled with a strange feeling in the
eye is the detection of errors in the scene. Thus, one could conduct an evaluation
on conversion quality by examining ease of detection of pseudo stereopsis. The
assumption being that it would be easier to detect if it was a real stereo image pair
that has been reversed than if the stereo image pair were converted. So, it is the
determination of viewers detection rate. If it is 100 % detection when the
converted image pair is reversed it would indicate that the conversion has been
very effective. If the detection rate is 50 % it means that the conversion was poorly
done. In the latter case, the conversion is such that the human visual system cannot
really tell whether the converted view is for the left or the right eye.

4 DIBR-Based Conversion from Monoscopic to Stereoscopic

137

Fig. 4.14 Rating scales with different labels for assessing visual comfort (a), (b), and (c) and for
rating image or depth quality of a converted video or film sequence (d)

Finally, it should be noted that an evaluation of the conversion quality of video


sequences should also take into consideration the application of the converted material.
Mobile video, home-viewing, and cinematic experiences are the three major types of
applications. The requirements can be quite different dependent on the application.
Viewers have come to expect high image quality, depth quality and visual comfort for
cinematic experiences. For the home-viewing environment the depth quality can usually
be reduced because of the viewing distance and display size restrictions, with visual
comfort being central. For mobile applications, the visual experience is generally not
expected to have such a high impact as with cinematic or home-viewing experiences.
However, it is still desirable because it does provide an enhanced sense of presence if
done properly [76]. Artifacts can negatively contribute to all these perceptual dimensions.

4.6 Discussions and Conclusions


An overview of the basic principles and the various methods of utilizing depthimage-based rendering for converting 2D video to stereoscopic 3D were presented
in this manuscript. The advantages and disadvantages of the different methods
were elaborated upon. Practical applications for the conversion from monoscopic
to stereoscopic and MV video were mentioned.

138

L. Zhang et al.

The overview indicates that the underlying principle for the conversion of 2D
program material to stereoscopic and MV 3D is simple and straightforward. On the
other hand, the strategies and methods that have been proposed and employed to
implement 2D-to-S3D conversion are quite varied. Despite the fact that progress
has been made, there is no one singular strategy that stands out from the rest as
being most effective. Although it seems easy for the human visual system to utilize
the depth information in 2D images, there is still much to be researched in terms of
how best to extract and utilize the pictorial and motion information that is
available in the 2D sources, for transformation into new stereoscopic views for
depth enhancement. In addition, more effective methods that do not require
intensive manual labor are required to fill in newly exposed areas in the
synthesised stereoscopic views. Beyond the study of different strategies and the
details and effectiveness of different methods of S3D conversion technologies, at a
higher level there are more general issues. Importantly, are 2D-to-3D video
conversion technologies here to stay?
Although S3D conversion technologies are useful in producing contents to
augment the trickle of stereoscopic material that are created in the initial stages of
3D-TV deployment, conversion technologies can be exploited for rejuvenating the
vast amount of existing 2D program material as a new source for revenue. For this
reason alone, 2D-to-3D video conversion is going to stay for quite some time
beyond the early stages of 3D-TV. Given that 2D-to-3D video conversion allows
for depth adjustments of video and film contents to be made that follow visual
comfort guidelines, it is highly likely that the 3D conversion process will become
ingrained as one of the tools or stages of the production process for prevention of
visual discomfort. This is true for even source video and film material that has
been originally captured with actual stereoscopic camera rigs. For example, a
stereoscopic scene might have been unwisely captured with a much larger baseline
than typically used, and it is too late to do a re-shoot. The only method to save the
scene might be to create a 2D-to-3D converted version, otherwise the other option
would be to delete the scene altogether (if horizontal shifting of the stereoscopic
image pairs in opposite directions is not an acceptable alternative). As another
example, when a significant number of frames for only one eye are inadvertently
damaged, they must be restored or regenerated. A practical method of restoring the
damaged stereoscopic clip would be to make use of available 3D conversion
technologies. Lastly, DIBR conversion technologies are likely to stay as part of the
post-production tools for repurposing, as an example, of movies that were originally created for a 10-m wide screen for display on a 60 television screen or a
7 9 4 cm mobile device. The same technologies can be artfully used to manipulate objects in depth to optimize the creative desires of the video or film director
for different perceptual effects of personal space, motion, and object size.
Other than the issue of whether 2D-to-3D video conversion is going to stay after
the S3D industry has matured, there has been an ongoing and intense debate as to
whether movies should be originally shot with stereoscopic camera rigs or whether
they should be shot in 2D with the intention of conversion afterwards to 3D as a
post-production process. The main argument for the latter is that the director and

4 DIBR-Based Conversion from Monoscopic to Stereoscopic

139

post-production crews can have full control of the 3D effects without surprises of
either visual discomfort or unintended distracting perceptual effects. Another
major advantage is that the production equipment during image capture can be
reduced to that of standard video or film capture, thus providing savings in terms
of human resources and equipment costs. Nevertheless, there have been strong
advocates against 2D-to-3D video conversion of video and film material. This
largely stems from the poor conversions of hastily generated S3D movies that have
been rushed out to meet non-practical deadlines. Perceptual artifacts, such as size
(e.g., puppet effect) and depth distortions (e.g., cardboard effect) can be rampant if
the 3D conversion is not done properly. As well, loss of depth details and even
distracting depth errors can turn off not only the ardent S3D movie buff, but also
the general public who might have been unwittingly foxed into expecting more
through aggressive advertisements by stakeholders in the 3D entertainment field.
In summary, 2D-to-3D video conversion can generate outstanding movies and
video program material, but the conversion quality is highly dependent on the
contents and the amount of resources (of both time and money) put into the
process. It is for this very same reason that the quality of conversion for realtime applications is not the same as that for off-line production.
Finally, as discussed under the various approaches and methods on 2D-to-3D
video conversion, there are pros and cons to each methodology used. For example,
the use of predefined scene models can be very useful for reducing computation
time and increasing accuracy when the contents are known in advance. Surrogate
depth maps with heavy filtering can reduce both computation time and rendering
significantly, but the conversion quality might not be suitable for demanding
applications, such as cinematic requirements. Thus, future research requires novel
combination of the best of different worlds. Importantly, there is a lack of
experimental data on how the human visual system combines the output of the
hypothesized depth processing modules involving depth cues, such as texture,
motion, linear perspective, and familiar size, in arriving at a final and stable
percept of the visual world around us. Identification and selection of the depth cues
that are more stable and relevant for most situations need to be analyzed. Recent
studies have also started investigating the efficacy of utilizing higher level
modules, such as attention, in deriving more reliable and more accurate estimation of depth for conversion of monoscopic to stereoscopic and MV video.
However, the solutions do not appear to be achievable in the foreseeable future.
Even the relatively simple use of motion information for depth perception that
humans are so good at are still far from reach in computational-algorithmic form.
In conclusion, future research should not only focus on algorithm implementation
of various conversion methodologies, but also try to better understand human
depth perception to find clues that will enable a much faster, more reliable, and
more stable 2D-to-3D video conversion methodology.
Acknowledgment We would like to express our sincere thanks to Mr. Robert Klepko for
constructive suggestions during the preparation of this manuscript. Thanks are also due to NHK
for providing the Balloons, Tulips, and Redleaf sequences.

140

L. Zhang et al.

References
1. Advanced Television Systems Commitee (ATSC) (2011) Final report of the ATSC planning
team on 3D-TV. PT1-049r1. Advanced Television Systems Commitee (ATSC), Washington
DC, USA
2. International Organisation for Standardisation (ISO) (2009) Vision on 3D video. ISO/IEC
JTC1/SC29/WG11 N10357, International Organisation for Standardisation (ISO), Lausanne,
Switzerland
3. Society of Motion Picture and Television Engineers (SMPTE) (2009) Report of SMPTE task
force on 3D to the home. TF3D, Society of Motion Picture and Television Engineers
4. Smolic A, Mueller K, Merkle P, Vetro A (2009) Development of a new MPEG standard for
advanced 3D video applications. TR2009-068, Mitsubishi Electric Research Laboratories,
Cambridge, MA, USA
5. Valentini VI (2011) Legend3D sets the transformers 2D-3D conversion record straight.
In: indiefilm3D. Available at: http://indiefilm3d.com/node/518
6. Tam WJ, Speranza F, Yano S, Ono K, Shimono H (2011) Stereoscopic 3D-TV: visual
comfort. IEEE Trans Broadcast 57(2):335346 part II
7. Kauff P, Atzpadin N, Fehn C, Mller M, Schreer O, Smolic A, Tanger R (2007) Depth map
creation and image-based rendering for advanced 3DTV services providing interoperability
and scalability. Signal Processing: Image Communication (Special issue on threedimensional video and television) 22(2):217234
8. Zhang L, Vzquez C, Knorr S (2011) 3D-TV content creation: automatic 2D-to-3D video
conversion. IEEE Trans Broadcast 57(2):372383
9. Tam WJ, Zhang L (2006) 3D-TV content generation: 2D-to-3D conversion. In IEEE
International Conference on Multimedia and Expo, Toronto, Canada
10. Fehn C (2003) A 3D-TV approach using depth-image-based rendering (DIBR). In 3rd
conference on visualization. Imaging and Image Processing, Benalmadena, Spain
11. Ostnes R, Abbott V, Lavender S (2004) Visualisation techniques: an overviewPart 1.
Hydrogr J 113:47
12. Shimono K, Tam WJ, Nakamizo S (1999) Wheatstone-panum limiting case: occlusion,
camouflage, and vergence-induced disparity cues. Percept Psychophys 61(3):445455
13. Ens J, Lawrence P (1993) An investigation of methods for determining depth from focus.
IEEE Trans Pattern Anal Mach Intell 15(2):97107
14. Battiato S, Curti S, La Cascia M, Tortora M, Scordato E (2004) Depth map generation by
image classification. Proc SPIE 5302:95104
15. Hudson W (1967) The study of the problem of pictorial perception among unacculturated
groups. Int J Psychol 2(2):89107
16. Knorr S, Kunter M, Sikora T (2008) Stereoscopic 3D from 2D video with super-resolution
capabilities. Signal Process: Image commun 23(9):665676
17. Mancini A (1998) Disparity estimation and intermediate view reconstruction for novel
applications in stereoscopic video, McGill University, Canada
18. Scharstein D, Szeliski R (2003) High-accuracy stereo depth maps using structured light. In:
IEEE Computer society conference on computer vision and pattern recognition (CVPR
2003), vol 1. Madison, WI, USA, pp 195202
19. Yamada K, Suehiro K, Nakamura H (2005) Pseudo 3D image generation with simple depth
models. In: International conference in consumer electronics, Las Vegas, NV, pp 277278
20. Battiato S, Capra A, Curti S, La Cascia M (2004) 3D Stereoscopic image pairs by depth-map
generation. In: 3D Data processing, visualization and transmission, pp 124131
21. Nedovic V, Smeulders AWM, Redertand A, Geusebroek JM (2007) Depth information
by stage classification. In: International conference on computer vision
22. Yamada K, Suzuki Y (2009) Real-time 2D-to-3D conversion at full HD 1080P resolution.
In: 13th IEEE International symposium on consumer electronics, Las Vegas, NV, pp 103106

4 DIBR-Based Conversion from Monoscopic to Stereoscopic

141

23. Huang X, Wang L, Huang J, Li D, Zhang M (2009) A depth extraction method based on
motion and geometry for 2D-to-3D conversion. In: Third international symposium on
intelligent information technology application
24. Jung Y-J, Baik A, Park D (2009) A novel 2D-to-3D conversion technique based on relative
height-depth-cue. In: SPIE Conference on stereoscopic displays and applications XX, San
Jose, CA, vol 7237. p 72371U
25. Tam WJ, Speranza F, Zhang L (2009) Depth map generation for 3-D TV: importance of edge
and boundary information. In: Javidi B, Okano F, Son J-Y(eds) Three-dimensional imaging,
visualization and display. Springer, New York, pp 153181
26. Tam WJ, Yee AS, Ferreira J, Tariq S, Speranza F (2005) Stereoscopic image rendering based
on depth maps created from blur and edge information. In: Proceedings of the stereoscopic
displays and applications, vol 5664. pp 104115
27. Tam WJ, Vzquez C, Speranza F (2009) 3D-TV: a novel method for generating surrogate
depth maps using colour information. In: SPIE Conference stereoscopic displays and
applications XX, San Jos, USA, vol 7237, p 72371A
28. Zhang L, Tam WJ (2005) Stereoscopic image generation based on depth images for 3D TV.
IEEE Trans Broadcast 51:191199
29. Ernst FE (2003) 2D-to-3D video conversion based on time-consistent segmentation. In:
Proceedings of the immersive communication and broadcast systems workshop, Berlin,
Germany
30. Chang Y-L, Fang C-Y, Ding L-F, Chen S-Y, Chen L-G (2007) Depth map generation for
2D-to-3D conversion by short-term motion assisted color segmentation. In: IEEE
International conference on multimedia and expo, pp 19581961
31. Vzquez C, Tam WJ (May 2010) CRC-CSDM: 2D to 3D conversion using colour-based
surrogate depth maps. In: International conference on 3D systems and applications (3DSA
2010), Tokyo, Japan
32. Kim J, Baik A, Jung YJ, Park D (2010) 2D-to-3D conversion by using visual attention
analysis. In: Proceedings SPIE, vol 7524, p 752412
33. Nothdurft H (2000) Salience from feature contrast: additivity across dimensions. Vis Res
40:11831201
34. Rogers B-J, Graham M-E (1979) Motion parallax as an independent cue for depth perception.
Perception 8:125134
35. Ferris S-H (1972) Motion parallax and absolute distance. J Exp Psychol 95(2):258263
36. Matsumoto Y, Terasaki H, Sugimoto K, Arakawa T (1997) Conversion system of monocular
image sequence to stereo using motion parallax. In: SPIE Conference in stereoscopic displays
and virtual reality systems IV, San Jose, CA, vol 3012. pp 108112
37. Zhang L, Lawrence B, Wang D, Vincent A (2005) Comparison study of feature matching and
block matching for automatic 2D to 3D video conversion. In: 2nd IEEE European conference
on visual media production, London, UK, pp 122129
38. Hartley R, Zisserman A (2000) Multiple view geometry in computer vision. Cambridge
University Press, Cambridge, UK
39. Choi S, Woods J (1999) Motion-compensated 3-D subband coding of video. IEEE Trans
Image Process 8(2):155167
40. Kim MB, Song MS (1998) Stereoscopic conversion of monoscopic video by the
transformation of vertical to horizontal disparity. Proc SPIE 3295:6575
41. Ideses I, Yaroslavsky LP, Fishbain B (2007) Real-time 2D to 3D video conversion. J RealTime Image Process 2(1):37
42. Pourazad M-T, Nasiopoulos P, Ward R-K (2009) An H.264-based scheme for 2D-to-3D
video conversion. IEEE Trans Consum Electron 55(2):742748
43. Pourazad M-T, Nasiopoulos P, Ward R-K (2010) Generating the depth map from the motion
information of H.264-encoded 2D video sequence. EURASIP J Image Video Process
44. Kim D, Min D, Sohn K (2008) A stereoscopic video generation method using stereoscopic
display characterization and motion analysis. IEEE Trans Broadcast 54(2):188197

142

L. Zhang et al.

45. Po L-M, Xu X, Zhu Y, Zhang S, Cheung K-W, Ting C-W (2010) Automatic 2D-to-3D video
conversion technique based on depth-from-motion and color segmentation. In: IEEE
International conference on signal processing, Hong Kong, China, pp 10001003
46. Xu F, Er G, Xie X, Dai Q (2008) 2D-to-3D conversion based on motion and color mergence.
In: 3DTV Conference, Istanbul, Turkey
47. Zhang G, Jia J, Wong TT, Bao H (2009) Consistent depth maps recovery from a video
sequence. IEEE Trans Pattern Anal Mach Intell 31(6):974988
48. Chang YL, Chang JY, Tsai YM, Lee CL, Chen LG (2008) Priority depth fusion for 2D-to-3D
conversion systems. In: SPIE Conference on three-dimensional image capture and
applications, San Jose, CA, vol 6805, p 680513
49. Cheng C-C, Li C-T, Tsai Y-M, Chen L-G (2009) Hybrid depth cueing for 2D-to-3D
conversion system. In: SPIE Conference on Stereoscopic Displays and Applications XX, San
Jose. CA, USA, vol 7237, p 723721
50. Chen Y, Zhang R, Karczewicz M (2011) Low-complexity 2D-to-3D video conversion. In:
SPIE Conference on stereoscopic displays and applications XXII, vol 7863, p 78631I
51. Tam WJ, Alain G, Zhang L, Martin T, Renaud R (2004) Smoothing depth maps for improved
stereoscopic image quality. In: Three-dimensional TV, video and display III (ITCOM04),
Philadelphia, PA, vol 5599, p 162
52. Vzquez C, Tam WJ, Speranza F (2006) Stereoscopic imaging: filling disoccluded areas in
depth image-based rendering. In: SPIE Conference on three-dimensional tv, video and
display V, Boston, MA, vol 6392, p 63920D
53. Shimono K, Tam WJ, Speranza F, Vzquez C, Renaud R (2010) Removing the cardboard
effect in stereoscopic images using smoothed depth maps. In: Stereoscopic displays and
applications XXI, San Jos, CA, vol 7524, p 75241C
54. Mori Y, Fukushima N, Yendo T, Fujii T, Tanimoto M (2009) View generation with 3D
warping using depth information for FTV. Signal Process: Image Commun 24(12):6572
55. Chen W-Y, Chang Y-L, Lin S-F, Ding L-F, Chen L-G (2005) Efficient depth image based
rendering with edge depenedent filter and interpolation. In: IEEE Internatinal conference on
multimedia and expo, Amnsterdam, The Netherlands
56. International Organization for Standardization / International Electrotechnical Commission
(2007) Representation of auxiliary video and supplemental information. ISO/IEC FDIS 230023:2007(E), International organization for standardization / International electrotechnical
commission, Lausanne
57. Daly SJ, Held RT, Hoffman DM (2011) Perceptual issues in stereoscopic signal processing.
IEEE Trans Broadcast 57(2):347361
58. Lang M, Hornung A, Wang O, Poulakos S, Smolic A, Gross M (2010) Nonlinear disparity
mapping for stereoscopic 3D. In: ACM SIGGRAPH, Los Angeles, CA
59. Vzquez C, Tam WJ (2008) 3D-TV: coding of disocclusions for 2D+Depth representation of
multi-view images. In: Tenth international conference on computer graphics and imaging
(CGIM), Innsbruck, Austria
60. Tauber Z, Li Z-N, Drew M-S (2007) Review and preview: disocclusion by inpainting for
image-based rendering. IEEE Trans Syst Man Cybernetics Part C: Appl Rev 37(4):527540
61. Azzari L, Battisti F, Gotchev A (2010) Comparative analysis of occlusion-filling techniques
in depth image-based rendering for 3D videos. In: 3rd Workshop on mobile video delivery,
Firenze, Italy
62. Criminisi A, Perez P, Toyama K (2004) Region filling and object removal by exemplar-based
image inpainting. IEEE Trans Image Process 13:12001212
63. Criminisi A, Perez P, Toyama K, Gangnet M, Blake A (2006) Image region filling by
exemplar-based inpainting. Patent No: 6,987,520, United States
64. Daribo I, Pesquet-Popescu B (2010) Depth-aided image inpainting for novel view synthesis.
In International Workshop on Multimedia Signal Processing, Saint-Malo, France, pp 167170
65. Gunnewiek R-K, Berrety R-PM, Barenbrug B, Magalhaes J-P (2009) Coherent spatial and
temporal occlusion generation. In: Proceedings SPIE, vol 7237, p 723713

4 DIBR-Based Conversion from Monoscopic to Stereoscopic

143

66. Cheng C-M, Lin S-J, Lai S-H (2011) Spatio-temporal consistent novel view synthesis
algorithm from video-plus-depth sequences for autostereoscopic displays. IEEE Trans
Broadcast 57(2):523532
67. Holliman NS, Dodgson NA, Favarola GE, Pockett L (2011) Three-dimensional displays:
a review and applications analysis. IEEE Trans Broadcast 57(2):362371
68. Cheng CM, Lin SJ, Lai SH, Yang JC (2003) Improved novel view sysnthesis from depth
image with large baseline. In: International conference on pattern recognition, Tampa, FL
69. Seymour M (2011) Art of stereo conversion: 2D-to-3D. In: fxguide. Available at: http://
www.fxguide.com/featured/art-of-stereo-conversion-2d-to-3d/
70. Boev A, Hollosi D, Gotchev A (2008) Classification of stereoscopic artefacts., Mobile3DTV
(Project No. 216503) http://sp.cs.tut.fi/mobile3dtv/results/tech/D5.1_Mobile3DTV_v1.0.pdf.
Accessed 22 Jun 2011
71. Yamanoue H, Okui M, Okano F (2006) Geometrical analysis of puppet-theater and cardboard
effects in stereoscopic HDTV images. IEEE Trans Circuits Syst Video Technol 16(6):
744752
72. Mendiburu B (2009) Fundamentals of stereoscopic imaging. In: Digital cinema summit, NAB
Las Vegas. Available at: http://www.3dtv.fr/NAB09_3D-Tutorial_BernardMendiburu.pdf
73. Yeh Y-Y, Silverstein LD (1990) Limits of fusion and depth judgment in stereoscopic color
displays. Hum Factors: J Hum Factors Ergon Soc 32:4560
74. Tam WJ, Stelmach LB (1998) Display duration and stereoscopic depth discrimination. Can J
Exp Psychol 52(1):5661
75. International Telecommunication Union (2010) Methodology for the subjective assessment
of the quality of television pictures, ITU-R
76. Tam WJ, Vincent A, Renaud R, Blanchfield P, Martin T (2003) Comparison of stereoscopic
and non-stereoscopic video images for visual telephone systems. In: Stereoscopic displays
and virtual reality systems X, San Jos, CA, vol 5006, pp 304312

Chapter 5

Virtual View Synthesis and Artifact


Reduction Techniques
Yin Zhao, Ce Zhu and Lu Yu

Abstract: With texture and depth data, virtual views are synthesized to produce a
disparity-adjustable stereo pair for stereoscopic displays, or to generate multiple
views required by autostereoscopic displays. View synthesis typically consists of
three steps: 3D warping, view merging, and hole filling. However, simple synthesis
algorithms may yield some visual artifacts, e.g., texture flickering, boundary artifact, and smearing effect, and many efforts have been made to suppress these synthesis artifacts. Some employ spatial/temporal filters to smooth depth maps, which
mitigate depth errors and enhance temporal consistency; some use a cross-check
technique to detect and prevent possible synthesis distortions; some focus on
removing boundary artifacts and others attempt to create natural texture patches for
the disoccluded regions. In addition to rendering quality, real-time implementation
is necessary for view synthesis. So far, the basic three-step rendering process has
been realized in real time through GPU programming and a design.

Keywords 3D warping Boundary artifact Hole filling Quality enhancement


technique Real-time implementation SMART Splatting Spatial filtering
Smearing effect Synthesis artifact Texture flickering Temporal consistency
Temporal filtering View merging View synthesis Virtual view







Y. Zhao (&)  L. Yu
Department of Information Science and Electronic Engineering, Zhejiang University,
310027 Hangzhou, China
e-mail: zhaoyin@zju.edu.cn
L. Yu
e-mail: yul@zju.edu.cn
C. Zhu
School of Electronic Engineering, University of Electronic Science
and Technology of China, 611731 Chengdu, Peoples Republic of China
e-mail: eczhu@uestc.edu.cn

C. Zhu et al. (eds.), 3D-TV System with Depth-Image-Based Rendering,


DOI: 10.1007/978-1-4419-9964-1_5,
Springer Science+Business Media New York 2013

145

146

Y. Zhao et al.

5.1 Introduction
View synthesis is an important component of 3D content generation, which is
employed to create virtual views as if the views are captured by virtual cameras at
different positions from several real cameras, as shown in Fig. 5.1a. A camera
view contains both color and depth information of the captured scene. Based on
the depth information, pixels in a camera view can be projected into a novel
viewpoint, and a virtual view is synthesized. Then, the generated virtual views as
well as the captured camera views will be presented on stereoscopic or multi-view
autostereoscopic displays to visualize 3D effect.
View synthesis contributes to more natural 3D viewing experience with stereoscopic displays. New stereo pairs with flexible baseline distances can be formed with
the synthesized views, which enables changes of disparity range (corresponding to
the intensity of perceived depth) of the displayed stereo videos. Besides, view synthesis can provide a series of views required by multi-view autostereoscopic displays, which is much more efficient than to transmit all the required views.
View synthesis employs a Depth-Image-Based Rendering (DIBR) technique [2]
which utilizes both texture and depth information to create novel views. Currently,
there are two prevalent texture plus depth data formats, Multi-view Video plus Depth
(MVD) and Layered Depth Video (LDV). MVD includes multiple views of texture
videos and their corresponding depth maps that record the distance of each pixel to
the camera. LDV contains multiple layers of texture plus depth data, as introduced in
Chap. 1. Accordingly, view synthesis procedures for the two types of 3D representations differ slightly. This chapter only covers the MVD-based rendering, and the
view synthesis procedure with LDV data will be elaborated in Chap. 7.
Given M M  1 input camera views (also called reference views), a virtual
view can be synthesized through the following three major steps: (1) project pixels
in a reference view to a target virtual view, which is termed as 3D warping; (2)
merge pixels projected to the same position in the virtual view from different
reference views (if M  2), called view merging; and (3) make up the remaining
holes (i.e., positions without any projected pixel) in the virtual view by creating
texture patterns that visually match the neighborhood, known as hole filling. More
details of view synthesis algorithms are provided in Sect. 5.2.
However, the basic DIBR scheme may not guarantee a synthesized view of perfect
quality, especially when errors present on the depth maps. Depth errors cause the
associated pixels to be warped to wrong positions in the virtual view, yielding geometric
distortions. Most noticeable geometric distortions appear at object boundaries due to
error-prone depth data along these areas, shown as broken edges and spotted background.
Besides, the appearance and vanishing of a synthesis artifact also temporally evokes
texture flickering in the virtual view, which greatly degrades visual quality. Moreover,
simple hole filling algorithms using interpolation with the surrounding pixels may fail to
recover the missing texture information in holes, especially at highly textured regions.
To alleviate those rendering artifacts, many algorithms have been proposed to enhance
different stages in the rendering process, which will be reviewed in Sect. 5.3.

5 Virtual View Synthesis and Artifact Reduction Techniques

147

Fig. 5.1 a Virtual view


generation with view
synthesis. b Illustration of the
framework of basic view
synthesis using two input
camera views A and B to
synthesize a target virtual
view C

View synthesis typically operates at the user end of the 3D-TV system
(as mentioned in Chap. 1), which requires the virtual views to be synthesized in
real time. 3D-TV terminals include TV sets, computers, mobile phones, and so on.
For computer platform, GPU has been used to assist CPU to carry out parallel
processing in view rendering; for devices without flexible computational ability,
e.g., TV set (or set-top box), specific hardware accelerators have been designed for
real-time view synthesis. Detailed information is given in Sect. 5.4.

5.2 Basic View Synthesis


Based on the DIBR technique, view synthesis employs 3D warping, view merging,
and hole filling to create a realistic virtual view. The framework of basic view
synthesis is illustrated in Fig. 5.1b, and is to be elaborated in this section.

5.2.1 3D Warping
Based on input depth values and predetermined camera models (which are
obtained in the multi-view video acquisition process as mentioned in Chap. 2),
3D warping maps pixels in a reference camera view to a target virtual view [13].
Assuming that 3D surfaces in the captured scene exhibit Lambertian reflection, the

148

Y. Zhao et al.

Fig. 5.2 Illustration of a 3D


point projected into the
reference view and virtual
view

virtual-view positions will possess the same color values as their correspondences
in the reference view. For a point Px; y; z in the 3D space, it is projected into both
the reference view and virtual view, and we denote the pair of projection positions
in the image planes of the two cameras as pr ur ; vr ; 1 and pv uv ; vv ; 1 in
homogeneous notation, respectively, as illustrated in Fig. 5.2. Thus, we have two
perspective projection equations that warp the 3D point in the world coordinate
system into the two camera systems,
zr pr Ar Rr P tr

5:1

zv pv Av Rv P tv

5:2

where Rr (3 9 3 orthogonal matrix) and tr (3 9 1 vector) denote the rotation


matrix and translation vector that transform pixels in the world coordinate to the
reference camera coordinate, respectively; zr is the depth value of the pixel
indicated by the input depth map; and Ar (3 9 3 upper triangle matrix) specifies
the intrinsic parameters of the reference camera,
0
1
fx 0 ox
B
C
Ar @ 0 fy oy A
5:3
0 0 1
where fx and fy are the focal lengths in horizontal and vertical directions,
respectively, and ox ; oy denotes the position difference between the principle
point (i.e., the intersection point of the optical axis and the image plane) and the
origin of the image plane (usually defined at the upper-left corner of the image),
called the principle point offset [3], as shown in Fig. 5.2. Similar notations are
applied for Rv ; tv ; zv ; and Av :
3D image warping aims to find in the virtual view the corresponding position of
each pixel from the reference view, which can be separated into two steps. First,
each reference-view pixel is reversely projected into the 3D world, deriving its
position in the world coordinate system by solving Eq. (5.1),
 1

P R1
zr Ar pr  tr :
5:4
r

5 Virtual View Synthesis and Artifact Reduction Techniques

149

Then, we project the 3D point into the target virtual view, and obtain the pixel
position in the virtual camera coordinate system, which is equivalent to substituting Eq. (5.4) into Eq. (5.2). Accordingly, we have
pv

 1


1 
Av Rv R1
zr Ar pr  tr tv :
r
zv

5:5

It is clear that the projection becomes incorrect, once the depth value or the
camera parameters are inaccurate.
There is an equivalent way for 3D warping, which uses multiple homographies.
Two images of a same plane in space are related through a homography (also
known as projective transformation). A 3D scene can be sliced into multiple planes
with the same distance (i.e., depth value) to the reference camera. For 3D points in
such a plane (called a depth plane), their projected points in the reference view and
the virtual view can be associated via a homography,
pv Av Hvr;z A1
r pr :

5:6

where Hvr;z (a non-singular 3 9 3 matrix) is called a homography matrix (for a


depth plane of distance z). Compared with the scheme of twice projection, the
homography in Eq. (5.6) is the composition of a pair of perspective projections of
a plane, without introducing the world coordinate. Since there are 256 possible
depth planes for an 8-bit depth map, there needs up to 256 homography matrices to
transform all reference-view pixels in different depth planes to the virtual view.
To simplify the coordinate systems in Eq. (5.5), it is common to attach the
world coordinate system to the reference camera system; thus, Rr I33 and
tr 031 : When generating a stereo pair or parallel views for the autostereoscopic
display, the virtual views are usually setup in the 1D parallel arrangement. In this
setting, each virtual view is placed on the horizontal axis of the reference view,
with its optical axis parallel to that of the reference view, i.e., tv tx ; 0; 0T and
Rv Rr : The horizontal translation tx is the horizontal baseline between the reference and virtual cameras, and tx \0 (or tx [ 0) when the virtual view is at the
right (or left) side of the reference view. Besides, the reference and virtual views
share the same intrinsic camera model, i.e., Ar Av : In this case, the 3D warping
equation turns into a simple form as
pv

zr
1
p A v tv ;
zv r zv

5:7

that is,
12 3
tx
6 7
6 7 B
C6 7
zv 4 vv 5 zr 4 vr 5 @ 0 fy oy A4 0 5:
0
1
1
0 0 1
2

uv

ur

fx 0 ox

5:8

150

Y. Zhao et al.

Accordingly, we have zv zr , and vv vr (i.e., the vertical disparity is always


zero), and the horizontal disparity can be obtained from the real depth value by
dx uv  ur

fx tx
:
zr

5:9

With the 3D warping (also known as forward warping), each pixel in the
original view is projected to a floating point coordinate in the virtual view. Then,
this point is commonly rounded to the nearest position of an integer or a subpixel
sample raster (if subpixel mapping precision is available) [35]. If several pixels
are mapped to the same position on the raster, which implies occlusion, the pixel
that is closest to the camera will occlude the others and be selected for this
position, which is known as the z-buffer method [35]. After the forward warping,
most pixels (typically over 90 % in a narrow baseline case) in a warped view can
be determined, as shown in Fig. 5.3c and d, and the remaining positions on the
image grid without corresponding pixels from the reference view are called holes.
Holes are generated mostly from disocclusion or non-overlap visual field of the
cameras, either of which means some regions in the virtual view are not visible in
the reference view. Thus, the warped reference view lacks the information of the
newly exposed areas. In addition, limited by the precision of pixel position
rounding in forward warping or insufficient sampling rate of the reference views,
some individual pixels may be left blank, causing typically one-pixel-wide cracks.
Moreover, depth errors cause associated pixels to be deviated from target positions, also leaving them as holes. Holes in warped views are to be eliminated by
view merging and hole filling techniques.

5.2.2 View Merging


In typical 3D-TV scenarios, virtual cameras are placed between two reference
cameras known as view interpolation. In this case, holes (of any types) in a warped
view can often be complemented by the corresponding non-hole region in the other
warped view. After overlapping all the warped views and merging them into one
image, the holes are reduced significantly, as shown in Fig. 5.3e. However, in the
view extrapolation case, the holes induced by disocclusion or non-overlap visual
field cannot be mitigated by compositing warped views.
View merging algorithms are mainly based on one of or a combination of the
three strategies below:
1. Blend available pixels from two warped views with a linear weighting function
[46],




tx;LV Ri tx;RV Li




Vi 
tx;LV  tx;RV 

5:10

5 Virtual View Synthesis and Artifact Reduction Techniques

151

Fig. 5.3 View synthesis with two-view MVD data. a Texture image of the left reference view.
b Associated depth map of the left reference view (a higher luminance value means that the
object is closer to the camera). c Warped left view (with holes induced by dis-occlusion and nonoverlap visual field, marked in white). d Warped right view. e Blended view with remaining
holes. f The final synthesized view after hole filling

where Vi ; Ri ; Li denote the ith pixel


 in the
 virtual view, right reference view, and
left reference view, respectively; tx;LV  is the baseline distance between the left


and virtual views, and tx;RV  is that between the right and virtual views.

152

Y. Zhao et al.

As a result, the baseline-based blending scheme gives more weight to the


reference view closer to the virtual view. In addition, artifacts in the two warped
views may be introduced into the merged view, although the distortion intensities
are generally decreased with the weighted blending.
2. Take one warped view as a dominant view, and use pixels from the other
warped view to fill the holes in the dominant view [7]. Compared with the
blending scheme, this dominance-complement strategy may provide a higher
quality synthesized view if the dominant view has fewer artifacts than the
complementary view. Besides, the virtual view will also have higher contrast
than that by blending, since blending the two warped views may introduce
blurring effects when texture patterns from the two warped views are not well
aligned.
3. Select the closest pixel based on the z-buffer method [3]. This strategy works well
with perfect depth maps. However, it is prone to increase flickering artifacts when
the depth data is temporally inconsistent. The varying depth values drive the
rendering engine to select pixels from one view at one time instant and choose
those from the other view at another time instant. Since pixels from two views
may slightly differ in color due to inter-view color difference, the alternating
appearance of two-view pixels, due to the depth-based pixel competition, brings
about temporal texture inconsistency in the synthesized view.
In view merging, once a position has only one pixel from all the warped views,
it leaves rendering engines no option but to select the only candidate.

5.2.3 Hole Filling


After view merging, the remaining small holes in the virtual view are handled with
hole filling algorithms which are generally based on linear interpolation using
neighboring pixels. Afterwards all pixels in the virtual view are determined, as
shown in Fig. 5.3f. However, for view synthesis with single view input, view
merging is not available, and hole filling algorithms have to make up all the large
holes in the warped image (e.g., Fig. 5.3c). In this case, a simple texture interpolation tends to be insufficient to complete natural texture patterns for the missing
areas, especially for holes at highly textured background. Thus, more sophisticated
inpainting techniques [810] are introduced to solve this problem.
In addition, holes by disocclusion always belong to the background instead of
the foreground. Simply averaging the foreground and background will blur object
boundaries, hence compromising the synthesis quality. Therefore, the foreground
texture at one side of a hole shall be less or not used in the simple linear interpolation-based hole filling. On this observation, some directional hole filling
methods [6, 11] first detect the foreground and background areas around holes and
then fill the holes by extending the background texture, which often produces more
realistic boundary regions.

5 Virtual View Synthesis and Artifact Reduction Techniques

153

5.3 Quality Enhancement Techniques


The standard three-step view synthesis is prone to yield synthesis artifacts in a virtual
view (as shown in Fig. 5.4), which demands enhancement techniques to eliminate
the visual distortions. Synthesis artifacts arise out of two ways: (1) depth errors and
(2) limitations of the view synthesis mechanism discussed above. Depth errors
beyond certain thresholds [39] will shift texture patterns to incorrect positions (i.e.,
geometric distortions), make background visible (i.e., occlusion variations) due to
z-buffer-based occlusion handling, and consequently induce temporal flickering
[12]. Depth map filtering before view synthesis [5, 1315] is the most frequently used
technique to suppress depth variations and hence reduce flickering artifacts, as to be
introduced in Sect. 5.3.1. Another powerful method to alleviate synthesis artifacts is
the cross-check [17, 18] on reliability of depth maps to prevent erroneous
3D warping, which will be elaborated in Sect. 5.3.2. Besides, depth values generated
by stereo-matching algorithms are often inaccurate at object boundaries [19] due to
insufficient texture features in stereo matching [4], and even user assistance [20] in
depth estimation or manual editing on depth maps cannot assure perfect alignment
between texture-depth edges at object boundaries. Synthesis artifacts appear with
texture-depth misalignment [21, 22], and several methods have been proposed to
remove them within or right after 3D warping [4, 6, 17, 2224], as to be discussed in
Sect. 5.3.3.
Limitations of view synthesis also result in unnatural synthetic texture. First, a
reference view does not contain all the information in a virtual view, and more
information is missing with an increasing baseline distance between the two views.
Holes then appear due to disocclusion and non-overlap visual field, and the missing
texture patches are sometimes hard to be estimated. With multiple input views for
view interpolation, the estimation becomes relatively easy, whereas it is still difficult
in single-view synthesis or view extrapolation. Advanced hole filling algorithms
[2527] provide more realistic solutions, as to be briefly introduced in Sect. 5.3.4
(more details in Chap. 6). Second, each pixel is assumed to belong to one object
surface, and assigned with a single depth value. However, some pixels may take
colors of multiple objects, e.g., half-transparent glass manifests itself as blended
color of the glass and its background. In this case, view synthesis may also fail to
produce satisfactory results, and approaches to the problem will not be addressed in
this chapter. Interested readers may refer to [24, 38] for some more information.

5.3.1 Preprocessing on Depth Map


5.3.1.1 Spatial Filtering
Depth map filtering is effective to reduce large disocclusion holes in warped
images [15, 16]. This operation is especially useful in the single-view rendering or
view extrapolation case when the large holes have to be completed only by hole

154

Y. Zhao et al.

Fig. 5.4 Samples of view synthesis artifacts. a The magnified chair leg presents eroded edges
(one type of boundary artifacts) and smearing effect at instant t. Though the leg turns intact in the
following instants, the quality variations temporally yield a flicker. The smearing effect,
appearing evident in t and t ? 3 while less noticeable in t ? 1 and t ? 2, also evokes a flicker.
b Background noises around the phone (another type of boundary artifact). c Unnatural synthetic
texture by hole filling in single-view synthesis, also shown as the smearing effect

filling. It is also shown that symmetric Gaussian filter causes certain geometric
distortions that vertically straight object boundaries become curved, depending on
the depth in the neighboring regions [16]. Accordingly, asymmetric smoothing
with stronger filtering strength in the vertical direction overcomes the disocclusion
problem as well as reducing bended lines. More details can be found in Chap. 6.
Besides, depth maps are often noisy with irregular changes on the same object,
which may cause unnatural-looking pixels in the synthesized view [5]. Smoothing
the depth map with a low-pass filter can suppress the noises and improve the
rendering quality. However, low-pass filtering will blur the sharp depth edges along

5 Virtual View Synthesis and Artifact Reduction Techniques

155

object boundaries which are critical for high-quality view synthesis. Therefore,
bilateral filter [28], known for its effectiveness of smoothing plain regions while
preserving discontinuities, is demonstrated as a superior alternate [5, 13],
ZZ
1
hx
f ncn; xsf n; f xdn
5:11
kx
D
ZZ
kx
cn; xsf n; f xdn
5:12
D

where kx is the normalization factor and D represents the filtering domain; the
factors cn; x and sf n; f x measure geometric closeness and photometric
similarity between the neighborhood center x and a nearby point n; respectively.
Another edge-preserving filter is proposed in [14], in which the filter coefficient wn; x is designed as the inverse of image gradient from the center point
x to the filter tap position n: Thus, pixels in the homogenous region around x are
assigned with large weights, while the edge-crossing ones are almost bypassed in
filtering. It shall be noted that complex filtering schemes (e.g., with multiple
iterations to finalize a depth map), though showing superior performance, will
greatly compromise real-time implementation of view synthesis. Therefore, they
are preferred to be deployed in depth production instead of for user-end view
rendering.

5.3.1.2 Temporal Filtering


Besides spatial filter, a simple temporal filter is proposed for depth denoising and
flickering effect reduction [15]. Depth values of stationary objects (with respect to
the camera) shall be constant over time, while those of moving objects are not
certain. Therefore, the temporal filter is applied at regions with a high probability
to be motionless,
k  di; n
0

d i; n

N
P

an i; n  k  d 0 i; n  k

k1

N
P

5:13
an i; n  k

k1

where di; n and d0 i; n are the original and enhanced depth values of the ith pixel
in the nth frame, respectively; N N  1 is the filter order and k is a constant weight
for the current original depth value; the rest filter parameter an i; n  k
0  an i; n  k  1 is determined as the product of a temporal decay factor
wn  k for frame n  k and a temporal stationary probability pn i; n  k for
pixel i; n  k. The decay factor is introduced on an assumption that the correlation
between two frames decreases with their temporal interval, and the probability
factor is monotonically increasing with the texture structural similarity (SSIM)

156

Y. Zhao et al.

Fig. 5.5 Comparison on temporal consistency of the original and enhanced depth sequences of
Book arrival [37]: a frame 3 of texture sequence at view 9; b, c, and d differences between
frame 2 and 3 of the texture sequence, that of the original depth sequence, and that of the
enhanced depth sequence, respectively (925 for printing purpose). Fake depth variations in (c)
are eliminated by the temporal filtering. (Reproduced with permission from [15])

[29] between a 5 9 5 block centered at i; n  k and its co-located area in


frame n;
wn  k e1k ; k  1; k 2 Z
pn i; n  k 1 

1
SSIMn i; n  k
 C1
C2
1e

5:14
5:15

where C1 and C2 are two empirical constants. It is assumed that a local area is very
likely to be stationary (i.e., pn i; n  k is near one) if the two corresponding
texture patterns have a high structure similarity (e.g., SSIMn i; n  k [ 0:9), and
an area is probably moving (i.e., pn i; n  k approaches zero) with a lower
structural similarity (e.g., SSIMn i; n  k\0:7). Thus, the two constants are
determined as C1 20 and C2 0:04 to make the sigmoid function pn i; n  k
greater than 0.9 when SSIMn i; n  k [ 0:9 and smaller than 0.1 when SSIMn
i; n  k\0:7.
The adaptive temporal filter assigns high weights at stationary regions in
adjacent frames, and it significantly enhances temporal consistency of the depth
maps, as shown in Fig. 5.5. Accordingly, flickering artifacts in the synthesized
view are suppressed due to the removal of temporal noise in the depth maps.

5 Virtual View Synthesis and Artifact Reduction Techniques

157

Fig. 5.6 The Inter-View Cross Check (IVCC) approach [17]. The dash lines denote the cross
check to determine pixels with unreliable depth values. Afterwards, projections of the unreliable
pixels to the virtual view are withdrawn to avoid synthesis artifacts. (Reproduced with permission
from [22])

5.3.2 Advanced 3D Warping and View Merging


5.3.2.1 Backward Warping
Forward warping has a drawback that it produces cracks when the object is
enlarged in the virtual view [6]. It means we have to reconstruct a Kv -pixel area
with the original Kc Kc \Kv pixels, and empty samples turn up as a result of the
one-to-one mapping (and many-to-one mapping in the case of occlusion). This
phenomenon is most common in synthesizing a rotated virtual view, and also takes
place for horizontal slopes in the parallel camera setup. It is known that backward
warping can locate each pixel in the destination image at a corresponding position
in the source image. Although the virtual view depth values are not available at
first, they can be obtained through forward warping, as shown in Eq. (5.5). On this
concept, Mori et al. [5] developed an enhanced 3D warping scheme. It first
employs forward warping to initialize depth values in the virtual view, and fill the
cracks with a median filter. The virtual-view depth map is further enhanced by
bilateral filtering as mentioned in Sect. 5.3.1. Afterwards, the color information is
retrieved from the reference view using backward warping based on the generated
depth map. Cracks are therefore removed from the warped images.

5.3.2.2 Reliability-Based Warping and View Merging


Depth errors have associated pixels being mapped to wrong positions. Based on
MVD data, it is possible to find suspicious depth values and hence remove those
erroneous projections. Yang et al. [17] proposed a scheme of reliability reasoning
on 3D warping, using Inter-View Cross Check (IVCC), as shown in Fig. 5.6.
Specifically, each pixel in the left (or the right) reference view is warped to the right

158

Y. Zhao et al.

(or the left) reference view, and the color difference between the projected pixel and
the corresponding original pixel in the other camera view is checked. A pixel with a
suprathreshold color difference is considered unreliable (e.g., the pixel 1R in
Fig. 5.6), because the color mismatch is probably induced by an incorrect depth
value; otherwise, the pixel is reliable (e.g., the pixel 1L). Finally, virtual-view pixels
projected from unreliable reference-view pixels are discarded from the warped
images, i.e., to withdraw all unreliable projections to the virtual view.
Though IVCC improves view synthesis quality significantly [17], it still has
some limitations. First, the color difference threshold in the cross check would be
large enough to accommodate either illumination difference between the original
views or color distortions due to video coding; otherwise, many pixels will be
wrongly treated as unreliable owing to a small threshold, which may in turn result
in a great number of holes in the warped view. Thus, the IVCC method is unable to
detect the unreliable pixels below the color difference threshold. In addition, at
least two views are required for the cross check, which is not applicable in view
synthesis with a single view input.
In addition to intelligently detecting and excluding unreliable pixels in view
synthesis, IVCC can also benefit view merging [18]. Conventional view blending
employs the weight for each view mainly based on the baseline distance between
the original and the virtual views [5], as mentioned in Sect. 5.2.2. With the reliability check, depth quality can be inferred from the number of wrong projections.
It is common with MVD data that one views depth map may be more accurate
than anothers, since the better depth map is generated with user assistance or even
manual editing. Therefore, it is advisable in view blending to assign a higher
weight to pixels from the more reliable view [18].

5.3.3 Boundary Artifact Reduction


Generally, most noticeable synthesis artifacts appear at object boundaries,
manifesting themselves with two major visual effects: (1) slim silhouettes of foreground objects are scattered into the background, called background noises, and
(2) the foreground boundaries are eroded by background texture, termed as foreground erosion, as to be shown in Fig. 5.7. Boundary artifacts arise from inaccurate
depth values along object boundaries. An example is shown in Fig. 5.8a, where the
pixels at the left (or right) side of the depth edge have foreground (or background)
depth values. In that case, the pixels in area a (or c) are wrongly projected into the area
b (or d) in the warped view due to the incorrect depth values, while the positions
where the pixels are supposed to be with correct depth values will turn into holes
(i.e., area a and c). In view merging, the holes are usually filled by background pixels
from the other view (see a proof in [22]). Therefore, we can see that, on one hand,
some pieces of foreground texture are scattered into the background (e.g., area b and
d), causing the visual artifacts of background noises, while on the other hand, the
background texture is punched into the foreground object, yielding the foreground

5 Virtual View Synthesis and Artifact Reduction Techniques

159

Fig. 5.7 a Illustration of boundary artifacts in the warped view due to incorrect depth values of
some foreground pixels [22]. (FG: Foreground, BG: Background, and H: Hole). Pixels at area
a and c are misaligned with background depth values. After warping, foreground area a (or c) is
separated from foreground and is projected to the background area b (or d) due to the incorrect
background depth values, which yields background noises as well as foreground erosion [22].
b Illustration of the Background Contour Region Replacement (BCRR) method [23]. Background
contour region with a predetermined width is replaced by pixels from the other view, and
background noises in b and e are eliminated. Failures (e.g., area f) occur when the background
noises are beyond the background contour region. (Reproduced with permission from [22])

erosion artifacts, as shown in Fig. 5.8a. Several techniques for boundary artifact
removal have been developed, which are reviewed as follows.

5.3.3.1 Background Contour Region Replacement


On the observation that background noises usually exist on the background side of
disocclusion holes, Lee & Ho [23] proposed a Background Contour Region
Replacement (BCRR) method to clean background noises in the warped views.
First, contours around holes in the warped views are detected and categorized into
foreground contours (on the foreground neighboring to the holes) and background
contours (on the background neighboring to the holes) by simply checking the
depth values around the holes, as shown in Fig. 5.8b. Empirically, the background
contour regions are probably spotted by some noises like fractions of the

160

Y. Zhao et al.

Fig. 5.8 Samples of synthesized images of Art [31] with depth maps generated by DERS [32].
From left to right: (1) basic view synthesis by VSRS 1D mode [3], (2) with BCRR [23], (3) with
IVCC [17], and (4) with SMART [22]. BCRR and IVCC clean part of the background noises
while omit the foreground erosion. SMART tackles both foreground erosion and background
noises, making the boundary of the pen more straight and smooth. (Reproduced with permission
from [22])

foreground object, whereas the corresponding areas in the other warped view are
prone to be free from distortions. Thus, the background contour regions are
intentionally replaced by more reliable pixels from the other view, and most
background noises are eliminated (e.g., area b and e in Fig. 5.8b). Similar
explanation and solution appear in [6]. Limitations of BCRR are clear: (1) it fails
to clean the background noises beyond the predefined background contour regions
(e.g., area f in Fig. 5.8b); and (2) foreground erosion artifacts are ignored.

5.3.3.2 Prioritized Multi-Layer Projection


Mller et al. [4] proposed a Prioritized Multi-Layer Projection (PMLP) scheme
(a variant of the two-layer representation by Zitnick et al. [24]) to reduce boundary
artifacts. Depth edges are located with a Canny edge detector [30], and 7-samplewide areas along the edges are marked as unreliable pixels. Each unreliable region is
split into a foreground boundary layer and a background boundary layer, and the rest
of the image is called a base layer. Both base layers in the two reference views are first
warped to the virtual view and merged into a common main layer. The foreground
boundary layers are projected with the second priority, and merged with the common
main layer with the z-buffer method. The background boundary layers are treated as
the least reliable, and only used to fill the remaining holes in the merged image.
Basically, PMLP adopts the idea of reducing unreliable pixels in warped views. The
IVCC method (as mentioned in Sect. 5.3.2), analyzing reliability in a different way
(with cross-check), is also useful to clean boundary artifacts.

5 Virtual View Synthesis and Artifact Reduction Techniques

161

5.3.3.3 Smart Algorithm


Recently, Zhao et al. [22] made a more in-depth investigation into the underlying
causes of boundary artifacts from a perspective of texture and depth alignment
around object boundaries. It is found that inaccurate texture and depth information
at object boundaries introduces texture-depth (T-D) misalignment, which consequently yields the boundary artifacts. As an example, the foreground misalignment
(FM) (referring to that foreground color is associated with background depth
value) in the left original view may produce both foreground erosion and strong
background noises after 3D warping, while the color transition region along an
object boundary (with blended foreground and background colors) contributes to
strong or weak background noises, as illustrated in Fig. 5.9a. Another two cases of
T-D misalignments that yield less annoying boundary artifacts are discussed in
[22] as well. Based on the in-depth analysis, they propose a novel and effective
method with the functional block diagram shown in Fig. 5.9b, Suppression of
Misalignment and Alignment Regulation Technique (denoted as SMART), to
remove boundary artifacts by two means: (1) mitigate background noises by
suppressing misalignments in CT regions; and (2) reduce foreground erosion by
regulating foreground texture and depth alignment. More specifically, the major
discontinuities on the depth maps, which probably imply object boundaries, are
first detected with a Canny edge operator [30]. Corresponding texture edges are
then located within a small window centered at the depth edges. After analyzing
the T-D edge misalignment based on the detected texture and depth edges, depth of
a FM region is modified by the foreground depth value, enforcing the foreground
T-D alignment and thus reducing foreground erosion. Besides, pixels in the CT
regions which are prone to yield background noises are marked by an unreliable
pixel mask, and are then prevented from being projected into the virtual view in
3D warping. Different from the conventional solutions, SMART exploits the
inherent coherence between texture and depth variations along object boundaries
to predict and suppress boundary artifacts, and demonstrates consistently superior
performance with both original and compressed MVD data (corresponding to
worse texture-depth alignment due to lossy data compression). Performance of
BCRR, IVCC, and SMART are compared in Fig. 5.7. In contrast with BCRR and
IVCC, SMART handles the background noises more thoroughly and makes the
foreground boundaries more natural.

5.3.4 Advanced Hole Filling


Simple interpolation-based hole filling propagates the foreground texture into the
holes, thus introducing smearing effect. Based on the fact that the disocclusion
holes belong to the background, directional interpolation schemes that fill the holes
mostly with background texture [6, 11] perform better for plain background
regions. However, they fail to recover complex texture realistically, which evokes

162

Y. Zhao et al.

Fig. 5.9 a Boundary artifacts due to texture-depth misalignment at object boundaries can be
reduced by SMART [22]. In this example of a left original view, foreground misalignment (FM)
and color transition (CT) regions are associated with background depth values. Accordingly, the
pixels in the two regions are erroneously warped to the background, yielding foreground erosion
and background noise. SMART corrects the depth values in the FM to enforce foreground
alignment, while it suppresses the warping of the unreliable pixels in the CT. As a result, the
wrong pixel projections are eliminated, and foreground boundaries are kept intact. b The
framework of the SMART algorithm. (Reproduced with permission from [22])

the idea of using directional image inpainting [25] to recover patterns of holes in a
non-homogeneous background region. Some algorithms [27] also utilize spatial
similarity for texture synthesis. It is assumed that the background texture patterns
may be duplicated (e.g., wallpaper), and thus it is possible to find a similar patch
for the hole at a nearby background region. Accordingly, the best continuation
patch out of all the candidate patches is obtained by minimizing a cost function of
texture closeness.
There is another approach to hole filling: the disoccluded region missing in the
current frame may be found in another frame [26, 27]. It is plausible when the scene is
record by a fixed camera, such that regions occluded by a moving foreground object

5 Virtual View Synthesis and Artifact Reduction Techniques

163

Fig. 5.10 Illustration of


splatting: a pixel is mapped to
two neighboring positions on
the sample raster

will appear at another time instant. Thus, the background information is accumulated
over time to build up a complete background layer (or sprite), which will be copied
to complement the missing texture in the holes of a virtual view.

5.3.5 Other Enhancement Techniques


As discussed in Sect. 5.3.2, insufficient sampling points in the reference view
results in cracks after warping. Splatting, well known in the field of computer
vision, is introduced to handle the cracks [3]. With splatting, each reference-view
pixel is allowed to be projected to a small area centered at the original projected
position, as if a snowball hits a wall and scatters, as shown in Fig. 5.10. It is a trick
to increase sampling rate by using one-to-many mapping, and the cracks are
removed eventually. However, splatting may blur the synthesized view, since each
pixel in the virtual view may be a mixture of several pixels warped from the
reference view. To tackle this side effect, splatting is only applied at depth
boundaries (called boundary-aware splatting in [3]) which accounts for a small
portion of the image but results in most of the holes after warping.
It is shown that subpixel rendering precision can improve quality of the
synthesized view [33]. When pixels are supported to be mapped to subpixel
positions on the integer sample raster (i.e., refining the grid to increase the
resolution of the virtual view), the geometric errors by position rounding are
reduced, and the subpixel information is exploited. Finally, the super resolution
virtual view is downsampled to generate the virtual view.
Color errors from inter-view cross check (as mentioned in Sect. 5.3.2) can
provide more clues apart from the reliability of the corresponding pixels. Furihata
et al. [34] proposed a brilliant way to reduce synthesis errors with feedbacks of the
color errors in cross check. In that way, color error in a smoothly varying area in
the virtual view can be predicted according to the geometric relationship as
illustrated in Fig. 5.11. In the figure, the color value of pixel B (denoted as IB) is
modified by adding the estimated error EV to IA; which makes IB closer to the
true value. Two more error feedback models for other texture patterns are also
provided in [34]. Compared with the IVCC, this algorithm further makes use of
more color information in the reference views rather than just ignoring it once the

164

Y. Zhao et al.

Fig. 5.11 Reducing energy of synthesis color error with error feedback on smooth texture. Color
intensity of pixel A is denoted as I(A). With an inaccurate depth value, pixel A is warped to B and
C in the virtual view and right view, respectively. From the right view, the color error due to
position deviation is known. Assuming smooth texture varies almost linearly, the corresponding
color error in the virtual view can be estimated. More accurate color value is then obtained by
adding the estimated error

pixels are associated with unreliable depth values. Accordingly, holes in warped
views are reduced due to fewer pixels being discarded in warping.

5.4 Real-time Implementation of View Synthesis


After a review of various techniques to enhance the basic view synthesis framework,
let us look back at another critical issue besides the visual quality, that is, the
complexity. Running on 3D video terminals, view synthesis is required to achieve
real-time performance. However, sequential view synthesis that processes pixel by
pixel usually fails to meet this requirement. For instance, the MPEG View Synthesis
Reference Software (VSRS) [3], a C program adopting backward warping [5],
two-view merging and image inpainting [10], takes around one second to render
a virtual image (1024 9 768 resolution) on an Intel E5335 (@2.00 GHz) CPU.
To speed up view synthesis, one way is to consider simplifying or processing in
parallel the general algorithm or a portion of it. The 3D warping and view merging
treat each pixel independently with the same operations, which are suitable for
parallel processing with single instruction multiple data (SIMD) structure. On the
computer platform, GPU is employed for parallel computation of 3D warping
[35, 36]. In addition, hole filling is simplified into a linewise scheme (though
performance is compromised) that the background pixel next to a hole is spread
toward the foreground side line by line, which allows line-based parallel hole
filling [35].
Application-Specific Integrated Circuit (ASIC) is an alternate solution for
devices without flexible or sufficient computational capabilities, such as TV set
(or set-top box) and mobile phones. Recently, an ASIC hardware implementation

5 Virtual View Synthesis and Artifact Reduction Techniques

165

of the view synthesis algorithm in VSRS has been proposed [36]. The chip design
tackles the high computational complexity by substituting a hardware-friendly
bilinear hole filling algorithm for the image inpainting in VSRS as well as by
employing multiple optimization techniques. As a result, the hardware view
synthesis engine achieves the throughput of more than 30 frame/second for
HD1080p sequences. Interested readers may further refer to Chap. 3 which
elaborates real-time implementation of view synthesis on FPGA.
Applications always highlight the balance between performance and cost.
The aforementioned implementations, either by GPU assistance or ASIC design,
target at the basic view synthesis framework. Synthesis quality can be further
improved by integrating the enhancement techniques introduced in Sect. 5.3, at the
cost of reduced rendering speed due to imposed computational complexity.
However, for portable devices with constrained power and processing capability,
the complexity should be carefully controlled in a reasonable range, and some
power-consuming algorithms may not be appropriate for these application
scenarios (e.g., 3D-TV on mobile phones).

5.5 Conclusion
View synthesis is the last component of 3D content generation, generating multi-view
images required by 3D displays. Deployed at 3D-TV terminals, the synthesis task needs
to be completed in real time. The basic view synthesis framework, which consists of 3D
warping, view merging, and hole filling, can fulfill the real-time constraint through
some techniques like GPU acceleration or specific circuits. However, the simple
scheme is prone to yield visual distortions in the synthesized view, e.g., flickering and
boundary artifacts. Accordingly, many enhancement algorithms have been developed
to adapt different components in the synthesis framework, which significantly suppress
visual artifacts and improve the synthesis quality.
View synthesis is changing from a passive standard three-step process into an
intelligent engine with versatile analysis tools. It may not always trust input depth
information and simply transform the pixels into a virtual view; instead, advanced
synthesis schemes may examine texture and depth information to realize reliable
projections and to minimize or eliminate warping errors. Besides, they may figure
out more natural texture patterns for missing samples in the virtual view with
further considerations on color patch and foreground-background relationship. On
the other hand, different frames are rendered individually and independently by the
conventional scheme. New engines are expected to make good use of texture and
depth information over time for more powerful and dynamic view synthesis with
highest rendering quality.
Acknowledgement The authors would like to thank Middlebury College, Fraunhofer Institute
for Telecommunications Heinrich Hertz Institute (HHI) and Philips for kindly providing the
multi-view images, Book_arrival and Mobile sequences. This work is partially supported by

166

Y. Zhao et al.

the National Basic Research Program of China (973) under Grant No.2009CB320903 and
Singapore Ministry of Education Academic Research Fund Tier 1 (AcRF Tier 1 RG7/09).

References
1. Mark WR, McMillan L Bishop G (1997) Post-Rendering 3D Warping. In: Proceedings of
the Symposium on interactive 30 graphics, pp 716, Providence, Rhode Island, Apr 1997
2. Fehn C (2003) A 3D-TV approach using depth-image-based rendering (DIBR).
In: Proceedings of visualization, imaging and image processing (VIIP), pp 482487
3. Tian D, Lai P, Lopez P, Gomila C (2009) View synthesis techniques for 3D video.
In: Proceedings applications of digital image processing XXXII, vol 7443, pp 74430T111
4. Mller K, Smolic A, Dix K, Merkle P, Kauff P, Wiegand T(2008) View synthesis for
advanced 3D video systems. EURASIP Journal on Image and Video Processing, vol 2008,
Article ID 438148
5. Mori Y, Fukushima N, Yendo T, Fujii T, Tanimoto M (2009) View generation with 3D
warping using depth information for FTV. Sig Processing: Image Commun 24(12):6572
6. Zinger S, Do L, de With PHN (2010) Free-viewpoint depth image based rendering. J Vis
Commun and Image Represent 21:533541
7. Domanski M, Gotfryd M, Wegner K (2009) View synthesis for multiview video
transmission. In: International conference on image processing, computer vision, and
pattern recognition, Las Vegas, USA, Jul 2009, pp 1316
8. Bertalmio M, Sapiro G, Caselles ,V, Ballester C (2000) Image inpainting. In: Proceedings of ACM
conference on computer graphics (SIGGRAPH), pp 417424, New Orleans, LU, July 2000
9. Criminisi A, Perez P, Toyama K (2004) Region filling and object removal by exemplar-based
inpainting. IEEE Trans Image Process 13(9):12001212
10. Telea A (2004) An image inpainting technique based on the fast marching method. J Graph
Tools 9(1):2536
11. Oh K, Yea S, Ho Y (2009) Hole-filling method using depth based in-painting for view
synthesis in free viewpoint television (FTV) and 3D video. In: Proceedings of the picture
coding symposium (PCS), Chicago, pp 23336
12. Zhao Y, Yu L (2010) A perceptual metric for evaluating quality of synthesized sequences in
3DV system. In: Proceedings of visual communications and image processing (VCIP), Jul 2010
13. Daribo I, Saito H (2010) Bilateral depth-discontinuity filter for novel view synthesis.
In: Proceedings of IEEE international workshop on multimedia signal processing (MMSP),
Saint-Malo, France, Oct 2010, pp 145149
14. Park JK, Jung K, Oh J, Lee S, Kim JK, Lee G, Lee H, Yun K, Hur N, Kim J (2009) Depthimage-based rendering for 3DTV service over T-DMB. J Vis Commun Image Represent
24(12):122136
15. Fu D, Zhao Y, Yu L (2010) Temporal consistency enhancement on depth sequences. In:
Proceedings of picture coding symposium (PCS), Dec 2010, pp 342345
16. Zhang L, Tam WJ (2005) Stereoscopic image generation based on depth images for 3D TV.
IEEE Trans Broadcast 51(2):191199
17. Yang L, Yendoa T, Tehrania MP, Fujii T, Tanimoto M (2010) Artifact reduction using
reliability reasoning for image generation of FTV. J Vis Commun Image Represent, vol 21,
pp 542560, JulAug 2010
18. Yang L, Yendo T, Tehrani MP, Fujii T, Tanimoto M(2010) Error suppression in view synthesis
using reliability reasoning for FTV. In: Proceedings of 3DTV Conference (3DTV-CON),
Tampere, Finland
19. Scharstein D, Szeliski R (2002) A taxonomy and evaluation of dense two-frame stereo
correspondence algorithms. Int J Comput Vision 47(13):742

5 Virtual View Synthesis and Artifact Reduction Techniques

167

20. Vandewalle P, Gunnewiek RK, Varekamp C (2010) Improving depth maps with limited user
input. In: Proceedings of the stereoscopic displays and applications XXI, vol 7524
21. Fieseler M, Jiang X (2009) Registration of depth and video data in depth image based
rendering In: Proceedings of 3DTV Conference (3DTV-CON), pp 14
22. Zhao Y, Zhu C, Chen Z, Tian D, Yu L (2011) Boundary artifact reduction in view synthesis of
3D video: from perspective of texture-depth alignment. IEEE Trans Broadcast 57(2):510522
23. Lee C, Ho YS (2008) Boundary filtering on synthesized views of 3D video. In: Proceedings
international conference on future generation communication and networking symposia,
Sanya, pp 1518
24. Zitnick CL, Kang SB, Uyttendaele M, Winder S, Szeliski R (2004) High-quality video view
interpolation using a layered representation. In: Proceedings of ACM SIGGRAPH, pp 600608
25. Daribo I, Pesquet-Popescu B (2010) Depth-aided image inpainting for novel view synthesis.
In: Proceedings of IEEE international workshop on multimedia signal processing (MMSP)
26. Schmeing M, Jiang X (2010) Depth image based rendering: a faithful approach for the
disocclusion problem. In: Proceedings of 3DTV conference, pp 14
27. Ndjiki-Nya P, Kppel M, Doshkov D, Lakshman H, Merkle P, Mller K, Wiegand T (2011)
Depth image-based rendering with advanced texture synthesis for 3-D video. IEEE Trans
Multimedia 13(3):453465
28. Tomasi C, Manduchi R (1998) Bilateral filtering for gray and color images. In: Proceedings
of IEEE international conference computer vision, pp 839846
29. Wang Z, Sheikh HR (2004) Image quality assessment: from error visibility to structural
similarity. IEEE Trans Image Process 13(4):114
30. Canny J (1986) A computational approach to edge detection. IEEE Trans Pattern Anal Mach
Intell 8(6):679698
31. Middlebury Stereo Vision Page (2007) Available: http://vision.middlebury.edu/stereo/
32. Tanimoto M, Fujii T, Tehrani MP, Suzuki K, Wildeboer M (2009) Depth estimation
reference software (DERS) 3.0. ISO/IEC JTC1/SC29/WG11 Doc. M16390, Apr 2009
33. Tong X, Yang P, Zheng X, Zheng J, He Y (2010) A sub-pixel virtual view synthesis method for
multiple view synthesis. In: Proceedings picture coding symposium (PCS), Dec 2010, pp 490493
34. H. Furihata, T. Yendo, M. P. Tehrani, T. Fujii, M. Tanimoto, Novel view synthesis with
residual error feedback for FTV, in Proc. Stereoscopic Displays and Applications XXI, vol.
7524, Jan. 2010, pp. 75240L-1-12
35. Shin H, Kim Y, Park H, Park J (2008) Fast view synthesis using GPU for 3D display. IEEE
Trans Consum Electron 54(4):20682076
36. Rogmans S, Lu J, Bekaert P, Lafruit G (2009) Real-Time Stereo-Based View Synthesis
Algorithms: A Unified Framework and Evaluation on Commodity GPUs. Sig Processing:
Image Commun 24(12):4964
37. Feldmann I et al (2008) HHI test material for 3D video. ISO/IEC JTC1/SC29/WG11 Doc.
M15413, Apr 2008
38. Schechner YY, Kiryati N, Basri R (2000) Separation of transparent layers using focus. Int J
Comput Vis 39(1):2539
39. Zhao Y (2011) Depth no-synthesis-error model for view synthesis in 3-D video. IEEE Trans
Image Process 20(8):22212228

Chapter 6

Hole Filling for View Synthesis


Ismael Daribo, Hideo Saito, Ryo Furukawa, Shinsaku Hiura
and Naoki Asada

Abstract Depth-image-based depth rendering (DIBR) technique is recognized as a


promising tool for supporting advanced 3D video services required in multi-view
video (MVV) systems. However, an inherent problem with DIBR is to deal with the
newly exposed areas that appear in synthesized views. This occurs when parts of the
scene are not visible in every viewpoint, leaving blanks spots, called disocclusions.
These disocclusions may grow larger as the distance between cameras increases.
This chapter addresses the disocclusion problem in two manners: (1) the preprocessing of the depth data, and (2) the image inpainting of the synthesizing view. To
deal with small disocclusions, a hole filling strategy is designed by preprocessing the

I. Daribo (&)
Division of Digital Content and Media Sciences,
National Institute of Informatics, 2-1-2 Hitotsubashi,
Chiyoda-ku 101-8430, Tokyo, Japan
e-mail: daribo@nii.ac.jp
H. Saito
Department of Information and Computer Science,
Keio University, Minato, Tokyo, Japan
e-mail: saito@hvrl.ics.keio.ac.jp
I. Daribo  R. Furukawa 
ShinsakuHiura  N. Asada
Faculty of Information Sciences,
Hiroshima City University,
Hiroshima, Japan
e-mail: ryo-f@hiroshima-cu.ac.jp
ShinsakuHiura
e-mail: hiura@hiroshima-cu.ac.jp
N. Asada
e-mail: asada@hiroshima-cu.ac.jp

C. Zhu et al. (eds.), 3D-TV System with Depth-Image-Based Rendering,


DOI: 10.1007/978-1-4419-9964-1_6,
Springer Science+Business Media New York 2013

169

170

I. Daribo et al.

depth video before DIBR, while for larger disocclusions an inpainting approach is
proposed to retrieve the missing pixels by leveraging the given depth information.

Keywords 3D Warping Bilateral filter Contour of interest Depth-imagebased rendering Disocclusion Distance map Hole filling Distance map
Multi-view video Patch matching Priority computation Smoothing filter
Structural inpainting Texture inpainting View synthesis




6.1 Introduction
A well-suitable, associated 3D video data representation and its multi-view extension
are known respectively as video-plus-depth and multi-view video-plus-depth (MVD).
They provide regular 2D videos enriched with their associated depth data. The 2D
video provides the texture information, the color intensity, whereas the depth video
represents the Z-distance per-pixel between the camera and a 3D point in the visual
scene. With recent evolution in acquisition technologies including 3D depth cameras
(time-of-fight and Microsoft Kinect) and multi-camera systems, to name a few,
depth-based systems have gained significant interest recently, particularly in terms
of view synthesis approaches. Especially, depth-image-based rendering (DIBR)
technique is recognized as a promising tool for supporting advanced 3D video services,
by synthesizing some new novel views from either the video-plus-depth data representation or its multi-view extension. Let us distinguish two scenarios: (1) generate a
second shifted view from one reference viewpoint, (2) synthesize any desired intermediate views from at least two neighboring reference viewpoints for free viewpoint
scene observation. The problem, however, is that every pixel does not necessarily
exist in every view, which results in the occurrence of holes when a novel view is
synthesized from another one. View synthesis then exposes the parts of the scene that
are occluded in the reference view and make them visible in the targeted views. This is
a process known as disocclusion as a consequence of the occlusion of points in the
reference viewpoint as illustrated in Fig. 6.1.
One solution would be to increase the captured camera viewpoints to make every
point visible from at least one captured viewpoint. For example in Fig. 6.1 the point
B4 is visible from neither camera cam1 nor cam2. However, that yields an increasing
of the amount of captured data to process, transmit, and render. This chapter gives
more attention to the single video-plus-depth scenarios. More details can be found in
the multi-view case in Chap. 5. Another solution may consist in relying on more
complex multi-dimensional data representations, like layer depth video (LDV) data
representation that allows storing additional depth and color values for pixels that are
occluded in the original view. This extra data provide the necessary information that
is needed to be filled in disoccluded areas in the synthesized views. This means,
however, increasing the overhead complexity of the system. In this chapter, we first
investigate two camera configurations: small and large baseline, i.e., small and large
distance between the cameras. The baseline affects the disocclusion size. Larger the

6 Hole Filling for View Synthesis

171

Fig. 6.1 Stereo configuration wherein all pixels are not visible from all camera viewpoints. For
example, when transferring the view from camera cam2 to cam1, points B2 and B3 are a occluded
in camera cam2, and b disoccluded in camera cam1

baseline is, bigger the disocclusions become. We then address the disocclusion
problem through a framework that consists of two strategies at different places of the
DIBR flowchart: (1) disocclusion removal is achieved by applying a low-pass filter to
preprocess the depth video before DIBR, (2) the synthesized view is postprocessed to
fill in larger missing areas with plausible color information. The process of filling in
the disocclusions is also known as hole filling.
This chapter starts by introducing the general formulation of the 3D image
warping view synthesis equations in Sect. 6.2. Afterwards, the disocclusion problem
is discussed along with related works.
Section 6.3 introduces a prefiltering framework based on the local properties of
the depth map to remove discontinuities that provokes the aforementioned disocclusions. Those discontinuities are identified and smoothed through an adaptive filter.
The recovery problem of the larger disoccluded regions is addressed in Sect. 6.4.
To this end, an inpainting-based postprocessing of the warped image is proposed.
Moreover, a texture and structure propagation process improves the novel view
quality and preserves object boundaries.

6.2 Disocclusion Problem in Novel View Synthesis


Novel view synthesis includes a function for mapping points from one view (the
reference image plane) to another one (the targeted image plane) as illustrated in
Fig. 6.2 and described in the next subsection.

6.2.1 Novel View Synthesis


First, we introduce some notations. The intensity of the reference view image I1 at
pixel coordinates u1 ; v1 is denoted by.I1 u1 ; v1 The pinhole camera model is

172

I. Daribo et al.

Fig. 6.2 3D image warping: projection of a 3D point on two image planes in homogeneous
coordinates

used to project I1 into the second view I2 u2 ; v2 with the given depth data
Zu1 ; v1 : Conceptually, the 3D image warping process can be separated into two
steps: a backprojection of the reference image into the 3D-world, followed by a
projection of the backprojected 3D scene into the targeted image plane [1]. If we
look at the pixel location u1 ; v1 ; first, a backprojection per-pixel is performed
from the 2D reference camera image plane I1 to the 3D-world coordinates. Next, a
second projection is performed from the 3D-world to the image plane I2 of the
target camera at pixel location u2 ; v2 ; and so on for each pixel location.
To perform these operations, three quantities are needed: K1 ; R1 ; and t1 ; which
denote the 3 x 3 intrinsic matrix, the 3 x 3 orthogonal rotation matrix, and
the 3 9 1 translation vector of the reference view I1 ; respectively. The 3D-world
backprojected point M x; y; zT is expressed in non-homogeneous coordinates as
0 1
0 1
x
u1
@ y A R1 K 1 @ v1 Ak1  R1 t1 ;
6:1
1
1
1
z
1
where k1 is a positive scaling factor. Looking at the target camera quantities,
K2 ; R2 ; and t2 ; the backprojected 3D-world point M x; y; zT is then mapped
0
0
into the targeted 2D image coordinates u ; v ; 1T in homogeneous coordinates as:
0 0 1
0 1
x
u2
@ v0 A K2 R2 @ y A K2 t2 :
6:2
20
z
w2
We can therefore express the targeted coordinates function of the reference
coordinates by
0 0 1
0 1
u1
u2
@ v0 A K2 R2 R1 K 1 @ v1 Ak1  K2 R2 R1 t1 K2 t2 :
6:3
1
1
1
20
1
w2

6 Hole Filling for View Synthesis

173

It is common to attach the world coordinates system to the first camera system, so
that R1 I3 and t1 03 ; which simplifies Eq. 6.3 into
0 0 1
0 1
u1
u2
@ v0 A K2 R2 K 1 @ v1 Ak1 K2 t2 :
6:4
1
20
1
w2
0

where u2 ; v2 ; w2 T is the homogeneous coordinates of the 2D image point m2 ; and


the positive scaling factor k1 is equal to
0 1
0 1
a
u1
z
k1 where @ b A K11 @ v1 A:
c
c
1

6:5

In the final
 0 step,
 the homogeneous result is converted in pixel location as
0
u 2 v2
u2 ; v2 w0 ; w0 : Note that z is the third component of the 3D-world point M;
2

which indicates the depth information at pixel location u1 ; v1 of image I1 : These


data are considered as a key side information to retrieve the corresponding pixel
location on the other image I2 :

6.2.2 Disocclusion Problem


An inherent problem of the previous described 3D image warping algorithm is the
fact that each pixel does not necessarily exist in both views. Consequently, due to
the sharp discontinuities in the depth data (i.e., strong edges), the 3D image
warping can expose areas of the scene that are occluded in the reference view and
become visible in the synthesized view. Figure 6.3 shows an example of a warped
picture by DIBR. Disoccluded regions are colored in white.
Although that the disocclusion problem has recently been studied, many
solutions have been proposed. Existing solutions mainly follow two general lines
of research: depth preprocessing and hole filling. While depth preprocessing takes
place before DIBR, hole filling works on the synthesized view after DIBR.
Preprocessing the depth video allows reducing number and size of the disoccluded regions, by meaning of smoothing the depth discontinuities, commonly
operated with a Gaussian filter [2, 3]. Considering the fact that smoothing the
whole depth video damages more rather than simply applying a correction around
the edges, various adaptive filters [46] have been proposed to reduce both disocclusions and filtering-induced distortions in the depth video. More recently,
bilateral filter [7] has been used to enhance the depth video for his edge-preserving
capability [8, 9]. In comparison with conventional averaging or Gaussian-based
filters, the bilateral filter operates in both spatial space and color intensity space,
which results in better preserving the sharp depth changes in conjunction with the

174

I. Daribo et al.

Fig. 6.3 In magenta: example of disoccluded regions (from the ATTEST test sequence
Interview). (a) reference texture (b) reference depth (c) synthesized view

intensity variation in color space, and in consistent boundaries between texture and
depth images.
Nevertheless, after depth preprocessing, larger disocclusions may still remain,
which requires a next stage that consists in interpolating the missing values. To
this end, average filter has been commonly used [3]. Average filter, however, does
not preserve edge information of the interpolated area, which results in obvious
artifacts in highly textured areas. Mark et al. [10] and Zhan-wei et al. [11]
proposed merging two warped images provided by two different spatial or
temporal viewpoints, where the pixels in reference images are processed in an
occlusion-compatible order. In a review of inpainting techniques, Tauber et al.
argue that inpainting can provide an attractive and efficient framework to fill the
disocclusions with texture and structure propagation, and should be integrated with
image-based rendering (IBR) techniques [12].
Although these methods are effective to some extent, there still existing
problems such as: the degradation of non-disoccluded areas in the depth map, the
depth-induced distortion of warped images and undesirable artifacts in inpainted
disoccluded regions. To overcome these issues we propose an adaptive depth map
preprocessing that operates mainly on the edges, and an inpainting-based postprocessing that uses depth information in case of a large distance between the
cameras.

6.3 Preprocessing of the Depth Video


In this section, we address the problem of filling in the disocclusions in the case of
a small camera inter-distance (around the human eyes inter-distance). As a result
smaller disocclusions are revealed. As previously discussed, one way to address
the disocclusion problem consists in preprocessing the depth video, by smoothing
the depth discontinuities. Instead of smoothing the whole depth video, we propose
here an adaptive filter that takes into account the distance to the edges.
The proposed scheme is summarized in Fig. 6.4. First we apply a preliminary
preprocessing stage to extract the edges of the depth map capable of revealing the

6 Hole Filling for View Synthesis

175

Fig. 6.4 Pre-processing flow-chart of the depth before DIBR

disoccluded areas. In the following, we refer these edges as contours of interest


(CI). This spatial information permits then to compute the distance data, and then
to infer the weight information for the proposed filtering operation.

6.3.1 Smoothing Filters


Let us first briefly introduce two common filters that will be utilized in the rest of
this chapter for their smooth property.
6.3.1.1 Gaussian Filer
A Gaussian filter modifies the input depth image Z by convolution with a discrete
Gaussian function g2D such that

Z  g2D u; v

w
2
X

h
2
X

Zu  x; v  y  g2D x; y;

6:6

w
h
x
2 y 2
where the two-dimensional approximation of the discrete Gaussian function g2D is
separable into x and y components, and expressed as follows:
y2
x2

1
1
2
2r2y
p e 2rx  p e
;
2prx
2pry


g2D x; y g1Dx  g1Dy

6:7

where g1D is the one-dimensional discrete approximation of the Gaussian function.


The parameters w and h are the width and height of the convolution kernel,
respectively. The parameters rx and ry are the standard deviation of the Gaussian

176

I. Daribo et al.

distributions along the horizontal and vertical directions respectively. In the case
of a symmetric Gaussian distribution, Eq. 6.7 is updated to
x2 y2
1 
g2D x; y
e 2r2 ; where r rx ry :
2pr2

6:8

Nonetheless, the asymmetric nature of the distribution may help reducing the
geometrical distortion in the warped pictured induced by applying a stronger
smoothing in one direction.

6.3.1.2 Bilateral Filter


A bilateral filter is an edge-preserving smoothing filter. While many filters are
convolutions in the spatial domain, a bilateral filter also operates in the intensity
domain. Rather than simply replacing a pixel value with a weighted average of its
neighbors, as for instance low-pass Gaussian filter does, the bilateral filter replaces
a pixel value by a weighted average of its neighbors in both space and intensity
domain. As a consequence, sharp edges are preserved by systematically excluding
pixels across discontinuities from consideration. The new depth value in the filtered depth map Z~ at the pixel location s u; v is then defined by:
1 X
Z~s

f p  s  gZp  Zs  Zp
ks p2X

6:9

where X is the neighborhood around s under the convolution kernel, and ks is a


normalization term:
ks

f p  sgZp  Zs :

6:10

p2X

In practice, a discrete Gaussian function is used for the spatial filter f in the spatial
domain, and the range filter g in the intensity domain as follows:
dp  s2
2r2d ; with dp  s kp  sk ;
f p  s e
2


6:11

where k:k2 is the Euclidean distance, and


dZp  Zs 2


2r2r
gZp  Zs e
; with dZp  Zs Zp  Zs ;


6:12

6 Hole Filling for View Synthesis

177

Fig. 6.5 Example of contours of interest (CI) from the depth map discontinuities. (a) depth map
(b) CI

where rd and rr are the standard deviation of the spatial filter f and the range filter
g; respectively. The filter extent is then controlled by these two input parameters.
Therefore, a bilateral filter can be considered as a product of two Gaussian filters,
where the value at a pixel location s is computed with a weight average of his
neighbors by a spatial component that favors close pixels, and a range component
that penalizes pixels with different intensity.

6.3.1.3 Temporal Filtering


The aforementioned filters can also take temporal consistency measures into
account. Readers are invited to read Chap. 5 for more details.

6.3.2 Extraction of the Contours of Interest


As previously discussed, DIBR technique may expose areas of the scene wherein
the reference camera has no information. These areas are located around specific
edges. In what follows, we denote these edges as contours of CI. Furthermore, it is
possible to infer these regions before DIBR. The CI are generated from the depth
data through a directional edge detector, such that only one edge side is detected,
as illustrated in Fig. 6.5. The problem of choosing an appropriate threshold is
handled by hysteresis approach,1 wherein multiple thresholds are used to find an
edge. The outputted binary map reveals the depth discontinuities, and thus, it is
necessary to apply a strong smoothing. Otherwise, a lighter smoothing is done

Hysteresis is used to track the more relevant pixels along the contours. Hysteresis uses two
thresholds and if the magnitude is below the first threshold, it is set to zero (made a non-edge). If
the magnitude is above the high threshold, it is made an edge. If the magnitude is between the two
thresholds, then it is set to zero unless the pixel is located near an edge detected by the high
threshold.

178

I. Daribo et al.

Fig. 6.6 Example of distance map derived from the contours of interests (CI). (a) CI (b) distance
map

according to the distance from the closest CI. The distance information has then to
be computed (Fig. 6.5).

6.3.3 Distance Map


Discrete map computation is commonly used in shape analysis to generate for
example skeletons of objects [13]. Here, we propose utilizing the distance map
computation to infer the shortest distance from a pixel location to a CI. The
distance information is then utilized as a weight to adapt the smooth filters
introduced previously. In a distance map context, a zero value indicates that the
pixel belongs to a CI. Subsequently, non-zero values represent the shortest distances from a pixel location to a CI. Among all possible discrete distances, we only
consider in this study the city-block (or 4-neighbors) distance d:; :; a special case
of the Minkowski distance.2 The city-block distance is defined for two pixels
p1 u1 ; v1 and p2 u2 ; v2 by:
dp1 ; p2 jju1  u2 j  jv1  v2 jj; where p1 ; p2 2 Z2

6:13

We define the distance map D of the depth map Z; with respect to the given input
CI, by the following function:
Du; v minfdZu; v; pg:
pCI

6:14

It is possible to take into account the spatial propagation of the distance, and
compute it successively from neighboring pixels with a reasonable computing
time, with an average linear complexity of the number of pixels. The propagation
of the distance relies on the assumption that it is possible to deduce the distance of

two 2D points u1 ; v1 and u2 ; v2 the Minkowski distance of order k is defined as:

qFor
k
ju1  u2 jk jv1  v2 jk

6 Hole Filling for View Synthesis

179

a pixel location from the value of its neighbors, which fits well sequential and
parallel algorithms. One example of distance map is shown in Fig. 6.6.

6.3.4 Edge Distance-Dependent Filtering


Given the per-pixel distance information (obtained in the previous subsection) it
becomes feasible to compute an adaptive filter capable to shape itself according to
the CI neighborhood. The result is the ability to apply a strong smoothness near a
CI, and a lower one as the distance increases. The new depth value in the depth
map Z~ at the pixel location u; v is then defined by:
~ v au; v  Zu; v 1  au; v  Z  h2D u; v;
Zu;

6:15

where u; v 2 Z; and with


au; v

minfDu; v; Dmax g
;
Dmax

where a 2 0; 1; normalized by the maximum distance Dmax ; controls the


smoothing impact on the depth map by means of the distance map D: The quality
of the depth map is thus preserved for the regions far from the object boundaries.
Note that, over all the discrete image support, the case a 1 corresponds to not
filtering the depth map, whereas the case a 0 refers to applying a smooth filtering of equal strength all over the depth map. The function h2D is the impulse
response of one of the smooth filters that are previously described.

6.3.5 Experimental Results


For the experiments, we have considered the ATTEST video-plus-depth test
sequences interview(720 9 576, 25 fps) [14]. The experimental parameters for
the camera are tx 60 mm for the horizontal camera translation and f 200 mm
for the focal length. Concerning the smoothing process the parameters r and Dmax
have been chosen as 20 and 50, respectively. In theory, the Gaussian distribution is
nonzero everywhere, which would require an infinitely large convolution kernel,
but in practice the discrete convolution depends on the kernel width w and r: It is
common to choose w 3r or w 5r: In our experiments we set w to 3r:
We can see in Fig. 6.7 examples of preprocessed depth map through the means of an
all-smoothing strategy and the proposed CI-dependent filter. The Gaussian filter is
used as smooth filter. While all-smoothing strategy smoothes all the depth map
uniformly, the proposed CI-dependent approach focuses only on the areas susceptible
of being revealed in the synthesized view. As a consequence, less depth-filteringinduced distortions are introduced in the warped picture (see Fig. 6.8), and in the

180

I. Daribo et al.

Fig. 6.7 Example of different depth prefiltering strategies. (a) original (b) all-smoothing
(c) CI-dependent

Fig. 6.8 Error comparison between the original depth map and the preprocessed one. Gray color
corresponds to a no-error value. (a) all-smoothing (b) CI-dependent

Fig. 6.9 Novel synthesized images after depth-processing. (top) Disocclusion removal is
achieved by depth preprocessing. (bottom) Right-side information is preserved and vertical lines
do not bend due better depth data preservation. (a) original (b) all-smoothing (c) CI-dependent

6 Hole Filling for View Synthesis

181

mean time the disoccluded regions are removed in the warped image, as we can see in
Fig. 6.9. In addition, the main gain of the adaptive approach also comes from the
conservation of the non-disoccluded regions. Indeed, by introducing the concept of
CI, any unnecessary filtering-induced distortion is limited.
6.3.5.1 PSNR Comparison
~ and,
Let us consider the original depth map Z and the preprocessed depth map Z;
~ In order
also, the warped pictures Ivirt and ~Ivirt , respectively according to Z and Z:
to measure the filtering-induced distortion in the depth map, and in the warped
pictures, we defined two objective peak signal-to-noise ratio (PSNR) measurements
that take place before and after the 3D image warping (see Fig. 6.10) as follows:




PSNRdepth PSNR Z; Z~ and PSNRvirt PSNR Ivirt ; ~Ivirt DnO [ O~
~ being the disocclusion
where D 2 N2 is the discrete image support, O and O
~
image support of the 3D image warping using respectively Z and Z:
PSNRdepth is calculated between the original depth map and the filtered one.
Hence, PSNRdepth only considers coding artifacts in the depth map. It , however,
does not reflect the overall quality of the synthesized view. After, PSNRvirt is
calculated between the warped image mapped with the decoded depth map and the
original depth map. In order not to introduce in the PSNRvirt measurement the
warping-induced distortion, PSNRvirt is computed only on the non-disoccluded
~
areas DnO [ O:
We can observe in Fig. 6.11 the important quality improvement obtained with
the proposed method. In a subjective way, it can also remark less degradation on
the reconstructed images due to the fact that the proposed method preserves more
details in the depth map.

6.4 Video Inpainting Aided by Depth Information


In this section, in opposition to the previous section, we consider a large baseline
camera setup which corresponds more to Free Viewpoint Video (FVV) applications rather than stereoscopic vision. As a consequence, the disoccluded areas
become larger in the warped image. One solution suggested by Tauber et al. [12]
consists of combining 3D image warping with inpainting techniques to deal with
large disocclusions, due to the natural similarity between damaged holes in
painting and disocclusions in 3D image warping (Fig. 6.12).
Image inpainting, also known as image completion [15], aims at filling in pixels
in a large missing region of an image with the surrounding information. Image and
video inpainting serve a wide range of applications, such as removing overlaid text
and logos, restoring scans of deteriorated images by removing scratches or stains,
image compression, or creating artistic effects. State-of-the-art methods are

182

I. Daribo et al.

Fig. 6.10 Practical disposition of the two PSNR computations

Fig. 6.11 PSNR comparison (a) depth video (b) synthesized view

broadly classified as structural inpainting or as textural inpainting. Structural inpainting reconstructs using prior assumptions about the smoothness of structures in
the missing regions and boundary conditions, while textural inpainting considers
only the available data from texture exemplars or other templates in non-missing
regions. Initially introduced by Bertalmio et al. [16], structural inpainting uses
either isotropic diffusion or more complex anisotropic diffusion to propagate
boundary data in the isophote3 direction, and prior assumptions about the
smoothness of structures in the missing regions. Textural inpainting considers a
statistical or template knowledge of patterns inside the missing regions, commonly
modeled by markov random fields (MRF). Thus, Levin et al. suggest in [17]
extracting some relevant statistics about the known part of the image, and combine
them in a MRF framework. Besides spatial image inpainting, other works that
combine both spatial and temporal consistency can be found in the literature [18, 19].
In this chapter, we start from the Criminisi et al. work [20], in which they attempted to

Isophotes are level lines of equal gray


 levels. Mathematically, the direction of the isophote can
be interpreted as r? I; where r? oy ; ox is the direction of the smallest change.

6 Hole Filling for View Synthesis

183

Fig. 6.12 Removing large objects from photographs using Criminisis inpainting algorithm.
(a) Original image, (b) the target region (10 % of the total image area) has been blanked out,
(ce) intermediate stages of the filling process, (f) the target region has been completely filled
and the selected object removed (from [20])

combine structure into textural inpainting advantages by using a very insightful


principle, whereby the texture is inpainted in the isophote direction according to its
strength. We propose extending this idea by adding depth information to distinguish
pixels belonging to foreground and background. Let us first briefly review the
Criminisis inpainting algorithm.

6.4.1 Criminisis Inpainting Algorithm


Criminisi et al. [20] first reported that exemplar-based texture synthesis contains
the process necessary to replicate both texture and structure. They used the sampling concept from Efros and Leungs approach [21], and demonstrated that the
quality of the output image synthesis is greatly influenced by the order in which
the inpainting is processed.
Let us define first some notations. Given the input image I and the missing
region X; the source region U is defined as U I  X (see Fig. 6.13). Criminisis
algorithm consists in using a best-first filling strategy that entirely depends on the
priority values that are assigned to each patch on the boundary dX: Given the patch
Wp centered at a point location p 2 d X (see Fig. 6.13), the patch priority Pp is
defined as the product of two terms:
Pp Cp  Dp

6:16

where C p is the confidence term that indicates the reliability of the current patch
and D p is the data term that gives special priority to the isophote direction.
These terms are defined as follows:

184

I. Daribo et al.

Fig. 6.13 Notation diagram


(from [20])

Cp 

X
1
r? Ip ; np

Cqand
Dp

a
Wp  q2Wp \ U

6:17

 
where Wp  is the area of Wp (in terms of number of pixels within patch Wp ), a is a
normalization factor (e.g. a 255 for a typical gray-level image), np is a unit
vector orthogonal to dX at point p; and r? oy ; ox is the direction of the
isophote. Cq represents the percentage of non-missing pixels in patch Wp and is
set at initialization to Cq 0 for missing pixels in X; and Cq 1 everywhere
else. Once all the priorities on dX are computed, a block-matching algorithm
derives the best exemplar W^q to fill in the missing pixels under the highest priority
patch WP^ ; previously selected, as follows


W^q arg minWq/ dW^p ; Wq ;

6:18

where d:; : is the distance between two patches, defined as the sum of squared
differences (SSD). After finding the optimal source exemplar W^q ; the value of each
pixel-to-be-filled ^
p 2 W^p \ U is copied from its corresponding pixel in W^q : After
the patch W^p has been filled, the confidence term Cp is updated as follows
C p C ^
p; 8p 2 W^p \ U:

6:19

6.4.2 Depth-Aided Texture and Structure Propagation


Although the conceptual similarities between missing areas in paintings and disocclusions from 3D IBR, disocclusions present some specific a priori characteristics.
Disocclusions are the result of displaced foreground object that reveals a part of the
background. Filling in the disoccluded regions by using background pixels therefore
makes more sense than foreground ones. In this way, Cheng et al. developed a view
synthesis framework [8], in which the depth information constrains the search range

6 Hole Filling for View Synthesis

185

of the texture matching, and then a trilateral filter utilizes the spatial and depth
information to filter the depth map, thus enhancing the view synthesis quality. In a
similar study, Oh et al. [22] proposed replacing the foreground boundaries with
background ones located on the opposite side. They intentionally manipulated the
disocclusion boundary so that it only contained pixels from the background. Then
existing inpainting techniques are applied. Based on these works, we propose a
depth-aided texture inpainting method using Criminisis algorithm principles that
gives background pixels higher priority than foreground ones.
6.4.2.1 Priority Computation
Given the associated depth patch Z in the targeted image plane, in our definition of
priority computation, we suggest weighting the previous priority computation in
Eq. 6.16 by adding a third multiplicative term:
Pp Cp  Dp  Lp

6:20

where Lp is the level regularity term, defined as the inverse variance of the depth
patch Zp :
 
Zp 
6:21
Lp   P

2
Zp 
Zq  Z q
q2Wp \ U

 
where Zp  is the area of Zp (in terms of number of pixels), and Zq the mean value.
We then give more priority to the patch that overlays at the same depth level,
which naturally favors background pixels over foreground ones.
6.4.2.2 Patch Matching
Considering the depth information, we update Eq. 6.18 as follows:


W^q arg minW
dW^p ; Wq b  dZ^p ; Zq
q/

6:22

where the block-matching algorithm is processed in the texture and depth domains
through parameter b; which allows us to control the importance given to the depth
distance minimization. By updating the distance measure, we favor the search of
patches located at the same depth level, which naturally makes more sense.

6.4.3 Experimental Results


The multi-view video-plus-depth (MVD) sequence Ballet provided by
Microsoft [23] was used to test the proposed method. Calibration parameters are

186

I. Daribo et al.

Fig. 6.14 Recovery example of large disocclusions. (a) large disocclusions (b) conventional
inpainting (c) depth-aided inpainting

Fig. 6.15 Objective measure


quality

supplied with the sequences. The depth video provided for each camera was
estimated via a color-based segmentation algorithm [24]. The Ballet sequence
represents two dancers at two different depth levels. Due to the large baseline
between the central camera and the side cameras, more disocclusions appear
during the 3D warping process.
Criminisis algorithm [20] has been utilized to demonstrate the necessity to
consider the boundaries of the disocclusions. In the case of inpainting from
boundaries pixels located in different part of the scene (i.e., foreground and

6 Hole Filling for View Synthesis

187

background), it can be observed that foreground part of the scene suffers from
highly geometric deformations as shown in Fig. 6.14b. As we can see in
Fig. 6.14a, disocclusion boundaries belong both to the foreground and background
part of the scene, which makes conventional inpainting methods less efficient.
Comparing the two methods clearly demonstrated that the proposed
depth-based framework better preserves the contours of foreground objects and can
enhance the visual quality of the inpainted images. This is achieved by prioritizing
the propagation of the texture and structure from background regions, while
conventional inpainting techniques, such as Criminisis algorithm, make no
boundaries distinction. The objective performance is investigated in the loss of
quality curves plotted in Fig. 6.15, which is measured by the PSNR. It can be
observed a significant quality improvement.

6.5 Chapter Summary


This chapter has addressed the problem of restoring the disoccluded regions in the
novel synthesized view through two approaches that take place before and after
DIBR. While the preprocessing of the depth data may be enough for small disocclusion, a hole filling better addresses larger disocclusions.
Preprocessing the depth video allows to reduce the number and the size of the
disoccluded regions. To that purpose well-known smooth filters can be utilized,
such as the average filter, Gaussian filter, bilateral filter that act like low frequency
filters. Nonetheless, filtering the overall depth video results inevitably in the
degradation of the user depth perception, and undesirable geometric distortions in
the synthesized view. To overcome this issue, we proposed limiting the filtering
area through an adaptive depth preprocessing that operates on the edges and takes
into account the distance to the edges. First, we apply a preliminary stage to extract
depth edges capable of revealing disoccluded areas. This spatial information
permits then to compute the weight information for the proposed filtering operation. As a result, it becomes feasible to apply a stronger smoothness near an edge
and a lower one far from an edge. Unnecessary filtering-induced distortions are
then limited while the number and size of the disocclusion are reduced.
Nevertheless, after the preprocessing of the depth video, disocclusions may still
be remained, mainly due to the large distance between cameras. A next stage is
then required by interpolating the missing values in the synthesized view. One
solution consists in utilizing inpainting techniques, since there is a natural similarity between damaged holes in painting and disocclusions in view synthesis by
DIBR. Image inpainting techniques aim at filling in pixels in a large missing
region of an image with the surrounding information. An attempt to combine
advantages of both structural and textural inpainting approaches has then been
proposed by using a very insightful principle, whereby the texture is inpainted in
the isophote direction according to its strength and the depth value. Pixels
belonging to either the foreground or the background are distinguished, which

188

I. Daribo et al.

allows to process differently disocclusion boundaries located at either the foreground or background part of the scene. Clearly indicating which disocclusion
contour is close to the object of interest and which one is in the background
neighborhood significantly improves the inpainting algorithm in this context.
Acknowledgments This work is partially supported by the National Institute of Information and
Communications Technology (NICT), Strategic Information and Communications R&D Promotion
Programme (SCOPE) No.101710002, Grand-in-Aid for Scientific Research No.21200002 in Japan,
Funding Program for Next Generation World-Leading Researchers No. LR030 (Cabinet Office,
Government Of Japan) in Japan, and the Japan Society for the Promotion of Science (JSPS) Program
for Foreign Researchers.

References
1. Leonard M Jr. (1997) An Image-based approach to three-dimensional computer graphics.
PhD thesis, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
2. James Tam W, Guillaume A, Zhang L, Taali M, Ronald R (2004) Smoothing depth maps for
improved steroscopic image quality. In: Proceedings of the SPIE international symposium
ITCOM on three-dimensional TV, video and display III, Philadelphia, USA, vol 5599,
pp 162172
3. Zhang L, Tam WJ (2005) Stereoscopic image generation based on depth images for 3D TV.
IEEE Trans Broadcast 51(2):191199 June 2005
4. Chen W-Y, Chang Y-L, Lin S-F, Ding L-F, Chen L-G (2005) Efficient depth image based
rendering with edge dependent depth filter and interpolation. In: Proceedings of the IEEE
international conference on multimedia and expo (ICME), pp 13141317, 68 July 2005
5. Daribo I, Tillier C, Pesquet-Popescu B (2007) Distance dependent depth filtering in 3D
warping for 3DTV. In: Proceedings of the IEEE workshop on multimedia signal processing
(MMSP), Crete, Greece, pp 312315, Oct 2007
6. Sang-Beom L, Yo-Sung H (2009) Discontinuity-adaptive depth map filtering for 3D view
generation. In Proceedings of the 2nd international conference on immersive
telecommunications (IMMERSCOM), ICST, Brussels, Belgium. ICST (Institute for
Computer Sciences, Social-Informatics and Telecommunications Engineering)
7. Tomasi C, Manduchi R (1998) Bilateral filtering for gray and color images. In: Proceedings
of the IEEE international conference on computer vision (ICCV), pp 839846
8. Cheng C-M, Lin S-J, Lai S-H, Yang J-C (2008) Improved novel view synthesis from depth
image with large baseline. In: Proceedings of the international conference on pattern
recognition, Tampa, Finland, pp 14, December 2008
9. Gangwal OP, Berretty RP (2009) Depth map post-processing for 3D-TV. In: Digest of
technical papers international conference on consumer electronics (ICCE), pp 12, Jan 2009
10. Mark WR, McMillan L, Bishop G (1997) Post-rendering 3D warping. In: Proceedings of the
symposium on interactive 3D graphics (SI3D), ACM Press, New York, pp 716
11. Zhan-wei L, Ping A, Su-xing L, Zhao-yang Z (2007) Arbitrary view generation based on
DIBR. In: Proceedings of the international symposium on intelligent signal processing and
communication systems (ISPACS), pp 168171
12. Tauber Z, Li Z-N, Drew MS (2007) Review and preview: disocclusion by inpainting for
image-based rendering. IEEE Trans Syst Man Cybern Part C Appl Rev 37(4):527540
13. Ge Y, Fitzpatrick JM (1996) On the generation of skeletons from discrete euclidean distance
maps. IEEE Trans Pattern Anal Mach Intell 18(11):10551066
14. Fehn C, Schr K, Feldmann I, Kauff P, Smolic A.(2002) Distribution of ATTEST test
sequences for EE4 in MPEG 3DAV. ISO/IEC JTC1/SC29/WG11, M9219 doc., Dec 2002

6 Hole Filling for View Synthesis

189

15. Furht B (2008) Encyclopedia of multimedia, 2nd edn. Springer, NY


16. Bertalmi M, Sapiro G, Caselles V, Ballester C (2000) Image inpainting. In: Proceedings of
the annual conference on computer graphics and interactive techniques (SIGGRAPH), New
Orleans, USA, pp 417424
17. Levin A, Zomet A, Weiss Y (2003) Learning how to inpaint from global image statistics (2003).
In: Proceedings of the IEEE international conference on computer vision (ICCV), vol 1, Nice,
France, pp 305312, Oct 2003
18. Chen K-Y, Tsung P-K, Lin P-C, Yang H-J, Chen L-G (2010) Hybrid motion/depth-oriented
inpainting for virtual view synthesis in multiview applications. In: Proceedings of the true
visioncapture, transmission and display of 3D video (3DTV-CON), Tampere, Finland,
pp 14, June 2010
19. Ndjiki-Nya P, Koppel M, Doshkov D, Lakshman H, Merkle P, Muller K, Wiegand T (2011)
Depth image-based rendering with advanced texture synthesis for 3D video. IEEE Trans
Multimedia 13(3):453465
20. Criminisi A, Perez P, Toyama K (2004) Region filling and object removal by exemplar-based
image inpainting. IEEE Trans Image Process 13(9):12001212
21. Efros AA, Leung TK (1999) Texture synthesis by non-parametric sampling. In: Proceedings
of the IEEE international conference on computer vision (ICCV), vol 2, Kerkyra, Greece,
pp 10331038, Sept 1999
22. Oh K-J, Yea S, Ho Y-S (2009) Hole filling method using depth based in-painting for view
synthesis in free viewpoint television and 3D video. In: Proceedings of the picture coding
symposium (PCS), Chicago, IL, USA, pp 14, May 2009
23. Microsoft Sequence Ballet and Breakdancers (2004) [Online] Available: http://research.
microsoft.com/en-us/um/people/sbkang/3dvideodownload/
24. Lawrence ZC, Bing Kang S, Matthew U, Simon W, Richard S (2004) High-quality video view
interpolation using a layered representation. In: Proceedings of the annual conference on
computer graphics and interactive techniques (SIGGRAPH), vol 23(3), pp 600608, Aug 2004

Chapter 7

LDV Generation from Multi-View Hybrid


Image and Depth Video
Anatol Frick and Reinhard Koch

Abstract The technology around 3D-TV is evolving rapidly. There are already
different stereo displays available and auto-stereoscopic displays promise 3D
without glasses in the near future. All of the commercially available content today
is purely image-based. Depth-based content on the other hand provides better
flexibility and scalability regarding future 3D-TV requirements and in the long
term is considered to be a better alternative for 3D-TV production. However, depth
estimation is a difficult process, which threatens to become the main bottleneck in
the whole production chain. There are already different sophisticated depth-based
formats such as LDV (layered depth video) or MVD (multi-view video plus depth)
available, but no reliable production techniques for these formats exist today.
Usually camera systems, consisting of multiple color cameras, are used for capturing. These systems however rely on stereo matching for depth estimation, which
often fails in presence of repetitive patterns or textureless regions. Newer, hybrid
systems offer a better alternative here. Hybrid systems incorporate active sensors
in the depth estimation process and allow to overcome difficulties of the standard
multi-camera systems. In this chapter a complete production chain for 2-layer
LDV format, based on a hybrid camera system of 5 color cameras and 2 timeof-flight cameras, is presented. It includes real-time preview capabilities for
quality control during the shooting and post-production algorithms to generate
high-quality LDV content consisting of foreground and occlusion layers.

A. Frick (&)  R. Koch


Computer Science Department, Christian-Albrechts-University of Kiel,
Hermann-Rodewald-Street 3, 24118 Kiel, Germany
e-mail: africk@mip.informatik.uni-kiel.de
R. Koch
e-mail: rk@mip.informatik.uni-kiel.de

C. Zhu et al. (eds.), 3D-TV System with Depth-Image-Based Rendering,


DOI: 10.1007/978-1-4419-9964-1_7,
 Springer Science+Business Media New York 2013

191

192

A. Frick and R. Koch

Keywords 3D-TV
Alignment
Depth estimation
Depth image-based
rendering (DIBR)
Foreground layer
Grab-cut Hybrid camera system
Layered depth video (LDV) LDV compliant capturing Multi-view video plus
depth (MVD) Occlusion layer Post-production Bilateral filtering Stereo
matching Thresholding Time-of-flight (ToF) camera Warping





7.1 Introduction
The 3D-TV production pipeline consists of three main blocks: content acquisition,
transmission, and display. While all of the blocks are important and the display
technology certainly plays a crucial role in user acceptance, the content acquisition
is currently the most problematic part in the whole pipeline.
Most of the content today is shot with standard stereo systems of two cameras,
often combined through a mirror system to provide narrow baseline, and is purely
image-based. While relatively easily produced, such content however does not
fulfill all the 3D-TV requirements today and will certainly not fulfill the
requirements of the future 3D-TV. The main drawbacks are its bad scalability
concerning different display types and inflexibility concerning modifications. For
example, the content shot for a big cinema display cannot be easily adapted to a
small display in the home environment, without distorting or destroying the 3D
effect. This makes it necessary to consider all possible target geometries during the
shooting, for example, by using several baselines, one for the cinema and one for
the home environment. Such practices however make the content acquisition more
complex and expensive. Also the 3D effect cannot be easily changed after capturing. It is for example not possible to adapt the content to conform to the
viewers personal settings, like distance to the screen or eye distance. Another
problem is that the content can only be viewed on a stereo display. It is not
possible to view the content on a multi-view display, because only two views are
available. The naive solution would be to shoot the content with so many cameras
as views required, but different displays require different numbers of views, so that
such content also would not scale well. Additionally the complexity and costs of
shooting with a multi-camera rig grows with the number of cameras and as such is
infeasible for real production.
An alternative to the pure image-based content is depth-based content, where
additional information in form of depth images, that describes the scene geometry, is
used. With this information, virtual views can be constructed from the existing ones
using depth image-based rendering methods [13]. This makes this kind of content
independent of display type and offers flexibility for later modifications. One can
produce a variable number of views, required by the specific displays, and render these
views according to user preferences and display geometry. Depth estimation however
is a difficult problem. Traditional capture systems use multiple color cameras and rely
on stereo matching [4] for depth estimation. However, although much progress was

7 LDV Generation from Multi-View Hybrid Image

193

achieved in this area in the last years, stereo matching still remains unreliable in
presence of repetitive patterns and textureless regions and is slow if acceptable quality
is required. Hybrid systems [5, 6] on the contrary, use additional active sensors for
depth measurement. Such sensors are for instance laser scanners or time-of-flight
(TOF) cameras. While laser scanners are restricted to static scenes only, the operational
principle of ToF cameras allows handling of dynamic scenes, which makes them very
suitable for 3D-TV production. ToF cameras measure distance to the scene points by
emitting infrared light and by determining the time this light needs to travel to the
object and back to the camera. There are different types of ToF cameras that are
available, some are based on pulse measurement and some on correlation between the
emitted and reflected light. The current resolution of the ToF cameras varies between
64 9 48 and 204 9 204 pixels. The ToF cameras measure depth from a single point
of view. Therefore, the accuracy of the depth measurement does not depend on the
resolution of the cameras or a baseline between the cameras as is the case with stereo
matching. In addition to the depth images ToF cameras provide a reflectance image,
which can be used to estimate the intrinsic and extrinsic camera parameters, allowing a
straightforward integration of ToF cameras in a multi-camera setup. However, the
operational range of the ToF cameras is limited, for example, about 7.5 m by correlating cameras, so that a 3D-TV system using time-of-flight cameras is typically
limited to indoor scenarios. ToF cameras also suffer from low resolution and low
signal-to-noise ratio, so that for high-quality results additional processing is required.
For more information on ToF cameras refer to [79].
One of the first European research projects demonstrating the use of depthbased content in 3D-TV was ATTEST [1, 10]. This was also one of the first
projects to use a ToF camera (ZCam) in a 3D-TV capture system. The main goal of
ATTEST was to establish a complete production chain based on a video format
consisting of one color and corresponding depth image (single video plus depth).
However, with single video plus depth, no proper occlusion handling could be
performed, which results in decreasing quality of rendered views with increasing
view distance.
As a follow-up, the European project 3D4YOU [11] focused on developing a
3D-TV production chain for a more sophisticated LDV [1115] format, extending
the work of ATTEST. Also MVD [ 1116] format and conversion techniques from
MVD to LDV were investigated. Both formats are a straightforward extension to
the single video plus depth format but conceptually quite different. MVD represents the scene from different viewpoints, each consisting of a color and a corresponding depth image. LDV represents the scene from a single viewpoint, the
so-called reference viewpoint. It consists of multiple layers representing different
parts of the scene, each consisting of a color and a corresponding depth image. The
first layer of the LDV format is called the foreground layer. All successive layers
are called occlusion layers and represent the parts of the scene, which are
uncovered by rendering a previous layer to the new viewpoint. While multiple
layers are possible, two are in most cases sufficient to produce high-quality results.
Therefore, in practice a 2-layer LDV format, consisting of one foreground and one
occlusion layer, is used.

194

A. Frick and R. Koch

Compared to LDV, MVD contains a lot of redundant information, because


normally large parts of the scene are seen from many different viewpoints. In the
LDV format on the contrary each of these parts would be ideally represented once
in a corresponding layer. Rendering from MVD is in general more complex as
rendering from LDV, because all of the occlusion layers have to be handled during
the rendering inside the multi-view display. LDV, however, has the relevant
occlusion information already contained in the occlusion layers, so that filling in
the missing scene parts by a viewpoint change becomes an easier process. LDV
can also be used in a simplified form, where only horizontal parallax information is
used. In this form rendering of novel views can be implemented through a simple
pixel shift along the baseline and is easily realized in hardware. In fact, this form
LDV is the only depth-based format, which was implemented in an auto-stereoscopic display (Philips WoW-display [12, 13]). MVD can be converted to LDV by
choosing a reference view and by transforming depth and color from other views
to the reference view. The main drawback of LDV against MVD is that by
increasing distance to the reference view the quality of the rendered views may
decrease due to suboptimal sampling. Using MVD however all of the original
views are still present, which allows to choose optimal set of views for rendering.
In the remainder of this chapter a complete processing scheme for generation of
a 2-layer LDV format will be discussed. In Sect. 7.2, the camera rig will be
introduced and an overview over the capturing system given. Section 7.3 describes
the preview capabilities of the system. In Sect. 7.4, the approach for generation of
the 2-layer LDV is presented. Section 7.5 targets an important topic in depth-based
content creation, the depth discontinuities. Here two approaches for alignment of
depth discontinuities to color will be presented. Section 7.6 gives an overview and
conclusions.

7.2 Capture System Setup and Hardware


While designing a system for LDV compliant capturing, several issues have to be
taken into consideration. These are not only technical issues such as number and
positioning of cameras, but also issues about the usability of the system in an
actual shooting. All these issues are highly correlated and cannot be dealt with
separately.
LDV represents the scene from a single viewpoint, the reference viewpoint. The
reference viewpoint acts like a window through which the scene will be later
presented to the viewer and should be considered as the central viewpoint of the
system. While a LDV capture system can be designed with the reference viewpoint
to be purely virtual, better quality and usability can be achieved if the reference
viewpoint corresponds to a real camera.
The proposed system was developed explicitly for generation of 2-layer LDV
content and consists of 5 color cameras C1-C5 and 2 time-of-flight (ToF) cameras
T1 and T2 (see Fig. 7.1) [5, 17].

7 LDV Generation from Multi-View Hybrid Image

195

Fig. 7.1 Picture of the camera system: (left) real picture (right) schematic representation.
(Reproduced with permission from [5])

Fig. 7.2 (Left) A schematic representation of the ToF cameras with the reference camera in the
center. Left with parallel aligned optical axis, no full coverage of the reference camera view and
right with the ToF cameras rotated outwards, full coverage of the reference camera view. (Right)
A diagram representing the workflow of the system. (Reproduced with permission from [5])

The system is constructed modularly, consisting of a reference camera in the


center C5 and two side modules M1 and M2 in the left and right of the reference
camera. The reference camera is a Sony X300 with a resolution of 1920 9 1080
pixels. The side modules consist of two color cameras and one ToF camera. These
modules are placed 180 mm apart. Moreover, the ToF cameras and the reference
camera are positioned to approximately lie on a horizontal line the so-called
baseline. The module color cameras are Point Grey Grasshopper cameras with a
resolution of 1600 9 1200 pixels. The ToF cameras are CamCube 3.0 cameras
from PmdTec with a resolution of 200 9 200 pixels. The ToF cameras and the
reference camera have different viewing frustums. In order to provide the optimal
coverage of the viewing frustum of the reference camera the modules can be
adjusted. In the used configuration the modules are slightly rotated outwards, so
that the coverage of the reference camera view in the range from 1.5 to 7.5 m can
be provided (see Figs. 7.2 and 7.3). The ability to adjust the modules to the view of
the reference camera makes the system flexible in the choice of the reference
camera or camera lens, which is an important point in an actual 3D-TV production.
Positioning the modules to left and right of the reference camera allows estimating
depth and color information not only in the foreground but also in occlusion areas not
directly seen from the view of the reference camera. Additionally several baselines

196

A. Frick and R. Koch

Fig. 7.3 Image of the reference camera, together with the ToF depth images. Both ToF cameras
are rotated outwards to provide optimal scene coverage for the view of the reference camera.
(Reproduced with permission from [5])

are contained in this constellation: (C1, T1, C2), (C3, T2, C4), (C1, C3), (C2, C4),
(C1, C5, C4) and (C2, C5, C3). These baselines can be used to combine ToF depth
measurements with traditional stereo matching techniques.
Due to different baseline orientations, ambiguities in stereo matching can be
resolved better. The system was designed with the modules symmetrically around the
reference camera so that the occlusion information in the left and right direction can
be estimated by a symmetric process. This allows generation of virtual views from
the LDV content in both directions (like in the Philips WoW-display).
During the shooting one has to concentrate on the reference camera only,
whereby other cameras play only a supportive role. This makes the 3D shooting
comparable to a standard 2D shooting.
To use the captured data for the LDV generation the camera system has to be
calibrated. Due to the bad signal-to-noise ratio, small resolution and systematic
depth measurement errors ToF cameras are difficult to calibrate with traditional
calibration methods. To provide a reliable calibration the approach from [9] can be
used, where ToF cameras are calibrated together with the color cameras incorporating depth measurements in the calibration process.
The camera system is operated from a single PC with Intel(R) Core(TM) i7-860
CPU and 8 GB RAM. The cameras C1, C2, C3, and C4 are connected through a
FireWire interface and deliver images in Bayer pattern format. The reference
camera C5 is connected through the HD-SDI interface and delivers images in
YUV format. The ToF cameras are connected through USB2.0 and provide phase
images, from which depth images can be calculated in an additional step.
The systems processing architecture is described in Fig. 7.2 (right). It consists
of active elements in form of arrows (processes) and passive elements, such as
image buffers, which are used for communication between the active elements.
Data processing starts with image acquisition, where the data from all cameras is
captured in parallel through multiple threads with 25 frames/s, which results in the
total amount of ca. 300 MB/s. Following the image acquisition, two processes run
in parallel: the image conversion process and the save-to-disk process. The
image conversion process converts the color images to RGB and the phase images
from the ToF cameras to depth images. The save-to-disk process is responsible for
storing the captured data permanently. To handle the large throughput of data and

7 LDV Generation from Multi-View Hybrid Image

197

ensure the required frame rate, a solid-state drive (OCZ Z-Drive p84) of 256 GB
capacity, connected over x8 PCI-Express interface, is used, with writing speed up
to 640 MB/s. This allows capturing ca. 14 min of content in a single shot. The
filtering process is an optional step. If the filtering process is running, a number of
predefined filters are applied to selected image streams, before copying images to
the preview buffer. Due to bad signal-to-noise ratio of the PMD cameras, the
filtering step is important for further processing. For filtering a 3 9 3 or 5 9 5
median filter can be currently applied to ToF depth images, but more sophisticated
filtering techniques are possible, for example, filtering over time [18, 19]. The
median filter also runs in parallel to ensure real-time performance. In the last step,
the data are transferred through the GPUProcessing process to the GPU memory, where multiple shaders are run for preview generation. Currently the GeForce
GTX 285 is used as the GPU. The whole system is capable to capture and
simultaneously provide a 3D preview at 25 frame/s.

7.3 Live LDV Generation


Because the production of 3D content is very expensive, a preview capability of
the capturing system during the shooting becomes a very important issue. It is
essential to know on stage if the desired 3D effects were achieved or if additional
adjustment is necessary. Taking this into consideration a preview mechanism for
the foreground layer of the LDV format was implemented [18]. While the LDV
preview is not free of errors, it provides enough feedback to allow the assessment
of the quality of the captured data and what can be achieved during a full-fledged
LDV generation process. In addition to the LDV preview, several other realtime preview mechanisms to guide the capturing process were implemented.
Figure 7.4 shows an overview over the preview system.
The preview processing starts with loading the images to the GPU memory in
form of textures. This is indicated in Fig. 7.4 by the arrow Load Textures. The
GPU memory is logically organized into four blocks: original, warped, interpolated, and filtered. Each part consists of a number of textures where the results of
different shader operations are stored. Images loaded from the CPU memory are
for example located in the original block. Below the GPU memory box one can see
the shader pipeline box. Shaders are programs executed on the GPU. Each shader
program has a memory block as input and one as output. The system allows to
present the preview in different modes.
Currently following preview mechanisms are available:
Original 2D images: Original color and depth images can be viewed on a
standard 2D display.
Overlay preview: The produced depth image is transparently overlaid with the
image of the central camera by alpha blending providing feedback on edge
alignment and calibration quality between depth and color. The result can be
viewed on a 2D display.

198

A. Frick and R. Koch

Fig. 7.4 Overview over the preview system. (Reproduced with permission from [5].)

Auto-stereoscopic 3D preview: The foreground layer of the LDV frame is


generated, from the filtered depth image in the filtered memory block and the
original color image from the reference camera. The result can be directly
viewed on the auto-stereoscopic WoW-Display from Philips.
The rendering to the displays is performed by different shaders not included in
the shader pipeline. In the following the shader pipeline will be introduced in more
detail.

7.3.1 Warping
The purpose of warping is to transform the left and right ToF depth images into the
view of the reference camera. Changing the viewpoint, however, causes new
regions to become visible, the disocclusion regions, where no depth information is
available in the reference view. When warping both ToF cameras to the reference
view, such regions appear on the left and right side of the foreground objects, on
the right side for the left and on the left side for the right ToF camera (Fig. 7.5).
As one can clearly see from Fig. 7.5, both ToF cameras compete each other in
the overlap area, so that the disocclusion regions caused by the warping of the left
ToF image can be partially filled through depth information from the right ToF
image and vice versa. For the warping of the ToF depth images to the view of the
reference camera the method from [5, 7] is applied, where several depth images
can be warped to the view of the target camera by a 3D mesh warping technique.
Special care has to be taken when combining the different meshes from the left
and right ToF camera. The problem by warping two meshes from different
viewpoints simultaneously is that the disocclusions caused through the viewpoint
change of one of the ToF cameras are filled linearly because of the triangle mesh
properties and cannot be filled correctly through the other ToF camera, causing
annoying rubber band effects on the object boundaries, see Fig. 7.6 (right). While
the artifacts may seem to be minor, they become worse with the shorter distance to

7 LDV Generation from Multi-View Hybrid Image

199

Fig. 7.5 ToF depth images warped to the reference camera. Left, ToF depth image warped from
the left ToF camera and right, ToF depth image warped from the right ToF camera. (Reproduced
with permission from [5])

Fig. 7.6 Combined ToF depth images warped to the view of the reference camera, left, with
mesh cutting and right, without mesh cutting. Notice the rubber effects, when no mesh cutting is
performed. (Reproduced with permission from [5])

the camera and are difficult to control. To avoid this, the mesh is cut on object
boundaries in a geometry shader before the transformation, by calculating the
normal for each triangle and discarding the triangle if the angle between the
normal and the viewing direction of the ToF camera exceeds a predefined
threshold.

7.3.2 Interpolation and Filtering


After the warping step, disocclusion regions can still be visible in some areas, see
Fig. 7.6 (left). To fill these regions with reasonable depth values two simple
methods are currently available in the system: linear interpolation and background
extrapolation. The user can freely choose between both during the runtime.
Activation of one of these methods during image acquisition ensures a completely
filled depth image for the central view in later processing steps. Figure 7.7 (left)
shows the depth image from Fig. 7.6 (left) after the background extrapolation.
After the interpolation step, depth and texture information is available only for
the foreground layer of the LDV format. Without the occlusion layer, however,
disoclusion regions cannot be filled correctly. This results in annoying artifacts,
when rendering novel views, which can significantly disturb the 3D experience.

200

A. Frick and R. Koch

Fig. 7.7 (Left) Depth image after background extrapolation. (Right) Depth image from left after
low-pass filtering. (Reproduced with permission from [5])

To reduce this artifact the depth image can be low-pass filtered as proposed in [3].
In [3] authors use a Gaussian-Filter. In the proposed system an asymmetric BoxFilter is used, because it provides results of sufficient quality while having less
computational complexity. The results of filtering the depth image after the
Interpolation step with an asymmetric Box-Filter from size (20 9 30) are shown in
Fig. 7.7 (right). These data are fed to the rendering stage as 1-layer LDV preview
and used to assess the quality of depth acquisition during shooting.

7.4 Occlusion Layer Generation for 2-layer LDV


The presence of an occlusion layer is essential for high-quality 3D-content.
Without an occlusion layer, invalid depth or, in case of interpolation, rubber band
effects will occur on object boundaries, which can significantly decrease the
perceived quality. For the occlusion layer generation the modular camera setup
plays an important role, especially the chosen position of the side modules left and
right from the reference camera. Through this configuration the module cameras
capture depth and color information not only in the foreground, but also in regions
behind the foreground objects, hidden from the view of the reference camera.
Figure 7.8 shows an overview of the data provided by the system. The reference
camera provides color for the foreground layer. The color cameras of the modules
provide color for the occlusion layer and the ToF cameras provide depth for the
foreground and for the occlusion layer.
Notice that the person in the background is holding a mask. The mask is
completely hidden from the view of the reference camera, but is visible from the
view of the right module, which is providing both color and depth information for
it. Therefore, by the viewpoint change to the right, missing information can be
correctly filled in (Fig. 7.10).
To create the LDV format, all the relevant data has to be transformed to the
view of the reference camera [17]. To reduce errors due to the big resolution
difference in conjunction with noise and the viewpoint change, the ToF depth
images are not directly transformed to the view of the reference camera, but are

7 LDV Generation from Multi-View Hybrid Image

201

Fig. 7.8 Data provided by the hybrid camera system. Notice that the modules provide not only
color and depth for the foreground but also for the occlusion layer

instead first warped to the corresponding color cameras in the modules. The high
resolution color images are then used to refine depth discontinuities of the warped
depth images, which is described in more detail in Sect. 7.5. This step has two
reasons. First the module color cameras have a much shorter distance to the ToF
cameras than the reference camera. Therefore, errors due to the viewpoint change
are smaller and thus are easier to correct. Secondly, through alignment of color and
depth, more consistent occlusion layer information can be created.
To create the occlusion layer, data for three virtual views are created, consisting
of color and corresponding depth images (Fig. 7.9 (left)). The original reference
camera view is used as the central virtual view. Thereby the color image of the
reference camera is directly used, while the corresponding depth image is constructed from the refined depth images of both side modules. The left virtual view
is constructed from the refined depth images of the left-side module and the right
virtual view from the refined depth images of the right-side module. The transformation of depth from a modules corner cameras to the one of the virtual views
is done by the 3D warping technique described in Sect. 7.3. The corresponding
color is transformed through backward mapping, from the virtual views to the
corner cameras. All virtual views have the same internal camera parameters as the
reference camera, i.e., resolution, aspect ratio, and so on. Notice, that the left and
right virtual views are incomplete. This is due to different viewing frustums of the
reference and the side module cameras. However, all the necessary information for

202

A. Frick and R. Koch

Fig. 7.9 (Left) Three virtual views. The depth image from the central virtual view is warped to
the left and right virtual view to get the occlusion information. Black regions in the warped depth
images correspond to disoccluded areas. (Right) Generated LDV Frame, with central virtual view
as foreground layer and occlusion layer cut out from the left and right virtual images and
transformed back to the central view

Fig. 7.10 Views rendered from the LDV frame from Fig. 7.9 (right). Center view is the original
view, left and right views are purely virtual. Notice that the right virtual view shows an object
held by the person in the background, which is completely occluded by the foreground person in
the original view

occlusion layer generation is contained. Depth and color are also well aligned by
means of previously applied refinement step. Before further processing, the depth
for the central virtual view is again aligned to the corresponding color images
yielding the foreground layer of the LDV frame (Fig. 7.9 (right)). To create the
occlusion layer, the depth map from the foreground layer is warped to the left and
right virtual views, see Fig. 7.9 (left). By doing so, the disocclusion regions are
revealed. However, due to the viewpoint difference between C1C4 and C5 all the
necessary information is now available in the left and right virtual views. The
relevant depth and texture information can be simply masked out by the disocclusion areas coded in black and transformed to the view of the reference camera.
This information is then combined to the occlusion layer. Note that by rendering a
virtual view from LDV-frame, the reverse operation will be performed, so that
disocclusions will be filled with the correct texture information.
Figure 7.9 (right) shows the full LDV frame constructed with the proposed method.
From this LDV frame multiple views can then be rendered along the baseline.
Figure 7.10 shows three views rendered from the LDV frame in Fig. 7.9 (right).

7 LDV Generation from Multi-View Hybrid Image

203

7.5 Refinement of Depth Discontinuities


In high-quality LDV creation depth discontinuities play an important role. A
misalignment of depth and color discontinuities in a LDV frame can cause
spectators great discomfort and in general decreases the content quality dramatically. For the occlusion layer generation approach, discussed in Sect. 7.4, good
aligned depth and color discontinuities are essential. Without precisely defined
object boundaries no consistency between foreground and occlusion layer can be
guaranteed. Different approaches for depth map refinement exist in the literature
[2023]. In this section two approaches will be presented. The first is based on
bilateral filtering of a distance cost volume [20] and is purely local, the second is a
global approach consisting of two steps: binary segmentation via iterative graph
cuts [24] and following refinement through restricted bilateral filtering. Both
approaches use high resolution color images to guide the refinement process, but
the second approach does it in a global way, which also allows incorporation of
time consistency constraints.

7.5.1 Local Approach (Bilateral Filtering)


The approach for depth map refinement based on bilateral filtering was originally
proposed in [20]. For the refinement the depth map Dx; y is considered as a surface
in hypothesis space Hd; x; y; with d being the depth hypothesis for the pixel x; y;
see Fig. 7.11. From this space a cost volume Cd; x; y consisting of squared
distances for hypothesis d to the surface Dx; y for each pixel x; y is constructed
Cd; x; y mina  L; Dx; y  d2 ;

7:1

with L being the search range and a a weighting constant.


The refinement is then performed iteratively through filtering the cost volume
using the bilateral filter constructed from the corresponding color image and by
updating the depth map according to the best cost hypothesis. The refinement
scheme consists of 4 steps:
1. Filter each dSlice (hypothesis plane) of the cost volume with the bilateral
filter constructed from the corresponding color image
2. For each pixel x; y find d^ mind Cd; x; y
^ d^  1 and d^ 1 and update Dx; y through the
3. Perform parabolic fitting for d;
parabola minimum (important for sub-pixel accuracy).
4. Update the cost volume based on the new depth map.
The bilateral filter used in step 1 is defined as follows:
p
kIx;yIxu;yvk
u2 v2
rc
Bx;y u; v e
e rs ;

7:2

204

A. Frick and R. Koch

Fig. 7.11 Cost volume with


the depth surface Dx; y

whereby x; y is the pixel in the current d-slice, u; v an offset, I the corresponding


color image, and rc ; rs weighting constants for smoothness in color and space.
The filter is applied to a pixel x; y on a d-slice in the cost volume as follows:
P
Cd; x u; y vBx;y u; v
u;v2N
^ x; y
P
;
7:3
Cd;
Bx;y u; v
u;v2N

^ x; y the updated
whereby N is the neighborhood defined on a pixel grid and Cd;
cost value. To reduce filtering complexity a separable approximation of the
bilateral filtering may be used [25]. Figure 7.12 (left) shows the result after
refinement of the depth image through the bilateral filter approach. Right from the
refined depth image is the corresponding color image with the removed background (black). The background was removed by applying a threshold to the pixel
depth values. All pixels, which had a depth value below the threshold were considered as a background and marked black. The bilateral filtering approach provides good results (compare the result to the originally warped depth image), but
has several limitations.
Due to the local nature of the bilateral filters some oversmoothing artifacts may
appear on the object boundaries if the color contrast is poor. Another problem is
that fine details, once lost, cannot be recovered. For example, notice that the arms
of the chair after refinement are still missing. Another problem occurs during
processing of video sequences. If all images in the sequence are processed separately no time consistency can be guaranteed, which can result in annoying
flickering artifacts.

7.5.2 Global Approach (Grab-Cut)


To deal with the problems like oversmoothing, missing of fine details and temporal
inconsistency an approach based on a powerful grab-cut optimization technique can
be used. In contrast to the bilateral filter approach it does not modify depth maps
directly, but first solves a binary segmentation problem separating foreground and

7 LDV Generation from Multi-View Hybrid Image

205

Fig. 7.12 (Left) Refined depth image, (Right) Color image segmented through depth thresholding. (Reproduced with permission from [5])

Fig. 7.13 (Left) Depth image corrected after Grab-Cut segmentation, (Right) result of Grab-Cut
segmentation

background in the scene. The transition between the foreground and background
regions then identifies the correct object borders. Using the foreground as a mask
the discontinuities in the depth map can be adjusted to the discontinuities defined
through the mask borders. Figure 7.13 (right) shows a segmentation result from the
same image as used for the local approach. The depth map left from the segmented
image was produced by cutting out the foreground depth corresponding to
segmentation result and by filling the missing depth values from the neighborhood.
One can clearly see that no oversmoothing is present and also fine details like arms
of the chair could be restored. Another way to improve depth images, based on
segmentation results, is to apply a restricted bilateral filtering. This means that the
bilateral filtering is performed only on the mask. Therefore precise object borders
stay preserved also in regions with pure local contrast. In the following, first a grabcut segmentation in a video volume and a depth-based foreground extraction
technique are presented. Next an application of the presented foreground extraction
method to the depth refinement in the 2-layer LDV generation process is introduced. Figure 7.14 shows a diagram of the global refinement scheme for one depth
map and corresponding color image. Note that the application of the global
refinement to the 2-layer LDV generation process is a bit more complicated,
because of warping operations between the refinement steps and multiple depth
and color images involved (see Sect. Depth map refinement).

206

A. Frick and R. Koch

Fig. 7.14 A diagram of the proposed global depth map refinement scheme

7.5.2.1 Initial Foreground Segmentation


Many applications, for example, in TV or 3D-TV, are concerned with foreground
extraction in indoor scenarios, where all objects inside a room are considered to be
foreground, and a rooms walls, ceiling and its floor as background. What is
considered to be the foreground and what the background is however application
specific and cannot be answered in general. The proposed system is limited to
indoor scenarios by the operational range of the ToF cameras. Therefore, it can be
assumed that the scene is confined to a room. Under this assumption the outer
boundaries of the room, like the floor, the walls, or the ceiling are defined here as
the background and the interior as the foreground.
The basic idea behind the automatic initial foreground segmentation is simple.
Use thresholding by defining a plane in space. All points behind the plane are then
background points and all in front belong to the foreground. However, in presence
of slanted surfaces like walls, the floor or the ceiling, the separation with only one
plane is difficult and in some cases impossible. In order to overcome these limitations the proposed approach [24] uses multiple thresholding planes that are fitted
to bounding walls, which are positioned in the scene automatically. The approach
for the initial foreground extraction consists of 3 main steps: initial plane estimation, refinement of the plane orientations, and final thresholding.

Initial Plane Estimation


The thresholding planes should confine to the interior of the room. So ideally the
orientation of a thresholding plane should correspond to the orientation of the floor, a
wall, or the ceiling. The naive approach would be to use all possible orientations in
the scene what in practice is not feasible. Instead one can restrict the set of possible
candidates to the set of normals found in the scene. It is important that the normals of
the walls, the floor, and the ceiling, are contained in this restricted set, if present.
Surface normals in the scene can be estimated from the depth image. To estimate a
normal for a point in the depth image each neighbor point in a window around the
current point is first projected into 3D space. After that the principal component
analysis (PCA) is applied. Principal component analysis is a common technique for
eigenvector-based multivariate data analysis and is often used as a least squares
estimate for multidimensional data sets. PCA is applied by calculating the covariance

7 LDV Generation from Multi-View Hybrid Image

207

Fig. 7.15 (Left) Result from the normal estimation through the PCA, color coded; (Right)
Normals for the pixels in the most outer histogram bin after refinement, color coded. (Reproduced
with permission from [24])

Fig. 7.16 (Left) Image from the reference camera; (Right) Combined ToF images, warped to the
view of the reference camera. (Reproduced with permission from [24])

matrix from the given point cloud and by performing singular value decomposition.
The calculated eigenvector with the smallest eigenvalue is then the plane normal in a
least squares sense. Figure 7.15 (left) shows the results from the normal estimation
through the PCA, from the depth image from Fig. 7.16 (right).
After normal estimation the set of the normals is large, in the worst case the
number of different normals is the product of width and height of the depth image. To
further reduce the set of potential candidates for the plane orientations the clustering
method from [26] is used. The method performs iterative splitting of the clusters
orthogonal to their greatest variance axis. The clustering is performed hierarchically,
starting with one cluster constructed from all image points and proceeding to split a
cluster with the biggest variance until a maximal cluster number is reached. To
determine the optimal cluster number adaptively, the splitting is stopped if no cluster
with variance over a certain threshold exists. At the end the normals are oriented to
point to the inside of the room, which is important for later processing.

Refinement of the Plane Orientations


After the initial plane estimation step we have a discrete set of normals as candidates
for the thresholding planes orientations. Due to the noisy data and possible errors in
depth measurements the normals of outer walls contained in the set are not

208

A. Frick and R. Koch

Fig. 7.17 (Left) A schematic representation of a room with objects; (Right) Histogram
corresponding to the schematic room to the left, with Bin0 being the most outer bin of the
histogram. (Reproduced with permission from [24])

guaranteed to be correct. Therefore an additional normal refinement step is performed. To increase robustness in the refinement process, the normals are ordered in
decreasing order based on the size of the corresponding clusters (biggest cluster first).
After that all points in the image are projected in 3D space using depth values from
the depth map and iteratively applying the following steps for each normal in the set:
A discrete histogram is built by projecting all 3D points in the scene to the line
defined through the normal. Because the normal is oriented inside the room the
first bin of the histogram (Bin0) will be the most outer bin in the room.
Therefore, if the normal corresponds to the normal of the outer wall, all the wall
points will be projected to the first bin (Fig. 7.17). The bin size is a manually set
parameter introduced to compensate for depth measurement errors and noise.
For the results in this section it was set to 20 cm.
Using RANSAC [27] a plane is estimated from the 3D points projected to the first bin.
After that, the points are classified to inliers and outliers and an optimal fitting plane is
estimated from the inliers set using PCA. After that the cluster normal is substituted
by the normal of the fitted plane. The idea here is to refine the normal orientation to be
perfectly perpendicular to the planes defined by the walls, ceiling, and floor.
The number of iterations in the refinement process has to be defined by the user,
but three iterations are generally sufficient. After the refinement, all 3D points in
the first bin are projected to the fitting plane of the last iteration. Figure 7.15 (right)
shows the normal vectors for the pixels corresponding to the 3D points in the first
bin for each refined normal. The pixels colored white correspond to the 3D points,
which are not lying in the first bin of any normal and are foreground candidates.

Final Thresholding
For the actual thresholding a histogram is constructed for each refined normal by
projecting all 3D points to the line defined by the normal. All the pixels corresponding to the 3D points in the first bin of the histogram (Bin0) are then removed
from the depth image. The size of the bin defines thereby the position of the thresholding plane along the histogram direction. Figure 7.18 (left) shows the result after
thresholding the depth image from Fig. 7.16 (right). To identify the foreground for

7 LDV Generation from Multi-View Hybrid Image

209

Fig. 7.18 (Left) Thresholded depth image; (Right) Trimap constructed from the binarized
threshoded depth image from the left. (Reproduced with permission from [24])

further processing a binary mask can be constructed from the thresholded depth
image, with 255 being the foreground and 0 BEING the background.
7.5.2.2 Grab-Cut: Segmentation in a Video Volume
The grab-cut algorithm was originally proposed in [28]. It requires a trimap, which divides
an image in definitive foreground, definitive background and uncertainty region, whereby
the definitive foreground region can be omitted in the initialization. Based on the provided
trimap two Gaussian mixture color models one for the foreground and one for the
background are constructed. After that all pixel in the uncertainty region are classified as
foreground or background by iteratively updating corresponding color models. The
transition between the foreground and background pixels inside the uncertainty region
defines the object border. In order to work properly the grab-cut algorithm requires that the
correct object borders must be contained inside the uncertainty region and definitive
foreground and background regions must not have false classified pixel.
The approach presented here [24] extends the standard grab-cut algorithm to the
video volume. Similar to the standard grab-cut, trimaps are used to divide an image in
definitive foreground, definitive background, and uncertainty region. The Gaussian
color mixture models are used to represent foreground and background regions, but
are built and updated for a batch B of images simultaneously. To achieve the temporal
consistency a 3D graph is constructed from all images in B; connecting pixels in
different images through temporal edges. To handle the complexity, an image pyramid is used and only pixels in the uncertainty region are included in the segmentation process, as well as their direct neighbors in definitive foreground and
background. In the following the segmentation scheme is described in more detail.

Trimap Generation
In the original paper [28] the trimap creation is performed by the user so that the
quality of segmentation depends on the users accuracy. The interactive trimap
generation is however a time-consuming operation especially for long video
sequences. Based on the binary foreground mask Mk ; from the initial foreground

210

A. Frick and R. Koch

segmentation step, the trimap T k can be created automatically for each image k in
the batch B: For the automatic trimap generation morphological operations, erosion and dilation are applied to each binary mask Mk : Let Ek be the kth binary
image after erosion, Dk the kth binary image after dilation and DDk the binary
image after dilation applied to Dk : The definitive foreground for the kth batch
image is then defined as Tfk Ek ; uncertainty region as Tuk Dk  Ek and
definitive background as Tbk DDk  Dk : The trimap for the batch image k is
defined as: T k Tfk [ Tbk [ Tuk : Additionally, a reduced trimap is defined, as foln
po
lows: T k Tuk [ pk 2 Tbk [ Tfk j9qk 2 Tuk : distpk ; qk  2 ; which are the
pixels in the uncertainty region together with the direct neighbors in background
and foreground. The number of erosion and dilation operations defines the size of
Tuk ; Tbk and Tfk in the trimap. Figure 7.18 (right) shows the generated trimap for the
thresholded image from Fig. 7.18 (left).
Hierarchical Segmentation Scheme
To reduce computational complexity the segmentation is performed in an image
pyramid of N levels. In the first level the original color images are used. In each
successive level the resolution of the images from the previous level is reduced by
a factor of 2. The segmentation starts with the lowest level N: In this level the
trimaps are generated from the thresholded depth images (Fig. 7.18) as described
before. In each level j\N the segmentation result from the level j 1  N is then
used for trimap creation. The sizes of the regions Tu ; Tb and Tf in the level N
should be set appropriately to compensate for depth measurement errors. In each
level j\N the sizes of the trimap regions are fixed to compensate for upscaling.
All images in a level are processed in batches of size jBj as follows.
For each batch B:
1. A trimap is created for each image Ik in the batch.
2. Using the definitive foreground and definitive background from all trimaps in
the batch two Gaussian mixture color models are created, one for the foreground
GMf and one for the background GMb : To create a Gaussian mixture model, the
clustering method from [26] with a specified variance threshold as a stopping
criterion is used (compare to Initial plane estimation). The clusters calculated
by this algorithm are used to determine individual components of a Gaussian
mixture model. By using a variance threshold as a stopping criterion the number
of clusters and hence the number of Gaussian components can be determined
automatically, instead of setting it to a fixed number as in the original paper [28].
3. A 3D graph is created from all batch images and pixels in the uncertainty
regions are classified as foreground or background using graph-cut
 optimization
[29]. The classification of a pixel is stored in a map A, where A pk FG; if pk
 
is classified as the foreground pixel and A pk BG else.

7 LDV Generation from Multi-View Hybrid Image

211

4. The color models are updated based on the new pixel classification.
5. The steps 3 and 4 are repeated until the number of pixels changing classification
is under a certain threshold or until a fixed number of iterations is reached.

Graph-Cut Optimization
The classification of pixels in a video volume (batch B) in foreground and
background can be formulated as a binary labeling problem on the map A and
expressed in form of the following energy functional:
FA; GMf ; GMb VA; GMf ; GMb kS ES A kT ET A:

7:4

VA; GMf ; GMb is the data term penalizing the pixel affiliation to the color
models and is defined as:
X
VA; GMf ; GMb
7:5
Dpk ; A; GMf ; GMb ;
k
k
p 2T for k  jBj
For pk 2 Tuk ; Dpk ; A; GMf ; GMb is defined as
 

 ln Prpk ; GMb ; Apk  BG
k
Dp ; A; GMf ; GMb
;
 ln Prpk ; GMf ; A pk FG

7:6

and for pk 2 Tfk [ Tbk ; Dpk is defined as


(

Dpk ; A; GMf ; GMb


 
 
0; A pk FG ^ pk 2 Tfk _ A pk BG ^ pk 2 Tbk
 
 
1; A pk BG ^ pk 2 Tbk _ A pk FG ^ pk 2 Tbk

ES A and ET A are spatial and temporal smoothness terms, defined as:


k k

   
I p  I k qk 2 dA pk ; A qk
X
exp

ES A
dpk ; qk
2r2S
pk ;qk 2N

7:7

7:8

ET A

X
pi ;q j 2NT

exp

   
jI i pi  I j q j j
dA pi ; A q j
2
2rT

7:9

NS defines the spatial 4- or 8- neighborhood in the image plane and NT the


temporal neighborhood between two successive frames:
n
po

pk ; qk pk ; qk 2 T k ^ dpk ; qk  2 ;
7:10
NS

212

A. Frick and R. Koch

(
NT


)

pk 2 T k ^ p j 2 T j ^
p ;q 
 ^pk 2 Tuk _ p j 2 Tuj ^ dpk ; q j  n
k

7:11

A pixel pk 2 I k is considered a temporal neighbor of the pixel pk1 2 I k1 ; if the


distance between the pixel coordinates does not exceed a maximum distance n and
either pk or pk1 belong to an uncertainty region. For simplicity the neighborhood
is chosen as a rectangular window. The function dx; y is 0, if x y and 1 else.
dpk ; qk is the Euclidean distance between the two pixel in the image coordinate
space. The parameters kS and kT from the Eq. (7.4) are weighting factors for
the balance between data and smoothness terms and the parameters r2S and r2T model
spatial and temporal color variance.
The energy functional FA; GMf ; GMb can be mapped on a 3D graph
(Fig. 7.19) and efficiently minimized using one of the min-cut/maxflow algorithms from the literature [30]. The 3D graph is defined as G V; E; with
V

jB[
j1

T k [ fs; tg;

7:11

k0

E NS [ NT [


 

s; pk ; t; pk pk 2 V  fs; tg ;
|{z}


7:12

ND

whereby s and t are two additional nodes representing background (t) and
foreground (s). The capacity function for an edge e fp; qg 2 E is then defined
as follows:
8 

I k p  I k q2 =2r2
>
e 2 NS
>
>
k
2 S 2
>
>
k1


>
I
p

I
q
=2r
e
2 NT
>
T
>
>
>

ln
Prp;
GM

e
2
N
^
q
s ^ p 2 Tuk
>
b
D
>
>
<
 ln Prp; GMf
e 2 ND ^ q t ^ p 2 Tuk
7:13
cape
q s ^ p 2 Tbk _
>
>
0
>
>
>
q t ^ p 2 Tfk ^ e 2 ND
>
>
>
>
>
q s ^ p 2 Tfk _
>
>
>
1
:
q t ^ p 2 Tbk ^ e 2 ND
To minimize the functional a minimum cut is calculated on the graph G using
the algorithm from [30], where minimum cut capacity is equivalent to the minimum of the functional.

Segmentation Results
For the experimental evaluation a sequence of 800 frames (color ? depth) was
used.

7 LDV Generation from Multi-View Hybrid Image

213

Fig. 7.19 (Left) Grab-Cut segmentation result; (Right) Schematic representation of the 3D graph
for grab-cut segmentation. (Reproduced with permission from [24])

Fig. 7.20 Results from the video segmentation (frames: 0, 200, 430, 700). (Reproduced with
permission from [24])

Figure 7.20 shows the results after the grab-cut segmentation for images 0, 200,
430, 700 in the sequence. One can see that in almost all cases foreground was
reliably identified throughout the whole sequence. Some artifacts remain, for
example, small parts of the plant are missing, but most object borders are detected
accurately.
Notice that the camera was moved from left to right and back during the
capturing. This demonstrates that the proposed approach can be applied to general
dynamic scenes with non-stationary cameras. The results from automatic segmentation were additionally compared to the manual segmentation results for
images 286296. Without time edges the percentage of false classified pixels
compared to manual segmentation is 1.97 %, with time edges the number of false

214

A. Frick and R. Koch

Fig. 7.21 From left to right: binary foreground mask from the left ToF depth image and from the
right ToF depth image; combined binary foreground mask warped to the view of the top color
camera of the left module

classified pixel decreases to 1.57 %. While the difference is only about 0.4 %, the
flickering on the object borders decreases significantly.

7.5.2.3 Depth Map Refinement


The refinement of depth images in the 2-layer LDV generation process is performed twice: first after warping of the ToF depth images to the module color
cameras and then again after generation of the central virtual view. In the local
approach the refinement is performed through bilateral filtering in the cost volume.
In the global approach, presented here, the refinement is performed in two steps. In
the first step, the foreground mask is constructed, based on depth image analysis
and grab-cut segmentation. In the second step, a constrained bilateral filtering in
the cost volume (see local approach) is performed using the foreground mask to
guide the filtering process. In the following the whole refinement scheme is
described in more detail.
The refinement starts with the initial foreground extraction in the view of the
ToF cameras using original ToF depth images. Figure 7.21 shows the results of the
initial foreground extraction.
Using corresponding depth values the produced binary foreground masks are
then transformed from the view of the ToF cameras to the views of the module
color cameras and combined into unified foreground mask (Fig. 7.21).
After that, the segmentation is performed for each of the module color cameras.
Thereby the transformed masks are used for the trimap generation in the segmentation process (see Fig. 7.14). Figure 7.22 shows the top color image of the
left module (left) and refined binary foreground mask after the grab-cut segmentation (right). Compare the refined mask to the original mask in Fig. 7.21 (right).
One can clearly see that the refined mask more precisely fits to the object
boundaries. Due to the temporal constraints in the segmentation process temporally consistent object boundaries can be obtained.

7 LDV Generation from Multi-View Hybrid Image

215

Fig. 7.22 (Left) Color image from the top camera of the left module. (Right) Refined binary
foreground mask

Fig. 7.23 (Left) Depth image warped to the top color camera of the left module. (Right) Depth
image refined through restricted bilateral filtering

After an accurate foreground mask is extracted, the actual refinement of the


depth images is performed. To refine the depth images the bilateral filtering of the
distance cost volume introduced in the local optimization approach is applied.
However, instead of operating on the whole image, the refinement process is
applied to the white marked regions only. This restricted bilateral filtering effectively prevents oversmoothing on the object borders and supports reconstruction of
fine details. Such filtering is performed twice. First the depth image is filtered with
the binary foreground mask. Second, the mask is inverted and the depth image is
filtered with the inverted binary mask (background mask).
This two-step filtering process is important, because the errors on the object
borders lie on the foreground as well as on the background. Filtering only the
foreground will leave border artifacts in the background; therefore, object
boundaries will not be correctly aligned in depth and color. Figure 7.23 shows the
depth image of the top camera of the left module before and after refinement.
Next step of the 2-layer LDV generation process, after warping to the corner
cameras and refinement, is the generation of 3 virtual views on the baseline of the
reference camera. Figure 7.24 shows the color images of the left and right virtual

216

A. Frick and R. Koch

Fig. 7.24 Left and right virtual views

Fig. 7.25 (Left) Warped depth image from the central virtual view. (Right) Refined depth image
from the central virtual view (depth from the foreground layer)

views. Notice that the images are not complete due to different viewing frustums
of the ToF cameras, module cameras, and the reference camera. Although the
images are purely virtual the texture quality is quite good. This demonstrates the
quality of the refinement process.
The central virtual view corresponds to the view of the reference camera.
Therefore the original color image of the reference camera is used as the color
image for the view. The corresponding depth image is constructed through the
combination of the refined images of the four corner cameras. Figure 7.25 (left)
shows the constructed central depth image. One can see that the object borders
already fit well, but some artifacts are still present. This can occur due to discrepancies in depth between different views, imperfectness of depth refinement in
the previous step, or calibration errors in conjunction with viewpoint change. To
reduce the remaining artifacts the global depth refinement is performed again in
the view of the reference camera. The foreground mask for the refinement thereby
is constructed as a unified foreground mask from all refined foreground masks
from the corner cameras transformed to the view of the reference camera. The
result from the refinement is shown in the Fig. 7.25 (right).
Figure 7.26 shows the 2-layer LDV frame constructed from the three virtual
views.
Using this frame, novel views can be rendered to the left and to the right of the
view of the reference camera. Figure 7.27 shows two novel views left (-180 mm)
and right (+180 mm) from the reference camera view.

7 LDV Generation from Multi-View Hybrid Image

217

Fig. 7.26 2-layer LDVFrame

Fig. 7.27 Two views rendered left (-180 mm) and right (+180 mm) from the view of the
reference camera

The quality of the novel views is already good, but there are still some issues
like color discrepancy between the foreground and the occlusion layer or artifacts
on the right side of the box, which diminish the achieved quality. Color discrepancy is a technical problem, which can be solved using similar cameras. Rendering
artifacts on the right border of the box are however a more essential problem.
While depth and color discontinuities in the virtual views are precisely aligned,
depth values on the object borders are determined through a filtering process,
trivially spoken: propagated from the neighborhood. In most cases this is sufficient, but sometimes no correct depth is available in the neighborhood or cannot be
correctly propagated, which leads to discrepancies between the foreground and the
occlusion layer during the occlusion layer creation process. This may lead to
imperfect alignment of discontinuities between two layers and later rendering
artifacts.

218

A. Frick and R. Koch

7.6 Discussion and Conclusions


In this chapter the requirements for LDV-format production were discussed and a
complete system for 2-layer LDV generation was presented. The system consists
of a hybrid camera rig and software algorithms for LDV generation. It can capture
14 min. of content in a single shot at 25 frames per second providing real-time
preview of captured content at the same time. One of the advantages of the system
is that the content production is comparable to a normal 2D production. During
shooting one has to concentrate only on the reference camera, whereby the
cameras in the modules play only a supportive role. This makes the system very
suitable for actual 3D-TV productions. However, the system is limited to indoor
scenarios due to the small operational range of the ToF cameras (7.5 m).
A processing scheme for 2-layer LDV generation was introduced as a postprocessing step and two approaches for alignment of depth and color discontinuities
were presented. Both approaches use a bilateral filtering technique in a distance
volume. While the first approach performs the refinement through a local filtering of
the cost volume, the second approach uses grab-cut segmentation and depth image
analysis in the first step to define the object borders globally. This strong segmentation information is then used in the second step to guide the filtering process. It
was demonstrated that such restricted filtering prevents oversmoothing on the object
borders and allows better reconstruction of fine details in the scene. Through
incorporation of time consistency constraints in the segmentation process, more
stable object boundaries over time can also be determined.
However, the local approach is less computational expensive as the global
approach and offers a good balance between quality and complexity.
The results presented in this chapter show that generation of a 2-layer LDV
format from the content provided by the hybrid camera system is possible in
acceptable quality. However, in some cases, correction through filtering (global
and local) is not sufficient. To get better depth correction results, combination of
ToF with stereo matching may be used [31]. For these purpose the system was also
designed with multiple baselines for stereo matching which reduces matching
ambiguities. Also combinations of proposed depth refinement techniques with
stereo matching are possible. One can for example apply stereo matching, to the
masked regions only. This would give better control of smoothness constraints
across the boundaries and decrease the ambiguities at the same time.

References
1. Fehn C (2004) Depth-image-based rendering (dibr), compression, and transmission for a new
approach on 3d- tv. In: Stereoscopic displays and virtual reality systems XI. In: Proceedings
of the SPIE 5291, pp 93104, May 2004
2. Kauff P, Atzpadin N, Fehn C et al (2007) Depth map creation and image-based rendering for
advanced 3DTV services providing interoperability and scalability. Sig Process: Image
Commun 22:217234

7 LDV Generation from Multi-View Hybrid Image

219

3. Zhang L, Tam WJ (2005) Stereoscopic image generation based on depth images for 3D TV.
Broadcast IEEE Trans 51(2):191199
4. Scharstein D, Szeliski R (2002) A taxonomy and evaluation of dense two-frame stereo
correspondence algorithms. Int J Comput Vision 47(1/2/3):742
5. Frick A, Bartczak B, Koch R (2010) Real-time preview for layered depth video in 3D-TV.
In: Proceedings of SPIE, vol 7724, p 77240F
6. Lee E-K, Ho Y-S (2011) Generation of high-quality depth maps using hybrid camera system
for 3-D video. J Visual Commun Image Represent (JVCI) 22:7384
7. Bartczak B, Schiller I, Beder C, Koch R (2008) Integration of a time-of-flight camera into
a mixed reality system for handling dynamic scenes, moving viewpoints and occlusions in
real-time. In: Proceedings of the 3DPVT Workshop, 2008
8. Kolb A, Barth E, Koch R, Larsen R (2009) Time-of-flight sensors in computer graphic.
In Proceedings of eurographics 2009state of the art reports, pp 119134
9. Schiller I, Beder C, Koch R (2008) Calibration of a PMD-Camera using a planar calibration
pattern together with a multi-camera setup. In: Proceedings XXXVII international social for
photogrammetry
10. Fehn C, Kauff P, Op de Beeck M, et al, An evolutionary and optimised approach on 3D-TV.
In: IBC 2002, International broadcast convention, Amsterdam, Netherlands, Sept 2002
11. Bartczak B, Vandewalle P, Grau O, Briand G, Fournier J, Kerbiriou P, Murdoch M et al
(2011) Display-independent 3D-TV production and delivery using the layered depth video
format. IEEE Trans Broadcast 57(2):477490
12. Barenburg B (2009) Declipse 2: multi-layer image and depth with transparency made
practical. SPIE 7237:72371G
13. Klein Gunnewiek R, Berretty R-P M, Barenbrug B, Magalhes JP (2009) Coherent spatial
and temporal occlusion generation. In: Proceedings of SPIE vol 7237, p 723713
14. Shade J, Gortler S, He L, Szeliski R (1998) Layered depth images. In: Proceedings of the
25th annual conference on Computer graphics and interactive techniques (SIGGRAPH 98).
ACM, New York, pp 231242
15. Smolic A, Mueller K, Merkle P, Kauff P, Wiegand T (2009) An overview of available and
emerging 3D video formats and depth enhanced stereo as efficient generic solution. In:
Picture coding symposium, 2009, PCS 2009
16. Merkle P, Smolic A, Muller K, Wiegand T (2007) Multi-View video plus depth
representation and coding. In: Image processing, 2007. ICIP 2007. IEEE International
conference on, vol 1, pp 201204
17. Frick A, Bartczack B, Koch R (2010) 3D-TV LDV content generation with a hybrid ToFmulticamera RIG. In: 3DTV-Conference: the true visionCapture, transmission and display
of 3D Video, June 2010
18. Frick A, Kellner F, Bartczak B, Koch R (2009) Generation of 3D-TV LDV-content with
Time-Of-Flight Camera. 3D-TV Conference: the true visionCapture, transmission and
display of 3D Video, Mai 2009
19. Xu M, Ellis T (2001) Illumination-invariant motion detection using colour mixture models.
In: British machine vision conference (BMVC 2001), Manchester, pp 163172
20. Yang Q, Yang R, Davis J, Nister D (2007) Spatial-depth super resolution for range images.
In: Computer vision and pattern recognition, CVPR 07, IEEE conference on, pp 18, June 2007
21. Kim S-Yl, Cho J-H, Koschan A, Abidi MA (2010) 3D Video generation and service based on
a TOF depth sensor in MPEG-4 multimedia framework. IEEE Trans Consum Electron
56(3):17301738
22. Diebel J, Thrun S (2005) An application of markov random fields to range sensing.
In: Advances in neural information processing systems, pp. 291298
23. Chan D, Buisman H, Theobalt C, Thrun S (2008) A noise-aware filter for real-time depth
upsampling. In: Workshop on multicamera and multi-modal sensor fusion, M2SFA2
24. Frick A, Franke M, Koch R (2011) Time-consistent foreground segmentation of dynamic
content from color and depth video. In: DAGM 2011, LNCS 6835. Springer, Heidelberg,
2011, pp 296305

220

A. Frick and R. Koch

25. Pham TQ, van Vliet LJ (2005) Separable bilateral filtering for fast video preprocessing.
In: Multimedia and Expo, 2005. ICME 2005. IEEE International conference on, 2005
26. Orchard M, Bouman C (1991) Color quantization of images. IEEE Trans Signal Process
39(12):26772690
27. Fischler M, Bolles R (1981) Random sample consensus: a paradigm for model fitting with
applications to image analysis and automated cartography. Commun ACM 24(6):381395
28. Rother C, Kolmogorov V, Blake A (2004) Grabcut: interactive foreground extraction using
iterated graph cuts. ACM Trans Graph 23(3):309314
29. Boykov Y, Jolly M (2000) Interactive organ segmentation using graph cuts. In: Medical
image computing and computer-assisted-intervention (MICCAI), pp 276286
30. Boykov Y, Kolmogorov V (2004) An experimental comparison of min-cut/max- flow
algorithms for energy minimization in vision. IEEE Trans Pattern Anal Mach Intell
26(9):11241137
31. Bartczak B, Koch R (2009) Dense depth maps from low resolution time-of-flight depth and
high resolution color views. In: Advances in visual computing, vol 5876, Springer, Berlin,
pp 228239

Part III

Data Compression and Transmission

Chapter 8

3D Video Compression
Karsten Mller, Philipp Merkle and Gerhard Tech

Abstract In this chapter, compression methods for 3D video (3DV) are presented.
This includes data formats, video and depth compression, evaluation methods, and
analysis tools. First, the fundamental principles of video coding for classical 2D
video content are reviewed, including signal prediction, quantization, transformation, and entropy coding. These methods are extended toward multi-view video
coding (MVC), where inter-view prediction is added to the 2D video coding
methods to gain higher coding efficiency. Next, 3DV coding principles are
introduced, which are different from previous coding methods. In 3DV, a generic
input format is used for coding and a dense number of output views are generated
for different types of autostereoscopic displays. This influences the format selection, encoder optimization, evaluation methods, and requires new modules, like the
decoder-side view generation, as discussed in this chapter. Finally, different 3DV
formats are compared and discussed for their applicability for 3DV systems.





Keywords 3D video (3DV) Analysis tool Correlation histogram Data format


Depth-image-based rendering methods (DIBR) Depth-enhanced stereo (DES)
Distortion measure Entropy coding Evaluation method Inter-view prediction
layered depth video (LDV) Multi-view video coding Multi-view video plus depth
(MVD) Rate-distortion-optimization Transform Video coding

K. Mller (&)  P. Merkle  G. Tech


Image Processing Department, Fraunhofer Institute for Telecommunications,
Heinrich-Hertz-Institut, Einsteinufer 37, 10587 Berlin, Germany
e-mail: karsten.mueller@hhi.fraunhofer.de
P. Merkle
e-mail: philipp.merkle@hhi.fraunhofer.de
G. Tech
e-mail: gerhard.tech@hhi.fraunhofer.de

C. Zhu et al. (eds.), 3D-TV System with Depth-Image-Based Rendering,


DOI: 10.1007/978-1-4419-9964-1_8,
 Springer Science+Business Media New York 2013

223

224

K. Mller et al.

8.1 Introduction
3D video (3DV) systems have entered different application areas in recent years,
like digital cinema, home entertainment, and mobile services. Especially for
3D-TV home entertainment, a variety of 2-view (stereoscopic) and N-view autostereoscopic displays have been developed [1, 2]. For these displays, 3DV formats are required, which are generic enough to provide any number of views at
dense spatial positions. As the majority of 3DV content is currently produced with
few (mostly two) cameras, mechanisms for extracting the required additional
views at the display are required. Here, the usage of geometry information of a
recorded 3D scene, e.g., in the form of depth or range data has been studied and
found to be a suitable and generic approach. Therefore, depth-enhanced video
formats, like multi-view video plus depth (MVD) have been studied, where 2 or 3
recorded views are amended with per-pixel depth data. The depth data may
originate from different sources, like time-of-flight cameras, depth estimation from
the original videos, or graphics rendering for computer-generated synthetic content. Such formats thus include additional information to generate the required
high number of views for autostereoscopic displays.
For the compression of such formats, new approaches beyond current 2D video
coding methods need to be investigated. Therefore, this chapter explains the
requirements for the compression of video and depth data and looks into new
principles of 3DV coding.
First, a short review on basic principles of video coding, which build the basis
for todays advanced coding methods, is given in Sect. 8.2.
Next, the extension toward multi-view video coding (MVC) is explained in
Sect. 8.3. To achieve higher coding gains, inter-view prediction was added in this
extension. For this, evaluation of coding results showed that no major changes are
required, such that the coding mechanisms for 2D can directly be applied to MVC.
For autostereoscopic displays with different number and position of views,
3DV coding methods are required, which efficiently transmit stereo and multi-view
video content and are able to provide any number of output views for the variety
of autostereoscopic displays. For this, new assumptions and coding principles
are required, which are explained in Sect. 8.4. One of the novelties in the new
3DV coding is the usage of supplementary data, e.g, in the form of depth maps, in
order to generate additional intermediate views at the display. For depth information, different coding methods are required, as shown by the depth data analysis
in Sect. 8.4.
Besides video and depth data, additional supplementary information, like
occlusion data, may also be used. Therefore, different 3DV formats exist, which
are explained and discussed in terms of coding efficiency in Sect. 8.5.

8 3D Video Compression

225

8.2 Principles of Video Coding


A main goal of video coding is to achieve the highest possible quality at the lowest
possible data rate (of the compressed video) [3]. For practical applications, this
means that a maximum quality of the video data at a given bit rate or vice versa,
the lowest possible bit rate at a given quality is sought. For this, rate-distortiontheory [4] and -optimization is used in video compression (conf. Eq. 8.1). Here, a
distortion is specified between original input data and its approximated version.
This distortion is inverse proportional to video quality: The closer an approximated
version of the video data to the original input, the smaller the distortion and the
higher the quality.
The rate-distortion-optimization problem is often constrained by additional
requirements of a certain practical application area, such as algorithm complexity
due to hardware restrictions in mobile devices, transmission errors and delay,
where additional data rate for error protection is required. Such additional constraints can also be included into the video coding optimization process.
For the tradeoff between rate R and distortion D, the Lagrange optimization is
used, as shown in Eq. 8.1.
popt arg minDpb k  Rpb

8:1

pb

Here, rate R and distortion D are related via the Lagrange multiplier k to form
the functional Dpb k  Rpb : Rate and distortion depend on the parameter
vector pb of a coding mode b. The optimal coding mode is determined by the
parameter vector popt, which minimizes the functional. For this, the best mode
among a number of candidates is selected, as explained in the coding principles
below.
A video coding system consists of encoder and decoder, as shown in Fig. 8.1.
The encoder applies video compression to the input data and generates a bit
stream. This bit stream can then be transmitted and is used by the decoder to
reconstruct the video. The compressed bit stream has to meet system requirements,
like a maximum given transmission bandwidth (and thus maximum bit rate).
Therefore, the encoder has to omit certain information within the original video, a
process, called lossy compression. Accordingly, the reconstructed video after
decoding has a lower quality in comparison with the original input video at the
encoder. The quality comparison between reconstructed output and original video
input, as well as the bit rate of the compressed bit stream, as shown in Fig. 8.1, is
used by the rate-distortion-optimization process in the encoder.
Video coding systems incorporate fundamental principles of signal processing,
such as signal prediction, quantization, transformation, and entropy coding to
reduce temporal as well as spatial signal correlations, as described in detail in [5].
As a result, a more compact video representation can be achieved, where important
information is carried by fewer bits, than in the original video in order to reduce
the data rate of the compressed bit stream.

226

K. Mller et al.

Fig. 8.1 Video coding system with encoder and decoder, objective quality measure of
reconstructed output versus original input video at certain bit rates

The first fundamental principle in video compression is signal prediction.


Digital video data mostly consist of a two-dimensional grid of discrete pixels. In
signal prediction, a sample pixel value xn is approximated by a value ^xn: This
value may be obtained from other pixel values at the same time instance (spatial
prediction) or different time instances (temporal prediction). For optimal signal
prediction, a variety of different prediction filters are described in the literature [6],
in order to minimize the error en between a sample value and its predicted value:
en xn  ^xn:

8:2

Typically, the error en is much smaller than the original pixel value xn;
therefore, a transmission of en is usually more efficient. The original signal is
then reconstructed from the error and prediction signal: xn en ^xn:
The second fundamental principle in video compression is quantization. One
example for this is already given by the most widely used YUV 8 bit video format,
where each pixel is represented by a luminance value (Y) and two chrominance
values (U and V) each with 8bit precision. Therefore, each pixel value component
is quantized into 256 possible values. In video compression, the error signal en is
quantized into a discrete number of values eq n: The combination of signal
prediction and error signal quantization is shown in the encoder in Fig. 8.2.
With the quantization of the error signal, the original and quantized version
differ, such that eq n 6 en: Consequently, also the reconstructed signal ~xn
differs from the original signal, as ~xn eq n ^xn 6 xn: In prediction
scenarios, previously coded and decoded data are often used as an input to the
predictor. These data contain quantization errors from the previous coding cycle,
such that an error accumulation occurs over time, which is known as drift.
Therefore, the typical basic encoder structure consists of a differential pulse code
modulation (DPCM) loop, shown in Fig. 8.2. This backward prediction loop
guarantees that the current quantized prediction error eq n is used for locally
reconstructing the output signal ~xn: This is used for a new prediction, such that
quantization errors cannot accumulate over time. Furthermore, the encoder uses
~xn to calculate the distortion error for the rate-distortion optimization.
The third fundamental principle in video compression is signal transformation.
Here, a number of adjacent signal samples are transformed from its original

8 3D Video Compression

227

Fig. 8.2 Differential pulse code modulation structure with encoder and decoder. The decoder
structure is also part of the encoder (gray box)

domain into another domain. For images and videos, usually a frequency transformation is applied to a rectangular block of pixels. Thus, an image block is
transformed from its original spatial domain into the frequency domain. The
purpose of this frequency transformation is to concentrate most of a signals
energy and thus the most relevant information in few large-value frequency
coefficients. As the samples of an original image block usually have a high correlation, mostly low frequency coefficients have large values. Consequently, data
reduction can be carried out by simply omitting high-frequency coefficients with
small values and still preserving a certain visual quality for the entire image block.
Adding a frequency transformation to signal prediction and quantization leads to
the basic structure of todays 2D video encoders, shown in Fig. 8.3.
In such an encoder, an image is tiled into a number of blocks, which are
sequentially processed. For each block xn; a predicted block ^xn is obtained by
advanced prediction mechanisms. Here, the encoder selects among a number of
different prediction modes, e.g., intra-frame prediction, where spatially neighboring content is used, or motion compensated temporal prediction, where similar
content in preceding pictures is used. The difference between the original and
predicted block en is transformed (T) and quantized. This residual block Qq n is
inversely transformed (T-1) into eq n and used to generate the local reconstructed
output block ~
xn (see Fig. 8.3). A deblocking filter is further applied to reduce
visible blocking artifacts. The locally reconstructed output image is used for the
temporal prediction of the following encoder cycle. The temporal prediction
consists of the motion compensation, as well as the motion estimation. Here,
motion vectors between input xn and reconstructed image blocks ~xn are
estimated. Each encoder cycle produces a set of residual block data Qq n and
motion vectors, required for decoding. These data are fed into the entropy coder.
The coder control is responsible for testing different prediction modes and
determining the overall optimal prediction mode in terms of minimum distortion
and bit rate after entropy coding.
The fourth fundamental principle in video compression is entropy coding. If a
signal is represented by a set of different source symbols, an entropy coding

228

K. Mller et al.

Fig. 8.3 Block-based hybrid video encoder with DPCM Loop, transformation (T), quantization,
inverse transformation (T-1), prediction (intra frame prediction, motion compensation, motion
estimation), entropy coding, and deblocking filter modules

algorithm exploits different occurrence probabilities of these symbols. Here, the


set of source symbols is represented by a set of code words, where the length of the
code words is proportional to the probability of occurrence of the source symbols.
Thus, fewer bits are required to represent source symbols with high occurrence
such that very low rates of 1 bit/source symbol are achieved. Examples for
entropy coding are Huffman codes [7] or arithmetic codes [8]. The latter have been
further refined and adapted to video content [5]. Entropy coding provides lossless
compression, such that an entropy-coded signal can be fully reconstructed.
Therefore, entropy coding considerably reduces the bit rate while maintaining the
distortion and is essential for the overall rate-distortion-optimization.
All fundamental principles of video coding are jointly applied in order to find
the best overall rate-distortion functional, as shown in formula (8.1). For this, the
distortion D is measured as signal deviation, e.g., as the mean squared error (MSE)
between the original block xn and the locally reconstructed image block ~xn; as
shown in Eq. (8.3).
MSE kxn  ~
xnk2

I
1X
xi n  ~xi n2
I i1

8:3

Here, xi n and ~xi n represent the single pixel values of blocks xn and x~n
respectively with I being the number of pixels per block. Thus, the MSE is calculated from the averaged squared pixel differences in Eq. (8.3). For the calculation of objective image and video quality, the peak signal-to-noise ratio (PSNR)
is calculated from the MSE, as shown in Eq. (8.4). The PSNR gives the

8 3D Video Compression

229

logarithmic ratio between the squared maximum value and the MSE. As an
example, also the PSNR for 8 bit signals (maximum value: 255) is given in (8.4).
MaxValue2
MSE
2552
10  log
kxn  x~nk2

PSNR 10  log
PSNR8 bit

8:4

For image and video coding, the PSNR is widely used, as it represents subjective quality very well and can easily be calculated. When a new coding technology is developed, excessive tests of subjective assessment are carried out with a
large number of participants. Here, measurements, like the mean opinion score
(MOS) are used. In MOS, a quality scale is specified, ranging, e.g., from 0 to 5 for
very bad to perfect quality. Participants are then asked to rate their subjective
impression within this scale. For (2D) video coding, the MOS value is highly
correlated with the objective PSNR. A higher PSNR value is also rated higher in
MOS value. Thus, the automatic encoder rate distortion optimization can maximize the objective PSNR value and thus also guarantees a high perceived subjective quality.

8.3 Coding of Multi-view Video Data


2D video coding has been extended toward the coding of multiple videos with
similar content. One example is the development of the MVC standard [913]
based on the 2D advanced video coding (AVC) standard [14]. In addition to
temporal redundancies, also redundancies between the views are exploited for
higher compression efficiency.
Basic signal prediction was introduced in Sect. 8.2 as one of the principles, used
in video coding. Especially, the motion-compensated temporal prediction is applied
to reduce the video signal correlation in temporal direction. For block-based video
encoding, the signal xn in (8.2) becomes a vector or block of pixels xn that is
centered at image position n i j T with horizontal and vertical coordinates
i and j. For a simple temporal prediction, an image block xt2 n at time instance t2
shall be predicted by a block xt1 n of a previous time instance t1, such that the
prediction error becomes en xt2 n  xt1 n: Here, the block has the same local
image coordinates n at both time instances. This prediction can be further improved
in image regions with object motion, where the same image position does not lead to
T
a good prediction. Therefore, a motion vector m mi mj is estimated for
xt2 n in order to find a more similar block in the image at time instance t1 for a better
prediction. Hence, the motion compensated prediction becomes.
en xt2 n  xt1 n m:

8:5

230

K. Mller et al.

Note that Eq. (8.5) represents a simple prediction by a temporally preceding


pixel value. More advanced prediction mechanisms may include a weighted
averaging of values. Also, further improvements in temporal prediction can be
achieved by analyzing multiple reference pictures [15] in addition to the picture at
t1 for finding the best prediction candidate.
When multiple video sequences from adjacent cameras or camera arrays are
coded, correlations within each sequence in temporal direction, as well as between
the sequences of neighboring views exist. An example is shown in Fig. 8.4 for
pictures of a stereo sequence with two views v1 and v2 and two time instances t1
and t2.
Here, a block xv2; t2 n is highlighted for view v2 at time t2. Corresponding image
content can be found in temporal, as well as in inter-view direction. As shown in
Fig. 8.4, this block is related to its corresponding block xv2;t1 n mv2 via the
motion vector mv2 in temporal direction. Accordingly, a corresponding block
xv1;t2 n dt2 is related via the disparity vector dt2 in inter-view direction. If both
corresponding blocks can be mapped to a single block at the top-left image [v1, t1]
in Fig. 8.4, motion and disparity data of the associated blocks between the images
are consistent. In reality, this is often prevented by occlusions, where image content
is not visible in all four images. Also note that the motion compensated prediction in
(8.5) only finds the optimal combination of error block en and motion data m that
yield the overall minimum for the rate-distortion-functional. This motion data can
significantly differ from the true motion of a block.
For higher coding efficiency, the block relations in inter-view direction are
exploited in MVC. Therefore, the most significant innovation in MVC is interview prediction. It is based on the concept of motion estimation, where temporally
neighboring pictures are used as reference pictures for finding the best matching
prediction for the current block by compensating the temporal motion of the scene.
In addition, inter-view prediction exploits the similarities in neighboring camera
views by disparity estimation. Given the fact that the disparity describes the displacement of regions or objects for different camera views, the best matching
prediction from a neighboring camera view is achieved by compensating the
disparity. According to Fig. 8.4, a multi-view video encoder can thus select
between the temporal and inter-view prediction modes:
etemporal n xv2; t2 n  xv2; t1 n mv2
einter

view n

xv2; t2 n  xv1; t2 n dt2 :

8:6

Inter-view prediction could be applied in both directions, i.e., from view v1 to


view v2 and vice versa. In contrast, only view v2 in Fig. 8.4 uses reference pictures from view v1. This one-directional prediction is also used in existing MVC
systems for backward compatibility: Here, v1 is called the base or independent
view and can be decoded by a legacy 2D or single view AVC video decoder.
A typical prediction structure in MVC for two views is shown in Fig. 8.5.
Here, horizontal arrows indicate temporal dependencies, e.g., an image at time
instance 3 requires the images at time instances 1 and 5 as reference for temporal

8 3D Video Compression

231

Fig. 8.4 Block relations in stereo video with respect to the lower right block at (t2,v2): temporal
relation via motion vectors m, inter-view relation via disparity vectors d

Fig. 8.5 Multi-view video coding prediction structure for stereo video: inter-view prediction
(vertical-arrow relations) combined with hierarchical B pictures for temporal prediction
(horizontal-arrow relations), source [20]

prediction. The temporal prediction structure in MVC uses a hierarchy of bipredictive pictures, known as hierarchical B pictures [16]. Here, different quantization
parameters can be applied to each hierarchy level of B pictures in order to further
increase the coding efficiency. Vertical arrows indicate inter-view dependencies,
i.e., video v2 requires v1 as reference. Note that the dependency arrows in Fig. 8.5
are inversely in direction to the motion and disparity vectors from Fig. 8.4.
The coding efficiency of MVC gains from using both temporal and inter-view
reference pictures. However, the inter-view decorrelation loses some efficiency, as
the algorithms and methods optimized for temporal signal decorrelation are
applied unmodified. Although motion and disparity both reflect content displacement between images at different temporal and spatial positions, they also differ in
their statistics. Consider a static scene, recorded with a stereo camera: Here, no
motion occurs, while the disparity between the two views is determined by the
intrinsic setting of the stereo cameras, such as spacing, angle, and depth of the
scene. Therefore, an encoder is optimized for a default motion of m = 0. This is
not optimal for disparity, where default values vary according to the parameters,
mentioned before. This is also reflected by the occurrence of coding decisions for

232

K. Mller et al.

temporally predicted picture blocks versus inter-view predicted blocks, where the
latter only occur in up to 20 % [11]. This only leads to a moderate increase in
coding efficiency of current MVC versus single view coding, where coding gains
of up to 40 % have been measured in experiments for multiple views [11].
Although such gains can be achieved by MVC, the bit rate is still linearly
dependent on the number of views.
The MVC encoder applies the same rate-distortion optimization principles, as
2D video coding. Here, PSNR is used as an objective measure between the locally
reconstructed and the original uncoded image. For two views of a stereo sequence,
a similar quality is usually aimed for, although unequal quality coding has also
been investigated. For the PSNR calculation of two or more views, the mean
squared error (MSE) is separately calculated for each view. The MSE values of all
views are then averaged and the PSNR value calculated, as shown for the stereo
(2-view) and N-view case in (8.7):
PSNR2
PSNRN

Views

10  log

Views

10  log

1=NMSEview1

MaxValue2
0:5MSEview1 + MSEview2

MaxValue2
:
+ MSEview2    + MSEviewN
8:7

Note that averaging the individual PSNR values of all views, would lead to a
wrong encoder optimizations toward maximizing the PSNR in one view only,
while omitting the others, especially for unequal view coding.
Based on the PSNR calculation in (8.7), MVC gives the best possible quality at a
given bit rate. Similar to 2D video coding, the PSNR correlates with subjective
MOS, although initially for the 2D quality of the separate views. In addition, tests
have also been carried out, where subjective 3D perception in stereo video was
assessed and a correlation with PSNR values shown [17]. Therefore, rate-distortionoptimization methods from 2D video coding are also directly applicable to stereo
and MVC.

8.4 3D Video Coding


With the introduction of multi-view displays with different number and positions
of views, coding methods are required that support the specific formats for these
displays. The only solution with MVC-based methods would be to encode a dense
range with a high number of views (e.g. 50). At the decoder, a specific display
would then select a subset of these views according to number and viewing range.
This approach would require a high data rate due to the linear dependency of MVC
on the number of views, as shown in [11]. Furthermore, 3DV content is produced
with only few cameras and most of todays content is recorded with stereo camera

8 3D Video Compression

233

systems, such that only two views are available. Therefore, new coding methods
are required for 3DV coding, which decouple the production and coding format
from the display format.

8.4.1 Assumptions and Requirements


In order to bridge the gap between a production format from very few cameras and
the multi-view display requirements of high numbers of dense output views, 3DV
coding (3DVC) was introduced [18].
In 3DVC, some of the basic coding principles and assumptions considerably
change in comparison to previous video coding methods. First, the display output
is very different from the input data in 3DVC. Moreover, the output format is
display dependent, such that the number and positions of display views are typically unknown at the 3DV encoder side. Therefore, objective evaluation methods
between output and input data, like MSE, may not be meaningful. Furthermore, the
quality of the output views is determined by the compression method of the
encoder as well as the view generation at the decoder side. Consequently, interdependencies between compression technology and view synthesis methods occur.
Furthermore, the type and quality of the supplementary data are also important for
the output view quality.
The 3DV system, shown in Fig. 8.6 addresses these specific conditions in
3DVC. It assumes an input format containing stereo or multi-view video data. In
addition, depth information is provided as supplementary data, shown in Fig. 8.6
left for the 3DV encoder input format. This format with its specific video and
depth data needs to be fixed in order to guarantee similar viewing conditions and
depth perception for different autostereoscopic displays: Fixing the depth data for
3DV is comparable to fixing the chrominance information for 2D video and thus
providing similar viewing perception for different displays.
The 3DV format with video and depth data is then encoded at a certain bit rate.
At the receiver side, the compressed bit stream is decoded and the 3DV format
reconstructed before further processing, such as view synthesis, as shown in
Fig. 8.6. In this case, the coded video data at original viewing positions is
reconstructed, such that it can be compared against the original video data, using
objective PSNR calculation from (8.7). Thus, the compression efficiency of the
compression technology can be evaluated. In contrast to previous coding methods,
an additional view generation module is used to synthesize the required number of
output views at required spatial positions from the reconstructed 3DV format. As
this needs to be adapted to the specific display, the view generation module is part
of the display in practical application. For overall system evaluation, the dense
number of output views is subjectively assessed, e.g. by MOS evaluation. This
may include viewing of all views on a multi-view display or a selection of stereo
pairs from the views on a stereoscopic display.

234

K. Mller et al.

Fig. 8.6 3D video coding system with single generic input format and multi-view output range
for N-view displays. For overall evaluation of different 3DVC methods with compression and
view generation methods, quality and data rate measurements are indicated by dashed boxes

In the 3DV system, the encoder has to be designed in such a way that a high
quality of the dense range of output views is ensured and that the rate-distortion
optimization can be carried out automatically. Therefore, some of the view generation functionality also has to be emulated at the encoder. For comparison, the
3DV encoder can generate uncompressed intermediate views from the original
video and depth data. These views can then be used in the rate-distortion-optimization as a reference, as shown in Sect. 8.4.3.

8.4.2 Depth-Enhanced Multi-View Coding Format


As pointed out in the previous section, the input format needs to be fixed in order
to provide comparable 3D depth perception for different displays in 3DV systems.
One possible format is MVD, where video data from multiple adjacent cameras is
provided together with depth information [19]. An example for two cameras is
shown in Fig. 8.7.
Here, pixel-wise depth information was estimated from the video data and
manually refined for small structures. The original depth data has been converted
into the shown 8 bit depth representation, where lighter pixels indicate foreground
and darker pixels background information. A detailed conversion of the scene
depth range into the 8 bit representation can be found in [20].
High-quality depth maps are important for the output view reconstruction
quality at all intermediate positions [21, 22]. Therefore, depth estimation in general has been studied intensively in the literature [23]. Here, the corresponding
content in neighboring camera views is identified and matched [24, 25].
Depending on the video data, different matching criterions have been investigated
[2630]. These methods have also been specifically adapted for depth extraction in
3DVC applications [3133]. As shown in Fig. 8.7, estimated depth data is not
always accurate, e.g., in homogeneous regions, the matching of corresponding

8 3D Video Compression

235

Fig. 8.7 Multi-view video plus depth format for two input cameras v1 and v2, Book_Arrival
sequence

content may lead to erroneous depth assignments. This, however, is only problematic, if visible artifacts occur in synthesized intermediate views. Here,
advanced view synthesis methods can reduce depth estimation errors to some
degree [34, 35].
For existing stereo content, where only two color views are available, depth
data has to be estimated. Newer recording devices are able to record depth
information by sending out light beams, which are reflected back by a scene
object, as described in detail in [36]. These cameras then measure the time-offlight time in order to calculate depth information.
In contrast to recorded natural 3DV data, animated films are computer generated. For these, extensive scene models with 3D geometry are available [37, 38].
From the 3D geometry, depth information with respect to any given camera
viewpoint can be generated and converted into the required 3DV depth format.
For 3DVC, a linear parallel camera setup in horizontal direction is typically
used. In addition, preprocessing of the video data is carried out for achieving exact
vertical alignment [39] in order to avoid viewing discomfort and fatigue. Consequently, a simple relation between the depth information z(n) at a pixel position
n and the associated disparity vector d(n) can be formulated:

236

K. Mller et al.

kdnk

f  Ds
:
zn

8:8

Here, f is the focal length of the cameras (assuming identical focal lengths) and
Ds the camera distance between two views v1 and v2. Due to the vertical alignment, the disparity vector d only has a horizontal component and its length is
determined by the inverse scene depth value, as also shown in detail in [20].

8.4.3 3D Video Coding Principles


In Sect. 8.3, the relation between similar picture content in multi-view data via the
disparity information was shown (see Fig. 8.4). In MVC, this information is only
implicitly used for good inter-view prediction by the encoder. In 3DVC, a number
of output views need to be generated after decoding. These views are generated by
depth-image-based rendering methods (DIBR) [40, 41], where video data from one
original view is shifted to an intermediate viewing position, using the disparity
vector d. In Fig. 8.8, an example of a synthesized intermediate view is shown,
which was generated between two original views v1 and v2.
Here, a picture block xv1 nv1 in v1 and a corresponding picture block xv2 nv2
in v2 are related in their positions nv1 and nv2 via the disparity vector d, such that
nv1 nv2 d: Note that all positions relate to the same local image coordinates in
all views, such that the associated picture blocks and their disparity relations are
shown in the synthesized view vS in Fig. 8.8. Due to the vertical alignment of
views in 3DV, positions only differ in the horizontal direction. Hence, the vertical
component of d is zero. If an intermediate view is to be synthesized between v1
and v2, picture content from both original views is usually projected into the
synthesized view vS and weighted by the intermediate view position parameter
j [ [01] in order to obtain the intermediate picture block xvS nvS :
xvS nvS j  xv1 nv1 1  j  xv2 nv2 :

8:9

Here, j specifies the spatial position between original views v1 and v2. For
instance, a value of j = 0.5 indicates that vS is positioned in the middle between
both original views, as also shown in Fig. 8.8. The two values j = 0 and j = 1
determine the original positions, where vS = v2 and vS = v1 respectively. The
positions of corresponding blocks in original and synthesized views are related via
the disparity vector d, such that nvS nv1  1  j  d as well as nvS
nv2 j  d: Thus, Eq. (8.9) can be modified by substituting the corresponding
positions, using the j-scaled disparity shifts:
xvS nvS j  xv1 nvS 1  j  d 1  j  xv2 nvS  j  d:

8:10

The weighted averaging of original color information in (8.10) is also known as


texture blending from Computer Graphics application. This method provides a

8 3D Video Compression

237

Fig. 8.8 Associated block


relations in synthesized view
with scaled disparity vector d

gradual adaptation of the color information, when navigating across the viewing
range of intermediate dense views from v1 to v2. Equation (8.10) assumes that
both original blocks xv1 and xv2 are visible. As the original cameras have recorded
a scene from slightly different viewpoints, background areas exist, which are only
visible in one view, while being occluded by foreground objects in the second
view. Therefore, (8.10) has to be adapted for both occlusion cases in order to
obtain the intermediate picture block, i.e., xvS nvS xv1 nvS 1  j  d if xv2
is occluded and xvS nvS xv2 nvS  j  d if xv1 is occluded.
Equation (8.10) represents the general case of intermediate view generation for
uncompressed data. As shown in Sect. 8.2, video compression introduces coding
errors due to quantization, such that an original data block xn and its reconstructed version after decoding ~
xn are different. In previous coding approaches,
the color difference between both values determined the reconstruction quality for
2D and MVC [see (8.4) and (8.7) respectively]. In 3DVC, color and depth data are
encoded. This leads to different reconstructed color values ~xn 6 xn as well as
different reconstructed depth values, where the latter cause different disparity
~
values dn
6 dn: As shown in (8.10), disparity data is used for intermediate
view synthesis and causes a position shift between original and intermediate
views. Therefore, depth coding errors result in a disparity offset or shift error
~
Dd dn  dn:
Consequently, Eq. (8.10) is subject to color as well as depth
coding errors and becomes for the coding case:
~
xv1 nvS 1  j  d  Dd 1  j  ~xv2 nvS  j  d  Dd:
xvS nvS j  ~
8:11
Equation (8.11) shows that coding of color data changes the interpolation value,
while coding of depth data causes disparity offsets, which reference neighboring
coded blocks at different positions in the horizontal direction. These neighboring
blocks may have a very different color contribution for the interpolation of
~
xvS nvS in the color blending process (8.11). Especially at color edges, completely
different color values are thus used, which lead to strong sample scattering and
color bleeding in synthesized views.
For the compression of 3DV data, color and depth need to be jointly coded and
evaluated with respect to the intermediate views. One possibility for a joint coding
optimization can be obtained by comparing the original color contributions from
v1 and v2 in (8.10) with their coded and reconstructed versions in (8.11). Thus, the
following MSE distortion measure could be derived:

238

K. Mller et al.

MSEv1 j kxv1 nvS 1  j  d  ~


xv1 nvS 1  j  d  Ddk2
and
8:12
xv2 nvS  j  d  Ddk2 :
MSEv2 j kxv2 nvS  j  d  ~
This method of MSE calculation contains a superposition of color as well as
depth errors. Both types of errors could cancel each other, such that reconstructed
blocks ~
xv1 or ~
xv2 are found, which originate from wrong positions in v1 and v2,
however, by coincidence have the same color values. Such blocks would minimize
the MSE for a particular intermediate position j, however, causing visible artifacts
for other interpolation positions. Therefore, color and depth coding errors have to
be analyzed separately in order to obtain their individual influence:
MSEC;v1 j kxv1 nvS 1  j  d  ~
xv1 nvS 1  j  dk2
MSED;v1 j k~
xv1 nvS 1  j  d  ~
xv1 nvS 1  j  d  Ddk2
and
xv2 nvS  j  dk2
MSEC;v2 j kxv2 nvS  j  d  ~
MSED;v2 j k~
xv2 nvS  j  d  ~
xv2 nvS  j  d  Ddk2 :
8:13
As a result, MSEC for color coding is first calculated between original and
reconstructed blocks at the same position for v1 and v2. This approach is similar to
2D video coding, as shown in (8.3). Following this, MSED evaluates the disparity
displacement errors and finally optimizes the depth coding based on already coded
and reconstructed color data. Consequently, color and depth coding errors are
decoupled.
The 3DV encoder optimization has to ensure a good reconstruction quality for
all intermediate views within the entire viewing range j [ [01]. For this, the
largest MSE value within this range can be determined:




MSEC;v1 max MSEC;v1 j ; MSED;v1 max MSED;v1 j
8j

8j

and




MSEC;v2 max MSEC;v2 j ; MSED;v2 max MSED;v2 j :
8j

8:14

8j

In practical applications, a subset of, e.g.,10 intermediate positions is taken,


depending on the targeted encoder complexity. Another approach is to only analyze the intermediate view in the middle at position j = 0.5. Here, experiments
have shown that this viewing position typically has the lowest quality [42]. With
the provision of the objective MSE distortion measure, the 3DV encoder can
automatically optimize color and depth coding toward the best quality of the
viewing range at a given bit rate for the compressed 3DV format. Further information on virtual view distortion computation by MSE can also be found in [43].

8 3D Video Compression

239

In the literature, sometimes also the PSNR for the reconstructed intermediate
views ~
xvS nvS ; is calculated, assuming the uncoded synthesized view xvS nvS as
a reference. This however, does not reflect the subjective quality adequately
enough, as synthesis errors in the uncoded intermediate view are not considered.
Therefore, the final reconstruction quality has to be subjectively evaluated, as
shown in the 3DV system output in Fig. 8.6.

8.4.4 Depth Data Coding


In the previous section, the 3DV coding principles have been discussed and the
distortion measure for optimized video and depth encoding for a dense range of
high-quality output views derived. In existing coding methods, such as AVC or
MVC [14], compression technology is optimized for the statistical characteristics
of video data, including prediction structures, transformation, or entropy coding.
Therefore, coding methods for depth data need to be adapted to the depth data.
Also, the specific usage of depth data in the view generation process needs to be
considered. Here, reconstructed depth data is used to shift reconstructed color data
to generate intermediate views and consequently depth reconstruction errors
translate into shift errors for intermediate video data.
In order to develop depth-adapted coding methods, depth data needs to be
analyzed. For this, correlation histograms (CHs) can be used [44]. CHs are an
extension of image histograms, which are a well-known tool in image analysis and
processing, e.g., for color and contrast adjustment [36]. An image histogram
contains bins H(k), where each bin counts the number of occurrences of a value
k. An example is luminance histograms, where each bin H(k) contains the number


of image samples with color value k at image position xn x i j T with
horizontal and vertical coordinates i and j. These histograms can be extended
toward two-dimensional histograms with an array of bins H(k, l). As the depth and
color edges are of special importance for 3DV coding, relations between neighboring or corresponding samples in the video and depth signal can be analyzed by
bin arrays H(k, l). For creating a CH, two neighboring samples are considered as a
sample pair. For this, a bin H(k, l) represents the number of neighboring sample
pairs, where the first and second samples have a value of k and l respectively. As
an example, two neighboring pixels in an image shall have luminance values of
210 and 212. Then, H(210, 212) = 20 would indicate that the image contains 20
pixel pairs with these luminance values. For edge detection in 3DV data,
especially the spatial correlation between neighboring pixel in horizontal and
vertical direction are important. Therefore, a spatial CH contains the number




Hspatial k; l of sample pairs with k x i j T ; l x i  1 j T and




k x i j T ; l x i j  1 T respectively. Hspatial k; l is related to the cooccurrence matrix CDi;Dj k; l of a picture, the latter being defined as:

240

CDi;Dj k; l

K. Mller et al.


I1 X
J 1 
X
1; if x i
0; else
iDi jDj




j T k and x i  Di j  Dj T l

8:15
Here, Di [ 0, and Dj [ 0 are the distances between pixel positions in the
vertical and horizontal directions of a picture with I 9 J pixels. With this,
Hspatial k; l accumulates both first-order neighborhood co-occurrence matrices for
vertical and horizontal directions over the entire sequence of pictures from
t = 1T:
Hspatial k; l

T 
X


C0;1 k; l; t C1;0 k; l; t :

8:16

t1

Thus, important features are detected by Hspatial k; l in both spatial directions.


The CH analysis is carried out for the luminance component of the video data (as it
contains the relevant edge information), as well as for the depth data. For an 8 bit
resolution of these components, a CH contains an array of 256 9 256 bins.
For analyzing the characteristics of depth-enhanced 3DV, CHs are superior to
other methods (e.g. spectrum analysis), as they are able to detect sharp edges and
constant regions, which are both typical features of depth maps and important for
the coding process.


The normalized logarithmic CHs log Hspatial k; l for the Book_Arrival
sequence for the video and depth data for two view sequences v1 and v2 are shown in
Fig. 8.9. They correspond
to the image example, shown in Fig. 8.7. Here, the

maximum value max Hspatial k; l is normalized to log(10).
The gray-value coded bin values Hspatial k; l are shown in logarithmic scale in
order to better differentiate between the high values along the diagonal and the low
values. Values on the CH main diagonal show bins for equal sample pairs,
e.g., Hspatial 50; 50: Values close to the diagonal count pixel pairs with small
differences in values and thus indicate smooth image areas. In contrast, large pixel
differences in a sample pair lead to histogram values far from the diagonal. Here,
clusters of significant values indicate important edges. Regarding the video CHs in
Fig. 8.9 top, the expected characteristic of a compact distribution around the
diagonal can be observed. Here, the diagonal shows the highest values, indicated
by darker values. The values decrease with increasing distance from the diagonal,
indicated first by the lighter gray values and then darker values again. In contrast,
the depth CHs in Fig. 8.9 bottom show that the values only occur at discrete
positions and thus, that the original depth data uses a limited number of the
available 256 values. Comparing the spatial CHs for original video with
the according depth results shows significant differences: Depth CHs are much
more frayed with some relevant areas off the diagonal with medium values (lighter
gray) that represent depth edges between foreground and background objects.

8 3D Video Compression

241

Fig. 8.9 Normalized logarithmic correlation histograms log(Hspatial(k, l)) for original luminance
and depth data of views v1 and v2, Book_Arrival sequence (see Fig. 8.7). The grayscale codes are
in logarithmic scale. The gray values at the main diagonal refer to the higher values up to 10.
With increasing distance from the diagonal, the histogram values decrease down to 0

CHs can be used to analyze the coded data and compare it to the original data.
Especially, the preservation of edge information and larger homogeneous areas
can be studied. An example for MVC coding is shown in Fig. 8.10 for the
Book_Arrival sequence at a medium data rate of 650 kBits/s.
The comparison of the decoded CHs in Fig. 8.10 with the original CHs in
Fig. 8.9 shows the differences in luminance and depth coding using MVC. The
luminance CHs show a stronger concentration along the diagonal due to typical
video coding artifacts, like low pass filtering. For the depth data, the discrete CH
values in the original data in Fig. 8.9 have spread out, such that a more continuous
CH is obtained for the coded version in Fig. 8.10. CH values and value clusters
toward the upper left and lower right corner of the CH represent important depth
edges. In the CH of the coded version, these values have disappeared or enlarged,
such as the lighter gray isolated areas of discrete values in Fig. 8.9, bottom.
This indicates that important details have been altered. The different changes in

242

K. Mller et al.

Fig. 8.10 Normalized logarithmic correlation histograms log(Hspatial(k, l)) for decoded luminance and depth data of views v1 and v2, Book_Arrival sequence (see Fig. 8.7). The grayscale
codes are in logarithmic scale. The gray values at the main diagonal refer to the higher values up
to 10. With increasing distance from the diagonal, the histogram values decrease down to 0

luminance and depth CHs show that MVC was optimized for video coding, while
its application to depth coding removes important features.
Consequently, a number of alternative depth coding approaches have been
investigated for preserving the most important depth features for good intermediate view synthesis [4549, 52]. These are discussed in detail in Chap. 9.

8.5 Alternative Formats and Methods for 3D Video


In 3DVC, an input format needs to be specified that guarantees a certain quality for
the dense range of output views. For this, the generic MVD format has been
introduced and shown for two views in Fig. 8.7. Besides MVD, alternative formats

8 3D Video Compression

243

have been used, such as layered depth video (LDV) and depth-enhanced stereo
(DES) [50]. These formats are shown in Fig. 8.11.
Both LDV and DES can be derived from the MVD format. The LDV format
consists of one (central) video and depth map, as well as associated occlusion
information for video and depth data. In Fig. 8.11, this is shown as background
information, which needs to be extracted beforehand in a prepossessing step.
Alternatively, the background information can be reduced to pure occlusion or
difference information, which only contains background information behind
foreground objects. The LDV view synthesis generates the required output views
for stereoscopic and autostereoscopic displays by projecting the central video with
the depth information to the required viewing positions and filling the disoccluded
areas in the background with the occlusion information [51].
For LDV, higher compression efficiency might be obtained, as only one full
view and occlusion information at the same position needs to be encoded. In
comparison, a minimum of two views from different viewpoints and thus differentiated by disparity offset needs to be coded for MVD. On the other hand, the
background data in LDV that is revealed behind foreground objects has no original
reference. This information is originally obtained from the second view of an
MVD format or further original views, if available. Therefore, both formats
originate from the same input data. For an optimized encoding process, inter-view
correlation between adjacent input views is only exploited at different stages in the
encoding process: In LDV, a redundancy reduction between the input views
already takes place at a preprocessing stage for the creation of this format, such
that the encoding cannot significantly reduce the compressed data size, especially
in disoccluded areas. For MVD, the redundancy reduction takes place in the
encoding process, where a precise inter-view prediction can better reduce the
compressed data size for areas, which are similar in adjacent views. Thus, similar
compression results are expected for LDV and MVD by fully optimized 3DV
coding methods.
As LDV only contains one original view, view synthesis always needs to be
carried out in order to obtain the second view. This can be problematic for pure
stereo applications that want to use two original or decoded views without further
view generation. Here, an alternative format is DES, which combines MVD and
LDV into a two-view representation with video, depth, occlusion video, and
occlusion depth signals for two views. The DES format enables pure stereo processing on one hand and additionally contains occlusion information for additional
view extrapolation toward either side of the stereo pair for multi-view displays on
the other hand. For pure stereo content, only limited occlusion information is
available, such that the MVD2 format is most suitable for stereo content. If, however, original multi-view content is available from a number of input cameras, the
creation of occlusion information from additional cameras is rather useful to generate a compact format that can be used to synthesize a wide range of output views.

244

K. Mller et al.

Fig. 8.11 Different 3D video formats: MVD with 2 video and 2 depth signals, LDV with 1 video,
1 depth, 1 occlusion video and 1 occlusion depth signal, DES with 2 video, 2 depth, 2 occlusion
video, and 2 occlusion depth signals

8.6 Chapter Summary


In this chapter, 3DV coding (3DVC) with data format selection, video and depth
compression, evaluation methods, and analysis tools have been discussed. First,
the fundamental principles of video coding for classical 2D content have been
reviewed, including signal prediction, quantization, transformation, and entropy
coding. The encoder uses rate-distortion-optimization in order to obtain the best
reconstruction quality at a given bit rate, or vice versa, to find the minimum bit rate
at a given quality. For quality evaluation, the PSNR measure is used as an
objective metric, derived from MSE, while subjective assessment is carried out by
MOS evaluation.
Next, the extension of 2D video coding methods for stereo and multi-view video
content has been shown. Especially, inter-view prediction between neighboring
views enables higher coding gains. The encoder-side rate-distortion-optimization
can be applied for MVC in the same way as single view coding. Here, an MSE value
is first calculated for each individual view and all MSE values are weighted for
calculating the final objective PSNR measure. In addition, MOS is used to assess the
subjective quality of multi-view data.
For 3DVC, some of the basic coding principles and assumptions considerably
change in comparison with previous video coding methods. A 3DV system has to
support a variety of multi-view displays with different number and spatial position
of views. This can only be achieved by providing an encoder-side generic input
format and decoder-side view generation to extract the required number of views.
One 3DV format is MVD, which provides per-pixel depth information.

8 3D Video Compression

245

When encoding video and depth data, a high quality of the entire viewing range of
synthesized views needs to be provided. Therefore, this chapter has shown how the
3DV encoder optimization emulates view synthesis functionality by analyzing the
coding distortion for intermediate views. Accordingly, color and depth MSE
consider the different types of video and depth coding errors. For the overall 3DV
system evaluation, subjective MOS needs to be obtained, either for the entire
viewing range or for stereo pairs out of this range. In addition, the generic 3DV
format is reconstructed before view synthesis for evaluating the coding efficiency
only by objective PSNR measures. Consequently, interdependencies between
compression technology and view synthesis methods are resolved. For the
assessment of depth coding methods, correlation histograms (CHs) have been
introduced. The differences between original and reconstructed CH reveal whether
a specific coding algorithm is capable of preserving important features. For depth
data coding, especially the important edges between foreground and background
need to be preserved, as they can lead to pixel displacement and consequently
wrong color shifts in intermediate views.
Finally, alternative formats, such as LDV and DES, have been discussed with
respect to application areas and coding efficiency in comparison with the MVD
format.
This chapter has summarized the fundamentals of 3DV coding, where coding
and view generation methods are applied to provide a range of high-quality output
views for any stereoscopic and multi-view display from a generic 3DV input at a
limited bit rate.

References
1. Benzie P, Watson J, Surman P, Rakkolainen I, Hopf K, Urey H, Sainov V, von Kopylow C
(2007) A survey of 3DTV displays: techniques and technologies. IEEE Trans Circuits Syst
Video Technol 17(11):16471658
2. Konrad J, Halle M (2007) 3-D displays and signal processing: an Answer to 3-D Ills? IEEE
Signal Proces Mag 24(6):21
3. Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J
27(3):21632177
4. Berger T (1971) Rate distortion theory. Prentice-Hall, Englewood Cliffs
5. Wiegand T, Schwarz H (2011) Source coding: part I of fundamentals of source and video
coding. Found Trends Signal Proces 4(1-2):1222, Jan 2011. http://dx.doi.org/10.1561/
2000000010
6. Jayant NS, Noll P (1994) Digital coding of waveforms. Prentice-Hall, Englewood Cliffs
7. Huffman DA (1952) A method for the construction of minimum redundancy codes.
In: Proceedings IRE, pp 10981101, Sept 1952
8. Said A (2003) Arithmetic coding. In: Sayood K (ed) Lossless compression handbook.
San Diego, Academic, London
9. Chen Y, Wang Y-K, Ugur K, Hannuksela M, Lainema J, Gabbouj M (2009) The Emerging
MVC standard for 3D video services. EURASIP J Adv Sign Proces 2009(1)
10. ISO/IEC JTC1/SC29/WG11 (2008) Text of ISO/IEC 14496-10:200X/FDAM 1 multiview
video coding. Doc. N9978, Hannover, Germany, July 2008

246

K. Mller et al.

11. Merkle P, Smolic A, Mueller K, Wiegand T (2007) Efficient prediction structures for
multiview video coding, invited paper. IEEE Trans Circuits Syst Video Technol
17(11):14611473
12. Shimizu S, Kitahara M, Kimata H, Kamikura K, Yashima Y (2007) View scalable multi-view
video coding using 3-d warping with depth map. IEEE Trans Circuits Syst Video Technol
17(11):14851495
13. Vetro A, Wiegand T, Sullivan GJ (2011) Overview of the stereo and multiview video coding
extensions of the H.264/AVC standard. Proc IEEE, Special issue on 3D Media and Displays
99(4):626642
14. ITU-T and ISO/IEC JTC 1 (2010) Advanced video coding for generic audiovisual services.
ITU-T Recommendation H.264 and ISO/IEC 14496-10 (MPEG-4 AVC), Version 10, March
2010
15. Wiegand T, Sullivan GJ, Bjntegaard G, Luthra A (2003) Overview of the H.264/AVC video
coding standard. IEEE Trans Circuits Syst Video Technol 13(7):560576
16. Schwarz H, Marpe D, Wiegand T (2006) Analysis of hierarchical B pictures and MCTF,
ICME 2006. IEEE international conference on multimedia and expo, Toronto, July 2006
17. Strohmeier D, Tech G (2010) On comparing different codec profiles of coding methods for
mobile 3D television and video. In: Proceedings 3D systems and applications, Tokyo, May
2010
18. ISO/IEC JTC1/SC29/WG11 (2009) Vision on 3D video. Doc. N10357, Lausanne, Feb 2009
19. Mller K, Smolic A, Dix K, Merkle P, Wiegand T (2009) Coding and intermediate view
synthesis of multi-view video plus depth. In: Proceedings IEEE international conference on
image processing (ICIP09), Cairo, pp 741744, Nov. 2009
20. Mller K, Merkle P, Wiegand T (2011) 3D video representation using depth maps. Proc
IEEE, Special issue on 3D media and displays 99(4):643656
21. Faugeras O (1993) Three-dimensional computer vision: a geometric viewpoint. MIT Press,
Cambridge
22. Hartley R, Zisserman A (2000) Multiple view geometry in computer vision. Cambrigde
Universitity Press, Cambrigde
23. Scharstein D, Szeliski R (2002) A taxonomy and evaluation of dense two-frame stereo
correspondence algorithms. Int J Comput Vision 47(1):742
24. Bleyer M, Gelautz M (2005) A layered stereo matching algorithm using image segmentation
and global visibility constraints. ISPRS J Photogrammetry Remote Sens 59(3):128150
25. Szeliski R, Zabih R, Scharstein D, Veksler O, Kolmogorov V, Agarwala A, Tappen M,
Rother C (2006) A Comparative study of energy minimization methods for markov random
fields. European conference on computer vision (ECCV 2006), vol 2, pp 1629, Graz, May
2006
26. Atzpadin N, Kauff P, Schreer O (2004) Stereo analysis by hybrid recursive matching for realtime immersive video conferencing. IEEE Trans Circuits Syst Video Technol, Special issue
on immersive Telecommunications 14(3):321334
27. Cigla C, Zabulis X, Alatan AA (2007) Region-based dense depth extraction from multi-view
video. In: Proceedings IEEE international conference on image processing (ICIP07), San
Antonio, USA, pp 213216, Sept 2007
28. Felzenszwalb PF, Huttenlocher DP (2006) Efficient belief propagation for early vision. Int J
Comp Vision 70(1):41
29. Kolmogorov V (2006) Convergent tree-reweighted message passing for energy minimization.
IEEE Trans Pattern Anal Mach Intell 28(10):1568
30. Kolmogorov V, Zabih R (2002) Multi-camera scene reconstruction via graph cuts. European
conference on computer vision, May 2002
31. Lee S-B, Ho Y-S (2010) View consistent multiview depth estimation for three-dimensional
video generation. In: Proceedings IEEE 3DTV conference, Tampere, Finland, June 2010
32. Min D, Yea S, Vetro A (2010) Temporally consistent stereo matching using coherence
function. In: Proceedings IEEE 3DTV conference, Tampere, June 2010

8 3D Video Compression

247

33. Tanimoto M, Fujii T, Suzuki K (2008) Improvement of depth map estimation and view
synthesis. ISO/IEC JTC1/SC29/WG11, M15090, Antalya, Jan 2008
34. Mller K, Smolic A, Dix K, Merkle P, Kauff P, Wiegand T (2008) View synthesis for
advanced 3D video systems. EURASIP J Image Video Proces, Special issue on 3D Image and
Video Processing, vol 2008, Article ID 438148, 11 pages, 2008 doi:10.1155/2008/438148
35. Zitnick CL, Kang SB, Uyttendaele M, Winder S, Szeliski R (2004) High-quality video view
interpolation using a layered representation. ACM SIGGRAPH and ACM Transaction on
Graphics, Los Angeles, Aug 2004
36. Gokturk S, Yalcin H, Bamji C (2004) A time-of-flight depth sensor system description, issues
and solutions. In: Proceedings of IEEE computer vision and pattern recognition workshop,
vol 4, pp 3543
37. ISO/IEC DIS 14772-1 (1997) The virtual reality modeling language. April 1997
38. Wrmlin S, Lamboray E, Gross M (2004) 3d video fragments: dynamic point samples for
real-time free-viewpoint video. Computers and graphics, Special issue on coding,
compression and streaming techniques for 3D and multimedia data, Elsevier, pp 314
39. Fusiello A, Trucco E, Verri A (2000) A compact algorithm for rectification of stereo pairs.
Mach Vis Appl 12(1):1622
40. Kauff P, Atzpadin N, Fehn C, Mller M, Schreer O, Smolic A, Tanger R (2007) Depth map
creation and image based rendering for advanced 3DTV services providing interoperability
and scalability. Signal processing: image communication. Special issue on 3DTV, Feb 2007
41. Redert A, de Beeck MO, Fehn C, Ijsselsteijn W, Pollefeys M, Van Gool L, Ofek E, Sexton I,
Surman P (2002) ATTESTadvanced three-dimensional television system techniques. In:
Proceedings of international symposium on 3D data processing, visualization and
transmission, pp 313319, June 2002
42. Merkle P, Morvan Y, Smolic A, Farin D, Mller K, de With PHN, Wiegand T (2009) The
effects of multiview depth video compression on multiview rendering. Signal Proces: Image
Commun 24(12):7388
43. Liu Y, Huang Q, Ma S, Zhao D, Gao W (2009) Joint video/depth rate allocation for 3d video
coding based on view synthesis distortion model. Signal Proces: Image Commun
24(8):666681
44. Merkle P, Singla J, Mller K, Wiegand T (2010) Correlation histogram analysis of depthenhanced 3D video coding. In: Proceedings IEEE international conference on image
processing (ICIP10), Hong Kong, pp 26052608, Sept 2010
45. Choi J, Min D, Ham B, Sohn K (2009) Spatial and temporal up-conversion technique for
depth video. In: Proceedings IEEE international conference on image processing (ICIP09),
Cairo, Egypt, pp 741744, Nov 2009
46. Daribo I, Tillier C, Pesquet-Popescu B (2008) Adaptive wavelet coding of the depth map for
stereoscopic view synthesis. In: Proceedings IEEE international workshop on multimedia
signal processing (MMSP08), Cairns, Australia, pp 3439, Oct 2009
47. Kim S-Y, Ho Y-S (2007) Mesh-based depth coding for 3d video using hierarchical
decomposition of depth maps. In: Proceedings IEEE international conference on image
processing (ICIP07), San Antonio, pp V117V120, Sept 2007
48. Kim W-S, Ortega A, Lai P, Tian D, Gomila C (2010) Depth map coding with distortion
estimation of rendered view. Visual information processing and communication, Proceedings
of the SPIE, vol 7543
49. Oh K-J, Yea S, Vetro A, Ho Y-S (2009) Depth reconstruction filter and down/up sampling for
depth coding in 3-D video. IEEE Signal Proces Lett 16(9):747750
50. Smolic A, Mller K, Merkle P, Kauff P, Wiegand T (2009) An overview of available and
emerging 3D video formats and depth enhanced stereo as efficient generic solution.
In: Proceedings picture coding symposium (PCS 2009), Chicago, May 2009

248

K. Mller et al.

51. Mller K, Smolic A, Dix K, Kauff P, Wiegand T (2008) Reliability-based generation and
view synthesis in layered depth video. In: Proceedings IEEE international workshop on
multimedia signal processing (MMSP2008), Cairns, pp 3439, Oct 2008
52. Maitre M, Do MN (2009) Shape-adaptive wavelet encoding of depth maps. In: Proceedings
picture coding symposium (PCS09), Chicago, USA, May 2009

Chapter 9

Depth Map Compression


for Depth-Image-Based Rendering
Gene Cheung, Antonio Ortega, Woo-Shik Kim, Vladan Velisavljevic
and Akira Kubota

Abstract In this chapter, we discuss unique characteristics of depth maps, review


recent depth map coding techniques, and describe how texture and depth map
compression can be jointly optimized.




Keywords Bit allocation


Characteristics of depth
Depth-image-based
rendering (DIBR) Depth map coding Distortion model Dont care region
Edge-adaptive wavelet Graph-based transform Geometric error Joint coding
Quadratic penalty function
Rate-distortion optimization
Rendered view
distortion Sparse representation




G. Cheung (&)
National Institute of Informatics, 2-1-2 Hitotsubashi, Chiyoda-ku,
Tokyo 101-8430, Japan
e-mail: cheung@nii.ac.jp
A. Ortega
University of Southern California, Los Angeles, CA, USA
e-mail: antonio.ortega@sipi.usc.edu
W.-S. Kim
Texas Instruments Inc., Dallas, TX, USA
e-mail: wskim@ti.com
V. Velisavljevic
University of Bedfordshire, Bedfordshire, UK
e-mail: vladan.velisavljevic@beds.ac.uk
A. Kubota
Chuo University, Hachio-ji, Tokyo, Japan
e-mail: kubota@elect.chuo-u.ac.jp

C. Zhu et al. (eds.), 3D-TV System with Depth-Image-Based Rendering,


DOI: 10.1007/978-1-4419-9964-1_9,
Springer Science+Business Media New York 2013

249

250

G. Cheung et al.

9.1 Introduction
As described in other chapters, 3D-TV systems often require multiple video channels
as well as depth information in order to enable synthesis of a large number of views at
the display. For example, some systems are meant to enable synthesis of any arbitrary
intermediate view chosen by the observer between two captured viewsa media
interaction commonly known as free viewpoint. Synthesis can be performed via
depth-image-based rendering (DIBR).
Clearly, compression techniques will be required for these systems to be used in
practice. Extensive work has been done to investigate approaches for compression
of multiview video, showing that some gains can be achieved by jointly encoding
multiple views, using extensions of well-known techniques developed for singleview video. For example, well-established motion compensation methods can be
extended to provide disparity compensation (predicting a frame in a certain view
based on corresponding frames in neighboring views), thus reducing overall rate
requirements for multiview systems.
In this chapter, we consider systems where multiple video channels are transmitted, along with corresponding depth information for some of them. Because
techniques for encoding of multiple video channels, multiview video coding
(MVC), are based on well-known methods,1 here we focus on problems where new
coding tools are showing more promise.
In particular, we provide an overview of techniques for coding of depth data.
Depth data may require coding techniques different from those employed for
standard video coding, because of its different characteristics. For example, depth
signals tend to be piecewise smooth, unlike typical video signals which can
include significant texture information. Moreover, depth images are not displayed
directly and are instead used to synthesize intermediate views. Because of this,
distortion metrics that take the synthesis process into account need to be used in
the context of any rate-distortion (RD) optimization of encoding for these images.
Finally, since both depth and texture are encoded together, techniques for joint
encoding must be considered, including, for example, techniques to optimize the
bit allocation between video and depth. Note that some authors have suggested to
replace or supplement depth information in order to achieve better view synthesis
[2, 3]. These studies fall outside of the scope of this chapter.
This chapter is organized as follows. We start by discussing briefly, in Sect. 9.2,
those characteristics that make depth signals different from standard video signals.
This will help us provide some intuition to explain the different approaches that
have been proposed to deal with depth signals. In Sect. 9.3, we discuss approaches
that exploit specific characteristics of depth in the context of existing coding
algorithms. Then, in Sect. 9.4, we provide an overview of new methods that have
been proposed to specifically target depth signals. Because depth and texture are
1

As an example, the MPEG committee defined an MVC extension to an existing standard [1]
where no new coding tools were introduced.

9 Depth Map Compression for Depth-Image-Based Rendering

251

usually transmitted together, in Sect. 9.5, we consider techniques to jointly encode


texture video data and depth information.

9.2 Unique Characteristics of Depth Maps


Depth maps are obviously closely related to their corresponding video scenes,
which indicate that coding techniques that have been designed to encode standard
video can be readily applied to depth sequences. Indeed, encoding depth using
H.264/AVC [46] should be considered a baseline for any other proposed depth
encoding approach.
At the same time, there are major differences between video and depth signals;
methods that exploit those differences have been shown to lead to substantial gains
versus direct applications of existing tools.

9.2.1 Synthesized View Distortion Sensitivity to Depth Pixel Errors


We first note that distortion in depth maps is not directly observed. In standard
video compression, quantization errors directly affect the rendered view quality by
adding noise to the luminance or chrominance level of each pixel. In contrast, the
distortion in the depth map will affect the rendered view quality only indirectly;
the depth map itself is not displayed but is used to provide additional geometric
information to help the decoders view synthesis process. Consequently, it affects
the overall 3D video quality including depth perception [7].
While specific view synthesis techniques may differ, in general, depth data
provide geometric information about which pixel(s) in the reference view(s)
correspond to a target pixel in the view to be synthesized. These pixels are then
used to construct this synthesized frame. One example is the view synthesis
reference software [8], in which left and right reference views are warped to the
target view position. Leftover holes due to disocclusion are filled using in-painting
techniques. Note that depth information itself could be incorrect (e.g., if depth is
estimated from video) and that depending on the scene structure and camera
positions, in-painting methods cannot guarantee perfect pixel extrapolation in
disoccluded areas. In this chapter, we are not concerned with these sources of error
and instead consider only errors due to lossy encoding.
Clearly, since depth is used to identify the position of reference pixels, an error in
the depth (or disparity) signal will lead to a geometric error, i.e., a pixel at the
wrong position may be used for interpolation. For a given depth error, the magnitude
of this displacement in the reference frame depends on camera parameters such as
baseline distance and focal length. These are static factors for a given camera setup
and thus could be taken into account at the sequence level.

252
Fig. 9.1 Temperature
images of the absolute
difference between the
synthesized view with and
without depth map coding,
Champagne Tower

G. Cheung et al.
180
160

100

140

200

120
300

100

400

80

500

60
40

600

20
700
200

400

600

800

1000

1200

A more challenging problem arises in how these displacement errors lead to view
synthesis errors. Observe that a geometric error leads to a view synthesis error only
if the pixels used for interpolation have in fact different intensity. In an extreme case, if
pixel intensity is constant, even relatively large geometric errors may cause no view
synthesis error. Alternatively, if the intensity signal has a significant amount of texture,
even a minimal geometric error can lead to significant view synthesis error. This also
applies to occlusion. For example, some depth map areas will be occluded after
mapping to the target view. Depth map areas corresponding to occluding area will
obviously have more impact on the rendered view quality than those in the occluded
area. Thus, the effect of depth errors on view synthesis is inherently local.
Moreover, this mapping is inherently non-linear, that is, we cannot expect
increases in view synthesis distortion to be proportional to increases in depth
distortion. Consider, for example, a target pixel within an object having constant
intensity. As the geometry error increases, the view synthesis error will remain
small and nearly constant, until the error is large enough that a pixel outside the
object is selected for interpolation, leading to a sudden large increase in view
synthesis distortion. Thus, depth information corresponding to pixels closer to the
object boundary will tend to be more sensitive; i.e., large increase in synthesis
error will be reached with a smaller depth error.
Figure 9.1 illustrates these nonlinear and localized error characteristics. The
absolute difference between the views synthesized with and without depth map
coding is represented as a temperature image [9]. Although quantization is applied
uniformly throughout the frame, its effect on view interpolation is much more
significant near edges. As the quantization step size Q increases, the geometric error
increases, and thus the rendered view distortion also increases. Note, however, that
even though quantization error in depth map increases, the rendered view distortion
still remains localized to a relatively small portion of pixels in the frame.

9 Depth Map Compression for Depth-Image-Based Rendering

253

These characteristics suggest that using depth error as a distortion metric may
lead to suboptimal behavior. Moreover, since the actual distortion is local in
nature, any alternative metrics to be developed for encoding are likely to be local
as well. These ideas will be further explored in Sect. 9.3.

9.2.2 Signal Characteristics of Depth Maps


Depth maps reflect only the distance between physical objects being captured and
the camera (and thus contain no object surface texture information), so that pixels
of a captured object will have very similar depth values. Therefore, typical depth
images can be seen as essentially piecewise smooth, that is, depth information will
vary smoothly within objects and change at object boundaries, where sharp edges
in the depth image may be observed. There are obviously exceptions where the
depth map is perhaps not as smooth (think, for example, of the depth information
generated by a tree), but one can generally assume that depth images will include
significantly less texture than typical video signals.
Video coding tools for standard signals are designed to provide an efficient
representation for a wide range of video information, including textured images.
Designing tools optimized for depth images allows us to consider approaches that
may outperform existing methods for piecewise smooth signals, at the cost of
inefficiency when textures are considered (since these are rare in depth images
anyway). A popular philosophy in designing these approaches is to signal
explicitly the location of (at least some of) the edges in the depth image, leading to
a lower bit rate required to encode the regions within those edges that have been
signaled. Various approaches along these lines will be presented in Sect. 9.4.
Finally, it is important to note that both video and depth information are
transmitted, and these two signals do share common characteristics (e.g., their
corresponding motion fields may be similar). Thus, it will be useful to explore
techniques that can exploit the correlation between these signals when it exists.
Also, keep in mind that any video coding system aims at optimally assigning bits
to different parts of the video signal (blocks, slices, frames, etc) to maximize
overall performance. Thus, rate should be divided between depth and video
information in a way that considers their relative importance. As an example, it
would not be efficient to achieve lossless depth map coding if this came at the cost
of a substantial increase in the distortion in decoded video signals: there would be
no geometric error but error in view synthesis would be dominated by the errors in
the reference pixel intensities. This problem is discussed in Sect. 9.5.

254

G. Cheung et al.

9.3 Depth Map Coding Optimization Using Existing


Coding Tools
In this section, we discuss how one can encode depth maps using existing compression tools, exploiting the characteristics discussed in the previous section for
coding gain. We start by providing a detailed distortion model to capture the effect
of depth distortion on view synthesis. We then provide an overview of various
proposed methods.

9.3.1 Distortion Model


The impact of depth error on the rendered view distortion has been considered by
numerous authors. The main challenge is, as discussed above, that this distortion
varies locally as a function of scene characteristics (scene geometry as well as
specific values of the intensity image). Most methods do not consider these local
characteristics or how the interpolation techniques affect rendered view quality in
the presence of depth distortion. For example, Mller et al. [10] studied experimentally the effect on rendered view quality of bit allocation to texture and depth
maps, but did not consider how this was related to the view synthesis mechanism.
Instead, Merkle et al. [6] measured geometric error caused by depth map error by
calculating the distance between the 3D surfaces constructed from the original and
the coded depth map, without considering the effect of local texture characteristics
on rendering distortion.
Nguyen and Do [11] and Ramanathan and Girod [12] both proposed techniques
to estimate (at a frame level) the relationship between the geometric error and the
rendered view distortion. These approaches as well as others (e.g., [13]) do not
provide block-by-block estimation of rendered view quality. Because block-based
techniques are popular for depth coding, approaches that can estimate locally the
impact of depth distortion on view rendering will be useful in practice. In what
follows, we provide a derivation of geometry error as a function of depth error and
use this to discuss local techniques for rendered view distortion estimation.

9.3.1.1 Derivation of Geometric Error from Depth Map Distortion


Denote z as the depth value at a given pixel, and assume the camera parameters are
known. Then it is possible to map this pixel position first into a point in world
coordinates, and then that position can be remapped into the view observed by a
camera with different parameters, so that a new view can be rendered. When a
depth map, L; is encoded using lossy compression, the resulting distortion in the
decoded map naturally leads to geometric errors in the view interpolation process.

9 Depth Map Compression for Depth-Image-Based Rendering

255

The relationship between the actual depth value, z; and the 8 bit depth map value,
Lp xim ; yim ; is given by




Lp xim ; yim
1
1
1 1

;
9:1
z

255
Znear Zfar
Zfar
where Znear and Zfar are the nearest and the farthest clipping planes, which correspond to value 255 and 0 in the depth map, respectively, with the assumption
that z, Znear and Zfar have all positive or all negative values [14].
Kim et al. [15] derived the geometric error due to a depth map error in the pth
view, DLp ; using intrinsic and extrinsic camera parameters as follows:
0
1


Dxim


1
1
@ Dyim A DLp xim ; yim

9:2
Ap0 Rp0 Tp  Tp0 ;
255
Znear Zfar
1
where Dxim and Dyim are translation error in horizontal and vertical direction,
respectively, and Ap0 ; Rp0 ; and Tp0 are the camera intrinsic matrix, the rotation
matrix, and a translation vector in p0 th view, respectively. This reveals that there is
a linear relationship between the depth map distortion DL; and the translational
rendering position error DP in the rendered view, which can be represented as


 
k
Dxim
DP
DLp xim ; yim x ;
9:3
ky
Dyim
where kx and ky are the scale factors determined from the camera parameters and
the depth ranges as shown in (9.2).
When the cameras are arranged in a parallel setting, further simplification
(rectification) can be made, so that there will be no translation other than in the
horizontal direction [16]. In this case, there will be a difference only in horizontal
direction between the translation vectors, i.e., ky 0: In addition, the rotation
matrix in (9.2) becomes an identity matrix. Neglecting radial distortion in the
camera intrinsic matrix, the scaling factor, kx ; in (9.3) becomes


1
1
1

9:4
kx
fx Dtx ;
255 Znear Zfar
where fx is the focal length divided by the effective pixel size in horizontal
direction, and Dtx is the camera baseline distance. Note that the geometric distortion incurred will depend on both the camera parameters and on the depth
characteristics of the scene: there will be increased distortion if the distance
between cameras is larger or if the range greater. Furthermore, the impact of a
given geometric distortion on the view interpolation itself will depend on the local
characteristics of the video. We describe coding techniques that take this into
account next.

256

G. Cheung et al.

Fig. 9.2 Illustration of


rendered view distortion
estimation process using
reference video

9.3.1.2 Estimation of Distortion in Rendered View due to Depth Coding


In a DIBR system, a view can be rendered using a set of reference video frames and
their corresponding depth maps. The exact amount of distortion in the rendered view
can be measured if we compare the rendered view with the ground truth, i.e., the
captured video by a camera at that position. However, the ground truth may not be
available in general, since views can be rendered at arbitrary viewpoints. Thus, an
alternative to evaluate the impact of depth coding on view rendering would be to
render at the encoder two versions of the target intermediate view, obtained using the
original and decoded depth values, respectively. By comparing these two rendered
views, it is possible to measure directly the impact of depth coding on rendered
quality, but at the cost of a fairly complex encoding procedure, especially if there is a
need to perform this operation for multiple alternative encodings of the depth map.
As an alternative, techniques have been proposed that compute the geometric
error and then use information in the reference video frame in order to estimate view
interpolation error [15]. Assume a quantization error, DL; causing a geometric error,
DP; for the depth map value at position xim ; yim : Then, the basic idea in these
methods (see also Fig. 9.2) is to model or compute the distortion incurred in the
reference frame when a pixel at that position xim ; yim is replaced by a pixel in the
same reference framebut displaced by a translation corresponding to the geometric
error, i.e., the one at xim dx ; yim dy : Thus, for pixel positions where the geometric error is zero, the error in interpolation will be estimated to be zero as well.
Also, note that this approach directly takes into consideration the characteristics of
the video signal, since the same displacement in the reference frame will lead to
lower distortion where the video signal is smooth and without textures.
In [15] two alternative methods are proposed. In the first method, the distortion
corresponding to these displacements in the reference frame is directly computed.
This approach ignores the impact on distortion of performing interpolation from
two different views or the effect of approaches to deal with occlusions, but it still
provides an accurate and local estimation of rendering distortion. Its main drawback is its complexity, especially as it requires memory accesses that can have a
fairly irregular pattern (e.g., as illustrated in Fig. 9.2, two neighboring pixels may
be mapped to two pixels that are not close together).
To reduce computation complexity, a model-based approach is proposed in
[15]. In this method, the sum squared error (SSE) distortion is estimated as

9 Depth Map Compression for Depth-Image-Based Rendering

!
N
1X
jDPnj
SSE  2N  1 1 
q
r2Xn ;
N n1 1

257

9:5

where N represents the number of pixels in the block, q1 represents the video
correlation when translated by one pixel, and r2Xn is the variance of video
block, Xn : With this model, the estimated distortion increases as the correlation q1
decreases or the variance increases, where these characteristics of the video signal
are estimated on a block-by-block basis. As expected, distortion estimates are not
as accurate as those obtained from the same method, but the accuracy is often
sufficient, given that encoding decisions are performed at the block level [15].

9.3.2 Proposed Methods


9.3.2.1 Rate-Distortion Optimization for Depth Map Coding
State-of-the-art video codecs make use of various coding modes in order to
improve coding efficiency. For example, in addition to skip mode and intracoding
mode (no prediction), H.264 provides various spatial prediction directions for
intraprediction modes and various block sizes and temporal prediction directions
for interprediction modes. To achieve the best coding performance, it is important
to select the optimal coding mode by considering bit rate and distortion at the same
time. For this purpose, Lagrangian optimization has been widely used [17, 18] to
optimize video coding performance.
When Lagrangian techniques are applied to depth map coding, it is appropriate
to consider the rendered view distortion caused by depth map error rather than the
distortion in the depth map itself, since, as previously discussed, depth maps sole
purpose is to provide geometric information of the captured scene to assist in view
synthesis via DIBR. Therefore, instead of using the depth map distortion directly,
it is proposed to use the estimation of the distortion in the rendered view to select
the optimal coding mode for depth map compression [15]. In [15], the rendered
view distortion from each depth map was considered separately, so that each depth
map can be encoded independently.
It is also possible to consider the optimal selection of Lagrange multiplier for
depth map coding. In the case of video coding, the optimal Lagrange multiplier
can be selected by considering the rate of change in distortion versus bit rate for
the given compressed video [19, 20]. When this is applied to depth map coding, it
is necessary to consider the rendered view distortion instead of the depth map
distortion itself. Kim et al. [15] derived the Lagrange multiplier for depth map
coding as a function of quantization step size based on the estimated rendered view
distortion using an autoregressive model given in (9.5). Refer to [15] for the
details.

258

G. Cheung et al.

9.3.2.2 Transform Domain Sparsification


As previously discussed, depth maps are encoded solely for the purpose of
synthesizing intermediate views via DIBR [21]. Given that different depth pixels
in general have varying degree of impact on the resulting synthesized view
distortion, one can assign them different levels of importance. Then, during coding
optimization, one would preserve values of more important depth pixels, while
manipulating values of less important depth pixels (within a well defined extent)
for coding gain. In this section, we discuss one depth value manipulation strategy
called transform domain sparsification (TDS) [22, 23] in more detail.
An orthogonal transform coder maps a signal s 2 IRN to a set of N predefined
basis functions /i s spanning the same signal space IRN of dimension N: In other
words, a given signal s in IRN can be written as a linear sum of those basis functions
using coefficients ai s:
s

N
X

ai /i

9:6

i1

Only non-zero quantized transform coefficients a^i s are encoded and transmitted
to the receiver for reconstruction of approximate signal ^s: ai s are obtained using
 s; i.e., ai hs; /
 i; where hx; yi denotes
a complementary set of basis functions /
i
i
a well-defined inner product between two signals x and y in Hilbert space IRN :
In general, coding gain can be achieved if the quantized transform coefficients
^
ai s are sufficiently sparse relative to original signal s; i.e., the number of non-zero
coefficients ^
ai s is small. Representations of typical image and video signals in
popular transform domains such as Discrete Cosine Transform (DCT) and Wavelet
Transform (WT) have been shown to be sparse in a statistical sense, given
appropriate quantization used, resulting in excellent compression performance.
In the case of depth map encoding, to further improve representation sparsity of a
given signal s in the transform domain, one can explicitly manipulate values of less
important depth pixels to sparsify the signal sresulting in even fewer non-zero
transform coefficients ^
ai s, optimally trading off its representation sparsity and its
adverse effect on synthesized view distortion.
In particular, in this section we discuss two methods to define depth-pixel
importance and associated coding optimizations given defined per-depth-pixel
importance to improve representation sparsity in the transform domain. In the first
method called dont care region (DCR) [22], one can define, for a given code
block, a range of depth values for each pixel in the block, where any depth value
within range will worsen synthesized view distortion by no more than a threshold
value T: Given defined per-pixel DCRs, one can formulate a sparsity maximization
problem, where the objective is to find the most sparse representation in the
transform domain while constraining the search space to be inside per-pixel DCRs.
In the second method, for each depth pixel in a code block, [23] first defines a
quadratic penalty function, where larger deviation from its nadir (ground truth

9 Depth Map Compression for Depth-Image-Based Rendering

259

depth value) leads to a larger penalty. A synthesized views distortion sensitivity to


the pixels depth value determines the sharpness of the constructed parabola. To
induce proper RD tradeoff, [23] then defines an objective for a depth signal in a
code block as a weighted sum of: (1) signals sparsity in the transform domain
(proxy for rate), and (2) per-pixel synthesized view distortion penalties for the
chosen signal in the pixel domain.

9.3.2.3 Defining Dont Care Region


We first discuss how DCRs are derived in [22]. Each DCR, specified in the pixel
domain, defines the search space of depth signals in which a sparse representation
in transform domain is sought.
Assume we are given left and right texture maps Il and Ir ; captured by a horizontally shifted camera, and corresponding depth maps Dl and Dr at the same
viewpoints and of the same spatial resolution. A pixel Il m; n in the left texture map,
where m is the pixel row and n is the pixel column, can then be mapped to a shifted
pixel Ir m; n  Dl m; n  c in the right texture map, where Dl m; n is the depth
value2 in the left depth map corresponding to the left texture pixel Il m; n; and c is
the camera-shift scaling factor for this camera setup. To derive synthesized views
distortion sensitivity to left depth pixel Dl m; n; [22] defines an error function
El e; m; n given depth error e : it is the difference in texture pixel values between
left pixel Il m; n; and incorrectly mapped right pixel Ir m; n  Dl m; n e  c
due to depth error e: We write:
El e; m; n jIl m; n  Ir m; n  Dl m; n e  cj

9:7

Given the above definition, one can now determine a DCR for left pixel
Dl m; n given threshold T as follows: find the smallest lower bound depth value
f m; n and largest upper bound depth value gm; n; f m; n\Dl m; n\gm; n;
such that the resulting synthesized error El e; m; n; for any depth error
e; f m; n  e Dl m; n  gm; n; do not exceed El 0; m; n T:
See Fig. 9.3a for an example of a depth map Dl for multiview sequence teddy
[24] and Fig. 9.3b for an example of ground truth depth Dl m; n (blue), DCR
lower and upper bounds f m; n (red) and gm; n (black) for a pixel row in a 8  8
block in teddy. Similar procedure can be derived to find DCR for the right depth
map.
In general, a larger threshold T offers a larger subspace for an algorithm to
search for sparse representations in transform domain leading to compression gain,
at the expense of larger resulting synthesized distortion.

2
Dl m; n is more commonly called the disparity value, which is technically the inverse of the
depth value. For simplicity of presentation, we assume this is understood from context and will
refer to Dl m; n as depth value.

260

G. Cheung et al.

Fig. 9.3 Depth map Dl (view 2) and DCR for the first pixel row in a 8  8 block for teddy at
T 7: a Depth map for teddy, b dont care region

9.3.2.4 Finding Sparse Representation in DCR


Given a defined pixel-level DCR described earlier, a depth pixel sm; n at location
m; n must be within range f m; n and gm; n; i.e., f m; n  sm; n  gm; n;
for the resulting synthesized view distortion not to exceed ground truth error plus
threshold T: Consequently, block-level DCR, B; is simply a concatenation of
pixel-level DCRs for all pixel locations in a code block. In other words, a signal
s is in B if all its pixels sm; ns fall within the permissible bounds.
Given a well-defined block-level DCR B; the goal is then to find a signal s 2 B
such that sparsity of its representation using basis /i s in the transform domain is
maximized. More precisely, given the matrix U containing basis functions /i s as
rows:
2
3
/T1 !
6
7
..
U4
9:8
5;
.
/TN !

the sparsity optimization can be written as follows:


min kakl0
s2B

s.t.

a Us

9:9

where a a1 ; . . .; aN  are the transform coefficients and kakl0 is the l0 -norm,


essentially counting the number of non-zero coefficients in a:
Minimizing the l0 -norm in (9.9)which is combinatorial in natureis in
general difficult. An alternative approach, as discussed in [25, 26], is to iteratively
solve a weighted version of the corresponding l1 -norm minimization instead:
min kaklw
s2B

s.t.

a Us

where lw1 -norm sums up all weighted coefficients in a:

9:10

9 Depth Map Compression for Depth-Image-Based Rendering

kaklw
1

wi jai j

261

9:11

It is clear that if weights wi 1=jai j (for ai 6 0), then the weighted lw1 -norm is
the same as the l0 -norm, and an optimal solution to (9.10) is also optimal for (9.9).
Having fixed weights means (9.10) can be solved using one of several known
linear programming algorithms such as Simplex [27]. Thus, it seems that if one can
set appropriate weights wi s for (9.10) a priori, the weighted lw1 -norm can promote
sparsity, just like l0 -norm. [25, 26] have indeed used an iterative algorithm so that
solution ai s to previous iteration of (9.10) is used as weights for optimization in
the current iteration. See [22] how the iterative algorithm in [25] is adapted to
solve (9.9).

9.3.2.5 Defining Quadratic Penalty Function


Instead of DCRs, we now discuss how a quadratic penalty function can be defined
per-pixel to reflect the importance of the depth pixel in synthesized view distortion
[23].
Looking closer at the error function El e; m; n (or similar Er e; m; n), one can
see that, as a general trend, as the depth value deviates from the ground truth depth
value Dl m; n; the error increases. As an example, the blue curve in Fig. 9.4a is
the resulting Er e; m; n for the right view (view 6) of multi-view sequence Teddy
[24]. One sees that as the depth value deviates from ground truth value (denoted by
red circle), the error in general goes up.
To construct a penalty function for this depth pixel, [23] fits a per-pixel
quadratic penalty function gi si to the error function:
gi si 1=2ai s2i bi si ci

9:12

where si is the depth value corresponding to pixel location i; and ai ; bi and ci are
the quadratic function parameters. The procedure used in [23] to fit gi si to the
error function is as follows. Given threshold q; first seek the nearest depth value
Dl m; n  e below ground truth Dl m; n that results in error El e; m; n
exceeding q El 0; m; n: Using only two data points at Dl m; n  e and
Dl m; n; and assuming gi si has minimum at ground truth depth value Dl m; n;
one can construct one quadratic function. A similar procedure is applied to construct another penalty function using two data points at Dl m; n e and Dl m; n
instead. The sharper of the two constructed functions (larger a) is the chosen
penalty function for this pixel.
Continuing with our earlier example, we see in Fig. 9.4a that two quadratic
functions (in dashed lines) with minimum at ground truth depth value are constructed. The narrower of the two is chosen as the penalty function. In Fig. 9.4b,
the per-pixel curvature (parameter a) of the penalty functions of the right depth
map of Teddy is shown. We can clearly see that larger curvatures (larger penalties

262

G. Cheung et al.
abs err vs. depth value
30

gt
err func
quad1
quad2

absolute error

25
20
15
10
5
0
80

100

120
depth value

140

160

Fig. 9.4 Error and quadratic penalty functions constructed for one pixel in right view (view 6),
and curvature of penalty functions for entire right view in Teddy. a Per-pixel penalty function,
b penalty curvature for Teddy

in white) occur at object boundaries, agreeing with our intuition that a synthesized
view is more sensitive to depth pixels at object boundaries.

9.3.2.6 Finding Sparse Representation with Quadratic Penalty Functions


To maximize sparsity in the transform domain without incurring a large penalty in
synthesized view distortion, [23] defines the following objective function:
X
min kakl0 k
gi /1
9:13
i a
a

1
where /1
i is the ith row of the inverse transform U ; and k is a weight parameter
to trade off transform domain sparsity and resulting synthesized view distortion.
As mentioned earlier, minimizing the l0 -norm is combinatorial and non-convex,
and so (9.13) is difficult to solve efficiently. Instead, [23] replaces the l0 -norm in
(9.13) with a weighted l2 -norm instead [28]:
X
X
min
wi a2i k
gi /1
9:14
i a
a

For a fixed set of weights wi s, (9.14) can be efficiently solved as an unconstrained quadratic program [29] (see [23] for details). The challenge again is how
to choose weights wi s such that when (9.14) is solved iteratively, minimizing
weighted l2 -norm is sparsity promoting. To accomplish this, [23] adopts the
iterative re-weighted least squares (IRLS) approach [26, 28]. The key point is that
after obtaining a solution ao in one iteration, each weight wi is assigned 1=jaoi j2 if
jaoi j is sufficiently larger than 0, so that contribution of the ith non-zero coefficient
wi jaoi j2 is roughly 1. See [23] for details of the iterative algorithm.

9 Depth Map Compression for Depth-Image-Based Rendering

263

Fig. 9.5 Use of DCR in


motion compensation to
reduce prediction residual
energy

9.3.2.7 Depth Video Coding Using DCR


Though DCR was described in the context of TDS previously, the general notion
that a depth pixel is only required to be reconstructed at decoder to be within a
well-defined range (DCR) is useful in other coding optimization contexts. As one
example, [30] proposed to use DCR to reduce the energy of prediction residuals in
motion compensation during depth video coding for coding gain. See Fig. 9.5 for a
simplified example of a two-pixel block, where the first and second pixel has DCR
2; 6 and 1; 4; respectively. For a given predictor block with pixel values 5; 5;
if the ground truth 3; 2 is the target, then the corresponding prediction residuals
2; 3 must be encoded, resulting in large energy and large bit overhead. On the
other hand, if prediction residuals are only required to bring the predictor
block inside DCR, then 0; 1; with much smaller energy is already sufficient.
Valenzise et al. [30] showed that using DCR during motion compensation in depth
video coding, up to 28 % bitrate saving can be achieved. More generally, DCR
provides a well-defined degree of freedom (in the sense that the resulting distortion
is carefully bounded for all signals within DCR) in which coding optimizations can
exploit; coding optimization using DCR is still an active area of research.

9.4 New Coding Tools for Depth Map Compression


Having discussed depth map coding optimization using existing compression tools
in the previous section, we now turn our attention to the development of new
compression tools designed specifically for depth map. Given the importance of
edge preservation in depth maps, as previously argued, we first overview edgeadaptive wavelets for depth maps (a full treatment is presented in the next chapter).
We then discuss a new block transform called graph-based transform (GBT), to
complement the traditional DCT commonly used in compression standards.

264

G. Cheung et al.

9.4.1 Edge-Adaptive Wavelets


Improvement of image coding performance using shape-adaptive 2D transforms
has been analyzed in the past [3133] and more recently extended to waveletbased coding. An early attempt to adapt the WT to object shapes in images was
made by Li and Li [34]. Here, wavelet filtering was adapted to boundaries in
images, so that the pixels convolved with the wavelet filters and located out of the
object boundaries were filled with values from the inner regions using symmetric
boundary extension. That is, the filled pixels values were symmetric with respect
to the object boundaries. As a result, the high-pass wavelet coefficients produced
by filtering across the discontinuities along these boundaries had reduced magnitudes. Hence, the distortion of encoded images was smaller at the same bit rate
and the coding efficiency was improved.
However, this improvement was counteracted by an additional overhead bit rate
required for coding the edge maps used by the shape-adaptive wavelets (e.g.,
encoded by the classical Freeman contour coding algorithm [35]). These edge
maps had to be conveyed to the decoder to apply the identical inverse shapeadaptive transform. Thus, bits must be allocated between the edge maps and
wavelet coefficients so that the RD performance of the shape-adaptive WT image
coding is optimized.
Depth images are particularly well suited for these edge-adaptive approaches,
as depth information tends to be smooth within each object. Thus, after encoding
the edge maps, few bits would be needed to represent the output of an edgeadaptive transform. As an example, a modified version of the shape-adaptive WTs
was proposed in [36], targeting specifically depth image encoding in the multiview imaging setup. To be discussed in Sect. 9.5.1 edges in texture and depth
images can be encoded jointly to save bit rate. We also refer to the next chapter for
a more detailed review of wavelet-based techniques for depth coding.

9.4.2 Graph-Based Transform


Since most international coding standards (e.g., MPEG-2, H.264/AVC) make use
of block-based transforms, block-based approaches to represent depth maps have
become popular. In this section, we review block-based edge adaptive coding
techniques, and study in more detail a graph-based transform (GBT) [37].
DCT has been widely used for block-based image and video compression.
It provides an efficient way to represent the signal both in terms of coding
efficiency and computational complexity. However, it is known to be inefficient for
coding blocks containing arbitrarily shaped edges. For example, if DCT is applied
to a block containing an object boundary which is neither a horizontal nor vertical
line, e.g., diagonal or round shape, or mixture of these, the resulting transform
coefficients tend not to be sparse and high frequency components can have

9 Depth Map Compression for Depth-Image-Based Rendering

265

significant energy. This leads to higher bit rate, or potentially highly visible coding
artifacts when operating at low rate due to coarse quantization of transform
coefficients.
To solve this problem variations of DCT have been proposed, such as shapeadaptive DCT [38], directional DCT [3941], spatially varying transform [42, 43],
variable block-size transform [44], direction-adaptive partitioned block transform
[45], etc., in which the transform block size is changed according to edge location or
the signal samples are rearranged to be aligned to the main direction of dominant
edges in a block. Karhunen-Love transform (KLT) is also used for shape adaptive
transform [33] or intra-prediction direction adaptive transform [46]. These approaches can be applied efficiently to certain patterns of edge shapes such as straight line
with preset orientation angles; however, they are not efficient with edges having
arbitrary shapes. The Radon transform has also been used for image coding [47, 48],
but perfect reconstruction is only for binary images. Platelets [49] are applied for
depth map coding [50], and approximate depth map images as piecewise planar
signals. Since depth maps are not exactly piecewise planar, this representation will
have a fixed approximation error.
To solve these problems, a GBT has been proposed as an edge-adaptive block
transform that represents signals using graphs, where no connection between nodes
(or pixels) is set across an image edge.3 GBT works well for depth map coding
since depth map consists of smooth regions with sharp edges between objects at
different depths. Now, we describe how to construct the transform and apply it to
depth map coding. Refer to [37, 51] for detailed properties and analysis of the
transform.
The transform construction procedure consists of three steps: (1) edge detection
on a residual block, (2) generation of a graph from pixels in the block using the
edge map, and (3) construction of transform matrix from the graph.
In the first step, after the intra/inter-prediction, edges are detected in a residual
block based on the difference between the neighboring residual pixel values.
A simple thresholding technique can be used to generate the binary edge map.
Then, the edge map is compressed and included into a bitstream, so that the same
transform matrix can be constructed at the decoder side.
In the second step, each pixel position is regarded as a node in a graph, G; and
neighboring nodes are connected either by 4-connectivity or 8-connectivity, unless
there is edge between them. From the graph, the adjacency matrix A is formed,
where Ai; j Aj; i 1 if pixel positions i and j are immediate neighbors not
separated by an edge. Otherwise, Ai; j Aj; i 0: The adjacency matrix is
then used to compute the degree matrix D, where Di; i equals the number of nonzero entries in the ith row of A, and Di; j 0 for all i 6 j:

Note that while edge can refer to a link or connection between nodes in graph theory, we
only use the term edge to refer an image edge to avoid confusion.

266

G. Cheung et al.

Edges

0
1
A=
0

0
1
1
L=
0

1 0 0
0 0 0
0 0 1

0 1 0
1 0
1 0

0
0
0 1 1

0 1 1

1
0
D=
0

0
1
2

0
Et =
1

2
0

0 0 0
1 0 0
0 1 0

0 0 1
1
2

1
2

1
2
0

0
1
2

0
1
2

1
2

Fig. 9.6 Example of a 2  2 block. Pixels 1 and 2 are separated from pixels 3 and 4 by a single
vertical edge (shown as the thinner dotted line). The corresponding adjacency matrix A and degree
matrix D are also shown there, along with the Laplacian matrix L and the resulting GBT Et :

In the third step, from the adjacency and the degree matrices, the Laplacian
matrix is computed as L D  A [52]. Figure 9.6 shows an example of these
three matrices.
Then, projecting a signal G onto the eigenvectors of the Laplacian L yields a
spectral decomposition of the signal, i.e., it provides a frequency domain
interpretation of the signal on the graph. Thus, a transform matrix can be constructed from the eigenvectors of the Laplacian of the graph. Since the Laplacian
L is symmetric, the eigenvector matrix E can be efficiently computed using the
well-known cyclic Jacobi method [53], and its transpose, ET ; is taken as GBT
matrix. Note that the eigenvalues are sorted in descending order, and corresponding eigenvectors are put in the matrix in order. This leads to transform
coefficients ordered in ascending order in frequency domain.
It is also possible to combine the first and second steps together [51]. Instead of
generating the edge map explicitly, we can find the best transform kernel for the
given block signal by searching for the optimal adjacency matrix. While the
number of possible matrices is large, it is possible to use a greedy search to obtain
adjacency matrices that lead to better RD performance [51].
Transform coefficients are computed as follows. For an N  N block of residual
pixels, form a one-dimensional input vector x by concatenating the columns of the
block together into a single N 2  1 dimensional vector, i.e., xNj i Xi; j for
all i; j 0; 1; :::; N  1: The GBT transform coefficients are then given by
y ET  x; where y is also an N 2  1 dimensional vector. The coefficients are
quantized with a uniform scalar quantizer followed by entropy coding. Unlike
DCT which uses zigzag scan of transform coefficients for entropy coding, GBT
does not need any such arrangement since its coefficients are already arranged in
ascending order in frequency domain.

9 Depth Map Compression for Depth-Image-Based Rendering

267

To achieve the best RD performance, one can select between DCT and GBT on
a per-block basis in an optimal fashion. For example, for each block the RD cost
can be calculated for both DCT and GBT, and the smaller one can be selected.
Overhead indicating the transform that was chosen can be encoded into the bitstream for each block, and the edge map is provided only for blocks coded using
GBT. It has been reported in [54] that coding efficiency improvement by 14 % on
average can be achieved using GBT when it is used to compress various depth map
sequences.

9.5 Approaches for Joint Coding of Texture and Depth


Having described compression techniques for depth maps only, we now discuss
the more general problem of joint compression of texture and depth maps. We first
discuss how the inherent correlation in texture and depth maps from the same
captured viewpoint can be exploited for coding gain. We then discuss the bit
allocation problem of how budgeted coding bits can be divided between texture
and depth maps for RD-optimal performance.

9.5.1 Exploiting Correlation Between Texture and Depth


Texture-plus-depth format is the coding of both texture and depth maps at the
same resolution from the same camera viewpoint for multiple views. As such,
there exists correlation between texture and depth maps of the same viewpoint that
can be exploited for coding gain. In this section, we overview coding schemes that
exploit these correlations for compression gain during joint compression of texture
and depth maps.
For stereoscopic video coding, where a slightly different viewpoint is presented
for each of left and right eyes, instead of coding both left and right texture maps, it
is also popular to code one texture map and one depth map at one view point, then
synthesize texture map of the other view at decoding using aforementioned DIBR.
Given this single-view texture-plus-depth representation, [55] proposed to use
exactly the same motion vectors used for texture map coding in MPEG2 for depth
maps also. Doing so means only one set of motion information needs to be
searched and encoded, reducing both overall motion compensation complexity and
motion coding overhead. The gain is particularly pronounced at low bit rate, where
motion information makes up a larger percentage of the coded bits than coding
residuals.
Using the same idea, [4] also considers sharing of motion information between
texture and depth maps when compressed using H.264. In more detail, if the
corresponding code block sizes in the texture and depth maps are the same, then
the same motion vectors are used. If not, merge and split operations are performed

268

G. Cheung et al.

to derive the appropriate motion vectors for code blocks in depth maps using
motion vectors of code blocks of different sizes in texture maps. Daribo et al. [5]
also employed this idea of motion information sharing, but motion vectors are
searched minimizing energy of both texture and depth maps, where a parameter a
is used to tune the relative weight of the two maps. Further, [5] included optimal
bit allocation as well, where optimal quantization parameters are chosen for
texture and depth maps (to be dicussed further in the next section).
Instead of sharing motion information between texture and depth maps, edge
information can also be shared, if an edge-adaptive coding scheme is adopted. For
compression of multi-view images (no time dimension) for DIBR at decoder, [36]
proposed to use edge-adaptive WT to code texture and depth maps at multiple
capturing viewpoints. Furthermore, the authors exploit the correlation of the edge
locations in texture and depth maps, relying on the fact that the edges in a depth
map reappear in the corresponding texture map of the same viewpoint and, hence,
these edges are encoded only once to save bit rate.
More recently, [56] proposed to use encoded depth maps as side information to
assist in the coding of texture maps. Specifically, observing that pixels of similar
depth have similar motion (first observed in [57]), a texture block containing pixels
of two different levels of depth (e.g., foreground object and background) was
divided into two corresponding subblocks, so that each can perform its own motion
compensation separately. Introducing this new subblock motion compensation
mode, which have arbitrary shapes according to the depth boundary in the block,
showed up to 0.7 dB performance gain in PSNR over native H.264 implementation with variable but fixed rectangular block sizes for motion compensation.

9.5.2 Bit Allocation for Joint Texture/Depth Map Coding


In this section, we discuss the bit allocation problem: how to optimally allocate
bits out of a fixed budget for encoding texture and depth maps to minimize
synthesized view distortion at decoder.
While the RD optimized bit allocation based on exhaustive search in the parameter
space achieves optimal performance, it typically comes with an unacceptably high
computational cost. Sampling the RD curves for the entire coding system or its
components usually requires rerunning the corresponding coding process for each
sample. The computational complexity of finding the optimal bit allocation grows
quickly (typically exponentially) with the size of the search space.
Instead, modeling the entire RD relation or particular factors can significantly
reduce this complexity. The modeled relations are assumed to be known a priori
up to a set of unknown model parameters. These residual parameters are still
computed by running the coding process a handful of iterations, but the total
number of such operations is substantially smaller than the case of exhaustive
search. The optimization of rate allocation is then performed by minimizing the
chosen distortion-related objective function with respect to rate distribution among

9 Depth Map Compression for Depth-Image-Based Rendering

269

the encoded signal components for a given maximal bit budget. Depending on the
assumed model, the optimization can be analytical (leading to a closed-form
solution), numerical (resulting in a numerical optimal solution), or a mixture of the
two.
The distortion-related objective function can reflect the distortions of individual
captured and encoded views, the weighted sum (or average) of these distortions, or
the distortions of synthesized views. In the last case, the deployed view synthesis
tools need additional information about the scene to be encoded along with the
captured views (e.g., disparity or depth maps for DIBR), which is penalized by the
required overhead coding rate.
While the model-based coding for traditional single-camera image and video
signals has been well studied in the literature [5861], it has become popular in
MVC only recently. In [62], the correlation between multi-view texture and depth
video frames is exploited in the concept of free viewpoint television (FTV). The
depth video frames are first encoded using joint multi-view video coding model
(JMVM) and view synthesis prediction [63] and, then, the multi-view texture
video sequences are processed using the same intermediate prediction results as
side information. However, the rate allocation strategy in this work has been
completely inherited from the JMVM coding without any adaptive rate distribution
between texture and depth frames.
The correlation between texture and depth frames has also been analyzed in [5]
such that the motion vector fields for these two types of sequences are estimated
jointly as a unique field. The rate is adaptively distributed between texture and
depth multi-view video sequences in order to minimize the objective function,
which is a weighted sum of the distortions of compressed texture and depth
frames. The separate RD relations for the compressed texture and depth frames are
adopted from [18] and they are verified for high rates and high spatial resolutions.
Finally, the MPEG coder is applied to the texture and depth sequences with the
optimized rates as an input.
Another method for optimally distributing the bit rate between the texture and
depth video stream has been proposed in [64]. The minimized objective function is
a view synthesis distortion at a specific virtual viewpoint, where this distortion is
influenced by the additive factors obtained by texture and depth compression and
geometric error in the reconstructions. The distortion induced by depth compression is characterized by a piecewise linear function with respect to the motion
warping error. Given this model and a fixed set of synthesis viewpoints, the
method computes the optimal quantization points for texture and depth sequences.
Within the concept of 3D-TV, with only two captured and coded views, the
work in [65] has proposed a solution for bit allocation among the texture and depth
sequences such that the synthesized view distortion at a fixed virtual viewpoint is
minimized. Here, the distortions of the left and right texture and depth frames are
modeled as linear functions of the corresponding quantization step sizes [66],
whereas the allocated rates are characterized by a fractional model [67]. The
Lagrangian multiplier-based optimization then results in the optimal quantization
step sizes for the texture and depth sequences.

270

G. Cheung et al.

In [68], the authors exploited the fact that depth video sequences consist mostly
of large homogeneous areas separated by sharp edges, exhibiting significantly
different characteristics as compared to their texture counterparts. Two techniques
for depth video sequence compression are proposed to reduce the coding artifacts
that appear around sharp edges. The first method uses trilateral filtering of depth
frames, where the filters are designed by taking into account correlation among
neighbor pixels such that the edges preserve sharpness after compression. In the
second approach, the depth frames are segmented into blocks with approximately
the same depth and the entire block is represented by a single depth value that
minimizes the mean absolute error. For efficient encoding of the block shapes and
sizes, the edge locations in the depth frames are predicted from the same blocks in
the texture frames, assuming a high correlation among the two types of edges.
A detailed model for rate control in the H.264 MVC coder [69, 70] and rate
allocation between texture and depth frames across the view and temporal
dimensions is exposed in [71]. In the model, three levels of the rate control are
adopted. The view level rate control inherits the allocation across views from the
predictive coding used in H.264/MVC coder, where, statistically, the I frames in
the view dimension are commonly encoded by most bits, followed by the B frames
and P frames. For the texture/depth level rate control, the allocated rates to depth
and texture frames are linearly related such that the depth rate is smaller than the
other. Such a relation reflects the chosen quantization levels for both types of the
frames in the H.264/MVC coder. Finally, in the frame level rate control, the rates
allocated to the I, B, and P frames across the temporal dimension are linearly
scaled, where the scaling factors are chosen empirically. Such a model allows for
fine tuning of the rate allocation across multiple dimensions in the MVC coder and
for providing the optimal solutions given various optimization criteria, not only the
synthesized view distortion.
When model-based rate allocation is applied to multi-view imaging instead of
video, the lack of temporal dimension leads to a simpler and smaller data structure.
Then, the models and allocation methods become more complex, and also more
accurate. Davidoiu et al. [72] introduce a model of error variance between the
captured and disparity compensated left and right reference views. They decompose this error into three decorrelated additive terms and approximate the relation
of each with respect to the operational quantization step size. Finally, the available
rate is distributed among the encoded images such that the objective distortion,
which is the sum of the left and right reference view distortions, is minimized. The
source RD relation is adopted from [73]. The rate allocation algorithm is also
applied to multiview videos, where each subset of frames at the same temporal
instant are considered as multi-view image data set.
Another method for rate allocation in multi-view imaging [74] approximates
the scene depth by a finite set of planar layers with associated constant depths or
disparity compensation vectors. The multi-dimensional WT is then applied to the
captured texture layers taking into account the disparity compensation across
views for each layer. Both the wavelet coefficients and layer contours are encoded,
where the corresponding RD relations are analytically modeled. The available bits

9 Depth Map Compression for Depth-Image-Based Rendering


Middlebury: Bowling2; [Dt0,Dd0,Dt1,Dd1]=[29.0, 1.50, 37.3, 1.94]

Middlebury: Rocks2; [D ,D ,D ,D ]=[74.9, 0.40, 64.2, 1.30]


t0

70

80

65

78

60

76

d0

t1

d1

D (x)

74
s

50

D (x)

55

45

72
Sampled values
Cubic model

70

40

68

35
30
25

271

66

Sampled values
Cubic model

0.2

0.4

0.6

0.8

64

0.2

0.4

0.6

0.8

Fig. 9.7 Two examples of sampled virtual view distortion and estimated cubic model using a
linear least square estimator for Middlebury [24] data sets Bowling2 (left) and Rocks2 (right)

are distributed among encoding the wavelet coefficients and layer contours so that
the aggregate distortion of the encoded images is minimized.
In [75, 76], the authors model the distortion of synthesized views using DIBR at
any continuous viewpoint between the reference views as a cubic polynomial with
respect to the distance between the synthesis and reference viewpoints. They adopt
from [6] that the synthesized texture pixels are obtained in DIBR by blending the
warped pixels from the left and right reference views, linearly weighted by the
distance between the synthesized and reference viewpoints. Further, they show
that, after compression of the reference views, the resulting mean-square quantization error of the blended synthesized pixels consists of two multiplicative factors
related to the quantization errors of the corresponding reference texture and depth
images, respectively. The first factor depends on the quantization of the reference
textures and, due to the linear blending and the mean square error evaluation, it
can be expressed as a quadratic function of the distance between synthesis and
reference viewpoints. The second factor reflects the quantization of the reference
depth images, which results in a geometrical distortion in the synthesized image
because of erroneous disparity information. This geometrical distortion is
estimated by assuming a linear spatial correlation among the texture pixels [15]
leading to a linear relation between the synthesized view mean square error and the
view distances. Finally, multiplying these two factors gives the cubic behavior of
the distortion across the synthesis viewpoints. The experiments exhibit an accurate
matching between the obtained distortions and the model, as illustrated in Fig. 9.7.
Furthermore, the model is used to optimally allocate the rate among the encoded
texture and depth images and the resulting RD coding performance is compared in
Fig. 9.8 to the other related methods applied to two data sets.

272

G. Cheung et al.
Bowling2

34

32

33

31

PSNR [dB]

PSNR [dB]

32
31
30
29

30
29
28
27

Optimal allocation
H.264/AVC
Simple allocation

28
27
0

Rocks2

33

0.1

0.2

0.3

bpp

0.4

0.5

Optimal allocation
H.264/AVC
Simple allocation

26
0.6

25
0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

bpp

Fig. 9.8 RD performance of the optimal rate allocation compression algorithm compared to the
performance of a simple uniform allocation and H.264/AVC. The results are obtained for two
data sets: Bowling2 (left) and Rocks2 (right) from the Middlebury database [24]. The optimal
allocation compression always outperforms the compression with the uniform allocation and has
a better RD performance than H.264/AVC at mid- and high-range rates. However, at lower rates,
the sophisticated motion compensation tools in H.264/AVC have a key influence on the RD
performance and thus lead to a better quality for the virtual views

9.6 Summary
In this chapter, toward the goal of compact representation of texture and depth
maps for free viewpoint image synthesis at decoder via DIBR, we discuss the
problem of depth map compression. We first study the unique characteristics of
depth maps. We then study new coding optimizations for depth maps using
existing coding tools, and new compression algorithms designed specifically for
depth maps in order. Finally, we overview proposed techniques in joint texture/
depth map coding, including the important bit allocation problem for joint texture
and depth map coding.
As depth map coding is still an emerging topic, there are still many remaining
unresolved questions left for future research. First, the question of how many depth
maps and at what resolution should they be encoded for given desired view synthesis
quality at decoder remains open. For example, if communication bit budget is scarce,
it is not clear densely sampled depth maps in view, which themselves are auxiliary
data only to provide geometric information, should be transmitted at full resolution.
This is especially questionable given depth maps in many cases are estimated from
texture maps (available at decoder, albeit in lossily compressed format) using stereomatching algorithms in the first place. Further, whether depth maps are the sole
appropriate auxiliary data for view synthesis remains an open question. Alternatives
to depth maps [2, 3] for view synthesis have already been proposed in the literature.
Further investigation on more efficient representation of 3D scene in motion is
warranted.

9 Depth Map Compression for Depth-Image-Based Rendering

273

References
1. Vetro A, Wiegand T, Sullivan GJ (2011) Overview of the stereo and multiview video coding
extensions of the H.264/MPEG-4 AVC standard. Proc IEEE 99(4):626642
2. Kim WS, Ortega A, Lee J, Wey H (2011) 3D video quality improvement using depth
transition data. In: IEEE international workshop on hot topics in 3D. Barcelona, Spain
3. Farre M, Wang O, Lang M, Stefanoski N, Hornung A, Smolic A (2011) Automatic content
creation for multiview autostereoscopic displays using image domain warping. In: IEEE
international workshop on hot topics in 3D. Barcelona, Spain
4. Oh H, Ho YS (2006) H.264-based depth map sequence coding using motion information of
corresponding texture video. In: The Pacific-Rim symposium on image and video technology.
Hsinchu, Taiwan
5. Daribo I, Tillier C, Pesquet-Popescu B (2009) Motion vector sharing and bit-rate allocation
for 3D video-plus-depth coding. In: EURASIP: special issue on 3DTV in Journal on
Advances in Signal Processing, vol 2009
6. Merkle P, Morvan Y, Smolic A, Farin D, Muller K, de With P, Wiegand T (2009) The effects
of multiview depth video compression on multiview rendering. Signal Process Image
Commun 24:7388
7. Leon G, Kalva H, Furht B (2008) 3D video quality evaluation with depth quality variations.
In: Proceedings of 3DTV-conference: the true vision - capture, transmission and display of
3D video, 3DTV-CON 2008. Istanbul, Turkey
8. Tanimoto M, Fujii T, Suzuki K (2009) View synthesis algorithm in view synthesis reference
software 2.0 (VSRS2.0). Document M16090, ISO/IEC JTC1/SC29/WG11
9. Kim WS, Ortega A, Lee J, Wey H (2010) 3-D video coding using depth transition data. In:
IEEE picture coding symposium. Nagoya, Japan
10. Mller K, Smolic A, Dix K, Merkle P, Wiegand T (2009) Coding and intermediate view
synthesis of multiview video plus depth. In: Proceedings of IEEE international conference on
image processing, ICIP 2009. Cairo, Egypt
11. Nguyen HT, Do MN (2009) Error analysis for image-based rendering with depth information.
IEEE Trans Image Process 18(4):703716
12. Ramanathan P, Girod B (2006) Rate-distortion analysis for light field coding and streaming.
Singal Process Image Commun 21(6):462475
13. Kim WS, Ortega A, Lai P, Tian D, Gomila C (2009) Depth map distortion analysis for view
rendering and depth coding. In: IEEE international conference on image processing. Cairo, Egypt
14. Video (2010) Report on experimental framework for 3D video coding. Document N11631,
ISO/IEC JTC1/SC29/WG11
15. Kim WS, Ortega A, Lai P, Tian D, Gomila C (2010) Depth map coding with distortion
estimation of rendered view. In: SPIE visual information processing and communication. San
Jose, CA
16. Lai P, Ortega A, Dorea C, Yin P, Gomila C (2009) Improving view rendering quality and coding
efficiency by suppressing compression artifacts in depth-image coding. In: Proceedings of SPIE
visual communication and image processing, VCIP 2009. San Jose, CA, USA
17. Ortega A, Ramchandran K (1998) Rate-distortion techniques in image and video
compression. IEEE Signal Process Mag 15(6):2350
18. Sullivan G, Wiegand T (1988) Rate-distortion optimization for video compression. IEEE
Signal Process Mag 15(6):7490
19. Wiegand T, Girod B (2001) Lagrange multiplier selection in hybrid video coder control. In:
IEEE international conference on image processing. Thessaloniki, Greece
20. Wiegand T, Sullivan G, Bjontegaard G, Luthra A (2003) Overview of the H.264/AVC video
coding standard. IEEE Trans Circuits Syst Video Technol 13(7):560576
21. Mark W, McMillan L, Bishop G (1997) Post-rendering 3D warping. In: Symposium on
interactive 3D graphics. New York, NY

274

G. Cheung et al.

22. Cheung G, Kubota A, Ortega A (2010) Sparse representation of depth maps for efficient
transform coding. In: IEEE picture coding symposium. Nagoya, Japan
23. Cheung G, Ishida J, Kubota A, Ortega A (2011) Transform domain sparsification of depth
maps using iterative quadratic programming. In: IEEE international conference on image
processing. Brussels, Belgium
24. (2006) stereo datasets. http://vision.middlebury.edu/stereo/data/scenes2006/
25. Candes EJ, Wakin MB, Boyd SP (2008) Enhancing sparsity by reweighted l1 minimization.
J Fourier Anal Appl 14(5):877905
26. Wipf D, Nagarajan S (2010) Iterative reweighted l1 and l2 methods for finding sparse
solutions. IEEE J Sel Top Sign Process 4(2):317329
27. Papadimitriou CH, Steiglitz K (1998) Combinatorial optimization: algorithms and
complexity. Dover, NY
28. Daubechies I, Devore R, Fornasier M, Gunturk S (2010) Iteratively re-weighted least squares
minimization for sparse recovery. Commun Pure Appl Math 63(1):138
29. Boyd S, Vandenberghe L (2004) Convex optimization. Cambridge University Press,
Cambridge
30. Valenzise G, Cheung G, Galvao R, Cagnazzo M, Pesquet-Popescu B, Ortega A (2012)
Motion prediction of depth video for depth-image-based rendering using dont care regions.
In: Picture coding symposium. Krakow, Poland
31. Gilge M, Engelhardt T, Mehlan R (1989) Coding of arbitrarily shaped image segments based
on a generalized orthogonal transform. Signal Process Image Commun 1:153180
32. Chang SF, Messerschmitt DG (1993) Transform coding of arbitrarily-shaped image
segments. In: Proceedings of 1st ACM international conference on multimedia. Anaheim,
CA, pp 8390
33. Sikora T, Bauer S, Makai B (1995) Efficiency of shape-adaptive 2-D transforms for coding of
arbitrarily shaped image segments. IEEE Trans Circuits Syst Video Technol 5(3):254258
34. Li S, Li W (2000) Shape-adaptive discrete wavelet transforms for arbitrarily shaped visual
object coding. IEEE Trans Circuits Syst Video Technol 10(5):725743
35. Freeman H (1961) On the encoding of arbitrary geometric configurations. IRE Trans Electron
Comput 10(2):260268
36. Maitre M, Shinagawa Y, Do M (2008) Wavelet-based joint estimation and encoding of depthimage-based representations for free-viewpoint rendering. IEEE Trans Image Process
17(6):946957
37. Shen G, Kim WS, Narang S, Ortega A, Lee J, Wey H (2010) Edge-adaptive transforms for
efficient depth map coding. In: IEEE picture coding symposium. Nagoya, Japan
38. Philips W (1999) Comparison of techniques for intra-frame coding of arbitrarily shaped video
object boundary blocks. IEEE Trans Circuits Syst Video Technol 9(7):10091012
39. Zeng B, Fu J (2006) Directional discrete cosine transforms for image coding. In: Proceedings of
IEEE international conference on multimedia and expo, ICME 2006. Toronto, Canada, pp 721724
40. Fu J, Zeng B (2007) Directional discrete cosine transforms: a theoretical analysis. In:
Proceedings of IEEE international conference on acoustics, speech and signal processing,
ICASSP 2007, vol I. Honolulu, HI, USA, pp 11051108
41. Zeng B, Fu J (2008) Directional discrete cosine transformsa new framework for image
coding. IEEE Trans Circuits Syst Video Technol 18(3):305313
42. Zhang C, Ugur K, Lainema J, Gabbouj M (2009) Video coding using spatially varying
transform. In: Proceedings of 3rd Pacific Rim symposium on advances in image and video
technology, PSIVT 2007. Tokyo, Japan, pp 796806
43. Zhang C, Ugur K, Lainema J, Gabbouj M (2009) Video coding using variable block-size
spatially varying transforms. In: Proceedings of IEEE international conference on acoustics,
speech and signal processing, ICASSP 2009, Taipei, Taiwan, pp 905908
44. Wien M (2003) Variable block-size transforms for H.264/AVC. IEEE Trans Circuits Syst
Video Technol 13(7):604613
45. Chang CL, Makar M, Tsai SS, Girod B (2010) Direction-adaptive partitioned block transform
for color image coding. IEEE Trans Image Proc 19(7):17401755

9 Depth Map Compression for Depth-Image-Based Rendering

275

46. Ye Y, Karczewicz M (2008) Improved H.264 intra coding based on bi-directional intra
prediction, directional transform, and adaptive coefficient scanning. In: Proceedings of IEEE
international conference on image processing, ICIP 2008. San Diego, CA, USA,
pp 21162119
47. Soumekh M (1988) Binary image reconstruction from four projections. In: Proceedings of
IEEE international conference on acoustics, speech and signal processing, ICASSP 1988.
New York, NY, USA, pp 12801283
48. Ramesh GR, Rajgopal K (1990) Binary image compression using the radon transform. In:
Proceedings of XVI annual convention and exhibition of the IEEE in India, ACE 90.
Bangalore, India, pp 178182
49. Willett R, Nowak R (2003) Platelets: a multiscale approach for recovering edges and surfaces
in photon-limited medical imaging. IEEE Trans Med Imaging 22(3):332350
50. Morvan Y, de With P, Farin D (2006) Platelets-based coding of depth maps for the transmission
of multiview images. In: SPIE stereoscopic displays and applications. San Jose, CA
51. Kim WS (2011) 3-D video coding system with enhanced rendered view quality. Ph.D. thesis,
University of Southern California
52. Hammond D, Vandergheynst P, Gribonval R (2010) Wavelets on graphs via spectral graph
theory. Elsevier: Appl Comput Harmonic Anal 30:129150
53. Rutishauser H (1966) The Jacobi method for real symmetric matrices. Numer Math 9(1):
54. Kim WS, Narang SK, Ortega A (2012) Graph based transforms for depth video coding. In:
Proceedings of IEEE international conference on acoustics, speech and signal processing,
ICASSP 2012. Kyoto, Japan
55. Grewatsch S, Muller E (2004) Sharing of motion vectors in 3D video coding. In: IEEE
International Conference on Image Processing, Singapore
56. Daribo I, Florencio D, Cheung G (2012) Arbitrarily shaped sub-block motion prediction in
texture map compression using depth information. In: Picture coding symposium. Krakow,
Poland
57. Cheung G, Ortega A, Sakamoto T (2008) Fast H.264 mode selection using depth information
for distributed game viewing. In: IS&T/SPIE visual communications and image processing
(VCIP08). San Jose, CA
58. Gray RM, Hashimoto T (2008) Rate-distortion functions for nonstationary Gaussian
autoregressive processes. In: IEEE data compression conference, pp 5362
59. Sagetong P, Ortega A (2002) Rate-distortion model and analytical bit allocation for waveletbased region of interest coding. In: IEEE international conference on image processing, vol 3,
pp 97100
60. Hang HM, Chen JJ (1997) Source model for transform video coder and its applicationpart
I: fundamental theory. IEEE Trans Circuits Syst Video Technol 7:287298
61. Lin LJ, Ortega A (1998) Bit-rate control using piecewise approximated rate-distortion
characteristics. IEEE Trans Circuits Syst Video Technol 8:446459
62. Na ST, Oh KJ, Ho YS (2008) Joint coding of multi-view video and corresponding depth map.
In: IEEE international conference on image processing, pp 24682471
63. Ince S, Martinian E, Yea S, Vetor A (2007) Depth estimation for view synthesis in multiview
video coding. In: IEEE 3DTV conference
64. Liu Y, Huang Q, Ma S, Zhao D, Gao W (2009) Joint video/depth rate allocation for 3D video
coding based on view synthesis distortion model. Elsevier, Signal Process Image Commun
24(8):666681
65. Yuan H, Chang Y, Huo J, Yang F, Lu Z (2011) Model-based joint bit allocation between
texture videos and depth maps for 3-D video coding. IEEE Trans Circuits Syst Video Technol
21(4):485497
66. Wang H, Kwong S (2008) Rate-distortion optimization of rate control for H.264 with
adaptive initial quantization parameter determination. IEEE Trans Circuits Syst Video
Technol 18(1):140144
67. Ma S, Gao W, Lu Y (2005) Rate-distortion analysis for H.264/AVC video coding and its
application to rate control. IEEE Trans Circuits Syst Video Technol 15(12):15331544

276

G. Cheung et al.

68. Liu S, Lai P, Tian D, Chen CW (2011) New depth coding techniques with utilization of
corresponding video. IEEE Trans Broadcast 57(2):551561 part 2
69. Merkle P, Smolic A, Muller K, Wiegand T (2007) Efficient prediction structures for
multiview video coding. IEEE Trans Circuits Syst Video Technol 17(11):14611473
70. Shen LQ, Liu Z, Liu SX, Zhang ZY, An P (2009) Selective disparity estimation and variable
size motion estimation based on motion homogeneity for multi-view coding. IEEE Trans
Broadcast 55(4):761766
71. Liu Y, Huang Q, Ma S, Zhao D, Gao W, Ci S, Tang H (2011) A novel rate control technique
for multiview video plus depth based 3D video coding. IEEE Trans Broadcast 57(2):562571
(part 2)
72. Davidoiu V, Maugey T, Pesquet-Popescu B, Frossard P (2011) Rate distortion analysis in a
disparity compensated scheme. In: IEEE international conference on acoustics, speech and
signal processing. Prague, Czech Republic
73. Fraysse A, Pesquet-Popescu B, Pesquet JC (2009) On the uniform quantization of a class of
sparse source. IEEE Trans Inf Theory 55(7):32433263
74. Gelman A, Dragotti PL, Velisavljevic V (2012) Multiview image coding using depth layers and
an optimized bit allocation. In: IEEE Transactions on Image Processing (to appear in 2012)
75. Velisavljevic V, Cheung G, Chakareski J (2011) Bit allocation for multiview image
compression using cubic synthesized view distortion model. In: IEEE international workshop
on hot topics in 3D (in conjunction with ICME 2011). Barcelona, Spain
76. Cheung G, Velisavljevic V, Ortega A (2011) On dependent bit allocation for multiview
image coding with depth-image-based rendering. IEEE Trans Image Process 20(11):
31793194

Chapter 10

Effects of Wavelet-Based Depth Video


Compression
Ismael Daribo, Hideo Saito, Ryo Furukawa, Shinsaku Hiura
and Naoki Asada

Abstract Multi-view video (MVV) representation based on depth data, such as


multi-view video plus depth (MVD), is emerging a new type of 3D video communication services. In the meantime, the problem of coding and transmitting the
depth video is being raised in addition to classical texture video. Depth video is
considered as key side information in novel view synthesis within MVV systems,
such as three-dimensional television (3D-TV) or free viewpoint television (FTV).
Nonetheless the influence of depth compression on the novel synthesized view is
still a contentious issue. In this chapter, we propose to discuss and investigate the
impact of the wavelet-based compression of the depth video on the quality of the
view synthesis. After the analysis, different frameworks are presented to reduce the
disturbing depth compression effects on the novel synthesized view.

I. Daribo (&)  R. Furukawa  S. Hiura  N. Asada


Faculty of Information Sciences, Hiroshima City University,
Hiroshima, Japan
e-mail: daribo@nii.ac.jp
R. Furukawa
e-mail: ryo-f@hiroshima-cu.ac.jp
S. Hiura
e-mail: hiura@hiroshima-cu.ac.jp
N. Asada
e-mail: asada@hiroshima-cu.ac.jp
I. Daribo
Division of Digital Content and Media Sciences,
National Institute of Informatics, Tokyo, Japan
H. Saito
Department of Information and Computer Science,
Keio University, Minato, Japan
e-mail: saito@hvrl.ics.keio.ac.jp

C. Zhu et al. (eds.), 3D-TV System with Depth-Image-Based Rendering,


DOI: 10.1007/978-1-4419-9964-1_10,
 Springer Science+Business Media New York 2013

277

278

I. Daribo et al.

Keywords 3D-TV Adaptive edge-dependent lifting Depth video compression


Edge detector Graph-based wavelet Haar filter bank Multi-view video plus
depth (MVD) Lifting scheme Multiresolution wavelet decomposition Scaling
Shape-adaptive wavelet Side information View synthesis Wavelet coding
Wavelet transform Wavelet filter bank







10.1 Introduction
Three-dimensional television (3D-TV) has a long history, and over the years a
consensus has been reached that a successful introduction of 3D-TV broadcast
services can only reach success if the perceived image quality and the viewing
comfort are at least comparable to conventional two-dimensional television
(2D-TV). The improvement of 3D technologies raises more interest in 3D-TV [1]
and in free viewpoint television (FTV) [2]. While 3D-TV offers depth perception
of entertainment programs without wearing special additional glasses, FTV allows
the user to freely change his viewpoint position and viewpoint direction around a
3D reconstructed scene.
Although there is no doubt that high definition television (HDTV) has succeeded in largely increasing the realism of television, it still lacks one very
important feature: the representation of natural depth sensation. At present, 3D-TV
and FTV can be considered to be the logical next step complementing of HDTV to
incorporate 3D perception into the viewing experience. In that sense, multi-view
video (MVV) systems have gained significant interest recently, more specifically
to the novel view synthesis enabled by depth-image-based rendering (DIBR)
approaches, also called 3D image warping in the computer graphics literature.
A well-suitable associated 3D video data representation is known as multi-view
video plus depth (MVD) that provides regular two-dimensional (2D) videos
enriched with their associated depth video (see Fig. 10.1). The 2D video provides
the texture information, the color intensity, the structure of the scene, whereas the
depth video represents the Z-distance per-pixel between the camera optical center
and a 3D point in the visual scene. In the following the 2D video may be denoted
as texture video unlike the depth video. In addition, we represent depth data in
color domain in order to highlight wavelet-induced distortions.
The benefit of this representation is to still be able to respond the stereoscopic
vision needs at the receiver side as illustrated in Fig. 10.2. After decoding,
intermediate views can be reconstructed from the transmitted MVD data by means
of DIBR techniques [3, 4]. Therefore, 3D impression and viewpoint can be
adjusted and customized after transmission. However, the rendering process does
not allow creating perfect novel views in general. It is still prone to errors in
particular from the coding and transmission of the depth video which is a key side
information in novel view synthesis.

10

Effects of Wavelet-Based Depth Video Compression

279

Fig. 10.1 Example of texture image and its associated depth map (Microsoft Ballet MVV sequence)

Fig. 10.2 Efficient support of multi-view autostereoscopic displays based on MVD content

A first study of the impact of the depth compression on the view synthesis has
been investigated under the MPEG 3DAV AHG activities, in which a MPEG-4
compression scheme has been used. One proposed solution consists, after
decoding, in applying a median filter on the decoded depth video to limit the
coding-induced artifacts on the view synthesis, as similarly done in H.264/AVC
with the deblocking filter. Afterwards, a comparative study has been proposed
between H.264/AVC intracoding and platelet-based depth coding on the quality of
the view synthesis [5]. The platelet-based depth coding algorithm models smooth
regions by using piecewise-linear functions and sharp boundaries by straight lines.
The results indicate that a worse depth coding PSNR does not necessarily imply a
worse synthesis PSNR. Indeed, platelet-based depth coding leads to the conclusion
that preserving the depth discontinuities in the depth compression scheme leads to
a higher rendering quality than H.264/AVC intracoding.

280

I. Daribo et al.

In this chapter, we propose to extend these studies to the wavelet domain. Due to
its unique spatial-frequency characteristics, wavelet-based compression approach
is considered as an alternative to traditional video coding standard based on discrete cosine transform (DCT) (e.g. H.264/AVC). Discrete wavelet transform
(DWT) has mainly two advantages over DCT that are important for compression:
(1) the multiresolution representation of the signal by wavelet decomposition that
greatly facilitates sub-band coding, (2) wavelet transform reaches a good compromise between frequency and time (or space for image) resolutions of the signal.
Although wavelet is less widely used in broadcast than DCT, it can be considered
as a promising alternative for compressing depth information within a 3D-TV
framework. For still image compression, the DWT outperforms the DCT by
around 1 dB, and less for video coding [6]. However, in the case of 3D video
coding, the DWT presents worse performance than the DCT with regard to the 3D
quality of the synthesized view. We then intend to understand the reason behind
this poor 3D quality results by studying the effects of wavelet-based compression
on the quality of the novel view synthesis by DIBR. This study includes the
analysis of the wavelet transforms for compressing the depth video, which tries to
answer the question:
Which wavelet transform should be used to improve the final 3D quality?

To properly answer this question, first, a brief review of basic concepts of wavelets
are introduced in Sect. 10.2, followed by the lifting mechanism that provides a
framework for implementing classical wavelet transform and the flexibility for
developing adaptive wavelet transforms.
Then, Sect. 10.3 investigates the impact of the choice of different classical
wavelet transforms on the depth map and its influence on the novel view synthesis
in terms of both compression efficiency and quality.
Finally, an adaptive wavelet transform is proposed to illustrate the result of the
aforementioned investigation. As a result, Sects. 10.4 and 10.5 address the
wavelet-based depth compression through different adaptive wavelet transforms
based on the local properties of the depth map.

10.2 Wavelet Basics


10.2.1 Introduction to Wavelet Theory
The main idea behind wavelet analysis is to decompose complex information such
as music, speech, images, and patterns into elementary forms at different positions
and scales. A signal f using a basis of functions wi can then be decomposed as
follows:
X
f
ai w i
10:1
i

10

Effects of Wavelet-Based Depth Video Compression

281

Fig. 10.3 Two-band filter


bank diagram

To have an efficient representation of the signal f using only a few coefficients ai ; it


is important to use a suitable family of functions wi that match efficiently the
features of the data to represent. Since there are an indefinite number of possible
wavelet transform basis functions, the compression efficiency is greatly influenced
by the choice of the wavelet. Wavelets with smaller support perform better for
signals with many discontinuities or lots of high frequencies, while longer and
smoother ones perform better for smoother signals. On the other hand, signals
usually have the following features: they are both limited in time (or space for
image) and in frequency. A compromise is then needed between the pure timelimited and band-limited basis functions, which combines the best of both worlds:
wavelets. Wavelet transform may be seen as an improved version of the Fourier
transform, which succeeds where Fourier transform failed analyzing non-stationary signal. The result is a well-known ability of the wavelet transforms to pack the
main signal information into a very small number of wavelet coefficients.
Mallat successfully connected the theory with the concept of the multiresolution [7]. Wavelet decomposition thus allows the analysis of a signal at different
resolution levels (or scales). He also showed that the wavelet analysis can be
performed using simple signal filtering through a very practical algorithm based on
a multiresolution analysis [8]. Let us consider the two-channel filter bank for onedimensional (1D) discrete signal xn as shown in Fig. 10.3. The idea is to separate
the signal in frequency domain into two sub-bands: (1) low-pass band and (2)
high-pass band. The analysis filter bank consists of a low-pass filter H0 and highpass filter H1 : Output from the low-pass channel represents the coarse approximation of the signal, while the output from the high-pass channel contains the fine
signal details. To reconstruct the original signal, synthesis filter bank is used as
shown in Fig. 10.3. Each subband can be downsampled for compression and
transmission, and then upsampled and combined for reconstruction of the original
signal. Perfect reconstruction is possible if no information is lost during the
compression and transmission. The above procedure can be iteratively applied on
the low-pass channel output with the same filter bank until the desired level of
decomposition.
Traditional wavelet transforms that are implemented with filter banks were
widely accepted and used in many applications; however, introducing adaptivity to
such transform or dealing with irregularly spaced data is a non-trivial task. Nonetheless, the lifting scheme, formally introduced by Sweldens [9], enables an easy and

282

I. Daribo et al.

Fig. 10.4 Lifting scheme composed of the analysis (left) and the synthesis (right) steps

efficient construction of wavelet transforms and the flexibility to design an adaptive


transform.

10.2.2 Lifting Scheme


The lifting scheme is a computationally efficient way of implementing the wavelet
transform, and overcomes the usual filter bank approach shortcomings. In addition to
the extra flexibility, every filter bank based on lifting automatically satisfies perfect
reconstruction properties. The lifting scheme starts with a set of well-known filters,
whereafter lifting steps are used in an attempt to improve (a.k.a. lift) the properties
of a corresponding wavelet decomposition. Every 1D wavelet transform can be
factored into one or more lifting stages. 2D transform is carried out as a separable
transform by cascading two 1D transforms in the horizontal and vertical direction.

10.2.2.1 Lifting Steps: Splitting, Predict, Update and Scaling


A typical DWT by lifting scheme consists of four steps: splitting, predict, update,
and scaling, as illustrated in Fig. 10.4.
Splitting
The first step consists of splitting the input signal into two polyphase components: the
even and odd samples x2i and x2i1 by means of a lazy wavelet transform (LWT).
Predict
As the two components x2i and x2i1 are correlated, the next stage predicts the odd
values x2i1 from the even ones x2i ; using a prediction P; and produces the residue


10:2
hi x2i1  P x2i i2N ;
where h denotes the detail sub-band coefficients.

10

Effects of Wavelet-Based Depth Video Compression

283

Update
An update stage U of the even values follows, such that


li x2i U hi i2N ;

10:3

where l denotes the approximation sub-band coefficients.


Scaling
The output of each channel is weighted to normalize the energy of the underlying
scaling and wavelet functions.
These four steps can be repeated by iterating on the approximation sub-band l;
thus creating a multi-level transform or a multiresolution decomposition. The
perfect reversibility of the lifting scheme is one of its most important property. The
reconstruction is done straightforwardly by inverting the order of the operations,
inverting the signs in the lifting steps, and replacing the splitting step by a merging
step. Thus, inverting the three-step procedure above results in:
Undo Scaling
Simply apply the inverse weight.
Undo Update



x2i li  U hi i2N ;

10:4



x2i1 hi P x2i i2N ;

10:5

x x2i [ x2i1

10:6

Undo Predict

Merging

10.2.2.2 Lifting Advantages


Some of the advantages of the lifting wavelet implementation with respect to the
classical DWT are:
simplicity: it is easier to understand and implement,
the inverse transform is obvious to find and has exactly the same complexity as
the forward transform,
the in-place lifting computation avoids auxiliary memory requirements since
lifting outputs from one channel may be saved directly in the other channel,

284

I. Daribo et al.

Daubechies and Sweldens proved that every biorthogonal DWT can be


factorized in a finite chain of lifting steps [10],
can be used on arbitrary geometries and irregular samplings.

10.2.2.3 Lifting Implementations of Some Wavelet


Filter Bank
In this chapter we will use the three filters: Haar, Le Galls 5/3 [11], and
Daubechies 9/7 [12, 13] filter banks. Their corresponding lifting steps for one
transform level of a discrete 1D signal x xk  are presented in the following:
Haar analysis lifting steps
1
hi p x2i1  x2i
2
p
 Update li 2x2i hi
 Predict

Le Galls 5/3 analysis lifting steps


Predict hi x2i1  12 x2i  x2i2
Update li x2i 14 hi1  hi
p
Scaling hi p12 hi and li 2li
Daubechies 9/7 analysis lifting steps
Predict1 hi x2i1 ax2i  x2i2
Update1 li x2i bhi1  hi
Predict2 hi hi cli  li1
Update2 li li dhi1  hi
Scaling hi 1f hi and li fli
with
a 1:586134342
b 0:05298011854
c 0:8829110762
d 0:4435068522
f 1:149604398
As a conclusion, the lifting scheme provides a framework for implementing the
classical wavelet transform. It has several advantages over the classical filter bank
scheme and provides additional features, like implementation simplicity and inplace computation. This motivates the choice of many researchers to consider the
lifting scheme for still image compression, and in the scope of this chapter the
compression of depth data.

10

Effects of Wavelet-Based Depth Video Compression

285

10.3 Problem Statement: Impact of Wavelet-Based Depth


Compression on View Synthesis
10.3.1 Wavelet-Based Coding Results
Unlike texture image, a depth map has a very singular texture-less structure, where
the singularities are mostly located along the edges of the objects. After the
wavelet transform, the quantization of the wavelet coefficients and the thresholding of the quantized coefficients, one can notice on the decoded depth map, the
visual apparition of artifacts that are denoted as Gibbs (ringing) artifacts, along the
edges as shown in Fig. 10.5.
Let us first define the experimental conditions. The depth coding experiments
are performed with the test MVD datasets Ballet (1,024 9 768 @ 15 fps)
produced by Microsoft Research [14]. We use the lifting implementation of
Haar, Le Galls 5/3 and Daubechies 9/7 filter banks at a multiresolution equal to
4. These filter banks are different in particular by their lifting operators that
utilize different support width. Haar utilizes the smallest support width, and
Daubechies 9/7 the wider one. As we can see in Fig. 10.5, the shortest Haar
filter bank is more efficient in reducing the Gibbs (ringing) effects in the
decoded depth video, while Le Galls 5/3 and Daubechies 9/7 work better in
smooth regions. Larger lifting operators can be approximated by polynomials of
higher degrees, which correspond to smoother basis functions. These lifting
operators then work better when the underlying signal is smooth, i.e., consists
of low frequencies, which is the case of the depth map. Nonetheless, the depth
edges cannot be well represented by those smooth functions. A poor approximation along the edges yields large wavelet coefficients, and an increasing of
the data entropy along these edges.
The objective performance of each wavelet transform is investigated in the
rate-distortion (RD) curves plotted in Fig. 10.6 through the average number of
bits per pixels (bpp), in relation to the loss of quality, measured by the peak
signal to noise ratio (PSNR). The rate is computed via JPEG2000 codec as
entropy coding. The singular texture-less nature of the depth data clearly
indicate a better compression performance of wider support filters such as Le
Galls 5/3 and Daubechies 9/7 over Haar. However, at the cost of the aforementioned localized errors along the edges. One may be satisfied with this
preliminary result, but when the all DIBR-based framework is put into consideration, it is important to note that the depth map is not directly rendered,
but utilized as key side information to synthesize novel views by DIBR. It is
then important to not only consider the depth compression efficiency, but as
well the quality of the synthesis view. In the following, we then propose
studying how the depth compression performance affects the quality of the
novel synthesized view. More specifically, how the not-well-preserved structures
that are localized at the edges may influence the synthesis of a novel view.

286

I. Daribo et al.

Fig. 10.5 Appearance of the Gibbs (ringing) effects along Ballet depth contours at 0.08 bpp

Fig. 10.6 RD comparison of


depth compression with
different wavelet filter banks

10.3.2 Effect on Novel View Synthesis


In this section, we attempt to evaluate the quality of the novel synthesized view
according to wavelet-based compression of the depth map. Recall that the depth map
is considered as key side information in the generation of the novel views by DIBR.
A bad or not appropriate depth map compression leads then to a poor quality

10

Effects of Wavelet-Based Depth Video Compression

287

Fig. 10.7 Effect of the wavelet-based depth compression onto the novel view synthesis at 0.08 bpp

of the novel view synthesis by DIBR. The novel view synthesis experiments are
partially realized with the software provided by the Tanimoto Lab. of Nagoya
University [15]. In this study, the novel view is generated between the camera 3 and
the camera 5, such that the viewpoint from the camera 4 is reconstructed by DIBR
from the reference camera 3 and 5. After transmission and decoding, the depth videos
are preprocessed by first using a median filter with an aperture linear size of 3, and then
filtered with a bilateral filter with parameters 4 and 40 for the filter sigma in color space
and in coordinate space, respectively. Disocclusions are filled in by using the reference
view from camera 3 and 5. Finally, remaining disocclusions are inpainted [16].
At low bitrate, the Gibbs phenomenon becomes more visible, leading to a
visible degradation of the novel view as shown in Fig. 10.7, in particular near the
object boundaries, which correspond to the aforementioned edge-localized errors
in the depth map.
Previously, we observed that the shorter the wavelet support filter width is, the
better the edges of the disoccluded areas are preserved, and thus, the quality of the
novel view is better. Although a lower depth compression ratio, Haar filter bank
can then be considered as the most efficient wavelet transform among the others
with respect to the rendering quality of the novel synthesized image (see Fig. 10.8).
As a conclusion, depth edges point up the weakness of classical wavelet-based
coding methods to efficiently preserve structures very well localized in the space
domain but with large frequency band, which yields to a degradation of the novel
synthesis quality. A good depth compression does not necessarily yield better
synthesis views. It then can be expected that the smaller the support of the wavelet

288

I. Daribo et al.

Fig. 10.8 RD comparison of


the novel synthesized image
using decoded wavelet-based
depth map with different
wavelet filter banks

transform is, the better the Gibbs (ringing) effects are reduced; and thus, the 3D
quality is better, despite a lower compression ratio of the depth map compared with
longer filters. Indeed the depth data are mainly composed of smooth areas, which
favors longer filter when considering the RD performance, at the cost of edgelocalized errors on the depth video. A tradeoff between depth compression efficiency and novel view synthesis quality has to be found.

10.4 Adaptive Edge-Dependent Lifting Scheme


As previously observed, it is to be expected that a wavelet transform capable of
fitting better the local properties will improve the compact representation of the
wavelet coefficients along the edges, and thus the quality of the novel synthesis
view. It is then desirable to design a DWT that is capable to shape itself according
to the neighborhood of the depth discontinuities. This can be achieved by allowing
the lifting scheme to adapt its update and prediction operators to the local properties of the signal. Since all calculations of the lifting framework are done in the
spatial domain, it is easy to incorporate the aforementioned adaptivity and nonlinear operators into the transform by means of the depth edges as side information
as illustrated in Fig. 10.9.
There have recently been various approaches in designing efficient waveletbased image transforms [1720] that seek an efficient representation of geometrical structure for still texture image. These geometric wavelet transform better
capture the local geometrical structure of a 2D image by using non-separable
wavelets. The result is the ability to pack a very small number of large wavelet

10

Effects of Wavelet-Based Depth Video Compression

289

Fig. 10.9 An adaptive lifting scheme by using depth edges as side information

coefficients. Inspired by these still texture image techniques, various depth coding
methods have been proposed that require an edge-detection stage, followed by an
adaptive transform [2124].

10.4.1 Shape-Adaptive Wavelet


There have been some proposals for coding arbitrarily shaped image using wavelet
transform. Notable examples are the Shape-Adaptive Discrete Wavelet Transform
(SA-DWT) [18] for still texture image, and its lifting extension [22] for depth map
coding. For the latter, the depth edges are encoded explicitly and the regions on
opposite sides of these edges are processed independently, which prevents wavelet
bases from crossing edges. The SA-DWT clearly generates fewer large wavelet
coefficients around edges than the classical DWT.

10.4.2 Graph-Based Wavelet


In a similar manner as SA-DWT, graph-based transforms [23] seek a DWT that
avoids filtering across edges. The main difference with respect to the SA-DWT is
a more general set of filtering kernels. In addition, the graph-based representation
has the advantage to provide a more general representation that can capture more
easily complex edge shapes. The basic idea is to map pixels into a graph in
which each pixel is connected to its immediate neighbors only if they are not
separated by an edge. Not only limited to wavelets, graph-based transform
related works can also be found in DCT-based codec [24]. One example can be
found in the Chap. 9.

290

I. Daribo et al.

10.4.3 Support-Adaptive Wavelet


Instead of processing independently opposite side of an edge, another solution
consists in adaptively applying different wavelets that better approximate the local
geometrical structure [21]. In simple terms, a transform that allows choosing the
suitable operator based on the depth discontinuity properties. An adaptive DWT
that apply: (1) a shorter filter support over an edge such as Haar that allows
reducing Gibbs artifacts; and (2) a longer one in homogeneous areas. As a result,
better compression efficiency is achieved in homogeneous areas, while depth edges
are preserved for a better novel view synthesis quality.

10.5 Side Information: Edges


The above described adaptive DWT still require an edge detection stage, and the
compression/transmission of the edges. The lifting operators P and U then become
edge-dependent, and thus the wavelet transform becomes nonlinear. The lifting
scheme, however, guarantees that the transform remains reversible under the
assumption that it is possible to obtain the same edge information at the encoder
and decoder side as illustrated in Fig. 10.9. In the following, we discuss different
ways to represent and transmit the required edge information.

10.5.1 Edge Detector


A common example of edge detector is the symmetric separable derivative of the
image (but any edge detector can be used instead), defined in 1D as follows:
1
x0i xi  xi1  xi1
2

10:7

After, a threshold is applied on the coefficients to find relevant edges. To handle


the problem of choosing an appropriate threshold, the approach by hysteresis1 is
commonly used, wherein multiple thresholds are used to find an edge. A rate
constraint has also been proposed, in which most important edges are encoded first
[22]. Finally, the edges are encoded using a simple differential chain code [25].
The main difficulty in such an adaptive scheme is to retrieve the same edges at
the encoder and decoder side, and thus to maintain the reversibility of the spatial
transform. To fulfill this condition, several approaches are reviewed hereafter.
1

Hysteresis is used to track the more relevant pixels along the contours. Hysteresis uses two
thresholds and if the magnitude is below the first threshold, it is set to zero (made a nonedge). If the
magnitude is above the high threshold, it is made an edge. And if the magnitude is between the two
thresholds, then it is set to zero unless the pixel is located near a edge detected by the high threshold.

10

Effects of Wavelet-Based Depth Video Compression

291

Fig. 10.10 Adaptive lifting scheme using depth edges as side information

10.5.1.1 Depth Edges as Side Information


A straightforward solution consist in utilizing the depth edges themselves as side
information (see Fig. 10.10), at the cost of increasing the bandwidth in losslessly
transmitting the key edge locations required to operate the inverse spatial transform. Even with the extra overhead of sending the edges map, reductions can be
achieved in the overall transmitted rate.
Under the assumption that both texture and depth data are captured under the
same viewpoint, one can notice the intuitive edge correlation between the texture
and the depth image. In the following, we present different approaches that address
the transmission reduction of the depth edge information by leveraging the texture
and depth edge correlation.
10.5.1.2 Texture Edges as Side Information
As previously said, from the observation that the texture and depth image share
common edges, to some extent it is possible to infer the depth edges from the
texture edges. As a result, the texture edges are utilized as side information, and
thus, no more additional bits are needed to be sent (see Fig. 10.11). The texture is
independently encoded and transmitted beforehand to the decoder.
The texture, however, contains many more edges than the depth map (see
Fig. 10.12), which leads to unnecessary special filtering (e.g. short filter support,
or non-edge-crossing) of the depth map, and thus, a loss of compression efficiency.
10.5.1.3 Interpolated Depth Edges as Side Information
Here, the spatial locations of the edges are extracted to some extent from the
approximation coefficients of the depth map. The idea is to assume that the
approximation coefficients sufficiently preserve the edge information after wavelet
transform such that the edge information can be retrieved properly. In that way, the
side information is obtained from an upsampling of the quantized approximation
coefficients. In order to simulate at the encoder side the edge detection stage of the
decoder side, the depth map is first performed with the step of encoding/decoding
at the target bitrate, wherein the change of bitrate is achieved by changing the

292

I. Daribo et al.

Fig. 10.11 Adaptive lifting scheme using texture edges as side information

Fig. 10.12 Example of edges of the texture image (left) and depth map (right)

Fig. 10.13 Adaptive lifting scheme using the approximated depth edges as side information

quantization step. The encoding/decoding process consist in applying a Le Galls


5/3 DWT by (1) putting all the detail coefficients to zero, and (2) quantizing the
approximation coefficients. In what follows, we denote the decoded depth map as
interpolated depth map. The edge detector is now performed on this interpolated
depth map (Fig. 10.13).
At the decoder side, the interpolated depth map is built from upsampling the
quantized approximation coefficients, while at the encoder, a dummy linear
transform based on the long filters is used. The reconstruction is still possible due to
the particular properties of smoothness of the depth map, which permits to preserve

10

Effects of Wavelet-Based Depth Video Compression

293

Fig. 10.14 Interpolated depth edges at different bitrate (up) at the encoder, (middle) at the decoder,
(down) difference between both

the location of the edges when using the two slightly different decompositions. The
edges of the interpolated depth map are, however, very sensitive to the bitrate, as we
can see in Fig. 10.14. And the slight difference between the two interpolated depth
edges does not allow a perfect reversibility of the adaptive lifting scheme.

10.5.1.4 Mixed Texture-Depth Edges as Side Information


As previously discussed, texture and depth edges present strong similarities. One
of the previous ideas consists in using the texture edges that are independently
encoded. Here, we suggest strengthening the previous ideas by jointly using the
texture and the interpolated depth edges. Based on the previous observation that
the texture image contains many more edges than the depth map, only relevant
texture edges are validated as illustrated in Fig. 10.15.
The contours in the texture image are validated if they have in a close neighborhood a corresponding edge in the interpolated depth map. This is possible due
to the correlation between the texture image and the depth map. Then, we validate
a pixel to belong to the final mixed texture-depth edge E if it belongs both to the
original edge of the texture ~I and to a neighborhood N of the interpolated depth
~ edges, as described in the pseudo-code below:
map D

294

I. Daribo et al.

Fig. 10.15 Adaptive lifting scheme using mixed texture-depth edges as side information

~I: decoded texture image


~ interpolated depth image
D:
for all i; j 2 I  h do


~  h then
if i; j 2 N D
E

i; j

end if
end for
where h denotes the impulse response of the high-pass filter described by Eq. 10.7.
Moreover, the real differences between the two interpolated maps are not crucial
for the edge detection in the texture image, since only a neighborhood of the edges
is used to validate the texture contours. As shown in Fig. 10.16, this allows us to
retrieve from the pair texture plus interpolated depth map the location of the
original edge of the depth map.

10.5.1.5 Rate-Distortion Comparison


This section is devoted to evaluate the coding efficiency of the adaptive lifting
scheme [21] against the linear Le Galls 5/3 wavelet transform, with the JPEG2000
codec as entropy coding. For the results, the multiresolution wavelet decomposition is performed on four levels, by applying the described adaptive procedure at
each decomposition level.
Figure 10.17 compares the coding efficiency between the different proposed
side information, and the linear wavelet filter bank Le Galls 5/3. As seen in Sect.
10.3, Le Galls 5/3 performs better than the Daubechies 9/7, contrary to natural
images. This is due to the very particular features of the depth map, which is much
smoother than natural images, and presents sharp edges. The allocated bitrate used

10

Effects of Wavelet-Based Depth Video Compression

295

Fig. 10.16 Mixed depth edges at different bitrates (up) at the encoder, (middle) at the decoder,
(down) difference between both

Fig. 10.17 Rate-distortion results of the depth maps and novel synthesized images. The DWT Le
Galls 5/3 is compared with the adaptive support DWT [21]. The labels depth edges, texture
edge, interpolated depth edges, mixed edges being the different strategies presented in
Sect. 10.5 to send the edges as side information

to encode the depth map is equal to 20 % of the bitrate of the texture image. Note
that in Fig. 10.17, the increase in bitrate related to the depth edges side information
has been neglected when reporting experimental data about the coding rate, which
is equivalent to perfectly retrieve the depth edges at the decoder side without any
extra bits to be sent.

296

I. Daribo et al.

The gain in the depth map coding becomes more perceptible when measuring
the PSNR of the warped image. Actually, the warped image PSNR measurements
indicate a quality gain around 1.8 dB. Thus, the adaptive schemes do not necessarily improve the overall quality of the transmitted depth map over a classical
linear lifting scheme. However, at similar PSNR values, the support adaptivity
clearly indicates a better preservation of the depth edges, and consequently an
improvement of the quality of the synthesized view.
It can be observed in Fig. 10.17 that if the side information rate cost does not
exceed 0.02 bpp, the strategy of sending directly the depth edges as side information
provides better RD performance. However, with state-of-the-art edge coders, such
rate performance is still difficult to be obtained. For example, the lossless encoding
of boundary shapes costs an average of 1.4 bits/boundary pel [26]. As a conclusion
the Mixed edges as side information provides the best performance.

10.6 Chapter Summary


In this chapter, the study of the depth compression impact on the view synthesis
has been extended to the wavelet domain. Due to its unique spatial-frequency
characteristics, wavelet-based compression approach can be considered as an
alternative to traditional video coding standard based on DCT. We suggested then
studying the effects of wavelet-based compression on the quality of the novel view
synthesis by DIBR. This study includes the analysis of the DWT for compressing the
depth video through an adaptive DWT that leads to a better depth compression and a
better 3D rendering quality. As a result, it has been observed that the DWT that can
shape itself to the local geometrical structure has the ability to reduce the edgelocalized errors, and thus, the 3D quality is improved. The depth edges, however, still
have to be losslessly transmitted as side information at the cost of increasing the
bandwidth. An alternative is to leverage the existing correlation between the texture
and depth information. It can be observed that under the assumption that the texture
and the depth data are captured under the same viewpoint the texture and depth edges
are strongly correlated. It is then possible to jointly utilize the edges from the texture
and depth video, which enables an adaptive DWT that optimizes not only the
RD performance with respect to the depth video distortion, but also the distortion
of the novel synthesized views.
Acknowledgments This work is partially supported by the National Institute of Information
and Communications Technology (NICT), Strategic Information and Communications R&D
Promotion Programme (SCOPE) No.101710002, Grand-in-Aid for Scientific Research
No.21200002 in Japan, Funding Program for Next Generation World-Leading Researchers No.
LR030 (Cabinet Office, Government Of Japan) in Japan, and the Japan Society for the Promotion
of Science (JSPS) Program for Foreign Researchers.

10

Effects of Wavelet-Based Depth Video Compression

297

References
1. Fehn C, Cooke E, Schreer O, Kauff P (2002) 3D analysis and image-based rendering for
immersive TV applications. Signal Process Image Commun 17(9):705715
2. Tanimoto M (2006) Overview of free viewpoint television. Signal Process Image Commun
21:454461
3. McMillan L Jr (1997) An image-based approach to three-dimensional computer graphics.
PhD thesis, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
4. Oliveira MM (2000) Relief texture mapping. PhD thesis, University of North Carolina at
Chapel Hill, NC, USA
5. Morvan Y, Farin D, de With PHN (2005) Novel coding technique for depth images using
quadtree decomposition and plane approximation. In: Visual communications and image
processing, vol 5960, Beijing, China, pp 11871194
6. Xiong Z, Ramchandran K, Orchard MT, Zhang Y-Q (1999) A comparative study of dct- and
wavelet-based image coding. IEEE Trans Circuits Syst Video Technol 9(5):692695
7. Mallat SG (1989) A theory for multiresolution signal decomposition: the wavelet
representation. IEEE Trans Pattern Anal Mach Intell 11(7):674693
8. Rioul O, Duhamel P (1992) Fast algorithms for discrete and continuous wavelet transforms.
IEEE Trans Inf Theory 38(2):569586
9. Sweldens W (1995) The lifting scheme: a new philosophy in biorthogonal wavelet constructions.
In: Proceedings of the SPIE, wavelet applications in signal and image processing III, vol 2569,
pp 6879
10. Daubechies I, Sweldens W (1998) Factoring wavelet transforms into lifting steps. J Fourier
Anal Appl 4:247269
11. Le Gall D, Tabatabai A (1988) Sub-band coding of digital images using symmetric short kernel
filters and arithmetic coding techniques. In: Proceedings of the IEEE international conference
on acoustics, speech and signal processing (ICASSP), pp 761764, 1114 Apr 1988
12. Antonini M, Barlaud M, Mathieu P, Daubechies I (1992) Image coding using wavelet
transform. IEEE Trans Image Process 1(2):205220
13. Cohen A, Daubechies I, Feauveau J-C (1992) Biorthogonal bases of compactly supported
wavelets. Commun Pure Appl Math 45:485500
14. Microsoft sequence Ballet and Breakdancers (2004) [Online] Available: http://
research.microsoft.com/en-us/um/people/sbkang/3dvideodownload/
15. Tanimoto M, Fujii T, Suzuki K, Fukushima N, Mori Y (2008) Reference softwares for depth
estimation and view synthesis, M15377 doc., Archamps, France, Apr 2008
16. Telea A (2004) An image inpainting technique based on the fast marching method. J Graph
GPU Game Tools 9(1):2334
17. Do MN, Vetterli M (2005) The contourlet transform: an efficient directional multiresolution
image representation. IEEE Trans Image Process 14(12):20912106
18. Li S, Li W (2000) Shape-adaptive discrete wavelet transforms for arbitrarily shaped visual
object coding. IEEE Trans Circuits Syst Video Technol 10(5):725743
19. Peyr G, Mallat S (2000) Surface compression with geometric bandelets. In: Proceedings of
the annual conference on computer graphics and interactive techniques (SIGGRAPH), New
York, NY, USA, pp 601608. ACM
20. Shukla R, Dragotti PL, Do MN, Vetterli M (2005) Rate-distortion optimized tree-structured
compression algorithms for piecewise polynomial images. IEEE Trans Image Process 14(3):343359
21. Daribo I, Tillier C, Pesquet-Popescu B (2008) Adaptive wavelet coding of the depth map for
stereoscopic view synthesis. In: Proceedings of the IEEE workshop on multimedia signal
processing (MMSP), Cairns, Queensland, Australia, pp 413417, Oct 2008
22. Maitre M, Do MN (2009) Shape-adaptive wavelet encoding of depth maps. In: Proceedings
of the picture coding symposium (PCS), Chicago, USA, pp 14, May 2009

298

I. Daribo et al.

23. Sanchez A, Shen G, Ortega A (2009) Edge-preserving depth-map coding using graph-based
wavelets. In: Proceedings of the asilomar conference on signals, systems and computers
record, Pacific Grove, CA, USA, pp 578582, Nov 2009
24. Shen G, Kim W-S, Narang SK, Ortega A, Lee J, Wey H (2010) Edge-adaptive transforms for
efficient depth map coding. In: Proceedings of the picture coding symposium (PCS), Nagoya,
Japan, pp 566569, Dec 2010
25. Freeman H (1961) On the encoding of arbitrary geometric configurations. IRE Trans Electron
Comput 2:260268
26. Eden M, Kocher M (1985) On the performance of a contour coding algorithm in the context
of image coding part I: contour segment coding. Signal Process 8(4):381386

Chapter 11

Transmission of 3D Video
over Broadcasting
Pablo Angueira, David de la Vega, Javier Morgade
and Manuel Mara Vlez

Abstract This chapter provides a general perspective of the feasibility options of


the digital broadcasting networks for delivering three-dimensional TV (3D-TV)
services. It discusses factors (e.g., data format) that need to be accounted for in the
deployment stages of 3D-TV services over broadcast networks with special
emphasis made on systems based on Depth-Image-Based Rendering (DIBR)
techniques.







Keywords 3D broadcasting 3D-TV system 3D video coding Cable Digital


broadcast DVB Frame compatible 3D format H.264/AVC ITU-R ISDB
MPEG-2 Multi-view video coding (MVC) Multi-view video plus depth (MVD)
Network requirement Satellite Scalable Video Coding (SVC) Standardization
Terrestrial Transport









P. Angueira (&)  D. de la Vega  J. Morgade  M. M. Vlez


Department of Electronics and Telecommunications, Bilbao Faculty of Engineering,
University of the Basque Country (UPV/EHU), Alda Urkijo s/n, 48013 Bilbao, Spain
e-mail: pablo.angueira@ehu.es
D. de la Vega
e-mail: david.delavega@ehu.es
J. Morgade
e-mail: javier.morgade@ehu.es
M. M. Vlez
e-mail: manuel.velez@ehu.es

C. Zhu et al. (eds.), 3D-TV System with Depth-Image-Based Rendering,


DOI: 10.1007/978-1-4419-9964-1_11,
 Springer Science+Business Media New York 2013

299

300

P. Angueira et al.

11.1 Introduction
This chapter provides a general perspective of the feasibility options of the digital
broadcasting networks for delivering three-dimensional TV (3D-TV) services.
Currently, the production and consumption of 3D video is a reality in cinema and BlueRay formats but the general deployment of commercial 3D-TV services is not a reality
yet. Satellite and some terrestrial broadcasters in countries like United Kingdom have
shown interest to provide these services in the short or mid-term. Specifically, several
broadcasters continue to carry out experiments in stereoscopic 3D-TV production in
various European countries. The pay-Television (TV) operator BSkyB (UK) started a
stereoscopic 3D-TV channel in 2010 and some consumer electronics manufacturers
have also announced stereoscopic TV receivers during 2010.
In parallel, the digital broadcast standards are being redesigned. The second
generation systems in different International Telecommunications Union (ITU)
Regions will have better spectral efficiency and higher throughput. Good examples
of this second generation of standards can be found in the DVB family, with
the second generation of satellite (DVB-S2), cable (DVB-C2) and terrestrial
(DVB-T2) systems.
Additionally, the advances in video coding and specifically in 3D video coding
have enabled the convergence between the bitrates associated to 3D streams and
the capacity of the wired and unwired broadcast systems under specific conditions.
Finally, the activity in the technical and commercial committees of different
standardization bodies and industry associations has been very intense in the last
couple of years focusing mainly on production and displays. Examples of this
work can be found in committees of the Society of Motion Picture and Television
Engineers (SMPTE), the European Broadcasting Union (EBU), the Advanced
Television Systems Committee (ATSC), and the Digital Video Broadcasting
Consortium (DVB). In all cases, the initial approach for short and mid-term
deployment is based on stereoscopic 3D, with different formats and coding options.
Exception made for a few auto stereoscopic prototypes, the 3D-TV services will
require a specific display based on wearing special glasses.
The following sections provide a general description of the factors that need to
be accounted for in the deployment stages of 3D-TV services over broadcast
networks with special emphasis made on systems based on Depth-Image-Based
Rendering (DIBR) techniques.

11.2 Standardization Activities in 3D Broadcasting


This section summarizes the recent and current standardization activities developed by different organizations. The first institution that should be mentioned is
the ITU, which started the work on 3D as early as 1990. The relevant contributions
by this organization are listed in Table 11.1.

11

Transmission of 3D Video over Broadcasting

301

Table 11.1 Summary of ITU-R contributions to 3D-TV


Document
Reference
Title

Year

Report
Recommendation

ITU-R BT.312-5
ITU-R BT.1198

(1990)
(1995)

Report

ITU-R BT.2017

Recommendation

ITU-R BT.1438

Report
Question

ITU-R BT.2088
WP6A 128/6

Report

ITU-R BT.2160

Constitution of stereoscopic television


Stereoscopic television based on R- and
L-eye two channel signals
Stereoscopic television MPEG-2
multi-view profile
Subjective assessment of stereoscopic
television pictures
Stereoscopic television
Digital three-dimensional (3D)
TV broadcasting
Features of three-dimensional television
video systems for broadcasting

(1998)
(2000)
(2006)
(2008)a
(2010)

Question 128/6 has been updated in March 2011

Currently, the target of ITU-R activities on 3D-TV is described by Question


128/6 Digital three-dimensional (3D) TV broadcasting. This question proposes
the following research areas:
1. What are the user requirements for digital 3D-TV broadcasting systems?
2. What are the requirements for image viewing and sound listening conditions
for 3D-TV?
3. What 3D-TV broadcasting systems currently exist or are being developed for
the purposes of TV programme production, post-production, TV recording,
archiving, distribution and transmission for realization of 3D-TV broadcasting?
4. What new methods of image capture and recording would be suitable for the
effective representation of three-dimensional scenes?
In 2010, the Study Group 6 (Working Party 6A) released a report that provides
more details about the views of ITU-R and the technical problems that need to be
addressed in all the building blocks of a 3D-TV broadcast chain [1]. This document proposes a format evolution scenario underlining non-technical challenges
that 3D-TV systems will face in the short and mid-term. A major milestone of this
report is the proposal of a uniform terminology for the plethora of existing 3D
video formats. Current work in SG6 focuses on production formats and subjective
evaluation of 3D content.
The ITU also participates to the well-known ITU/ISO/IEC JTC1/SC29/WG11
group (MPEG Group). This team is in charge of the development of international
standards for compression, processing, and coded representation of moving pictures, audio and associated data. This committee has carried out the standardization of the coding and multiplexing aspects of 3D video up to date. Relevant
contributions produced in the previous years have been the MPEG-C extension,
which enables representation of auxiliary video and supplemental information
(i.e. depth information, L/R difference, etc.) [2], and the completion of the MVC
multi-view coding extension of H.264/AVC, carried out in 2009.

302

P. Angueira et al.

This committee is now in the preliminary stages for developing a new family of
coding tools for 3D video called 3DVideo. The target of this new task is to define a
data format and associated compression technology to enable the high-quality
reconstruction of synthesized views for 3D displays The call for technology proposals has been released in 2011 [3].
The SMPTE has been also one of the key organizations to promote 3D standardization worldwide. This organization produces standards (ANSI), reports, and
studies as well as technical journals and conferences in the generic field of TV. In
August 2008, SMPTE formed a Task Force on 3D to the Home. The report
produced by this committee contains the definition of 3D Home Master and
provides a description of the terminology and use cases for 3D in the home
environment [4].
The Digital Video Broadcasting (DVB) consortium activities should be mentioned here, considering the number of viewers that access TV services worldwide
by means of its standards. The DVB specifications relating to 3D-TV are currently
described by the DVB-3D-TV specification (DVB BlueBook A154 and A151)
[5, 6] and a 3D Subtitling addendum. The work in the technical committees has
focused on Stereoscopic Frame Compatible formats guided by the commercial
strategy of the DVB Commercial Module. The DVB plans propose to accomplish
the 3D-TV introduction in two phases. Phase 1 will be based on Frame Compatible
formats, and Phase 2 could be based either on a Frame Compatible Compatible or
a Service Compatible approach (the format definitions will be explained in Sect.
11.3.2).
Finally, it is worth mentioning that almost every commercial and non-commercial
broadcasting-related organization in all ITU regions has a working committee that
either develops or monitors aspects of 3D-TV policies and technology. Examples can
be found in the Advanced Television Standards Committee (ATSC), EBU, Asian
Broadcasting Union (ABU), Association of Radio Industries and Businesses (ARIB)
and Consumer Electronics Association (CEA).

11.3 A Generic 3D-TV System Architecture


Any 3D-TV system based on digital broadcast networks can be studied using the
traditional building blocks associated to a generic 2D broadcast system: productionprocessing, transport, transmission and reception. The concept is shown in Fig. 11.1.
The system described in Fig. 11.1 considers that HDTV and 3D (whichever the
format) will coexist for a long time. Providing the final goal of an immersive TV
system is to provide a color full resolution free view 3D image in a way similar to
todays holography, there is still a long way to achieve this target, and so,
the compatibility between different 3D technology generations and 2D HDTV
infrastructure is crucial. Current research focuses mainly on the beginning and end
blocks of the chain: production/coding and displays. The multiplex and broadcast
blocks will not be drastically modified in the mid-term (probably in the long term

11

Transmission of 3D Video over Broadcasting

303

Fig. 11.1 General architecture of a 3D-TV broadcast system

either), specially taking into consideration that the second generation of digital
broadcast systems has been already standardized [79]. These new systems have
driven the performance curves close to Shannons Limit and thus the challenge to
solve in the following decades will be the commercial rollout of the technology.

11.3.1 Production
The first step in a 3D TV transmission chain is the generation of suitable content.
Although the topic is not on the scope of this chapter, it is worth mentioning here
the relevance of 3D production storytelling (also called production grammar). The
production grammar of 3D should differ from 2D productions for a good 3D
viewing experience. This can lead to some compromises for the 2D viewer and
thus, compatibility issues might be more related to the content and how to show it
rather than formats, coding, or other technical matters.
Capturing 3D video scenes is one of the most complex aspects of the whole 3D
chain. Currently, the production of contents for broadcasting is based on 2D
cameras. The output is a 2D image which is the basis for different processing
techniques that produce additional views or layers required by each specific 3D
format. The simplest options are based on two camera modules that produce a
stereoscopic output composed of two 2D images, one per each eye. An example of
this system is the specification from the SMPTE that recommends a production
format with a resolution of 1,920 9 1,080 pixels at 60 frames per second per eye.
Another family of techniques is based on multi camera systems, which provides
a higher flexibility, in the sense that it allows capturing content that can be processed into virtually any 3D format. This advantage is limited due to the extreme
complexity of the multi camera arrangements that limit their application to very
special situations (e.g., specific studios).
Focusing on systems based on depth information for constructing 3D images,
there are different approaches to obtain the depth data of the scene depending on

304

P. Angueira et al.

the production case. In the case of 3D production of new material, some camera
systems have the capacity of simultaneously capturing video and associated pixel
depth information by a system based on infrared sensing of the scene depth [10]. If
the camera system is not furnished with the modules that obtain the depth information
directly, the depth maps can be obtained by processing the different components of
the 3D image (stereoscopic or multi-view), as explained in Chaps. 2 and 7.
The case of computer generated video streams is less problematic, as full scene
models with 3D geometry are available and thus the depth information with
respect to any given camera view-point can be generated and converted into the
required 3DV depth format.
Finally, and especially during the introductory phases of 3D services, existing
adequate 2D material can be processed to obtain equivalent 3D content. In this
case, there have been different proposals that extract depth information from a 2D
image. A detailed survey of the state of the art of the conversion methods can be
found in [11].
Once a 3D video stream, in different formats, is captured or computer generated, it has to be coded to be either delivered or stored on different media. The
coding algorithm will remove time, spatial, or inter stream redundancies that will
make delivery more efficient. This efficiency is a critical problem in digital terrestrial broadcast systems where the throughput is always limited. Depth databased systems show an interesting performance between achieved 3D perception
quality and required bitrate.

11.3.2 Formats
The representation format used to produce a 3D image will influence the video
coder choice and thus the bitrate that will be delivered to the user, which has
obvious implications on the transport and broadcast networks. Last but not least,
the video format will have a close dependency with the consumer display.

11.3.2.1 Format Terminology


A 3D format classification for broadcasting can be found in ITU-R Report BT2160
[1]. This report defines a list of different video formats for 3D-TV based on
compatibility factors. The list of formats applies to different compatibility scenarios of existing standard and 2D HDTV set-top boxes and displays and does not
imply any preferred delivery system (cable, terrestrial or satellite). The formats
and summarized descriptions are shown in Table 11.2.
The classification in Table 11.2 does not specify further restrictions to the
format details except for the mentioned compatibility of broadcast receivers (settop boxes) and displays. If we consider the different 3D display techniques, the
ITU-R formats can be associated to the display options, and finally, these options

11

Transmission of 3D Video over Broadcasting

305

Table 11.2 3D format terminology as ITU-R report BT.2160-1


Table
Description
Conventional HD
display compatible
Conventional HD
frame compatible

HD frame compatible

Conventional HD
service compatible

The signals transmitted are based on a complementary primary color


separation and matrixing of the left-eye and right-eye signals.
The two stereoscopic images are multiplexed into a single HD image.
The current set-top-box can receive this format but the 3D display
needed is new and must have the capability to interpret an HD
frame as left-eye and right-eye pictures
The generic set-top box (or IRD) for which the signal is intended here
is a new set top box which is able to decode a Frame compatible
image, and also decode a resolution enhancement layer, using for
example, MPEG SVC
The generic set-top box here is also a new set-top box able to decode
an MPEG MVC or a 2D HD plus depth stream conforming to the
ISO/IEC JTC1 MPEG specification. The signal is arranged so that
a conventional set-top box sees a single 2D HD signal. New settop boxes (or integrated receiver/displays) recognize the additional
information in order to decode a second view and provide two
output signals L and R, to the display. The envisaged display is a
multi-view auto- stereoscopic display

Table 11.3 Usual 3D format terminology used for different system features description
Format
3D display type
3D representation/coding
(compatibility)
HD display
compatible
HD frame
compatible

System based on anaglyph


glasses
System based on glasses (limited
auto-stereoscopic possible)

HD frame
compatible
compatible
HD Service
compatible

Systems based on glasses


(Limited auto-stereoscopic
possible)
Systems based on glasses/
auto-stereoscopic

Anaglyph
Frame compatible stereo
2D ? Depth Multi-view plus
depth (2 views)
Multi-view plus depth (2 views)
layered depth video depth
enhanced stereo
Full Resolution Stereo (simulcast)
multi-view plus depth depth
enhanced stereo

can be related also to the 3D representation and coding of the images (see
Table 11.3).
It should be mentioned that there is no general agreement in the literature for
the correspondence of formats, displays, and 3D representation techniques. The
3D-TV specifications made so far by SMPTE and DVB [5, 6] recommend an HD
Frame Compatible format, based on Frame Compatible Stereo Images and it is
assumed that the general case of displays will be based on wearing glasses. An
example can be found in Korea. In this case the 3D representation and coding is
based on 2D plus depth (3D is built on DIBR techniques). The system is compatible with existing receivers and displays, as the baseline image is the same 2D
video sequence.

306

P. Angueira et al.

11.3.2.2 Formats for 3D-TV Broadcasting


3D video coding and formats were already described in Chaps. 8 and 9. This
section summarizes the possibilities that are being considered for broadcasting.
Despite the delivery media that is used to reach the consumer (cable, satellite or
terrestrial broadcast networks), there are currently different format families to
represent a 3D video sequence:
1.
2.
3.
4.

Full Resolution Stereo or Multi-view


Frame compatible Stereo
2D plus difference
Depth-based Representation

Full resolution or multi-view image formats represent different views of the


same sequence of video scenes. These views are captured with the resolution of the
equivalent 2D system. The usual resolutions in the case of storage and broadcast
systems are 1,920 9 1,080 interlaced or 1,280 9 720 progressive, while the midand long-term format will probably be 1,920 9 1,080 progressive. The main
advantage of the Full Resolution format relies on its conceptual simplicity for
different coder options and the lack of any restriction to the coding algorithm
choice. Additionally, this format also has the advantage for a production environment of a full resolution video image. On the contrary, the major disadvantage
is the bitrate implied for storage or delivery. In the general case, the throughput
requirement will be N times the base image bitrate, N being the number of views.
The second format family is composed of the formats called Frame Compatible
Stereo. In this case, the original video information comes from two full resolution
video sequences, usually left- and right-eye views, which are multiplexed into a single
image flow. The multiplexation of both streams into a single one implies a decimation
of each sequence to half the original full resolution. There are different options for the
decimation and multiplexation as shown in Fig. 11.2. In the case of Side-by-Side and
Line Interleaving formats, the subsampling is carried out in the horizontal plane,
whereas for the TopBottom option, the decimation is performed on the vertical plane.
Finally, the Checkerboard technique distributes the loss of resolution on both planes.
The obvious drawback of these systems is the inherent loss in resolution for
both 2D and 3D consumers. This disadvantage has not been considered a serious
obstacle for the use of this format at least in the first rollout stages of 3D-TV and it
is the preferred choice in most of the experiments, trials and showcase 3D-TV
activities up to date. This format has very little impact on current transport,
broadcast and reception systems, fact that has been prioritized against full resolution of 3D new consumers in most of studies so far.
There is an alternative option based on assigning different resolutions to the
left- and right-eye views, and assuming that the perceived quality will be close to
the quality associated to the view delivered in full resolution. This approach is
called 2D plus Difference or 2D plus Delta. In this approach, one of the views
is taken as the baseline stream and encoded conventionally. With appropriate
signaling a viewer with the proper 2D receiver will be able to access this service,

11

Transmission of 3D Video over Broadcasting

307

Fig. 11.2 Multiplexing techniques for frame-compatible representations

while the owners of a 3D decoder will be able to extract from the multiplex the
difference signal that will be used to modify the 2D video and create the complementary view. The output from the set-top box to the display would normally
be an L and R stereo pair with different resolutions depending on each case. The
difference signal can be compressed using either a standard video encoder e.g.,
using the MPEG-4 Stereo High Profile [12].
Finally, the fourth group of representation formats adds one or multiple auxiliary views of depth information through Depth-Image-Based Rendering (DIBR)
techniques. The depth map associated to a specific scene will contain geometric
information for each image pixel. This being information geometry data, the
reconstruction of the 3D image can be adapted to the display type, size, and
spectator conditions [13, 14].
There are different formats based on image depth information. The simplest
version contains a 2D image sequence and a depth information layer, also similar in
shape to a 2D image. The problem associated to this format is the representation of
occlusions by objects in a closer plane from the viewer perspective. The problem is
solved by formats called 2D plus DOT (Depth, Occlusion, and Transparency) or
MVD (Multi-view Video plus Depth) [15]. MVD formats contain additional layers
with occlusion areas of the scene, and have the disadvantage of the redundancy
intrinsic to them. Some versions as Layered Depth Video (LDV) [16] contain the
information of the 2D original scene, the image depth information, and a third layer
with occlusion information, only considering first order occlusions. Similar versions of depth information rendering formats are available in the literature. The
technique for capturing or producing depth information can be achieved from
specially designed cameras or by post-processing a multi-view image stream.

11.3.3 Transport
This section describes the existing methods for multiplexing 3D-TV (and 2D
HDTV) services into a bit stream structure. The objective of creating these
structures is two-fold. On the one hand, these structures enable a comprehensive

308

P. Angueira et al.

association of the different components of a multimedia service: the video component(s), the audio component(s) and associated data streams. Furthermore, this
association is extended to aggregate several individual multimedia services into a
multi-programme structure. The well-known MPEG Transport Stream is the most
widespread example.
The second objective of these multiplex structures is to allow the transportation
of the programme structure through different networks (microwave line of sight
links, fiber optic links, satellite contribution links) to the broadcast nodes (transmitters in the case of terrestrial networks, head-ends in cable networks and earth
satellite stations in satellite broadcasting). In some cases, these streams are further
encapsulated into higher level structures for transportation. An example of the
latter is IP encapsulation of MPEG-TS packets [17].
The contents of this section summarize the ways of conveying different 3D
video formats into transport structures. The information is organized into two
differentiated groups: solutions for Conventional Frame Compatible formats and
solutions for systems based on DIBR and thus conveying different versions of the
scene depth data.

11.3.3.1 Transport Techniques Conventional Stereo Video


The aim of the conventional stereo video transport techniques is to provide a
structure that conveys the two components of a stereoscopic 3D video (left and
right-eye views). The alternatives that will be described in the following subsections are Simulcast MPEG-2 Multi-view Profile, H.264/AVC Simulcast and
H.264/AVC SEI Messages. Despite coding of 3D streams was described in previous Chaps. 810, in some transportation schemes, coding and multiplexing are
somehow mixed as a single stage. This is due to the fact that some of the transport
solutions are difficult to separate from the coding stages, being the multiplexing
process of the two views of the 3D image inherent to the coding algorithm.

MPEG-2 Multi-View Profile


The MPEG-2 Multi View Profile (MVP) is a technique that provides efficient
coding of a stereo video stream (left- and right-eye views) based on two different
layers. The main layer is one of the stereoscopic video components (usually the lefteye view component) that are coded using the usual MPEG-2 main profile tools.
This profile allows intra and predicted frames encoding and decoding with bidirectional prediction. The other view is coded as an enhancement layer video which
is coded with temporal scalability tools and exploits the correlation between the two
views to improve the compression efficiency. Both streams can be multiplexed into
an MPEG Transport Stream or an RTP (Real-Time Transport Protocol) stream [18].
This option has not been commercially successful so far, despite the fact that it
would provide backward compatibility with existing MPEG-2 receivers.

11

Transmission of 3D Video over Broadcasting

309

Fig. 11.3 Simulcast H.264/AVC

Fig. 11.4 H.264/AVC MVC

H.264/AVC Simulcast
H.264/AVC Simulcast is specified as the individual application of an H.264/
AVC [19] conforming coder to several video sequences in a generic way. The
process is illustrated in Fig. 11.3.
Thus, this solution for delivering a stereoscopic 3D video stream would consist
of independent coding and aggregation of the bit stream into the transport structure
being used, either MPEG-TS or RTP protocols.

H.264/AVC Multi-View Video Coding


H.264/AVC MVC is an update of the H.264/AVC coding standard. It is based on the
High Profile of the standard, which provides the tools for coding two or more views
using inter-picture (temporal) and inter-view prediction [19]. The MVC has two
profiles for supporting either multi-view (2 or more) and stereoscopic inputs (leftand right-eye views). The Multi-view High Profile does not support interlace coding
tools, whereas the Stereo High Profile allows it, but limited to two views [20].
For stereoscopic transmission, the input left- and right-eye view components of
the stereo pair will be the input to the coder, which is applied to both sequences
simultaneously (see Fig. 11.4).
The results are in two dependent encoded bit-streams. The streams, and
optional camera parameters as auxiliary information, are interleaved frameby-frame resulting in one MVC elementary stream that can be conveyed in a
standard MPEG-TS.

310

P. Angueira et al.

Fig. 11.5 H.264/AVC SEI message transport

H.264/AVC SEI Message


The Stereo Frame Packing Arrangement SEI (Supplemental Enhanced Information) message is an addition to the MPEG-4 AVC that informs the decoder that the
left- and right-eye stereo views are packed into a single high-resolution video [21,
22]. Packing both left- and right-eye stereo views into a single video frame makes
possible the use of existing encoders and decoders to distribute 3D-TV immediately without having to wait for MVC & Stereo High Profile hardware to be
deployed widely. This technique is summarized in Fig. 11.5. The codec used for
H.264/AVC Stereo SEI Message is H.264/AVC, which is applied to the interlaced
sequence. This sequence is multiplexed into a single suitable transport structure
(MPEG-TS or RTP/UDP).

11.3.3.2 Transport techniques for Video Plus Depth Coding


The second group of techniques are associated with video formats that use the
depth information to build the stereoscopic (or multi-view) 3D image. The multiplexation will be in all cases based on different options to code and transport the
depth related data. The alternatives covered in the following subsections are ISO/
IEC 23002-3 (MPEG-C PART 3), H.264/AVC with auxiliary picture syntax,
MPEG-4 MAC and H.264/AVC/SVC.

MPEG-C PART 3 (ISO/IEC 23002-3)


The MPEG group launched the ISO/IEC 23002-3 specification (also known as
MPEG-C part 3), which was standardized in 2007 [23]. MPEG-C Part 3 specifies
the representation of auxiliary 2D ? Depth video and supplementary information.
In particular, it provides the rules for signaling those auxiliary streams. Fig. 11.6
shows the concept behind this standard.
MPEG-C Part 3 specifies an Auxiliary Video Data format that could be used to
convey other information, in addition to just depth maps. The standard is based on
an array of N-bit values that are associated with the individual pixels of the 2D
video stream. Depth maps and parallax maps are the first specified types of auxiliary video streams, relating to stereoscopic-view video content. Parallax maps
can be seen as a hyperbolic representation of depth that convey the difference in

11

Transmission of 3D Video over Broadcasting

311

Fig. 11.6 ISO/IEC 23002-3 (MPEG-C PART 3)

Fig. 11.7 H.264/AVC with auxiliary picture syntax

the apparent position of an object viewed along two different lines of sight. New
values for additional data representations could be added to meet future technologies of coding. The MPEG-C Part 3 is directly applicable to broadcasting video
because it allows specifying the video stream as a 2D stream plus associated depth,
where the single channel video is augmented by the per-pixel depth attached as
auxiliary data [21, 22, 24].

H.264/AVC with Auxiliary Picture Syntax


H.264/AVC Auxiliary Picture Syntax provides a means to deliver image plus
depth formatted 3D streams [19]. This feature of the H.264/AVC standard allows
sending auxiliary pictures associated to a video sequence without specifying how
these side video should be decoded. For depth information sending purposes, the
2D and depth images are combined into a single video input coded by H.264/AVC
(see Fig. 11.7). The coding process has a primary coded image (2D Video) and an
auxiliary coded stream (depth data).
The coder is then applied to both sequences simultaneously but independently.
The output of the coder is a single video stream that can be directly multiplexed
into an MPEG-TS. A single encoder output is advantageous, compared to MPEGC part 3, because it will not affect the end-to-end communication chain, as it does
not require any additional signaling.
MPEG-4 MAC
Multiple Auxiliary Component (MAC) is a mechanism within the MPEG-4 coder
that specifies the encoding of auxiliary components, such as a depth map, in

312

P. Angueira et al.

addition to the Y, U, V components present in 2D video [25, 26]. The additional


layers are defined on a pixel-by-pixel basis providing additional data related to the
video object, such as disparity, depth, and additional texture. Up to three auxiliary
components are possible. Like H.264/AVC, MAC produces one bitstream output
which avoids the different multiplexing and demultiplexing stages.
H.264/AVC SVC
H.264/AVC Scalable Video Coding (SVC) is an extension of the H.264/AVC that
allows that different users with different displays and connected through different
delivery media can use a single bit stream. H.264/AVC SVC defines a bit stream
containing several quality levels, several resolutions and several frame rates. SVC
generates a H.264/AVC compatible base layer and one or several enhancement
layer(s). The base layer bit stream corresponds to a minimum quality, frame rate, and
resolution whereas the enhancement layer bit streams represent the same video at
gradually increased quality and/or increased resolution and/or increased frame rate.
The base layer bit stream could be a 2D H.264/AVC video, being the complementary layers of different levels of depth, occlusion, transparency, etc. This
standard has shown better performance than MPEG-4 MAC for the same bitrate
and similar behavior as H.264/AVC [26]. From a service compatibility point of
view, the base layer will be compatible with H.264/AVC 2D receivers, whereas
users with an SVC decoder would be able to decode the depth/disparity information for 3D DIBR-based receivers.

11.3.4 Bitrate Requirements


This section discusses the bitrate requirements for the service types that can be
considered as relevant targets for a 3D broadcasting environment scenario. The
offer will include 2D High Definition services and 3D services (full or partial
resolution). In the case of terrestrial broadcasting 3D services for portable
receivers might also be considered [27].
There are three different possible formats for delivering HDTV content:
1,080 9 720 progressive scanning (720p), 1,920 9 1,080 interlaced scanning
(1,080i), and 1920 9 1080 progressive scanning (1,080p). Currently, the first two
options have been adopted or recommended by different countries and consortia. There
have been exhaustive tests in the previous years in order to compare the subjective
quality provided by each format for different contents and coding rates (MPEG-4/AVC
and MPEG-2) [28, 29]. The 720p has provided better subjective performance for a
given bitrate if compared to the 1,080i (bitrates up to 18 Mbps). Most studies conclude
that the best performance is provided by the 1,080p format, which at the same time will
require, depending on the content type, 3050 % more bitrate. Considering the HDTV
services currently on air (terrestrial delivery and cable/satellite) a summary of required
bitrates is presented in Table 11.4.

11

Transmission of 3D Video over Broadcasting

313

Table 11.4 HDTV MPEG-4/AVC Coding Bitrates (in Mbps)


Modulation
Bitrate max.
Bitrate min.

Reference value

720p
1,080i
1,080p

8
10
13

11
13
13

6
7.5
7.8

Table 11.5 Bitrate requirement estimations (3D/2D %) according to different sources


Bitrate (% of required 2D)
Format
Simulcast
Stereoscopic
2D ? Difference
2D ? Depth
2D ? DOT

Resolution
Ref. [33] Ref. [34]
1,080p/50
200
200
1,080p/100 or 2,160p/50 (reduced resolution) 170190 110160
1,080p/50 ? additional layer
140180 130180
1,080p/50 ? additional layer
120160
1,080p/50 ? additional layer
180220 130

Ref. [5]
200
100160
160a
160a
160a

a
The DVB consortium is currently considering a maximum of 60 % overhead for phase 2 first
generation 3D-TV formats

The coding output bitrate is still a matter of discussion in the research community
[2832]. Values from 6 to 18 Mbps can be found in different reports as a function
of the perceptual quality and delivery infrastructure (terrestrial, cable or satellite).
The values proposed here are a summary of the values found in these references,
including the future performance gains forecasted by the same authors for the
following 57 years in the case of 1,080p the coding algorithm is MPEG-4/AVC.
The reference 3D-TV bitrate values used in the following sections associated to
the terrestrial standards will be close to the maximum values, considering expected
gains in statistical multiplexing and advances in the coding technologies. The
bitrate requirements associated to 3D services will depend on the format and
coding choice as described in Sect. 11.3.2 and Chaps. 810. Foresight studies in
Europe have suggested the bitrate ranges of Table 11.5.
Table 11.5 shows the clear advantages of the depth information-based formats.
Although still an area of challenging research, the 2D ? Depth approach offers the
prospect of considerable bitrate saving.

11.3.5 Broadcast Networks


This section provides a first introduction to this part of the 3D delivery chain from
a general system architecture point of view. The technical details of TV standards
and the implications for 3D-TV broadcasting will be described on the sections
dedicated to satellite, terrestrial, and cable (Sects. 11.511.7).
Each delivery network (satellite, cable or terrestrial) will impose its own constraints. These constraints are in the first place, the structure of the multiplex
conveying the multimedia services (audio, video, and associated data). The second

314

P. Angueira et al.

Table 11.6 Summary of expected capacities associated with different delivery media
System
Throughput (Mbps)
DVB-T
DVB-T2
DVB-S2
cabled system

24
35
45
80

restriction is the capacity, understood either as the number of services or the


maximum bitrate associated with an individual service.
The first specification is completely dependent on the standard under study.
Some standards define very precisely the coding and multiplex (as DVB-T, DVB-S
or ATSC) leaving little room for 3D compatibility. Other systems leave room for
service evolution with a flexible multiplex architecture and without a predefined
coding method (DVB-C2, DVB-T2 or DTMB).
The second restriction will be in addition influenced by the media. Usually, the
maximum system capacity will be achieved by cabled systems, followed by
satellite and terrestrial. In cabled systems, the bitrate should not be a relevant
constraint for the introduction of 3D services, a fact that is even truer after the
publication of the recent DVB-C2 standard, with rates of 80 Mbps per each 7/
8 MHz channel and using all VHF and UHF bands.
Satellite systems would not face serious problems to convey 3D services in
terms of capacity provided the number of services offered is limited. It should be
considered that the currently allocated bandwidth to DBS (Direct Broadcast
Satellite) systems will be used with improved efficiency with the new standards
such as DVB-S2 (4555 Mbps per 30 MHz RF channel).
The terrestrial case is more challenging. The capacity achieved by the latest
standards (DVB-T2) is in the range of 3040 Mbps per 8 MHz RF channel. The
reduction in the bands for terrestrial broadcasting as a consequence of the analog
switch off makes the problem more difficult to solve. 3D-TV systems that require a
moderate increase in bandwidth with respect to its counterpart 2D will be of key
importance in the first introductory stages. A summary of the expected throughput
for each of the delivery options can be found in Table 11.6.
The following Sects. 11.3.5.111.3.5.3 introduce the concept of terrestrial,
cable and satellite networks.

11.3.5.1 Terrestrial
Terrestrial broadcast networks are composed of a number of transmitting centers
that send the DTV signal using UHF channels. These transmitter sites are usually
high power stations, with radiating systems placed well above the average terrain
height of the service area. In principle, the network configuration is independent of
the service being delivered, provided the throughput associated to the transmission

11

Transmission of 3D Video over Broadcasting

315

parameters is high enough to allocate the service. Currently, there are four main
digital terrestrial TV standards around the world:
1. ATSC. This is a North America-based DTV standards organization, which
developed the ATSC terrestrial DTV series of standards.
2. DVB Consortium. A European-based standards organization, which developed the
DVB series of DTV standards DVB-T/H, and more recently the second generation
systems like DVB-T2. The systems designed by DVB are all standardized by the
European Telecommunication Standard Institute (ETSI). Although initially
European, the scope of DVB encompasses today the three ITU Regions.
3. The ISDB standards, a series of DTV standards developed and standardized by
the ARIB and by the Japan Cable Television Engineering Association
(JCTEA).
4. DTMB (Digital Terrestrial Multimedia Broadcast), which is the TV standard
for mobile and fixed terminals used in the Peoples Republic of China, Hong
Kong, and Macau.

11.3.5.2 Cable
A cable broadcast network is a communication network that is composed of a number
of head-ends, a transport and distribution section based on fiber technologies, and a
final access section usually relying on coaxial cable. Historically, these networks have
been referred as HFC (Hybrid Fiber Coaxial) networks. The head-ends of the system
aggregate contents from different sources and content providers. These services are
then fed to the fiber distribution network, usually based on SDH/SONET switching.
The distribution network transports the services to a node close to the final user area
(usually a few blocks in a city). Within this local area, the services are delivered by a
coaxial network to the consumer premises. The coaxial section has a tree structure,
based on power splitters and RF amplifiers. The signals on the cable are transmitted
using the frequencies that range from a few MHz to the full UHF band. The system total
bandwidth is limited by the effective bandwidth of the coaxial cable section.
Unlike the terrestrial market, the standards for cable TV are only two. The ISDB-C
(Integrated Services Digital BroadcastingCable) is used exclusively in Japan and
was standardized by the JCTEA [35]. The DVB consortium standardized in 1994 the
DVB-C standard. DVB-C is the widespread cable standard used in all ITU-R Regions
1, 2, and 3. Recently, the second generation system DVB-C2 [36] has been approved
and endorsed by the European Telecommunications Standards Institute.

11.3.5.3 Satellite
A Broadcast Satellite Service network, also called Direct to the Home (DTH) or
Direct Broadcast Satellite (DBS) is a communications network (currently with
some degree of interactivity through dedicated return channels) that is based on a

316

P. Angueira et al.

geostationary satellite which broadcasts the services received from a terrestrial


contribution center. The signals from the satellite are received directly by the
consumers using a relatively small parabolic antenna and a coaxial cable home
distribution system to a set-top box receiver. Thus the system is similar to a
terrestrial broadcast network, in the sense that the terrestrial transmitter networks
are substituted by one or several geostationary satellites that provide the signals
that will be directly received by the users. The frequency bands are in the vicinity
of 12 GHz with different ranges depending on the ITU Region (11.712.2 GHz
in ITU Region 3, 10.712.75 GHz in ITU Region 1 and 12.212.7 GHz ITU
Region 2).
DVB-S is the digital satellite transmission used in most countries (exception
made for Japan, where the ISDB-S standard prevails). Again this system was
developed by the DVB consortium as early as 1994. In 2006, the next generation
of the standard for satellite delivery DVB-S2 [9] was approved by the DVB
consortium with the aim of progressively replacing DVB-S. The transition toward
the second generation is expected in the long term.

11.3.6 Reception and Displays


The display is the last step in the 3D-TV chain associated to reception and to a
great extent, the key part for a widespread consumers acceptance. There are, as in
production and coding, a diversity of techniques and proposals.
The systems based on wearing glasses are the first stage toward the final
objective of a glasses-free full angle, high definition resolution 3D content. Currently, the majority of commercially available displays rely on any of the following technical approaches: Anaglyph, Polarized Glasses, and Shuttered Glasses.
The Anaglyph system presents two differently filtered colored images (typically
red for the right-eye image and cyan for the left-eye image) that are viewed
through correspondingly colored glasses. This technique provides a backward
compatibility with legacy hardware but has a relatively poor color rendition. The
solution based on polarized glasses uses cross-polarizations for the right-eye and
the left-eye images of a stereo pair, that are independently showed to each eye.
Presentation of stereo images at HDTV resolution requires with this system a more
expensive display providing at least twice the horizontal resolution of HDTV.
Finally, for systems with Shuttered Glasses, two images of a stereo pair will be
time-interleaved and viewed through special glasses in which the left- and righteye lenses are shuttered synchronously with the presentation frames.
Auto-stereoscopic techniques do not require any user-mounted device. This
family could also include volumetric and holographic systems. Nevertheless, the
traditional use of this terminology is restricted to techniques reproducing 3D
images within the viewing field [37]. There are three subcategories of the autostereoscopic systems: binocular, multi-view, and holoforms. Binocular systems are
the simplest approach, and generate 3D images in a fixed viewing zone.

11

Transmission of 3D Video over Broadcasting

317

Fig. 11.8 General architecture of a 3D-TV broadcast system

Multi-view systems have a discrete number of views within the viewing field,
generating different regions where the different perspectives of the video scene can
be appreciated. In this case, some motion parallax is provided but it is restricted by
the limited number of views available. Nevertheless, there are adjacent view
fusion strategies that try to smooth the transition from a viewing position to the
next ones [38].
Holoform techniques try to provide a smooth motion parallax for a viewer moving
along the viewing field. Volumetric displays are based on generating the image
content inside a volume in space. These techniques are based on a visual representation of the scene in three dimensions [39]. One of the main difficulties associated to
volumetric systems is the required high resolution of the source material. Finally,
holographic techniques aim at representing exact replicas of the scenes that cannot be
differentiated from the original. These techniques try to capture the light field of the
scene including all the associated physical light attributes, hence the spectators eyes
receive the same light conditions as the original scene. These last techniques are still
in a very preliminary study phase [40].

11.4 Advantages of a DIBR-Based 3D-TV System


Previous sections have described the main characteristics of a 3D-TV system from
a generic perspective. Special emphasis has been put to address the specifics of
systems that send 2D video streams along with depth data that would allow the
receiver to build the 3D stereoscopic image (or multi-view in second generation
systems) using DIBR techniques. Fig. 11.8 shows the specific architecture of a
DIBR-based 3D-TV system.
The DIBR approach has some advantages with respect to others based on
sending any version of the stereoscopic left and right-eye views. The first one and
already mentioned in previous sections is the backwards compatibility with 2D
systems. Another general advantage is the independency of the display and capture

318

P. Angueira et al.

technologies. DIBR-based systems would also provide direct compatibility with


most 2D3D video conversion algorithms. Finally, it should not be forgotten that
the compression efficiency of this approach makes it very attractive for current
transport and delivery systems.

11.4.1 Production
The advantages of depth-based systems for production are based on the pixel by
pixel matching of the different layers of the picture. This structure facilitates 3D
post-processing. An example can be found in object segmentation based on depthkeying. This technique allows an easy integration of synthetic 3D objects into real
sequences, for example real-time 3D special effects [41]. Another important aspect
is the suppression of photometrical asymmetries that are associated to left- and
right-eye view stereoscopic distribution, that create visual discomfort and might
degrade the stereoscopic sensation [42].
The most widespread methods for 2D3D conversion are based on extracting
depth information: Depth from blur, Vanishing Point-based Depth Estimation and
Depth from Motion Parallax [43, 44]. This is a major advantage of DIBR systems
as the success of any 3D-TV broadcast system will depend strongly on the
availability of attractive 3D video material.
Despite these advantages, there are also challenges that need to be addressed for
commercial implementation. The first one is associated to the occluded areas of
the original image. Suitable occlusion concealing algorithms are required in the
postproduction stages. Also, in the simplest versions of DIBR-based systems, the
effects associated to transparencies (atmospheric effects like fog or smoke, semitransparent glasses, shadows, refractions, reflections) cannot be handled adequately by a single depth layer and additional transparency layers (or production
techniques processing multiple views with associated depth) are required. Finally,
it should be mentioned that creating depth maps with great accuracy is still a
challenge, particularly for real-time events.

11.4.2 Coding and Transport


The advantage from the coding and transport perspective is clear: a reduced
bandwidth requirement as compared to other 3D formats. There is also an intrinsic
advantage in the fact that the structure of the depth information is usually the same
as the original 2D image, with a pixel-per-pixel association between depth and
base reference 2D layers. In this way, the depth layers might be coded using the
same general-purpose coder as the 2D component (H.264/AVC). In some cases,
due to the local smoothness of most real-world object surfaces, the per-pixel

11

Transmission of 3D Video over Broadcasting

319

depth information can be compressed much more efficiently than the 2D component. Nevertheless, this will not be a general rule. Depth information will also
have very sharp transitions, that might be distorted by coding and decoding processes. Finally, the fact that some of the transport mechanisms (see Sect. 11.3.3)
have been designed specifically to convey various depth information layers makes
this option very attractive for broadcasters, with a limited impact on transport and
distribution networks.

11.4.3 Broadcast Network


The advantage of this system from the transmission perspective is the little impact
that the required additional image depth layers would have on the broadcast network. Having met the bitrate requirements to convey the additional depth into the
service multiplex, these new services would be transparent for the network.
This advantage is crucial for free-to-air operators with limited transmission
bandwidth capacity. These broadcasters might require continuing to use existing
transmission channels and network infrastructure to reach the general 2D audience.
In this situation, a frame-based approach would not be suitable. In a DIBR 3D-TV
system, new set-top boxes (or integrated receiver/displays) could recognize the
additional information in order to decode a second view and provide two output
signals, left- and right-eye views, to the display.
The aspect that should be paid attention to, in the broadcast part of the distribution chain, is Quality of Service. Transmission impairments could produce
errors in the depth information and may lead to severe depth image artifacts after
the DIBR rendering process.

11.4.4 Receiver and Displays


In a DIBR system, the left- and right-eye views are only generated at the 3D-TV
receiver. Their appearance in terms of parallax (and thus perceived depth
impression) can be adapted to the particular viewing conditions of the consumer.
Moreover, the approach allows the viewer to adjust the reproduction of depth to
suit his own personal preferences, in a similar fashion to todays controls for color
saturation, brightness, and other image characteristics. In consequence, it would be
possible to provide the viewer with a customized 3D experience despite the type of
stereoscopic or auto-stereoscopic 3D-TV display.
Finally, considering that the degree of 3D depth perception is related to visual
fatigue, a control over the depth intensity is also a relevant advantage of DIBR
systems.

320

P. Angueira et al.

11.5 3D-TV Satellite Broadcasting: DVB-S/S2


The prevailing satellite broadcasting standard nowadays is the system developed
by DVB Project. Since the end of 1994, when the first digital satellite TV services
started, most satellite DTV services around the world today use the DVB-S
standard, with more than a 100 million receivers deployed around the world.
DVB-S is a broadcasting system for satellite digital multi-programme TV/High
Definition Television (HDTV) services to be used for primary and secondary
distribution in Fixed Satellite Service (FSS) and Broadcast Satellite Service (BSS)
bands. The system is intended to provide DTH services for consumer Integrated
Receiver Decoder (IRD), as well as collective antenna systems (Satellite Master
Antenna Television (SMATV) and cable TV head-end stations [45, 46].
The DVB-S2 system is based on the previous DVB-S, incorporating new
modulation schemes and channel coding techniques to upgrade the spectral efficiency reducing the threshold carrier-to-noise ratio necessary for good reception.
DVB-S2 was quickly adopted in Europe, the Americas, Asia, the Middle East,
and Africa for the delivery of new services since the standard was published in
2005. In 2006 the ITU recommended DVB-S2 as a suitable option for a Digital
Satellite Broadcasting System with Flexible Configuration [47].
It is not foreseen that DVB-S2 will replace completely DVB-S in the mid-term,
but it will make available new transmission possibilities and services. To allow
legacy DVB-S receivers to continue to operate, optional backwards compatible
modes with hierarchical modulation are available while providing additional
capacity and services to newer receivers. For this reason, although both systems
are briefly described here, DVB-S2 will be the system considered for the transmission of 3D content delivery in this chapter.

11.5.1 System Overview


The standard DVB-S uses QPSK modulation along with channel coding and error
correction techniques based on the use of a convolutional code concatenated with
shortened Reed-Solomon code. MPEG-2 services, a transmission structure synchronous with the packet multiplex, and the use of the multiplex flexibility allows
the use of the transmission capacity for a variety of TV service configurations,
including sound and data services [45, 46].
The DVB-S2 system [9, 48, 49] incorporates the use of MPEG-4 advanced video
coding (AVC) and additional modulation schemes. Four modulation modes are
available: QPSK and 8PSK are typically proposed for broadcast applications, since
they are virtually constant envelope modulations and can be used in nonlinear
satellite transponders driven near saturation, and 16APSK and 32APSK (requiring a
higher level of C/N) for professional applications such as news gathering and
interactive services. DVB-S2 can operate at carrier-to-noise ratios that range from 2 dB (below the noise floor) with QPSK, to 16 dB using 32APSK.

11

Transmission of 3D Video over Broadcasting

321

Fig. 11.9 3D Encapsulation options in DVB-S2

A Forward Error Correction scheme based on a concatenation of BCH (BoseChaudhuri-Hocquenghem) and LDPC (Low Density Parity Check) coding is used
for a better performance in the presence of high levels of noise and interference.
The introduction of two FEC code block lengths (64,800 and 16,200) was dictated
by two opposite needs: the C/N performance is higher for long block lengths, but
the end-to-end modem latency increases as well. For applications where end-toend delay is not critical, such as TV broadcasting, the long frames are the best
solution, as long block lengths provide an improved C/N performance.
Additionally, when used for different hierarchies of services (like TV, HDTV)
or for interactive point-to point applications (like IP unicasting), Variable Coding
and Modulation (VCM) functionality allows different modulations and error protection levels that can be modified on a frame-by-frame basis. This may be
combined with the use of a return channel to achieve closed-loop Adaptive Coding
and Modulation (ACM), and thus allowing the transmission parameters to be
optimized for each individual user, depending on the particular conditions of the
delivery path. Optional backwards-compatible modes have been defined in
DVB-S2, intended to send two Transport Streams on a single satellite channel: the
High Priority (HP) TS, compatible with DVB-S and DVB-S2 receivers, and the
Low Priority (LP) TS, compatible with DVB-S2 receivers only. HP and LP
Transport Streams are synchronously combined by using a hierarchical modulation
on a non-uniform 8PSK constellation. The LP DVB-S2 compliant signal is BCH
and LDPC encoded, with LDPC code rates 1/4, 1/3, 1/2, or 3/5.

11.5.2 Services and Multiplex Options


One of the main novelties of DVB-S2 is the possibility to transmit different
streams of video, voice, and data as independent streams with their own parameters, allowing a better system capacity allocation. DVB-S2 considers four
potential input formats: MPEG-TS container, Generic Stream Encapsulation
(GSE), Generic Continuous Stream (GCS), or Generic Fixed-length Packetized

322

P. Angueira et al.

Streams (GFPS). Based on these options, it is possible to accommodate any input


stream format, including continuous bit-streams, IP as well as ATM packets.
Additionally, the VCM functionality may be applied on multiple transport streams
to achieve a differentiated error protection for different services (TV, HDTV,
audio, multimedia). The use of the VCM functionality will depend on the application scenario and the need of broadcasting DVB-S backwards-compatible.
Figure 11.9 shows the most common encapsulation approaches that could be
directly applied in a 3D scenario. The simplest approach to encapsulate services
would be similar to the currently used architecture for DVB-T, DVB-S, and DVB-C.
The video services (no matter the format, stereoscopic, depth based, SVC) would be
embedded into an MPEG-TS structure, maintaining the 188 byte sync frequency.
The second approach would usually involve a first step of MPEG frames that would
be encapsulated on IP packets for distribution from the production centers.
These IP packets would then be fed into the DVB-S2 structure following the
GSE option. This second approach might seem more complicated, but it simplifies
the overall network architecture considering the fact that current microwave LOS
links, satellite feeding channels, and fiber optic links will be based on IP transport
protocols.

11.5.3 Network Requirements


In a first approach, the network architecture for 3D-TV delivery under the DVB-S2
standard would be similar to the currently used for DVB-S 2D services. The
necessary evolution is required both in the terrestrial head-end and in the user
reception equipment needed for the reception of new services.
The wide bandwidth available in satellite broadcasting transmissions provides a
significant flexibility to the deployment of different 3D transmission scenarios,
maintaining 2D backwards compatibility by simulcasting. A 1,080p HD quality
could be part of all these application scenarios, as this is the format most likely to
prevail in the future.
A significant part of satellite broadcasting operators usually have a pay-TV
business model. The satellite TV operator with a multichannel offer might want to
prioritize the exploitation of the existing infrastructure in order to deliver 3D-TV
content to a group of subscribers and provide a simulcast programme for non 3DTV displays. The simplest way to introduce 3D content is then to deliver left-view
and right-view images either using a simulcast approach or a frame compatible
format depending on the available bitrate and the 3D resolution target. A possible
configuration to provide simulcast in DVB-S2 would be the use of two Transport
Streams combined by means of hierarchical modulations. An HP TS, compatible
with DVB-S and DVB-S2 receivers, would deliver HDTV content, while the LP
TS, only compatible with DVB-S2 receivers, would transmit the 3D-TV content.
In this case, 8PSK constellation would be used, and it would be possible only for
C/N ratios above 7 dB.

11

Transmission of 3D Video over Broadcasting

323

Table 11.7 Example of configuration for DVB-S2 HDTV broadcasting (Source: 9)


Case 1
Case 2
Case 3
Modulation and coding
Symbol rate (Mbaud)
Useful bitrate (Mbps)
Number of programmes

QPSK 3/4
30.9 (a = 0.2)
46
10 SDTV or 2 HDTV

8PSK 2/3
29.7 (a = 0.25)
58.8
13 SDTV or 3 HDTV

8PSK 3/4-QPSK 2/3


27.5
40-12
2 HDTV-23 SDTV

Table 11.8 Example of configuration for DVB-S2 3D-TV broadcasting


Case 4
Case 5
Case 6
Modulation
and coding
Symbol rate
(Mbaud)
Useful Bitrate
(Mbps)
Number of
programmes

QPSK 3/4

8PSK 2/3

30.9 (a = 0.2)

29.7 (a = 0.25)

DVB-S (HP) 8PSK


7/8-DVB-S2 (LP) 8PSK 3/5
27.5

46

58.8

44-16

2 2D ? Depth

2 2D ? Depth
and 1 HDTV

3 HDTV

1 2D ? Depth

In cases where backwards compatibility is sought, DIBR-based formats


(2D ? Depth, 2D ? DOT) could be adequate formats. Compatibility with legacy
2D receivers and displays could be possible provided a general upgrading of the
software hosted in the user decoders. In the long term, advanced 3D content, such
as HD ? MVC can be delivered through dedicated channels.
Table 11.7 shows an example of configurations for DVB-S2 TV broadcasting
services via 36 MHz satellite transponders in Europe. The video coding bitrates in
this configuration are 4.4 Mbps using traditional MPEG-2 coding. The DVB
Project is currently defining the use of AVC systems for future applications. The
video coding rates are approximately the half of those required with MPEG-2, and
consequently, the number of programs in a satellite channel is 2-fold. Cases 1 and
2 of the table show typical configurations for MEPG-2 TV broadcasting of programmes of the same type (SDTV or HDTV) with the same configuration for all of
them. Case 3 is an example of broadcasting over multiple Transport Streams,
providing differentiated error protection per multiplex (VCM mode).
A typical application is broadcasting of a highly protected multiplex for MPEG-2
SDTV and of a less protected multiplex for MPEG-2 HDTV. Assuming a transmission of 27.5 Mbaud and the use of 8PSK 3/4 and QPSK 2/3, a throughput of
40 Mbps would be available for two HDTV programmes and 12 Mbps for twothree
SDTV programmes. The difference in C/N requirements would be around 5 dB.
Table 11.8 shows the same transmission conditions considering possible scenarios of 3D-TV content delivery for a 2D receiver and display compatibility case.
Cases 4 and 5 are estimations assuming 2D ? Depth or 2D ? Difference. Both of
them assume MPEG-4 and other assumptions described in Sect. 11.3.4. The last
possible scenario (Case 6) is based on the use of hierarchical modulations,
assuming a C/N link budget of 10.8 dB.

324

P. Angueira et al.

11.6 3D-TV Terrestrial Broadcasting


This section describes in detail the digital terrestrial TV standards in different ITU
Regions. Specifically, the DVB standards DVB-T/T2, the ATSC family, the ISDBT related and the Chinese DTMB standards are presented. For each system, the
sections provide a summary of the major technical aspects with a description of the
coding and multiplexing techniques used by each specification, being these aspects
key for delivering 3D-TV services over a terrestrial network. A discussion section
on the specific conditions and factors associated to each standard for delivering 3D
content is also included.

11.6.1 DVB-T2
The predecessor of DVB-T2 is DVB-T. DVB-T was the first European standard
for digital terrestrial broadcasting, and was approved by the European Telecommunications Standards Institute on 1997 [50, 51]. DVB-T is based on OFDM
modulation and was designed to be broadcasted using 6, 7, and 8 MHz wide
channels on the VHF and UHF bands. The standard has two options for the number
of carriers, namely 2 and 8 K and different carrier modulation schemes (QPSK,
16QAM y 64QAM) that can be selected by the broadcaster. The channel coding is
based on FEC techniques, with a mixture of Convolutional (1/2, 2/3, 3/4, 5/6, and
7/8 rates) and Reed-Solomon coders (188, 204). The spectrum includes pilot
carriers that enable simple channel estimation both in time and frequency.
Amongst all the usual configuration options, the most used ones targeting fixed
receivers provide a throughput between 19 and 25 Mbps. DVB-T is based on the
MPEG-TS structure and the synchronization and signaling is strongly dependent
on the multiplex timing. The video coder associated to DVB-T is MPEG-2.
DVB-T2 has been designed with the objective of increasing the throughput of
DVB-T a 30 %, number considered to be the requirement for HDTV service
rollout in countries where DVB-T does not provide enough capacity.

11.6.1.1 System Overview


DVB-T2 [7, 52] uses OFDM (Orthogonal Frequency Division Multiplex) modulation. The system includes larger FFT modes (16 K, 32 K) and high order 256QAM constellations, that increase the number of bits transmitted per symbol, and
in consequence, the throughput of the system. DVB-T2 uses LDPC in combination
with BCH codes. Figure 11.10 shows a block diagram of the DVB-T2 signal
generation process.
The specification includes scattered pilot patterns where the number of patterns
available has been increased with respect to DVB-T, providing higher flexibility

11

Transmission of 3D Video over Broadcasting


Stream
Generation

325

Input Processing
Module

BCH/LDPC
Encoder

QAM
Modulator

Cell Time
Interleaver

BCH/LDPC
Encoder

QAM
Modulator

Cell Time
Interleaver

BCH/LDPC
Encoder

QAM
Modulator

Cell Time
Interleaver

PLP 1

PLP 2

PLP N

Frame
Builder

Cell
Mux

IFFT

Guard
Interval

Output

Fig. 11.10 DVB-T2 modulation block diagram

and maximizing the data payload depending on the FFT size and Guard Interval
values. The DVB-T2 physical layer data is divided into logical entities called
physical layer pipes (PLP). Each PLP will convey one logical data stream with
specific coding and modulation characteristics. The PLP architecture is designed to
be flexible so that arbitrary adjustments to robustness and capacity can be easily
done [7].

11.6.1.2 Services and Multiplex Options


The multiplex options in DVB-T2 are in fact shared from the already described
DVB-S2: MPEG-TS containers, GSE, GCS, or GFPS. Having in mind these
choices the same approaches for DVB-S2 can be taken to feed 3D-TV streams into
the DVB-T2 structure. The simplest approach (no matter the video format, stereoscopic, depth based, SVC) would be embedded into an MPEG-TS structure,
maintaining the 188 byte sync frequency. The second option would relay on IP and
would require a two-step encapsulation.

11.6.1.3 Network Requirements


Considering the flexibility of the DVB-T2 standard, there is no restriction in terms
of which of the formats could be most adequate, neither for HD nor for 3D. In the
case of HD, advances in the field of video coding suggest that the format to prevail

326

P. Angueira et al.

Table 11.9 Service configuration and bitrate requirements in a 2D HDTV compatible 3D-TV
deployment scenario
Number of services (COFDM 32 K, 1/128 guard interval)
HD720p
HD1080i
HD1080p
3D in 2D
2D ? Delta
Bitrate (Mbps)a
DVBT2 mode
Modulation
(QAM order)code rate
a

2
2

2
42
2563/4

1
2

1
41.6
2563/4

1
28.6
645/6

2
1
31.6
2565/6

46
2564/5

1
35.6
645/6

1
25.6
645/6

Bitrates calculated according to assumptions in Sect. 11.3.4

in the future is 1,080p. In the first stage of a 3D service introduction scenario,


stereoscopic and 2D ? Depth (2D ? DOT) formats will be probably the best
choice. This scenario does not distinguish service robustness for HD and 3D so
using a single PLP to deliver HD and 3D services, or provide completely separated
PLPs for each type of content would not make any significant difference from the
planning perspective, exception made, perhaps, for signaling management and
associated behavior of existing DVB-T2 receivers prior to 3D deployments.
Possible configuration choices are shown in Table 11.9 [53].
A long-term scenario can also include the possibility of having dedicated 3D
DVB-T2 channels. In this case the HD compatibility restrictions would be avoided,
and all the transport capacity of the RF signal will be dedicated to 3D contents. This
approach assumes that there would be enough broadcasting spectrum available for
dedicated resources to 3D programming. In this case, a certain amount of the system
capacity could be left for 3D portable services. The content for portable users will be
multiplexed in a different PLP with increased robustness in the modulation and
coding schemes. This scenario could be theoretically implemented based on any of
the three formats to deliver 3D. Nevertheless, if compatibility is not sought with 2D
receivers, the simple 2D ? Depth format is less probable and further formats
implying several views and different depth layers would be more realistic.

11.6.2 ATSC
ATSC is a set of standards developed by the ATSC for digital TV transmission over
terrestrial, cable, and satellite networks. The ATSC DTV standard [54, 55] established in 1995, was the worlds first standard for DTV, and it was included as System
A in ITU-R Recommendations BT.1300 [56] and BT.1306 [57]. ATSC standard for
terrestrial broadcasting was adopted by the Federal Communications Commission
(FCC) in USA in December 1996, and the first commercial DTV/HDTV transmissions were launched in November 1998. In November 1997 this standard was
adopted also in Canada and Korea, and some time later in other countries.

11

Transmission of 3D Video over Broadcasting

327

Table 11.10 ATSC technical specifications


Parameter

Value

Occupied bandwidth
Pilot frequency
Modulation
Transmission payload bitrate
Symbol rate
Channel coding(FEC):

5.38 MHz
310 kHz
(8-Level vestigial side-band)
19.39 Mbps
10.762 baud
Trellis coded (inner)/Reed solomon (outer)

11.6.2.1 System Overview


The ATSC digital TV standard was designed to transmit high-quality video and audio,
and ancillary data, within a single 6 MHz terrestrial TV broadcast channel. ATSC
employs the MPEG-2 video stream syntax (Main Profile at High Level) for the coding
of video [58] and the ATSC standard Digital Audio Compression (AC-3) for the
coding of audio [59]. The standard defines several video formats for HDTV and SDTV.
In July 2008, ATSC was updated to support the H.264/AVC video codec [60].
The modulation in ATSC is 8-VSB. This scheme provides a good compromise
between spectral efficiency and a low receiver carrier-to-noise (C/N) threshold
requirement, high immunity to both co-channel and adjacent channel interference and
a high robustness to transmission errors [61]. Being a single carrier modulation the
system relies on complex equalization modules to overcome multipath and dynamic
channels. The baseband data segment uses Trellis coded 8-VSB signals, based on eight
discrete equi-probable data levels [62]. In order to protect against both burst and
random errors, the packet data are interleaved before transmission and ReedSolomon
forward error correcting codes are added. The error correction coding and the larger
interleaver with respect to other DTV standards provide good behavior against noise
and interference. The spectrum efficiency allows a transport stream of up to
19.39 Mbps over 6 MHz channels (see Table 11.10). The VSB spectrum is flat
throughout most of the channel due to the noise-like attributes of randomized data and
low-level constant RF pilot carrier is added to the noise-like data signal at the lower
band edge. Two optional modes use higher order data coding, called Enhanced 8-VSB,
which allow the broadcaster to allocate a portion of the base 19.39 Mbps data rate to
Enhanced data transmission. Enhanced data is designed to have higher immunity to
certain channel impairments than the Main Service but delivers data at a reduced
information rate selected by the broadcaster from the specified options.
Finally, it is worth mentioning that an upgrade of ATSC, ATSC 2.0, is under
development for interactive and video on demand services. It will be based on
H.264/AVC compression standard, and it will coexist with the current ATSC
standard, but will require new receivers. It is expected to be adopted by 2012. The
standard is being designed using the H.264/AVC for video coding and it is being
specified for full HD quality (left- and right-eye views in HD) and full resolution
3D-TV services.

328

P. Angueira et al.

11.6.2.2 Services and Multiplex Options


The ATSC transport layer is based on the MPEG-TS format, as defined by the
MPEG-2 Systems standard [63]. Video, audio, and data bitstreams are divided into
packets of information and multiplexed into a single transport bitstream. The
ATSC system employs the MPEG TS syntax for the packetization and multiplexing of video, audio, and data signals, which defines system coding at two
hierarchical layers: the packetized elementary stream (PES) layer and the system
layer, in TS format [64].
Timing information in the form of timestamps enables the real-time reproduction and precise synchronization of video, audio, and data. The current 2D
service mostly allocates below 17.5 Mbps for video streaming, and the rest of
19.39 Mbps for audio streams and other purposes. The system can currently
accommodate one HDTV programme (MPEG-2 coding) or a mixture of HD and
SD programmes. Although ATSC standard defines several formats for HDTV
image resolution, a common configuration to be considered in 3D service is 1,080i
(1,920 9 1,080@60 Hz interlaced) [65, 66].

11.6.2.3 Network Requirements


The 3D-TV service introduction studies for ATSC have been based in compatibility with existing 2D services. The frame compatible format has been proposed
for the transition period in coexistence with 2D services due to its simplicity and
easy deployment. Left- and right-eye views are decimated by a factor of 2 and
arranged into one common frame-compatible format, such as side-by-side or topand-bottom. The resulting video is then encoded with the Main Profile of MPEG-2
and transmitted as an auxiliary stream along the MPEG-2 bitstream for the 2D
programme. The use of MPEG-2 allows 2D/3D compatibility. The channel
bandwidth and transmission bitrate needs to be restricted to within existing 6 MHz
and 19.39 Mbps, respectively, and the HD image quality should be maintained in
both 2D and 3D viewing (1,080i, which means 1,920 9 1,080 interlaced). If
existing 2D viewers will maintain their consumer equipment, the use of MPEG-2
video codec, at least in the 2D service, is also mandatory. Therefore, if MPEG-2
HD quality is sought for both 2D and 3D viewing it will be quite challenging to
multiplex two video elementary streams into one portion of 17.5 Mbps, while
maintaining HD resolution and satisfactory image quality [67].
One of the solutions to overcome the above-mentioned constraints is to use
H.264/AVC as an advanced codec together with the MPEG-2 video codec as a
hybrid format. The MPEG-2 video codec can be used to encode the left-view
image as the primary video data stream, and the H.264/AVC video codec to
encode the right-view image as an additional video stream. This scheme ensures
the backward compatibility with current ATSC services.
Two elementary streams generated by the 3D video codec are then multiplexed
to create a single MPEG-2 transport stream, which is fully compliant with the

11

Transmission of 3D Video over Broadcasting

329

MPEG-2 standard. The total bitrate of the two video streams is maintained below
the overall bitrate of an ATSC channel (17.5 Mbps). In this way, either current
DTV or new 3D-TV are accessible, and 2D and 3D programmes can be displayed
normally, complying with each service capability and solving the loss of resolution
in the frame compatible method. A 3D-TV display can show both left- and righteye views in HD resolution (1,080i for each view), and 2D displays can show the
left view with a little sacrifice in image quality. A typical bitrate assignment could
be 12 Mbps (MPEG-2) for the left (primary) stream and an additional 6 Mbps
(H.264/AVC) stream for the right channel. It is important to design a signaling
mechanism in accordance with the multiplexing scheme used for each video
elementary stream. Other issues such as synchronization between MPEG-2
encoded left image and H.264/AVC encoded right image are under development.
This format has been adopted in Korea for providing 3D-TV services [66].
Depth map information may complement a single view to form a 3D programme, or it may be supplemental to the stereo pair. The depth maps may be
encoded with either MPEG-2 or advanced video codecs (AVC/MVC). In this
option, the encoding format can be integrated into broadcast multiplex [67].

11.6.3 ISDB-T
The ISDB system was designed to provide integrated digital information services
consisting of reliable high-quality audio/video/data via satellite, terrestrial, or
cable TV network transmission channels. Currently, the ISDB-T digital terrestrial
TV broadcasting system provides HDTV-based multimedia services, including
service for fixed receivers or moving vehicles, and TV service to cellular [6872].
The ISDB-T system was developed for Japan. Afterwards, other countries
adopted this system such as Brazil, that adopted a modified version of the standard
which considers H.264/AVC coding for video compression (High Profile for SDTV
and HDTV, and Baseline Profile for One-Seg) in 2006. Other South American
countries have followed the decision of Brazil, like Peru, Argentina, and Chile
in 2009. The ISDB-T transmission system was included as System C in ITU-R
Recommendation BT.1306 [57].

11.6.3.1 System Overview


The ISDB-T standard is based on MPEG-2 for audio/video coding and multiplexing [73] but it also includes H.264/AVC video coding for handheld services.
The system uses coded orthogonal frequency division multiplexing (COFDM)
modulation, frequency-domain and time-domain interleaving, and concatenated
error-correcting codes [74], as outlined in Table 11.11.
The BST-OFDM (Band Segmented Transmission-OFDM) modulation scheme
was designed for hierarchical transmission and partial spectrum reception. The

330

P. Angueira et al.

Table 11.11 ISDB-T technical specifications


Parameter
Value
Channel bandwidth
Modulation
Modes
Sub-carrier modulation
Guard interval
Channel coding (FEC)
Source coding

6/8 MHz
Bandwidth segmented transmission (BST)-OFDM
2, 4 and 8 K FFT (2 and 4 K for mobile service)
DPSK, QPSK, 16QAM and 64QAM
1/4, 1/8, 1/16 and 1/32 of symbol duration
Convolutional (inner)/Reed solomon (outer)
MPEG-2 /H.264/AVC

modulation BST-OFDM is based on OFDM segments by dividing a 6 MHz


bandwidth into 14 segments, of which 13 are for signal transmission and the
remaining segment is for a guard band between channels (each OFDM segment
has a bandwidth of 6/14 MHz. The spectrum within each segment is allocated to
both data and reference signal carriers. The transmission parameters may be
individually set for each segment for flexible channel composition according to
each service requirements and interference conditions. The system has three
transmission modes (Modes 1, 2, and 3) to enable the use of a wide range of
transmitting frequencies, and it has four choices of guard-interval length to enable
better design of a single-frequency network (SFN). The system supports hierarchical transmissions (Layers A, B, and C). Each layer can be configured according
to different target receivers. As an example, for handheld reception, the consumer
device accesses the programme transmitted on the center segment of a full-band
ISDB-T signal.
The Brazilian version of the standard (SBTVD-T or ISDB-TB) has a key difference for the purpose of this chapter. The ISDB-TB standard admits H.264/AVC
coding for video compression (High Profile for SDTV and HDTV, and Baseline
Profile for One-Seg). Consequently, the channel capacity is considerably higher.

11.6.3.2 Services and Multiplex Options


ISDB-T uses a modification of MPEG-2 for encapsulating the data stream (MPEG2 TS) to enable hierarchical transmission and partial reception [70, 73] The
transport streams corresponding to each service are re-multiplexed into a single
TS. Then, the single TS is separated into specific hierarchical layers, which are
modulated through a channel coding process on each hierarchical layer
Table 11.12.
The number of theoretical multiplex configurations is rather high, considering
all the modes available, the hierarchical options, and other system parameters.
Nevertheless, the most common services are [74]:
1 HDTV programme (12 segments) and 1 high quality audio programme (1
segment)

11

Transmission of 3D Video over Broadcasting

331

Table 11.12 ISDB-T system capacity


Channel bandwidth (MHz)

Throughput range (Mbps)

6
8

3.623.3
4.931

1 SDTV programme for fixed reception (5 segments), 1 SDTV programme for


mobile reception (7 segments), and 1 high quality audio programme (1 segment)
In Japan, HDTV programmes are broadcast to fixed reception receivers using
12 OFDM segments, and transmission parameters of 64QAM modulation, 3/4
inner-coding rate, and 0.2 s time interleaving. Trials for mobile reception have
been carried out in Japan using 64 QAM modulation, a convolution coding ratio of
3/4 (bitrate of 18 Mbps), and diversity reception [75].

11.6.3.3 Network Requirements


Japan has traditionally been one of the most active research countries on 3D and its
national TV NHK has been leading many developments in this area. In fact, the
roadmap of 3D in Japan has been planned in advance with a view on a 15- or 20years perspective and focusing on new scenarios not compatible with ISDB-T,
with advanced display systems in high resolution and without the need for special
glasses by the user.
According to the options provided to the ISDB-T multiplex structure a compatible option including 3D and 2D services is presented in the following
paragraphs.
The configuration of an MPEG-2 HDTV programme and associated transmission parameters in 6 MHz, for a mixture of fixed and mobile reception services in
ISDB-T, is currently 18 Mbps.
If a stereo compatible would be delivered in this scenario, the decimation of
right and left images by a factor of 2 would reduce the necessary bitrate to 9 Mbps
(6 OFDM segments) with a noteworthy impact on the image quality, significantly
lower than HDTV.
Another alternative would consist of an HDTV service quality only to fixed
reception, with H.264/AVC coding for the other stereoscopic component. The
bitrate would then be divided into an MPEG-2 encoded left-view image as the
primary video data stream (8 OFDM segments), and the H.264/AVC video codec
to encode the right-view image as an additional video stream (4 OFDM segments).
In this way, two elementary streams generated by the 3D video codec would then
be multiplexed to create a single MPEG-2 transport stream, fully compliant with
the MPEG-2 standard. This dual stream method is a solution to overcome the
drawback of the loss of resolution in the frame compatible method, but it sacrifices
robustness, limiting the 3D-TV service for fixed reception.

332

P. Angueira et al.

As shown, the MPEG-2 coder imposes a limitation for image quality in a backwards
compatible scenario. In the ISDB-T version where the H.264/AVC video coder is
included (Brazilian standard) the delivery of 2 HDTV programmes is feasible in a
6 MHz channel (6 OFDM segments each). This option would allow the transmission of
right and left images in HDTV for 3D composition.

11.6.4 DTMB
After carrying out some test trials for evaluating different DTV standards (ATSC,
DVB-T and ISDB-T systems in 2000), China developed its own system for fixed,
mobile, and high-speed mobile reception: Digital Terrestrial Multimedia Broadcasting (DTMB). The standard was ratified in 2006 and adopted as GB20600
Standard [76].

11.6.4.1 System Overview


DTMB contains both single and multicarrier options and supports various multiprogramme SDTV and HDTV services. In the multi-carrier working mode, timedomain synchronous OFDM (TDS-OFDM) technology is used. In TDS-OFDM
system, Pseudo Noise (PN) sequences are inserted as guard interval, which also
serve as time-domain pilots. The usage of PN-sequence as prefix reduces transmission overhead and provides higher spectrum efficiency [77]. In this standard,
the signal is transmitted in frames and a mixture of time and frequency interleavers
is defined. In both single-carrier and multi-carrier options, the input data is coded
with low-density parity check (LDPC) rates of 0.4, 0.6, or 0.8. Constellation
mapping schemes for each mode are 64-QAM, 32-QAM, 16-QAM, 4-QAM, and
4-QAMNR [78, 79]. The source coding is not specified and both MPEG-2
and H.264/AVC are possible. The current configuration of the system for SD is
MPEG-2 @MP/ML. A probable configuration for future HDTV services would be
H.264/AVC Main Profile at Level 3.0 [80]. The specification defines SDTV as
576i at 25 fps and HDTV as 720p at 50 Hz or 1,080i at 25 Hz [81]

11.6.4.2 Services and Multiplex Options


The DTMB standard neither specifies the multiplexation scheme nor the coding
algorithms, but currently all the products include interfaces based on the MPEG-2
TS. The data rates in this standard range from 4.8 Mbps to 32.5 Mbps depending on
FEC coding rate, modulation scheme, and PN used. A usual configuration can deliver
2026 Mbps of useful information within each 8 MHz RF channel. The combination
of 64-QAM modulation with FEC code rate of 0.6, providing a throughput beyond
24.3 Mbps, is the primary working mode to support HDTV services.

11

Transmission of 3D Video over Broadcasting

333

In contrast, the combination of 4-QAM with FEC code rate of 0.4, providing a
throughput beyond 5.4 Mbps, is a good option to support the mobile reception
application. Consequently, the high and ultra-high data rate modes are used for
fixed reception, transmitting 1012 SDTV programmes or 12 HDTV programmes
in one 8 MHz channel, while low and middle data rate modes are used for mobile
reception, transmitting 25 SDTV programmes in one 8 MHz channel.

11.6.4.3 Network Requirements


The rollout and deployment of standard definition digital TV services, and as a
consequence, the development of more advanced ones, such as HDTV and future
3D-TV, has not been accomplished as fast as DVB and ATSC for different reasons.
Up to date, 3D-TV service plans have not been published. Nevertheless, there is
no doubt that, as HDTV services will become a reality in major cities in China,
3D-TV will be also a reality. The flexibility of the multiplexation stages of the
DTMB standard will be a major advantage to introduce these services without any
relevant change on the standard.
Considering that the DTMB capacity is around 2026 Mbps for typical configurations, the rollout scenarios can be very similar to other standards that use also
H.264/AVC as the baseband video coding. Under the assumption that compatibility with current 2D services is sought, the base video layer (e.g., left view) could
be coded with MPEG-2, while the other view could be compressed using other
techniques as described in previous sections.

11.6.5 3D-TV to Mobile Receivers


The research activity on 3D-TV for mobile receivers has been active in two
regions. In Europe, a few research consortiums have investigated the possibilities
and requirements of the DVB family of standards. In Asia, South Korea has led the
research activity on 3D services to portable receivers using the T-DMB standard.
The trials in Europe have been carried out around the DVB-H (Digital Video
BroadcastingHandheld) standard [8284]. This system, that will be outperformed
by the new DVB-NGH (Digital Video BroadcastingNew Generation Handheld) in
late 2011, has been the DVB approach to portable and mobile TV services from 2004.
It is worth mentioning that DVB-H has not had a successful development worldwide.
The research with DVB-H has focused on different formats and coders. A first
option of the experiments carried out has been based on sending the stereoscopic
components in different Elementary Streams, using separate DVB-H bursts. This
option enables backwards compatibility with 2D receivers. The video coders tested
have been based on H.264/AVC. Specifically, a new technique called Advanced
Mixed Resolution Stereo Coding (AMRSC) was tested on stereo and video plus
depth formats. AMRSC is based on optimized down sampling, interview prediction,

334

P. Angueira et al.

and view enhancement. One of the major outcome of the trials was the quantification
of the required additional bitrate for video plus depth formats: a 600 kbps 2D video
service would require an increase in the range of 1030 % (depending on the specific
content) to provide acceptable 3D image quality [85, 86].
The mobile 3D activities in Korea have been focused around the standard
T-DMB [87, 88]. The Telecommunications Technology Association (TTA) has
started the standardization of a 3D video service for T-DMB and S-DMB and the
major issue remaining is how to determine the proper stereoscopic video codec.
The options currently being studied are MVC (Multi-view Video Coding) [89]
H.264/AVC independent, and HEVC (High Efficiency Video Coding) [90]. On the
trials carried out so far, different formats have been analyzed (Frame Compatible
3D, dual channel, video plus depth and partial 3D) and different MPEG-4 based
coders have been tested [9195]. Nevertheless, the deployment scenario assumes
that the services will be 2D compatible and any of the views will be regarded as
base layer and coded using MPEG-2.

11.7 3D-TV over Cable Networks: DVB-C/C2


The DVB-C [96] specification was developed in 1994 by the DVB Consortium for
broadcast delivery over Hybrid Fiber Coax (HFC) cable networks and Master
Antenna Television (MATV) installations. At the moment, this standard is
deployed worldwide in cable systems ranging from the larger CATV networks
down to smaller SMATV systems.
The demand for advanced services and the development of second generation
of standards for satellite and terrestrial broadcasting pushed to the publication of
the DVB-C2 specification in April 2010 [8, 97]. The specification is based on an
increase in capacity (at least 30 %), support of different input protocols, and
improved error performance. The new standard was not required to be backwards
compatible with DVB-C, although DVB-C2 receivers will be able to also handle
DVB-C services. Hence, the DVB-C2 system will initially be used for the delivery
of new services, such as HDTV and video-on-demand on a commercial scale; in
the longer term the migration of current DVB-C services to DVB-C2 is foreseen.

11.7.1 System Overview


The standard DVB-C is based on the MPEG-2 System Layer and single carrier
QAM Modulation. It allows the transport of a single input MPEG transport stream
on 16, 32, 64, 128, or 256-QAM constellations, thus achieving a maximum
payload of 50 Mbps per cable channel. Reed-Solomon (204, 188) channel coding
along with convolutional interleaving are applied to improve BER values in the
receiver, ensuring Quasi Error Free (QEF) operation with approximately one
uncorrected error event per transmission hour.

11

Transmission of 3D Video over Broadcasting

335

Table 11.13 Main features of DVB-C and DVB-C2 systems (source: DVB-C2 standard)
DVB-C
DVB-C2
Input interface

Single transport
stream (TS)

Modes

Constant coding and


moFEBUdulation

FEC
Interleaving
Modulation
Pilots
Guard interval
Modulation Schemes

Reed solomon (RS)


Bit-interleaving
Single carrier QAM
Not applicable
Not applicable
16256-QAM

Multiple transport stream and


generic stream encapsulation
(GSE)
Variable coding and Modulation
and adaptive coding and
modulation
LDPC ? BCH
Bit- Time- and Freq-Interleaving
COFDM
Scattered and continual pilots
1/64 or 1/128
164,096-QAM

DVB-C2 represents the second generation transmission system for digital TV


broadcasting via HFC cable networks and MATV installations. This system offers a
range of modes and options that can be optimized for the different network characteristics and the requirements of the different services. It offers more than 30 % higher
spectrum efficiency under the same conditions as current DVB-C deployments,
coming close to the Shannon limit, the theoretical maximum information transfer rate
in a channel for a given noise level. DVB-C2 is characterized by the following:
A flexible input stream adapter, suitable for operation with single and multiple
input streams of various formats (packetized or continuous).
A powerful FEC system based on LDPC inner codes concatenated with BCH
(Bose Chaudhuri Hocquenghem) outer codes.
A wide range of code rates (from 2/3 up to 9/10) and 5 QAM constellations schemes
(16, 64, 256, 1,024 and 4,096-QAM), spectrum efficiency from 1 to 10.8 bit/s/Hz.
ACM functionality on a frame-by-frame basis, for a dynamic link adaptation to
propagation conditions.
COFDM modulation was chosen to reduce the vulnerability to echoes caused by
typical in-house coaxial networks and to increase the robustness to impulsive noise
interference. 4 k-mode OFDM modulation is used within 8 MHz (European cable
networks) or 6 MHz channels (US-type cable networks). In both cases the number
of the carriers is the same, while the space between carriers differs.
DVB-C2 will increase the bandwidth of the transmitted signal beyond 8 MHz to
improve the spectrum efficiency. Signals from 8 to 64 MHz (6 to 48 MHz in US
networks) will be allowed, in order to offer larger pipes with a very efficient
sharing of the available resources.
Although the widespread penetration of DVB-C2 system will require some
years, this development can coincide in time with the development of commercial
3D-TV services. Furthermore, the benefits of this second generation system match

336

P. Angueira et al.

with the requirements of 3D-TV content delivery. Because of these reasons, only
DVB-C2 will be considered in this section.

11.7.2 Services and Multiplex Options


Following the path set up for DVB-S2 and DVB-T2, the encapsulation techniques
in DVB-C2 can be based on MPEG Transport Stream (TS), packetized and continuous input formats as well as the so-called GSE [8, 27]. The description provided in previous sections for DVB-S2 and DVB-T2 applies here.

11.7.3 Network Requirements


The high transmission capacity of DVB-C2 system makes this system adequate to
delivery of 3D-TV services over cable networks. The payload capacity using
Guard Interval 1/128 and 2,232 kHz OFDM carrier spacing, for European type
(8 MHz) cable networks range from 23.56 Mbps (16-QAM 4/5) to 79.54 Mbps
(4,096-QAM 9/10). If channel bandwidths wider than 68 MHz (applying channel
bonding techniques) are used, the throughput can be increased.
Cable networks have no severe restrictions on bandwidth, when compared to
other networks such as terrestrial or cellular networks. This fact, along with the
high transmission capacity of DVB-C2 system, allows the use of multiple options
and formats for 3D-TV delivery. Additionally, multicast of 2D and 3D contents is
feasible due to the wide frequency range available in cable networks.

11.8 Evolution Scenarios for 3D-TV Broadcast


Some of the basic technologies associated to a stable 3D solution for terrestrial
broadcasting, including formats, display technologies, and baseband coding are still
a matter of discussion (and also technical improvements) in the broadcast community. This section tries to set up some boundary conditions to build the choice of
rollout scenarios proposed in this book. The boundary conditions are related to the
target receivers (fixed, portable, or mobile), backwards compatibility requirements
with existing terrestrial broadcast receivers (simultaneous HDTV-3D services),
content quality requirements (not only, but mostly bitrate allocation to each HDTV
and 3D content), service coverage areas and minimum required field strength values.

11

Transmission of 3D Video over Broadcasting

337

Table 11.14 ITU-R 3D-TV evolution model


N8 views

Display

First generation
Second generation
Third generation

Glasses/Autostereoscopic with limited views


Autostereoscopic
Similar to todays holography

L?R
Multi-view
Continuous object wave

11.8.1 3D Format Evolution and Compatibility


It is widely agreed that the ideal 3D-TV system would be such that it provides full
resolution 3D video without requiring the consumer to wear any type of headmounted device (including glasses). This target, currently named third generation
3D is still far to be reached by current state-of-the-art technology and will take
probably decades to be achieved [1].
Meanwhile, most standardization organizations have structured the evolution
from 2D to 3D into two previous technology generations. The ITU-R evolution
model of 3D-TV systems is shown in Table 11.14.
Associated to the evolution steps in Table 11.14, the ITU-R provides a classification of possible image formats according to four different compatibility
levels. The compatibility levels have been decided upon the requirement of a new
set-top box (broadcast receiver) for accessing 3D services. Each level is compatible with the previous ones, except for the specific case of Layer 4 compatibility
with Layer 3. Compatibility with existing 2D will depend on the specific details of
each scenario.
Level 1 is associated to systems compatible with current HD displays and is called
HD conventional display compatible (CDC). This case uses anaglyph techniques for
achieving the 3D video sensation. This case has been dismissed by broadcasters.
Level 2 is compatible with existing 2D set-top boxes but requires a new display.
This level is associated to conventional HD frame compatible (CFC) formats
explained in Sect. 11.3.2. The resolution of the left- and right-eye views is not the
conventional HD 2D for obvious compatibility reasons with the receiver. Level 2
does not provide service compatibility with existing 2D HD displays, so if a 2D
service of the same programme is needed, some type of simulcast would be
required
Level 3 is based on a new set-top box and a new display. This option aims at
providing additional resolution to the left- and right-eye views of Level 2 by
adding a resolution enhancement layer, using for example, MPEG SVC. Using this
additional information, the receiver will have an output equivalent resolution to the
2D HD images. For the existing 2D legacy displays it would still be necessary to
simulcast a 2D version of the programme.
Level 4 is based also on a new set-top box which is able to decode an MPEG
MVC signal conforming to the ISO/IEC JTC1 MPEG specification. This level will
be conventional HD service compatible (CSC) because an existing 2D set-top box
will find, in the incoming multiplex, a conventional 2D HD signal which it can

338

P. Angueira et al.

pass to a conventional display as a 2D picture. Level 4 (first-generation profile)


will include capability for Level 2 decoding (but, depending on market conditions,
not complete Level 3 decoding including extension).
Level 4 is at the moment the only format envisaged for the second generation
by the ITU-R. The generic receiver of this second generation Level 4 format is also
a new set-top box which is able to decode the 2D HD plus depth format as
specified by the IEC/ISO JTC1 MPEG specification. The display is normally a
multi-view auto-stereoscopic display. Such set-top boxes would also decode
Levels 1, 2, and 4 of the first-generation profile.
Another relevant evolution proposal has been described by the DVB Consortium. The introduction of the 3D-TV services is envisaged by this organization in
two phases. The first phase is based on stereoscopic 3D-TV (called plano-stereoscopic 3D by DVB), where two images (left- and right-eye views) are delivered
to be seen simultaneously. Both images are multiplexed into a single HD frame
enabling service providers to use their existing HDTV. This case is equivalent to
the First generation Frame Compatible Layer 2 of ITU-R Report BT. 2160. The
second phase has been divided into two possible scenarios. The first scenario is a
Service Compatible case. The format conveys full HDTV resolution 3D-TV services for those who have 3D-TV capable displays and at the same time provides a
full resolution HD 2D version of the same programme, to viewers who have
conventional 2D HDTV receivers and displays. This scenario is equivalent to first
generation Layer 4 as defined by ITU-R Report 2160. The second scenario, called
Frame Compatible Compatible, is equivalent to ITU-R Layer 3 and aims at providing full HDTV resolution 3D-TV services to 3D-TV capable displays, at the
same time as delivering backwards compatible half resolution 3D images for phase
1 receivers. The DVB proposals are restricted to two views and do not specify the
techniques used by the display to construct the left- and right-eye views [5].

11.8.2 DIBR and 3D-TV Broadcast Evolution Roadmap


The future of 3D-TV broadcasting will be strongly influenced by the viability and
commercial success of first-generation 3D-TV systems. Under this assumption, the
business models for pay TV and for free-to-air broadcasters will be completely
different and so the strategies for first-generation 3D-TV.
The business model for pay TV is the case where the service will address only
viewers with 3D displays. In this case, the Frame Compatible approach seems the
most adequate (Layer 2 and Layer 3 formats according to ITU-R terminology). In
this case, DIBR techniques can be used to enhance the resolution that would
otherwise be lost by the intrinsic packaging of two images on the same frame. The
depth information required by DIBR would be conveyed following any of the
encapsulation options described in Sect. 11.3. For broadcasters based on a multichannel pay TV offer, the exploitation of the existing infrastructure might
probably becomes a major priority and due to the relatively high bitrate capacity

11

Transmission of 3D Video over Broadcasting

339

available in this scenario, they would deliver 3D-TV content to a group of subscribers and provide a simulcast programme for non 3D-TV displays. Satellite and
cable operators are most likely to implement this option.
Terrestrial free to air TV business model is completely different. Given the scarce
frequency resources the major concern is to keep the services backward compatible
with existing 2D HD receivers. The 3D-TV content offer would then be necessarily
based in an additional information channel that would provide the data to reconstruct
the second image for suitably equipped 3D receivers (Layer 4 model as described by
ITU-R). There are two format choices for building this 3D scenario:
1. Formats following the 2D ? delta scheme. This option would be based on
SVC coders or even in a mixture of MPEG-2 and H.264/AVC coders, where
one of the channels is the baseline for legacy receivers and the additional layer
would provide service access for 3D receivers.
2. Formats based on DIBR techniques using any type of auxiliary depth information. In this case either 2D ? DOT (data to represent depth, occlusion, and
transparency information) or 2D ? depth coding scheme could allow
multiple views to be generated for presentation on Auto-stereoscopic displays.
Korea is an example of the coexistence of the two scenarios described above.
On one hand, the terrestrial broadcasters KBS, MBC (Munhwa Broadcasting
Corp.), SBS and EBS (Educational Broadcasting System), have prepared for 3D
trial broadcasting from October 2010 using dual stream coding (left image with
MPEG-2, right image with H.264/AVC) at a resolution of 1,920 9 1,080 interlaced 30 fps. On the other hand, the pay TV segment formed by cable broadcasters
CJHelloVision and HCN and Korea Digital Satellite Broadcasting, will also take
part in the 3D trial broadcasting service situation, using in this case a frame-based
solution.

11.9 Conclusions
The rollout of mass market 3D-TV services over the broadcast networks is still
under development. During the last couple of years, a remarkable standardization
and research activity has been carried out, with different degrees of success.
Among all the possibilities to represent the 3D video content, the choice will be
strongly dependent on the business model. Free to air terrestrial TV (mainly terrestrial) will have backwards compatibility as one of the major requirements due to
the scarce capacity resources inherent to terrestrial networks. Here, DIBR-based
approaches are a good compromise between simplicity, 3D perceived quality and
2D receiver and display compatibility. The requirement of 2D backwards compatibility is being assumed by different consortia and standardization bodies
(DVB, ATSC and SMPTE among others).
In the case of pay TV broadcasters (specially satellite and cable) it is clear that
the short-term deployments will be based in any of the versions of Frame-

340

P. Angueira et al.

Compatible 3D video. The next step would then be the enhancement of the content
resolution to achieve a quality that would be close to the Full Resolution format.
Again, DIBR techniques have in this area a special interest, as a way to produce a
complementary data layer that could upgrade a limited resolution 3D stream to an
HDTV 3D service.
As relevant as the format, at least from the broadcaster side, the coding algorithm will be either one of the enabling technologies or obstacles for the fast
deployments of 3D services. Currently, the bitrate to provide a 3D service is being
assumed around a 60 % higher than the equivalent 2D material. This increase is a
challenge for the current broadcast standards, if todays services should be
maintained. The solution for this obstacle might rely on the new generation of
video coding standards.

References
1. International Telecommunications Union. Radiocommunications Sector (2010) Report
ITU-R. BT.2160 features of three-dimensional television video systems for broadcasting
2. International Organization for Standardization (2007) ISO/IEC 230023:2007Information
technologyMPEG video technologiesPart 3: Representation of auxiliary video and
supplemental information
3. International Organization for Standardization (2011) Call for proposals on 3D video coding
technology, ISO/IEC JTC1/SC29/WG11 MPEG2011/N12036
4. Society of Motion Picture and Television Engineers (2009) Report of SMPTE Task Force on
3D to the Home
5. Digital Video Broadcasting (2010) DVB BlueBook A151 Commercial requirements for
DVB-3DTV
6. Digital Video Broadcasting (2011) DVB Frame compatible plano-stereoscopic 3DTV
(DVB-3DTV), DVB BlueBook A154
7. European Telecommunications Standard Institute (2011) ETSI EN 302 755 V1.2.1. Frame
structure channel coding and modulation for a second generation digital terrestrial television
broadcasting system (DVB-T2)
8. European Telecommunications Standard Institute (2011) ETSI EN 302 769 V1.2.1 Frame
structure channel coding and modulation for a second generation digital transmission system
for cable systems (DVB-C2)
9. European Telecommunications Standard Institute (2009) ETSI EN 302 307 V1.2.1. Second
generation framing structure, channel coding and modulation systems for broadcasting,
interactive services, news gathering and other broadband satellite applications
10. Mller K et al (2009) Coding and intermediate view synthesis of multiview video plus depth.
In: 16th IEEE International conference on image processing (ICIP), pp 741744
11. Li Sisi et al (2010) The overview of 2D to 3D conversion system. In: IEEE 11th International
conference on computer-aided industrial design and conceptual design (CAIDCD), vol 2,
pp 13881392
12. International Organization for Standardization (2009) ISO/IEC JTC1/SC29/WG11 N10540:
Text of ISO/IEC 1449610:2009 FDAM 1(including stereo high profile)
13. Fehn C (2004) 3D-TV using depth-image-based rendering (DIBR). In: Proceedings of picture
coding symposium
14. Zhang L, Tam WJ (2005) Stereoscopic image generation based on depth images for
3DTV3D-TV. IEEE Trans Broadcast 51(2):191199

11

Transmission of 3D Video over Broadcasting

341

15. Merkle P et al (2007) Multi-View Video Plus Depth Representations and Coding
Technologies. IEEE Conference on Image Processing, pp 201204
16. Bartczak B et al (2011) Display-independent 3D-TV production and delivery using the
layered depth video format. IEEE Trans Broadcast 57(2):477490
17. European Telecommunications Standard Institute (2009) ETSI TS 102 034 V1.4.1. Transport
of MPEG-2 TS based DVB services over IP based networks (and associated XML)
18. Internet Engineering Task Force (IETF) (2011) RTP Payload format for H.264 video, RFC
6184, proposed standard
19. International Organization for Standardization (2010) ISO/IEC 14496-10:2010. Information
technologycoding of audio-visual objectsPart 10: advanced video coding
20. Vetro A et al (2011) Overview of the stereo and multiview video coding extensions of the
H.264/MPEG-4 AVC standard. Proc IEEE 99(4):626642
21. Smolic A et al (2009) Development of a new MPEG standard for advanced 3D video
applications. In: Proceedings of 6th international symposium on image and signal processing
and analysis, pp 400407
22. Merkle P et al (2008) Adaptation and optimization of coding algorithms for mobile 3DTV.
MOBILE 3DTV, Technical report D2.2
23. International Organization for Standardization (2007) Text of ISO/IEC 13818-1:2003/
FPDAM2 Carriage of auxiliary video data streams and supplemental information, Doc No.
8799
24. Bourge A et al (2006) MPEG-C Part 3: Enabling the introduction of video plus depth
contents. In: Proceeding workshop content generation coding 3D-Television
25. Sukhee C et al (2005) Disparity-compensated stereoscopic video coding using the MAC in
MPEG-4. ETRI J 27(3):326329
26. Hewage CTER et al (2007) Comparison of stereo video coding support in MPEG-4 MAC,
H.264/AVC and H.264/SVC. In: 4th IET international conference on visual information
engineering
27. Hur N (2010) 3D DMB: A portable/mobile 3D-TV system. 3D-TV workshop, Shanghai
28. Hoffmann H et al (2006) Studies on the bit rate requirements for a HDTV format With 1920
1,080 pixel resolution, progressive scanning at 50 Hz frame rate targeting large flat panel
displays. IEEE Trans Broadcast 52(4)
29. Hoffmann H et al (2008) A novel method for subjective picture quality assessment and
further studies of HDTV formats. IEEE Trans Broadcast 54(1)
30. European Broadcasting Union (2006) Digital terrestrial HDTV broadcasting in Europe. EBU
Tech 3312
31. Klein K et al (2007) Advice on spectrum usage, HDTV and MPEG-4. http://www.bbc.co.uk/
bbctrust/assets/files/pdf/consult/hdtv/sagentia.pdf
32. Brugger R et al (2009) Spectrum usage and requirements for future terrestrial broadcast
applications. EBU technical review
33. McCann K et al (2009) Beyond HDTV: implications for digital delivery. An independent
report by Zeta cast Ltd commissioned by ofcom
34. Husak W (2009) Issues in broadcast delivery of 3D. EBU 3D TV workshop
35. Tagiri S et al (2006) ISDB-C: cable television transmission for digital broadcasting in Japan.
Proc IEEE 94(1):303311
36. Robert J et al (2009) DVB-C2The standard for next generation digital cable transmission.
In: IEEE international symposium on broadband multimedia systems and broadcasting,
BMSB
37. Benzie P et al (2007) A survey of 3DTV displays: techniques and technologies. IEEE Trans
Circuits Syst Video Technol 17(11):
38. Onural L et al (2006) An assessment of 3DTV technologies. NAB BEC proceedings,
pp 456467
39. Surman P et al (2008) Chapter 13: Development of viable domestic 3DTV displays.
In: Ozaktas HM, Onural L (eds) Three-dimensional television, capture, transmission, display.
Springer, Berlin

342

P. Angueira et al.

40. Onural L, Ozaktas HM (2008) Chapter 1: Three-dimensional television: from science-fiction


to reality. In: Ozaktas HM, Onural L (eds) Three-dimensional television, capture,
transmission, display. Springer, Berlin
41. Gvili R et al (2003) Depth keying. Proc SPIE-IS&T Electron Imaging 50(6):564574
42. Pastoor S (1991) 3D Television: a survey of recent research results on subjective
requirements. Signal Process, Image Commun 4(1):2132
43. Sung-Fang Tsai et al (2011) A real-time 1080p 2D-to-3D video conversion system. In: IEEE
International conference on consumer electronics (ICCE) Proceedings 803804
44. Zhang Liang (2011) 3D-TV content creation: automatic 2D-to-3D video conversion. IEEE
Trans Broadcast 57(2):372383
45. European Telecommunications Standards Institute (1997) Framing structure, channel coding
and modulation for 11/12 GHz satellite services. EN 300 421 V1.1.2
46. European Telecommunications Standards Institute (1997) Implementation of binary phase
shift keying (BPSK) modulation in DVB satellite transmission systems. TR 101 198 V1.1.1
47. International Telecommunications Union. Radiocommunications Sector (2007)
Recommendation BO.1784. Digital satellite broadcasting system with flexible configuration
(television, sound and data)
48. European Telecommunications Standards Institute (2005) User guidelines for the second
generation system for Broadcasting, Interactive Services, News Gathering and other
broadband satellite applications. TR 102 376 V1.1.1
49. European Telecommunications Standards Institute (2005) ETSI. TS 102 441 V1.1.1 DVB-S2
Adaptive coding and modulation for broadband hybrid satellite dialup applications
50. European Telecommunications Standard Institute (1997) ETSI EN 300 744 Framing
structure, channel coding and modulation for digital terrestrial television
51. Reimers U (2006) The family of international standards for digital video broadcasting. Proc
IEEE 94(1):173182
52. European Telecommunications Standards Institute (2010) ETSI. TS 102 831 V1.1.1 DVBImplementation guidelines for a second generation digital terrestrial television broadcasting
system (DVB-T2)
53. Morgade J et al (2011) 3DTV Roll-Out scenarios: a DVB-T2 approach. IEEE Trans
Broadcast 57(2):582592
54. Advanced Television Systems Committee (2007) ATSC digital television standard A/53D
55. Advanced Television Systems Committee (2003) Guide to the use of the digital television
standard, ATSC A/54A
56. International Telecommunication Union. Radiocommunications Sector (2005) Rec. ITU-R
BT.1300-3. Service multiplex, transport, and identification methods for digital terrestrial
television broadcasting
57. International Telecommunication Union. Radiocommunications Sector (2011) Rec. ITU-R
BT.1306. Error correction, data framing, modulation and emission methods for digital
terrestrial television broadcasting
58. International Organization for Standardization (2005) ISO/IEC 13818-2. Information
technologygeneric coding of moving pictures and associated audio: video
59. Advanced Television Systems Committee (2001) Digital audio compression (AC-3) standard,
ATSC: A/52B
60. International Telecommunications Union. Telecommunications Sector (2003) Rec. ITU-T
H.264 | ISO/IEC 14496-10 AVC. Advanced video coding for generic audiovisual services
61. Richer MS et al (2006) The ATSC digital television system. Proc IEEE 94:3742
62. Bretl W et al (2006) ATSC RF, modulation and transmission. Proc IEEE 94:4459
63. International Organization for Standardization (2005) Information technologyGeneric
coding of moving pictures and associated audioPart 1: Systems. ISO/IEC 13818-1
64. Lechner BJ et al (2006) The ATSC transport layer, including program and system
information protocol (PSIP). Proc IEEE 94:77101
65. Hur N et al (2011) 3DTV broadcasting and distribution systems. IEEE Trans Broadcast
57(2):395407

11

Transmission of 3D Video over Broadcasting

343

66. Park S et al (2010) A new method of terrestrial 3DTV broadcasting system. IEEE broadcast
symposium
67. Advanced Television Systems Committee (2011) ATSC planning team interim report. Part2:
3D technology
68. Association of Radio Industries and Businesses (2005) Transmission system for digital
terrestrial television broadcasting, ARIB Standard STD-B31
69. Association of Radio Industries and Businesses (2006) Operational guidelines for digital
terrestrial television broadcasting, ARIB Tech. Rep. TR-B14
70. Association of Radio Industries and Businesses (2007) Video coding, audio coding and
multiplexing specifications for digital broadcasting, ARIB standard STD-B32
71. Association of Radio Industries and Businesses (2008) Data coding and transmission
specification for digital broadcasting, ARIB standard STD-B24
72. Asami H, Sasaki M (2006) Outline of ISDB systems. Proc IEEE 94:248250
73. Uehara M (2006) Application of MPEG-2 systems to terrestrial ISDB (ISDB-T). Proc IEEE
94:261268
74. Takada M, Saito M (2006) Transmission system for ISDB-T. Proc IEEE 94:251256
75. Itoh N, Tsuchida K (2006) HDTV mobile reception in automobiles. Proc IEEE 94:274280
76. Standardization Administration of the Peoples Republic of China (2006) Frame structure,
channel coding and modulation for a digital television terrestrial broadcasting system,
chinese national standard. GB 20600
77. Wu J et al (2007) Robust timing and frequency synchronization scheme for DTMB system.
IEEE Trans Consum Electron 53(4):13481352
78. Zhang W et al (2007) An introduction of the Chinese DTTB standard and analysis of the
PN595 working modes. IEEE Trans Broadcasting 53(1):813
79. Song J et al (2007) Technical review on chinese digital terrestrial television broadcasting
standard and measurements on some working modes. IEEE Trans Broadcasting 53(1):17
80. OFTA (2009) Technical specifications for digital terrestrial television baseline receiver
requirements
81. Ong C (2009) White paper on latest development of digital terrestrial multimedia broadcasting
(DTMB) technologies. Hong Kong Applied Science and Technology Research Institute
(ASTRI), Hong Kong
82. European Telecommunications Standards Institute (2009) ETSI TR 102 377 v1.3.1 digital
video broadcasting (DVB): DVB-H implementation guidelines
83. European Telecommunications Standards Institute (2004) ETSI EN 302 304 v1.1.1 Digital
video broadcasting (DVB); transmission system for handheld terminals (DVB-H)
84. Faria G et al (2006) DVB-H: digital broadcast services to handheld devices. Proc IEEE
94(1):194209
85. Atanas G et al (2010) Complete end-to-end 3DTV system over DVB-H. Mobile3DTV project
86. Atanas G et al (2011) Mobile 3DTV content delivery optimization over DVB-H system. Final
public summary. Mobile3DTV project
87. European Telecommunications Standards Institute (2006) ETSI EV 300401 v1.4.1, radio
broadcasting systems, digital audio broadcasting (DAB) to mobile, portable and fixed receivers
88. Telecommunications Technology Association (2005) TTASKO-07.0024 radio broadcasting
systems, Specification of the video services for VHF digital multimedia broadcasting (DMB)
to mobile, portable and fixed receivers
89. International Organization for Standardization (2008) ISO/IEC JTC1/SC29/WG11 joint draft
7.0 on multiview video coding
90. Baroncini V, Sullivan GJ and Ohm JR (2010) Report of subjective testing of responses to
joint call for proposals on video coding technology for high efficiency video coding (HEVC).
Document JCTVC-A204 of JCT-VC
91. Yun K et al (2008) Development of 3D video and data services for T-DMB. SPIE
Stereoscopic Disp Appl XIX 6803:2830
92. International Organization for Standardization (2010) ISO/IEC. JTC1/SC29/WG11 a frame
compatible system for 3D delivery. Doc. M17925

344

P. Angueira et al.

93. Lee H et al (2008) A backward-compatible, mobile, personalized 3DTV broadcasting system


based on T-DMB. Three-dimensional television capture, transmission, display. Springer,
New York
94. Park YK et al (2009) Depth-image-based rendering for 3DTV service over T-DMB. Signal
Process: Image Commun 24:122136 (Elsevier)
95. Kauff P et al (2007) Depth map creation and image-based rendering for advanced 3DTV
services providing interoperability and scalability. Signal Process: Image Commun 22(2):
217234. (Elsevier)
96. European Telecommunications Standards Institute (1998) EN 300 429 V1.2.1. Framing
structure, channel coding and modulation for cable systems
97. European Telecommunications Standards Institute (2011) TS 102 991 v1.2.1. DVB-C2
implementation guidelines

Part IV

3D Visualization and Quality Assessment

Chapter 12

The Psychophysics of Binocular Vision


Philip M. Grove

Abstract This chapter reviews psychophysical research on human stereoscopic


processes and their relationship to a 3D-TV system with DIBR. Topics include
basic physiology, binocular correspondence and the horopter, stereoacuity and
fusion limits, non-corresponding inputs and rivalry, dynamic cues to depth and
their interactions with disparity, and development and adaptability of the binocular
system.

Keywords 3D-TV Binocular correspondence Binocular development Binocular rivalry Binocular visual system Depth-image-based rendering (DIBR)
Disparity scaling Dynamic depth cue Fusion limit Horopter Monocular
occlusion zone Motion parallax Size scaling Stereoacuity Visual cortex




12.1 Introduction
In the last decade, stereoscopic media such as 3D1 movies, 3D television (3D-TV),
and mobile devices have become increasingly common. This has been facilitated
by technical advances and reduced cost. Moreover, viewers report a preference for
3D content over the same 2D content in many contexts. Viewing media content in
stereoscopic 3D increases viewers sense of presence (e.g. [1]) and 3D content
viewed on large screens and mobile devices are scored more favorably than 2D
content viewed on the same displays [2]. With the increasing demand for 3D
1

Throughout this chapter, 3D refers to stereoscopic imaging.

P. M. Grove (&)
School of Psychology, The University of Queensland, Brisbane, Australia
e-mail: p.grove@psy.uq.edu.au

C. Zhu et al. (eds.), 3D-TV System with Depth-Image-Based Rendering,


DOI: 10.1007/978-1-4419-9964-1_12,
Springer Science+Business Media New York 2013

347

348

P. M. Grove

media content, it is necessary to find efficient modes of delivery. Indeed, Fehn


et al. [3] assert that for 3D-TV to be commercially viable in the current broadcast
environment, bandwidth requirements must approach those of regular 2D digital
TV. Depth-Image-Based Rendering (DIBR) is a promising method for meeting the
challenge of efficiently delivering 3D content. 3D content based on DIBR involves
the transmission of a 2D image and an accompanying depth map that specifies the
depth of each pixel in the 2D image with respect to the camera position [4]. Two or
more virtual stereoscopic images can be synthesized at the receiver end based on
the depth map. The bandwidth for transmitting these two signals is significantly
less than would be required to transmit two full quality images, one for the left and
one for the right eye.
Any delivery method must ensure that viewers 3D experiences are safe,
without strain, and perceptually compelling. A significant challenge for 3D-TV
systems is to maximize viewer comfort. A large body of research in this area has
accumulated in recent years. Major topics of research include perceptual issues
such as accommodation/vergence conflict (e.g. [4]) and the optimal range of
disparities and their distributions across the display (e.g. [5, 6]). Higher cognitive
factors have also been explored [7]. Moreover, a number of recent reviews
summarizing this literature have been published recently [811], indicating a
heightened interest in this area.
Image artifacts and distortions are additional problems that remain to be solved for
all 3D media including 3D-TV systems. Some examples include keystone distortions, the puppet theater and cardboard cutout effects, and the presence of unmatched
features in the two eyes arising from compression techniques and some DIBR
algorithms. Some aspects of DIBR contribute to these problems (see for example,
[12]), and other features of DIBR contribute to their resolution [2]. Therefore, two
major challenges for workers in DIBR are to facilitate efficient transmission of 3D
media while maintaining a high level of viewer comfort and satisfaction.
In order to meet the challenges mentioned above, it is necessary to have some
understanding of the human binocular system. Therefore, this chapter reviews a
selection of the literature on human binocular vision and stereopsis relevant to 3D
media developers. The goal is to link some of the literature on basic binocular
visual processes to the applied problems facing 3D media developers. Understandably, much of the applied work on visual fatigue, distortions, and artifacts
focuses on modifications to the technology for their remedy, rather than probing
the underlying binocular processes related to these issues. This chapter focuses on
the binocular visual system with the hope of illuminating the perceptual processes
related to some of the negative experiences with 3D media as well as those
perceptual processes that might be exploited to enrich the 3D experience.
This chapter is organized as follows. From Sect. 12.2 to Sect. 12.8, each reviews
specific topics and their relevance to a 3D-TV System with Depth-Image-Based
Rendering. Beginning with the basic physiology of the human binocular system
(Sect. 12.2), the chapter then goes on to cover binocular correspondence and the
horopter (Sect. 12.3), the performance limits of the binocular system (Sect. 12.4),
binocular rivalry and monocular occlusion zones (Sect. 12.5), size scaling and

12

The Psychophysics of Binocular Vision

349

Fig. 12.1 Major structures


of the human right eye.
See text for details

disparity scaling with distance (Sect. 12.6), dynamic depth cues and their interaction
with binocular disparity (Sect. 12.7), and concludes with binocular development and
plasticity (Sect. 12.8). Section 12.9 concludes the chapter. This chapter aims to help
workers in 3D media to understand the biological side of the 3D mediaconsumer
relationship with the goal of eliminating artifacts and distortions from their displays,
and maximizing user comfort and presence.

12.2 Basic Physiology of the Binocular System


The human binocular field is approximately 2002 in angular extent. The left and
right eyes visual fields overlap in the central 120. Adjacent to the binocular field are
two monocular crescents of approximately 40 in extent. In spite of the large region
of binocular overlap, binocular processes such as stereopsis are limited to a portion of
this region. For example, psychophysical data show that binocular processing of
dynamic stimuli occurs within the central 63 of the binocular field [14].
The eyes are the front-end of the binocular system and contain an optical system
and a neural system. The optical system consists of the cornea, lens, and an adjustable
opening called the pupil (Fig. 12.1). The pupil is formed by the muscles in the iris
(the area pigmentation in our eyes) and can vary in size between approximately 2 and
8 mm [15]. As pupil size decreases, depth of focus increases. That is, objects
increasingly nearer or farther from the focal point remain in focus. For very small
artificial pupils, as in a pinhole camera, all distances are in focus [16]. 3D-TV
researchers have exploited depth of focus to define a viewing zone in which
accommodation remains constant, eliminating conflicts with other sensory processes

Researchers in human visual perception specify the size of distal objects, the extent of visual
space, and binocular disparities in terms of their angular extent at the eye rather than linear
measurements such as display-screen units like pixels. See Harris [13] for a description. For
reference, a 1 cm wide object 57 cm from the eye subtends approximately 1 degree of visual
angle.

350

P. M. Grove

such as convergence (see below), in an attempt to reduce viewer fatigue and


discomfort [11].
A transparent membrane, called the cornea covers the pupil. Light passes through
the cornea and pupil, then through an adjustable lens to the back of the eye where it is
absorbed by photoreceptors in the retina, the first stage of the neural system. A key
feature of the retina is the fovea (Fig. 12.1), a small region approximately 0.33 mm in
diameter in the center of retina, where color sensitive photoreceptors, called cones,
are densely packed and exposed directly to incoming light unimpeded by neural
tissue and blood vessels. This provides the best possible resolution. Monochromatic
photoreceptors, called rods, are located in the peripheral retina.
The visual axis (Fig. 12.1) is an imaginary line that connects the object that is
being looked at directly (fixated) and the center of the fovea via the optical nodal
point of the eye. The left and right eyes visual axes intersect at points in space
where the two eyes are looking simultaneously.
In humans, the eyes move together in a precise coordinated fashion. This is
necessary for at least two reasons. First, the resolution of fine spatial detail is limited
to the fovea. Therefore, the eye must move if it is to inspect any selected portions of
the visual scene with the highest fidelity. Second, the lateral separation of the eyes
necessitates eye movements in order to image an object of interest on the two foveas
simultaneously. When the visual axes intersect on an object and it is imaged on the
foveas, it is said to be bifixated. The point is space that is bifixated is called the fixation
point.
Eye movements are mediated by six extra-ocular muscles, which are arranged
in three pairs. The medial and lateral rectus muscles connect to the nasal (the side
closest to the nose) and temporal (closest to the temple) side of each eye,
respectively, and rotate the eyes in the horizontal plane of the head. Moving the
two eyes in the same direction in the horizontal plane is called version. For
example, visually tracking a train across the horizon. Moving the eyes in opposite
directions in the horizontal plane is called vergence. This typically occurs when
one tracks an object that is approaching or receding in depth. The eyes can also
move vertically, both in the same direction and to a lesser extent, and under special
conditions, in opposite directions. The superior and inferior rectus muscles,
attached to the tops and bottoms of the two eyes, respectively, mediate vertical eye
movements. Vertical version eye movements entail both eyes rotating upwards or
downwards through equal extents together. Vertical vergence eye movements
occur when one eye rotates upwards and the other rotates downwards. These eye
movements can correct misalignments of stereoscopic images [17]. Lastly, each
eye can rotate about the visual axis. These are called torsional eye movements and
are mediated by the superior and inferior oblique muscles. Cycloversion eye
movements occur, for example, when one tilts the head to the left or right. The
eyes counterrotate in the same direction in an effort to keep the horizontal
meridians of the eyes parallel with the horizon. Cyclovergence eye movements
entail rotations of the eyes in opposite directions and can be induced by large
stereoscopic displays that are counterrotated [18].

12

The Psychophysics of Binocular Vision

351

Fig. 12.2 The major


pathways of the binocular
visual system. Solid gray and
dashed black lines indicate
the path of axons from the
retinas, via the optic chiasm
and LGN to the visual cortex

Photoreceptors connect to single neurons in the retina called ganglion cells.


As many as 600 rods will input into one retinal ganglion cell but as few as 3 cones
converge onto one ganglion cell [19, 15]. In the case of rods, this results in greater
sensitivity to faint light at the cost of spatial acuity. In the case of cones, the result
is higher spatial acuity at the cost of lower sensitivity to faint light. About one
million ganglion cell axons bundle together (referred to as the optic nerve) and
leave the back of the eye at the blind spot (Fig. 12.1), so named because this point
of exit has no photoreceptors and so vision cannot occur. Mariotte [21] was the
first to publish observations of the blind spot.
After the optic nerve exits the eye, it courses backwards in the brain toward the
primary visual cortex. However, before reaching the cortex, the axons from each
retina undergo hemidecussation. That is, half of the axons from each retina cross
over to the other side of the brain at the optic chiasm (Fig. 12.2). This is an orderly
and systematic process that results in the axons from the nasal side of the right
eyes retina crossing to the left side of the brain and coursing back with the axons
from the temporal side of the left eyes retina. Similarly, the axons from the nasal
side of the left eyes retina cross over and course backwards with axons from the
temporal side of the right eyes retina. This arrangement is critical because it
brings inputs originating from the same points in space to the same or adjacent
neurons in the visual cortex. This mapping is called retinotopic mapping and is
maintained all the way through the visual system. That is, adjacent regions on the
retina connect to adjacent neurons in the visual cortex.
After the optic chiasm, sections of the visual pathway are referred to as optic
tracts. The optic tracts still consist of axons from retinal ganglion cells. The first
synapse in the visual processing stream is at the lateral geniculate nucleus (LGN),
consisting of a left and right nucleus about mid-way between the eyes and the back
of the brain. The inputs from the two eyes insert into alternate layers here with

352

P. M. Grove

corresponding locations in the two eyes represented in columns of cells [20].


Although, there is minimal interaction between the two eyes in the LGN, the first
step in retinotopic organization of the left and right hemifields occurs here. Relay
axons connect with those projecting from each eye and course backwards and
terminate in the visual cortex on the same side. Thus, owing to hemidecussation,
the right visual cortex processes input from the left visual field and the left visual
cortex processes input from right visual field (Fig. 12.2).
The visual cortex is a convoluted sheet of neural tissue comprising six layers.
Axons from the LGN synapse with cortical neurons in layer four. These first cortical
cells are still monocular. At the next synapse, however, monocular neurons, one from
each eye, converge to synapse with individual neurons in adjacent layers (layers 13
or 5 and 6), which represent the first binocular cells in the visual system [21].
Early physiological studies on cats demonstrated that cells in the primary visual
cortex (V1) are selectively tuned to specific binocular disparities [22]. However, not
all binocular processing takes place in the primary visual cortex. This area connects
laterally to the secondary visual cortex, which is also organized retinotopically in
layers that process coarse and fine disparities [23]. Proceeding to later visual brain
areas, more complex attributes are analyzed. Approximately 4060 % of cells in the
first five visual brain areas (V1, V2, V3, V4 and V5/MT) are sensitive to binocular
disparity [15]. Therefore, the recovery of stereoscopic depth is not restricted to a
single brain area. A general principle is that information flows from simple to more
complex analysis with progression along the visual pathway. For example, in the
middle temporal cortex (V5/MT), motion and depth information are integrated
yielding sensitivity to motion in depth [24].

12.3 Binocular Correspondence and the Horopter


Stereopsis is a process of depth recovery based on comparing the subtly different
images that project to laterally separated eyes viewing a three-dimensional world.
However, for accurate depth recovery from binocular disparities, there must be a zero
disparity reference point for the binocular system. The fovea provides one such
reference. When an observer binocularly fixates on a particular object, its image falls
on the fovea in each eye and by definition stimulates corresponding points. Assuming
no cyclorotation of the eyes, the theoretical set of loci stimulating corresponding
points in the two eyes, called the geometric horopter, comprises two components.
The first component was originally argued to be a circle intersecting the fixation point
and the optical nodal points of the eyes, called the Vieth-Mller circle after Vieth [25]
and Mller [26]. Recently, Howarth [27] corrected this model, pointing out that
objects on the smaller arc of the circle between the nodal points of the two eyes would
not project images to corresponding points. Therefore, the loci comprising the first
component of the geometric horopter fall on the larger arc of a circle intersecting the
fixation point and the optical nodal points of the eyes [27], referred to as the
geometric horizontal horopter. The second component of the geometric horopter is a

12

The Psychophysics of Binocular Vision

353

Fig. 12.3 In (a) the two eyes are fixating on point P. The larger arc of the circle intersects the
fixation point and the optical nodal points of the two eyes. Geometry dictates that the angle
subtended at Q is equal to the angle subtended at the fixation point (P) (the angles labeled x).
It follows that the image of object Q is an equal angular distance from the fovea in the two eyes.
This is true for the binocular images of any object lying on this arc. In (b) The geometric vertical
horopter is a line perpendicular to the plane containing the horizontal horopter that intersects the
fixation point. The object at R is equally distant from the two eyes and therefore its angular
elevation from the plane containing the horizontal horopter is the same for both eyes (the angles
labeled v). Therefore, the image of the square at R projects to a location on the retina an equal
angular extent below the fovea in each eye

line perpendicular the plane containing the geometric horizontal horopter, intersecting the fixation point, referred to as the geometric vertical horopter. The geometric horopter (Fig. 12.3) is limited to these loci because locations away from the
median plane of the head and the horizontal plane of regard give rise to vertical image
size differences owing to the fact that eccentrically located objects are different
distances from the two eyes, precluding stimulation of corresponding points.
The empirical horopter is a map of the locations in space that observers perceive
in identical directions from both eyes for a given point of fixation. In general,
the empirical horopter is determined by having an observer fixate at a specific distance straight ahead and make judgments about the relative alignment of dichoptic
targets, presented at various eccentricities. The inference from these measurements
is that aligned targets stimulate corresponding points in the two retinas.
The loci of points comprising the empirical horopter differ from those making up
the geometric horopter. The portion of the empirical horopter in the horizontal plane
is characterized by a shallower curve than the geometric horopter [28, 29]
(Fig. 12.4). More striking, however, is the difference between the empirical vertical
horopter and its geometric counterpart. The vertical component of the empirical
horopter is inclined top away [3035]. Moreover, the inclination of the empirical
vertical horopter increases with viewing distance such that it corresponds with the
ground plane at viewing distances greater than about 2 m.

354

P. M. Grove

Fig. 12.4 The empirical


horizontal horopter (solid
black curve) and the
geometric horizontal horopter
(dashed arc). The empirical
vertical horopter (solid
inclined line) is tilted top
away relative the geometric
vertical horopter (dashed
line). See text for details

Knowing the loci making up the geometric and empirical horopters is useful for
at least two reasons. First, these loci predict the regions in space at which we would
expect superior stereoacuity. As discussed in Sect. 12.4, stereoacuity degrades quickly
as the pair of objects is displaced farther in front or behind the fixation point. Therefore,
for best stereoscopic performance, it would be desirable to present objects close to zero
disparity loci. Second, it is very likely that zero disparity loci are close to the middle
of the range of fusible disparities [36]. Knowing this locus can guide the positioning
of the zone of comfortable viewing [9].
As mentioned above, corresponding points cannot be determined for locations
in space away from the horizontal and vertical meridians. However, with fixation
straight ahead, a surface, inclined top away will come close stimulating corresponding points in the two eyes [35, 36]. One implication from these data is that
stereo performance should be superior around these loci and that the zone of
comfort for binocular disparities should be centered on the empirical horopters.
The shapes of the horizontal and vertical empirical horopters differ from flat
consumer 3D displays. Deviations from the horizontal horopter are likely to be small
for normal viewing distances of approximately 2 m. For example a large
60-inch flat TV display viewed from 2 m would be beyond the horizontal horopter
for all locations except the fixation point. With fixation in the center of the display,
the outer edges would be approximately 10 cm behind the geometric horopter. The
discrepancy between the TV and the empirical horizontal horopter would be less
owing to the latters shallower curve. However, TVs are usually upright and vertical.
Therefore, the typical 3D display deviates markedly from the inclination of the
empirical vertical horopter. This deviation grows with increasing viewing distance as
the backward inclination of the vertical horopter increases. Psychophysical studies
have shown that observers prefer to orient displays top away approaching the
empirical vertical horopter [33, 38]. Indeed, basic human perception studies have
shown that observers have a perceptual bias to see vertical lines as tilted top away
[39]. These findings combine nicely with the applied work by Nojiri et al. [6] who
reported that viewers find images with uncrossed disparities in the upper visual field

12

The Psychophysics of Binocular Vision

355

more comfortable to view than other distributions, broadly consistent with the noted
backward inclination of the empirical vertical horopter.

12.4 The Performance Limits of the Binocular System


12.4.1 Stereoacuity
Stereoacuity refers to the smallest disparity an individual can detect. Under
optimal conditions a trained observer can reliably resolve a disparity between 2
and 6 arc sec. A disparity of 2 arc sec corresponds to the depth interval of 4 mm
viewed from 5 m away [40]. Considering HDTV displays, the optimal viewing
distance is specified as 3.1 times the picture height. At this viewing distance, one
pixel on the display subtends approximately 1 min arc at the eye. This is about the
resolution limit for 20/20 acuity. However, 1 min arc is about 10 times the best
stereoscopic threshold of the binocular system. Media developers can be confident
that the minimum disparity simulated on a screen (without sub-pixel sampling)
should be clearly visible to most viewers. However, there is the potential for
artifacts where motion in depth could appear jerky owing to the difference in
resolution of the technology and the human visual system.
Stereoacuity is relatively robust to motion of the images across the retina.
Stereo thresholds are unaffected by motion of up to 2 degrees/second [41] and
depth can still be reliably reported for stimuli translating at 640 degrees/second
[42]. Stereoacuity does, however, degrade rapidly in the periphery. Rawlings
and Shipley [43] found that viewers could discriminate disparities of just a few
minutes of arc in the central visual field but performance dropped off rapidly as the
stimuli were moved to more eccentric locations. At 8 on either side of fixation,
thresholds were over 350 arc min. Fendick and Westheimer [44] reported better
stereoacuity in the periphery than Rawlings and Shipley and showed that stereoacuity improves with practice.
Often the viewer is fixating at a specific distance but must discern the depth
between a pair of objects at a different depth either in front of or beyond the
fixation point. This separation in depth between the fixation point and the pair of
objects is referred to as a disparity pedestal (Fig. 12.5). The ability to discriminate
between a pair of objects decreases exponentially as they are moved in front of or
beyond the point of fixation [45]. Again, the high resolution of the human
binocular system minimizes the impact of disparity pedestals in 3D displays. For
eccentricities up to 5 and disparity pedestals of approximately 30 arc min,
stereoacuity is approximately 1 arc min, about the minimum resolution of a typical
HDTV.
There is considerable variability in stereoscopic performance in the general
population. Some of the variability can be attributed to different testing methods
[46]. However, individual differences also contribute. For example, Tam and

356

P. M. Grove

Fig. 12.5 With the eyes


converged at point P, objects
A and B, separated by a
relative depth interval, are
located beyond the fixation
point. This distance is called
a disparity pedestal. From
Howard and Rogers [40]
Seeing in Depth, vol. 2 with
permission from I Porteous.
Copyright 2002

Stelmach [47] reported large individual differences in depth discrimination as a


function of display duration. Approximately half of the 100 observers tested could
reliably discriminate depth defined by disparities of 1522.5 min arc at durations
of 20 ms. The remaining half of the observers required as much as 1,000 ms to
reliably discriminate the same depth intervals.

12.4.2 Disparity Fusion Limits


An image on one retina will perceptually fuse with a similar image presented to the
other eye so long as both images stimulate similar retinal areas. The extent of these
areas is called Panums fusional area. Alternatively, the range of disparities that
give rise to single vision is called Panums fusional range [48]. Larger disparities
outside Panums fusional range do not fuse and the object is seen as double
(diplopia) (Fig. 12.6). Depth is still perceived with diplopia but it is a major source
of viewer discomfort in 3D media [49, 50].
Ogle [51] investigated responses to a wide range of disparities between a thin
vertical line and a central fixation point. Smaller disparities yielded a strong
impression of precise depth with sharply fused images. He referred to this as
patent stereopsis. At larger disparities, diplopia is experienced but depth is still
perceived, though less precise. Ogle referred to this as qualitative stereopsis.
Grove et al. [52] measured fusion thresholds for thin bars and extended
surfaces. They reasoned that larger disparities might be fused for larger images.
This does not seem to be the case. They found the same disparity fusion limits
(approximately 10 arc min) for thin and thick bars. Schor et al. [53] showed that
the lower spatial frequency stimuli remain fused at larger disparities than high

12

The Psychophysics of Binocular Vision

357

Fig. 12.6 Panums fusional


range around the empirical
horizontal horopter. The
empirical horopter intersects
the fixation point (F). Objects
within the gray region (points
A, B and F) fall within
Panums fusional range and
will appear in precise depth
and single. Objects outside
this region (points C and D)
will appear double

spatial frequency stimuli. It seems that the upper fusion limit is determined by the
highest spatial frequency contained in an image.
If eye movements are not controlled, as is the case for commercial 3D displays,
very large disparities can be introduced and a vergence eye movement can
effectively reduce them. For example, Howard and Duke [54], and later Grove
et al. [55] showed that observers could reliably match the depth of a square
displaced in depth with crossed disparities as large as 3. Although it is possible to
fuse very large screen disparities with accompanying vergence movements, it is
demanding and generally leads to fatigue and discomfort. Therefore, one recommendation for entertainment 3D media is to limit the range of disparities in the
display to [56].
Under conditions where observers must fixate on a static stimulus, the range of
fusible horizontal disparities is larger than the range of fusible vertical disparities.
For example, Grove et al. [52] found diplopia thresholds in central vision for
horizontal disparities were approximately 10 arc min but they were only 5 arc min
for vertical disparity.
The binocular system responds to vertical disparities resulting from vertical
misalignments of stereoscopic images by making compensatory eye movements to
align the eyes and eliminate the vertical disparities as much as possible. The
mechanism for vertical vergence eye movements integrates visual information
over a large part of the visual field [57]. This is probably why viewers can tolerate
rather large vertical offsets between left and right eye video sequences presented
on large displays [58, 17] but not on small displays [59]. The large displays will
stimulate compensatory vertical vergence eye movements, but the smaller displays
do not.
Vertical disparities arise in 3D media when the stereoscopic images are
acquired with two converged (toed-in) cameras. For example, filming an objective
square placed directly in front of two converged cameras will result in two trapezoidal images, one in each eye. The left side of the trapezoid will be taller in the
left eye than the right eye and vice versa, introducing a gradient of vertical disparities across the width of the image. If these distortions are large enough, they

358

P. M. Grove

could lead to double images. However, viewers seem to tolerate these vertical
disparities with little reduction in viewing comfort [60].
The maximum disparity that can be fused depends on the horizontal and vertical
spacing between the objects in depth. This has been operationally defined as the
disparity gradient, the disparity between two images divided by their angular
separation. Burt and Julesz [61] reported that for small dots a disparity gradient of
one was the boundary between fusion and diplopia. That is, when the angular
separation between two dots was equal to or less than the angular disparity,
diplopia was experienced. The concept of disparity gradient is somewhat problematic, however, because it is not clear how to specify the separation between
images wider than a small dot. Nevertheless, the presence of adjacent objects in a
3D scene affects the fusion of disparate images.

12.5 Binocular Rivalry and Monocular Occlusion Zones


12.5.1 Binocular Rivalry
Images may fall on corresponding points in the two eyes but owing to photometric
differences in luminance or texture between the images, they will not fuse. Instead,
they engage in competition for conscious visibility, a process called binocular
rivalry [62, 63]. At a given point in time, the visible image is called the dominant
stimulus and the image that cannot be seen is called the suppressed image. Smaller
images of less than 2 in diameter tend to alternate in their entirety. Larger images
alternate in a piecemeal fashion called mosaic dominance (Fig. 12.7).
Binocular images that are equal in luminance contrast, but differ in orientation
(as in Fig. 12.7) will alternate such that each eyes image is visible approximately
50 % of the time. However, reducing the contrast of one of the eyes images will
result in that image being visible less than 50 % and the higher contrast image
being visible more than 50 % [23].
A defocused image tends to be suppressed by a sharp image [65, 66]. The latter
group of researchers hypothesized that this has the adaptive consequence that one
eyes sharp image of objects closer in depth to the point of fixation (and point of
focus) tend to suppress blurry images of objects nearer or more distant than the
fixation point. This happens in the real world when peeking around corners.
Readers can demonstrate this for themselves by holding their hand in front of one
eye but keep both eyes open while reading this text. With fixation on the page of
text, the image of the near hand is blurry and is suppressed but the text remains
visible (see also [67]).
The noted suppression of one eyes blurry image by a sharp image in the other
eye in basic visual perception research is directly relevant to the compression of
stereoscopic media for transmission. JPEG compression tends to introduce high
spatial frequency artifacts (see [66]). Low-pass filtering is another compression

12

The Psychophysics of Binocular Vision

359

Fig. 12.7 a Stimuli for binocular rivalry: left eye views vertical stripes and the right eye views
horizontal stripes or vice versa. In b different perceptions depending on image size are shown.
Small images alternate in their entirety. Larger images alternate in a piecemeal fashion (right
panel in b). The reader can experience rivalry by free fusing the images in a

strategy that removes high spatial frequencies from the image at the cost of fine
detail. Meegan et al. [67] showed that when an uncompressed image was presented
to one eye and a compressed image was presented to the other eye, the perceived
quality of the fused image was dependent on the type of compression strategy.
When one image was blurred, the perceived quality of the fused image was close
to the uncompressed image. When JPEG compression was applied to one eyes
image, the fused image was degraded compared to the uncompressed image. This
is likely due to high spatial frequency artifacts introduced in the blocky image
suppressing corresponding regions in the uncompressed image. Therefore, this
applied research study combines nicely with the visual perception studies on
binocular rivalry discussed above, suggesting that low-pass filtering of one eyes
image is a promising compression strategy for 3D media.

12.5.2 Monocular Occlusion Zones


Situations where one object occludes a more distant object give rise to regions on
the more distant object that are visible to one eye but not the other, called
monocular occlusion zones. Consider Fig. 12.8 in which an opaque surface is in
front of a background. Of note here are the regions on the background to the left
and right of the near surface. The region on the background just to the right of the
near surface is visible to the right eye but not the left eye because it is blocked
from view by the near surface. A similar region exists to the left of the near surface
that is visible to the left eye but not to the right. Features in these areas have no

360

P. M. Grove

Fig. 12.8 Monocular occlusion zones. The near surface blocks portions of the far surface from
view for one of the black eyes but not the other. The gray eye in the center illustrates that
translating between the two eyes positions results in portions of the background becoming
visible to that eye while other portions become invisible (See Sect. 12.7 for a discussion of
dynamic depth cues). Adapted from Vision Research, vol. 30, 11, Nakayama and Shimojo [76]
Da vinci stereopsis: depth and subjective occluding contours from unpaired image points,
p. 18111825, with permission from Elsevier. Copyright 1990

match in the other eye and therefore disparity is undefined. Moreover, these
monocular regions may differ in texture or luminance from the corresponding
region in the other eye. Considering the previous discussion on binocular rivalry,
monocular regions should constitute an impediment to stereopsis and potentially
induce rivalry.
Psychophysical investigations since the late 1980s have shown that monocular
occlusion zones consistent with naturally occurring occlusion surface arrangements resist rivalry [68] and contribute to binocular depth perception. For example
the presence, absence, or type of texture in monocular regions impacts the speed of
depth perception and the magnitude of depth perceived in relatively simple
laboratory stimuli as well as complex real world images [69]. When monocular
texture is present and can be interpreted as a continuation of the background
surface or object to which it belongs, perceived depth is accelerated [70] relative to
when that region is left blank. If, on the other hand the monocular texture is very
different from the surrounding background texture, perceived depth is retarded or
even destroyed [7173]. Furthermore, monocular occlusions can also elicit the
perception of depth between visible surfaces in the absence of binocular disparity
[74, 75]. Even more striking are demonstrations in which monocular features elicit
a 3D impression and the generation of an illusory surface created in perception
possibly to account for the absence of that feature in one eyes view [7680]. For a
recent review, see Harris and Wilcox [81].

12

The Psychophysics of Binocular Vision

361

Fig. 12.9 A slanted line (a) or a partially occluded frontoparallel line, (b) can generate identical
horizontal disparities. Black rectangles below each eye illustrate the difference in width between
the left and right eyes images. The gray region in the right figure is a monocular occlusion zone.
From Vision Research, Vol. 44, 20, Gillam and Grove [86] Slant or occlusion: global factors
resolve stereoscopic ambiguity in sets of horizontal lines, p. 23592366, with permission from
Elsevier. Copyright 2004

In addition to affecting the latency for stereopsis and the magnitude of depth,
differential occlusion in the two eyes can have a dramatic effect on how existing
binocular disparities in the scene are resolved [8285]. In some scenes, local horizontal
disparities are equally consistent with a slanted object and a flat object that is partially
occluded by a nearer surface. If a plausible occlusion solution is available, it is preferred.
For example, consider Fig. 12.9a. A contour that is slanted in depth generates
images of different widths in the two eyes. The corresponding endpoints are
matched and their disparities are computed. However, note that the same image
width differences of a frontoparallel line are generated when it is differentially
occluded in the two eyes by a nearer surface, as in Fig. 12.9b. In the latter case,
when an occluder is present, the right end of the line is noncorresponding.
Disparity computations are discarded here and the binocular images are interpreted
to be resulting from partial occlusion by the nearer surface [82, 83].
These human visual perception experiments complement applied work on 3D
displays and in 3D media production to solve the problem of unmatched features in
the two eyes resulting from objects entering and exiting at the edges of the 3D
display [86]. Depending on the image content, these unmatched features can elicit
rivalry, as discussed previously, or they can introduce spurious disparities resulting
in misperceived depth. A common technique is to apply masks to the left and right
sides of the display and stereoscopically move them forward in depth [87]. The
resulting floating window partially occludes the right eyes view of the right side of
the display and the left eyes view of the left side of the display.
In the context of DIBR, monocular occlusion zones are referred to as holes and
are a major source of display artifacts [12]. The synthesis of a second virtual
stereoscopic image from a 2D image and a depth map is possible for all parts of the

362

P. M. Grove

image that are visible to both eyes in the fused scene. However, there is no
information in the original 2D image or the depth map about the content of
monocular occlusion zones. Indeed, these regions present as blank holes, known as
disocclusions in the computer graphics literature, in the synthesized image and
must be filled with texture. However, it is not clear as to how to fill these regions.
Algorithms that interpolate pixel information in the background produce visible
artifacts of varying severity depending on image content [2]. Furthermore, visual
resolution in monocular regions is equal to binocularly visible regions [88].
Therefore, blurring these regions or reducing their contrast is not a viable strategy
to conceal the artifacts.
The choice of texture to fill the holes could be informed by additional information provided in the depth map such that more than one depth and color is
stored for each image location [2]. However, this increases the computational load
and bandwidth requirements for transmission. Zhang and Tam [12] proposed a
method in which the depth map is preprocessed to smooth out sharp depth
discontinuities at occlusion boundaries. With increased smoothing, disocclusions
can be nearly eliminated without a significant decrease in subjective image quality.
Nevertheless, their informal observations revealed that object boundaries were less
sharp in images with the smoothed depth map and depth quality was somewhat
compromised.
As reviewed above, however, monocular occlusion zones have a major impact
on human stereopsis. Therefore, more research is needed to determine the optimal
balance between the benefits accrued from depth map smoothing and the costs
associated with reducing the visibility of monocular zone information from these
displays.

12.6 Size Scaling and Disparity Scaling with Viewing Distance


12.6.1 Size Scaling
Our perception in artificial 3D environments is often characterized by illusions and
distortions that are not present in real-world viewing situations. One common
distortion in stereoscopic displays is the change in apparent size of an object as it is
stereoscopically displaced in depth though its size is unchanged on the display.
Some understanding of this distortion in artificial environments is gained if we
consider how the image on the retina is related to both the size of the object and
the registered distance between the viewer and object. This is referred to as the size
distance relation [89] and can be expressed mathematically as:
a
h 2d  tan
12:1
2

12

The Psychophysics of Binocular Vision

363

Fig. 12.10 The geometric


relationships among retinal
size, perceived distance, and
object size. The arrow at
distance d generates the
inverted image on the back of
the eye. An object twice as
tall generates an identical
image from double the
distance

where h is the linear height of the object, d is the distance between the object and
the eye, and a is the angular subtense of the object at the retina. As can be deduced
from Eq. (12.1), for an object of a given size, changes in distance lead to changes
in the angular extent of retinal image such that a doubling of the viewing distance
results in halving the angular extent of the retinal image. Thought of another way,
to maintain a retinal image of a given size, the real object must shrink or grow as it
approaches or recedes from the viewer. Consider Fig. 12.10. The inverted retinal
image of the arrow corresponds to both the arrow at distance d and the arrow that
is twice the height at distance 2d. For a fixed retinal image size, changing the
distance of the object requires a change in size of the object. Size scaling is a
perceptual process leading to correctly perceiving an objects size as constant
despite large changes in the size of the retinal image due to changes in the viewing
distance [89].
Size scaling is a robust process in the real three-dimensional environment,
though it often breaks down in artificial environments. Indeed, when an object is
stereoscopically displaced nearer in depth, d is perceptually reduced but the retinal
image remains the same and the object perceptually shrinks. When the object is
stereoscopically displaced farther in depth, d perceptually increases and the
objects height is perceptually overestimated. The direction of the size distortions
is in the opposite direction to the expected change in size with distance. Therefore,
viewers could experience incorrect depth if they base their depth judgments on
size rather than disparity.
A well-known illusion related to size constancy is the puppet theater effect in
which familiar images, usually of humans, appear unusually small in 3D displays
as though they are puppets. The magnitude of this illusion is linked to the type of
camera configuration, with a greater illusion occurring for toed-in cameras than
parallel cameras [90, 91]. However, the flexibility of 3D reproduction afforded by
DIBR offers a solution to size distortions that can be implemented at any time after
capturing the images [2]. Potentially, size scaling of objects displaced in depth can
be programmed into the rendering algorithm and distortions associated with
camera positioning and configuration can also be corrected.

364

P. M. Grove

Fig. 12.11 The geometric


relationship between
perceived depth from
disparity and viewing
distance. The same binocular
disparity (d) corresponds to
4 times the perceived depth
(d) when viewing distance
(D) is doubled

12.6.2 Disparity Scaling


Binocular disparity arising from a fixed depth interval is inversely proportional to
the viewing distance squared. The relationship among disparity, relative depth,
and distance is expressed mathematically as:
g

aDd
in radians
D2 DDd

12:2

where g is disparity, in radians, a is the interocular distance, Dd is the relative


depth between two objects, and D is the distance to the closer of the two objects.
For example, assuming an interocular distance of 6.5 cm, a depth interval of 2 cm
between a fixated object at 60 cm and a more distant object yields a binocular
disparity of 12.41 arc min. Doubling the viewing distance to 120 cm that same
angular disparity corresponds to a linear depth difference of 8 cm, four times the
depth at the near viewing distance. This relationship is illustrated in Fig. 12.11.
In order for a viewer to receive correct stereoscopic depth information, the
geometry of the acquisition system, including camera shooting distance and
separation between the cameras must match that of the viewers eyes and the
geometry of the viewing conditions [13]. These conditions are rarely achieved in
practice. For example, the viewing geometries for individuals in a 3D movie
audience differ depending on their position in the theater and can therefore differ
considerably from the original shooting conditions. Deviations from the geometry

12

The Psychophysics of Binocular Vision

365

during acquisition are likely to be even greater when viewing 3D media on mobile
devices with small hand held displays.
Differences between parameters during image capture and those during image
viewing, combined with the fact that the relative depth from a given disparity
scales with the inverse of the square of viewing distance are likely contributing
factors to the so-called cardboard cutout phenomenon [40]. For example, a stereoscopic photograph of a group of people may, when viewed in a stereoscope,
yield the perception of a number of depth planes. However the volume of the
individual people is perceptually reduced such that they appear as cardboard
cutouts instead of their real volume. Typically, such photographs are taken from a
greater distance than the photos are viewed from. Therefore, when viewing these
photos, the vergence, accommodation, and other cues such as perspective signal a
shorter viewing distance. Depth is still appreciated between the individuals, but the
striking distortionthat the subjects appear as cardboard cutoutsresults from the
disparities signaling the volumetric depth of their bodies being scaled down with
viewing distance. Attempts to eliminate this illusion by making the viewing
conditions as close to the capture conditions as possible have been partly
successful [92, 93] suggesting that at least part of the phenomenon is due to cue
conflicts arising from differences between shooting and viewing conditions.
A promising feature of DIBR is its flexibility in 3D reproduction [2]. When
combined with stereoscopic image capture using a parallel camera configuration,
the main system variables such as camera separation and convergence distance
need not be fixed from the time of shooting. Instead, these parameters can be
optimized in the rendering process to adapt to specific viewing conditions. This
would enable the same 3D media to be optimally presented in varying contexts
from cinemas to mobile devices.
Harris [13] makes an important observation about the appropriateness of
modifying 3D content to maximize comfort versus accurate representation. In
entertainment settings such as movies, TV, and mobile devices it is reasonable to
sacrifice accuracy for viewing comfort. However, it is possible that DIBR
technology may be implemented in contexts where accuracy is very important
such as remote medical procedures and design CAD/CAM applications. Nevertheless, DIBR algorithms could be customized to support either entertainment or
industrial/medical applications.

12.7 Dynamic Cues to Depth and Their


Interactions with Disparity
Stereopsis is based on the simultaneous comparison of the left and right eyes images
with depth being coded from the disparities between the images. Depth from motion
parallax is captured sequentially either from an eye that translates through a path
perpendicular to the line of sight or a stationary eye viewing a translating three-

366

P. M. Grove

Fig. 12.12 The dynamic


images cast on a translating
eye contains parallax
information about the relative
depth of objects. For a right
to left translation, the fixated
dot remains stationary on the
retina but the image of the
more distant square travels
from left to right across the
retina. The opposite pattern is
generated for left to right
translations. The solid black
eyes illustrate the binocular
disparity when viewing this
scene from the left and right
eyes simultaneously

dimensional object or scene. If the translation of the eye is equal in length the
individuals interocular distance, depth from motion parallax is geometrically
identical to depth from stereopsis [94]. Figure 12.12 shows the relative motion of the
retinal images of two objects separated in depth as the viewer moves laterally. With
fixation on the closer black dot, as the eye translates from right to left, the image of the
dot remains on the fovea while the image of the more distant square moves across the
retina from its original position on the nasal side toward the temporal side. The total
distance moved by the image of the square is equal to the binocular disparity
generated viewing these two objects simultaneously with two stationary eyes. Note
that with fixation on the closer of the two objects, the motion of the image of the more
distant object across the retina is against the motion of the eye. With fixation on the
far object, the image of the closer object moves across the retina in the same direction
as the movement of the eye.
The minimum detectable depth from motion parallax is about the same as for
binocular disparity [95] if the parallax is due to active lateral head motion. Depth
thresholds are slightly higher when passively viewing a translating display
containing a depth interval. This could be because the retinal image is more
accurately stabilized during self-motion than when tracking a translating image
[96]. This is an important consideration for 3D media developers when adding
motion parallax as a cue to depth in their content. Greater visual fidelity will be
achieved if the relative motion is yoked to the viewers head movement than if
they passively view the display.
Like binocular disparity, motion parallax has an upper limit to the relative
motion of images on the retina after which perceived depth is degraded. Depth
from motion parallax is metrical and precise for relative motions up to approximately 20 arc min (of equivalent disparity) [97]. Increasing the relative motion
beyond this point results in the perception of relative motion and depth.

12

The Psychophysics of Binocular Vision

367

Still further increasing the relative motion destroys the depth percept and only
relative motion is perceived. Viewers may be able to intuit the depth order from
the large relative motions, but depth is not directly perceived [98].
A closely related source of dynamic information is accretion and deletion of
background texture at the vertical edges of a near surface. For example, Fig. 12.8
shows monocular regions on a far surface owing to partial occlusion by a near
surface. Consider an eye starting at the position of the right eye in this figure and
translating to the position of the left eye. During this translation, parts of the
background that were visible to the eye in its initial position will gradually be
hidden from view by the closer occluder. Conversely, regions of the background
that were initially hidden from view gradually come into view. At a position
midway between the eyes (the gray eye in Fig. 12.8), part of the monocular zone
that was visible to the right is now partially hidden by the near occluder and part of
the monocular zone that was completely hidden to the right eye (the left eye
monocular zone) is now partially visible. Therefore, in the real 3D environment,
motion of the head yields an additional cue to depth, the gradual accretion and
deletion of background texture near the vertical edges of foreground objects and
surfaces [89, 99]. Dynamic accretion and deletion of texture is an additional
challenge for DIBR because filling in partially occluded texture regions will need
to be computed online. This introduces additional computational problems and
also requires an efficient strategy for choosing appropriate textures to avoid visible
artifacts (see [100]).

12.8 Binocular Development and Plasticity


Development of the human visual system occurs in stages with different acuities
maturing at different times. Light detection thresholds in dim light conditions
approach adult values within the first month after birth. Detection thresholds in all
luminance conditions approaches adult levels by 10 weeks of age. A similar
development schedule is observed for temporal resolution as measured by critical
flicker frequency. Color vision, fine spatial acuity, orientation discrimination, and
discrimination of the direction of motion are all present by 12 weeks of age, but
performance improves and approaches adult levels by 6 months [20].
It is difficult to determine stereoacuity thresholds for infants since they cannot
perform the verbal and motor tasks used to test adults. Preferential looking is a
common paradigm. For example, Birch et al. [101] tested a large group of infants
between two and 12 months of age. They presented two displays side by side each
with three vertical bars. In one display all three bars were at the same depth while
in the other display the two outer bars carried a crossed disparity relative to the
central bar. They measured the smallest disparity for which the infants showed
preferential looking to the disparity stimulus 75 % of the time. Most five-month
olds showed this preference for disparities of 1 arc min. These data suggest that
stereoacuity quickly approaches adult levels in the first year of life. Depth

368

P. M. Grove

perception just from disparity has been demonstrated in children just over
3 months of age. Infants as young as 14 weeks will visually follow a disparitydefined contour moving from side to side. This test was conducted with dynamic
random-dot stereo displays in which the field of random dots was replaced on
every frame of the sequence, thus eliminating any monocular cues to the contours
motion [102].
Visual development is marked by critical periods during which normal inputs
are required for normal development. In the case of binocular vision, it is assumed
that normal binocular inputs early in life are required for the appropriate cortical
development to take place and normal stereopsis to arise. A common example of
compromised binocular input is strabismus, the turning of one eye relative to the
other such that the visual axes do not intersect at the point of fixation. Often input
from the turned eye is suppressed. Surgical correction of the turned eye is
beneficial but it is widely accepted that this should occur before stereopsis is fully
developed. Otherwise, stereopsis will not develop even if normal binocular input is
achieved later in life. Fawcett et al. [103] reported that compromised binocular
input that occurs as early as 2.4 months of age and as late as 4.6 years severely
disrupts development of stereopsis.
Although there is less research on changes binocular vision as a function of
normal aging, the evidence suggests that motor control of the eyes and stereoacuity
remain stable after maturity. There is a slight decline in stereoacuity after
approximately 45 or 50 years of age [23, 20]. Very recently Susan Barry,
a research scientist, who had strabismus as a child but later corrected through
surgery, published her account of how she recovered stereopsis through an intense
regimen of visual therapy at age 50 [104]. This case implies that the binocular
system has the capacity to reorganize in response to correlated inputsa neural
plasticity later in life [105].

12.9 Conclusion
The goal of this chapter was to introduce the reader to psychophysical research on
human binocular vision and link this research with issues related to stereoscopic
imaging and processing. Beginning with a brief overview of the physiology of the
binocular system, this chapter then discussed the theoretical and empirical loci in
space from which corresponding points are stimulated in the two eyes. These loci
have implications for the ergonomics of 3D display shape and position as well as
defining the distributions of binocular disparities in displays to maximize viewer
comfort. Next, the minimum and maximum disparities that can be processed by
the binocular system were discussed. These should be considered by 3D media
developers in order to avoid visible artifacts and user discomfort. This chapter then
defined and explored binocular rivalry, highlighting implications for the choice of
compression algorithm when transmitting stereoscopic media. Mismatches due to
artifacts in the rendering process such as holes were analyzed in the context of the

12

The Psychophysics of Binocular Vision

369

psychophysical literature on monocular occlusion zones. This chapter then


explored common size and depth illusions, highlighting that differences between
the conditions of shooting and those of presentation are a major cause. DIBR offers
a possible remedy to these illusions by optimizing the main shooting parameters
(i.e. camera separation) for the target display (e.g. cinema, TV or mobile device) in
the rendering process. Interactions between dynamic and static depth cues were
next considered highlighting the common geometry between disparity and motion
parallax as well as similar issues associated with DIBR (i.e. disocclusions). This
chapter concluded with a short discussion on the development of the binocular
visual system featuring a case suggesting the brain is able to reorganize after early
development. At the time of writing, however, no systematic studies to the
authors knowledge have documented permanent changes in viewers visual
systems as a result of viewing 3D content.
Acknowledgments Parts of this chapter were written while the author was on Special Study
Leave from the School of Psychology, The University of Queensland Australia. The author
thanks Peter Howarth and a second anonymous reviewer for helpful comments on earlier versions
of this chapter. Thanks to Nonie Finlayson for editorial help and assistance with the figures.

References
1. Freeman J, Avons S (2000) Focus group exploration of presence through advanced
broadcast services. Proc SPIE 3959:530539
2. Shibata T, Kurihara S, Kawai T, Takahashi T, Shimizu T, Kawada R, Ito A, Hkkinen J,
Takatalo J, Nyman G (2009) Evaluation of stereoscopic image quality for mobile devices
using interpretation based quality methodology. Proc SPIE 7237:72371E. doi:10.1117/
12.807080
3. Fehn C, De La Barre R, Pastoor S (2006) Interactive 3D-TVconcepts and key
technologies. Proc IEEE 94(3):524538. doi:10.1109/JPROC.2006.870688
4. Zhang L, Vzquez C, Knorr S (2011) 3D-TV content creation: automatic 2D-3D video
conversion. IEEE T Broad 57(2):372383
5. Yano S, Ide S, Mitsuhashi T, Thwaites H (2002) A study of visual fatigue and visual comfort
for 3D HDTV/HDTV images. Displays 23:191201. doi:10.1016/S0141-9382(02)00038-0
6. Nojiri Y, Yamanoue H, Hanazato A, Okana F (2003) Measurement of parallax distribution,
and its application to the analysis of visual comfort for stereoscopic HDTV. Proc SPIE
5006:195205. doi:10.1117/12.474146
7. Nojiri Y, Yamanoue S, Ide S, Yano S, Okana F (2006) Parallax distributions and visual
comfort on stereoscopic HDTV. Proc IBC 2006:373380
8. Patterson R, Silzars A (2009) Immersive stereo displays, intuitive reasoning, and cognitive
engineering. J SID 17(5):443448. doi:10.1889/JSID17.5.443
9. Meesters LMJ, IJsselsterijn WA, Seuntins PJH (2004) A survey of perceptual evaluations
and requirements of three-dimensional TV. IEEE T Circuits Syst 14(3):381391.
doi:10.1109/TCSVT.2004.823398
10. Lambooij M, IJsselsteijn W, Fortuin M, Heynderickx I (2009) Visual discomfort and visual
fatigue of stereoscopic displays: a review. J Imaging Sci Tech 53(3):0302010302014.
doi:10.2352/J.ImagingSci.Technol.2009.53.3.030201
11. Daly SJ, Held R, Hoffman DM (2011) Perceptual issues in stereoscopic signal processing.
IEEE T Broadcast 57(2):347361. doi:10.1109/TBC.2011.2127630

370

P. M. Grove

12. Tam WJ, Speranza F, Yano S, Shimono K, Ono H (2011) Stereoscopic 3D-TV: visual
comfort. IEEE T Broad 57(2):335346. doi:10.1109/TBC.2005.846190
13. Zhang L, Tam WJ (2005) Stereoscopic image generation based on depth images for 3D TV.
IEEE T Broad 51(2):191199. doi:10.1109/TBC.2005.846190
14. Harris JM (2010) Monocular zones in stereoscopic scenes: a useful source of information
for human binocular vision? Proc SPIE 7524:111. doi:10.1117/12.837465
15. Grove PM, Ashida H, Kaneko H, Ono H (2008) Interocular transfer of a rotational motion
aftereffect as a function of eccentricity. Percept 37:11521159. doi:10.1068/p5771
16. Mather G (2006) Foundations of perception. Psychology Press, New York
17. Hennessy RT, Iida T, Shina K, Leibowitz HW (1976) The effect of pupil size on
accommodation. Vis Res 16:587589. doi:10.1016/0042-6989(76)90004-3
18. Allison RS (2007) Analysis of the influence of vertical disparities arising in toed-in
stereoscopic cameras. J Imaging Sci and Tech 51(4):317327
19. Kertesz AE, Sullivan MJ (1978) The effect of stimulus size on human cyclofusional
response. Vis Res 18(5):567571. doi:10.1016/0042-6989(78)90204-3
20. Curcio CA, Allen KA (1990) Topography of ganglion cells in human retina. J Comp Neurol
300:525. doi:10.1002/cne.903000103
21. Mariotte E (1665) A new discovery touching vision. Philos Trans v 3:668669
22. Steinman SB, Steinman BA, Garzia RP (2000) Foundations of binocular vision: a clinical
perspective. McGraw-Hill, New York
23. Hubel DH, Wiesel TN (1959) Receptive fields of single neurons in the cats visual cortex.
J Physiol 148:574591
24. Barlow HB, Blakemore C, Pettigrew JD (1967) The neural mechanisms of binocular depth
discrimination. J Physiol 193:327342
25. Howard IP (2002) Seeing in depth vol 1 basic mechanisms. Porteous, Toronto
26. Nagahama Y, Takayama Y, Fukuyama H, Yamauchi H, Matsuzaki S, Magata MY,
Shibasaki H, Kimura J (1996) Functional anatomy on perception of position and motion in
depth. Neuroreport 7(11):17171721
27. Vieth G (1818) ber die Richtung der Augen. Ann Phys 58(3):233253
28. Mller J (1826) Zur vergleichenden Physiologie des Gesichtssinnes des Menschen und der
Thiere. Cnobloch, Leipzig
29. Howarth PA (2011) The geometric horopter. Vis Res 51:397399. doi:10.1016/
j.visres.2010.12.018
30. Ames A, Ogle KN, Gliddon GH (1932) Corresponding retinal points, the horopter, and size
and shape of ocular images. J Opt Soc Am 22:575631
31. Shipley T, Rawlings SC (1970) The nonius horopterII. An experimental report. Vis Res
10(11):12631299. doi:10.1016/0042-6989(70)90040-4
32. Helmholtz H (1925) Helmholtzs treatise on physiological optics. In: Southall JPC (ed)
Hanbuch der physiologischen optic, vol 3. Optical Society of America, New York
33. Ledgeway T, Rogers BJ (1999) The effects of eccentricity and vergence angle upon the
relative tilt of corresponding vertical and horizontal meridian revealed using the minimum
motion paradigm. Percept 28:143153. doi:10.1068/p2738
34. Siderov J, Harwerth RS, Bedell HE (1999) Stereopsis, cyclovergence and the backwards tilt
of the vertical horopter. Vis Res 39(7):12471357. doi:10.1016/S0042-6989(98)00252-1
35. Grove PM, Kaneko H, Ono H (2001) The backward inclination of a surface defined by
empirical corresponding points. Percept 30:411429. doi:10.1068/p3091
36. Schreiber KM, Hillis JM, Filippini HR, Schor CM, Banks MS (2008) The surface of the
empirical horopter. J Vis 8(3):120. doi:10.1167/8.3.7
37. Cooper EA, Burge J, Banks MS (2011) The vertical horopter is not adaptable, but it may be
adaptive. J Vis 11(3):119. doi:10.1167/11.3.20
38. Fischer FP (1924) III. Experimentelle Beitraege zum Gegriff der Sehrichtungsgemeinschaft
der Netzhaute auf Grund der Binokularen Noniusmethode. In: Tschermak A (ed)
Fortgesetzte Studien uber Binokularsehen. Pflugers Archiv fur die Gesamte Physiologie
des Menschen und der Tiere vol 204, pp 234246

12

The Psychophysics of Binocular Vision

371

39. Ankrum DR, Hansen EE, Nemeth KJ (1995) The vertical horopter and the angle of view. In:
Greico A, Molteni G, Occhipinti E, Picoli B (eds) Work with display units 94. Elsevier,
New York
40. Cogen A (1979) The relationship between the apparent vertical and the vertical horopter.
Vis Res 19(6):655665. doi:10.1016/0042-6989(79)90241-4
41. Howard IP, Rogers BJ (2002) Seeing in depth vol 2 depth perception. Porteous, Toronto
42. Westheimer G, McKee SP (1978) Stereoscopic acuity for moving retinal images. J Opt Soc
Am 68(4):450455. doi:10.1364/JOSA/68.000450
43. Morgan MJ, Castet E (1995) Stereoscopic depth perception at high velocities. Nature
378:380383. doi:10.1038/378380a0
44. Rawlings SC, Shipley T (1969) Stereoscopic activity and horizontal angular distance from
fixation. J Opt Soc Am 59:991993
45. Fendick M, Westheimer G (1983) Effects of practice and the separation of test targets on
foveal and peripheral stereoacuity. Vis Res 23(2):145150. doi:10.1016/00426989(83)90137-2
46. Blakemore C (1970) The range and scope of binocular depth discrimination in man.
J Physiol 211:599622
47. Patterson R, Fox R (1984) The effect of testing method on stereoanomoly. Vis Res
24(5):403408. doi:10.1016/0042-6989(84)90038-5
48. Tam WJ, Stelmach LB (1998) Display duration and stereoscopic depth discrimination. Can
J Exp Psychol 52(1):5661
49. Panum PL (1858) Physiologische Untersuchungen ber das Sehen mit zwei Augen. Keil,
Schwers
50. Speranza F, Tam WJ, Renaud R, Hur N (2006) Effect of disparity and motion on visual
comfort of stereoscopic images. Proc SPIE 6055:60550B. doi:10.1117/12.640865
51. Wopking M (1995) Visual comfort with stereoscopic pictures: an experimental study on the
subjective effects of disparity magnitude and depth of focus. J SID 3:10101103.
doi:10.1889/1.1984948
52. Ogle KN (1952) On the limits of stereoscopic vision. J Exp Psychol 44(4):253259.
doi:10.1037/h0057643
53. Grove PM, Finlayson NJ, Ono H (2011) The effect of stimulus size on stereoscopic fusion
limits and response criteria. i-Percept 2(4):401. doi:10.1068/ic401
54. Schor CM, Wood IC, Ogawa J (1984) Binocular sensory fusion is limited by spatial
resolution. Vis Res 24(7):661665. doi:10.1016/0042-6989(84)90207-4
55. Howard IP, Duke PA (2003) Monocular transparency generates quantitative depth. Vis Res
43(25):26152621. doi:10.1016/S0042-6989(03)00477-2
56. Grove PM, Sachtler WL, Gillam BJ (2006) Amodal completion with the background
determines depth from monocular gap stereopsis. Vis Res 46:37713774. doi:10.1016/
j.visres.2006.06.020
57. Seigel M, Nagata S (2000) Just enough reality: comfortable 3-D viewing via
microstereopsis. IEEE T Circuits Syst 10(3):387396. doi:10.1109/76.836283
58. Howard IP, Fang X, Allison RS, Zacher JE (2000) Effects of stimulus size and eccentricity
on horizontal and vertical vergence. Exp Brain Res 130:124132. doi:10.1007/
s002210050014
59. Speranza F, Wilcox LM (2002) Viewing stereoscopic images comfortably: the effects of
whole-field vertical disparity. Proc SPIE 4660:1825. doi:10.1117/12.468047
60. Kooi FL, Toet A (2004) Visual comfort of binocular and 3D displays. Displays
25(23):99108. doi:10.1016/j.displa.2004.07.004
61. Stelmach L, Tam WJ, Speranza F, Renaud R, Martin T (2003) Improving the visual comfort
of stereoscopic images. Proc SPIE 5006:269282. doi:10.1117/12.474093
62. Burt P, Julesz B (1980) A disparity gradient limit for binocular fusion. Science
208(4444):615617. doi:10.1126/science.7367885
63. Levelt WJM (1968) On binocular rivalry. Mouton, The Hauge
64. Alais D, Blake R (2005) Binocular rivalry. MIT Press, Cambridge

372

P. M. Grove

65. Humphriss D (1982) The psychological septum. An investigation into its function. Am J
Optom Physiol Opt 59(8):639641
66. Ono H, Lillakas L, Grove PM, Suzuki M (2003) Leonardos constraint: two opaque objects
cannot be seen in the same direction. J Exp Psychol: Gen 132(2):253265. doi:10.1037/
0096-3445.132.2.253
67. Arnold DH, Grove PM, Wallis TSA (2007) Staying focused: a functional account of
perceptual suppression during binocular rivalry. J Vis 7(7):18. doi:10.1167/7.7.7
68. Seuntiens P, Meesters L, IJsselsteijn W (2006) Perceived quality of compressed
stereoscopic images: effects of symmetric and asymmetric JPEG coding and camera
separation. ACM Trans Appl Percept 3(2):95109. doi:10.1145/1141897.1141899
69. Meegan DV, Stelmach LB, Tam WJ (2001) Unequal weighting of monocular inputs in
binocular combination: implications for the compression of stereoscopic imagery. J Exp
Psychol Appl 7:143153. doi:10.1037/1076-898X.7.2.143
70. Shimojo S, Nakayama K (1990) Real world occlusion constraints and binocular rivalry. Vis
Res 30:6980. doi:10.1016/0042-6989(90)90128-8
71. Wilcox L, Lakra DC (2007) Depth from binocular half-occlusions in stereoscopic images of
natural scenes. Percept 36:830839. doi:10.1068/p5708
72. Gillam B, Borsting E (1988) The role of monocular regions in stereoscopic displays. Percept
17(5):603608. doi:10.1068p170603
73. Grove PM, Ono H (1999) Ecologically invalid monocular texture leads to longer perceptual
latencies in random-dot stereograms. Percept 28:627639. doi:10.1068/p2908
74. Grove PM, Gillam B, Ono H (2002) Content and context of monocular regions determine
perceived depth in random dot, unpaired background and phantom stereograms. Vis Res
42(15):18591870. doi:10.1016/S0042-6989(02)00083-4
75. Grove PM, Brooks K, Anderson BL, Gillam BJ (2006) Monocular transparency and
unpaired stereopsis. Vis Res 46(18):30423053. doi:10.1016/j.visres.2006.05.003
76. Gillam B, Blackburn S, Nakayama K (1999) Stereopsis based on monocular gaps: metrical
encoding of depth and slant without matching contours. Vis Res 39(3):493502.
doi:10.1016/S0042-6989(98)00131-X
77. Forte J, Peirce JW, Lennie P (2002) Binocular integration of partially occluded surfaces. Vis
Res 42(10):12251235. doi:10.1016/S0042-6989(02)00053-6
78. Nakayama K, Shimojo S (1990) Da Vinci stereopsis: depth and subjective occluding
contours from unpaired image points. Vis Res 30:18111825. doi:10.1016/00426989(90)90161-D
79. Anderson BL (1994) The role of partial occlusion in stereopsis. Nature 367:365368.
doi:10.1038/367365a0
80. Liu L, Stevenson SB, Schor CM (1994) Quantitative stereoscopic depth without binocular
correspondence. Nature 267(6458):6669. doi:10.1038/367066a0
81. Gillam B, Nakayama K (1999) Quantitative depth for a phantom surface can be based on
cyclopean occlusion cues alone. Vis Res 39:109112. doi:10.1016/S0042-6989(98)00052-2
82. Tsirlin I, Wilcox LM, Allison RS (2010) Monocular occlusions determine the perceived
shape and depth of occluding surfaces. J Vis 10(6):112. doi:10.1167/10.6.11
83. Harris JM, Wilcox LM (2009) The role of monocularly visible regions in depth and surface
perception. Vis Res 49:26662685. doi:10.1016/j.visres.2009.06.021
84. Hkkinen J, Nyman G (1997) Occlusion constraints and stereoscopic slant. Percept
26:2938. doi:10.1068/p260029
85. Grove PM, Kaneko H, Ono H (2003) T-junctions and perceived slant of partially occluded
surfaces. Percept 32:14511464. doi:10.1068/p5054
86. Gillam B, Grove PM (2004) Slant or occlusion: global factors resolve stereoscopic
ambiguity in sets of horizontal lines. Vis Res 44(20):23592366. doi:10.1016/
j.visres.2004.05.002
87. Grove PM, Byrne JM, Gillam B (2005) How configurations of binocular disparity determine
whether stereoscopic slant or stereoscopic occlusion is seen. Percept 34:10831094.
doi:10.1068/p5274

12

The Psychophysics of Binocular Vision

373

88. Ohtsuka S, Ishigure Y, Janatsugu Y, Yoshida T, Usui S (1996) Virtual window: a technique
for correcting depth-perception distortion in stereoscopic displays. Soc Inform Disp Symp
Dig 27:893898
89. Mendiburu B (2009) 3D movie making: stereoscopic digital cinema from script to screen.
Focal Press, Oxford
90. Liu J (1995) Stereo image compressionthe importance of spatial resolution in half
occluded regions. Proc SPIE 2411:271276. doi:10.1117/12.207545
91. Palmer SE (1999) Vision science: photons to phenomenology. MIT Press, Cambridge
92. Yamanoue H (1997) The relation between size distortion and shooting conditions for
stereoscopic images. SMPTE J 106:225232. doi:10.5594/L00566
93. Yamanoue H, Okui M, Okano F (2006) Geometrical analysis of puppet-theatre and
cardboard effects in stereoscopic HDTV images. IEEE T Circuits Tech 16(6):744752.
doi:10.1109/TCSVT.2006.875213
94. Sato T, Kitazaki M (1999) Cardboard cut-out phenomenon in virtual-reality environment.
Percept 28:125 ECVP abstract supplement
95. Rogers BJ (2002) Charles wheatstone and the cardboard cut-out phenomenon. Percept 31:58
ECVP abstract supplement
96. Gillam B, Palmisano SA, Govan DG (2011) Depth interval estimates from motion parallax
and binocular disparity beyond interaction space. Percept 40:3949. doi:10.1068/p6868
97. Bradshaw MF, Rogers BJ (1999) Sensitivity to horizontal and vertical corrugations defined
by binocular disparity. Vis Res 39(18):30493056. doi:10.1016/S0042-6989(99)00015-2
98. Cornilleau-Prs V, Droulez J (1994) The visual perception of three-dimensional shape
from self-motion and object-motion. Vis Res 34(18):23312336. doi:10.1016/00426989(94)90279-8
99. Ono H, Ujike H (2005) Motion parallax driven by head movements: Conditions for visual
stability, perceived depth, and perceived concomitant motion. Percept 24:477490.
doi:10.1068/p5221
100. Ono H, Wade N (2006) Depth and motion perceptions produced by motion parallax. Teach
Psychol 33:199202
101. Ono H, Rogers BJ, Ohmi M (1988) Dynamic occlusion and motion parallax in depth
perception. Percept 17:255266. doi:10.1068/p170255
102. Wilcox L, Tsirlin I, Allison RS (2010) Sensitivity to monocular occlusions in stereoscopic
imagery: Implications for S3D content creation, distribution and exhibition. In: Proceedings
of SMPTE international conference on stereoscopic 3D for media and entertainment
103. Birch EE, Gwiazda J, Held R (1982) Stereoacuity development for crossed and uncrossed
disparities in human infants. Vis Res 22(5):507513. doi:10.1016/0042-6989(82)90108-0
104. Fox R, Aslin RN, Shea SL, Dumais ST (1980) Stereopsis in human infants. Science
207(4428):323324. doi:10.1126/science.7350666
105. Fawcette SL, Wang Y, Birch EE (2005) The critical period for susceptibility of human
stereopsis. Invest Ophth Vis Sci 46(2):521525. doi:10.1167/iows.04-0175
106. Barry SR (2009) Fixing my gaze: a scientists journey into seeing in three dimensions. Basic
Books, New York
107. Blake R, Wilson H (2011) Binocular vision. Vis Res 51(7):754770. doi:10/1016/
j.visres.2010.10.009

Chapter 13

Stereoscopic and Autostereoscopic


Displays
Phil Surman

Abstract This chapter covers the state of the art in stereoscopic and autostereoscopic displays. The coverage is not exhaustive but is intended that in the relatively limited space available a reasonably comprehensive snapshot of the current
state of the art can be provided. In order to give a background to this, a brief
introduction to stereoscopic perception and a short history of stereoscopic displays
is given. Holography is not covered in detail here as it is really a separate area of
study and also is not likely to be the basis of a commercially viable display within
the near future.




Keywords Autostereoscopic display Binocular parallax Crosstalk Depth cue


Disparity Geometrical distortion Glasses Head-tracked display Image artifact
Integral imaging Light field display Monocular cue Multi-view display
Stereoscopic display Stereoscopic perception Viewing zone Volumetric display

13.1 Stereoscopic Perception


There are many factors involved in stereoscopic perception, and the relative
importance of contributory perceptual factors in natural vision is not necessarily
the same as for an artificial image. Consider a natural scene being observed; when
one eye is covered the image appears to lose very little realism. However, images
reproduced on a display appear to be considerably more realistic when stereo is
applied to the image. This is particularly well demonstrated if an image is viewed

P. Surman (&)
Imaging and Displays Research Group, De Montfort University, Leicester, UK
e-mail: psurman@dmu.ac.uk

C. Zhu et al. (eds.), 3D-TV System with Depth-Image-Based Rendering,


DOI: 10.1007/978-1-4419-9964-1_13,
 Springer Science+Business Media New York 2013

375

376

P. Surman
Depth
information
Oculomotor

Accommodation

Visual

Convergence

Interposition

Monocular

Binocular

Static cues

Motion
parallax

Size

Perspective

Essential for
stereo

Preferable but not


essential
Monocular cues

Fig. 13.1 Oculomotor and visual cuesoculomotor cues involve muscles controlling the eyes.
Visual cues relate to monocular where depth information is determined by image content and
binocular where disparity provides information

through Pulfrich glasses; when the camera goes from being static to panning the
image appears to come alive as the viewer sees the same image in 3D (provided
the camera is moving in the correct direction). Pulfrich stereo is considered in
more detail later in this section. The first reference to the richness of monoscopic
cues was in the book The Theory of Stereoscopic Transmission first published in
1953 [1].
This section describes the ways in which stereo is perceived. It covers the
physical oculomotor cues of accommodation and convergence and then the visual
cues. The visual cues can be monocular where the images received by the brain are
interpreted using their content alone, or they can be binocular where the brain
utilizes the differences in the eyes images as these are captured at two different
viewpoints.
The oculomotor cues are accommodation and convergence and involve the
actual positioning and focus of the eyes. Accommodation is the ability of the lens
of the eye to change its power in accordance with distance so that the region of
interest in the scene is in focus. Convergence is the ability of the eyes to adjust
their visual axes in order for the region of interest to focus on to the fovea of each
eye. Figure 13.1 shows the complete range of cues.
Monocular cues are determined only by the content of the images. Twodimensional images provide a good representation of actual scenes and paintings,
photographs, television, and cinema all provide realistic depictions of the real
world due to their wealth of monocular cues. There are many of these and a
selection of the principal cues is given below.
The most obvious monoscopic cue is that of occlusion where nearby objects
hide regions of objects behind them. Figure 13.2 shows some of the monocular
cues and it can be seen in Fig. 13.2a that the cube A is at the front as this occludes
sphere B which in turn occludes sphere C. In the upper figure the depth order
(from the front) has been changed to C A B by altering the occlusion.

13

Stereoscopic and Autostereoscopic Displays

(a)

(b)

377

(c)

C
B
A
C
B
A

Occlusion

Linear Parallax

Size Constancy

Fig. 13.2 Monocular cues. a Relative distances of objects inferred from occlusions. b Parallel
lines converge at horizon. c Relative distance inferred by assuming all these objects are around
same actual size

In Fig. 13.2b the effect of linear parallax is shown where the lower end of the
beam appears closer to the observer as it is wider than the top end. This is in
accordance with the rules of perspective where parallel lines converge at a point on
the horizon known as the vanishing point and where it can be inferred that the
closer the lines appear to be, the closer they are to the vanishing point.
In the example of size constancy shown in Fig. 13.2c the smallest of the
matchstick figures appears to be the furthest as the assumption is made by the
observer that each of the figures is around the same actual size as the others so that
the one subtending the smallest angle is furthest away.
In each of the above examples certain assumptions are made by the observer
that are generally correct, but not necessarily true at all times. In the case of
occlusion for instance, a real object could be specially constructed to fool the
visual system from one viewpoint but not when observed from another. With
linear parallax the rules of perspective are assumed to apply; however, before
humans lived in structures comprising rectangular sides and were surrounded by
rectangular objects, the so-called rules of perspective probably had little meaning.
There are many other monocular cues that are too numerous to describe here,
these include: texture gradient, aerial perspective, lighting and shading, and
defocus blur. The most important other monocular cue that is not dependent on the
static content of the image is motion parallax. This is the effect of a different
perspective being observed at different lateral positions of the head [2]. This
causes the images of objects that are closer to appear to move more rapidly on the
retina as the head is moved. In this way the relative position of an object can be
inferred by observing it against the background while moving head position.
Figure 13.3 shows the effect of observer position where it is apparent that depth
information can be implied from the series of observed images.
The strongest stereoscopic cue is binocular parallax where depth is perceived
by each eye receiving a different perspective of the scene so that fusion of the

378

P. Surman

MOVEMENT OF OBSERVER

Fig. 13.3 Motion parallaxcloser objects show greater displacement in apparent position as
observation position moves laterally so that relative distances of objects is established

images by the brain gives the effect of depth. Different perspectives are illustrated
in Fig. 13.4a where in this case the appearance of the cube is slightly different for
each eye. Binocular parallax results in disparity in a stereoscopic image pair where
points in the scene are displayed at different lateral positions in each image.
Although the monocular cues mentioned previously provide an indication of
depth, the actual sensation of depth is provided by disparity. This sensation is
perceived even in the absence of any monocular cues and can be convincingly
illustrated with the use of Julesz random dot stereograms [3]. An example of these
is shown in Fig. 13.4b; in this figure there are no monocular cues whatsoever,
however, when the image pair is observed with a stereoscopic viewer a square can
be seen in front of a flat background.
It is also possible to see the square without the use of a stereoscopic viewer by
using the technique known as cross-eyed stereo. With this technique the eyes are
crossed so that the left and right images are combined in the brain. The ability to
do this can be quite difficult to acquire and can cause considerable eyestrain. The
stereoscopic effect is also reversed (pseudoscopic) so that in this case a square
located behind a flat background is observed.
The sensation of 3D can be seen by other means, among these being the Pulfrich
effect and the kinetic depth effect. The Pulfrich effect was first reported by the
German physicist Carl Pulfrich in 1922 [4]. It was noticed that when a pendulum is
observed with one eye covered by a dark filter the pendulum appears to follow an
elliptical path. The explanation for this is that the dimmer image effectively takes
longer to reach the brain. In Fig. 13.5, it can be seen that the actual position of the
pendulum is A but at that instant the apparent position lies at position B on line

13

Stereoscopic and Autostereoscopic Displays

(a)

379

(b)
Left View

Right View

R
Viewer
L
Plan
Left and Right Perspectives

Julesz Random Dot Stereogram

Fig. 13.4 Binocular parallax. a Different perspectives of cube seen by each eye to give effect of
depth. b Even with no content cues depth is seen in random dot stereograms (in this case a square)

XX due to the delay in the visual system. This gives an apparent position at point
C where the lines XX and YY intersect. Although Pulfrich stereo can be shown on a
two-dimensional image without the loss of resolution or color rendition and without
the use of special cameras, its use is limited and it is only suitable for novelty
applications as either the cameras have to be continually panning or the subject must
keep moving in relation to its background.
The kinetic depth effect is an illusion where the three-dimensional structure of
an object in a two-dimensional image is revealed by movement of the image [5].
The illusion is particularly interesting as images appear to be three dimensional
even when viewed with one eye. Unlike Pulfrich stereo, the kinetic depth effect has
some useful applications, for example, it can be used to enable images from airport
security X-ray scanners to be seen in 3D. The virtual position of the object is
moved in a rocking motion in order to reveal objects that might otherwise be
missed by merely observing the customary static false color images [6].
To summarize the usefulness of the depth effects on stereoscopic displays,
motion parallax is useful but not absolutely necessary and it should be borne in
mind that in many viewing situations, viewers tend to be seated and hence fairly
static so that the look-around capability provided by motion parallax adds very
little to the effect. The sensation of depth is not particularly enhanced by the
oculomotor cues of accommodation and convergence but where there is conflict
between these two, discomfort can occur and this is considered in more detail in
the following section. The kinetic depth effect has restricted use and the Pulfrich
effect has no apparent practical use. Even if all of the aforementioned cues are
present, 3D will not be perceived unless binocular parallax is present.

13.2 Stereoscopic Display


In principle a stereoscopic display can take many forms ranging from two-image
stereo, multi-view with a relatively small number of views, super multi-view
(SMV) with a large number of views, and full parallax systems such as integral

380
Apparent
position of
pendulum

B
C

A
Viewer
X
Dark
filter

Actual
position of
pendulum
Apparent
path of
pendulum

Path of pendulum

Fig. 13.5 Pulfrich effect


pendulum swinging in
straight line appears to follow
elliptical path when viewed
through one dark filter; the
message effectively takes
longer to reach brain in this
channel

P. Surman

imaging or holography where the image is a faithful reproduction of the original


scene. A detailed description of the various methods is given in Sect. 13.4.

13.2.1 Disparity
As described previously, binocular parallax is the most important visual depth cue.
In a display where a stereo pair is presented to the users eyes, if a point in the
image is not located in the plane of the screen then it must appear in two different
lateral positions on the screen, one position for the left eye and one for the right
eye. This is referred to as disparity.
When an image point occupies the same position on the screen for both the left
and right eyes it has what is referred to as zero disparity and this is the condition
when a normal two-dimensional image is observed (Fig. 13.6a). In this case the
image point will obviously appear to be located at the plane of the screen
(Point Z).
When an object appears behind the plane of the screen as in Fig. 13.6b the
disparity is referred to as uncrossed (or positive). It can be seen that the object will
appear to be point U behind the screen where the lines passing through the pupil
centers and the displayed positions on the screen intersect. Similarly, for crossed
(or negative disparity) shown in Fig. 13.6c the apparent position of the image point
will be at the intersection at position C in front of the screen.

13.2.2 Accommodation/Convergence Conflict (Mismatch)


The oculomotor cues of accommodation and convergence provide an indication of
the distance of the region of interest. When observing a natural scene the eyes
focus at the same distance that the eyes converge. When a stereoscopic pair is

13

Stereoscopic and Autostereoscopic Displays


Screen

381

Viewer

PLAN VIEWS

(a) Zero Disparity

(b) Uncrossed Disparity

(c) Crossed Disparity

Fig. 13.6 Disparitydisparity gives appearance of depth when two 2D images are fused. In
uncrossed disparity the eyes converge behind screen so object appears to be further than screen;
the opposite applies for crossed disparity

viewed on a display the eyes will always focus on the screen, however the apparent
distance and hence the convergence (also referred to as vergence) will invariably
be different as shown in Fig. 13.7. In this chapter accommodation/convergence
conflict will be referred to as A/C conflict. If this conflict is too great visual
discomfort occurs and this is the case of cross-eyed stereo where the eyes are
converging at half the focusing distance.
There are various criteria mentioned in the literature regarding the acceptable
maximum level of conflict and the tolerance between individuals will vary. Also, if
objects occasionally jump excessively out of the screen this may be tolerable but
if it happens frequently viewer fatigue can occur.
There are two widely published criteria for acceptable difference between
accommodation and convergence, the first is the one degree difference between
accommodation and convergence rule [7]. This rule of thumb states that the
angular difference between the focus and accommodation (angle h in Fig. 13.7)
should not exceed 1 (h = w - u).
The other criterion is the one half to one third diopter rule [8]. In this case
diopters, the reciprocal of the distance in meters, are used as the unit of measurement.
Applying the rule in Fig. 13.7 gives:


1 1 1
 \
13:1
C A 3

13.2.3 Geometrical Distortions


Producing the 3D effect with the use of a stereo pair gives rise to certain image
geometry distortions. These distortions do not appear to be particularly disturbing
and are akin to distortions that occur when two-dimensional representation of a
natural scene is observed. Also, the effects are far less noticeable if the viewers
head remains fairly static. Two principal distortions are described below.

382

P. Surman

Screen
Eyes
converge
on object

PLAN
VIEW

Eyes focus
on plane of
screen

A
C

Fig. 13.7 Accommodation/convergence conflictthe viewers eyes focus on screen but


converge at apparent distance of object on which eyes are fixated

False rotation is the effect where the image appears to move with movement of
the viewers head. This is inevitable with a stereo pair as an apparent image point
must always lie on the axis between P, the center of the two image points on the
screen and the mid-point between the viewers eyes. Figure 13.8a indicates how
this effect arises. As the viewer moves from position P1 to P2, the apparent image
point moves from position M1 to M2. At each viewer position the image point lies
on the line between the eye-center and the screen at approximately 0.4 of the
distance. Although the figure shows virtual images in front of the screen, the same
considerations apply to images behind the screen where the points appear on
the axis extended into this region.
Image plane pivoting is the effect where the apparent angle of a surface alters with
viewpoint. In Fig. 13.8b consider surface A0 A00 that is perceived by viewer PA: point
A0 is located around 0.3 of the distance from the screen and point A00 half the distance.
Similarly, for viewer PB the same relationships apply for points B0 and B00 that are at
the ends of surface B0 B00 . It can be seen that as the viewer moves from PA to PB the
apparent angle of the surface pivots in a clockwise direction.

13.2.4 Other Image Artifacts


There are many other image artifacts and three of the more important are described
below, these are: edge violationtruncated objects in front of the screen, keystoning,
and the puppet theater effect.
The volume of the viewing field that can be portrayed by a rectangular display
is effectively a pyramid whose apex is located at the center of the users eyes as
depicted in Fig. 13.9a. In this case the figure shows a viewer on the axis but as the
viewer moves away from this position the apex will always follow the mid-point of
the eyes. Consider Fig. 13.9 where there are three displayed objects; sphere A that
appears behind the screen, rectangular rod B that is contained behind the screen
and within the virtual image pyramid in front of the screen, and cylinder C that is
cut off by one side of the pyramid. The appearance of the sphere will be natural as

13

Stereoscopic and Autostereoscopic Displays

(a)

383

(b)
Image
M1

Viewer
P1

M2
PLAN
VIEW

P2

False Rotation

PA

Screen

Screen

B
A"
B"

PB

Pivoting

Fig. 13.8 Geometrical distortions. a False potation is caused by apparent point always being
located on the line between the eye center and point P. b Pivoting is result of combination of false
rotation and variation of apparent distance

the part that is not seen is obscured by the left side of the screen as if by a window.
The rectangular rod will also appear natural as it is completely contained within a
volume where it is not unnaturally truncated. The cylinder will appear unnatural as
it is cutoff in space by an imaginary plane in front of the screen that could not be
there in practice. Occasional display of this truncation may be acceptable but it is
something that should be avoided if possible.
When stereo is captured with a camera pair there is a rule of thumb frequently
applied stating that the camera separation should be around 1/30th the distance of
the nearest object of interest in the scene. If this is the case and images are
captured with no post processing, for the subject to appear near to the plane of the
screen the cameras will have to be toed-in so that their axes intersect at around the
distance of the subject. This will cause geometrical distortion in the image known
as keystoning. For example, if the object in the scene is a rectangle at right angles
to the axis its images will be trapezoidal with parallel sides, hence the term
keystone. The detrimental effect of this is that the same point in the scene can
appear at different heights when being viewed thus giving difficulty in fusing the
images. In the literature there are different schools of thought as to what is
acceptable [9].
When we observe a two-dimensional representation of a scene, the fact that the
angle subtended by an object in the scene may be considerably less than it would
be by direct viewing does not concern us. However, when the representation is
three dimensional the binocular cue conflict can make objects appear unnaturally
small, thus giving rise to the puppet theater effect [10]. It is possible that with
greater familiarity with 3D, viewers will become accustomed to this effect as they
have done with flat images.
There are many other artifacts and a useful list of these is given in a publication
by the 3D@Home Consortium [11].

384

P. Surman

Screen
C
B
A
Viewing zone
pyramid

Viewer

Fig. 13.9 Viewing zoneleft side of sphere A is cutoff by window of screen edge. Bar B
protrudes into the virtual image pyramid in front of the screen. Rod C cut off unnaturally by edge
of pyramid

13.2.5 Crosstalk
In displays where the 3D effect is achieved by showing a stereo pair it is essential
that as little of the image intended for the left eye reaches the right eye and vice
versa. This unwanted bleeding of images is referred to as crosstalk and creates
difficulty in fusing the images which can cause discomfort and headaches.
Crosstalk is expressed as a percentage and is most simply expressed as;
Crosstalk % leakage=signal  100
where leakage is defined as the maximum luminance of light from the unintended channel into the intended channel and signal is the maximum luminance
of the intended channel.
The simplified representation of crosstalk is shown in Fig. 13.10 where part of
each image is shown as bleeding into the other channel. The mechanism for this
can be due to several causes; in an anaglyph display it would be due to the
matching of the spectral characteristics of the glasses to the display. With shuttered
glasses it could be due to incomplete extinction of transmission by a lens when it
should be opaque or due to timing errors between the display and the glasses, and
in an autostereoscopic display it is due to some of the light rays from one of the
image pair traveling to the eye for which they were not intended.
A tolerable level of crosstalk is generally considered as being in the region of
2 %. The actual subjective effect of crosstalk is dependent on various factors; for
example, the dimmer the images, the higher the level of crosstalk that can be
accepted. Image content is also an important factor. When the contrast is greater,
the subjective effect of crosstalk is increased. If the image has vertical edges where
one side is black and the other white, the effect of crosstalk is very pronounced and
in this case it should be the region of 1 % or less.

13

Stereoscopic and Autostereoscopic Displays

385

Right
image
Left image
RM

LM

RC

LC

Right eye

Left eye

Fig. 13.10 Crosstalksimplified representation showing some left image bleeding into right
eye and right image bleeding into left eye. This occurs in free space in an autostereoscopic
display and in the lenses of a glasses display

13.2.6 Multi-View Displays


The majority of this section has been devoted to stereo-pair displays. The other
important class of 3D display is multi-view where a series of images, each
showing a slightly different perspective, is displayed across the viewing field. This
enables freedom of viewer movement but there are particular considerations that
apply to these displays. Two of the principal considerations are discussed in this
section; one is the depth of field and the other is the number of views required for
the presentation of continuous motion parallax.
In stereo pair displays depth of field as such is not a problem and the limitation
is set by the human factors considerations regarding the maximum tolerable disparity for comfortable viewing. In multi-view displays where there is a relatively
small number of views say in the order of 15 or fewer, the depth of field is fairly
limited for the following reasons.
Consider Fig. 13.11a where the typical luminosity profile of a series of overlapping adjacent viewing zones is shown. It is not easy to achieve, and also not
desirable to have a series of zones that have a top hat profile where the image
abruptly changes at the boundaries. In practice, as the eye moves across the
viewing field varying proportions of two or more adjacent images are seen at any
one time. In the eye position shown the eye receives around 80 % of the maximum
intensity destined for Zone 1, 60 % of Zone 2, and 4 % of Zone 4.
In Fig. 13.11b the eye located at position Y observes a virtual image at X. The
contributions to this are from the three regions on the screen marked as image 1,
image 2, and image 3. The object is seen as three displaced discrete images in
the region of X. The displacement between the images is proportional to the
distance from the screen and manifests itself as an apparent blurring. This puts a
limitation on the distance an object can appear from the plane of the screen and the
same considerations apply for virtual object positions behind the plane of the
screen.

386

P. Surman

(a)

(b)
1

Virtual image
of object

Image 3

Viewing zones

Relative
luminosity

Image 2

1
Y

Eye
position
Distance across viewing field

Viewing Zone Profiles

Screen

2
3
4

Image 1
PLAN VIEW

Viewing Zone Contributions

Fig. 13.11 Multi-view viewing zones. a Viewing zones overlap in a multi-view display.
b Overlap causes softening of image points away from plane of screen which causes depth of
field reduction

The amount of displacement of the multiple images is dependent on the pitch


and width of the viewing zones so the smaller the zone width, the greater the depth
of field. With regard to the number of views required for the presentation of
continuous motion parallax, there have been various proposals put forward over
the years. Some of these are described below.
A research group in the 3D project at the telecommunications advancement
organization of Japan (TAO) has identified the need for a large number of views in
order to overcome problems caused by the difference between accommodation and
convergence [12]. Their approach is to provide what they term SMV. Under these
conditions, the eye pupil receives two or more parallax images. The authors claim
this will cause the eye to focus at the same distance as the convergence. This is a
significant finding regarding the minimum amount of information that has to be
displayed in order for the A/C conflict to be acceptable but the paper does not state
where this finding originates.
The SMV display itself is implemented by using a focused light array (FLA) in
order to obtain the necessary horizontal spatial resolution required for the production of 45 views. High resolution is obtained by modulating the output of an
array of light-emitting diodes (LEDs) or laser diodes, and mechanically scanning
the light in a similar manner to the TAO 32-image holographically derived display
which is in turn inspired by the MIT electro-holographic system [13].
Holographic stereograms where multiple views across the viewing field are
produced holographically, are analyzed in a paper by Pierre St Hilaire [14]. The
effect of the image appearing to jump between adjacent views is considered
and the phenomenon is likened to aliasing when a waveform is undersampled,
i.e., when the sampling rate is less than double the maximum frequency in the
original signal. This optimum is in the same order as the figure obtained from
research at Fraunhofer HHI where it has been determined that typically, 20
views per interocular distance are required for the appearance of smooth motion
parallax [15].

13

Stereoscopic and Autostereoscopic Displays

(a)

387

(b)

Virtual
images

Left
Inverted
right
image

Inverted
left
image

Images

Right

Mirrors

Lenses
Viewer

Wheatstone Stereoscope (1833)

Viewer

Brewster Stereoscope (1849)

Fig. 13.12 Early stereoscopes. a Two inverted virtual images are formed in space behind
combining mirrors and fused in brain. b Convex viewing lenses enable eyes to focus at
convergence distance

The criteria above apply to horizontal parallax only displays where there is no
parallax in the vertical direction. This could conceivably produce an effect similar
to astigmatism where the image on the retina can be focused in one direction but
not at right angles to this. None of the references mention this potential effect so it
is not known at present whether or not it would be an issue.

13.3 Brief History


Although the subject of stereoscopy has been considered throughout history,
including comments by Leonardo da Vinci, the first recorded apparatus for viewing
stereoscopic images was built by Charles Wheatstone in 1833. A schematic diagram
of this is shown in Fig. 13.12a. The original images were line drawings as photography was in its infancy in the 1830s. Two inverted images are viewed via a pair of
mirrors such that the virtual images appear in space behind the mirrors as shown in
the figure.
In 1849 David Brewster demonstrated a stereoscope of the design in
Fig. 13.12b where the left and right images are viewed through a pair of convex
lenses. The lenses enable the viewers eyes to focus at a distance further than the
actual distance of the image pair, thus avoiding the A/C conflict problem. By the
time the stereoscope was introduced, photography was well established with
stereoscopic photographs being very popular during the Victorian period.
Integral imaging is a technique to display full parallax images and was first
proposed by Gabriel Lippmann in 1908 [16]. In this method the image is captured
by an array of lenses as in Fig. 13.13a. An elemental image is formed behind every
lens and this enables the light emerging in each direction from the lens to vary in
such a way that a reconstructed image is built up as shown in the figure.

388

P. Surman

Microlens array

Reconstructed
image

PLAN
VIEWS

(b)
Images

Elemental
image
plane

Barrier

(a)

Viewer
Observer

Elemental
images

Integral Imaging

Parallax Barrier

Fig. 13.13 Integral imaging and parallax barrier. a Elemental images enable reconstruction of
input ray patternwithout correction this is pseudoscopic. b Vertical apertures in barrier direct
left and right images to appropriate eyes

In early integral imaging the same elemental images that were captured also
reproduced the image seen by the viewer. If a natural orthoscopic scene is captured
a certain pattern of elemental images is formed and when the image is reproduced,
light beams travel back in the opposite direction to reproduce the shape of the
original captured surface. The viewer however sees this surface from the opposite
direction to which it was captured and effectively sees the surface from the inside
thus giving a pseudoscopic image. Methods have been developed to reverse this
effect either optically [17] or more recently by reversing the elemental images
electronically [18].
Another approach developed in the early twentieth century is the parallax barrier
where light directions are controlled by a series of vertical apertures. In 1903 Frederic
Ives patented the parallax stereogram where left and right images are directed to the
appropriate left and right eyes as in Fig. 13.13b.
The problem of limited head movement was addressed in a later development
of the display known as a parallax panoramagram. The first method of capturing a
series of images was patented by Clarence Kanolt in 1918 [19]. This used a camera
that moved laterally in order to capture a series of viewpoints. As the camera
moved a barrier changed position in front of the film so that the images were
recorded as a series of vertical strips.
Another recording method was developed by Frederic Ives son Herbert who
captured the parallax information with a large diameter lens. A parallax barrier
located in front of the film in the camera was used in order to separate the
directions of the rays incident upon it.
It is not widely known that one of the pioneers of television, the British inventor
John Logie Baird, pioneered 3D television before the Second World War. The
apparatus used was an adaption of his mechanically scanned system (Fig. 13.14a).

13

Stereoscopic and Autostereoscopic Displays

(a)

389

(b)

Light sources

Motor

Photoelectric
cells

Lenses

Synchronized
scanning discs
Motor
Scene
Neon
tube

Transmitter

Apparatus

Viewer

Receiver

Double Scanning Disc

Fig. 13.14 Bairds mechanical systemadaptation of standard 30-line mechanical system.


Scanning disks have two sets of aperturesone for the left eye and one for the right eye

As in his standard 30-line system the image was captured by illuminating it with
scanning light. In this case two scans were carried out sequentially, one for the left
image and one for the right. This was achieved with a scanning disk having two
sets of spiral apertures as shown in Fig. 13.14b. Images were reproduced by using
a neon tube illumination source as the output of this could be modulated sufficiently rapidly. Viewing was carried out by observing this through another double
spiral scanning disk running in synchronism with the capture disk.
Possibly of greater interest is Bairds work on a volumetric display that he referred
to as a Phantoscope. Image capture involved the use of the inverse square law to
determine the range of points on the scene surface and reproduction was achieved by
projecting an image on to a surface that moved at right angles to its plane [20, 21].
A 3D movie display system called the Stereoptiplexer was developed by Robert
Collender in the latter part of the twentieth century. This operates on the principle
of laterally scanning a slit where appearance of the slit varies with viewing angle
in the same way as for the natural scene being located behind the slit. As the slit
moves across the screen a complete 3D representation is built up. The display can
operate in two modes; these are inside looking out and outside looking in [22].
Figure 13.15a shows the former case where a virtual image within the cylinder can
be seen over 360. The mechanism operates in a similar manner to the zoetrope
[23] that was an early method of viewing moving images.
The outside looking in display uses the method of so-called aerial exit pupils
where virtual apertures are generated in free space [24]. This embodiment of the
display gives the appearance of looking at the scene through a window.
A variant of this principle that does not require the use of film but generates the
images electronically is Homer Tiltons parallactiscope [25] where the images are
produced on a cathode ray tube (CRT). This does not produce real video images as
does Collenders display but has the advantage of having a reduced number of
moving parts. In order to produce an effectively laterally moving aperture with the
minimum mass, a half wave retarder is moved between crossed polarizers as
shown in Fig. 13.15b. The retarder is moved with a voice coil actuator.

390

(a)

P. Surman
16 revs/sec

Reconstructed
3D image

Viewer

(b)

Actuator
Half wave
retarder

Oscillating
arm

Slit
CRT
High-speed
continuous
loop
projector

Light
exits
here

Crossed
polarizers

Stereoptiplexer

Parallactiscope

Fig. 13.15 Stereoptiplexer and parallactiscope. a Rapidly moving images from film projector viewed
through rotating slit. b Same principle used in parallactiscope where images are formed on CRT

Both Collenders and Tiltons displays, and also integral imaging are early
examples of what are termed light field displays; these are described in more detail
in Sect. 13.4.3.

13.4 3D Display TypesPrinciple of Operation


In this section the principle of operation of various basic 3D display types is
described. These different types are categorized as shown in Fig. 13.16 which
gives a convenient classification; there are other classification systems, for
example, that of the 3D@Home Consortium [26], but they all follow a similar
pattern. All 3D displays, apart from those where apparatus is required such as the
Wheatstone and Brewster stereoscopes or head mounted displays can be divided
into autostereoscopic and glasses types.
Although not related to the principle of operation, 3D displays can also be
categorized according to their size and number of users. The following categories
provide a convenient classification:
Handheldthese will usually be single user and being handheld the user can
move both head position and display orientation in order to readily obtain the
viewing sweet spot.
Monitor-sizedup to around 2000 diagonal. Generally single user so that head
tracking is a viable option.
Television-sizedup to around 8000 diagonal, up to around six users located at
distances from one to four meters from the screen with an opening angle of 90 or
more. These are suitable for television and multi-user interactive gaming
applications.

13

Stereoscopic and Autostereoscopic Displays

391

3D display

Autostereoscopic

Holographic

Glasses

Multiple
image

Volumetric

Real image

Virtual
image

Stereoscope
& HMD

Light
field

Multiview

Twoimage

Fig. 13.16 3D display typesautostereoscopic displays are: holographic employing wavefront


reconstruction, volumetric with images formed in a volume of space and multiple image where
images pass through some form of screen

Cinema-likegreater than six users and greater than 2 m diagonal. Autostereoscopic presentation is difficult or maybe impossible for a large number of
users. However, the wearing of special glasses is acceptable in this viewing
environment.

13.4.1 Glasses
The earliest form of 3D glasses viewing was anaglyph introduced by Wilhelm
Rollmann in 1853; this is the familiar red/green glasses method. Anaglyph operates by separating the left and right image channels by their color. Imagine a red
line printed on a white background; when the line is viewed through a red filter the
line is barely perceptible as it appears to be around the same colour as the
background. However, when the line is viewed through a green filter the background appears green and the line virtually black. The opposite effect occurs for a
green line on a white background is viewed through red and green filters. In this
way different images can be seen by each eye, albeit with different colors. This
color difference does not prevent the brain fusing a stereoscopic pair.
Early anaglyph systems used red and green or red and blue. When liquid crystal
displays (LCD) are used where colors are produced by additive color mixing,
better color separation is obtained with red/cyan glasses. Typical spectral transmission curves for anaglyph glasses and LCD color filters are shown in Fig. 13.17.
It should be noted that the curves for the LCD filter transmission only give an
indication of the spectral output of the panel. In practice the spectra of the cold
cathode fluorescent lamp (CCFL) illumination sources generally used are quite

392

P. Surman

Relative transmission (%)

(a)
100

Blue
LCD
filter

80

Red
glasses
filter

Green
LCD
filter

(b)
Cyan
glasses
filter

Red
LCD
filter

60
40
20
0
400

500

600

700

400

500

600

700

Wavelength (nm)

Red Channel

Cyan Channel

Fig. 13.17 Typical anaglyph transmission spectraplots showing typical separation for red/
cyan glasses and LCD filters. These only give indication as separation also affected by spectral
output of CCFL backlight

spiky. Examination of the curves shows that crosstalk can occur. For example, in
Fig. 13.17b it can be seen that some of the red light emitted by the LCD between
400 and 500 nm will pass to the left eye through the cyan filter. Extensive
investigation into anaglyph crosstalk has been carried out by Curtin University in
Australia [27].
Another method of separating the left and right image channels is with the use
of polarized light that can be either linearly or circularly polarized. In linearly
polarized light the electric field is oriented in one direction only and orthogonally
oriented polarizing filters in front of the left and right eyes can be used to separate
the left and right image channels.
Light can also be circularly polarized where the electric field describes a circle
as time progresses. The polarization can be left or right handed depending on the
relative phase of the field components. Again, the channels can be separated with
the use of left- and right-hand polarizers. Circular polarization has the advantage
that the separation of the channels is independent of the orientation of the polarizers so that rotation of the glasses does not introduce crosstalk. It should noted
that although crosstalk does not increase with rotation of the head, the fusion of
stereoscopic images becomes more difficult as corresponding points in the two
images do not lie in the same image plane in relation to the axis between the
centers of the users eyes.
The other method of separating the channels is with the use of liquid crystal
shutter glasses, where left and right images are time multiplexed. In order to avoid
image flicker images must be presented at 60 Hz per eye, that is, a combined frame
rate of 120 Hz. The earliest shutter glasses displays used a CRT to produce the
images as these can run at the higher frame rate required. It is only relatively
recently that LCD displays have become sufficiently fast and in 2008 the NVIDIA
3D vision gaming kit was introduced.

13

Stereoscopic and Autostereoscopic Displays

393

LEFT

LEFT

RIGHT

RIGHT

0 ms
LCD addressed
with left image

4.17 ms
Left image displayed
over complete LCD

8.33 ms
LCD addressed
with right image

12.5 ms
Right image displayed
over complete LCD

Fig. 13.18 Shutter glasses timingover 16.7 ms period of 120 Hz display there are two periods
with both cells are opaque; these are when LCD is being addressed and both left and right images
are displayed at one time

The switching of the LC cells must be synchronized with the display as shown
in Fig. 13.18. This is usually achieved with an infrared link. When the display is
being addressed, both of the cells in the glasses must be switched off otherwise
parts of both images will be seen by one eye thus causing crosstalk.

13.4.2 Volumetric
Volumetric displays reproduce the surface of the image within a volume of space.
The 3D elements of the surface are referred to as voxels, as opposed to pixels on
a 2D surface. As volumetric displays create a picture in which each point of light
has a real point of origin in a space, the images may be observed from a wide range
of viewpoints. Additionally, the eye can focus at a real point within the image thus
giving the sense of ocular accommodation.
Volumetric displays are usually more suited for computer graphics than video
applications due to the difficulties in capturing the images. However, their most
important drawback, in particular with regard to television displays is that they
generally suffer from image transparency where parts of an image that are normally occluded are seen through the foreground object. Another limitation that
could give an unrealistic appearance to natural images is the inability in general to
display surfaces with a non-Lambertian intensity distribution. Some volumetric
displays do have the potential to overcome these problems and these are described
in Sect. 13.5.2.
Volumetric displays can be of two basic types; these are virtual image where
the voxels are formed by a moving or deformable lens or mirror and real image
where the voxels are on a moving screen or are produced on static regions. An
early virtual image method described is that of Traub [28] where a mirror of
varying focal length (varifocal) is used to produce a series of images at differing
apparent distances (Fig. 13.19a). The variable surface curvature of the mirror
entails smaller movement for a diven depth effect than would be required from a
moving flat surface on to which an image is projected.

394

P. Surman

(a)

(b)
Virtual
image

(c)
Mirror

Mirror

Mirror

CRT
image

Varifocal
mirror

Viewer

Virtual Image

Fast
projector

Stack of
switchable
screens
Mirror

Real Image: Static Screen

Fast
projector
Rotating
screen

Real Image: Moving Screen

Fig. 13.19 Volumetric display types. a Moving virtual image plane formed by deformable
mirror. b Image built up of slices projected on to a stack of switchable screens. c Image
projected on to moving screen

In static displays voxels are produced on stationary regions in the image space
(Fig. 13.19b). A simple two-plane method by Floating Images Inc. [29] uses a partially reflecting mirror to combine the real foreground image with the reflected
background image behind it. The foreground image is brighter in order for it to appear
opaque. This type of display is not suitable for video but does provide a simple and
inexpensive display that can be effective in applications such as advertising.
The DepthCube Z1024 3D display consists of 20 stacked LCD shutter panels
that allow viewers to see objects in three dimensions without the need for glasses.
It enhances depth perception and creates a volumetric display where the viewer
can focus on the planes so the accommodation is correct [30].
A solid image can be produced in a volume of space by displaying slices of the
image on a moving screen. In Fig. 13.19c the display comprises a fast projector
projecting an image on a rotating spiral screen. The screen could also be flat and moving
in a reciprocating motion and if, for example, a sphere is to be displayed this can be
achieved by projecting in sequence a series of circles of varying size on to the surface.
Most attempts to create volumetric 3D images are based on swept volume techniques
because they can be implemented with currently available hardware and software.
There are many volumetric display methods and a useful reference work on the
subject has been written by B G Blundell [31].

13.4.3 Light Field


Light field displays emit light from each point on the screen that varies with
direction in order to faithfully reproduce a natural 3D scene but without the use of
holography. These can take several forms including integral imaging, optical

13

Virtual
optical
modules
Mirror

395

(b)

PLAN
VIEW

Optical modules

(a)

Stereoscopic and Autostereoscopic Displays

Viewers

High
frame
rate
projector

PLAN
VIEW
Screen

FLC
screen

Viewer

B
Mirror
Virtual
optical
modules

Vertically
diffusing
screen

Optical Module

Scanning
slit

Scanning Slit and fast Projector

Fig. 13.20 Light field displays. a Optical modules create image points with intersecting beams
(in front or behind screen). b Scanning slit display operates on same principle as stereoptiplexer
and parallactiscope

modules, and dynamic aperture. Light field displays require large amounts of
information to be displayed. In methods that use projectors or light modules an
insufficient amount of information can produce effects such as lack of depth of
field or shearing of the images.
NHK in Japan has carried out research into integral imaging for several years
including one approach using projection [32]. In 2009, NHK announced they had
developed an integral 3D TV achieved by using a 1.34 mm pitch lens array that
covers an ultra-high definition panel. Hitachi has demonstrated a 10 Full Parallax 3D display that has a resolution of 640 9 480. This uses 16 projectors in
conjunction with a lens array sheet [33] and provides vertical and horizontal
parallax. There is a tradeoff between the number of viewpoints and the resolution; Hitachi uses sixteen 800 9 600 resolution projectors and in total there are
7.7 million pixels (equivalent to 4000 9 2000 resolution).
Some light field displays employ optical modules to provide multiple beams that
either intersect in front of the screen to form real image voxels (Fig. 13.20a) or
diverge to produce virtual voxels behind the screen. The term voxel is used here as
it best describes an image point formed in space whether this point is either real or
virtual. The screen diffuses the beams in the vertical direction only, therefore
allowing viewers vertical freedom of movement without altering horizontal beam
directions. As the projectors/optical modules are set back from the screen, mirrors are
situated either side in order to extend the effective width of the actual array.
The dynamic aperture type of light field display uses a fast frame rate projector
in conjunction with a horizontally scanned dynamic aperture as in Fig. 13.20b.
Although the actual embodiment appears to be totally different to the optical
module approach, the results they achieve are similar. In the case of the dynamic
aperture the beams are formed in temporal manner. The Collender and Tilton
displays mentioned earlier are examples of this type.

P. Surman

(a)

(b)
View 1

View 2

View 3

Red
Green
Blue

396

Lenticular
screen

1
3
7 2
4 6
1 3
5 7
7 2
2
4
6
1 3
1
5 7
2
4 6
6
1 3
5 7
5 7
2 4
6

LCD
sub
pixels

PLAN
VIEW

Lenticular screen
LCD
3 2 1
Lenticular Screen

View number

1 3
5

Pixel
layer
FRONT VIEW

Slanted Lenticular Screen for 7 Views

Fig. 13.21 Multi-view display. a Views formed by collimated beams emerging from lenticular
screen where LCD image is located at focal plane. b Pixel and lens configuration for 7-view
slanted lenticular display

13.4.4 Multi-View
Multi-view displays present a series of discrete views across the viewing field. One
eye will lie in a region where one perspective is seen, and the other eye in a position
where another perspective is seen. The number of views is too small for continuous
motion parallax. Current methods use either lenticular screens or parallax barriers to
direct images in the appropriate directions. There have been complex embodiments
of this type of display in the past [34] where the views are produced in a dynamic
manner with the use of a fast display; this type of display employs a technique
referred to as Fourier-plane shuttering [35]. However, recent types using lenticular
screens [36] or parallax barriers provide a simple solution that is potentially inexpensive. These displays have the advantages of providing the look-around capability
(motion parallax) and a reasonable degree of freedom of movement.
Lenticular screens with the lenses running vertically can be used to direct the light
from columns of pixels on an LCD into viewing zones across the viewing field. The
principle of operation is shown in Fig. 13.21a. The liquid crystal layer lies in the focal
plane of the lenses and the lens pitch is slightly less than the horizontal pitch of the
pixels in order to give viewing zones at the chosen optimum distance from the screen.
In the figure three columns of pixels contribute to three viewing zones.

13

Stereoscopic and Autostereoscopic Displays

397

A simple multi-view display with this construction suffers from two quite
serious drawbacks. First, the mask between the columns of pixels in the LCD gives
rise to the appearance of vertical banding on the image known as the picket fence
effect. Second, when a viewers eye traverses the region between two viewing
zones the image appears to flip between views. These problems were originally
addressed by Philips Research Laboratories in the UK by simply slanting the
lenticular screen in relation to the LCD as in Fig. 13.21b where a 7-view display is
shown. An observer moving sideways in front of the display always sees a constant
amount of black mask. This renders it invisible and eliminates the appearance of
the picket fence effect which is a moir-like artifact where the LCD mask is
magnified by the lenticular screen.
The slanted screen also enables the transition between adjacent views to be
softened so that the appearance to the viewer is closer to the continuous motion
parallax of natural images. Additionally it enables the reduction of perceived resolution against the display native resolution to be spread between the vertical and
horizontal directions. For example, in the Philips Wow display the production of nine
views reduces the resolution in each direction by a factor of three. The improvements
obtained with a slanted lenticular screen also apply to a slanted parallax barrier.

13.4.5 Head Tracked


The use of head tracking can overcome many of the limitations of other display
types. The amount of information displayed is kept to a minimum as only two
views have to be displayed if all the viewers see the same image pair, and only
2N views if motion parallax is supplied to N viewers. Head tracked displays
produce exit pupils; these are regions in the viewing field where a particular image
is seen over the complete area of the screen. When an observer is not in this region
an image is not seen at all or is seen over only part of the screen. Figure 13.22a
shows the formation of a single image pupil pair. The head tracker ensures that the
pupils are always located in the region of the eyes so the exit pupil pairs shown in
Fig. 13.22b follow the viewers head positions.
Head tracked displays have been built for many years and go back to the work
of Schwartz in 1985 [37]. Sharp Laboratories of Europe developed a system where
the images from two LCDs are combined with a semi-silvered mirror [38]. The
Massachusetts Institute of Technology Media Lab described the use of a spatial
light modulator (SLM) rotates the polarization by 90 in order to produce exit
pupils [39]. The simplest lenticular screen method is to place a screen with vertical
lenses in front of an LCD and swap the images on the pixel rows to ensure that the
eyes never see pseudoscopic images; this was an early head tracked method
developed by the Nippon Telegraph and Telephone Corporation (NTT) [40].
In the Dimension Technologies Inc. display [41] a lenticular sheet is used to
form a series of vertical illumination lines for an LCD. The VarrierTM display of
the University of Illinois [42], a physical parallax barrier with vertical apertures

398

P. Surman

(a)

(b)
Screen
B
R
Viewer
L

PLAN VIEWS

Exit Pupil Pair

Multiple Exit Pupil Pairs

Fig. 13.22 Exit pupils. a Exit pupil pair located at positions of the viewers eyes. b Multiple exit
pupil pairs under control of head tracker head tracker follow positions of viewers heads

is located in front of the screen. The German company SeeReal produces a head
tracked display where light is steered through an LCD with a prismatic arrangement [43].

13.5 State of the Art


13.5.1 Glasses
13.5.1.1 AnaglyphColorCode3D
Anaglyph glasses methods have been used since the mid-nineteenth century and
although they are a very simple means of providing 3D they suffer from poor color
reproduction. Recently this shortcoming has been addressed and the first method
described here employs the more traditional approach of using a standard display
and conventional broadband filters in the glasses.
ColorCode3D is a patented method of separating the left and right channels
using blue and amber filters [44]. A blue filter centered at 450 nm covers the left
eye and this provides monochrome information for the depth effect. An amber
filter in the right eye path allows wavelengths greater than around 500 nm to pass
to supply the color information. The overall subjective effect of this is to give
much better color reproduction than red/cyan anaglyph and with problems only
encountered in the extreme blue.
The light is attenuated to a much greater extent than traditional anaglyph types
and best results are obtained with the ambient lighting level reduced. In November
2009, Channel 4 in the UK transmitted 3D programs during 3D Week with free
glasses being distributed at supermarkets. The programs could still be viewed
without the glasses but with some horizontal fringing artifacts.

13

Stereoscopic and Autostereoscopic Displays

399

13.5.1.2 AnaglyphInfitec
A rather different approach is taken by the German company Infitec GmbH. This is
a projection system where narrowband dichroic interference filters are used to
separate the channels. The filter for each eye has three narrow pass bands that
provide the primary colors for each channel. In order to give separation the
wavelengths have to be slightly different for each channel and are typically: left
eye, Red 629 nm, Green 532 nm, Blue 446 nm; right eye, Red 615 nm, Green
518 nm, Blue 432 nm. The effect of the slight differences in the primary colors is
relatively small as it can be corrected to a certain extent and principally affects
highly saturated colors.
Ideally the light sources would be lasers but in practice an ultra-high pressure
(UHP) lamp is used [45] where spectral peaks are close to the desired wavelengths.
One advantage of this system is that polarized light is not used so that the screen
does not have to be metallized but can be white. A disadvantage is that the glasses
are relatively expensive (around $38 in early 2012) and are therefore not disposable. This requires additional manpower at the theater as the glasses have to be
collected afterwards and they also have to be sterilised before reuse.
The Infitec system has been adopted by Dolby where the projector can show
either 2D or 3D. When 3D is shown the regular color wheel used is replaced by
one with two sets of filters, one set for each channel.

13.5.1.3 Linearly Polarized


3D images can be projected by simply using two synchronized projectors, one for
the left image and one for the right, with orthogonally aligned polarizers in front of
each lens. When viewed on a metallized screen through correctly matched linearly
polarized glasses 3D is seen. The earliest demonstration of this was by Edwin
Land, the inventor of Polaroid film and the Polaroid instant camera, in 1936.
Analog IMAX 3D also operates on this principle.
A 120 Hz digital light processor (DLP) projector or a 120 Hz LCD projector can
be adapted for use with passive glasses using a converter that changes the display
mode from frame sequential to polarity sequential. This can be carried out with a kit
available from Tyrell Innovations at a cost of approximately $400. The kit comprises
a polarization rotation cell that can be synchronized with the projector. With the use
of a metallized screen 3D can be seen with linearly polarized glasses. Circularly
polarized glasses can also be used with the use of a slightly modified system.

13.5.1.4 Circularly Polarized


Circular polarization is used in the RealD cinema system where a circularly
polarized liquid crystal filter is placed in front of the projector lens. Only one
projector is necessary as the left and right images are displayed alternately.

400

P. Surman

The active polarization filter called a ZScreen was invented by Lenny Lipton
of the Stereographics Corporation and was first patented in 1988 [46]. It was
originally used in shutter glasses systems using CRT displays as at the time this
was the only technology sufficiently fast to support this mode of operation.
In the RealD XLS system used by Sony a 4 K projector (4096 9 2160 resolution)
displays two 2 K (2048 9 858 resolution) images. Only a single projector is necessary as the two images are displayed one above the other in the 4 K frame. These
are combined by a special optical assembly that incorporates separating prisms. The
lenses for each channel are located one above the other at the projector output.
Displaying the images simultaneously avoids the artifacts produced by the alternative triple flash method where left and right images are presented sequentially.
Another polarization switching system is produced by DepthQ where a range of
modules is available that can support projector output up to 30 K Lumens. The
modules have a fast switching time of 50 ls and will give low crosstalk for frame
rates in excess of 280 FPS. The system includes the liquidcrystal modulator and its
control unit with a standard synchronization input. Output from the graphics card or
projectors GPIO (general purpose input/output) port is supplied to the control unit,
which then converts the signal to match the required input of the modulator.
A mechanical version of the switched filter cell is produced by Masterimage where
a filter wheel in the output beam of the projector running at 4320 rpm enables triple
flash sequential images to be shown. Triple flash is where within a 1/24 s period the
left image is displayed three times and the right image displayed three times.
It is possible to combine a screen-sized active retarder with a direct-view LCD
to provide a passive glasses system so that there is no loss of vertical resolution.
Samsung and RealD showed prototypes of this system in 2011. The additional
weight and cost of the active shutter panel was seen as an issue but RealD stated
that they had a plan to make the panel affordable.
The left and right images can be spatially multiplexed with the use of a film
pattern retarder (FPR) where alternate rows of pixels have orthogonal polarization.
LG led the way with this technology and other manufacturers have followed suit.
There is an argument that says the perceived vertical resolution is halved but this
is countered with another argument suggesting that as each eye sees half the
resolution, the overall effect is of minimal resolution loss [47].

13.5.2 Volumetric
There are many different methods for producing a volumetric display and a large
proportion of these have been available for some years. As most of these suffer from
the disadvantages of having transparent image surfaces and the inability to display
anisotropic light distribution (cannot show a non-Lambertian distribution) from their
component voxels, these are not particularly relevant in a state-of-the-art survey.
It is clear that if the image is produced by any of the three basic methods
described in Sect. 13.4.2 the light distribution from the voxels will be isotropic as

13

Stereoscopic and Autostereoscopic Displays

401

the real surfaces from which the light radiates are diffusing and have a Lambertian
distribution. Images will therefore lack the realism of real-world scenes and this
isotropic distribution is also the cause of image surface transparency. Making the
display hybrid so that it is effectively a combination of volumetric and multi-view
or light field types can overcome this deficiency [48].
The limitations of a conventional volumetric display, referred to as an IEVD
(isotropically emissive volumetric display), can be partially mitigated by
employing a technique using lumi-volumes in the rendering. This has been
reported by a team comprising groups from Swansea University in the UK and
Purdue University in the US [49].
Isotropic emission limitations can be overcome with a display developed at
Purdue University that is effectively a combination of volumetric and multi-view
display [50]. This uses a Perspecta volumetric display that gives a spherical
viewing volume 250 mm diameter. The principal modification to the display
hardware is the replacement of the standard diffusing rotating screen with a mirror
and a 60 FWHM (full width half maximum) Luminit vertical diffuser. This
provides a wide vertical viewing zone. The results published in the paper clearly
show occlusions and the stated advantage of this approach over conventional
integral imaging is that the viewing angle is 180. Although this is an interesting
approach the Perspecta display unfortunately is no longer available.
Two displays employing rotating screens have been reported. In the Transpost
system [51] multiple images of the object, taken from different angles, are projected onto a directionally reflective spinning screen with a limited viewing angle.
In another system called Live Dimension [52] the display comprises projectors
arranged in a circle. The Lambertian surfaces of the screens are covered by vertical
louvers that direct the light towards the viewer.
A completely different type of display, but one that is of interest is the
laser-produced plasma display of the National Institute of Advanced Industrial Science
and Technology (AIST) [53]. This produces luminous dots in free space by using
focussed infrared pulsed lasers with a pulse duration of around 1 nanosecond and a
repetition frequency of approximately 100 Hz. Objects can be formed in air by moving
the beam with an XYZ scanner and this enables around 100 dots per second to be
produced. The AIST paper shows images of various simple three-dimensional dot
matrix images. The display is marketed by the Japanese company Burton Inc. who has
announced that an RGB version will become available in the near future.

13.5.3 Light Field


13.5.3.1 ICT Graphics Lab
The Institute of Creative Technologies (ICT) Graphics Lab at the University of
Southern California has developed a system that reflects the images off an
anisotropic spinning mirror [54]. The mirror has an adjacent vertical diffuser in

402

P. Surman

order to enable vertical movement of the viewer. Although the display is inherently horizontal parallax only, vertical viewer position tracking can be used to
render the images in accordance with position so that the appearance of vertical
parallax, with minor distortions, could be provided. Although this referred to as a
light field display its operating principle appears to be similar to the modified
Perspecta display. This demonstrates the point that the distinction between the
different types of display when hybrid technology is employed can become
blurred. The same group has also proposed another display where the screen
comprises a two-dimensional pico projector array [55]. The paper describes a
display that would actually work but with poor performance; the pixels of the
display would be the projector lenses so that the resolution would be very low.
13.5.3.2 Holografika
The Holografika display [56] uses optical modules to provide multiple beams that converge and intersect in front of the screen to form real image voxels, or diverge to produce
virtual voxels behind the screen. The screen diffuses the beams in the vertical direction
only without altering horizontal beam directions. Holografika can currently supply a
range of products including a 32 display using 9.8 megapixels and a 72 version with
34.5 megapixels [57]. A version of the display that has a higher angular resolution of the
emergent light field will become available and will provide a 120 field of view.
13.5.3.3 Other
Hitachi has demonstrated a version of its display where a combination of mirrors
and lenses is used to overlay stereoscopic images within a real object [58]. The
images are made up by views from 24 projectors and enable 3D to be seen by
several users over a wide viewing angle. The dynamic aperture type of light field
display uses a fast frame rate projector in conjunction with a horizontally scanned
aperture. An early version of this was the Tilton display with the mechanically
scanned aperture. More recently this principle has been developed further at
Cambridge University with the use of a fast digital micromirror device (DMD)
projector and a ferroelectric shutter [59]. This is currently available from Setred as
a 20 XGA display intended for medical applications.

13.5.4 Multi-View
13.5.4.1 Two-View
The simplest type of multi-view display is one where a single image pair is
displayed. If the display hardware is only capable of showing a single image pupil
pair the sweet spot occupies only a very small part of the viewing field.

13

Stereoscopic and Autostereoscopic Displays

403

However in many two-view systems, for example simple lenticular screen of


parallax barrier types, the exit pupils are repeated across the viewing region so that
a series of sweet spots is formed.
Sharp were pioneers of parallax barrier 3D displays and brought out the Actius
RD3D notebook in 2004. This was 2D/3D switchable with the use of an active
parallax barrier and was produced for around 2 years. They also made a 3D mobile
phone that sold for a short time in Japan and was so popular that at one point in the
mid-2000s the phone had produced higher sales than that the total of all other
autostereoscopic displays combined up to that time. Another parallax barrier
product by Sharp that is in current use is a parallax barrier display mounted in a
vehicle that enables the driver to see the satellite navigation and other essential
information and the front passenger to see entertainment that would be distracting
to the driver.

13.5.4.2 Projector
Although projection techniques do not provide the most compact form of multiview display they do have the potential to offer the image source where a large
number of views are required. This approach has become more viable recently as
relatively inexpensive pico projectors are now available. There are two simple
ways projectors can be used. The first method is the use of a double lenticular
screen [60] where the region of contact between the sheets is the flat common focal
plane of the two outer layers of vertically aligned cylindrical lenses. This is an
example of a Gabor superlens [61]. The lenses of the projectors form real images
in the viewing field that are the exit pupils and the light must be diffused vertically
in order to enable vertical movement of the viewers.
Another method is the use of a retroreflecting screen that returns the light rays
back in the same direction as it enters the screen [62]. The screen could comprise a
lenticular arrangement to operate in the same manner to cats eyes that are used
as reflectors on UK roads. An alternative method is to use a corner cube structure
where three orthogonal reflecting surfaces return light entering back in the same
direction. Again, vertical diffusing is required to enable vertical viewer movement.
Hitachi has announced a 24-view system [63] where the images from 24 projectors are combined with the use of a half-silvered mirror that enables the 3D
image to be overlaid with an actual object. The viewing region of 30 in the
vertical direction and 60 horizontally enables several people to view the images.

13.5.4.3 Lenticular/Barrier
Currently the most common form of multi-view display uses a slanted lenticular
view-directing screen. Philips [64] was active in this area for around 15 years but
pulled out in 2009. The work is being continued by a small company called
Dimenco [65] that consists of a few ex-Philips engineers. It appears that Philips

404

P. Surman

has again taken up marketing this display. An alternative method is the slanted
parallax barrier [66] where the result of the slanting has the same effect as for the
lenticular screen, that is, the elimination of the appearance of visible banding and
the reduction in resolution spread over both the horizontal and vertical direction.
The Masterimage display [67] uses an active twisted nematic (TN) LCD parallax barrier located in front of an image-forming thin film transistor (TFT) LCD
panel. The barrier is switchable to enable 3D to be seen in the portrait and the
landscape orientation and to switch between the 2D and 3D modes. Information
released by the company states that the system has no crosstalk. The display is
incorporated into the Hitachi Wooo mobile phone that was launched in 2009.
Light directions exiting the screen can also be controlled by using a color filter
array in a component that is referred to as a wavelength selective filter [68]. This
technique was developed around 2000 by 4D-Vision GmbH in Germany where it
was used to support an 8-view display.
There are many multi-view displays on the market and new products frequently
appear with others becoming discontinued. This is a relatively specialized market
and a convenient means of finding out what is still available, at least in the UK, is
to contact or to visit the website of Inition [69] who supply a wide range of 3D
display products.

13.5.5 Head Tracked


13.5.5.1 Free2C
A single-user head tracked display has been developed by Fraunhofer HHI in
Germany [70]. This uses a lenticular screen in front of an LCD that moves in
accordance with the position of the user. Both lateral and frontal head movement
can be accommodated by moving the lenticular screen in the X and Z directions
with the use of voice coil actuators. The display is in the portrait orientation and
the nominal viewing distance is 700 mm. Tracking is carried out with a nonintrusive system where the output from a pair of cameras located above the screen
is processed to control the actuators. The tracking provides a comfortably large
viewing region and image quality is good with crosstalk less than 2 %.

13.5.5.2 SeeReal
SeeReal is developing a display that combines the advantages of head tracking and
holography [71]. When a holographic image is produced a large amount of
redundant information is required in order to provide images over a large region
that is only sparsely populated with viewers eye pupils. This requires a screen
resolution that is in the order of a wavelength of light and even if vertical parallax
is not displayed the screen resolution is still too high to be handled by current

13

Stereoscopic and Autostereoscopic Displays

405

display technology. SeeReals solution to this is to produce holograms with a small


diffraction angle that serve only a small viewing window and require a resolution
that is in the order of around only ten times that of HDTV. As opposed to conventional holography where the complete screen area contributes to each image
point, the SeeReal display uses proprietary sub-holograms that reconstruct the
point that is seen over the small viewing window region. By reconstructing the
points in this fashion the A/C conflict problem is overcome.

13.5.5.3 Microsoft
Microsoft Applied Sciences Group has announced a system where the Wedge
waveguide technology developed at Cambridge University is adapted to provide
steering of light beams from an array of LEDs [72]. The display has a thin form factor
of around 50 mm. Images are produced view sequentially so that the focussing of the
rays exiting the screen changes position each time a new view is displayed.
The display in operation can be seen via the link in the following Ref. [73].

13.5.5.4 Apple
Another means of directing light to viewers eyes is described in a patent assigned
to Apple [74]. This uses a corrugated reflecting screen where the angle of the
exiting beam is a function of its X position. The text of the patent mentions that
head tracking can be applied to this display. Proposed interactive applications for
the display, including gesture and presence detection, can be obtained on the
Patently Apple blog [75].

13.5.5.5 MUTED, HELIUM3D


Displays that have the aim of being true multi-user head tracked types with a large
viewing region where viewers are not restricted to being at a set distance from the
screen have been developed in the MUTED and HELIUM3D projects that have
been funded by the European Union. In MUTED [76] this is achieved with the use
of an optical array where the off-axis aberrations produced by forming an exit
pupil with a single large lens component are eliminated. A multi-user head tracker
is used to generate an illumination pattern that produces a series of collimated
beams that intersect at the viewers eyes after passing through a direct-view LCD
that has its backlight removed.
The HELIUM3D display [77] does not incorporate a flat panel display and image
information is supplied by a fast light valve (LV) that controls the output of red,
green, and blue (RGB) lasers. Multiple steerable exit pupils are created dynamically
where an image column scans the screen horizontally and the directions of the light
emerging from the column are controlled by an SLM. A laser source is used for its

406

P. Surman

low tendue properties that enable accurate control of the light ray directions. It is not
used for its high coherence properties as holographic techniques are not employed in
the display. The front screen assembly includes a Gabor superlens which is a novel
type of lens that can provide angular magnification. This was invented by Denis
Gabor in the 1940 s. The current HELIUM3D setup was intended to incorporate a fast
light engine that would have provided motion parallax to several users. A suitable
light engine is not yet available and the system is currently being adapted for operation as the backlight of a 120 Hz LCD where the display provides the same stereo
pair to several users. It will have a compact housing size therefore making it suitable
for 3D television use.

13.6 Conclusion
Currently the most common type of autostereoscopic display is multi-view;
however, these suffer from resolution loss, restricted depth of field, and a relatively
limited viewing region. These effects are reduced with increasing number of views
but this creates a tradeoff with resolutionunfortunately you cannot get something
for nothing. It is possible that eventually SMV displays could become viable with
ultra-high resolution panels used in conjunction with slanted lenticular screens in
the event of such panels becoming available.
If the criteria for the appearance of continuous motion parallax given in
Sect. 13.2.6 are adhered to then it would appear that a view pitch in the order of
2.5 mm would suffice. As an example, 100 parallax images would be required for a
continuous 250 mm viewing width. With a slanted lenticular screen the resolution
loss is approximately equal to the square root of the number of views so each
individual view would have a tenth of the native display resolution. Consider the
example of an 8 K display (7860 9 4320); the resolution of each view image
would only be in the order of 768 9 432. Although the addition of 3D enhances
the perceived image quality, it is not clear that this resolution reduction would be
mitigated.
The performance of a multi-view display can be improved with the addition of
head tracking. Toshiba demonstrated a 55 autostereoscopic display at the 2012
Consumer Electronics Show (CES). Before the show the display had already been
available in Japan and Germany. The native resolution of the display is 4 K
(3840 9 2160) which enables a series of nine 1160 9 720 images to be formed.
The addition of head tracking permits the viewing zones to be located in the
optimum position for the best viewing condition for several viewers. Toshiba
states that this can be up to nine viewers but best performance is obtained with four
or fewer. In addition to the resolution loss the other disadvantage of this display is
that viewers have to be close to a fixed distance from the display.
If SMV displays do become widely used it must be borne in mind that image
capture will be more complex. If the continuous viewing width is, for example
250 mm then the capture width should be in the same order. It is most likely that a

13

Stereoscopic and Autostereoscopic Displays

407

camera array would be used with intermediate views being produced by interpolation if necessary. Plenoptic techniques [78] are probably not suitable as a single
lens used at the first stage would have to be extremely large.
3D televisions were initially active glasses where the left and right images are
separated sequentially as described in Sect. 13.4.1. In early 2011 the Display Daily
blog, that is, prepared by well-informed Insight Media, proposed the Wave
Theory of 3D [79]. In this proposal the first wave of 3D television is active
glasses, the second wave passive glasses and the third wave glasses-free. Although
there are pros and cons regarding glasses displays, for example, cost and weight of
glasses, display weight, and image resolution, ultimately the acceptance of 3D
television will be determined by whether or not glasses will need to be worn at all.
Although glasses-free is shown as the third wave interestingly there is no time
scale given in Insight Medias report.
Head tracking can provide an autostereoscopic display that does not have to
compromise the image resolution. In its present form it does not appear that the
Microsoft head tracked display can control the exit pupil positions in the Z
direction so the use of this display could be limited in a television viewing situation. It is not apparent in the Apple patent how readily tracking could be controlled in the Z direction. The HELIUM3D display is fairly complex but the
complexity is principally due to its inherent ability to produce exit pupils over a
large region allowing several viewers to move freely over a room-sized area.
In September 2009, SeeReal announced that they had produced an 8 holographic demonstrator with 35 micron pixels forming a phase hologram and that a
24 product with a slightly smaller pixel size was planned. It does not appear at the
time of writing (February 2012) that a product is actually available yet.
Holographic displays may be some years away but research is being carried out
that is leading the way to its inception. The University of Arizona is developing a
holographic display that is updatable so that moving images can be shown [80].
In order to be suitable for this application the material displaying the hologram must
have a high diffraction efficiency, fast writing time, hours of image persistence,
capability for rapid erasure and the potential for large display area; this is a combination of properties that had not been realized before. A material has been developed
that is a photorefractive polymer written by a 50 Hz nanosecond pulsed laser. A proof
of concept of demonstrator has been built where the image is refreshed every 2 s.
Another holographic approach is taken by the object-based media group at
MIT. In this display a multi-view holographic stereogram is produced in what is
termed a diffraction specific coherent (DSC) panoramagram [81]. This provides a
sufficiently large number of discrete views to provide the impression of smooth
motion parallax. Frame rates of 15 frames per second have been achieved. So far
horizontal-only parallax has been demonstrated but the system can be readily
extended to full parallax.
To summarize the current situation with stereoscopic displays, it is fairly
certain that glasses will not be acceptable for television for much longer and that
autostereoscopic displays will replace glasses types. At present the multi-view
displays are most common autostereoscopic display but do not provide sufficiently

408

P. Surman

good quality for television use. The use of head tracking in conjunction with a
multi-view display with a high native resolution is one means of making multiview displays more acceptable, however the limitations of resolution loss and
limited depth of usable viewing field could still be an issue.
Head tracking applied to two-view displays offers a solution to the resolution and
restricted viewing issues but these displays, for example the HELIUM3D display still
require further development before they become a viable consumer television product.
It is likely that glasses systems will continue to be used in cinemas for the
foreseeable future. The reasons for this are twofold; first the wearing of glasses is
more acceptable in a cinema where the audience tend not to keep putting on and
taking off the glasses. Second, the technical difficulties in placing images in the
correct positions over a large area would be extremely difficult to overcome.
Niche markets where cost is of less importance will be suitable for the other
types of display, for example, light field and volumetric. Eventually all 3D displays could be either true holographic or possibly hybrid conventional/holographic
as work is proceeding in this area; however, it is not likely to happen within the
next 10 years. If holographic displays eventually become the principal display
method it is still unlikely that the capture would be holographic. Illuminating a
scene with coherent light would be extremely difficult and also would not create
natural lighting conditions. It is more likely that the hologram would be synthesized from images captured on a camera array.

References
1. Spottiswoode N, Spottiswoode R (1953) The theory of stereoscopic transmission. University
of California Press, p 13
2. Gibson EJ, Gibson JJ, Smith OW, Flock H (1959) Motion parallax as a determinant of
perceived depth. J Exp Psychol 58(1):4051
3. Julesz B (1960) Binocular depth perception of computer-generated patterns. Bell Syst Tech J
39:11251162
4. Scotcher SM, Laidlaw DA, Canning CR, Weal MJ, Harrad RA (1997) Pulfrichs
phenomenon in unilateral cataract. Br J Ophthalmol 81(12):10501055
5. Wallach H, OConnelu DN (1953) The kinetic depth effect. J Exp Psychol 45(4):205217
6. Evans JPO (2003) Kinetic depth effect X-ray (KDEX) imaging for security screening visual
information engineering. In: VIE 2003 international conference, 79 July 2003
7. Lambooij M, Ijsselsteijn W, Fortuin M, Heynderickx I (2009) Visual discomfort and visual
fatigue of stereoscopic displays: a review. J Imaging Sci Tech 53(3):030201
8. Hoffmann DM, Girshik AR, Akeley K, Banks MS (2008) Vergenceaccommodation conflicts
hinder visual performance and cause visual fatigue. J Vis 8(3), Article 33
9. Woods A, Docherty T, Koch R (1993) Image distortions in stereoscopic video systems.
In: Proceedings of the SPIE, stereoscopic displays and applications IV, vol 1915. San Jose
10. Pastoor S (1991) 3D-television: a survey of recent research results on subjective requirements
signal processing image communication. Elsevier Science Publishers BV, Amsterdam, pp 2132
11. McCarthy S (2010) Glossary for video & perceptual quality of stereoscopic video. White paper
prepared by the 3D @Home Consortium and the MPEG Forum 3DTV working group. 17 Aug
2010. http://www.3dathome.org/files/ST1-01-01_Glossary.pdf. Accessed 24 Jan 2012

13

Stereoscopic and Autostereoscopic Displays

409

12. Kajiki Y (1997) Hologram-like video images by 45-view stereoscopic display. SPIE Proc
Stereosc Disp Virtual Real Syst IV 3012:154166
13. Lucente M, Benton SA, Hilaire P St (1994) Electronic holography: the newest.
In: International symposium on 3D imaging and holography, Osaka
14. Hilaire P St (1995) Modulation transfer function of holographic stereograms. In: Proceedings
of the SPIE, applications of optical holography
15. Pastoor S (1992) Human factors of 3DTV: an overview of current research at Heinrich-HertzInstitut Berlin. IEE colloquium on stereoscopic television: Digest No 1992/173:11/3
16. Lippmann G (1908) Epreuves reversibles. Photographies intgrales. Comptes-Rendus de
lAcadmie des Sciences 146(9):446451
17. McCormick M, Davies N, Chowanietz EG (1992) Restricted parallax images for 3D TV. IEE
colloquium on stereoscopic television: Digest No 1992/173:3/13/4
18. Arai J, Kawai H, Okano F (2006) Microlens arrays for integral imaging system. Appl Opt
45(36):90669078
19. Kanolt CW (1918) US Patent 1260682
20. Brown D (2009) Images across space. Middlesex University Press, London
21. Baird JL (1945) Improvements in television. UK Patent 573,008. Applied for 26 Aug 1943 to
9 Feb 1944, Accepted 1 Nov 1945
22. Funk (2008) Stereoptiplexer cinema systemoutside-looking in. Veritas et Visus
23. Bordwell D, Thompson K (2010) Film history: an introduction, 3rd edn. McGraw-Hill, New
York. ISBN 978-0-07-338613-3
24. Collender R (1986) 3D television, movies and computer graphics without glasses. IEEE
Trans Cons Electro CE-32(1):5661
25. Tilton HB (1987) The 3D oscilloscopea practical manual and guide. Prentice Hall Inc.,
New Jersey
26. 3D Display Technology Chart (2012) http://www.3dathome.org/. Accessed 24 Jan 2012
27. Woods AJ, Yuen KL, Karvinen KS (2007) Characterizing crosstalk in anaglyphic
stereoscopic images on LCD monitors and plasma displays. J SID 15(11):889898
28. Traub AC (1967) Stereoscopic display using rapid varifocal mirror oscillations. Appl Opt
6(6):10851087
29. Dolgoff G (1997) Real-depthTM imaging: a new imaging technology with inexpensive directview (no glasses) video and other applications. SPIE Proc Stereosc Disp Virtual Real Syst IV
3012:282288
30. Sullivan A (2004) DepthCube solid-state 3D volumetric display. Proc SPIE 5291(1).
ISSN:0277786X:279284. doi:10.1117/12.527543
31. Blundell BG, Schwartz AJ (2000) Volumetric three-dimensional display systems. WileyIEEE Press, New York. ISBN 0-471-23928-3
32. Okui M, Arai J, Okano F (2007) New integral imaging technique uses projector. doi:10.1117/
2.1200707.0620. http://spie.org/x15277.xml?ArticleID=x15277. Accessed 28 Jan 2012
33. Hitachi (2010) Hitachi shows 10 glasses-free 3D display. Article published in 3D-display-info
website: www.3d-display-info.com/hitachi-shows-10-glasses-free-3d-display. Accessed 24 Jan 2012
34. Travis ARL, Lang SR (1991) The design and evaluation of a CRT-based autostereoscopic
display. Proc SID 32/4:279283
35. Dodgson N (2011) Multi-view autostereoscopic 3D display. Presentation given at the
Stanford workshop on 3D imaging. 27 Jan 2011. http://scien.stanford.edu/pages/conferences/
3D%20Imaging/presentations/Dodgson%20%20%20Stanford%20Workshop%20on%203D%20Imaging.pdf. Accessed 29 Jan 2012
36. Berkel C, Parker DW, Franklin AR (1996) Multview 3D-LCD. SPIE Proc Stereosc Disp
Virtual Real Syst IV 2653:3239
37. Schwartz A (1985) Head tracking stereoscopic display. In: Proceedings of IEEE international
display research conference, pp 141144
38. Woodgate GJ, Ezra D, Harrold J, Holliman NS, Jones GR, Moseley RR (1997) Observer
tracking autostereoscopic 3D display systems. SPIE Proc Stereosc Disp Virtual Real Syst IV
3012:187198

410

P. Surman

39. Benton SA, Slowe TE, Kropp AB, Smith SL (1999) Micropolarizer-based multiple-viewer
autostereoscopic display. SPIE Proc Stereosc Disp Virtual Real Syst IV 3639:7683
40. Ichinose S, Tetsutani N, Ishibashi M (1989) Full-color stereoscopic video pickup and display
technique without special glasses. Proc SID 3014:319323
41. Eichenlaub J (1994) An autostereoscopic display with high brightness and power efficiency.
SPIE Proc Stereosc Disp Virtual Real Syst IV 2177:143149
42. Sandlin DJ, Margolis T, Dawe G, Leigh J, DeFanti TA (2001) Varrier autostereographic
display. SPIE Proc Stereosc Disp Virtual Real Syst IV 4297:204211
43. Schwerdtner A, Heidrich H (1998) The dresden 3D display (D4D). SPIE Proc Stereosc Disp
Appl IX 3295:203210
44. Sorensen SEB, Hansen PS, Sorensen NL (2001) Method for recording and viewing
stereoscopic images in color using multichrome filters. US Patent 6687003. Free patents
online. http://www.freepatentsonline.com/6687003.html
45. Jorke H, Fritz M (2012) Infiteca new stereoscopic visualization tool by wavelength
multiplexing. http://jumbovision.com.au/files/Infitec_White_Paper.pdf. Accessed 25 Jan 2012
46. Lipton L (1988) Method and system employing a pushpull liquid crystal modulator. US
Patent 4792850
47. Soneira RM (2012) 3D TV display technology shoot-out. http://www.displaymate.com/
3D_TV_ShootOut_1.htm. Accessed 27 Jan 2012
48. Favalora GE (2009) Progress in volumetric three-dimensional displays and their applications.
Opt Soc Am. http://www.greggandjenny.com/gregg/Favalora_OSA_FiO_2009.pdf. Accessed
28 Jan 2012
49. Mora B, Maciejewski R, Chen M (2008) Visualization and computer graphics on
isotropically emissive volumetric displays. IEEE Comput Soc. doi:10.1109/TVCG.2008.99.
https://engineering.purdue.edu/purpl/level2/papers/Mora_LF.pdf. Accessed 28 Jan 2012
50. Cossairt O, Napoli J, Hill SL, Dorval RK, Favalora GE (2007) Occlusion-capable multiview
volumetric three-dimensional display. Appl Opt 46(8). https://engineering.purdue.edu/purpl/
level2/papers/Mora_LF.pdf. Accessed 28 Jan 2012
51. Otsuka R, Hoshino T, Horry Y (2006) Transpost: 360 deg-viewable three-dimensional
display system. Proc IEEE 94(3). doi:10.1109/JPROC.2006.870700
52. Tanaka K, Aoki S (2006) A method for the real-time construction of a full parallax light field.
In: Proceedings of SPIE, stereoscopic displays and virtual reality systems XIII 6055, 605516.
doi:10.1117/12.643597
53. Shimada S, Kimura T, Kakehata M. Sasaki F (2006) Three dimensional images in the air.
Translation of the AIST press release of 7 Feb 2006. http://www.aist.go.jp/aist_e/
latest_research/2006/20060210/20060210.html. Accessed 28 Jan 2012
54. Jones A, McDowall I, Yamada H, Bolas M, Debevec P (2007) Rendering for an interactive
3608 light field display. Siggraph 2007 Emerging Technologies. http://gl.ict.usc.edu/
Research/3DDisplay/. Accessed 28 Jan 2012
55. Jurik j, Jones A, Bolas M, Debevec P (2011) Prototyping a light field display involving direct
observation of a video projector array. In: Proceedings of IEEE computer society conference
on computer vision and pattern recognition workshops (CVPRW), pp 1520
56. Baloch T (2001) Method and apparatus for displaying three-dimensional images. US Patent
6,201,565 B1
57. http://www.holografika.com/. Accessed 28 Jan 2012
58. http://www.3d-display-info.com/hitachi-shows-10-glasses-free-3d-display. Accessed 28 Jan 2012
59. Moller C, Travis A (2004) Flat panel time multiplexed autostereoscopic display using an
optical wedge waveguide. In: Proceedings of 11th international display workshops, Niigata,
pp 14431446
60. Okoshi T (1976) Three dimensional imaging techniques. Academic Press, New York, p 129
61. Hembd C, Stevens R, Hutley M (1997) Imaging properties of the gabor superlens. EOS
topical meetings digest series: 13 microlens arrays. NPL Teddington, pp 101104
62. Okoshi T (1976) Three dimensional imaging techniques. Academic Press, New York, p 140

13

Stereoscopic and Autostereoscopic Displays

411

63. Hitachi (2011) Stereoscopic display technology to display stereoscopic images superimposed
on real space. News release 30 Sept 2011. http://www.hitachi.co.jp/New/cnews/month/2011/
09/0930.html. Accessed 29 Jan 2012
64. IJzerman W et al (2005) Design of 2D/3D switchable displays. Proc SID 36(1):98101
65. Dimenco (2012) Productsdisplays 3D stopping power5200 proffesional 3D display. http://
www.dimenco.eu/displays/. Accessed 29 Jan 2012
66. Boev A, Raunio K, Gotchev A, Egiazarian K (2008) GPU-based algorithms for optimized
visualization and crosstalk mitigation on a multiview display. In: Proceedings of SPIE-IS&T
electronic imaging. SPIE, vol 6803, pp 24. http://144.206.159.178/FT/CONF/16408309/
16408328.pdf. Accessed 29 Jan 2012
67. Masterimage (2012) Autostereoscopic 3D LCD. http://www.masterimage3d.com/products/
3d-lcd0079szsxdrf. Accessed 29 Jan 2012
68. Schmidt A, Grasnick (2002) Multi-viewpoint autostereoscopic displays from 4D-vision. In:
Proceedings of SPIE photonics west 2002: electronic imaging, vol 4660, pp 212221
69. Inition (2012) http://www.inition.co.uk/search/node/autostereoscopic. Accessed 29 Jan 2012
70. Surman P, Hopf K, Sexton I, Lee WK, Bates R (2008) Solving the problemthe history and
development of viable domestic 3DTV displays. In: Three-dimensional television, capture,
transmission, display (edited book). Springer Signals and Communication Technology
71. Haussler R, Schwerdtner A, Leister N (2008) Large holographic displays as an alternative to
stereoscopic displays. In: Proceedings of SPIE, stereoscopic displays and applications XIX,
vol 6803(1)
72. Travis A, Emerton N, Large T, Bathiche S, Rihn B (2010) Backlight for view-sequential
autostereo 3D. SID 2010 Digest, pp 215217
73. Microsoft (2010) The wedgeseeing smart displays through a new lens. http://
www.microsoft.com/appliedsciences/content/projects/wedge.aspx. Accessed 4 Feb 2012
74. Krah C (2010) Three-dimensional display system. US Patent 7,843,449. www.freepatentsonline/
7843339.pdf. Accessed 4 Feb 2012
75. Purcher J (2011) Apple wins a surprise 3D display and imaging patent stunner. http://
www.patentlyapple.com/patently-apple/2011/09/whoa-apple-wins-a-3d-display-imagingsystem-patent-stunner.html
76. Surman P, Sexton I, Hopf K, Lee WK, Neumann F, Buckley E, Jones G, Corbett A, Bates R,
Talukdar S (2008) Laser-based multi-user 3D display. J SID 16(7):743753
77. Erden E, Kishore VC, Urey H, Baghsiahi H, Willman E, Day SE, Selviah DR, Fernandez FA,
Surman P (2009) Laser scanning based autostereoscopic 3D display with pupil tracking. Proc
IEEE Photonics 2009:1011
78. Ng R, Levoy M, Bredif M, Duval G, Horowitz M, Hanrahan P (2005) Light field photography
with a hand-held plenoptic camera. Stanford Tech Report CTSR 2005-02. http://
hci.stanford.edu/cstr/reports/2005-02.pdf. Accessed 6 Feb 2012
79. Chinnock C (2011) Here comes the second wave of 3D. Display Daily. http://
displaydaily.com/2011/01/05/here-comes-the-second-wave-of-3d/. Accessed 6 Feb 2012
80. Blanche P-A et al (2010) Holographic three-dimensional telepresence using large-area
photorefractive polymer. Nature 468(7320):8083. http://www.optics.arizona.edu/pablanche/
images/Articles/1010_Blanche_Nature468.pdf. Accessed 6 Feb 2012
81. Barabas J, Jolly S, Smalley DE, Bove VM (2011) Diffraction specific coherent
panoramagrams of real scenes. Proceedings of SPIE Practice Hologram XXV, vol 7957

Chapter 14

Subjective and Objective Visual Quality


Assessment in the Context of Stereoscopic
3D-TV
Marcus Barkowsky, Kjell Brunnstrm, Touradj Ebrahimi,
Lina Karam, Pierre Lebreton, Patrick Le Callet, Andrew Perkis,
Alexander Raake, Mahesh Subedar, Kun Wang, Liyuan Xing
and Junyong You

Abstract Subjective and objective visual quality assessment in the context of


stereoscopic three-dimensional TV (3D-TV) is still in the nascent stage and needs
to consider the effect of the added depth dimension. As a matter of fact, quality
assessment of 3D-TV cannot be considered as a trivial extension of two-dimensional (2D) cases. Furthermore, it may also introduce negative effects not experienced in 2D, e.g., discomfort or nausea. Based on efforts initiated within the cost
action ICT 1003 QUALINET, this chapter discusses current challenges in relation
to subjective and objective visual quality assessment for stereo-based 3D-TV. Two
case studies are presented to illustrate the current state of the art and some of the
remaining challenges.

M. Barkowsky  P. Le Callet (&)


LUNAM Universit, Universit de Nantes, IRCCyN UMR CNRS 6597,
Polytech Nantes, 44306 Nantes Cedex 3, France
e-mail: patrick.lecallet@univ-nantes.fr
M. Barkowsky
e-mail: marcus.barkowsky@univ-nantes.fr
K. Brunnstrm  K. Wang
Department of NetLab, Acreo AB, Electrum 236, SE 16440 Kista, Sweden
e-mail: kjell.brunnstrom@acreo.se
T. Ebrahimi
Multimedia Signal Processing Group (MMSPG),
Ecole Polytechnique Fdrale de Lausanne (EPFL),
EPFL STI IEL GR-EB, CH-1015 Lausanne, Switzerland
e-mail: touradj.ebrahimi@epfl.ch
L. Karam  M. Subedar
School of ECEE, Arizona State University, MC 5706,
Tempe, AZ 85287-5706 USA
e-mail: karam@asu.edu
C. Zhu et al. (eds.), 3D-TV System with Depth-Image-Based Rendering,
DOI: 10.1007/978-1-4419-9964-1_14,
 Springer Science+Business Media New York 2013

413

414

M. Barkowsky et al.







Keywords 3D-TV ACR Correlation Crosstalk perception Coding error


Depth map Human visual system IPTV Mean opinion score Objective visual
quality Quality assessment Quality of experience (QoE) Reliability Stereoscopic 3D-TV Subjective visual quality SSIM

14.1 Introduction
There is a growing demand for three-dimensional TV (3D-TV) in various applications including entertainment, gaming, training, and education, to name a few.
This in turn has spurred the need for methodologies that can reliably assess the
quality of the user experience in the context of 3D-TV-based applications.
Subjective and objective quality assessment has been widely studied in the
context of two-dimensional (2D) video and television. These assessment methods
provide metrics to quantify the perceived visual quality of the 2D video. Subjective quality assessment entails human observers to measure different aspects of
the video quality such as compression artifacts, sharpness level, color and contrast,
and overall visual experience through experimental setup. The average score
provided by the subjects is a typical output of such experiments. This is considered
as a measurement of subjective quality, also called the mean opinion score (MOS).
Subjective quality assessment methods are time-consuming and are hard to deploy

P. Lebreton  A. Raake
Assessment of IP-Based Applications, Telekom Innovation Laboratories,
TU Berlin, Ernst-Reuter-Platz 7, 10587 Berlin, Germany
e-mail: pierre.lebreton@telekom.de
A. Raake
e-mail: Alexander.raake@telekom.de
A. Perkis  L. Xing  J. You
Centre for Quantifiable Quality of Service (Q2S) in Communication Systems,
Norwegian University of Science and Technology (NTNU),
N-7491 Trondheim, Norway
e-mail: andrew.perkis@iet.ntnu.no
L. Xing
e-mail: liyuan.xing@q2s.ntnu.no
J. You
e-mail: junyong.you@q2s.ntnu.no
M. Subedar
Intel Corporation and Arizona State University, Chandler, AZ 85226 USA
e-mail: mahesh.subedar@asu.edu
K. Wang
Department of Information Technology and Media (ITM), Mid Sweden University,
Holmgatan 10, 85170 Sundsvall, Sweden
e-mail: kun.wang@acreo.se

14

Subjective and Objective Visual Quality Assessment

415

on a regular basis, hence there is a need for objective quality assessment. Objective
quality assessment methods provide metrics or indicies that are computational
models for predicting the perceived visual quality by humans and aim to provide a
high degree of correlation with the subjective scores. For extensive reviews on 2D
objective quality metrics, the reader might refer to the following papers [1, 2].
3D video adds several new challenges compared to 2D video from both human
factors and technological point of views. 3D video needs to capture the sense of
presence or ultimate reality. Hence, the overall quality of 3D visual experience is
referred to quality of experience (QoE). Consequently, both subjective and
objective 2D visual quality assessments require revisiting and augmenting to
measure the 3D QoE. Toward this goal, several efforts have been initiated in many
academic labs working on 3D video QoE assessment. Simultaneously, standardization bodies and relative groups (ITU, VQEG, IEEE P3333) have identified
several questions on 3D QoE assessment witnessing the current difficulties to go
beyond 2D quality assessment. The COST action IC1003 QUALINET (Network
on QoE in Multimedia Systems and Services, www.qualinet.eu) which gathers
effort from more than 25 countries is also strongly considering 3D-TV QoE
assessment as an important challenge to tackle for wider acceptance of 3D video
technology. It is so far, one of the largest efforts to integrate and coordinate
activities related to 3D-TV QoE assessment from different research groups. This
chapter is reflecting some of the activities related to QUALINET. A discussion on
current challenges related to subjective and objective visual quality assessment for
stereo 3D video and a brief review of the current state of the art are presented.
These challenges are then illustrated by the two case studies to measure the 3D
subjective and objective quality metrics. The first case study deals with how to
measure the effects of crosstalk as an aspect of display technologies on 3D Video
QoE. The second case study is focused on examining effects of both transmission
and coding errors on 3D video QoE.

14.2 From 2D to 3D Quality Assessment: New Challenges


to Address
The emergence of 3D video technologies imposes new requirements and consequently opens new perspectives for quality assessment. New paradigms, compared
to 2D image and video quality assessment [3], are needed for two main reasons:
(a) The QoE of 3D videos triggers multidimensional factors from the human point
of view.
(b) Numerous use cases require quality assessment for different needs all along
the content delivery chain (from capturing to production, coding, transmission,
decoding, and rendering) as shown in Fig. 14.2. This is all the more true that at
every stage of the chain, the technology is currently not completely mature.
The artifacts introduced during each stage have different effects on the overall

416

M. Barkowsky et al.

Fig. 14.1 Factors affecting the 3D video QoE

QoE. Separate methods of quality assessment need to be developed for different application scenarios and to measure and improve the current
technologies.
3D video quality is highly dependent on the sense of true presence or ultimate
reality and can be significantly affected if discomfort is experienced. With the
current stereo 3D displays, visual discomfort is experienced by some viewers,
which typically induces symptoms such as eye soreness, headache, and dizziness.
3D QoE should also measure the depth perception and added advantage of depth
dimension to the video. Hence, in order to more reliably measure the QoE of a 3D
video system, the three components, shown in Fig. 14.1, need to be considered.
Subjective and objective metrics should measure the contribution of these factors
to the 3D QoE.

14.2.1 Depth Perception and Stereoscopic 3D TV


In order to understand the challenges that are associated with 3D video quality
evaluation, one needs to know 3D visual perception basics. The reader might refer
to a previous chapter of this book for further details on binocular vision. A brief
description of how current 3D-TV simulates our perception of depth is provided
here. The human visual system (HVS) can perceive depth from monocular cues
which are derived from the image of only one eye. These monocular cues include
head movement parallax, linear perspective, occlusion of more distant objects by
near ones, shading and texture illumination gradient, and lens accommodation
(muscular tension to focus objects). For depth estimation, however, our brain uses
also binocular vision, which is based on the fact that we sense the world through
two 2D images that are projected on the retinas of the left eye and right eye,
respectively. Since our left and right eyes are separated by a lateral distance,

14

Subjective and Objective Visual Quality Assessment

417

Fig. 14.2 3D content delivery chain

Fig. 14.3 Fixation curve

known as the inter-pupillary distance (IPD), each eye sees a slightly different
projection of the same scene. Binocular vision corresponds to the process in which
our brain computes the disparity, which is the difference in distance between the
center of fixation and two corresponding points in the retinal image of the left and
right eyes.
The images projected on the fixation center have zero disparity and they correspond to objects on the fixation curve (Fig. 14.3). Objects with positive or
negative disparity appear in front or behind the fixation curve. The binocular
vision is used in stereoscopic 3D displays, where the left-eye image is horizontally
shifted with respect to the right-eye image. The brain tries to fuse these disparity

418

M. Barkowsky et al.

Fig. 14.4 Depth perception in stereoscopic displays

images to create a sense of depth. As shown in Fig. 14.4, depending on the


disparity the object is perceived as a near or far object. It is to be noted that the left
and right eyes are always focusing on the display screen, but because of the later
shift in the left and right eye images, the brain perceives it as a near/far object.

14.2.2 3D-TV QoE: On the Perceptual Maturity


of the Delivery Chain
3D video QoE is affected by various factors along the delivery chain such as content
acquisition/production, compression, transmission, and display. Figure 14.2 shows
the delivery chain for a 3D-TV system. On the acquisition and production side
shooting conditions should be managed to better assess the user experience [4, 5].
Proper 3D shooting has to be aware of the final presence factor (ratio of screen width
and viewing distance). This depends on displays and is restricted, compared to the
real 3D world. A set of 3D video acquisition factors directly affect the QoE, such as
overall depth range, convergence, binocular rivalry. During the capture of multiple
views of the scene, depending on the camera setup, there will be misalignments and
differences in luminance and/or chrominance values. These differences not only
result in quality degradation, but also lead to incorrect or loss of depth information
[6]. Subsequently, two or more views of a scene need to be stored or transmitted
through a broadcast chain. In order to reduce the required doubling of the bandwidth
for the left and right eye images (one pair of images is taken for each view of the
scene), compression techniques are applied [7]. This may result in artifacts such as

14

Subjective and Objective Visual Quality Assessment

419

blur, blockiness, graininess, and ringing. Transmission errors can also cause
degradations in quality. As the compression and transmission artifact impacts may be
different on the left and right eye images, spatial and temporal inconsistencies
between the left and right eye views may occur.
Finally, the 3D video signal needs to be rendered on a display. Depending on
the display technology, various artifacts may impair the perceived picture quality.
A reduction in spatial and/or temporal resolution, loss in luminance and/or color
gamut, and the occurrence of crosstalk are typical artifacts related to display
technologies. Visual discomfort is a physiological problem experienced by many
subjects when viewing 3D video. There are many reasons leading to the visual
complaints. One the most important cause for discomfort is accommodationvergence conflict. Accommodation refers to the occulomotor ability of the human
eye to adapt its lens to focus at different distances. Vergence refers to the occulomotor ability of the human eye to rotate both eyes toward the focused object. In
natural viewing conditions, accommodation and vergence are always matched. But
on (most) 3D displays, they are mismatched as a consequence of the fact that the
eyes focus at the screen, while they converge at a location in front or behind the
screen where an object is rendered.

14.3 3D-TV Quality Assessment: Trends and Challenges


14.3.1 Subjective Quality Assessment
Subjective quality assessment of 3D video is crucial to develop 3D-TV infrastructure, which will provide optimum 3D QoE. Subjective evaluation of 3D QoE
is multifaceted, and should include measurement of depth perception, visual
comfort, along with 2D video quality (as shown in Fig. 14.1). 3D QoE tries to
measure the sense of ultimate reality or true presence, along with physiological
issues such as eye fatigue.
2D video quality is an important factor in the assessment of quality of 3D video.
Methodologies for subjective quality assessment for 2D video have been extensively studied and standardized as ITU-T Rec. P.910 and ITU-R Rec. BT. 500. 3D
video subjective quality assessment followed similar efforts and was standardized
as ITU-R Rec. BT 1438, which adopted the same methodology described in ITU-R
Rec. BT.500. These efforts are not sufficient to capture the overall 3D QoE.
Depth perception tries to measure the added value of stereoscopic depth in 3DTV systems. Subjective experiments were conducted in the literature to predict the
depth perception. In [8], the relationship among the quality, depth perception, and
naturalness is studied through subjective assessment of 3D video. It was concluded
that the depth perception decreases as the quality decreases. When the quality is
low, the perceived depth of the 3D video sequences gets closer to that of 2D video
sequences.

420

M. Barkowsky et al.

Visual discomfort is important to measure the overall 3D QoE. Previous


experiment were conducted to study visual fatigue, experiments were conducted to
study visual fatigue in the context of multi-view acquisition. Standard subjective
quality assessment protocols were used, but requesting observer to rate visual
fatigue instead of overall quality. In these experiments, 810 s length video clips
have been used, which is quite short to stress visual fatigue.
Besides measurement methods, in order to conduct subjective quality assessment of 3D video, one has to carefully consider the following factors:
(a) Content selection: The content selection is very important in order to correctly
interpret the data obtained from the conducted 3D video subjective assessment
experiments. When shooting or generating the content, the camera baseline
width and disparity measurements will affect the comfortable viewing conditions. These viewing conditions need to be matched during the conducted
subjective experiments. The viewing conditions here represent display size and
the viewing distance. The content has to be re-authored for different viewing
conditions.
(b) Display technology: Currently, there are three main 3D display technologies
available for 3D-TV, including active displays, passive displays, and autostereoscopic displays. The underlying display technology affects the spatial/
and temporal resolutions, crosstalk level, comfortable viewing position, and
need to be considered in the subjective assessment.
(c) 3D video pipe: The subjective quality assessment needs to consider the
underlying technologies in a 3D-TV video pipe. The selection of different
formats (frame compatible, full-frame), compression methods, distribution
channels will introduce different kinds of artifacts. These factors have to be
defined and reflected in the subjective assessment.
(d) Viewing conditions: Viewing conditions are strongly coupled with content
selection and display technology. Display size and viewing distance need to be
defined based on the created content. Different luminance levels, lighting
conditions, and display calibration have to be considered based on the display
technology.
There is a need for developing subjective assessment methodologies that
would enable the study of overall 3D QoE. The existing subjective quality
assessment methodologies for 3D video are largely the same as those for 2D
video, which do not measure the multifaceted nature of 3D video quality.
Additionally, there are no standardized or well-known methodologies for the
evaluation of the quality of depth perception, visual comfort, which are important
for evaluating the overall 3D QoE. Such methodologies are indispensable for the
planning and management of 3D-TV services and for the development of
objective 3D video quality assessment metrics. Considering these challenges, a
current way to success is to restrict the scope of studies, validating step-by-step
some methodological issues and exploring progressively more and more aspects
of the technological issues.

14

Subjective and Objective Visual Quality Assessment

421

14.3.2 Objective Quality Assessment: Where are We?


The objective quality assessment for 3D-TV applications will try to predict 3D QoE of
subjective quality assessment. The objective metrics help in providing solutions that
will attempt to maximize the QoE under given constraints. Current efforts in objective
quality metrics for 3D video are targeted toward specific applications and attempt to
assess the quality with respect to one or two aspects of the overall 3D video chain, such
as video compression, depth map quality, view synthesis, and visual discomfort.
Following the 2D video objective quality assessment trends, 3D video objective
quality assessment efforts started with the main focus on coding artifacts. The possibility to adapt 2D metrics in the context of stereoscopic images has been studied, by
calculating 2D metrics for each view [10]. In [11], 2D video quality metrics were
extended adding considerations for disparity distortions that could be associated with
depth distortions. Computational models of the HVS have been recently considered for
3D video quality assessment. In [12], computational models of the HVS were adapted
to implement a new stereoscopic image quality metric, named stereo band limited
contrast (SBLC) algorithm. In [13], a new model was proposed for 3D asymmetric
video coding based on the observation that the spatial frequency determines the view
domination under the action of the HVS. In [14], a no-reference 3D video quality
metric was proposed considering strong dependency of perceived distortion and depth
of stereoscopic images, on the local features, such as edges, flat, and texture regions.
In [15], this method was extended to video by adding a temporal feature measure.
A continuous quality rating scale was used for the subjective quality assessment.
Objective depth map quality assessment has been recently considered in the
literature. The depth map itself is meaningless for an observer; this topic is lacking
in valid subjective quality assessment protocols. Consequently, one should assess
its value through its use in a rendering process, which makes the related quality
assessment metrics application- and technology dependent. In [16], the quality of
the view synthesis and depth map compression is evaluated in the context of a
video-plus-depth coding framework. It is concluded that view synthesis errors
need to be considered when measuring the quality of compressed depth maps.
All the aforementioned methods mainly consider the picture quality aspects of the
3D video, and no explicit validation is performed to identify the relationship between
measured quantities and the overall QoE in its multidimensional form. Actually, current
objective quality assessment techniques are mainly validated comparing their outputs
with subjective scores obtained following ITU-R Rec. BT.500.12 or similar methods.
This is insufficient to capture the real QoE (including depth perception and comfort).

14.4 Case Study: Measuring the Perception of Crosstalk


As mentioned before, QoE can be degraded by various binocular artifacts introduced
in each stage from the acquisition to the restitution in a typical 3D processing chain.
For example, depth plane curvature, keystone-distortion, cardboard effect can be

422

M. Barkowsky et al.

introduced in the capturing stage and convergence-accommodation rivalry, intraocular crosstalk, shear distortion, puppet theater effect, picket fence effect in the
displaying stage, as reviewed in [1719]. Most of these artifacts cannot be eliminated
completely because of the limitation of current 3D imaging techniques, although the
causes and nature of these binocular artifacts have been extensively investigated.
In particular, crosstalk is produced by an imperfect view separation causing a
small proportion of one eye image to be seen by the other eye as well. It is one of
the most annoying artifacts in the visualization stage of stereoscopic imaging and
exists in almost all stereoscopic displays [20]. Although it is well known that
crosstalk artifacts are usually perceived as ghosts, shadows, or double contours by
human subjects, it is still unclear what and how the influence factors impact the
crosstalk perception qualitatively and quantitatively. Subjective quality testing
methodologies are often utilized to study these mechanisms. Some efforts have
been devoted to this topic and several factors have been discovered to have an
impact on crosstalk perception. Specifically, the binocular parallax (camera
baseline) [20, 21], crosstalk level [20], image contrast [21, 22], and hard edges
[22] have an intensified impact on users perception of crosstalk, while texture and
details of scene content [22] are helpful to conceal crosstalk. Moreover, other
factors (e.g. monocular cues of images [23]) and artifacts (e.g. blur and vertical
disparity [24]) have also been found to play an important role in crosstalk perception. In this section, we report results from a case study on the perception of the
crosstalk effect; both subjective and objective assessments are considered.

14.4.1 Subjective Tests on Crosstalk Perception


Laboratory Environment:
Aligned projected polarization technique was used to present 3D images,
forming a screen region with a width of 1.12 m and height of 0.84 m and resolution of 1,280 9 960. Images up-sampled by a bicubic interpolation method was
displayed in full screen mode. The subjects equipped with polarized glasses were
asked to view 3D images at a distance of five times the image height (0.84 m 9 5)
and in a dark room except for a reading lamp in front of the subject.
The system-introduced crosstalk of the display system P was measured according
to the Definition 2 in [25] by taking the effect of black level into consideration,
which was approximately 3 % in our experiments. For a polarized display, the
system-introduced crosstalk is consistent over the display, between projectors, and
among different combinations of brightness between left and right test images.
Test Stimuli:
In total 72 stimuli (6 scene contents 9 3 camera baselines 9 4 crosstalk levels)
have been included in the test.
The scene contents include Book Arrival, Champagne, Dog, Love Bird, Outdoor,
Pantomime, and Newspaper from the MPEG [26], among which Book Arrival was used
for the training. These contents cover a wide range of depth structures, contrasts, colors,

14

Subjective and Objective Visual Quality Assessment

423

edges, and textures, as well as a wide range of depth structures. For each content, four
consecutive cameras were selected from the multi-view sequences, which forms three
camera baselines in a way that the leftmost camera always served as the left eye view
and the other three cameras took turns as the right eye views for stereoscopic images.
Finally, in order to simulate different levels of system-introduced crosstalk for
different displays, four levels of crosstalk artifacts were added to each 3D image
pairs using the algorithm developed in [27] as follows:
(
Rpc Ro p  Lo
14:1
Lpc Lo p  Ro
where Lo and Ro denote the original left and right views, Lpc and Rpc are the distorted views
by simulating the system-introduced crosstalk distortions, and the parameter p is to
adjust the level of crosstalk distortion, which was set to 0, 5, 10, and 15 %, respectively. By taking the system-introduced crosstalk of the display system P (3 %) into
consideration, the actual crosstalk level viewed by the subjects was thereby 3, 8, 13,
and 18 %, respectively. The additive rule between the system-introduced crosstalk of
display system P and simulated system-introduced crosstalk p was justified in [28].
Test Methodology:
The single stimulus (SS) method was adopted in the test, since it is difficult to
choose an original 3D image as the reference point when there is a combination of
different camera baselines. Moreover, a minor modification was made on the SS
method such that the subjects could decide the viewing time for each image with
freedom and a test interface was implemented accordingly. A total of 28 subjects
participated in the tests. Before the training sessions, visual perception-related
characteristics of the subjects were collected, including pupillary distance, normal
or corrected binocular vision, color vision, and stereo vision.
During the training session, an example of each five categorical adjectival
levels (Imperceptible; Perceptible but not annoying; Slightly annoying; Annoying;
Very annoying) was shown to the subjects in order to benchmark and harmonize
their measurement scales. The content Book Arrival with different camera baselines and crosstalk levels of (0 mm, 3 %), (58 mm, 3 %), (114 mm, 8 %),
(114 mm, 13 %), (172 mm, 18 %), has been used in the training. When each
image was displayed, the operator verbally explained the corresponding quality
level to the subjects until they completely understood.
During the test sessions, the subjects were first presented with three dummy 3D
images from the content of Book Arrival, which were not used in the training
sessions, in order to stabilize subjects judgment. Following the dummy images, 72
test images were shown to the subjects in a random order. A new 3D image would
be shown after a subject finished scoring of the previous image.
Subjective Results Analysis:
Before computing the MOS and performing a statistical analysis of the subjective scores, normality distribution test and outlier detection were first carried
out, as defined in ITU-R Rec. BT. 500. A b2 test showed that the subjective scores

424

M. Barkowsky et al.

followed a normal distribution, and one outlier was also detected and the corresponding results were excluded from the subsequent analysis.
According to the MOS values, two conclusions can be drawn, (1) crosstalk level
and camera baseline have an impact on crosstalk perception, (2) the impact of
crosstalk level and camera baseline on crosstalk perception varies with scene
content. These conclusions were further verified by an ANOVA analysis. The
ANOVA results confirmed that crosstalk level, camera baseline, and scene content
main effects were significant at 0.05 level. The two-way interaction terms were
also significant on the same level. However, three-way interactions were not
significant.

14.4.2 Objective Metric for Crosstalk Perception


Perceptual Attributes of Crosstalk:
Perceptual attributes of crosstalk can be described by several sensorial results of
HVS from the perceptive point of view. They can bridge the gap between lowlevel significant factors and high-level user perception on crosstalk.
We studied carefully the test stimuli with different amplitudes of significant
factors and summarized the visual variation influencing user perception on
crosstalk as the perception attributes of crosstalk, which in turn are shadow degree,
separation distance, and spatial position of crosstalk. Specifically, we have defined
the shadow degree of crosstalk as the distinctness of crosstalk against the original
view and the separation distance of crosstalk as the distance of crosstalk separated
from the original view. They are both 2D perceptual attributes existing in a single
eye view but still maintained in 3D perception. The spatial position of crosstalk is
a 3D perceptual attribute, defined as the impact of crosstalk position in 3D space
on perception when the left and right views are fused.
Moreover, the relationship among crosstalk perception, perceptual attributes,
and significant factors can be summarized as follows: (1) 2D perceptual attributes
can characterize low-level significant factors at a more perceptual level; (2) 2D
perceptual attributes alone are not sufficient to explain the visual perception of
crosstalk. Thus, 3D perceptual attribute should also be taken into account in
prediction of user perception on crosstalk.
Perceptual Attribute Map:
A map reflecting the perceptual attributes to some extent is named as perceptual
attribute map. Since the shadow degree, separation distance of crosstalk, and their
interaction are most visible in edges region with high contrast, we believe that the
Structural SIMilarity (SSIM) quality index [29] can describe the 2D perceptual
attributes of crosstalk to some extent. In our case, the original image Lo is the one
shown on the stereoscopic display, while the distorted image Lc was degraded by
both the system-introduced and the simulated crosstalk. Thus, the SSIM map is
derived as follows:

14

Subjective and Objective Visual Quality Assessment

425

Fig. 14.5 Illustration of SSIM maps of Champagne (right 100 mm, 3 % left 50 mm, 13 %)

Ls SSIMLo ; Lc

14:2

where SSIM denotes the SSIM algorithm proposed in [29] and Ls is the derived
SSIM map of the left eye view. Figure 14.5 is a representative illustration of SSIM
map derived from the crosstalk distorted Champagne presentations.
The spatial position of crosstalk can characterize user perception of crosstalk in
3D space, because visible crosstalk of foreground objects might have a stronger
impact on perception than background objects. Therefore, in order to form a 3D
perceptual attribute map, the depth structure of scene content and regions of
visible crosstalk should be combined. We defined a filtered depth map as the 3D
perceptual attribute map:
Rdep DERSRo


Rdep i; j if Ls i; j\0:977
Rpdep i; j
0
if Ls i; j  0:977

14:3
14:4

where DERS denotes the DERS algorithm proposed in [30], Rdep is the generated
depth map of the right view, i and j are the pixel index, and Rpdep denotes the
filtered depth map corresponding to the visible crosstalk regions of the left eye
image. Especially, a threshold 0.977 was obtained empirically from our experiments. Figure 14.6 gives an example of the filtered depth map of Champagne and
Dog.
Objective Metric for crosstalk perception
The overall crosstalk perception is assumed to be an integration of the 2D and
3D perceptual attributes, which are represented by the SSIM map and filtered
depth map, respectively. Since the 3D perceptual attributes discover that visible
crosstalk of foreground objects have stronger impacts on perception than background objects, bigger weights should be assigned to the visible crosstalk of
foreground than background. In other words, the SSIM map should be weighted by
the filtered depth map. Thus, the integration is performed in the following
equation:

426

M. Barkowsky et al.

Fig. 14.6 Illustration of filtered depth maps of Champagne and dog when the camera baseline is
150 mm and the crosstalk level is 3 %




Cpdep Ls  1  Rpdep 255


Vpdep AVG Cpdep

14:5
14:6

where Cpdep and Vpdep denote the combined map and the quality value predicted by
the objective metric, respectively. AVG denotes an averaging operation. In Eq.
(14.6), the filtered depth map Rpdep is first normalized into the interval [01] by the
maximum depth value 255, and then subtracted from one to comply with the meaning
of the SSIM map in which a lower pixel value indicates a larger crosstalk distortion.
Experimental Results
The performance of an objective quality metric can be evaluated by a comparison with respect to the MOS values obtained in subjective tests. The proposed
metric shows a promising performance, since the Pearson correlation of Vpdep is
88.4 % with the subjective quality scores obtained with the crosstalk subjective
test described before. That is much higher than traditional 2D metrics, e.g., PSNR
(82.1 %) and normal SSIM (82.5 %). In addition, the scatter plot in Fig. 14.7
shows that they are straightforwardly correlative with each other, which further
demonstrates the performance of the objective metric.

14.4.3 Conclusion
Subjective tests for stereoscopic crosstalk perception have been conducted
with varying parameters of scene content, camera baseline, and crosstalk level.
Limiting the scope of the study to the perception of specific artifacts, it has been
possible to obtain reliable data with an ad hoc methodology for the subjective
experiment.
According to a statistical analysis, we found that the above three factors have major
impacts on the perception of crosstalk. By observing the visual variations of stimuli

14

Subjective and Objective Visual Quality Assessment

427

Fig. 14.7 Scatter plot of


MOS of crosstalk perception
versus predicted values
MOSp (MOSp b1 = Vpdep 
b3 :1 expb2
Vpdep  b3 )

with changing the three significant factors, three perceptual attributes of crosstalk are
summarized, including shadow degree, separation distance, and spatial position of
crosstalk. These perceptual attributes can be represented by the SSIM map and filtered
depth map. An integration of these two maps can form an effective objective metric for
crosstalk perception, achieving more than 88 % correlation with the MOS results.

14.5 Case Study: Influence of Coding and Transmission


Errors Adapting Standard Subjective Assessment
Methodologies
In the context of assessing coding and transmission artifacts for IPTV, several
aspects of QoE have been analyzed by three distinct assessment laboratories. The
testing methodology absolute category rating (ACR) with hidden reference
(ACR-HR) has been widely used in subjective experiments dealing with multimedia quality. It allows for the fast and reliable testing of a wide variety of source
content (SRC) and degradation conditions, which will be called hypothetical
reference circuits (HRC) in the following, according to the notation of the video
quality experts group (VQEG). By using either the same degradation method or
testing exactly the same video in several different labs, the influence of the laboratory environment has been studied.
The employed subjective assessment methodology for 3D stereoscopic video quality
needs to be evaluated and validated according to the scope of interest. The present case

428

M. Barkowsky et al.

study summarizes a series of tests conducted in the context of severely degraded video
sequences targeted at a typical Full-HD service chain. Further, it details a procedure
which may be used to analyze the validity of the ACR-HR methodology in this context.

14.5.1 Considered Scenarios


In IPTV transmission chains for stereoscopic Full-HD transmission, three influencing factors are predominant:
The choice of the video coding algorithm
Determining the rate-distortion optimal spatial- and/or temporal resolution on
the transmission channel
For the case of transmission errors, the choice of the optimal method for error
concealment on the decoder side
These factors have been addressed in three subjective experiments as shown in
Table 14.1, which were carried out by three different laboratories in three different
countries, France, Germany, and Sweden. Note that one of the subjective experiments, Exp. 2, was carried out with the same conditions and set up in two different
laboratories in two different countries, in France and in Sweden. Details about the
different conditions and the properties of the source sequences can be found in
[3133] as indicated in Table 14.1.
The content data sets were chosen to spread a large variety of typical 3D IPTV
contents, especially considering spatial detail and temporal variability. The data set 1b
used in Exp. 2 was identical to data set 1a except that one content was added. Four of the
source contents were not used in Full-HD resolution, as lower source content resolution
is still a typical condition. The minimum resolution was 1,024 9 576 pixels.
All coding-only scenarios used several quality levels which were chosen by
fixed quantization parameters (QPs). Using a fixed QP instead of a fixed bitrate in
a subjective experiment is advantageous as the perceived quality is more homogeneous across different contents, while the bitrate for similar quality varies largely as a function of motion and spatial detail of the source content. Packet loss
was inserted using different simulation systems which resemble different scenarios
which are typically considered in IPTV.

14.5.2 Subjective Experiments: Methodology and Environment


All tests were conducted on the same type of stereoscopic display, the 23 Alienware OptX Full-HD screen in combination with the NVidia 3D Vision System,
which uses shutter glasses at a refresh rate of 120 Hz, thus 60 Hz per eye.
The experimental lab setup follows the guidelines of the relevant ITU standards. The illumination did not interfere with the shutter glasses. A screening for

14

Subjective and Objective Visual Quality Assessment

429

Table 14.1 Conditions evaluated in the presented subjective experiments


Exp. 1 [31] Exp. 2 [32]

Exp. 3 [33]

Subjective assessment laboratory

IRCCyN

T-Labs

Source contents (dataset)


2D reference condition
MVC full-HD
H.264 full-HD
H.264 side by side (half resolution)
H.264 quarter resolution
H.264 1/16 resolution
H.264 2D coded
H.264 packet loss, using 2D error concealment
H.264 packet loss, continuous display in 2D
H.264 packet loss, pausing in 3D including/excluding
slow down process

10 (1a)
Yes
Yes
Yes

Acreo,
IRCCyN
11 (1b)
Yes
Yes
Yes
Yes
Yes
Yes

Alg. A1
Yes
Yes

7 (2)
Yes
Yes
Yes
Yes

Yes
Alg. A2

visual acuity in 2D and 3D using Snellen charts and the randot test as well as an
Ishihara color plate test was performed on each subject. A training session preceded each subjective evaluation. The experiment session was split into two or
three parts which were approximately 15 min each with a break afterwards.
The standardized ACR method was used for measuring video quality [34]. A QoE
scale was presented and the subjects were instructed to rate on overall perceived
QoE. In addition, the observers provided ratings of visual discomfort, with different
scales used in the different experiments. In Exp. 1, the observers could optionally
indicate a binary opinion on visual discomfort. In Exp. 2, a balanced rating scale was
used that allowed to indicate both higher and lower comfort as compared to the
well-known 2D TV condition, by using a forced choice on a five-point scale: much
more comfortable/more comfortable/as comfortable as/less comfortable/much less
comfortable than watching 2D television. In Exp. 3, an adaptation of the five point
ACR scale was used by indicating excellent/good/fair/poor/bad visual comfort.
The two scales were presented at the same time instant to each observer and he voted
on the two scales in parallel. The presence of the two scales may have led to the effect
that the subjects did not include visual discomfort in their QoE voting.

14.5.3 Reliability of the ACR-HR Methodology for 3D QoE


Assessment
Three different aspects need to be considered when using a subjective experiment
methodology in a new scope: The intra-lab reliability, the inter-lab correlation
(reproducibility), and the response validity related to the judgment task.
Intra-lab reliability can be analyzed by studying the distribution of all observer
votes for each judged video sequence. A uni-modal Gaussian distribution is

430

M. Barkowsky et al.

Fig. 14.8 Correlation between different labs on the QoE scale. a Cross-lab Exp. 2, b crossexperiment Exp. 1 vs. Exp. 2

expected. In addition, small confidence intervals indicate that the observers do not
disagree about the meaning of the terms found on the scale. The results found in all
three experiments indicate that the intra-lab variability corresponds to the limits
known from the ACR-HR methodology in 2D.
The inter-lab correlation is best measured by performing the exact same experiment in two different labs as it has been done for Exp. 2. Slightly less reliable is a
comparison of a subset of processed video sequences that have been evaluated in two
different experiments, since the corpus effect comes into play [35]. Such a comparison can be made between Exp. 1 and Exp. 2. A third method compares the mean
subjective ratings for HRCs that are common. This method may be applied even if the
source content differs provided that source contents spread approximately the same
range, thus in between all the different experiments.
Figure 14.8 shows scatter plots for the obtained MOS for processed video
sequences. Figure 14.8a shows the results of the fully reproduced test Exp. 2
between the two laboratories in Sweden and France. The regression curve indicates that there is a slight offset and gradient. One explanation could be that this
effect is due to different cultural influences such as the translation of the scale
items and the general acceptance of 3D in the two countries. The linear correlation
of 0.95 corresponds to the correlation found in ACR tests for 2D [36]. Figure 14.8b compares the MOS of common PVSs between Exp. 1 and the linearly
fitted results of the two experiments of Exp. 2. The correlation of 0.97 compares
well to those found for the common set in the VQEG studies [37, 38]. The HRCs
of the coding scenarios can be compared pairwise between all three experiments as
shown in Table 14.2. Taking into account that Exp. 3 used a different setup of
source content, the correlation may be considered as acceptable.
While the ACR methodology seems to provide stable results across labs and
experiments, questions arise regarding the correct interpretation of the scale

14

Subjective and Objective Visual Quality Assessment

431

Table 14.2 Linear correlation coefficient for the common HRCs in the four experiments
Exp. 1
Exp. 2a
Exp. 2b
Exp. 3
Exp.
Exp.
Exp.
Exp.

1
2a
2b
3

1.00 (20HRC)
0.987 (6HRC)
0.987 (6HRC)
0.964 (5HRC)

0.987 (6HRC)
1.00 (15HRC)
0.988 (15HRC)
0.909 (9HRC)

0.987 (6HRC)
0.988 (15HRC)
1.00 (15HRC)
0.939 (9HRC)

0.964 (5HRC)
0.909 (9HRC)
0.939 (9HRC)
1.00 (23HRC)

QoE. Since 3D video has not been on the market long enough to establish a
solid category of user perception like traditional television has, it can be assumed
that features of the content play a large role for QoE judgments. All experiments
included the 3D reference videos and as a hidden reference a condition where only
one of the two 3D views was shown to both eyes in 3D mode, thus with zero
disparity corresponding to displaying 2D, but using the same viewing setup. In
experiments 1 and 2, 10 common contents were judged by 72 observers from the
10 common SRC contents, in three cases the 2D presentation was preferred over
the 3D presentation in a statistically significant way while the opposite did not
occur. It remains an open question whether this result stems from the ACR
methodology which lacks an explicit reference so that subjects might judge only
on the 2D QoE scale in the case of 2D presentation, or whether the reference 3D
sequences did not excite the viewers, or that the subjects preferred the 2D presentation even in the 3D context.

14.5.4 Reliability of the Different Methodologies for 3D Visual


Discomfort Assessment
Three different methods were used for subjectively measuring visual discomfort for
each scene. The first method was a non-forced choice indication of discomfort on the
same voting screen as used for the quality judgments. Individual subjects used this
feedback to different extents, ranging from 0 % to over 30 % of the viewed
sequences. A major drawback of the methodology is that it remains an open question
whether observers who did not make discomfort ratings did not perceive visual
discomfort, or whether they only ignored the possibility to indicate it.
The second method was to perform a comparison to 2D television in terms of
visual discomfort on a five point scale. A high linear inter-laboratory correlation
between Exp. 2a and Exp. 2b of 0.96 seems to indicate that this method may be a
better choice compared to Exp. 1, since subjects seem to better understand and use
the scale in this case. The third method used an absolute scale of visual discomfort
as in case of the ACR method. As the HRCs that are common in both tests only
comprise pure coding and thus do not cover a significant part of the visual discomfort scale, a cross-experiment analysis is not indicative.
The usage of the additional discomfort scale may be analyzed by calculating the
inter-observer correlations, thus investigating whether the observers use the scale

432

M. Barkowsky et al.

Fig. 14.9 Histogram of pairwise correlation between observers in each experiment, comparing
quality and comfort scale

in the same way. For comparison, the inter-observer correlation for the established
quality scale is calculated. Histograms showing the achieved correlation between
any two observers for the three tests and the two scales are presented in Fig. 14.9,
for example, for Exp. 2a, 133 pairs of observers have reached a correlation in the
interval of 0.91.0 on the QoE scale.
It can be seen that the agreement among different observers on the quality scale
is much higher than on the discomfort scale, indicating that the scale is not used
equally by all observers, although the mean value of two observer distributions is
well correlated.
Another question that arises is whether the subjects were able to clearly distinguish between the two scales. A certain correlation may be expected as a
sequence with a low comfort level may not be voted with a high QoE level. On the
opposite side, a low QoE level may not necessarily induce a low comfort level, so
a scatter plot may show a triangular behavior [31]. This effect was not observed in
this study. In Fig. 14.10, the correlation between the two scales for each observer
in the three experiments demonstrates that for some observers the distinction was
not clear, thus the correlation is high while others voted completely independent
on the two scales. For example, 3 observers in Exp. 2a had a very high correlation,
larger than 0.9, between the QoE and the visual comfort scale indicating that they
did not distinguish clearly between the two scales.
The presented analysis does not show a clear preference of either asking for
comfort on an immediate scale as done in Exp. 3 or asking to compare to the 2D
case as in Exp. 2a, b.

14

Subjective and Objective Visual Quality Assessment

433

Fig. 14.10 Histogram of linear correlation between each observers vote on QoE and comfort

14.5.5 Comparison of Video Coding Schemes


Figure 14.11 gives an overview of bitrate gain and quality gain for MVC at QP 32 and
38, for side-by-side (SBS) at QP 26, 32, and 28, resolution reductions by a factor of 4
and 16 as well as frame rate reduction by a factor of 2 or 3. The bars indicate the mean
gain over all sources SRCs, and the error bars indicate the 95 % confidence intervals,
which show how efficient the corresponding process is, and if it saves bandwidth
compared to MVC or H.264 simulcast coding. If the bitrate gain (Fig. 14.11 left) is
below 100 % it indicates that this process saves bandwidth compared to H.264/AVC.
For the quality gain (Fig. 14.11 right) a positive gain means better quality for the
same bitrate. Details about this analysis can be found in [30].
As a conclusion, it may be stated that for an HD 3D-TV transmission system a
reduction of the resolution either by using SBS or by a factor of four before the video
encoding will result in a better quality. It could not only help the service provider to
save bitrate but also to save some amount of hardware processing which would be
needed for encoding and decoding of two times a full-HD video. Frame rate reduction,
on the other hand, will not save bandwidth and will result in poor quality. The comparisons with MVC were not conclusive since it differed between the experiments.
On the visual discomfort scale, it was noted that at higher compression levels,
i.e., lower bitrate, the observers were increasingly more eager to indicate visual
discomfort. This effect might be explained by the fact that the coding artifacts
impact differently on the left and right views. Another explanation might be that
the quantization error is spatially shaped by the transformation grid which may
superpose a strong eye-to-eye correlation at zero disparity.

14.5.6 Comparison of Error Concealment Methods


As shown in Table 14.1, different error concealment methods were tested. It was
seen in [31] that in terms of QoE, the observers preferred switching to 2D instead
of concealing the error with a spatio-temporal algorithm as implemented in the

434

M. Barkowsky et al.

Fig. 14.11 Comparison of MVC, spatial and temporal reduction performance with AVC coding
concerning bitrate gain at same quality level (left) and Quality gain at same bitrate level (right)

Fig. 14.12 Mean bitrate


redundancy factor of all
packet loss scenarios
compared to packet loss free
H.264 scenarios. The average
and standard deviation were
determined using data points
which did not require strong
extrapolation. This was
necessary as in some rare
cases even the minimum
quality tested for the coding
scenario in the subjective
experiment was significantly
higher than the quality
obtained for the packet loss
condition

reference software of H.264. Pausing the video was considered worst. Considering visual discomfort, pausing was the best solution, followed by switching to
2D. Using an error concealment algorithm on the impacted view led to the highest
visual discomfort.
Overall, it was noted that packet loss had a very high influence on the perceived
quality. An additional analysis is thus provided in Fig. 14.12. It illustrates a bitrate
redundancy factor comparison of all packet loss scenarios. This factor is introduced in a similar way as the bitrate gain in Fig. 14.11, but in this case, H.264
coding performance with error concealment method in all packet loss conditions is
compared: For each packet loss scenario, the resulting MOS is compared to a
similar quality in the pure coding scenario. Obviously, the bitrate that is required
to transmit an error free video at the same perceived quality is much lower.
Figure 14.12 shows the factor of the bitrate used in the packet loss case divided by
the bitrate of a similar coding-only quality. The higher the factor, the stronger the

14

Subjective and Objective Visual Quality Assessment

435

impact due to packet loss. In other words, in an error-prone transmission environment with fixed bandwidth limit, the tested scenario corresponds to the transmission of a high quality encoded video and performing error concealment on the
decoder side. Alternatively, it can be seen that there may be an interest in using
error protection and correction methods such as automatic repeat request (ARQ) or
forward error correction (FEC) while transmitting the video content at a lower
bitrate.

14.6 Conclusions
The purpose of this chapter was to present the status of 3D QoE assessment in the
context of 3D-TV. It appears that new objective methods need to be developed
which can comprehend the multidimensional nature of the 3D video QoE. The
main initial effort should be focused on revisiting subjective experiments protocols
for 3D-TV. Even if there are urgent needs in 3D-TV industry, it is mandatory to be
very cautious while manipulating and using objective and subjective quality
assessment tools. A wise approach should be carried out by limiting the scope of
the studies (e.g. coding conditions, display rendering, etc.) with a deep understanding of the underlying mechanisms. Following this statement, the two case
studies reported in this chapter reached relevant results overcoming current limitation in terms of subjective assessment protocols.
The first case study permits to define a good model of the perception of
crosstalk effect after conducting ad hoc psychophysics experiments. This model
represents a comprehensive way to measure one of the components of the overall
QoE of 3D-TV services, taking into account visualization conditions. In the second
case study, methodological aspects have been further addressed. The ACR
methodology has been evaluated for use in 3D subjective experiments using a QoE
and a visual discomfort scale. For the validation of the methodology, a critical
analysis has been presented showing that the QoE scale seems to provide reliable
results. The visual discomfort scale has demonstrated some weaknesses which may
be counteracted by a larger number of observers in subjective experiments. The
evaluation itself revealed some interesting preliminary conclusions for industry
such as: at lower coding qualities, visual discomfort increased; spatial down
sampling is advantageous on the transmission channel in a rate-distortion sense
(except for very high quality content with fine details); when transmission errors
occur, the error concealment strategies that are commonly used in 2D might cause
visual discomfort when applied to one channel in a 3D video, thus resulting in
visual discomfort. Switching to 2D, when only one channel is affected, it was
found to provide superior quality and comfort.
DIBR, as part of 3D-TV context, is not an exception to the rule of cautiousness.
As for the two reported case studies, it requires fine understanding to use ad hoc
methodology for subjective quality assessment and ad hoc objective quality
metrics. Chapter 15 is fully dedicated to this context.

436

M. Barkowsky et al.

Acknowledgments This work was partially supported by the COST IC1003 European Network
on QoE in Multimedia Systems and ServicesQUALINET (http://www.qualinet.eu/).

References
1. Chikkerur S, Vijay S, Reisslein M, Karam LJ (2011) Objective video quality assessment
methods: a classification, review, and performance comparison. IEEE Trans Broadcast
57(2):165182
2. You J, Reiter U, Hannuksela MM, Gabbouj M, Perkis A (2010) Perceptual-based objective
quality metrics for audio-visual servicesa survey. Signal Process Image Commun
25(7):482501
3. Huynh-Thu Q, Le Callet P, Barkowsky M (2010) Video quality assessment: from 2D to 3D
challenges and future trends. In: IEEE ICIP, pp. 40254028, Sept 2010
4. Mendiburu B (2009) 3D movie making: stereoscopic digital cinema from script to screen.
Focal Press, Burlington
5. Zilly F, Muller M, Eisert P, Kauff P (2010) The stereoscopic analyzer-an image-based assistance
tool for stereo shooting and 3D production. In: IEEE ICIP, pp 40294032, Sept 2010
6. Pastoor S (1991) 3D-television: a survey of recent research results on subjective
requirements. Signal Process Image Commun 4(1):2132
7. Vetro A, Wiegand T, Sullivan G (2011) Overview of the stereo and multiview video coding
extensions of the H.264/MPEG-4 AVC standard. Proc IEEE 99:626642
8. Yamagishi K, Karam L, Okamoto J, Hayashi T (2011) Subjective characteristics for
stereoscopic high definition video. In: IEEE QoMEX, Sept 2011
9. Kima D, Mina D, Ohab J, Jeonb S, Sohna K (2009) Depth map quality metric for threedimensional video. Image, 5(6):7
10. Campisi P, Le Callet P, Marini E (2007) Stereoscopic images quality assessment. In:
Proceedings of 15th European signal processing conference (EUSIPCO)
11. Benoit A, Le Callet P, Campisi P, Cousseau R (2008) Quality assessment of stereoscopic
images. EURASIP J Image Video Process 2008:13
12. Gorley P, Holliman N (2008) Stereoscopic image quality metrics and compression. In:
Proceedings of SPIE, vol 6803
13. Lu F, Wang H, Ji X, Er G (2009) Quality assessment of 3D asymmetric view coding using
spatial frequency dominance model. In: IEEE 3DTV conference, pp 14
14. Sazzad Z, Yamanaka S, Kawayokeita Y, Horita Y (2009) Stereoscopic image quality
prediction. In: IEEE QoMEX, pp 180185
15. Sazzad Z, Yamanaka S, Horita Y (2010) Spatio-temporal segmentation based continuous noreference stereoscopic video quality prediction. In: International workshop on quality of
multimedia experience, pp 106111
16. Ekmekcioglu E, Worrall S, De Silva D, Fernando W, Kondoz A (2010) Depth based
perceptual quality assesment for synthesized camera viewpoints. In: Second international
conference on user centric media, Sept 2010
17. Boev A, Hollosi D, Gotchev A, Egiazarian K (2009) Classification and simulation of
stereoscopic artifacts in mobile 3DTV content. In: Proceedings of SPIE, pp 72371F-12
18. Woods A, Docherty T, Koch R (1993) Image distortions in stereoscopic video systems. In:
Proceedings of SPIE, San Jose, pp 3648
19. Meesters LMJ, Ijsselsteijn WA, Seuntins PJH (2004) A survey of perceptual evaluations and
requirements of three-dimensional TV. IEEE Trans Circuits Syst Video Technol 14(3):381391
20. Seuntiens PJH, Meesters LMJ, IJsselsteijn WA (2005) Perceptual attributes of crosstalk in 3D
images. Displays 26(4-5):177183
21. Pastoor S (1995) Human factors of 3D images: results of recent research at Heinrich-HertzInstitut Berlin. In: International display workshop, vol 3, pp 6972

14

Subjective and Objective Visual Quality Assessment

437

22. Lipton L (1987) Factors affecting ghosting in time-multiplexed plano-stereoscopic CRT


display systems. In: True 3D imaging techniques and display technologies, vol 761, pp 7578
23. Huang KC, Yuan JC, Tsai CH et al (2003) A study of how crosstalk affects stereopsis in
stereoscopic displays. In: Stereoscopic displays and virtual reality systems X, vol 5006,
pp 247253
24. Kooi FL, Toet A (2004) Visual comfort of binocular and 3D displays. Displays 25:99108
25. Woods A (2011) How are crosstalk and ghosting defined in the stereoscopic literature. In:
Proceedings of SPIE, vol 7863, p 78630Z
26. ISO/IEC JTC1/SC29/WG11, M15377, M15378, M15413, M15419, Archamps, France, 2008
27. Boev A, Hollosi D, Gotchev A Software for simulation of artefacts and database of impaired
videos. Mobile3DTV Project report, no. 216503, [available] http://mobile3dtv.eu
28. Xing L, You J, Ebrahimi T, Perkis A (2011) Assessment of stereoscopic crosstalk perception.
IEEE Transaction on Multimedia, no 99, p 1
29. Wang Z, Bovik AC, Sheikh HR, Simoncelli EP (2004) Image quality assessment: from error
visibility to structural similarity. IEEE Trans Image Process 13(4):600612
30. Tanimoto M, Fujii T, Suzuki K et al (2008) Reference softwares for depth estimation and
view synthesis. In: ISO/IEC JTC1/SC29/WG11 MPEG2008/M15377, Archamps, France
31. Barkowsky M, Wang K, Cousseau R, Brunnstrm K, Olsson R, Le Callet P (2010) Subjective
quality assessment of error concealment strategies for 3DTV in the presence of asymmetric
Transmission Errors. In: IEEE packet video workshop, pp 193200
32. Wang K, Barkowsky M, Cousseau R, Brunnstrm K, Olsson R, Le Callet P et al (2011)
Subjective evaluation of HDTV stereoscopic videos in IPTV scenarios using absolute
category rating. In: Proceedings of SPIE, vol 7863
33. Lebreton P, Raake A, Barkowsky M, Le Callet P (2011) A subjective evaluation of 3D IPTV
broadcasting implementations considering coding and transmission degradation. In: IEEE
international workshop on multimedia quality of experience: modeling, evaluation and
directions, pp 506511
34. ITU-T Study Group 12 (1997) ITU-T P.910 subjective video quality assessment methods for
multimedia applications
35. Zielinski S, Rumsey F, Bech S (2008) On some biases encountered in modern audio quality
listening tests-a review. J AES 56(6):427451
36. Huynh-Thu Q, Garcia M-N, Speranza F, Corriveau PJ, Raake A (2011) Study of rating scales
for subjective quality assessment of high-definition video. IEEE Trans Broadcast 57(1):114
37. Webster A, Speranza F (2008) Final report from the video quality experts group on the
validation of objective models of multimedia quality assessment, phase I. http://www.its.
bldrdoc.gov/vqeg/projects/multimedia/, ITU Study Group 9, TD 923
38. Webster A, Speranza F (2010) Video quality experts group: report on the validation of video
quality models for high definition video content. http://www.its.bldrdoc.gov/vqeg/projects/
hdtv/, version 2.0, 30 June 2010

Chapter 15

Visual Quality Assessment of Synthesized


Views in the Context of 3D-TV
Emilie Bosc, Patrick Le Callet, Luce Morin
and Muriel Pressigout

Abstract Depth-image-based rendering (DIBR) is fundamental to 3D-TV applications because the generation of new viewpoints is recurrent. Like any tool, DIBR
methods are subject to evaluations, thanks to the assessment of the visual quality
of the resulting generated views. This assessment task is peculiar because DIBR
can be used for different 3D-TV applications: either in a 2D context (Free
Viewpoint Television, FTV), or in a 3D context (3D displays reproducing
stereoscopic vision). Depending on the context, the factors affecting the visual
experience may differ. This chapter concerns the case of the use of DIBR in the 2D
context. It addresses two particular cases of use: visualization of still images and
visualization of video sequences, in FTV in the 2D context. Through these two
cases, the main issues of DIBR are presented in terms of visual quality assessment.
Two experiments are proposed as case studies addressing the problematic of this
chapter: the first one concerns the assessment of still images and the second one
concerns the video sequence assessment. The two experiments question the reliability of subjective and objective usual tools when assessing the visual quality of
synthesized views in a 2D context.
E. Bosc (&)  L. Morin  M. Pressigout
IETR UMR CNRS 6164, INSA de Rennes,
35708 RENNES CEDEX 7, France
e-mail: emilie.bosc@insa-rennes.fr
L. Morin
e-mail: luce.morin@insa-rennes.fr
M. Pressigout
e-mail: muriel.pressigout@insa-rennes.fr
P. Le Callet
LUNAM Universit, Universit de Nantes,
IRCCyN UMR CNRS 6597, Polytech
Nantes, 44306 Nantes Cedex 3, France
e-mail: patrick.lecallet@univ-nantes.fr

C. Zhu et al. (eds.), 3D-TV System with Depth-Image-Based Rendering,


DOI: 10.1007/978-1-4419-9964-1_15,
 Springer Science+Business Media New York 2013

439

440

E. Bosc et al.







Keywords 3D-TV
Absolute categorical rating (ACR)
Blurry artifact
Correlation DIBR Distortion Human visual system ITU-R BT.500
Objective metric Paired Comparisons PSNR Shifting effect Subjective
assessment SSIM Subjective test method Synthesized view UQI Visual
quality VQM

15.1 Introduction
3D-TV technology has brought out new challenges such as the question of synthesized view evaluation. The success of the two main applications referred to as
3D Videonamely 3D Television (3D-TV) that provides depth to the scene, and
free viewpoint video (FVV) that enables interactive navigation inside the scene
[1]relies on their ability to provide added value (depth or immersion) coupled with
high-quality visual content. Depth-image-based rendering (DIBR) algorithms are
used for virtual view generation, which is required in both applications. From depth
and color data, novel views are synthesized with DIBR. This process induces new
types of artifacts. Consequently, its impact on the quality has to be identified considering various contexts of use. While many efforts have been dedicated to visual
quality assessment in the last twenty years, some issues still remain unsolved in the
context of 3D-TV. Actually, DIBR opens new challenges because it mainly deals
with geometric distortions, which have barely been addressed so far.
Virtual views synthesized either from decoded and distorted data, or from original data, need to be assessed. The best assessment tool remains the human judgment as long as the right protocol is used. Subjective quality assessment is still
delicate while addressing new type of conditions because one has to define the
optimal way to obtain reliable data. Tests are time-consuming and consequently one
should issue precise guidelines on how to conduct such experiment to save time and
the number of observers. Since DIBR introduces new parameters, the right protocol
to assess the visual quality with observers is still an open question. The adequate
assessment protocol might vary according to the targeted objective that researchers
focus on (impact of compression, DIBR techniques comparison, etc.).
Objective metrics are meant to predict human judgment and their reliability is
based on their correlation to subjective assessment results. As the way to conduct
the subjective quality assessment protocols is already questionable in a DIBR
context, their correlation with objective quality metrics (i.e. this validates the
reliability of the objective quality metrics) in the same context is also questionable.
Yet, trustworthy working groups partially base their future specifications
concerning new strategies for 3D video on the outcome of objective metrics.
Considering that the test conditions may rely on usual subjective and objective
protocols (because of their availability), the outcome of wrong choices could result
in poor quality experience for users. Therefore, new tests should be carried out to

15

Visual Quality Assessment of Synthesized Views in the Context of 3D-TV

441

determine the reliability of subjective and objective quality assessment tools in


order to exploit their results for the best.
This chapter is organized as follows: first, Sect. 15.2 refers to the new challenges related to DIBR process. Section 15.3 gives an overview of two experiments that we propose to evaluate the suitability of usual subjective assessment
methods and the reliability of the usual objective metrics, in the context of view
synthesis via DIBR. Section 15.4 presents the results of the first experiment,
concerning the evaluation of still images. Section 15.5 presents the results of the
second experiment, concerning the evaluation of video sequences. Section 15.6
addresses the new trends regarding the assessment of synthesized views. Finally,
Sect. 15.7 concludes the chapter.

15.2 New Challenges in the DIBR Context in Terms


of Quality Assessment
15.2.1 Sources of Distortion
The major issue in DIBR consists in filling in the disoccluded regions of the novel
viewpoint: when generating a new viewpoint, regions that were not visible in the
previous viewpoint become visible in the new one [2]. However, the appropriate
color information related to these discovered regions is often unknown. Inpainting
methods that are either extrapolation or interpolation techniques are meant to fill in
the disoccluded regions. However, distortions from inpainting are specific and
dependent on a given hole filling technique, as observed in [3].
Another noticeable problem refers to the numerical rounding of pixel positions
when projecting the color information in the target viewpoint (3D warping process): the pixels mapped in the target viewpoint may not be located at an integer
position. In this case the position is either rounded to the nearest integer or
interpolated.
Finally, another source of distortion relies on the depth map uncertainties.
Errors in depth maps estimation cause visual distortion in the synthesized views
because the color pixels are not correctly mapped. Also, the problem is similar
when depth maps suffer important quantization from compression methods [4].

15.2.2 Examples of Distortions


In this section, typical DIBR artifacts are described. As explained above, the
sources of distortions are various and their visual effects on the synthesized views
are perceptible as well in the spatial domain as in the temporal domain. In most
cases, these artifacts are located around large depth discontinuities, but they are

442

E. Bosc et al.

Fig. 15.1 Shifting/resizing artifacts. Left: original frame. Right: synthesized frame. The shape of
the leaves, in this figure, is slightly modified (thinner or bigger). The vase is also moved

more noticeable in case of high texture contrast between background and


foreground.
Object shifting: A region may be slightly translated or resized, depending on the
chosen extrapolation method (if the method chooses to assign the background
values to the missing areas, object may be resized), or on the encoding method (in
depth data blocking artifacts result in object shifting in the synthesis). Figure 15.1
depicts this type of artifact.
Blurry regions: This may be due to the inpainting method used to fill in the
disocccluded areas. It is evident around the background/foreground transitions.
These remarks are confirmed in Fig. 15.2 around the disoccluded areas. Behind the
head and around the arms of the chair, thin blurry regions are perceptible.
Incorrect rendering of textured areas: inpainting (or hole-filling) methods may
fail to fill in complex textured areas.
Flickering: Errors occurring randomly in depth data along the sequence imply
that color pixels are wrongly projected: some pixels suffer slight changes of depth,
which appears as flickers in the resulting synthesized pixels. To avoid this,

15

Visual Quality Assessment of Synthesized Views in the Context of 3D-TV

443

Fig. 15.2 Blurry artifacts (book arrival). a Original frame. b Synthesized frame

methods (such as [5]) propose to acquire background knowledge along the


sequence and to consequently improve the synthesis process.
Tiny distortions: In synthesized sequences, a large number of tiny geometric
distortions and illumination differences are temporally constant and perceptually
invisible. Due to the rounding decimal point problem mentioned in Sect. 15.2.1
and to depth inaccuracy, slight errors may occur when assigning a color value to a
pixel in the target viewpoint. This leads to tiny illumination errors that may not be
perceptible to the human eye. However, pixel-based metrics may penalize these
distorted zones.
When encoding either depth data or color sequences before performing the
synthesis, compression-related artifacts are combined with synthesis artifacts.
Artifacts from data compression are generally scattered over the whole image,
while artifacts inherent to the synthesis process are mainly located around the

444

E. Bosc et al.

disoccluded areas. The combination of both types of distortion, depending on the


compression method, relatively affects the synthesized view. Actually, most of the
used compression methods are based on 2D video coding methods, and are thus
optimized for the human perception of color. As a result, artifacts occurring
especially in depth data induce severe distortions in the synthesized views. In the
following, a few examples of such distortions are presented.
Shifting effect: This shifting effect is due to staircase effect or blocking effect in the
quantized depth map. This occurs when the DCT-based compression method deals
with diagonal edges and features. Coarse quantization of blocks containing a diagonal edge results in either a horizontal or vertical reconstruction, depending on its
original orientation. In the synthesized views, whole blocks of color image seem to be
translated. Figure 15.3 illustrates the distortion. Staircase effect is perceptible in the
depth map and it results in geometric distortions of the projected objects: the face and
the arms have distorted shapes. The diagonal line in the background is also degraded.
The staircase effect modifies the depth plane values of the color pixels, thus objects
are consequently wrongly projected during the synthesis.
Crumbling effect: When artifacts occur in depth data around strong discontinuities, appearing like erosion, the edges of objects appear distorted in the synthesized
view. This occurs when applying wavelet-based compression on depth data.
Figures 15.4 and 15.5 depict this artifact. It is perceptible around the arms of the chair.

15.2.3 The Peculiar Task of Assessing the Synthesized View


The evaluation of DIBR systems is a difficult task because the type of evaluation differs
depending on the application (FTV or 3D-TV). It is not the same factors that are
involved in the two applications. The main difference between the two applications is
the stereopsis phenomenon [fusion of left and right views in the human visual system
(HVS)]. This is used by 3D-TV and this reproduces stereoscopic vision. This includes
psycho-physiological mechanisms which are not completely understood. A FTV
application is not necessarily used in the context of stereoscopic display. FTV can be
applied in a 2D context. Consequently, the quality assessments protocols differ as they
address the quality of the synthesized view in two different contexts (2D visualization
and 3D stereoscopic visualization): it is obvious that stereoscopic impairments (such as
cardboard effect, crosstalk, etc., as described in [6] and [7]), which occur in stereoscopic conditions, are not assessed in 2D conditions. Also, distortions detected in 2D
conditions may not be perceptible in a 3D context.
Finally, artifacts in DIBR are mainly geometric distortions. These distortions are
different from those commonly encountered in video compression, and assessed by
usual evaluation methods: most video coding standards rely on DCT, and the resulting
artifacts are specific (some of them are described in [8]). These artifacts are often
scattered over the whole image, whereas DIBR-related artifacts are mostly located
around the disoccluded regions. Thus, since most of the usual objective quality metrics

15

Visual Quality Assessment of Synthesized Views in the Context of 3D-TV

445

Fig. 15.3 Shifting effect from depth data compression results in distorted synthesized views
(Breakdancers). a Original depth frame (up) and color original frame (bottom). b Distorted depth
frame (up), synthesized view (bottom)

446

E. Bosc et al.

Fig. 15.4 Crumbling effect in depth data leads to distortions in the synthesized views. a Original depth
frame (up) and original color frame (bottom). b Distorted depth frame (up) and synthesized frame (bottom)

were initially created to address usual specific distortions, they may be unsuitable to
the problem of DIBR evaluation. This will be discussed in Sect. 15.3.
Another limitation of usual objective metrics concerns the need for nonreference quality metrics. In particular cases of use, like FTV, references are
unavailable because the generated viewpoint is virtual. In other words, there is no
ground truth allowing a full comparison with the distorted view.
The next section addresses two case studies that question the validity of subjective and objective quality assessment methods for the evaluation of synthesized
views in 2D conditions.

15.3 Two Case Studies to Question the Evaluation


of Synthesized View
In this section, we first present the aim of the studies, and the experimental
material. Then, we present the two subjective assessment methods whose suitability has been questioned in our experiments. We also justify the choice of these
two methods. Finally, we present a selection of the most commonly used metrics
that were also tested in our experiments.

15.3.1 Objective of the Studies


We conducted two different studies. The first one addresses the evaluation of still
images. A scenario worth studying is when watching a video, the user presses the

15

Visual Quality Assessment of Synthesized Views in the Context of 3D-TV

447

Fig. 15.5 Synthesized frames (Lovebird1 sequence)

pause button; or the case of 3D display advertisings is also imaginable. These very
likely cases are interesting since the image can be subject to meticulous observation.
The second study addresses the evaluation of video sequences.
The two studies question the reliability of subjective and objective assessment
methods when evaluating the quality of synthesized views. Most of the proposed
metrics for assessing 3D media are based on 2D quality metrics. Previous studies

448

E. Bosc et al.

[911] already considered the reliability of usual objective metrics. In [12], You
et al. studied the assessment of stereoscopic images in stereoscopic conditions with
usual 2D image quality metrics, but the distorted pairs did not include any DIBRrelated artifacts. In such studies, experimental protocols often involve depth and/or
color compression, different 3D displays, and different 3D representations
(2D ? Z, stereoscopic video, MVD, etc.). In these cases, the quality scores
obtained from subjective assessments are compared to the quality scores obtained
through objective measurements, in order to find a correlation and validate the
objective metric. The experimental protocols often assess both compression distortion and synthesis distortion, at the same time without distinction. This is
problematic because there may be a combination of artifacts from various sources
(compression and synthesis) whose effects are neither understood nor assessed.
The studies presented in this chapter concern only views synthesized using DIBR
methods with uncompressed image and depth data, observed in 2D conditions.
The rest of this section presents the experimental material, the subjective
methodologies, and the objective quality metrics used in the studies.

15.3.2 Experimental Material


Three different multi-view video plus depth (MVD) sequences are used in the two
studies. The sequences are Book Arrival (1,024 9 768, 16 cameras with 6.5 cm
spacing), Lovebird 1 (1,024 9 768, 12 cameras with 3.5 cm spacing) and Newspaper (1,024 9 768, 9 cameras with 5 cm spacing).
Seven DIBR algorithms processed the three sequences to generate four different
viewpoints per sequence.
These seven DIBR algorithms are labeled from A1 to A7:
A1: based on Fehn [13], where the depth map is preprocessed by a low-pass
filter. Borders are cropped, and then an interpolation is processed to reach the
original size.
A2: based on Fehn [13]. Borders are inpainted by the method proposed by
Telea [14].
A3: Tanimoto et al. [15], it is the recently adopted reference software for the
experiments in the 3D Video group of MPEG.
A4: Mller et al. [16], proposed a hole filling method aided by depth
information.
A5: Ndjiki-Nya et al. [17], the hole-filling method is a patch-based texture
synthesis.
A6: Kppel et al. [5], uses depth temporal information to improve the synthesis
in the disoccluded areas.
A7: corresponds to the unfilled sequences (i.e. with holes).
The test was conducted in an ITU conforming test environment. For the subjective assessments, the stimuli were displayed on a TVLogic LVM401W, and

15

Visual Quality Assessment of Synthesized Views in the Context of 3D-TV

449

according to ITU-T BT.500 [18]. In the following, the subjective methodologies


are first presented, and then the objective metrics are addressed.
Objective measurements were obtained by using MetriX MuX Visual Quality
Assessment Package [19].

15.3.3 Subjective Assessment Methodologies


Subjective tests are used to measure image or video quality. The International
Telecommunications Union (ITU) [20] is in charge for the recommendations of the
most commonly used subjective assessment methods. Several methods exist but
there is no 3D-dedicated protocol, because the technology is not mature yet. The
available protocols both have drawbacks and advantages and they are usually
chosen according to the desired task. This depends on the distortion and on the
type of evaluation [21]. They differ according to the type of pattern presentation
(single-stimulus, double-stimulus, multi-stimulus), the type of voting (quality,
impairment, or preference), the voting scale (discrete or continuous), the number
of rating points or categories. Figure 15.6 depicts the proposed classification of
subjective methods in [21]. The abbreviations of the methods classified in
Fig. 15.6 are referenced in Table 15.1.
In the absence of any better 3D-adapted subjective quality assessment methodologies, the evaluation of synthesized views is mostly obtained through 2Dvalidated assessment protocols. The aim of our two experiments is to question the
suitability of a selection of subjective quality assessment methods. This selection
is based on the comparison of methods in the literature. Considering the aim of the
two experiments that we proposed, the choice of a subjective quality assessment
method should be based on consideration of reliability, accuracy, efficiency, and
easiness of implementation of the available methods.
Brotherton et al. [22] investigated the suitability of ACR and SAMVIQ
methods when assessing 2D media. The study showed that ACR method allowed
more test sequences (at least twice as many) to be presented for assessment
compared to the SAMVIQ method. ACR method also proved to be reliable in the
test conditions. Rouse et al. also studied the tradeoff of these two methods in [23],
in the context of high-definition still images and video sequences. They concluded
that the suitability of the two methods could depend on specific applications.
A study was conducted by Huynh-Thu et al. in [24], and proposed to compare
different methods according to their different voting scales (5-point discrete,
9-point discrete, 5-point continuous, 11-point continuous scales). The tests were
carried out in the context of high definition video. The results showed that ACR
method produced reliable subjective results, even across different scales.
Considering the analyses of the methods in the literature, we selected the single-stimulus pattern presentation, ACR-HR (with 5 quality categories) and the
double-stimulus pattern presentation PC for its accuracy. They are described and
commented in the following.

450

E. Bosc et al.

Fig. 15.6 Commonly used subjective test methods, as depicted in [21]


Table 15.1 Overview of subjective test methods
Abbreviations
Full meaning

References

DSIS
DSQS
SSNCS
SSCQE
SDSCE
ACR
ACR-HR
DCR
PC
SAMVIQ

[18]
[18]
[18]
[18]
[18]
[20]
[20]
[20]
[20]
[20]

Double-stimulus impairment scale


Double-stimulus quality scale
Single-stimulus numerical categorical scale
Single-stimulus continuous quality evaluation
Simultaneous double-stimulus for continuous evaluation
Absolute category rating
Absolute category rating with hidden reference removal
Degradation category rating
Pair comparison
Subjective assessment methodology for video quality

15.3.3.1 Absolute Categorical Rating with Hidden Reference


Removal (ACR-HR)
ACR-HR methodology consists in presenting test objects (i.e., images or
sequences) to observers, one at a time. The objects are rated independently on a
category scale. The reference version of each object must be included in the test
procedure and rated like any other stimulus. This explains the term used of hidden
reference. From the scores obtained, a differential score (DMOS for Differential
Mean Opinion Score) is computed between the mean opinion scores (MOS) of
each test object and its associated hidden reference. ITU recommends the 5-level
quality scale depicted in Table 15.2.

15

Visual Quality Assessment of Synthesized Views in the Context of 3D-TV

Table 15.2 ACR-HR quality


scale

5
4
3
2
1

451
Excellent
Good
Fair
Poor
Bad

ACR-HR requires many observers to minimize the contextual effects (previously presented stimuli influence the observers opinion, i.e. presentation order
influences opinion ratings). Accuracy increases with the number of participants.

15.3.3.2 Paired Comparisons


Paired Comparisons (PC) methodology is an assessment protocol in which stimuli
are presented by pairs to the observers: it is a double-stimulus method. The
observer selects one out of the pair that best satisfies the specified judgment
criterion, i.e., image quality.
The results of a paired comparison test are recorded in a matrix: each element
corresponds to the frequencies a stimulus is preferred over another stimulus. These
data are then converted to scale values using Thurstone-Mostellers model or
Bradley-Terrys [25]. It leads to a hypothetical perceptual continuum.
The presented experiments follow Thurstone-Mostellers model where naive
observers are asked to choose the preferred item from one pair. Although the method
is known to be highly accurate, it is time-consuming since the number of comparisons grows considerably when the number of images to be compared increases.
The differences between ACR-HR and PC are of different types. First, with
ACR-HR, even though they may be included in the stimuli, the reference
sequences are not identified as such by the observers. Observers assign an absolute
grade without any reference. In PC, observers only need to indicate their preference out of a pair of stimuli. Therefore, the requested task is different: while
observers assess the quality of the stimuli in ACR-HR, they just give their
preference in PC.
The quality scale is another issue. ACR-HR scores provide knowledge on the
perceived quality level of the stimuli. However, the voting scale is coarse, and
because of the single-stimulus presentation, observers cannot remember previous
stimuli and precisely evaluate small impairments. PC scores (i.e. preference
matrices) are scaled to a hypothetical perceptual continuum. However, it does not
provide knowledge on the quality level of the stimuli, but on the order of
preference. On the other hand, PC is very well suited for small impairments,
thanks to the fact that only two conditions are compared each time. This is why PC
tests are often coupled with ACR-HR tests.
Another aspect concerns the complexity and the feasibility of the test: PC is
simple because observers only need to provide preference in each double-stimulus.
However, when the number of stimuli increase, the test becomes difficult to carry

452

E. Bosc et al.

out since the number of comparisons grows according to NN1


where N is the
2
number of stimuli. In the case of video sequence assessment, a double-stimulus
method, such as PC, involves the use of either one split-screen environment (or
two full screens), with the risk of distracting the observer (as explained in [26]),
or one screen but the length of the test increases as sequences are displayed one
after the other. On the other hand, the ease of handling of ACR-HR allows the
assessment of a larger number of stimuli but, the results of this assessment are
reliable as long as the group of participants is large enough.

15.3.4 Objective Quality Metrics


The experiments that are proposed in this chapter require the use of objective
quality metrics. The choice of objective metrics used in these experiments is
motivated by their availability. This section presents an overview of those in these
experiments. Still images and video sequences metrics are presented.
Objective metrics are meant to predict human perception of the quality of images
and thus avoid wasting time in subjective quality assessment tests. They are therefore
supposed to be highly correlated with human opinion. In the absence of approved
metrics for assessing synthesized views, most of the studies rely on the use of 2D
validated metrics, or on adaptations of such. There are different types of objective
metrics, depending on their requirement for reference images. The objective metrics
can be classified in three different categories of methods according to the availability
of the reference image: full reference methods (FR), reduced reference (RR), and no
reference (NR). Most of the existing metrics rely on FR methods which require
references images. RR methods require only elements of the reference images. NR
methods do not require any reference images. NR methods mostly rely on HVS
models to predict the human opinion of the quality. Also, a prior knowledge on the
expected artifacts highly improves the design of such methods.
All the objective metrics, FR, RR, or NR, can be classified according to a
different criterion than the requirement of the reference image. As proposed in
[27], we use a classification relying on tools used in the methods presented
hereafter. Table 15.3 lists a selection of commonly used objective metrics and
Fig. 15.7 depicts the proposed classification.

15.3.4.1 Signal-Based Methods


PSNR is a widely used method because of its simplicity. PSNR belongs to the
category of signal-based methods. It measures the signal fidelity of a distorted
image compared to a reference. It is based on the measure of the Mean Squared
Error (MSE). Because of the pixel-based approach of such a method, the amount
of distorted pixels is summed, but their perceptual impact on the quality is not
considered: PSNR does not take into account the visual masking phenomenon.

15

Visual Quality Assessment of Synthesized Views in the Context of 3D-TV

Table 15.3 Overview of commonly used objective metrics


Objective metric
Signal-based
Perception-oriented

Structure-based

HVS-based

Peak signal to noise ratio


Universal quality index
Information fidelity criterion
Video quality metric
Perceptual video quality measure
Single-scale structural similarity
Multi-scale SSIM
Video structural similarity measure
Motion-based video integrity evaluation
PSNR- human visual system
PSNR-human visual system masking model
Visual signal to noise ratio
Weighted signal to noise ratio
Visual information fidelity
Moving pictures quality metric

453

Abbreviations

Tested

PSNR
UQI
IFC
VQM
PVQM
SSIM
MSSIM
V-SSIM
MOVIE
PSNR-HVS
PSNR-HVSM
VSNR
WSNR
VIF
MPQM

X
X
X
X
X
X
X
X
X
X
X
X

Fig. 15.7 Overview


of quality metrics

Thus, even if an error is not perceptible, it contributes to the decrease of the quality
score. Studies (such as [28]) showed that in the case of synthesized views, PSNR is
not reliable, especially when comparing two images with low PSNR scores. PSNR
cannot be used in very different scenarios as explained in [29].

15.3.4.2 Perception-Oriented Methods


Considering that signal-based methods are unable to correctly predict the perceived quality, perception-oriented metrics have been introduced. They make use
of perceptual criteria such as luminance or contrast distortion.

454

E. Bosc et al.

UQI [30] is a perception-oriented metric. The quality score is the product of the
correlation between the original and the degraded image, a term defining
the luminance distortion and a term defining the contrast distortion. The quality
score is computed within a sliding window and the final score is defined as the
average of all local scores.
IFC [31] uses a distortion model to evaluate the information shared between the
reference image and the degraded image. IFC indicates the image fidelity rather
than the distortion. IFC is based on the hypothesis that, given a source channel and
a distortion channel, an image is made of multiple independently distorted subbands. The quality score is the sum of the mutual information between the source
and the distorted images for all the subbands.
VQM was proposed by Pinson and Wolf in [32]. It is a RR video metric that
measures the perceptual effects of numerous video distortions. It includes a calibration step (to correct spatial/temporal shift, contrast, and brightness according to
the reference video sequence) and an analysis of perceptual features. VQM score
combines all the perceptual calculated parameters. VQM method is complex but
its correlation to subjective scores is good according to [33]. The method is
validated in video display conditions.
Perceptual Video Quality Measure (PVQM) [34] is meant to detect perceptible
distortions in video sequences. Various indicators are used. First, an edge-based
indicator allows the detection of distorted edges in the images. Second, a motionbased indicator analyzes two successive frames. Third, a color-based indicator
detects non-saturated colors. Each indicator is pooled separately across the video
and incorporated in a weighting function to obtain the final score. As this method
was not available, it was not tested in our experiments.

15.3.4.3 Structure-Based Methods


Structure-based methods are also included in the perception-oriented metrics.
They rely on the assumption that human perception is based on the extraction of
structural information. Thus, they measure the structural information degradation.
SSIM [35] was the first method among those of this category. It is considered as an
extension of UQI. It combines image structural information: mean, variance,
covariance of pixels, for a single local patch. The block size depends on the
viewers distance from the screen. However, a low variation of the SSIM measure
can lead to an important error of MOS prediction.
Therefore, many improvements to SSIM were proposed, and adaptations to
video assessment were introduced. MSSIM is the average SSIM scores of all
patches of the image. V-SSIM [36] is an FR video quality metric which uses
structural distortion as an estimate of perceived visual distortion. At patch level,
the score is a weighted function of SSIM for the different color components of the
image (i.e. luminance and chrominance). At frame level, the score is a weighted
function of patches SSIM scores (the weights depend on the mean value of the
luminance in the patch [36]). Finally, at sequence level, VSSIM score is a

15

Visual Quality Assessment of Synthesized Views in the Context of 3D-TV

455

weighted function of frames SSIM scores (based on motion). The choice of the
weights relies on the assumption that dark regions are less salient. However, this is
questionable because the relative luminance may depend on the used screen.
MOVIE [37] is an FR video metric that uses several steps before computing the
quality score. It includes the decomposition of both reference and distorted video
by using a multi-scale spatio-temporal Gabor filter-bank. An SSIM-like method is
used for spatial quality analysis. An optical flow calculation is used for motion
analysis. Spatial and temporal quality indicators determine the final score.

15.3.4.4 Human Visual-System (HVS)-Based Methods


HVS-based methods rely on HVS modeling from psychophysics experiments. Due
to the complexity of the human vision, studies are still in progress. HVS-based
models are the result of tradeoffs between computational feasibility and accuracy
of the model. HVS-based models can be classified into two categories: neurobiological models and models based on the psychophysical properties of human
vision.
The models based on neurobiology estimate the actual low-level process in
HVS including retina and optical nerve. However, these models are not widely
used because of their complexity [38].
Psychophysical HVS-based models are implemented in a sequential process
that includes luminance masking, color perception analysis, frequency selection,
contrast sensitivity implementation (based on the contrast sensitivity function
(CSF) [39]), and modeling of masking and facilitation effects [40].
PSNR-HVS [41], based on PSNR and UQI, takes into account the HVS properties
such as its sensitivity to contrast change and to low frequency distortions. In [41], the
method proved to be correlated to subjective scores, but the performances of the
PSNR-HVS method are tested on a variety of distortions specific to 2D image
compression which are different from distortions related to DIBR.
PSNR-HVSM [42] is based on PSNR but takes into account CSF and
between-coefficient contrast masking of DCT basis functions. The performances of the method are validated considering a set of images containing
Gaussian noise or spatially correlated additive Gaussian noise, at different locations (uniformly through the entire image, mostly in regions with a high masking
effect or, with a low masking effect).
VSNR [43] is also a perception-oriented metric: it is based on a visual detection
of the distortion criterion, helped by CSF. VSNR metric is sensitive to geometric
distortions such as spatial shifting and rotations, transformations which are typical
in DIBR applications.
WSNR that uses a weighting function adapted to HVS denotes a Weighted
Signal-to-Noise Ratio, as applied in [44]. It is an improvement on PSNR that uses
a CSF-based weighting function. However, although WSNR is more accurate by
taking into account perceptual properties, the problem remains the accumulation of
degradations errors even in non-perceptible areas, like with PSNR method.

456

E. Bosc et al.

IFC has been improved by the introduction of an HVS model. The method is
called VIF [45]. VIFP is a pixel-based version of VIF. It uses wavelet decomposition and computes the parameters of the distortion models, which enhance the
computational complexity. In [45], five distortion types are used to validate the
performances of the method (JPEG and JPEG 2000 related distortions, white and
Gaussian noise over the entire image), which are quite different from the DIBRrelated artifacts.
MPQM [46] uses an HVS model. In particular it takes into account the masking
phenomenon and contrast sensitivity. It has high complexity and its correlation to
subjective scores varies according to [33]. Since the method is not available, it is
not tested in our experiments.
Only a few commonly used algorithms (in the 2D context) have been described
above. Since they are all dedicated to 2D applications, they are optimized to detect
and penalize specific distortions of 2D image and 2D video compression methods.
As explained in 15.2, distortions related to DIBR are very different from 2D
known artifacts. There exist many other algorithms for visual quality assessment
that are not covered here.

15.3.5 Experimental Protocol


Two experiments were conducted. The first one addresses the evaluation of still
images. The second study addresses the evaluation of video sequences.
Figure 15.8 depicts the overview of the two experiments.
The material for both experiments comes from the same set of synthesized
views as described in Sect. 15.3.2. However, in the case of the first experiments, on
still images, the test images are key frames (keys were randomly chosen)
from the same set of synthesized views, due to the complexity of PC tests when the
number of items increases. That is to say that for each of the three reference
sequences, only one frame was selected out of each synthesized viewpoint.
In both experiments, the suitability of subjective quality assessment methods
and the reliability of objective metrics are addressed.
Concerning the subjective tests, two sessions were conducted. The first one
addressed the assessment of still images. Forty-three naive observers participated
in this test. The second session addressed the assessment of video sequences.
Thirty-two naive observers participated in this test.
In the case of video sequences, only an ACR-HR test was conducted, but both
ACR-HR and PC were carried out for the still image context. A PC test with video
sequences would have required either two screens, or switching between items. In
the case of the use of two screens, it involves the risk of missing frames of the
tested sequences because one cannot watch two different video sequences simultaneously. In the case of the switch, it would have considerably increased the
length of the test.

15

Visual Quality Assessment of Synthesized Views in the Context of 3D-TV


15.3.5.

457

Experimental protocol

Fig. 15.8 Experimental protocol for fixed image experiment and for video experiment

The objective measurements were realized over 84 synthesized views by the


means of MetriX MuX Visual Quality Assessment Package [19] software, except
for two metrics: VQM and VSSIM. VQM was available at [47]; VSSIM was
implemented by the authors, according to [36]. The reference was the original
acquired image. It should be noted that still image quality metrics used in the study
with still images are also used to assess the visual video sequences quality by
applying these metrics on each frame separately and averaging the frames scores.
Table 15.4 summarizes the experimental framework. The next sections present
the results of the first experiment assessing the quality of still images, and then the
results of the second experiment assessing the quality of video sequences.

15.4 Results on Still Images (Experiment 1)


This section addresses the context of still images. Section 15.4.1 addresses the
results of the subjective assessments and Sect. 15.4.2 presents the results of the
objective evaluations. These experiments are meant to determine whether
the subjective protocols are appropriate for the assessment of different DIBR; the
number of participants required in such a subjective assessment test; whether the
results of the subjective assessments are consistent with the objective evaluations.

15.4.1 Subjective Tests


The seven DIBR algorithms are ranked according to the obtained ACR-HR and PC
scores, as depicted in Table 15.5. This table indicates that the rankings obtained by
both testing methods are consistent. For the ACR-HR test, the first line gives the

458

E. Bosc et al.

Table 15.4 Overview of the experiments


Experiment 1
(still images)
Data
Subjective
tests

Number of
participants
Methods
Objective measures

Experiment 2 (video
sequences)

Key frames of each


synthesized view
43

Synthesized video
sequences
32

ACR-HR, PC
All available metrics of
MetriX MuX

ACR-HR
VQM, VSSIM, Still image
metrics

Table 15.5 Rankings of algorithms according to subjective scores


A1
A2
A3
A4
A5

A6

A7

ACR-HR
Rank order
PC
Rank order

3.32
5
0.454
5

2.278
7
-2.055
7

3.539
1
1.038
1

3.386
4
0.508
4

3.145
6
0.207
6

3.40
3
0.531
3

3.496
2
0.936
2

DMOS scores obtained through the MOS scores. For the PC test, the first line
gives the hypothetical MOS scores obtained through the comparisons. For both
tests, the second line gives the ranking of the algorithms, obtained through the first
line.
In Table 15.5, although the algorithms can be ranked from the scaled scores, there
is no information concerning the statistical significance of the quality difference of
two stimuli (one preferred to another one). Therefore, statistical analyses were
conducted over the subjective measurements: a students t-test was performed over
ACR-HR scores, and over PC scores, for each algorithm. This provides knowledge
on the statistical equivalence of the algorithms. Tables 15.6 and 15.7 show the results
of the statistical tests over ACR-HR and PC values respectively. In both tables, the
number in parentheses indicates the minimum required number of observers that
allows statistical distinction (VQEG recommends 24 participants as a minimum [48],
values in bold are higher than 24 in the table).
A first analysis of these two tables indicates that the PC method leads to clearcut decisions, compared to the ACR-HR method: indeed, the distributions of the
algorithms are statistically distinguished with less than 24 participants in 17 cases
with PC (only 11 cases with ACR-HR). In one case (between A2 and A5), less
than 24 participants are required with PC, and more than 43 participants are
required to establish the statistical difference with ACR-HR. The latter case can
be explained by the fact that the visual quality of the synthesized images may be
perceived very similar for non-expert observers. That is to say that the distortions,
though different from an algorithm to another, are difficult to assess. The absolute
rating task is more delicate for observers than the comparison task. These results
indicate that it seems more difficult to assess the quality of synthesized views than
to assess the quality of degraded images in other contexts (for instance, quality

15

Visual Quality Assessment of Synthesized Views in the Context of 3D-TV

Table 15.6 Results of students t-test with ACR-HR results


A1
A2
A3
A4
A5
A1
A2
A3
A4
A5
A6
A7

:(32)
;(32)
;(\24)
;(32)
O
(>43)
;(30)
;(\24)

:(\24)
:(\24)

;(\24)
(>43)
O
(>43)
O
(>43)
;(\24)
O

:(\24)
:(\24)
:(\24)
;(\24)

:(32)
O
(>43)
;(\24)

(>43)
O
(>43)
;(\24)
O
(>43)

(>43)
(>43)
;(\24)
O

;(28)
;(\24)

459

A6

A7

:(30)
O
(>43)
;(\24)
O
(>43)
:(28)

:(\24)
:(\24)
:(\24)
:(\24)
:(\24)
:(\24)

;(\24)

Legend: : superior, ; inferior, O statistically equivalent


Reading: line 1 is statistically superior to column 2 Distinction is stable when 32
observers participate

Table 15.7 Results of students t-test with PC results


A1
A2
A3
A4
A1
A2
A3
A4
A5
A6
A7

: (\24)
;(\24)
;(\24)
;(\24)
;(\24)
;(\24)
;(\24)

:(\24)
:(28)

;(28)
(>43)
:(\24)
O
(>43)
;(\24)
O

:(\24)
:(\24)
:(\24)
;(\24)

:(\24)
O
(<43)
;(\24)
:(\24)
;(>43)
;(\24)

A5

A6

A7

:(\24)
;(\24)
;(\24)
;(\24)

:(\24)
O
(>43)
;(\24)
:(>43)
:(\24)

:(\24)
:(\24)
:(\24)
:(\24)
:(\24)
:(\24)

;(\24)
;(\24)

;(\24)

statistically equivalent
Legend: : superior, ; inferior,
Reading: Line 1 is statistically superior to column 2. Distinction is stable when less than
24 observers participate

assessment of images distorted through compression). The results with the ACRHR method, in Table 15.6, confirm this observation: in most of the cases, more
than 24 participants (or even more than 43) are required to distinguish the classes
(Note that A7 is the synthesis with holes around the disoccluded areas).
However, as seen with rankings results above, methodologies give consistent
results: when the distinction between algorithms is clear, the ranking is the same
with either methodology.
Finally, these experiments show that fewer participants are required for a PC
test than for an ACR-HR test. However, as stated before, PC tests, while efficient,
are feasible only with a limited number of items to be compared. Another problem,
pointed out by these experiments, concerns the assessment of similar items: with
both methods, 43 participants were not always sufficient to obtain a clear and
reliable decision. Results suggest that observers had difficulties assessing the
different types of artifacts.
As a conclusion, this first analysis reveals that more than 24 participants may be
necessary for still image quality assessment.
Regarding the evaluation of PC and ACR-HR methods, PC gives clear-cut
decisions, due to the mode of assessment (preference) while algorithms statistical
distinction with ACR-HR is slightly less accurate. With ACR-HR, the task is not

PSNR
SSIM
MSSIM
VSNR
VIF
VIFP
UQI
IFC
NQM
WSNR
PSNR HSVM
PSNR HSV

83.9
79.6
87.3
77.0
70.6
53.6
71.6
95.2
98.2
99.2
99.0

96.7
93.9
93.4
92.4
81.5
92.9
84.9
83.7
83.2
83.5

83.9

89.7
88.8
90.2
86.3
89.4
85.6
81.1
77.9
78.3

79.6
96.7

87.9
83.3
71.9
84.0
85.3
85.5
86.1
85.8

87.3
93.9
89.7

97.5
75.2
98.7
74.4
78.1
79.4
80.2

77.0
93.4
88.8
87.9

85.9
99.2
73.6
75.0
72.2
72.9

70.6
92.4
90.2
83.3
97.5

81.9
70.2
61.8
50.9
50.8

53.6
81.5
86.3
71.9
75.2
85.9

Table 15.8 Correlation coefficients between objective and subjective scores in percentage
PSNR
SSIM
MSSIM
VSNR
VIF
VIFP
UQI
IFC

72.8
74.4
73.5
74.4

71.6
92.9
89.4
84.0
98.7
99.2
81.9

NQM

97.1
92.3
91.8

95.2
84.9
85.6
85.3
74.4
73.6
70.2
72.8

WSNR

97.4
97.1

98.2
83.7
81.1
85.5
78.1
75.0
61.8
74.4
97.1

PSNR

99.9

99.2
83.2
77.9
86.1
79.4
72.2
50.9
73.5
92.3
97.4

HSVM

PSNR
99.0
83.5
78.3
85.8
80.2
72.9
50.8
74.4
91.8
97.1
99.9

HSV

460
E. Bosc et al.

15

Visual Quality Assessment of Synthesized Views in the Context of 3D-TV

461

Table 15.9 Correlation coefficients between objective scores in percentage


PSNR SSIM MSSIM VSNR VIF VIFP UQI IFC NQM WSNR PSNR PSNR
HVSM HVS
ACR- 31.1
HR
PC
40.0

19.9

11.4

22.9

19.6 21.5

18.4 21.0 29.5

37.6

31.7

31.0

23.8

34.9

19.7

16.2 22.0

32.9 20.1 37.8

36.9

42.2

41.9

Fig. 15.9 Difference


between correlation and
agreement [49]

easy for the observers because the impairments among the tested images are small,
though each DIBR induces specific artifacts. Thus, this aspect should be taken into
account when evaluating the performances of different DIBR algorithms with this
methodology.
However, ACR-HR and PC are complementary: when assessing similar items,
like in this case study, PC can provide a ranking, while ACR-HR gives the overall
perceptual quality of the items.

15.4.2 Objective Measurements


The results of this subsection concern the measurements conducted over the same
selected key frames as those in Sect. 15.4.1. The objective is to determine the
consistency between the subjective assessments and the objective evaluations, and
the most consistent objective metric.
The first step consists in comparing the objective scores with the subjective
scores (presented in Sect. 15.4.1). The consistency between objective and

462

E. Bosc et al.

Table 15.10 Rankings according to measurements


A1
A2
A3

A4

A5

A6

A7

ACR-HR
Rank order
PC
Rank order
PSNR
Rank order
SSIM
Rank order
MSSIM
Rank order
VSNR
Rank order
VIF
Rank order
VIFP
Rank order
UQI
Rank order
IFC
Rank order
NQM
Rank order
WSNR
Rank order
PSNR HSVM
Rank order
PSNR HSV
Rank order

2.250
3
0.531
3
26.117
3
0.859
1
0.950
1
22.004
3
0.425
2
0.448
1
0.577
1
2.587
2
17.074
3
21.597
3
21.428
3
20.938
3

2.345
2
0.936
2
26.171
2
0.859
1
0.949
2
22.247
1
0.425
2
0.448
1
0.576
3
2.586
3
17.198
2
21.697
2
21.458
2
20.958
2

2.169
5
0.454
5
26.177
1
0.858
3
0.949
2
22.195
2
0.426
1
0.448
1
0.577
1
2.591
1
17.201
1
21.716
1
21.491
1
20.987
1

1.126
7
-2.055
7
20.307
6
0.821
5
0.883
5
21.055
4
0.397
4
0.420
4
0.558
4
2.423
4
10.291
6
15.588
6
15.714
6
15.407
6

2.388
1
1.038
1
18.75
7
0.638
7
0.648
7
13.135
7
0.124
7
0.147
7
0.237
7
0.757
7
8.713
7
13.817
7
13.772
7
13.530
7

2.234
4
0.508
4
24.998
4
0.843
4
0.932
4
20.530
5
0.394
5
0.416
5
0.556
5
2.420
5
16.334
4
20.593
4
19.959
4
19.512
4

1.994
6
0.207
6
23.18
5
0.786
6
0.826
6
18.901
6
0.314
6
0.344
6
0.474
6
1.959
6
13.645
5
18.517
5
18.362
5
17.953
5

subjective measures is evaluated by calculating the correlation coefficients for the


whole fitted measured points. The coefficients are presented in Table 15.8. In the
results of our test, none of the tested metric reaches 50 % of human judgment. This
reveals that the objective tested metrics do not reliably predict human appreciation
in the case of synthesized views, even though efficiency has been shown for the
quality assessment of 2D conventional media.
The whole set of objective metrics gives the same trends. Table 15.9 provides
correlation coefficients between obtained objective scores. It reveals that they are
highly correlated. This table shows that the behavior of the tested metrics is the
same when assessing images containing DIBR related artifacts. Thus, they have
the same response when assessing DIBR related artifacts. Note the high correlation
scores between pixel-based and more perception-oriented metrics such as PSNR
and SSIM (83.9 %).
Since it is established in [49, 50] that correlation is different from agreement
(as illustrated in Fig. 15.9), we check the agreement of the tested metrics by

15

Visual Quality Assessment of Synthesized Views in the Context of 3D-TV

463

Table 15.11 Ranking of algorithms according to subjective scores


A1
A2
A3
A4
A5

A6

A7

ACR-HR
Rank order

2.956
4

2.104
7

3.523
1

3.237
2

2.966
3

2.865
5

2.789
6

Table 15.12 Results of students t-test with ACR-HR results


A1
A2
A3
A4
A5
A1
A2
A3
A4
A5
A6
A7

:(7)
;(7)
;(3)
;(3)
;(2)
;(3)
;(1)

:(3)
:(2)

;(2)
;(2)
;(1)
;(2)
;(1)

Legend: : superior, ; inferior,

:(3)
:(2)
O
(>32)

(>32)
;(9)
O
(>32)
;(1)
O

:(2)
:(1)
:(9)
O
(>32)

(>32)
(>32)
;(1)
O

:(15)
;(1)

A6

A7

:(3)
:(2)
O
(>32)
O
(>32)
;(15)

:(1)
:(1)
:(1)
:(1)
:(1)
:(1)

;(1)

statistically equivalent

comparing the ranks affected to the algorithms. Table 15.10 presents the rankings
of the algorithms obtained from the objective scores. Rankings from subjective
scores are mentioned for comparison. They present a noticeable difference concerning the ranking order of A1: ranked as the best algorithm out of the seven by
the subjective scores, it is ranked as the worst by the whole set of objective
metrics. Another comment refers to the assessment of A6: often regarded as the
best algorithm, it is ranked as one of the worst algorithms by the subjective tests.
The ensuing assumption is that objective metrics detect and penalize non-annoying
artifacts.
As a conclusion, none of the tested metric reaches 50% of human judgment.
The tested metrics have the same response when assessing DIBR-related artifacts.
Given the inconsistencies with the subjective assessments, it is assumed that
objective metrics detect and penalize non-annoying artifacts.

15.5 Results on Video Sequences (Experiment 2)


This section addresses the context of video sequences. Section 15.5.1 addresses
the results of the subjective assessments and Sect. 15.5.2 presents the results of the
objective evaluations. In these experiments, the objective and subjective methods
are now evaluated with the temporal domain. In these conditions, the objective is
to determine whether ACR-HR is appropriate for the assessment of different
DIBR; the number of participants required in such a subjective assessment test;
whether the results of the subjective assessments are consistent with the objective
evaluations.

464

E. Bosc et al.

Table 15.13 Rankings according to measurements


A1
A2
A3

A4

A5

A6

A7

ACR-HR
Rank order
PSNR
Rank order
SSIM
Rank order
MSSIMM
Rank order
VSNR
Rank order
VIF
Rank order
VIFP
Rank order
UQI
Rank order
IFC
Rank order
NQM
Rank order
WSNR
Rank order
PSNR HSVM
Rank order
PSNR HSV
Rank order
VSSIM
Rank
VQM
Rank order

2.03
5
25.994
3
0.859
1
0.948
1
21.786
3
0.423
2
0.446
1
0.598
3
2.562
2
16.635
3
21.76
3
21.278
3
20.795
3
0.899
1
0.572
3

1.96
6
26.035
2
0.859
1
0.948
1
21.965
2
0.423
2
0.446
1
0.598
3
2.562
2
16.739
1
21.839
2
21.318
2
20.823
2
0.898
2
0.556
1

2.13
4
26.04
1
0.859
1
0.948
1
21.968
1
0.424
1
0.446
1
0.598
3
2.564
1
16.739
1
21.844
1
21.326
1
20.833
1
0.893
3
0.557
2

1.28
7
20.89
6
0.824
5
0.888
5
20.73
4
0.396
4
0.419
4
0.667
1
2.404
4
10.63
6
16.46
6
16.23
6
15.91
6
0.854
5
0.652
6

2.70
1
19.02
7
0.648
7
0.664
7
13.14
7
0.129
7
0.153
7
0.359
7
0.779
7
8.66
7
14.41
7
13.99
7
13.74
7
0.662
7
0.888
7

2.41
2
24.99
4
0.844
4
0.932
4
20.41
5
0.393
5
0.415
5
0.664
5
2.399
5
15.933
4
20.85
4
19.37
4
19.52
4
0.879
4
0.623
5

2.14
3
23.227
5
0.786
6
0.825
6
18.75
6
0.313
6
0.342
6
0.58
6
1.926
6
13.415
5
18.853
5
18.361
5
17.958
5
0.809
6
0.581
4

Table 15.14 Correlation coefficients between objective and subjective scores in percentage
PSNR
SSIM
MSSIM
VSNR
VIF
VIFP
UQI
ACR-HR
ACR-HR

34.5
IFC
45.6

45.2
NQM
36.6

27
WSNR
32.9

47.3
PSNR HVSM
34.5

43.9
PSNR HVS
33.9

46.9
VSSIM
33

20.2
VQM
33.6

15.5.1 Subjective Tests


In the case of video sequences, only the ACR-HR test was conducted, as explained
in Sect. 15.3.5. Table 15.11 shows the algorithms ranking from the obtained
subjective scores. The ranking order differs from the one obtained with ACR-HR
test in the still image.

15

Visual Quality Assessment of Synthesized Views in the Context of 3D-TV

465

Fig. 15.10 Ranking of used metrics according to their correlation to human judgment

Although the values allow the ranking of the algorithms, they do not directly
provide knowledge on the statistical equivalence of the results. Table 15.12
depicts the results of the Students t-test processed with the values. Compared to
the ACR-HR test with still images detailed in Sect. 15.4.1, distinctions between
algorithms seem to be more obvious. The statistical significance of the difference
between the algorithms, based on the ACR-HR scores, exists and seems clearer for
video sequences than for still images. This can be explained by the exhibition time
of the video sequences: watching the whole video, observers can refine their
judgment, contrary to still images. Note that the same algorithms were not statistically differentiated: A4, A3, A5, and A6.
As a conclusion, though more than 32 participants are required to perform all
the distinctions in the tested conditions, the ACR-HR test with video sequences
gives clearer statistical differences between the algorithms than the ACR-HR test
with still images. This suggests that new elements allow the observers to make a
decision: existence of flickering, exhibition time, etc.

15.5.2 Objective Measurements


The results of this subsection concern the measurements conducted over the entire
synthesized sequences. The objective is to determine the consistency between the
subjective assessments and the objective evaluations, and the most consistent
objective metric, in the context of video sequences.
As in the case of still images studied in the previous section, the rankings of the
objective metrics (Table 15.13) are consistent with each other: the correlation
coefficients between objective metrics are very close to the figures depicted in
Table 15.8, and so they are not presented here. As with still images, the difference
between the subjective-test-based ranking and the ranking from the objective
scores is noticeable. Again, the algorithm given as the worst (A1) by the objective
measurements is the one observers preferred. This can be explained by the fact that

466

E. Bosc et al.

A1 performs the synthesis on a cropped image, and then enlarges it to reach the
original size. Consequently, signal-based metrics penalize it though it gives good
perceptual results.
Table 15.14 presents the correlation coefficients between objective scores and
subjective scores, based on the whole set of measured points. None of the tested
objective metric reaches 50% of subjective scores. The metric obtaining the higher
correlation coefficient is VSNR, with 47.3 %. Figure 15.10 shows the ranking of
the metrics according to the correlation scores of Table 15.14. It is observed that
the top metrics are perception-oriented metrics (they include psychophysical
approaches).
To conclude, performances of objective metrics, with respect to subjective
scores, are different for video sequences and for of still images. Correlation
coefficients between objective and subjective scores were higher in the case of
video sequences, when comparing Table 15.14 with 15.9. However, human
opinion also differed in the case of video sequences. For video sequences, perception-oriented metrics were the most correlated to subjective scores (also in
video conditions). However, in either context, none of the tested metrics reached
50 % statistical correlation with human judgment.

15.6 Discussion and Future Trends


This section discusses the future directions regarding the quality assessment of
views synthesized with DIBR systems. The results presented in the previous
sections showed that the subjective quality assessment protocols commonly used
in 2D conditions, may be used in the context of DIBR provided careful adjustment
(regarding the number of observers for example). The results also showed that
objective metrics are not sufficient for assessing synthesized views obtained with
DIBR. This section addresses the issues related to the design of a new subjective
quality assessment method, in the context of DIBR and more generally in the
context of 3D video, and the new trends for the objective metrics.

15.6.1 Subjective Protocols


ACR-HR and PC are known for their efficiency in 2D conditions, though they
showed their limitations in the two case studies presented in Sect. 15.3. Moreover,
one may need to assess the quality of 3D media in 3D conditions. Defining a new
subjective video quality assessment framework is a tough task, given the new
complexity involved in 3D media. The difficulty of 3D-image quality evaluation,
compared to 2D conventional images, is now better considered. Seuntiens [51]
introduced new parameters to be assessed in addition to image quality: naturalness,
presence and visual experience. Thus, a multi-dimensional quality indicator may

15

Visual Quality Assessment of Synthesized Views in the Context of 3D-TV

467

allow a reliable assessment of 3D-TV media. However, it may be difficult to define


such terms in the context of a subjective quality assessment protocol. And since
the technology is not mature yet, currently recommended protocols are still to be
improved. ITU-R BT. 1438 recommendation [52] describes subjective assessment
of stereoscopic television pictures and the methods are described in [18].
Chen et al. [53] revisited the question of subjective video quality assessment protocols for 3D-TV. This work points out the complexity of 3D media quality assessment. Chen et al. proposed to reconsider several conditions in this context, such as the
viewing conditions (viewing distance, monitor resolution), the test material (depth
rendering according to the chosen 3D display), viewing duration, etc. In the following,
some of the requirements proposed by Chen et al. in [53] are mentioned:
General viewing conditions: first the luminance and contrast ratio is considered,
because of the crosstalk involved by 3D-TV screens, and because of the used
glasses (both active and polarized glasses cause reduction of luminance).
Second, the resolution of depth has to be defined. Third, the viewing distance
recommended by ITU standards may differ according to the used 3D display.
Moreover, as the authors of the study claim, depth perception should be
considered as a new parameter to evaluate the Preferred Viewing Distance, such
as human visual acuity or picture resolution.
Source signals: the video format issue is mentioned. It refers to the numerous 3D
representations (namely Layer Depth Video (LDV), Multi-view Videoplus-Depth (MVD), or video plus depth (2D ? Z)) whose reconstruction or
conversion lead to different types of artifacts.
Test methods: as mentioned previously, new aspects have to be considered such
as naturalness, presence, visual experience, and visual comfort as well. The
latter refers to the visual fatigue that should be measured to help in a standardization process.
Observers: first an adapted protocol should involve the measurement of viewers
stereopsis ability. Second, the authors of [53] mention that the required number
of participants may differ from 2D. So, further experiments should define this
number.
Test duration and results analysis: the duration of the test is still to be
determined, taking into account visual comfort. Analysis of the results refers to
the definition of a criterion for incoherent viewer rejection and to the analysis of
the assessed parameters (depth, image quality, etc.)

15.6.2 Objective Quality Assessment Metrics


The experiments presented in this chapter showed the need for better adapted tools
to correctly assess the quality of synthesized views. The most recent proposed 3D
quality metrics suggest to take into account the new modes brought by 3D.
Among the proposed metrics, many of them target stereoscopic video, for instance,

468

E. Bosc et al.

but none of them target views synthesized from DIBR in 2D viewing conditions.
Therefore they do not apply to the issue targeted in this chapter.
Most of the proposed metrics for assessing 3D media are inspired from 2D
quality metrics. It should be noted that experimental protocols validating the
proposed metrics often involve depth and/or color compression, different 3D
displays, and different 3D representations (2D ? Z, stereoscopic video, MVD,
etc.). The experimental protocols often assess at the same time both compression
distortion and synthesis distortion, without distinction. This is problematic because
there may be a combination of artifacts from various sources (compression and
synthesis) whose effects are not clearly specified and assessed.
In the following, we present the new trends regarding new objective metrics for
3D media assessment, by distinguishing whether they make use of depth data in
the quality score computation or not.

15.6.2.1 2D-Like Metrics


Perceptual Quality Metric (PQM) [54] is proposed by Joveluro et al. Although the
authors assess the quality of decoded 3D data (2D ? Z), the metric is applied on
left and right views synthesized with a DIBR algorithm (namely [13]). Thus, this
method may be applied also for synthesized views. The quality score is a weighted
function of the contrast distortion and the luminance differences between both
reference and distorted color views. So, the method can be classified as HVSbased. The method is sensitive to slight changes in image degradation and error
quantification. In [54] PQM method performances are validated by evaluating
views synthesized from compressed data (both color and depth data are encoded at
different bit-rates). Subjective scores are obtained by an SAMVIQ test, on a 3D
42-inch Philips multi-view auto-stereoscopic display. Note that compression,
synthesis, and factors inherent to the display are assessed at the same time without
distinction in the experiments.
Zhao and Yu [55] proposed an FR metric, Peak Signal to Perceptible Temporal
Noise Ratio. The metric evaluates the quality of synthesized sequences by measuring the perceptible temporal noise within these impaired sequences.

15.6.2.2 Depth-Aided Methods


Ekmekcioglu et al. [56] proposed a depth-based PQM. It is a tool that can be
applied to PSNR or SSIM. The method uses a weighting function based on depth
data at the target viewpoint, and a temporal consistency function to take the
motion activity into account. The final score includes a factor that considers nonmoving background objects during view synthesis. The inputs of the method are
the original depth map (uncompressed), the original color view (originally
acquired, uncompressed), and the synthesized view. The validation of the performances is achieved by synthesizing different viewpoints from distorted data:

15

Visual Quality Assessment of Synthesized Views in the Context of 3D-TV

469

color views suffer two levels of quantization distortion; depth data suffer four
different types of distortion (quantization, low pass filtering, borders shifting, and
artificial local spot errors in certain regions). The study [56] shows that the proposed method enhances the correlation of PSNR and SSIM to subjective scores.
Yasakethu et al. [57] proposed an adapted VQM for measuring 3D Video
quality. It combines 2D color information quality and depth information quality.
Depth quality measurement includes an analysis of the depth planes. The final
depth quality measure combines 1) the measure of distortion of the relative distance within each depth plane, 2) the measure of the consistency of each depth
plane, and 3) the structural error of the depth. The color quality is based on the
VQM score. In [57], the metric is evaluated through left and right views (rendered
from 2D ? Z encoded data), and compared to subjective scores obtained by using
an auto-stereoscopic display. Results show higher correlation than simple VQM.
Solh et al. [58] introduced the 3D Video Quality Measure (3VQM) predict the
quality of views synthesized from DIBR algorithms. The method analyses the quality
of the depth map against an ideal depth map. Three different analyses lead to three
distortions measures: spatial outliers, temporal outliers, and temporal inconsistencies. These measures are combined to provide the final quality score. To validate the
method, subjective tests were run in stereoscopic conditions. The stereoscopic pairs
included views synthesized from depth map and colored video compression, depth
from stereo matching, depth from 2D to 3D conversion. Results showed accurate and
consistent scores compared to subjective assessments.
As a conclusion, subjective and objective methods tend to take more into
account the added value of depth. This makes the evaluation of depth an additional
feature to assess, just like the image quality. The proposed objective metrics still
rely on 2D usual methods but partially. They tend to include more tools allowing
the analysis of the depth component. These tools focus either on depth structure, or
on depth accuracy. Temporal consistency is also taken into account.

15.7 Conclusion
This chapter proposed a reflection on both subjective quality assessment protocols
and objective quality assessment method reliability in the context of DIBR-based
media.
Typical distortions related to DIBR were introduced. They are geometric
distortions and mainly located around the disoccluded areas. When compressionrelated distortions and synthesis-related distortions are combined, the errors are
generally scattered over the whole image, increasing perceptible annoyance.
Two case studies were presented answering the two questions relating, first, to the
suitability of two efficient subjective protocols (in 2D), and second, to the reliability
of commonly used objective metrics. Experiments considered commonly used
methods for assessing conventional images, either subjectively or objectively, to
assess DIBR-based synthesized images, from seven different algorithms.

470

E. Bosc et al.

Concerning the suitability of the tested subjective protocols, the number of


participants required for establishing a statistical difference between the algorithms
was almost the double of the number required by VQEG (24), which reinforces
Chen et al. requirements [53]. Both methodologies agreed on the performance
ranking of the view synthesis algorithms. Experiments also showed that the
observers opinion was not as stable when assessing still images as when assessing
video sequences, with ACR-HR. PC gave stable results with fewer participants
than ACR-HR, in the case of still images. Both methodologies have their
advantages and drawbacks and they are complementary: assigning an absolute
rating to distortions such as those of the synthesized views seemed a tough task to
observers, although it provides knowledge on the perceived quality of the set.
Small impairments are better evaluated with PC.
Concerning the reliability of the tested objective metrics, the results showed
that objective metrics did not correlate the observers opinion. Objective measures
did not reach 50 % correlation with human judgment and they were all correlated
with each other. The results suggest that tiny distortions are penalized by the
objective metrics when not perceptible by humans. Therefore, objective metrics
inform on the existence of distortions but not on their perceptible annoyance.
Using the tested metrics is not sufficient for assessing virtual synthesized views.
The simple experiments that have been presented in this chapter reveal that the
reliability of the tested objective metrics is uncertain when assessing intermediate
synthesized views, in the tested conditions. New standards are under investigation
considering the new aspects brought by DIBR. The new trends for the design of
DIBR-adapted metrics include analyses of depth accuracy and depth structure, and
analyses of potential disocclusions, to enhance the prediction of the degree of
annoyance of artifacts.
Acknowledgments We would like to thank the experts who provided the synthesized sequences
of the presented experiments, as well as the algorithms specifications: Martin Kppel and Patrick
Ndjiki-Nya, from the Fraunhofer Institut for Telecommunications, HHI (Berlin).
We would like to acknowledge the Interactive Visual Media Group of Microsoft Research for
providing the Breakdancers data set, the MPEG Korea Forum for providing the Lovebird1 data
set, the GIST for providing the Newspaper data set, and HHI for providing Book Arrival.

References
1. Smolic A et al (2006) 3D video and free viewpoint video-technologies, applications and
mpeg standards. In: Proceedings of the IEEE international conference on multimedia and
expo (ICME06), pp 21612164
2. Fehn C (2004) Depth-image-based rendering (DIBR), compression and transmission for a
new approach on 3D-TV. In: Proceedings of SPIE stereoscopic displays and virtual reality
systems XI, vol 5291, pp 93104
3. Bosc E et al (2011) Towards a new quality metric for 3-D synthesized view assessment. IEEE
J Sel Top in Sign Proces 5(7):13321343
4. De Silva DVS et al (2010) Intra mode selection method for depth maps of 3D video based on
rendering distortion modeling. IEEE Trans Consum Electron 56(4):27352740

15

Visual Quality Assessment of Synthesized Views in the Context of 3D-TV

471

5. Kppel M et al (2010) Temporally consistent handling of disocclusions with texture synthesis


for depth-image-based rendering. In: Proceedings of IEEE international conference on image
processing (ICIP), Hong Kong, China
6. Meesters M et al (2004) A survey of perceptual evaluations and requirements of three
dimensional TV. IEEE Trans Circuits Syst Video Technol 14(3):381391
7. Boev A et al (2011) Classification of stereoscopic artefacts. Mobile3DTV project report,
available online at http://mobile3dtv.eu/results
8. Yuen M et al (1998) A survey of hybrid MC/DPCM/DCT video coding distortions. Sign
Process 70(3):247278
9. Yasakethu SLP et al (2008) Quality analysis for 3D video using 2D video quality models.
IEEE Trans Consum Electron 54(4):19691976
10. Tikanmaki A et al (2008) Quality assessment of 3D video in rate allocation experiments. In:
IEEE international symposium on consumer electronics, 1416 April, Algarve, Portugal
11. Hewage CTER et al (2009) Quality evaluation of color plus depth map-based stereoscopic
video. IEEE J Sel Top Sign Proces 3(2):304318
12. You J et al (2010) Perceptual quality assessment for stereoscopic images based on 2D image
quality metrics and disparity analysis. In: Proceedings of international workshop video
processing and quality metrics, Scottsdale, Arizona, USA
13. Fehn C (2004) Depth-image-based rendering (DIBR), compression and transmission for a
new approach on 3D-TV. In: Proceedings of SPIE conference on stereoscopic displays and
virtual reality systems X, San Jose, USA
14. Telea A (2004) An image inpainting technique based on the fast marching method. J Graph
GPU Game Tools 9(1):2334
15. Tanimoto M et al (2008) Reference softwares for depth estimation and view synthesis. In:
Presented at the ISO/IEC JTC1/SC29/WG11 MPEG 2008/M15377
16. Mller K et al (2008) View synthesis for advanced 3D video systems. EURASIP Journal on
Image and Video Processing, vol 2008
17. Ndjiki-Nya P et al (2010) Depth image based rendering with advanced texture synthesis. In:
Proceedings of IEEE international conference on multimedia and expo (ICME), Singapore
18. ITU-R BT (1993) 500, Methodology for the subjective assessment of the quality of television
pictures, November 1993
19. MetriX MuX Home Page [Online]. Available at http://foulard.ece.cornell.edu/gaubatz/
metrix_mux/. Accessed on 18 Jan 2011
20. ITU-T (2008) Subjective video quality assessment methods for multimedia applications.
Geneva, Recommendation P 910, 2008
21. Barkowsky M (2009) Subjective and objective video quality measurement in low-bitrate
multimedia scenarios. ISBN: 978-3-86853-142-8, Verlag Dr. Hut, 2009
22. Brotherton MD et al (2006) Subjective multimedia quality assessment. IEICE Trans Fundam
Electron Commun Comp Sci ES E SERIES A 89(11):2920
23. Rouse DM et al (2010) Tradeoffs in subjective testing methods for image and video quality
assessment. Human Vision and Electronic Imaging XV, vol 7527, p 75270F
24. Huynh-Thu Q et al (2011) Study of rating scales for subjective quality assessment of highdefinition video. IEEE Trans Broadcast, pp 114, Mar 2011
25. Handley JC (2001) Comparative analysis of Bradley-Terry and Thurstone-Mosteller paired
comparison models for image quality assessment. In: IS and TS PICS conference, pp 108112
26. Pinson M et al (2003) Comparing subjective video quality testing methodologies. In: SPIE
video communications and image processing conference, Lugano, Switzerland
27. Pchard S (2008) Qualit dusage en tlvision haute dfinition: valuations subjectives et
mtriques objectives
28. Bosc E et al (2010) Focus on visual rendering quality through content-based depth map
coding. In: Proceedings of picture coding symposium (PCS), Nagoya, Japan
29. Ebrahimi T et al (2004) JPEG vs. JPEG2000: an objective comparison of image encoding
quality. In: Proceedings of SPIE, 2004, vol 5558, pp 300308
30. Wang Z et al (2002) A universal image quality index. IEEE Sign Process Lett 9(3):8184

472

E. Bosc et al.

31. Sheikh HR et al (2005) An information fidelity criterion for image quality assessment using
natural scene statistics. IEEE Trans Image Process 14(12):21172128
32. Pinson MH et al (2004) A new standardized method for objectively measuring video quality.
IEEE Trans Broadcast 50(3):312322
33. Wang Y (2006) Survey of objective video quality measurements. EMC Corporation
Hopkinton, MA, vol 1748
34. Hekstra AP et al (2002) PVQM-A perceptual video quality measure. Sign Process Image
Commun 17(10):781798
35. Wang Z et al (2004) Image quality assessment: From error visibility to structural similarity.
IEEE Trans Image Process 13(4):600612
36. Wang Z et al (2004) Video quality assessment based on structural distortion measurement.
Sign Process Image Commun 19(2):121132
37. Seshadrinathan K et al (2010) Motion tuned spatio-temporal quality assessment of natural
videos. IEEE Trans Image Process 19(2):335350
38. Boev A et al (2009) Modelling of the stereoscopic HVS
39. Yang J et al (1994) Spatiotemporal separability in contrast sensitivity. Vision Res
34(19):25692576
40. Winkler S (2005) Digital video quality: vision models and metrics. Wiley
41. Egiazarian K et al (2006) New full-reference quality metrics based on HVS. In: CD-ROM
proceedings of the second international workshop on video processing and quality metrics,
Scottsdale, USA
42. Ponomarenko N et al (2007) On between-coefficient contrast masking of DCT basis
functions. In: CD-ROM proceedings of the third international workshop on video processing
and quality metrics, vol 4
43. Chandler DM et al (2007) VSNR: a wavelet-based visual signal-to-noise ratio for natural
images. IEEE Trans Image Process 16(9):22842298
44. Damera-Venkata N et al (2002) Image quality assessment based on a degradation model.
IEEE Trans Image Process 9(4):636650
45. Sheikh HR et al (2006) Image information and visual quality. IEEE Trans Image Process
15(2):430444
46. Van C et al (1996) Perceptual quality measure using a spatio-temporal model of the human
visual system
47. Video Quality Research (2011) [Online]. Available at http://www.its.bldrdoc.gov/vqm/.
Accessed on 19 Jul 2011
48. VQEG 3DTV Group (2010) VQEG 3DTV test plan for crosstalk influences on user quality of
experience, 21 Oct 2010
49. Engelke U et al (2011) Towards a framework of inter-observer analysis in multimedia quality
assessment
50. Haber M et al (2006) Coefficients of agreement for fixed observers. Stat Methods Med Res 15(3):255
51. Seuntiens P (2006) Visual experience of 3D TV. Doctoral thesis, Eindhoven University of
Technology
52. ITU (2000) Subjective assessment of stereoscopic television pictures. In: Recommandation
ITU-R BT, p 1438
53. Chen W et al (2010) New requirements of subjective video quality assessment methodologies
for 3DTV. In: Fifth international workshop on video processing and quality metrics for
consumer electronicsVPQM 2010, Scottsdale, Arizona, USA
54. Joveluro P et al (2010) Perceptual video quality metric for 3D video quality assessment. In:
3DTV-conference: the true vision-capture, transmission and display of 3D video (3DTVCON), pp 14
55. Zhao Y et al (2010) A perceptual metric for evaluating quality of synthesized sequences in
3DV system. In: Proceedings of SPIE, vol 7744, p 77440X
56. Ekmekcioglu E et al (2010) Depth based perceptual quality assessment for synthesized
camera viewpoints. In: Proceedings of second international conference on user centric media,
UCMedia 2010, Palma de Mallorca

15

Visual Quality Assessment of Synthesized Views in the Context of 3D-TV

473

57. Yasakethu SLP et al (2011) A compound depth and image quality metric for measuring the
effects of packet loss on 3D video. In: Proceedings of 17th international conference on digital
signal processing, Corfu, Greece
58. Solh M et al (2011) 3VQM: a vision-based quality measure for DIBR-based 3D videos.
In: IEEE international conference on multimedia and expo (ICME) 2011, pp 16

Index

2D-to-3D video conversion, 107, 109113,


116118, 120, 121, 123, 124, 128, 133,
138, 139
3D
broadcasting, 300, 312
display, 3941, 4346, 48, 58, 61
format, 303, 304, 318
production, 39, 40, 44, 46, 48, 56, 65
representation, 39, 40, 55, 60, 63, 65, 66
video (3DV), 170, 223
video coding, 17, 26, 300, 306
video transmission, 17, 26, 32
videoconferencing, 42
visualization, 3, 4, 18, 41
warping, 145147, 149, 150, 153, 157, 161,
164, 165, 170, 186
3D-TV, 108, 109, 138, 191193, 195, 206,
218, 277, 278, 280, 347349, 439, 440,
444, 467
3D-TV chipest, 73
3D-TV system, 301, 302, 314, 317, 319,
337, 338

A
Absolute Categorical Rating
(ACR), 427431, 449, 450
Adaptive edge-dependent lifting, 288
Alignment, 194, 197, 201, 217, 218, 235
Analysis tool, 165, 223, 244
Autostereoscopic, 69, 72, 73, 94

Autostereoscopic display, 224, 243, 391,


403, 407
Auto-stereoscopic multi-view display, 39, 40,
44, 61

B
Bilateral filter, 173, 176, 177, 187, 203
Bilateral filtering, 203205, 214,
215, 218
Binocular correspondence, 348, 352
Binocular development, 349, 367
Binocular parallax, 19, 377, 379,
380, 422
Binocular rivalry, 348, 358360, 368
Binocular visual system, 348, 351
Bit allocation, 250, 268
Blurry artifact, 43
Boundary artifact, 158

C
Cable, 300, 304, 306, 308, 312, 313316, 320,
326, 329, 334336, 339
Census Transform, 7678, 8487
Challenge, 7, 48, 104, 347
Characteristics of depth, 240, 251, 253
Coding error, 237, 238
Computational complexity, 73, 75, 77, 78, 80,
84, 87, 92, 97, 165, 200, 268, 456
Content creation, 194

C. Zhu (eds.), 3D-TV System with Depth-Image-Based Rendering,


DOI: 10.1007/978-1-4419-9964-1,
Springer Science+Business Media New York 2013

475

476

C (cont.)
Content generation, 3, 8, 22, 24, 25
Contour of interest, 175, 177
Conversion artifact, 109, 134, 136
Correlation, 440, 448, 454, 456, 462, 465, 466,
469, 470
Correlation histogram, 239, 241, 242, 245
Cross-check, 82, 90, 145, 153
Crosstalk, 384, 392, 393, 400, 404,
419421
Crosstalk perception, 422, 424, 425

D
Data format, 223, 244
Depth camera, 10
Depth cue, 109, 111113, 116118, 121, 123,
124, 132, 139, 380
Depth estimation, 3941, 47, 48, 51, 53, 55,
62, 109, 121, 191, 192
Depth map, 3, 69, 10, 11, 16, 17, 24, 29, 31,
33, 39, 40, 48, 5153, 5557, 6164,
66, 69, 71, 86, 89, 95100, 102, 103
Depth map coding, 253, 257, 263,
265, 296
Depth map preprocessing
Depth perception, 19, 20, 27
Depth video compression, 277
Depth-based 3D-TV, 3, 7, 27
Depth-enhanced stereo (DES), 223, 243
Depth-image-based rendering (DIBR), 3, 4,
28, 107, 192, 223, 236
Depth-image-based rendering methods (dibr),
170
Depth-of-field, 112, 113, 385, 406
DIBR, 440, 441, 444, 448, 455457, 461463,
466, 468470
Digital broadcast, 299, 300, 302, 303
Disocclusion, 115, 124, 129131, 169171,
173, 174, 181, 185, 187, 188
Disparity, 7375, 7794, 96, 102104, 378,
380, 385
Disparity scaling, 349, 362
Display-agnostic production, 41, 44, 46, 65
Distance map, 170, 178, 179
Distortion, 441, 444, 448, 449, 453456, 468,
469
Distortion measure, 223, 237239
Distortion model, 237239
Dont care region, 258, 259
DVB, 300, 302, 305, 314316, 320326,
332336, 338, 339
Dynamic depth cue, 349
Dynamic programming, 80

Index
E
Edge detector, 160, 177, 290, 292
Edge-adaptive wavelet, 264
Entropy coding, 228, 266, 285, 294
Evaluation method, 223, 233, 244
Extrapolation, 55, 59

F
Focus, 113, 139, 349
Foreground layer, 193, 197200, 202
FPGA, 69, 70, 73, 8385, 103, 104
Frame compatible, 306, 307, 321, 328, 329,
331, 337, 338
Free viewpoint TV, 70, 72
Fusion limit, 347, 356, 357

G
Geometric error, 251, 252, 254256
Geometrical distortion, 271, 381, 383
Glasses, 376, 384, 390394, 398400, 407, 408
GPU, 83, 102, 164, 197
Grab-cut, 204, 209
Graph-based transform, 263, 264, 289
Graph-based wavelet, 289

H
H.264/AVC, 301, 308312, 318, 327334, 339
Haar filter bank, 285, 287
Hamming distance, 85, 86
Hardward implementation
Head-tracked display
Hole filling, 96, 97, 146, 147, 150, 161, 162,
164, 165, 187
Horopter, 348, 352355
Human visual system, 11, 108, 111, 112, 136,
138, 139, 455
Hybrid approach, 123, 124
Hybrid camera system, 191, 201, 218

I
Image artifact, 348, 382
Image inpainting, 169, 170, 182
Integral imaging, 380, 388, 390, 394, 395, 401
Inter-view prediction, 223, 224, 230, 236, 243,
244, 309
IPTV, 427, 428
ISDB, 315, 316, 324, 329232
ITU-R, 301, 304, 315, 326, 329, 337-339, 419,
427, 423, 469
ITU-R BT.500

Index
J
Joint coding

L
Layered depth video (LDV), 3, 7, 191, 243
LDV compliant capturing, 194
Lifting scheme, 281284, 288, 293,
294, 296
Light field display, 390, 394, 395, 402
Linear perspective, 113, 139

M
Matching cost, 7581, 86, 87, 89
Mean opinion score, 21, 229, 414
Monocular cue, 11, 376378
Monocular occlusion zone, 348, 358362, 369
Motion parallax, 112, 120, 121, 123, 124,
133, 366
MPEG-2, 308, 312, 320, 323, 324,
327334, 339
Multiresolution wavelet
decomposition, 294
Multi-view display, 385, 396, 397, 402404,
406, 408
Multi-view video, 170, 448, 467
Multi-view video coding (MVC), 9, 16, 224,
231, 269, 309, 334
Multi-view video plus depth (MVD), 3, 146,
170, 191, 224

N
Network requirement, 322, 325, 327,
331, 336

O
Objective metric, 440, 441, 446, 448, 449, 452,
456, 461463, 465, 466, 468470
Objective visual quality, 413
Occlusion layer, 191, 193, 194,
199203, 217

P
Paired comparisons, 451
Patch matching, 185
Perceptual issue, 18, 348
Pictorial depth cue, 112
Post-production, 11, 40, 138, 191
Priority computation, 185
PSNR, 452, 455, 462, 468

477
Q
Quadric penalty function
Quality assessment, 21, 414, 419, 462
Quality enhancement technique, 153
Quality evaluation, 3, 4, 21, 25, 28, 244
Quality of experience (QOE), 21, 415

R
Rate-distortion optimization, 225, 228, 229,
232, 234, 244, 257
Real-time, 70, 73, 75, 76, 84, 95, 104, 117
Real-time implementation, 65, 121, 145,
155, 165
Reliability, 157, 160
Rendered view distortion, 252254, 257, 262

S
Satellite, 300, 304, 306, 308, 312314, 316,
320323, 326, 329, 334, 339
Scalable video coding (SVC), 312
Scaling, 282, 283
Shape-adaptive wavelet, 289
Shifting effect, 444
Side information, 277, 278, 285, 286, 288,
291, 293, 294, 296
Size scaling, 348, 363
Smart, 161
Smearing effect, 145, 161
Smoothing filter, 170, 176
Sparse representation, 259, 260, 262
Spatial filtering, 153
Splatting, 163
SSIM, 454, 455, 462, 468
Standardization, 3, 22, 2527, 300302, 334,
337, 339
Stereo display, 39, 40, 44, 48, 61
Stereo matching, 52, 53, 73, 75, 7880, 83, 85,
90, 94, 95, 98, 100, 102, 104, 191, 192,
196, 218
Stereoacuity, 347, 354, 355, 367, 368
Stereoscopic 3D (S3D), 108
Stereoscopic 3D-TV, 3, 4, 300, 319, 338
Stereoscopic display, 19, 32, 375, 379, 407
Stereoscopic perception, 375
Stereoscopic video, 41
Structural inpainting, 170, 182
Subjective assessment, 440, 441, 446, 448,
449, 457, 461, 463, 465, 467, 469
Subjective test method, 450
Subjective visual quality, 413
Support region builder, 85, 87
Surrogate depth map, 118, 119

478

S (cont.)
Synthesis artifact, 145, 146, 153, 158
Synthesized view, 440, 441, 444, 446, 447, 449,
452, 453, 456458, 462, 466468, 470

T
Temporal consistency, 53, 54, 157,
182, 468, 469
Temporal filtering, 155, 177
Terrestrial, 300, 304, 306, 308, 312316,
322, 324, 326, 327, 329, 334,
336, 339
Texture flickering, 145, 146
Texture inpainting, 185
Thresholding, 206208, 265
Time-of-flight (TOF) camera, 193196
Transform, 148, 164, 198, 258
Transport, 302, 304, 306, 308310, 315, 318,
322, 326328, 330, 331, 334

U
UQI, 454, 455

Index
V
Video coding, 223225, 228, 229, 232, 233,
238, 239, 241, 243245
View merging, 145147, 150, 152, 158, 164,
165
View synthesis, 13, 94, 99, 100, 145147, 153,
165, 171, 286
Viewing zone, 385, 386, 396, 401, 406
Virtual view, 145153, 157, 158, 160, 161,
163, 165
Visual cortex, 351, 352
Visual quality, 440, 456, 458
Volumetric display, 389, 393, 394, 400, 401
VQM, 454, 457, 469

W
Warping, 72, 96, 97, 114, 198, 199, 201, 205,
214, 215, 269
Wavelet coding, 17
Wavelet filter bank, 286, 294
Wavelet transform, 280282, 285, 287, 288,
290, 294
Wavelet transforms, 280282

Potrebbero piacerti anche