Sei sulla pagina 1di 166

Beyond Controllers

Human Segmentation, Pose, and Depth Estimation


as Game Input Mechanisms
Glenn Sheasby
Thesis submitted in partial fullment of the requirements of the award of
Doctor of Philosophy
Oxford Brookes University
in collaboration with Sony Computer Entertainment Europe
December 2012
Abstract
Over the past few years, video game developers have begun moving away from the tradi-
tional methods of user input through physical hardware interactions such as controllers or
joysticks, and towards acquiring input via an optical interface, such as an infrared depth
camera (e.g. the Microsoft Kinect) or a standard RGB camera (e.g. the PlayStation Eye).
Computer vision techniques form the backbone of both input devices, and in this thesis,
the latter method of input will be the main focus.
In this thesis, the problem of human understanding is considered, combining segment-
ation and pose estimation. While focussing on these tasks, we examine the stringent
challenges associated with the implementation of these techniques in games, noting par-
ticularly the speed required for any algorithm to be usable in computer games. We also
keep in mind the desire to retain information wherever possible: algorithms which put
segmentation and pose estimation into a pipeline, where the results of one task are used
to help solve the other, are prone to discarding potentially useful information at an early
stage, and by sharing information between the two problems and depth estimation, we
show that the results of each individual problem can be improved.
We adapt Wang and Kollers dual decomposition technique to take stereo information
into account, and tackle the problems of stereo, segmentation and human pose estimation
simultaneously. In order to evaluate this approach, we introduce a novel, large dataset
featuring nearly 9,000 frames of fully annotated humans in stereo.
Our approach is extended by the addition of a robust stereo prior for segmenta-
tion, which improves information sharing between the stereo correspondence and human
segmentation parts of the framework. This produces an improvement in segmentation
results. Finally, we increase the speed of our framework by a factor of 20, using a highly
ecient lter-based mean eld inference approach. The results of this approach compare
favourably to the state of the art in segmentation and pose estimation, improving on the
best results in these tasks by 6.5% and 7% respectively.
Acknowledgements
Okay... now what?
(Mike Slackenerny, PhD comic #844)
It is nished. Although the PhD thesis is a beast that must be tamed in solitude, I
dont believe its something that can be done entirely alone, and there are many people
to whom I owe a debt of gratitude.
My supervisor, Phil Torr, made it possible for me to get started in the rst place,
and gave me immeasurable help along the way. While were talking about how I came to
be doing a PhD, I should also thank my old boss, Andrew Stoddart, who recommended
that I apply, and the recession for costing me the software job I was doing after leaving
student life for the rst time. I guess my escape velocity wasnt high enough, moving
only two miles from my rst alma mater. Im about 770 miles away now, so that should
be enough!
My colleagues at Brookes also helped immensely, from those who helped me settle
in: David Jarzebowski, Jon Rihan, Chris Russell, Lubor Ladick` y, Karteek Alahari, Sam
Hare, Greg Rogez, and Paul Sturgess; to those who saw me o at the end of it: Paul
Sturgess, Sunando Sengupta, Michael Sapienza, Ziming Zhang, Kyle Zheng, and Ming-
Ming Cheng. Special thanks are due to Morten Lindegaard, who proof-read large chunks
of this thesis, and to my co-authors: Julien Valentin, Vibhav Vineet, Jonathan Warrell,
and my second supervisor, Nigel Crook.
Financial support from the EPSRC partnership with Sony is gratefully acknowledged,
and weekly meetings and regular feedback from Diarmid Campbell helped to guide and
focus my research. Furthermore, Amir Saari and the rest of the crew at SCEE London
Studio provided a dataset, as well as feedback from a professional perspective.
Id also like to thank my examiners, Teo de Campos, Mark Bishop, and David Duce,
for taking the time to read my thesis, and for providing useful feedback and engaging
discussion during the viva.
While struggling through my PhD years, I was kept sane in Oxford by a variety of
groups, including the prayer group at St. Mary Magdalenes, Brookes Ultimate Frisbee,
and of course, the Oxford University Bridge Club, where I spent many Monday even-
ings exercising my mind (and liver), and where I met my wonderful ance, the future
Dr. Mrs. Dr. Sheasby, Aleksandra: wszystkie nasze sukcesy s wsplne, ale ten jeden za-
wdziczam wycznie Tobie. Wierzya we mnie nawet wtedy, kiedy ja sam w siebie nie
wierzyem i za to bd Ci wdziczny do koca ycia.
Lastly, but most importantly, Id like to thank my parents, for raising me, for sup-
porting me in all of my endeavours, and for teaching me to question everything.
Contents
List of Figures 7
List of Tables 9
List of Algorithms 11
1 Introduction 13
1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2 Vision in Computer Games: A Brief History 19
2.1 Motion Sensors: Nintendo Wii . . . . . . . . . . . . . . . . . . . . . . . . 20
2.1.1 Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 RGB Cameras: EyeToy and Playstation Eye . . . . . . . . . . . . . . . . 22
2.2.1 Early Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.2 Antigrav . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.3 Eye of Judgment . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.4 EyePet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2.5 Wonderbook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3 Depth Sensors: Microsoft Kinect . . . . . . . . . . . . . . . . . . . . . . 28
2.3.1 Technical Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3.2 Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3.3 Vision Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3.4 Overall Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3 State of the Art in Selected Vision Algorithms 33
3.1 Inference on Graphs: Energy Minimisation . . . . . . . . . . . . . . . . . 34
3
Contents
3.1.1 Conditional Random Fields . . . . . . . . . . . . . . . . . . . . . 34
3.1.2 Submodular Terms . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.1.3 The st-Mincut Problem . . . . . . . . . . . . . . . . . . . . . . . . 36
3.1.4 Application to Image Segmentation . . . . . . . . . . . . . . . . . 38
3.2 Inference on Trees: Belief Propagation . . . . . . . . . . . . . . . . . . . 40
3.2.1 Message Passing . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2.2 Belief Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3 Human Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3.1 Pictorial Structures . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3.2 Flexible Mixtures of Parts . . . . . . . . . . . . . . . . . . . . . . 47
3.3.3 Unifying Segmentation and Pose Estimation . . . . . . . . . . . . 50
3.4 Stereo Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.4.1 Humans in Stereo . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4 3D Human Pose Estimation in a Stereo Pair of Images 57
4.1 Joint Inference via Dual Decomposition . . . . . . . . . . . . . . . . . . . 58
4.1.1 Introduction to Dual Decomposition . . . . . . . . . . . . . . . . 59
4.1.2 Related Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.2 Humans in Two Views (H2view) Dataset . . . . . . . . . . . . . . . . . . 67
4.2.1 Evaluation Metrics Used . . . . . . . . . . . . . . . . . . . . . . . 68
4.3 Inference Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.3.1 Segmentation Term . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.3.2 Pose Estimation Term . . . . . . . . . . . . . . . . . . . . . . . . 77
4.3.3 Stereo Term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.3.4 Joint Estimation of Pose and Segmentation . . . . . . . . . . . . . 81
4.3.5 Joint Estimation of Segmentation and Stereo . . . . . . . . . . . . 82
4.4 Dual Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.4.1 Binarisation of Energy Functions . . . . . . . . . . . . . . . . . . 84
4.4.2 Optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.4.3 Solving Sub-Problem L
1
. . . . . . . . . . . . . . . . . . . . . . . 88
4.4.4 Solving Sub-Problem L
2
. . . . . . . . . . . . . . . . . . . . . . . 88
4.4.5 Solving Sub-Problem L
3
. . . . . . . . . . . . . . . . . . . . . . . 89
4.5 Weight Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.5.1 Assumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.5.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.6.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.6.2 Runtime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4
Contents
5 A Robust Stereo Prior for Human Segmentation 103
5.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.1.1 Range Move Formulation . . . . . . . . . . . . . . . . . . . . . . . 106
5.2 Flood Fill Prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.3 Application: Human Segmentation . . . . . . . . . . . . . . . . . . . . . 112
5.3.1 Original Formulation . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.3.2 Stereo Term f
D
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.3.3 Segmentation Terms f
S
and f
SD
. . . . . . . . . . . . . . . . . . . 114
5.3.4 Pose Estimation Terms f
P
and f
PS
. . . . . . . . . . . . . . . . . 115
5.3.5 Energy Minimisation . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.3.6 Modications to
D
Vector . . . . . . . . . . . . . . . . . . . . . . 118
5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.4.1 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.4.2 Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.4.3 Runtime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
6 An Ecient Mean Field Based Method for Joint Estimation of Human
Pose, Segmentation, and Depth 125
6.1 Mean Field Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.1.1 Introduction to Mean-Field Inference . . . . . . . . . . . . . . . . 128
6.1.2 Simple Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.1.3 Performance Comparison: Mean Field vs Graph Cuts . . . . . . . 131
6.2 Model Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.2.1 Joint Energy Function . . . . . . . . . . . . . . . . . . . . . . . . 132
6.3 Inference in the Joint Model . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
6.4.1 Segmentation Performance . . . . . . . . . . . . . . . . . . . . . . 137
6.4.2 Pose Estimation Performance . . . . . . . . . . . . . . . . . . . . 137
6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
7 Conclusions and Future Work 143
7.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 143
7.2 Directions for Future Research . . . . . . . . . . . . . . . . . . . . . . . . 144
Bibliography 147
5
List of Figures
2.1 Duck Hunt screenshot . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Wii Sensor Bar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 Putting action from Wii Sports . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 EyeToy and Playstation Eye cameras. . . . . . . . . . . . . . . . . . . . . 22
2.5 EyeToy: Play . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.6 Antigrav . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.7 Eye of Judgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.8 EyePet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.9 Wonderbook design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.10 A Wonderbook scene . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.11 Kinect games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.12 Furniture removal guidelines in Kinect instruction manual . . . . . . . . 30
3.1 Image for our toy example. . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2 Segmentation results on the toy image . . . . . . . . . . . . . . . . . . . 40
3.3 Skeleton Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.4 Part models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.5 Yang-Ramanan skeleton model . . . . . . . . . . . . . . . . . . . . . . . 49
3.6 Stereo example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.1 Subgradient example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2 Dual functions versus the cost variable . . . . . . . . . . . . . . . . . . . 66
4.3 Values of the dual function g() . . . . . . . . . . . . . . . . . . . . . . . 66
4.4 Accuracy of Part Proposals . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.5 CRF Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.6 Foreground weightings on a cluttered image from the Parse dataset . . . 76
4.7 Results using just f
S
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.8 Part selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.9 Limb recovery due to J
1
term . . . . . . . . . . . . . . . . . . . . . . . . 83
7
List of Figures
4.10 Master-slave update process . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.11 Decision tree: parameter optimisation . . . . . . . . . . . . . . . . . . . . 94
4.12 Sample stereo and segmentation results . . . . . . . . . . . . . . . . . . . 97
4.13 Segmentation results on H2view . . . . . . . . . . . . . . . . . . . . . . . 98
4.14 Results from H2view dataset . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.1 Flood ll example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.2 Three successive range expansion iterations . . . . . . . . . . . . . . . . . 109
5.3 The new master-slave update process . . . . . . . . . . . . . . . . . . . . 117
5.4 Segmentation results on H2View . . . . . . . . . . . . . . . . . . . . . . . 119
5.5 Comparison of segmentation results on H2View . . . . . . . . . . . . . . 120
5.6 Failure cases of segmentation . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.1 Segmentation of the Tree image . . . . . . . . . . . . . . . . . . . . . . . 127
6.2 Basic 6-part skeleton model . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.3 Segmentation results on H2view compared to other methods . . . . . . . 138
6.4 Further segmentation results . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.5 Qualitative results on H2View dataset . . . . . . . . . . . . . . . . . . . 141
8
List of Tables
4.1 Table of Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2 Evaluation of f
S
only . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.3 Evaluation of f
S
combined with f
PS
. . . . . . . . . . . . . . . . . . . . . 82
4.4 List of weights learned . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.5 Evaluation of segmentation performance . . . . . . . . . . . . . . . . . . 96
4.6 Dual Decomposition results . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.1 Segmentation results on the H2View dataset . . . . . . . . . . . . . . . . 121
5.2 Results (given in % PCP) on the H2view test sequence. . . . . . . . . . . 122
6.1 Evaluation of mean eld on the MSRC-21 dataset . . . . . . . . . . . . . 131
6.2 Quantitative segmentation results on the H2View dataset . . . . . . . . . 137
6.3 Pose estimation results on the H2View dataset . . . . . . . . . . . . . . . 140
9
List of Algorithms
4.1 Parameter optimisation algorithm for dual decomposition framework. . . 93
5.1 Generic ood ll algorithm for an image I of size W H. . . . . . . . . 110
5.2 doLinearFill: perform a linear ll from the seed point (s
x
, s
y
). . . . . . 111
6.1 Nave mean eld algorithm for fully connected CRFs . . . . . . . . . . . 130
11
Chapter 1
Introduction
Over the past several years, a wide range of commercial applications of computer vision
have begun to emerge, such as face detection in cameras, augmented reality (AR) in shop
displays, and the automatic construction of image panoramas. Another key application
of computer vision that has become popular recently is computer games, with commercial
products such as Sonys Playstation Eye and Microsofts Kinect selling millions of units
[98].
In creating these products, video game developers have been able to partially expand
the demographic of players. They have done this by moving away from the traditional
controller pad method of user input, and enabling the player to control the game using
other objects, such as books or AR markers, and even their own bodies. Some of the
most popular games that are either partially or completely driven using human motion
include sports games such as Wii Sports and Kinect Sports, and party games such as
the EyeToy: Play series. More recent games, such as EyePet and Wonderbook: Book of
Spells, combine motion information with object detection. A more thorough description
of these games can be found in Chapter 2.
Three main computer vision techniques are used to obtain input instructions for these
games: motion detection, object detection, and human pose estimation. The rst of these,
motion detection, involves detecting changes in image intensity across several frames; in
video games, motion detection is used in particular areas of the screen as the player
13
1.1. Contributions
attempts to complete tasks. Secondly, object detection involves determining the presence,
position and orientation of particular objects in the frame. The object can be a simple
shape (e.g. a quadrilateral) or a complex articulated object, such as a cat. In certain
video games, the detection of AR markers is used to add computer graphics to an image
of the players surroundings. Finally, the goal of human pose estimation is to determine
the position and orientation of each of a persons body parts. Using images obtained via
an infrared depth sensor, Kinect games can track human poses over several frames, in
order to detect actions [110].
Theoretically, an image contains a lot more information than a controller can sup-
ply. However, the player can only provide information via a relatively limited set of
actions, either with their own body, or using some kind of peripheral object which can
be recognised.
The main aim of this thesis is to explore and expand the applicability of human
pose estimation to video games. After an analysis of the techniques that have already
been used, and of the current state of these techniques in research, our main application
will be presented. Using a stereo pair of cameras, we will develop a system that unies
human segmentation, pose estimation, and depth estimation, solving the three tasks
simultaneously. In order to evaluate this system, we will present a large dataset containing
stereo images of humans in indoor environments where video games might be played.
1.1 Contributions
In summary, the principal contributions of this thesis are as follows:
A system for the simultaneous segmentation and pose estimation of humans, as
well as depth estimation of the entire scene. This system is further developed by
the introduction of a stereo-based prior; the speed of the system is subsequently
improved by applying a state-of-the-art approximate inference technique.
The introduction of a novel, 9,000 image dataset of humans in two views.
14
1.2. Outline of the Thesis
Throughout the thesis, the pronoun we is used instead of I. This is done to follow
scientic convention; the contents of this thesis are the work of the author. Where others
have contributed towards the work, their collaborations will be attributed in a short
section at the end of each chapter.
1.2 Outline of the Thesis
Chapter 2 contains a description of some of the various attempts that games developers
have made to provide alternatives to controllers, and the impact that these games have
had on the video games community. Starting with the accelerometer and infrared detec-
tion based solutions provided by the Nintendo Wii, we observe the increasing amount
of integration of vision techniques, with this trend demonstrated by the methods used
by Sonys EyeToy and PlayStation Eye-based games over the past several years. Finally,
we consider the impact that depth information can have in enabling the software to
determine the pose of the players body, as shown by the Microsoft Kinect.
Following on from that, Chapter 3 contains an appraisal of related work in computer
vision that might be applied in computer games. We consider the dierent approaches
commonly used to solve the problems of segmentation and human pose estimation, and
give an overview of some of the approaches that have been used to provide 3D information
given a pair of images from a stereo camera.
Chapter 4 describes a novel framework for the simultaneous depth estimation of a
scene, and segmentation and pose estimation of the humans within that scene. Using a
stereo pair of images as input provides us with the ability to compute the distance of each
pixel from the camera; additionally, we can use standard approaches to nd the pixels
occupied by the human, and predict its pose. In order to share information between
these three approaches, we employ a dual decomposition framework [62, 127]. Finally, to
evaluate the results obtained by our method, we introduce a new dataset, called Humans
in Two Views, which contains almost 9,000 stereo pairs of images of humans.
In Chapter 5, we extend this approach to improve the quality of information shared
15
1.3. Publications
between the segmentation and depth estimation parts of the algorithm. Observing that
the human occupies a continuous region of the cameras eld of view, we infer that the
distance of human pixels from the camera will vary only in certain ways, without sharp
boundaries (we say that the depth is smooth). Therefore, starting from pixels that we are
very condent lie within the human, we can extract a reliable initial segmentation from
the depth map, signicantly improving the overall segmentation results.
The drawback of the dual decomposition-based approach, however, is that it is much
too slow to be used in computer games. In Chapter 6, we adapt our framework in order
to apply an approximate, but very fast, inference approach based on mean eld [64]. Our
use of this new inference approach enables us to improve the information sharing between
the three parts of the framework, providing an improvement in accuracy, as well as an
order-of-magnitude speed improvement.
While the mean-eld inference approach is much quicker than the dual decomposition-
based approach, its speed (close to 1 fps) is still not fast enough for real-time application
such as computer games. In Chapter 7, the thesis concludes with some suggestions for
how to further improve the speed, as well as some other promising possible directions for
future research. The concluding chapter also contains a summary of the work presented
and contributions made.
1.3 Publications
Several chapters of this thesis rst appeared as conference publications, as follows:
G. Sheasby, J. Warrell, Y. Zhang, N. Crook, and P.H.S. Torr. Simultaneous hu-
man segmentation, depth and pose estimation via dual decomposition. In British
Machine Vision Conference, Student Workshop, 2012. (Chapter 4, [108])
G. Sheasby, J. Valentin, N. Crook, and P.H.S. Torr. A robust stereo prior for human
segmentation. In Asian Conference on Computer Vision (ACCV), 2012. (Chapter
5, [107])
16
The contributions of co-authors are acknowledged in the corresponding chapters. The rst
paper [108] received the best student paper award at the BMVC workshop. Addition-
ally, some sections of Chapter 6 form part of a paper that is currently under submission
at a major computer vision conference.
Chapter 2
Vision in Computer Games: A Brief History
Figure 2.1: A screenshot [84] from Duck Hunt, an early example of a game that used
sensing technology.
The purpose of a game controller is to convey the users intentions to the game. A wide
varieties of input methods, for instance a mouse and keyboard, a handheld controller,
or a joystick, have been employed for this purpose. Video games using some sort of
sensing technology (instead of, or in addition to, those listed above) have been available
for several decades. In 1984, Nintendo released a light gun, which detects light emitted by
CRT monitors; this release was made popular by the game Duck Hunt for the Nintendo
19
2.1. Motion Sensors: Nintendo Wii
Figure 2.2: The sensor bar, which emits infrared light that is detected by Wii remotes.
The picture [81] was taken with a camera sensitive to infrared light; the LEDs are not
visible to the human eye.
Entertainment System (NES), in which the player aimed the gun at ducks that appeared
on the screen (Figure 2.1). When the trigger is red, the screen is turned black for one
frame, and then the target area is turned white in the next frame. If it is pointed at the
correct place, the gun detects this change in intensity, and registers a hit.
Over the past few years, technological developments have made it easier for video
game developers to incorporate sensing devices to augment, or in some cases replace, the
traditional controller pad method of user input. These devices include motion sensors,
RGB cameras, and depth sensors. The following sections give a brief summary of the
applications of each in turn.
2.1 Motion Sensors: Nintendo Wii
The Wii is a seventh-generation games console that was released by Nintendo in late 2006.
Unlike previous consoles, the unique selling point of the Wii was a new form of player
interaction, rather than greater power or graphics capability. This new form of interaction
was the Wii Remote, a wireless controller with motion sensing capabilities. The controller
contains an accelerometer, enabling it to sense acceleration in three dimensions, and an
infrared sensor, which is used to determine where the remote is pointing [49].
Unlike light guns, which sense light from CRT screens, the remote detects light from
the consoles sensor bar, which features ten infrared LEDs (Figure 2.2). The light from
each end of the bar is detected by the remotes optical sensor as two bright lights. Trian-
gulation is used to determine the distance between the remote and the sensor bar, given
20
2.1. Motion Sensors: Nintendo Wii
Figure 2.3: An example of the use of the Wii Remotes motion sensing capabilities to
control game input. Here, the player moves the remote as he would move a putter when
playing golf. The power of the putt is determined by the magnitude of the swing [74].
the observed distance between the two bright lights and the known distance between the
LED arrays.
The capability of the Wii to track position and motion enables the player to mimic
actual game actions, such as swinging a sword or tennis racket. This capability is demon-
strated by games such as Wii Sports, which was included with the games console in the
rst few years after its release. The remote can be used to mimic the action of bowling
a ball, or swung like a tennis racket, a baseball bat, or a golf club (Figure 2.3).
2.1.1 Impact
The player still uses a controller, although in some games, like Wii Sports, it is now
the position and movement of the remote that is used to inuence events in-game. This
21
2.2. RGB Cameras: EyeToy and Playstation Eye
(a) EyeToy [114] (b) PlayStation Eye [32]
Figure 2.4: The two webcam peripherals released by Sony.
makes playing the games more tiring than before, especially if they are played with
vigour. However, a positive eect is that the control system is more intuitive, meaning
that people who dont normally play traditional video games might still be interested in
owning a Wii console [103].
2.2 RGB Cameras: EyeToy and Playstation Eye
While Nintendos approach uses the position and motion of the controller to enhance
gameplay, other games developers have made use of the RGB images provided by cameras.
The rst camera released as a games console peripheral and used as an input device for a
computer game was the EyeToy, which was released for the PlayStation 2 (PS2) in 2003.
This was followed in 2007 by the PlayStation Eye (PS Eye) for the PlayStation 3 (PS3).
Some of Sonys recent games have used the PlayStation Move (PS Move) in addition
to the PS Eye. The Move is a handheld plastic controller which has a large, bright ball
on the top; the hue of this ball can be altered by the software. During gameplay, the ball
is easily detectable by the software, and is used as a basis for determining the position,
orientation and motion of the PS Move [113].
The degree to which vision techniques have been applied to EyeToy games has varied
widely. Some games only use the camera to allow the user to see themselves, whereas
others require signicant levels of image processing. The following sections contain de-
22
2.2. RGB Cameras: EyeToy and Playstation Eye
(a) Ghost Catcher [48] (b) Keep Up [117] (c) Kung Foo [42]
Figure 2.5: Screenshots of three mini-games from EyeToy: Play.
scriptions of some of the games that have used image processing to enhance gameplay.
2.2.1 Early Games
In its original release, the EyeToy was released in a bundle with EyeToy: Play, which
features twelve mini-games. The game play is simplistic, as is common with party-oriented
video games. Many of them rely on motion detection; for instance, the object of Ghost
Catcher is to ll ghosts with air and then pop them, and this is done by repeatedly
waving your hands over them. Others, such as Keep Up, use human detection; the
player is required to keep a ball in the air. Therefore, the game needs to determine
whether there is a person in the area where the ball is.
A third use of vision in this game occurs in Kung Foo; in this mini-game, the
player stands in the middle of the cameras eld of view, and is instructed to hit ninjas
that y onto the screen from various directions. Again, motion detection can be used to
determine whether a hit has been registered, as it doesnt matter which body part was
used to perform the hit.
Impact
As the mini-games in EyeToy: Play only require simplistic image understanding tech-
niques, specically the detection of motion within a small portion of the cameras eld
of view, the underlying techniques seemed to work well. As with Wii Sports, the game
was aimed at casual gamers rather than traditional, or hardcore, gamers.
23
2.2. RGB Cameras: EyeToy and Playstation Eye
(a) Antigrav screenshot (b) Close-up of user display
Figure 2.6: A screenshot [91] from Antigrav, where the player has extended their right
arm to grab an object (the rst of three) and thus score some points. The user display
shows where the game has detected the players hands to be.
2.2.2 Antigrav
Antigrav, a PS2 game that utilises the EyeToy, is a futuristic trick-based snowboarding
game, and was brought out by Harmonix in late 2004. The player takes control of a
character in the game, and guides them down a linear track. The game uses face tracking
to control the characters movements, enabling the player to increase the characters speed
by ducking, and change direction by leaning. In addition, the players hands are tracked,
and their hand position is used to infer a pose, enabling the player to literally grab for
collectible objects on-screen. The player can see what the computer calculates their head
and hand positions to be in the form of a small diagram in the corner of the screen, as
shown in Figure 2.6. A GameSpot review [24] points out:
this is good for letting you know when the EyeToy is misreading your move-
ments, which takes place more often than it ought to.
The review, like other reviews of PS2 EyeToy releases, hints at further technological
limitations impairing the enjoyment of the game:
Harmonix pushes the limits of what you should expect from an EyeToy
entry... unfortunately, EyeToy pushes back, and its occasional inconsistency
hobbles an otherwise bold and enjoyable experience.
24
2.2. RGB Cameras: EyeToy and Playstation Eye
Impact
The reviews above imply that the head and hand detection techniques employed by
the game were not completely eective, meaning that users are often frustrated by their
actions not being recognised by the game due to failure of the tracking system. This high-
lights the importance of accuracy when developing vision algorithms for video games: if
your tracking algorithm fails around 5% of the time, then the 95% accuracy is, quantit-
atively, extremely good. However, during a 3-minute run down a track on Antigrav, this
could result in a failure of the tracking system, taking several seconds to recover from.
This would be clearly noticeable by gamers.
2.2.3 Eye of Judgment
Figure 2.7: An image [3] showing the set-up of Eye of Judgment. The camera is pointed
at a cloth, on which several cards are placed. These cards are recognised by the game,
and the on-screen display shows the objects or creatures that the cards represent.
In 2007, Sony released Eye of Judgment, a role-playing card-game simulation that can be
compared to the popular card game Magic: The Gathering. The PS3 game comes with
a cloth, and a set of cards with patterns on them that the computer can easily recognise
25
2.2. RGB Cameras: EyeToy and Playstation Eye
(Figure 2.7). It can recognise the orientation as well as the identity of the cards, enabling
them to have dierent functions when oriented dierently. A review reported very
few hardware-related issues, principally because of the pattern-based card recognition
system [122].
Since then, the PS3 saw very little PS Eye-related development before the release of
EyePet in October 2009; in the two years between the releases of Eye of Judgment and
EyePet, the use of the PS Eye was generally limited to uploading personalised images for
game characters.
2.2.4 EyePet
(a) EyePets AR marker (b) EyePet with trampoline
Figure 2.8: An example of augmented reality being used in EyePet [4].
EyePet features a virtual pet, which interacts with people and objects in the real world
using fairly crude motion sensing. For example, if the player rolls a ball towards the pet,
it will jump out of the way. Another major feature of the game is the use of augmented
reality: a card with a specic pattern is detected in the cameras eld of view, and a
magic toy (a virtual object that the pet can interact with, such as a trampoline, a
bubble-blowing monkey, or a tennis player) is shown on top of the card (see Figure 2.8).
26
2.2. RGB Cameras: EyeToy and Playstation Eye
Impact
Again, EyePet uses fairly simplistic vision techniques, with marker detection and a motion
buer being used throughout the game. This prevents it from receiving the sort of
criticism that was associated with Antigrav.
Although it was generally well-received, even EyePet did not escape criticism for the
limitations of its technology, which cant help but creak at times according to a review
published in Eurogamer [129]. The review goes on to say that performance is robust
under strong natural light, but patchy under electric light in the evening.
This sort of comment shows the unforgivingness of video gamers, or at least of video
game reviewers: for a vision technique to be useful in a game, it needs to be able to work
under a very wide variety of environments and lighting conditions.
2.2.5 Wonderbook
(a) (b)
Figure 2.9: The Wonderbook (a) [82] is used with the PlayStation Move controller. The
interior (b) [115] features AR markers, as well as markings on the border to identify the
edge of the book, and ones near the edge of the page, which help to identify the page
quickly.
Wonderbook: Book of Spells, released by Sony in November 2012, is the rst in an up-
coming series of games that will use computer vision methods to enhance gameplay. The
games will be centred upon a book whose pages contain augmented reality markers and
other patterns (Figure 2.9). These are detected by various pattern recognition techniques,
in order to determine where the book is, and which pages are currently visible. Once
27
2.3. Depth Sensors: Microsoft Kinect
Figure 2.10: After the Wonderbook is detected, gameplay objects can be overlaid on-
screen. In this image [116], a 3D stage is superimposed onto the book.
this is known, augmented reality can be used to replace the image of the book with, for
example, a burning stage (Figure 2.10).
In Book of Spells, the book becomes a spell book, and through the gameplay, spells
from the Harry Potter series are introduced [53]. At various points in the game, the
player must interact with the book, for example to put out res by patting the book.
Skin detection algorithms are used to ensure that the players hands appear to occlude
the spellbook, rather than going through it.
The generality of the book enables it to be used in multiple dierent kinds of games.
BBCs Walking with Dinosaurs will be made into an interactive documentary, with the
player rst excavating and completing dinosaur skeletons, and then feeding the dinosaurs
using the PS Move [55]. It remains to be seen how the nal versions of these games will
be appraised by reviewers and customers, and thus whether the Wonderbook franchise
will have a signicant impact on the video gaming market.
2.3 Depth Sensors: Microsoft Kinect
While RGB cameras can be useful in enhancing gameplay with vision techniques, the
extra information provided by depth cameras makes it signicantly easier to determine
the structure of a scene. This enables games developers to provide a new way of playing.
With Kinect, which was released in November 2010, Microsoft oer a system where you
28
2.3. Depth Sensors: Microsoft Kinect
(a) (b) (c)
Figure 2.11: A selection of dierent games available for Kinect. (a) Kinect Sports [78]: two
players compete against each other at football. (b) Dance Central [7]: players perform
dance moves, which are tracked and judged by the game. (c) Kinect Star Wars [99]:
players swing a lightsabre by making sweeping movements with their arm.
are the controller. Using an infrared depth sensor to track 3D movement, they generate
a detailed map of the scene, signicantly simplifying the task of, for example, tracking
the movement of a person.
2.3.1 Technical Details
The Kinect provides a 320 240 16-bit depth image, and a 640 480 32-bit RGB image,
both running at 30 frames per second (fps); the depth sensor has an active range of 1.2 to
3.5 metres [93]. The skeletal detection system, used for detecting the human body in each
frame, is based on random forest classiers [110], and is capable of tracking a twenty-
link skeleton of up to two active players in real-time.
1
The software also provides an
object-specic segmentation of the people in the scene (i.e. dierent people are segmented
separately), and further enhances the players experience by using person recognition to
provide greetings and content.
2.3.2 Games
As with the Nintendo Wii, the Kinect was launched along with a sports game, namely
Kinect Sports. The controls are intuitive: the player makes a kicking motion in order
1
In an articulated body, a link is dened as an inexible part of the body. For example, if the
exibility of ngers and thumbs is ignored, each arm could be treated as three links, with one link each
for hand, forearm and upper arm.
29
2.3. Depth Sensors: Microsoft Kinect
Figure 2.12: The furniture removal guidelines in the Kinect instruction manual [79] advise
the player to move tables etc. that might block the cameras view, which may cause a
problem for some users.
to kick a football, or runs on the spot in the athletics mini-games. No controllers or
buttons are required, which makes the games very easy to adapt to, although some of
the movements need to be exaggerated in order for the game to recognise them [18].
Another intuitive game is Dance Central, which uses the Kinects full body tracking
capabilities to compare the players dance moves to those shown by an on-screen in-
structor. The object of the game is to imitate these moves in time with the music. This
can be compared to classic games like Dance Dance Revolution, with the dierence that
the players whole body is now used, enabling a greater variety of moves and adding an
element of realism [128].
Up until now, games developers have struggled to produce a game that uses the
Kinects capabilities, yet still appeals to the serious gamer. One attempt was made
in the 2011 release Kinect Star Wars, in which the player uses their arms to control
a lightsaber, making sweeping or chopping motions to remove obstacles, and to defeat
enemies. However, this game was criticised due to the games inability to keep up with
fast and frantic arm motions [126].
A common problem with the Kinect model of gaming is that it is necessary to stand
a reasonable distance away from the camera (2 to 3 metres is the recommended range),
30
2.4. Discussion
which makes gaming very dicult in small rooms, especially as any furniture will need
to be moved away (Figure 2.12).
2.3.3 Vision Applications
Since its release, and the subsequent release of an open-source software development
kit [83], the Kinect has been used in a wide variety of non-gaming related work by
computer vision researchers. Oikonomidis et al. [87] developed a hand-tracking system
capable of running at 15 fps, while Izadi et al. [54] perform real-time 3D reconstruction of
indoor scenes by slowly moving the Kinect camera around the room. The Kinect has also
been shown to be a useful tool for easily collecting large amounts of training data [47].
However, due to IR interference, the depth sensor does not work in direct sunlight,
making it unsuitable for outdoor applications such as pedestrian detection [39].
2.3.4 Overall Impact
The Kinect has had a huge impact worldwide, selling 19 million units worldwide in its rst
eighteen months. This has helped Microsoft improve sales of the Xbox 360 year-on-year,
despite the console now being in its seventh year. This is the reverse of the trend shown
by competing consoles [98]. The method of controlling games using the human body
rather than a controller is revolutionary, and the technology has also had a signicant
eect on vision research, as mentioned in Section 2.3.3 above.
2.4 Discussion
To date, a number of vision methods that use RGB cameras have been introduced to
the video gaming community. However, these tend to be low-level (motion detection or
marker detection) rather than high-level: if the only information given is an RGB signal,
unconstrained object detection and human pose estimation are neither accurate nor fast
enough to be useful in video games.
31
2.4. Discussion
The depth camera used in the Microsoft Kinect has provided a huge leap forward in
this area, although the cost of this peripheral (which had a recommended retail price of
129.99 at release, around four times more than the PS Eye) means that an improvement
in the RGB-based techniques would be desirable. The next chapter contains an appraisal
of related work in computer vision that might be of interest to games developers, and
provides background for this thesis.
32
Chapter 3
State of the Art in Selected Vision
Algorithms
While we have seen in Chapter 2 that computer vision techniques are beginning to have a
profound eect on computer games, there are a number of research areas which could be
applied to further transform the gaming industry. Accurate object segmentation would
allow actual objects, or even people, to be taken directly from the players surroundings
and put into the virtual environment of the game. Human motion tracking could be
used to allow the player to navigate a virtual world, for instance by steering a vehicle.
Finally, human pose estimation could be used to allow the player to control an avatar in
a platform or role-playing game. In this chapter, we will discuss the current state of the
art in energy minimisation, human pose estimation, segmentation, and stereo vision.
In order for computer vision techniques like localisation and pose estimation to be
suitable for use in computer games, the algorithm that applies the technique needs to
respond in real time as well as being accurate. A fast algorithm is necessary because
the results (e.g. pose estimates) need to be used in real-time so that they can aect
the game in-play; very high accuracy is a requirement because mistakes made by the
game will undoubtedly frustrate the user (see [24] and Section 2.2.2). The problem is
to nd a suitable balance between these two requirements (a faster algorithm might
involve approximate solutions, and hence could be less accurate). This may involve
33
3.1. Inference on Graphs: Energy Minimisation
tweaking existing algorithms to produce signicant speed increases without any loss in
accuracy, or developing novel and signicantly more accurate algortihms that still have
speed comparable to current state-of-the-art algorithms.
3.1 Inference on Graphs: Energy Minimisation
Many of the most popular problems in computer vision can be framed as energy minim-
isation problems. This requires the denition of a function, known as an energy function,
which expresses the suitability of a particular solution to the problem. Solutions that are
more probable should give the energy function a lower value; hence, we wish to nd the
solution that gives the lowest value.
3.1.1 Conditional Random Fields
Suppose we have a nite set V of random variables, to which we wish to assign labels from
a label set L. If all the variables are independent, then this problem is easily solvable
- just nd the best label for each variable. However, in general we have relationships
between variables. Let E be the set of pairs of variables {v
1
, v
2
} V which are related to
one another.
We can then construct a graph G = (V, E) which species both the set of variables,
and the relationships between those variables. G is a directed graph if the pairs in E are
unordered; this enables us to construct graphs where, for some v
1
, v
2
, (v
1
, v
2
) E, but
(v
2
, v
1
) / E.
Given some observed data X, we can assign a set {y
i
: v
i
V} of values to the
variables in V. Let f denote a function that assigns a label f(v
i
) = y
i
to each v
i
V.
Now, suppose that we also have a probability function p that gives us the probability of
34
3.1. Inference on Graphs: Energy Minimisation
a particular labelling {f(v
i
) : v
i
V} given observed data X. Then:
Denition 3.1
(X, V) is a conditional random eld if, when conditioned on X, the variables
V obey the Markov property with respect to G:
p(f(v
i
) = y
i
|X, {f(v
j
) : j = i}) = p(f(v
i
) = y
i
|X, {f(v
j
) : (v
i
, v
j
) E}).
(3.1)
In other words, each output variable y
i
only depends on its neighbours [72].
3.1.2 Submodular Terms
Now we consider set functions, which are functions whose input is a set. For example,
suppose we have a set Y of possible variable values, and a set V, with size = |V|,
of variables v
i
which each take a value y
i
Y. A function f which takes as input an
assignment of these variables {y
i
: v
i
V} is a set function.
Energy functions are set functions f : Y

R
+
{0}, which take as input the variable
values {y
i
: v
i
V}, and output some non-negative real number. If the variable values
are binary, then this f is a binary set function f : 2

R
+
{0}.
Denition 3.2
A binary set function f : 2

R
+
{0} is submodular if and only if for every
ordered set S, T V we have that:
f(S) +f(T ) f(S T ) +f(S T ). (3.2)
35
3.1. Inference on Graphs: Energy Minimisation
For example, if = 2, S = [1, 0] and T = [0, 1], a submodular function will satisfy the
following inequality [104]:
f([1, 0]) +f([0, 1]) f([1, 1]) +f([0, 0]). (3.3)
From Schrijver [104], we also have the following proposition:
Proposition 3.1 The sum of submodular functions is submodular.
Proof It is sucient to prove that, given two submodular functions f : A R
+
{0}
and g : B R
+
{0}, h = f +g : A B R
+
{0} is submodular.
h(S) +h(T) =(f +g)(S) + (f +g)(T)
=f(S|
A
) +g(S|
B
) +f(T|
A
) +g(T|
B
)
=(f(S|
A
) +f(T|
A
)) + (g(S|
B
) +g(T|
B
))
(f((S T)|
A
) +f((S T)|
A
)) + (g((S T)|
B
) +g((S T)|
B
))
=f((S T)|
A
) +g((S T)|
B
) +f((S T)|
A
) +g((S T)|
B
)
=h(S T) +h(S T).
As shown by Kolmogorov and Zabih [61], one way of minimising energy functions, par-
ticularly submodular energy functions, is via graph cuts, which we will now introduce.
3.1.3 The st-Mincut Problem
In this section, we will consider directed graphs G = (V, E) that have special nodes
s, t V such that for all v
i
V\{s, t}, we have (s, v
i
) E, (v
i
, t) E, (v
i
, s) / E, and
(t, v
i
) / E. We say that s is the source node and t is the sink node of the graph. Such a
graph is also known as a ow network. Let c be a function c : E R
+
{0}, where for
each (v
1
, v
2
) E, c(v
1
, v
2
) represents the capacity, or maximum amount of ow, of the
edge.
36
3.1. Inference on Graphs: Energy Minimisation
Max Flow
Denition 3.3
A ow function is a function f : E R
+
{0} which satises the following
constraints:
1. f(v
1
, v
2
) c(v
1
, v
2
) (v
1
, v
2
) E
2.

v
1
:(v
1
,v)E
f(v
1
, v) =

v
2
:(v,v
2
)E
f(v, v
2
) v V.
The denition given above gives us two guarantees: rst, that the ow passing along a
particular edge does not exceed that edges capacity; and second, that the ow entering
a vertex is equal to the ow leaving that vertex. From this second constraint, we can
derive the following:
Denition 3.4
The ow of a ow function is the total amount passing from the source to the
sink, and is equal to

(s,v)E
f(s, v).
The objective of the max ow problem is to maximise the ow of a network, i.e. to nd
a ow function f with the highest ow.
Min Cut
Denition 3.5
An s-t cut C = (S, T ) is a partition of the variables v V into two disjoint
sets S and T , with s S and t T .
37
3.1. Inference on Graphs: Energy Minimisation
Let E

be the set of edges that connect a variable v


1
S to a variable v
2
T . Formally:
E

= {(v
1
, v
2
) E : v
1
S, v
2
T } (3.4)
Note that there are at least |V| 2 edges in E

, as if v S\s, then (v, t) E

, and
if v T \t, then (s, v) E

. Depending on the connectivity of G, there may be up to


(|S| 1) (|T | 1) additional edges.
Denition 3.6
The capacity of an s-t cut is the sum of the capacity of the edges connecting
S to T , and is equal to

(v
1
,v
2
)E
c(v
1
, v
2
).
The objective of the min cut problem is to nd an s-t cut which has minimal capacity
(there may be more than one solution).
In 1956, it was shown independently by Ford and Fulkerson [41] and by Elias et al. [30]
that the two problems above are equivalent. Therefore, to nd a ow function that has
maximal ow, one needs only to nd an s-t cut with minimal capacity. Algorithms that
seek to obtain such an s-t cut are known as graph cut algorithms. Submodular functions
can be eciently minimised via graph cuts [15, 61]; C++ code is available that performs
this minimisation using an augmented path algorithm [14, 58, 61]. This code is often used
as a basis for image segmentation algorithms, for example [9, 16, 71, 100, 101, 130].
3.1.4 Application to Image Segmentation
To illustrate the use of energy minimisation in image segmentation, consider the following
example. We have an image, shown in Figure 3.1, with just 9 pixels (33). To construct
a graph, we create a set of vertices V = {s, t, v
1
, v
2
, . . . , v
9
}, and a set of edges E, with
(s, v
i
) and (v
i
, t) E for i = 1 to 9, and (v
i
, v
j
) E if v
i
and v
j
are adjacent in the image,
as shown in Figure 3.1. The vertices v
i
have pixel values p
i
between 0 and 255 inclusive
(i.e. the image is 8-bit greyscale, with 0 corresponding to black, and 255 to white). Our
38
3.1. Inference on Graphs: Energy Minimisation
Figure 3.1: Image for our toy example.
objective is to separate the pixels into foreground and background sets, i.e. to dene a
labelling z = {z
1
, z
2
, . . . , z
9
}, where z
i
= 1 if and only if v
i
is assigned to the foreground
set.
We wish to separate the light pixels in the image from the dark ones, with the light
pixels in the foreground, so we create foreground and background penalties
F
and
B
respectively for pixels v
i
as follows:

F
(v
i
) = 255 p
i
; (3.5)

B
(v
i
) = p
i
. (3.6)
These are known as unary pixel costs. The total unary cost of a labelling z is:
(z) =
9

i=1
(z
i

F
(v
i
) + (1 z
i
)
B
(v
i
)) . (3.7)
We also want the boundary of the foreground set to align with edges in the image. There-
fore, we wish to penalise cases where adjacent pixels have similar values, but dierent
labels. This is done by including a pairwise cost :
(z) =

(v
i
,v
j
)E
1(z
i
= z
j
) exp(|p
i
p
j
|), (3.8)
where 1 is the indicator function, which has a value of 1 if the statement within the
39
3.2. Inference on Trees: Belief Propagation
(a) = 0.1 (b) = 0.2 (c) = 1
Figure 3.2: Segmentation results for dierent values of . A higher value punishes seg-
mentations with large boundaries; a high enough value (as in (c)) will make the result
either all foreground or all background.
brackets is true, and zero otherwise. The overall energy function is:
f(z) = (z) + (z), (3.9)
where is a weight parameter; higher values of will make it more likely that adjacent
pixels have similar labels.
The energy function in (3.9) is submodular, and can therefore be minimised eciently
using the max ow code available at [58]. The segmentation results obtained for dierent
values of are shown in Figure 3.2. The ratio between the unary and pairwise weights
inuences the segmentation result produced.
3.2 Inference on Trees: Belief Propagation
While vision problems such as segmentation require a large number of variables (one per
image pixel), others, such as pose estimation, only require a smaller number of variables,
and hence a smaller graph. An important type of graph that is useful for this problem is
40
3.2. Inference on Trees: Belief Propagation
the tree structure.
Denition 3.7
Let G = (V, E) be any graph. We dene a path to be an ordered set of
vertices {v
1
, . . . , v
n
} V (where n 2), such that (v
i
, v
i+1
) E for all
i {1, . . . n 1}.
Denition 3.8
A tree is an undirected graph T = (V, E) in which any two vertices are con-
nected by exactly one path.
In order to perform inference on this tree, we require a set of labels L
i
that each vertex
v
i
can take a value z
i
from. Note that we allow each variable to have a dierent label set.
Let z be a labelling of T , i.e. an assignment of a label z
i
L
i
to each vertex v
i
V.
As with inference in graphs, we may dene an energy function f(z), and this energy
function may consist of unary and pairwise terms. Throughout this section, we shall
assume the existence of a unary potential (v
i
= z
i
), which provides the cost associated
with v
i
taking a label z
i
, and a pairwise potential (v
i
= z
i
, v
j
= z
j
), providing the cost
of the variables v
i
and v
j
such that (v
i
, v
j
) E taking labels z
i
and z
j
.
As described in the remainder of this section, inference on trees may be performed
using belief propagation, a message-passing algorithm which was rst proposed by Pearl
in 1982 [90]. Belief propagation proves to be computationally ecient, as it exploits the
structure of the tree.
3.2.1 Message Passing
A message passing method is an inference method where vertices are considered separ-
ately, and each vertex v
i
receives information from connected vertices v
j
(where (v
i
, v
j
)
41
3.2. Inference on Trees: Belief Propagation
E) in the form of a message. Here, a message can be as simple as a scalar value, or a
matrix of values. This message is then combined with information relevant to the vertex
itself, to form a new message for the next vertex.
Messages are passed between vertices in a series of pre-dened updates. The number
of updates required to nd an overall solution depends on the complexity of the graph.
If the graph has a simple structure, such as a chain (where each vertex is connected to
at most two other vertices, and the graph is not a cycle), then only one set of updates is
required to nd the optimal set of values. This set can be found by an algorithm such as
the Viterbi algorithm [125].
3.2.2 Belief Propagation
Belief propagation can be viewed as a variation of the Viterbi algorithm that is applicable
to trees. To use this process to perform inference on a tree T , we must choose a vertex
of the tree to be the root vertex, denoted v
0
.
Since T is a tree, v
0
is connected to each of the other vertices by exactly one path.
We can therefore re-order the vertices such that, for any vertex v
i
, the path from v
0
to v
i
proceeds via vertices with indices in ascending order.
1
Once we have done this, we can
introduce the notions of parent-child relations between vertices, dened here for clarity.
Denition 3.9
We say that a vertex v
i
is the parent of v
j
if (v
i
, v
j
) E and i < j. If this is
the case, we say that v
j
is a child of v
i
.
Note that the root node has no parents, and each other vertex has exactly one parent,
since if a vertex v
j
had two parents, then there would be more than one path from v
0
to v
j
, which contradicts the denition of a tree. However, a vertex may have multiple
1
There will typically be multiple ways to do this.
42
3.2. Inference on Trees: Belief Propagation
children, or none at all.
Denition 3.10
A vertex v
i
with no children is known as a leaf vertex.
We now describe the general form of belief propagation on our tree T . The vertices
in V are considered in two passes: a down pass, where the vertices are processed in
descending order, so that each vertex is processed after its children, but before its parent,
and an up pass, where the order is reversed.
For each leaf vertex v
i
, we have a set L
i
= {l
1
i
, l
2
i
, . . . , l
K
i
i
} of possible labels, where
K
i
is the number of labels in the set L
i
. Then the score associated with assigning a
particular label l
p
i
to vertex v
i
is:
score
i
(l
p
i
) = (v
i
= l
p
i
). (3.10)
This score is the message that is passed to the parents of v
i
.
For a vertex v
j
with at least one child, we need to combine these messages with the
unary and pairwise energies, in order to produce a message for the parents of v
j
. Again,
we have a nite set L
j
= {l
1
j
, l
2
j
, . . . , l
K
j
j
} of possible labels for v
j
. The score associated
with assigning a particular label l
q
j
is:
score
j
(l
q
j
) = (v
j
= l
q
j
) +

i>j:(v
i
,v
j
)E
m
i
(l
q
j
), (3.11)
where:
m
i
(l
q
j
) = max
l
p
i
_
(v
i
= l
p
i
, v
j
= l
q
j
) + score
i
(l
p
i
)
_
. (3.12)
When the root vertex is reached, the optimal label v
0
can be found by maximising score
0
,
dened in (3.11). Finally, the globally optimal conguration can be found by keeping
track of the arg max indices, and then tracing back through the tree on the up pass
43
3.3. Human Pose Estimation
to collect them. The up pass can be avoided if the arg max indices are recorded along
with the messages during the down pass.
One of the vision problems that is both interesting for computer games and suitable
for the application of belief propagation is human pose estimation, and it is this problem
that is described in the next section.
3.3 Human Pose Estimation
In laymans terms, the problem of human pose estimation can be stated as follows: given
an image containing a person, the objective is to correctly classify the persons pose. This
pose can either be in the form of selection from a constrained list, or freely estimating
the locations of a persons limbs, and the angles of their joints (for example, the location
of their left arm, and the angle at which their elbow is bent). This is often formalised
by dening a skeleton model, which is to be tted to the image. It is quite common to
describe the human body as an articulated object, i.e. one formed of a connected set of
rigid parts. Such a formalisation gives rise to a family of parts-based models known as
pictorial structure models.
These models typically consist of six parts (if the objective is restricted to upper body
pose estimation) or ten parts (full body) [10, 3537]. The upper body model consists of
head, torso, and upper and lower arms; to extend this to the full body, upper and lower
leg parts are added. Having divided the human body into parts, one can then learn a
separate detector for each part, taking advantage of the fact that the parts have both a
simpler shape (not being articulated), and a simpler colour distribution.
Indeed, pose estimation can be formulated as an energy minimisation problem. In
contrast to segmentation problems, which require a dierent variable for each pixel, the
number of variables required is equal to the number of parts in the skeleton model.
However, the number of possible values that the variables can take is large (a part can,
in theory, occupy any position in the image, with any orientation and extent).
44
3.3. Human Pose Estimation
Figure 3.3: A depiction of the ten-part skeleton model used by Felzenszwalb and Hut-
tenlocher [35].
3.3.1 Pictorial Structures
A pictorial structure model can be expressed as a graph G = (V, E), with the vertices
V = {v
1
, v
2
, . . . , v
n
} corresponding to the parts, and the edges E specifying which pairs
of parts (v
i
, v
j
) are connected. A typical graph is shown in Figure 3.3.
In Felzenszwalb and Huttenlochers pictorial structure model [35], a particular la-
belling of the graph is given by a conguration L = {l
1
, l
2
, . . . , l
n
}, where each l
i
species
the location (x
i
, y
i
) of part v
i
, together with its orientation, and degree of foreshortening
(i.e. the degree to which the limb appears to be shorter than it actually is, due to its
angle relative to the camera). The energy of this labelling is then given by:
E(L) =
n

i=1
(l
i
) +

(v
i
,v
j
)E
(l
i
, l
j
), (3.13)
where, as in the previous section, represents the unary energy on part conguration,
and the pairwise energy. These energies relate to the likelihood of a conguration, given
45
3.3. Human Pose Estimation
the image data, and given prior knowledge of the parts; more realistic congurations
will have lower energy. Despite the large number of possible congurations, a globally
optimal conguration can be found eciently. This can be done by using simple appear-
ance models for each part, explained in the following section, and then applying belief
propagation.
Appearance Model
For each part, appearance models can be learned from training data, and can be based on
edges [95], colour-invariant features such as HOG [22,37], or the position of the part within
the image [38]. Another approach for video sequences is to apply background subtraction,
and dene a unary potential based on the number of foreground pixels around the object
location [35].
Given an image, this appearance model can be evaluated over a dense grid [1]; to
speed this process up, a feature pyramid can be dened, so that a number of promising
locations are found from a coarse grid, and then higher resolution part lters produce
more precise matching scores [34, 36]. In order to reduce the time taken by the inference
process, it might be desirable to reduce the set of possible part locations. Two ways to
do this are:
1. Thresholding, where part locations with a score that is worse than some predened
value, or with a score outside the top N values for some N.
2. Non-maximal suppression, which involves the removal of part locations that are
similar, but inferior, to other part locations.
Optimisation
After applying these techniques, we now have a small set of possible locations {l
1
i
, l
2
i
, . . . , l
k
i
}
for each vertex v
i
. For a leaf vertex v
i
, the score of each location l
p
i
is:
score
i
(l
p
i
) = (l
p
i
). (3.14)
46
3.3. Human Pose Estimation
Now, for a vertex v
j
with at least one child, the score is dened in terms of the children:
score
j
(l
p
j
) = (l
p
j
) +

v
i
:(v
i
,v
j
)E
m
i
(l
p
j
), (3.15)
where:
m
i
(l
p
j
) = max
l
q
i
_
(l
q
i
, l
p
j
) + score
i
(l
q
i
)
_
. (3.16)
Finally, the top-scoring part conguration is found by nding the root location with the
highest score, and then tracing back through the tree, keeping track of the arg max indices.
Multiple detections can be generated by thresholding this score and using non-maximal
suppression.
3.3.2 Flexible Mixtures of Parts
Yang and Ramanan [131] extend these approaches by introducing a exible mixture of
parts model, allowing for greater intra-limb variation.
Rather than using a classical articulated limb model such as that of Marr and Nishi-
hara [75], they introduce a new representation: a mixture of non-orientable pictorial
structures. Instead of having ten rigid parts, as the methods described in Section 3.3.1
do, their model has twenty-six rigid parts, which can be combined to form limbs and
produce an estimate for the ten parts, as shown in Figure 3.4. Each part has a number
T of possible types, learned from training data. Types may include orientations of a part
(e.g. horizontal or vertical hand), and may also span semantic classes (e.g. open versus
closed hand).
Model
Let us denote an image by I, the location of part i by p
i
= (x, y), and its mixture
component by t
i
, with i {1, . . . , K}, p
i
{1, . . . , L}, and t
i
{1, . . . T}, where K is
the number of parts, L is the number of possible part locations, and T is the number of
mixture components per part.
47
3.3. Human Pose Estimation
(a) The basic model features ten parts,
representing limbs.
(b) Instead, [131] uses the endpoints of
limbs (orange), giving 14 parts.
(c) The midpoints (white) of the eight limb
parts are added, giving a total of 22 parts.
(d) Four parts are added to model the
torso, producing a nal total of 26 parts.
Figure 3.4: Evolution of part models, from the basic 10-part model to Yang and
Ramanans 26-part model.
48
3.3. Human Pose Estimation
Figure 3.5: Graph showing the relations between parts in Yang and Ramanans part
model.
Certain congurations of parts, such as those where adjacent parts have the same
orientation, are more likely to occur than others. To model this, a compatibility function
S is dened on mixture choices t = {t
1
, . . . , t
K
}:
S(t) =

iV
b
t
i
i
+

(i,j)E
b
(t
i
,t
j
)
(i,j)
, (3.17)
where V is the set of vertices and E is the set of edges in the K-vertex relational graph
specifying which pairs of parts are adjacent, represented graphically in Figure 3.5.
This score is added to energy terms including unary potentials on parts, and pairwise
potentials between parts:
S(I, p, t) = S(t) +

iV
w
t
i
i
(I, p
i
) +

(i,j)E
w
(t
i
,t
j
)
(i,j)
(p
i
p
j
). (3.18)
Here, the appearance model computes the local score at p
i
of placing a template w
t
i
i
for
part i, tuned for type t
i
. In the nal term, we have (p
i
p
j
) = [dx, dx
2
, dy, dy
2
]
T
,
49
3.3. Human Pose Estimation
where dx and dy represent the relative location of part i with respect to j. The parameter
w
(t
i
,t
j
)
(i,j)
encodes the expected values for dx and dy, tailored for types t
i
and t
j
. So if i is
the elbow and j is the forearm, with t
i
and t
j
specifying vertically-oriented parts (i.e. the
arm is at the persons side), we would expect p
j
to be below p
i
on the image.
Inference
To perform inference on this model, Yang and Ramanan maximise S(I, p, t) over p and t.
Since the graph G in Figure 3.5 is a tree, belief propagation (see Section 3.2.2) can again
be used. The score of a particular leaf node p
i
with mixture t
i
is:
score
i
(t
i
, p
i
) = b
t
i
i
+w
t
i
i
(I, p
i
), (3.19)
and for all other nodes, we take into account the messages passed from the nodes children:
score
i
(t
i
, p
i
) = b
t
i
i
+w
t
i
i
(I, p
i
) +

cchildren(i)
m
c
(t
i
, p
i
), (3.20)
where:
m
c
(t
i
, p
i
) = max
t
c
b
t
i
,t
c
(i,c)
+ max
p
c
w
(t
i
,t
c
)
(i,c)
(p
i
p
c
). (3.21)
Once the messages passed reach the root part (i = 1), score
1
(c
1
, p
1
) contains the best-
scoring skeleton model given the root part location p
1
. Multiple detections can be gen-
erated by thresholding this score and using non-maximal suppression.
3.3.3 Unifying Segmentation and Pose Estimation
So far, a number of methods for solving either human segmentation or pose estimation
have been discussed. Some recent work has also been done that attempts to solve both
tasks together. In this section, we discuss PoseCut [16].
50
3.3. Human Pose Estimation
PoseCut
Bray et al. [16] tackle the segmentation problem by introducing a pose-specic Markov
random eld (MRF), which encourages the segmentation result to look human-like.
This prior diers from image to image, as it depends on which pose the human is in.
Given an image, they nd the best pose prior
opt
by solving:

opt
= arg min

_
min
x

3
(x, )
_
, (3.22)
where x species the segmentation result, and
3
is the Object Category Specic MRF
from [65], which denes how well a pose prior ts a segmentation result x. It is dened
as follows:

3
(x, ) =

i
_
_
(D|x
i
) +(x
i
|) +

j
((I|x
i
, x
j
) +(x
i
, x
j
))
_
_
, (3.23)
where I is the observed (image) data, (I|x
i
) is the unary segmentation energy, (x
i
|)
is the cost of the segmentation given the pose prior (penalising pixels near to the shape
being background, and pixels far from the shape being foreground), and the term is a
pairwise energy. Finally, (I|x
i
, x
j
) is a contrast-sensitive term, dened as:
(I|x
i
, x
j
) =
_
_
_
(i, j), if x
i
= x
j
; (3.24)
0, if x
i
= x
j
,
where (i, j) is proportional to the dierence in RGB values of pixels i and j; pixels with
similar values will have a high value for (i, j), since we wish to encourage these pixels
to have the same label.
Given a particular pose prior , the optimal conguration x

= arg min
x

3
(x, )
can be found using a single graph cut. The nal solution arg min
x

3
(x,
opt
) is found
using the Powell minimisation algorithm [94].
51
3.4. Stereo Vision
3.4 Stereo Vision
Stereo correspondence algorithms typically denote one image as the reference image and
the other as the target image. A dense set of patches is extracted from the reference
image, and for each of these patches, the best match is found in the target image. The
displacement between the two patches is known as the disparity; the disparity for each
pixel in the reference image is stored in a disparity map. It can easily be shown that
the disparity of a pixel is inversely proportional to its distance from the camera, or its
depth [105]. A typical disparity map is shown in Figure 3.6.
A plethora of stereo correspondence algorithms have been developed over the years.
Scharstein and Szeliski [102] note that earlier methods can typically be divided into four
stages: (i) matching cost computation; (ii) cost aggregation; (iii) disparity computation;
and (iv) disparity renement; later methods can be described in a similar fashion [20, 60,
76, 92, 97].
It is quite common to use the sum of absolute dierences measure when nding the
matching cost for each pixel. A patch with a height and width of 2n + 1 pixels for some
n 0 is extracted from the reference image. Then, for each disparity value d, a patch
is extracted from the target image, and the pixelwise intensity values for each patch are
compared.
With L and R representing the reference (left) and target (right) images respectively,
the cost of assigning disparity d to a pixel (x, y) in L is as follows:
(x, y, d) =
n

x=n
n

y=n
|L(x +x, y +y) R(x +x d, y +y)| (3.25)
Evaluating this cost over all pixels and disparities provides a cost volume, on which
aggregation methods such as smoothing can be applied in order to reduce noise. Disparity
values for each pixel can then be computed. The simplest method for doing this is just
to nd for each pixel (x, y) the disparity value d which minimises (x, y, d).
However, such a method is likely to result in a high degree of noise. Additionally, pixels
immediately outside a foreground object are often given disparities that are higher than
52
3.4. Stereo Vision
(a) Left image
(b) Right image
(c) Disparity map
Figure 3.6: An example disparity map produced by a stereo correspondence algorithm.
Pixels that are closer to the camera appear brighter in colour. Note the dark gaps
produced in areas with little or no texture.
53
3.5. Discussion
the actual disparity value, producing what are known as halo artefacts in the result [20].
To overcome these problems, many methods either perform a post-processing step
to rene the output, or use more complex methods of computing the disparity map.
For example, Criminisi et al. [20] use an approach based on dynamic programming to
reduce halo artefacts. Rhemann et al. [97] smooth the cost volume with a weighted box
lter, with weights chosen so that edges in the input image correspond with edges in the
disparity image, while in a similar vein, Hirschmller et al. [51] use patches of multiple
sizes to correct errors at object boundaries.
3.4.1 Humans in Stereo
A number of methods have been developed that use stereo correspondence as a pre-
processing step for a wider task, for example foreground-background segmentation. Kol-
mogorov et al. [60] achieve real-time segmentation of stereo video sequences. Stereo
correspondence is used together with colour and contrast information to provide an im-
proved result.
Often in these works, the foreground object of interest is a human. This is the case
in [20], where the nal objective is to improve video conferencing by using cameras either
side of the monitor to generate a view from the centre of the monitor, therefore increasing
eye contact. More ambitiously, Pellegrini and Iocchi [92] use stereo as a basis for posture
classication, although they omit arms from their body model and only choose from a
limited set of postures.
3.5 Discussion
In this chapter, several existing techniques in computer vision have been introduced;
we have considered general problems such as energy minimisation on graphs and belief
propagation on trees, and shown how they have been applied to segmentation and human
pose estimation. These tasks are of particular interest to video game developers, especially
after the advent of the Kinect (see Section 2.3) gave the general public a showcase of
54
the possibility of controlling video games with the human body. To this end, in the next
chapter, we will begin exploring the possibility of providing a human scene understanding
framework, combining pose estimation with segmentation and depth estimation.
Chapter 4
3D Human Pose Estimation in a Stereo Pair
of Images
The problem of human pose estimation has been widely studied in the computer vision
literature; a survey of recent work is provided in Section 3.3. Despite the large body of
research focussing on 2D human pose estimation, relatively little work has been done to
estimate pose in 3D, and in particular, annotated datasets featuring frontoparallel stereo
views of humans are non-existent.
In recent years, some research has focussed on combining segmentation and pose
estimation to produce a richer understanding of a scene [10, 16, 66, 89]. Many of these
approaches simply put the algorithms into a pipeline, where the result of one algorithm
is used to drive the other [10, 16, 89]. The problem with this is that it often proves
impossible to recover from errors made in the early stages of the process. Therefore,
a joint inference framework, as proposed by Wang and Koller [127] for 2D human pose
estimation, is desired.
This chapter describes a new algorithm for estimating human pose in 3D, while sim-
ultaneously solving the problems of stereo matching and human segmentation. The al-
gorithm uses an optimisation method known as dual decomposition, of which we give an
overview in Section 4.1.
Following that, a new dataset for two-view human segmentation and pose-estimation,
57
4.1. Joint Inference via Dual Decomposition
H2view, is presented in Section 4.2, and our inference framework is described in Section
4.3. Our application of dual decomposition is discussed in Section 4.4, while our learning
approach can be found in Section 4.5, and we evaluate our algorithm in Section 4.6.
Finally, we discuss the outcomes of this chapter in Section 4.7.
The contributions of this chapter can be summarised as follows: we present a novel
dual decomposition framework to combine stereo algorithms with pose estimation and
segmentation. Our system is fully automated, and is applicable to more general tasks
involving object segmentation, stereo and pose estimation. Drawing these together, we
demonstrate a proof of concept that the achievements of the Kinect, covered in Section
2.3, can be matched using a stereo pair of images, instead of using infrared depth data.
We also provide an extensive new dataset of humans in stereo, featuring nearly 9,000
annotated images.
An earlier version of part of this chapter previously appeared as [108]. Graph cuts
are performed using the max-ow code of Boykov and Kolmogorov [14, 58].
4.1 Joint Inference via Dual Decomposition
A decomposition method is an optimisation method that minimises a complex function
by splitting it into easier subfunctions. One easy class of functions are convex functions,
of which we give a denition below:
Denition 4.1
A function f : R
N
R is convex if it satises the following inequality for all
x, y R
N
and all , R with 0, 0, and + = 1:
f(x +y) f(x) +f(y).
Equivalently, the line segment between (x, f(x)) and (y, f(y)) lies above the graph of
f [13]. An important property of convex functions that makes them easier to optimise is
58
4.1. Joint Inference via Dual Decomposition
that they only have one local minimum, and no local maxima. The converse of a convex
function is a concave function, which is simply a function f : R
N
R such that f is
convex. Concave functions have only one local maximum.
4.1.1 Introduction to Dual Decomposition
Let x be a vector in R
N
, which can be split into three subvectors, x
1
, x
2
, and y. Consider
the following optimisation problem:
minimise
xR
N f(x) = f
1
(x
1
, y) +f
2
(x
2
, y). (4.1)
If the shared (or complicating) variable y were xed, we could use a divide-and-conquer
type approach, and solve f
1
and f
2
separately. Suppose that for xed y, we have solutions

1
and
2
as follows:

1
(y) =inf
x
1
f
1
(x
1
, y); (4.2)

2
(y) =inf
x
2
f
2
(x
2
, y). (4.3)
Now, the original problem can be restated as:
minimise
y

1
(y) +
2
(y), (4.4)
which is convex provided the original function f is convex. This problem is referred to
as the master problem.
A decomposition method is a method that solves the master problem in order to solve
the problem in (4.1); it is so called because the problem is decomposed into smaller
subproblems. If the original problem is decomposed, then the method is known as primal
decomposition, since the shared variable is being updated directly. The master problem
(4.4) can be solved via an iterative process; one such method, which is described next, is
the subgradient method.
59
4.1. Joint Inference via Dual Decomposition
Figure 4.1: Example showing a subgradient (red) of the function f(x) = x +0.5 |x 2|
(blue) at x = 2.
Subgradient Method
Denition 4.2
A subgradient of a function f at x is any vector g such that for all y, the
inequality f(y) f(x) +g
T
(y x) is satised.
An example of a subgradient is shown in Figure 4.1: the function g(x) = x is a subgradient
of f(x) = x + 0.5 |x 2| at the point x = 2.
60
4.1. Joint Inference via Dual Decomposition
Denition 4.3
A sequence {
k
}

k=0
is summable if:

k=0

k
< .
The sequence is square summable if:

k=0

2
k
< .
Non-dierentiable convex functions can be minimised using a method known as the sub-
gradient method. This can be thought of as similar to Newtons method for dierentiable
functions; one dierence that warrants a mention is the speed: the subgradient method
is generally far slower than Newtons method. The subgradient method, originally de-
veloped by Shor [109], uses predetermined step sizes that can be set by a variety of
methods, discussed in more detail in [11]. In this chapter, a square summable but not
summable step size is used:
k
= a/(b +k), with parameters a, b > 0.
To minimise a given convex function f : R
n
R, the subgradient method uses the
following iteration:
x
k+1
= x
k

k
g
k
, (4.5)
where x
k
R
n
is the k
th
iterate,
k
is the step size, and g
k
is any subgradient of f at
x
k
. By keeping track of the best point x
i
found so far, this method can be used to nd a
decreasing sequence f
k
of objective values. It was shown in [11] that if the step size
k
is
square summable, but not summable, then the sequence f
k
of objective values converges
to the optimal value f

.
Primal Decomposition
To apply the subgradient method to primal decomposition, we apply an iterative process,
whereby the complicating variable y is xed, so that we can solve the subproblems f
1
61
4.1. Joint Inference via Dual Decomposition
and f
2
as in (4.2) and (4.3). At iteration k, we nd the x
1
and x
2
that minimise f
1
and f
2
respectively, given y
k
. The next step is to nd a subgradient g
1
of f
1
at x
1
(and analogously, a subgradient g
2
). Once these have been found, we can update the
complicating variable by the following rule:
y
k+1
= y
k

k
(g
1
+g
2
) (4.6)
This process can be repeated until convergence of the y
k
to some xed value y

. It is quite
common to instead decompose the problems Lagrangian dual (dual decomposition); this
is the method that is followed in this chapter.
Lagrangian Duality
An alternative approach to solving the problem in (4.1) is to introduce a copy of the
shared variable y, and add an equality constraint:
minimise
xR
N f(x) =f
1
(x
1
, y
1
) +f
2
(x
2
, y
2
) (4.7)
subject to: y
1
= y
2
.
Note that the minimisation problem is now separable, although we must take the equality
constraint into account. We do this by forming the Lagrangian dual problem.
In Lagrangian duality, the constraints in (4.7) are accounted for by augmenting the
objective functions f
1
and f
2
by constraint functions [13]. Constraint functions have
the property that they are equal to zero when the constraint is satised, and non-zero
otherwise. The Lagrangian L associated with the problem in (4.7) is as follows:
L(x
1
, y
1
, x
2
, y
2
, ) = f
1
(x
1
, y
1
) +f
2
(x
2
, y
2
) +(y
1
y
2
), (4.8)
where is a cost vector of the same dimensionality of y, and is known as the Lagrange
multiplier associated with the constraint y
1
= y
2
. This problem is separable, and for a
given , we can nd solutions for x
1
, y
1
, x
2
and y
2
in parallel.
62
4.1. Joint Inference via Dual Decomposition
Optimisation via Dual Decomposition
The Lagrangian (4.8) can be separated into two sub-problems:
L
1
(x
1
, y
1
, ) =f
1
(x
1
, y
1
) +y
1
; (4.9)
L
2
(x
2
, y
2
, ) =f
2
(x
2
, y
2
) y
2
. (4.10)
The dual problem is then as follows:
maximise
R
g() = g
1
() +g
2
(), (4.11)
where:
g
1
() = inf
x
1
,y
1
(f
1
(x
1
, y
1
) +y
1
) ; (4.12)
g
2
() = inf
x
2
,y
2
(f
2
(x
2
, y
2
) y
2
) . (4.13)
While the master problem in (4.11) is not typically dierentiable, it is known to be convex
over R, and therefore we can solve it using an iterative method.
To apply the subgradient method to our problem (4.11), suppose that after the k
th
iteration, we have the current value
k
for . We denote by ( x
1
(
k
), y
1
(
k
)) the values of
x
1
and y
1
that give us the minimum value for g
1
(
k
); the solution for g
2
(
k
) is denoted
analogously. Then:
Proposition 4.1 A subgradient with respect to of the master problem (4.11) at
k
is
given by:
g(
k
) = y
1
(
k
) y
2
(
k
). (4.14)
Proof For any , let x
1
() and y
1
() be the values of x
1
and y
1
that give the minimum
63
4.1. Joint Inference via Dual Decomposition
value for g
1
(), and dene x
2
() and y
2
() analogously. Then we have:
g(
k
) = f
1
( x
1
(
k
), y
1
(
k
)) +
k
y
1
(
k
) +f
2
( x
2
(
k
), y
2
(
k
))
k
y
2
(
k
)
= inf
x
1
,y
1
(f
1
(x
1
, y
1
) +
k
y
1
) + inf
x
2
,y
2
(f
2
(x
2
, y
2
)
k
y
2
)
f
1
( x
1
(), y
1
()) +
k
y
1
() +f
2
( x
2
(), y
2
())
k
y
2
()
= f
1
( x
1
(), y
1
()) +f
2
( x
2
(), y
2
()) +
k
( y
1
() y
2
())
= f
1
( x
1
(), y
1
()) +f
2
( x
2
(), y
2
()) +( y
1
() y
2
())
+ (
k
)( y
1
() y
2
())
= g() + (
k
)( y
1
() y
2
()).
We have shown that for any , g(
k
) g() + (
k
)( y
1
() y
2
()), which is
equivalent to g() g(
k
) + (
k
)( y
1
() y
2
()), as required.
The subgradient in (4.14) is a vector of the same dimensionality as . We then update
according to the following rule:

k+1
=
k

k
g(
k
), (4.15)
and by keeping track of the smallest values g
best
, we can nd the optimal -value.
It is perhaps helpful to consider the subgradient method for updating from an
economic viewpoint. There are two copies of the y variables: the values of the y
1
copy
can be thought of as the amounts of a set of resources generated by g
1
that is, the
supply of y. Then, the y
2
copy can be thought of as the demand of the variables by g
2
.
At iteration k, the
k
can be taken to be the costs of these resources.
If for some index i the supply exceeds the demand (i.e. y
i
1
> y
i
2
), then the subgradient
of index i will be positive. Hence, by the update rule (4.15), we will decrease the cost
k
.
Note that this makes sense from an economic perspective, as if supply exceeds demand,
then we would like to decrease the cost of the resource, in order to stimulate interest in it.
Conversely, if the demand of a resource exceeds its supply (i.e. y
i
1
< y
i
2
), we would like
to increase the cost. The increased cost causes a reduced demand, but while the demand
64
4.1. Joint Inference via Dual Decomposition
exceeds the supply, the revenue is increased. The optimal point comes where supply and
demand are equal. In this way, the update rule encourages the two copies y
1
and y
2
to
agree with each other.
Example
As a simple example, suppose that x
1
and x
2
are each vectors in R
10
, y R, g
1
, g
2
, . . . , g
50
are ane functions of x
1
and y, and h
1
, h
2
, . . . , h
50
are ane functions of x
2
and y, for
example:
g
1
(x
1
, y) =A
(1,0)
x
1
(0) +. . . +A
(1,9)
x
1
(9) +A
(1,10)
y +A
(1,11)
; (4.16)
h
1
(x
2
, y) =B
(1,0)
x
2
(0) +. . . +B
(1,9)
x
2
(9) +B
(1,10)
y +B
(1,11)
, (4.17)
where A
(i,j)
, B
(i,j)
R for all (i, j) (i.e. A and B are real-valued 50-by-12 matrices).
Now, the functions g
i
and h
i
are convex, and the maximum of a set of convex functions
is convex, so if we set f
1
and f
2
as:
f
1
(x
1
, y) =
50
max
i=1
g
i
(x
1
, y); (4.18)
f
2
(x
2
, y) =
50
max
i=1
h
i
(x
2
, y), (4.19)
then f
1
and f
2
are also convex. Thus, using the procedure outlined above and CVX, a
package for specifying and solving convex programs [45,46], we can iteratively solve (4.11)
for a given , and then update using the update rule (4.15). Figure 4.2 shows a plot of
the dual functions in (4.11) with respect to the used during the optimisation.
In Figure 4.3, the progress of a bisection method for maximising g() is shown. Also
shown are two bounds: the larger (worse) bound is found by evaluating f(x), using the
current solutions x
1
, x
2
and y, and (4.1). The smaller (better) bound is found by using
these solutions to evaluate the dual function g() from (4.11). Note the high magnitude
of the deviations in g() for the rst few iterations, where is changing by a relatively
large amount. After a few iterations, the dual function converges to the better bound.
65
4.1. Joint Inference via Dual Decomposition
Figure 4.2: Dual functions versus the cost variable .
Figure 4.3: Value of the dual function g() after each iteration k, compared to the bounds
obtained by equations 4.1 and 4.11.
66
4.2. Humans in Two Views (H2view) Dataset
4.1.2 Related Approaches
The principle of dual decomposition was originally developed by the optimisation com-
munity, and can be traced back to work on linear programs in 1960 [23]. More than four
decades later, the principle was combined with Markov random eld (MRF) optimisation
by Komodakis et al. [62, 63], who decompose an NP-hard MRF problem on a general
graph G into a set of easier subproblems dened on trees T G.
Dual decomposition was applied by Wang and Koller [127] to jointly solve the problems
of human segmentation and pose estimation. They dened two slave problems: one
provides a pose estimate by using belief propagation to select from a nite set of estimates;
the other nds a foreground segmentation via graph cuts.
In the remainder of this chapter, we extend the formulation of [127] to include stereo,
so that we can provide segmentation and pose estimation in 3D, not just 2D. While
existing stand-alone stereo correspondence algorithms are not suciently accurate to
compensate for the lack of an infrared sensor, our multi-level inference framework aids
us in segmenting objects despite errors in the disparity map.
4.2 Humans in Two Views (H2view) Dataset
While many datasets [22, 36, 37] exist for evaluation of 2D human pose estimation, there
is a lack of multi-view human pose datasets the only well-known such dataset is the
HumanEva dataset [112], which consists of four image sequences obtained from seven
cameras, of which three are in colour. However, no two of the cameras are frontoparallel,
so rectied images cannot be obtained.
Another related dataset is the H3D (Humans in 3D) dataset created by Bourdev et
al. [10], which gives 3D ground-truth human pose for set of 1000 images, but these images
are only available in one view.
1
In order to evaluate the algorithms in this chapter, a new dataset, called Humans
1
The dataset size is doubled by mirroring these 1000 images.
67
4.2. Humans in Two Views (H2view) Dataset
in Two Views (H2view for short) was created.
2
This dataset contains a total of 8,741
images of humans standing, walking, crouching or gesticulating in front of a stereo camera,
divided up into 25 video
3
sequences of between 149 and 430 images, with eight subjects
and two locations.
The images are annotated with ground-truth human segmentation, and 3D positions
of fteen joints, from which a ten-part skeleton is obtained. Ground truth data was
collected using a Microsoft Kinect, which was synchronised with the stereo camera for
each sequence. Since the Kinects pose estimation algorithm is not 100% accurate, the
pose estimates were corrected manually. Camera calibration was also performed, and
python code was used to translate the data obtained using the Kinect into the image
plane of the left image from the stereo camera.
Each annotated frame has six les associated with it: the RGB and depth images
from the Kinect, the interlaced image from the stereo camera, the left and right rectied
images from the stereo camera, and the ground truth skeleton.
The dataset is split into training and testing portions. The training set contains 7,143
images, while the test set features 1,598 images, with a dierent location and dierent
subjects from the training set. The subjects were not given particular instructions regard-
ing how to act in front of the camera; they mostly perform simple gestures at random in
front of the camera, and occasionally interact with inanimate objects. This means that
the diculty of the test set is varied: some sequences contain simple, slow actions, while
others feature more energetic movement. A downside to this is that the dataset is not
readily suitable for use with action recognition algorithms.
4.2.1 Evaluation Metrics Used
Segmentation
In order to evaluate segmentation, we give results on four metrics. These are calculated
based on the counts of three classes of pixel. The rst, true positives (TP) are pixels that
2
http://cms.brookes.ac.uk/research/visiongroup/h2view/h2view.html
3
Video les are not included, but the frames were captured at a rate of 15 fps.
68
4.3. Inference Framework
are classed as human in both the segmentation result and the ground truth. If a pixel is
wrongly classed as human, it is a false positive (FP), and if a pixel is wrongly classed as
background, it is a false negative (FN). The metrics used are as follows:
Precision: P = TP/(TP +FP) ,
Recall: R = TP/(TP +FN) ,
F-Score: F = 2PR/(P +R) ,
Overlap: O = TP/(TP +FP +FN) .
Pose Estimation
To evaluate pose estimation, we follow the standard criterion of probability of correct
pose (PCP) [37] to measure the percentage of correctly localised body parts.
In this criterion, the detected endpoints of each limb are compared to the annotated
ground truth endpoints. If both detected endpoints are within 50% of the ground truth
limbs length from the ground truth endpoint, then the part detection is considered to
be correct. The threshold of 50% is consistent with previous works that evaluate pose
estimation using PCP [1, 37, 131].
4.3 Inference Framework
Table 4.1: Notation used in this chapter.
Symbol Meaning
L Left input image
R Right input image
D Set of disparity labels
S Set of segmentation labels
Continued on next page
69
4.3. Inference Framework
Table 4.1 Continued from previous page
Symbol Meaning
B Set of body part labels
Z Set of pixels
Z
m
Individual pixel
B Set of body parts
B
i
Individual part
C Set of neighbouring pixels
T Set of neighbouring parts
N Number of pixels in image
N
E
Number of proposals per body part
N
P
Number of body parts
K Maximum disparity label
i part index
j proposal index
k disparity value
m = (x, y) pixel location
z CRF labelling
z Binarised CRF labelling
z
m
= [d
m
, s
m
] label for one pixel Z
m
d
m
disparity label for Z
m
s
m
segmentation label for Z
m
b
i
label for one body part B
i
E(z) Energy function
L Lagrangian function
f

Individual term in an energy function


f

Binarised term in an energy function


Continued on next page
70
4.3. Inference Framework
Table 4.1 Continued from previous page
Symbol Meaning
J

Joint energy term (combining two of segmentation, pose and depth)

Weight of a term or sub-term

Unary term

Pairwise term
x, y, etc. Dierence in x, y, etc.
L, R Gradient image of L and R (in x-direction only).
w
m
ij
Weight of a pixel m given by proposal j for part i
W
F
Overall weight of a pixel
p
m
Posterior foreground probability of a pixel m
Dual variable
Threshold
Slack variable
F Feasible set of labels
t Iteration index
step size
For ease of reference, a list of symbols used in this chapter can be found in Table
4.1. The problem which we wish to optimise consists of three main elements: stereo,
segmentation, and human pose estimation. Each of these are represented by one term in
the energy function. We introduce two additional terms, hereafter referred to as joining
terms, which combine information from two of the elements, encouraging them to be
consistent with one another. We take as input a stereo pair of images L and R, and as a
preprocessing step, we use the algorithm of Yang and Ramanan [131] to obtain a number
N
E
of proposals for N
P
dierent body parts. In this chapter, we use N
P
= 10, with two
parts for each of the four limbs, plus one each for the head and torso. Each proposal j for
71
4.3. Inference Framework
each part i comes with a pair of endpoints corresponding to a line segment in the image,
representing the limb (or skull, or spine).
Needless to say, these proposals are not guaranteed to be accurate. As might be
expected, the top-ranked proposal returned is most likely to be correct, however the
maximum possible accuracy (that is, the accuracy that an oracle could obtain by picking
the most accurate proposal for each part) does increase as the number of parts is increased,
as shown in Figure 4.4.
Our approach is formulated as a conditional random eld (CRF) with two sets of
random variables: one set covering the image pixels Z = {Z
1
, Z
2
, . . . , Z
N
}, and one
covering the body parts B = {B
1
, B
2
, . . . , B
N
P
}. Any possible assignment z of labels to
the random variables will be called a labelling.
The label sets for the pixels are dened as follows:
D ={0, 1, . . . , K 1}; (4.20)
S ={0, 1}. (4.21)
In a particular labelling z, each pixel variable Z
m
takes a label z
m
= [d
m
, s
m
], from
the product space of disparity and segmentation labels D S, where d
m
represents the
disparity assigned to the pixel, and s
m
is set to 1 if and only if the pixel is assigned to
foreground. Additionally, each part B
i
takes a label b
i
B = {1, 2, . . . , N
E
}, denoting
which proposal for part i has been selected. In general, the energy of z can be written
as:
E(z) = f
D
(z) +f
S
(z) +f
P
(z) +f
PS
(z) +f
SD
(z). (4.22)
The structure of the variable sets and labellings is shown graphically in Figure 4.5. It
is important to note that we have two separate sets of variables, with three labellings
obtained, and the ve terms in the equation above each give an energy score for a dierent
aspect of this labelling structure:
1. f
D
gives the cost of the disparity label assignment {d
m
}
N
m=1
(Figure 4.5(c), centre);
72
4.3. Inference Framework
(a)
(b)
Figure 4.4: (a): Part proposal accuracy decreases as the proposal rank is lowered, with
the fourth proposal onwards less than half as accurate as the rst proposal. However,
there is still information of value in these lower-ranked proposals, as the best possible
accuracy, shown in (b), increases slightly. Note that a dierent scale is used on the two
graphs.
73
4.3. Inference Framework
2. f
S
gives the cost of the segmentation label assignment {s
m
}
N
m=1
(Figure 4.5(c), left);
3. f
P
gives the cost of the part proposal selection {b
i
}
N
P
i=1
(Figure 4.5(c), right);
4. f
SD
gives the joint cost of the disparity and segmentation labellings; and
5. f
PS
gives the joint cost of the segmentation labelling and the part proposal selection.
In the following sections, we describe in turn each of the ve terms in (4.22). Where
appropriate, terms are weighted by parameters

; the process of setting the values of


these parameters is given in Section 4.5.
4.3.1 Segmentation Term
In order to build unary potentials for the segmentation term, we create a foreground
weight map based on the pose detections obtained from Yang and Ramanans algorithm
[131]. For each pixel Z
m
, each part proposal (i, j) contributes a weight w
m
ij
, where w
m
ij
= 1
if Z
m
lies directly on the line segment representing the limb, and decreases exponentially
as we move away from it. We then have a foreground weight W
F
=

i,j

j
w
m
ij
and a
background weight

i,j

j
(1w
m
ij
) for each pixel. The
j
term represents our condence
in the detections, which decreases as the proposal ranking lowers. An example of why
this weighting can be important is shown in Figure 4.6.
These weights are then used to t Gaussian mixture models (GMMs) for the fore-
ground and background regions, which together give us a probability p
m
of each pixel Z
m
being a foreground pixel. From this, we obtain unary costs
F
and
B
, which store the
costs of assigning each pixel m to foreground and background respectively:

F
(m) =log(p
m
); (4.23)

B
(m) =log(1 p
m
). (4.24)
The unary segmentation cost for Z
m
given the binary segmentation label s
m
is then the
74
4.3. Inference Framework
(a) Inputs
(b) Variable sets
(c) Labellings
Figure 4.5: A graphical representation of the CRF formulation given in this chapter. (a):
the inputs are a stereo pair of images (left), and a skeleton model (right). (b): we have
two variable sets; one covering pixels, and one covering body parts. (c): These variables
are assigned labellings. The pixels get segmentation (binary) and disparity (integer)
labels, while the parts are given integer labels that represent the index of the proposal
selected for that part.
75
4.3. Inference Framework
(a) Original image
(b) W
F
(averaging) (c) W
F
(relative)
Figure 4.6: In this cluttered image, there are many pose estimates that are extremely
inaccurate. However, weighting the GMMs according to our condence in the detections
results in the incorrect detections, which had a lower rank, being assigned a less signicant
weight.
76
4.3. Inference Framework
Table 4.2: Results on the H2view test set, using just the segmentation term.
Method Precision Recall F-Score Overlap
Unary only 27.91% 83.53% 41.84% 27.45%
Unary + Pairwise 55.11% 87.28% 67.56% 56.59%
following:

S
(m) =
_
_
_

F
(m), if s
m
= 1 ; (4.25a)

B
(m), if s
m
= 0 . (4.25b)
We also have pairwise costs
S
, which store the cost of assigning adjacent pixels to
dierent labels. Denoting the set of neighbouring pixels by C, we follow equation (11) in
Rother et al. [100], and write the pairwise energy for each pair of pixels (m
1
, m
2
) C as:

S
(m
1
, m
2
) =
_
_
_
exp(
3
L(m
1
) L(m
2
)), if s(m
1
) = s(m
2
) ;
0, otherwise. (4.26)
The energy we have to minimise is then the following:
f
S
(z) =
1

Z
m
Z

S
(m) +
2

(m
1
,m
2
)C

S
(m
1
, m
2
). (4.27)
Results
When only the unary segmentation energy is used, a high recall value is obtained, al-
though the precision is very low, as nothing is done to exclude pixels with a similar colour
to those in the foreground. Adding in the pairwise term causes both the precision and
recall to increase slightly, although there are still a large number of false positives. Full
results are given in Table 4.2, while a sample result is shown in Figure 4.7.
4.3.2 Pose Estimation Term
Recall that Yang and Ramanans algorithm provides us with a discrete set of part pro-
posals. An example set of proposals is shown in Figure 4.8. Each proposal j for each
77
4.3. Inference Framework
(a) Original image (b) W
F
(relative)
(c) Result: Unary only (
1
= 1,
2
= 0) (d) Result: Unary + Pairwise
(
1
=
2
= 1)
Figure 4.7: Sample result using just f
S
. There are many false positives in both results.
The unary-only result in (c) also contains many speckles, which the pairwise term in (d)
removes; however, the head is also lost.
78
4.3. Inference Framework
(a) Top-ranked estimates (b) Second-ranked estimates
(c) 10
th
-ranked estimates (d) Result
Figure 4.8: Several complete pose estimations are obtained from Yang and Ramanans
algorithm, which we split into ten body parts. We then select each part individually
(one head, one torso, etc.). In this example, the parts highlighted in white are selected,
enabling us to recover from errors such as the left forearm being misplaced in the rst
estimate.
79
4.3. Inference Framework
part i has a unary cost
P
(i, j) associated with it, whose value is based on the weights w
ij
dened in the previous section. This cost function penalises the selection of a part where
the pixels Z
m
close to the limb (so w
m
ij
is close to 1) have low foreground probability (so
p
m
is close to 0). We write:

P
(i, j) =

Z
m
Z
w
m
ij
(1 p
m
). (4.28)
A pairwise term
i
1
,i
2
is introduced to penalise the case where, for two parts that should
be connected (e.g. upper and lower left leg), two proposals are selected that are distant
from one another in image space. We dene a tree-structured set of edges T over the
set of parts, where (i
1
, i
2
) T if and only if parts i
1
and i
2
are connected. For each
connected pair of parts (i
1
, i
2
) T , we model the joint by a three-dimensional Gaussian
distribution over the relative position and angle between the two parts. This term is
computed using the same method as in [1]:

i
1
,i
2
(b
i
1
, b
i
2
) = exp
_
x
x
2
2
x
_
+ exp
_
y
y
2
2
y
_
+ exp
_

2
2

_
, (4.29)
where x is the dierence in x, and
x
and
2
x
are respectively the mean and variance
dierence in x. The terms for y and are analogous. Multiplying by unary and pairwise
terms by weights
4
and
5
, we minimise the following cost function:
f
P
(z) =
4
10

i=1

P
(i, b
i
) +
5

(i
1
,i
2
)T

i
1
,i
2
(b
i
1
, b
i
2
). (4.30)
4.3.3 Stereo Term
A particular disparity label d corresponds to matching the pixel (x, y) in L to the pixel
(x d, y) in R. Note that the disparity will always be non-negative, due to the fact that
the two cameras are aligned, with R coming from a point directly to the right of L. In
other words, points always appear in R at a position closer to left of the image than the
80
4.3. Inference Framework
corresponding position in L. These disparity labels are integer-valued, and range from 0
to K 1.
We dene a cost volume
D
, which for each pixel m, species the cost of assigning a
disparity label d
m
. These costs incorporate the gradient in the x-direction (in L and
R), which means that we dont need to adopt a pairwise cost. The following energy
function then needs to be minimised over labellings z:
f
D
(z) =
6

D
(m, d
m
), (4.31)
where the cost
D
is calculated for each pixel m as follows:

D
(m, d
m
) =
4

x=4
4

y=4
_
|L(x +x, y +y) R(x +x d
m
, y +y)|
+|L(x +x, y +y) R(x +x d
m
, y +y)|
_
. (4.32)
4.3.4 Joint Estimation of Pose and Segmentation
Here, we encode the concept that foreground pixels should be explained by some body
part; conversely, each selected body part should explain some part of the foreground. We
use the same weights w
m
ij
as dened in Section 4.3.1, and incur a cost if the part candidate
(i, j) is selected and the pixel m is labelled as background. For a particular labelling z,
the cost is:
J
1
(z) =

i,j

m
_
1(b
i
= j) (1 s
m
) w
m
ij
_
. (4.33)
Secondly, we formulate the cost for the case where a pixel m is assigned to foreground,
but not explained by any body part. We set a threshold = 0.1 (value determined
empirically), and a cost is accrued for m when for all parts (i, j), w
m
ij
< :
J
2
(z) =

m
1(max
i,j
w
m
ij
< ) s
m
. (4.34)
81
4.3. Inference Framework
Table 4.3: Results on the H2view test set, after combining the segmentation term with
the joint segmentation and pose term f
PS
.
Method Precision Recall F-Score Overlap
Unary only 27.91% 83.53% 41.84% 27.45%
Unary + Pairwise 55.11% 87.28% 67.56% 56.59%
f
S
+f
PS
63.47% 90.41% 74.59% 62.65%
With weighting factors
7
and
8
, The overall cost f
PS
can be written as:
f
PS
(z) =
7
J
1
(z) +
8
J
2
(z). (4.35)
Results
Combining this term with the segmentation term f
S
provides an improvement in results,
as the J
1
and J
2
terms make it possible for information obtained by the pose estimation
algorithm to be used to aid segmentation. Results on the H2view test set are given in
Table 4.3. Note that adding the joint segmentation and pose term f
PS
yields a large
increase in precision, as pixels far from any part are excluded by the J
2
term above.
Additionally, limbs that were missing from the original segmentation can be recovered,
as shown in Figure 4.9.
4.3.5 Joint Estimation of Segmentation and Stereo
Here, we encode the intuition that, assuming that the objects closest to the camera are
body parts, foreground pixels should have a higher disparity than background pixels. To
do this, we use the foreground weights W
F
obtained in Section 4.3.1 to obtain an expected
value E
F
for the foreground disparity:
E
F
=

m
w
m
d
m

m
w
m
. (4.36)
Using a hinge loss with a non-negative slack variable = 2 to allow small deviations to
occur, we then have the following cost measure to penalise pixels with a high disparity
82
4.3. Inference Framework
(a) Pose estimation result
(b) Segmentation: rst iteration
(c) Segmentation: second iteration
Figure 4.9: In this example, the persons lower legs are missing from the initial segmenta-
tion estimate, although the pose estimation is correct. Once the two results are combined
using the f
PS
energy term above, the lower legs are correctly segmented.
83
4.4. Dual Decomposition
being assigned to the background:
f
SD
(z) =
9

m
(1 s
m
) max(d
m
E
F
, 0). (4.37)
While the individual functions f
S
, f
P
and f
D
do reasonably well by themselves at solving
each of their tasks, the functions f
PS
and f
SD
can improve the results further. However,
the addition of these functions means that the problems are no longer separable, and
therefore an optimal labelling z

is harder to nd. In order to minimise the energy


function (4.22) eciently, we apply the dual decomposition technique that was introduced
in Section 4.1.
4.4 Dual Decomposition
Many of the minimisation problems dened in Section 4.3 are multiclass problems, and
are therefore intractable to solve in their current forms. However, we can binarise the
multiclass label sets D and B.
4.4.1 Binarisation of Energy Functions
For pixels, we extend the labelling space so that each pixel takes a vector of binary labels
z
m
= [d
(m,0)
, d
(m,1)
, , d
(m,K1)
, s
m
], with each d
(m,k)
equal to 1 if and only if disparity
value k is selected for pixel m. For parts, we extend the labelling space so that each part
takes a vector of binary labels b
i
= [b
(i,1)
, b
(i,2)
, , b
(i,N
E
)
], where each b
(i,j)
is equal to 1
if and only if the j
th
proposal for part i is selected.
A particular solution to this binary labelling problem is denoted by z. Only a subset of
the possible binary labellings will correspond directly to multiclass labellings z; these are
those such that each pixel has exactly one disparity turned on, and each part has exactly
one proposal selected. The set of solutions for which all pixels satisfy this constraint is
84
4.4. Dual Decomposition
called the feasible set F. We can write:
F =
_
_
_
z :
K1

k=0
d
(m,k)
= 1 Z
m
Z;
N
E

j=1
b
(i,j)
= 1 B
i
B
_
_
_
. (4.38)
We rewrite the cost functions from (4.30) and (4.31) as follows:
f

P
(z) =
4

(i,j)
b
(i,j)

P
(i, j) (4.39)
+
5

(i
1
,i
2
)T

j
1
,j
2
b
(i
1
,j
1
)
b
(i
2
,j
2
)

i
1
,i
2
(j
1
, j
2
);
f

D
(z) =
6
N

m=1
K1

k=0
d
(m,k)

D
(m, k). (4.40)
The joining functions given in Sections 4.3.4 and 4.3.5 can be binarised in a similar
fashion. The energy minimisation problem (4.22) can be restated in terms of these binary
functions, giving us:
E(z) = f

D
(z) +f
S
(z) +f

P
(z) +f

PS
(z) +f

SD
(z) (4.41)
subject to: z F.
4.4.2 Optimisation
Minimising this energy function across all labellings z simultaneously is intractable, so in
order to simplify the problem, we use dual decomposition, which was described in Section
4.1.
We introduce duplicate variables z
1
and z
2
, and only enforce the feasibility constraints
on these duplicates. Our energy function thus becomes:
E(z, z
1
, z
2
) =f

D
(z
1
) +f
S
(z) +f

P
(z
2
) +f

PS
(z) +f

SD
(z) (4.42)
subject to: z
1
, z
2
F, z
1
= z, z
2
= z .
85
4.4. Dual Decomposition
We remove the equality constraints by adding Lagrangian multipliers:
L(z, z
1
, z
2
) = f

D
(z
1
) +f
S
(z) +f

P
(z
2
) +f

PS
(z) (4.43)
+f

SD
(z) +
D
(z z
1
) +
P
(z z
2
).
Hence, the original problem, min
z
E(z), is converted to:
max

D
,
P
_
min
z,z
1
,z
2
L(z, z
1
, z
2
)
_
, (4.44)
which is the dual problem with dual variables
D
and
P
.
D
is a vector with N K
elements; one for each binary variable d
ij
.
P
is a vector with N
P
N
E
elements; one for
each binary variable b
ij
.
Note that a labelling z gives us an upper bound for the solution to the primal problem
E(z); on the other hand, the dual problem (4.44) gives us a lower bound, as we have
relaxed the feasibility constraints by adding
D
and
P
. We can decompose this dual
problem into three subproblems L
1
, L
2
and L
3
, as follows:
L(z, z
1
, z
2
) = L
1
(z
1
,
D
) +L
2
(z
2
,
P
) +L
3
(z,
D
,
P
), (4.45)
where
L
1
(z
1
,
D
) = f

D
(z
1
)
D
z
1
; (4.46)
L
2
(z
2
,
P
) = f

P
(z
2
)
P
z
2
; (4.47)
L
3
(z,
D
,
P
) = f
S
(z) +f

SD
(z) +f

PS
(z) +
D
z +
P
z; (4.48)
are the three slave problems, which can be optimised independently and eciently, while
treating the dual variables
D
and
P
as constant. This process is shown graphically
in Figure 4.10. Intuitively, the role of the dual variables is to encourage the labellings
z, z
1
, and z
2
to agree with each other. Since these
D
and
P
remain constant during
each iteration (only changing between iterations), we do not actually need to explicitly
86
4.4. Dual Decomposition
(a) (b)
Figure 4.10: Diagram showing the two-stage update process. (a): the slaves nd la-
bellings z, z
1
, z
2
and pass them to the master; (b): the master updates the dual variables

D
and
P
and passes them to the slave.
minimise them.
Given the current values of
D
and
P
, we solve the slave problems L
1
, L
2
, and L
3
,
denoting the solutions by

L
1
(
D
),

L
2
(
P
), and

L
3
(
D
,
P
) respectively. We concatenate

L
1
(
D
) and

L
2
(
P
) to form a vector of the same dimensionality as

L
3
(
D
,
P
). The
master then calculates the subgradient of the relaxed dual function at (
D
,
P
), given by
L(
D
,
P
) =

L
3
(
D
,
P
) [

L
1
(
D
),

L
2
(
P
)]. (4.49)
The master problem can then update the dual variables using the subgradient method,
similar to that of [127], and then update the by:
[
D
,
P
] [
D
,
P
] +
t
L(
D
,
P
), (4.50)
and pass the new vectors back to the slaves. Here,
t
is the step size indexed by
iteration t, which we set as follows:

t
=

10
1 +t
, (4.51)
ensuring that the
t
form a decreasing sequence, so that we make progressively ner
renements with each iteration. Consider a particular proposal j for a part i. At each
iteration, two solutions b
1
ij
and b
3
ij
are generated for B
ij
, from L
1
and L
3
respectively. If
these candidates are the same, then no penalty is suered, and the corresponding element
of
P
will not change. If, however, there was a disagreement, for example b
1
ij
= 1 but
87
4.4. Dual Decomposition
b
3
ij
= 0, then for problem L
1
, we wish to increase the penalty of selecting b
ij
, while doing
the reverse for problem L
3
.
4.4.3 Solving Sub-Problem L
1
Since problem L
1
contains terms that only depend on the disparity variables, we can
relax the feasibility constraint in (4.38) to only depend on these variables. We call this
expanded feasible set F
D
:
F
D
=
_
z :
K1

k=0
d
(m,k)
= 1 Z
m
Z
_
. (4.52)
Then, we can write L
1
in terms of the binary function f

D
as:
L
1
(z
1
,
D
) =f

D
(z
1
)
D
z
1
(4.53)
subject to: z
1
F
D
.
Since f

D
includes only unary terms, this equation can be solved independently for each
pixel. As we have the feasibility constraint attached to this equation, we solve the fol-
lowing for each pixel m, where
D
(m, k) is the element of the
D
vector corresponding
to pixel m and disparity k:
K1
min
k=0
(max(
6

D
(m, k)
D
(m, k), 0)) . (4.54)
4.4.4 Solving Sub-Problem L
2
L
2
contains functions that depend only on the pose variables b
ij
, so we can again relax
the feasibility constraint. This expanded feasible set is called F
P
:
F
P
=
_
_
_
z :
N
E

j=1
b
(i,j)
= 1 B
i
B
_
_
_
. (4.55)
88
4.4. Dual Decomposition
L
2
can be written in terms of f

P
as follows:
L
2
(z
2
,
P
) =f

P
(z
2
)
P
z
2
(4.56)
subject to: z
2
F
P
.
with f

P
as in (4.39). Ordering the parts B
i
such that (i
1
, i
2
) T only if i
1
< i
2
, we nd
the optimal solution using belief propagation, as introduced in Section 3.2.2. The score
of each leaf vertex is the following, calculated for each estimate j:
score
i
(j) =
P
(i, j)
P
(i, j). (4.57)
For a vertex i with children, we can compute the following:
score
i
(j) =
P
(i, j)
P
(i, j) +

(i
1
,i)T
min
j
1
(
i
1
,i
(j
1
, j) + score
i
1
(j
1
)) , (4.58)
and the globally optimal solution is found by keeping track of the arg min indices, and
then selecting the root (torso) estimate with minimal score.
4.4.5 Solving Sub-Problem L
3
Sub-problem L
3
is signicantly more complex, as it contains the joining terms f

PS
and
f

SD
. Since we have rewritten L
1
and L
2
in terms of the binary variables z
1
and z
2
,
we need to do the same to the joining terms. J
1
(z) penalised background pixels being
assigned a body part. For parts i, estimates j, and pixels m, this becomes:
J

1
(z) =

i,j

m
_
b
(i,j)
(1 s
m
) w
m
ij
_
. (4.59)
The function J
2
(z), penalising foreground pixels not being explained by any body part,
becomes the following for pixels m:
J

2
(z) =

m
1
_
max
i,j
w
m
ij
> 0
_
s
m
. (4.60)
89
4.5. Weight Learning
Bringing these costs together, we get:
f

PS
(z) =
7
J

1
(z) +
8
J

2
(z). (4.61)
Function f
SD
(z), which penalises background pixels with a higher disparity than the
foreground region, becomes the following for pixels m and disparities k:
f

SD
(z) =
9

k
_
(1 s
m
) d
(m,k)
max(k E
F
, 0)
_
. (4.62)
Since all the terms in the energy function L
3
are submodular, we can eciently minimise
the function via a graph cut.
4.5 Weight Learning
The formulation described in the previous section contains a large number of parameters,
denoted by

, which dictate how much inuence the various energy terms have on the
nal result. A particular term can be made hugely important by setting its value,
hereafter its weight, to be very high. Alternatively, setting the weight to zero will mean
that the term is ignored completely.
The number of parameters is sucient that it is not feasible to select the optimal
values by hand; additionally, it is not clear which terms should have a higher weight.
This section contains a description of an algorithm to determine the best choices from a
nite set of weights.
4.5.1 Assumption
In order to improve the eciency of the algorithm, the following assumption is made:
Proposition 4.2 When a variable is varied, the result of the algorithm wil l vary in a
convex fashion in other words, there is only one local optimum.
90
4.5. Weight Learning
Assuming this makes it possible to reduce the runtime of the algorithm by up to 50%:
without it, we would have to run the test for each possible value in each iteration, rather
than performing a binary search for the best value. To test this assumption empirically,
we xed all but one of the variables, and evaluated the performance of the framework for
each of a sequence of values. In each case, the above proposition held.
This optimum point is hereafter referred to as
opt
. Dene the score of a given
parameter value x P to be s
x
. Formally,
Denition 4.4
For a given set P of parameter values, the optimal value
opt
is the one such
that:
x P, s
x
s

opt
x =
opt
.
Further, the following two propositions directly follow from the assumed convexity of the
result:
Proposition 4.3 If
opt
< x < y, then s
x
> s
y
.
Proposition 4.4 If x < y <
opt
, then s
x
< s
y
.
4.5.2 Method
The parameters are optimised based on their performance on the training set. To avoid
overtting, it might be better to use the entire training set, but since the learning time
is linear in relation to the number of images, a subset of 100 images is used.
In each iteration, we x all but one of the parameters, and nd the best value by a
method that is itself iterative. The process for optimising one variable is described next,
and this subsection is concluded by a discussion of the overall algorithm for nding the
best parameter set.
91
4.5. Weight Learning
Optimisation of One Variable
Suppose we have xed the parameters
j
for j = i. Our goal is to nd the best value
opt
for
i
from the following set of values: P = {0.001, 0.01, 0.1, 1, 10, 100}.
We nd this optimal value by performing a binary search. At each step, we choose a
value from this set, and evaluate the performance of our dual decomposition algorithm
from Section 4.4 on 100 images from the training set. Since there are only six values to
choose from, the algorithm is written in full, although a more general algorithm can be
extrapolated from the procedure below.
First, we set
i
= 0.1, and nd the score s
0.1
of the algorithm for this parameter set.
At this stage we cant eliminate any parameter values, as we only have one score. So, we
choose the middle value from the longest set of unchecked values, i.e.
i
= 10, and nd
the score s
10
.
Now, we can eliminate some values from the parameter set based on the scores found
so far:
Lemma 4.5 Denote by
opt
the (as yet unknown) optimal parameter value. Then:
1. if s
0.1
> s
10
, then
opt
< 10.
2. if s
0.1
s
10
, then
opt
0.1.
Proof 1. For a contradiction, suppose that s
0.1
> s
10
, but
opt
> 10 (by denition, we
cant have
opt
= 10, since s
10
< s
0.1
). Then, since
opt
> 10 > 0.1, by Proposition
4.4, we have s
10
> s
0.1
, which is a contradiction.
2. Again seeking a contradiction, suppose that s
0.1
s
10
, but
opt
< 0.1. Then
Proposition 4.3 applies, and we have that s
0.1
> s
10
, a contradiction.
We now have a reduced set of possible values for
opt
: either {0.001, 0.01, 0.1, 1}, or
{0.1, 1, 10, 100}. In both cases, we check the second possible value. After this point,
there are four possible cases:
92
4.5. Weight Learning
1. s
0.1
> s
10
and s
0.01
> s
0.1
: the optimum value must be either 0.001 or 0.1, so we
nd s
0.001
, and thus determine
opt
.
2. s
0.1
> s
10
, but s
0.01
< s
0.1
: then we know that 0.1
opt
< 10, so we nd s
1
, and
thus determine
opt
.
3. s
0.1
< s
10
and s
1
> s
10
: then weve already found a local maximum, so by Proposi-
tion 4.2, it must be the only local maximum.
4. s
0.1
< s
10
and s
1
< s
10
: the optimum value must be either 10 or 100, so we nd
s
100
, and thus determine
opt
.
This is the set of cases for the rst set of possible values for
opt
; the cases for the second
set are analogous. The whole process is represented by a decision tree in Figure 4.11.
Optimisation of All Variables
To optimise all variables, we use Algorithm 4.1:
Algorithm 4.1 Parameter optimisation algorithm for dual decomposition framework.
1: Input: Initial variable values
1
, . . . ,
10
2: Initialise has been optimised vector: opt(i) FALSE for all i.
3: while i : opt(i) = FALSE do
4: Choose a variable
i
s.t. opt(i) = FALSE, and store the previous value
prev
.
5: Find the optimal value
opt
and score s
opt
.
6: if
opt
=
prev
then
7: for all j = i do
8: opt(j) FALSE
9: end for
10: end if
11: opt(i) TRUE
12: end while
In theory, this algorithm will converge in a nite number of iterations (as there are a
nite number of possible values for the set of parameters), but in practice, we limit the
number of iterations to 100. It is possible to do ne-tuning by repeating the process with
additional parameter sets, e.g. if the best value for a variable turns out to be 0.1, we
could use the set {0.025, 0.05, 0.1, 0.2, 0.4, 0.8}.
93
4.5. Weight Learning
F
i
g
u
r
e
4
.
1
1
:
T
h
e
o
p
t
i
m
i
s
a
t
i
o
n
p
r
o
c
e
s
s
f
o
r
o
n
e
p
a
r
a
m
e
t
e
r
,
r
e
p
r
e
s
e
n
t
e
d
a
s
a
d
e
c
i
s
i
o
n
t
r
e
e
.
94
4.6. Experiments
Table 4.4: The various weights learned by the weight learning process and used in the
experiments in this chapter.
Symbol Usage Weight

1
Unary segmentation weight 1

2
Pairwise segmentation weight 1

3
Factor in (4.26) 0.1

4
Unary pose weight 10

5
Pairwise pose weight 10

6
Stereo weight 1

7
Joint weight penalising background pixels close to body parts 10

8
Joint weight penalising foreground pixels far from body parts 10

9
Joint stereo-segmentation weight 0.1
Table 4.4 species the weights learned for our framework. In general, the terms
relating to pose were found to be most important, and the joint stereo-segmentation
term least important, perhaps to compensate for the fact that this term is used in far
more places than other terms (W H D pairwise connections).
4.6 Experiments
The dual decomposition formulation described in this chapter is evaluated using the
H2view dataset, which was described in Section 4.2. For each experiment, the training
set is used for two tasks: to train the pose estimation algorithm we use to generate
proposals, and to learn the weights that we attach to the energy terms. These weights
are learned by coordinate ascent, as detailed in Section 4.5.
4.6.1 Performance
Some qualitative stereo results are given in Figure 4.12. Due to the dierence between
the elds of view of the stereo and Kinect cameras, ground truth disparity values are
only available for the foreground objects, some background objects, and surrounding
oor space. Comparing our stereo results (Figure 4.12(b)) to the ground truth disparity
95
4.6. Experiments
Table 4.5: Results on the H2view test set. The entire framework achieves a signicant
performance improvement compared to the joint segmentation and pose terms.
Method Precision Recall F-Score Overlap
GrabCut [100] 43.03% 93.64% 58.96% 43.91%
ALE [71] 83.19% 73.58% 78.09% 64.29%
f
S
55.11% 87.28% 67.56% 56.59%
f
S
+f
PS
63.47% 90.41% 74.59% 62.65%
Entire framework 69.33% 89.61% 78.18% 67.62%
values for these objects (Figure 4.12(c)) shows a good correspondence between the two.
Segmentation
We compare the segmentation performance of our framework against the methods of
GrabCut [100] and ALE [71], and quantitative results are given in Table 4.5. While our
segmentation term alone does not perform as well as ALE does, the inclusion of pose
and depth information allows us to surpass the result of ALE. In Figure 4.13, we give
examples of how our performance improves as more terms are added to the framework.
Pose Estimation
To evaluate pose estimation, we measure the percentage of correctly localised body parts
(PCP) [37]. Quantitative results are given in Table 4.6, while we include qualitative
results in Figure 4.14.
Our model exhibits a signicant performance increase for the upper body, where the
segmentation cues are the strongest. However, there is a slight reduction in perform-
ance for the upper and lower legs. The baseline performance from Yang and Ramanans
algorithm (taking the highest scoring pose for each image) is 69.85% PCP; our joint infer-
ence model allows us to improve this performance to 75.37%. Our model can successfully
correct mistakes made by Yann and Ramanans algorithm; an example of this is shown
in Figure 4.14(b). It should be noted that the other algorithms do not use stereo inform-
ation, so it is likely that some of the improvement obtained by our algorithm is due to
96
4.6. Experiments
(a) RGB image
(b) Disparity map
(c) Ground truth depth
(d) Segmentation result
(e) Ground truth segmentation
Figure 4.12: Sample stereo and segmentation results. The Kinect depth data used to
generate the ground truth depth in (c) is only available for some pixels, due to the
slightly dierent eld of view of the camera.
97
4.6. Experiments
(a) Original image
(b) Segmentation term only
(c) Joint segmentation and pose
(d) Joint segmentation, pose, and depth estimation
(e) Ground truth segmentation
Figure 4.13: Segmentation results comparing the three stages of the algorithm. Left:
some head pixels are recovered, although the arms are still missing. Middle: the new
terms remove progressively more false positives. Right: the segmentation term already
achieves a good result, which is maintained when more terms are added.
98
4.6. Experiments
(a) Pose, segmentation and stereo results together.
(b) Error correction: the rst estimate (rst image) misclassies the left leg (red), while the second
estimate (second image) gets it right; our segmentation (third image) and stereo (fourth image) cues
enable us to recover (fth image).
Figure 4.14: Some sample results from our new dataset.
99
4.7. Discussion
Table 4.6: Results (given in % PCP) on the H2view test sequence, compared with Yang
and Ramanans algorithm.
Method Torso Head
Upper
arm
Forearm
Upper
leg
Lower
leg
Total
Ours 94.9 88.7 74.4 43.4 88.4 78.8 75.37
Yang [131] 72.0 87.3 61.5 36.6 88.5 83.0 69.85
Andriluka [1] 80.5 69.2 60.2 35.2 83.9 76.0 66.03
this extra information.
4.6.2 Runtime
Our entire framework requires around 2.75 minutes per frame, using an unoptimised
single-threaded algorithm on a 2.67GHz processor. The long time required is due to the
graph construction for problem L
3
, which requires millions of vertices to be added to
capture the relationship between segmentation and disparity values. This runtime does
not include the observed runtime of Yang and Ramanans pose estimation algorithm
(about 10 seconds per frame using the Linux implementation provided on their website
[132]; a faster run-time was quoted in their paper, albeit for smaller images) [131]. This
is much slower than the speed of 15 fps required for real-time applications. However, the
entire algorithm (including using Yang and Ramanans method to obtain pose estimates)
is quicker than that of Andriluka et al. [1], which requires around 3 minutes per frame.
However, the framework is still slower than ALE and GrabCut, which require around 1.5
and 3 seconds per frame, respectively [71, 100].
4.7 Discussion
In this chapter, we have described a novel formulation for solving the problems of human
segmentation, pose estimation and depth estimation, using a single energy function. The
algorithm we have presented is self-contained, and exhibits high performance in all three
tasks. In addition, we have introduced an extensive, fully annotated dataset for 3D
100
4.7. Discussion
human pose estimation.
The algorithm is modular in design, which means that it would be straightforward
to substitute alternative approaches for each slave problem; a thorough survey of the
ecacy of these combinations would be a promising direction for future research. Further,
the formulation could be extended to incorporate cues from video, for example motion
tracks. This data could be processed sequentially, or alternatively, entire sequences could
be optimised oine.
The most signicant drawback of this approach is the speed; at an average of 2.75
minutes per frame, it takes just over three days for a single-threaded test run. One of
the main aims of the next two chapters is to improve the speed of the framework. We
have noted the large number of vertices and edges needed to allow the disparity map to
inuence the segmentation result. In the next chapter, we will restructure our framework
in order to bypass this issue.
Acknowledgments
The basis for the H2view dataset was provided by SCEE London Studio. This included
camera calibration, data capture, and python code for translating the segmentation
masks, joint locations and depth data obtained via Kinect into the image co-ordinates
corresponding to the (left) stereo camera. The correction of the pose estimates in the
dataset was performed by myself, using an annotation tool in MATLAB which I created
for the task.
The main codebase was written by myself, although the segmentation algorithm used
was (heavily) adapted from code released by Talbot and Xu [118, 119].
Credit is also due to the co-authors of the BMVC student workshop publication:
Jonathan Warrell contributed signicantly to the discussions in which the for-
mulation followed in this chapter was rened;
Yuhang Zhang, a visiting student at the time of these discussions, also made some
contribution;
101
4.7. Discussion
Phil Torr, my PhD supervisor, and Nigel Crook, my second supervisor, provided
feedback based on results and paper drafts throughout the project.
102
Chapter 5
A Robust Stereo Prior for Human
Segmentation
We demonstrated in the previous chapter that using a dual decomposition framework
can achieve good results on human pose estimation, segmentation and depth estimation.
However, the main problem with the framework was the time required to run the al-
gorithm. The most signicant bottleneck in the framework of the previous chapter was
found to be caused by the addition of millions of vertices in the graph to unify the stereo
and segmentation tasks. In this chapter, we show how a high-quality segmentation result
can be obtained directly from the disparity map.
In human segmentation, the goal is to obtain a binary labelling of an image, showing
which pixels belong to the person of interest. The human will generally occupy a continu-
ous region of the scene, and the depth of the human will be smooth within this region.
In contrast, the background region tends to have a dierent depth to that of the human.
This property motivates our method of generating a segmentation prior: after nding
a small set of seed pixels which have high probability of belonging to the foreground
object, we use a ood ll algorithm to nd a continuous region of the image which has a
smooth disparity.
As shown in Figure 5.1, a ood ll prior can generate promising human segmentations
by itself. We show how to integrate this prior into our inference framework from Section
103
5.1. Related Work
4.3, improving the speed, and also providing a slight increase in performance.
The chapter is structured as follows: related work is summarised in the next section,
describing the range expansion formulation. Then, the ood ll prior is introduced in
Section 5.2. Our modied inference framework is described in Section 5.3. Experimental
results follow in Section 5.4, and concluding remarks are given in Section 5.5.
An earlier version of this chapter previously appeared as [107].
5.1 Related Work
The use of priors for segmentation has made a signicant dierence for a variety of
computer vision problems in recent years. Many of these approaches can be traced back
to the introduction of GrabCut [100], an interactive method in which the user species
a rectangular region of interest, and then the foreground region is rened based on that
region using graph cuts. Larlus and Jurie [73] used a similar approach, using object
detection to supply bounding boxes and thus automate the process. More sophisticated
priors have been used in works such as ObjCut and PoseCut, which use the output of
object detection and pose estimation algorithms respectively to provide a shape prior
term [16, 66].
Lower-level priors are used by Gulshan et al. [47], who formulate the task of segment-
ing humans as a learning problem, using linear classiers to predict segmentation masks
from HOG descriptors. Ladick et al. [70] showed how multiple such features and priors
can be combined, avoiding the need to make an a priori decision of which is most appro-
priate. Their Associative Hierarchical Random Field (AHRF) model has demonstrated
state of the art results for several semantic segmentation problems.
In this chapter, we use the intuition that foreground objects will occupy a continuous
region of 3D space, and will therefore exist within a smooth range of depth values. Given
the bijective relation between depth and disparity, we show how disparity can be used as
a discriminative tool to aid segmentation, and propose a novel segmentation prior based
on the disparity map produced by a stereo correspondence algorithm.
104
5.1. Related Work
(a)
(b)
(c)
Figure 5.1: The seeds (in red) on the disparity map in (a) are used to generate the ood
ll prior in (b), which is a good approximation of the ground truth segmentation (c).
105
5.1. Related Work
Many approaches exist for solving the stereo correspondence problem, and a review
of recent methods can be found in Section 3.4. In this chapter, we follow the approach
of Kumar et al. [67], who proposed an ecient way to solve the stereo correspondence
problem using graph cuts. They note that the disparity labels form a discrete range of
possible values, and that therefore the stereo correspondence problem is suitable for the
application of range moves. A range move is a move making approach where rather than
only considering one or two labels at each iteration, a range of several possible values is
considered. Kumar et al. dene two dierent range move formulations; the one we use in
this chapter is range expansion. In a single range expansion move, each pixel can either
keep its old label, or choose from a range of consecutive labels in this case, disparity
values.
5.1.1 Range Move Formulation
Given a stereo pair of images L and R, our aim is to solve the stereo correspondence prob-
lem, and nd a disparity map. In order to solve this problem via graph cuts, we construct
a set of pixels Z = {Z
1
, Z
2
, . . . , Z
N
}, and a set of disparity labels D = {0, 1, . . . , M 1}.
A solution obtained via graph cuts is referred to as a labelling z, where each pixel variable
Z
m
is assigned a label d
m
D. The set of possible labellings is referred to as the label
space.
Move-making algorithms nd a local minimum solution to a graph cut problem by
making a series of moves in the label space, where each move is governed by a set of
specic rules. One of the more popular moves is -expansion [15], where each variable
can either retain its current label, or change its label to . When the set of labels is
ordered (as D is), it is possible to eciently obtain a local minimum by considering a
range of labels rather than just one. This can be done by using the range expansion
algorithm, as proposed by Kumar et al. [67].
At each iteration k of the range expansion algorithm, let z
(k)
be the current labelling
of pixels; each pixel Z
m
has a disparity label d
(k)
m
. We consider an interval D
(k)
of labels,
where D
(k)
= {
(k)
,
(k)
+ 1, . . . ,
(k)
}. The option is provided for each pixel to either
106
5.1. Related Work
keep its current label d
(k+1)
m
= d
(k)
m
, or choose a new label d
(k+1)
m
D
(k)
. Formally, we nd
a labelling z
(k+1)
satisfying:
z
(k+1)
=arg min
z
f
D
(z), (5.1)
such that Z
m
Z : d
(k+1)
m
= d
(k)
m
OR d
(k+1)
m
D
(k)
,
where f
D
is the energy function of the stereo correspondence problem, with unary poten-
tials
D
, pairwise potentials
D
, and weight parameters
1
and
2
:
f
D
(z) =
1

Z
m

D
(Z
m
, d
m
) +
2

(Z
m
1
,Z
m
2
)C

D
(d
m
1
, d
m
2
) , (5.2)
where C is the set of neighbouring pairs of pixels in Z,
D
is the L
1
norm on the depth
dierence, and d
m
1
and d
m
2
are the disparity values of the pixels Z
m
1
and Z
m
2
respectively.
To minimise this energy function, we follow the formulation of [67]. A graph G
(k)
=
{V
(k)
, E
(k)
} with source s and sink t is created, and for each pixel Z
m
, nodes v
(m,
(k)
)
, . . . ,
v
(m,
(k)
)
are added.
The edges (s, v
(m,
(k)
)
) and (v
(m,j)
, v
(m,j+1)
) E
(k)
, where
(k)
j <
(k)
, have as
their capacities the unary potentials in f
D
. If the current disparity d
(k)
m
is outside the
interval I
(k)
, then the unary potential is used as the capacity of the edge (s, v
(m,
(k)
)
):
c
(k)
(s, v
(m,
(k)
)
) =
_
_
_

D
(Z
m
, d
(k)
m
)), if d
(k)
m
/ D
m
; (5.3a)
, otherwise , (5.3b)
while the capacity of edges (v
(m,j)
, v
(m,j+1)
) are given by:
c
(k)
(v
(m,j)
, v
(m,j+1)
) =
D
(Z
m
, j) . (5.4)
Finally, the unary potential for disparity
(k)
is used as the capacity of the edge (v
(m,
(k)
)
, t):
c
(k)
(v
(m,
(k)
)
, t) =
D
(Z
m
,
(k)
) . (5.5)
107
5.2. Flood Fill Prior
For neighbouring pixels Z
m
and Z
m
, edges are added between all nodes v
(m,j)
and v
(m

,j

)
for j and j

in the range {
(k)
, . . . ,
(k)
}, except for the case where j = j

=
(k)
. These
edges store the pairwise potentials:
c
(k)
(v
(m,j)
, v
(m

,j

)
) =
(Z
m
,Z
m
)
(j, j

) . (5.6)
If at least one of d
(k)
m
and d
(k)
m
are not in I
(k)
, then an additional edge is added. If
d
(k)
m
D
(k)
and d
(k)
m
/ D
(k)
, the edge (v
(m,
(k)
)
, v
(m

,
(k)
)
) is added, with capacity L +/2,
where L =
(k)

(k)
is the length of the interval, and is a constant. If d
(k)
m
/ D
(k)
and d
(k)
m
D
(k)
, then the edge (v
(m

,
(k)
)
, v
(m,
(k)
)
) is added with the same capacity. If
d
(k)
m
/ D
(k)
and d
(k)
m
/ D
(k)
, then an additional node
(m,m

)
is added, with edges:
c
(k)
(v
(m,
(k)
)
,
(m,m

)
) = c
(k)
(
(m,m

)
, v
(m,
(k)
)
) = L +/2 ; (5.7)
c
(k)
(v
(m

,
(k)
)
,
(m,m

)
) = c
(k)
(
(m,m

)
, v
(m

,
(k)
)
) = L +/2 ; (5.8)
c
(k)
(s,
(m,m

)
) =
(Z
m
,Z
m
)
(d
(k)
m
, d
(k)
m
) + . (5.9)
The formulation is given in full in [67], and an example of a sequence of range move
iterations is shown in Figure 5.2.
5.2 Flood Fill Prior
After using range expansions to nd the disparity map, we apply a ood ll to generate
a prior for human segmentation. Flood ll is an algorithm that, given a list of seeds, lls
connected parts of a multidimensional array, e.g. an image, with a given label. Starting
with two pixels which have a high probability of belonging to the human, we nd a
continuous region of the image which has a smooth disparity. This region is specied by
a binary image F.
The ood ll algorithm is a standard method that has many applications: for example,
it can be used for bucket lling in graphics editors such as GIMP or Microsoft Paint;
108
5.2. Flood Fill Prior
(a) Disparity map after the rst range move iteration
(b) Disparity map after the second range move
iteration
(c) Disparity map after the third range move
iteration
Figure 5.2: The results of three successive range expansion iterations. At rst, there are
very few disparities to choose from, but with further iterations, the person becomes more
visible against the background.
109
5.2. Flood Fill Prior
Algorithm 5.1 Generic ood ll algorithm for an image I of size W H.
Input: non-empty list S of seed pixels
initialize a zero-valued matrix F of size W H; empty queue of new ranges R
for all seeds s = (s
x
, s
y
) S do
v I(s
x
, s
y
)
(F, x
min
, x
max
) doLinearFill(F, I, s
x
, s
y
, v)
if x
min
< x
max
then
add (x
min
, x
max
, y, v) to back of queue R
end if
while R = do
remove (x
min
, x
max
, y, v) from front of R
if y > 0 then
(R, F) multiLinearFill(F, I, x
min
, x
max
, y 1, v)
end if
if y < H 1 then
(R, F) multiLinearFill(F, I, x
min
, x
max
, y + 1, v)
end if
end while
end for
return F
a memory-ecient version is given in general form in Algorithm 5.1
1
. The function
doLinearFill(F, s
x
, s
y
, v), given in Algorithm 5.2, nds the largest range (x
min
, x
max
),
where x
min
s
x
x
max
, such that for all x, x+1 within the range, |I(x, s
y
)v| < , where
v is the intensity value of the seed pixel. In other words, it nds the largest horizontal
region of the image containing (s
x
, s
y
), that has an intensity value close to that of the
seed pixel.
The function multiLinearFill(F, x
min
, x
max
, y, v) executes doLinearFill starting
from each pixel (x, y) where x
min
x x
max
, provided |I(x, y) v < | and F(x, y) = 0,
i.e. (x, y) has not already been lled by the algorithm. Each execution of doLinearFill
generates a new range (x

min
, x

max
); if x

min
< x

max
, then (x

min
, x

max
, y, v) is added to R.
We use the ood ll algorithm to generate a segmentation of the human from the
disparity map. We run the pose estimation algorithm of Yang and Ramanan [131] on
the original RGB image, and then use the endpoints of the top-ranked torso estimate as
1
The algorithm was written by Julien Valentin, a member of the Brookes vision group, and was based
on the implementation of Dunlap [29].
110
5.2. Flood Fill Prior
Algorithm 5.2 doLinearFill: perform a linear ll from the seed point (s
x
, s
y
).
Input: Image I, current ood ll map F, starting points s
x
and s
y
, and initial seed
value v.
Require: 0 s
x
< W, 0 s
y
< H
if |I(s
x
, s
y
) v| > then
return (F, s
x
, s
x
)
end if
F(s
x
, s
y
) 1
x s
x
while x < W 1 and |I(x + 1, s
y
) v| < do
F(x + 1, s
y
) 1
increment x
end while
x
max
x, x s
x
while x > 0 and |I(x 1, s
y
) v| < do
F(x 1, s
y
) 1
decrement x
end while
x
min
x
return (F, x
min
, x
max
)
seeds.
To determine whether a value I(x, y) matches its neighbour I(x +x, y +y) (where
|x| + |y| = 1), we consider the dierence between the two values; the pair of values
match if the dierence is below a preset threshold.
If the person is fully visible, then we will also be able to see the point where their feet
touch the oor. At this point, the disparity value of the oor is often very similar to the
disparity value of the foot, and so it is possible for the segmentation prior to leak on to
the oor. In order to prevent this, we estimate the position of the oor plane.
The disparity d
m
of a pixel Z
m
has an inverse relation to its depth z
m
,
z
m
=
b(f
x
+f
y
)
2d
m
, (5.10)
where b is the baseline (the distance between the camera centres) of the stereo camera,
and f
x
and f
y
are the focal lengths in x and y respectively (diagonal elements of the
camera matrix).
111
5.3. Application: Human Segmentation
Once we have the depth z
m
, the real-world height can be obtained via a projective
transformation. We also assume that the camera is parallel to the oor plane, and we
know its height above the ground at capture time, so by thresholding the height values,
we can obtain an estimate for the oor plane. A typical ood ll prior is shown in Figure
5.1.
5.3 Application: Human Segmentation
Here, we discuss our modications to the inference framework of Section 4.3. To recap, the
objective is to minimise an energy function which consists of ve terms: three representing
the distinct problems that the function unies (i.e. stereo correspondence, segmentation,
and pose estimation), and two joining terms, which encourage consistency between the
solutions of these problems. As input, we have a stereo pair of images L and R. The parts-
based pose estimation algorithm of Yang and Ramanan [131] is used as a preprocessing
step to obtain a number N
E
of proposals for K = 10 body parts: head, torso, and two
for each of the four limbs. For each part i, each proposal j consists of a pair of image
co-ordinates, representing the endpoints of the limb (or skull, or spine).
5.3.1 Original Formulation
The approach is formulated as a conditional random eld (CRF), with two sets of random
variables. The set Z = {Z
1
, Z
2
, . . . , Z
N
} represents the image pixels; in addition to a
disparity label from the multi-class label set D dened in Section 5.1.1, each pixel Z
m
is
given a binary segmentation label s
m
from S = {0, 1}. The set B = {B
1
, B
2
, . . . , B
K
}
represents the body parts; labels b
i
for each part B
i
are assigned to each part from the
multi-class label set B. Any possible assignment of labels to these variables is called a
labelling, and denoted by z. As it was in (4.22), the energy of a labelling is written as:
E(z) = f
D
(z) +f
S
(z) +f
P
(z) +f
PS
(z) +f
SD
(z) , (5.11)
112
5.3. Application: Human Segmentation
with each term containing weights
i
R
+
.
In order to utilise range moves and apply the new ood ll prior, we need to make some
amendments to the formulation given in Section 4.3. We modify three of the terms: f
D
,
which gives the cost of the disparity label assignment {d
m
}
N
m=1
; f
S
, which gives the cost
of the segmentation label assignment {s
m
}
N
m=1
; and f
SD
, which unies the two. We now
describe in turn each of the terms in (5.11), along with our modications to these terms.
For clarity, the terms as they appeared in Chapter 4 will be denoted by f

, while our new


functions will be denoted by g

. Weights carried over from the original formulation will


be denoted by

(with the same index), and new weights will be denoted by w

.
5.3.2 Stereo Term f
D
The energy f
D
(z) represented the disparity map. In (4.31), the energy was simply given
in terms of a cost volume
D
, which for each pixel Z
m
, specied the cost of assigning a
disparity label d
m
. This cost incorporated the gradient in the x-direction, so a pairwise
term was not added:
f
D
(z) =
1

Z
m

D
(Z
m
, d
m
) . (5.12)
In order to apply range moves, we introduce a truncated pairwise term to formalise the
notion that adjacent pixels should have similar disparities. After the k
th
iteration, we
denote by d
(k)
m
the labelling of each pixel Z
m
. We dene C to be the set of pairs of
neighbouring pixels (Z
m
, Z
m
). Given a labelling z, the pairwise cost associated with a
pair of pixels (Z
m
, Z
m
) C is:

(Z
m
,Z
m
)
(d
(k)
m
, d
(k)
m
) = |d
(k)
m
d
(k)
m
|
_

1
+
2
exp
_

D
(Z
m
, Z
m
)

3
__
, (5.13)
where

D
(Z
m
, Z
m
) = min(L(m) L(m

) ,
4
), (5.14)
113
5.3. Application: Human Segmentation
for some given maximum distance
4
. The overall energy of the labelling is:
g
D
(z) =
1

Z
m

D
(Z
m
, d
(k)
m
) +w
1

(Z
m
,Z
m
)C

(Z
m
,Z
m
)
(d
(k)
m
, d
(k)
m
) , (5.15)
where
1
, w
1
,
1
,
2
and
3
are preset weight parameters.
5.3.3 Segmentation Terms f
S
and f
SD
As introduced in (4.27), f
S
(z) gave the segmentation energy. The part estimates obtained
from Yang and Ramanans algorithm were used to create a foreground weight map W
F
,
from which Gaussian mixture models based on RGB values were tted for the foreground
and background respectively, built using the Orchard-Bouman clustering algorithm [88].
2
These were used to obtain unary costs
F
and
B
of assigning each pixel to foreground
and background respectively. Additionally, a pairwise cost
S
was included to penalise
the case where adjacent pixels are assigned to dierent labels:
f
S
(z) =
2

S
(z) +
3

S
(z) , (5.16)
where:

S
(z) =

Z
m
Z
s
m

F
(Z
m
) + (1 s
m
)
B
(Z
m
) . (5.17)

S
(z) =

(Z
m
1
,Z
m
2
)C
1(s
m
1
= s
m
2
) exp( L(m
1
) L(m
2
)
2
) . (5.18)
f
SD
(z), detailed in (4.37), was a cost function that penalised the case where pixels with
high disparity were assigned to foreground, contrary to the notion that the foreground
object is in front of the background. Using the foreground weight map W
F
dened above,
a set F of pixels with a high probability of being foreground was obtained, from which
the expected foreground disparity E
F
was calculated. Background pixels with disparity
2
This algorithm was used in the GrabCut implementation on which the code used in this work was
based [119], and was found empirically to give the best separation between foreground and background.
114
5.3. Application: Human Segmentation
greater than E
F
+, where is a non-negative slack variable, were then penalised:
f
SD
(z) =
4

Z
m
Z
(1 s
m
) max(d
m
E
F
, 0) . (5.19)
We nd that a ood ll prior proves to be much more reliable in discriminating between
high-disparity pixels that belong to the object, and those far from the object. Therefore,
we replace f
SD
with a ood ll-based term. Dene F = {
(k)
m
: Z
m
Z} to be the ood
ll prior at iteration k. We can penalise disagreements between F and the segmentation
s using a Hamming distance between this prior and the current segmentation mask:
f
SD
(z) =
SD

Z
m
Z
1(s
m
=
m
), where
SD
is again a preset weighting term. f
SD
now
depends only on the ood ll prior and the segmentation. Since the ood ll prior at
each iteration is xed, we can merge the two terms f
SD
and f
S
, with the former being
subsumed by the unary segmentation potential. This becomes:

S
(z, ) =

Z
m
Z
_

2
_
s
m

F
(Z
m
) + (1 s
m
)
B
(Z
m
)
_
(5.20)
+w
2
1[s
m
=
m
]
_
,
where
F
and
B
are as dened in (5.16). The overall segmentation energy is:
g
S
(z, ) =
S
(z, ) +
3

S
(z) . (5.21)
5.3.4 Pose Estimation Terms f
P
and f
PS
The terms relating to pose estimation are unchanged from the framework described in
Section 4.3, and appear here for ease of reference. f
P
(z), given in (4.30), denotes the
cost of a particular selection of parts {b
i
}
10
i=1
. Each proposal j for each part i has an
associated unary cost
P
(i, j); connected parts i
1
and i
2
have associated pairwise costs,
to penalise the case where parts that should be connected are distant from one another
115
5.3. Application: Human Segmentation
in image space. Dening T as the set of connected parts (i
1
, i
2
), the cost function is:
g
P
(z) = f
P
(z) =
5
10

i=1

P
(i, b
i
) +
6

(i
1
,i
2
)T

i
1
,i
2
(b
i
1
, b
i
2
) . (5.22)
f
PS
(z), given in (4.35), encodes the relation between segmentation and pose estimation:
we expect foreground pixels to be close to some body part, and we expect pixels close to
some body part to be foreground. The energy is written as:
g
PS
(z) = f
PS
(z) =
7

i,j

Z
m
_
1(b
i
= j) (1 s
m
) w
m
ij
_
(5.23)
+
8

Z
m
1(max
i,j
w
m
ij
< ) s
m
.
5.3.5 Energy Minimisation
Our new energy function, with the terms dened in Sections 5.3.2 and 5.3.3, is as follows:
E(z) = g
D
(z) +g
S
(z) +g
P
(z) +g
PS
(z) . (5.24)
In order to eciently minimise this energy function, the label set B is binarised. Each
body part B
i
takes a vector of binary labels b
i
= [b
(i,1)
, b
(i,2)
, , b
(i,N
E
)
], where each b
(i,j)
is equal to 1 if and only if the j
th
proposal for part i is selected. Note that since we solve
g
D
eciently using range moves, and the term linking stereo and segmentation is now
included in g
S
, we no longer need to binarise D, as we can eciently nd a multi-class
solution. This has an eect on the cost vector
D
, which we discuss in Section 5.3.6.
An assignment of binary labels to all pixels and parts is denoted by z. We introduce
duplicate variables z
1
and z
2
, and add Lagrangian multipliers to penalise disagreement
between z and its two copies, forming the Lagrangian dual of the energy function (5.24).
116
5.3. Application: Human Segmentation
(a) Message-passing structure from Chapter 4
(b) New message-passing structure.
Figure 5.3: Diagram showing the two-stage update process. Left: the slaves nd labellings
z, z
1
, z
2
and pass them to the master. Note that this is the same for both the old and
new formulations. Right: the master updates the cost vectors
D
and
P
and passes
them to the slave. In the new formulation, the ood ll prior is passed to the slave L
3
instead of the cost vector
D
.
This is then divided into three slave problems using dual decomposition:
L(z, z
1
, z
2
, ) = g
D
(z
1
) +g
S
(z, ) +g
P
(z
2
) +g
PS
(z) (5.25)
+
D
(z z
1
) +
P
(z z
2
) .
= L
1
(z
1
,
D
) +L
2
(z
2
,
P
) +L
3
(z,
D
,
P
, ) ,
where:
L
1
(z
1
,
D
) = g
D
(z
1
)
D
z
1
; (5.26)
L
2
(z
2
,
P
) = g
P
(z
2
)
P
z
2
; (5.27)
L
3
(z,
D
,
P
, ) = g
S
(z, ) +g
PS
(z) +
D
z +
P
z . (5.28)
Figure 5.3 shows the decomposition structure used in Section 4.4, and how it relates to
our new decomposition structure. The change in L
3
means that the messages passed
between the slaves and the master also change. The new message passing structure sees
the ood ll prior passed from the master to the slave, instead of
D
.
117
5.4. Experiments
5.3.6 Modications to
D
Vector
Removing the explicit statement of f
SD
means that the loss function L
3
no longer depends
on nding values for stereo. Recall that an expected foreground disparity E
F
, and a non-
negative slack variable , were dened in Section 5.3.1. This gives us a range [E
F

, , E
F
+] of disparity values which we associate with the foreground region.
When a segmentation result has been obtained, the
D
cost vector can be adjusted
by comparing the ood ll prior
(k)
with the segmentation result s
(k+1)
, and altering
the cost for those pixels for which the segmentation result disagrees with the ood ll
prior. If a pixel is segmented as foreground despite the ood ll prior encouraging it to be
segmented as background, this implies that the disparity costs within this range should
be reduced, and the costs outside the range increased; conversely, if a pixel is background
despite the ood ll insisting it is foreground, then the costs for disparities within the
range should be increased. Therefore, we set the cost update vector
D
as follows:

D
(x, y, d) =
_
_
_

(k)
(x, y) s
(k+1)
(x, y), if |d E
F
| < ; (5.29)
s
(k+1)
(x, y)
(k)
(x, y), otherwise . (5.30)
5.4 Experiments
To test the eects that our modications to the formulation of Section 4.3 had on the
results, we evaluate our algorithm on the H2view dataset. To evaluate pose estimation,
we use the standard probability of correct pose (PCP) criterion; in this section, we also
present quantitative segmentation results, using the overlap (intersection over union)
metric.
5.4.1 Segmentation
We used the H2view test sequence of 1598 images to evaluate our approach; for compar-
ison, we also tested GrabCut [100], ALE [71], and our original algorithm. We achieve a
segmentation accuracy of 69.23%; some qualitative results are shown in Figure 5.4. The
118
5.4. Experiments
(a) RGB image
(b) Result from Chapter 4
(c) Our result
(d) Ground truth segmentation
Figure 5.4: Sample segmentation results. Note that the ood ll prior correctly seg-
ments the outstretched arm in the third example. The prior also leads to a less blocky
appearance to the segmentation results. In the rst two examples, part of the window
is segmentated as foreground by the framework of Chapter 4, but the ood ll prior
prevents this.
119
5.4. Experiments
(a) Original image (b) Disparity map
(c) Our result (Overlap: 83.46%) (d) Ground truth segmentation
(e) ALE [71] (51.51%) (f) [100] on RGB (64.18%)
(g) [100] on depth (51.84%) (h) [100]: two-stage (69.61%)
Figure 5.5: Sample segmentation results, with overlap scores for this image. Note that
ALE and GrabCut [100] on RGB both classify parts of the window and other background
areas as foreground, while GrabCut based on the disparity map fails to capture thin
structures such as the arm and head.
120
5.4. Experiments
Table 5.1: Segmentation results on the H2view dataset, compared with previous results.
Using the RGB image to rene the result obtained by GrabCut on depth improves the
result slightly, but our joint approach is superior.
Method Precision Recall F-Score Overlap
Two-stage GrabCut 60.00% 73.53% 66.08% 49.34%
GrabCut on depth 69.96% 53.24% 60.47% 45.14%
GrabCut on RGB 43.03% 93.64% 58.96% 43.91%
ALE [71] 83.19% 73.58% 78.09% 64.29%
Chapter 4 69.33% 89.61% 78.18% 67.62%
Ours 79.59% 83.23% 81.37% 69.23%
accuracy is reasonably consistent over the sequence, with most images achieving accur-
acy close to the average, but there is room for improvement in temporal consistency via
video-based cues. It should be noted that the segmentation results are quite sensitive
to initialisation; if the torso is classied incorrectly, the ood ll prior is likely to be
incorrect, leading to a bad segmentation result (Figure 5.6). Developing a resistance to
this pitfall is a promising direction for future research.
These results compare favourably to those obtained by the other approaches we eval-
uated (a qualitative comparison is shown in Figure 5.5). The addition of the prior has
improved the precision of our algorithm by 10% compared to the framework of Chapter
4, meaning that although the recall has dropped, the overall performance (overlap) in-
creased by 1.6%. Although the precision achieved is still not quite as good as ALE, the
recall is much better, meaning the overlap is almost 5% better.
In order to test GrabCut, we obtained initial bounding boxes by thresholding the
foreground weight map W
F
dened in Section 5.3.3. We ran the GrabCut algorithm
on three passes through our test set: rst, using solely the RGB images; second, on
disparity values; and nally, running GrabCut on the disparity map and then rening
the result with the RGB information. The results, given in Table 5.1, show that combining
information from depth and RGB (as we do in two-stage GrabCut) can improve the
segmentation results, but the accuracy can be improved by a large margin by using a
ood ll prior.
121
5.5. Conclusion
Table 5.2: Results (given in % PCP) on the H2view test sequence.
Method Torso Head
Upper
arm
Forearm
Upper
leg
Lower
leg
Total
Ours 96.3 92.4 77.3 42.0 89.3 81.4 76.86
Chapter 4 94.9 88.7 74.4 43.4 88.4 78.8 75.37
Yang [131] 72.0 87.3 61.5 36.6 88.5 83.0 69.85
Andriluka [1] 80.5 69.2 60.2 35.2 83.9 76.0 66.03
5.4.2 Pose Estimation
While we did not directly alter the pose estimation algorithm used in Chapter 4, our im-
proved segmentation results also lead to a slight improvement in pose estimation results.
Full quantitative results are given in Table 5.2.
5.4.3 Runtime
The framework presented has been implemented in a single-threaded C++ implement-
ation, and runs at 25 seconds per frame on a 2.67GHz processor. This represents a
signicant speed improvement over the previous framework the computational time per
image has been decreased by a factor of 6.6. Again, this does not include the overhead
caused by needing to run Yang and Ramanans pose estimation to get the initial pose
estimates [131].
5.5 Conclusion
In this chapter, we have demonstrated the applicability of the ood ll algorithm for
generating segmentation priors. Disparity maps generated by stereo correspondence al-
gorithms can be used as input, and given a set of seeds with high probability of belonging
to the foreground object, accurate segmentation results can be obtained. We have also
incorporated our prior into the dual decomposition framework of Chapter 4, greatly im-
proving upon the segmentation results. Our segmentation results show that combining
122
5.5. Conclusion
information from depth and RGB can improve the segmentation results, but the accuracy
can be improved by a large margin by using a ood ll prior.
The results presented in this chapter, both for segmentation and for human pose
estimation, can be further improved by adding video-based cues, such as motion tracks.
This should improve temporal consistency, in particular giving the algorithm a chance to
recover in the rare cases where the seeds used in ood ll were incorrect (shown in Figure
5.6). Indeed, the centroid of the segmented object could be used as a seed for the next
frame.
Finally, we note that while the speed of the updated framework is signicantly faster
than that of Chapter 4, the dual decomposition algorithm is still much too slow for
real-time applications such as video games. Recently, a powerful new approximation
technique, mean eld inference [57], has emerged, and has proven to be very powerful in
combining object segmentation and disparity computation [124]. In the next chapter, we
show that our framework is also suitable for the application of mean eld inference.
Acknowledgements
As with the previous chapter, this section contains attributions to the other co-authors
of the paper on which this chapter was based, as follows:
Julien Valentin coded an initial version of the ood ll algorithm used in this
chapter, and provided valuable feedback and comments while the formulation was
being adapted.
Phil Torr, my PhD supervisor, and Nigel Crook, my second supervisor, provided
feedback based on results and paper drafts throughout the project.
123
5.5. Conclusion
(a) The original image
(b) Result of Chapter 4
(c) Our result
(d) Ground truth
Figure 5.6: Failure cases of segmentation. In (c), incorrect torso detection means that
the ood ll prior misleads the segmentation algorithm, resulting in the wrong region
being segmented.
124
Chapter 6
An Ecient Mean Field Based Method for
Joint Estimation of Human Pose,
Segmentation, and Depth
While the methods developed in Chapters 4 and 5 show state-of-the-art results for seg-
mentation and pose estimation, the slow speed (some 25 seconds per frame) make them
infeasible for development in video game applications, where real-time performance is
required. Therefore, it is desirable to nd an eciently solvable approximation of the
original problem.
One such method that can be applied here is mean eld inference [57]. For a cer-
tain class of pairwise terms, mean-eld inference has been shown to be very powerful in
solving the object class segmentation and object-stereo correspondence problems in CRF
frameworks, providing an order-of-magnitude speedup [124].
The main contribution of this chapter is the proposition of a highly ecient lter-
based mean eld approach to perform joint estimation of human segmentation, pose and
disparity in the product label space. The application of mean-eld inference produces a
signicant improvement in speed, as well as noticeable qualitative improvements.
As with the previous two chapters, the eciency and accuracy of the new model
is evaluated using the H2view dataset, introduced in Section 4.2. We show results for
125
6.1. Mean Field Inference
segmentation and pose estimation; disparity computation is used to improve these results,
but is not quantitatively evaluated as it is not feasible to obtain dense ground truth data.
We achieve 20 times speedup compared to the current state-of-the-art methods [1,70,131],
as well achieving better accuracy in all cases.
The remainder of this chapter is structured as follows: in the next section, we give an
introduction to mean-eld inference. Our body part formulation is discussed in Section
6.2, while we describe our joint inference framework in Section 6.3. Results follow in
Section 6.4, and the chapter concludes with a discussion in Section 6.5.
An earlier version of this chapter forms part of a paper that is under submission at a
major computer vision conference.
6.1 Mean Field Inference
While graph cut-based inference methods are fairly ecient for the task of object class
segmentation, they fail to capture long-range interactions, which as shown in Figure
6.1(b), can result in oversmoothed boundaries between object classes. An alternative
approach is to use a CRF that is fully connected, i.e. one that has pairwise connections
between every pair of pixels in the image, regardless of the distance between them.
Despite the improvement in results that this gives, this is not a feasible way to improve
results on large sets of images, since the time taken by a graph cut based inference method
rises from 1 second (for the Robust P
n
CRF [56]) to around 36 hours (for the fully
connected CRF) [64]. An alternative approach is to use mean-eld inference to solve this
problem; the inference algorithm of Krhenbhl and Koltun [64] successfully recovers ne
boundaries between object classes (as shown in Figure 6.1(c)), and only takes around 0.2
seconds to converge. The image which these timings are based on is 237 by 356 pixels;
the performance of the algorithms on images of other sizes was not specied in [64].
126
6.1. Mean Field Inference
(a) Image (b) Robust P
n
CRF
(c) Mean eld inference for fully
connected pairwise CRF
(d) Ground truth
Figure 6.1: The Robust P
n
CRF [56] produces an oversmoothed result on this image of a
tree, from the MSRC-21 dataset [77, 111]. Using a fully connected CRF recoveres the ne
detail, but is prohibitively slow. Mean eld inference is much quicker, and still achieves
similar performance. (Images taken from [64].)
127
6.1. Mean Field Inference
6.1.1 Introduction to Mean-Field Inference
Dene the energy of a CRF with fully connected pairwise terms as:
E(x) =

i
(x
i
) +

i<j
(x
i
, x
j
), (6.1)
where i and j range from 1 to N, and N is the number of vertices in the graph. In the
formulation of Krhenbhl and Koltun [64], the pairwise potential is dened as a linear
combination k of Gaussian kernels:
(x
i
, x
j
) = (x
i
, x
j
)k(f
i
, f
j
), (6.2)
where the label compatibility function can be, for example, a Potts model:
(x
i
, x
j
) = 1(x
i
= x
j
); (6.3)
and k is a linear combination of bilateral and spatial Gaussian kernels:
k(f
i
, f
j
) = w
(1)
exp
_

|p
i
p
j
|
2
2
2

|I
i
I
j
|
2
2
2

_
+w
(2)
exp
_

|p
i
p
j
|
2
2
2

_
, (6.4)
where the

are truncation parameters. We can use the Gibbs distribution to trans-


form the energy function (6.1) into the probability of a given labelling x:
P(x|I) =

P(x|I)
Z
=
1
Z
exp(E(x, I)) , (6.5)
where Z is a normalising constant.
In the mean eld approach, given the true probability distribution P(x|I) dened
above, we wish to nd an approximate distribution Q(x) which is tractable, and which
closely resembles the original distibution. We measure this closeness in terms of the
128
6.1. Mean Field Inference
KL-divergence between P and Q:
D
KL
(Q(x)||P(x|I)) =

xX
Q(x) log
Q(x)
P(x|I)
. (6.6)
Our goal is to nd the closest such Q from the tractable family of probability distributions
Q. This is dened as:
Q

= arg min
QQ
D
KL
(Q(x)||P(x|I)). (6.7)
One way of approximating the true distribution is to make independence assumptions,
i.e. to partition the set of variables into a collection of subsets, which are then assumed
to be independent from each other. The simplest possible such assumption is to assume
that all of the variables are independent:
Q(x) =

iV
Q
i
(x
i
). (6.8)
In this case, (6.6) can be evaluated as follows:
D
KL
(Q(x)||P(x|I))
=

xX
Q(x) log
Q(x)
P(x|I)
(6.9)
=

xX
Q(x) log Q(x)

xX
Q(x) log P(x|I) (6.10)
=

iV

x
i
x
i
Q
i
(x
i
) log Q
i
(x
i
)

xX
Q(x) log

P(x|I) + log Z(x, I) (6.11)
=

iV

x
i
x
i
Q
i
(x
i
) log Q
i
(x
i
) +

xX
Q(x)E(x, I) + log Z(x, I), (6.12)
where the nal step is due to (6.5).
The marginal Q
i
(x
i
) that minimises (6.6) can then be found by analytically minim-
ising a Lagrangian consisting of all terms in D
KL
(Q(x)||P(x|I)).
1
From Krhenbhl and
1
for a detailed derivation, the reader is referred to Koller and Friedman, Chapter 11.5 [57].
129
6.1. Mean Field Inference
Koltun [64], the marginal update equation is:
Q
i
(x
i
= l) =
1
Z
i
exp
_
_
(x
i
)

j=i

x
j
Q
j
(x
j
)(x
i
, x
j
)
_
_
, (6.13)
where the normalising constant Z
i
is:
Z
i
=

x
i
exp
_
_
(x
i
)

j=i

x
j
Q
j
(x
j
)(x
i
, x
j
)
_
_
. (6.14)
The inference algorithm from [64] is given in Algorithm 6.1.
Algorithm 6.1 Nave mean eld algorithm for fully connected CRFs
Initialise Q: Q
i
(x
i
= l)
1
Z
i
exp((x
i
))
while not converged do
Message passing:

Q
(m)
i
(l)

j=i
k
(m)
(f
i
, f
j
)Q
j
(l) for all m
Compatibility transform:

Q
i
(x
i
)

lL

(m)
(x
i
, l)

m
w
(m)

Q
(m)
i
(l)
Local update: Q
i
(x
i
) exp((x
i
)

Q
i
(x
i
))
Normalise Q
i
(x
i
)
end while
6.1.2 Simple Illustration
As a simple example, consider the simplied human skeleton model shown in Figure
6.2. This skeleton model is represented by a graph G = (V, E) with six vertices (V =
{x
1
, . . . , x
6
}, with x
1
representing the torso), and ve edges (E = {(x
1
, x
2
), . . . , (x
1
, x
6
)}).
That is, each of the variables x
2
to x
6
are only connected to x
1
.
Applying the independence assumption and the derivation above, the mean eld up-
date for x
1
is:
Q
1
(x
1
) =
1
Z
1
exp
_

i=2
Q
i
(x
i
)(x
1
, x
i
)
_
, (6.15)
where is a kernel on the distance between the parts x
1
and x
i
.
130
6.2. Model Formulation
Figure 6.2: Our basic six-part skeleton model, for the simple example.
Table 6.1: Comparing the performance of Robust P
n
CRF [56] and mean eld inference
[64] on the MSRC-21 dataset. In this table Global refers to the percentage of all pixels
that are correctly labelled, and Average is the average of the per-class accuracy gures.
Method Runtime (s/f) Global (%) Average (%)
Robust P
n
30 84.9 77.5
Mean eld 0.2 86.0 78.3
6.1.3 Performance Comparison: Mean Field vs Graph Cuts
Krhenbhl and Koltun evaluated their mean eld inference framework on the MSRC-
21 dataset [77]. Compared to the robust CRF method of Kohli et al. [56], the runtime
decreases from 30 seconds per image to 0.2 seconds per image (a 150 speed-up factor);
at the same time, the global accuracy also improves. A brief summary of results is given
in Table 6.1; a more thorough evaluation can be found in [64].
6.2 Model Formulation
As has been the case in the previous two chapters, the goal of our joint optimisation
framework is to estimate human segmentation and pose, and perform stereo reconstruc-
131
6.2. Model Formulation
tion.
We formulate the problem of joint labelling in a conditional random eld (CRF)
framework in a product label space. We rst dene two sets of random variables, X =
[X
S
, X
D
], covering the segmentation and disparity variables, and Y, to represent the
part variables. X takes a label from the product label space L = {(L
S
L
D
)
N
}, and
Y takes a label from (L
P
)
M
. Here, X
S
= {X
S
1
, ..., X
S
N
} and X
D
= {X
D
1
, ..., X
D
N
} are
the per-pixel human segmentation and disparity variables. We assume that each of these
random variables is associated with a pixel x
i
in the image, with i {1, ..., N}. Further,
each X
S
i
takes a label from the segmentation label set, L
S
{0, 1}, and X
D
i
takes a label
from the disparity label set, L
D
.
The body parts are represented by the set of latent variables Y = {Y
1
, Y
2
, ..., Y
M
}
corresponding to the M body parts, each taking labels from L
P
{0, ..., K} where
1, 2, ..., K corresponds to the K part proposals generated for each body part, and zero
represents the background class. We generate K part proposals using the model of Yang
and Ramanan [131].
6.2.1 Joint Energy Function
Given the above model, our joint energy function takes the following form:
E(x
S
, x
D
, y) = E
S
(x
S
) +E
D
(x
D
) +E
P
(y) +E
PS
(x
S
, y) +E
SD
(x
S
, x
D
) +E
PD
(x
D
, y)
(6.16)
We dene each of these terms individually below.
Per-Pixel Terms
We rst dene E
S
and E
D
, which take the following forms:
132
6.2. Model Formulation
E
S
(x
S
) =

iV

S
(x
i
) +

iV,jN
i

S
(x
i
, x
j
) +

cC

S
c
(x
c
), (6.17)
E
D
(x) =

iV

D
(x
i
) +

iV,jN
i

D
(x
i
, x
j
), (6.18)
where N
i
represents the neighbourhood of the variable i,
S
(x
i
) and
D
(x
i
) represent
unary terms corresponding to human segmentation class and depth labels respectively,
and
S
(x
i
, x
j
) and
D
(x
i
, x
j
) are pairwise terms capturing the interaction between a pair
of segment and depth variables respectively. The human object specic unary cost
S
(x
i
)
is computed based on a boosted unary classier on image-specic appearance. The cost

D
(x
i
) is computed based on sum of absolute dierences, as in Section 4.3.3. The higher
order term
S
c
(x
c
) describes a cost dened over cliques containing more than two pixels,
as introduced by Ladick et al. [70].
The pairwise term between human object variables
S
takes the form of a Potts model,
weighted by edge-preserving Gaussian kernels [64]. The depth consistency term
D
(x
i
, x
j
)
encourages pixels which are next to each other to take the same level, and takes a similar
form to that of
S
.
Per-Part Term
Similar to the energy function g
P
(x) dened in (5.22) (Section 5.3.4), the energy term
E
P
(y) covers the human part variables Y. This energy function involves a per-part unary
cost
P
(y
j
= k) for associating j
th
part to the k
th
proposal or to the background, and a
pairwise term
P
(y
i
, y
j
), which penalises the case where parts that should be connected
are distant from one another in image space. These terms are the same as in (5.22).
Dening the set of connected parts as E, the pose estimation energy is as follows:
E
P
(y) =

jY

P
(y
j
= k) +

(i,j)E

P
(y
i
, y
j
). (6.19)
133
6.2. Model Formulation
Joint Terms
We now give details of our joint energy terms, E
PS
, E
SD
, and E
PD
.
The joint human segmentation and part proposal term, E
PS
, encodes the relation
between segmentation and part proposals. Specically, we expect pixels close to a selected
part proposal to belong to the foreground class, and pixels that are far from any body part
to belong to the background class. We pay a cost of C
PS
for violation of this constraint,
incorporated through a pairwise interaction between the segmentation and part proposal
variables; this interaction takes the following form:
E
PS
=
PS
p
(x
S
, y) =
N

i=1
C
PS
[(x
S
i
= 1) (max
j,k
(dist(x
i
, y
(j,k)
)) )]
+
N

i=1
M

j=1
K

k=1
C
PS
[(x
S
i
= 0) (dist(x
i
, y
(j,k)
) < ) (y
j
= k)]. (6.20)
Here, dist(x
i
, y
(j,k)
) gives the distance from the pixel x
i
to the k
th
proposal for the j
th
part; this distance is measured by modelling the part proposal as a line segment between
its two endpoints, and nding the Euclidean distance from the point to the line segment.
The threshold, is set to half of the average width of each part, as determined by
typical images from the training set where the parts are well-separated. The values range
from around 10 pixels for the lower arms, to around 25 pixels for the torso. However, it
should be noted that the torso width in particular is heavily dependent on the orientation
of the person, and indeed their clothing and body shape. In future research, it would be
interesting to consider nding the orientation of the person, and adjusting the expected
torso width accordingly. Adjusting the width based on the length of the part could also
improve results, although the possibility of foreshortening should be taken into account.
Our joint object-depth cost E
SD
encourages pixels with a high disparity to be classed
as foreground, and pixels with a low disparity to be classied as background. We penalise
the violation of this constraint by a cost C
SD
. Using the ood ll method introduced in
Chapter 5, we rst generate a segmentation map F = {
1
,
2
, . . . ,
N
} by thresholding
the disparity map, thus each
i
takes a label from L
S
. We would expect the prior map F
134
6.3. Inference in the Joint Model
to agree with the segmentation result, so that pixels labelled as human by the ood ll
prior (
i
= 1) are classied as human, and vice versa, otherwise we pay a cost C
SD
for
violation of this constraint:
E
SD
=
SD
p
(x
S
, x
D
) =
N

i=1
C
SD
[(x
S
i
= 1) (
i
= 0)]
+
N

i=1
C
SD
[(x
S
i
= 0) (
i
= 1)] (6.21)
Finally, the joint energy term E
PD
encodes the relationship between the part proposals
and the disparity variables. Again, we use the ood ll prior F dened above. We expect
pixels classed as human by this prior (so
i
= 1) to be close to a selected body part, so
for some part j and proposal index k, y
j
= k and dist(x
i
, y
(j,k)
) < . Conversely, pixels
classed as background (
i
= 0) should not be close to selected body parts, so we pay a
cost if for some j, k, y
j
= k and dist(x
i
, y
(j,k)
) < . Therefore, the energy term has the
following form:
E
PD
=
PD
p
(y, x
D
) =
N

i=1
C
PD
[(max
j,k
(dist(x
i
, y
(j,k)
)) ) (
i
= 1)]
+
N

i=1
M

j=1
K

k=1
C
PD
[(y
j
= k) (dist(x
i
, y
(j,k)
) < ) (
i
= 0)] (6.22)
The weights C
PS
, C
SD
, and C
PD
capturing the relationships between the dierent sets
of variables are set through cross-validation.
6.3 Inference in the Joint Model
We now provide details of the mean-eld update for segmentation variables X
S
, depth
variables X
D
, and part variables Y
P
.
Given the energy function detailed in Section 6.2.1, the marginal update for human
135
6.4. Experiments
segmentation variable X
S
i
takes the following form:
Q
S
i
(x
S
[i,l]
) =
1
Z
S
i
exp{
S
(x
i
)

L
J

j=i
Q
S
j
(x
[j,l

]
)
S
(x
i
, x
j
)

L
D
Q
D
i
(x
D
[i,l

]
)(x
S
i
, x
D
i
)
M

j=1

L
P
Q
P
j
(y
P
[j,l

]
)(x
S
i
, y
P
j
)} (6.23)
Similarly, the marginal update for the per-pixel depth variables X
D
i
takes the following
form:
Q
D
i
(x
D
[i,l]
) =
1
Z
D
i
exp{
D
(x
i
)

L
D

j=i
Q
D
j
(x
[j,l

]
)
D
(x
i
, x
j
)

L
S
Q
S
i
(x
S
[i,l

]
)(x
D
i
, x
S
i
)
M

j=1

L
J
Q
P
j
(y
P
[j,l

]
)(y
P
j
, x
D
i
)}. (6.24)
Finally, the marginal update for the part variables Y
P
j
is as follows:
Q
P
j
(y
P
[j,l

]
) =
1
Z
P
j
exp{
P
(y
j
)

L
P

=j
Q
P
j
(y
[j

,l

]
)
P
(y
j
, y
j
)

i=1

L
S
Q
S
i
(x
[i,l

]
)(x
S
i
, y
P
j
)
N

i=1

L
D
Q
D
i
(x
[i,l

]
)(y
P
j
, x
D
i
)}. (6.25)
6.4 Experiments
In this section, we demonstrate the eciency and accuracy provided by our approach on
the challenging H2view dataset, introduced in Section 4.2. In all experiments, timings are
based on single-threaded code run on an Intel

Xeon

3.33 GHz processor, and we x the


number of full mean-eld update iterations to 5 for all models. As a baseline, we compare
our approach for the human segmentation problem against the graph-cuts based AHRF
method of Ladick et al. [70], the dual-decomposition based model of Chapter 5, and the
mean-eld model of Krhenbhl et al. [64]. We assess the overall percentage of pixels
correctly labelled, and the average recall and intersection/union scores per class. For pose
estimation, we compare our results with those of Chapter 5, Yang and Ramanan [131],
136
6.4. Experiments
Table 6.2: Quantitative results for human segmentation on the H2View dataset. The table
compares timing and accuracy of our approach (last line) against three baselines. Note
the signicant improvement in inference time, recall, F-score and overlap performance of
our approach against the baselines.
Method Time (s) Precision Recall F-Score Overlap
Unary 0.36 79.84% 73.55% 76.57% 62.21%
ALE [70] 1.5 83.19% 73.58% 78.09% 64.29%
MF [64] 0.48 84.22% 73.50% 78.50% 64.61%
Chapter 5 25 79.59% 83.23% 81.37% 69.23%
Ours 1.07 79.89% 87.05% 83.32% 71.17%
and Andriluka et al. [1], using the probability of correct pose (PCP) criterion.
6.4.1 Segmentation Performance
We rst evaluate the performance of our method in the human segmentation problem. A
quantitative comparison can be found in Table 6.2. Our method shows a clear improve-
ment in recall, compared to the other approaches, with an increase of 3.82% compared
to Chapter 5, and around a 13.5% increase compared to the other methods. A similar
improvement is shown in the overlap (intersection over union) score, with our method
showing an improvement of almost 2%. Signicantly, we observe an order of magnitude
speed up (close to 25) over the model of Chapter 5, and a slight speed up over ALE.
Sample results comparing the performance of our mean eld formulation with that of
previous chapters is given in Figure 6.3, while further results are shown in Figure 6.4.
6.4.2 Pose Estimation Performance
Further, we observe an improvement of about 3.3% over Yang and Ramanan, and 7%
over Andriluka et al. in the PCP scores for the pose estimation problem. Although we
do slightly worse than Chapter 5 in the PCP score, we observe a speed up of 20, as well
as a speed-up of 8 over Yang and Ramanan model, and almost 30 over the model of
Andriluka, as shown in Table 6.3. However, in some cases a qualitative improvement can
137
6.4. Experiments
(a) Original image
(b) Result of Chapter 4
(c) Result of Chapter 5
(d) Our segmentation result
(e) Ground truth segmentation
Figure 6.3: Segmentation results. Outstretched arms are dicult for segmentation meth-
ods to capture, but the dense CRF approached used in this chapter successfully retrieves
the arms.
138
6.4. Experiments
(a) Original image (b) Segmentation result (c) Ground truth
Figure 6.4: Results showing consistency of segmentation performance. Shown here are
ve successive frames from a dicult part of the test set. Qualitatively, the segmentation
results are of a consistently high standard.
139
6.5. Discussion
Table 6.3: The table compares timing and accuracy of our approach (last line) against
the baseline for the pose estimation problem on the H2View dataset. Observe that our
approach achieves 20 speedup, and performs better than the baseline in estimating the
limbs. U/LL represents average of upper and lower legs, and U/FA represents average of
upper and fore arms.
Method T(s) U/LL U/FA TO Head Overall
Andriluka [1] 35 80.0 47.7 80.5 69.2 66.03
Yang [131] 10 85.8 49.0 87.4 72.0 69.85
Chapter 5 25 85.4 59.7 96.3 92.4 76.86
Ours 1.2 82.9 55.2 89.1 86.2 73.12
be observed, as shown in Figure 6.5.
6.5 Discussion
In this chapter, we proposed an ecient mean-eld based method for joint estimation of
human segmentation, pose, and disparity. Our new inference approach yields excellent
results, with a substantial improvement in inference speed with respect to the current
state-of-the-art methods, while also observing a good improvement both human segment-
ation and pose estimation accuracy.
Mean-eld inference produces faster performance than the dual decomposition-based
method due to the simplications made in the pairwise part of the equations: for in-
stance, the segmentation update equation (6.23) can be evaluated separately for each
pixel, whereas a graph cut-based approach with a complexity depending on the number
of pairwise edges was used for the corresponding step in the dual decomposition-based
method.
The proposed method runs at 1.07 seconds per frame, meaning that a speed up factor
of 15 is still needed for real-time applications such as computer games. Directions for fu-
ture research include investigating ways to further improve the eciency of the approach.
For instance, some parts of the algorithm are parallelisable. It would also be extremely
interesting to adapt the algorithm to a GPU framework. While the results given are state
140
6.5. Discussion
(a) Pose estimation from Chapter 5
(b) Our pose estimation result
Figure 6.5: Qualitative results on H2View dataset. Our method is able to correct some
mistakes from the previous chapter (left side), but some of the same mistakes still remain
(right side).
141
6.5. Discussion
of the art, they could perhaps be further improved by adopting a hierarchical approach.
This approach could capture image-level attributes, such as the persons orientation, or
a high-level understanding of the persons pose, e.g. standing, sitting, gesticulating.
Acknowledgements
The paper that this chapter is based on was jointly authored by myself and Vibhav
Vineet, a PhD student in Brookes Vision Group. I extended the formulation to include
higher-order terms, and adapted it to mean-eld inference, also putting the terms into
the C++ code used to run the experiments. Vibhav provided contributions to the paper
which are not included in this chapter.
142
Chapter 7
Conclusions and Future Work
This thesis has explored the applicability to video games of human segmentation and pose
estimation. However, owing to the high levels of articulation that the human body can
exhibit, together with the variability in shape, size, colour, and appearance, it is necessary
to use a complex approach to nd accurate pose estimates, and the result unfortunately
falls short of the real-time aim. In Section 7.2, however, we identify a few methods which
could help further speed up our system.
7.1 Summary of Contributions
The major contributions of this thesis are as follows:
In Chapter 4, we presented a unied framework for human segmentation, pose
estimation, and depth estimation. These three problems can be solved separately,
but the results can be improved by sharing information between the three solutions.
To do this, we dened a ve-term energy function, with three terms to give the
costs of the solution for each individual problem, and two joining terms, which
facilitated information sharing. For example, we encoded the notion that if a human
body part is located in a particular region of the image, then that region should
be segmented as foreground. In order to minimise the energy function, we applied
dual decomposition [62].
143
7.2. Directions for Future Research
In order to evaluate our dual decomposition framework, we created a novel dataset,
which was presented in Section 4.2. This dataset contains almost 9,000 stereo pairs
of images, which each feature a human standing, walking, crouching or gesticulating.
To encourage future research on solving multi-view human understanding problems,
the dataset has been released.
1
In Chapter 5, we introduced a stereo-based prior for human segmentation. Using
the endpoints of the top-ranked torso estimate as starting points, we run a ood ll
algorithm on the disparity map, obtaining a ood ll prior, . We then incorporated
this prior into our framework from the previous chapter, yielding an improvement
in segmentation performance, and reducing the runtime by a factor of 6.
The main drawback of the dual decomposition-based approach is the speed: the
algorithms presented took around 20-30 seconds per frame to converge. Our nal
contribution, in Chapter 6, was to reformulate the problem into a highly ecient
inference framework based on mean eld [64]. The use of mean eld inference allows
us to use dense pairwise interactions, which provides a further improvement in
segmentation performance. The resulting algorithm only requires around 1 second
per frame.
7.2 Directions for Future Research
The nal version of our unied human segmentation, depth, and pose estimation frame-
work, presented in Chapter 6, requires around 1 second per frame on a desktop computer,
running in a single thread. It would be tremendously interesting to extend this framework
to run on multiple threads, which would bring the holy grail of real-time human under-
standing within sight. With this goal in mind, it would also be worthwhile to consider
adapting the framework to run on architectures such as the GPU, or the internal SPU
processor on the Playstation 3. With the latter solution, care would have to be taken to
ensure that the memory limitations of such processors are taken into account.
1
http://cms.brookes.ac.uk/research/visiongroup/h2view/
144
Besides such solutions, there are additional methods that could also speed up the
performance; for instance, using integral images (via which the colour values of rectan-
gular regions in an image can be summed quickly and eciently), removing low-weighted
components of the energy function, and using a faster algorithm to generate the pose es-
timates. Indeed, it would be very useful to remove the constraint that our pose estimates
depend on those of another algorithm, and instead to generate new pose candidates at
each iteration.
We conclude this thesis with the following observations: while the controller is likely
to remain the standard input method for many regular players of computer games, the
market for casual gamers is growing. There is clear demand in this demographic for
aordable systems where games can be controlled using the human body instead of a
physical controller, as shown by the sales of the Kinect [98]; we note that if such a system
relied on a stereo pair of cameras rather than a depth sensor, the cost to the consumer
would be lower.
Finally, it is worth noting that a human understanding framework based on a stereo
pair of images would have applications beyond computer games. In particular, a major
advantage that stereo camera systems have over depth sensors is their ability to work
outdoors during the daytime, allowing them to be used in, for example, pedestrian de-
tection systems; however, it should be noted that infra-red based methods would work
better at night than RGB stereo camera-based methods. With vision research, including
this thesis, bringing the goal of real-time human understanding closer to reality, the next
few years promise exciting developments.
Bibliography
[1] M. Andriluka, S. Roth, and B. Schiele. Pictorial structures revisited: People de-
tection and articulated pose estimation. In IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), pages 10141021, 2009.
[2] D. Anguelov, P. Srinivasan, D. Koller, S. Thrun, J. Rodgers, and J. Davis. Scape:
shape completion and animation of people. ACM Transactions on Graphics (TOG),
24(3):408416, 2005.
[3] B. Ashcraft. Eye of Judgment. http://kotaku.com/343541/more-of-eye-of-
judgment-coming-to-japan?tag=the-eye-of-judgment, 2008. This photo, re-
trieved on 22nd August 2012, contains an image from a copyrighted game (Eye of
Judgment, Sony, 2007), whose appearance here for research purposes qualies as
fair dealing under Section 29 of the Copyright, Designs and Patents Act 1988.
[4] Press Association. EyePet bounce. http://www.telegraph.co.uk/technology/
6410150/The-Sony-EyePet-an-electronic-pet-for-people-too-lazy-for-
the-real-thing.html, 2009. This photo, retrieved on 22nd August 2012, is a
screenshot from a copyrighted game (EyePet, SCEE, 2009) provided by the Press
Association, and appears here for illustrative purposes, qualifying as fair dealing
under Section 29 of the Copyright, Designs and Patents Act 1988.
147
Bibliography
[5] A.O. Balan, L. Sigal, M.J. Black, J.E. Davis, and H.W. Haussecker. Detailed human
shape and pose from images. In IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pages 18, 2007.
[6] D.P. Bertsekas. Nonlinear programming. Athena Scientic, 1999.
[7] J. Best. Kinect Dance demonstration. http://www.techrepublic.com/
blog/cio-insights/why-microsofts-kinect-gaming-tech-matters-to-
business/39746574, 2010. This photo, retrieved on 22nd August 2012, appears
here for illustrative purposes, qualifying as fair dealing under Section 29 of the
Copyright, Designs and Patents Act 1988.
[8] C.M. Bishop. Pattern recognition and machine learning, volume 4. Springer New
York, 2006.
[9] A. Blake, C. Rother, M. Brown, P. Perez, and P. Torr. Interactive image segment-
ation using an adaptive GMMRF model. In European Conference of Computer
Vision (ECCV), pages 428441. Springer, 2004.
[10] L. Bourdev and J. Malik. Poselets: Body part detectors trained using 3D human
pose annotations. In IEEE Conference on Computer Vision and Pattern Recogni-
tion (CVPR), pages 13651372, 2009.
[11] S. Boyd, L. Xiao, and A. Mutapcic. Subgradient methods. Lecture notes of EE392o,
Stanford University, Autumn Quarter, 2003.
[12] S. Boyd, L. Xiao, A. Mutapcic, and J. Mattingley. Notes on decomposition methods.
Notes for EE364B, Stanford University, 2007.
[13] S.P. Boyd and L. Vandenberghe. Convex optimization. Cambridge University Press,
2004.
[14] Y. Boykov and V. Kolmogorov. An experimental comparison of min-cut/max-
ow algorithms for energy minimization in vision. IEEE Transactions on Pattern
Analysis and Machine Intelligence (PAMI), 26(9):11241137, 2004.
148
Bibliography
[15] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization via
graph cuts. IEEE Transactions on Pattern Analysis and Machine Intel ligence
(PAMI), 23(11):12221239, 2001.
[16] M. Bray, P. Kohli, and P. Torr. Posecut: Simultaneous segmentation and 3D
pose estimation of humans using dynamic graph-cuts. In European Conference of
Computer Vision (ECCV), pages 642655, 2006.
[17] L. Breiman. Random forests. Machine learning, 45(1):532, 2001.
[18] J. Calvert. Kinect Sports review. http://www.gamespot.com/kinect-sports/
reviews/kinect-sports-review-6283473/, 2010. Retrieved 22nd August 2012.
[19] J. Canny. A computational approach to edge detection. IEEE Transactions on
Pattern Analysis and Machine Intelligence (PAMI), 6:679698, 1986.
[20] A. Criminisi, A. Blake, C. Rother, J. Shotton, and P.H.S. Torr. Ecient dense
stereo with occlusions for new view-synthesis by four-state dynamic programming.
International Journal of Computer Vision (IJCV), 71(1):89110, 2007.
[21] N. Dalal. Inria object detection and localization toolkit, 2008. Software available
at http://pascal.inrialpes.fr/soft/olt.
[22] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection.
In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages
886893, 2005.
[23] G.B. Dantzig and P. Wolfe. Decomposition principle for linear programs. Operations
research, pages 101111, 1960.
[24] R. Davis. EyeToy: Antigrav review. http://uk.gamespot.com/eyetoy-
antigrav/reviews/eyetoy-antigrav-review-6112714/, 2004. Retrieved 26th
July 2012.
149
Bibliography
[25] D. DeMenthon and L.S. Davis. Exact and approximate solutions of the perspective-
three-point problem. IEEE Transactions on Pattern Analysis and Machine Intel li-
gence (PAMI), 14(11):11001105, 1992.
[26] R. Deriche. Using Cannys criteria to derive a recursively implemented optimal edge
detector. International Journal of Computer Vision (IJCV), 1(2):167187, 1987.
[27] P. Dollar, Z. Tu, and S. Belongie. Supervised learning of edges and object bound-
aries. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
volume 2, pages 19641971, 2006.
[28] R.O. Duda and P.E. Hart. Use of the hough transformation to detect lines and
curves in pictures. Communications of the ACM, 15(1):1115, 1972.
[29] J. Dunlap. Queue-linear ood ll: A fast ood ll algorithm. http:
//www.codeproject.com/Articles/16405/Queue-Linear-Flood-Fill-A-
Fast-Flood-Fill-Algorith. Retrieved 1st April 2013.
[30] P. Elias, A. Feinstein, and C. Shannon. A note on the maximum ow through a
network. IEEE Transactions on Information Theory, 2(4):117119, 1956.
[31] M. Enzweiler, A. Eigenstetter, B. Schiele, and D.M. Gavrila. Multi-cue pedestrian
classication with partial occlusion handling. In IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), pages 990997, 2010.
[32] PlayStation Europe. PSEye. http://www.flickr.com/photos/
playstationblogeurope/4989764343/, 2007. This photo, retrieved on 22nd Au-
gust 2012, is used under a Creative Commons Attribution-NonCommercial 2.0 Gen-
eric licence: http://creativecommons.org/licenses/by-nc/2.0/deed.en_GB.
[33] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman.
The PASCAL Visual Object Classes (VOC) challenge. International Journal of
Computer Vision, 88(2):303338, June 2010.
150
Bibliography
[34] P. Felzenszwalb, R. Girschick, D. McAllester, and D. Ramanan. Object detection
with discriminatively trained part based models. IEEE Transactions on Pattern
Analysis and Machine Intelligence (PAMI), 32(9):16271645, 2010.
[35] P. Felzenszwalb and D. Huttenlocher. Pictorial structures for object recognition.
International Journal of Computer Vision (IJCV), 61(1), 2005.
[36] P. Felzenszwalb, D. McAllester, and D. Ramanan. A discriminatively trained,
multiscale, deformable part model. In IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), pages 18, 2008.
[37] V. Ferrari, M. Marn-Jimnez, and A. Zisserman. Progressive search space reduction
for human pose estimation. In IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pages 18, 2008.
[38] V. Ferrari, M. Marn-Jimnez, and A. Zisserman. Pose search: retrieving people
using their pose. In IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), 2009.
[39] A. Ferworn, J. Tran, A. Ufkes, and A. DSouza. Initial experiments on 3D mod-
eling of complex disaster environments using unmanned aerial vehicles. In IEEE
International Symposium on Safety, Security, and Rescue Robotics (SSRR), pages
167171. IEEE, 2011.
[40] M.A. Fischler and R.C. Bolles. Random sample consensus: a paradigm for model
tting with applications to image analysis and automated cartography. Commu-
nications of the ACM, 24(6):381395, 1981.
[41] L.R. Ford and D.R. Fulkerson. Maximal ow through a network. Canadian Journal
of Mathematics, 8(3):399404, 1956.
[42] J. Fraser. Kung Foo. http://www.thunderboltgames.com/reviews/article/
eye-toy-play-review-for-ps2.html, 2003. This photo, retrieved on 22nd Au-
gust 2012, is a screenshot from a copyrighted game (EyeToy: Play, SCEE, 2003),
151
Bibliography
and appears here for illustrative purposes, qualifying as fair dealing under Section
29 of the Copyright, Designs and Patents Act 1988.
[43] J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical
view of boosting. Annals of Statistics, 28(2):337407, 2000.
[44] K. Golder. Why isnt Kinect kinecting with hardcore gamers? http:
//www.hardcoregamer.com/2012/04/08/gears-of-war-exile-in-permanent-
diskinect/, 2012. Retrieved 17th September 2012.
[45] M. Grant and S. Boyd. Graph implementations for nonsmooth convex programs. In
Recent Advances in Learning and Control, Lecture Notes in Control and Information
Sciences, pages 95110. Springer-Verlag Limited, 2008.
[46] M. Grant and S. Boyd. CVX: Matlab software for disciplined convex programming,
version 1.21. http://cvxr.com/cvx/, April 2011.
[47] V. Gulshan, V. Lempitsky, and A. Zisserman. Humanising grabcut: Learning to
segment humans using the kinect. In International Conference on Computer Vision
(ICCV) Workshops, pages 11271133, 2011.
[48] G. Gupta. Ghost Catcher. http://archive.techtree.com/techtree/jsp/
article.jsp?print=1&article_id=51303&cat_id=544, 2004. This photo, re-
trieved on 22nd August 2012, is a screenshot from a copyrighted game (EyeToy:
Play, SCEE, 2003), and appears here for illustrative purposes, qualifying as fair
dealing under Section 29 of the Copyright, Designs and Patents Act 1988.
[49] S. Hay, J. Newman, and R. Harle. Optical tracking using commodity hardware. In
IEEE/ACM International Symposium on Mixed and Augmented Reality (ISMAR),
pages 159160, 2008.
[50] K. He, J. Sun, and X. Tang. Guided image ltering. In European Conference of
Computer Vision (ECCV), pages 114. Springer, 2010.
152
Bibliography
[51] H. Hirschmller, P.R. Innocent, and J. Garibaldi. Real-time correlation-based ste-
reo vision with reduced border errors. International Journal of Computer Vision
(IJCV), 47(1):229246, 2002.
[52] P.V.C. Hough. Machine analysis of bubble chamber pictures. In International
Conference on High Energy Accelerators and Instrumentation, volume 73. CERN,
1959.
[53] G. Howitt. Wonderbook - hands-on preview. http://www.guardian.co.uk/
technology/gamesblog/2012/aug/16/wonderbook-hands-on-preview-ps3,
2012. Retrieved 22nd August 2012.
[54] S. Izadi, D. Kim, O. Hilliges, D. Molyneaux, R. Newcombe, P. Kohli, J. Shotton,
S. Hodges, D. Freeman, A. Davison, et al. Kinectfusion: real-time 3D reconstruction
and interaction using a moving depth camera. In Proceedings of the 24th annual
ACM symposium on User interface software and technology, pages 559568. ACM,
2011.
[55] M. Jackson. BBCs Walking with Dinosaurs coming to PS3 Wonder-
book. http://www.computerandvideogames.com/363100/bbcs-walking-with-
dinosaurs-coming-to-ps3-wonderbook/, 2012. Retrieved 22nd August 2012.
[56] P. Kohli, L. Ladick` y, and P.H.S. Torr. Robust higher order potentials for enforcing
label consistency. International Journal of Computer Vision (IJCV), 82(3):302324,
2009.
[57] D. Koller and N. Friedman. Probabilistic graphical models: principles and tech-
niques. MIT press, 2009.
[58] V. Kolmogorov. maxow-v3.01 C++ code library for solving the max-ow/min-
cut problem. http://vision.csd.uwo.ca/code/maxflow-v3.01.zip, 2010. Re-
trieved 9th August 2012.
153
Bibliography
[59] V. Kolmogorov, A. Criminisi, A. Blake, G. Cross, and C. Pother. Probabilistic
fusion of stereo with color and contrast for bilayer segmentation. IEEE Transactions
on Pattern Analysis and Machine Intel ligence (PAMI), 28(9):14801492, 2006.
[60] V. Kolmogorov, A. Criminisi, A. Blake, G. Cross, and C. Rother. Bi-layer seg-
mentation of binocular stereo video. In IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), volume 2, pages 407414, 2005.
[61] V. Kolmogorov and R. Zabih. What energy functions can be minimized via graph
cuts? IEEE Transactions on Pattern Analysis and Machine Intel ligence (PAMI),
26(2):147159, 2004.
[62] N. Komodakis, N. Paragios, and G. Tziritas. MRF optimization via dual decompos-
ition: Message-passing revisited. In IEEE International Conference on Computer
Vision (ICCV), pages 18, 2007.
[63] N. Komodakis, N. Paragios, and G. Tziritas. MRF energy minimization and beyond
via dual decomposition. IEEE Transactions on Pattern Analysis and Machine
Intelligence (PAMI), 33(3):531552, 2011.
[64] P. Krhenbhl and V. Koltun. Ecient inference in fully connected CRFs with
Gaussian edge potentials. In NIPS, pages 109117, 2011.
[65] M.P. Kumar, P.H.S. Torr, and A. Zisserman. Obj cut. In IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), pages 1825, 2005.
[66] M.P. Kumar, P.H.S. Torr, and A. Zisserman. Objcut: Ecient segmentation us-
ing top-down and bottom-up cues. IEEE Transactions on Pattern Analysis and
Machine Intelligence (PAMI), 32(3):530545, 2010.
[67] M.P. Kumar, O. Veksler, and P.H.S. Torr. Improved moves for truncated convex
models. The Journal of Machine Learning Research, 12:3167, 2011.
154
Bibliography
[68] M.P. Kumar, A. Zisserman, and P.H.S. Torr. Ecient discriminative learning of
parts-based models. In IEEE Conference on Computer Vision and Pattern Recog-
nition (CVPR), pages 552559, 2009.
[69] L. Ladick. Global structured models towards scene understanding. PhD thesis,
Oxford Brookes University, 2011.
[70] L. Ladick, C. Russell, P. Kohli, and P.H.S. Torr. Associative hierarchical CRFs for
object class image segmentation. In International Conference on Computer Vision
(ICCV), pages 739746, 2009.
[71] L. Ladick and P.H.S. Torr. The automatic labelling environment. http://cms.
brookes.ac.uk/staff/PhilipTorr/ale.htm. Retrieved 1st November 2012.
[72] J.D. Laerty, A. McCallum, and F.C.N. Pereira. Conditional random elds: Prob-
abilistic models for segmenting and labeling sequence data. In Proceedings of the
Eighteenth International Conference on Machine Learning, pages 282289. Morgan
Kaufmann Publishers Inc., 2001.
[73] D. Larlus and F. Jurie. Combining appearance models and markov random elds
for category level object segmentation. In Computer Vision and Pattern Recognition
(CVPR), pages 17, 2008.
[74] Nintendo Power magazine. Wii Sports: WiiRemote example. http://upload.
wikimedia.org/wikipedia/en/f/f6/WS-WiiRemote_Example.jpg, 2008. This
photo, retrieved on 22nd August 2012, is an annotated screenshot from a copy-
righted game (Wii Sports, Nintendo, 2006), whose appearance here for research
purposes qualies as fair dealing under Section 29 of the Copyright, Designs and
Patents Act 1988.
[75] D. Marr and H.K. Nishihara. Representation and recognition of the spatial organ-
ization of three-dimensional shapes. Proceedings of the Royal Society of London.
Series B. Biological Sciences, 200(1140):269294, 1978.
155
Bibliography
[76] Y. Matsumoto and A. Zelinsky. An algorithm for real-time stereo vision implement-
ation of head pose and gaze direction measurement. In Fourth IEEE International
Conference on Automatic Face and Gesture Recognition, pages 499504, 2000.
[77] Microsoft. Image understanding. http://research.microsoft.com/en-us/
projects/objectclassrecognition/, 2009. Retrieved 22nd November 2012.
[78] Microsoft. Kinect Sports. http://g-ecx.images-amazon.com/images/G/01/
videogames/detail-page/kinectsports.04.lg.jpg, 2010. This photo, retrieved
on 22nd August 2012, is promotional material for a copyrighted game (Kinect
Sports, Microsoft, 2010), and appears here for illustrative purposes, qualifying as
fair dealing under Section 29 of the Copyright, Designs and Patents Act 1988.
[79] Microsoft. Microsoft Kinect instruction manual. http://download.
microsoft.com/download/f/6/6/f6636beb-a352-48ee-86a3-abd9c0d4492a/
kinectmanual.pdf, 2010. Retrieved 22nd August 2012. This photo appears for
illustrative purposes, qualifying as fair dealing under Section 29 of the Copyright,
Designs and Patents Act 1988.
[80] Microsoft. Xbox Kinect. full body game controller. http://www.xbox.com/kinect,
2010. Retrieved 16th June 2012.
[81] J. Mikesell. Nintendo Wii sensor bar. http://upload.wikimedia.org/wikipedia/
commons/2/20/Nintendo_Wii_Sensor_Bar.jpg, 2007. This photo, retrieved on
22nd August 2012, is used under a Creative Commons Attribution-Share Alike 3.0
Unported license: http://creativecommons.org/licenses/by-sa/3.0/deed.
en.
[82] G. Miller. Wonderbook and PlayStation Move. http://uk.ign.com/articles/
2012/06/06/e3-2012-can-wonderbook-be-successful, 2012. This photo, re-
trieved on 22nd August 2012, appears here for illustrative purposes, qualifying
as fair dealing under Section 29 of the Copyright, Designs and Patents Act 1988.
156
Bibliography
[83] Open NI. Kinect SDK. http://openni.org/Downloads/OpenSources.aspx, 2010.
Retrieved 22nd August 2012.
[84] Nintendo. Duck Hunt screenshot. http://pressthebuttons.typepad.com/
photos/uncategorized/duckhunt.png, 2006. Retrieved 14th September 2012.
The image is a screenshot from a copyrighted game (Nintendo, 1984), whose ap-
pearance here for research purposes qualies as fair dealing under Section 29 of the
Copyright, Designs and Patents Act 1988.
[85] W. Niu, J. Long, D. Han, and Y.F. Wang. Human activity detection and recognition
for video surveillance. In IEEE International Conference on Multimedia and Expo
(ICME), volume 1, pages 719722, 2004.
[86] F. OGorman and MB Clowes. Finding picture edges through collinearity of feature
points. IEEE Transactions on Computers, 100(4):449456, 1976.
[87] I. Oikonomidis, N. Kyriazis, and A. Argyros. Ecient model-based 3D tracking of
hand articulations using kinect. In British Machine Vision Conference (BMVC),
pages 101111, 2011.
[88] M.T. Orchard and C.A. Bouman. Color quantization of images. IEEE Transactions
on Signal Processing, 39(12):26772690, 1991.
[89] B. Packer, S. Gould, and D. Koller. A unied contour-pixel model for gure-ground
segmentation. In European Conference of Computer Vision (ECCV), pages 338
351, 2010.
[90] J. Pearl. Reverend Bayes on inference engines: a distributed hierarchical approach.
Cognitive Systems Laboratory, School of Engineering and Applied Science, Univer-
sity of California, Los Angeles, 1982.
[91] D. Pelfrey. EyeToy: Antigrav screenshot. http://www.dignews.com/platforms/
ps2/ps2-reviews/eye-toy-antigrav-review/, 2005. This photo, retrieved on
18th September 2012, is a screenshot from a copyrighted game (EyeToy: Antigrav,
157
Bibliography
Sony/Harmonix, 2004), whose appearance here for research purposes qualies as
fair dealing under Section 29 of the Copyright, Designs and Patents Act 1988.
[92] S. Pellegrini and L. Iocchi. Human posture tracking and classication through
stereo vision and 3D model matching. Journal on Image and Video Processing,
2008:112, 2008.
[93] L. Plunkett. Report: Here are Kinects technical specs. http://kotaku.com/
5576002/here-are-kinects-technical-specs, 2010. Retrieved 22nd August
2012.
[94] W. Press, B. Flannery, S. Teukolsky, and W. Vetterling. Numerical recipes in C.
Cambridge University Press, 1988.
[95] D. Ramanan. Learning to parse images of articulated bodies. Advances in neural
information processing systems (NIPS), 19:11291136, 2007.
[96] C. Rhemann. Fast cost-volume ltering code for stereo matching, 2011. Software
available at http://www.ims.tuwien.ac.at/research/costFilter/.
[97] C. Rhemann, A. Hosni, M. Bleyer, C. Rother, and M. Gelautz. Fast cost-volume
ltering for visual correspondence and beyond. In IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), pages 30173024, 2011.
[98] B. Rigby. Microsoft Kinect: CEO releases stunning sales gures during CES key-
note. http://www.huffingtonpost.com/2012/01/09/microsoft-kinect-ces-
keynote_n_1195735.html, 2012. Retrieved 17th September 2012.
[99] C. Rodgers. Kinect Star Wars demonstration. http://www.guardian.co.uk/
technology/2011/jun/08/kinect-star-wars-game-preview, 2011. This photo,
retrieved on 22nd August 2012, was released by the Associated Press, and appears
here for illustrative purposes, qualifying as fair dealing under Section 29 of the
Copyright, Designs and Patents Act 1988.
158
Bibliography
[100] C. Rother, V. Kolmogorov, and A. Blake. GrabCut: Interactive foreground extrac-
tion using iterated graph cuts. ACM Transactions on Graphics (TOG), 23(3):309
314, 2004.
[101] C. Rother, V. Kolmogorov, Y. Boykov, and A. Blake. Interactive foreground extrac-
tion using graph cut. In Markov Random Fields for Vision and Image Processing,
pp. 111-126, pages 111126. MIT Press, 2011.
[102] D. Scharstein and R. Szeliski. A taxonomy and evaluation of dense two-frame ste-
reo correspondence algorithms. International Journal of Computer Vision (IJCV),
47(1):742, 2002.
[103] S. Schiesel. Getting everybody back in the game. http://www.nytimes.com/2006/
11/24/arts/24wii.html?_r=1, 2006. Retrieved 17th September 2012.
[104] A. Schrijver. Combinatorial optimization: polyhedra and eciency. Springer, 2003.
[105] L. Shapiro and G.C. Stockman. Computer Vision. Prentice Hall, 2001.
[106] Shcha. R theta line. http://en.wikipedia.org/wiki/File:R_theta_line.GIF,
2011. This photo, retrieved on 13th May 2010, was released into the public domain
by Wikipedia user Shcha, who created it.
[107] G. Sheasby, J. Valentin, N. Crook, and P.H.S. Torr. A robust stereo prior for human
segmentation. Asian Conference on Computer Vision (ACCV), 2012.
[108] G. Sheasby, J. Warrell, Y. Zhang, N. Crook, and P.H.S. Torr. Simultaneous human
segmentation, depth and pose estimation via dual decomposition. British Machine
Vision Conference, Student Workshop (BMVW), 2012.
[109] N.Z. Shor, K.C. Kiwiel, and A. Ruszczynski. Minimization methods for non-
dierentiable functions. Springer-Verlag Berlin, 1985.
159
Bibliography
[110] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman,
and A. Blake. Real-time human pose recognition in parts from single depth images.
Communications of the ACM, 56(1):116124, 2013.
[111] J. Shotton, J. Winn, C. Rother, and A. Criminisi. Textonboost for image un-
derstanding: Multi-class object recognition and segmentation by jointly modeling
texture, layout, and context. International Journal of Computer Vision (IJCV),
81(1):223, 2009.
[112] L. Sigal and M.J. Black. Humaneva: Synchronized video and motion capture data-
set for evaluation of articulated human motion. Brown Univertsity TR, 120, 2006.
[113] B. Sinclair. Sony reveals what makes PlayStation Move tick. http://uk.gamespot.
com/news/sony-reveals-what-makes-playstation-move-tick-6253435, 2010.
Retrieved 29th November 2012.
[114] Sony. Eye Toy. http://ecx.images-amazon.com/images/I/31PQsJIk2VL.jpg,
2003. This photo, retrieved on 22nd August 2012, appears here for illustrative
purposes, qualifying as fair dealing under Section 29 of the Copyright, Designs and
Patents Act 1988.
[115] Sony. Wonderbook. http://blog.problematicgamer.com/2012/06/e3-
wonderbook-announced.html, 2012. This photo, retrieved on 22nd August 2012,
is promotional material for an upcoming video game (Wonderbook, Sony, 2012),
and appears here for illustrative purposes, qualifying as fair dealing under Section
29 of the Copyright, Designs and Patents Act 1988.
[116] Sony. Wonderbook: Fire. http://www.digitalspy.co.uk/gaming/news/
a400867/wonderbook-pricing-revealed-hands-on-event-coming-to-
oxford-street.html, 2012. This photo, retrieved on 22nd August 2012, is
promotional material for an upcoming video game (Wonderbook, Sony, 2012), and
appears here for illustrative purposes, qualifying as fair dealing under Section 29
of the Copyright, Designs and Patents Act 1988.
160
Bibliography
[117] Spong. Keep Up. http://spong.com/asset/117573/2/11034546, 2004. This
photo, retrieved on 22nd August 2012, is a screenshot from a copyrighted game
(EyeToy: Play, SCEE, 2003), and appears here for illustrative purposes, qualifying
as fair dealing under Section 29 of the Copyright, Designs and Patents Act 1988.
[118] J. Talbot. Implementing GrabCut. http://www.justintalbot.com/course-
work/. Retrieved 1st April 2013.
[119] J. Talbot and X. Xu. Implementing GrabCut. Brigham Young University, 2006.
[120] A. Torralba, K.P. Murphy, and Freeman W.T. Sharing visual features for multiclass
and multiview object detection. IEEE Transactions on Pattern Recognition and
Machine Learning, 29(5):854869, 2007.
[121] Z. Tu. Probabilistic boosting-tree: Learning discriminative models for classication,
recognition, and clustering. In IEEE International Conference on Computer Vision
(ICCV), volume 2, pages 15891596, 2005.
[122] K. VanOrd. Eye of Judgment review. http://www.gamespot.com/the-eye-of-
judgment-legends/reviews/the-eye-of-judgment-review-6181426/, 2007.
Retrieved 21st August 2012.
[123] V. Vineet, J. Warrell, P. Sturgess, and P.H.S. Torr. Improved initialization and
gaussian mixture pairwise terms for dense random elds with mean-eld inference.
In British Machine Vision Conference (BMVC), pages 114, 2012.
[124] V. Vineet, J. Warrell, and P.H.S. Torr. Filter-based mean-eld inference for random
elds with higher-order terms and product label-spaces. In European Conference
of Computer Vision (ECCV), pages 3144, 2012.
[125] A. Viterbi. Error bounds for convolutional codes and an asymptotically optimum
decoding algorithm. IEEE Transactions on Information Theory, 13(2):260269,
1967.
161
Bibliography
[126] M. Walton. Kinect Star Wars review. http://www.gamespot.com/kinect-star-
wars/reviews/kinect-star-wars-review-6369636/, 2011. Retrieved 22nd Au-
gust 2012.
[127] H. Wang and D. Koller. Multi-level inference by relaxed dual decomposition for
human pose segmentation. In IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pages 24332440, 2011.
[128] C. Watters. Dance Central review. http://www.gamespot.com/dance-central/
reviews/dance-central-review-6283598/, 2010. Retrieved 22nd August 2012.
[129] D. Whitehead. EyePet review. http://www.eurogamer.net/articles/eyepet-
review, 2009. Retrieved 21st August 2012.
[130] J. Winn and J. Shotton. The layout consistent random eld for recognizing and
segmenting partially occluded objects. In IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), pages 3744, 2006.
[131] Y. Yang and D. Ramanan. Articulated pose estimation with exible mixtures-of-
parts. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
pages 13851392, 2011.
[132] Y. Yang and D. Ramanan. Articulated pose estimation with exible mixtures-of-
parts: pose-release version 1.2. http://phoenix.ics.uci.edu/software/pose/,
2011. Retrieved 1st April 2013. Version 1.2 of the code was used throughout this
thesis.
162

Potrebbero piacerti anche