Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
(a)
I. I NTRODUCTION
3D scene understanding is a fundamental problem in
robotics and perception, as knowledge of the environment
is a prerequisite for complex robotic tasks and interactions.
Thus far, most robotic scene understanding work has focused
on outdoor scenes captured with laser scanners, where technologies have matured and enabled high-impact applications
such as mapping and autonomous driving [2], [25], [21], [7],
[30]. Indoor scenes, in comparison, prove more challenging
and cover a wider range of objects, scene layouts, and scales.
A variety of approaches have been proposed for labeling 3D
indoor scenes and detecting objects in them [29], [14], [12],
[4], [19], [26], typically on a single camera view or laser
scan. Purely shape-based labeling is often very successful
on large architectural elements (e.g. wall, chair) but less so
on small objects (e.g. coffee mug, bowl). Robust labeling
of objects requires the use of rich visual information in
addition to shape. While image-only object matching can
work very well for textured objects [4], recent works on
RGB-D perception combine both shape and color and show
promising results for generic object detection [26], [19], [20].
Advances in RGB-D matching and scene reconstruction
make it feasible to go beyond single-view labeling and allow
the continuous capture of a scene, where RGB-D videos
can be robustly and efficiently merged into consistent and
detailed 3D point clouds [18]. One way to label a multiframe scene would be to directly work with the merged
point cloud. Point cloud labeling has worked very well for
outdoor scenes [21], [7], [30], and the recent work of [1]
illustrates how labeling can be done on indoor 3D scenes,
where local features are computed from segments of point
clouds and integrated in a Markov Random Field (MRF)
This work was funded in part by an Intel grant, by ONR MURI grant
N00014-07-1-0749 and by the NSF under contract IIS-0812671.
Kevin Lai, Liefeng Bo, and Dieter Fox are with the Department of
Computer Science & Engineering, University of Washington, Seattle, WA
98195, USA. {kevinlai,lfb,fox}@cs.washington.edu
Xiaofeng Ren is with Intel Science and Technology Center on Pervasive
Computing, Seattle, WA 98195, USA. xiaofeng.ren@intel.com
(b)
Fig. 1. An illustration of our detection-based approach versus a 3D segmentation approach. (a) A typical approach to 3D labeling segments the scene
into components, inevitably causing some objects to be oversegmented. (b)
We train object detectors and use them to localize entire objects in individual
frames, taking advantage of rich visual information.
{i,j}N
(2)
(4)
(5)
Fig. 2. Each voxel in the scene (center) contains 3D points projected from
multiple RGB-D frames. The points and their projections are known, so we
can compute average likelihoods for each voxel based on the likelihoods
of constituent points. Combining detections from multiple frames greatly
improves the robustness of object detection.
(9)
1yi 6=yj
(I(ni , nj ) + )
d(ni , nj )
(10)
(11)
(12)
captured with a freely moving Kinect style camera (RGBD Scenes Dataset). We train our object detectors using
individually captured objects and evaluate scene labeling on
point clouds reconstructed from the RGB-D scene videos. We
show how to use 3D scene labeling results to improve the
performance of bounding box based object detection. Finally,
we show examples of 3D labeling using only web image data
for training.
A. Experimental Setup
The RGB-D Object Dataset [19] contains 250,000 segmented RGB-D images of 300 objects in 51 categories. It
also includes the RGB-D Scenes Dataset, which consists of
eight video sequences of office and kitchen environments.
The videos are recorded with a Kinect-style sensor from
Primesense and each contain from 1700 to 3000 RGB-D
frames. We evaluate our systems ability to detect and label
objects in the five categories bowl, cap, cereal box, coffee
mug, and soda can, and distinguish them from everything
else, which we label as background. Each scene used in our
evaluation contains things ranging from walls, doors, and
tables, to a variety of objects placed on multiple supporting
surfaces. Considering only the five categories listed above,
each scene contains between 2 to 11 objects in up to all five
categories. There are a total of 19 unique objects appearing
in the eight scenes.
We train a linear SVM sliding window detector for each
of the five object categories using views of objects from the
RGB-D Object Dataset. Each view is an RGB-D image of
the object as it is rotated on a turntable. This provides dense
coverage of the object as it is rotated about its vertical axis.
The data contains views from three different camera angles
at 30, 45, and 60 degrees with the horizon. Each detector is
trained using all but one of the object instances in the dataset,
so 5 of the 19 objects that appear in the evaluation were
never seen during training. The aspect ratio of the detectors
bounding box is set to be the tightest box that can be resized
to contain all training images. In total, the positive training
set of each category detector consists of around 500 RGBD images. We use the 11 background videos in the RGB-D
Technique
Random
AllBG
DetOnly
PottsMRF
ColMRF
Det3DMRF
Micro F-score
16.7
94.9
72.5
87.7
88.6
89.9
Macro F-score
7.4
15.8
71.3
87.4
88.5
89.8
Fig. 4. Micro and Macro F-scores for two nave algorithms (random and
labeling everything as background) and the proposed detection-based 3D
scene labeling approach and its variants.
Technique
DetOnly
PottsMRF
ColMRF
Det3DMRF
Bowl
46.9/90.7
84.4/90.7
93.7/86.9
91.5/85.1
Cap
54.1/90.5
74.8/91.8
81.3/92.2
90.5/91.4
Cereal Box
76.1/90.7
88.8/94.1
91.2/89.0
93.6/94.9
Precision/Recall
Coffee Mug Soda Can
42.7/74.1
51.6/87.4
87.2/73.4
87.6/81.9
88.3/73.6
83.5/86.5
90.0/75.1
81.5/87.4
Background
98.8/93.9
99.0/98.3
98.7/98.8
99.0/99.1
Overall
61.7/87.9
86.9/88.4
89.4/87.8
91.0/88.8
Fig. 5. Per-category and overall (macro-averaged across categories) precisions and recalls for the proposed detection-based 3D scene labeling approach
and its variants. Our approach works very well for all object categories in the RGB-D Scene Dataset.
Fig. 6. Close-up of a scene containing a cap (green), a cereal box (blue), and a soda can (cyan). From left to right, the 3D scene; Detection-only leaves
patches of false positives; Potts MRF removes isolated patches but cannot cleanly segment objects from the table; Color MRF includes part of the table
with the cap because it is similar in color due to shadows; the proposed detection-based scene labeling obtains clean segmentations. Best viewed in color.
Fig. 7. 3D scene labeling results for three complex scenes in the RGB-D Scenes Dataset. 3D reconstruction (top), our detection-based scene labeling
(bottom). Objects colored by their category: bowl is red, cap is green, cereal box is blue, coffee mug is yellow, and soda can is cyan. Best viewed in color.
data using the approach proposed in [19], it is still timeconsuming to get enough data to train good object detectors.
Given that large labeled image collections are available
online now [6], in this experiment we try to train object
detectors using an existing image dataset (here we use
ImageNet) and use the resulting detector responses in our
system to perform scene labeling. This can be done in exactly
the same framework, except that now the probability map is
obtained from detectors that use only HOG features extracted
from the RGB image. Since only the detectors need to
be changed, we can still use the depth map to refine the
probability map and perform 3D scene reconstruction.
We retrieve the coffee mug category from ImageNet and
V. D ISCUSSION
RGB-D mapping techniques have made it feasible to
capture and reconstruct 3D indoor scenes from RGB-D
videos. In this work we pursue an object-centric view of
scene labeling, utilizing view-based object detectors to find
objects in frames, project detections into 3D and integrate
them in a voxel-based MRF. We show that our detectionbased approach, reasoning about whole objects on the source
RGB-D frames (instead of 3D point clouds), leads to very
accurate labelings of objects and can be used to makes object
detection in still images much more robust.
Labeling objects in indoor scenes is very challenging,
partly because there exists a huge range of objects, scene
types, and scales. One needs to utilize the full resolution
of sensor data and detailed object knowledge. We see a
promising direction forward by focusing on objects and their
learning. There exists a large amount of object data, such as
in ImageNet [6] and Google 3D Warehouse [13]. We have
shown an example where we reliably label objects in 3D
scenes using only ImageNet for training. There are many
such opportunities where object knowledge may be acquired
and applied to scene understanding.
R EFERENCES
[1] A. Anand, H.S. Koppula, T. Joachims, and A. Saxena. Semantic
labeling of 3D point clouds for indoor scenes. In Advances in Neural
Information Processing Systems (NIPS), 2011.
[2] D. Anguelov, B. Taskar, V. Chatalbashev, D. Koller, D Gupta, G. Heitz,
and A. Ng. Discriminative learning of Markov random fields for
segmentation of 3D scan data. In Proc. of CVPR, 2005.
[3] Yuri Boykov, Olga Veksler, and Ramin Zabih. Fast approximate energy
minimization via graph cuts. IEEE PAMI, 23(11):12221239, 2001.
[4] A. Collet, M. Martinez, and S. Srinivasa. Object recognition and full
pose registration from a single image for robotic manipulation. IJRR,
30(10), 2011.
[5] N. Dalal and B. Triggs. Histograms of oriented gradients for human
detection. In Proc. of CVPR, 2005.