Sei sulla pagina 1di 9

Rapid 2D to 3D Conversion

Phil Harman * , Julien Flack, Simon Fox, Mark Dowley Dynamic Digital Depth Research Pty Ltd Perth, Western Australia

ABSTRACT

The conversion of existing 2D images to 3D is proving commercially viable and fulfills the growing need for high quality stereoscopic images. This approach is particularly effective when creating content for the new generation of autostereoscopic displays that require multiple stereo images. The dominant technique for such content conversion is to develop a depth map for each frame of 2D material. The use of a depth map as part of the 2D to 3D conversion process has a number of desirable characteristics:

1. The resolution of the depth map may be lower than that of the associated 2D image;

2. It can be highly compressed;

3. 2D compatibility is maintained; and

4. Real time generation of stereo, or multiple stereo pairs, is possible.

The main disadvantage has been the laborious nature of the manual conversion techniques used to create depth maps from existing 2D images, which results in a slow and costly process. An alternative, highly productive technique has been developed based upon the use of Machine Leaning Algorithms (MLAs). This paper describes the application of MLAs to the generation of depth maps and presents the results of the commercial application of this approach.

Keywords: 2D to 3D conversion, autostereoscopic displays, machine learning

1. INTRODUCTION

The last few years have seen a dramatic increase in the demand for stereo content. This has largely been driven by the commercial availability of multiviewer autostereoscopic displays, such as those manufactured by Stereographics [1], 4D-Vision [2] and Philips [3].

Such displays require a number of adjacent views of the scene, typically eight or nine, rather than the simple left and right eye views of previous stereoscopic display technologies. Whilst original content can be created for such displays using CGI based material, consumer demand appears strongest in video formats. Recording multiple views live using synchronised cameras has been attempted but, particularly for inside shots, has proven both cumbersome and time consuming.

We have previously presented the advantages of 2D to 3D conversion by generating a depth map from the original 2D image [4]. This technique enables the conversion of existing content, as well as live broadcasting and recording, to be undertaken at a commercial level of service.

The use of a depth map as part of the 2D to 3D conversion process has a number of desirable characteristics. Consumer testing has indicated that, with currently available autostereoscopic displays, the resolution of the depth map may be substantially less that that of the associated 2D image before

any degradation of the stereo image becomes apparent. Typically, for NTSC video resolution 2D images, a reduction of 4:1 may be used.

Since the depth map is of lower resolution and only contains luminance information, its bandwidth and storage requirements are lower than the associated 2D image. Optimum compression techniques can

be used to reduce the depth map to less than 2% of the size of it’s associated 2D image [4]. This

enables the depth map to be embedded in the original 2D image with minimal overhead and with the ability to deliver a 2D compatible 3D TM image.

Software or hardware decoders can subsequently render, in real time, either a single stereo pair, or a series of perspective images, suitable for driving a wide range of stereoscopic displays [4][2][5].

2. DEPTH MAP GENERATION

A number of devices capable of capturing depth maps in real-time, in synchronism with the 2D source

are now commercially available. These include 3DV’s ‘Z-Cam’ and other sensors based on scanning lasers [6][7]. These systems both enable live broadcasting and eliminate the need for content conversion. Whilst live recording will most certainly be the dominant process in the future, there are still significant challenges in educating the existing 2D content creators in this new art as well as the costs associated with equipping studios with such technology.

In the meantime the conversion of 2D content, either pre-existing or recorded specifically for the

purpose of display on a 3D screen, is a commercially viable alternative. Given the vast library of existing 2D material the consumer is ensured of both compelling and current content. Conversion from 2D to 3D, of pre-existing content, based on the generation of depth maps is now an established process [4]. The main disadvantage of the technique has been the manual nature of the majority of techniques used to create depth maps, which resulted in a slow and costly process.

There are a number of manual techniques that are currently used to produce depth maps, which include:

Hand drawn object outlines manually associated with an artistically chosen depth value; and

Semi-automatic outlining with corrections made manually by an operator.

Each of these has a number of drawbacks. Hand drawing produces high quality depth maps but is very time consuming and expensive. Semi-automatic outlining is generally unreliable where complex outlines are encountered.

Although the fully automated recovery of depth from monocular image sequences is possible under certain conditions, the operational constraints associated with such techniques limit their commercial viability. These approaches generally fall into one of the following two categories:

1. Depth from motion: The relationship between the motion of an object (relative to the camera) and its distance from the camera can be used to calculate depth maps by analyzing optic flow [8]. This technique can only recover relative depth accurately if the motion of all objects is directly proportional to their distance from the camera. This assumption only holds in a relatively small proportion of footage encountered (for example, a camera panning across a stationary scene). This principal, which exploits motion parallax, is also the basis of single lens stereo systems [9] and stereopsis by binocular delay [10].

2. Structure from motion (SFM): SFM is an active area of computer vision research in which correspondences between subsequent frames (or similar views of the same scene) are used to determine depth and recover camera parameters [11][12]. A restriction of this approach is that the 3D scene must be predominately static – that is, objects must remain stationary. Furthermore, the camera must be moving relative to this static scene. Although this technique is used in the special effects industry for compositing live action footage with CGI its application to depth recovery appears limited.

It should also be noted that these techniques rely on finding correspondences between frames, a process that is unreliable in the presence of low textured, fast moving objects. These fully automated techniques cannot recover depth in the absence of any motion.

3. IMPROVED DEPTH MAP GENERATION

The research presented in this paper describes a more pragmatic approach to the problem of 2D to 3D conversion. We have developed an efficient interactive or semi-automated process in which a special effects artist guides the generation of depth maps using a Machine Learning Algorithm (MLA).

3.1 Machine Learning Algorithms

A MLA can be considered as a black box that is trained to learn the relationships between a set of inputs and a set of outputs. As such, most MLAs consist of two stages, training and classification. In our application of MLAs the inputs relate to the position and colour of individual pixels. For the purpose of this paper, we define the 5 inputs of a pixel as: x,y,r,g,b. Where x and y represent the Cartesian coordinates and r,g,b respectively represent the red, green and blue colour components of any given pixel. The output of the MLA is the depth of a pixel, which we denote by the output z.

3.1.1 Training

During the training stage samples are presented to the MLA along with the known depth:

Inputs

Sample: x,y,r,g,b

MLA
MLA

Depth: z

The MLA will adjust its internal configuration to “learn” the relationships between the samples and their associated depth. The details of this learning process vary according to the algorithm used. Popular algorithms include Decision Trees and Neural Networks [13], the specifics of which are beyond the scope of this paper.

3.1.2

Classification

During classification samples with unknown depth values are presented to the MLA, which uses the relationship established during training to determine an output depth value.

Inputs

Outputs

MLA

MLA
MLA
MLA
to determine an output depth value. Inputs Outputs MLA Sample: x,y,r,g,b The learning process is applied

Sample: x,y,r,g,b

Sample: x,y,r,g,b
Sample: x,y,r,g,b
an output depth value. Inputs Outputs MLA Sample: x,y,r,g,b The learning process is applied in two
an output depth value. Inputs Outputs MLA Sample: x,y,r,g,b The learning process is applied in two

The learning process is applied in two related phases of the rapid 2D to 3D conversion process:

1. Depth mapping: assigning depths to key frames

2. Depth tweening: generating depth maps for frames between the key frames mapped in the previous phase

3.1.3 Depth Mapping

During the depth mapping phase of rapid 2D to 3D conversion the MLA is applied to a single key frame. Manual depth mapping techniques traditionally require the user to associate a depth with every pixel of the source image, typically by manipulating some geometric objects (such as Bezier curves). By using an MLA we can significantly reduce the amount of effort required to produce a depth map.

Horizon line

Training samples

to produce a depth map. Horizon line Training samples Figure 1: (left) An example source frame
to produce a depth map. Horizon line Training samples Figure 1: (left) An example source frame

Figure 1: (left) An example source frame 1 – the dots indicate the position of training samples. The colour of the dots indicates the depth associated with the pixel. A horizon line may be used to add depth ramps. (right) The completed depth map derived from the MLA with added depth ramp.

1 The images used in this test are taken from “Ultimate G’s: Zac’s Flying Dream” ©Copyright 1999, Sky High Entertainment. All rights reserved.

Figure 1 indicates how an MLA provided with a relatively small number of training samples, as indicated by the depth coloured dots on the source frame, can generate an accurate depth map. In this instance 623 samples were used – this represents approximately 0.2% of the total number of pixels in the image. In more complex scenes additional training data is required, but it is rarely necessary to supply more than 5% of the image as training samples to achieve an acceptable result.

In this example, the results from the MLA are composited on top of a perspective depth ramp by adding a horizon line. Depth maps are median filtered and smoothed to reduce stereoscopic rendering artifacts.

3.1.4 Depth Tweening

Depth maps are generated for key frames using the process described above. These frames are strategically located at points in an image sequence where there is significant change in the colour and/or position of objects. Key frames may be identified manually, or techniques used for detecting shot transitions [14] may be used to automate this process.

Key Frame 1

“Tween” frame

Key Frame 2

this process. Key Frame 1 “Tween” frame Key Frame 2 x,y,r,g,b x,y,r,g,b Train combine z 1
this process. Key Frame 1 “Tween” frame Key Frame 2 x,y,r,g,b x,y,r,g,b Train combine z 1
x,y,r,g,b x,y,r,g,b Train combine z 1 z 2 MLA 1 MLA 2
x,y,r,g,b
x,y,r,g,b
Train
combine
z 1
z 2
MLA 1
MLA 2

Train

x,y,r,g,b Train combine z 1 z 2 MLA 1 MLA 2 Train Figure 2: An illustration
x,y,r,g,b Train combine z 1 z 2 MLA 1 MLA 2 Train Figure 2: An illustration

Figure 2: An illustration of the depth tweening process. At each key frame an MLA is trained using the known depth map of the source image. At any given “tween” frame the results of these MLAs are combined to generated a tweened depth map.

During the Depth Tweening phase of the rapid conversion process MLAs are used to generate depth maps for each frame between any two existing key frames. This process is illustrated in figure 2. As indicated, a separate MLA is trained for each key frame source and depth pair. For any other frame in the sequence the x,y,r,g,b values are input into both MLAs and the resulting depths (z 1 and z 2 ) are combined using a normalised time-weighted sum:

w

1

=

 

1

1

 

=

(

f

k

1

)

P

w

2

(

k

2

f

)

P

Depth =

w z

1

1

+

w

2

z

2

w

1

+

w

2

f is the timecode of the frame under consideration, k 1 is the timecode of the first key frame and k 2 is the timecode of the second key frame. The parameter P is used to control the rate at which the influence of a MLA decays with time. Figure 3 illustrates an example of the MLA weighting functions for P = 2.

MLA weights between key frames 1.2 1 0.8 w1 0.6 w2 0.4 0.2 0 key
MLA weights between key frames
1.2
1
0.8
w1
0.6
w2
0.4
0.2
0
key frame 1
key frame 2

Figure 3: Plot showing the relative MLA weighting over time – in this example P=2.

4. RESULTS

The rapid conversion process was tested on a short sequence consisting of 43 frames. This sequence is relatively challenging as it contains fast motion and overlapping regions of similar colour (i.e. the oarsman’s head and the cliffs on the left hand side of the image). Three key frames (at frames 1, 14 and 43) were depth mapped and the remaining frames were converted by depth tweening.

Figure 4: Source (left) and depth map (right) generated by depth tweening at frame 6.
Figure 4: Source (left) and depth map (right) generated by depth tweening at frame 6.

Figure 4: Source (left) and depth map (right) generated by depth tweening at frame 6.

Figure 4 shows the depth map generated from tweening at frame 6 using the key frames at frame positions 1 and 14. The frames that are furthest away from a key frame generally contain the most errors as the difference between the source at training and classification is highest. The depth map in figure 4 accurately represents the major structure of the scene, although there are misclassification errors between the oarsman’s head and the background. Similarly, figure 5 shows the depth map generated by tweening at frame 32 using the key frames at frame 14 and 43.

at frame 32 using the key frames at frame 14 and 43. Figure 5: Source (left)
at frame 32 using the key frames at frame 14 and 43. Figure 5: Source (left)

Figure 5: Source (left) and depth map (right) generated by depth tweening at frame 32

This 43 frame sequence was successfully depth mapped by providing around 8,000 training samples over the 3 key frames. This represents only 0.05% of the total number of pixels depth mapped in this sequence.

4.1 Quantitative Analysis

In order to evaluate the accuracy of depth maps generated from the tweening process a CGI sequence was used as “ground truth” in a quantitative comparison. Pixel accurate depth maps can be generated

for CGI scenes using most commercial CG packages. We used these depths to measure the root mean square error of the depth tweening results.

The graph in figure 6 shows the RMS analysis on a 30 frame sequence with 4 key frames. The RMS error is shown as a percentage of the total depth range. At the key frames the RMS error drops to zero and as expected the error increases as the distance to the nearest key frame increases. The higher RMS errors at the end of the sequence are due to the presence of very fast motion in the scene.

Error Depth Tweening cf. CG ground truth

12 10 8 6 4 2 0
12
10
8
6
4
2
0

639

644

649

654

frame number

659

664

669

Figure 6: Root mean square error as a percentage of the depth resolution for a 30-frame sequence with 4 key frames

The average root mean square error for this sequence was 7.5% of the total depth range. These results indicate that we can reconstruct the depth maps with better than 90% accuracy given just over 10% training data. It should be noted that although the frame-by-frame RMS error is a useful tool for evaluating such techniques other factors such as edge accuracy, smoothness and temporal consistency are crucial for generating effective stereoscopic display.

The rapid conversion process, as described, has been successfully deployed in a commercial content conversion service for the last two years.

5.

CONCLUSIONS

The application of MLAs to the generation of depth maps has resulted in a substantial reduction in both the time and manual effort required for converting 2D images to 3D. Training MLAs for content conversion is simple, intuitive and easy to learn. The depth maps generated by this rapid conversion process are of a high enough quality for commercial applications using autostereoscopic displays.

REFERENCES

1. L. Lipton, “SynthaGram TM : an autostereoscopic display technology”, to be published Proc. Of SPIE Vol 4660, Stereoscopic Displays and Virtual Reality Systems IX, Jan 2002.

2.

4D Vision GMBH homepage: http://www.4d-vision.de

3.

C.

Berkel, D. W. Parker and A. R. Franklin, “Multiview 3D-LCD”, Proc. Of SPIE Vol. 2653,

Stereoscopic Displays and Virtual Reality Systems III, ed. S S Fischer, J O Merritt, M T Bolas, pp 32-39, Apr. 1996

4.

P.V. Harman, “Home-based 3D Entertainment – An Overview”, Proceedings of the IEEE Intl. Conference on Image Processing, p 1-4, Vancouver 2000.

5.

J. Eichenlaub, “A Lightweight, Compact 2D/3D Autostereoscopic LCD Backlight for Games, Monitor and Notebook Applications”, Proc. SPIE, Stereoscopic Displays and Applications, San Jose California, pp 180-185, 1998.

6.

J. Berg, "3D Vision for Autonomous Robot-Based Security Operations", Advanced Imaging, pp 20-24, Jan. 2001.

7.

J. A. Beraldin, F. Blais, L. Cournoyer, G. Godin and M. Rioux, "Active 3D Sensing", Modelli

E

Metodi per lo studio e la conservazione dell'architettura storica, University: Scola Normale

Superiore, Pisa 10, NRC 44159, pp 22-46, 2000.

8.

Y.

Matsumoto, H. Terasaki, K. Sugimoto and T. Arakawa, "Conversion System of Monocular

Image Sequence to Stereo Using Motion Parallax", Proc. of SPIE Vol. 3012, Stereoscopic Displays and Virtual Reality Systems IV, ed. S. S. Fisher, J. O. Merritt, M. T. Bolas, pp 108- 112, May 1997.

9.

B.

J. Garcia, "Approaches to stereoscopic video based on spatio-temporal interpolation", Proc.

of

SPIE Vol. 2653, Stereoscopic Displays and Virtual Reality Systems IV, ed. S. S. Fisher, J.

O.

Merritt, M. T. Bolas, pp 85-95, Apr. 1996.

10.

J. Ross, "Stereopsis by binocular delay", Nature 248, Vol 2, 363-364, 1974.

11.

C.

Tomasi and T. Kanade, "Shape and motion from image streams under orthography: A

factorization approach", International Journal of Computer Vision (IJCV), 9(2), pp 137-154, November 1992.

12.

Zisserman, A. W. Fitzgibbon and G. Cross, "VHS to VRML: 3D Graphical Models from Video Sequences", Proc. International Conference on Multimedia Systems, pp 51-57, 1999.

13.

T.

M. Mitchell, "Machine Learning", McGraw Hill, 1997.

14.

S.

Porter, M. Mirmehdi and B. Thomas, "Detection and Classification of Shot Transitions",

British Machine Vision Conference (BMVC), pp 73-82, 2001.

* PV Harman, Chief Technology Officer, Dynamic Digital Depth Research Pty Ltd, 6a Brodie Hall Drive, Bentley, Western Australia 6102. Tel: +61 8 9355688 Fax: +61 8 93556988. Email pharman@ddd.com www.ddd.com