Semi-Automatic Depth Map Generation in Unconstrained Images and Video Sequences for 2D to Stereoscopic 3D Conversion

Raymond Phan

doi:10.32920/ryerson.14652426.v1

Abstract

In this work, we describe a system for accurately estimating depth through synthetic depth maps in unconstrained conventional monocular images and video sequences, to semi-automatically convert these into their stereoscopic 3D counterparts. With current accepted industry efforts, this conversion process is performed automatically in a black box fashion, or manually converted using human operators to extract features and objects on a frame by frame basis, known as rotoscopers. Automatic conversion is the least labour intensive, but allows little to no user intervention, and error correction can be difficult. Manual is the most accurate, providing the most control, but very time consuming, and is prohibitive for use to all but the largest production studios. Noting the merits and disadvantages between these two methods, a semi-automatic method blends the two together, allowing for faster and accurate conversion, while decreasing time for releasing 3D content for user digest. Semi-automatic methods require the user to place user-defined strokes over the image, or over several keyframes in the case of video, corresponding to a rough estimate of the depths in the scene at these strokes. After, the rest of the depths are determined, creating depth maps to generate stereoscopic 3D content, and Depth Image Based Rendering is employed to generate the artificial views. Here, depth map estimation can be considered as a multi-label image segmentation problem: each class is a depth value. Additionally, for video, we allow the option of labeling only the first frame, and the strokes are propagated using one of two techniques: A modified computer vision object tracking algorithm, and edge-aware temporally consistent optical flow./p pFundamentally, this work combines the merits of two well-respected segmentation algorithms: Graph Cuts and Random Walks. The diffusion of depths, with smooth gradients from Random Walks, combined with the edge preserving properties from Graph Cuts can create the best possible result. To demonstrate that the proposed framework generates good quality stereoscopic content with minimal effort, we create results and compare to the current best known semi-automatic conversion framework. We also show that our results are more suitable for human perception in comparison to this framework.

Highlights

Despite its surge in popularity, some viewers are very skeptical with regards to the enhanced experience when they watch stereoscopic 3D films
As the TLD tracker is comprised of the KLT, if the stroke lies on a uniform background, the online learning, randomized fern detector (RFD) and nearest neighbour classifier (NNC) should be disabled, and the KLT should only function
To create the stereoscopic image, the left view serves as the original image, while the right view was created using simple Depth Image Based Rendering (DIBR)

Summary

Introduction

Despite its surge in popularity, some viewers are very skeptical with regards to the enhanced experience when they watch stereoscopic 3D films. In particular,Roger Ebert, a well-known film critic for the Chicago Sun Times, wrote an article describing his abhorrence for 3D, and why it should not be used as a means of entertainment He lists various reasons, which include nausea and headaches, as well as visual discomfort [3]. Humans view the world using two views ( known as binocular vision), where the left view is one view of the scene, and the right view is a slightly horizontally (or sometimes with the additional of some vertical components) shifted version of the left view These views are presented to each eye, they are processed by the visual cortex in our brain, and we perceive depth [9][10]. Stereoscopy is quite different from viewing 3D displays, which displays an image in three full dimensions, allowing the observer to increase information intake about the three-dimensional objects being displayed by performing head and eye movements

Methods

Results

Conclusion