Relative Importance of Binocular Disparity and Motion Parallax for Depth Estimation: A Computer Vision Approach

Mostafa Mansour,Oleg Stepanov,Pavel Davidson,Robert Piché

doi:10.3390/rs11171990

Abstract

Binocular disparity and motion parallax are the most important cues for depth estimation in human and computer vision. Here, we present an experimental study to evaluate the accuracy of these two cues in depth estimation to stationary objects in a static environment. Depth estimation via binocular disparity is most commonly implemented using stereo vision, which uses images from two or more cameras to triangulate and estimate distances. We use a commercial stereo camera mounted on a wheeled robot to create a depth map of the environment. The sequence of images obtained by one of these two cameras as well as the camera motion parameters serve as the input to our motion parallax-based depth estimation algorithm. The measured camera motion parameters include translational and angular velocities. Reference distance to the tracked features is provided by a LiDAR. Overall, our results show that at short distances stereo vision is more accurate, but at large distances the combination of parallax and camera motion provide better depth estimation. Therefore, by combining the two cues, one obtains depth estimation with greater range than is possible using either cue individually.

Highlights

The human visual system relies on several different cues that provide depth information in static and dynamic environments: binocular disparity, motion parallax, kinetic depth effect, looming, perspective cues from linear image elements, occlusion, smooth shading, blur, etc
In the experiments with lower resolution we considered three cases: two cases with favorable geometry for motion parallax with the features far from the focus of expansion and one case with poor geometry with the features closed to the focus of expansion
The performance of stereo camera in depth estimation depends on the following factors; distance and angle to the point features, texture, and camera resolution

Summary

Introduction

The human visual system relies on several different cues that provide depth information in static and dynamic environments: binocular disparity, motion parallax, kinetic depth effect, looming, perspective cues from linear image elements, occlusion, smooth shading, blur, etc. Information from multiple cues is combined to provide the viewer with a unified estimate of depth [1]. In this combination, the cues are weighted dynamically depending on the scene, observer motion, lighting conditions, etc. Computer vision approaches that take into account combination of multiple cues can be implemented using semi-supervised deep neural networks [2]. In this approach the depth of each pixel in an image is directly predicted based on models that have been trained offline on large collections of ground truth depth data. Practical implementations usually incorporate monocular cues into a stereo system

Objectives

Methods

Results

Conclusion