Depth estimation for a road scene using a monocular image sequence based on fully convolutional neural network

Haixia Wang,Zhiguo Zhang,Chunyang Sheng,Xiao Lu,Yehao Sun

doi:10.1177/1729881420925305

Abstract

An advanced driving assistant system is one of the most popular topics nowadays, and depth estimation is an important cue for advanced driving assistant system. Depth prediction is a key problem in understanding the geometry of a road scene for advanced driving assistant system. In comparison to other depth estimation methods using stereo depth perception, determining depth relation using a monocular camera is considerably challenging. In this article, a fully convolutional neural network with skip connection based on a monocular video sequence is proposed. With the integration framework that combines skip connection, fully convolutional network and the consistency between consecutive frames of the input sequence, high-resolution depth maps are obtained with lightweight network training and fewer computations. The proposed method models depth estimation as a regression problem and trains the proposed network using a scale invariance optimization based on L2 loss function, which measures the relationships between points in the consecutive frames. The proposed method can be used for depth estimation of a road scene without the need for any extra information or geometric priors. Experiments on road scene data sets demonstrate that the proposed approach outperforms previous methods for monocular depth estimation in dynamic scenes. Compared with the currently proposed method, our method has achieved good results when using the Eigen split evaluation method. The obvious prominent one is that the linear root mean squared error result is 3.462 and the δ < 1.25 result is 0.892.

Highlights

Estimating depth from a single image is a very important problem in the computer vision field
With the integration framework that combines skip connection, fully convolutional network (FCN) network and the consistency between consecutive frames of the input sequence, high-resolution depth maps are obtained with lightweight network training and fewer computations
The proposed method models depth estimation as a regression problem and trains the proposed network using a scale invariance optimization based on L2 loss function, which measures the relationships between points in the consecutive frames

Summary

Introduction

Estimating depth from a single image is a very important problem in the computer vision field. Depth estimation is a key problem for many research topics such as three-dimensional (3-D) modeling, 3-D reconstruction, scene understanding, object detection and robotics, semantic segmentation, human activity recognition, and so on. Most of the depth estimation methods predict depth from stereo images and achieved good performances. Stereo methods rely on stereo images captured from multiple cameras to ensure that the problem of depth prediction is well-posed, where depths are estimated using geometrical computations,[1] additional sensors,[2] and photometric or consistency checks.[3] the stereo image method can obtain relatively accurate scene depth information, the depth result tends to be sparse. The estimated depth tends to be inaccurate when the distances considered are large, and a small matching error often causes a large depth estimation error

Methods

Results

Conclusion