Self-Supervised Correlational Monocular Depth Estimation using ResVGG Network

Kuo Shiuan Peng,Ditzler Gregory,Jerzy Rozenblit

doi:10.12792/icisip2019.019

Abstract

Self-supervised monocular depth estimation (SMDE) has recently received significant attention in computer vision. Leveraging the development of deep learning approaches, SMDE provides a solution to the applications of automation, navigation, and scene understanding. In this paper, we propose a novel training objective and learning network to perform a single image depth estimation in our convolutional neural network without the ground truth depth data. The proposed training objective enables the learning network to learn the stereo image correlation in training and estimates the image depth from a single input image in prediction. The proposed learning network ResVGG is a hybrid structure of Resnet50 and VGG-16. The proposed ResVGG has a similar performance as Resnet50 but needs much less computational costs. We demonstrate that our proposed method has competitive accuracy comparing to the current state-of-the-art on KITTI dataset and achieves the frame rates of 33 frame per second (FPS) in prediction using a single NVIDIA GTX 1080 GPU. Furthermore, the proposed method can potentially support visual odometry depth estimation.

Highlights

Depth estimation is one of the fundamental problems with a long history in computer vision
We argue that the network can learn the image correlation information from the stereo training image set image correlation is unavailable from the single input image in prediction
We evaluate the performance of the proposed method on the KITTI benchmark

Summary

Introduction

Depth estimation is one of the fundamental problems with a long history in computer vision. It serves as the cornerstone for many machine perception applications, such as 3D reconstruction, auto-driving system, industrial machine vision, robotics interaction, etc. In the task of monocular depth estimation, the input source is a monocular image (e.g., a left image). The corresponding another view (e.g., the right image), can be reconstructed by the estimated right depth and the input left image (left) using a warping function[4]. The reconstructed right view is supervised by an actual right image.

Methods

Results

Conclusion