Abstract

This paper presents a new deep visual-inertial odometry and depth estimation framework for improving the accuracy of depth estimation and ego-motion from image sequences and inertial measurement unit (IMU) raw data. The proposed framework predicts ego-motion and depth with absolute scale in a self-supervised manner. We first capture dense features and solve the pose by deep visual odometry (DVO), and then combine the pose estimation pipeline with deep inertial odometry (DIO) by the extended Kalman filter (EKF) method to produce the sparse depth and pose with absolute scale. We then join deep visual-inertial odometry (DeepVIO) with depth estimation by using sparse depth and the pose from DeepVIO pipeline to align the scale of the depth prediction with the triangulated point cloud and reduce image reconstruction error. Specifically, we use the strengths of learning-based visual-inertial odometry (VIO) and depth estimation to build an end-to-end self-supervised learning architecture. We evaluated the new framework on the KITTI datasets and compared it to the previous techniques. We show that our approach improves results for ego-motion estimation and achieves comparable results for depth estimation, especially in the detail area.

Highlights

  • Dense depth estimation from an RGB image is the fundamental issue for 3D scene reconstruction that is useful for computer vision applications, such as automatic driving [1], simultaneous localization and mapping (SLAM) [2], and 3D scene understanding [3]

  • We present a new deep visual-inertial odometry (DeepVIO) based ego-motion and depth prediction system that combines the strengths of learning-based VIO and geometrical depth estimation [16,19,20]

  • The results show that our proposed DeepVIO method can improve the accuracy of depth estimation and enhance the detail of depth estimation at the edge of objects

Read more

Summary

Introduction

Dense depth estimation from an RGB image is the fundamental issue for 3D scene reconstruction that is useful for computer vision applications, such as automatic driving [1], simultaneous localization and mapping (SLAM) [2], and 3D scene understanding [3].With rapid development of in depth estimation (from monocular), many supervised and unsupervised learning methods have been proposed. Instead of traditional supervised methods depending on expensively collected ground truth, unsupervised learning from stereo images or monocular videos is a more universal solution [4,5]. To overcome the lack of geometric constraints in unsupervised depth estimation training, recent works have used sparse LiDAR data [6–8] to guide depth estimation in the process of image feature extraction and improve the quality of supervised depth map generation. These methods lead to the dependence on sparse LiDAR data, which are relatively expensive. A recent trend in depth estimation methods involves traditional SLAM [9], which could provide an accurate sparse point cloud, learning to predict monocular depth and odometry in a self-supervised manner [10,11]

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call