Abstract

Self-supervised monocular depth and visual odometry (VO) are often cast as coupled tasks. Accurate depth contributes to precise pose estimation and vice versa. Existing architectures typically exploit stacking convolution layers and long short-term memory (LSTM) units to capture long-range dependencies. However, their intrinsic locality hinders the model from getting the expected performance gain. In this article, we propose a Transformer-based architecture, named Transformer-based self-supervised monocular depth and VO (TSSM-VO), to tackle these problems. It comprises two main components: 1) a depth generator that leverages the powerful capability of multihead self-attention (MHSA) on modeling long-range spatial dependencies and 2) a pose estimator built upon a Transformer to learn long-range temporal correlations of image sequences. Moreover, a new data augmentation loss based on structural similarity (SSIM) is introduced to constrain further the structural similarity between the augmented depth and the augmented predicted depth. Rigorous ablation studies and exhaustive performance comparison on the KITTI and Make3D datasets demonstrate the superiority of TSSM-VO over other self-supervised methods. We expect that TSSM-VO would enhance the ability of intelligent agents to understand the surrounding environments.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call