Abstract

Depth information plays an important role in the vision-related activities of robots and autonomous vehicles. An effective method to obtain 3D scene information is self-supervised monocular depth estimation, which utilizes large and diverse monocular video datasets during the training process without the need for ground-truth data. A novel multi-task learning strategy that uses semantic information to guide the monocular depth estimation method while maintaining self-supervision is proposed. An improved differential direct visual odometer (DDVO) combined with Pose-Net is applied for achieving better pose prediction. Minimum reprojection loss with auto-masking and semantic masking is used to remove the effects of low-texture areas and moving dynamic-class objects within scenes. Concurrently, the semantic masking is introduced into the DDVO pose predictor to filter moving objects and reduce the matching error between monocular sequence frames. In addition, PackNet is employed as the backbone of multi-task learning to further improve the accuracy of deep prediction. The proposed method produces state-of-the-art results for monocular depth estimation on the KITTI Eigen split benchmark, even outperforming supervised methods that have been trained using ground-truth depth.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call