Abstract

Supervised monocular depth estimation has always been one of the most important tasks in computer vision. With the convolution module as a basic operator, the U-shaped network architecture has become the de facto standard and has achieved tremendous success. However, due to the limited receptive field of the convolution operation, the CNNs are generally inferior in explicitly modeling the long-range dependencies. Originally proposed for natural language processing, the transformers are designed for performing sequence-to-sequence predictions based on global self-attention mechanism. Therefore, the transformers can capture long-range dependencies. However, they have limited localization abilities due to insufficient low-level details. In this work, we propose a TSD-Depth model, which merits both the transformers and the CNNs, as a strong alternative for self-supervised monocular depth estimation. The proposed model simultaneously extracts the global contextual information and local spatial detail features. Furthermore, by designing the hybrid encoder connection method and proper-sized transformer module, the global and local information can more effectively interact. In addition, a local multi-scale fusion block is first proposed to refine the fine-grained details. More importantly, the knowledge is learned by using self-distillation to skip the multi-scale fusion block concatenated with the encoder at the inference time, computed only during the training process for minimal overhead. The experimental results on NYU-v2 and ScanNet datasets show that the proposed TSD-Depth achieves the best performance as compared to the previous state-of-the-art methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call