Abstract

Monocular depth estimation has received more and more attention due to its wide range of application scenarios. In this paper, we propose a novel simple framework, called CATNet, which treats monocular depth estimation as an ordinal regression problem. At present, in order to obtain higher performance, the research on monocular depth estimation is achieved by increasing the amount of calculation and parameters of the model. Based on this, we propose a novel simple encoder–decoder architecture that aims to reduce the SOTA model parameters and complexity while keeping the depth estimation accuracy as high as possible rather than aiming for extremely lightweight. Meanwhile, in order to further refine the multi-scale information extracted by the encoder, we propose a Multi-dimensional Convolutional Attention (MCA) module. To enhance the extraction of global information for accurate pixel classification, we propose a Dual Attention Transformer (DAT) module to extract global features of images. Furthermore, experimental results on the KITTI and NYU datasets demonstrate that the advantage of our proposed framework is that it achieves almost equivalent depth estimation performance to the current SOTA with fewer parameters and lower complexity. To the best of our knowledge, CATNet is the first work that achieves nearly the same depth estimation accuracy as Transformer-based large model encoders with so few parameters.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call