Abstract

Monocular depth estimation has received more and more attention due to its wide range of application scenarios. In this paper, we propose a novel simple framework, called CATNet, which treats monocular depth estimation as an ordinal regression problem. At present, in order to obtain higher performance, the research on monocular depth estimation is achieved by increasing the amount of calculation and parameters of the model. Based on this, we propose a novel simple encoder–decoder architecture that aims to reduce the SOTA model parameters and complexity while keeping the depth estimation accuracy as high as possible rather than aiming for extremely lightweight. Meanwhile, in order to further refine the multi-scale information extracted by the encoder, we propose a Multi-dimensional Convolutional Attention (MCA) module. To enhance the extraction of global information for accurate pixel classification, we propose a Dual Attention Transformer (DAT) module to extract global features of images. Furthermore, experimental results on the KITTI and NYU datasets demonstrate that the advantage of our proposed framework is that it achieves almost equivalent depth estimation performance to the current SOTA with fewer parameters and lower complexity. To the best of our knowledge, CATNet is the first work that achieves nearly the same depth estimation accuracy as Transformer-based large model encoders with so few parameters.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.