Abstract

Self-supervised depth estimation algorithms eschew depth ground truth and employ the convolutional U-Net with a fixed receptive field which confines its focus primarily to nearby spatial distances. These factors obscure adequate supervision during image reconstruction, consequently hindering accurate depth estimation, particularly in complex indoor scenes. The pure transformer framework can perform global modelling to provide more semantic information. However, the cost is significant. To tackle these challenges, we introduce GDM-Depth, which utilizes global dependency modelling to offer more precise depth guidance from the network itself. Initially, we propose integrating learnable tree filters with unary terms, leveraging the structural properties of spanning trees to facilitate efficient long-range interactions. Subsequently, instead of replacing the convolutional framework entirely, we employ the transformer to design a scale-aware global feature extractor, establishing global relationships among local features at various scales, achieving both efficiency and cost-effectiveness. Furthermore, inter-class disparities between depth global and local features are observed. To address this issue, we introduce the global feature injector to further enhance the representation. GDM-Depth's effectiveness is demonstrated on the NYUv2, ScanNet, and InteriorNet depth datasets, achieving impressive test set performances of 87.2%, 83.1%, and 76.1% in key indicators δ<0.125, respectively.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call