Monocular image depth estimation is crucial for indoor scene reconstruction, and it plays a significant role in optimizing building energy efficiency, indoor environment modeling, and smart space design. However, the small depth variability of indoor scenes leads to weakly distinguishable detail features. Meanwhile, there are diverse types of indoor objects, and the expression of the correlation among different objects is complicated. Additionally, the robustness of recent models still needs further improvement given these indoor environments. To address these problems, a detail‒semantic collaborative network (DSCNet) is proposed for monocular depth estimation of indoor scenes. First, the contextual features contained in the images are fully captured via the hierarchical transformer structure. Second, a detail‒semantic collaborative structure is established, which establishes a selective attention feature map to extract details and semantic information from feature maps. The extracted features are subsequently fused to improve the perception ability of the network. Finally, the complex correlation among indoor objects is addressed by aggregating semantic and detailed features at different levels, and the model accuracy is effectively improved without increasing the number of parameters. The proposed model is tested on the NYU and SUN datasets. The proposed approach produces state-of-the-art results compared with the 14 performance results of recent optimal methods. In addition, the proposed approach is fully discussed and analyzed in terms of stability, robustness, ablation experiments and availability in indoor scenes.
Read full abstract