The Network of Attention-Aware Multimodal fusion for RGB-D Indoor Semantic Segmentation Method

Qiankun Zhao,Yingcai Wan,Huaizhen Wang,Lijin Fang

doi:10.1109/ccdc55256.2022.10034106

Abstract

In recent years, convolutional neural networks have effectively improved the accuracy of semantic segmentation tasks. However, semantic segmentation of indoor scenes is still a challenging problem due to the complexity of indoor environments. With the advent of depth sensors, the use of depth information to improve semantic segmentation has been considered. Most of the previous studies simply use equal-weight stitching or summation operations to fuse RGB features and depth features, failing to make full use of the complementary information between RGB features and depth features. In this paper, we propose the network of attention-aware multimodal fusion for RGB-D indoor semantic segmentation method. By designing an attention-aware multimodal fusion module, the multilevel RGB features and depth features are effectively fused. The cross-modal attention mechanism is also introduced in the multimodal fusion module, and the features are fused at different scales so that RGB features and depth features can guide and optimize each other using complementary information, thus extracting feature representations rich in spatial location information. The method can effectively promote the synergistic interaction of multimodal features and further improve the effect of semantic segmentation. Experimental results on publicly available datasets SUN RGB-D, NYU Depth v2 and ScanNet show that our designed algorithm outperforms other RGB-D image semantic segmentation algorithms.

Full Text