Feature interaction and two-stage cross-modal fusion for RGB-D salient object detection

Ming Yu,Yi Liu,Jiali Liu,Gang Yan

doi:10.3233/jifs-233225

Abstract

Most existing RGB-D salient object detection (SOD) methods extract features of both modalities in parallel or adopt depth features as supplementary information for unidirectional interaction from depth modality to RGB modality in the encoder stage. These methods ignore the influence of low-quality depth maps, and there is still room for improvement in effectively fusing RGB features and depth features. To address the above problems, this paper proposes a Feature Interaction Network (FINet), which performs bi-directional interaction through feature interaction module (FIM) in the encoder stage. The feature interaction module is divided into two parts: depth enhancement module (DEM) filters the noise in the depth features through the attention mechanism; and cross enhancement module (CEM) effectively interacts RGB features and depth features. In addition, this paper proposes a two-stage cross-modal fusion strategy: high-level fusion adopts the semantic information of high level for coarse localization of salient regions, and low-level fusion makes full use of the detailed information of low level through boundary fusion, and then we progressively refine high-level and low-level cross-modal features to obtain the final saliency prediction map. Extensive experiments show that the proposed model achieves better performance than eight state-of-the-art models on five standard datasets.

Full Text