Abstract

Depth cue has proven to be useful information in the indoor scene understanding of RGB-D images for providing a geometric counterpart to RGB representation. However, because of the differences between RGB-D image pairs, utilizing cross-modal data effectively is a key issue. Most methods exclusively leverage depth data to unilaterally complement RGB data for better feature representation; they invariably ignore the fact that RGB and depth data can bilaterally complement each other. Herein, a novel RGB-D scene-understanding network called BCINet is presented, in which RGB and depth data bilaterally complement each other via a proposed bilateral cross-modal interaction module (BCIM). The BCIM helps to capture cross-modal complementary cues by crossly fusing enhanced features from one modality to the counterpart modality through a feature enhanced module. Meanwhile, exploiting the long-range dependencies of RGB-D features is also significant for accurate scene understanding. Specifically, we design a hybrid pyramid dilated convolution module to enlarge the receptive fields along both the vertical and horizontal spatial directions to adaptively capture diverse contexts with different shapes. Additionally, we propose a context-guided module to aggregate these diverse higher-level contexts with lower-level features in the encoder to guide the information flow for progressively refining the segmentation map. Experimental results on two indoor scene datasets demonstrate the superiority and effectiveness of the proposed BCINet over several state-of-the-art approaches.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call