RGB-D scene recognition has achieved promising performance because depth could provide complementary geometric information to RGB images. However, the inaccessibility of depth sensors severely limits RGB-D applications. In this paper, we focus on depth privileged setting, in which depth information is only available during training but not available during testing. Considering that the information obtained from RGB and depth images are complementary while attention is informative and transferable, our idea is using RGB input to hallucinate depth attention. We build our model upon modulated deformable convolutional layer and hallucinate dual attention: post-hoc importance weight and trainable spatial transformation. Specifically, we use modulation (resp., offset) learned from RGB to mimic Grad-CAM (resp., offset) learned from depth, to combine the strength of dual attention. We also design a weighted loss to avoid negative transfer according to the quality of depth attention. Extensive experiments on two benchmarks, i.e., SUN RGB-D and NYUDv2, demonstrate that our method outperforms the state-of-the-art methods for depth privileged scene recognition.
Read full abstract