Abstract

Utilizing multi-level features has been proven to improve RGB-D scene recognition performance. However, simply fusing features after conducting RGB and depth data separately may not satisfy multi-modal integrity. In this work, we propose an effective multi-modal RGB-D scene recognition model that integrates global or local multi-scale/multi-semantic features. The proposed approach is built on two key components. In the first stage, multiple random recursive neural networks (RNNs) are employed on a baseline CNN model to obtain multi-scale encoded features from multi-level feature hierarchy. In the second stage, multi-layer perceptrons (MLPs) learn global/local features at multiple levels while encouraging the correlation of multi-modal mutual features. Our learning design is based on the insight that correlated multi-modal features provide the complementary relation between the two modalities that promotes better performance of RGB-D scene recognition. In addition, the network is trained using a decisive fusion based on modality prediction confidence weights to yield RGB-D multi-modal recognition. Experiments on three RGB-D scene datasets verify the effectiveness of the proposed approach by achieving superior or highly competitive results compared to other state-of-the-art counterpart methods. Evaluation code and models are available at https://github.com/acaglayan/MMSNet.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.