RGB-D Scene Recognition Research Articles

Recognizing objects and scenes are two challenging but essential tasks in image understanding. In particular, the use of RGB-D sensors in handling these tasks has emerged as an important area of focus for better visual understanding. Meanwhile, deep neural networks, specifically convolutional neural networks (CNNs), have become widespread and have been applied to many visual tasks by replacing hand-crafted features with effective deep features. However, it is an open problem how to exploit deep features from a multi-layer CNN model effectively. In this paper, we propose a novel two-stage framework that extracts discriminative feature representations from multi-modal RGB-D images for object and scene recognition tasks. In the first stage, a pretrained CNN model has been employed as a backbone to extract visual features at multiple levels. The second stage maps these features into high level representations with a fully randomized structure of recursive neural networks (RNNs) efficiently. To cope with the high dimensionality of CNN activations, a random weighted pooling scheme has been proposed by extending the idea of randomness in RNNs. Multi-modal fusion has been performed through a soft voting approach by computing weights based on individual recognition confidences (i.e. SVM scores) of RGB and depth streams separately. This produces consistent class label estimation in final RGB-D classification performance. Extensive experiments verify that fully randomized structure in RNN stage encodes CNN activations to discriminative solid features successfully. Comparative experimental results on the popular Washington RGB-D Object and SUN RGB-D Scene datasets show that the proposed approach achieves superior or on-par performance compared to state-of-the-art methods both in object and scene recognition tasks. Code is available at https://github.com/acaglayan/CNN_randRNN.

Read full abstract

The existing RGB-D scene recognition approaches typically employ two separate and modality-specific networks to learn effective RGB and Depth representations respectively. This independent training scheme fails to capture the correlation of two modalities, and thus may be suboptimal for RGB-D scene recognition. To address this issue, this paper proposes a general and flexible framework to enhance RGB-D representation learning with a customized cross-modal pyramid translation branch, coined as TRecgNet. This framework unifies the tasks of cross-modal translation and modality-specific recognition with a shared feature encoder, and aims at leveraging the correspondence between two modalities to regularize the representation learning of each modality. Specifically, we present a cross-modal pyramid translation strategy to perform multi-scale image generation with a carefully designed layer-wise perceptual supervision. To improve the complementarity of cross-modal translation to modality specific scene recognition, we devise a feature selection module to adaptively enhance the discriminative information during the translation procedure. In addition, we train multiple auxiliary classifiers to further regularize the behavior of generated data to be consistent with its paired data on label prediction. Meanwhile, our translation branch enables us to generate cross-modal data for training data augmentation and further improve single modality scene recognition. Extensive experiments on benchmarks of SUN RGB-D and NYU Depth V2 demonstrate the superiority of the proposed method to the state-of-the-art RGB-D scene recognition methods. We also generalize the TRecgNet to the single modality scene recognition benchmark of MIT Indoor, and automatically synthesize a depth view to boost the final recognition accuracy.

Read full abstract

RGB-D Scene Recognition Research Articles

Related Topics

Articles published on RGB-D Scene Recognition

CoMAE: Single Model Hybrid Pre-training on Small-Scale RGB-D Datasets

MMSNet: Multi-modal scene recognition using multi-scale encoded features

When CNNs meet random RNNs: Towards multi-level analysis for RGB-D object and scene recognition

RGB-D Scene Recognition based on Object-Scene Relation (Student Abstract)

Cross-Modal Pyramid Translation for RGB-D Scene Recognition

ASK: Adaptively Selecting Key Local Features for RGB-D Scene Recognition.

Depth Privileged Scene Recognition via Dual Attention Hallucination.

An Efficient RGB-D Scene Recognition Method Based on Multi-Information Fusion

MSN: Modality separation networks for RGB-D scene recognition

Image Representations with Spatial Object-to-Object Relations for RGB-D Scene Recognition.

ACM: Adaptive Cross-Modal Graph Convolutional Neural Networks for RGB-D Scene Recognition

RGB-D Scene Recognition via Spatial-Related Multi-Modal Feature Learning

Learning Effective RGB-D Representations for Scene Recognition.

Viewpoint invariant semantic object and scene categorization with RGB-D sensors

Depth CNNs for RGB-D Scene Recognition: Learning from Scratch Better than Transferring from RGB-CNNs

Feature representation of RGB-D images using joint spatial-depth feature pooling

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

RGB-D Scene Recognition Research Articles

Related Topics

Articles published on RGB-D Scene Recognition

CoMAE: Single Model Hybrid Pre-training on Small-Scale RGB-D Datasets

MMSNet: Multi-modal scene recognition using multi-scale encoded features

When CNNs meet random RNNs: Towards multi-level analysis for RGB-D object and scene recognition

RGB-D Scene Recognition based on Object-Scene Relation (Student Abstract)

Cross-Modal Pyramid Translation for RGB-D Scene Recognition

ASK: Adaptively Selecting Key Local Features for RGB-D Scene Recognition.

Depth Privileged Scene Recognition via Dual Attention Hallucination.

An Efficient RGB-D Scene Recognition Method Based on Multi-Information Fusion

MSN: Modality separation networks for RGB-D scene recognition

Image Representations with Spatial Object-to-Object Relations for RGB-D Scene Recognition.

ACM: Adaptive Cross-Modal Graph Convolutional Neural Networks for RGB-D Scene Recognition

RGB-D Scene Recognition via Spatial-Related Multi-Modal Feature Learning

Learning Effective RGB-D Representations for Scene Recognition.

Viewpoint invariant semantic object and scene categorization with RGB-D sensors

Depth CNNs for RGB-D Scene Recognition: Learning from Scratch Better than Transferring from RGB-CNNs

Feature representation of RGB-D images using joint spatial-depth feature pooling