Abstract

Scene understanding is an important and fundamental problem in computer vision and is critical in applications of robotics and augmented reality. Scene understanding includes many tasks such as scene labeling, object recognition and scene classification. Most previous scene understanding methods focus on outdoor scenes. In contrast, indoor scene understanding is more challenging, due to poor illumination and cluttered objects. With the wide availability of affordable RGB-D cameras such as Kinect, huge changes have been made to indoor scene analysis due to the rich 3D geometry information provided by depth measurements. Feature extraction is the key part for scene understanding tasks. Most of the early methods extract hand-crafted features. However, the performance of such feature extractors highly depends on variations in hand-crafting and combinations. The designing process requires empirical understanding of data, thus hard to systematically extend to different modalities. In addition, the hand-crafted features usually capture a subset of recognition cues from raw data, which might ignore some useful information. Thus, in this research, we focus on feature learning with raw data as input. Particularly, we explore feature learning on three different tasks of indoor scene understanding using RGB-D input: Scene labeling: The aim is to densely assign a category label (e.g. table, TV) to each pixel in an image. Inspired by the success of unsupervised feature learning, we start by adapting the existing unsupervised feature learning technique to directly learn features from RGB-D images. Typically, better performance could be achieved by further applying feature encoding over the learned features to build bag of words type of features. However, feature learning and feature encoding are performed separately, which may result in suboptimal solution. We propose to jointly optimize these two processes to derive more discriminative features. Object recognition: Most of the feature learning methods for RGB-D object recognition either learn the features for individual modalities independently, or treat RGB-D simply as undifferentiated four-channel data, which cannot adequately exploit the complementary relationship between the two modalities. To address the above issues, we propose a general Convolutional Neural Networks (CNN) based multi-modal learning method for RGB-D object recognition. Our multi-modal layer is designed to not only discover the most discriminative features for each modality, but also harness the complementary relationship between the two modalities. Scene classification: Methods for scene classification task to leverage local information share a similar pipeline: first densely extracting CNN features from different locations and scales of an image, and then using an encoding method. However, for state-of-the-art feature encoding techniques such as Fisher Vector (FV), since their components in Gaussian Mixture Model (GMM) are derived from densely sampled local features, many components are likely to be noisy…

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call