Abstract

Recently depth-modal information has been witnessed effectively in computer vision community, especially for scene analysis related tasks. However, it still suffers severely from depth data scarcity as well as improperly transferring pre-trained RGB models to fit depth-modal data. In this study, we propose a novel two-step training strategy to address these problems and focus on enhancing the recognition power for depth-modal images in RGB-D scene recognition task. Specifically, we build an effective “Res-U” architecture on a GAN (generative adversarial networks) based RGB-to-depth modality translation model, which is endowed with both short and long skips for residual learning. On one hand, this could first well pre-train a depth-modal-specific discriminator network from scratch in an unsupervised manner, which is effectively transformed for the subsequent recognition task instead of directly fitting pre-trained RGB model to depth-specific one. On the other hand, new depth images with helpful perturbations, generated from the modality translation model, help argument the original training set and regularize the learning process in some sense. This two-step training strategy makes it more effective for training a modal-specific network to discriminate depth scenes. Besides, we extensively explore the modality translation network to investigate the effects in recognizing depth-modal scenes, which encourages a reasonable way to take full advantage of multi-modalities. The proposed method achieves state-of-the-art accuracy on NYU Depth v2 and SUN RGB-D benchmark datasets, especially on depth data only evaluation.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call