CoMAE: Single Model Hybrid Pre-training on Small-Scale RGB-D Datasets

Jiange Yang,Limin Wang,Sheng Guo,Gangshan Wu

doi:10.1609/aaai.v37i3.25419

Abstract

Current RGB-D scene recognition approaches often train two standalone backbones for RGB and depth modalities with the same Places or ImageNet pre-training. However, the pre-trained depth network is still biased by RGB-based models which may result in a suboptimal solution. In this paper, we present a single-model self-supervised hybrid pre-training framework for RGB and depth modalities, termed as CoMAE. Our CoMAE presents a curriculum learning strategy to unify the two popular self-supervised representation learning algorithms: contrastive learning and masked image modeling. Specifically, we first build a patch-level alignment task to pre-train a single encoder shared by two modalities via cross-modal contrastive learning. Then, the pre-trained contrastive encoder is passed to a multi-modal masked autoencoder to capture the finer context features from a generative perspective. In addition, our single-model design without requirement of fusion module is very flexible and robust to generalize to unimodal scenario in both training and testing phases. Extensive experiments on SUN RGB-D and NYUDv2 datasets demonstrate the effectiveness of our CoMAE for RGB and depth representation learning. In addition, our experiment results reveal that CoMAE is a data-efficient representation learner. Although we only use the small-scale and unlabeled training set for pre-training, our CoMAE pre-trained models are still competitive to the state-of-the-art methods with extra large-scale and supervised RGB dataset pre-training. Code will be released at https://github.com/MCG-NJU/CoMAE.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

CoMAE: Single Model Hybrid Pre-training on Small-Scale RGB-D Datasets

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence

Lead the way for us

Similar Papers

Improving RGB-D salient object detection by addressing inconsistent saliency problems
Kun Zuo ... Hao Wen
Knowledge-Based Systems | VOL. 299
Kun Zuo, et. al.Kun Zuo ... Hao Wen
28 May 2024
Knowledge-Based Systems | VOL. 299

Interactive Efficient Multi-Task Network for RGB-D Semantic Segmentation
Xinhua Xu ... Hong Liu
Electronics | VOL. 12
Xinhua Xu, et. al.Xinhua Xu ... Hong Liu
19 Sep 2023
Electronics | VOL. 12

Multi-Task and Multi-Modal Learning for RGB Dynamic Gesture Recognition
Dinghao Fan ... Hengjie Lu
IEEE Sensors Journal | VOL. 21
Dinghao Fan, et. al.Dinghao Fan ... Hengjie Lu
01 Dec 2021
IEEE Sensors Journal | VOL. 21

Semi-Supervised Cross-Modality Action Recognition by Latent Tensor Transfer Learning
Chengcheng Jia ... Zhengming Ding
IEEE Transactions on Circuits and Systems for Video Technology | VOL. 30
Chengcheng Jia, et. al.Chengcheng Jia ... Zhengming Ding
12 Sep 2019
IEEE Transactions on Circuits and Systems for Video Technology | VOL. 30

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

CoMAE: Single Model Hybrid Pre-training on Small-Scale RGB-D Datasets

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence