Abstract

Multimodal feature fusion aims to draw complementary information from different modalities to achieve better performance. Contrastive learning is effective at discriminating coexisting semantic features (positive) from irrelative ones (negative) in multimodal signals. However, positive and negative pairs learn at separate rates, which undermines the overall performance of multimodal contrastive learning (MCL). Moreover, the learned representation model is not robust, as MCL utilizes supervision signals from potentially noisy modalities. To address these issues, a novel multimodal contrastive learning objective, Pace-adaptive and Noise-resistant Noise-Contrastive Estimation (PN-NCE), is proposed for multimodal fusion by directly using unimodal features. PN-NCE encourages the positive and negative pairs reaching to their optimal similarity scores adaptively and shows less susceptibility to noisy inputs during training. A theoretical analysis is performed on its robustness. Maximizing modality invariance information in the fused representation is expected to benefit the overall performance and therefore, an estimator that measures the difference between the fused representation and its unimodal representations is integrated into MCL to obtain a more modality-invariant fusion output. The proposed method is model-agnostic and can be adapted to various multimodal tasks. It also bears less performance degradation when reducing the number of training samples at the linear probing stage. With different networks and modality inputs from three multimodal datasets, experimental results show that PN-NCE achieves consistent enhancements compared with previous state-of-the-art approaches.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call