Abstract

Multimodal sequence learning aims to utilize information from different modalities to enhance overall performance. Mainstream works often follow an intermediate-fusion pipeline, which explores both modality-specific and modality-supplementary information for fusion. However, the unaligned and heterogeneously distributed multimodal sequences pose significant challenges to the fusion task: 1) to extract both effective unimodal and crossmodal representations and 2) to overcome the overfitting issue in joint multimodal sequence optimization. In this work, we propose regularized expressive representation distillation (RERD) that aims to seek effective multimodal representations and to enhance the generalization of fusion. First, to improve unimodal representation learning, unimodal representations are assigned to multi-head distillation encoders, where the unimodal representations are iteratively updated through distillation attention layers. Second, to alleviate the overfitting issue in joint crossmodal optimization, a multimodal sinkhorn distance regularizer is proposed to reinforce the expressive representation extraction and to reduce the modality gap before fusion adaptively. These representations produce a comprehensive view of the multimodal sequences, which are utilized for downstream fusion tasks. Experimental results on several popular benchmarks demonstrate that the proposed method achieves state-of-the-art performance, compared with widely used baselines for deep multimodal sequence fusion. Source code will be shared publicly.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call