Abstract

In the field of Multimodal Sentiment Analysis (MSA), the prevailing methods are devoted to developing intricate network architectures to capture the intra- and inter-modal dynamics, which necessitates numerous parameters and poses more difficulties in terms of interpretability in multimodal modeling. Besides, the heterogeneous nature of multiple modalities (text, audio, and vision) introduces significant modality gaps, thereby making multimodal representation learning an ongoing challenge. To address the aforementioned issues, by considering the learning process of modalities as multiple subtasks, we propose a novel approach named Multi-Task Momentum Distillation (MTMD) which succeeds in reducing the gap among different modalities. Specifically, according to the abundance of semantic information, we treat the subtasks of textual and multimodal representations as the teacher networks while the subtasks of acoustic and visual representations as the student ones to present knowledge distillation, which transfers the sentiment-related knowledge guided by the regression and classification subtasks. Additionally, we adopt unimodal momentum models to explore modality-specific knowledge deeply and employ adaptive momentum fusion factors to learn a robust multimodal representation. Furthermore, we provide a theoretical perspective of mutual information maximization by interpreting MTMD as generating sentiment-related views in various ways. Extensive experiments illustrate the superiority of our approach compared with the state-of-the-art methods in MSA.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call