Hybrid cross-modal interaction learning for multimodal sentiment analysis

Yanping Fu,Zhiyuan Zhang,Ruidi Yang,Cuiyou Yao

doi:10.1016/j.neucom.2023.127201

Abstract

Multimodal sentiment analysis (MSA) predicts the sentiment polarity of an unlabeled utterance that carries multiple modalities, such as text, vision and audio, by analyzing labeled utterances. Existing fusion methods mainly focus on establishing the relationship of characteristics among different modalities to enhance their emotion recognition abilities. However, they always ignore the all-round interaction between different modalities, especially the cross-modal interaction, which is critical to the sentiment decision of multimodal data. To address these issues, we propose a novel hybrid cross-modal interaction learning (HCIL) framework for hybrid learning of intra-modal, inter-modal, interactive-modal and cross-modal interactions, with which the model can fully utilize the sentiment information of multimodalities and enhance the sentiment assistance between modalities. Specifically, we propose two core substructures to learn discriminative multimodal features. One is the comparative learning interaction structure that can track the class dynamics in the intra-modal, reduce the modal gap in the inter-modal and establish emotional communication in the interactive-modal; the other is the cross-modal prediction structure, which can build the sentiment relationship between cross-modal pairs, especially exploring the auxiliary sentiment effect of audio on the vision and text. Furthermore, we adopt a hierarchical feature fusion structure to generate the multimodal feature for the final sentiment prediction. Extensive experiments on three benchmark datasets showed that our HCIL approach can obtain significant performance on the MSA task and that the design of a cross-modal interaction structure can directly promote the improvement of sentiment classification performance.

Full Text