Abstract

Multimodal sentiment analysis (MSA) has become a popular field of research in recent years. The aim is to combine the three modalities of text, video, and audio to obtain comprehensive emotional information. However, current research often treats these three modalities equally, downplaying the crucial role of text modality in MSA and ignoring the redundant information generated during multimodal fusion. To address these problems, we propose the Text-Centric Hierarchical Fusion Network (TCHFN), employing a hierarchical fusion strategy. In this framework, low-level fusion involves cross-modal interactions between pairs of modalities, while high-level fusion extends these interactions to involve all three modalities. Through the design of the Cross-modal Reinforced Transformer (CRT), we achieve cross-modal enhancement of the target modality, facilitating a nuanced fusion process with text serving as its core. Additionally, we design Text-Centric Contrastive Learning (TCCL) to align non-text modalities with the text modality, emphasising the central role of text in the fusion process. After fusion, a multimodal fusion output gate is employed to mitigate redundant information within the multimodal fusion representation, which is subsequently processed by a linear layer for prediction. Simultaneously, to fully leverage limited labelled datasets, we introduced knowledge distillation. This approach involves preserving the model parameters that yield the best performance during training as a teacher model. The teacher model aids in capturing rich emotional information, enabling the model to transcend local optima and discover more optimal parameters, thereby enhancing overall model performance. Extensive experiments on the CMU-MOSI, CMU-MOSEI, and CH-SIMS datasets demonstrate the superiority of our model over state-of-the-art methods in MSA tasks.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call