In multimodal sentiment analysis (MSA), the fusion strategies of multimodal features significantly influence the performance of MSA models. Previous works frequently face challenges in integrating heterogeneous data without fully leveraging the rich semantic content of text, resulting in poor information association. We propose an MSA model based on Text-Driven Crossmodal Fusion and Mutual Information Estimation, called TeD-MI, which comprises a Stacked Text-Driven Crossmodal Fusion (STDC) module, which efficiently fusions the three modalities driven by the text modality to optimize fusion feature representation and enhance semantic understanding. Furthermore, TeD-MI designed a mutual information estimation module to achieve the best balance between preserving task-related information and filtering out irrelevant noise information as much as possible. Comprehensive experiments conducted on the CMU-MOSI and CMU-MOSEI datasets demonstrate our proposed model achieves varying degrees of improvement across most evaluation metrics.
Read full abstract