Multimodal sentiment analysis aims to extract sentiment information expressed by users from multimodal data, including linguistic, acoustic, and visual cues. However, the heterogeneity of multimodal data leads to disparities in modal distribution, thereby impacting the model’s ability to effectively integrate complementarity and redundancy across modalities. Additionally, existing approaches often merge modalities directly after obtaining their representations, overlooking potential emotional correlations between them. To tackle these challenges, we propose a Multiview Collaborative Perception (MVCP) framework for multimodal sentiment analysis. This framework consists primarily of two modules: Multimodal Disentangled Representation Learning (MDRL) and Cross-Modal Context Association Mining (CMCAM). The MDRL module employs a joint learning layer comprising a common encoder and an exclusive encoder. This layer maps multimodal data to a hypersphere, learning common and exclusive representations for each modality, thus mitigating the semantic gap arising from modal heterogeneity. To further bridge semantic gaps and capture complex inter-modal correlations, the CMCAM module utilizes multiple attention mechanisms to mine cross-modal and contextual sentiment associations, yielding joint representations with rich multimodal semantic interactions. In this stage, the CMCAM module only discovers the correlation information among the common representations in order to maintain the exclusive representations of different modalities. Finally, a multitask learning framework is adopted to achieve parameter sharing between single-modal tasks and improve sentiment prediction performance. Experimental results on the MOSI and MOSEI datasets demonstrate the effectiveness of the proposed method.
Read full abstract