Abstract
Multimodal sentiment analysis (MSA) is crucial as it integrates textual, visual, and audio information from videos to accurately identify human emotional states. This study proposes an innovative multimodal feature decoupling strategy that categorizes sentiment features into common and private features. The private features aim to accurately capture the uniqueness of each modality, thereby increasing feature diversity. In contrast, the common features seek to identify and capture commonalities among different modalities, thus reducing potential information loss during decoupling. To achieve this, we designed dedicated encoders and loss function constraints for both types of features. Additionally, to mitigate information redundancy and prevent key information loss during decoupled representation learning, we introduce a dual feature reconstruction mechanism comprising unimodal feature reconstruction (UFR) and multimodal feature reconstruction (MFR). These mechanisms preserve vital information from the decoupling process and mitigate the impact of redundant data. Our extensive experiments on three datasets demonstrate that our method achieves a significant margin of approximately 1%–3% in accuracy, illustrating that our approach outperforms existing advanced techniques significantly, resulting in noteworthy performance enhancements.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have