Abstract

Representation learning is a crucial and challenging task within multimodal sentiment analysis. Effective multimodal sentiment representations contain two key aspects: consistency and difference. However, the state-of-the-art multimodal sentiment analysis approaches failed to capture the difference and consistency of sentiment information across diverse modalities. To address the multimodal sentiment representation problem, we propose an autoencoder-based self-supervised learning framework. In the pre-training stage, an autoencoder is designed for each modality, leveraging unlabeled data to learn richer sentiment representations for each modality through sample reconstruction and modality consistency detection tasks. In the fine-tuning stage, the pre-trained autoencoder is injected into MulT (AE-MT) and enhance the model's ability to extract deep sentiment information by incorporating a contrastive learning auxiliary task. Our experiments on the popular Chinese sentiment analysis benchmark (CH-SIMS v2.0) and English sentiment analysis benchmark (MOSEI) demonstrate significant gains over baseline models.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.