Abstract

Developing an effective multimodal representation has always been the crux of multimodal sentiment analysis. Different modalities possess distinct sentiment attributes between modality-invariant and modality-specific representation spaces. Prior studies have concentrated on utilizing intricate networks to directly generate joint representations of three modalities and lack exploiting relationships of the two representation spaces. To mitigate this, (1) we introduce a novel framework Co-space Representation Interaction Network (CRNet) that leverages different acoustic and visual representation subspaces to interact with linguistic modality. (2) To construct a joint representation through coordinated acoustic and visual spaces with linguistic modality, a novel module named Gradient-based Representation Enhancement (GRE) is proposed which is effective for extracting significant variation of representation matrices. (3) we design a novel multi-task strategy to optimize the training process to improve the performance of different representation combinations that come from the two spaces. Experimental results demonstrate that our suggested framework achieves state-of-the-art (SOTA) performance on CMU-MOSI, CMU-MOSEI and CH-SIMS datasets.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call