Abstract

To explore the heterogeneous sentiment information in each modal feature and improve the accuracy of sentiment analysis, this paper proposes a Multimodal Sentiment Analysis based on Text-Centric Sharing-Private Affective Semantics (TCSP). First, the Deep Canonical Time Wrapping (DCTW) algorithm is employed to effectively align the timing deviations of Audio and Picture modalities. Then, a cross-modal shared mask matrix is designed, and a mutual attention mechanism is introduced to compute the shared affective semantic features of Audio-picture-to-text. Following this, the private affective semantic features within Audio and Picture modalities are derived via the self-attention mechanism with LSTM. Finally, the Transformer Encoder structure is improved, achieving deep interaction and feature fusion of cross-modal emotional information, and conducting emotional analysis. Experiments are conducted on the IEMOCAP and MELD datasets. By comparing with current state-of-the-art models, the accuracy of the TCSP model reaches 82.02%, fully validating the effectiveness. In addition, the rationality of the design of each structure within the model is verified through ablation experiments.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call