Cross-Modal Sentiment Analysis of Text and Video Based on Bi-GRU Cyclic Network and Correlation Enhancement

Ping He,Shiyi Wang,Huaying Qi,Jiayue Cang

doi:10.3390/app13137489

Abstract

Cross-modal sentiment analysis is an emerging research area in natural language processing. The core task of cross-modal fusion lies in cross-modal relationship extraction and joint feature learning. The existing research methods of cross-modal sentiment analysis focus on static text, video, audio, and other modality data but ignore the fact that different modality data are often unaligned in practical applications. There is a long-term time dependence among unaligned data sequences, and it is difficult to explore the interaction between different modalities. The paper proposes a sentiment analysis model (UA-BFET) based on feature enhancement technology in unaligned data scenarios, which can perform sentiment analysis on unaligned text and video modality data in social media. Firstly, the model adds a cyclic memory enhancement network across time steps. Then, the obtained cross-modal fusion features with interaction are applied to the unimodal feature extraction process of the next time step in the Bi-directional Gated Recurrent Unit (Bi-GRU) so that the progressively enhanced unimodal features and cross-modal fusion features continuously complement each other. Secondly, the extracted unimodal text and video features taken jointly from the enhanced cross-modal fusion features are subjected to canonical correlation analysis (CCA) and input into the fully connected layer and Softmax function for sentiment analysis. Through experiments executed on unaligned public datasets MOSI and MOSEI, the UA-BFET model has achieved or even exceeded the sentiment analysis effect of text, video, and audio modality fusion and has outstanding advantages in solving cross-modal sentiment analysis in unaligned data scenarios.

Full Text