Aiming at the existing multimodal sentiment analysis approaches, which include inadequate extraction of unimodal features, redundancy of independent modal features, insufficient analysis of semantic correlation between data and insufficient fusion, a Web-Semantic Enhanced Multimodal Sentiment Analysis Using Multilayer Cross-Attention Fusion is proposed. The model utilizes deep learning (including XLNet, ResNeSt, and convolutional neural networks) to extract high-level features from text, audio, and visual modes through self-attention mechanisms, and improves the accuracy of emotion classification through multimodal fusion. The results of experiments demonstrate that the suggested MCFMSA can achieve Acc-2, Acc-3, F1, and MAE values of 89.7%, 85.2%, 89.3%, and 0.466 on the CMU-MOSI dataset, respectively; and on the CMU-MOSEI dataset, Acc-2, Acc-3, F1, and MAE values of 88.7%, 82.5%, 86.5%, and 0.475. All of them are significantly improved compared to several other advanced multimodal sentiment analysis methods, which can enhance the accuracy of sentiment classification.
Read full abstract