Abstract

Multimodal sentiment analysis aims to infer the sentiment of video bloggers from the features of multiple input modalities. However, there are problems such as signal noise and signal loss in the input phase and inefficient utilization of features in the modality fusion phase. To address these issues, this study proposes a feature-based restoration dynamic interaction network for multimodal sentiment analysis. Firstly, the idea of resampler and integration is employed to enhance visual and textual features during the input phase. Secondly, in the modal interaction phase, a dynamic routing network is employed. The network is centered on text modality and dynamically fuses visual and audio features. Finally, in the classification phase, multimodal representations are united to provide guidance for multimodal sentiment analysis. This study conducted experiments on the datasets MOSI, MOSEI and UR-FUNNY, which have 2199, 22856 and 16514 video segments respectively. The results show that the proposed method achieves an average improvement of about 1 point for three metrics on MOSI and 0.5 points for individual metrics on MOSEI compared to the state-of-the-art methods. Compared to other methods, the proposed approach achieve about 1 point improvement for individual metrics on UR-FUNNY dataset.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call