Abstract

Deciphering sentiments or emotions in face-to-face human interactions is an inherent capability of human intelligence, and thus a natural goal of artificial intelligence. The proliferation of multimedia data in video sites gives rise to multimodal sentiment analysis in various applications and research fields such as movie and product review, opinion polling, and affective computing. In order to improve the performance of multimodal sentiment analysis task, this paper proposes a novel neural network with multiple stacked attention mechanism (MSAM) on multimodal data containing texts, video, and audio at an utterance level. We conduct experiments using two benchmark datasets, namely CMU Multi-modal Opinion-level Sentiment Intensity (CMU-MOSI) corpus, and CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI) corpus. Compared with a comprehensive set of state-of-the-art baselines, the evaluation results demonstrate the effectiveness of our proposed MSAM network.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call