Abstract

Multi-modal sentiment analysis offers various challenges, one being the effective combination of different input modalities, namely text, visual and acoustic. In this paper, we propose a recurrent neural network based multi-modal attention framework that leverages the contextual information for utterance-level sentiment prediction. The proposed approach applies attention on multi-modal multi-utterance representations and tries to learn the contributing features amongst them. We evaluate our proposed approach on two multi-modal sentiment analysis benchmark datasets, viz. CMU Multi-modal Opinion-level Sentiment Intensity (CMU-MOSI) corpus and the recently released CMU Multi-modal Opinion Sentiment and Emotion Intensity (CMU-MOSEI) corpus. Evaluation results show the effectiveness of our proposed approach with the accuracies of 82.31% and 79.80% for the MOSI and MOSEI datasets, respectively. These are approximately 2 and 1 points performance improvement over the state-of-the-art models for the datasets.

Highlights

  • Sentiment analysis (Pang and Lee, 2005, 2008) has been applied to a wide variety of texts (Hu and Liu, 2004; Liu, 2012; Turney, 2002; Akhtar et al, 2016, 2017; Mohammad et al, 2013)

  • We propose a novel method that employs a recurrent neural network based multimodal multi-utterance attention framework for sentiment prediction.We hypothesize that applying attention to contributing neighboring utterances and/or multi-modal representations may assist the network to learn in a better way

  • The main contributions of our proposed work are three-fold: a) we propose a novel technique for multi-modal sentiment analysis; b) we propose an effective attention framework that leverages contributing features across multiple modalities and neighboring utterances for sentiment analysis; and c) we present the state-of-the-art systems for sentiment analysis in two different benchmark datasets

Read more

Summary

Introduction

Sentiment analysis (Pang and Lee, 2005, 2008) has been applied to a wide variety of texts (Hu and Liu, 2004; Liu, 2012; Turney, 2002; Akhtar et al, 2016, 2017; Mohammad et al, 2013). It depends on the information that can be obtained from more than one modality (e.g. text, visual and acoustic) for the analysis. The motivation is to leverage the varieties of (often distinct) information from multiple sources for building an efficient system. It is a non-trivial task to detect the sentiment of a sarcastic sentence “My neighbours are home!! Have important roles to play in the correctness of the system Combining these information in an effective manner is a non-trivial task that researchers often have to face (Zadeh et al, 2017; Chen et al, 2017)

Objectives
Methods
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call