Abstract

In recent times, multi-modal analysis has been an emerging and highly sought-after field at the intersection of natural language processing, computer vision, and speech processing. The prime objective of such studies is to leverage the diversified information, (e.g., textual, acoustic and visual), for learning a model. The effective interaction among these modalities often leads to a better system in terms of performance. In this paper, we introduce a recurrent neural network based approach for the multi-modal sentiment and emotion analysis. The proposed model learns the inter-modal interaction among the participating modalities through an auto-encoder mechanism. We employ a context-aware attention module to exploit the correspondence among the neighboring utterances. We evaluate our proposed approach for five standard multi-modal affect analysis datasets. Experimental results suggest the efficacy of the proposed model for both sentiment and emotion analysis over various existing state-of-the-art systems.

Highlights

  • In recent past, the world has witnessed tremendous growth of various social media platforms, e.g., YouTube, Instagram, Twitter, Facebook, etc

  • The main contributions of our current research are as follows: (1) We propose an Inter-modal Interactive Module (IIM) that aims to learn the interaction among the diverse and distinct features of the input modalities, i.e., text, acoustic and visual; (2) We employ a Context-aware Attention Module (CAM) that identifies and assigns the weights to the neighboring utterances based on their contributing features

  • We have proposed a Context-aware Interactive Attention framework that aims to capture the interaction between the input modalities for the multi-modal sentiment and emotion prediction

Read more

Summary

Introduction

The world has witnessed tremendous growth of various social media platforms, e.g., YouTube, Instagram, Twitter, Facebook, etc. People treat these platforms as a communication medium and freely express themselves with the help of a diverse set of input sources, e.g. videos, images, audio, text etc. Text can provide a better clue for the prediction, whereas for the others, acoustic or visual sources can be more informative. Have important roles to play in determining the correctness of the system Effectively combining this information is a nontrivial task that researchers often have to face (Poria et al, 2016; Ranganathan et al, 2016; Lee et al, 2018)

Objectives
Methods
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call