Abstract

Multi-modal sentiment and emotion analysis have been an emerging and prominent field nowadays at the intersection of natural language processing, deep learning, machine learning, computer vision, and speech processing. Sentiment and emotion prediction model finds the attitude of a speaker or writer towards any discussion, debate, event, document or topic. It can be expressed in different ways like the words spoken, energy and tone while delivering words, accompanying facial expressions, gestures, etc. Moreover related and similar tasks generally depend on each other and are predicted better if solved through a joint framework. In this paper, we present a multi-task gated contextual cross-modal attention framework which considers all the three modalities (viz. text, acoustic and visual) and multiple utterances for sentiment and emotion prediction together. We evaluate our proposed approach on CMU-MOSEI dataset for sentiment and emotion prediction. Evaluation results depict that our proposed approach extracts co-relation among the three modalities and attains an improvement over the previous state-of-the-art models.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call