Abstract

In the natural language processing community, open-domain conversational agents, also known as chatbots, are gaining popularity. One of the difficulties is getting them to communicate in an emotionally intelligent manner. To generate dialogues, current neural response generation methods depend solely on end-to-end learning from large scale conversation data. Therefore, we introduce a large-scale multi Emotion and Intent guided Multimodal Dialogue (EmoInt-MD) dataset labelled with 32 emotions and 15 empathetic intents having 32 k dialogues taken from different movie genres. We propose a novel multi-task multimodal contextual Transformer framework for simultaneously identifying the emotions and intents in a given utterance utilizing audio and visual features in addition to the textual information. Experimental analysis proves that the proposed framework outperforms several unimodal and multimodal baselines on the <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">EmoInt-MD</i> dataset. This dataset along with our baseline and proposed framework implementations will be made publicly available for research purposes.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call