Abstract
In the natural language processing community, open-domain conversational agents, also known as chatbots, are gaining popularity. One of the difficulties is getting them to communicate in an emotionally intelligent manner. To generate dialogues, current neural response generation methods depend solely on end-to-end learning from large scale conversation data. Therefore, we introduce a large-scale multi Emotion and Intent guided Multimodal Dialogue (EmoInt-MD) dataset labelled with 32 emotions and 15 empathetic intents having 32 k dialogues taken from different movie genres. We propose a novel multi-task multimodal contextual Transformer framework for simultaneously identifying the emotions and intents in a given utterance utilizing audio and visual features in addition to the textual information. Experimental analysis proves that the proposed framework outperforms several unimodal and multimodal baselines on the <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">EmoInt-MD</i> dataset. This dataset along with our baseline and proposed framework implementations will be made publicly available for research purposes.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: IEEE/ACM Transactions on Audio, Speech, and Language Processing
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.