Abstract

An essential skill for effective communication is the ability to express specific sentiment and emotion in a conversation. Any robust dialogue system should handle the combined effect of both sentiment and emotion while generating responses. This is expected to provide a better experience and concurrently increase users’ satisfaction. Previously, research on either emotion or sentiment controlled dialogue generation has shown great promise in developing the next generation conversational agents, but the simultaneous effect of both is still unexplored. The existing dialogue systems are majorly based on unimodal sources, predominantly the text, and thereby cannot utilize the information present in the other sources, such as video, audio, image, etc. In this article, we present at first a large scale benchmark Sentiment Emotion aware Multimodal Dialogue (SEMD) dataset for the task of sentiment and emotion controlled dialogue generation. The SEMD dataset consists of 55k conversations from 10 TV shows having text, audio, and video information. To utilize multimodal information, we propose multimodal attention based conditional variational autoencoder (M-CVAE) that outperforms several baselines. Quantitative and qualitative analyses show that multimodality, along with contextual information, plays an essential role in generating coherent and diverse responses for any given emotion and sentiment.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.