Abstract

Conversation is an important form of human communication and contains a large number of emotions. It is interesting to discover emotions and their causes in conversations. Conversation in its natural form is multimodal. Many studies have been carried out on multimodal emotion recognition in conversations, yet there is still a lack of work on multimodal emotion cause analysis. In this work, we introduce a new task named Multimodal Emotion-Cause Pair Extraction in Conversations, aiming to jointly extract emotions and the corresponding causes from conversations reflected in multiple modalities (i.e., text, audio and video). We accordingly construct a multimodal conversational emotion cause dataset, Emotion-Cause-in-Friends, which contains 9,794 multimodal emotion-cause pairs among 13,619 utterances in the <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Friends</i> sitcom. We benchmark the task by establishing two baseline systems including a heuristic approach considering inherent patterns in the location of causes and emotions and a deep learning approach that incorporates multimodal features for emotion-cause pair extraction, and conduct the human performance test for comparison. Furthermore, we investigate the effect of multimodal information, explore the potential of incorporating commonsense knowledge, and perform the task under both Static and Real-time settings.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call