Abstract

When available, multimodal data is key for enhanced emotion recognition in conversation. Text, audio, and video in dialogues can facilitate and complement each other in analyzing speakers’ emotions. However, it is very challenging to effectively fuse multimodal features to understand the detailed contextual information in conversations. In this work, we focus on dynamic interactions during the information fusion process and propose a Dynamic Interactive Multiview Memory Network (DIMMN) model to integrate interaction information for recognizing emotions. Specifically, the information fusion within DIMMN is through multiple perspectives (combining different modalities). We designed multiview layers in attention networks to enable the model to mine the crossmodal dynamic dependencies between different groups in the process of dynamic modal interaction. In order to learn the long-term dependency information, temporal convolutional networks are introduced to synthesize contextual information of a single person. Then, the gated recurrent units and memory networks are used to model the global session to detect contextual dependencies for multi-round, multi-speaker interactive emotion information. Experimental results on IEMOCAP and MELD demonstrate that DIMMN achieves better and comparable performance to the state-of-the-art methods, with an accuracy of 64.7% and 60.6%, respectively.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call