Emotion recognition in conversations (ERC) has gained increasing research attention in recent years due to its wide applications in a surge of emerging tasks, such as social media analysis, dialog generation, and recommender systems. Since constituent utterances in a conversation are closely semantic-related, the constituent utterances’ emotional states are also closely related. In our consideration, this correlation could serve as a guide for the emotion recognition of constituent utterances. Accordingly, we propose a novel approach named Semantic-correlation Graph Convolutional Network (SC-GCN) to take advantage of this correlation for the ERC task in multimodal scenario. Specifically, we first introduce a hierarchical fusion module to model the dynamics among the textual, acoustic and visual features and fuse the multimodal information. Afterward, we construct a graph structure based on the speaker and temporal dependency of the dialog. We put forward a novel multi-loop architecture to explore the semantic correlations by the self-attention mechanism and enhance the correlation information via multiple loops. Through the graph convolution process, the proposed SC-GCN finally obtains a refined representation of each utterance, which is used for the final prediction. Extensive experiments are conducted on two benchmark datasets and the experimental results demonstrate the superiority of our SC-GCN.
Read full abstract