With the rapid development of social media and human–computer interaction, multimodal emotion recognition in conversations (MERC) tasks have begun to receive widespread research attention. The MERC task is to extract and fuse complementary semantic information from different modalities to classify the speaker’s emotion. However, the existing feature fusion methods usually directly map the features of other modalities into the same feature space for information fusion, which cannot eliminate the heterogeneity between different modalities and make the subsequent emotion class boundary learning more difficult. In addition, existing graph contrastive learning methods obtain consistent feature representations by maximizing mutual information between multiple views, which may lead to overfitting of the model. To tackle the above problem, we propose a novel Adversarial Alignment and Graph Fusion via Information Bottleneck for Multimodal Emotion Recognition in Conversations (AGF-IB) method. Firstly, we input video, audio, and text features into a multi-layer perceptron (MLP) to map them into separate feature spaces. Secondly, we build a generator and a discriminator for the three modal features, respectively, through adversarial representation to achieve information interaction between modalities and eliminate the heterogeneity among modalities. Thirdly, we introduce graph contrastive representation learning to capture intra-modal and inter-modal complementary semantic information and learn intra-class and inter-class boundary information of emotion categories. Furthermore, instead of maximizing the mutual information (MI) between multiple views, we use information bottleneck theory to minimize the MI between views. Specifically, we construct a graph structure for the three modal features respectively and perform contrastive representation learning on nodes with different emotions in the same modality and nodes with the same emotion in different modalities, to improve the feature representation ability of nodes. Finally, we use MLP to complete the emotional classification of the speaker. Extensive experiments show that AGF-IB can improve emotion recognition accuracy on IEMOCAP and MELD datasets. Furthermore, since AGF-IB is a general multimodal fusion and contrastive learning method, it can be applied to other multimodal tasks in a plug-and-play manner, e.g., humor detection.
Read full abstract