Emotion Recognition In Conversations Research Articles

With the rapid development of social media and human–computer interaction, multimodal emotion recognition in conversations (MERC) tasks have begun to receive widespread research attention. The MERC task is to extract and fuse complementary semantic information from different modalities to classify the speaker’s emotion. However, the existing feature fusion methods usually directly map the features of other modalities into the same feature space for information fusion, which cannot eliminate the heterogeneity between different modalities and make the subsequent emotion class boundary learning more difficult. In addition, existing graph contrastive learning methods obtain consistent feature representations by maximizing mutual information between multiple views, which may lead to overfitting of the model. To tackle the above problem, we propose a novel Adversarial Alignment and Graph Fusion via Information Bottleneck for Multimodal Emotion Recognition in Conversations (AGF-IB) method. Firstly, we input video, audio, and text features into a multi-layer perceptron (MLP) to map them into separate feature spaces. Secondly, we build a generator and a discriminator for the three modal features, respectively, through adversarial representation to achieve information interaction between modalities and eliminate the heterogeneity among modalities. Thirdly, we introduce graph contrastive representation learning to capture intra-modal and inter-modal complementary semantic information and learn intra-class and inter-class boundary information of emotion categories. Furthermore, instead of maximizing the mutual information (MI) between multiple views, we use information bottleneck theory to minimize the MI between views. Specifically, we construct a graph structure for the three modal features respectively and perform contrastive representation learning on nodes with different emotions in the same modality and nodes with the same emotion in different modalities, to improve the feature representation ability of nodes. Finally, we use MLP to complete the emotional classification of the speaker. Extensive experiments show that AGF-IB can improve emotion recognition accuracy on IEMOCAP and MELD datasets. Furthermore, since AGF-IB is a general multimodal fusion and contrastive learning method, it can be applied to other multimodal tasks in a plug-and-play manner, e.g., humor detection.

Read full abstract

Emotion Recognition in Conversations (ERC) aims to accurately identify the emotional labels of each utterance in a conversation, holding significant application value in human–computer interaction. Existing research suggests introducing commonsense knowledge (CSK) and multimodal information enhances model performance in ERC tasks. However, several challenges persist: (1) the neglect of complex psychological influences between utterances; (2) noise issues within modal information; (3) prediction challenges for emotion labels with few samples in different categories that exhibit semantic similarity but distinct emotional categories. To address the above problems, we propose a Multimodal Knowledge-enhanced Interactive Network with Mixed Contrastive Learning (MKIN-MCL). Firstly, we establish a knowledge aggregation graph to capture the dependencies of commonsense knowledge (CSK) between utterances during a conversation. We actively aggregate relevant knowledge information to enhance text features. Simultaneously, we apply feature filters for acoustic and visual modalities to eliminate noise and enhance feature quality. Furthermore, we implement an interactive attention module by stacking designed Cross-modal Interactive Transformers (CITs) to continuously explore the relevance between the interacting parties in their respective semantic spaces, thus improving the effectiveness of modality interaction while reducing noise generated during the interaction. Lastly, we employ the Mixed Contrastive Learning (MCL) strategy to enhance the model’s ability to handle few-shot labels. This strategy utilizes unsupervised contrastive learning to improve the representation capability of the multimodal fusion features and supervised contrastive learning to extract information from few-shot labels. Extensive experiments on two benchmark datasets, IEMOCAP and MELD, validate the effectiveness and superiority of the proposed model.

Read full abstract

Emotion Recognition In Conversations Research Articles

Related Topics

Articles published on Emotion Recognition In Conversations

Deep emotion recognition in textual conversations: a survey

MLGAT: multi-layer graph attention networks for multimodal emotion recognition in conversations

HiMul-LGG: A hierarchical decision fusion-based local–global graph neural network for multimodal emotion recognition in conversation

Correlation mining of multimodal features based on higher-order partial least squares for emotion recognition in conversations

Multimodal graph learning with framelet-based stochastic configuration networks for emotion recognition in conversation

Adversarial alignment and graph fusion via information bottleneck for multimodal emotion recognition in conversations

Multi-modal graph context extraction and consensus-aware learning for emotion recognition in conversation

Speaker-aware cognitive network with cross-modal attention for multimodal emotion recognition in conversation

Disentangled Variational Autoencoder for Emotion Recognition in Conversations

DC-BVM: Dual-channel information fusion network based on voting mechanism

Adaptive Graph Learning for Multimodal Conversational Emotion Detection

Multimodal Knowledge-enhanced Interactive Network with Mixed Contrastive Learning for Emotion Recognition in Conversation

MRSLN: A Multimodal Residual Speaker-LSTM Network to alleviate the over-smoothing issue for Emotion Recognition in Conversation

Multi-Modal Attentive Prompt Learning for Few-shot Emotion Recognition in Conversations

Hypergraph Neural Network for Emotion Recognition in Conversations

Self-supervised utterance order prediction for emotion recognition in conversations

PIRNet: Personality-Enhanced Iterative Refinement Network for Emotion Recognition in Conversation.

ERNetCL: A novel emotion recognition network in textual conversation based on curriculum learning strategy

Modeling Hierarchical Uncertainty for Multimodal Emotion Recognition in Conversation.

GraphCFC: A Directed Graph Based Cross-Modal Feature Complementation Approach for Multimodal Conversational Emotion Recognition

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Emotion Recognition In Conversations Research Articles

Related Topics

Articles published on Emotion Recognition In Conversations

Deep emotion recognition in textual conversations: a survey

MLGAT: multi-layer graph attention networks for multimodal emotion recognition in conversations

HiMul-LGG: A hierarchical decision fusion-based local–global graph neural network for multimodal emotion recognition in conversation

Correlation mining of multimodal features based on higher-order partial least squares for emotion recognition in conversations

Multimodal graph learning with framelet-based stochastic configuration networks for emotion recognition in conversation

Adversarial alignment and graph fusion via information bottleneck for multimodal emotion recognition in conversations

Multi-modal graph context extraction and consensus-aware learning for emotion recognition in conversation

Speaker-aware cognitive network with cross-modal attention for multimodal emotion recognition in conversation

Disentangled Variational Autoencoder for Emotion Recognition in Conversations

DC-BVM: Dual-channel information fusion network based on voting mechanism

Adaptive Graph Learning for Multimodal Conversational Emotion Detection

Multimodal Knowledge-enhanced Interactive Network with Mixed Contrastive Learning for Emotion Recognition in Conversation

MRSLN: A Multimodal Residual Speaker-LSTM Network to alleviate the over-smoothing issue for Emotion Recognition in Conversation

Multi-Modal Attentive Prompt Learning for Few-shot Emotion Recognition in Conversations

Hypergraph Neural Network for Emotion Recognition in Conversations

Self-supervised utterance order prediction for emotion recognition in conversations

PIRNet: Personality-Enhanced Iterative Refinement Network for Emotion Recognition in Conversation.

ERNetCL: A novel emotion recognition network in textual conversation based on curriculum learning strategy

Modeling Hierarchical Uncertainty for Multimodal Emotion Recognition in Conversation.

GraphCFC: A Directed Graph Based Cross-Modal Feature Complementation Approach for Multimodal Conversational Emotion Recognition