RMER-DT: Robust multimodal emotion recognition in conversational contexts based on diffusion and transformers
RMER-DT: Robust multimodal emotion recognition in conversational contexts based on diffusion and transformers
- Conference Article
6
- 10.1109/icsp48669.2020.9321008
- Dec 6, 2020
Continuous emotion recognition is a challenging task and a key part of human-computer interaction, especially multimodal emotion recognition can effectively improve the accuracy and robustness of recognition. But there are limited emotion data sets, and it is difficult to extract emotion features. We present a multi-level segmented decision level fusion emotion recognition model to improve the performance of emotion recognition. In this paper, we predict multi-modal dimensional emotional state on AVEC2017 dataset. Our model uses Bidirectional Long Short-Term Memory (BLSTM) as multi-level segmented emotional feature learning model, and uses the SVR model as fusion model of the decision layer. The BLSTM can model different forms of emotional information in time, and can also consider the impact of previous and later emotional features on current results. The SVR model can compensate for the redundant information of emotion recognition. At the same time, we also consider annotation delay and temporal pooling in our multi-modal dimensional emotion recognition model. Our multi-modal emotion recognition model achieves significant recognition improvements and provide the robustness. Finally, we compare the baseline methods which used the same dataset, and find that the CCC performance of our method is the best on arousal, which is 0.685. Our research shows that the proposed multi-layer segmentation decision level fusion emotion recognition model is conducive to improving performance.
- Research Article
13
- 10.1016/j.neucom.2023.126649
- Aug 4, 2023
- Neurocomputing
A multimodal fusion emotion recognition method based on multitask learning and attention mechanism
- Research Article
- 10.54097/37zncv36
- Dec 11, 2024
- Highlights in Science, Engineering and Technology
Multi-modal emotion recognition technology explores emotion recognition by integrating facial expression, voice intonation, text analysis and other multi-source data, so as to improve the naturalness and accuracy of human-computer interaction. Aiming at the emerging field of multi-modal emotion recognition, this paper introduces three single-modal emotion recognition methods of text, face and voice, especially the problem of multi-modal emotion fusion, and introduces the methods with high success rate of multi-modal emotion fusion recognition in recent years. Through comparative analysis, the conclusion is drawn that the current fusion methods are more complicated and the fusion success rate has been improved to some extent. However, the number of data sets on multi-modal emotion analysis is small, and the research on gesture and other modes of emotion recognition is also scarce. In the later stage, it is necessary to enrich the data set and add new modes to improve the accuracy and robustness of the multi-modal emotion recognition and analysis system.
- Research Article
12
- 10.1016/j.engappai.2024.108348
- Apr 6, 2024
- Engineering Applications of Artificial Intelligence
Token-disentangling Mutual Transformer for multimodal emotion recognition
- Research Article
58
- 10.1016/j.eswa.2023.122946
- Dec 15, 2023
- Expert Systems with Applications
MSER: Multimodal speech emotion recognition using cross-attention with deep fusion
- Research Article
- 10.1371/journal.pone.0333674.r004
- Oct 27, 2025
- PLOS One
Multimodal emotion recognition leverages multiple modalities to capture emotional cues more comprehensively, thereby improving the accuracy and robustness of emotion recognition. From the perspective of multimodal data and feature learning, reducing information redundancy in multimodal data and enhancing the discriminability of deep feature co-learning can effectively boost recognition performance. Based on this, this paper proposes a multimodal emotion recognition method based on an Adaptive High-order Transformer Network (AHOT). This method constructs Adaptive Selection Transformer block (AST) and Cross-modal Feature Fusion block (CMFF) for each modality branch, aiming to fully capture non-redundant feature representations from each modality and the interactions between modalities. In addition, a sparse high-order feature learning module is designed to enable the learning of highly discriminative high-order features across modalities. Experimental results on two multimodal emotion recognition datasets (IEMOCAP and CMU-MOSEI) demonstrate that, compared with several related methods, the proposed AHOT effectively improves emotion recognition accuracy. Moreover, ablation studies and parameter analyses further validate the effectiveness of AHOT.
- Research Article
- 10.1371/journal.pone.0333674
- Oct 27, 2025
- PloS one
Multimodal emotion recognition leverages multiple modalities to capture emotional cues more comprehensively, thereby improving the accuracy and robustness of emotion recognition. From the perspective of multimodal data and feature learning, reducing information redundancy in multimodal data and enhancing the discriminability of deep feature co-learning can effectively boost recognition performance. Based on this, this paper proposes a multimodal emotion recognition method based on an Adaptive High-order Transformer Network (AHOT). This method constructs Adaptive Selection Transformer block (AST) and Cross-modal Feature Fusion block (CMFF) for each modality branch, aiming to fully capture non-redundant feature representations from each modality and the interactions between modalities. In addition, a sparse high-order feature learning module is designed to enable the learning of highly discriminative high-order features across modalities. Experimental results on two multimodal emotion recognition datasets (IEMOCAP and CMU-MOSEI) demonstrate that, compared with several related methods, the proposed AHOT effectively improves emotion recognition accuracy. Moreover, ablation studies and parameter analyses further validate the effectiveness of AHOT.
- Research Article
- 10.61091/jcmcc127a-127
- Apr 15, 2025
- Journal of Combinatorial Mathematics and Combinatorial Computing
In recent years, with the increasing psychological pressure on students, psycho-pedagogical methods have been highly emphasized. This article takes students’ multimodal emotion recognition as a research perspective. The article firstly studies the unimodal emotion recognition methods of expression, text and speech respectively. Then it proposes a multimodal emotion recognition algorithm based on dual-attention mechanism and gated memory network, and then conducts emotion recognition experiments to validate this paper’s method. The article further proposes an intervention pathway to further assist in solving students’ mental health problems by designing a virtual reality mental health intervention system. Using the method of this paper in Multimodal database unimodal emotion recognition experiments, found that the network of the model used in this paper has better experimental results, which verifies the effectiveness of the method of this paper, and the accuracy rate of emotion recognition is 60.65%. After testing the mental health level of 8000 students in a school, it was found that the number of hypermodality and the screening rate were low except for the high score of compulsion, from which it can be concluded that the students in our school are in good mental health as a whole after applying the method of this paper.
- Research Article
34
- 10.1609/aaai.v35i16.17686
- May 18, 2021
- Proceedings of the AAAI Conference on Artificial Intelligence
As an important research issue in affective computing community, multi-modal emotion recognition has become a hot topic in the last few years. However, almost all existing studies perform multiple binary classification for each emotion with focus on complete time series data. In this paper, we focus on multi-modal emotion recognition in a multi-label scenario. In this scenario, we consider not only the label-to-label dependency, but also the feature-to-label and modality-to-label dependencies. Particularly, we propose a heterogeneous hierarchical message passing network to effectively model above dependencies. Furthermore, we propose a new multi-modal multi-label emotion dataset based on partial time-series content to show predominant generalization of our model. Detailed evaluation demonstrates the effectiveness of our approach.
- Research Article
9
- 10.1002/eat.23854
- Nov 14, 2022
- International Journal of Eating Disorders
Interpersonal difficulties are evidenced in Anorexia Nervosa (AN) and are thought to contribute to disease onset and maintenance, however, research in the framework of emotional competence is currently limited. Previous studies have often only used static images for emotion recognition tasks, and evidence is lacking on the relationships between performance-based emotional abilities and self-reported intra- and interpersonal emotional traits. This study aimed to test multimodal dynamic emotion recognition ability in AN and analyze its correlation with the psychometric scores of self- and other-related emotional competence. A total of 268 participants (128 individuals with AN and 140 healthy controls) completed the Geneva Emotion Recognition Test, the Profile of Emotional Competence, the Reading the Mind in the Eyes Test, and measures of general and eating psychopathology. Scores were compared between the two groups. Linear mixed effects models were utilized to examine the relationship between emotion recognition ability and self-reported measures and clinical variables. Individuals with AN showed significantly poorer recognition of emotions of both negative and positive valence and significantly lower scores in all emotional competence dimensions. Beside emotion type and group, linear mixed models evidenced significant effects of interpersonal comprehension on emotion recognition ability. Individuals with AN show impairment in multimodal emotion recognition and report their difficulties accordingly. Notably, among all emotional competence dimensions, interpersonal comprehension emerges as a significant correlate to emotion recognition in others, and could represent a specific area of intervention in the treatment of individuals with AN. In this study, we evidence that the ability to recognize the emotions displayed by others is related to the level of interpersonal emotional competence reported by individuals with anorexia nervosa. This result helps in understanding the social impairments in people with anorexia nervosa and could contribute to advancements in the application of the training of emotional competence in the treatment of this disorder.
- Research Article
3
- 10.7717/peerj-cs.1977
- Apr 19, 2024
- PeerJ Computer Science
Emotional recognition is a pivotal research domain in computer and cognitive science. Recent advancements have led to various emotion recognition methods, leveraging data from diverse sources like speech, facial expressions, electroencephalogram (EEG), electrocardiogram, and eye tracking (ET). This article introduces a novel emotion recognition framework, primarily targeting the analysis of users' psychological reactions and stimuli. It is important to note that the stimuli eliciting emotional responses are as critical as the responses themselves. Hence, our approach synergizes stimulus data with physical and physiological signals, pioneering a multimodal method for emotional cognition. Our proposed framework unites stimulus source data with physiological signals, aiming to enhance the accuracy and robustness of emotion recognition through data integration. We initiated an emotional cognition experiment to gather EEG and ET data alongside recording emotional responses. Building on this, we developed the Emotion-Multimodal Fusion Neural Network (E-MFNN), optimized for multimodal data fusion to process both stimulus and physiological data. We conducted extensive comparisons between our framework's outcomes and those from existing models, also assessing various algorithmic approaches within our framework. This comparison underscores our framework's efficacy in multimodal emotion recognition. The source code is publicly available at https://figshare.com/s/8833d837871c78542b29.
- Research Article
78
- 10.1016/j.eij.2020.07.005
- Aug 13, 2020
- Egyptian Informatics Journal
A 3D-convolutional neural network framework with ensemble learning techniques for multi-modal emotion recognition
- Research Article
- 10.54254/2755-2721/2025.po24716
- Jul 4, 2025
- Applied and Computational Engineering
The advent of multimedia technology has precipitated a paradigm shift in the realm of human-computer interaction and affective computing, thus rendering multimodal emotion recognition a pivotal domain. However, the issue of modal absence, resulting from equipment failure or environmental interference in practical applications, significantly impacts the accuracy of emotion recognition. The objective of this paper is to analyse multimodal emotion recognition methods oriented to modal absence. The focus is on comparing and analysing the advantages and disadvantages of techniques such as generative class and joint representation class. Experimental findings demonstrate the efficacy of these methods in surpassing the conventional baseline on diverse datasets, including IEMOCAP, CMU-MOSI, and others. Notably, CIF-MMIN enhances the mean accuracy by 0.92% in missing conditions while concurrently reducing the UniMF parameter by 30%, thus preserving the SOTA performance. Key challenges currently being faced by researchers in the field of multimodal emotion recognition for modal absence include cross-modal dependencies and semantic consistency, model generalisation ability, and dynamic scene adaptation. These challenges may be addressed in the future through the development of a lightweight solution that does not require full-modal pre-training, and by combining comparative learning with generative modelling to enhance semantic fidelity. The present paper provides both theoretical support and practical guidance for the development of a highly robust and efficient emotion recognition system.
- Research Article
22
- 10.1155/2023/9645611
- Jan 1, 2023
- Computational Intelligence and Neuroscience
Humans express their emotions in a variety of ways, which inspires research on multimodal fusion-based emotion recognition that utilizes different modalities to achieve information complementation. However, extracting deep emotional features from different modalities and fusing them remain a challenging task. It is essential to exploit the advantages of different extraction and fusion approaches to capture the emotional information contained within and across modalities. In this paper, we present a novel multimodal emotion recognition framework called multimodal emotion recognition based on cascaded multichannel and hierarchical fusion (CMC-HF), where visual, speech, and text signals are simultaneously utilized as multimodal inputs. First, three cascaded channels based on deep learning technology perform feature extraction for the three modalities separately to enhance deeper information extraction ability within each modality and improve recognition performance. Second, an improved hierarchical fusion module is introduced to promote intermodality interactions of three modalities and further improve recognition and classification accuracy. Finally, to validate the effectiveness of the designed CMC-HF model, some experiments are conducted to evaluate two benchmark datasets, IEMOCAP and CMU-MOSI. The results show that we achieved an almost 2%∼3.2% increase in accuracy of the four classes for the IEMOCAP dataset as well as an improvement of 0.9%∼2.5% in the average class accuracy for the CMU-MOSI dataset when compared to the existing state-of-the-art methods. The ablation experimental results indicate that the cascaded feature extraction method and the hierarchical fusion method make a significant contribution to multimodal emotion recognition, suggesting that the three modalities contain deeper information interactions of both intermodality and intramodality. Hence, the proposed model has better overall performance and achieves higher recognition efficiency and better robustness.
- Book Chapter
- 10.1007/978-3-031-05484-6_41
- Jan 1, 2022
With the emergence of artificial intelligence, the realization of more humane and intelligent human-computer interaction has always attracted attention, and emotion recognition has become one of the global hot spots. Traditional language translation systems focus on translating voice and text messages into English. However, the way of communication between people is not simply the exchange of textual information, there are also rich emotional exchanges. Therefore, recognizing the emotion of English language has become an indispensable part of realizing natural language translation system. For this reason, this paper proposes to design an English linguistics multimodal emotion recognition system based on the BOOSTING framework. The purpose is to improve the accuracy of the emotion recognition system. This paper mainly uses the methods of comparison and experiment to analyze the single-modal and multi-modal English language emotion recognition technology. Experimental data shows that the accuracy of multi-modal emotion recognition results after fusion for feature extraction can reach more than 47%. And its recognition level basically remains at the same level, with little change.KeywordsBOOSTING frameworkEnglish linguisticsMultimodal recognitionEmotion recognition system
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.