Articles published on Multimodal Emotion Recognition
Authors
Select Authors
Journals
Select Journals
Duration
Select Duration
599 Search results
Sort by Recency
- New
- Research Article
- 10.1016/j.patcog.2025.111963
- Jan 1, 2026
- Pattern Recognition
- Liangfei Zhang + 4 more
Multimodal latent emotion recognition from micro-expression and physiological signal
- New
- Research Article
- 10.1016/j.eswa.2025.129109
- Jan 1, 2026
- Expert Systems with Applications
- Fang’Ai Liu + 4 more
INN-based dual-generator adversarial contrastive learning network for multi-modal multi-label emotion recognition
- New
- Research Article
- 10.1016/j.dsp.2025.105679
- Jan 1, 2026
- Digital Signal Processing
- Rongyue Zhao + 7 more
EEG-based multimodal emotion recognition framework with supervised contrastive learning and spatial–temporal convolutional attention network
- New
- Research Article
- 10.1016/j.specom.2025.103332
- Jan 1, 2026
- Speech Communication
- Weijie Lu + 2 more
Adaptive weighting in a transformer framework for multimodal emotion recognition
- New
- Research Article
- 10.1007/s40998-025-00968-2
- Dec 27, 2025
- Iranian Journal of Science and Technology, Transactions of Electrical Engineering
- Samiddha Chakrabarti + 1 more
Three-Stream Feature-Level Fusion Based Multimodal Emotion Recognition from Vocal and Facial Expressions for Human-Computer Interaction
- New
- Research Article
- 10.1145/3786588
- Dec 25, 2025
- ACM Transactions on Asian and Low-Resource Language Information Processing
- Cheng Cheng + 4 more
Electroencephalogram (EEG) has shown great potential in multi-modal emotion recognition (MER) due to its ability to directly capture emotional states. However, the nonstationarity of EEG signals leads to significant variations across subjects and sessions, posing challenges for subject-independent MER. While previous methods have made significant progress, they often fail to integrate multimodal signals into transfer learning frameworks effectively. To address this limitation, we propose a Multi-source Domain Adaptive Network (MSDA-Net) for MER, designed to mitigate cross-subject and cross-session distribution shifts and enhance recognition performance. Specifically, we first design a feature alignment module to integrate features from different modalities, generating cross-modal feature representations and extracting representative shared features. To further improve generalization, we incorporate domain-specific feature extractors to capture domain-invariant emotional representations. Additionally, we introduce an adapter module to adjust the feature representations between different modalities, aiming to capture inter-individual differences and cross-modal correlations better. Finally, we unify classification loss, discrepancy loss, and maximum mean discrepancy (MMD) loss into a joint optimization framework. Abundant experiments on the SEED and SEED-IV datasets demonstrate the superiority of MSDA-Net, highlighting its effectiveness in improving MER performance.
- New
- Research Article
- 10.62051/jhhx1w14
- Dec 25, 2025
- Transactions on Computer Science and Intelligent Systems Research
- Junze Lyu
With the rapid development of artificial intelligence technology, service robots are increasingly integrating into daily life scenarios. Achieving efficient and natural human-robot interaction has become a key challenge. The ability to recognize emotions, as a core factor in enhancing the interaction experience, has attracted extensive attention from both the academic and industrial communities. This paper focuses on the emotion recognition problem of service robots, analyzes three key problems that need to be solved by the application of multimodal emotion recognition technology in service robots, lists and analyzes several advanced machine learning algorithms that can help solve the problem, evaluates and analyzes them respectively, and summarizes their common limitations as well as the possible improvement methods in the future. This research not only provides crucial theoretical support and algorithmic paths for the emotional computing of service robots, but also helps to build a more reliable human-robot collaboration relationship. It holds significant application value for promoting the real implementation of service robots in real-life scenarios.
- New
- Research Article
- 10.1142/s021984362540016x
- Dec 24, 2025
- International Journal of Humanoid Robotics
- Qaisar Abbas + 5 more
Multimodal emotion recognition has become an important function for intelligent human-robot interaction, especially in an assistive robotics scenario for individuals with communication disabilities. The paper presents a Hybrid BERT-Vision Transformer (BERT-ViT) system constructed to combine complementary information obtained from natural language and facial expression analysis for enhanced emotional comprehension in assistive response systems. By employing BERT to produce contextual embeddings for language and the Vision Transformer to obtain fine-grained visual features, the proposed BERT-ViT-domain framework performs cross-modal emotion classification in a robust manner. Contrastive learning is applied for the multimodal feature alignment, and then the fusion step concatenates the [CLS] token embeddings from both modalities. The evaluation of the model was conducted with the MELD dataset, and the findings indicate promising performance associated with a mean accuracy on seven emotion classes and performance levels above conventional unimodal and early-fusion counterparts. These results also indicate the model performs relatively well at detecting subtle and complex affective states of concern in relation to ASD support for empathetic interaction, speech impairments, and socially assistive robotics. The modular nature of the model’s architecture provides for seamless integration into assistive platforms, while also supporting feedback and personalization features. The experimental validation includes linguistic profiling, space-time processing of emotions, and performance classification measures of accuracy (98.71%), precision (98.34%), recall (98.00%), and F1-score (98.38%). Also noteworthy is the strong resistance to overfitting with training-validation convergence and low overall error rates (FPR: 1.58%, FNR: 0.98%). This hybrid framework contributes to realizing advanced multimodal emotion recognition and developing a scalable, responsive, and privacy-responsible solution toward emotionally aware assistive robotic applications in sensitive health care and educational contexts.
- New
- Abstract
- 10.1002/alz70857_100061
- Dec 24, 2025
- Alzheimer's & Dementia
- Greta Keller + 7 more
BackgroundThe early detection of amnestic mild cognitive impairment (aMCI) is essential for effective preventive interventions. Artificial Intelligence (AI) offers innovative methods to identify markers of aMCI, complementing traditional approaches. However, research in this field remains in its early stages. In this context, this study focuses on designing AI multimodal neuropsychological instruments to differentiate healthy individuals from aMCI subjects, emphasizing local data calibration and validation.MethodWe recruited 59 participants from Fleni, Argentina, including 30 healthy controls and 29 individuals diagnosed with aMCI based on Petersen's criteria (2020). All participants underwent comprehensive assessments, including neuropsychological testing (Uniform Data Set 3) and magnetic resonance imaging (MRI) following the ADNI 3 protocol. Participants were video‐ and audio‐recorded while performing language tasks on a web platform, which involved describing two target images (“Cookie Theft” and “Firefighter‐Oasis”) and completing two additional tasks without images (describing their favorite sandwich and reading a story). Multimodal markers were extracted from five modalities: language processing (automated speech transcription), speech acoustics (audio), face mesh analysis (video), blend shapes (video), and emotion recognition (video). Each modality provided a variety of features, including expert‐derived metrics and embedding representations. These features were used to train machine learning classifiers to differentiate individuals with aMCI from healthy controls.ResultParticipants ranged in age from 60 to 89 years (mean ± SD: 70.95 ± 6.8). Unimodal analysis was performed to study shared information between proposed AI‐markers and traditional neurocognitive tests. We obtained 204 significantly correlated AI‐markers to traditional tests of a total of 432 (47%). Univariate AUC for aMCI diagnosis was measured for all markers, yielding an average above chance performance (0.57 ± 0.062). However, combining all modalities using a multivariate random forest classifier achieved an outstanding AUC of 0.91, highlighting its excellent diagnostic performance.ConclusionThis study demonstrates that AI‐based multimodal markers, including language, speech acoustics, facial analysis, and emotion recognition, can effectively differentiate aMCI from healthy controls in an Argentine population. Validating these tools using Spanish‐language data and cost‐effective, non‐invasive methods is crucial for their broader applicability.
- New
- Research Article
- 10.1145/3786343
- Dec 23, 2025
- ACM Transactions on Information Systems
- Yuntao Shou + 5 more
Multi-modal conversation emotion recognition (MCER) aims to recognize and track the speaker's emotional state using text, speech, and visual information. Compared with traditional single-utterance multi-modal emotion recognition or single-modal conversation emotion recognition, MCER is more challenging. It requires modeling complex emotional interactions and learning consistent and complementary semantics across multiple modalities. Although many deep learning-based approaches have been proposed for MCER, there is still a lack of systematic reviews summarizing existing modeling methods. Therefore, a timely and comprehensive overview of MCER's recent advances in deep learning is of great significance. In this survey, we provide a comprehensive overview of MCER modeling methods and roughly divide MCER methods into four categories, i.e., context-free modeling, sequential context modeling, speaker-differentiated modeling, and speaker-relationship modeling. Unlike conventional taxonomies based on modality combinations or task-stage decomposition, our framework focuses on how models structurally capture conversational dynamics, speaker roles, and emotional dependencies. In addition, we further discuss MCER's publicly available popular datasets, multi-modal feature extraction methods, application areas, existing challenges, and future development directions. We hope this review provides valuable insights into the current state of MCER research and inspires the development of more effective models.
- Research Article
- 10.61173/2vy9ez95
- Dec 19, 2025
- Science and Technology of Engineering, Chemistry and Environmental Protection
- Pengyu Chen + 2 more
As the intelligent cockpit evolves towards a human-machine emotional interaction center, driver emotion recognition has become a key task for enhancing active safety and interaction experience. This paper systematically studies the applicability and optimization paths of multimodal fusion emotion recognition methods based on deep learning in the intelligent cockpit environment. The research findings show that the visual methods (such as YOLO, MobileNetV3) have the advantages of non-intrusiveness and high realtime performance (response time < 40 ms), but are susceptible to changes in lighting and facial occlusion, and have insufficient recognition for high-risk emotions (such as anger); the physiological signals (electroencephalogram, electrooculogram) have strong objectivity, but have high equipment costs, poor wearing comfort, and limited generalization and robustness due to individual differences; the speech method (combined with psychological acoustic model) has the characteristics of natural interaction and resistance to expression disguise, but is greatly affected by in-vehicle noise and has decreased recognition stability. To address the above problems, this paper proposes improvement strategies such as lightweight network structure, sample enhancement for high-risk emotions, and hardware-algorithm collaborative anti-interference, which significantly improve the model’s adaptability and real-time performance in extreme environments. The research shows that a single modality is difficult to meet the complex requirements of the intelligent cockpit, and in the future, multi-modal deep fusion should be adopted to achieve the coordinated optimization of accuracy, robustness, and real-time performance while ensuring user experience, providing key support for building an integrated intelligent cockpit emotion computing framework of perception - understanding - intervention.
- Research Article
- 10.70267/ic-aimees.202516
- Dec 19, 2025
- Exploring Science Academic Conference Series
- Zihui Zhao
Multimodal fusion technology achieves information communication and exchange between human and computer by integrating different modal information such as vision, speech, and touch, which has become an important research direction in the field of human-computer interaction. This paper focuses on the four mainstream multimodal fusion methods of graph-based feature fusion, cross-modal attention technology, cross-correlation attention architecture and multimodal emotion recognition technology, compares and analyzes their technical principles, advantages, disadvantages and application scenarios, and systematically sorts out the differences in technical characteristics. By integrating multiple input methods, these methods significantly improve the user interface interaction experience, optimize the efficiency of multi-source information processing, and provide new ideas for interaction design in complex scenes. Research shows that multimodal fusion human-computer interaction technology can effectively reduce user cognitive load and improve operation efficiency, which has important application value in education, medical care, smart home and other fields. In the future, it is necessary to solve the challenges of insufficient cross-modal data alignment accuracy and high real-time requirements, and explore the deep combination of affective computing and multimodal fusion.
- Research Article
- 10.3390/electronics14244972
- Dec 18, 2025
- Electronics
- Da-Eun Chae + 1 more
Multimodal emotion recognition (MER) often relies on single-scale representations that fail to capture the hierarchical structure of emotional signals. This paper proposes a Dual Routing Mixture-of-Experts (MoE) model that dynamically selects between local (fine-grained) and global (contextual) representations extracted from speech and text encoders. The framework first obtains local–global embeddings using WavLM and RoBERTa, then employs a scale-aware routing mechanism to activate the most informative expert before bidirectional cross-attention fusion. Experiments on the IEMOCAP dataset show that the proposed model achieves stable performance across all folds, reaching an average unweighted accuracy (UA) of 75.27% and weighted accuracy (WA) of 74.09%. The model consistently outperforms single-scale baselines and simple concatenation methods, confirming the importance of dynamic multi-scale cue selection. Ablation studies highlight that neither local-only nor global-only representations are sufficient, while routing behavior analysis reveals emotion-dependent scale preferences—such as strong reliance on local acoustic cues for anger and global contextual cues for low-arousal emotions. These findings demonstrate that emotional expressions are inherently multi-scale and that scale-aware expert activation provides a principled approach beyond conventional single-scale fusion.
- Research Article
- 10.1007/s40747-025-02198-9
- Dec 18, 2025
- Complex & Intelligent Systems
- Xiaoyu Liu + 3 more
COLIN: complementary and competitive balanced learning network for multi-modal multi-label emotion recognition
- Research Article
- 10.1007/s44163-025-00671-5
- Dec 17, 2025
- Discover Artificial Intelligence
- Fanghai Gong
Design and implementation of an intelligent educational interaction system with integrated multimodal emotion recognition and adaptive content delivery
- Research Article
- 10.1038/s41597-025-06214-y
- Dec 15, 2025
- Scientific Data
- Xin Huang + 7 more
We introduce a novel multimodal emotion recognition dataset designed to enhance the precision of valence-arousal modeling while incorporating individual differences. This dataset includes electroencephalogram (EEG), electrocardiogram (ECG), and pulse interval (PI) data from 64 participants. Data collection employed two emotion induction paradigms: video stimuli targeting different valence levels (positive, neutral, and negative) and the Mannheim Multicomponent Stress Test (MMST) inducing high arousal through cognitive, emotional, and social stressors. To enrich the dataset, participants’ personality traits, anxiety, depression, and emotional states were assessed using validated questionnaires. By capturing a broad spectrum of affective responses and systematically accounting for individual differences, this dataset provides a robust resource for precise emotion modeling. The integration of multimodal physiological data with psychological assessments lays a strong foundation for personalized emotion recognition. We anticipate this resource will support the development of more accurate, adaptive, and individualized emotion recognition systems across diverse applications.
- Research Article
- 10.1145/3774880
- Dec 10, 2025
- ACM Transactions on Asian and Low-Resource Language Information Processing
- Abdelhamid Haouhat + 3 more
Multimodal sentiment analysis and emotion recognition have attracted significant interest in multimodal learning. Naturally, humans express their feelings and emotions through nuanced expressions across various verbal and non-verbal modalities. Despite this, there remains a critical gap in publicly accessible multimodal datasets for the Arabic language. To address this issue, we posited that creating a large and high-quality Arabic multimodal dataset would significantly improve sentiment analysis and emotion recognition in Arabic contexts. We aimed to develop a large, high-quality Arabic Multimodal Sentiment Analysis and Emotion Recognition (A md ’S a E r ) dataset by building upon our AMSA dataset, increasing its size to 1,037 samples, and adding emotional labels. Leveraging a novel methodology, we carefully selected and annotated data across audio, text, and visual modalities, and proposed a hybrid inter-annotator agreement strategy. Extensive analyses were conducted to validate the robustness of the dataset. We experimented with the A md ’S a E r dataset using a customized MERBench framework, which demonstrated the dataset’s efficacy and reliability. Our findings indicate the high quality of the dataset and underscore the importance of multimodal context for accurate sentiment analysis and emotion recognition in Arabic. We recommend further research and application of the A md ’S a E r dataset in broader Arabic contexts, as it provides a valuable resource for advancing multimodal analysis in this language.
- Research Article
- 10.48084/etasr.14864
- Dec 8, 2025
- Engineering, Technology & Applied Science Research
- L Monish + 1 more
Emotion recognition from physiological signals is a promising approach in affective computing because it is accurate and less affected by external conditions. This paper proposes a new hybrid model that combines fuzzy logic and deep learning to improve Electroencephalogram (EEG) and Electrocardiogram (ECG)-based multimodal emotion recognition. The system undertakes feature-level fusion of EEG and ECG, coupled with fuzzy logic–based membership scoring for handling uncertainty and subject variability. These fuzzy-enhanced representations are then utilized as input to a hybrid Convolutional Neural Network (CNN)–LSTM model, allowing automatic spatial association extraction and temporal emotional dynamics extraction. When tested on the DREAMER dataset, the proposed method has a total accuracy of 92% compared to current machine learning and deep learning models. The performance metrics of precision, recall, F1-score, confusion matrix, and ROC-AUC analysis show stable classification for four affective classes. The findings validate that the fuzzy-deep hybrid model not only enhances prediction accuracy but also enhances interpretability and robustness against noisy physiological signals, making it appropriate for application in healthcare monitoring, adaptive learning, and human–computer interaction.
- Research Article
- 10.4114/intartif.vol29iss77pp28-39
- Dec 8, 2025
- Inteligencia Artificial
- Marcelo Alejandro Huerta-Espinoza + 2 more
Depression and anxiety disorders affect millions of individuals globally and are commonly addressed through psychological interventions. A growing technological approach to support such treatments involves the use of embodied conversational agents that employ motivational interviewing, a method that promotes behavioral change through empathic engagement. Despite its critical role in therapeutic efficacy, empathy remains a significant challenge for virtual agents to emulate. Emotion Recognition (ER) technologies offer a potential solution by enabling agents to perceive and respond appropriately to users' emotional states. Given the inherently multimodal nature of human emotion, unimodal ER approaches often fall short in accurately interpreting affective cues. In this work, we propose a multimodal emotion recognition model that integrates verbal and non-verbal signals (text and video) using a Cross-Modal Attention fusion strategy. Trained and evaluated on the IEMOCAP dataset, our approach leverages Ekman's taxonomy of basic emotions and demonstrates superior performance over unimodal baselines across key metrics such as accuracy and F1-score. By prioritizing text as the main modality and dynamically incorporating complementary visual cues, the model proves effective in complex emotion classification tasks. The proposed model is designed for integration into an existing conversational agent aimed at supporting individuals experiencing emotional and psychological distress. Future work will involve embedding the model in the conversational agent platform for emotionally distressed users, aiming to assess its real-world impact on engagement, user experience, and perceived empathy.
- Research Article
- 10.48175/ijarsct-30156
- Dec 4, 2025
- International Journal of Advanced Research in Science Communication and Technology
- Shreyas V + 4 more
The exponential growth of multimedia data across digital platforms has sparked an ever-increasing need for intelligent, automated video summarization systems that are capable of generating concise, emotionally engaging, and contextually relevant summaries. State-of-the-art practices for creating trailers and editing videos still rely on highly manual approaches, wherein editors go through hours of footage to identify significant scenes. This process is very time-consuming, labor-intensive, and biased by human judgment. It is definitely impractical for use on large-scale or real-time applications. This paper provides an extensive survey and in-depth analysis of human-in-the-loop, AI-assisted video summarization frameworks with a focus on emotion-based scene extraction and collaborative editing. The paper proposes a combined scheme: MTCNN for face detection, FaceNet for identity recognition, and CNNs for emotion classification. These deep learning models detect, track, and analyze emotional expressions throughout the frames to identify scenes with the most narrative and affectively important content. Further, the frame level processing and trailer compilation are done using OpenCV, while a Flask-based interactive interface is used by human editors to review and refine the AI-generated summaries in order to balance automation with creative input. This survey brings together thirteen key research works that cut across predictive modeling, multimodal emotion recognition, and collaboration between AI and humans. A clear demonstration of how human intuition, coupled with machine precision, can improve efficiency by reducing editing time as high as 70%, without sacrificing quality or emotional depth, is depicted in the results. It also establishes the fact that emotion-aware hybrid systems will eventually turn traditional video editing into an adaptive, scalable, intelligent process and open up a whole new dimension toward next- generation media production frameworks which can present emotionally resonant and narratively cohesive video summaries.