Articles published on Emotional Speech
Authors
Select Authors
Journals
Select Journals
Duration
Select Duration
2553 Search results
Sort by Recency
- New
- Research Article
1
- 10.1016/j.ijcce.2024.11.008
- Dec 1, 2025
- International Journal of Cognitive Computing in Engineering
- Xueliang Kang
Speech Emotion Recognition Algorithm of Intelligent Robot Based on ACO-SVM
- New
- Research Article
- 10.1016/j.engappai.2025.112152
- Dec 1, 2025
- Engineering Applications of Artificial Intelligence
- Yonghong Fan + 3 more
Temporal-frequency joint hierarchical transformer with dynamic windows for speech emotion recognition
- New
- Research Article
- 10.1016/j.asoc.2025.113915
- Dec 1, 2025
- Applied Soft Computing
- Feng Li + 2 more
Improving speech emotion recognition using gated cross-modal attention and multimodal homogeneous feature discrepancy learning
- New
- Research Article
- 10.1016/j.artmed.2025.103279
- Dec 1, 2025
- Artificial intelligence in medicine
- Xinran Li + 5 more
TSFNet: A Temporal-Spectral Fusion Network for advanced speech emotion recognition in medical applications.
- New
- Research Article
- 10.1016/j.measurement.2025.118165
- Dec 1, 2025
- Measurement
- Ravi + 1 more
A filtering approach for speech emotion recognition using wavelet approximation coefficient
- New
- Research Article
- 10.1016/j.ins.2025.122956
- Dec 1, 2025
- Information Sciences
- Bao Thang Ta + 3 more
EmoDim: An independent dimensional contrastive learning with pseudo-labeling for speech emotion recognition
- New
- Research Article
- 10.1016/j.apacoust.2025.110905
- Dec 1, 2025
- Applied Acoustics
- Astha Tripathi + 1 more
Multilingual speech emotion recognition using IGRFXG – Ensemble feature selection approach
- New
- Research Article
- 10.1016/j.eswa.2025.128605
- Dec 1, 2025
- Expert Systems with Applications
- Chang Wang + 3 more
Bimodal speech emotion recognition via contrastive self-alignment learning
- New
- Research Article
- 10.54097/bk7f6783
- Nov 27, 2025
- Academic Journal of Science and Technology
- Tian Jing
Artificial Intelligence (AI) has become a useful tool in human emotion recognition, with a broad application range. To better cater for the applications, numerous researches are conducted, helping developing the related technologies rapidly. This review broadly explores the main methods in emotion recognition based on AI. It begins with facial emotion recognition (FER), analyzing its general working flow (from constructing database to preprocessing to extracting features to machine learning). It is seen in the following that this flow commonly applies to other three modalities. Then, speech emotion recognition (SER) is briefly discussed, mainly on its feature extraction and classification (classification is a part of machine learning). Subsequently, emotion recognition from physiological signals is deeply explored, due to its passive nature and resistance to artificial control. Among a variety of physiological signals, the review concentrates on electroencephalographic (EEG) and electrocardiographic (ECG) signals. Afterwards, textual emotion recognition (TER) is roughly introduced, outlining four basic methods based on it. Finally, the review concludes the challenges which occur to nearly every experiment regarding emotion recognition. Additionally, the strengths and limitations of each modality are presented in the discussion module. The highlight of the review is that it provides a systematic analysis of basic methods of using AI to recognize emotion.
- New
- Research Article
- 10.4218/etrij.2025-0058
- Nov 26, 2025
- ETRI Journal
- Seyyed Mahdi Hassani + 1 more
Abstract Given the importance of emotions in social interactions, emotional speech synthesis has attracted significant attention in the field of human–computer interaction. Remarkable advancements have been made in emotional text‐to‐speech synthesis, but most previous studies have concentrated on imitating styles associated with a specific primary emotion, neglecting secondary emotions that arise from mixtures of primary emotions. Therefore, there is a need to leverage both primary and secondary emotions in speech synthesis to facilitate more engaging, realistic, and natural interactions among artificial social agents. To address this gap, we propose a text‐to‐emotional speech synthesis model designed to generate nuanced mixtures of emotions that effectively convey secondary emotions during interactions. By adjusting the values of each basic emotion, we can control the mix of emotions in the synthetic speech. Our proposed method distinguishes between primary emotions and variations in mixed emotions while learning emotional styles. The effectiveness of the proposed framework was validated through both objective and subjective evaluations.
- New
- Research Article
- 10.3390/e27121201
- Nov 26, 2025
- Entropy
- Michael Norval + 1 more
We evaluate a hybrid quantum–classical pipeline for speech emotion recognition (SER) on a custom Afrikaans corpus using MFCC-based spectral features with pitch and energy variants, explicitly comparing three quantum approaches—a variational quantum classifier (VQC), a quantum support vector machine (QSVM), and a Quantum Approximate Optimisation Algorithm (QAOA)-based classifier—against a CNN–LSTM (CLSTM) baseline. We detail the classical-to-quantum data encoding (angle embedding with bounded rotations and an explicit feature-to-qubit map) and report test accuracy, weighted precision, recall, and F1. Under ideal analytic simulation, the quantum models reach 41–43% test accuracy; under a realistic 1% NISQ noise model (100–1000 shots) this degrades to 34–40%, versus 73.9% for the CLSTM baseline. Despite the markedly lower empirical accuracy—expected in the NISQ era—we provide an end-to-end, noise-aware hybrid SER benchmark and discuss the asymptotic advantages of quantum subroutines (Chebyshev-based quantum singular value transformation, quantum walks, and block encoding) that become relevant only in the fault-tolerant regime.
- New
- Research Article
- 10.1038/s41598-025-25874-9
- Nov 25, 2025
- Scientific reports
- Jun Zhao + 2 more
The precise identification and understanding of human emotions by computers is crucial for generating natural interactions between humans and machines. This research presents a novel approach for identifying emotions in speech through the integration of deep learning and metaheuristic techniques. The approach utilizes Deep Maxout Networks (DMN) as the primary framework and enhances it using the modified version of the Water Cycle Algorithm (MWCA). The MWCA enhances the architectural parameters of the DMN and optimizes its capability to recognize emotions from speech signals. The suggested model employs Mel-Frequency Cepstral Coefficients (MFCC) to extract features from speech input, which can enable effective differentiation between numerous emotional states. The efficiency of the model has been assessed using two datasets, CASIA and Emo-DB, achieving an average accuracy of 93.1% and an F1-score of 92.4% on Emo-DB, outperforming baseline models with statistically significant improvements (p < 0.01). This research helps the domain of emotional interaction design by providing a robust tool for computers to understand and react to the emotions of users, and finally improves the general experience of users.
- New
- Research Article
- 10.1038/s41598-025-28686-z
- Nov 24, 2025
- Scientific reports
- Eman Abdulrahman Alkhamali + 3 more
Federated learning for speech emotion recognition faces fundamental challenges in simultaneously achieving high performance, privacy preservation, and model interpretability. This paper introduces FedSER-XAI, a novel framework that integrates Particle Swarm Optimization (PSO)-based feature selection, multi-stream cross-attention mechanisms, and graph-based feature extraction within an explainable federated learning architecture. Our approach combines Vision Transformer processing of mel-spectrograms with temporal-spatial graph convolutional networks to capture both contextual and structural speech relationships. The PSO algorithm achieves 78.1% dimensionality reduction (228→50 features) while improving discriminative power. The multi-stream architecture processes traditional acoustic features alongside novel graph-based representations derived from visibility and correlation graphs, fused through Transformer-based cross-attention mechanisms. Extensive evaluation on EMODB and SAVEE datasets demonstrates exceptional performance: 99.9% and 97.2% accuracy in centralized settings, with remarkable federated performance achieving global model accuracies of 99.7% (EMODB) and 97.2% (SAVEE) across 8 emotion-specialized clients, representing only 0.2% and 0.0% degradation compared to centralized training. The framework achieves rapid convergence within 10 communication rounds, representing minimal performance degradation (0.2% for EMODB) while preserving privacy. Cross-dataset evaluation on CREMA-D yields 68% accuracy, demonstrating reasonable generalization. The comprehensive explainability framework using SHAP and LIME provides global and local interpretations, validating that graph-based features contribute significantly to emotion discrimination. FedSER-XAI represents the first explainable federated speech emotion recognition system, advancing trustworthy AI for sensitive healthcare and human-computer interaction applications.
- New
- Research Article
1
- 10.1038/s41598-025-27871-4
- Nov 22, 2025
- Scientific reports
- Thejan Rajapakshe + 4 more
Quantum machine learning (QML) offers a promising avenue for advancing representation learning in complex signal domains. In this study, we investigate the use of parameterised quantum circuits (PQCs) for speech emotion recognition (SER)-a challenging task due to the subtle temporal variations and overlapping affective states in vocal signals. We propose a hybrid quantum-classical architecture that integrates PQCs into a conventional convolutional neural network (CNN), leveraging quantum properties such as superposition and entanglement to enrich emotional feature representations. Experimental evaluations on three benchmark datasets IEMOCAP, RECOLA, and MSP-IMPROV-demonstrate that our hybrid model achieves improved classification performance relative to a purely classical CNN baseline, with over 50% reduction in trainable parameters. Furthermore, Adjusted Rand Index (ARI) analysis demonstrates that the quantum model yields feature representations with improved alignment to true emotion classes compared with the classical model, reinforcing the observed performance gains. This work provides early evidence of the potential for QML to enhance emotion recognition and lays the foundation for future quantum-enabled affective computing systems.
- New
- Research Article
- 10.1007/s00034-025-03410-4
- Nov 16, 2025
- Circuits, Systems, and Signal Processing
- Swapna Mol George + 1 more
Assessing the Effectiveness of Feature Normalization and Dataset Quality in Speech Emotion Recognition Across Diverse Emotional and Linguistic Contexts
- Research Article
- 10.1007/s00034-025-03408-y
- Nov 15, 2025
- Circuits, Systems, and Signal Processing
- Yuanyuan Wei + 4 more
A Multi-branch Interactive Attention Network Based on Self-Distillation for Speech Emotion Recognition
- Research Article
- 10.54097/hq3rzm08
- Nov 13, 2025
- Academic Journal of Science and Technology
- Zhizhou Lu
With the rapid advancement of artificial intelligence, sensor technology, and big data analytics, the way humans interact with machines is shifting from an "instruction-based" mode to a "perception-based" one. As emotions are the core driving force behind human behaviors and decisions, the objective and quantitative detection of emotions has become a focus of interdisciplinary research. Studies have shown that emotions can be detected and analyzed through four methods: facial expression recognition, speech emotion recognition, text sentiment analysis, and physiological signal recognition. Reducing errors in emotion recognition is of great help to the development and widespread application of human-computer interaction. The development and research on emotion detection is not only an inevitable outcome of technological development but also a key to addressing practical needs. This paper will analyze deep learning models such as CNN, LSTM, and SENN, and summarize their advantages and disadvantages. It breaks the traditional perception that "emotions cannot be quantified," enabling machines to move from "understanding language" to "understanding the human heart," and ultimately promoting the harmonious coexistence of technology and human society.
- Research Article
- 10.3390/bdcc9110285
- Nov 12, 2025
- Big Data and Cognitive Computing
- Elena Ryumina + 4 more
Bimodal emotion recognition based on audio and text is widely adopted in video-constrained real-world applications such as call centers and voice assistants. However, existing systems suffer from limited cross-domain generalization and monolingual bias. To address these limitations, a cross-lingual bimodal emotion recognition method is proposed, integrating Mamba-based temporal encoders for audio (Wav2Vec2.0) and text (Jina-v3) with a Transformer-based cross-modal fusion architecture (BiFormer). Three corpus-adaptive augmentation strategies are introduced: (1) Stacked Data Sampling, in which short utterances are concatenated to stabilize sequence length; (2) Label Smoothing Generation based on Large Language Model, where the Qwen3-4B model is prompted to detect subtle emotional cues missed by annotators, producing soft labels that reflect latent emotional co-occurrences; and (3) Text-to-Utterance Generation, in which emotionally labeled utterances are generated by ChatGPT-5 and synthesized into speech using the DIA-TTS model, enabling controlled creation of affective audio–text pairs without human annotation. BiFormer is trained jointly on the English Multimodal EmotionLines Dataset and the Russian Emotional Speech Dialogs corpus, enabling cross-lingual transfer without parallel data. Experimental results show that the optimal data augmentation strategy is corpus-dependent: Stacked Data Sampling achieves the best performance on short, noisy English utterances, while Label Smoothing Generation based on Large Language Model better captures nuanced emotional expressions in longer Russian utterances. Text-to-Utterance Generation does not yield a measurable gain due to current limitations in expressive speech synthesis. When combined, the two best performing strategies produce complementary improvements, establishing new state-of-the-art performance in both monolingual and cross-lingual settings.
- Research Article
- 10.1007/s10586-025-05830-y
- Nov 11, 2025
- Cluster Computing
- Zineddine Sarhani Kahhoul + 6 more
Automatic speech emotion recognition for arabic dialects: a new dataset and machine learning framework
- Research Article
- 10.3390/fi17110509
- Nov 5, 2025
- Future Internet
- Seounghoon Byun + 1 more
Speech Emotion Recognition (SER) is important for applications such as Human–Computer Interaction (HCI) and emotion-aware services. Traditional SER models rely on utterance-level labels, aggregating frame-level representations through pooling operations. However, emotional states can vary across frames within an utterance, making it difficult for models to learn consistent and robust representations. To address this issue, we propose two auxiliary loss functions, Emotional Attention Loss (EAL) and Frame-to-Utterance Alignment Loss (FUAL). The proposed approach uses a Classification token (CLS) self-attention pooling mechanism, where the CLS summarizes the entire utterance sequence. EAL encourages frames of the same emotion to align closely with the CLS while separating frames of different classes, and FUAL enforces consistency between frame-level and utterance-level predictions to stabilize training. Model training proceeds in two stages: Stage 1 fine-tunes the wav2vec 2.0 backbone with Cross-Entropy (CE) loss to obtain stable frame embeddings, and stage 2 jointly optimizes CE, EAL and FUAL within the CLS-based pooling framework. Experiments on the IEMOCAP four-class dataset demonstrate that our method consistently outperforms baseline models, showing that the proposed losses effectively address representation inconsistencies and improve SER performance. This work advances Artificial Intelligence by improving the ability of models to understand human emotions through speech.