Systematic Evaluation of Deep Learning Paradigms for Speech Emotion Recognization Using Diverse Audio Sources
Speech emotion identification is one of the most difficult areas of human-computer interaction, with significant ramifications for assistive technologies, customer support, and mental health monitoring. Despite significant advances in machine learning, accurately identifying emotional states from speech remains difficult due to the complex, nuanced nature of vocal emotional expressions across diverse speakers and contexts. This study presents a comprehensive evaluation of Speech Emotion Recognition (SER) systems across multiple machine learning paradigms using four benchmark datasets (CREMA-D, RAVDESS, SAVEE, and TESS). We implement a multi-feature extraction approach incorporating prosodic, spectral, and voice quality features, while employing data augmentation techniques to enhance model robustness. Our investigation spans traditional machine learning algorithms, ensemble methods, and deep learning architectures including CNN and RNN implementations. Performance evaluation reveals the superiority of the Stacking Classifier (accuracy: 72.54%, F1-score: 72.47%), with strong performances from Random Forest (68.31% accuracy) and ResNet (66% accuracy). This comparative analysis advances affective computing by providing detailed insights into the effectiveness of various approaches for emotion recognition in speech, with significant implications for developing more sophisticated emotional intelligence systems.
- Research Article
51
- 10.1007/s00521-013-1377-z
- Mar 29, 2013
- Neural Computing and Applications
Emotion recognition in speech signals is currently a very active research topic and has attracted much attention within the engineering application area. This paper presents a new approach of robust emotion recognition in speech signals in noisy environment. By using a weighted sparse representation model based on the maximum likelihood estimation, an enhanced sparse representation classifier is proposed for robust emotion recognition in noisy speech. The effectiveness and robustness of the proposed method is investigated on clean and noisy emotional speech. The proposed method is compared with six typical classifiers, including linear discriminant classifier, K-nearest neighbor, C4.5 decision tree, radial basis function neural networks, support vector machines as well as sparse representation classifier. Experimental results on two publicly available emotional speech databases, that is, the Berlin database and the Polish database, demonstrate the promising performance of the proposed method on the task of robust emotion recognition in noisy speech, outperforming the other used methods.
- Research Article
1
- 10.3390/jimaging11080273
- Aug 14, 2025
- Journal of Imaging
Emotion recognition in speech is essential for enhancing human–computer interaction (HCI) systems. Despite progress in Bangla speech emotion recognition, challenges remain, including low accuracy, speaker dependency, and poor generalization across emotional expressions. Previous approaches often rely on traditional machine learning or basic deep learning models, struggling with robustness and accuracy in noisy or varied data. In this study, we propose a novel multi-stream deep learning feature fusion approach for Bangla speech emotion recognition, addressing the limitations of existing methods. Our approach begins with various data augmentation techniques applied to the training dataset, enhancing the model’s robustness and generalization. We then extract a comprehensive set of handcrafted features, including Zero-Crossing Rate (ZCR), chromagram, spectral centroid, spectral roll-off, spectral contrast, spectral flatness, Mel-Frequency Cepstral Coefficients (MFCCs), Root Mean Square (RMS) energy, and Mel-spectrogram. Although these features are used as 1D numerical vectors, some of them are computed from time–frequency representations (e.g., chromagram, Mel-spectrogram) that can themselves be depicted as images, which is conceptually close to imaging-based analysis. These features capture key characteristics of the speech signal, providing valuable insights into the emotional content. Sequentially, we utilize a multi-stream deep learning architecture to automatically learn complex, hierarchical representations of the speech signal. This architecture consists of three distinct streams: the first stream uses 1D convolutional neural networks (1D CNNs), the second integrates 1D CNN with Long Short-Term Memory (LSTM), and the third combines 1D CNNs with bidirectional LSTM (Bi-LSTM). These models capture intricate emotional nuances that handcrafted features alone may not fully represent. For each of these models, we generate predicted scores and then employ ensemble learning with a soft voting technique to produce the final prediction. This fusion of handcrafted features, deep learning-derived features, and ensemble voting enhances the accuracy and robustness of emotion identification across multiple datasets. Our method demonstrates the effectiveness of combining various learning models to improve emotion recognition in Bangla speech, providing a more comprehensive solution compared with existing methods. We utilize three primary datasets—SUBESCO, BanglaSER, and a merged version of both—as well as two external datasets, RAVDESS and EMODB, to assess the performance of our models. Our method achieves impressive results with accuracies of 92.90%, 85.20%, 90.63%, 67.71%, and 69.25% for the SUBESCO, BanglaSER, merged SUBESCO and BanglaSER, RAVDESS, and EMODB datasets, respectively. These results demonstrate the effectiveness of combining handcrafted features with deep learning-based features through ensemble learning for robust emotion recognition in Bangla speech.
- Research Article
34
- 10.1016/j.iswa.2024.200351
- Mar 11, 2024
- Intelligent Systems with Applications
In-depth investigation of speech emotion recognition studies from past to present –The importance of emotion recognition from speech signal for AI–
- Research Article
104
- 10.1016/j.apacoust.2023.109492
- Jun 28, 2023
- Applied Acoustics
Emotional speech Recognition using CNN and Deep learning techniques
- Research Article
- 10.25258/ijddt.16.17s.87
- Apr 10, 2026
- International Journal of Drug Delivery Technology
Speech Emotion Recognition (SER) plays a significant role in improving human–computer interaction by enabling systems to identify and interpret emotions expressed through speech. While extensive research has been conducted for languages such as English, limited work exists for Hindi, one of the most widely spoken languages in the world. The linguistic diversity, dialectal variations, and cultural differences in emotional expression make Hindi speech emotion recognition a challenging task.This study explores the development of a Speech Emotion Recognition system for Hindi speech using machine learning techniques. The proposed approach focuses on extracting relevant acoustic features, including Mel-Frequency Cepstral Coefficients (MFCCs) and prosodic features, which capture important characteristics of speech signals related to emotional expression. Various machine learning algorithms, such as Support Vector Machines (SVM), K-Nearest Neighbors (KNN), Random Forest, Decision Trees, and Multilayer Perceptron, are employed to classify emotions from speech data.The study also discusses challenges associated with Hindi speech, including limited availability of annotated datasets, variations in pronunciation across dialects, and overlapping acoustic characteristics among emotions. The findings suggest that effective feature extraction and appropriate machine learning models can significantly improve the performance of Hindi Speech Emotion Recognition systems. The research contributes toward the development of intelligent systems capable of understanding emotional cues in Hindi speech, which can be applied in areas such as virtual assistants, customer service automation, and mental health monitoring.
- Research Article
27
- 10.3390/s22197561
- Oct 6, 2022
- Sensors
Vocal emotion recognition (VER) in natural speech, often referred to as speech emotion recognition (SER), remains challenging for both humans and computers. Applied fields including clinical diagnosis and intervention, social interaction research or Human Computer Interaction (HCI) increasingly benefit from efficient VER algorithms. Several feature sets were used with machine-learning (ML) algorithms for discrete emotion classification. However, there is no consensus for which low-level-descriptors and classifiers are optimal. Therefore, we aimed to compare the performance of machine-learning algorithms with several different feature sets. Concretely, seven ML algorithms were compared on the Berlin Database of Emotional Speech: Multilayer Perceptron Neural Network (MLP), J48 Decision Tree (DT), Support Vector Machine with Sequential Minimal Optimization (SMO), Random Forest (RF), k-Nearest Neighbor (KNN), Simple Logistic Regression (LOG) and Multinomial Logistic Regression (MLR) with 10-fold cross validation using four openSMILE feature sets (i.e., IS-09, emobase, GeMAPS and eGeMAPS). Results indicated that SMO, MLP and LOG show better performance (reaching to 87.85%, 84.00% and 83.74% accuracies, respectively) compared to RF, DT, MLR and KNN (with minimum 73.46%, 53.08%, 70.65% and 58.69% accuracies, respectively). Overall, the emobase feature set performed best. We discuss the implications of these findings for applications in diagnosis, intervention or HCI.
- Conference Article
4
- 10.1145/3395035.3425252
- Oct 25, 2020
While both speech emotion recognition and music emotion recognition have been studied extensively in different communities, little research went into the recognition of emotion from mixed audio sources, i.e. when both speech and music are present. However, many application scenarios require models that are able to extract emotions from mixed audio sources, such as television content. This paper studies how mixed audio affects both speech and music emotion recognition using a random forest and deep neural network model, and investigates if blind source separation of the mixed signal beforehand is beneficial. We created a mixed audio dataset, with 25% speech-music overlap without contextual relationship between the two. We show that specialized models for speech-only or music-only audio were able to achieve merely 'chance-level' performance on mixed audio. For speech, above chance-level performance was achieved when trained on raw mixed audio, but optimal performance was achieved with audio blind source separated beforehand. Music emotion recognition models on mixed audio achieve performance approaching or even surpassing performance on music-only audio, with and without blind source separation. Our results are important for estimating emotion from real-world data, where individual speech and music tracks are often not available.
- Research Article
26
- 10.1016/j.apacoust.2020.107519
- Jul 22, 2020
- Applied Acoustics
Investigation of multilingual and mixed-lingual emotion recognition using enhanced cues with data augmentation
- Conference Article
19
- 10.1109/tencon.2015.7372840
- Nov 1, 2015
Emotional information in speech signal is an important information resource. When verbal expression combined with human emotion, emotional speech processing is no longer a simple mathematical model or pure calculations. Fluctuations of the mood are controlled by the brain perception; speech signal processing based on cognitive psychology can capture emotion better. In this paper the relevance analysis between speech emotion and human cognition is introduced firstly. The recent progress in speech emotion recognition was summarized including the review of speech emotion databases, feature extraction and emotion recognition networks. Secondly a fuzzy cognitive map network based on cognitive psychology is introduced into emotional speech recognition. In addition, the mechanism of the human brain for cognitive emotional speech is explored. To improve the recognition accuracy, this report also tries to integrate event-related potentials to speech emotion recognition. This idea is the conception and prospect of speech emotion recognition mashed up with cognitive psychology in the future.
- Research Article
29
- 10.1177/2059204318762650
- Jan 1, 2018
- Music & Science
The acoustic cues that convey emotion in speech are similar to those that convey emotion in music, and recognition of emotion in both of these types of cue recruits overlapping networks in the brain. Given the similarities between music and speech prosody, developmental research is uniquely positioned to determine whether recognition of these cues develops in parallel. In the present study, we asked 60 children aged 6 to 11 years, and 51 university students, to judge the emotions of 10 musical excerpts, 10 inflected speech clips, and 10 affect burst clips. We presented stimuli intended to convey happiness, sadness, anger, fear, and pride. Each emotion was presented twice per type of stimulus. We found that recognition of emotions in music and speech developed in parallel, and adult-levels of recognition develop later for these stimuli than for affect bursts. We also found that sad stimuli were most easily recognised, followed by happiness, fear, and then anger. In addition, we found that recognition of emotion in speech and affect bursts can predict emotion recognition in music stimuli independently of age and musical training. Finally, although proud speech and affect bursts were not well recognised, children aged eight years and older showed adult-like responses in recognition of proud music.
- Dissertation
1
- 10.51415/10321/3801
- Oct 15, 2021
The use of customer call centres has increased exponentially in the modern business world and is the heart of marketing in the customer services industry. Previous studies have shown that the quality of services that customers receive from the call centres paint a picture of how they view the company. Reliance on the use of suggestion boxes to crowdsource customer views on call centre services is not adequate and at times, may not give a correct record about the services in question. Therefore, speech emotion recognition has been applied in customer call centres as a tool for evaluating customer service perception, emotion, and sentiment. This approach presents several advantages, for instance, the performance of call centre agents can adequately be scrutinised because their emotions can be automatically classified based on machine learning methods for emotion recognition. In recent times, various techniques and methods have been used to develop robust speech emotion recognition systems for customer call centres, but the primary problem associated with these novel applications is that most of them do not perform well in multilingual environments. In addition, most of the proposed models do not properly recognise the fear archetype of emotion. The effectiveness of a speech emotion recognition system depends largely on the strength of the features used. Consequently, the purpose of this research was to discover the most efficacious features in recognising speech emotion in call centre conversations. Therefore, this thesis reports on the development of hybrid acoustic features based on spectral and prosodic descriptors. The set of hybrid features proposed in this study comprises the logarithm of energy, fundamental frequency, zero-crossing rate, spectral roll- off point, spectral flux, spectral centroid, spectral compactness, spectral variability, fast Fourier transform, Mel frequency cepstral coefficients, and linear prediction cepstral coefficients. Furthermore, this thesis reports on the development of a novel stacked ensemble machine learning algorithm based on a combination of inducers and ensemble classifiers. The discovery of effective speech emotion features and the development of an efficient machine learning algorithm are essential stages of effective speech emotion recognition in call centre conversations. The verification and validation of the proposed speech emotion recognition methods based on feature extraction and feature classification for applications in call centre conversions were done using a series of experiments. This was accomplished by testing the crafted hybrid acoustic features on five distinct speech emotion databases. The acoustic features were evaluated against deep learning auto-generated features and a hybrid of popular acoustic features. In addition, a set of four ensemble algorithms were evaluated against the newly invented stacked ensemble algorithm. The performance of the developed stacked ensemble algorithm in this study was analysed based on the widely used statistical evaluation metrics of accuracy, precision, F-score, area under the receiver operating characteristic curve and computation time. The results have indeed demonstrated that the newly developed stacked ensemble algorithm coupled with the crafted hybrid acoustic features have consistently performed better than many other state-of-the-art algorithms and speech features across various standard speech corpora.
- Book Chapter
48
- 10.1007/978-3-540-87734-9_52
- Sep 24, 2008
Nowadays, recognition of human emotion is a challenging yet important speech technology. In this paper, based on deriving prosody features from emotional speech, some voice quality features are proposed to be extracted as new emotional features to improve emotion recognition. Utilizing support vector machines classifier, four emotions from Chinese natural emotional speech corpus including anger, joy, sadness and neutral are discriminated by combining prosody and voice quality features. The experiment results show that combining prosody and voice quality features yields an overall accuracy of 76% for emotion recognition, which makes approximately 10% improvement compared with using the single prosody features. It also shows that voice quality features in speech are effective emotional features and can promote prosody features for improving emotion recognition results.
- Research Article
163
- 10.3390/s20185212
- Sep 12, 2020
- Sensors
Artificial intelligence (AI) and machine learning (ML) are employed to make systems smarter. Today, the speech emotion recognition (SER) system evaluates the emotional state of the speaker by investigating his/her speech signal. Emotion recognition is a challenging task for a machine. In addition, making it smarter so that the emotions are efficiently recognized by AI is equally challenging. The speech signal is quite hard to examine using signal processing methods because it consists of different frequencies and features that vary according to emotions, such as anger, fear, sadness, happiness, boredom, disgust, and surprise. Even though different algorithms are being developed for the SER, the success rates are very low according to the languages, the emotions, and the databases. In this paper, we propose a new lightweight effective SER model that has a low computational complexity and a high recognition accuracy. The suggested method uses the convolutional neural network (CNN) approach to learn the deep frequency features by using a plain rectangular filter with a modified pooling strategy that have more discriminative power for the SER. The proposed CNN model was trained on the extracted frequency features from the speech data and was then tested to predict the emotions. The proposed SER model was evaluated over two benchmarks, which included the interactive emotional dyadic motion capture (IEMOCAP) and the berlin emotional speech database (EMO-DB) speech datasets, and it obtained 77.01% and 92.02% recognition results. The experimental results demonstrated that the proposed CNN-based SER system can achieve a better recognition performance than the state-of-the-art SER systems.
- Research Article
1
- 10.1121/10.0036812
- Jun 1, 2025
- The Journal of the Acoustical Society of America
Prosodic and voice quality modulations of the speech signal offer acoustic cues to the emotional state of the speaker. In quiet, listeners are highly adept at identifying not only a speaker's words but also the underlying emotional context. Given that distinct vocal emotions possess varying acoustic characteristics, background noise level may differentially impact speech recognition, emotion recognition, or their interaction. To investigate this question, we assessed the effects of three emotional speech styles (angry, happy, neutral) on speech intelligibility and emotion recognition across four different SNR levels. High-arousal emotional speech styles (happy and angry speech) enhanced both speech intelligibility and emotion recognition in noise. However, emotion recognition behavior was not a reliable predictor of speech recognition behavior. Instead, we found a strong correspondence between speech recognition scores and the relative power of the speech-in-noise signal in critical bands derived from the Speech Intelligibility Index. Unsupervised dimensional scaling analysis of emotion recognition patterns revealed that different noise baselines elicit different perceptual cue weighting strategies. Further dimensional scaling analysis revealed that emotion recognition patterns were best predicted by emotion-level differences in harmonic-to-noise ratio and variability around the fundamental frequency. Listeners may thus weight acoustic features differently for recognizing speech versus emotional patterns.
- Research Article
1
- 10.1007/s11055-011-9421-x
- Apr 20, 2011
- Neuroscience and Behavioral Physiology
The experimental-theoretical aims of the present study were to investigate the ability of humans to evaluate emotions in speech in relation to individual EEG characteristics and to compare clinical and electrophysiological data. Profound impairments to the recognition of emotions in speech were seen in subjects with lesions to the right temporal area, while the most significant defects in recognition were associated with frontal-temporal focal lesions. EEG studies of two groups of subjects, with high and low levels of recognition of emotions in speech, showed high levels of activation of the posterior temporal area of the right hemisphere and anterior leads of the left hemisphere in subjects with poor discrimination of the emotional tone of speech. Clinical and electrophysiological data lead to the conclusion that the recognition of emotions in speech may involve not only the temporal area of the right hemisphere, but also the speech centers in the left hemisphere.