Speech emotion recognition using multimodal LLMs and quality-controlled TTS-based data augmentation for Iberian languages
Speech emotion recognition using multimodal LLMs and quality-controlled TTS-based data augmentation for Iberian languages
- Research Article
38
- 10.1007/s00521-013-1377-z
- Mar 29, 2013
- Neural Computing and Applications
Emotion recognition in speech signals is currently a very active research topic and has attracted much attention within the engineering application area. This paper presents a new approach of robust emotion recognition in speech signals in noisy environment. By using a weighted sparse representation model based on the maximum likelihood estimation, an enhanced sparse representation classifier is proposed for robust emotion recognition in noisy speech. The effectiveness and robustness of the proposed method is investigated on clean and noisy emotional speech. The proposed method is compared with six typical classifiers, including linear discriminant classifier, K-nearest neighbor, C4.5 decision tree, radial basis function neural networks, support vector machines as well as sparse representation classifier. Experimental results on two publicly available emotional speech databases, that is, the Berlin database and the Polish database, demonstrate the promising performance of the proposed method on the task of robust emotion recognition in noisy speech, outperforming the other used methods.
- Book Chapter
- 10.1007/978-3-319-24033-6_11
- Jan 1, 2015
The speech signals are non-stationary processes with changes in time and frequency. The structure of a speech signal is also affected by the presence of several paralinguistics phenomena such as emotions, pathologies, cognitive impairments, among others. Non-stationarity can be modeled using several parametric techniques. A novel approach based on time dependent auto-regressive moving average TARMA is proposed here to model the non-stationarity of speech signals. The model is tested in the recognition of fear-typeo emotions in speech. The proposed approach is applied to model syllables and unvoiced segments extracted from recordings of the Berlin and enterface05 databases. The results indicate that TARMA models can be used for the automatic recognition of emotions in speech.
- Research Article
20
- 10.1177/2059204318762650
- Jan 1, 2018
- Music & Science
The acoustic cues that convey emotion in speech are similar to those that convey emotion in music, and recognition of emotion in both of these types of cue recruits overlapping networks in the brain. Given the similarities between music and speech prosody, developmental research is uniquely positioned to determine whether recognition of these cues develops in parallel. In the present study, we asked 60 children aged 6 to 11 years, and 51 university students, to judge the emotions of 10 musical excerpts, 10 inflected speech clips, and 10 affect burst clips. We presented stimuli intended to convey happiness, sadness, anger, fear, and pride. Each emotion was presented twice per type of stimulus. We found that recognition of emotions in music and speech developed in parallel, and adult-levels of recognition develop later for these stimuli than for affect bursts. We also found that sad stimuli were most easily recognised, followed by happiness, fear, and then anger. In addition, we found that recognition of emotion in speech and affect bursts can predict emotion recognition in music stimuli independently of age and musical training. Finally, although proud speech and affect bursts were not well recognised, children aged eight years and older showed adult-like responses in recognition of proud music.
- Research Article
14
- 10.1016/j.iswa.2024.200351
- Mar 11, 2024
- Intelligent Systems with Applications
In-depth investigation of speech emotion recognition studies from past to present –The importance of emotion recognition from speech signal for AI–
- Conference Article
14
- 10.1109/tencon.2015.7372840
- Nov 1, 2015
Emotional information in speech signal is an important information resource. When verbal expression combined with human emotion, emotional speech processing is no longer a simple mathematical model or pure calculations. Fluctuations of the mood are controlled by the brain perception; speech signal processing based on cognitive psychology can capture emotion better. In this paper the relevance analysis between speech emotion and human cognition is introduced firstly. The recent progress in speech emotion recognition was summarized including the review of speech emotion databases, feature extraction and emotion recognition networks. Secondly a fuzzy cognitive map network based on cognitive psychology is introduced into emotional speech recognition. In addition, the mechanism of the human brain for cognitive emotional speech is explored. To improve the recognition accuracy, this report also tries to integrate event-related potentials to speech emotion recognition. This idea is the conception and prospect of speech emotion recognition mashed up with cognitive psychology in the future.
- Research Article
467
- 10.1016/j.neunet.2017.02.013
- Mar 21, 2017
- Neural Networks
Evaluating deep learning architectures for Speech Emotion Recognition
- Conference Article
- 10.12792/iciae2023.045
- Jan 1, 2023
Speaking is the main medium of communication in our daily life. It can convey the thought and express the emotional state between humans. The goal of speech emotion recognition is to recognize human emotional states from speaking. The speech emotion recognition mainly includes two steps: feature extraction and classifier construction. Taking speech features usually refers to its spectral features. The speech signal input is an approximately continuous value, and the spectral features contain considerable number of information, such as speech content, rhythm, tone, intonation, and so on. However, related with the emotion speech feature extraction is still an immature research direction. Influenced by the success of the computer vision, the visualization of speech signals has become a new method for analysis emotion recognition on the acoustic features of speech. Based on the Graham Angle Field Method, this research using a variety of neural network models to extract speech feature values and recognizing speech emotions. According to the experiments, this research find that it is feasible to visualize speech signals and use the obtained results for emotion recognition. In the future, we will further optimize the network model. Combing with other acoustic features, such as speech content, rhythm. etc., completing speech emotion recognition in our real life.
- Conference Article
12
- 10.23919/apsipaasc55919.2022.9979844
- Nov 7, 2022
Speech emotion recognition (SER) helps achieve better human-computer interaction and thus has attracted extensive attention from industry and academia. Speech emotion intensity plays an important role in the emotional description, but its effect on emotion recognition still has been rarely studied in the area of SER to the best of our knowledge. Previous studies have shown that there is a certain relationship between speech emotion intensity and emotion category, so each recognition task of multi-task learning is supposed to be beneficial to each other. We propose a multi-task learning framework with a self-supervised speech representation extractor based on Wav2Vec 2.0 to detect speech emotion and intensity at the same time in downstream networks. Experiment results show that the multi-task learning framework outperforms SOTA SER models and achieves 5% and 7% SER performance improvement on IEMOCAP and RAVDESS thanks to the auxiliary task of emotion intensity recognition.
- Research Article
1
- 10.3390/jimaging11080273
- Aug 14, 2025
- Journal of Imaging
Emotion recognition in speech is essential for enhancing human–computer interaction (HCI) systems. Despite progress in Bangla speech emotion recognition, challenges remain, including low accuracy, speaker dependency, and poor generalization across emotional expressions. Previous approaches often rely on traditional machine learning or basic deep learning models, struggling with robustness and accuracy in noisy or varied data. In this study, we propose a novel multi-stream deep learning feature fusion approach for Bangla speech emotion recognition, addressing the limitations of existing methods. Our approach begins with various data augmentation techniques applied to the training dataset, enhancing the model’s robustness and generalization. We then extract a comprehensive set of handcrafted features, including Zero-Crossing Rate (ZCR), chromagram, spectral centroid, spectral roll-off, spectral contrast, spectral flatness, Mel-Frequency Cepstral Coefficients (MFCCs), Root Mean Square (RMS) energy, and Mel-spectrogram. Although these features are used as 1D numerical vectors, some of them are computed from time–frequency representations (e.g., chromagram, Mel-spectrogram) that can themselves be depicted as images, which is conceptually close to imaging-based analysis. These features capture key characteristics of the speech signal, providing valuable insights into the emotional content. Sequentially, we utilize a multi-stream deep learning architecture to automatically learn complex, hierarchical representations of the speech signal. This architecture consists of three distinct streams: the first stream uses 1D convolutional neural networks (1D CNNs), the second integrates 1D CNN with Long Short-Term Memory (LSTM), and the third combines 1D CNNs with bidirectional LSTM (Bi-LSTM). These models capture intricate emotional nuances that handcrafted features alone may not fully represent. For each of these models, we generate predicted scores and then employ ensemble learning with a soft voting technique to produce the final prediction. This fusion of handcrafted features, deep learning-derived features, and ensemble voting enhances the accuracy and robustness of emotion identification across multiple datasets. Our method demonstrates the effectiveness of combining various learning models to improve emotion recognition in Bangla speech, providing a more comprehensive solution compared with existing methods. We utilize three primary datasets—SUBESCO, BanglaSER, and a merged version of both—as well as two external datasets, RAVDESS and EMODB, to assess the performance of our models. Our method achieves impressive results with accuracies of 92.90%, 85.20%, 90.63%, 67.71%, and 69.25% for the SUBESCO, BanglaSER, merged SUBESCO and BanglaSER, RAVDESS, and EMODB datasets, respectively. These results demonstrate the effectiveness of combining handcrafted features with deep learning-based features through ensemble learning for robust emotion recognition in Bangla speech.
- Research Article
3
- 10.2298/fuee1403425b
- Jan 1, 2014
- Facta universitatis - series: Electronics and Energetics
Due to the advance of speech technologies and their increasing usage in various applications, automatic recognition of emotions in speech represents one of the emerging fields in human-computer interaction. This paper deals with several topics related to automatic emotional speech recognition, most notably with the improvement of recognition accuracy by lowering the dimensionality of the feature space and evaluation of the relevance of particular feature types. The research is focused on the classification of emotional speech into five basic emotional classes (anger, joy, fear, sadness and neutral speech) using a recorded corpus of emotional speech in Serbian.
- Research Article
9
- 10.1016/j.heliyon.2022.e09196
- Mar 1, 2022
- Heliyon
Vector learning representation for generalized speech emotion recognition
- Conference Article
- 10.1109/tencon50793.2020.9293820
- Nov 16, 2020
The technological capabilities of computers in today's time continues to improve in ways that seemed impossible before. It is common knowledge that most people use computers to make everyday lives easier. Therefore, it is vital to bridge the gap between humans and computers to provide more suitable aid to the user. One way to do this is to use emotion recognition as a tool to make the computer understand and analyze how it can help its user on a much deeper level. This paper proposes a way to use both face and speech emotion recognition as a basis for selecting an appropriate music that can improve or relieve one's emotion or stress. To accomplish this, Support Vector Machine with different kernels are used to create the models for validation and testing on both the face and speech emotion recognition. The final integrated system yielded an accuracy rate of 78.5%.
- Research Article
52
- 10.1109/access.2020.2990974
- Jan 1, 2020
- IEEE Access
Driven by the vision of Internet of Things, some research efforts have already focused on designing a network of efficient speech recognition for the development of edge computing. Other researches (such as tpool2) do not make full use of spatial and temporal information in the acoustic features of speech. In this paper, we propose a compact speech recognition network with spatio-temporal features for edge computing, named EdgeRNN. Alternatively, EdgeRNN uses 1-Dimensional Convolutional Neural Network (1-D CNN) to process the overall spatial information of each frequency domain of the acoustic features. A Recurrent Neural Network (RNN) is used to process the temporal information of each frequency domain of the acoustic features. In addition, we propose a simplified attention mechanism to enhance the portion of the network that contributes to the final identification. The overall performance of EdgeRNN has been verified on speech emotion and keywords recognition. The IEMOCAP dataset is used in speech emotion recognition, and the unweighted average recall (UAR) reaches 63.98%. Speech keywords recognition uses Google's Speech Commands Datasets V1 with a weighted average recall (WAR) of 96.82%. Compared with the experimental results of the related efficient networks on Raspberry Pi 3B+, the accuracies of EdgeRNN have been improved on both of speech emotion and keywords recognition.
- Conference Article
1
- 10.21437/interspeech.2009-585
- Sep 6, 2009
As a hot research field, speech emotion recognition has attracted increasing attentions from both academic and business. In this paper, we proposed a method to recognize speech emotions adopting ANNs and to fuse two kinds of recognitions using different features at the decision level. Each emotional utterance is recognized by some individual recognizers firstly. Then the outputs of these recognizers were fused adopting the voting strategy. Furthermore, the dimensionality of supervectors constructed from spectral features is reduced through PCA. Experimental results demonstrated that the proposed decision fusion is effective and the dimensionality reduction is feasible. Index Terms : speech emotion recognition, ANN, decision fusion 1. Introduction Speech is a dominant tool for communication, and it is also an important and effective approach for transmitting information and human emotions. With the increasing role of speech interfaces in human-machine interaction applications, speech emotion recognition becomes more and more important recently. Speech emotion recognition is an interesting and challenging speech technology, which can be applied to broad areas, such as environment of call center [1], treatment of mental and psychological diseases [2], development of education and entertainment software [3], and so on. Speech emotion recognition deals with how to make the computer automatically recognize various emotions in speech signal by extracting and analyzing some acoustic features. A key problem of speech emotion recognition is that which kinds of speech features can be used to represent human emotions. Some researchers have investigated the relations between features and emotions. With their efforts, many speech features were found to be used for emotion recognition. Statistical features based on prosody and voice quality have been widely used in speech emotion recognition and demonstrated considerable recognition success [4, 5]. Besides statistical features, spectral or cepstral features are another effective group for describing emotional states [6, 7]. Since these features all have played significant roles in speech emotion recognition, it is necessary to explore an effective way to complementarily fuse both two kinds of features to further enhance the performance of emotion recognition.Another key issue of speech emotion recognition is how to choose an effective method to classify speech emotions. So far, many pattern classification methods have been used for speech emotion recognition [6-9], such as Support Vector Machine (SVM), k-Nearest Neighbors (k-NN), Gaussian Mixture Model (GMM), Hidden Markov Model (HMM), Artificial Neural Networks (ANN), and so on. These methods are all feasible, but their performances are different with each other seriously. The SVMs based method has been shown to be robust and performs well. But some fresh researches have indicated that ideal performance may be obtained using ANNs as well. However, it is difficult to determine which kind of ANNs is suitable for emotion recognition and it is necessary to compare its performance with the SVMs. In this paper, the ANN based decision fusion for speech emotion recognition was presented. Firstly four different ANNs were used to recognize various emotions. Then the voting scheme was adopted to fuse recognitions using two kinds of features at the decision level. Experimental results demonstrated that the proposed approach improved the performance of ANN based recognition and its accuracy was comparable with SVM based method.The remainder of this paper is organized as follows. The features used for speech emotion recognition are introduced in Section 2. The principles of PCA and ANN are briefly described in Section 3 and Section 4 respectively. The proposed decision fusion is depicted in Section 5. In Section 6, experiments and discussions of experimental results are presented. In Section 7, conclusions are drawn and future works are suggested.
- Conference Article
4
- 10.1145/3395035.3425252
- Oct 25, 2020
While both speech emotion recognition and music emotion recognition have been studied extensively in different communities, little research went into the recognition of emotion from mixed audio sources, i.e. when both speech and music are present. However, many application scenarios require models that are able to extract emotions from mixed audio sources, such as television content. This paper studies how mixed audio affects both speech and music emotion recognition using a random forest and deep neural network model, and investigates if blind source separation of the mixed signal beforehand is beneficial. We created a mixed audio dataset, with 25% speech-music overlap without contextual relationship between the two. We show that specialized models for speech-only or music-only audio were able to achieve merely 'chance-level' performance on mixed audio. For speech, above chance-level performance was achieved when trained on raw mixed audio, but optimal performance was achieved with audio blind source separated beforehand. Music emotion recognition models on mixed audio achieve performance approaching or even surpassing performance on music-only audio, with and without blind source separation. Our results are important for estimating emotion from real-world data, where individual speech and music tracks are often not available.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.