Speech Emotion Recognition in Arabic Language: A Review
This review examines speech emotion recognition in Arabic, highlighting challenges such as limited emotional databases and feature selection. It surveys existing approaches, emphasizing the nascent stage of Arabic SER research and aiming to support future developments in this area.
Nowadays, interpreting human emotions through speech has attracted great attention in human–computer interaction and artificial intelligence. Speech emotion recognition (SER) systems have become a significant field of research. SER is one of the interesting directions in speech processing, is to predict the expressed emotional state. SER systems encounter numerous challenges, such as the availability of appropriate emotional databases, the identification of suitable speech features, and the choice of the appropriate classification method. SER systems are mostly implemented in English, French, German, Indian, and Chinese languages. However, SER for the Arabic language is still in the growing phase. In this work, a literature review on the SER in Arabic has been presented in terms of emotional databases, speech features, and classification algorithms. This review contributes to filling the gap in the works on emotion recognition available in the Arabic language and constitutes a valuable resource for researchers in this field.
- Research Article
26
- 10.1016/j.apacoust.2020.107519
- Jul 22, 2020
- Applied Acoustics
Investigation of multilingual and mixed-lingual emotion recognition using enhanced cues with data augmentation
- Research Article
8
- 10.1007/s10462-024-10760-z
- May 21, 2024
- Artificial Intelligence Review
Speech emotion recognition (SER) systems leverage information derived from sound waves produced by humans to identify the concealed emotions in utterances. Since 1996, researchers have placed effort on improving the accuracy of SER systems, their functionalities, and the diversity of emotions that can be identified by the system. Although SER systems have become very popular in a variety of domains in modern life and are highly connected to other systems and types of data, the security of SER systems has not been adequately explored. In this paper, we conduct a comprehensive analysis of potential cyber-attacks aimed at SER systems and the security mechanisms that may prevent such attacks. To do so, we first describe the core principles of SER systems and discuss prior work performed in this area, which was mainly aimed at expanding and improving the existing capabilities of SER systems. Then, we present the SER system ecosystem, describing the dataflow and interactions between each component and entity within SER systems and explore their vulnerabilities, which might be exploited by attackers. Based on the vulnerabilities we identified within the ecosystem, we then review existing cyber-attacks from different domains and discuss their relevance to SER systems. We also introduce potential cyber-attacks targeting SER systems that have not been proposed before. Our analysis showed that only 30% of the attacks can be addressed by existing security mechanisms, leaving SER systems unprotected in the face of the other 70% of potential attacks. Therefore, we also describe various concrete directions that could be explored in order to improve the security of SER systems.
- Research Article
163
- 10.3390/s20185212
- Sep 12, 2020
- Sensors
Artificial intelligence (AI) and machine learning (ML) are employed to make systems smarter. Today, the speech emotion recognition (SER) system evaluates the emotional state of the speaker by investigating his/her speech signal. Emotion recognition is a challenging task for a machine. In addition, making it smarter so that the emotions are efficiently recognized by AI is equally challenging. The speech signal is quite hard to examine using signal processing methods because it consists of different frequencies and features that vary according to emotions, such as anger, fear, sadness, happiness, boredom, disgust, and surprise. Even though different algorithms are being developed for the SER, the success rates are very low according to the languages, the emotions, and the databases. In this paper, we propose a new lightweight effective SER model that has a low computational complexity and a high recognition accuracy. The suggested method uses the convolutional neural network (CNN) approach to learn the deep frequency features by using a plain rectangular filter with a modified pooling strategy that have more discriminative power for the SER. The proposed CNN model was trained on the extracted frequency features from the speech data and was then tested to predict the emotions. The proposed SER model was evaluated over two benchmarks, which included the interactive emotional dyadic motion capture (IEMOCAP) and the berlin emotional speech database (EMO-DB) speech datasets, and it obtained 77.01% and 92.02% recognition results. The experimental results demonstrated that the proposed CNN-based SER system can achieve a better recognition performance than the state-of-the-art SER systems.
- Research Article
104
- 10.1016/j.apacoust.2023.109492
- Jun 28, 2023
- Applied Acoustics
Emotional speech Recognition using CNN and Deep learning techniques
- Conference Article
- 10.1109/bharat53139.2022.00042
- Apr 1, 2022
The speech emotion recognition (SER) system categorizes human emotions based on contextual features. However, it is seriously affected during the signal transmission in which the quality of realtime speech processing is degraded in the SER system. This paper presents refined feature vectors for human emotion classifiers based on multiple learning strategies combined with recurrent neural networks (RefineHERNN). It extracts spatial emotional vectors by observing speech signals for contextual feature dependency through the multiple learning (ML) approach. It computes signal interpretation, emotional cues, and input correction by using the skip connection (SC) module in the residual block of the ML strategy. The fused layer is simple to concentrate derived features that support automatic learning of classifying different human emotions. For experimental purposes, standard IEMOCAP and MSPIMPROV datasets are considered for proposed method validation. Results convey that the proposed method has significant improvement (in terms of percentage closer to 80% higher than the existing CNN result) in the feature recognition and is flexible for realtime implementation in the SER system. Moreover, it can extend to automatic sensing of human emotion with the help of a light weighted RNN framework.
- Conference Article
5
- 10.1109/asyu52992.2021.9598956
- Oct 6, 2021
Data scarcity and speech degradation due to environmental noise are two significant issues in the modelling and deployment speech emotion recognition (SER) systems. Deep learning-based SER systems overfits during modelling because of scarce training samples. Although recent attempts to tackle these issues, simultaneously, using data augmentation have yielded promising results, they are not robust enough to handle speech degradation due to real environmental noise. Thus, there is the need to further improve the classification performance of deployed SER systems. This work proposes an SER system based on a novel robust multi-window spectrogram augmentation (RMWSaug) scheme and, transfer learning to handle these aforementioned issues simultaneously. First, the RMWSaug scheme utilizes the concept of multi-window and multi-noise conditioning of clean speech samples to create additional speech spectrograms required for training. Then, pretrained networks are adapted for speech emotion recognition and finetuned with the generated training datasets to develop a model robust to speech degradation due to noise. Thereby, improving the classification performance in the wild. The Interactive Emotional Dyadic Motion Capture (IEMOCAP) database was selected as benchmark dataset for evaluating the proposed SER system. Experimental results show that the proposed SER system outperformed existing methods when deployed in the wild. The proposed SER system can be deployed to predict the emotions of speakers conversing virtually on online platforms.
- Research Article
105
- 10.1007/s11042-020-09874-7
- Jan 2, 2021
- Multimedia Tools and Applications
Speech emotion recognition (SER) systems identify emotions from the human voice in the areas of smart healthcare, driving a vehicle, call centers, automatic translation systems, and human-machine interaction. In the classical SER process, discriminative acoustic feature extraction is the most important and challenging step because discriminative features influence the classifier performance and decrease the computational time. Nonetheless, current handcrafted acoustic features suffer from limited capability and accuracy in constructing a SER system for real-time implementation. Therefore, to overcome the limitations of handcrafted features, in recent years, variety of deep learning techniques have been proposed and employed for automatic feature extraction in the field of emotion prediction from speech signals. However, to the best of our knowledge, there is no in-depth review study is available that critically appraises and summarizes the existing deep learning techniques with their strengths and weaknesses for SER. Hence, this study aims to present a comprehensive review of deep learning techniques, uniqueness, benefits and their limitations for SER. Moreover, this review study also presents speech processing techniques, performance measures and publicly available emotional speech databases. Furthermore, this review also discusses the significance of the findings of the primary studies. Finally, it also presents open research issues and challenges that need significant research efforts and enhancements in the field of SER systems.
- Research Article
103
- 10.1016/j.specom.2019.09.002
- Sep 19, 2019
- Speech Communication
Automatic speech emotion recognition using an optimal combination of features based on EMD-TKEO
- Dissertation
- 10.32657/10356/200448
- Jan 1, 2024
Emotions are an integral part of social interactions, serving vital functions in the management of our interpersonal relationships. Cues to emotional meaning can be conveyed through a variety of modalities, from facial expressions and body gestures to the linguistic and non-linguistic, or prosodic aspects of speech. As automatic speech systems continue to grow in use, the ability to recognize emotional cues from these modalities is no longer solely expected in human communication, but also in human-computer interactions with the goal of developing more user-friendly and effective applications. This thesis explores how speech emotion recognition (SER) systems can be more robust to factors of language, culture, and context through the lens of emotional speech and emotional prosody from bilingual English-speaking Singaporeans of Chinese, Malay, or Indian ethnicity. Specifically, the thesis aims to examine how Singaporeans from the three ethnic groups may vary in their perception of emotions in speech, and to explore how they express emotions via speech prosody. The recognition and production processes of emotional communication in Singapore English was investigated through two emotion classes – anger and happiness. An emotion perception test was conducted using acted angry and happy speech as stimuli to understand how Singapore English speakers recognized emotions in speech. The results of the perception test revealed that respondents displayed very similar patterns in their rating of emotional speech stimuli. The study also found that respondents tended to rated anger with low dominance. At the same time, a spontaneous speech corpus known as the Singapore English Spontaneous Emotional Speech Corpus (SESESC) was also developed, featuring audio recordings of 44 Singapore English speakers from the three ethnic groups playing a co-operative video game with either a familiar partner or a confederate. More than 36,000 intervals were annotated in-context of the recordings by each annotator based on a modified emotion annotation scheme that included information regarding the emotion category and arousal levels. Intervals that were labelled as conveying joy or anger with full agreement from the annotators were selected following a series of criteria to facilitate comparable prosodic analyses given the inherent dynamicity of spontaneous speech. The analysis focused on exploring the F0 movements of the selected utterances in tandem with periodic energy. Overall, it was observed that utterances conveying happiness were associated more with level contours, while angry utterances were produced with more steep falling contours especially on the utterance-final syllable. A tendency for an F0 peak to be produced on the second syllable of the utterance was also observed in both sets of data analyzed, alluding to the interactions between emotional prosody and the intonational structure of Singapore English. Finally, an emotion classification was conducted using a subset of the corpus as test data for a fine-tuned SER model. Results showed that while utterances conveying happiness could be automatically recognized to a high percentage of 86%, there was also a high percentage of misclassification of anger. This result appears to counter findings in past SER studies, which typically reported high levels of recognition accuracy for anger. Taken together, the thesis contributes an understanding of how spontaneous speech and emotional cues on the prosodic level interact even in task-oriented conversations, with further considerations of how these patterns may vary among bilinguals in multicultural contexts, so as to enable more robust speech emotion recognition systems in the future.
- Conference Article
29
- 10.1109/icicict1.2017.8342835
- Jul 1, 2017
In this paper, a smart music system is designed by recognizing the emotion using voice speech signal as an input. The objective of the speech emotion recognition (SER) system is to determine the state of emotion of a human being's voice. This study recognizes five emotions-anger, anxiety, boredom, happiness and sadness. The important aspects in implementing this SER system includes the speech processing using the Berlin emotional database, then extracting suitable features and selecting appropriate pattern recognition or classifier methods to identify the emotional states. Once the emotion of the speech is recognized, the system platform automatically selects a piece of music as a cheer up strategy from the database of song playlist stored. The analysis results show that this SER system implemented over five emotions provides successful emotional classification performance of 76.31% using GMM model and an overall better accuracy of 81.57% with SVM model.
- Research Article
- 10.14445/23488549/ijece-v12i1p117
- Jan 30, 2025
- International Journal of Electronics and Communication Engineering
The technique of recognizing and classifying emotions expressed in language spoken using audio features is Speech Emotion Recognition (SER). Human-computer interaction must enable machines to accurately perceive and respond to human emotions. Numerous challenges, like capturing both spatial and temporal features in speech signals, impact the accuracy of emotion recognition models. Conventional emotion recognition systems heavily depend on manual feature extraction and classification, which require significant effort and often lead to errors in detection. Advances in image processing and Artificial Intelligence (AI) have introduced hybrid Deep Learning (DL) approaches to improve SER tasks. This study developed an efficient Speech Emotion Recognition (SER) system utilizing a hybrid DL model combined with an ensemble approach to accurately classify emotions expressed through speech. The models were evaluated on the CREMA dataset which contains 7,442 audio samples across six different emotions. After preprocessing and data augmentation, Mel Frequency Cepstral Coefficients (MFCC) were captured as features from speech data. The proposed models include CNN-LSTM and CNN-GRU to extract both spatial and temporal features. Outputs from these frameworks were combined using an ensemble learning approach with a Support Vector Machine (SVM) classifier as the meta-learner. Experimental results specify that the suggested model attained improved performance with an accuracy of 98.69%, precision of 98.70%, recall of 98.72% and an F1 score of 98.70%. The results highlight the effectiveness of combining advanced neural networks for achieving high performance in emotion detection from speech signals, providing valuable information for developing real-time emotion recognition systems and enhancing human-computer interaction.
- Conference Article
5
- 10.1109/lt58159.2023.10092295
- Jan 26, 2023
The Speech Emotion Recognition (SER) system is an approach to identify individuals' emotions. This is important for human-machine interface applications and for the emerging Metaverse. This work presents a bilingual Arabic-English speech emotion recognition system based on EYASE and RAVDESS datasets. A novel feature set was composed by using spectral and prosodic parameters to obtain high performance at a low computational cost. Different classification models were applied. These machine learning classifiers are Random Forest, Support Vector Machine, Logistic Regression, Multi-Layer Perceptron, and Ensemble learning. The proposed feature set performance was compared to the "Interspeech 2009" challenge feature set, which is considered a benchmark in the field. Promising results were obtained using the proposed feature sets. SVM resulted in the best emotion recognition rate and execution performance. The best accuracies achieved were 85% on RADVESS, and 64% on EYASE. Ensemble learning detected the valence emotion with 90% on RADVESS, and 87.6% on EYASE.
- Conference Article
3
- 10.1109/netact.2017.8076811
- Jul 1, 2017
The growth in human computer interaction has necessitated the requirement of accurate recognition of emotion from speech data. This paper presents a new novel feature called TEO (Teager Energy Operator) Slope for emotion recognition. The feature is obtained by applying least square fit instead of applying DCT in TEO feature. The feature was tested on the publically available Berlin Emotion Database (EMO-DB) using a GMM classifier. TEO Slope feature based emotion recognition system shows a significant improvement over MFCC based baseline system by 2% and TEO feature based system by 6% in terms of overall accuracy. Also the feature set obtained by fusion of MFCC, its delta and TEO Slope was evaluated and 60% accuracy was obtained. The results show that the TEO Slope is a promising feature in speech emotion recognition system.
- Research Article
13
- 10.1016/j.procs.2022.09.345
- Jan 1, 2022
- Procedia Computer Science
Multiple Models Fusion for Multi-label Classification in Speech Emotion Recognition Systems
- Research Article
2
- 10.4114/intartif.vol28iss76pp85-123
- Jun 17, 2025
- Inteligencia Artificial
This study employs Explainable Artificial Intelligence (XAI) techniques, including SHAP, LIME, and XGBoost, to interpret speech-emotion recognition (SER) models. Unlike previous work focusing on generic datasets, this research integrates these tools to explore the unique emotional nuances within an Afrikaans speech corpus. The complexity of architectures poses significant challenges regarding model interpretability. This paper explicitly aims to bridge the gaps in existing Speech Emotion Recognition (SER) systems by integrating advanced Explainable Artificial Intelligence (XAI) techniques. The objective is to develop an Ensemble stacking model that combines CNN, CLSTM, and XGBoost, augmented by SHAP and LIME, to enhance the interpretability, accuracy, and adaptability of SER systems, particularly for underrepresented languages like Afrikaans. Our research methodology involves utilising XAI methods to explain the decision-making processes of CNN and CLSTM models in speech emotion recognition (SER) to enhance trust, diagnostic insight, and theoretical understanding. We train the models for SER using a comprehensive dataset of emotional speech samples. Post-training, we apply SHAP and LIME to these models to generate explanations for their predictions, focusing on the importance of featuresand the models’ decision logic. By comparing the explanations generated by SHAP and LIME, we assess the efficacy of each method in providing meaningful insights into the models’ operations. The comparative study of various models in SER demonstrates their capability to discern complex emotional states through diverse analytical approaches, from spatial feature extraction to temporal dynamics. Our research reveals that XAI techniques improve the interpretability of complex SER models. This enhanced transparency builds end-user trust and provides valuable insights. This study contributes to the importance of explainability in deploying AI technologies in emotionally sensitive applications, paving the way for more accountable and user-centric SER systems.