Accelerate Literature Icon
Want to do a literature review? Try our new Literature Review workflow

Speech Emotion Recognition in Arabic Language: A Review

  • TL;DR
  • Abstract
  • Literature Map
  • Similar Papers
TL;DR

This review examines speech emotion recognition in Arabic, highlighting challenges such as limited emotional databases and feature selection. It surveys existing approaches, emphasizing the nascent stage of Arabic SER research and aiming to support future developments in this area.

Abstract
Translate article icon Translate Article Star icon

Nowadays, interpreting human emotions through speech has attracted great attention in human–computer interaction and artificial intelligence. Speech emotion recognition (SER) systems have become a significant field of research. SER is one of the interesting directions in speech processing, is to predict the expressed emotional state. SER systems encounter numerous challenges, such as the availability of appropriate emotional databases, the identification of suitable speech features, and the choice of the appropriate classification method. SER systems are mostly implemented in English, French, German, Indian, and Chinese languages. However, SER for the Arabic language is still in the growing phase. In this work, a literature review on the SER in Arabic has been presented in terms of emotional databases, speech features, and classification algorithms. This review contributes to filling the gap in the works on emotion recognition available in the Arabic language and constitutes a valuable resource for researchers in this field.

Similar Papers
  • Research Article
  • Cite Count Icon 26
  • 10.1016/j.apacoust.2020.107519
Investigation of multilingual and mixed-lingual emotion recognition using enhanced cues with data augmentation
  • Jul 22, 2020
  • Applied Acoustics
  • S Lalitha + 3 more

Investigation of multilingual and mixed-lingual emotion recognition using enhanced cues with data augmentation

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 8
  • 10.1007/s10462-024-10760-z
Speech emotion recognition systems and their security aspects
  • May 21, 2024
  • Artificial Intelligence Review
  • Itzik Gurowiec + 1 more

Speech emotion recognition (SER) systems leverage information derived from sound waves produced by humans to identify the concealed emotions in utterances. Since 1996, researchers have placed effort on improving the accuracy of SER systems, their functionalities, and the diversity of emotions that can be identified by the system. Although SER systems have become very popular in a variety of domains in modern life and are highly connected to other systems and types of data, the security of SER systems has not been adequately explored. In this paper, we conduct a comprehensive analysis of potential cyber-attacks aimed at SER systems and the security mechanisms that may prevent such attacks. To do so, we first describe the core principles of SER systems and discuss prior work performed in this area, which was mainly aimed at expanding and improving the existing capabilities of SER systems. Then, we present the SER system ecosystem, describing the dataflow and interactions between each component and entity within SER systems and explore their vulnerabilities, which might be exploited by attackers. Based on the vulnerabilities we identified within the ecosystem, we then review existing cyber-attacks from different domains and discuss their relevance to SER systems. We also introduce potential cyber-attacks targeting SER systems that have not been proposed before. Our analysis showed that only 30% of the attacks can be addressed by existing security mechanisms, leaving SER systems unprotected in the face of the other 70% of potential attacks. Therefore, we also describe various concrete directions that could be explored in order to improve the security of SER systems.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 163
  • 10.3390/s20185212
Deep-Net: A Lightweight CNN-Based Speech Emotion Recognition System Using Deep Frequency Features.
  • Sep 12, 2020
  • Sensors
  • Tursunov Anvarjon + 2 more

Artificial intelligence (AI) and machine learning (ML) are employed to make systems smarter. Today, the speech emotion recognition (SER) system evaluates the emotional state of the speaker by investigating his/her speech signal. Emotion recognition is a challenging task for a machine. In addition, making it smarter so that the emotions are efficiently recognized by AI is equally challenging. The speech signal is quite hard to examine using signal processing methods because it consists of different frequencies and features that vary according to emotions, such as anger, fear, sadness, happiness, boredom, disgust, and surprise. Even though different algorithms are being developed for the SER, the success rates are very low according to the languages, the emotions, and the databases. In this paper, we propose a new lightweight effective SER model that has a low computational complexity and a high recognition accuracy. The suggested method uses the convolutional neural network (CNN) approach to learn the deep frequency features by using a plain rectangular filter with a modified pooling strategy that have more discriminative power for the SER. The proposed CNN model was trained on the extracted frequency features from the speech data and was then tested to predict the emotions. The proposed SER model was evaluated over two benchmarks, which included the interactive emotional dyadic motion capture (IEMOCAP) and the berlin emotional speech database (EMO-DB) speech datasets, and it obtained 77.01% and 92.02% recognition results. The experimental results demonstrated that the proposed CNN-based SER system can achieve a better recognition performance than the state-of-the-art SER systems.

  • Research Article
  • Cite Count Icon 104
  • 10.1016/j.apacoust.2023.109492
Emotional speech Recognition using CNN and Deep learning techniques
  • Jun 28, 2023
  • Applied Acoustics
  • C Hema + 1 more

Emotional speech Recognition using CNN and Deep learning techniques

  • Conference Article
  • 10.1109/bharat53139.2022.00042
Refined Feature Vectors for Human Emotion Classifier by combining multiple learning strategies with Recurrent Neural Networks
  • Apr 1, 2022
  • K Swetha + 1 more

The speech emotion recognition (SER) system categorizes human emotions based on contextual features. However, it is seriously affected during the signal transmission in which the quality of real­time speech processing is degraded in the SER system. This paper presents refined feature vectors for human emotion classifiers based on multiple learning strategies combined with recurrent neural networks (Refine­HE­RNN). It extracts spatial emotional vectors by observing speech signals for contextual feature dependency through the multiple learning (ML) approach. It computes signal interpretation, emotional cues, and input correction by using the skip connection (SC) module in the residual block of the ML strategy. The fused layer is simple to concentrate derived features that support automatic learning of classifying different human emotions. For experimental purposes, standard IEMOCAP and MSP­IMPROV datasets are considered for proposed method validation. Results convey that the proposed method has significant improvement (in terms of percentage closer to 80% higher than the existing CNN result) in the feature recognition and is flexible for real­time implementation in the SER system. Moreover, it can extend to automatic sensing of human emotion with the help of a light weighted RNN framework.

  • Conference Article
  • Cite Count Icon 5
  • 10.1109/asyu52992.2021.9598956
RMWSaug: Robust Multi-window Spectrogram Augmentation Approach for Deep Learning based Speech Emotion Recognition
  • Oct 6, 2021
  • Shehu Mohammed Yusuf + 4 more

Data scarcity and speech degradation due to environmental noise are two significant issues in the modelling and deployment speech emotion recognition (SER) systems. Deep learning-based SER systems overfits during modelling because of scarce training samples. Although recent attempts to tackle these issues, simultaneously, using data augmentation have yielded promising results, they are not robust enough to handle speech degradation due to real environmental noise. Thus, there is the need to further improve the classification performance of deployed SER systems. This work proposes an SER system based on a novel robust multi-window spectrogram augmentation (RMWSaug) scheme and, transfer learning to handle these aforementioned issues simultaneously. First, the RMWSaug scheme utilizes the concept of multi-window and multi-noise conditioning of clean speech samples to create additional speech spectrograms required for training. Then, pretrained networks are adapted for speech emotion recognition and finetuned with the generated training datasets to develop a model robust to speech degradation due to noise. Thereby, improving the classification performance in the wild. The Interactive Emotional Dyadic Motion Capture (IEMOCAP) database was selected as benchmark dataset for evaluating the proposed SER system. Experimental results show that the proposed SER system outperformed existing methods when deployed in the wild. The proposed SER system can be deployed to predict the emotions of speakers conversing virtually on online platforms.

  • Research Article
  • Cite Count Icon 105
  • 10.1007/s11042-020-09874-7
Deep learning approaches for speech emotion recognition: state of the art and research challenges
  • Jan 2, 2021
  • Multimedia Tools and Applications
  • Rashid Jahangir + 3 more

Speech emotion recognition (SER) systems identify emotions from the human voice in the areas of smart healthcare, driving a vehicle, call centers, automatic translation systems, and human-machine interaction. In the classical SER process, discriminative acoustic feature extraction is the most important and challenging step because discriminative features influence the classifier performance and decrease the computational time. Nonetheless, current handcrafted acoustic features suffer from limited capability and accuracy in constructing a SER system for real-time implementation. Therefore, to overcome the limitations of handcrafted features, in recent years, variety of deep learning techniques have been proposed and employed for automatic feature extraction in the field of emotion prediction from speech signals. However, to the best of our knowledge, there is no in-depth review study is available that critically appraises and summarizes the existing deep learning techniques with their strengths and weaknesses for SER. Hence, this study aims to present a comprehensive review of deep learning techniques, uniqueness, benefits and their limitations for SER. Moreover, this review study also presents speech processing techniques, performance measures and publicly available emotional speech databases. Furthermore, this review also discusses the significance of the findings of the primary studies. Finally, it also presents open research issues and challenges that need significant research efforts and enhancements in the field of SER systems.

  • Research Article
  • Cite Count Icon 103
  • 10.1016/j.specom.2019.09.002
Automatic speech emotion recognition using an optimal combination of features based on EMD-TKEO
  • Sep 19, 2019
  • Speech Communication
  • Leila Kerkeni + 5 more

Automatic speech emotion recognition using an optimal combination of features based on EMD-TKEO

  • Dissertation
  • 10.32657/10356/200448
Prosodic features of spontaneous emotional Singapore English speech
  • Jan 1, 2024
  • Rae Jia Xin Koh

Emotions are an integral part of social interactions, serving vital functions in the management of our interpersonal relationships. Cues to emotional meaning can be conveyed through a variety of modalities, from facial expressions and body gestures to the linguistic and non-linguistic, or prosodic aspects of speech. As automatic speech systems continue to grow in use, the ability to recognize emotional cues from these modalities is no longer solely expected in human communication, but also in human-computer interactions with the goal of developing more user-friendly and effective applications. This thesis explores how speech emotion recognition (SER) systems can be more robust to factors of language, culture, and context through the lens of emotional speech and emotional prosody from bilingual English-speaking Singaporeans of Chinese, Malay, or Indian ethnicity. Specifically, the thesis aims to examine how Singaporeans from the three ethnic groups may vary in their perception of emotions in speech, and to explore how they express emotions via speech prosody. The recognition and production processes of emotional communication in Singapore English was investigated through two emotion classes – anger and happiness. An emotion perception test was conducted using acted angry and happy speech as stimuli to understand how Singapore English speakers recognized emotions in speech. The results of the perception test revealed that respondents displayed very similar patterns in their rating of emotional speech stimuli. The study also found that respondents tended to rated anger with low dominance. At the same time, a spontaneous speech corpus known as the Singapore English Spontaneous Emotional Speech Corpus (SESESC) was also developed, featuring audio recordings of 44 Singapore English speakers from the three ethnic groups playing a co-operative video game with either a familiar partner or a confederate. More than 36,000 intervals were annotated in-context of the recordings by each annotator based on a modified emotion annotation scheme that included information regarding the emotion category and arousal levels. Intervals that were labelled as conveying joy or anger with full agreement from the annotators were selected following a series of criteria to facilitate comparable prosodic analyses given the inherent dynamicity of spontaneous speech. The analysis focused on exploring the F0 movements of the selected utterances in tandem with periodic energy. Overall, it was observed that utterances conveying happiness were associated more with level contours, while angry utterances were produced with more steep falling contours especially on the utterance-final syllable. A tendency for an F0 peak to be produced on the second syllable of the utterance was also observed in both sets of data analyzed, alluding to the interactions between emotional prosody and the intonational structure of Singapore English. Finally, an emotion classification was conducted using a subset of the corpus as test data for a fine-tuned SER model. Results showed that while utterances conveying happiness could be automatically recognized to a high percentage of 86%, there was also a high percentage of misclassification of anger. This result appears to counter findings in past SER studies, which typically reported high levels of recognition accuracy for anger. Taken together, the thesis contributes an understanding of how spontaneous speech and emotional cues on the prosodic level interact even in task-oriented conversations, with further considerations of how these patterns may vary among bilinguals in multicultural contexts, so as to enable more robust speech emotion recognition systems in the future.

  • Conference Article
  • Cite Count Icon 29
  • 10.1109/icicict1.2017.8342835
Music player based on emotion recognition of voice signals
  • Jul 1, 2017
  • Sneha Lukose + 1 more

In this paper, a smart music system is designed by recognizing the emotion using voice speech signal as an input. The objective of the speech emotion recognition (SER) system is to determine the state of emotion of a human being's voice. This study recognizes five emotions-anger, anxiety, boredom, happiness and sadness. The important aspects in implementing this SER system includes the speech processing using the Berlin emotional database, then extracting suitable features and selecting appropriate pattern recognition or classifier methods to identify the emotional states. Once the emotion of the speech is recognized, the system platform automatically selects a piece of music as a cheer up strategy from the database of song playlist stored. The analysis results show that this SER system implemented over five emotions provides successful emotional classification performance of 76.31% using GMM model and an overall better accuracy of 81.57% with SVM model.

  • Research Article
  • 10.14445/23488549/ijece-v12i1p117
English
  • Jan 30, 2025
  • International Journal of Electronics and Communication Engineering
  • Manolekshmi I + 1 more

The technique of recognizing and classifying emotions expressed in language spoken using audio features is Speech Emotion Recognition (SER). Human-computer interaction must enable machines to accurately perceive and respond to human emotions. Numerous challenges, like capturing both spatial and temporal features in speech signals, impact the accuracy of emotion recognition models. Conventional emotion recognition systems heavily depend on manual feature extraction and classification, which require significant effort and often lead to errors in detection. Advances in image processing and Artificial Intelligence (AI) have introduced hybrid Deep Learning (DL) approaches to improve SER tasks. This study developed an efficient Speech Emotion Recognition (SER) system utilizing a hybrid DL model combined with an ensemble approach to accurately classify emotions expressed through speech. The models were evaluated on the CREMA dataset which contains 7,442 audio samples across six different emotions. After preprocessing and data augmentation, Mel Frequency Cepstral Coefficients (MFCC) were captured as features from speech data. The proposed models include CNN-LSTM and CNN-GRU to extract both spatial and temporal features. Outputs from these frameworks were combined using an ensemble learning approach with a Support Vector Machine (SVM) classifier as the meta-learner. Experimental results specify that the suggested model attained improved performance with an accuracy of 98.69%, precision of 98.70%, recall of 98.72% and an F1 score of 98.70%. The results highlight the effectiveness of combining advanced neural networks for achieving high performance in emotion detection from speech signals, providing valuable information for developing real-time emotion recognition systems and enhancing human-computer interaction.

  • Conference Article
  • Cite Count Icon 5
  • 10.1109/lt58159.2023.10092295
Arabic English Speech Emotion Recognition System
  • Jan 26, 2023
  • Mai El Seknedy + 1 more

The Speech Emotion Recognition (SER) system is an approach to identify individuals' emotions. This is important for human-machine interface applications and for the emerging Metaverse. This work presents a bilingual Arabic-English speech emotion recognition system based on EYASE and RAVDESS datasets. A novel feature set was composed by using spectral and prosodic parameters to obtain high performance at a low computational cost. Different classification models were applied. These machine learning classifiers are Random Forest, Support Vector Machine, Logistic Regression, Multi-Layer Perceptron, and Ensemble learning. The proposed feature set performance was compared to the "Interspeech 2009" challenge feature set, which is considered a benchmark in the field. Promising results were obtained using the proposed feature sets. SVM resulted in the best emotion recognition rate and execution performance. The best accuracies achieved were 85% on RADVESS, and 64% on EYASE. Ensemble learning detected the valence emotion with 90% on RADVESS, and 87.6% on EYASE.

  • Conference Article
  • Cite Count Icon 3
  • 10.1109/netact.2017.8076811
Significance of teo slope feature in speech emotion recognition
  • Jul 1, 2017
  • P S Drisya + 1 more

The growth in human computer interaction has necessitated the requirement of accurate recognition of emotion from speech data. This paper presents a new novel feature called TEO (Teager Energy Operator) Slope for emotion recognition. The feature is obtained by applying least square fit instead of applying DCT in TEO feature. The feature was tested on the publically available Berlin Emotion Database (EMO-DB) using a GMM classifier. TEO Slope feature based emotion recognition system shows a significant improvement over MFCC based baseline system by 2% and TEO feature based system by 6% in terms of overall accuracy. Also the feature set obtained by fusion of MFCC, its delta and TEO Slope was evaluated and 60% accuracy was obtained. The results show that the TEO Slope is a promising feature in speech emotion recognition system.

  • Research Article
  • Cite Count Icon 13
  • 10.1016/j.procs.2022.09.345
Multiple Models Fusion for Multi-label Classification in Speech Emotion Recognition Systems
  • Jan 1, 2022
  • Procedia Computer Science
  • Anwer Slimi + 3 more

Multiple Models Fusion for Multi-label Classification in Speech Emotion Recognition Systems

  • Research Article
  • Cite Count Icon 2
  • 10.4114/intartif.vol28iss76pp85-123
Explainable Artificial Intelligence Techniques for Speech Emotion Recognition: A Focus on XAI Models
  • Jun 17, 2025
  • Inteligencia Artificial
  • Michael Norval + 1 more

This study employs Explainable Artificial Intelligence (XAI) techniques, including SHAP, LIME, and XGBoost, to interpret speech-emotion recognition (SER) models. Unlike previous work focusing on generic datasets, this research integrates these tools to explore the unique emotional nuances within an Afrikaans speech corpus. The complexity of architectures poses significant challenges regarding model interpretability. This paper explicitly aims to bridge the gaps in existing Speech Emotion Recognition (SER) systems by integrating advanced Explainable Artificial Intelligence (XAI) techniques. The objective is to develop an Ensemble stacking model that combines CNN, CLSTM, and XGBoost, augmented by SHAP and LIME, to enhance the interpretability, accuracy, and adaptability of SER systems, particularly for underrepresented languages like Afrikaans. Our research methodology involves utilising XAI methods to explain the decision-making processes of CNN and CLSTM models in speech emotion recognition (SER) to enhance trust, diagnostic insight, and theoretical understanding. We train the models for SER using a comprehensive dataset of emotional speech samples. Post-training, we apply SHAP and LIME to these models to generate explanations for their predictions, focusing on the importance of featuresand the models’ decision logic. By comparing the explanations generated by SHAP and LIME, we assess the efficacy of each method in providing meaningful insights into the models’ operations. The comparative study of various models in SER demonstrates their capability to discern complex emotional states through diverse analytical approaches, from spatial feature extraction to temporal dynamics. Our research reveals that XAI techniques improve the interpretability of complex SER models. This enhanced transparency builds end-user trust and provides valuable insights. This study contributes to the importance of explainability in deploying AI technologies in emotionally sensitive applications, paving the way for more accountable and user-centric SER systems.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant