Emotion Recognition System for Arabic Speech: Case Study Egyptian Accent

Mai El Seknedy,Sahar Ali Fawzi

doi:10.1007/978-3-031-21595-7_8

Abstract

Speech Emotion Recognition (SER) systems are widely regarded as essential human-computer interface applications. Extracting emotional content from voice signals enhances the communication between humans and machines. Despite the rapid advancement of Speech Emotion Recognition systems for several languages, there is still a gap in SER research for the Arabic language. The goal of this research is to build an Arabic-based SER system using a feature set that has both high performance and low computational cost. Two novel feature sets were created using a mix of spectral and prosodic features, which were evaluated on the Arabic corpus (EYASE) constructed from a drama series. EYASE is the Egyptian Arabic Semi-natural Emotion speech dataset that consists of 579 utterances representing happy, sad, angry, and neutral emotions, uttered by 3 male and 3 female professional actors. To verify the emotions’ recognition results, surveys were conducted by Arabic and non-Arabic speakers to analyze the dataset constituents. The survey results show that recognition of anger, sadness, and happiness are sometimes misclassified as neutral. Machine learning classifiers Multi-Layer Perceptron, Support Vector Machine, Random Forest, Logistic Regression, and Ensemble learning were applied. For valence (happy/angry) emotions classification, Ensemble learning showed best results of 87.59% using the 2 proposed feature sets. Featureset-2 had the highest recognition accuracy with all classifiers. For multi-emotions classification, Support Vector Machine had the highest recognition accuracy of 64% using featureset-2 and benchmarked Interspeech feature sets. The computational cost of featureset-2 was the lowest for all classifiers, either for training or testing.

Full Text