Extracting the speaker's emotional state from their speech has become an active research topic lately due to the demand for more human interactive applications. This field of research has noted significant advancement, especially in the English language, owing to the availability of massive speech-labeled corpora. However, the progress of analogous methodologies in the Arabic language is still in its infancy stages. In this paper, we present a Speech Recognition model for the Arabic language, proficient in discerning both the emotional state and gender of the speaker through voice analysis. Three primary emotion labels were selected: low, standard, and high levels of emotion. Various spectral features, such as the mel-frequency cepstral coefficient (MFCC), were extracted and tested to determine the optimal features. Furthermore, various Machine Learning models (SVM, KNN, and HMM) and Deep Learning models (LSTM and CNN) were evaluated for training. The results were compared between the five models using different extracted features, ultimately culminating in the selection of MFCC, root-mean-square (RMS), mel-scaled spectrogram, spectral, and zero-crossing rate as spectral features, and the CNN as a classification model. This selection yielded significant results, with an accuracy of 93% for emotion recognition and 99% for gender recognition.
Read full abstract