Emotional expressions are a fundamental aspect of human communication, with speech being one of the most natural modes of interaction. Speech Emotion Recognition (SER) is a significant research topic in Natural Language Processing (NLP), aimed at identifying emotions such as satisfaction, frustration, and anger from speech audio using multiple classifiers. This paper presents a method to emotion recognition from spontaneous Tunisian Dialect (TD) speech, marking the first work in the SER field to utilize spontaneous speech for emotion recognition in this dialect. The dataset was created from freely available YouTube videos across multiple domains and labeled with four perceived emotions: anger, satisfaction, frustration, and neutral. To address the data scarcity issue, we implemented data augmentation techniques, specifically Vocal Tract Length Perturbation (VTLP). The preprocessing of the speech signals involved cleaning the data from ambient and unwanted noises. We extracted and selected various spectral features, including mel-frequency cepstral coefficients (MFCC) and Linear Prediction Cepstral Coefficients (LPCC). Subsequently, we applied several classification methods: Support Vector Machine (SVM), Bidirectional Long Short-Term Memory (BiLSTM), Long Short-Term Memory (LSTM), Convolutional Neural Network (CNN), and Random Forest. Our experiments demonstrated that the Random Forest classifier achieved the highest F-score of 58.75%. The results were thoroughly discussed, analyzed, and compared across the five models using different feature extractions. This study provides valuable insights and advancements in the SER field, particularly for the TD, future research directions for improving emotion recognition systems.
Read full abstract