Enhancing speech emotion detection with Windowed Long-Term Average Spectrum and Logistic-Rectified Linear Unit

P.Rajesh Kanna,V Kumararaja

doi:10.1016/j.engappai.2024.109103

Abstract

Speech emotion detection can help monitor people's mental health by analyzing their speech patterns and recognizing emotional indicators. The raw data is initially augmented using various techniques such as shifting, pitching, noise injection, and stretching. Some feature extraction methods may only produce a limited set of features, potentially missing out on important information present in the data. This can result in reduced accuracy and the inability to capture complex patterns and variations in speech signals. The planned work consists of two phases: attribute extraction and classification. In the first phase, the WLTAS (Windowed Long-Term Average Spectrum) feature extraction extracts spectral information from speech signals, providing a compact and representative feature representation. It captures the frequency content and energy distribution of speech over time, enabling the analysis of important spectral characteristics. In the second phase, the Logistic-Rectified Linear Unit (LoRLU) introduces sparsity. It enhances the discriminative power of the classifier by allowing only positive values to pass through while discarding negative values. The Gibbs Restricted Boltzmann Machine (GRBM) is a generative probabilistic model that studies a joint probability delivery of the input data. It is particularly effective in capturing dependencies and patterns in high-dimensional data. The performance analysis of WLTAS-L-GRBM (Windowed Long-Term Average Spectrum - Logistic-Rectified Linear Unit - Gibbs Restricted Boltzmann Machine) has been conducted using evaluation metrics such as accuracy (99%), precision (0.97), recall (1), F1 score (0.98), MSE (0.01), and AUC-ROC (0.97) and also evaluated against various speech emotion datasets.

Full Text