A novel concatenated 1D-CNN model for speech emotion recognition

T Mary Little Flower,T Jaya

doi:10.1016/j.bspc.2024.106201

Abstract

Speech Emotion Recognition (SER) has conventionally been executed solely through acoustic data. Convolutional Neural Network (CNN) in Deep Learning (DL) is a cutting-edge technique that has been effectively used in diverse fields, such as analyzing speech emotion. This study utilizes the Mel Frequency Magnitude Coefficient (MFMC), a modified variant of the Mel Frequency Cepstral Coefficient (MFCC) feature, in conjunction with a one-dimensional (1D) CNN to increase the SER performance. The accuracy of SER is evaluated utilizing the proposed models. Model 1 refers to MFCC-1D-CNN, model 2 refers to MFMC-1D-CNN, and model 3 refers to the concatenated model of model 1 and model 2. The models were evaluated on four datasets comprising eight distinct emotions, specifically anger, happiness, sadness, boredom, fear, surprise, disgust, neutral, and calm. The proposed models, namely model 1, model 2, and model 3, were evaluated on the four datasets using coefficients of 12, 24, and 30. The Proposed Model 1 achieved accuracy rates for EMO-DB, EMOVO, SAVEE, and RAVDESS of 88.1%, 87.5%, 75%, and 97.7% for the respective coefficients. While the proposed model 2 achieved accuracy rates of 93.8%, 98.3%, 90.3%, and 97.5% for the same coefficients. However, the proposed concatenated model 3 outperformed the accuracy of 95.6%, 99.4%, 91.7%, and 98.1% for the four datasets EMO-DB, EMOVO, SAVEE, and RAVDESS compared to model 1 and model 2. Experimental results indicate that the proposed concatenated model 3 improves accuracy and prevents the model from overfitting.

Full Text