Variational mode decomposition based acoustic and entropy features for speech emotion recognition

Siba Prasad Mishra,Pankaj Warule,Suman Deb

doi:10.1016/j.apacoust.2023.109578

Abstract

Automated speech emotion recognition (SER) is a machine-based method for identifying emotion from speech signals. SER has many practical applications, including improving man-machine interaction (MMI), online customer support, healthcare services, online marketing, etc. Because of the wide range of applications, the popularity of SER has been increasing among researchers for three decades. Numerous studies employed various combinations of features and classifiers to improve emotion classification performance. In our study, we tried to achieve the same by using variational mode decomposition (VMD)-based features. We extracted features like MFCC, mel-spectrogram, approximate entropy (ApEn), and permutation entropy (PrEn) from each VMD mode. The performance of emotion classification is evaluated using the deep neural network (DNN) classifier and the proposed VMD-based features individually (MFCC, mel-spectrogram, ApEn, and PrEn) and in combination (MFCC + mel-spectrogram + ApEn + PrEn). We used two datasets, RAVDESS and EMO-DB, to evaluate the emotion classification performance and obtained a classification accuracy of 91.59% and 80.83% for the EMO-DB and RAVDESS datasets, respectively. Our experimental results were compared with the other methods, and we found that the proposed VMD-based feature combinations with a DNN classifier performed better than the state-of-the-art works in SER.

Full Text