Speech emotion recognition using a combination of variational mode decomposition and Hilbert transform

Siba Prasad Mishra,Suman Deb,Pankaj Warule

doi:10.1016/j.apacoust.2024.110046

Abstract

The objective of automated speech emotion recognition (SER) is to recognize each emotion of the speech signal uniquely and efficiently using machines like computers, mobile devices, etc. The popularity of SER has been widely acknowledged among researchers due to its extensive applicability in practical contexts. The use of SER has shown to be advantageous in several domains, including medical treatment, enhancement of security systems, surveillance operations, online marketing strategies, online educational platforms, online search engines, personal communication, customer relationship management, reinforcement of machine and human connection, and numerous other areas. Numerous authors have used various techniques, including the combination of multiple features (acoustic, non-acoustic, or both) and classifiers (machine learning, deep learning, or both), in order to enhance the efficacy of emotion categorization. In the present work, our objective is to enhance the performance of emotion classification (PECL) by using a combined approach of the variational mode decomposition (VMD) and Hilbert transform (HT) techniques. Using the VMD method, we decomposed the speech signal frame into many sub-signals, or intrinsic mode functions (IMFs). Then HT is applied to each VMD-based IMF signal to find the mode instantaneous amplitude (MIA) and mode instantaneous frequency (MIF) signal vectors. We extracted proposed features such as HT-based approximate entropy (HTAE), HT-based permutation entropy (HTPE), HT-based increment entropy (HTIE), and HT-based sample entropy (HTSE) using each MIA and MIF signal vector. The combination of the proposed HT-based features is called HT-based entropy (HTE) features. Then, we accessed the PECL using the HTE and MFCC features alone and in conjunction with a deep neural network (DNN) classifier. The experiment's results showed that the combinations of the proposed feature (MFCC + HTE) using a DNN classifier outperformed the individual features and obtained a SER accuracy of 86.92% for the EMOVO dataset and 91.63% for the EMO-DB dataset.

Full Text