Speech emotion recognition (SER) is still a challenging research area in human–computer interaction-based systems. This paper proposed a nonlinear feature extraction technique to improve the classification performance of the SER system. The proposed method explores variational mode decomposition (VMD) with the Teager-Kaiser energy operator (TKEO) for the SER. First, VMD decomposes a speech signal into modes, and then the nonlinear TKEO operator is applied to each mode to obtain a time series. The VMD-TKEO preprocessed signal is used to extract the global features based on Energy, Pitch frequency and Mel frequency cepstral coefficients. The features are statistically examined using the Kruskal-Wallis test. The resultant feature set is examined over the support vector machine and its variants for emotion classification. The Ryerson Audio-Visual database is used for the experiment, and different emotion classification problems are formulated. Finally, the accuracy of the proposed SER architecture is quantitatively analyzed, which outperforms the other existing architectures.
Read full abstract