Design of Hierarchical Classifier to Improve Speech Emotion Recognition

P Vasuki

doi:10.32604/csse.2023.024441

Abstract

Automatic Speech Emotion Recognition (SER) is used to recognize emotion from speech automatically. Speech Emotion recognition is working well in a laboratory environment but real-time emotion recognition has been influenced by the variations in gender, age, the cultural and acoustical background of the speaker. The acoustical resemblance between emotional expressions further increases the complexity of recognition. Many recent research works are concentrated to address these effects individually. Instead of addressing every influencing attribute individually, we would like to design a system, which reduces the effect that arises on any factor. We propose a two-level Hierarchical classifier named Interpreter of responses (IR). The first level of IR has been realized using Support Vector Machine (SVM) and Gaussian Mixer Model (GMM) classifiers. In the second level of IR, a discriminative SVM classifier has been trained and tested with meta information of first-level classifiers along with the input acoustical feature vector which is used in primary classifiers. To train the system with a corpus of versatile nature, an integrated emotion corpus has been composed using emotion samples of 5 speech corpora, namely; EMO-DB, IITKGP-SESC, SAVEE Corpus, Spanish emotion corpus, CMU's Woogle corpus. The hierarchical classifier has been trained and tested using MFCC and Low-Level Descriptors (LLD). The empirical analysis shows that the proposed classifier outperforms the traditional classifiers. The proposed ensemble design is very generic and can be adapted even when the number and nature of features change. The first-level classifiers GMM or SVM may be replaced with any other learning algorithm.

Full Text