Speech emotion recognition using fusion of three multi-task learning-based classifiers: HSF-DNN, MS-CNN and LLD-RNN

Zengwei Yao,Zihao Wang,Weihuang Liu,Yaqian Liu,Jiahui Pan

doi:10.1016/j.specom.2020.03.005

Zengwei Yao, Zihao Wang + Show 3 more

Open Access

https://doi.org/10.1016/j.specom.2020.03.005

Copy DOI

Journal: Speech Communication	Publication Date: Mar 28, 2020
Citations: 93	License type: cc-by

Affiliation: South China Normal University

Abstract

Speech emotion recognition plays an increasingly important role in emotional computing and is still a challenging task due to its complexity. In this study, we developed a framework integrating three distinctive classifiers: a deep neural network (DNN), a convolution neural network (CNN), and a recurrent neural network (RNN). The framework was used for categorical recognition of four discrete emotions (i.e., angry, happy, neutral and sad). Frame-level low-level descriptors (LLDs), segment-level mel-spectrograms (MS), and utterance-level outputs of high-level statistical functions (HSFs) on LLDs were passed to RNN, CNN, and DNN, separately. Three individual models of LLD-RNN, MS-CNN, and HSF-DNN were obtained. In the models of MS-CNN and LLD-RNN, the attention mechanism based weighted-pooling method was utilized to aggregate the CNN and RNN outputs. To effectively utilize the interdependencies between the two approaches of emotion description (discrete emotion categories and continuous emotion attributes), a multi-task learning strategy was implemented in these three models to acquire generalized features by simultaneously operating classification of discrete categories and regression of continuous attributes. Finally, a confidence-based fusion strategy was developed to integrate the power of different classifiers in recognizing different emotional states. Three experiments on emotion recognition based on the IEMOCAP corpus were conducted. Our experimental results show that the weighted pooling method based on attention mechanism endowed the neural networks with the capability to focus on emotionally salient parts. The generalized features learned in the multi-task learning helped the neural networks to achieve higher accuracies in the tasks of emotion classification. Furthermore, our proposed fusion system achieved weighted accuracy of 57.1% and unweighted accuracy of 58.3%, which were significantly higher than those of each individual classifier. The effectiveness of the proposed approach based on classifier fusion was thus validated.

Full Text