A multi-feature speech emotion recognition method based on temporal pyramid pooling

Meng Huang,Danmei Niu,Zhiyong Zhang

doi:10.1109/aemcse55572.2022.00147

Abstract

In recent years, the research on speech emotion recognition has received widespread attention from academia and industry. At present, speech emotion recognition has some problems with low emotion recognition rate and single emotion feature. Aiming at the above issues, a multi-feature speech emotion recognition method is proposed.The model improves the speech emotion recognition rate and increasing the feature in terms of feature complementarity. Firstly, the low-level descriptive features are counted by using some statistical functions to obtain statistical features. Secondly, the mel spectrogram is cropped to obtain slices and input into DCNN. In addition, The BiLSTM network with a self-attention mechanism are used to obtain slice features. The deep spatial features and temporal features are obtained by using temporal pyramid pooling. Finally, the aforementioned three features are fused at the feature level and combined with the Softmax layer for sentiment classification. The experimental results of the three datasets, including CASIA, EMO-DB and IEMOCAP, show that the accuracy rates increased by 3.8%, 4.8% and 17.41% compared with a single feature, respectively.

Full Text