Multi-Channel 2-D Convolutional Recurrent Neural Networks for Speech Emotion Recognition

Weidong Zhou,Pengfei Xia,Houpan Zhou

doi:10.1109/cac51589.2020.9326970

Abstract

Speech emotion recognition (SER) is a challenging task in the field of emotion recognition. The performance of SER largely depends on the emotional features extracted from speech. However, the distribution of different emotional features is uneven and linearly combined, and the sensitivity of different emotional features to emotions is also different, which largely limits the accuracy of emotion recognition. In order to solve this problem, a multi-channel 2-D convolutional recurrent neural network model is proposed, which uses the same channel convolution to map different features to the same dimension, and combines the results of each channel to input to the bidirectional long short-term memory (Bi-LSTM) network extracts global features, and finally uses the attention mechanism to eliminate the influence of silent segments. The experiment is evaluated in two benchmark corpora, and the results show that the designed network has achieved good results in SER, and the average accuracy of IEMOCAP and EMO-DB is 69.51% and 86.42%, respectively.

Full Text