Speaker-Independent Speech Emotion Recognition Based on CNN-BLSTM and Multiple SVMs

Zhen-Tao Liu,Dan-Yun Li,Peng Xiao,Man Hao

doi:10.1007/978-3-030-27535-8_43

Abstract

Speaker-independent speech emotion recognition (SER) is a complex task because of the variations among different speakers, such as gender, age and other emotional irrelevant factors, which may lead to a tremendous difference among emotional features’ distribution. To alleviate the adverse effect generated by emotional irrelevant factors, we propose a SER model that consists of convolutional neutral networks (CNN), attention-based bidirectional long short-term memory network (BLSTM), and multiple linear support vector machines. The log Mel-spectrogram with its velocity (delta) and acceleration (double delta) coefficients are adopted as the inputs of our model since they can apply sufficient information for feature learning by our model. Several groups of speaker-independent SER experiments are performed on the Interactive Emotional Dyadic Motion Capture Database (IEMOCAP) database to improve the credibility of the results. Experimental results show that our method obtains unweighted average recall of 61.50% and weighted average recall of 62.31% for speaker-independent SER on IEMOCAP database.

Full Text