Abstract

Speech Emotion Recognition is of great significance in the research field of human-computer interaction and affective computing. One of the major challenges for SER now lies in how to explore effective emotional features from lengthy utterances. However, since most of existing deep-learning based SERs adopt Log-Mel spectrograms as the input model, it is unable to fully convey the emotional information in the speech. Furthermore, limited extraction ability of the model may make it difficult to extract key emotional representations. As a result, in order to address the above issues, we propose a new convolutional recurrent network based on multiple attention, including convolutional neural network (CNN) and bidirectional long short-term memory network (BiLSTM) modules, using extracted Mel-spectrums and Fourier Coefficient features respectively, which helps to complement the emotional information. Further, the multiple attention mechanisms in our model are as follows: Spatial attention and channel attention mechanisms are added to the CNN module to focus on the key emotional area and locate more effective features. Temporal attention gives weights to different time series segment features after BiLSTM extracts sequence information. Experimental results show that the model achieves the WA (weighted accuracy) of 87.9%, 76.5%, and 75.2% respectively while the UA (unweighted accuracy) stands at 87.6%, 73.5%, 70.1 % respectively on EMODB, IEMOCAP, and EESDB speech datasets, which is better than most state-of-the-art methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call