Abstract

Speech emotion recognition (SER) is a difficult task due to the complexity of emotions. The SER performances are heavily dependent on the effectiveness of emotional features extracted from the speech. However, most emotional features are sensitive to emotionally irrelevant factors, such as the speaker, speaking styles, and environment. In this letter, we assume that calculating the deltas and delta-deltas for personalized features not only preserves the effective emotional information but also reduces the influence of emotionally irrelevant factors, leading to reduce misclassification. In addition, SER often suffers from the silent frames and emotionally irrelevant frames. Meanwhile, attention mechanism has exhibited outstanding performances in learning relevant feature representations for specific tasks. Inspired by this, we propose a three-dimensional attention-based convolutional recurrent neural networks to learn discriminative features for SER, where the Mel-spectrogram with deltas and delta-deltas are used as input. Experiments on IEMOCAP and Emo-DB corpus demonstrate the effectiveness of the proposed method and achieve the state-of-the-art performance in terms of unweighted average recall.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call