Abstract

AbstractSpeech Emotion Recognition (SER) is a difficulty of deep learning algorithms. The difficulty is that people’s own understanding of emotions is not absolute. Different people may also have different judgments on the same speech. And speech emotion recognition plays a huge role in many real-time applications. With the continuous development of deep learning in recent years, many people use convolutional neural networks (CNN) to extract high-dimensional features in speech from speech spectrograms, thereby improving the accuracy of speech emotion recognition. In contrast, we propose a new model of speech emotion recognition. The model uses the eGeMAPS feature set extracted through the openSMILE toolkit to input into our model. The model learns the correlation and timing between features. In addition, we perform intra-class normalization on the input features to ensure more accurate recognition and faster data fitting. In our model, the key speech segments can be selected through the characteristics of convolutional neural network (CNN), so that the recognition accuracy of the model can achieve a better effect. Our model was evaluated experimentally in the IEMOCAP dataset. Experimental results show that our unweighted accuracy (UA) and weighted accuracy (WA) on the test set reached 60.9% and 63.0%.KeywordsSpeech emotion recognitionConvolutional neural networksAttention mechanism

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call