Abstract

Speech emotion recognition (SER) is a challenging task in the field of emotion recognition. The performance of SER largely depends on the emotional features extracted from speech. However, the distribution of different emotional features is uneven and linearly combined, and the sensitivity of different emotional features to emotions is also different, which largely limits the accuracy of emotion recognition. In order to solve this problem, a multi-channel 2-D convolutional recurrent neural network model is proposed, which uses the same channel convolution to map different features to the same dimension, and combines the results of each channel to input to the bidirectional long short-term memory (Bi-LSTM) network extracts global features, and finally uses the attention mechanism to eliminate the influence of silent segments. The experiment is evaluated in two benchmark corpora, and the results show that the designed network has achieved good results in SER, and the average accuracy of IEMOCAP and EMO-DB is 69.51% and 86.42%, respectively.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.