Abstract
Speech emotion recognition (SER) is a challenging task in the field of emotion recognition. The performance of SER largely depends on the emotional features extracted from speech. However, the distribution of different emotional features is uneven and linearly combined, and the sensitivity of different emotional features to emotions is also different, which largely limits the accuracy of emotion recognition. In order to solve this problem, a multi-channel 2-D convolutional recurrent neural network model is proposed, which uses the same channel convolution to map different features to the same dimension, and combines the results of each channel to input to the bidirectional long short-term memory (Bi-LSTM) network extracts global features, and finally uses the attention mechanism to eliminate the influence of silent segments. The experiment is evaluated in two benchmark corpora, and the results show that the designed network has achieved good results in SER, and the average accuracy of IEMOCAP and EMO-DB is 69.51% and 86.42%, respectively.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.