Abstract

In recent years, the research on speech emotion recognition has received widespread attention from academia and industry. At present, speech emotion recognition has some problems with low emotion recognition rate and single emotion feature. Aiming at the above issues, a multi-feature speech emotion recognition method is proposed.The model improves the speech emotion recognition rate and increasing the feature in terms of feature complementarity. Firstly, the low-level descriptive features are counted by using some statistical functions to obtain statistical features. Secondly, the mel spectrogram is cropped to obtain slices and input into DCNN. In addition, The BiLSTM network with a self-attention mechanism are used to obtain slice features. The deep spatial features and temporal features are obtained by using temporal pyramid pooling. Finally, the aforementioned three features are fused at the feature level and combined with the Softmax layer for sentiment classification. The experimental results of the three datasets, including CASIA, EMO-DB and IEMOCAP, show that the accuracy rates increased by 3.8%, 4.8% and 17.41% compared with a single feature, respectively.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.