Abstract

Speaker-independent speech emotion recognition (SER) is a complex task because of the variations among different speakers, such as gender, age and other emotional irrelevant factors, which may lead to a tremendous difference among emotional features’ distribution. To alleviate the adverse effect generated by emotional irrelevant factors, we propose a SER model that consists of convolutional neutral networks (CNN), attention-based bidirectional long short-term memory network (BLSTM), and multiple linear support vector machines. The log Mel-spectrogram with its velocity (delta) and acceleration (double delta) coefficients are adopted as the inputs of our model since they can apply sufficient information for feature learning by our model. Several groups of speaker-independent SER experiments are performed on the Interactive Emotional Dyadic Motion Capture Database (IEMOCAP) database to improve the credibility of the results. Experimental results show that our method obtains unweighted average recall of 61.50% and weighted average recall of 62.31% for speaker-independent SER on IEMOCAP database.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.