Abstract
Speech Emotion Recognition (SER) plays an important role in human-computer interface and assistant technologies. In this paper, a new method is proposed using distributed Convolution Neural Networks (CNN) to automatically learn affect-salient features from raw spectral information, and then applying Bidirectional Recurrent Neural Network (BRNN) to obtain the temporal information from the output of CNN. In the end, an Attention Mechanism is implemented on the output sequence of the BRNN to focus on target emotion-pertinent parts of an utterance. This attention mechanism not only improves the classification accuracy, but also provides model’s interpretability. Experimental results show that this approach can gain 64.08% weighted accuracy and 56.41% unweighted accuracy for four-emotion classification in IEMOCAP dataset, which outperform previous results reported for this dataset.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: DEStech Transactions on Computer Science and Engineering
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.