Abstract
Speech Emotion Recognition (SER) refers to the use of machines to recognize the emotions of a speaker from his (or her) speech. SER benefits Human-Computer Interaction(HCI). But there are still many problems in SER research, e.g., the lack of high-quality data, insufficient model accuracy, little research under noisy environments, etc. In this paper, we proposed a method called Head Fusion based on the multi-head attention mechanism to improve the accuracy of SER. We implemented an attention-based convolutional neural network(ACNN) model and conducted experiments on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) data set. The accuracy is improved to 76.18% (weighted accuracy, WA) and 76.36% (unweighted accuracy, UA). To the best of our knowledge, compared with the state-of-the-art result on this dataset (76.4% of WA and 70.1% of WA), we achieved a UA improvement of about 6% absolute while achieving a similar WA. Furthermore, We conducted empirical experiments by injecting speech data with 50 types of common noises. We inject the noises by altering the noise intensity, time-shifting the noises, and mixing different noise types, to identify their varied impacts on the SER accuracy and verify the robustness of our model. This work will also help researchers and engineers properly add their training data by using speech data with the appropriate types of noises to alleviate the problem of insufficient high-quality data.
Highlights
Emotion recognition plays an important role in Human– Computer Interaction(HCI)
Xu et al.: Head Fusion: Improving Accuracy and Robustness of Speech Emotion Recognition (SER) on Interactive Emotional Dyadic Motion Capture (IEMOCAP) and RAVDESS Dataset influenced by many factors, language and culture have an important influence on the judgment of emotions in speech [19], which increases the cost of data labeling
We proposed a method called Head Fusion based on multi-head self-attention and designed an attention-based convolutional neural network(ACNN) model
Summary
Emotion recognition plays an important role in Human– Computer Interaction(HCI). With the development of deep learning technology, it has become possible to recognize human emotions in terms of speech [1]–[7], text [8], [9], and facial [10], [11]. Deep learning has accelerated the progress of recognizing human emotions from speech, but there are still deficiencies in the research of SER, such as data shortage and insufficient model accuracy. M. Xu et al.: Head Fusion: Improving Accuracy and Robustness of SER on IEMOCAP and RAVDESS Dataset influenced by many factors, language and culture have an important influence on the judgment of emotions in speech [19], which increases the cost of data labeling. The model improves accuracy to 76.18% (Weighted Accuracy, WA) and 76.36% (Unweighted Accuracy, UA) on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) data set, which is state-of-the-art.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.