Abstract

Speech Emotion Recognition (SER) refers to the use of machines to recognize the emotions of a speaker from his (or her) speech. SER benefits Human-Computer Interaction(HCI). But there are still many problems in SER research, e.g., the lack of high-quality data, insufficient model accuracy, little research under noisy environments, etc. In this paper, we proposed a method called Head Fusion based on the multi-head attention mechanism to improve the accuracy of SER. We implemented an attention-based convolutional neural network(ACNN) model and conducted experiments on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) data set. The accuracy is improved to 76.18% (weighted accuracy, WA) and 76.36% (unweighted accuracy, UA). To the best of our knowledge, compared with the state-of-the-art result on this dataset (76.4% of WA and 70.1% of WA), we achieved a UA improvement of about 6% absolute while achieving a similar WA. Furthermore, We conducted empirical experiments by injecting speech data with 50 types of common noises. We inject the noises by altering the noise intensity, time-shifting the noises, and mixing different noise types, to identify their varied impacts on the SER accuracy and verify the robustness of our model. This work will also help researchers and engineers properly add their training data by using speech data with the appropriate types of noises to alleviate the problem of insufficient high-quality data.

Highlights

  • Emotion recognition plays an important role in Human– Computer Interaction(HCI)

  • Xu et al.: Head Fusion: Improving Accuracy and Robustness of Speech Emotion Recognition (SER) on Interactive Emotional Dyadic Motion Capture (IEMOCAP) and RAVDESS Dataset influenced by many factors, language and culture have an important influence on the judgment of emotions in speech [19], which increases the cost of data labeling

  • We proposed a method called Head Fusion based on multi-head self-attention and designed an attention-based convolutional neural network(ACNN) model

Read more

Summary

INTRODUCTION

Emotion recognition plays an important role in Human– Computer Interaction(HCI). With the development of deep learning technology, it has become possible to recognize human emotions in terms of speech [1]–[7], text [8], [9], and facial [10], [11]. Deep learning has accelerated the progress of recognizing human emotions from speech, but there are still deficiencies in the research of SER, such as data shortage and insufficient model accuracy. M. Xu et al.: Head Fusion: Improving Accuracy and Robustness of SER on IEMOCAP and RAVDESS Dataset influenced by many factors, language and culture have an important influence on the judgment of emotions in speech [19], which increases the cost of data labeling. The model improves accuracy to 76.18% (Weighted Accuracy, WA) and 76.36% (Unweighted Accuracy, UA) on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) data set, which is state-of-the-art.

RELATED WORK
EXPERIMENTAL SETUP
Findings
1) EXPERIMENTS ON THE CLEAN DATA SET
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call