Head Fusion: Improving the Accuracy and Robustness of Speech Emotion Recognition on the IEMOCAP and RAVDESS Dataset

Mingke Xu,Wei Zhang,Fan Zhang

doi:10.1109/access.2021.3067460

Mingke Xu, Wei Zhang + Show 1 more

Open Access

https://doi.org/10.1109/access.2021.3067460

Copy DOI

Journal: IEEE Access	Publication Date: Jan 1, 2021
Citations: 60	License type: CC BY 4.0

Affiliation: Nanjing Tech University

Abstract

Speech Emotion Recognition (SER) refers to the use of machines to recognize the emotions of a speaker from his (or her) speech. SER benefits Human-Computer Interaction(HCI). But there are still many problems in SER research, e.g., the lack of high-quality data, insufficient model accuracy, little research under noisy environments, etc. In this paper, we proposed a method called Head Fusion based on the multi-head attention mechanism to improve the accuracy of SER. We implemented an attention-based convolutional neural network(ACNN) model and conducted experiments on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) data set. The accuracy is improved to 76.18% (weighted accuracy, WA) and 76.36% (unweighted accuracy, UA). To the best of our knowledge, compared with the state-of-the-art result on this dataset (76.4% of WA and 70.1% of WA), we achieved a UA improvement of about 6% absolute while achieving a similar WA. Furthermore, We conducted empirical experiments by injecting speech data with 50 types of common noises. We inject the noises by altering the noise intensity, time-shifting the noises, and mixing different noise types, to identify their varied impacts on the SER accuracy and verify the robustness of our model. This work will also help researchers and engineers properly add their training data by using speech data with the appropriate types of noises to alleviate the problem of insufficient high-quality data.

Highlights

Emotion recognition plays an important role in Human– Computer Interaction(HCI)
Xu et al.: Head Fusion: Improving Accuracy and Robustness of Speech Emotion Recognition (SER) on Interactive Emotional Dyadic Motion Capture (IEMOCAP) and RAVDESS Dataset influenced by many factors, language and culture have an important influence on the judgment of emotions in speech [19], which increases the cost of data labeling
We proposed a method called Head Fusion based on multi-head self-attention and designed an attention-based convolutional neural network(ACNN) model

Summary

INTRODUCTION

Emotion recognition plays an important role in Human– Computer Interaction(HCI). With the development of deep learning technology, it has become possible to recognize human emotions in terms of speech [1]–[7], text [8], [9], and facial [10], [11]. Deep learning has accelerated the progress of recognizing human emotions from speech, but there are still deficiencies in the research of SER, such as data shortage and insufficient model accuracy. M. Xu et al.: Head Fusion: Improving Accuracy and Robustness of SER on IEMOCAP and RAVDESS Dataset influenced by many factors, language and culture have an important influence on the judgment of emotions in speech [19], which increases the cost of data labeling. The model improves accuracy to 76.18% (Weighted Accuracy, WA) and 76.36% (Unweighted Accuracy, UA) on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) data set, which is state-of-the-art.

RELATED WORK

EXPERIMENTAL SETUP

Findings

1) EXPERIMENTS ON THE CLEAN DATA SET

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Head Fusion: Improving the Accuracy and Robustness of Speech Emotion Recognition on the IEMOCAP and RAVDESS Dataset

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

Improve Accuracy of Speech Emotion Recognition with Attention Head Fusion
Mingke Xu ... Samee U Khan
-
Mingke Xu, et. al.Mingke Xu ... Samee U Khan
01 Jan 2020
01 Jan 2020

Hierarchical Network with Decoupled Knowledge Distillation for Speech Emotion Recognition
Ziping Zhao ... Björn Schuller
-
Ziping Zhao, et. al.Ziping Zhao ... Björn Schuller
04 Jun 2023
04 Jun 2023

Unsupervised Feature Learning for Speech Emotion Recognition Based on Autoencoder
Yangwei Ying ... Hong Zhou
Electronics | VOL. 10
Yangwei Ying, et. al.Yangwei Ying ... Hong Zhou
28 Aug 2021
Electronics | VOL. 10

Exploring Deep Spectrum Representations via Attention-Based Recurrent and Convolutional Neural Networks for Speech Emotion Recognition
Ziping Zhao ... Zhao Ren
IEEE Access | VOL. 7
Ziping Zhao, et. al.Ziping Zhao ... Zhao Ren
01 Jan 2019
IEEE Access | VOL. 7

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Head Fusion: Improving the Accuracy and Robustness of Speech Emotion Recognition on the IEMOCAP and RAVDESS Dataset

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access