Multi-Path and Group-Loss-Based Network for Speech Emotion Recognition in Multi-Domain Datasets.

Kyoung Ju Noh,Jiyoun Lim,Seungeun Chung,Chi Yoon Jeong,Hyuntae Jeong,Gague Kim,Jeong Mook Lim

doi:10.3390/s21051579

Abstract

Speech emotion recognition (SER) is a natural method of recognizing individual emotions in everyday life. To distribute SER models to real-world applications, some key challenges must be overcome, such as the lack of datasets tagged with emotion labels and the weak generalization of the SER model for an unseen target domain. This study proposes a multi-path and group-loss-based network (MPGLN) for SER to support multi-domain adaptation. The proposed model includes a bidirectional long short-term memory-based temporal feature generator and a transferred feature extractor from the pre-trained VGG-like audio classification model (VGGish), and it learns simultaneously based on multiple losses according to the association of emotion labels in the discrete and dimensional models. For the evaluation of the MPGLN SER as applied to multi-cultural domain datasets, the Korean Emotional Speech Database (KESD), including KESDy18 and KESDy19, is constructed, and the English-speaking Interactive Emotional Dyadic Motion Capture database (IEMOCAP) is used. The evaluation of multi-domain adaptation and domain generalization showed 3.7% and 3.5% improvements, respectively, of the F1 score when comparing the performance of MPGLN SER with a baseline SER model that uses a temporal feature generator. We show that the MPGLN SER efficiently supports multi-domain adaptation and reinforces model generalization.

Highlights

Human speech is a natural communication method in human–computer interaction (HCI) and human–robot interaction (HRI)
For the evaluation of the multi-path and grouploss-based network (MPGLN) Speech emotion recognition (SER) based on multi-cultural datasets, two Korean Emotional Speech Database (KESD) databases (i.e., KESDy18 and KESDy19) constructed for this study, and the Interactive Emotional Dyadic Motion Capture database (IEMOCAP) are used
Speech samples of each class in the multi-domain datasets were trained in the SER model by the units of the speech segment, which consisted of the voiced part of the vocalcord vibrations and unvoiced parts such as a silence section between voiced parts [53]

Summary

Introduction

Human speech is a natural communication method in human–computer interaction (HCI) and human–robot interaction (HRI). Prior to deploying SER models in real applications, the lack of SER databases tagged with emotion labels must be addressed, because they are not sufficient for training deepSER models Another challenge is the limited generality of the SER model, owing to the high variability of the acoustic signals of the emotional speech samples. We propose a multi-path and group-loss-based network (MPGLN) for SER, which supports supervised domain adaptation in multi-domain datasets acquired from multiple environments. The transferred feature extractor creates feature vectors from the pre-trained VGG-like audio classification model (VGGish) [17], and the proposed MPGLN SER is trained based on multiple losses by the association between the discrete and continuous dimensional emotion labels [1] of the multi-domain samples.

Related Works

Ensemble Learning Model for SER in Multi-Domain Datasets

Multi-Path Embedding Features

Group Loss

Evaluation of the BLSTM-Based Baseline SER

Evaluation of Multi-Domain Adaptation

Conclusions

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Sensors (Basel, Switzerland)	Publication Date: Feb 24, 2021
Citations: 14	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Multi-Path and Group-Loss-Based Network for Speech Emotion Recognition in Multi-Domain Datasets.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Sensors (Basel, Switzerland)

Lead the way for us

Similar Papers

Combining a parallel 2D CNN with a self-attention Dilated Residual Network for CTC-based discrete speech emotion recognition
Ziping Zhao ... Björn W Schuller
Neural networks : the official journal of the International Neural Network Society | VOL. 141
Ziping Zhao, et. al.Ziping Zhao ... Björn W Schuller
23 Mar 2021
Neural networks : the official journal of the International Neural Network Society | VOL. 141

Significance of Phonological Features in Speech Emotion Recognition
Wei Wang ... Xinyi Cao
International Journal of Speech Technology | VOL. 23
Wei Wang, et. al.Wei Wang ... Xinyi Cao
15 Jul 2020
International Journal of Speech Technology | VOL. 23

Exploring Deep Spectrum Representations via Attention-Based Recurrent and Convolutional Neural Networks for Speech Emotion Recognition
Ziping Zhao ... Zixing Zhang
IEEE access : practical innovations, open solutions | VOL. 7
Ziping Zhao, et. al.Ziping Zhao ... Zixing Zhang
01 Jan 2019
IEEE access : practical innovations, open solutions | VOL. 7

Self-attention transfer networks for speech emotion recognition
Ziping Zhao ... Björn W Schuller
Virtual Reality & Intelligent Hardware | VOL. 3
Ziping Zhao, et. al.Ziping Zhao ... Björn W Schuller
01 Feb 2021
Virtual Reality & Intelligent Hardware | VOL. 3

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Multi-Path and Group-Loss-Based Network for Speech Emotion Recognition in Multi-Domain Datasets.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Sensors (Basel, Switzerland)