A comprehensive study on bilingual and multilingual speech emotion recognition using a two-pass classification scheme.

Panikos Heracleous,Akio Yoneyama

doi:10.1371/journal.pone.0220386

Panikos Heracleous, Akio Yoneyama

Open Access

https://doi.org/10.1371/journal.pone.0220386

Copy DOI

Journal: PLOS ONE	Publication Date: Aug 15, 2019
Citations: 27	License type: CC BY 4.0

Affiliation: KDDI Research (Japan)

Abstract

Emotion recognition plays an important role in human-computer interaction. Previously and currently, many studies focused on speech emotion recognition using several classifiers and feature extraction methods. The majority of such studies, however, address the problem of speech emotion recognition considering emotions solely from the perspective of a single language. In contrast, the current study extends monolingual speech emotion recognition to also cover the case of emotions expressed in several languages that are simultaneously recognized by a complete system. To address this issue, a method, which provides an effective and powerful solution to bilingual speech emotion recognition, is proposed and evaluated. The proposed method is based on a two-pass classification scheme consisting of spoken language identification and speech emotion recognition. In the first pass, the language spoken is identified; in the second pass, emotion recognition is conducted using the emotion models of the language identified. Based on deep learning and the i-vector paradigm, bilingual emotion recognition experiments have been conducted using the state-of-the-art English IEMOCAP (four emotions) and German FAU Aibo (five emotions) corpora. Two classifiers along with i-vector features were used and compared, namely, fully connected deep neural networks (DNN) and convolutional neural networks (CNN). In the case of DNN, 64.0% and 61.14% unweighted average recalls (UARs) were obtained using the IEMOCAP and FAU Aibo corpora, respectively. When using CNN, 62.0% and 59.8% UARs were achieved in the case of the IEMOCAP and FAU Aibo corpora, respectively. These results are very promising, and superior to those obtained in similar studies on multilingual or even monolingual speech emotion recognition. Furthermore, an additional baseline approach for bilingual speech emotion recognition was implemented and evaluated. In the baseline approach, six common emotions were considered, and bilingual emotion models were created, trained on data from the two languages. In this case, 51.2% and 51.5% UARs for six emotions were obtained using DNN and CNN, respectively. The results using the baseline method were reasonable and promising, showing the effectiveness of using i-vectors and deep learning in bilingual speech emotion recognition. On the other hand, the proposed two-pass method based on language identification showed significantly superior performance. Furthermore, the current study was extended to also deal with multilingual speech emotion recognition using corpora collected under similar conditions. Specifically, the English IEMOCAP, the German Emo-DB, and a Japanese corpus were used to recognize four emotions based on the proposed two-pass method. The results obtained were very promising, and the differences in UAR were not statistically significant compared to the monolingual classifiers.

Highlights

Automatic recognition of human emotions is of vital importance in human-computer interaction and its applications [1]
deep neural networks (DNN) and convolutional neural networks (CNN) trained with Interactive Emotional Dyadic Motion Capture (IEMOCAP) and FAU Aibo databases are used
Angry and sad emotions have the highest recalls in both DNN and CNN followed by the emotions neutral and happy

Summary

Introduction

Automatic recognition of human emotions is of vital importance in human-computer interaction and its applications [1]. Applications include human-robot communication, when robots respond to humans according to the detected emotions, implementation in call centers to detect the caller’s emotional state in cases of emergency, identifying the level of a customer’s satisfaction, medical analysis, and education. Emotion recognition can be conducted using facial expressions, verbal communication, text, electroencephalography (EEG) signals, or a combination of multiple modalities. Emotion recognition can identify emotions solely in relation to a single language, or can simultaneously recognize emotions expressed through several languages. In the current study, comprehensive experiments and analysis of bilingual and multilingual emotion recognition based on speech, using English, German, and Japanese corpora are reported. Deep neural networks fed with i-vector [2] features are used

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A comprehensive study on bilingual and multilingual speech emotion recognition using a two-pass classification scheme.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLOS ONE

Lead the way for us

Similar Papers

Speech Emotion Recognition Using Spontaneous Children’s Corpus
Panikos Heracleous ... Yasser Mohammad
-
Panikos Heracleous, et. al.Panikos Heracleous ... Yasser Mohammad
01 Jan 2023
01 Jan 2023

Deep Convolutional Neural Networks for Feature Extraction in Speech Emotion Recognition
Panikos Heracleous ... Akio Yoneyama
-
Panikos Heracleous, et. al.Panikos Heracleous ... Akio Yoneyama
01 Jan 2019
01 Jan 2019

ANN based decision fusion for speech emotion recognition
Lu Xu ... Dali Yang
-
Lu Xu, et. al.Lu Xu ... Dali Yang
06 Sep 2009
06 Sep 2009

I-vectors and Deep Convolutional Neural Networks for Language Identification in Clean and Reverberant Environments
Panikos Heracleous ... Kohichi Takai
-
Panikos Heracleous, et. al.Panikos Heracleous ... Kohichi Takai
01 Jan 2023
01 Jan 2023

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A comprehensive study on bilingual and multilingual speech emotion recognition using a two-pass classification scheme.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLOS ONE