A comprehensive study on bilingual and multilingual speech emotion recognition using a two-pass classification scheme.

Panikos Heracleous,Akio Yoneyama,Seyed Reza Shahamiri

doi:10.1371/journal.pone.0220386

Panikos Heracleous, Akio Yoneyama + Show 1 more

Open Access

https://doi.org/10.1371/journal.pone.0220386

Copy DOI

Journal: PloS one	Publication Date: Aug 15, 2019
Citations: 23	License type: CC BY 4.0

Affiliation: KDDI Research (Japan)

Abstract

Emotion recognition plays an important role in human-computer interaction. Previously and currently, many studies focused on speech emotion recognition using several classifiers and feature extraction methods. The majority of such studies, however, address the problem of speech emotion recognition considering emotions solely from the perspective of a single language. In contrast, the current study extends monolingual speech emotion recognition to also cover the case of emotions expressed in several languages that are simultaneously recognized by a complete system. To address this issue, a method, which provides an effective and powerful solution to bilingual speech emotion recognition, is proposed and evaluated. The proposed method is based on a two-pass classification scheme consisting of spoken language identification and speech emotion recognition. In the first pass, the language spoken is identified; in the second pass, emotion recognition is conducted using the emotion models of the language identified. Based on deep learning and the i-vector paradigm, bilingual emotion recognition experiments have been conducted using the state-of-the-art English IEMOCAP (four emotions) and German FAU Aibo (five emotions) corpora. Two classifiers along with i-vector features were used and compared, namely, fully connected deep neural networks (DNN) and convolutional neural networks (CNN). In the case of DNN, 64.0% and 61.14% unweighted average recalls (UARs) were obtained using the IEMOCAP and FAU Aibo corpora, respectively. When using CNN, 62.0% and 59.8% UARs were achieved in the case of the IEMOCAP and FAU Aibo corpora, respectively. These results are very promising, and superior to those obtained in similar studies on multilingual or even monolingual speech emotion recognition. Furthermore, an additional baseline approach for bilingual speech emotion recognition was implemented and evaluated. In the baseline approach, six common emotions were considered, and bilingual emotion models were created, trained on data from the two languages. In this case, 51.2% and 51.5% UARs for six emotions were obtained using DNN and CNN, respectively. The results using the baseline method were reasonable and promising, showing the effectiveness of using i-vectors and deep learning in bilingual speech emotion recognition. On the other hand, the proposed two-pass method based on language identification showed significantly superior performance. Furthermore, the current study was extended to also deal with multilingual speech emotion recognition using corpora collected under similar conditions. Specifically, the English IEMOCAP, the German Emo-DB, and a Japanese corpus were used to recognize four emotions based on the proposed two-pass method. The results obtained were very promising, and the differences in UAR were not statistically significant compared to the monolingual classifiers.

Highlights

Automatic recognition of human emotions is of vital importance in human-computer interaction and its applications [1]
deep neural networks (DNN) and convolutional neural networks (CNN) trained with Interactive Emotional Dyadic Motion Capture (IEMOCAP) and FAU Aibo databases are used
Angry and sad emotions have the highest recalls in both DNN and CNN followed by the emotions neutral and happy

Summary

Introduction

Automatic recognition of human emotions is of vital importance in human-computer interaction and its applications [1]. Applications include human-robot communication, when robots respond to humans according to the detected emotions, implementation in call centers to detect the caller’s emotional state in cases of emergency, identifying the level of a customer’s satisfaction, medical analysis, and education. Emotion recognition can be conducted using facial expressions, verbal communication, text, electroencephalography (EEG) signals, or a combination of multiple modalities. Emotion recognition can identify emotions solely in relation to a single language, or can simultaneously recognize emotions expressed through several languages. In the current study, comprehensive experiments and analysis of bilingual and multilingual emotion recognition based on speech, using English, German, and Japanese corpora are reported. Deep neural networks fed with i-vector [2] features are used

Methods

Results

Discussion

Conclusion