Emotion recognition plays an important role in human-computer interaction. Previously and currently, many studies focused on speech emotion recognition using several classifiers and feature extraction methods. The majority of such studies, however, address the problem of speech emotion recognition considering emotions solely from the perspective of a single language. In contrast, the current study extends monolingual speech emotion recognition to also cover the case of emotions expressed in several languages that are simultaneously recognized by a complete system. To address this issue, a method, which provides an effective and powerful solution to bilingual speech emotion recognition, is proposed and evaluated. The proposed method is based on a two-pass classification scheme consisting of spoken language identification and speech emotion recognition. In the first pass, the language spoken is identified; in the second pass, emotion recognition is conducted using the emotion models of the language identified. Based on deep learning and the i-vector paradigm, bilingual emotion recognition experiments have been conducted using the state-of-the-art English IEMOCAP (four emotions) and German FAU Aibo (five emotions) corpora. Two classifiers along with i-vector features were used and compared, namely, fully connected deep neural networks (DNN) and convolutional neural networks (CNN). In the case of DNN, 64.0% and 61.14% unweighted average recalls (UARs) were obtained using the IEMOCAP and FAU Aibo corpora, respectively. When using CNN, 62.0% and 59.8% UARs were achieved in the case of the IEMOCAP and FAU Aibo corpora, respectively. These results are very promising, and superior to those obtained in similar studies on multilingual or even monolingual speech emotion recognition. Furthermore, an additional baseline approach for bilingual speech emotion recognition was implemented and evaluated. In the baseline approach, six common emotions were considered, and bilingual emotion models were created, trained on data from the two languages. In this case, 51.2% and 51.5% UARs for six emotions were obtained using DNN and CNN, respectively. The results using the baseline method were reasonable and promising, showing the effectiveness of using i-vectors and deep learning in bilingual speech emotion recognition. On the other hand, the proposed two-pass method based on language identification showed significantly superior performance. Furthermore, the current study was extended to also deal with multilingual speech emotion recognition using corpora collected under similar conditions. Specifically, the English IEMOCAP, the German Emo-DB, and a Japanese corpus were used to recognize four emotions based on the proposed two-pass method. The results obtained were very promising, and the differences in UAR were not statistically significant compared to the monolingual classifiers.
Read full abstract