A Two-Stage Hierarchical Bilingual Emotion Recognition System Using a Hidden Markov Model and Neural Networks

Ahmed H Abo Absa,Mohamed Deriche

doi:10.1007/s13369-017-2742-5

Abstract

Speech emotion recognition continues to attract a lot of research especially under mixed-language scenarios. Here, we show that emotion is language dependent and that enhanced emotion recognition systems can be built when the language is known. We propose a two-stage emotion recognition system that starts by identifying the language, followed by a dedicated language-dependent recognition system for identifying the type of emotion. The system is able to recognize accurately the four main types of emotion, namely neutral, happy, angry, and sad. These types of emotion states are widely used in practical setups. To keep the computation complexity low, we identify the language using a feature vector consisting of energies from a basic wavelet decomposition. A hidden Markov model (HMM) is then used to track the changes of this vector to identify the language, achieving recognition accuracy close to 100%. Once the language is identified, a set of speech processing features including pitch and MFCCs are used with a neural network (NN) architecture to identify the emotion type. The results show that that identifying the language first can substantially improve the overall accuracy in identifying emotions. The overall accuracy achieved with the proposed system reached more than 93%. To test the robustness of the proposed methodology, we also used a Gaussian mixture model (GMM) for both language identification and emotion recognition. Our proposed HMM-NN approach showed a better performance than the GMM-based approach. More importantly, we tested the proposed algorithm with 6 emotions which are showed that the overall accuracy continues to be excellent, while the performance of the GMM-based approach deteriorates substantially. It is worth noting that the performance we achieved is close to the one attained for single language emotion recognition systems and outperforms by far recognition systems without language identification (around 60%). The work shows the strong correlation between language and type of emotion, and can further be extended to other scenarios including gender-based, facial expression-based, and age-based emotion recognition.

Full Text