Abstract

Automatic speech recognition and emotion recognition have been research hotspots in the field of human-computer interaction over the last years. However, despite significant recent advances, the problem of robust recognition of emotional speech remains unresolved. In this research we try to fill this gap by looking into the multimodality of speech and starting to use visual information to increase both recognition accuracy and robustness. We present extensive experimental investigation of how different emotions (anger, disgust, fear, happy, neutral, and sad) affect automatic lip-reading. We train the 3D ResNet-18 model on the CREMA-D emotional speech database by experimentation with different parameters of the model. To the best of our knowledge, this is the first research investigating the influence of human emotions on automatic lip-reading. Our results demonstrate that speech with the emotion of disgust is the most difficult to recognize correctly. This is due to the fact that a person significantly curves his lips and articulation is distorted. We have experimentally confirmed that the accuracy of models trained on all types of emotions (mean UAR 94.04%) significantly exceeds the accuracy of recognition of models trained only on a neutral emotion (mean UAR 65.81%), or on any other separate emotion (mean UAR from 54.82% to 68.62% with the emotion of disgust and sadness respectively). We have carefully analyzed the visual manifestations of various emotions and assessed their impact on the accuracy of automatic lip-reading. Current research is the first step in the creation of emotion-robust speech recognition systems.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call