Abstract

Abstract. Automatic audio-visual speech recognition systems (AVSRs) have recently achieved tremendous success, especially in limited vocabulary tasks by far surpassing human abilities to recognize speech, especially in acoustically noisy conditions. Automatic speech recognition systems based on processing of audio and video information are being actively researched and developed all over the world. However, scientific studies aimed at analyzing the influence of the speaker's emotional state (anger, disgust, fear, happy, neutral, and sad), and, most importantly, intensity level of emotion (low - LO, medium - MD, high - HI) on automatic lip-reading have not been conducted. In this regard, the relevance of this research topic cannot be overestimated and requires detailed study. In this paper, we present a novel approach for emotional speech lip-reading, that includes evaluation of a speaker’s emotion and its intensity level. The proposed approach uses visual speech data to detect a person’s emotion type and its intensity level and based on this information assigns it to one of the trained emotional lip-reading models. This essentially resolves the multi-emotional lip-reading issue associated with most real-life scenarios. The proposed approach improves the state-of-the-art results due to the consideration of the intensity of the pronounced audio-visual speech up to 8.2% in terms of the accuracy. Current research is the first step in the creation of emotion-robust speech recognition systems and leaves open a wide field for further research.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call