Abstract

Understanding speech becomes a demanding task when the environment is noisy. Comprehension of speech in noise can be substantially improved by looking at the speaker’s face, and this audiovisual benefit is even more pronounced in people with hearing impairment. Recent advances in AI have allowed to synthesize photorealistic talking faces from a speech recording and a still image of a person’s face in an end-to-end manner. However, it has remained unknown whether such facial animations improve speech-in-noise comprehension. Here we consider facial animations produced by a recently introduced generative adversarial network (GAN), and show that humans cannot distinguish between the synthesized and the natural videos. Importantly, we then show that the end-to-end synthesized videos significantly aid humans in understanding speech in noise, although the natural facial motions yield a yet higher audiovisual benefit. We further find that an audiovisual speech recognizer (AVSR) benefits from the synthesized facial animations as well. Our results suggest that synthesizing facial motions from speech can be used to aid speech comprehension in difficult listening environments.

Highlights

  • Real-world listening environments are often noisy: many people talk simultaneously in a busy pub or restaurant, background music plays frequently, and traffic noise is omnipresent in cities

  • The 18 participants of the speech-innoise comprehension experiment were not told of the nature of half of the videos, and none reported finding anything unusual regarding the videos in a questionnaire completed following the experiment

  • We proceeded to assess the potential benefits of the synthesized talking faces on speech-in-noise comprehension

Read more

Summary

INTRODUCTION

Real-world listening environments are often noisy: many people talk simultaneously in a busy pub or restaurant, background music plays frequently, and traffic noise is omnipresent in cities. Seeing a speaker’s face makes it considerably easier to understand them (Sumby and Pollack, 1954; Ross et al, 2007), and this is true for people with hearing impairments (Puschmann et al, 2019) or who are listening in background noise This phenomenon, termed inverse effectiveness, is characterized by a more pronounced audiovisual comprehension gain in challenging hearing conditions (Meredith and Stein, 1986; Stevenson and James, 2009; Crosse et al, 2016). Most are based on generative adversarial networks (GANs) and can produce high quality visual signals that can even reflect the speaker’s emotion (Chung et al, 2017; Chen et al, 2019; Vougioukas et al, 2020) Employing such facial animations to improve speech-in-noise comprehension would represent a significant step forward in the development of audiovisual hearing aids. The offset between the audio and the visual component in the synthesized videos is below 1 frame (below 40 ms, Table 6, Vougioukas et al, 2020)

MATERIALS AND METHODS
RESULTS
DISCUSSION
ETHICS STATEMENT
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.