Abstract

The ability to see a talker's face improves speech intelligibility in noise, provided that the auditory and visual speech signals are approximately aligned in time. However, the importance of spatial alignment between corresponding faces and voices remains unresolved, particularly in multi-talker environments. In a series of online experiments, we investigated this using a task that required participants to selectively attend a target talker in noise while ignoring a distractor talker. In experiment 1, we found improved task performance when the talkers' faces were visible, but only when corresponding faces and voices were presented in the same hemifield (spatially aligned). In experiment 2, we tested for possible influences of eye position on this result. In auditory-only conditions, directing gaze toward the distractor voice reduced performance, but this effect could not fully explain the cost of audio-visual (AV) spatial misalignment. Lowering the signal-to-noise ratio (SNR) of the speech from +4 to -4 dB increased the magnitude of the AV spatial alignment effect (experiment 3), but accurate closed-set lipreading caused a floor effect that influenced results at lower SNRs (experiment 4). Taken together, these results demonstrate that spatial alignment between faces and voices contributes to the ability to selectively attend AV speech.

Highlights

  • IntroductionIn a study that introduced a more complex version of the sound-induced flash paradigm with multiple competing streams of auditory and visual stimuli, the strength of the effect was modulated by spatial alignment within each AV stream (Bizley et al, 2012)

  • B) at: Department of Neuroscience, the Del Monte Institute for Neuroscience, and the Center for Visual Science at the University of Rochester, Rochester, NY 14620, USA, ORCID: 0000-0003-2668-0238

  • In experiment 1, performance was worst overall by far in the A-only co-located condition, in which neither visual information nor spatial separation of the talkers were available to help participants segregate the speech streams. This indicates that the benefits of spatial release from masking (SRM) in a multi-talker environment were preserved in the online experiment format (Kidd et al, 1998; Marrone et al, 2008; Shinn-Cunningham et al, 2005)

Read more

Summary

Introduction

In a study that introduced a more complex version of the sound-induced flash paradigm with multiple competing streams of auditory and visual stimuli, the strength of the effect was modulated by spatial alignment within each AV stream (Bizley et al, 2012). The main benefits of spatially separating the competing talkers arise from facilitating perceptual segregation of the voices, thereby allowing listeners to focus attention selectively on the target talker based on its location (Durlach et al, 2003; Watson, 1987; Wu et al, 2005). In one previous study that combined SRM with the ability to see a target talker’s face, the presence of visual input was found to provide a greater speech recognition benefit when the target and masker speech signals were spatially coincident (Helfer and Freyman, 2005).

Objectives
Methods
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call