Concurrent talking in immersive virtual reality: on the dominance of visual speech cues

Mar Gonzalez-Franco,Zhengyou Zhang,Nikolai Smolyanskiy,Dinei Florencio,Antonella Maselli

doi:10.1038/s41598-017-04201-x

Abstract

Humans are good at selectively listening to specific target conversations, even in the presence of multiple concurrent speakers. In our research, we study how auditory-visual cues modulate this selective listening. We do so by using immersive Virtual Reality technologies with spatialized audio. Exposing 32 participants to an Information Masking Task with concurrent speakers, we find significantly more errors in the decision-making processes triggered by asynchronous audiovisual speech cues. More precisely, the results show that lips on the Target speaker matched to a secondary (Mask) speaker’s audio severely increase the participants’ comprehension error rates. In a control experiment (n = 20), we further explore the influences of the visual modality over auditory selective attention. The results show a dominance of visual-speech cues, which effectively turn the Mask into the Target and vice-versa. These results reveal a disruption of selective attention that is triggered by bottom-up multisensory integration. The findings are framed in the sensory perception and cognitive neuroscience theories. The VR setup is validated by replicating previous results in this literature in a supplementary experiment.

Highlights

Humans often interact in noisy environments, where unintelligible noise or concurrent speakers masks a target speech
Previous studies have shown that under specific circumstances the visual input on an auditory-visual speech (AVS) task can modulate the perception of sounds, thereby producing well-known phonemic restoration[18] or McGurk[17] effects
We show how these results might have stronger ramifications affecting selective attention and semantic interpretation beyond multimodal integration

Summary

Introduction

Humans often interact in noisy environments, where unintelligible noise or concurrent speakers masks a target speech. Energetic masking, which might originate from both speech and non-speech sounds, shows frequency and amplitude in the same range of the target speech This masking might hinder target-speech perception due to interference with auditory peripheral processing. Informational masking, which consists of babbles of irrelevant intelligible speech, has stronger interference with target speech perception. This masking is likely related to stages of processing beyond auditory periphery, such as attention, perceptual grouping, short-term memory, and cognitive abilities[4].

Methods

Results

Conclusion