Abstract

Audiovisual speech integration combines information from auditory speech (talker’s voice) and visual speech (talker’s mouth movements) to improve perceptual accuracy. However, if the auditory and visual speech emanate from different talkers, integration decreases accuracy. Therefore, a key step in audiovisual speech perception is deciding whether auditory and visual speech have the same source, a process known as causal inference. A well-known illusion, the McGurk Effect, consists of incongruent audiovisual syllables, such as auditory “ba” + visual “ga” (AbaVga), that are integrated to produce a fused percept (“da”). This illusion raises two fundamental questions: first, given the incongruence between the auditory and visual syllables in the McGurk stimulus, why are they integrated; and second, why does the McGurk effect not occur for other, very similar syllables (e.g., AgaVba). We describe a simplified model of causal inference in multisensory speech perception (CIMS) that predicts the perception of arbitrary combinations of auditory and visual speech. We applied this model to behavioral data collected from 60 subjects perceiving both McGurk and non-McGurk incongruent speech stimuli. The CIMS model successfully predicted both the audiovisual integration observed for McGurk stimuli and the lack of integration observed for non-McGurk stimuli. An identical model without causal inference failed to accurately predict perception for either form of incongruent speech. The CIMS model uses causal inference to provide a computational framework for studying how the brain performs one of its most important tasks, integrating auditory and visual speech cues to allow us to communicate with others.

Highlights

  • Speech is the most important method of human communication and is fundamentally multisensory, with both auditory cues and visual cues contributing to perception

  • A major challenge for models of multisensory speech perception is deciding which voices and faces should be integrated. Our solution to this problem is based on the idea of causal inference—given a particular

  • Our results suggest a fundamental role for a causal inference type calculation in multisensory speech perception

Read more

Summary

Introduction

Speech is the most important method of human communication and is fundamentally multisensory, with both auditory cues (the talker’s voice) and visual cues (the talker’s face) contributing to perception. Because auditory and visual speech cues can be corrupted by noise, integrating the cues allows subjects to more accurately perceive the speech content [1,2,3]. Integrating auditory and visual speech cues can lead subjects to less accurately perceive speech if the speech cues are incongruent. The McGurk effect is surprising because the incongruent speech tokens are easy to identify as physically incompatible: it is impossible for an open-mouth velar as seen in visual “ga” to produce a closed-mouth bilabial sound as heard in auditory “ba”. The effect raises fundamental questions about the computations underlying multisensory speech perception: Why would the brain integrate two incompatible speech components to produce an illusory percept? If the illusion happens at all, why does it not happen more often?

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call