Abstract

Speech is perceived with both the ears and the eyes. Adding congruent visual speech improves the perception of a faint auditory speech stimulus, whereas adding incongruent visual speech can alter the perception of the utterance. The latter phenomenon is the case of the McGurk illusion, where an auditory stimulus such as e.g. “ba” dubbed onto a visual stimulus such as “ga” produces the illusion of hearing “da”. Bayesian models of multisensory perception suggest that both the enhancement and the illusion case can be described as a two-step process of binding (informed by prior knowledge) and fusion (informed by the information reliability of each sensory cue). However, there is to date no study which has accounted for how they each contribute to audiovisual speech perception. In this study, we expose subjects to both congruent and incongruent audiovisual speech, manipulating the binding and the fusion stages simultaneously. This is done by varying both temporal offset (binding) and auditory and visual signal-to-noise ratio (fusion). We fit two Bayesian models to the behavioural data and show that they can both account for the enhancement effect in congruent audiovisual speech, as well as the McGurk illusion. This modelling approach allows us to disentangle the effects of binding and fusion on behavioural responses. Moreover, we find that these models have greater predictive power than a forced fusion model. This study provides a systematic and quantitative approach to measuring audiovisual integration in the perception of the McGurk illusion as well as congruent audiovisual speech, which we hope will inform future work on audiovisual speech perception.

Highlights

  • When we see the face of a person speaking to us, our brains use both the auditory and visual input to understand what is being said

  • Asynchronous versions of the stimuli were made by temporally shifting the audio to create a 500 ms audio lead. This stimulus onset asynchrony (SOA) is substantially larger than the “temporal window of integration” previously found [14, 25], we found in pilot trials that it was necessary to extend the window to 500 ms in order to produce a reliable effect of SOA

  • There was a significant SOA x Visual signal to noise ratio (SNR) interaction, which not part of our initial hypothesis is in line with a binding and fusion model: since a low SNR visual stimulus will have a negligible influence on perception, we would not expect an effect of temporally offsetting these stimuli

Read more

Summary

Introduction

When we see the face of a person speaking to us, our brains use both the auditory and visual input to understand what is being said. Numerous studies have suggested that the information reliability principle may not sufficiently account for all the aspects of audiovisual speech perception. Factors such as attention [8,9,10], audiovisual context [11], top-down expectations [12, 13], time offset between cues [14] and even spontaneous pre-stimulus brain activity [15] have all been shown to modulate the McGurk illusion. A strong fusion model cannot account for any of these effects

Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call