Abstract

During speech perception, humans integrate auditory information from the voice with visual information from the face. This multisensory integration increases perceptual precision, but only if the two cues come from the same talker; this requirement has been largely ignored by current models of speech perception. We describe a generative model of multisensory speech perception that includes this critical step of determining the likelihood that the voice and face information have a common cause. A key feature of the model is that it is based on a principled analysis of how an observer should solve this causal inference problem using the asynchrony between two cues and the reliability of the cues. This allows the model to make predictions about the behavior of subjects performing a synchrony judgment task, predictive power that does not exist in other approaches, such as post-hoc fitting of Gaussian curves to behavioral data. We tested the model predictions against the performance of 37 subjects performing a synchrony judgment task viewing audiovisual speech under a variety of manipulations, including varying asynchronies, intelligibility, and visual cue reliability. The causal inference model outperformed the Gaussian model across two experiments, providing a better fit to the behavioral data with fewer parameters. Because the causal inference model is derived from a principled understanding of the task, model parameters are directly interpretable in terms of stimulus and subject properties.

Highlights

  • When an observer hears a voice and sees mouth movements, there are two potential causal structures (Figure 1A)

  • BEHAVIORAL RESULTS FROM EXPERIMENT 1 The causal inference of multisensory speech (CIMS) model makes trial-to-trial behavioral predictions about synchrony perception using a limited number of parameters that capture physical properties of speech, the sensory noise of the subject, and the subject’s prior assumptions about the causal structure of the stimuli in the experiment

  • Audiovisual spatial localization likely occurs in the parietal lobe (Zatorre et al, 2002) while multisensory speech perception is thought to occur in the superior temporal sulcus (Beauchamp et al, 2004)

Read more

Summary

Introduction

When an observer hears a voice and sees mouth movements, there are two potential causal structures (Figure 1A). In the second causal structure, the events have two different causes (C = 2): one talker produces the auditory voice and a different talker produces the seen mouth movements. A critical step in audiovisual integration during speech perception is estimating the likelihood that the speech arises from a single talker. This process, known as causal inference (Kording et al, 2007; Schutz and Kubovy, 2009; Shams and Beierholm, 2010; Buehner, 2012), has provided an excellent tool for understanding the behavioral properties of tasks requiring spatial localization of simple auditory beeps and visual flashes (Kording et al, 2007; Sato et al, 2007). We set out to determine whether the causal inference model could explain the behavior of humans perceiving multisensory speech

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call