Abstract
An increasing number of neuroscience papers capitalize on the assumption published in this journal that visual speech would be typically 150 ms ahead of auditory speech. It happens that the estimation of audiovisual asynchrony in the reference paper is valid only in very specific cases, for isolated consonant-vowel syllables or at the beginning of a speech utterance, in what we call “preparatory gestures”. However, when syllables are chained in sequences, as they are typically in most parts of a natural speech utterance, asynchrony should be defined in a different way. This is what we call “comodulatory gestures” providing auditory and visual events more or less in synchrony. We provide audiovisual data on sequences of plosive-vowel syllables (pa, ta, ka, ba, da, ga, ma, na) showing that audiovisual synchrony is actually rather precise, varying between 20 ms audio lead and 70 ms audio lag. We show how more complex speech material should result in a range typically varying between 40 ms audio lead and 200 ms audio lag, and we discuss how this natural coordination is reflected in the so-called temporal integration window for audiovisual speech perception. Finally we present a toy model of auditory and audiovisual predictive coding, showing that visual lead is actually not necessary for visual prediction.
Highlights
Audiovisual interactions in the human brain Sensory processing has long been conceived as modular and hierarchic, beginning by monosensory cue extraction in the primary sensory cortices before higher level multisensory interactions took place in associative areas, preparing the route for final decision and adequate behavioral answer
The view that vision leads audition is globally oversimplified and often wrong. It should be replaced by the acknowledgement that the temporal relationship between auditory and visual cues is complex, including a range of configurations more or less reflected by the temporal integration window from 30 to 50 ms auditory lead to 170 to 200 ms visual lead
This is in line with the 100 to 300 ms visual lead proposed in [21], the more so considering that the estimate of the visible onset for lip closure in [21] is done at the mid closing phase – while we prefer detecting the first visible event that is at the very beginning of the lip closure phase, typically 100 ms before
Summary
Audiovisual interactions in the human brain Sensory processing has long been conceived as modular and hierarchic, beginning by monosensory cue extraction in the primary sensory cortices before higher level multisensory interactions took place in associative areas, preparing the route for final decision and adequate behavioral answer. It is still under debate to determine the specific role of direct connections between primary sensory cortices vs the role of associative cortex and the superior temporal sulcus in these early interactions [6,7,8]. Capitalizing on the natural rhythmicity of the auditory speech input, it has been suggested [9,10] that the visual input could enhance neuronal oscillations thanks to a phase-resetting mechanism across sensory modalities. This has led to various experimental demonstrations that visual speech improves the tracking of audiovisual speech information in the auditory cortex by phase coupling of auditory and visual cortices [11,12]
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.