Abstract
Humans, like other animals, are exposed to a continuous stream of signals, which are dynamic, multimodal, extended, and time varying in nature. This complex input space must be transduced and sampled by our sensory systems and transmitted to the brain where it can guide the selection of appropriate actions. To simplify this process, it's been suggested that the brain exploits statistical regularities in the stimulus space. Tests of this idea have largely been confined to unimodal signals and natural scenes. One important class of multisensory signals for which a quantitative input space characterization is unavailable is human speech. We do not understand what signals our brain has to actively piece together from an audiovisual speech stream to arrive at a percept versus what is already embedded in the signal structure of the stream itself. In essence, we do not have a clear understanding of the natural statistics of audiovisual speech. In the present study, we identified the following major statistical features of audiovisual speech. First, we observed robust correlations and close temporal correspondence between the area of the mouth opening and the acoustic envelope. Second, we found the strongest correlation between the area of the mouth opening and vocal tract resonances. Third, we observed that both area of the mouth opening and the voice envelope are temporally modulated in the 2–7 Hz frequency range. Finally, we show that the timing of mouth movements relative to the onset of the voice is consistently between 100 and 300 ms. We interpret these data in the context of recent neural theories of speech which suggest that speech communication is a reciprocally coupled, multisensory event, whereby the outputs of the signaler are matched to the neural processes of the receiver.
Highlights
Organisms are exposed to a continuous stream of signals which are dynamic, multimodal, extended, time varying in nature and typically characterized by sequences of inputs with particular time constants, durations, repetition rates, etc. [1]
When we watch someone speak, how much work is our brain doing? How much of this work is facilitated by the structure of speech itself? Our work shows that are the visual and auditory components of speech tightly locked, this temporal coordination has a distinct rhythm that is between 2 and 7 Hz
During speech production, the onset of the voice occurs with a delay of between 100 and 300 ms relative to the initial, visible movements of the mouth. These temporal parameters of audiovisual speech are intriguing because they match known properties of neuronal oscillations in the auditory cortex
Summary
Organisms are exposed to a continuous stream of signals which are dynamic, multimodal, extended, time varying in nature and typically characterized by sequences of inputs with particular time constants, durations, repetition rates, etc. [1]. This complex input space is transduced and sampled by the respective sensory systems and transmitted to the brains of the organisms where they modulate both neural activity and behavior over multiple time scales [2]. Barlow [3], for example, suggested that exploiting statistical regularities in the stimulus space may be evolutionarily adaptive According to this approach, sensory processing would encode incoming sensory information in the most efficient form possible by exploiting the redundancies and correlation structure of the input. Neural systems should be optimized to process the statistical structure of sensory signals that they encounter most often [4] Often overlooked in these studies is recognition that an organism’s experience of the world is profoundly multisensory, and it is likely that multiple overlapping and time-locked sensory systems enable it to perceive events and interact with the world [5]
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.