Abstract

Watching a speaker's facial movements can dramatically enhance our ability to comprehend words, especially in noisy environments. From a general doctrine of combining information from different sensory modalities (the principle of inverse effectiveness), one would expect that the visual signals would be most effective at the highest levels of auditory noise. In contrast, we find, in accord with a recent paper, that visual information improves performance more at intermediate levels of auditory noise than at the highest levels, and we show that a novel visual stimulus containing only temporal information does the same. We present a Bayesian model of optimal cue integration that can explain these conflicts. In this model, words are regarded as points in a multidimensional space and word recognition is a probabilistic inference process. When the dimensionality of the feature space is low, the Bayesian model predicts inverse effectiveness; when the dimensionality is high, the enhancement is maximal at intermediate auditory noise levels. When the auditory and visual stimuli differ slightly in high noise, the model makes a counterintuitive prediction: as sound quality increases, the proportion of reported words corresponding to the visual stimulus should first increase and then decrease. We confirm this prediction in a behavioral experiment. We conclude that auditory-visual speech perception obeys the same notion of optimality previously observed only for simple multisensory stimuli.

Highlights

  • Vision often plays a crucial role in understanding speech

  • We present a first-principles model for multisensory word recognition that captures the main concepts of a stimulus neighborhood in high-dimensional feature space, where the reliability of the signal affects the size of the neighborhood and lexical information is represented by the distribution of words

  • The model uses performance measures on individual phonemes to estimate the performance of identifying full words

Read more

Summary

Introduction

Vision often plays a crucial role in understanding speech. Watching a speaker’s facial movements, especially lip movements, provides input that can supplement the information from the speaker’s voice. ‘‘Lip-reading’’ or ‘‘speech-reading’’ allows hearing-impaired individuals to understand speech (e.g. [1,2]), and in subjects with intact hearing abilities, substantially facilitates speech perception under noisy environmental conditions [3,4,5,6,7]. [1,2]), and in subjects with intact hearing abilities, substantially facilitates speech perception under noisy environmental conditions [3,4,5,6,7] This benefit has been quantified by measuring performance enhancement due to visual input as a function of auditory noise [8,9,10,11]. The benefit from the visual information, measured in percent correct, was found to be greatest when the auditory stimulus was most noisy (but see [12,13]) This seems to be evidence for inverse effectiveness, a widely cited concept stating that the largest multisensory enhancement is expected when a unisensory stimulus is weakest [14]. When multisensory word recognition was tested under more natural conditions (without a checklist), maximal gain was found not at low, but at intermediate signal-to-noise ratios (SNRs) [15], in apparent contradiction to inverse effectiveness

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.