Fusion and combination in audio-visual integration

Kei Omata,Ken Mogi

doi:10.1098/rspa.2007.1910

Abstract

Language is essentially multi-modal in its sensory origin, the daily conversation depending heavily on the audio-visual (AV) information. Although the perception of spoken language is primarily dominated by audition, the perception of facial expression, particularly that of the mouth, helps us comprehend speech. The McGurk effect is a striking phenomenon where the perceived phoneme is affected by the simultaneous observation of lip movement, and probably reflects the underlying AV integration process. The elucidation of the principles involved in this unique perceptual anomaly poses an interesting problem. Here we study the nature of the McGurk effect by means of neural networks (self-organizing maps, SOM) designed to extract patterns inherent in audio and visual stimuli. It is shown that a McGurk effect-like classification of incoming information occurs without any additional constraint or procedure added to the network, suggesting that the anomaly is a consequence of the AV integration process. Within this framework, an explanation is given for the asymmetric effect of AV pairs in causing the McGurk effect (fusion or combination) based on the ‘distance’ relationship between audio or visual information within the SOM. Our result reveals some generic features of the cognitive process of phoneme perception, and AV sensory integration in general.

Full Text