Waseda Institute for Advanced Study, 1–6–1 Nishi Waseda, Shinjuku-ku, Tokyo, 169–8050 Japan(Received 29 October 2010, Accepted for publication 5 January 2011)Keywords: Audiovisual speech integration, Synchrony perception, Sine-wave speechPACS number: 43.71.+m [doi:10.1250/ast.32.125]1. IntroductionAudiovisual synchrony is important for comfortablespeech communication. We occasionally encounter a tempo-ral mismatch between a speaker’s face and speech sound, forinstance, in a satellite broadcast or in video streaming via theInternet. Human observers perceive physically desynchron-ized audiovisual signals as synchronous within a certaintemporal tolerance [1]. This audiovisual synchrony perceptionmay be affected by both structural factors (i.e., bottom-upfactors) and cognitive factors (i.e., top-down factors). Forexample, audiovisual spatial congruency [2] and stimuluscomplexity [3] are considered to be structural factors, whilecognitive factors include, for instance, an instruction (‘‘imag-ine’’ in Arnold et al. [4]) that audiovisual stimuli originatefrom the same source, and ‘‘assumption of unity.’’ Thisassumption of unity means the following: when multimodalinputs have highly consistent properties, it is more likely thatobservers treat them as originating from a single source [5–7](see [5,6] for review).However, the contribution of structural and cognitivefactors to multisensory integration is unclear [6] because thesetwo factors are often intermingled and it is not easy todistinguish between them [6,8]. Vatakis and Spence [7] triedto dissociate them and control the structural factors (i.e.,matched stimulus complexity) to investigate the assumptionof unity. They showed that participants were less sensitive toaudiovisual asynchrony when the auditory and visual stimulioriginated from the same speech event than when theyoriginated from different speech events. They speculated thatthe strength of the assumption of unity by observers dependson whether or not the stimulus origin is different betweenaudition and vision, and that it is the assumption of unity thatinfluences audiovisual synchrony perception. In their study,audiovisual structural factors could be nearly controlled.Nonetheless, their stimuli were different in terms of structuralfactors as well as of cognitive factors. Thus, the influence ofpurely cognitive factors on audiovisual synchrony perceptionis not clear, especially in the case of speech signals.In this study, we attempted to investigate this cognitiveeffect on audiovisual synchrony perception. For this purpose,we used a simplified speech sound called ‘‘sine-wave speech(SWS)’’ [9]. In SWS, a natural speech signal is replaced withthree sinusoids corresponding to the first three formantfrequencies (Fig. 1). SWS is heard as either speech or non-speech altered by instruction; thus, manipulation of only thecognitive factor is possible. Listeners without informationabout SWS typically perceive this sound as non-speech suchas a whistle or electronic sound. In contrast, once they areinformed that SWS is a synthesized speech sound based onnatural speech, they perceive SWS as speech [9,10]. Thisprocedure using physically identical SWS sound in bothgroups enables us to discuss whether an instruction on sound(i.e., cognitive factor) modulates the audiovisual synchronyperception.We measured audiovisual temporal resolution as an indexof audiovisual synchrony perception [3,7]. We hypothesizedthat cognitive factor, i.e., multisensory inputs refer to thesame event, would enhance the multisensory integration.2. Experiment 12.1. Method2.1.1. ParticipantsIn Experiment 1, thirty-three participants with normalhearing and normal or corrected-to-normal visual acuity tookpart. All were native Japanese speakers. The experimentswere approved by the Ethics Committee of the ResearchInstitute of Electrical Communication, Tohoku University.2.1.2. StimuliThe stimuli consisted of video clips of faces and voices ofthree Japanese female speakers (frontal view, including headand shoulders) uttering monosyllables: /pa/, /ta/ and /ka/.Consequently nine tokens were used in total. The videoclips (640 480 pixels, Cinepak Codec video compression,30 frames/s, 16 bits and 48kHz audio signal digitization)were edited by Adobe Premiere Pro 1.5 (Adobe Systems Inc.)to make the stimulus onset asynchronies (SOA). The SWSsound was synchronized with the corresponding video clip byreplacing the original speech sound.SWS sounds of these monosyllables were created using aninteractive method as follows: First, the formant frequencieswere estimated by using STRAIGHT [11] to conduct spectro-