People use previous knowledge and in situ judgment to produce speech with a vocal effort appropriate to a given environment’s acoustics. To test how people integrate auditory and visual cues in speech production, we employed a three-by-three cross-conditional audiovisual match-mismatch paradigm. Three visually distinct environments with three different room acoustics were selected: a gymnasium, a classroom, and a hemi-anechoic room. The visual environment was presented with a Virtual Reality (VR) headset and the auditory environment was a diffuse room impression, playing back the participants’ speech through loudspeakers in the room with different reverberation times. Participants were prompted to speak in all nine combinations of the audiovisual conditions, with three being congruent and six incongruent. Linear mixed-effects regression modeling was used to evaluate the effect of the audiovisual manipulations and time course on mean intensity. Preliminary results indicate that participants initially spoke at a level that matched the visual expectation and then adapted to the audio condition; detailed analysis of the time course of adaptation is ongoing and will be presented. This study furthers our understanding of multimodal integration and the sensorimotor adaptation of speech production, which finds applications in fields including communication in noise and VR soundscape design.
Read full abstract