Abstract

Audio-visual scenes were collected in a medium-sized reverberant conference hall through in-field 3rd-order ambisonics impulse response recordings and 360-degree stereoscopic videos. The visual scenes included cues of the room and the location of the sound sources, without lip-sync-related cues. Speech intelligibility tests based on seven audio-visual scenes were administered inside an immersive virtual 3D environment reproduced through a spherical 16-speaker array synched with a head-mounted display. Forty normal-hearing subjects were engaged to test the effects on speech intelligibility of a talker in front of the listener and amplified by two lateral symmetrical loudspeakers, in the case of (i) different listener-to-talker distances, (ii) one-talker noise at various azimuth angles around the listener, (iii) high reverberation with –5 dB signal-to-noise ratio, (iv) self-motion, and (v) visual cues. We conducted tests in four configurations, that is, audio-visual and audio-only, both with self-motion and in the static condition. The static audio-only tests scored the highest speech intelligibility, followed by a tie between audio-visual with self-motion and in the static condition. Speech intelligibility decreased as the target-to-listener distance increased in all the noisy scenes. Additionally, speech intelligibility increased when the noise azimuth was at 120° compared to both 180° and 0° , with the talker at approximately 8 m from the listener. The advantage of the spatial separation of the noise signal in reverberation is evident in the case of the audio-visual with self-motion test. This suggests a spatial release from masking in the presence of reverberation, one-talker-interfering noise and within an more ecological scene.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call