Abstract

This paper addresses an active audio-visual integration framework which integrates audio and visual information with a robot's active motion for noise-robust Voice Activity Detection (VAD). VAD is crucial for noise robust Automatic Speech Recognition (ASR) because speech captured by a robot's microphones is usually contaminated with other noise sources. To realize such noise-robust VAD, we propose Active Audio-Visual (AAV) integration framework which integrates auditory, visual and motion information using a Causal Bayesian Network (CBN). CBN is a subclass of Bayesian networks, which is able to estimate the effect on VAD performance caused by active motions. Since CBN is a general framework for information integration, we can naturally introduce various types of information such as the location of a speaker and a noise source which affect VAD performance to CBN, and CBN selects the optimal active motion for better perception of the robot using “intervention” mechanism in CBN. We implemented a prototype system based on the proposed framework on a humanoid robot called Hearbo. The proposed AAV-VAD is compared with three types of AV-VAD; simple AAV-VAD, multi-regression-based AAV-VAD, and stationary (not active) AV-VAD. A preliminary experiment using the prototype system showed that the VAD performance of the proposed AV-VAD was 14.4, 26.0, and 30.3 points higher than that of the simple active, multi-regression-based active, and stationary AV-VAD, respectively.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call