Fusion of audio and video information for detecting speech events

F Asano,Y Motomura,S Nakamura

doi:10.1109/icif.2003.177472

Abstract

In this paper, a method of detecting speech events in a multiple-sound-source condition us- ing sound and vision information is proposed. Detec- tion of speech event is an important issue for automatic speech recognition operated in a real environment. Fur- thermore, as stated an this paper, the performance of sound source separation using adaptive beamforming is greatly improved by knowing when and where the target speech event occurs. For this purpose, sound localiza- tion using a microphone array and human tracking by stereo vision is combined by a Bayesian network. From the inference results of the Bayesian network, the in- formation on time and location of speech events can be known in a multiple-sound-source condition. Results of an off-line experiment an a real environment with TV and music interference are shown.

Full Text