Abstract

In this paper, a method of detecting speech events in a multiple-sound-source condition us- ing sound and vision information is proposed. Detec- tion of speech event is an important issue for automatic speech recognition operated in a real environment. Fur- thermore, as stated an this paper, the performance of sound source separation using adaptive beamforming is greatly improved by knowing when and where the target speech event occurs. For this purpose, sound localiza- tion using a microphone array and human tracking by stereo vision is combined by a Bayesian network. From the inference results of the Bayesian network, the in- formation on time and location of speech events can be known in a multiple-sound-source condition. Results of an off-line experiment an a real environment with TV and music interference are shown.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call