Abstract

For cooperative work of robots and humans in the real world, a communicative function based on speech is indispensable for robots. To realize such a function in a noisy real environment, it is essential that robots be able to extract target speech spoken by humans from a mixture of sounds by their own resources. We have developed a method of detecting and extracting speech events based on the fusion of audio and video information. In this method, audio information (sound localization using a microphone array) and video information (human tracking using a camera) are fused by a Bayesian network to enable the detection of speech events. The information of detected speech events is then utilized in sound separation using adaptive beam forming. In this paper, some basic investigations for applying the above system to the humanoid robot HRP-2 are reported. Input devices, namely a microphone array and a camera, were mounted on the head of HRP-2, and acoustic characteristics for sound localization/separation performance were investigated. Also, the human tracking system was improved so that it can be used in a dynamic situation. Finally, overall performance of the system was tested via off-line experiments.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call