Abstract

Locating the sound source is one of the most important capabilities of robot audition. In recent years, single-source localization techniques have increasingly matured. However, localizing and tracking specific sound sources in multi-source scenarios, which is known as the cocktail party problem, is still unresolved. In order to address this challenge, in this paper, we propose a system for dynamically localizing and tracking sound sources based on audio–visual information that can be deployed on a mobile robot. Our system first locates specific targets using pre-registered voiceprint and face features. Subsequently, the robot moves to track the target while keeping away from other sound sources in the surroundings instructed by the motion module, which helps the robot gather clearer audio data of the target to perform downstream tasks better. Its effectiveness has been verified via extensive real-world experiments with a 20% improvement in the success rate of specific speaker localization and a 14% reduction in word error rate in speech recognition compared to its counterparts.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call