Abstract

Neither a monocular RGB camera nor a small-size microphone array is capable of accurate three-dimensional (3D) speaker localization. By taking advantage of accurate visual object detection, and audio-visual complementary sensor fusion, we formulate the three-dimensional (3D) speaker localization problem as a visual scaling factor estimation problem. As a result, we effectively reduce the traditional audio-only 3D speaker localization from an exhaustive grid search to a one-dimensional (1D) optimization problem. We propose a multi-modal perception system with two optimization approaches. We show that the proposed methods are effective, accurate, and robust against interference and, as corroborated by indicative empirical results on real dataset, competitive to the conventional uni-modal and the state-of-the-art audio-visual speaker localization approaches.

Highlights

  • Multimodal perception is fertile research ground that merits further investigation, and has been extensively used in cognitive science, behavioral science, and neuroscience owing to its capabilities of enabling brains to learn meaningful information from different sensory modalities, including sound, sight etc [1]

  • Stochastic Region Contraction (SRC) [11], hierarchical search [12, 13] and vectorization [14] are proposed to speed up the scanning, which usually restrict the search to a two-dimensional (2D) space

  • AUTHOR et al.: We propose two objective functions, namely Multi-channel Cost Function (MCF) and Global Likelihood Function (GLF), and adopt the grid searching method to select the optimal visual scaling factor κ from the pre-defined hypotheses

Read more

Summary

Introduction

Multimodal perception is fertile research ground that merits further investigation, and has been extensively used in cognitive science, behavioral science, and neuroscience owing to its capabilities of enabling brains to learn meaningful information from different sensory modalities, including sound, sight etc [1]. Acoustic Sound Source Localization (SSL) with multichannel microphones [6, 7], as one of the most profound localization techniques, has spurred on-going interests for many far-field speech applications, such as automatic speech recognition [8] and separation [9]. Acoustic SSL only works well in relatively clean conditions, and suffers from noise and reverberation distortions. It relies on an extensive grid search algorithm to find the 3D position e.g. Steered Response Power (SRP) [10], that is computationally demanding. As pointed out by [15], a static small-size microphone array is incapable of localizing the speaker in 3D domain

Objectives
Methods
Findings
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.