Abstract

In many speech communication applications, robust localization and tracking of multiple speakers in noisy and reverberant environments are of major importance. Several algorithms to tackle this problem have been proposed in the last decades. In this paper, we propose several extensions to a recently presented joint direction of arrival (DOA) and pitch estimation method, increasing its robustness in multi-speaker scenarios, noise, and reverberation. First, a spectral comb filter is added to the original algorithm to better cope with concurrent speakers. Second, the well-known generalized cross-correlation with phase transform (GCC-PHAT) is used as an additional weighting function to improve the DOA estimation accuracy in terms of correct hits. Third, using multiple microphone pairs, the multi-channel cross-correlation approach is incorporated to improve the robustness against noise and reverberation. In order to improve tracking for moving and even intersecting speakers, a particle filter is used. Experiments with real-world recordings in realistic acoustic conditions show that the proposed extensions increase the DOA hit rate by about 33% compared to the original algorithm for two step-wise moving sources at a signal-to-noise ratio (SNR) of 15 dB and a reverberation time RT60 of 560 ms.

Highlights

  • Automatic detection, localization, and tracking of speaker are of high interest in several applications such as handsfree speech communication and video conferencing, as well as for computational auditory scene analysis and human-machine interfaces

  • We will explain three novel extensions, namely, a spectral comb filter to better cope with concurrent speakers, a generalized crosscorrelation (GCC)-phase transform (PHAT) weighting function to improve the direction of arrival (DOA) estimation accuracy, and a multi-channel cross-correlation approach to improve the robustness against noise and reverberation

  • A performance comparison between the core algorithm discussed in Section 2.2 and the extensions proposed in Section 2.3 will be presented in terms of DOA estimation hit rate Aφ and pitch estimation hit rate Af, as well as root-mean-square error (RMSE) of the DOA estimates

Read more

Summary

Introduction

Localization, and tracking of speaker are of high interest in several applications such as handsfree speech communication and video conferencing, as well as for computational auditory scene analysis and human-machine interfaces. 1. The final and pitch estimate for time frame λ is obtained by summing all weighted particles, i.e., 2.3 Methods to increase the robustness For a single speaker scenario and clean speech recordings, the basic DOA and pitch estimation algorithm in [23] performs quite well.

Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call