Abstract

In this study, we present a deep neural network-based online multi-speaker localization algorithm based on a multi-microphone array. Following the W-disjoint orthogonality principle in the spectral domain, time-frequency (TF) bin is dominated by a single speaker and hence by a single direction of arrival (DOA). A fully convolutional network is trained with instantaneous spatial features to estimate the DOA for each TF bin. The high-resolution classification enables the network to accurately and simultaneously localize and track multiple speakers, both static and dynamic. Elaborated experimental study using simulated and real-life recordings in static and dynamic scenarios demonstrates that the proposed algorithm significantly outperforms both classic and recent deep-learning-based algorithms. Finally, as a byproduct, we further show that the proposed method is also capable of separating moving speakers by the application of the obtained TF masks.

Highlights

  • Localizing multiple sound sources recorded with a microphone array in an acoustic environment is an essential component in various cases such as source separation and scene analysis

  • In [11], a Convolutional neural network (CNN)-based classification method was applied in the short-time Fourier transform (STFT) domain for broadband direction of arrival (DOA) estimation, assuming that only a single speaker is active per time frame

  • We present a multi-speaker DOA estimation algorithm that is based on the U-net architecture that infers the DOA of each TF bin

Read more

Summary

Introduction

Localizing multiple sound sources recorded with a microphone array in an acoustic environment is an essential component in various cases such as source separation and scene analysis. The steered response power with phase transform (SRP-PHAT) algorithm [3] used a generalization of cross-correlation methods for DOA estimation These methods are still widely in use for both single- and multi-speaker localization tasks. Hammer et al EURASIP Journal on Audio, Speech, and Music Processing (2021) 2021:16 audio features, demonstrating improved performance as compared with classical approaches This method is mainly designed to deal with a single sound source at a time. In [11], a CNN-based classification method was applied in the short-time Fourier transform (STFT) domain for broadband DOA estimation, assuming that only a single speaker is active per time frame. The proposed method improves the DOA estimation performance with respect to (w.r.t.) the state-of-the-art (SOTA) approaches, which are frame-based, and facilitates simultaneous tracking of multiple moving speakers

Multi-microphone time-frequency features
U-Net for DOA estimation
Experimental study
Findings
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call