Abstract
The problem of blind and online speaker localization and separation using multiple microphones is addressed based on the recursive expectation-maximization (REM) procedure. A two-stage REM-based algorithm is proposed: (1) multi-speaker direction of arrival (DOA) estimation and (2) multi-speaker relative transfer function (RTF) estimation. The DOA estimation task uses only the time frequency (TF) bins dominated by a single speaker while the entire frequency range is not required to accomplish this task. In contrast, the RTF estimation task requires the entire frequency range in order to estimate the RTF for each frequency bin. Accordingly, a different statistical model is used for the two tasks. The first REM model is applied under the assumption that the speech signal is sparse in the TF domain, and utilizes a mixture of Gaussians (MoG) model to identify the TF bins associated with a single dominant speaker. The corresponding DOAs are estimated using these bins. The second REM model is applied under the assumption that the speakers are concurrently active in all TF bins and consequently applies a multichannel Wiener filter (MCWF) to separate the speakers. As a result of the assumption of the concurrent speakers, a more precise TF map of the speakers’ activity is obtained. The RTFs are estimated using the outputs of the MCWF-beamformer (BF), which are constructed using the DOAs obtained in the previous stage. Next, using the linearly constrained minimum variance (LCMV)-BF that utilizes the estimated RTFs, the speech signals are separated. The algorithm is evaluated using real-life scenarios of two speakers. Evaluation of the mean absolute error (MAE) of the estimated DOAs and the separation capabilities, demonstrates significant improvement w.r.t. a baseline DOA estimation and speaker separation algorithm.
Highlights
Multi-speaker separation techniques, utilizing microphone arrays, have attracted the attention of the research community and the industry in the last three decades, especially in the context of hands-free communication systems
A commonly used technique for source extraction is the linearly constrained minimum variance (LCMV)-BF [4, 5], which is a generalization of the minimum variance distortionless response (MVDR)-BF [6]
In [15], the speech sparsity in the short-time Fourier transform (STFT) domain was utilized to track the direction of arrival (DOA) of multiple speakers using a convolutional neural network (CNN) applied to the instantaneous relative transfer function (RTF) estimate
Summary
Multi-speaker separation techniques, utilizing microphone arrays, have attracted the attention of the research community and the industry in the last three decades, especially in the context of hands-free communication systems. Common DOA estimators are based on the SRP-phase transform (PHAT) [23], the multiple signal classification (MUSIC) algorithm [24], or Model-based expectationmaximization source separation and localization (MESSL) [25]. The LCMV-BF is re-employed using the estimated RTFs. The direct-path phase differences are set using the speakers DOA estimated by an online preliminary stage of multiple concurrent DOA estimation. The direct-path phase differences are set using the speakers DOA estimated by an online preliminary stage of multiple concurrent DOA estimation In this stage, assuming J speakers, J dominant DOAs are estimated in each frame using a novel version of the MoG-REM.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: EURASIP Journal on Audio, Speech, and Music Processing
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.