Abstract
It is crucial to robustly estimate the number of speakers (NoS) from the recorded audio mixtures in a reverberant environment. Some popular time-frequency (TF) methods approach this NoS estimation problem by assuming that only one of the speech components is active at each TF slot. However, this condition is violated in many scenarios where the speeches are convolved with long length of room impulse response coefficients, which causes degenerated performance of NoS estimation. To tackle this problem, a density-based clustering strategy is proposed to estimate NoS based on a local dominance assumption of speeches. Our method consists of several steps from clustering to classification of speakers with the consideration of robustness. First, the leading eigenvectors are extracted from the local covariance matrices of mixture TF components and ranked by the combination of local density and minimum distance to other leading eigenvectors with higher density. Second, a gap-based method is employed to determine the cluster centers from the ranked leading eigenvectors at each frequency bin. Third, a criterion based on averaged volume of cluster centers is proposed to select reliable clustering results at some frequency bins for the classification decision of NoS. The experiment results demonstrate that the proposed algorithm is superior to the existing methods in various reverberation cases with noise-free condition or noise condition.
Highlights
Audio source separation (ASS) targets at recovering multiple mixing speech sources recorded by multiple microphones [1]–[5]
The model selection of convolutive ASS is to find the best classification of speakers from the recorded mixtures, where multiple speech sources are convolved from a multiple delay mixing system
A new number of speakers (NoS) detector in reverberant environment has been proposed in this paper
Summary
Audio source separation (ASS) targets at recovering multiple mixing speech sources recorded by multiple microphones [1]–[5]. Due to the existence of echoes in a real recording environment, the convolutive ASS is usually employed to depict the physical mixing mechanism of multiple speech source signals, where multiple speech sources are convolved from a sequence of delayed version of linear mixing system model [1], [3], [6]. It is essential to estimate NoS from a recorded mixture signals in the convolutive ASS [7], [8]. The model selection of convolutive ASS is to find the best classification of speakers from the recorded mixtures, where multiple speech sources are convolved from a multiple delay mixing system. We mainly focus on the NoS estimation problem based on the time-frequency (TF) domain
Published Version (
Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have