Multi-Channel Talker-Independent Speaker Separation Through Location-Based Training

Hassan Taherian,Deliang Wang,Ke Tan

doi:10.1109/taslp.2022.3202129

Abstract

Permutation ambiguity is a crucial issue for deep learning based talker-independent speaker separation. Deep clustering and permutation invariant training (PIT) have been widely used to address the permutation ambiguity problem in monaural scenarios. Although both approaches have been extended to multi-microphone scenarios, we believe that the permutation ambiguity problem can be naturally avoided by leveraging the spatial relations of multiple speakers. In this study, we present location-based training (LBT), a new approach to achieve talker independency in multi-channel speaker separation. Unlike PIT that examines all possible permutations, LBT assigns speakers according to their positions in physical space. With a linear training complexity to the number of concurrent speakers, LBT is computationally much more efficient than PIT with a factorial complexity, particularly when a large number of overlapping speakers needs to be separated. Specifically, we propose two training criteria: azimuth-based and distance-based training, using speaker azimuths and distances relative to a microphone array, respectively. Evaluation results show that LBT significantly outperforms PIT on two-speaker and three-speaker mixtures with different array geometries and in various acoustic conditions. In addition, we propose a joint training strategy to integrate azimuth-based and distance-based training, which further improves separation performance.

Full Text