Speaker Counting and Separation From Single-Channel Noisy Mixtures

Srikanth Raj Chetupalli,Emanuël A P Habets

doi:10.1109/taslp.2023.3268572

Abstract

We address the problem of speaker counting and separation from a noisy, single-channel, multi-source, recording. Most of the works in the literature assume mixtures containing two to five speakers. In this work, we consider noisy speech mixtures with one to five speakers and noise-only recordings. We propose a deep neural network (DNN) architecture, that predicts a speaker count of zero for noise-only recordings and predicts the individual clean speaker signals and speaker count for mixtures of one to five speakers. The DNN is composed of transformer layers and processes the recordings using the long-time and short-time sequence modeling approach to masking in a learned time-feature domain. The network uses an encoder-decoder attractor module with long-short term memory units to generate a variable number of outputs. The network is trained with simulated noisy speech mixtures composed of the speech recordings from WSJ0 corpus, and noise recordings from the WHAM! corpus. We show that the network achieves 99% speaker counting accuracy and more than 19 dB improvement in the scale-invariant signal-to-noise ratio for mixtures of up to three speakers.

Full Text