Abstract

We address the problem of speaker counting and separation from a noisy, single-channel, multi-source, recording. Most of the works in the literature assume mixtures containing two to five speakers. In this work, we consider noisy speech mixtures with one to five speakers and noise-only recordings. We propose a deep neural network (DNN) architecture, that predicts a speaker count of zero for noise-only recordings and predicts the individual clean speaker signals and speaker count for mixtures of one to five speakers. The DNN is composed of transformer layers and processes the recordings using the long-time and short-time sequence modeling approach to masking in a learned time-feature domain. The network uses an encoder-decoder attractor module with long-short term memory units to generate a variable number of outputs. The network is trained with simulated noisy speech mixtures composed of the speech recordings from WSJ0 corpus, and noise recordings from the WHAM! corpus. We show that the network achieves 99% speaker counting accuracy and more than 19 dB improvement in the scale-invariant signal-to-noise ratio for mixtures of up to three speakers.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.