Listening and Grouping: An Online Autoregressive Approach for Monaural Speech Separation

Zeng-Xi Li,Yan Song,Li-Rong Dai,Ian Mcloughlin

doi:10.1109/taslp.2019.2892241

Abstract

This paper proposes an autoregressive approach to harness the power of deep learning for multi-speaker monaural speech separation. It exploits a causal temporal context in both mixture and past estimated separated signals and performs online separation that is compatible with real-time applications. The approach adopts a learned listening and grouping architecture motivated by computational auditory scene analysis, with a grouping stage that effectively addresses the label permutation problem at both frame and segment levels. Experimental results on the WSJ0-2mix benchmark show that the new approach can achieve better signal-to-distortion ratio and perceptual evaluation of speech quality scores than most of the state-of-the-art methods for both closed-set and open-set evaluations, even methods that exploit whole-utterance statistics for separation. It achieves this while requiring fewer model parameters.

Full Text