Abstract

Typical speech separation systems usually operate in the time-frequency (T-F) domain by enhancing the magnitude response and leaving the phase response unaltered. Recent studies, however, suggest that phase is important for perceptual quality, leading some researchers to consider magnitude and phase spectrum enhancements. The merging of the complex ideal ratio masking (cIRM) estimation and training with deep neural network (DNN) has been proved to be an effective way to improve speech separation. Furthermore, the label ambiguity (or permutation) problem has become a major barrier for speaker-independent multi-talker source separation, which prompts us to come up with new solutions. In this paper, to solve the problem of speaker-independent monaural source separation, we propose a novel method called pcIRM, which creatively achieves the cIRM estimation with the utterance-level permutation invariant training (uPIT). Specifically, pcIRM is implemented with the deep bidirectional LSTM (Bi-LSTM) RNN network, and evaluated with the WSJ0-2mix datasets. We report separation results for the proposed method and compare them to that of the existing state-of-the-art methods. Extensive experimental results demonstrate the advantages of our proposed pcIRM method in terms of the signal-to-distortion ratio (SDR) metric.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call