Abstract

Typical speech separation systems usually operate in the time-frequency (T-F) domain by enhancing the magnitude response and leaving the phase response unaltered. Recent studies, however, suggest that phase is important for perceptual quality, leading some researchers to consider magnitude and phase spectrum enhancements. The merging of the complex ideal ratio masking (cIRM) estimation and training with deep neural network (DNN) has been proved to be an effective way to improve speech separation. Furthermore, the label ambiguity (or permutation) problem has become a major barrier for speaker-independent multi-talker source separation, which prompts us to come up with new solutions. In this paper, to solve the problem of speaker-independent monaural source separation, we propose a novel method called pcIRM, which creatively achieves the cIRM estimation with the utterance-level permutation invariant training (uPIT). Specifically, pcIRM is implemented with the deep bidirectional LSTM (Bi-LSTM) RNN network, and evaluated with the WSJ0-2mix datasets. We report separation results for the proposed method and compare them to that of the existing state-of-the-art methods. Extensive experimental results demonstrate the advantages of our proposed pcIRM method in terms of the signal-to-distortion ratio (SDR) metric.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.