PcIRM: Complex Ideal Ratio Masking for Speaker-Independent Monaural Source Separation with Utterance Permutation Invariant Training

Wen Zhang,Xiaoyong Li,Aolong Zhou,Kaijun Ren,Junqiang Song

doi:10.1109/ijcnn48605.2020.9207440

Wen Zhang, Xiaoyong Li + Show 3 more

https://doi.org/10.1109/ijcnn48605.2020.9207440

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

Typical speech separation systems usually operate in the time-frequency (T-F) domain by enhancing the magnitude response and leaving the phase response unaltered. Recent studies, however, suggest that phase is important for perceptual quality, leading some researchers to consider magnitude and phase spectrum enhancements. The merging of the complex ideal ratio masking (cIRM) estimation and training with deep neural network (DNN) has been proved to be an effective way to improve speech separation. Furthermore, the label ambiguity (or permutation) problem has become a major barrier for speaker-independent multi-talker source separation, which prompts us to come up with new solutions. In this paper, to solve the problem of speaker-independent monaural source separation, we propose a novel method called pcIRM, which creatively achieves the cIRM estimation with the utterance-level permutation invariant training (uPIT). Specifically, pcIRM is implemented with the deep bidirectional LSTM (Bi-LSTM) RNN network, and evaluated with the WSJ0-2mix datasets. We report separation results for the proposed method and compare them to that of the existing state-of-the-art methods. Extensive experimental results demonstrate the advantages of our proposed pcIRM method in terms of the signal-to-distortion ratio (SDR) metric.

Full Text