Abstract

Recent advances in the time-domain speech separation methods, particularly those specialized in using attention mechanisms to model sequences, have significantly improved speech separation performance. In this paper, we address monaural (one microphone) speaker separation, mainly in the case of two concurrent speakers. We propose a dual-path hybrid attention network (DPHA-Net) for monaural speech separation based on time-domain. The critical component of DPHA-Net, the DPHA module, comprises multiple attentions and is designed to capture the short and long-term context information dependencies. DPHA module consists of the multi-head self-attention (MHSA), element-wise attention (EA), and adaptive feature fusion (AFF) units. We proposed an improved multi-stage aggregation training strategy during the training. That strategy has proven very effective for audio separation in this paper. The results of experiments on the benchmark WSJ0-2mix, WHAM! and Libri2Mix datasets show that our proposed DPHA-Net can achieves the competitive performance. For the task of two speaker separation on the WSJ0-2mix dataset, our proposed DPHA-Net is superior to the state of the art with a margin of 0.3 dB absolute improvement on the SI-SNRi and a margin of 0.4 dB absolute improvement on the SDRi in the same condition.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.