Abstract

Recent advances in the time-domain speech separation methods, particularly those specialized in using attention mechanisms to model sequences, have significantly improved speech separation performance. In this paper, we address monaural (one microphone) speaker separation, mainly in the case of two concurrent speakers. We propose a dual-path hybrid attention network (DPHA-Net) for monaural speech separation based on time-domain. The critical component of DPHA-Net, the DPHA module, comprises multiple attentions and is designed to capture the short and long-term context information dependencies. DPHA module consists of the multi-head self-attention (MHSA), element-wise attention (EA), and adaptive feature fusion (AFF) units. We proposed an improved multi-stage aggregation training strategy during the training. That strategy has proven very effective for audio separation in this paper. The results of experiments on the benchmark WSJ0-2mix, WHAM! and Libri2Mix datasets show that our proposed DPHA-Net can achieves the competitive performance. For the task of two speaker separation on the WSJ0-2mix dataset, our proposed DPHA-Net is superior to the state of the art with a margin of 0.3 dB absolute improvement on the SI-SNRi and a margin of 0.4 dB absolute improvement on the SDRi in the same condition.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call