Abstract

Voice activity detection (VAD) is an essential initial step of speech signal processing, and greatly affects the timeliness and accuracy of the system. Although many novel methods have been proposed to promote the performance of VAD under low signal-to-noise ratios (SNRs), the robustness to very low SNRs and unknown noisy environments has yet to be enhanced. In this paper, we propose a fusion model of local attention and global attention to strengthen the attention mechanism currently applied in VAD methods. First, based on self-attention, the local attention cooperates with long short-term memory networks (LSTMs) to achieve efficient use of local contextual information; and then the global attention measures global contextual information to focus on the most appropriate area of contextual frames. The experimental results show that compared with the state-of-the-art VAD methods, the proposed approach achieves better performance under low SNRs such as −15 dB and non-stationary noisy conditions.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call