Abstract

Deep-learning based full-band speech enhancement methods have gained increasing proliferation in recent years. To balance between denoising performance and computational complexity, mainstream full-band approaches typically utilize the compressed perceptual-motivated features with relatively low frequency resolution in middle and high frequencies to recover the full-band spectrum, limiting the upper bound of speech quality. Recently, sub-band fusion based approaches have been developed, where the low-frequency and high-frequency bands are tackled separately, thus neglecting the full-band spectral pattern and cross-band dependency. This paper proposes a dual-stage full- and sub-band integration network, dubbed FSI-Net, to simultaneously leverage the coarse-grained full-band spectral pattern and the fine-grained sub-band spectral details for the full-band speech enhancement task. Concretely, in the first stage, only coarse denoising is performed using the compressed ERB-scaled spectrum to capture the global full-band spectral context, so as to decrease the computational overhead. In the second stage, because the sub-band spectral characteristics of speech vary among different frequency bands, we elaborately devise two sub-networks to refine the low-frequency and high-frequency bands separately in the complex domain. To fully capitalize on cross-band guidance, we employ a band-guided encoder to provide external knowledge for the high-frequency bands. Extensive experiments show that the proposed method consistently outperforms state-of-the-art one-stage full-band and sub-band fusion based baselines in terms of various evaluation metrics.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call