Abstract

Deep neural network-based speech enhancement systems have achieved promising results. However, the state-of-the-art (SOTA) models usually have too many parameters and require too much computational work to be used on devices for practical applications. In this paper, we propose a novel lightweight complex spectral mask-based neural network with a two-stage pipeline for monaural speech enhancement. The network utilizes the idea of decoupling a primary problem into several simple sub-problems, which reduces the computational burden and model parameters. Specifically, the network contains two mask-based sub-networks, i.e., CoarseNet, and FineNet, implemented in the complex domain to improve the enhancement performances progressively. The CoarseNet takes the coarse-grained compact features as input and estimates the corresponding full-band complex mask. The FineNet focuses on further removing residual noises in the low-frequency components of CoarseNet output by predicting a fine-grained mask. The transforms between coarse- and fine-scale are based on a novel learnable complex-valued rectangular bandwidth (LCRB) filter bank. Furthermore, we also propose a lightweight and general complex-valued attention mechanism to improve the modeling capability of convolutional encoder/decoder of the network and uses cross-stage skip connections (CSSC) between sub-networks to facilitate information flowing between sub-networks. Extensive experiments on two standard corpora demonstrate that our proposed approach achieves better performances over previous SOTA systems under various conditions while maintaining relatively small model sizes and low computational complexity.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call