Multi-scale informative perceptual network for monaural speech enhancement

Tian Lan,Jiajia Li,Yujia Feng,Wenxin Tai,Yixiang Wang,Cong Chen,Jun Kang,Qiao Liu

doi:10.1016/j.apacoust.2022.108787

Abstract

For speech enhancement, contextual information is important for accurate speech spectrum estimation. Conventional convolution layers are generally leveraged to mine implicit correlations from the adjacent area. But non-local information cannot be well captured such as correlations between the pitch and its overtones or full-band noise by fixed convolution. To capture superior dependency along temporal and frequency dimensions, we introduce a multi-scale informative perceptual network (MIPNet) to probe into feature extraction by incorporating localized patterns and global correlations for monaural speech enhancement. MIPNet is based on the encoder-decoder composed of multi-scale perceptual modules (MPMs) to extract preferable local patterns, which have two branches with dilated convolution and stacked fully convolutional layers. MPM is designed with long-term contexts sensitivity to detect the multi-scale adjacent information, thus it helps to rectify informative features and improve the efficiency and accuracy of feature coding. Besides, non-local modules are applied as bottleneck layers to obtain global informative flow. Incorporating MPMs and non-local modules, our proposed network can aggregate multi-scale contextual information, which can model preferable implicit acoustic features and eliminate the noise components. On Voice Bank + DEMAND dataset, MIPNet obtains 14.34% improvement in SSNR for its superiority in noise suppression. Experimental results on WSJ0, TIMIT demonstrate that the proposed model with a few parameters exhibits strong robustness and good performance in terms of objective speech intelligibility and quality under various noise conditions.

Full Text