Abstract

For speech enhancement, contextual information is important for accurate speech spectrum estimation. Conventional convolution layers are generally leveraged to mine implicit correlations from the adjacent area. But non-local information cannot be well captured such as correlations between the pitch and its overtones or full-band noise by fixed convolution. To capture superior dependency along temporal and frequency dimensions, we introduce a multi-scale informative perceptual network (MIPNet) to probe into feature extraction by incorporating localized patterns and global correlations for monaural speech enhancement. MIPNet is based on the encoder-decoder composed of multi-scale perceptual modules (MPMs) to extract preferable local patterns, which have two branches with dilated convolution and stacked fully convolutional layers. MPM is designed with long-term contexts sensitivity to detect the multi-scale adjacent information, thus it helps to rectify informative features and improve the efficiency and accuracy of feature coding. Besides, non-local modules are applied as bottleneck layers to obtain global informative flow. Incorporating MPMs and non-local modules, our proposed network can aggregate multi-scale contextual information, which can model preferable implicit acoustic features and eliminate the noise components. On Voice Bank + DEMAND dataset, MIPNet obtains 14.34% improvement in SSNR for its superiority in noise suppression. Experimental results on WSJ0, TIMIT demonstrate that the proposed model with a few parameters exhibits strong robustness and good performance in terms of objective speech intelligibility and quality under various noise conditions.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call