Abstract

U-shaped Network (UNet) has shown excellent performance in a variety of speech enhancement tasks. However, because of the intrinsic limitation of convolutional operation, traditional UNet built with convolutional neural network (CNN) cannot learn global and long-term information well. In this work, we propose a new Swin-UNet-based speech enhancement method. Unlike the traditional UNet model, the CNN blocks are all replaced with Swin-Transformer blocks to explore more multi-scale contextual information. The Swin-UNet model employs shifted window mechanism which not only overcomes the defect of high computational complexity of the Transformer but also enhances global information interaction by utilizing the powerful global modeling capability of the Transformer. Through hierarchical Swin-Transformer blocks, global and local speech features can be fully leveraged to improve speech reconstruction ability. Experimental results confirm that the proposed method can eliminate more background noise while maintaining good objective speech quality.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call