Abstract
U-shaped Network (UNet) has shown excellent performance in a variety of speech enhancement tasks. However, because of the intrinsic limitation of convolutional operation, traditional UNet built with convolutional neural network (CNN) cannot learn global and long-term information well. In this work, we propose a new Swin-UNet-based speech enhancement method. Unlike the traditional UNet model, the CNN blocks are all replaced with Swin-Transformer blocks to explore more multi-scale contextual information. The Swin-UNet model employs shifted window mechanism which not only overcomes the defect of high computational complexity of the Transformer but also enhances global information interaction by utilizing the powerful global modeling capability of the Transformer. Through hierarchical Swin-Transformer blocks, global and local speech features can be fully leveraged to improve speech reconstruction ability. Experimental results confirm that the proposed method can eliminate more background noise while maintaining good objective speech quality.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.