Abstract

Waveform-based speech enhancement solutions have become increasingly popular in recent years. They commonly take a certain fully convolutional network (FCN) architecture that relies on convolutions to capture discriminative features at multiple scales. We found FCNs with standard convolutions tended to overfit on training data, likely due to the large number of trainable parameters brought by channel integration operations. Handling short speech frames also poses a challenge for FCNs, as the context to be observed is often limited, resulting in boundary discontinuities in the concatenated outputs. In this work, we propose remedies to address the aforementioned practical issues. With the Wave-U-Net as the baseline model, we replace the standard convolutions with depthwise and depthwise separable convolutions to compress the FCN models. With the reduced model complexity, such replacements lead to significantly improved network efficiency and generalization. To address the short frame issue, we propose to utilize RNN to connect depthwise FCNs, allowing temporal information to be propagated along the networks on individual frames. Our FCN + RNN model demonstrates an excellent smoothing effect on short frames, enabling speech enhancement systems with very short delays. The effectiveness of the proposed models is validated with experiments on AzBio sentences and VCTK datasets.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call