Abstract

In recent years, with the application of deep learning in speech synthesis, waveform generation models based on generative adversarial networks have achieved high quality comparable to natural speech. In most waveform generators, a neural upsampling unit plays an essential role as it is employed to upsample acoustic features to the sample point level. However, aliasing artifacts are observed in the generated speech regardless of whether transposed convolution, subpixel convolution, or nearest neighbor interpolation are used as temporary upsampling layers. Non-ideal upsampling filters produce aliasing, according to the Shannon-Nyquist sampling theorem. This paper aims to systematically analyze how aliasing artifacts are produced in non-ideal upsampling-based waveform generators. We investigate the HiFi-GAN and VITS generation processes and discover that high-frequency spectral details are generated based on low-frequency structures using the nonlinear transformation. However, the nonlinear transformation was unable to completely remove the low-frequency spectral imprint, which eventually manifested as spectral artifacts in generated waveforms. To suppress aliasing artifacts, a low-pass filter is applied after the upsampling layer, but this results in significant performance drops. The experimental results also show that aliasing speeds up the training process by filling high-frequency vacancies. In this regard, we propose to mix high-frequency components into low-pass filtered features, allowing models to converge faster while naturally avoiding artifacts. In addition, to assess the efficacy of our method, we created an artifact-detection algorithm based on structural similarity.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.