In recent years, with the application of deep learning in speech synthesis, waveform generation models based on generative adversarial networks have achieved high quality comparable to natural speech. In most waveform generators, a neural upsampling unit plays an essential role as it is employed to upsample acoustic features to the sample point level. However, aliasing artifacts are observed in the generated speech regardless of whether transposed convolution, subpixel convolution, or nearest neighbor interpolation are used as temporary upsampling layers. Non-ideal upsampling filters produce aliasing, according to the Shannon-Nyquist sampling theorem. This paper aims to systematically analyze how aliasing artifacts are produced in non-ideal upsampling-based waveform generators. We investigate the HiFi-GAN and VITS generation processes and discover that high-frequency spectral details are generated based on low-frequency structures using the nonlinear transformation. However, the nonlinear transformation was unable to completely remove the low-frequency spectral imprint, which eventually manifested as spectral artifacts in generated waveforms. To suppress aliasing artifacts, a low-pass filter is applied after the upsampling layer, but this results in significant performance drops. The experimental results also show that aliasing speeds up the training process by filling high-frequency vacancies. In this regard, we propose to mix high-frequency components into low-pass filtered features, allowing models to converge faster while naturally avoiding artifacts. In addition, to assess the efficacy of our method, we created an artifact-detection algorithm based on structural similarity.
Read full abstract