The Evaluation of Performance Related to Noise Robustness of VITS for Speech Synthesis

Jvlie Yang

doi:10.54097/hset.v57i.9904

Abstract

In recent years, the utilization of voice interfaces has gained significant popularity, with speech synthesis technology playing a pivotal role in their functionality. However, speech synthesis technology is susceptible to noise interference in practical applications, which may lead to a decrease in the quality of speech synthesis. In this paper, the noise robustness of the Variational Inference with adversarial learning for end-to-end Text-to-Speech (VITS) model was investigated, which has shown promising results in speech synthesis tasks. This study conducted experiments using six different texts and evaluated the speech synthesis results using three metrics: Mean Opinion Score (MOS), Disfluency Prediction (DIS), and Colorfulness Prediction (COL). The experiments consist of a control group and six experimental groups, which include two types of noise, Additive White Gaussian Noise (AWGN) and real-world noise, at three different signal-to-noise ratios (SNRs). The results demonstrated that both types of noise can significantly reduce the MOS scores of the synthesized speech, with a more severe decrease at lower SNRs. In terms of DIS and COL scores, the VITS model exhibits superior performance with real-world noise compared to AWGN noise, especially at lower SNRs. Moreover, even at an SNR of 3, the VITS model can still generate intelligible speech, which demonstrates its high noise robustness. The findings have important implications for the design of robust speech synthesis models in noisy environments. Future studies may focus on exploring more advanced noise-robust models or investigating the application of these models in practical voice interfaces.

Full Text