Feature Quantization for Many-to-many Voice Conversion

Yutian Wang,Hui Wang,Qin Zhang,Yuankun Xie

doi:10.1109/iciba50161.2020.9276871

Abstract

In recent years, Voice Conversion (VC) has made great strides with the application of deep learning. Many works on many-to-many Voice Conversion task require to have parallel training data and multiple generative adversarial networks. One of the successful models for VC is the Star generative adversarial network (StarGAN-VC), which does not require parallel training data and can use a single generator and discriminator for all pairs of speakers domains. However, training a high quality StarGAN is not an easy task. Using a single generator and discriminator with mini-batch statistics to perform feature mapping will result in an insurmountable gap between the converted distribution and the real distribution. In this paper, we proposed a Feature Quantization model plugged into the discriminator of StarGAN-VC2, which can quantize the continuous feature into a discrete embedding space to solve the feature mapping problem and improve the quality of converted speech. Experiments show that our proposed model can improve the MCD scores of baseline methods which confirm the efficiency of the proposed method.

Full Text