One-shot voice conversion using a combination of U2-Net and vector quantization

Fangkun Liu,Hui Wang,Yuxuan Ke,Chengshi Zheng

doi:10.1016/j.apacoust.2022.109014

Abstract

Researchers are paying more and more attention on one-shot voice conversion because of its superiority on collecting data, which can convert a timbre of one speech from one source speaker to another target speaker even for unseen speakers in the training dataset. In our previous work, a two-level nested U-structure was developed for one-shot voice conversion, called U2-VC. It was shown that the multi-scale features extracted by the U2-Net structure is promising in improving the naturalness of the converted speech. However, the converted speech of U2-VC still suffers from the source speaker timbre leakage problem, caused by only using the instance normalization for disentanglement. To solve this problem, the vector quantization is designed to disentangle the content and speaker identity features from the extracted multi-scale features. Meanwhile, instead of using one segment taken from one utterance as both the source and target signals, two non-overlapped segments cut from one utterance are used as the source and the target during the training phase. Both objective and subjective evaluation results show that the proposed voice conversion method effectively improves the converted speech quality when compared with the original U2-VC and other state-of-the-art baselines.

Full Text