Abstract

Researchers are paying more and more attention on one-shot voice conversion because of its superiority on collecting data, which can convert a timbre of one speech from one source speaker to another target speaker even for unseen speakers in the training dataset. In our previous work, a two-level nested U-structure was developed for one-shot voice conversion, called U2-VC. It was shown that the multi-scale features extracted by the U2-Net structure is promising in improving the naturalness of the converted speech. However, the converted speech of U2-VC still suffers from the source speaker timbre leakage problem, caused by only using the instance normalization for disentanglement. To solve this problem, the vector quantization is designed to disentangle the content and speaker identity features from the extracted multi-scale features. Meanwhile, instead of using one segment taken from one utterance as both the source and target signals, two non-overlapped segments cut from one utterance are used as the source and the target during the training phase. Both objective and subjective evaluation results show that the proposed voice conversion method effectively improves the converted speech quality when compared with the original U2-VC and other state-of-the-art baselines.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.