Abstract

Voice cloning aims to synthesize the voice with a new speaker’s timbre from a small amount of the new speaker’s speech. Current voice cloning methods, which focus on modeling speaker timbre, can synthesize speech with similar speaker timbres. However, the prosody of these methods is flat, lacking expressiveness and the ability to control the expressiveness of cloned speech. To solve this problem, we propose a novel method ZSE-VITS (zero-shot expressive VITS) based on the end-to-end speech synthesis model VITS. Specifically, we use VITS as the backbone network and add the speaker recognition model TitaNet as the speaker encoder to realize zero-shot voice cloning. We use explicit prosody information to avoid effects from the speaker information and adjust speech prosody using the prosody information prediction and prosody fusion methods directly. We widen the pitch distribution of the train datasets using pitch augmentation to improve the generalization ability of the prosody model, and we fine-tune the prosody predictor alone in the emotion corpus to learn prosody prediction of various styles. The objective and subjective evaluations of the open datasets show that our method can generate more expressive speech and adjust prosody information artificially without affecting the similarity of speaker timbre.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.