ZSE-VITS: A Zero-Shot Expressive Voice Cloning Method Based on VITS

Jiaxin Li,Lianhai Zhang

doi:10.3390/electronics12040820

Abstract

Voice cloning aims to synthesize the voice with a new speaker’s timbre from a small amount of the new speaker’s speech. Current voice cloning methods, which focus on modeling speaker timbre, can synthesize speech with similar speaker timbres. However, the prosody of these methods is flat, lacking expressiveness and the ability to control the expressiveness of cloned speech. To solve this problem, we propose a novel method ZSE-VITS (zero-shot expressive VITS) based on the end-to-end speech synthesis model VITS. Specifically, we use VITS as the backbone network and add the speaker recognition model TitaNet as the speaker encoder to realize zero-shot voice cloning. We use explicit prosody information to avoid effects from the speaker information and adjust speech prosody using the prosody information prediction and prosody fusion methods directly. We widen the pitch distribution of the train datasets using pitch augmentation to improve the generalization ability of the prosody model, and we fine-tune the prosody predictor alone in the emotion corpus to learn prosody prediction of various styles. The objective and subjective evaluations of the open datasets show that our method can generate more expressive speech and adjust prosody information artificially without affecting the similarity of speaker timbre.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Electronics	Publication Date: Feb 6, 2023
Citations: 3	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

ZSE-VITS: A Zero-Shot Expressive Voice Cloning Method Based on VITS

Abstract

Talk to us

Similar Papers

More From: Electronics

Lead the way for us

Similar Papers

MRMI-TTS: Multi-Reference Audios and Mutual Information Driven Zero-Shot Voice Cloning
Yi Ting Chen ... Wanting Li
ACM Transactions on Asian and Low-Resource Language Information Processing | VOL. 23
Yi Ting Chen, et. al.Yi Ting Chen ... Wanting Li
10 May 2024
ACM Transactions on Asian and Low-Resource Language Information Processing | VOL. 23

Optimizing feature fusion for improved zero-shot adaptation in text-to-speech synthesis
Zhiyong Chen ... Shugong Xu
EURASIP Journal on Audio, Speech, and Music Processing | VOL. 2024
Zhiyong Chen, et. al.Zhiyong Chen ... Shugong Xu
28 May 2024
EURASIP Journal on Audio, Speech, and Music Processing | VOL. 2024

A Multiview Metric Learning Method for Few-Shot Fine-Grained Classification
Zhuang Miao ... Xun Zhao
IEEE Access | VOL. 10
Zhuang Miao, et. al.Zhuang Miao ... Xun Zhao
01 Jan 2021
IEEE Access | VOL. 10

Adaptation of an Expressive Single Speaker Deep Neural Network Speech Synthesis System
Jonathan Parker ... Roberto Cipolla
-
Jonathan Parker, et. al.Jonathan Parker ... Roberto Cipolla
01 Apr 2018
01 Apr 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

ZSE-VITS: A Zero-Shot Expressive Voice Cloning Method Based on VITS

Abstract

Talk to us

Similar Papers

More From: Electronics