Abstract

This paper proposes a novel Chinese speech cloning model named VStyclone, which consists of three stages: multi-speaker training, target speaker encoding, and target speaker synthesis. In this work, we design an efficient tone extractor, which can reallocate resources to the sequences of log-mel spectrogram frames obtained from multiple speakers’ speech, thus allowing the network to learn multiple speakers’ features differently. This approach allows the network to focus more on the voice features of the target speaker and extract the target features accurately. To cluster the voices of the same speaker and sparse the voices of different speakers, we build an optimal softmax loss to optimize the model. Then, we develop a style synthesizer, which adopts the idea of transformer instead of recurrent neural network, so that the model can not only process text information in parallel, but also improve the model's ability to process long-distance contextual information. Meanwhile, we embed a style extraction module in the style synthesizer to dynamically capture style ranges in an unsupervised manner. In addition, the VStyclone model uses generative adversarial networks as the base generation model of the vocoder to improve the generation speed, which runs 1.2 times faster than the real-time generation speed on CPU, and finally the VStyclone model obtains the SOTA effect.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.