A Voice Cloning Method Based on the Improved HiFi-GAN Model.

Zeyu Qiu,Jun Tang,Jiaxin Li,Xishan Bai,Yaxin Zhang

doi:10.1155/2022/6707304

Abstract

With the aim of adapting a source Text to Speech (TTS) model to synthesize a personal voice by using a few speech samples from the target speaker, voice cloning provides a specific TTS service. Although the Tacotron 2-based multi-speaker TTS system can implement voice cloning by introducing a d-vector into the speaker encoder, the speaker characteristics described by the d-vector cannot allow for the voice information of the entire utterance. This affects the similarity of voice cloning. As a vocoder, WaveNet sacrifices speech generation speed. To balance the relationship between model parameters, inference speed, and voice quality, a voice cloning method based on improved HiFi-GAN has been proposed in this paper. (1) To improve the feature representation ability of the speaker encoder, the x-vector is used as the embedding vector that can characterize the target speaker. (2) To improve the performance of the HiFi-GAN vocoder, the input Mel spectrum is processed by a competitive multiscale convolution strategy. (3) The one-dimensional depth-wise separable convolution is used to replace all standard one-dimensional convolutions, significantly reducing the model parameters and increasing the inference speed. The improved HiFi-GAN model remarkably reduces the number of vocoder model parameters by about 68.58% and boosts the model's inference speed. The inference speed on the GPU and CPU has increased by 11.84% and 30.99%, respectively. Voice quality has also been marginally improved as MOS increased by 0.13 and PESQ increased by 0.11. The improved HiFi-GAN model exhibits outstanding performance and remarkable compatibility in the voice cloning task. Combined with the x-vector embedding, the proposed model achieves the highest score of all the models and test sets.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Computational Intelligence and Neuroscience	Publication Date: Oct 11, 2022
Citations: 3	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

A Voice Cloning Method Based on the Improved HiFi-GAN Model.

Abstract

Talk to us

Similar Papers

More From: Computational Intelligence and Neuroscience

Lead the way for us

Similar Papers

Voice Cloning Using Transfer Learning with Audio Samples
Usman Nawaz ... Ammara Tariq
UMT Artificial Intelligence Review | VOL. 3
Usman Nawaz, et. al.Usman Nawaz ... Ammara Tariq
20 Dec 2023
UMT Artificial Intelligence Review | VOL. 3

AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios
Yihan Wu ... Tao Qin
-
Yihan Wu, et. al.Yihan Wu ... Tao Qin
18 Sep 2022
18 Sep 2022

Comparison of Voice Cloning Algorithms in Zero-shot and Few-shot Scenarios
Olga Hovhannisyan ... Artur Malajyan
Proceedings of the Institute for System Programming of the RAS | VOL. 36
Olga Hovhannisyan, et. al.Olga Hovhannisyan ... Artur Malajyan
01 Jan 2024
Proceedings of the Institute for System Programming of the RAS | VOL. 36

Comparison of Voice Cloning Algorithms in Zero-shot and Few-shot Scenarios
Olga Hovhannisyan ... Artur Malajyan
Proceedings of the Institute for System Programming of the RAS | VOL. 36
Olga Hovhannisyan, et. al.Olga Hovhannisyan ... Artur Malajyan
01 Jan 2024
Proceedings of the Institute for System Programming of the RAS | VOL. 36

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Voice Cloning Method Based on the Improved HiFi-GAN Model.

Abstract

Talk to us

Similar Papers

More From: Computational Intelligence and Neuroscience