MRMI-TTS: Multi-Reference Audios and Mutual Information Driven Zero-Shot Voice Cloning

Yi Ting Chen,Buzhou Tang,Wanting Li

doi:10.1145/3649501

Abstract

Voice cloning in text-to-speech (TTS) is the process of replicating the voice of a target speaker with limited data. Among various voice cloning techniques, this article focuses on zero-shot voice cloning. Although existing TTS models can generate high-quality speech for seen speakers, cloning the voice of an unseen speaker remains a challenging task. The key aspect of zero-shot voice cloning is to obtain a speaker embedding from the target speaker. Previous works have used a speaker encoder to obtain a fixed-size speaker embedding from a single reference audio unsupervised, but they suffer from insufficient speaker information and content information leakage in speaker embedding. To address these issues, this article proposes MRMI-TTS, a FastSpeech2-based framework that uses speaker embedding as a conditioning variable to provide speaker information. The MRMI-TTS extracts speaker embedding and content embedding from multi-reference audios using a speaker encoder and a content encoder. To obtain sufficient speaker information, multi-reference audios are selected based on sentence similarity. The proposed model applies mutual information minimization on the two embeddings to remove entangled information within each embedding. Experiments on the public English dataset VCTK show that our method can improve synthesized speech in terms of both similarity and naturalness, even for unseen speakers. Compared to state-of-the-art reference embedding learned methods, our method achieves the best performance on the zero-shot voice cloning task. Furthermore, we demonstrate that the proposed method has a better capability of maintaining the speaker embedding in different languages. Sample outputs are available on the demo page. 1

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

MRMI-TTS: Multi-Reference Audios and Mutual Information Driven Zero-Shot Voice Cloning

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Asian and Low-Resource Language Information Processing

Lead the way for us

Similar Papers

Improving Robustness of One-Shot Voice Conversion with Deep Discriminative Speaker Encoder
Hongqiang Du ... Lei Xie
-
Hongqiang Du, et. al.Hongqiang Du ... Lei Xie
30 Aug 2021
30 Aug 2021

Zero-Shot Voice Cloning Using Variational Embedding with Attention Mechanism
Jaeuk Lee ... Jiye Kim
-
Jaeuk Lee, et. al.Jaeuk Lee ... Jiye Kim
17 Nov 2021
17 Nov 2021

Multi-target Voice Conversion without Parallel Data by Adversarially Learning Disentangled Audio Representations
Ju-Chieh Chou ... Hung-Yi Lee
-
Ju-Chieh Chou, et. al.Ju-Chieh Chou ... Hung-Yi Lee
02 Sep 2018
02 Sep 2018

Voice Cloning Applied to Voice Disorders: a Study of Extreme Phonetic Content in Speaker Embeddings
Lily Wadoux ... Nelly Barbot
Proceedings of the Canadian Conference on Artificial Intelligence | VOL. -
Lily Wadoux, et. al.Lily Wadoux ... Nelly Barbot
27 May 2022
Proceedings of the Canadian Conference on Artificial Intelligence | VOL. -

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

MRMI-TTS: Multi-Reference Audios and Mutual Information Driven Zero-Shot Voice Cloning

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Asian and Low-Resource Language Information Processing