Optimizing feature fusion for improved zero-shot adaptation in text-to-speech synthesis

Zhiyong Chen,Zhiqi Ai,Youxuan Ma,Xinnuo Li,Shugong Xu

doi:10.1186/s13636-024-00351-9

Abstract

In the era of advanced text-to-speech (TTS) systems capable of generating high-fidelity, human-like speech by referring a reference speech, voice cloning (VC), or zero-shot TTS (ZS-TTS), stands out as an important subtask. A primary challenge in VC is maintaining speech quality and speaker similarity with limited reference data for a specific speaker. However, existing VC systems often rely on naive combinations of embedded speaker vectors for speaker control, which compromises the capture of speaking style, voice print, and semantic accuracy. To overcome this, we introduce the Two-branch Speaker Control Module (TSCM), a novel and highly adaptable voice cloning module designed to precisely processing speaker or style control for a target speaker. Our method uses an advanced fusion of local-level features from a Gated Convolutional Network (GCN) and utterance-level features from a gated recurrent unit (GRU) to enhance speaker control. We demonstrate the effectiveness of TSCM by integrating it into advanced TTS systems like FastSpeech 2 and VITS architectures, significantly optimizing their performance. Experimental results show that TSCM enables accurate voice cloning for a target speaker with minimal data through both zero-shot or few-shot fine-tuning of pretrained TTS models. Furthermore, our TSCM-based VITS (TSCM-VITS) showcases superior performance in zero-shot scenarios compared to existing state-of-the-art VC systems, even with basic dataset configurations. Our method’s superiority is validated through comprehensive subjective and objective evaluations. A demonstration of our system is available at https://great-research.github.io/tsct-tts-demo/, providing practical insights into its application and effectiveness.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Optimizing feature fusion for improved zero-shot adaptation in text-to-speech synthesis

Abstract

Talk to us

Similar Papers

More From: EURASIP Journal on Audio, Speech, and Music Processing

Lead the way for us

Journal: EURASIP Journal on Audio, Speech, and Music Processing	Publication Date: May 28, 2024
License type: CC BY 4.0

Similar Papers

Neural Fusion for Voice Cloning
Bo Chen ... Chenpeng Du
IEEE/ACM Transactions on Audio, Speech, and Language Processing | VOL. 30
Bo Chen, et. al.Bo Chen ... Chenpeng Du
01 Jan 2021
IEEE/ACM Transactions on Audio, Speech, and Language Processing | VOL. 30

Dian: Duration Informed Auto-Regressive Network for Voice Cloning
Wei Song ... Youzheng Wu
-
Wei Song, et. al.Wei Song ... Youzheng Wu
06 Jun 2021
06 Jun 2021

ZSE-VITS: A Zero-Shot Expressive Voice Cloning Method Based on VITS
Jiaxin Li ... Lianhai Zhang
Electronics | VOL. 12
Jiaxin Li, et. al.Jiaxin Li ... Lianhai Zhang
06 Feb 2023
Electronics | VOL. 12

Enhanced Air Quality Prediction through Spatio-temporal Feature Sxtraction and Fusion: A Self-tuning Hybrid Approach with GCN and GRU
Bao Liu ... Lei Gao
Water, Air, & Soil Pollution | VOL. 235
Bao Liu, et. al.Bao Liu ... Lei Gao
17 Jul 2024
Water, Air, & Soil Pollution | VOL. 235

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Optimizing feature fusion for improved zero-shot adaptation in text-to-speech synthesis

Abstract

Talk to us

Similar Papers

More From: EURASIP Journal on Audio, Speech, and Music Processing