Cross-lingual multi-speaker speech synthesis with limited bilingual training data

Zexin Cai,Yaogen Yang,Ming Li

doi:10.1016/j.csl.2022.101427

Abstract

Modeling voices for multiple speakers and multiple languages with one speech synthesis system has been a challenge for a long time, especially in low-resource cases. This paper presents two approaches to achieve cross-lingual multi-speaker text-to-speech (TTS) and code-switching synthesis under two training scenarios: (1) cross-lingual synthesis with sufficient data, (2) cross-lingual synthesis with limited data per speaker. Accordingly, a novel TTS synthesis model and a non-autoregressive multi-speaker voice conversion model are proposed. The TTS model designed for sufficient-data cases has a Tacotron-based structure that uses shared phonemic representations associated with numeric language ID codes. As for the data-limited scenario, we adopt a framework cascading several speech modules to achieve our goal. In particular, we proposed a non-autoregressive many-to-many voice conversion module to address multi-speaker synthesis for data-insufficient cases. Experimental results on speaker similarity show that our proposed voice conversion module can maintain the voice characteristics well in data-limited cases. Both approaches use limited bilingual data and demonstrate impressive performance in cross-lingual synthesis, which can deliver fluent foreign speech and even code-switching speech for monolingual speakers.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Cross-lingual multi-speaker speech synthesis with limited bilingual training data

Abstract

Talk to us

Similar Papers

More From: Computer Speech & Language

Lead the way for us

Journal: Computer Speech & Language	Publication Date: Jul 14, 2022
Citations: 6

Similar Papers

Parallel voice conversion with limited training data using stochastic variational deep kernel learning
Mohamadreza Jafaryani ... Vahid Pourahmadi
Engineering Applications of Artificial Intelligence | VOL. 115
Mohamadreza Jafaryani, et. al.Mohamadreza Jafaryani ... Vahid Pourahmadi
11 Aug 2022
Engineering Applications of Artificial Intelligence | VOL. 115

Japanese pitch conversion for voice morphing based on differential modeling
Ryuki Tachibana ... Masafumi Nishimura
-
Ryuki Tachibana, et. al.Ryuki Tachibana ... Masafumi Nishimura
06 Sep 2009
06 Sep 2009

Regularizing Generative Adversarial Networks under Limited Data
Hung-Yu Tseng ... Lu Jiang
-
Hung-Yu Tseng, et. al.Hung-Yu Tseng ... Lu Jiang
01 Jun 2021
01 Jun 2021

Dynamic-Pix2Pix: Medical image segmentation by injecting noise to cGAN for modeling input and target domain joint distributions with limited training data
Mohammadreza Naderi ... Shadrokh Samavi
Biomedical Signal Processing and Control | VOL. 85
Mohammadreza Naderi, et. al.Mohammadreza Naderi ... Shadrokh Samavi
03 Apr 2023
Biomedical Signal Processing and Control | VOL. 85

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Cross-lingual multi-speaker speech synthesis with limited bilingual training data

Abstract

Talk to us

Similar Papers

More From: Computer Speech &amp; Language

More From: Computer Speech & Language