Abstract

This paper proposes a deep learning-based Mandarin-Tibetan cross-lingual speech synthesis to realize both Mandarin speech synthesis and Tibetan speech synthesis under a unique framework. Because Tibetan training corpus is hard to record, we train the acoustic models with a large scale Mandarin multi-speaker corpus and a small scale Tibetan one-speaker corpus. The acoustic models are trained with deep neural network (DNN), hybrid long short-term memory (LSTM), and hybrid bi-directional long short-term memory (BLSTM). We also further extend our Chinese text analyzer by adding a Tibetan text analyzer for generating context-dependent labels from input Chinese or Tibetan sentences. The Tibetan text analyzer includes a text normalization, a novel Tibetan word segmentation that combines a BLSTM with conditional random field, a prosodic boundary prediction, and a grapheme-to-phoneme conversion. We select the initials and the finals of both Mandarin and Tibetan as the speech synthesis units to train a speaker-independent mixed language average voice model (AVM) with DNN, hybrid LSTM, and hybrid BLSTM from Mandarin and Tibetan mixed corpus. Then the speaker adaptation is applied to train speaker-dependent DNN, hybrid LSTM, or hybrid BLSTM models of Mandarin or Tibetan with a small target speaker corpus from an AVM. Finally, we synthesize the Mandarin speech, or Tibetan speech though the speaker-dependent Mandarin or Tibetan models. The experiments show that the hybrid BLSTM-based cross-lingual speech synthesis framework is better than the other two cross-lingual frameworks and the Tibetan monolingual framework. The mixed Tibetan training corpus does not influence the voice quality of synthesized Mandarin speech. Furthermore, the hybrid BLSTM-based cross-lingual speech synthesis framework only needs 60% of the training corpus to synthesize a similar voice as the Tibetan monolingual framework. Therefore, the proposed method can be used for speech synthesis of low resource languages by borrowing the same tremendous resource language's corpus.

Highlights

  • In recent years, cross-lingual speech synthesis has been a popular topic in text-to-speech synthesis (TTS) research [1], [2]

  • TIBETAN TEXT ANALYSIS We further developed a Tibetan text analyzer based on the Mandarin text analyzer we have realized for the MandarinTibetan cross-lingual speech synthesis

  • The acoustic model of the hybrid bi-directional long short-term memory (BLSTM)-based TTS framework is better than other frameworks

Read more

Summary

INTRODUCTION

Cross-lingual speech synthesis has been a popular topic in text-to-speech synthesis (TTS) research [1], [2]. DNN-based speaker adaptation [22], [23] has been adopted in various neural networks-based TTS applications [20], [21], VOLUME 7, 2019 the adaptation method has not been applied to the MandarinTibetan cross-lingual speech synthesis To address these limitations, we adopt a deep learning-based framework to realize the Mandarin-Tibetan cross-lingual speech synthesis. We use the initials and the finals of Mandarin and Tibetan as the speech synthesis units to train a speaker-independent mixed language average voice model (AVM) with DNN, hybrid LSTM, and hybrid BLSTM from a large Mandarin multi-speaker corpus and a small Tibetan one-speaker corpus. LSTM can effectively capture long-term temporal dependencies It has been widely used in natural language processing and acoustic modeling for speech synthesis. The WORLD [26] vocoder is used to generate the speech waveforms from the acoustic parameters

TIBETAN TEXT ANALYSIS
PROSODIC BOUNDARY PREDICTION
TIBETAN GRAPHEME-TO-PHONEME CONVERSION
DISCUSSION
Findings
VIII. CONCLUSION AND FUTURE WORK

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.