Abstract
This paper proposes a deep learning-based Mandarin-Tibetan cross-lingual speech synthesis to realize both Mandarin speech synthesis and Tibetan speech synthesis under a unique framework. Because Tibetan training corpus is hard to record, we train the acoustic models with a large scale Mandarin multi-speaker corpus and a small scale Tibetan one-speaker corpus. The acoustic models are trained with deep neural network (DNN), hybrid long short-term memory (LSTM), and hybrid bi-directional long short-term memory (BLSTM). We also further extend our Chinese text analyzer by adding a Tibetan text analyzer for generating context-dependent labels from input Chinese or Tibetan sentences. The Tibetan text analyzer includes a text normalization, a novel Tibetan word segmentation that combines a BLSTM with conditional random field, a prosodic boundary prediction, and a grapheme-to-phoneme conversion. We select the initials and the finals of both Mandarin and Tibetan as the speech synthesis units to train a speaker-independent mixed language average voice model (AVM) with DNN, hybrid LSTM, and hybrid BLSTM from Mandarin and Tibetan mixed corpus. Then the speaker adaptation is applied to train speaker-dependent DNN, hybrid LSTM, or hybrid BLSTM models of Mandarin or Tibetan with a small target speaker corpus from an AVM. Finally, we synthesize the Mandarin speech, or Tibetan speech though the speaker-dependent Mandarin or Tibetan models. The experiments show that the hybrid BLSTM-based cross-lingual speech synthesis framework is better than the other two cross-lingual frameworks and the Tibetan monolingual framework. The mixed Tibetan training corpus does not influence the voice quality of synthesized Mandarin speech. Furthermore, the hybrid BLSTM-based cross-lingual speech synthesis framework only needs 60% of the training corpus to synthesize a similar voice as the Tibetan monolingual framework. Therefore, the proposed method can be used for speech synthesis of low resource languages by borrowing the same tremendous resource language's corpus.
Highlights
In recent years, cross-lingual speech synthesis has been a popular topic in text-to-speech synthesis (TTS) research [1], [2]
TIBETAN TEXT ANALYSIS We further developed a Tibetan text analyzer based on the Mandarin text analyzer we have realized for the MandarinTibetan cross-lingual speech synthesis
The acoustic model of the hybrid bi-directional long short-term memory (BLSTM)-based TTS framework is better than other frameworks
Summary
Cross-lingual speech synthesis has been a popular topic in text-to-speech synthesis (TTS) research [1], [2]. DNN-based speaker adaptation [22], [23] has been adopted in various neural networks-based TTS applications [20], [21], VOLUME 7, 2019 the adaptation method has not been applied to the MandarinTibetan cross-lingual speech synthesis To address these limitations, we adopt a deep learning-based framework to realize the Mandarin-Tibetan cross-lingual speech synthesis. We use the initials and the finals of Mandarin and Tibetan as the speech synthesis units to train a speaker-independent mixed language average voice model (AVM) with DNN, hybrid LSTM, and hybrid BLSTM from a large Mandarin multi-speaker corpus and a small Tibetan one-speaker corpus. LSTM can effectively capture long-term temporal dependencies It has been widely used in natural language processing and acoustic modeling for speech synthesis. The WORLD [26] vocoder is used to generate the speech waveforms from the acoustic parameters
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.