Tibetan Speech Synthesis Research Articles

The paper proposes a meta-learning-based Mandarin-Tibetan cross-lingual text-to-speech (TTS) to realize both Mandarin and Tibetan speech synthesis under a unique framework. First, we build two kinds of Tacotron2-based Mandarin-Tibetan cross-lingual baseline TTS. One is a shared encoder Mandarin-Tibetan cross-lingual TTS, and another is a separate encoder Mandarin-Tibetan cross-lingual TTS. Both baseline TTS use the speaker classifier with a gradient reversal layer to disentangle speaker-specific information from the text encoder. At the same time, we design a prosody generator to extract prosodic information from sentences to explore syntactic and semantic information adequately. To further improve the synthesized speech quality of the Tacotron2-based Mandarin-Tibetan cross-lingual TTS, we propose a meta-learning-based Mandarin-Tibetan cross-lingual TTS. Based on the separate encoder Mandarin-Tibetan cross-lingual TTS, we use an additional dynamic network to predict the parameters of the language-dependent text encoder that could realize better cross-lingual knowledge sharing in the sequence-to-sequence TTS. Lastly, we synthesize Mandarin or Tibetan speech through the unique acoustic model. The baseline experimental results show that the separate encoder Mandarin-Tibetan cross-lingual TTS could handle the input of different languages better than the shared encoder Mandarin-Tibetan cross-lingual TTS. The experimental results further show that the proposed meta-learning-based Mandarin-Tibetan cross-lingual speech synthesis method could effectively improve the voice quality of synthesized speech in terms of naturalness and speaker similarity.

Read full abstract

This paper proposes a deep learning-based Mandarin-Tibetan cross-lingual speech synthesis to realize both Mandarin speech synthesis and Tibetan speech synthesis under a unique framework. Because Tibetan training corpus is hard to record, we train the acoustic models with a large scale Mandarin multi-speaker corpus and a small scale Tibetan one-speaker corpus. The acoustic models are trained with deep neural network (DNN), hybrid long short-term memory (LSTM), and hybrid bi-directional long short-term memory (BLSTM). We also further extend our Chinese text analyzer by adding a Tibetan text analyzer for generating context-dependent labels from input Chinese or Tibetan sentences. The Tibetan text analyzer includes a text normalization, a novel Tibetan word segmentation that combines a BLSTM with conditional random field, a prosodic boundary prediction, and a grapheme-to-phoneme conversion. We select the initials and the finals of both Mandarin and Tibetan as the speech synthesis units to train a speaker-independent mixed language average voice model (AVM) with DNN, hybrid LSTM, and hybrid BLSTM from Mandarin and Tibetan mixed corpus. Then the speaker adaptation is applied to train speaker-dependent DNN, hybrid LSTM, or hybrid BLSTM models of Mandarin or Tibetan with a small target speaker corpus from an AVM. Finally, we synthesize the Mandarin speech, or Tibetan speech though the speaker-dependent Mandarin or Tibetan models. The experiments show that the hybrid BLSTM-based cross-lingual speech synthesis framework is better than the other two cross-lingual frameworks and the Tibetan monolingual framework. The mixed Tibetan training corpus does not influence the voice quality of synthesized Mandarin speech. Furthermore, the hybrid BLSTM-based cross-lingual speech synthesis framework only needs 60% of the training corpus to synthesize a similar voice as the Tibetan monolingual framework. Therefore, the proposed method can be used for speech synthesis of low resource languages by borrowing the same tremendous resource language's corpus.

Read full abstract

Tibetan Speech Synthesis Research Articles

Related Topics

Articles published on Tibetan Speech Synthesis

Tibetan Speech Synthesis Based on Pre-Traind Mixture Alignment FastSpeech2

Improving Sequence-to-sequence Tibetan Speech Synthesis with Prosodic Information

Meta-Learning for Mandarin-Tibetan Cross-Lingual Speech Synthesis

End-to-End Speech Synthesis for Tibetan Multidialect

Tibetan speech synthesis based on an improved neural network

An open speech resource for Tibetan multi-dialect and multitask recognition

Deep Learning for Mandarin-Tibetan Cross-Lingual Speech Synthesis

Research and Realization the Method of Pronunciation Conversion for Speech Synthesis of the Lhasa Dialect of Tibetan

Research on Tibetan Language Synthesis System Front-End Text Processing Technology Based on HMM

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Tibetan Speech Synthesis Research Articles

Related Topics

Articles published on Tibetan Speech Synthesis

Tibetan Speech Synthesis Based on Pre-Traind Mixture Alignment FastSpeech2

Improving Sequence-to-sequence Tibetan Speech Synthesis with Prosodic Information

Meta-Learning for Mandarin-Tibetan Cross-Lingual Speech Synthesis

End-to-End Speech Synthesis for Tibetan Multidialect

Tibetan speech synthesis based on an improved neural network

An open speech resource for Tibetan multi-dialect and multitask recognition

Deep Learning for Mandarin-Tibetan Cross-Lingual Speech Synthesis

Research and Realization the Method of Pronunciation Conversion for Speech Synthesis of the Lhasa Dialect of Tibetan

Research on Tibetan Language Synthesis System Front-End Text Processing Technology Based on HMM