Abstract

There are about 6,000 languages worldwide, most of which are low-resource languages. Although the current speech synthesis (or text-to-speech, TTS) for major languages (e.g., Mandarin, English, French) has achieved good results, the voice quality of TTS for low-resource languages (e.g., Tibetan) still needs to be further improved. Because prosody plays a significant role in natural speech, the article proposes two sequence-to-sequence (seq2seq) Tibetan TTS models with prosodic information fusion to improve the voice quality of synthesized Tibetan speech. We first constructed a large-scale Tibetan corpus for seq2seq TTS. Then we designed a prosody generator to extract prosodic information from the Tibetan sentences. Finally, we trained two seq2seq Tibetan TTS models by fusing prosodic information, including feature-level and model-level prosodic information fusion. The experimental results showed that the proposed two seq2seq Tibetan TTS models, which fuse prosodic information, could effectively improve the voice quality of synthesized speech. Furthermore, the model-level prosodic information fusion only needs 60% ~ 70% of the training data to synthesize a voice similar to the baseline seq2seq Tibetan TTS. Therefore, the proposed prosodic information fusion methods can improve the voice quality of synthesized speech for low-resource languages.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call