Abstract
Recent advances in deep learning facilitate the development of end-to-end Vietnamese text-to-speech (TTS) systems that produce Vietnamese voices with high intelligibility and naturalness. However, enabling these systems to speak Vietnamese and English words in the same utterance fluently remains a challenge known as the code-switching (CS) problem in speech synthesis. The main reason is that it is not easy to obtain a large amount of high-quality CS corpus from a Vietnamese speaker. In this paper, we explore the efficacy of three approaches, which are based on the Tacotron-2 end-to-end framework, to build such a Vietnamese TTS system under a limited code-switched data scenario: (1) CS synthesis based on grapheme-to-syllable (G2S), (2) CS synthesis based on speaker embedding, and (3) CS synthesis based on speaker embedding and language embedding. We handle English and Vietnamese words in the code-switched input text by converting them into Vietnamese syllables using our G2S model. For the speaker-embedding based approach, we combine Vietnamese monolingual data in our dataset with an English public dataset to train a multi-speaker Tacotron-2 system. The experimental results show that adding language embedding is effective, and training with character input representations outperforms phonemes. Thus, the speaker and language-embedding based approach achieves strong results in naturalness for CS speech. Besides, the G2S-based CS synthesis also has good results, with almost absolute English pronunciation accuracy.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.