Abstract

As the core technology of human-computer interaction, speech synthesis plays an important role in education and life, science technology. Especially as a research hotspot in the field of artificial intelligence, speech synthesis has not only achieved extraordinary results in Mandarin but also in minority languages such as Tibetan got good results. At present, Tibetan speech synthesis research is mainly based on autoregressive models, which are far superior to traditional models and can synthesize high-quality speech. However, due to the slow inference speed of the autoregressive model and the implicit features of the speech duration alignment, pitch, and energy of the acoustic model, there are problems such as slow synthesis speed, repeated words or word skipping, and the inability to control speech rate and prosody in a fine-grained manner. In response to the above problems, this paper studies Tibetan text-to-speech alignment and Tibetan speech synthesis based on a combination of a non-autoregressive acoustic model and vocoder. First, Tibetan speech and phoneme alignment are performed based on the Hidden Markov Gaussian Mixture alignment model. Secondly, the phoneme duration of real speech combined with variable information such as pitch and energy is introduced into the Fastspeech2 acoustic model, and the Variance Adapter is used to solve the one-to-many problem of traditional speech synthesis, reducing word skipping and repetition. Finally, to take into account both synthesis speed and synthesis quality, a pre-trained HiFi-GAN vocoder is used to convert the mel spectrum to speech.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.