Abstract

In concatenative-based speech synthesis systems, speech is generated by concatenating acoustic units together, so selection of these units directly impacts the quality of synthetic speech. In our previous Text To Speech (TTS) system [8], speech was synthesized by concatenating acoustic units together. These units were of a single type, such as diphones or half syllables. Thanks to recent improvements in CPU speed and memory capacity, we can now increase the database size and perform more complex searches. In this paper, we develop the method of non-uniform unit selection using many different types of units. We find that the quality of speech is directly related to the size of the units used. This method was applied in different ways in different languages. This paper describes the way of applying this method for Vietnamese TTS to improve the quality of speech synthesis system.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call