Abstract
In concatenative-based speech synthesis systems, speech is generated by concatenating acoustic units together, so selection of these units directly impacts the quality of synthetic speech. In our previous Text To Speech (TTS) system [8], speech was synthesized by concatenating acoustic units together. These units were of a single type, such as diphones or half syllables. Thanks to recent improvements in CPU speed and memory capacity, we can now increase the database size and perform more complex searches. In this paper, we develop the method of non-uniform unit selection using many different types of units. We find that the quality of speech is directly related to the size of the units used. This method was applied in different ways in different languages. This paper describes the way of applying this method for Vietnamese TTS to improve the quality of speech synthesis system.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.