Abstract
It has been conjectured that articulatory synthesis possesses the greatest potential for generating high quality synthetic speech. However, for text-to-speech (TTS), waveform concatenation techniques have proven more practical due in part to the challenge of generating appropriate trajectories of articulatory parameters. A waveform generation method for TTS that combines the practical success of concatenative methods with the quality potential of articulatory synthesis is under development. The system concatenates articulatory units derived from natural speech using an articulatory voice mimic. The mimic estimates articulatory parameters by minimizing a cost function that includes a spectral distance between natural and synthetic speech and a geometric distance that penalizes rapid or discontinuous changes in articulator positions. A database of articulatory trajectories representing phonetic units is constructed from the estimated parameters. For TTS, phonetic units generated by text analysis are used to select the corresponding articulatory units from the database. Duration modification, concatenation, and smoothing across units are performed in the articulatory domain resulting in a single articulatory trajectory for the complete utterance. Speech is synthesized from the trajectory using a two mass model for voicing, achieving a high degree of acoustic continuity across unit boundaries while also allowing for source–tract interaction.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.