Abstract
Frame-by-frame representation is not appropriate for prosodic features, which are tightly related to speech units spreading a wide time span, such as words, phrases and so on. This causes an inherit problem in fundamental frequency (F <inf xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">0</inf> ) contour generation by HMM-based speech synthesis. A method is developed to modify F <inf xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">0</inf> contours in the framework of a generation process model by referring to linguistic information of input text (word boundary and accent type). It takes F <inf xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">0</inf> variances obtained through HMM-based speech synthesis into account during the process. Through a listening experiment on synthetic speech, the method is proved to generate better quality as compared to the HMM-based speech synthesis on average. Since the generation process model can clearly relate its commands and linguistic (and para-/non- linguistic) information, the method has an additional advantage; changing speech styles, and /or adding further information (such as emphasis) can be easily done through manipulating the commands.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.