Abstract
Speaker adaptive speech synthesis based on Hidden Semi-Markov Model (HSMM) has been demonstrated to be dramatically effective in the presence of confined amount of speech data. However, we could intensify this effectiveness by training the average voice model appropriately. Hence, this study presents a new method for training the average voice model. This method guarantees that data from every speaker contributes to all the leaves of decision tree. We considered this fact that small training data and highly diverse contexts of training speakers are considered as disadvantages which degrade the quality of average voice model impressively, and further influence the adapted model and synthetic speech unfavorably. The proposed method takes such difficulties into account in order to train a tailored average voice model with high quality. Consequently, as the experiments indicate, the proposed method outweighs the conventional one not only in the quality of synthetic speech but also in similarity to the natural voice. Our experiments show that the proposed method increases the CMOS test score by 0.6 to the conventional one.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.