Abstract

This paper proposes a method for modeling and generating pitch in hidden Markov model (HMM)-based Mandarin speech synthesis by exploiting prosody hierarchy and dynamic pitch features. The prosodic structure of a sentence is represented by a prosody hierarchy, which is constructed from the predicted prosodic breaks using a supervised classification and regression tree (S-CART). The S-CART is trained by maximizing the proportional reduction of entropy to minimize the errors in the prediction of the prosodic breaks. The pitch contour of a speech sentence is estimated using the STRAIGHT algorithm and decomposed into the prosodic features (static features) at prosodic word, syllable, and frame layers, based on the predicted prosodic structure. Dynamic features at each layer are estimated to preserve the temporal correlation between adjacent units. A hierarchical prosody model is constructed using an unsupervised CART (U-CART) for generating pitch contour. Minimum description length (MDL) is adopted in U-CART training. Objective and subjective evaluations with statistical hypothesis testing were conducted, and the results compared to corresponding results for HMM-based pitch modeling. The comparison confirms the improved performance of the proposed method.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.