Abstract
This paper proposes a multi-level Gaussian process regression (GPR)-based method for duration prediction by incorporating phone- and syllable-level duration models. In this method, we first train the syllable model and predict syllable durations for a given input of context labels. Then, we use the predicted syllable duration as an additional context for the phone-level model to predict phone durations. To apply multi-level duration prediction to the GPR-based speech synthesis framework, we designed phone- and syllable- level context sets for Thai that include linguistic information and the relative positions of speech units. We also examined the multi-level deep neural network (DNN)-based duration-prediction method by using the same approach as for the proposed multi-level GPR-based one. We conducted objective and subjective evaluations using two-hour training data to compare the proposed method with single-level ones. The results indicate that the proposed multi-level duration-prediction method outperformed single-level ones in DNN-, and GPR-based frameworks. They also indicate that the proposed multi-level GPR-based method can provide better performance than the multi-level HMM-based duration-prediction method.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.