Abstract

This paper proposes a multi-level Gaussian process regression (GPR)-based method for duration prediction by incorporating phone- and syllable-level duration models. In this method, we first train the syllable model and predict syllable durations for a given input of context labels. Then, we use the predicted syllable duration as an additional context for the phone-level model to predict phone durations. To apply multi-level duration prediction to the GPR-based speech synthesis framework, we designed phone- and syllable- level context sets for Thai that include linguistic information and the relative positions of speech units. We also examined the multi-level deep neural network (DNN)-based duration-prediction method by using the same approach as for the proposed multi-level GPR-based one. We conducted objective and subjective evaluations using two-hour training data to compare the proposed method with single-level ones. The results indicate that the proposed multi-level duration-prediction method outperformed single-level ones in DNN-, and GPR-based frameworks. They also indicate that the proposed multi-level GPR-based method can provide better performance than the multi-level HMM-based duration-prediction method.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call