Abstract

This paper introduces a novel model-constrained, data-driven method to generate fundamental frequency contours for Japanese text-to-speech synthesis. In the training phase, the relationship between linguistic features and the parameters of a command–response F 0 contour generation model is learned by a prediction module, which is represented by either a neural network or a set of binary regression trees. Input features consist of linguistic information related to accentual phrases that can be automatically derived from text, such as the position of the accentual phrase in the utterance, number of morae, accent type, and morphological information. In the synthesis phase, the prediction module is used to generate appropriate values of model parameters. The use of the parametric model restricts the degrees of freedom of the problem to facilitate the mapping between linguistic and prosodic features. Experimental results show that the method makes it possible to generate quite natural F 0 contours with a relatively small training database.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call