A statistical model with hierarchical structure for predicting prosody in a mandarin text‐to‐speech system

Ming‐Shing Yu,Neng‐Huang Pan

doi:10.1080/02533839.2005.9671006

Abstract

In this paper we propose a statistical prosody model with hierarchical structure for Mandarin text‐to‐speech (TTS) systems. There are four levels in our model, namely syllable level, word level, breath group (prosodic phrase) level, and utterance level. Here “hierarchy” means that each lower level is a subset of its higher level. The prosodic information is first found in each level, and then they are combined to get the predicted prosody. The advantages of our model are as follows: (1) Our model can relieve the data sparsity problem. Since there are only a few parameters in each level, the size of our training corpus need not be very large. (2) It is easy to verify the appropriateness of the output values of each level. (3) Our model has low prediction error. The experimental results show that the predicted prosodic values and their original values match very well. (4) Our prosody generator can predict all prosodic information, namely syllable duration, pause length, energy, and pitch contours.

Full Text