This paper investigates the stylized invariance and local variability of prosody patterns by using a speech database containing two repetitions of 1000 sentences. The two repetitions (separated by a time span of 6 months) were recorded by a single professional speaker, who was instructed to read these sentences in the same reading style. It was observed statistically that the two repetitions have fairly wide variations in prosodic features and the variations can be up to 50% of the full dynamic range of the speaker. This shows the inadequacy of traditional prosody models that focus on capturing the universal invariance of prosody as precise as possible. In this paper, we propose to model prosody by capturing its stylized invariance and retaining local variability with a soft prediction strategy, which predicts an acceptable region rather than a single fixed point in the multi-dimensioned prosody space. A prosodic-constrained unit selection algorithm is devised under the soft prediction strategy.
Read full abstract