Abstract

This paper presents a stochastic model of intonation contours for use in text-to-speech synthesis. The model has two modules, a linguistic module that generates abstract prosodic labels from text, and a phonetic module that generates an F 0 curve from the abstract prosodic labels. This model differs from previous work in the abstract prosodic labels used, which can be automatically derived from the training corpus. This feature makes it possible to use large corpora or several corpora of different speech styles, in addition to making it easy to adapt to new languages. The present paper focuses on the linguistic module, which does not require full syntactic analysis of the text but simply relies on part-of-speech tagging. The results were validated on French by means of a perception test. Listeners did not perceive a significant difference in quality between the sentences synthesised using the phonetic module only, with prosodic labels derived from original recordings as input, and those synthesised directly from the text using the linguistic module followed by the phonetic module. The proposed model thus appears to capture most of the grammatical information needed to generate F 0.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call