Abstract

Text-to-speech systems generally consist of two components. The first one converts the input text to an abstract, linguistically relevant, representation. Usually, this is a phoneme representation of the input text, with markers for (word, morpheme, syllable) boundaries, word stress, and sentence accent. The second component converts this transcription into a physical speech sound. Two aspects of natural speech are most important to be imitated in this latter step: (a) natural prosody (speech rate, segment duration, pitch, etc.), and (b) representation of phonetic adjustement between phonemes. The resulting synthetic speech is mainly used in special-purpose applications, although a wider use is foreseen for the future.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call