Adapting Prosody in a Text-to-Speech System

Janez Stergar,Caglayan Erdem

doi:10.5772/10398

Abstract

The requirements of the evolving information communication technologies (ICT) place new demands on text-to-speech (TTS) systems. The modern high quality TTS system has to be capable of fast and high-quality adaptation to a new language, voice or even expressive speech. Thus adaptation to new voices with different prosodic characteristics is desired. In this chapter a survey of recent and past approaches of prosodic processing in text-tospeech synthesis will be discussed. Regardless of the different approaches which have been proposed ranging from generating prosody by rule to huge databases covering almost all prosodic patterns of a specific speaker there is clearly still much work to be done (van Santen et al., 2008). Automatic learning techniques seem to offer the fastest solution in adapting a TTS system to a new language, voice or a new application. They allow automatic extraction of specific features (e.g. non-uniform unit selection, prosodic regularities extraction) from an appropriate database of natural speech. Such techniques depend on the construction of a large pre-processed corpora (properly segmented, labelled with appropriate prosody labels, etc.). Despite the overall impression that TTS is an inferior task compared to speech recognition, TTS research and development community was not able to produce massive series of consumer products since the early 80es (Dutoit, 2008). Since then a broad spectrum of systems has been developed and successfully implemented – prosody was one of the major tasks to tackle in such systems. The term “prosody” covers a wide range of features characterizing “the musical qualities” of speech, including phrasing, pitch, loudness, tempo and rhythm. A number of studies suggest that prosody has a great impact on the intelligibility and naturalness of speech perception. Despite the fact that synthesized speech is nowadays mostly intelligible and in some cases sounds undistinguishable from human speech, it still lacks the flexibility and appropriate rendering of expressivity in the synthesized voice. Text-to-prosody systems based on the use of prosodic databases extracted from natural speech are a key point for development of new TTS systems. One of the major problems in TTS synthesis consists in the automatic generation of natural and intelligible prosody. Therefore the preparation of suitable speech-corpora for automatic prosodic feature extraction is essential.

Full Text