Abstract

Nowadays Text-to-Speech (TTS) systems adapt the output voice to the user and the corresponding application. The aim is a personalisation. Thereby, the user is set into familiar surroundings, increasing the TTS acceptance. For example, an e-mail client that may read the incoming messages with the synthesised voice sounding like that of the sender. Such a personalised TTS system is costly; so, voice-conversion (VC) techniques are used to save resources. VC transforms the voice of a ''source speaker'' in such a way that the converted voice sounds like that of another ''target speaker''. This voice sounds only natural, if it includes all features relevant for the true target voice. Here, a main problem is the mapping of the prosody which is one of the essential features. This contribution introduces a statistical prosodic model for voice conversion. It is based on Gaussian-Mixture Models (GMM), trained for the pitch and the duration of diphones. To ensure sufficient data for the GMM training, seven diphone classes are separated as related to the international phonetic alphabet. The suitability for VC as well as limitations, necessary extensions (stress) and problems are pointed out.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call