A method for registration of a new voice in a text-to-speech synthesizer

Takashi Saito,Masaharu Sakamoto

doi:10.1121/1.416354

Abstract

In conventional text-to-speech systems, synthesis unit inventories are prepared in advance through a laborious process of speech data gathering, analysis, and manual segmentation, and users of such systems cannot add unit inventories of new voices. This paper describes a method for registration of a new speaker’s voice to synthesis unit inventories in a text-to-speech system, by providing users with a function of registering their voices like a training function commonly used in speech recognition systems. An approach taken here is as follows: (1) create an initial unit inventory as a reference unit inventory by manual segmentation, (2) for new speakers, let them give utterances following a guidance synthetic speech of the reference speaker, whose pitch range is adjusted to those of new speakers, (3) segment the new speech database automatically by a phonemic alignment technique combined with constraints given by the segment information of the reference unit inventory. Experimental results are shown for a waveform-concatenation-based TTS system, which was recently developed by the authors [Saito et al., Proc. ICASSP’96, 381–384 (1996)].

Full Text