In this paper, the new French text-to-speech system being developed at AT&T is presented. The system, based on diphone and triphone concatenation, follows the general framework of the Bell Laboratories English TTS system [J. Olive (1977); J. Olive and M. Liberman (1979)]. Concatenative units have been recorded within real words from dictionary extraction and from logatomes, or nonexisting words. Researchers disagree as to whether to use logatomes or real words for synthesis. The argument for using logatomes is that it is better to collect neutral words so that the diphone is recorded as neutrally as possible and does not undergo any real word stress. Those against argue that the diphone is over-articulated in a logatome environment and that it reduces the naturalness of the synthesized speech. Experiments are shown that prove the strengths and weaknesses of the two approaches. Methods for automatic segmentation and for the choice of concatenative units are presented, as well as durational and prosodic issues. Directions for future improvement are described, including morphological analysis, part-of-speech tagging, and partial syntactic analysis. Some problems in French text-to-speech, such as nasalization, liaison, realization of schwas, lengthening of silences, prosodic groupings, as well as stable versus unstable phonemes are discussed.
Read full abstract