A formant type speech synthesis method has an enormous advantage in that it allows one to generate speech with various voice quality variations and talker individualities. But it has suffered from unnatural speech quality, not because of theoretical limitations but because of an incomplete set of rules for the synthesis. Any insufficient approximation to the acoustics causes degradation of the perceived quality of synthetic speech. A novel formant type speech synthesizer in Japanese based on concatenation of CV (consonant-vowel) formant-source templates obtained from natural utterances has been investigated, in which multiple sets of formant and voice source parameter values are used for each of the CV syllables. This paper describes an automatic method to create the CV formant-source templates from speech corpus. The ARX (autoregressive with exogenous input) analysis method is first used to automatically extract formant and voice source parameters and then an HMM based segmentation is performed to locate the CV segments. The segments are further analyzed to detect a starting point of the syllable. A distance measure is used to decide the number of templates needed for each of the syllables. The method is proved to be useful in creating CV templates by experiments performed on 503 Japanese sentences.
Read full abstract