Multiple-prosody speech databases and their effectiveness in high-quality speech synthesis at arbitrary rates

Tsuyoshi Masuda,Hiroshi Saruwatari,Kiyohiro Shikano,Tomoki Toda,Hiromichi Kawanami

doi:10.1002/ecjb.20215

Abstract

This paper discusses a method of high-quality speech synthesis in which the speech rate can be controlled in various ways. When the prosody is adjusted by the PSOLA method or by the synthesis-by-analysis method in the waveform segment connection process, the quality declines as the extent of modification increases. To deal with this problem, this paper proposes a method in which modification of the segment duration is reduced and quality degradation is alleviated by using a speech database for each speech rate. The proposed method has the following features. (1) Synthesized speech with the target speech rate is produced for each utterance, and is recorded. (2) Speech databases of the same text at different speech rates are constructed. In this study, speech databases at three different speech rates, fast, medium, and slow, were acquired. Speech at two different speech rates (fast and slow) was synthesized by using the acquired speech databases and by the conventional method (using a speech database at the standard speech rate). Listening experiments showed that the proposed method can synthesize higher-quality speech than the conventional method. When speech databases with different speech rates are combined, there is a danger that the speech quality may be degraded due to differences in voice quality among the databases. The effect of voice quality was investigated in a listening experiment, and was found to be within the tolerable range. © 2005 Wiley Periodicals, Inc. Electron Comm Jpn Pt 2, 88(9): 38–47, 2005; Published online in Wiley InterScience (www.interscience.wiley.com). DOI 10.1002/ecjb.20215

Full Text