An HMM-Based Approach to Flexible Speech Synthesis

Keiichi Tokuda

doi:10.1007/11939993_5

Abstract

The increasing availability of large speech databases makes it possible to construct speech synthesis systems, which are referred to as corpus-based, data-driven, speaker-driven, or trainable approach, by applying statistical learning algorithms. These systems, which can be automatically trained, not only generate natural and high quality synthetic speech but also can reproduce voice characteristics of the original speaker. This talk presents one of these approaches, HMM-based speech synthesis. The basic idea of the approach is very simple: just train HMMs (hidden Markov models) and generate speech directly from them. To realize such a speech synthesis system, however, we need some tricks: algorithms for speech parameter generation from HMMs, and a mel-cepstrum based vocoding technique are reviewed, and an approach to simultaneous modeling of phonetic and prosodic parameters (spectrum, F0, and duration) is also presented. The main feature of the system is the use of dynamic feature: by inclusion of dynamic coefficients in the feature vector, the speech parameter sequence generated in synthesis is constrained to be realistic, as defined by the parameters of the HMMs. The attraction of this approach is that voice characteristics of synthesized speech can easily be changed by transforming HMM parameters. Actually, it has been shown that we can change voice characteristics of synthetic speech by applying a speaker adaptation technique which has been used in speech recognition systems. The relationship between the HMM-based approach and other concatenative speech synthesis approaches is also discussed. In the talk, not only the technical description but also recent results and demos will be presented.

Full Text