Hidden Markov model-based speech synthesis as a tool for constructing comunicative spoken dialog systems

Keiichi Tokuda

doi:10.1121/1.4787017

Abstract

For constructing spoken dialog systems which can realize natural human-computer interaction, speech synthesis systems are required to have an ability to generate speech with arbitrary speaker’s voice and various speaking styles and/or emotional expressions. Although state-of-the-art speech synthesis systems based on unit selection and concatenation can generate natural-sounding speech, it is still difficult to synthesize various voices flexibly because they need a large-scale speech corpus for each voice. In recent years, a kind of corpus-based speech synthesis system based on hidden Markov models (HMMs) has been developed, which has the following features: (1) original speaker’s characteristics can easily be reproduced because all speech features, not only spectral parameters but also fundamental frequencies and durations, are modeled in a unified framework of HMM, and then generated from the trained HMMs themselves; (2) using a very small amount of adaptation speech data, voice characteristics can easily be modified by transforming HMM parameters by a speaker adaptation technique used in speech recognition systems. From these features, the HMM-based speech synthesis approach is expected to be used as a tool for constructing communicative spoken dialog systems: keeping such a viewpoint in mind, basic algorithms and techniques for HMM-based speech synthesis are reviewed.

Full Text