Toward hidden Markov model‐based spontaneous speech synthesis

Tatsuya Akagawa,Koji Iwano,Sadaoki Furui

doi:10.1121/1.4787189

Abstract

This paper investigates several issues of spontaneous speech synthesis. Although state‐of‐the‐art synthesis systems can achieve highly intelligible speech, their naturalness is still low. Therefore, much work must still be done to achieve the goal of synthesizing natural, spontaneous speech. To model spontaneous speech using a limited amount of data, we used an HMM‐based speech synthesizer based on three features: cepstral features modeled by HMMs, and duration and fundamental frequency features modeled using Quantification Theory Type I. The models were trained with approximately 17 min of spontaneous lecture speech, from a single speaker, which was extracted from the Corpus of Spontaneous Japanese (CSJ). For comparison, utterances by the same speaker, reading a transcription of the same lecture, were used to train analogous models for read speech. Spontaneity of the synthesized speech was evaluated by subjective pair comparison tests. Results obtained from 18 subjects showed that the preference score for the synthesized spontaneous speech was significantly higher than that for the synthesized read speech. This implies that HMM‐based speech synthesis using actual spontaneous utterances for model training is effective at producing natural speech. Additional subjective evaluation tests were also conducted to analyze the effects of individual features on the impression of spontaneity.

Full Text