Synthesis of F0 contours using generation process model parameters predicted from unlabeled corpora: application to emotional speech synthesis

Keikichi Hirose,Kentaro Sato,Yasufumi Asano,Nobuaki Minematsu

doi:10.1016/j.specom.2005.03.014

Abstract

A corpus-based method of generating fundamental frequency ( F 0) contours from text was developed for Japanese. Instead of directly predicting F 0 values, the method predicts command values of the F 0 contour generation process model using binary decision trees. Since the model controls the F 0 movement in word or in longer units, sudden undulations, unlikely in natural utterances, can be avoided even in the case of erroneous prediction. The method includes a scheme of extracting the model commands from given F 0 contours, which makes it possible to prepare the corpora for training the binary decision trees automatically. Since accuracy of the extracted model commands in the training corpora is crucial for the method, constraints are applied on the location of commands. Although the method can generate any speaking styles if the corpora of the styles are available, this paper is aimed at realizing three types of emotional speech (anger, joy, and sadness) besides calm speech. The mismatches between the predicted and target contours for angry speech were similar to those for calm speech. Synthesis of emotional speech was then conducted. Phoneme durations were predicted in a similar corpus-based method, and segmental features were generated using an HMM-based speech synthesizer. A perceptual experiment was conducted for the synthesized speech, and the result indicated that anger could be conveyed well by the developed method. The result was less satisfactory for joy and sadness.

Full Text