Two efforts towards natural speech synthesis: Incorporating disfluency and speaking style change based on the interlocutor

Akiko Mokhtari,Toshiyuki Sadanobu,Nick Campbell

doi:10.1121/10.0023286

Abstract

During the period 2000–2005, a Japanese female speaker recorded her everyday conversations with many different interlocutors using a head-set microphone. As a result, 600 hours of natural Japanese speech data were obtained. This study describes a DNN-based speech synthesis system which was trained on 300 hours of the data, focusing on two unique efforts to make it more expressive in a human-like way: (1) allowing for disfluencies, and (2) accounting for the category of interlocutor. Incorporating some frequently observed disfluent patterns in general Japanese speech such as fillers, phrase-final rising intonation, and word-internal prolongation or suspension, is believed to be effective in practical application as certain disfluencies are connected to a speaker’s attitude in Japanese communication. For example, having word-internal prolongations can show hesitation or politeness, and word-internal suspending can show the speaker’s surprised attitude. Interlocutors in the original data were categorized into four groups: family, friend, child and others. This information was used in the training process, and as a result, the synthesizer can generate different speaking styles according to the interlocutor setting. Being able to generate disfluent speech and change the speaking style depending on who you are talking to can make the synthesizer ever more expressive.

Full Text