Towards Speech Entrainment: Considering ASR Information in Speaking Rate Variation of TTS Waveform Generation

Mayuko Okamoto,Sakriani Sakti,Satoshi Nakamura

doi:10.1109/o-cocosda50338.2020.9295020

Mayuko Okamoto, Sakriani Sakti + Show 1 more

https://doi.org/10.1109/o-cocosda50338.2020.9295020

Copy DOI

Abstract

State-of-the-art text-to-speech (TTS) systems successfully produce speech with a high degree of intelligibility. But TTS systems still often generate monotonous synthesized speech, unlike natural utterances. Several existing studies have addressed the issue of modeling speaking style variations in TTSs. Unfortunately, scant research has discussed the dialog and entrainment context. In this paper, we address TTS waveform generation toward speech entrainment in human-machine communication and focus on the synchronization of speaking rates that may vary within an utterance, i.e., slowing down to emphasize specific words and distinguish elements to highlight. We assume a dialog system exists and concentrate on its speech processing part. To perform such a task, we develop (1) a multi-task automatic speech recognition (ASR) that listens to the conversation partner and recognizes the content and the speaking rate and (2) a generative adversarial network (GAN)-based TTS that produces the synthesized speech of the response while entraining with the partner's speaking rate. The evaluation is performed on a dialog corpus. Our results reveal that it is possible to entrain the input speech by synchronizing the speaking rate.

Full Text