Abstract

State-of-the-art text-to-speech (TTS) systems successfully produce speech with a high degree of intelligibility. But TTS systems still often generate monotonous synthesized speech, unlike natural utterances. Several existing studies have addressed the issue of modeling speaking style variations in TTSs. Unfortunately, scant research has discussed the dialog and entrainment context. In this paper, we address TTS waveform generation toward speech entrainment in human-machine communication and focus on the synchronization of speaking rates that may vary within an utterance, i.e., slowing down to emphasize specific words and distinguish elements to highlight. We assume a dialog system exists and concentrate on its speech processing part. To perform such a task, we develop (1) a multi-task automatic speech recognition (ASR) that listens to the conversation partner and recognizes the content and the speaking rate and (2) a generative adversarial network (GAN)-based TTS that produces the synthesized speech of the response while entraining with the partner's speaking rate. The evaluation is performed on a dialog corpus. Our results reveal that it is possible to entrain the input speech by synchronizing the speaking rate.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.