Abstract
Recent advanced text-to-speech (TTS) systems synthesize natural speeches. However, in many applications, it is desirable to synthesize utterances in a specific style. In this paper, we investigate synthesizing audios with three styles — news-casting, public speaking and storytelling — for a speaker who provides only neutral speech data. Firstly, considerable speech data were collected from the neutral speaker, and small amounts of speech from the wanted styles were collected from other speakers such that no speakers uttered in more than one style. All the data were used to train a basic multi-style multi-speaker TTS model. Secondly, augmented audios were created on-the-fly with the latest TTS model during its training and were used to further train the TTS model. Specifically, augmented data were created by ‘forcing’ a speaker to imitate stylish speeches of other three speakers by requiring their attention alignment matrices as similar as possible. Objective evaluation on the rhythm and pitch profile of the synthesized speech shows that the TTS model trained with our proposed data augmentation successfully transfers speech styles in these aspects. Subjective ABX evaluation also shows that stylish speeches synthesized by our proposed method are overwhelmingly preferred than those from a baseline TTS model by 40-60%.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.