Abstract

Adaptation and manipulation techniques for creating various characteristics of synthetic speech are important research topics in the speech synthesis field. In this work, we investigate the performance of a DNN-based text-to-speech synthesis system that uses speaker, gender, and age codes as well as the text inputs (1) for modeling speaker-independent models called “average voice models,” (2) for performing speaker adaptation using a small amount of adaptation data, and also (3) for manipulating characteristics of synthetic speech based on the codes. For these purposes, we extracted a set of studio-quality speech data uttered by 68 males and 70 females, whose age vary between 10 and 80, from our large-scale Japanese corpus and carried out the three experiments: (1) We constructed a DNN-based speaker-independent model using one-hot vectors representing a set of the above speakers. (2) We performed speaker adaptation by estimating a code vector for a new speaker via the back-propagation. (3) We performed manual manipulation of the code vector to modify perceived characteristics, gender, and/or age of synthetic speech. Experimental results showed that high-performance speaker-independent models can be constructed using the proposed code vectors and additionally that adaptation and manipulation using the codes can also be performed effectively.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.