Abstract
Conventional approaches to statistical parametric speech synthesis usually require a large amount of speech data. But it is very difficult for persons with articulation disorders, in particular, to utter a large amount of speech data, and their utterances are often unstable or unclear so that we cannot understand what they say. In this paper, we propose a hybrid approach for a person with an articulation disorder, using two models of a physically unimpaired person and a person with an articulation disorder to generate an intelligible voice while preserving the speaker's individuality (with an articulation disorder). Our method has two processes - the speech synthesis part and voice conversion part. Speech synthesis is employed for obtaining a speech signal (of a physically unimpaired person), where a large amount of training data of a physically unimpaired person is used. Then, voice conversion (VC) is employed for converting the voice of the physically unimpaired person to that of a person with an articulation disorder, where a small amount of speech data of a person with an articulation disorder is only used for training VC. Also, a cycle-consistent adversarial network (CycleGAN) that does not require parallel data is employed for VC. An objective evaluation showed that the mel-cepstrum obtained using our method are close to the target in terms of global variance (GV) and modulation spectrum (MS).
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.