Abstract

Conventional approaches to statistical parametric speech synthesis usually require a large amount of speech data. But it is very difficult for persons with articulation disorders, in particular, to utter a large amount of speech data, and their utterances are often unstable or unclear so that we cannot understand what they say. In this paper, we propose a hybrid approach for a person with an articulation disorder, using two models of a physically unimpaired person and a person with an articulation disorder to generate an intelligible voice while preserving the speaker's individuality (with an articulation disorder). Our method has two processes - the speech synthesis part and voice conversion part. Speech synthesis is employed for obtaining a speech signal (of a physically unimpaired person), where a large amount of training data of a physically unimpaired person is used. Then, voice conversion (VC) is employed for converting the voice of the physically unimpaired person to that of a person with an articulation disorder, where a small amount of speech data of a person with an articulation disorder is only used for training VC. Also, a cycle-consistent adversarial network (CycleGAN) that does not require parallel data is employed for VC. An objective evaluation showed that the mel-cepstrum obtained using our method are close to the target in terms of global variance (GV) and modulation spectrum (MS).

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.