Hybrid Text-to-Speech for Articulation Disorders with a Small Amount of Non-Parallel Data

Ryuka Nanzaka,Tetsuya Takiguchi

doi:10.23919/apsipa.2018.8659785

Abstract

Conventional approaches to statistical parametric speech synthesis usually require a large amount of speech data. But it is very difficult for persons with articulation disorders, in particular, to utter a large amount of speech data, and their utterances are often unstable or unclear so that we cannot understand what they say. In this paper, we propose a hybrid approach for a person with an articulation disorder, using two models of a physically unimpaired person and a person with an articulation disorder to generate an intelligible voice while preserving the speaker's individuality (with an articulation disorder). Our method has two processes - the speech synthesis part and voice conversion part. Speech synthesis is employed for obtaining a speech signal (of a physically unimpaired person), where a large amount of training data of a physically unimpaired person is used. Then, voice conversion (VC) is employed for converting the voice of the physically unimpaired person to that of a person with an articulation disorder, where a small amount of speech data of a person with an articulation disorder is only used for training VC. Also, a cycle-consistent adversarial network (CycleGAN) that does not require parallel data is employed for VC. An objective evaluation showed that the mel-cepstrum obtained using our method are close to the target in terms of global variance (GV) and modulation spectrum (MS).

Full Text