Development of Indonesian audiovisual speech synthesis system for assistance children with delayed speech

Elok Anggrayni,Dhany Arifianto,Sangsaka Wira,Joko Sarwono,Nyilo Purnami

doi:10.1121/1.5146835

Abstract

Hearing impairment is one of the congenital deafness frequently found in children, which is followed by a delayed speech. Furthermore, a speech therapist currently available is limited. In this research, we outlined the development of the Indonesian audio-visual speech synthesis system for learning of the deaf children with delayed speech. First, we developed two kinds of Indonesian corpus, such as speech corpus and audio-visual corpus. The speech corpus contains speech recordings from professional speech therapists. The total duration of all recorded Indonesian speech database is more than 18 hours of audio. The audio-visual corpus contains visual phoneme (viseme) which is the visualization of Indonesian phoneme for lips. Segmentation and labeling were conducted to create transcriptions. We did some variation in the number of sentences and the type of sentences used in the training part of speech synthesis. Audio-visual synthesis used viseme concatenation method. The objective evaluation result using the Mel-cepstrum distortion method was 2.8. The subjective evaluation result using Mean Opinion Score was 3.71. The evaluation results showed that the new design of Indonesian audio-visual speech synthesis for learning to produce any single meaningful word was capable to use as the alternative for hospitals for the therapy of the delayed speech patients.

Full Text