Abstract

Text to speech synthesis (TTS) which generate input texts is generate to the speech from texts. TTS is very important in aiding impaired people, in teaching and learning process. But, to implemented TTS have a lot of challenging such as text processing, time to phoneme mapping and acoustic modeling for Afaan Oromoo language. So, Afaan Oromoo language mostly required to text to speech synthesis for development of this language. The application of Natural Language Processing is provide that input texts pair speech to generate the desired result outputs of speech in waveforms from prepared text corpus. The normalized text was used for linguistic features are extracted by using Festival toolkit for Afaan Oromoo TTS. The labeled texts are done using Festival toolkit, and generated the utterances of texts from scheme file parameters. The Festival toolkit is used for texts normalized in linguistic extraction from label phoneme alignment to match with speech corpus in trains and tests. The forced alignment is done by HTK toolkit for prepared environment, checked data extracting features within timestamps of state level alignment for acoustic feature extracted. So, this study focus on TTS approach deep learning model based on BLSTM-RNN for Afaan Oromoo language. The RNN model used from a given input feature sequence to extracted duration model and acoustic model. The implementation is done in BLSTM-based on RNN using pytorch library on jupyter notebook, create duration model and generated speech samples from trained acoustic model. We have prepared 1000 texts corpus their matching text transcription from Afaan Oromoo speech corpus by a female speaker dependent for training 700 sentences and tests 300 sentences from dataset domains. In this study, two evaluation techniques used. Frist, the Mean Opinion Score (MOS) evaluation technique is used for intelligibility and naturalness in TTS. The second is Mel Cepstral Distortion (MCD) which is highly used for objective evaluation in model approach for TTS. So, the performance of this model was measured and quality of synthesized speech is assessed in terms of intelligibility and naturalness which results are 3.77 and 3.76 respectively. The total average processed using objective evaluation technique the speech corpus on 16 kHz standards is generated by MCD BLSTM-based on RNN is 3.89 and merlin wave generated is 3.71 correspondingly. Keywords: Text To Speech Synthesis, Mel Cepstral Distortion (MCD), Mean Opinion Square (MOS), Bidirectional Long Short Term Memory Recurrent Neural Network (BLSTM-RNN) DOI: 10.7176/NMMC/101-02 Publication date: April 30 th 2022

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call