Abstract

Statistical parametric speech synthesis techniques such as deep neural network (DNN) and hidden Markov model (HMM) have grown in popularity since last decade over the concatenative speech synthesis approaches by modelling excitation and spectral parameters of speech to synthesize the waveforms from the written text. Due to inappropriate acoustic modelling, speech synthesized using HMM-based speech synthesis sounds muffled. DNN tried to improve the acoustic model by replacing decision trees in HMM with powerful regression model. Further, the performance of a deep neural network is greatly enhanced using pre-learning either restricted Boltzmann machines (RBM) or autoencoders. RBMs are capable to map multi-modal property of speech but result in spectral distortion of synthesized speech waveforms as non-consideration of reconstruction error. This article proposed the model of deep neural network, which is pre-trained using stacked denoising autoencoders to map speech parameters of the Punjabi language. Denoising autoencoders work by adding noise in the training data and then reconstructing the original measurements to reduce the reconstruction error. The synthesized voice using the proposed model showed the VARN of 0.82, F0 RMSE (Hz) 9.03, and V/UV error rate of 4.04% have been observed.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.