Modelling of Speech Parameters of Punjabi by Pre-trained Deep Neural Network Using Stacked Denoising Autoencoders

Navdeep Kaur,Parminder Singh

doi:10.1145/3568308

Abstract

Statistical parametric speech synthesis techniques such as deep neural network (DNN) and hidden Markov model (HMM) have grown in popularity since last decade over the concatenative speech synthesis approaches by modelling excitation and spectral parameters of speech to synthesize the waveforms from the written text. Due to inappropriate acoustic modelling, speech synthesized using HMM-based speech synthesis sounds muffled. DNN tried to improve the acoustic model by replacing decision trees in HMM with powerful regression model. Further, the performance of a deep neural network is greatly enhanced using pre-learning either restricted Boltzmann machines (RBM) or autoencoders. RBMs are capable to map multi-modal property of speech but result in spectral distortion of synthesized speech waveforms as non-consideration of reconstruction error. This article proposed the model of deep neural network, which is pre-trained using stacked denoising autoencoders to map speech parameters of the Punjabi language. Denoising autoencoders work by adding noise in the training data and then reconstructing the original measurements to reduce the reconstruction error. The synthesized voice using the proposed model showed the VARN of 0.82, F0 RMSE (Hz) 9.03, and V/UV error rate of 4.04% have been observed.

Full Text