Abstract

With the rapid development of neural networks and deep learning, speech synthesis technology has been significantly improved. The end-to-end speech synthesis systems based on deep learning have been able to synthesize speech with naturalness close to the original human pronunciation. However, the existing end-to-end speech synthesis system model is complex, and it is impossible to achieve real-time speech synthesis on devices with low computing power. In this paper, a multi-band discriminative autoregressive speech synthesis model is proposed based on natural language processing. The model uses an encoder-decoder architecture with attention mechanism, which is mainly composed of DSC-GRN modules. Stacking multiple convolutions with different expansion coefficients by gating the residual structure can increase the receptive field so that the encoder and decoder can pay attention to the context information with a longer time span, which can improve the performance of the model. The whole model uses full convolution architecture and can be trained in parallel. Compared with the existing autoregressive model, the number of parameters of the model is greatly reduced. The synthesis speed is improved, and the quality of synthesized speech is ensured.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call