Abstract

This letter proposes an improved statistical parametric speech synthesis (SPSS) method which utilizes auxiliary information for acoustic modeling under generalized distillation framework. In conventional SPSS, acoustic models are trained using context features as input and acoustic features as output. In our proposed method, two acoustic models, so-called teacher and student, are involved. Both of them are recurrent neural networks (RNN) with bidirectional long short-term memory (BLSTM) units. The teacher, which aims to provide the student with additional knowledge, employs auxiliary features (e.g., articulatory features or spectra of short-time Fourier transform) in addition to the conventional input and output of acoustic models. The student, which serves as the final acoustic model for synthesis, adopts a multitask learning architecture which uses the outcome of the teacher as the target of its secondary task. Experimental results show that this method can achieve better accuracy of acoustic feature prediction and produce more natural synthetic speech than conventional BLSTM-RNN-based acoustic modeling with single-task or multitask learning.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call