Abstract

This letter proposes an improved statistical parametric speech synthesis (SPSS) method which utilizes auxiliary information for acoustic modeling under generalized distillation framework. In conventional SPSS, acoustic models are trained using context features as input and acoustic features as output. In our proposed method, two acoustic models, so-called teacher and student, are involved. Both of them are recurrent neural networks (RNN) with bidirectional long short-term memory (BLSTM) units. The teacher, which aims to provide the student with additional knowledge, employs auxiliary features (e.g., articulatory features or spectra of short-time Fourier transform) in addition to the conventional input and output of acoustic models. The student, which serves as the final acoustic model for synthesis, adopts a multitask learning architecture which uses the outcome of the teacher as the target of its secondary task. Experimental results show that this method can achieve better accuracy of acoustic feature prediction and produce more natural synthetic speech than conventional BLSTM-RNN-based acoustic modeling with single-task or multitask learning.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.