Learning auxiliary categorical information for speech synthesis based on deep and recurrent neural networks

Zhengqi Wen,Kehuang Li,Zhen Huang,Chin-Hui Lee,Jianhua Tao

doi:10.1109/iscslp.2016.7918441

Abstract

We proposed an auxiliary categorization framework for training speech synthesis systems using deep neural networks (DNNs) and recurrent neural networks (RNNs). The adopted artificial neural networks (ANNs) are regression models comprising a few hidden layers and an affine-transform layer for transforming the contextual features into a set of speech synthesis parameters. In order to incorporate categorization information into training ANNs, similar to DNN-based speech recognition, the proposed approach stacks a secondary classification layer on top of the hidden layers for the regression ANN and trained it together with the primary affine-transform. Four categorization tasks, for classification of voicing, phonation position, phone identity and hidden Markov model state, are considered. The experimental results show that the proposed framework can reduce the root mean square error (RMSE) of the generated log fundamental frequency by about 10.8% and 4.3% for DNN and RNN based synthesis systems, respectively. With the extra classification layers, subjective listening tests also favor DNN and RNN generated speech by about 24% and 15%, respectively, over the ANN baselines without using any categorical information.

Full Text