Abstract

This paper studies the methods for emotional statistical parametric speech synthesis (SPSS) using recurrent neural networks (RNN) with long short-term memory (LSTM) units. Two modeling approaches, i.e., emotion-dependent modeling and unified modeling with emotion codes, are implemented and compared by experiments. In the first approach, LSTM-RNN- based acoustic models are built separately for each emotion type. A speaker-independent acoustic model estimated using the speech data from multi-speakers is adopted to initialize the emotion-dependent LSTM-RNNS. Inspired by the speaker code techniques developed for speech recognition and speech synthesis, the second approach builds a unified LSTM-RNN-based acoustic model using the training data of a variety of emotion types. In the unified LSTM-RNN model, an emotion code vector is input to all model layers to indicate the emotion characteristics of current utterance. Experimental results on an emotional speech synthesis database with four emotion types (neutral style, happiness, anger, and sadness) show that both approaches achieve significant better naturalness of synthetic speech than HMM-based emotion- dependent modeling. The emotion-dependent modeling approach outperforms the unified modeling approach and the HMM-based emotion-dependent modeling in terms of the subjective emotion classification rates for synthetic speech. Furthermore, the emotion codes used by the unified modeling approach are capable of controlling the emotion type and intensity of synthetic speech effectively by interpolating and extrapolating the codes in the training set.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.