Abstract

Fine-grained emotion strength control and prediction is recently studied in text-to-speech to adjust local emotion intensity in an utterance. Due to the lack of fine-grained emotion strength labelling data, emotion or style strength extractor is usually learned at the whole utterance scale through a ranking function. However, such utterance-based extractor is then used to provide fine-grained emotion strength labels, conditioning on which a fine-grained emotional speech synthesis model is separately trained. To bridge the granularity gap between emotion strength extraction and emotional synthesis speech generation, a simple yet effective component called Emotion Gate is designed to learn fine-grained emotion strengths in an end-to-end way, which are then used to create scaled emotion representations that serve as a condition of emotional speech synthesis. Furthermore, beside predicting from a jointly trained emotion strength predictor, our proposed method also allows to manually assign and control the fine-grained emotion strengths during inference. In experiment part, the proposed method is evaluated in both non-transferred emotional speech synthesis and cross-speaker transferred scenarios. Both objective and subjective evaluations show the effectiveness and superiority of the proposed method over the state-of-the-art baseline systems. The audio samples in our experiments can found in the demo page: https://kingstorm.github.io/emotiongate/ .

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call