Abstract

Emotional speech synthesis is the synthesis of text into speech with various emotions. At present, mainstream deep learning-based emotional speech synthesis networks rely on single-speaker emotional speech datasets for training, but such specially designed high-quality datasets are difficult to obtain in reality. In this paper, we propose a novel two-stage training strategy for end-to-end emotional speech synthesis. In the first stage, a text-to-speech Mel-spectrogram alignment is trained using the single-speaker neutral speech dataset. In the second stage, a limited multi-speaker speech emotional dataset and a part of the single-speaker neutral speech dataset are used for non-parallel mixed training, decoupling emotion category, speaker identity and text information to achieve multi-speaker emotional speech synthesis. Experiments show that using this strategy can synthesize emotional speech on a limited emotional speech dataset, and the effect of synthesizing emotional speech is better than the mainstream speech conversion model under the same resource.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call