Abstract
Emotional speech synthesis is the synthesis of text into speech with various emotions. At present, mainstream deep learning-based emotional speech synthesis networks rely on single-speaker emotional speech datasets for training, but such specially designed high-quality datasets are difficult to obtain in reality. In this paper, we propose a novel two-stage training strategy for end-to-end emotional speech synthesis. In the first stage, a text-to-speech Mel-spectrogram alignment is trained using the single-speaker neutral speech dataset. In the second stage, a limited multi-speaker speech emotional dataset and a part of the single-speaker neutral speech dataset are used for non-parallel mixed training, decoupling emotion category, speaker identity and text information to achieve multi-speaker emotional speech synthesis. Experiments show that using this strategy can synthesize emotional speech on a limited emotional speech dataset, and the effect of synthesizing emotional speech is better than the mainstream speech conversion model under the same resource.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have