Multi-speaker emotional speech synthesis with limited datasets: Two-stage non-parallel training strategy

Kai He,Lasheng Zhao,Caixia Sun,Ruifeng Zhu

doi:10.1109/icsp54964.2022.9778768

Kai He, Lasheng Zhao + Show 2 more

https://doi.org/10.1109/icsp54964.2022.9778768

Copy DOI

Export

Save

Cite

Publication Date: Apr 15, 2022

Citations: 2

Affiliation: Dalian University

Abstract
Full-Text
Similar Papers

Abstract

Listen

Emotional speech synthesis is the synthesis of text into speech with various emotions. At present, mainstream deep learning-based emotional speech synthesis networks rely on single-speaker emotional speech datasets for training, but such specially designed high-quality datasets are difficult to obtain in reality. In this paper, we propose a novel two-stage training strategy for end-to-end emotional speech synthesis. In the first stage, a text-to-speech Mel-spectrogram alignment is trained using the single-speaker neutral speech dataset. In the second stage, a limited multi-speaker speech emotional dataset and a part of the single-speaker neutral speech dataset are used for non-parallel mixed training, decoupling emotion category, speaker identity and text information to achieve multi-speaker emotional speech synthesis. Experiments show that using this strategy can synthesize emotional speech on a limited emotional speech dataset, and the effect of synthesizing emotional speech is better than the mainstream speech conversion model under the same resource.

Full Text