Research on synthesis of designated speaker speech based on StarGAN-VC model

Xiaohong Qiu,Yu Luo,Pavel Loskot

doi:10.1117/12.2659719

Abstract

With the rapid development of deep learning, the research focus of speech synthesis has gradually shifted to artificial neural network technology. The speech quality has been greatly improved and has been introduced into many application scenarios. However, the existing synthesis systems need to use rich and high-quality parallel data sets when training models, and the synthesized speech is also weak in personalized performance. This paper describes an improved Mel spectrogram acoustic feature sequence prediction model based on Tacotron2 and a StarGAN-VC model. The model uses the predicted Mel spectrogram as input to generate Mel spectrogram sequence of the specified speaker and synthesize speech. StarGAN-VC model can train the model in non-parallel Mini dataset, generate Mel spectrogram sequence of designated speaker in real time and synthesize speech, which can well solve the problem of lack of non-parallel dataset and enrich the speech content generated by StarGAN -VC model. The experimental results show that StarGAN-VC model can generate relatively smooth Mel spectrogram by using the Mel spectrogram sequence predicted by the improved model, and have stronger expressiveness in dealing with the details of Mel spectrogram, so as to synthesize smooth and high intelligible speech. The model uses the speech data of the designated speaker for about 27 minutes to train the model and synthesize personalized speech, which provides an effective reference for the synthesis of personalized speech.

Full Text