D-MelGAN: speech synthesis with specific voiceprint features

Daigang Chen,Shaowen Yao,Hua Jiang,Chengxi Pu,Ke Chen,Fengjie Cen,Teresa A Oliveira,Nan Lin,Hong-Ming Yin,Romeo Meštrović

doi:10.1117/12.2627502

Abstract

In recent years, speech synthesis based on machine learning has become more and more popular. At present, there are many kinds of neural network models that can generate synthetic audio which highly imitates human voice. The quality of these generated audio is usually evaluated by mean opinion score (MOS). Voiceprint is an important metric to distinguish the speaker's speech features. Generating voice speech with specific voiceprint features is of great significance to improve the application of speech synthesis. However, the existing speech synthesis models seldom consider the preservation of specific voiceprint features. In this paper, we propose D-MelGAN, a speech synthesis model targeting to high-quality voice speech with specific speaker voiceprint features. The model is based on the non-autoregressive feedforward convolution neural network of GANs. By embedding the d-vector technology used to identify specific voiceprints in GANs, the original audio waveform with the characteristics of specific speaker voiceprints is further generated. The experimental results show that the new model can increase the voiceprint features of the generated audio, and the quality of the synthesized speech can be well maintained, which will make the generated speech have the specific style of a speaker, the text to speech technology will be applied to more fields.

Full Text