Abstract

In recent years, speech synthesis based on machine learning has become more and more popular. At present, there are many kinds of neural network models that can generate synthetic audio which highly imitates human voice. The quality of these generated audio is usually evaluated by mean opinion score (MOS). Voiceprint is an important metric to distinguish the speaker's speech features. Generating voice speech with specific voiceprint features is of great significance to improve the application of speech synthesis. However, the existing speech synthesis models seldom consider the preservation of specific voiceprint features. In this paper, we propose D-MelGAN, a speech synthesis model targeting to high-quality voice speech with specific speaker voiceprint features. The model is based on the non-autoregressive feedforward convolution neural network of GANs. By embedding the d-vector technology used to identify specific voiceprints in GANs, the original audio waveform with the characteristics of specific speaker voiceprints is further generated. The experimental results show that the new model can increase the voiceprint features of the generated audio, and the quality of the synthesized speech can be well maintained, which will make the generated speech have the specific style of a speaker, the text to speech technology will be applied to more fields.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.