Abstract

Voice conversion (VC) emerged as a significant domain of research in the field of speech synthesis in recent years due to its emerging application in voice-assistive technologies, such as automated movie dubbing speech-to-singing conversion, to name a few. VC deals with the conversion of the vocal style of one speaker to another speaker while keeping the linguistic contents unchanged. Nowadays, generative adversarial network (GAN) models are widely used for speech feature mapping from the source speaker to the target speaker. In this article, we propose an adaptive-learning-based GAN model, called ALGAN-VC, to improve the one-to-one VC of speakers. Our ALGAN-VC framework consists of some approaches to improve the speech quality and voice similarity between the source and target speakers. We incorporate a dense residual network architecture into the generator network for efficient speech feature learning between source and target speakers. Our framework also includes an adaptive learning mechanism to compute the loss function for the proposed model. Moreover, a boosted learning rate approach is incorporated to enhance the learning capability of the proposed model. The proposed model is tested on Voice Conversion Challenge 2016, 2018, and 2020 datasets along with our self-prepared Indian regional-language-based speech dataset. In addition, an emotional speech dataset is also considered for evaluating the model’s performance. The objective and subjective evaluations of the generated speech samples indicated that the proposed model elegantly performed the voice conversion task by achieving high speaker similarity and good speech quality.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call