Abstract
Many-to-many voice conversion (VC) is a technique aimed at mapping speech features between multiple speakers during training and transferring the vocal characteristics of one source speaker to another target speaker, all while maintaining the content of the source speech unchanged. Existing research highlights a notable gap between the original and generated speech samples in terms of naturalness within many-to-many VC. Therefore, there is substantial room for improvement in achieving more natural-sounding speech samples for both parallel and nonparallel VC scenarios. In this study, we introduce a generative adversarial network (GAN) system with a guided loss (GLGAN-VC) designed to enhance many-to-many VC by focusing on architectural improvements and the integration of alternative loss functions. Our approach includes a pair-wise downsampling and upsampling (PDU) generator network for effective speech feature mapping (FM) in multidomain VC. In addition, we incorporate an FM loss to preserve content information and a residual connection (RC)-based discriminator network to improve learning. A guided loss (GL) function is introduced to efficiently capture differences in latent feature representations between source and target speakers, and an enhanced reconstruction loss is proposed for better contextual information preservation. We evaluate our model on various datasets, including VCC 2016, VCC 2018, VCC 2020, and an emotional speech dataset (ESD). Our results, based on both subjective and objective evaluation metrics, demonstrate that our model outperforms state-of-the-art (SOTA) many-to-many GAN-based VC models in terms of speech quality and speaker similarity in the generated speech samples.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: IEEE Transactions on Neural Networks and Learning Systems
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.