Abstract

Voice conversion (VC) refers to the technique of modifying one speaker’s voice to mimic another’s while retaining the original linguistic content. This technology finds its applications in fields such as speech synthesis, accent modification, medicine, security, privacy, and entertainment. Among the various deep generative models used for voice conversion, including variational autoencoders (VAEs) and generative adversarial networks (GANs), diffusion models (DMs) have recently gained attention as promising methods due to their training stability and strong performance in data generation. Nevertheless, traditional DMs focus mainly on learning reconstruction paths like VAEs, rather than conversion paths as GANs do, thereby restricting the quality of the converted speech. To overcome this limitation and enhance voice conversion performance, we propose a cycle-consistent diffusion (CycleDiffusion) model, which comprises two DMs: one for converting the source speaker’s voice to the target speaker’s voice and the other for converting it back to the source speaker’s voice. By employing two DMs and enforcing a cycle consistency loss, the CycleDiffusion model effectively learns both reconstruction and conversion paths, producing high-quality converted speech. The effectiveness of the proposed model in voice conversion is validated through experiments using the VCTK (Voice Cloning Toolkit) dataset.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.