Abstract
With the development of deep learning, nonparallel voice conversion (VC) has achieved a significant progress recently. Automatic speech recognition (ASR) and text-to-speech (TTS) for leveraging knowledge are the two mainstream methods in VC research. In this paper, we demonstrate that the two bottleneck features (BNFs) in the above methods are complementary. ASR-BNFs are more robust especially in any-to-many tasks, but suffer from leakage of source speaker’s timbre information; TTS-BNFs are less likely to reveal speaker’s timbre information, but lack robustness. Therefore, a nonparallel any-to-many voice conversion model is proposed by combining ASR-BNFs and TTS-BNFs. The whole modules in the proposed model can be trained jointly without any pre-trained models. Experiments are conducted on a private multi-speaker TTS dataset. It is demonstrated that the proposed model achieves the best balance in speech quality, timbre similarity and robustness compared to baseline models.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have