WaveVC: Speech and Fundamental Frequency Consistent Raw Audio Voice Conversion

Kyungdeuk Ko,Donghyeon Kim,Kyungseok Oh,Hanseok Ko

doi:10.1007/s11063-024-11613-0

Kyungdeuk Ko, Donghyeon Kim + Show 2 more

Open Access

PDF Available

https://doi.org/10.1007/s11063-024-11613-0

Copy DOI

Export

Save

Cite

Journal: Neural Processing Letters	Publication Date: May 8, 2024
License type: CC BY 4.0

Abstract
Full-Text PDF
Similar Papers

Abstract

Listen

Voice conversion (VC) is a task for changing the speech of a source speaker to the target voice while preserving linguistic information of the source speech. The existing VC methods typically use mel-spectrogram as both input and output, so a separate vocoder is required to transform mel-spectrogram into waveform. Therefore, the VC performance varies depending on the vocoder performance, and noisy speech can be generated due to problems such as train-test mismatch. In this paper, we propose a speech and fundamental frequency consistent raw audio voice conversion method called WaveVC. Unlike other methods, WaveVC does not require a separate vocoder and can perform VC directly on raw audio waveform using 1D convolution. This eliminates the issue of performance degradation caused by the train-test mismatch of the vocoder. In the training phase, WaveVC employs speech loss and F0 loss to preserve the content of the source speech and generate F0 consistent speech using the pre-trained networks. WaveVC is capable of converting voices while maintaining consistency in speech and fundamental frequency. In the test phase, the F0 feature of the source speech is concatenated with a content embedding vector to ensure the converted speech follows the fundamental frequency flow of the source speech. WaveVC achieves higher performances than baseline methods in both many-to-many VC and any-to-any VC. The converted samples are available online.

Full Text