Abstract

In recognition-synthesis based any-to-one voice conversion (VC), an automatic speech recognition (ASR) model is employed to extract content-related features and a synthesizer is built to predict the acoustic features of the target speaker from the content-related features of any source speakers at the conversion stage. Since source speakers are unknown at the training stage, we have to use the content-related features of the target speaker to estimate the parameters of the synthesizer. This inconsistency between conversion and training stages constrains the speaker similarity of converted speech. To address this issue, a cyclic training method is proposed in this paper. This method designs pseudo-source acoustic features, which are generated by converting the training data of the target speaker towards multiple speakers in a reference corpus. Then, these pseudo-source acoustic features are used as the input of the synthesizer at the training stage to predict the acoustic features of the target speaker and a cyclic reconstruction loss is derived. Experimental results show that our proposed method achieved more consistent accuracy of acoustic feature prediction for various source speakers than the baseline method. It also achieved better similarity of converted speech, especially for the pairs of source and target speakers with distant speaker characteristics.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.