CLESSR-VC: Contrastive learning enhanced self-supervised representations for one-shot voice conversion

Yuhang Xue,Ning Chen,Yixin Luo,Hongqing Zhu,Zhiying Zhu

doi:10.1016/j.specom.2024.103139

Yuhang Xue, Ning Chen + Show 3 more

https://doi.org/10.1016/j.specom.2024.103139

Copy DOI

Export

Save

Cite

Journal: Speech Communication

Publication Date: Sep 10, 2024

Abstract
Full-Text
Similar Papers

Abstract

Listen

One-shot voice conversion (VC) has attracted more and more attention due to its broad prospects for practical application. In this task, the representation ability of speech features and the model’s generalization are the focus of attention. This paper proposes a model called CLESSR-VC, which enhances pre-trained self-supervised learning (SSL) representations through contrastive learning for one-shot VC. First, SSL features from the 23rd and 9th layers of the pre-trained WavLM are adopted to extract content embedding and SSL speaker embedding, respectively, to ensure the model’s generalization. Then, the conventional acoustic feature mel-spectrograms and contrastive learning are introduced to enhance the representation ability of speech features. Specifically, contrastive learning combined with the pitch-shift augmentation method is applied to disentangle content information from SSL features accurately. Mel-spectrograms are adopted to extract mel speaker embedding. The AM-Softmax and cross-architecture contrastive learning are applied between SSL and mel speaker embeddings to obtain the fused speaker embedding that helps improve speech quality and speaker similarity. Both objective and subjective evaluation results on the VCTK corpus confirm that the proposed VC model has outstanding performance and few trainable parameters.

Full Text