The self-supervised speech representation (S3R) has succeeded in many downstream tasks, such as speaker recognition and voice conversion thanks to its high-level information. Voice conversion (VC) is a task to convert the source speech into a target speaker’s voice. Though S3R features effectively encode content and speaker information, spectral features contain low-level acoustic information that is complementary to the S3R. As a result, solely relying on the S3R features for VC may not be optimal. In order to seek speech representation carrying both high-level learned information and low-level spectral details for VC, we proposed a three-level attention to combine Mel-spectrogram (Mel) and S3R, denoted as Mel-S3R. In particular, S3R features are high-level learned representations extracted by a pre-trained network with self-supervised learning. Whereas Mel is the spectral feature representing the acoustic information. Then the proposed Mel-S3R is used as the input of any-to-any VQ-VAE-based VC and the experiments are performed as a downstream task. Objective metrics and subjective listening tests have demonstrated that the proposed Mel-S3R speech representation facilitates the VC framework to achieve robust performance in terms of both speech quality and speaker similarity.
Read full abstract