Abstract

We explore the use of cosine similarity between x-vector speaker embeddings as an objective metric to evaluate the effectiveness of singing voice conversion. Our system preprocesses a source singer’s audio to obtain melody features via the F0 contour, loudness curve, and phonetic posteriorgram. These are input to a denoising diffusion probabilistic acoustic model conditioned with another target voice’s speaker embedding to generate a mel spectrogram, which is passed through a HiFi-GAN vocoder to synthesize audio of the source song in the target timbre. We use cosine similarity between the converted audio’s speaker embedding and that of the target voice as an objective metric in two tasks. First, we show that we can morph between two voices: a smooth transition between two speaker embeddings in latent space results in a smooth transition of timbre in generated audio. This shows potential for creativity in the speaker embedding latent space to represent new voices. Second, we use cosine similarity to compare our diffusion acoustic model with a model based on DurIAN. We find that the latter has better conversion results, fewer parameters, and less training time. Overall, we conclude that cosine similarity is a helpful objective metric for voice morphing and conversion.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call