Non-Parallel Voice Conversion Using Variational Autoencoders Conditioned by Phonetic Posteriorgrams and D-Vectors

Yuki Saito,Kyosuke Nishida,Yusuke Ijima,Shinnosuke Takamichi

doi:10.1109/icassp.2018.8461384

Abstract

This paper proposes novel frameworks for non-parallel voice conversion (VC) using variational autoencoders (VAEs). Although conventional VAE-based VC models can be trained using non-parallel speech corpora with given speaker representations, phonetic contents of the converted speech tend to vanish because of an over-regularization issue often observed in latent variables of the VAEs. To overcome the issue, this paper proposes a VAE-based non-parallel VC conditioned by not only the speaker representations but also phonetic contents of speech represented as phonetic posteriorgrams (PPGs). Since the phonetic contents are given during the training, we can expect that the VC models effectively learn speaker-independent latent features of speech. Focusing on the point, this paper also extends the conventional VAE-based non-parallel VC to many-to-many VC that can convert arbitrary speakers' characteristics into another arbitrary speakers' ones. We investigate two methods to estimate speaker representations for speakers not included in speech corpora used for training VC models: 1) adapting conventional speaker codes, and 2) using d-vectors for the speaker representations. Experimental results demonstrate that 1) PPGs successfully improve both naturalness and speaker similarity of the converted speech, and 2) both speaker codes and d-vectors can be adopted to the VAE-based many-to-many non-parallel VC.

Full Text