Non-parallel Voice Conversion Using Generative Adversarial Networks

Yuta Hasunuma,Masayuki Kobayashi,Tomoharu Nagao,Chiaki Hirayama

doi:10.1109/smc.2018.00283

Abstract

Considering the ease of data collection, it is desirable to build a voice conversion system (VCS) from non-parallel voice data, with different sentences read by the source and target speakers. Previous non-parallel VCSs have used either a mel-cepstrum or spectral envelope as an input feature. However, these features have different acoustic characteristics that play important roles in speaker recognition. Thus, we propose a non-parallel VCS that efficiently uses both mel-cepstrum and spectral envelopes as input features. Our method is based on three key strategies: 1) we use generative adversarial networks for voice conversion; 2) we add noise to facilitate the training of the formant part; and 3) we integrate the acoustic features to generate high-quality converted voices. Subjective evaluations in the Voice Conversion Challenge 2016 (VCC 2016) revealed that our model outperformed the previous approaches in terms of the naturalness and similarity of the converted voice.

Full Text