Pitch transformation in neural network based voice conversion

Feng-Long Xie,Haifeng Li,Frank K Soong,Yao Qian

doi:10.1109/iscslp.2014.6936599

Abstract

In voice conversion task, prosody conversion especially pitch conversion is a very challenging research topic because of the discontinuity property of pitch. Conventionally pitch conversion is always achieved by adjusting the mean and variance of the source pitch distribution to the target pitch distribution. This method removes most of the detailed information of the speaker's prosody and only maintains the global F0 contour. In this paper, we propose a neural network based pitch conversion system which converts F0 and spectral features all together frame by frame. Experimental results show that neural network based pitch conversion can significantly reduce the Unvoiced/Voiced error and RMSE of F0 between converted pitch and target pitch compared with the conventional Gaussian normalized transformation method. Wavelet decomposition for F0 can further improve the performance of voice conversion.

Full Text