Voice conversion (VC) is a technique aiming to mapping the individuality of a source speaker to that of a target speaker, wherein Gaussian mixture model (GMM) based methods are evidently prevalent. Despite their wide use, two major problems remains to be resolved, i.e., over-smoothing and over-fitting. The latter one arises naturally when the structure of model is too complicated given limited amount of training data.Recently, a new voice conversion method based on Gaussian processes (GPs) was proposed, whose nonparametric nature ensures that the over-fitting problem can be alleviated significantly. Meanwhile, it is flexible to perform non-linear mapping under the framework of GPs by introducing sophisticated kernel functions. Thus this kind of method deserves to be explored thoroughly in this paper. To further improve the performance of the GP-based method, a strategy for mapping prosodic and spectral features coherently is adopted, making the best use of the intercorrelations embedded among both excitation and vocal tract features. Moreover, the accuracy in computing the kernel functions of GP can be improved by resorting to an asymmetric training strategy that allows the dimensionality of input vectors being reasonably higher than that of the output vectors without additional computational costs. Experiments have been conducted to confirm the effectiveness of the proposed method both objectively and subjectively, which have demonstrated that improvements can be obtained by GP-based method compared to the traditional GMM-based approach.
Read full abstract