Abstract

Media conversion of image, text, speech, etc., generally requires a large amount of parallel data for training a conversion model. Recently, methods for training the model using no or a small amount of parallel data draw researchers’ attention. In many-to-many voice conversion, since it is often hard to collect parallel data from every pair of speakers, the conversion models requiring no parallel data are desired. Conventional many-to-many voice conversion models required a large amount of prestored parallel data to acquire prior knowledge of the entire speaker space. Then, a specific model from an arbitrary speaker to another can be realized by adapting a few model parameters. Although these conversion models certainly do not use parallel data in an adaptation step, they still use parallel data for prior training. In this study, we aim at realizing completely parallel-data-free and many-to-many voice conversion. The proposed method uses both Eigenvoice Gaussian mixture models EVGMM and Deep neural network DNN. EVGMM is a many-to-many conversion model that constructs the entire speaker space called eigenspace by analyzing mean vectors of Gaussian mixture models and it is used in our method to decompose training speakers’ features into their eigenspace components. By using the speaker features and the obtained components as pseudo parallel data, multiple DNNs are trained to realize conversion between them. With these DNNs, features of any target speaker can be represented by a weighted sum of the components. It should be noted that all the processes of our proposal do not require any parallel data. A key technique is to estimate covariance terms of EVGMM with no parallel data. Experiments indicate that individuality scores of the proposed method using no parallel data are comparable enough to those of a baseline system trained with parallel data.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call