Abstract
In recent years, voice conversion (VC) becomes a popular technique since it can be applied to various speech tasks. Most existing approaches on VC must use aligned speech pairs (parallel data) of the source speaker and the target speaker in training, which makes hard to handle it. Furthermore, VC methods proposed so far require to specify the source speaker in conversion stage, even though we just want to obtain the speech of the target speaker from the other speakers in many cases of VC. In this paper, we propose a VC method where it is not necessary to use any parallel data in the training, nor to specify the source speaker in the conversion. Our approach models a joint probability of acoustic, phonetic, and speaker features using a three-way restricted Boltzmann machine (3WRBM). Speaker-independent (SI) and speaker-dependent (SD) parameters in our model are simultaneously estimated under the maximum likelihood (ML) criteria using a speech set of multiple speakers. In conversion stage, phonetic features are at first estimated in a probabilistic manner given a speech of an arbitrary speaker, then a voice-converted speech is produced using the SD parameters of the target speaker. Our experimental results showed not only that our approach outperformed other non-parallel VC methods, but that the performance of the arbitrary-source VC was close to those of the traditional source-specified VC in our approach.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have