Abstract
Speaker identity is one of the important characteristics of human speech. In voice conversion, we change the speaker identity from one to another, while keeping the linguistic content unchanged. Voice conversion involves multiple speech processing techniques, such as speech analysis, spectral conversion, prosody conversion, speaker characterization, and vocoding. With the recent advances in theory and practice, we are now able to produce human-like voice quality with high speaker similarity. In this article, we provide a comprehensive overview of the state-of-the-art of voice conversion techniques and their performance evaluation methods from the statistical approaches to deep learning, and discuss their promise and limitations. We will also report the recent Voice Conversion Challenges (VCC), the performance of the current state of technology, and provide a summary of the available resources for voice conversion research.
Highlights
Voice conversion (VC) is a significant aspect of artificial intelligence
There was a study to use a hidden Markov model (HMM) that is trained for the target speaker, the parameters of Gaussian mixture model (GMM)-based linear transformation function are estimated in such a way that the converted source vectors exhibit maximum likelihood with respect to the target HMM [152]
The techniques for text-independent speaker characterization are readily available for non-parallel training data, where a speaker can be modeled by a set of parameters, such as a GMM or i-vector
Summary
Voice conversion (VC) is a significant aspect of artificial intelligence. It is the study of how to convert one’s voice to sound like that of another without changing the linguistic content. The early studies of voice conversion were focused on spectrum mapping using parallel training data, where speech of the same linguistic content is available from both the source and target speaker, for example, vector quantization (VQ) [8] and fuzzy vector quantization [9]. Wu and Li [6], and Mohammadi and Kain [35] provided an overview of voice conversion systems from the perspective of time alignment of speech features followed by feature mapping, that represents the statistical modeling school of thought. The advent of deep learning techniques represents an important technology milestone in the voice conversion research [36] It has greatly advanced the state-of-the-art, and transformed the way we formulate the voice conversion research problems.
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have