Abstract
Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody. Nonetheless, without sufficient data, seq2seq VC models can suffer from unstable training and mispronunciation problems in the converted speech, thus far from practical. To tackle these shortcomings, we propose to transfer knowledge from other speech processing tasks where large-scale corpora are easily available, typically text-to-speech (TTS) and automatic speech recognition (ASR). We argue that VC models initialized with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech. In this work, we examine our proposed method in a parallel, one-to-one setting. We employed recurrent neural network (RNN)-based and Transformer based models, and through systematical experiments, we demonstrate the effectiveness of the pretraining scheme and the superiority of Transformer based models over RNN-based models in terms of intelligibility, naturalness, and similarity.
Highlights
V OICE conversion (VC) aims to convert the speech from a source to that of a target without changing the linguistic content [1]
VC: recurrent neural networks (RNNs) and Transformers, and we show the supremacy of the latter over the former, which is consistent with the finding in most speech processing tasks [30]
A unified, two-stage training strategy that first pretrains the decoder and the encoder subsequently followed by initializing the VC model with the pretrained model parameters was proposed
Summary
V OICE conversion (VC) aims to convert the speech from a source to that of a target without changing the linguistic content [1]. As pointed out in [11], the converted speech in seq2seq VC systems still suffers from mispronunciations and other linguistic-inconsistency problems, such as inserted, repeated and skipped phonemes This is mainly due to the failure in attention alignment learning, which can be seen as a consequence of insufficient data. As one popular solution to limited training data, pretraining has been regaining attention in recent years, where knowledge from massive, out-of-domain data is transferred to aid learning in the target domain This concept is usually realized by learning universal, high-level feature representations. TTS and ASR both aim to find a mapping between text and speech, as the former tries to add speaker information to the source while the latter tries to remove. We suspect that the intermediate hidden representation spaces of these two tasks contain somewhat little speaker information, and serve as a suitable fit for VC
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: IEEE/ACM Transactions on Audio, Speech, and Language Processing
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.