Abstract
We introduce a novel speech synthesis system, called NAUTILUS, that can generate speech with a target voice either from a text input or a reference utterance of an arbitrary source speaker. By using a multi-speaker speech corpus to train all requisite encoders and decoders in the initial training stage, our system can clone unseen voices using untranscribed speech of target speakers on the basis of the backpropagation algorithm. Moreover, depending on the data circumstance of the target speaker, the cloning strategy can be adjusted to take advantage of additional data and modify the behaviors of text-to-speech (TTS) and/or voice conversion (VC) systems to accommodate the situation. We test the performance of the proposed framework by using deep convolution layers to model the encoders, decoders and WaveNet vocoder. Evaluations show that it achieves comparable quality with state-of-the-art TTS and VC systems when cloning with just five minutes of untranscribed speech. Moreover, it is demonstrated that the proposed framework has the ability to switch between TTS and VC with high speaker consistency, which will be useful for many applications.
Highlights
S PEECH synthesis is the technology of generating speech from an input interface
One possible explanation is that the utterances generated by XV had the characteristics of the speakers in LibriTTS corpus instead of those of the target speakers, which makes its utterances more compatible with automatic speech recognition (ASR) model trained on LibriSpeech
Our TTS system had a lower word error rate (WER) than our voice conversion (VC) systems. c) Even though we had a lower score for quality than did N10, the similarity seemed to be higher. d) Our TTS and VC systems had highly consistent results, while there was a gap between the same-gender and cross-gender subsystems of N10
Summary
S PEECH synthesis is the technology of generating speech from an input interface. Speech synthesis can refer to all kinds of speech generation interfaces like voice conversion (VC) [2], video-to-speech [3], [4], and others [5], [6], [7]. Deep neural networks are used in various components of these speech synthesis systems. This means that we need many hours of speech from a target speaker to train a model. This limits the ability to scale the technology to many different voices
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: IEEE/ACM Transactions on Audio, Speech, and Language Processing
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.