NAUTILUS: A Versatile Voice Cloning System

Hieu-Thi Luong,Junichi Yamagishi

doi:10.1109/taslp.2020.3034994

Abstract

We introduce a novel speech synthesis system, called NAUTILUS, that can generate speech with a target voice either from a text input or a reference utterance of an arbitrary source speaker. By using a multi-speaker speech corpus to train all requisite encoders and decoders in the initial training stage, our system can clone unseen voices using untranscribed speech of target speakers on the basis of the backpropagation algorithm. Moreover, depending on the data circumstance of the target speaker, the cloning strategy can be adjusted to take advantage of additional data and modify the behaviors of text-to-speech (TTS) and/or voice conversion (VC) systems to accommodate the situation. We test the performance of the proposed framework by using deep convolution layers to model the encoders, decoders and WaveNet vocoder. Evaluations show that it achieves comparable quality with state-of-the-art TTS and VC systems when cloning with just five minutes of untranscribed speech. Moreover, it is demonstrated that the proposed framework has the ability to switch between TTS and VC with high speaker consistency, which will be useful for many applications.

Highlights

S PEECH synthesis is the technology of generating speech from an input interface
One possible explanation is that the utterances generated by XV had the characteristics of the speakers in LibriTTS corpus instead of those of the target speakers, which makes its utterances more compatible with automatic speech recognition (ASR) model trained on LibriSpeech
Our TTS system had a lower word error rate (WER) than our voice conversion (VC) systems. c) Even though we had a lower score for quality than did N10, the similarity seemed to be higher. d) Our TTS and VC systems had highly consistent results, while there was a gap between the same-gender and cross-gender subsystems of N10

Summary

Introduction

S PEECH synthesis is the technology of generating speech from an input interface. Speech synthesis can refer to all kinds of speech generation interfaces like voice conversion (VC) [2], video-to-speech [3], [4], and others [5], [6], [7]. Deep neural networks are used in various components of these speech synthesis systems. This means that we need many hours of speech from a target speaker to train a model. This limits the ability to scale the technology to many different voices

Objectives

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE/ACM Transactions on Audio, Speech, and Language Processing	Publication Date: Jan 1, 2020
Citations: 99	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

NAUTILUS: A Versatile Voice Cloning System

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE/ACM Transactions on Audio, Speech, and Language Processing

Lead the way for us

Similar Papers

The NU Non-Parallel Voice Conversion System for the Voice Conversion Challenge 2018
Yichiao Wu ... Kazuhiro Kobayashi
-
Yichiao Wu, et. al.Yichiao Wu ... Kazuhiro Kobayashi
15 Apr 2018
The NU Non-Parallel Voice Conversion System for the Voice Conversion Challenge 2018
Yichiao Wu ... Kazuhiro Kobayashi

Comparing the performance of classic voice-driven assistive systems for dysarthric speech
Wei-Zhong Zheng ... Ying-Hui Lai
Biomedical Signal Processing and Control | VOL. 81
Wei-Zhong Zheng, et. al.Wei-Zhong Zheng ... Ying-Hui Lai
07 Dec 2022
Biomedical Signal Processing and Control | VOL. 81

Non-Parallel Voice Conversion System With WaveNet Vocoder and Collapsed Speech Suppression
Yi-Chiao Wu ... Kazuhiro Kobayashi
IEEE Access | VOL. 8
Yi-Chiao Wu, et. al.Yi-Chiao Wu ... Kazuhiro Kobayashi
01 Jan 2020
IEEE Access | VOL. 8

Phonetic posteriorgram-based voice conversion system to improve speech intelligibility of dysarthric patients
Wei-Zhong Zheng ... Ying-Hui Lai
Computer Methods and Programs in Biomedicine | VOL. 215
Wei-Zhong Zheng, et. al.Wei-Zhong Zheng ... Ying-Hui Lai
26 Dec 2021
Computer Methods and Programs in Biomedicine | VOL. 215

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

NAUTILUS: A Versatile Voice Cloning System

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE/ACM Transactions on Audio, Speech, and Language Processing