Pretraining Techniques for Sequence-to-Sequence Voice Conversion

Wen-Chin Huang,Yi-Chiao Wu,Tomoki Hayashi,Tomoki Toda,Hirokazu Kameoka

doi:10.1109/taslp.2021.3049336

Wen-Chin Huang, Yi-Chiao Wu + Show 3 more

Open Access

https://doi.org/10.1109/taslp.2021.3049336

Copy DOI

Abstract

Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody. Nonetheless, without sufficient data, seq2seq VC models can suffer from unstable training and mispronunciation problems in the converted speech, thus far from practical. To tackle these shortcomings, we propose to transfer knowledge from other speech processing tasks where large-scale corpora are easily available, typically text-to-speech (TTS) and automatic speech recognition (ASR). We argue that VC models initialized with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech. In this work, we examine our proposed method in a parallel, one-to-one setting. We employed recurrent neural network (RNN)-based and Transformer based models, and through systematical experiments, we demonstrate the effectiveness of the pretraining scheme and the superiority of Transformer based models over RNN-based models in terms of intelligibility, naturalness, and similarity.

Highlights

V OICE conversion (VC) aims to convert the speech from a source to that of a target without changing the linguistic content [1]
VC: recurrent neural networks (RNNs) and Transformers, and we show the supremacy of the latter over the former, which is consistent with the finding in most speech processing tasks [30]
A unified, two-stage training strategy that first pretrains the decoder and the encoder subsequently followed by initializing the VC model with the pretrained model parameters was proposed

Summary

INTRODUCTION

V OICE conversion (VC) aims to convert the speech from a source to that of a target without changing the linguistic content [1]. As pointed out in [11], the converted speech in seq2seq VC systems still suffers from mispronunciations and other linguistic-inconsistency problems, such as inserted, repeated and skipped phonemes This is mainly due to the failure in attention alignment learning, which can be seen as a consequence of insufficient data. As one popular solution to limited training data, pretraining has been regaining attention in recent years, where knowledge from massive, out-of-domain data is transferred to aid learning in the target domain This concept is usually realized by learning universal, high-level feature representations. TTS and ASR both aim to find a mapping between text and speech, as the former tries to add speaker information to the source while the latter tries to remove. We suspect that the intermediate hidden representation spaces of these two tasks contain somewhat little speaker information, and serve as a suitable fit for VC

Data Deficiency in Sequence-to-Sequnece Voice Conversion

Pretraining in Speech Processing

Transfer Learning From ASR and TTS for VC

MODEL ARCHITECTURES

RNN-Based Model

Transformer-Based Model

PRETRAINING TECHNIQUES

TTS-Oriented Pretraining

Decoder pretraining

ASR-Oriented Pretraining

VC Model Training

Experimental Settings

Effectiveness of TTS-Oriented Pretraining on RNN and Transformer Based Models

Comparison of TTS-Oriented and ASR-Oriented Pretraining

Comparison of RNN and Transformer Based Models

Visualizing the Hidden Representation Space

CONCLUSION

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE/ACM Transactions on Audio, Speech, and Language Processing	Publication Date: Jan 1, 2021
Citations: 33	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Pretraining Techniques for Sequence-to-Sequence Voice Conversion

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE/ACM Transactions on Audio, Speech, and Language Processing

Lead the way for us

Similar Papers

Voice Transformer Network: Sequence-to-Sequence Voice Conversion Using Transformer with Text-to-Speech Pretraining
Wen-Chin Huang ... Yi-Chiao Wu
-
Wen-Chin Huang, et. al.Wen-Chin Huang ... Yi-Chiao Wu
25 Oct 2020
25 Oct 2020

JSV-VC: Jointly Trained Speaker Verification and Voice Conversion Models
Shogo Seki ... Kou Tanaka
-
Shogo Seki, et. al.Shogo Seki ... Kou Tanaka
04 Jun 2023
04 Jun 2023

Non-Parallel Many-To-Many Voice Conversion by Knowledge Transfer from a Text-To-Speech Model
Xinyuan Yu ... Brian Mak
-
Xinyuan Yu, et. al.Xinyuan Yu ... Brian Mak
06 Jun 2021
06 Jun 2021

Distilling Sequence-to-Sequence Voice Conversion Models for Streaming Conversion Applications
Kou Tanaka ... Takuhiro Kaneko
-
Kou Tanaka, et. al.Kou Tanaka ... Takuhiro Kaneko
09 Jan 2023
09 Jan 2023

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Pretraining Techniques for Sequence-to-Sequence Voice Conversion

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE/ACM Transactions on Audio, Speech, and Language Processing