ConvS2S-VC: Fully Convolutional Sequence-to-Sequence Voice Conversion

Hirokazu Kameoka,Takuhiro Kaneko,Kou Tanaka,Nobukatsu Hojo,Damian Kwasny

doi:10.1109/taslp.2020.3001456

Hirokazu Kameoka, Takuhiro Kaneko + Show 3 more

Open Access

https://doi.org/10.1109/taslp.2020.3001456

Copy DOI

Abstract

This paper proposes a voice conversion (VC) method using sequence-to-sequence (seq2seq or S2S) learning, which flexibly converts not only the voice characteristics but also the pitch contour and duration of input speech. The proposed method, called ConvS2S-VC, has three key features. First, it uses a model with a fully convolutional architecture. This is particularly advantageous in that it is suitable for parallel computations using GPUs. It is also beneficial since it enables effective normalization techniques such as batch normalization to be used for all the hidden layers in the networks. Second, it achieves many-to-many conversion by simultaneously learning mappings among multiple speakers using only a single model instead of separately learning mappings between each speaker pair using a different model. This enables the model to fully utilize available training data collected from multiple speakers by capturing common latent features that can be shared across different speakers. Owing to this structure, our model works reasonably well even without source speaker information, thus making it able to handle any-to-many conversion tasks. Third, we introduce a mechanism, called the conditional batch normalization that switches batch normalization layers in accordance with the target speaker. This particular mechanism has been found to be extremely effective for our many-to-many conversion model. We conducted speaker identity conversion experiments and found that ConvS2S-VC obtained higher sound quality and speaker similarity than baseline methods. We also found from audio examples that it could perform well in various tasks including emotional expression conversion, electrolaryngeal speech enhancement, and English accent conversion.

Highlights

V OICE conversion (VC) is a technique for converting para/non-linguistic information contained in a given utterance such as the perceived identity of a speaker while preserving linguistic information
While recurrent neural networks (RNNs) are a natural choice for modeling long sequential data, recent work has shown that convolutional neural networks (CNNs) with gating mechanisms have excellent potential for capturing long-term dependencies [31], [32]
Inspired by its success in these tasks, we propose a voice conversion (VC) method based on the ConvS2S model, which we call ConvS2S-VC, along with an architecture tailored for use with VC

Summary

INTRODUCTION

V OICE conversion (VC) is a technique for converting para/non-linguistic information contained in a given utterance such as the perceived identity of a speaker while preserving linguistic information. Since these features play as important a role as local spectral features in characterizing speaker identities and speaking styles, it would be desirable if these features could be converted more flexibly To overcome this limitation, we need a model that can learn to convert entire feature sequences by capturing and utilizing long-term dependencies in source and target speech. While RNNs are a natural choice for modeling long sequential data, recent work has shown that convolutional neural networks (CNNs) with gating mechanisms have excellent potential for capturing long-term dependencies [31], [32] They are suitable for parallel computations using GPUs unlike RNNs. In addition, they are suitable for parallel computations using GPUs unlike RNNs To exploit this advantage of CNNs, an S2S model was recently proposed that adopts a fully convolutional architecture [33]. This particular mechanism was experimentally found to work very well

RELATED WORK

Feature Extraction and Normalization

Constraints on Attention Matrix

Training Loss

Impact of Batch Normalization

Model and Training Loss

Conditional Batch Normalization

Any-to-Many Conversion

Experimental Settings

Network Architectures

Objective Performance Measures

Baseline Methods

Objective Evaluations

Subjective Listening Tests

Audio Examples of Various Conversion Tasks

Findings

CONCLUSION

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE/ACM Transactions on Audio, Speech, and Language Processing	Publication Date: Jan 1, 2020
Citations: 94	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

ConvS2S-VC: Fully Convolutional Sequence-to-Sequence Voice Conversion

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE/ACM Transactions on Audio, Speech, and Language Processing

Lead the way for us

Similar Papers

Many-to-Many Voice Transformer Network
Hirokazu Kameoka ... Tomoki Toda
IEEE/ACM Transactions on Audio, Speech, and Language Processing | VOL. 29
Hirokazu Kameoka, et. al.Hirokazu Kameoka ... Tomoki Toda
01 Jan 2020
IEEE/ACM Transactions on Audio, Speech, and Language Processing | VOL. 29

Non-parallel training for voice conversion by maximum likelihood constrained adaptation
A Mouchtaris ... P Mueller
-
A Mouchtaris, et. al.A Mouchtaris ... P Mueller
19 Nov 2004
19 Nov 2004

Voice conversion using General Regression Neural Network
Jagannath Nirmal ... Pramod Kachare
Applied Soft Computing | VOL. 24
Jagannath Nirmal, et. al.Jagannath Nirmal ... Pramod Kachare
11 Jul 2014
Applied Soft Computing | VOL. 24

Non-Parallel Many-To-Many Voice Conversion by Knowledge Transfer from a Text-To-Speech Model
Xinyuan Yu ... Brian Mak
-
Xinyuan Yu, et. al.Xinyuan Yu ... Brian Mak
06 Jun 2021
06 Jun 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

ConvS2S-VC: Fully Convolutional Sequence-to-Sequence Voice Conversion

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE/ACM Transactions on Audio, Speech, and Language Processing