Abstract

This paper proposes a voice conversion (VC) method using sequence-to-sequence (seq2seq or S2S) learning, which flexibly converts not only the voice characteristics but also the pitch contour and duration of input speech. The proposed method, called ConvS2S-VC, has three key features. First, it uses a model with a fully convolutional architecture. This is particularly advantageous in that it is suitable for parallel computations using GPUs. It is also beneficial since it enables effective normalization techniques such as batch normalization to be used for all the hidden layers in the networks. Second, it achieves many-to-many conversion by simultaneously learning mappings among multiple speakers using only a single model instead of separately learning mappings between each speaker pair using a different model. This enables the model to fully utilize available training data collected from multiple speakers by capturing common latent features that can be shared across different speakers. Owing to this structure, our model works reasonably well even without source speaker information, thus making it able to handle any-to-many conversion tasks. Third, we introduce a mechanism, called the conditional batch normalization that switches batch normalization layers in accordance with the target speaker. This particular mechanism has been found to be extremely effective for our many-to-many conversion model. We conducted speaker identity conversion experiments and found that ConvS2S-VC obtained higher sound quality and speaker similarity than baseline methods. We also found from audio examples that it could perform well in various tasks including emotional expression conversion, electrolaryngeal speech enhancement, and English accent conversion.

Highlights

  • V OICE conversion (VC) is a technique for converting para/non-linguistic information contained in a given utterance such as the perceived identity of a speaker while preserving linguistic information

  • While recurrent neural networks (RNNs) are a natural choice for modeling long sequential data, recent work has shown that convolutional neural networks (CNNs) with gating mechanisms have excellent potential for capturing long-term dependencies [31], [32]

  • Inspired by its success in these tasks, we propose a voice conversion (VC) method based on the ConvS2S model, which we call ConvS2S-VC, along with an architecture tailored for use with VC

Read more

Summary

INTRODUCTION

V OICE conversion (VC) is a technique for converting para/non-linguistic information contained in a given utterance such as the perceived identity of a speaker while preserving linguistic information. Since these features play as important a role as local spectral features in characterizing speaker identities and speaking styles, it would be desirable if these features could be converted more flexibly To overcome this limitation, we need a model that can learn to convert entire feature sequences by capturing and utilizing long-term dependencies in source and target speech. While RNNs are a natural choice for modeling long sequential data, recent work has shown that convolutional neural networks (CNNs) with gating mechanisms have excellent potential for capturing long-term dependencies [31], [32] They are suitable for parallel computations using GPUs unlike RNNs. In addition, they are suitable for parallel computations using GPUs unlike RNNs To exploit this advantage of CNNs, an S2S model was recently proposed that adopts a fully convolutional architecture [33]. This particular mechanism was experimentally found to work very well

RELATED WORK
Feature Extraction and Normalization
Constraints on Attention Matrix
Training Loss
Impact of Batch Normalization
Model and Training Loss
Conditional Batch Normalization
Any-to-Many Conversion
Experimental Settings
Network Architectures
Objective Performance Measures
Baseline Methods
Objective Evaluations
Subjective Listening Tests
Audio Examples of Various Conversion Tasks
Findings
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call