Abstract

Spoken languages are similar phonetically because humans have a common vocal production system. However, each language has a unique phonetic repertoire and phonotactic rule. In cross-lingual voice conversion, source speaker and target speaker speak different languages. The challenge is how to project the speaker identity of the source speaker to that of the target across two different phonetic systems. A typical voice conversion system employs a generator-vocoder pipeline, where the generator is responsible for conversion, and the vocoder is for waveform reconstruction. We propose a novel Multi-Task WaveRNN with an integrated architecture for cross-lingual voice conversion. The WaveRNN is trained on two sets of monolingual data via a two-task learning. The integrated architecture takes linguistic features as input and outputs speech waveform directly. Voice conversion experiments are conducted between English and Mandarin, which confirm the effectiveness of the proposed method in terms of speech quality and speaker similarity.

Highlights

  • V OICE conversion (VC) is a technique to convert one’s voice to sound like that of another [1], where we change the speaker identity while carrying over the linguistic content

  • A modularized neural network (NN) with multi-task learning (MTL) [25] was recently studied for xVC between English and Mandarin

  • We find that MTWaveRNN achieves lower average MCD and RMSE than STWaveRNN consistently (7.14 dB vs. 7.69 dB and 13.90 dB vs. 14.49 dB), that confirms the advantage of multi-task learning over single-task learning

Read more

Summary

INTRODUCTION

V OICE conversion (VC) is a technique to convert one’s voice (source) to sound like that of another (target) [1], where we change the speaker identity while carrying over the linguistic content. In cross-lingual voice conversion (xVC), the source and target speakers speak different languages, it is difficult to obtain parallel data of the same content [2]–[4]. Cycle-consistent generative adversarial networks (CycleGANs) [17] is studied to deal with non-parallel training data problem. It generally works for a specific source-target pair and requires a large amount of training data. Another idea is to disentangle the speaker-dependent component (speaker identity) from the speaker-independent component (speech content). ZHOU et al.: MULTI-TASK WaveRNN WITH AN INTEGRATED ARCHITECTURE FOR CROSS-LINGUAL VOICE CONVERSION

Multi-Task Generator-Vocoder Pipeline
Neural Vocoder
MULTI-TASK WAVERNN WITH AN INTEGRATED ARCHITECTURE
EXPERIMENTS
Database and Feature Extraction
Results and Discussions
Findings
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call