Abstract

Voice conversion (VC) transforms the speaking style of a source speaker to the speaking style of a target speaker by keeping linguistic information unchanged. Traditional VC techniques rely on parallel recordings of multiple speakers uttering the same sentences. Earlier approaches mainly find a mapping between the given source–target speakers, which contain pairs of similar utterances spoken by different speakers. However, parallel data are computationally expensive and difficult to collect. Non-parallel VC remains an interesting but challenging speech processing task. To address this limitation, we propose a method that allows a non-parallel many-to-many voice conversion by using a generative adversarial network. To the best of the authors’ knowledge, our study is the first one that employs a sinusoidal model with continuous parameters to generate converted speech signals. Our method involves only several minutes of training examples without parallel utterances or time alignment procedures, where the source–target speakers are entirely unseen by the training dataset. Moreover, empirical study is carried out on the publicly available CSTR VCTK corpus. Our conclusions indicate that the proposed method reached the state-of-the-art results in speaker similarity to the utterance produced by the target speaker, while suggesting important structural ones to be further analyzed by experts.

Highlights

  • Voice conversion (VC) is a task developed to convert the observed identity of a source speaker to sound like a different target speaker, while retaining the linguistic or phonetic content unchanged

  • Instance normalization [36] is used for the generator as this greatly improves the stability of the training but no normalization is employed for the discriminator. All these layers are followed by a rectified linear unit (ReLU) activation function, and the output layer followed by a sigmoid activation function

  • It was confirmed that the proposed approach speaker to the target speaker

Read more

Summary

Introduction

Voice conversion (VC) is a task developed to convert the observed identity of a source speaker to sound like a different target speaker, while retaining the linguistic or phonetic content unchanged. With the novelty of WaveNet architecture, it was applied in [16] as a vocoder and the quality of synthesized speech was greatly increased Most of these conventional VC methods require accurately aligned parallel data for training, where the source speaker and target speaker should read the same sentence, and their pairs of frames must be aligned. There is another major problem: if both source and target speakers speak different languages or have different accents, building parallel data would become an unviable task To overcome these limitations, numerous attempts have been made to use non-parallel voice conversion methods. This narrow focus and lack of phase approach has caused important gaps in the literature Sinusoidal modeling is another signal processing technique that offers promise for improving voice system performance.

Related Work and Motivation
Proposed Continuous Sinusoidal Model
Analysis Phase
Synthesis Phase
Modified
Full Objective
Conversion Process
Network Architecture
Experimental Setup
Statistical Evaluations
Empirical cumulative distribution function
Qualitative Evaluations
Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.