Voice Conversion With CycleRNN-Based Spectral Mapping and Finely Tuned WaveNet Vocoder

Patrick Lumban Tobing,Tomoki Hayashi,Yi-Chiao Wu,Kazuhiro Kobayashi,Tomoki Toda

doi:10.1109/access.2019.2955978

Patrick Lumban Tobing, Tomoki Hayashi + Show 3 more

Open Access

https://doi.org/10.1109/access.2019.2955978

Copy DOI

Abstract

In this paper, we present a novel framework for a voice conversion (VC) system based on a cyclic recurrent neural network (CycleRNN) and a finely tuned WaveNet vocoder. Even though WaveNet is capable of producing natural speech waveforms when fed with natural speech features, it still suffers from speech quality degradation when fed with oversmoothed features, such as spectral parameters estimated from a statistical model. One way to address this problem is to introduce oversmoothed features while developing a WaveNet model. However, in a VC framework, providing oversmoothed spectral features of a target speaker for WaveNet modeling is not straightforward owing to the difference in the time-sequence alignment from that of a source speaker. To overcome this problem, we propose the use of a cyclic spectral conversion network, i.e., CycleRNN, capable of performing a conversion flow, i.e., source-to-target, and a cyclic flow, i.e., to generate self-predicted target spectra. The CycleRNN spectral model is trained using both conversion and weighted cyclic losses. To finely tune WaveNet, a pretrained multispeaker WaveNet model is optimized using the self-predicted features of the corresponding target speaker of a speaker conversion pair. The experimental results demonstrate that 1) the proposed CycleRNN-based spectral model for WaveNet fine-tuning significantly improves the naturalness of the converted speech waveforms, giving an overall mean opinion score of 3.50; and 2) the proposed model yields the highest speaker conversion accuracy, giving an overall speaker similarity score of 78.33%, which is a significant improvement compared with conventional WaveNet fine-tuning using natural target features.

Highlights

Voice conversion (VC) [1]–[4] is a framework for transforming the voice characteristics of a source speaker into those of a particular target speaker
In this paper, we have presented a novel parallel voice conversion (VC) framework based on the cyclic structure of a recurrent neural network (CycleRNN) and a finely tuned WaveNet vocoder
The CycleRNN architecture consists of two concatenated RNN modules where the reconstructed target spectral features are obtained by feeding the original target spectral to the first RNN the second RNN

Summary

INTRODUCTION

Voice conversion (VC) [1]–[4] is a framework for transforming the voice characteristics of a source speaker into those of a particular target speaker. WAVENET FINE-TUNING WITH OVERSMOOTHED FEATURES TO OVERCOME QUALITY DEGRADATION PROBLEM IN VC In a statistical VC framework, as illustrated in the top diagram of Fig. 3, there are mismatches between the converted spectral features of the source speaker and the natural spectral features of the target speaker. These mismatches degrade the quality of the converted speech waveform generated using the WaveNet vocoder because it is developed with the natural spectral features. This inconsistency is addressed using the proposed CycleRNN-based spectral modeling described in the subsection

PROPOSED CYCLERNN SPECTRAL MAPPING MODEL TO IMPROVE WAVENET FINE-TUNING

Findings

DISCUSSION

CONCLUSION