Abstract

In this paper, we integrate a simple non-parallel voice conversion (VC) system with a WaveNet (WN) vocoder and a proposed collapsed speech suppression technique. The effectiveness of WN as a vocoder for generating high-fidelity speech waveforms on the basis of acoustic features has been confirmed in recent works. However, when combining the WN vocoder with a VC system, the distorted acoustic features, acoustic and temporal mismatches, and exposure bias usually lead to significant speech quality degradation, making WN generate some very noisy speech segments called collapsed speech. To tackle the problem, we take conventional-vocoder-generated speech as the reference speech to derive a linear predictive coding distribution constraint (LPCDC) to avoid the collapsed speech problem. Furthermore, to mitigate the negative effects introduced by the LPCDC, we propose a collapsed speech segment detector (CSSD) to ensure that the LPCDC is only applied to the problematic segments to limit the loss of quality to short periods. Objective and subjective evaluations are conducted, and the experimental results confirm the effectiveness of the proposed method, which further improves the speech quality of our previous non-parallel VC system submitted to Voice Conversion Challenge 2018.

Highlights

  • Voice conversion is a technique to change speech characteristics such as the speaker identity and emotion of an input speech while maintaining the same linguistic content

  • We propose a collapsed speech segment detector (CSSD) [42] to only apply the linear predictive coding distribution constraint (LPCDC) to the detected collapsed segments, which limits the negative effects of the LPCDC to few speech segments and markedly eases the oversmoothing and quality degradation problems of the LPCDC

  • EXPERIMENTAL EVALUATIONS we present collapsed speech detection, spectral conversion, and perceptual quality evaluations to respectively confirm the effectiveness of the proposed CSSD module, the baseline non-parallel voice conversion (VC) model, and the proposed WN vocoder with the LPCDC and CSSD

Read more

Summary

INTRODUCTION

Voice conversion is a technique to change speech characteristics such as the speaker identity and emotion of an input speech while maintaining the same linguistic content. The discontinuous waveform signals and unexpected noisy speech segments caused by the acoustic and temporal mismatches and the exposure bias are called the collapsed speech problem [42] To address this problem, we propose a distribution constraint [42] to directly refine the predicted probability distribution of each speech sample from the output of the WN vocoder, which significantly alleviates the collapsed speech problem. These negative effects degrade the quality of the WN-generated speech when the LPCDC is applied.

RELATED WORKS
DNN-BASED VC
WAVENET VOCODER WITH COLLAPSED SPEECH SUPPRESSION AND DETECTION
EXPERIMENTAL EVALUATIONS
EXPERIMENTAL SETTINGS
Nature
Findings
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call