Wasserstein GAN and Waveform Loss-Based Acoustic Model Training for Multi-Speaker Text-to-Speech Synthesis Systems Using a WaveNet Vocoder

Yi Zhao,Junichi Yamagishi,Shinji Takaki,Daisuke Saito,Nobuaki Minematsu,Hieu-Thi Luong

doi:10.1109/access.2018.2872060

Abstract

Recent neural networks such as WaveNet and sampleRNN that learn directly from speech waveform samples have achieved very high-quality synthetic speech in terms of both naturalness and speaker similarity even in multi-speaker text-to-speech synthesis systems. Such neural networks are being used as an alternative to vocoders and hence they are often called neural vocoders. The neural vocoder uses acoustic features as local condition parameters, and these parameters need to be accurately predicted by another acoustic model. However, it is not yet clear how to train this acoustic model, which is problematic because the final quality of synthetic speech is significantly affected by the performance of the acoustic model. Significant degradation happens, especially when predicted acoustic features have mismatched characteristics compared to natural ones. In order to reduce the mismatched characteristics between natural and generated acoustic features, we propose frameworks that incorporate either a conditional generative adversarial network (GAN) or its variant, Wasserstein GAN with gradient penalty (WGAN-GP), into multi-speaker speech synthesis that uses the WaveNet vocoder. We also extend the GAN frameworks and use the discretized mixture logistic loss of a well-trained WaveNet in addition to mean squared error and adversarial losses as parts of objective functions. Experimental results show that acoustic models trained using the WGAN-GP framework using back-propagated discretized-mixture-of-logistics (DML) loss achieves the highest subjective evaluation scores in terms of both quality and speaker similarity.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Access	Publication Date: Jan 1, 2018
Citations: 100	License type: cc-by-nc-nd

R Discovery Prime

R Discovery Prime

Wasserstein GAN and Waveform Loss-Based Acoustic Model Training for Multi-Speaker Text-to-Speech Synthesis Systems Using a WaveNet Vocoder

Abstract

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

The effect of loss function on conditional generative adversarial networks
Alaa Abu-Srhan ... Omar S Al-Kadi
Journal of King Saud University - Computer and Information Sciences | VOL. 34
Alaa Abu-Srhan, et. al.Alaa Abu-Srhan ... Omar S Al-Kadi
04 Mar 2022
Journal of King Saud University - Computer and Information Sciences | VOL. 34

Variants of Generative Adversarial Networks for Credit Card Fraud Detection
Leichombam Somorjit ... Mridula Verma
-
Leichombam Somorjit, et. al.Leichombam Somorjit ... Mridula Verma
01 Jan 2020
01 Jan 2020

An Empirical Study of WGAN and WGAN-GP for Enhanced Image Generation
Liyuan Lu
Applied and Computational Engineering | VOL. 83
Liyuan LuLiyuan Lu
31 Oct 2024
Applied and Computational Engineering | VOL. 83

Churn Prediction in Telecommunications Industry Based on Conditional Wasserstein GAN
Chang Su ... Linglin Wei
-
Chang Su, et. al.Chang Su ... Linglin Wei
01 Dec 2022
01 Dec 2022

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Wasserstein GAN and Waveform Loss-Based Acoustic Model Training for Multi-Speaker Text-to-Speech Synthesis Systems Using a WaveNet Vocoder

Abstract

Talk to us

Similar Papers

More From: IEEE Access