Vocoder-free text-to-speech synthesis incorporating generative adversarial networks using low-/multi-frequency STFT amplitude spectra

Yuki Saito,Shinnosuke Takamichi,Hiroshi Saruwatari

doi:10.1016/j.csl.2019.05.008

Yuki Saito, Shinnosuke Takamichi + Show 1 more

Open Access

https://doi.org/10.1016/j.csl.2019.05.008

Copy DOI

Export

Save

Cite

Journal: Computer Speech & Language	Publication Date: Jun 1, 2019
Citations: 13	License type: cc-by

Affiliation: The University of Tokyo

Abstract
Full-Text
Similar Papers

Abstract

Listen

This paper proposes novel training algorithms for vocoder-free text-to-speech (TTS) synthesis based on generative adversarial networks (GANs) that compensate for short-term Fourier transform (STFT) amplitude spectra in low/multi frequency resolution. Vocoder-free TTS using STFT amplitude spectra can avoid degradation of synthetic speech quality caused by the vocoder-based parameterization used in conventional TTS. Our previous work for the vocoder-based TTS proposed a method for incorporating the GAN-based distribution compensation into acoustic model training to improve synthetic speech quality. This paper extends the algorithm to the vocoder-free TTS and propose a GAN-based training algorithm using low-frequency-resolution amplitude spectra to overcome the difficulty in modeling complicated distribution of the high-dimensional spectra. In the proposed algorithm, amplitude spectra are transformed into low-frequency-resolution amplitude spectra by applying an average pooling function along with a frequency axis; then the GAN-based distribution compensation is performed in the low-frequency-resolution domain. Because the low-frequency-resolution amplitude spectra approximately emulate filter banks, the proposed algorithm is expected to improve synthetic speech quality by reducing differences in spectral envelopes of natural and synthetic speech. Furthermore, various frequency scales that are related to human speech perception (e.g., mel and inverse mel frequency scales) can be introduced to the proposed training algorithm by applying an frequency warping function to amplitude spectra. This paper also proposes a GAN-based training algorithm using multi-frequency-resolution amplitude spectra that uses both low- and original-frequency-resolution amplitude spectra to reduce the differences in not only spectral envelopes but also fine structures. Experimental results demonstrate that (1) GANs using low-frequency-resolution amplitude spectra improve speech quality and work robustly against the settings of the frequency resolution and hyperparameters, (2) in comparison among low-, original-, and multi-frequency-resolution amplitude spectra, the use of low-frequency-resolution ones work best improve the synthetic speech quality, and (3) the use of the inverse mel frequency scale for obtaining low-frequency-resolution amplitude spectra further improves synthetic speech quality.

Full Text

Published Version

View

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

Vocoder-free text-to-speech synthesis incorporating generative adversarial networks using low-/multi-frequency STFT amplitude spectra

Abstract

Published Version

Talk to us

Similar Papers

More From: Computer Speech & Language

Lead the way for us

Similar Papers

Text-to-Speech Synthesis Using STFT Spectra Based on Low-/Multi-Resolution Generative Adversarial Networks
Yuki Saito ... Hiroshi Saruwatari
-
Yuki Saito, et. al.Yuki Saito ... Hiroshi Saruwatari
01 Apr 2018
01 Apr 2018

Quality of synthetic speech and auditory working memory performance: neuroergonomic perspectives from fNIRS
Adrian Curtin ... Hasan Ayaz
Frontiers in Human Neuroscience | VOL. 12
Adrian Curtin, et. al.Adrian Curtin ... Hasan Ayaz
01 Jan 2018
Frontiers in Human Neuroscience | VOL. 12

Singing voice synthesizing method
Hideki Kenmochi
The Journal of the Acoustical Society of America | VOL. 120
Hideki KenmochiHideki Kenmochi
01 Jan 2006
The Journal of the Acoustical Society of America | VOL. 120

Improve the Quality of Synthetic Speech Trained with Found Data using Silence Cutter
Lau Chee Yong ... Tan Tian Swee
Research Journal of Applied Sciences, Engineering and Technology | VOL. 8
Lau Chee Yong, et. al.Lau Chee Yong ... Tan Tian Swee
10 Oct 2014
Research Journal of Applied Sciences, Engineering and Technology | VOL. 8

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

Vocoder-free text-to-speech synthesis incorporating generative adversarial networks using low-/multi-frequency STFT amplitude spectra

Abstract

Published Version

Talk to us

Similar Papers

More From: Computer Speech &amp; Language

More From: Computer Speech & Language