Es-Tacotron2: Multi-Task Tacotron 2 with Pre-Trained Estimated Network for Reducing the Over-Smoothness Problem

Yifan Liu,Jin Zheng

doi:10.3390/info10040131

Abstract

Text-to-speech synthesis is a computational technique for producing synthetic, human-like speech by a computer. In recent years, speech synthesis techniques have developed, and have been employed in many applications, such as automatic translation applications and car navigation systems. End-to-end text-to-speech synthesis has gained considerable research interest, because compared to traditional models the end-to-end model is easier to design and more robust. Tacotron 2 is an integrated state-of-the-art end-to-end speech synthesis system that can directly predict closed-to-natural human speech from raw text. However, there remains a gap between synthesized speech and natural speech. Suffering from an over-smoothness problem, Tacotron 2 produced ’averaged’ speech, making the synthesized speech sounds unnatural and inflexible. In this work, we first propose an estimated network (Es-Network), which captures general features from a raw mel spectrogram in an unsupervised manner. Then, we design Es-Tacotron2 by employing the Es-Network to calculate the estimated mel spectrogram residual, and setting it as an additional prediction task of Tacotron 2, to allow the model focus more on predicting the individual features of mel spectrogram. The experience shows that compared to the original Tacotron 2 model, Es-Tacotron2 can produce more variable decoder output and synthesize more natural and expressive speech.

Highlights

Speech synthesis is the process of transposing input text into corresponding speech, and it is known as text-to-speech (TTS)
We present a comparison of the natural mel spectrogram, the mel spectrogram synthesized by original Tacotron 2, and that synthesized by the proposed Es-Tacotron2 model in
In order to measure the average distance between the estimated spectrogram and the estimated residual with the original spectrogram, we introduce three statistics: average cross-entropy, average Cosine similarity and average variance

Summary

Introduction

Speech synthesis is the process of transposing input text into corresponding speech, and it is known as text-to-speech (TTS). Some speech parameter generation algorithms [4,5] consider the global variance (GV) [6] to enhance the details of over-smoothed spectra for HMM-based speech synthesis, to alleviate the over-smoothness problem Such methods work by maximizing the likelihood estimation of the global variance, which can be thought of as introducing an additional global constraint to the model. The speech synthesized by these models is dissimilar from averaged result, and tends to preserve more of the characteristics of the speech These methods produce redundant distortion in the predicted spectrogram, making the generated speech noisy.

End-to-End Speech Synthesis

Multi-Task Learning

Attention Mechanism

Estimated Network

Multi-Task Tacotron 2 with Pre-Trained Estimated Network

Conv Layer

Initialization

Objective Evaluation

Subjective Evaluation

Effect of Heads Number n of the Es-Network

Findings

Conclusion

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Information	Publication Date: Apr 9, 2019
Citations: 6	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Es-Tacotron2: Multi-Task Tacotron 2 with Pre-Trained Estimated Network for Reducing the Over-Smoothness Problem

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Information

Lead the way for us

Similar Papers

HMM-based Finnish text-to-speech system utilizing glottal inverse filtering
Tuomo Raitio ... Martti Vainio
-
Tuomo Raitio, et. al.Tuomo Raitio ... Martti Vainio
22 Sep 2008
22 Sep 2008

Optimization of Arabic Database and an Implementation for Arabic Speech Synthesis System using HMM: HTS_ARAB_TALK
Cherif Adnan ... Krichi Mohamedkhalil
International Journal of Computer Applications | VOL. 73
Cherif Adnan, et. al.Cherif Adnan ... Krichi Mohamedkhalil
26 Jul 2013
International Journal of Computer Applications | VOL. 73

Intelligibility Assessment of the De-Identified Speech Obtained Using Phoneme Recognition and Speech Synthesis Systems
Tadej Justin ... France Mihelič
-
Tadej Justin, et. al.Tadej Justin ... France Mihelič
01 Jan 2014
01 Jan 2014

Speaker de-identification using diphone recognition and speech synthesis
Tadej Justin ... Bostjan Vesnicer
-
Tadej Justin, et. al.Tadej Justin ... Bostjan Vesnicer
01 May 2015
01 May 2015

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Es-Tacotron2: Multi-Task Tacotron 2 with Pre-Trained Estimated Network for Reducing the Over-Smoothness Problem

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Information