Abstract

Text-to-speech synthesis is a computational technique for producing synthetic, human-like speech by a computer. In recent years, speech synthesis techniques have developed, and have been employed in many applications, such as automatic translation applications and car navigation systems. End-to-end text-to-speech synthesis has gained considerable research interest, because compared to traditional models the end-to-end model is easier to design and more robust. Tacotron 2 is an integrated state-of-the-art end-to-end speech synthesis system that can directly predict closed-to-natural human speech from raw text. However, there remains a gap between synthesized speech and natural speech. Suffering from an over-smoothness problem, Tacotron 2 produced ’averaged’ speech, making the synthesized speech sounds unnatural and inflexible. In this work, we first propose an estimated network (Es-Network), which captures general features from a raw mel spectrogram in an unsupervised manner. Then, we design Es-Tacotron2 by employing the Es-Network to calculate the estimated mel spectrogram residual, and setting it as an additional prediction task of Tacotron 2, to allow the model focus more on predicting the individual features of mel spectrogram. The experience shows that compared to the original Tacotron 2 model, Es-Tacotron2 can produce more variable decoder output and synthesize more natural and expressive speech.

Highlights

  • Speech synthesis is the process of transposing input text into corresponding speech, and it is known as text-to-speech (TTS)

  • We present a comparison of the natural mel spectrogram, the mel spectrogram synthesized by original Tacotron 2, and that synthesized by the proposed Es-Tacotron2 model in

  • In order to measure the average distance between the estimated spectrogram and the estimated residual with the original spectrogram, we introduce three statistics: average cross-entropy, average Cosine similarity and average variance

Read more

Summary

Introduction

Speech synthesis is the process of transposing input text into corresponding speech, and it is known as text-to-speech (TTS). Some speech parameter generation algorithms [4,5] consider the global variance (GV) [6] to enhance the details of over-smoothed spectra for HMM-based speech synthesis, to alleviate the over-smoothness problem Such methods work by maximizing the likelihood estimation of the global variance, which can be thought of as introducing an additional global constraint to the model. The speech synthesized by these models is dissimilar from averaged result, and tends to preserve more of the characteristics of the speech These methods produce redundant distortion in the predicted spectrogram, making the generated speech noisy.

End-to-End Speech Synthesis
Multi-Task Learning
Attention Mechanism
Estimated Network
Multi-Task Tacotron 2 with Pre-Trained Estimated Network
Conv Layer
Initialization
Objective Evaluation
Subjective Evaluation
Effect of Heads Number n of the Es-Network
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call