Document-Level Neural TTS Using Curriculum Learning and Attention Masking

Sung-Woong Hwang,Joon-Hyuk Chang

doi:10.1109/access.2020.3049073

Sung-Woong Hwang, Joon-Hyuk Chang

Open Access

https://doi.org/10.1109/access.2020.3049073

Copy DOI

Abstract

Speech synthesis has been developed to the level of natural human-level speech synthesized through an attention-based end-to-end text-to-speech synthesis (TTS) model. However, it is difficult to generate attention when synthesizing a text longer than the trained length or document-level text. In this paper, we propose a neural speech synthesis model that can synthesize more than 5 min of speech at once using training data comprising a short speech of less than 10 s. This model can be used for tasks that need to synthesize document-level speech at a time, such as a singing voice synthesis (SVS) system or a book reading system. First, through curriculum learning, our model automatically increases the length of the speech trained for each epoch, while reducing the batch size so that long sentences can be trained with a limited graphics processing unit (GPU) capacity. During synthesis, the document-level text is synthesized using only the necessary contexts of the current time step and masking the rest through an attention-masking mechanism. The Tacotron2-based speech synthesis model and duration predictor were used in the experiment, and the results showed that proposed method can synthesize document-level speech with overwhelmingly lower character error rate, and attention error rates, and higher quality than those obtained using the existing model.

Highlights

Speech synthesis, which produces natural speech from text, is an active research area
With the advent of an end-to-end speech synthesis model based on a deep neural network (DNN), the quality of synthesized speech has significantly improved compared with that generated using the previous concatenative synthesis model [1], [2] and statistical parametric speech synthesis model [3]–[6]
Tacotron [7] is a representative end-to-end speech synthesis model based on DNN that simplifies the complex structure used to generate linguistic and acoustical features in the previous model; it is achieved by generating a mel spectrogram from the text sequence through a single neural network and synthesizing speech using the Griffin and Lim [8] algorithm as a vocoder

Summary

Introduction

Speech synthesis (text-to-speech synthesis, TTS), which produces natural speech from text, is an active research area. This model cannot synthesize sentences longer than the trained speech length, and various problems such as missing or repeated words and incomplete synthesis occur when attempting document-level speech synthesis. The current end-to-end natural speech synthesis system uses a sequence-to-sequence model comprising two structures: an encoder and a decoder. In the original Tacotron [7] system, the contentbased attention mechanism introduced in [10] is used to align the target text and output a spectrogram. With this mechanism, it is difficult to synthesize speech longer than the trained text length.

Objectives

Methods

Findings

Conclusion