Abstract

Tacotron-based end-to-end speech synthesis has shown remarkable voice quality. However, the rendering of prosody in the synthesized speech remains to be improved, especially for long sentences, where prosodic phrasing errors can occur frequently. In this paper, we extend the Tacotron-based speech synthesis framework to explicitly model the prosodic phrase breaks. We propose a multi-task learning scheme for Tacotron training, that optimizes the system to predict both Mel spectrum and phrase breaks. To our best knowledge, this is the first implementation of multi-task learning for Tacotron based TTS with a prosodic phrasing model. Experiments show that our proposed training scheme consistently improves the voice quality for both Chinese and Mongolian systems.

Highlights

  • With the advent of deep learning, end-to-end text-to-speech (TTS) has shown many advantages over the conventional TTS techniques [1], [2]

  • We propose a novel two-task learning scheme for Tacotronbased TTS model to improve the prosodic phrasing: 1) the main task learns the prediction of the speech spectrum parameters from character-level embedding representation, and 2) the secondary task learns the prediction of a word-level prosody embedding

  • WETacotron serves as the contrastive model for Multi-task learning (MTL)-Tacotron and PE-Tacotron to show the advantage of the proposed prosody embedding

Read more

Summary

INTRODUCTION

With the advent of deep learning, end-to-end text-to-speech (TTS) has shown many advantages over the conventional TTS techniques [1], [2]. We apply multi-task learning to the Tacotron-based TTS for prosody modeling. The main contributions of this paper include: 1) a novel Tacotron-based TTS architecture that explicitly models prosodic phrasing; and 2) a multi-task learning scheme, that optimizes the model for high quality speech spectrum, and adequate prosodic phrasing at the same time. There have been attempts [8], [32] to use word embedding as input to improve the expressiveness of Tacotron-based TTS model, that shows word embedding is prosody-informing. We propose a novel multi-task learning framework that optimizes the system to generate Mel spectrum, at the same time, accurately predict phrase breaks, which will be the focus of Section III. We study a two-task learning strategy, 1) the main task generates the Mel-spectrums from the input character sequence; and 2) the secondary task predicts an appropriate prosodic phrasing

Main Task
Secondary Task
Multi-task Learning
Databases Speech Data
Contrastive Systems
Experimental Setup
Phrase Break Prediction
Subjective Listening Test
CONCLUSIONS

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.