Modeling Prosodic Phrasing With Multi-Task Learning in Tacotron-Based TTS

Rui Liu,Guanglai Gao,Berrak Sisman,Haizhou Li,Feilong Bao

doi:10.1109/lsp.2020.3016564

Rui Liu, Guanglai Gao + Show 3 more

Open Access

https://doi.org/10.1109/lsp.2020.3016564

Copy DOI

Abstract

Tacotron-based end-to-end speech synthesis has shown remarkable voice quality. However, the rendering of prosody in the synthesized speech remains to be improved, especially for long sentences, where prosodic phrasing errors can occur frequently. In this paper, we extend the Tacotron-based speech synthesis framework to explicitly model the prosodic phrase breaks. We propose a multi-task learning scheme for Tacotron training, that optimizes the system to predict both Mel spectrum and phrase breaks. To our best knowledge, this is the first implementation of multi-task learning for Tacotron based TTS with a prosodic phrasing model. Experiments show that our proposed training scheme consistently improves the voice quality for both Chinese and Mongolian systems.

Highlights

With the advent of deep learning, end-to-end text-to-speech (TTS) has shown many advantages over the conventional TTS techniques [1], [2]
We propose a novel two-task learning scheme for Tacotronbased TTS model to improve the prosodic phrasing: 1) the main task learns the prediction of the speech spectrum parameters from character-level embedding representation, and 2) the secondary task learns the prediction of a word-level prosody embedding
WETacotron serves as the contrastive model for Multi-task learning (MTL)-Tacotron and PE-Tacotron to show the advantage of the proposed prosody embedding

Summary

INTRODUCTION

With the advent of deep learning, end-to-end text-to-speech (TTS) has shown many advantages over the conventional TTS techniques [1], [2]. We apply multi-task learning to the Tacotron-based TTS for prosody modeling. The main contributions of this paper include: 1) a novel Tacotron-based TTS architecture that explicitly models prosodic phrasing; and 2) a multi-task learning scheme, that optimizes the model for high quality speech spectrum, and adequate prosodic phrasing at the same time. There have been attempts [8], [32] to use word embedding as input to improve the expressiveness of Tacotron-based TTS model, that shows word embedding is prosody-informing. We propose a novel multi-task learning framework that optimizes the system to generate Mel spectrum, at the same time, accurately predict phrase breaks, which will be the focus of Section III. We study a two-task learning strategy, 1) the main task generates the Mel-spectrums from the input character sequence; and 2) the secondary task predicts an appropriate prosodic phrasing

Main Task

Secondary Task

Multi-task Learning

Databases Speech Data

Contrastive Systems

Experimental Setup

Phrase Break Prediction

Subjective Listening Test

CONCLUSIONS

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Signal Processing Letters	Publication Date: Jan 1, 2020
Citations: 42	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Modeling Prosodic Phrasing With Multi-Task Learning in Tacotron-Based TTS

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Signal Processing Letters

Lead the way for us

Similar Papers

Prosodic phrase prediction using an embedded hierarchy model of speech.
Mari Ostendorf ... Nanette Veilleux
The Journal of the Acoustical Society of America | VOL. 90
Mari Ostendorf, et. al.Mari Ostendorf ... Nanette Veilleux
01 Oct 1991
The Journal of the Acoustical Society of America | VOL. 90

Incorporating second-order information into two-step major phrase break prediction for Korean
Seungwon Kim ... Jinsik Lee
-
Seungwon Kim, et. al.Seungwon Kim ... Jinsik Lee
17 Sep 2006
17 Sep 2006

Using multiple linguistic features for Mandarin phrase break prediction in maximum-entropy classification framework
Yu Zheng ... Byeongchang Kim
-
Yu Zheng, et. al.Yu Zheng ... Byeongchang Kim
04 Oct 2004
04 Oct 2004

Exclusives, equatives and prosodic phrases in Samoan
Sasha Calhoun
Glossa: a journal of general linguistics | VOL. 2
Sasha CalhounSasha Calhoun
23 Feb 2017
Glossa: a journal of general linguistics | VOL. 2

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Modeling Prosodic Phrasing With Multi-Task Learning in Tacotron-Based TTS

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Signal Processing Letters