Pre-Alignment Guided Attention for Improving Training Efficiency and Model Stability in End-to-End Speech Synthesis

Xiaolian Zhu,Yuchao Zhang,Lei Xie,Liumeng Xue,Shan Yang

doi:10.1109/access.2019.2914149

Xiaolian Zhu, Yuchao Zhang + Show 3 more

Open Access

https://doi.org/10.1109/access.2019.2914149

Copy DOI

Abstract

Recently, end-to-end (E2E) neural text-to-speech systems, such as Tacotron2, have begun to surpass the traditional multi-stage hand-engineered systems, with both simplified system building pipelines and high-quality speech. With a unique encoder-decoder neural structure, the Tacotron2 system no longer needs separately learned text analysis front-end, duration model, acoustic model, and audio synthesis module. The key of such a system lies in the attention mechanism, which learns an alignment between the encoder and the decoder, serving as an implicit duration model bridging the text sequence and the acoustic sequence. However, attention learning suffers from low training efficiency and model instability problems, which hinder the E2E approaches from wide deployment. In this paper, we address the problems and propose a novel pre-alignment guided attention learning approach. Specifically, we inject handy prior knowledge-accurate phoneme durations-in the neural network loss function to bias the attention learning to the desired direction more accurately. The explicit time alignment between an audio recording and its corresponding phoneme sequence can be achieved by forced-alignment from an automatic speech recognizer (ASR). The experiments show that the proposed pre-alignment guided (PAG) attention approach can significantly improve training efficiency and model stability. More specifically, the PAG updated version of Tacotron2 can quickly obtain the attention alignment using only 500 (text, audio) pairs, which is apparently not possible for the original Tacotron2. A series of subjective experiments also show that the PAG-Tacotron2 approach can synthesize more stable and natural speech.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Access	Publication Date: Jan 1, 2019
Citations: 30	License type: cc-by-nc-nd

R Discovery Prime

R Discovery Prime

Pre-Alignment Guided Attention for Improving Training Efficiency and Model Stability in End-to-End Speech Synthesis

Abstract

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

Acoustic and pronunciation model adaptation for context-independent and context-dependent pronunciation variability of non-native speech
Yoo Rhee Oh ... Mina Kim
-
Yoo Rhee Oh, et. al. Yoo Rhee Oh ... Mina Kim
01 Mar 2008
01 Mar 2008

Acoustic Modelling for Croatian Speech Recognition and Synthesis
Sanda Martinčić–Ipšić ... Ivo Ipšić
Informatica | VOL. 19
Sanda Martinčić–Ipšić, et. al.Sanda Martinčić–Ipšić ... Ivo Ipšić
01 Jan 2008
Informatica | VOL. 19

Cross-lingual speech recognition under runtime resource constraints
Dong Yu ... Li Deng
-
Dong Yu, et. al.Dong Yu ... Li Deng
01 Apr 2009
01 Apr 2009

Acoustic Modelling for Croatian Speech Recognition and Synthesis
...
Informatica (lithuanian Academy of Sciences) | VOL. -
, et. al. ...
01 Apr 2008
Informatica (lithuanian Academy of Sciences) | VOL. -

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Pre-Alignment Guided Attention for Improving Training Efficiency and Model Stability in End-to-End Speech Synthesis

Abstract

Talk to us

Similar Papers

More From: IEEE Access