Improving Seq2Seq TTS Frontends With Transcribed Speech Audio

Siqi Sun,Korin Richmond,Hao Tang

doi:10.1109/taslp.2023.3273414

Abstract

Due to the data inefficiency and low speech quality of grapheme-based end-to-end text-to-speech (TTS), having a separate high-performance TTS linguistic frontend is still commonly regarded as necessary. However, a TTS frontend is itself difficult to build and maintain, since it requires abundant linguistic knowledge for its construction. In this paper, we start by bootstrapping an integrated sequence-to-sequence (Seq2Seq) TTS frontend using a pre-existing pipeline-based frontend and large amounts of unlabelled normalized text, achieving promising memorization and generalisation abilities. To overcome the performance limitation imposed by the pipeline-based frontend, this work proposes a Forced Alignment (FA) method to decode the pronunciations from transcribed speech audio and then use them to update the Seq2Seq frontend. Our experiments demonstrate the effectiveness of our proposed FA method, which can significantly improve the word token accuracy from 52.6% to 91.2% for out-of-dictionary words. In addition, it can also correct the pronunciation of homographs from transcribed speech audio and potentially improve the homograph disambiguation performance of the Seq2Seq frontend.

Full Text