Abstract

Text-to-speech alignment, also known as time alignment, is essential for automatic speech recognition (ASR) systems used for speech retrieval tasks, such as keyword search and speech segment extraction. Previous works have used the Gaussian mixture model-hidden Markov model (GMM-HMM) forced alignment to improve the alignment performance. However, when used with end-to-end (E2E) ASR, GMM-HMM forced alignment causes extra reliance on expertise such as pronunciation lexica. It also increases the system complexity because GMM-HMMs are very dissimilar to E2E models. To tackle these two problems, we propose an E2E-ASR-based iteratively-trained timestamp estimator (ITSE), which performs alignment between token-level transcription and speech. We train ITSE first with coarse initial alignment targets generated using connectionist temporal classification (CTC) posteriors. During training, we iteratively perform realignment to update the targets. We attribute the effectiveness of the iterative training to ITSE’s two vital features. First, ITSE performs alignment using similarities between token and speech embeddings instead of frame-wise token classification posteriors. Second, ITSE uses speech embeddings that are aware of left context rather than global context. ITSE significantly outperforms CTC-based baselines in word alignment accuracy and is comparable to a GMM-HMM forced aligner. In short, ITSE is an accurate, lightweight text-to-speech alignment module implemented without expertise such as pronunciation lexica.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.