Abstract

Text-to-speech alignment, also known as time alignment, is essential for automatic speech recognition (ASR) systems used for speech retrieval tasks, such as keyword search and speech segment extraction. Previous works have used the Gaussian mixture model-hidden Markov model (GMM-HMM) forced alignment to improve the alignment performance. However, when used with end-to-end (E2E) ASR, GMM-HMM forced alignment causes extra reliance on expertise such as pronunciation lexica. It also increases the system complexity because GMM-HMMs are very dissimilar to E2E models. To tackle these two problems, we propose an E2E-ASR-based iteratively-trained timestamp estimator (ITSE), which performs alignment between token-level transcription and speech. We train ITSE first with coarse initial alignment targets generated using connectionist temporal classification (CTC) posteriors. During training, we iteratively perform realignment to update the targets. We attribute the effectiveness of the iterative training to ITSE’s two vital features. First, ITSE performs alignment using similarities between token and speech embeddings instead of frame-wise token classification posteriors. Second, ITSE uses speech embeddings that are aware of left context rather than global context. ITSE significantly outperforms CTC-based baselines in word alignment accuracy and is comparable to a GMM-HMM forced aligner. In short, ITSE is an accurate, lightweight text-to-speech alignment module implemented without expertise such as pronunciation lexica.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call