Abstract

End-to-end (E2E) models are widely used because they significantly improve the performance of automatic speech recognition (ASR). However, based on the limitations of existing hardware computing devices, previous studies mainly focus on short utterances. Typically, utterances used for ASR training do not last much longer than 15 s, and therefore the models often fail to generalize to longer utterances at inference time. To address the challenge of long-form speech recognition, we propose a novel Context-Association Architecture with Simulated Long-utterance Training (CAST), which consists of a Context-Association RNN-Transducer (CARNN-T) and a simulating long utterance training (SLUT) strategy. The CARNN-T obtains the sentence-level contextual information by paying attention to the cross-sentence historical utterances and adds it in the inference stage, which improves the robustness of long-form speech recognition. The SLUT strategy simulates long-form audio training by updating the recursive state, which can alleviate the length mismatch between training and testing utterances. Experiments on the test of the Aishell-1 and aidatatang_200zh synthetic corpora show that our model has the best recognition performer on long utterances with the character error rate (CER) of 12.0%/12.6%, respectively.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call