Abstract

End-to-end (E2E) models are widely used because they significantly improve the performance of automatic speech recognition (ASR). However, based on the limitations of existing hardware computing devices, previous studies mainly focus on short utterances. Typically, utterances used for ASR training do not last much longer than 15 s, and therefore the models often fail to generalize to longer utterances at inference time. To address the challenge of long-form speech recognition, we propose a novel Context-Association Architecture with Simulated Long-utterance Training (CAST), which consists of a Context-Association RNN-Transducer (CARNN-T) and a simulating long utterance training (SLUT) strategy. The CARNN-T obtains the sentence-level contextual information by paying attention to the cross-sentence historical utterances and adds it in the inference stage, which improves the robustness of long-form speech recognition. The SLUT strategy simulates long-form audio training by updating the recursive state, which can alleviate the length mismatch between training and testing utterances. Experiments on the test of the Aishell-1 and aidatatang_200zh synthetic corpora show that our model has the best recognition performer on long utterances with the character error rate (CER) of 12.0%/12.6%, respectively.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.