CAST: Context-association architecture with simulated long-utterance training for mandarin speech recognition

Yue Ming,Boyang Lyu,Zerui Li

doi:10.1016/j.specom.2023.102985

Abstract

End-to-end (E2E) models are widely used because they significantly improve the performance of automatic speech recognition (ASR). However, based on the limitations of existing hardware computing devices, previous studies mainly focus on short utterances. Typically, utterances used for ASR training do not last much longer than 15 s, and therefore the models often fail to generalize to longer utterances at inference time. To address the challenge of long-form speech recognition, we propose a novel Context-Association Architecture with Simulated Long-utterance Training (CAST), which consists of a Context-Association RNN-Transducer (CARNN-T) and a simulating long utterance training (SLUT) strategy. The CARNN-T obtains the sentence-level contextual information by paying attention to the cross-sentence historical utterances and adds it in the inference stage, which improves the robustness of long-form speech recognition. The SLUT strategy simulates long-form audio training by updating the recursive state, which can alleviate the length mismatch between training and testing utterances. Experiments on the test of the Aishell-1 and aidatatang_200zh synthetic corpora show that our model has the best recognition performer on long utterances with the character error rate (CER) of 12.0%/12.6%, respectively.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

CAST: Context-association architecture with simulated long-utterance training for mandarin speech recognition

Abstract

Talk to us

Similar Papers

More From: Speech Communication

Lead the way for us

Similar Papers

Estimation of speech recognition performance in noisy and reverberant environments using PESQ score and acoustic parameters
Takahiro Fukumori ... Takanobu Nishiura
-
Takahiro Fukumori, et. al.Takahiro Fukumori ... Takanobu Nishiura
01 Oct 2013
01 Oct 2013

Combined speech enhancement and auditory modelling for robust distributed speech recognition
Ronan Flynn ... Edward Jones
Speech Communication | VOL. 50
Ronan Flynn, et. al.Ronan Flynn ... Edward Jones
20 May 2008
Speech Communication | VOL. 50

Novel speech processing techniques for robust automatic speech recognition

-

01 Jan 2006
01 Jan 2006

Speaker verification based processing for robust ASR in co-channel speech scenarios
Seyed Omid Sadjadi ... Larry P Heck
-
Seyed Omid Sadjadi, et. al.Seyed Omid Sadjadi ... Larry P Heck
01 Jan 2014
01 Jan 2014

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

CAST: Context-association architecture with simulated long-utterance training for mandarin speech recognition

Abstract

Talk to us

Similar Papers

More From: Speech Communication