Abstract

Long utterances segmentation is crucial in end-to-end (E2E) streaming automatic speech recognition (ASR). However, commonly used voice activity detection(VAD)-based and fixed-length segmentation methods may lead to long segments and semantic incompleteness, affecting the user experience and ASR performance. In this paper, we propose a speech segmentation method for streaming E2E ASR to solve the above issues. Both the decoder's dependence on acoustic information and the human average breath frequency are used for judging segment boundaries. Frame-level decoder's dependence information is provided by the Continuous Integrate-and-Fire (CIF) predictor, which optimizes jointly with ASR to guarantee a more suitable segmentation for ASR. Besides, the proposed method does not increase the model parameters and real-time factor (RTF). The experimental results show that our method can accurately detect the pauses in speech, and the segment usually contains relatively complete semantic information. Compared with VAD-based segmentation, 53.5% latency reduction and 3.7% CER reduction relatively are achieved.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call