Fast offline transformer‐based end‐to‐end automatic speech recognition for real‐world applications

Yoo Rhee Oh,Jeon Gue Park,Kiyoung Park

doi:10.4218/etrij.2021-0106

Yoo Rhee Oh, Jeon Gue Park + Show 1 more

Open Access

https://doi.org/10.4218/etrij.2021-0106

Copy DOI

Journal: ETRI Journal	Publication Date: Dec 8, 2021
Citations: 4	License type: publisher-specific license

Affiliation: Electronics and Telecommunications Research Institute

Abstract

With the recent advances in technology, automatic speech recognition (ASR) has been widely used in real-world applications. The efficiency of converting large amounts of speech into text accurately with limited resources has become more important than ever. This paper proposes a method to rapidly recognize a large speech database via a Transformer-based end-to-end model. Transformers have improved the state-of-the-art performance in many fields. However, they are not easy to use for long sequences. In this paper, various techniques to speed up the recognition of real-world speeches are proposed and tested, including decoding via multiple-utterance batched beam search, detecting end-of-speech based on a connectionist temporal classification (CTC), restricting the CTC prefix score, and splitting long speeches into short segments. Experiments are conducted with the Librispeech English and the real-world Korean ASR tasks to verify the proposed methods. From the experiments, the proposed system can convert 8 hours of speeches spoken at real-world meetings into text in less than 3 minutes with a 10.73% character error rate, which is 27.1% relatively lower than that of conventional systems.

Full Text