Abstract
Even though attention-based end-to-end (E2E) automatic speech recognition (ASR) models have been yielding state-of-the-art recognition accuracy, they still fall behind many of the ASR models deployed in the industry in some crucial functionalities such as online processing and precise timestamps generating. This weakness prevents attention-based E2E ASR models from being applied in several essential speech tasks, such as online speech recognition and keyword searching (KWS). In this paper, we describe our proposed unified attention-based E2E ASR and KWS architecture–ETEH, which supports, in one model, both online and offline ASR decoding modes, thus allowing for precise and reliable KWS. “ETE” stands for attention-based E2E modeling, whereas “H” represents the hybrid gaussian mixture model and hidden Markov model (GMM-HMM). As a combination of both, ETEH is an attention-based E2E ASR architecture which utilizes the frame-wise time alignment (FTA) generated by GMM-HMM ASR models. This FTA is used to better the model in two ways: first, it helps the monotonic attentions of ETEH models to capture more accurate word time stamps, thus resulting in lower latency for online decoding; second, it helps ETEH models to provide accurate and reliable KWS results. Furthermore, we are able to combine both offline and online modes in one ETEH model and establish a concise system by adopt the universal training strategy. ETEH is functional and unique, and to the best of our knowledge, we can hardly find a comparable single attention-based E2E ASR system as the baseline. To evaluate ASR accuracy and latency for ETEH, we use our previously proposed monotonic truncated attention (MTA) based online CTC/attention (OCA) ASR models as baselines. Experimental results show that ETEH ASR models gain significant improvement in ASR latency compared to the baseline. To evaluate KWS performance, we compare ETEH models with CTC-based KWS models. Results demonstrate that our ETEH models achieve significantly better KWS performance compared to the CTC baselines.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have