Online Hybrid CTC/Attention End-to-End Automatic Speech Recognition Architecture

Haoran Miao,Pengyuan Zhang,Yonghong Yan,Gaofeng Cheng

doi:10.1109/taslp.2020.2987752

Abstract

Recently, there has been increasing progress in end-to-end automatic speech recognition (ASR) architecture, which transcribes speech to text without any pre-trained alignments. One popular end-to-end approach is the hybrid Connectionist Temporal Classification (CTC) and attention (CTC/attention) based ASR architecture, which utilizes the advantages of both CTC and attention. The hybrid CTC/attention ASR systems exhibit performance comparable to that of the conventional deep neural network (DNN)/ hidden Markov model (HMM) ASR systems. However, how to deploy hybrid CTC/attention systems for online speech recognition is still a non-trivial problem. This article describes our proposed online hybrid CTC/attention end-to-end ASR architecture, which replaces all the offline components of conventional CTC/attention ASR architecture with their corresponding streaming components. Firstly, we propose stable monotonic chunk-wise attention (sMoChA) to stream the conventional global attention, and further propose monotonic truncated attention (MTA) to simplify sMoChA and solve the training-and-decoding mismatch problem of sMoChA. Secondly, we propose truncated CTC (T-CTC) prefix score to stream CTC prefix score calculation. Thirdly, we design dynamic waiting joint decoding (DWJD) algorithm to dynamically collect the predictions of CTC and attention in an online manner. Finally, we use latency-controlled bidirectional long short-term memory (LC-BLSTM) to stream the widely-used offline bidirectional encoder network. Experiments with LibriSpeech English and HKUST Mandarin tasks demonstrate that, compared with the offline CTC/attention model, our proposed online CTC/attention model improves the real time factor in human-computer interaction services and maintains its performance with moderate degradation. To the best of our knowledge, this is the first work to provide the full-stack online solution for CTC/attention end-to-end ASR architecture.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Online Hybrid CTC/Attention End-to-End Automatic Speech Recognition Architecture

Abstract

Talk to us

Similar Papers

More From: IEEE/ACM Transactions on Audio, Speech, and Language Processing

Lead the way for us

Journal: IEEE/ACM Transactions on Audio, Speech, and Language Processing	Publication Date: Jan 1, 2020
Citations: 80

Similar Papers

Hybrid CTC/Attention Architecture for End-to-End Speech Recognition
Shinji Watanabe ... Tomoki Hayashi
IEEE Journal of Selected Topics in Signal Processing | VOL. 11
Shinji Watanabe, et. al.Shinji Watanabe ... Tomoki Hayashi
01 Dec 2017
IEEE Journal of Selected Topics in Signal Processing | VOL. 11

Hybrid End-to-End Architecture for Hindi Speech Recognition System
A Kumar ... M Dua
-
A Kumar, et. al.A Kumar ... M Dua
01 Jan 2021
01 Jan 2021

Chapter 2 - End-to-End Acoustic Modeling Using Convolutional Neural Networks
Vishal Passricha ... Rajesh Kumar Aggarwal
Intelligent Speech Signal Processing | VOL. -
Vishal Passricha, et. al.Vishal Passricha ... Rajesh Kumar Aggarwal
01 Jan 2019
Intelligent Speech Signal Processing | VOL. -

Performance Analysis of various Front-end and Back End Amalgamations for Noise-robust DNN-based ASR
Mohit Dua ... Vinam Agrawal
Recent Advances in Computer Science and Communications | VOL. 14
Mohit Dua, et. al.Mohit Dua ... Vinam Agrawal
01 Dec 2021
Recent Advances in Computer Science and Communications | VOL. 14

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Online Hybrid CTC/Attention End-to-End Automatic Speech Recognition Architecture

Abstract

Talk to us

Similar Papers

More From: IEEE/ACM Transactions on Audio, Speech, and Language Processing