Abstract
Automatic speech recognition, especially large vocabulary continuous speech recognition, is an important issue in the field of machine learning. For a long time, the hidden Markov model (HMM)-Gaussian mixed model (GMM) has been the mainstream speech recognition framework. But recently, HMM-deep neural network (DNN) model and the end-to-end model using deep learning has achieved performance beyond HMM-GMM. Both using deep learning techniques,
Highlights
Popularity of smart devices has made people more and more feel the convenience of voice interaction
The following content of this paper is organized as follows: In Section 2, we briefly reviewed the history of automatic speech recognition, focus on introducing the basic ideas and characteristics of hidden Markov model (HMM)-Gaussian mixed model (GMM), HMM-deep neural network (DNN), end-to-end models, and comparing their strengths and weaknesses; in Sections 3–5, we summarize and analyze the principles, progress and research focus of connectionist temporal classification (CTC)-based, recurrent neural network (RNN)-transducer and attention-based three different end-to-end models, respectively
For non-specific-speaker, large vocabulary continuous speech recognition tasks, HMM-based technology and deep learning technology are the key to its breakthrough
Summary
Popularity of smart devices has made people more and more feel the convenience of voice interaction. Compared with the HMM-based model, the end-to-end model uses a single model to directly map audio to characters or words. It replaces engineering process with learning process and needs no domain expertise, so end-to-end model is simpler for constructing and training. These advantages make the end-to-end model quickly become a hot research direction in large vocabulary continuous speech recognition (LVCSR).
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have