Abstract

Automatic speech recognition, especially large vocabulary continuous speech recognition, is an important issue in the field of machine learning. For a long time, the hidden Markov model (HMM)-Gaussian mixed model (GMM) has been the mainstream speech recognition framework. But recently, HMM-deep neural network (DNN) model and the end-to-end model using deep learning has achieved performance beyond HMM-GMM. Both using deep learning techniques,

Highlights

  • Popularity of smart devices has made people more and more feel the convenience of voice interaction

  • The following content of this paper is organized as follows: In Section 2, we briefly reviewed the history of automatic speech recognition, focus on introducing the basic ideas and characteristics of hidden Markov model (HMM)-Gaussian mixed model (GMM), HMM-deep neural network (DNN), end-to-end models, and comparing their strengths and weaknesses; in Sections 3–5, we summarize and analyze the principles, progress and research focus of connectionist temporal classification (CTC)-based, recurrent neural network (RNN)-transducer and attention-based three different end-to-end models, respectively

  • For non-specific-speaker, large vocabulary continuous speech recognition tasks, HMM-based technology and deep learning technology are the key to its breakthrough

Read more

Summary

Introduction

Popularity of smart devices has made people more and more feel the convenience of voice interaction. Compared with the HMM-based model, the end-to-end model uses a single model to directly map audio to characters or words. It replaces engineering process with learning process and needs no domain expertise, so end-to-end model is simpler for constructing and training. These advantages make the end-to-end model quickly become a hot research direction in large vocabulary continuous speech recognition (LVCSR).

History of ASR
Models for LVCSR
HMM-Based Model
End-to-End Model
CTC-Based End-to-End Model
Key Ideas of CTC
Path Probability Calculation
Path Aggregation
Model Structure
Large-Scale Data Training
Language Model
RNN-Transducer End-to-End Model
Key Ideas of RNN-Transducer
RNN-Transducer Works
Attention-Based End-to-End Model
Works on the Encoder
Delay and Information Redundancy
Network Structure
Works on Attention
Continuity Problem
Monotonic Problem
Inaccurate Extraction of Key Information
Works on Decoder
Model Characteristics Comparison
Model Recognition Performance Comparison
Findings
Future Works
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call