An Overview of End-to-End Automatic Speech Recognition

Dong Wang,Shaohe Lv,Xiaodong Wang

doi:10.3390/sym11081018

Dong Wang, Shaohe Lv + Show 1 more

Open Access

https://doi.org/10.3390/sym11081018

Copy DOI

Journal: Symmetry	Publication Date: Aug 7, 2019
Citations: 126	License type: CC BY 4.0

Affiliation: National University of Defense Technology

Abstract

Automatic speech recognition, especially large vocabulary continuous speech recognition, is an important issue in the field of machine learning. For a long time, the hidden Markov model (HMM)-Gaussian mixed model (GMM) has been the mainstream speech recognition framework. But recently, HMM-deep neural network (DNN) model and the end-to-end model using deep learning has achieved performance beyond HMM-GMM. Both using deep learning techniques,

Highlights

Popularity of smart devices has made people more and more feel the convenience of voice interaction
The following content of this paper is organized as follows: In Section 2, we briefly reviewed the history of automatic speech recognition, focus on introducing the basic ideas and characteristics of hidden Markov model (HMM)-Gaussian mixed model (GMM), HMM-deep neural network (DNN), end-to-end models, and comparing their strengths and weaknesses; in Sections 3–5, we summarize and analyze the principles, progress and research focus of connectionist temporal classification (CTC)-based, recurrent neural network (RNN)-transducer and attention-based three different end-to-end models, respectively
For non-specific-speaker, large vocabulary continuous speech recognition tasks, HMM-based technology and deep learning technology are the key to its breakthrough

Summary

Introduction

Popularity of smart devices has made people more and more feel the convenience of voice interaction. Compared with the HMM-based model, the end-to-end model uses a single model to directly map audio to characters or words. It replaces engineering process with learning process and needs no domain expertise, so end-to-end model is simpler for constructing and training. These advantages make the end-to-end model quickly become a hot research direction in large vocabulary continuous speech recognition (LVCSR).

History of ASR

Models for LVCSR

HMM-Based Model

End-to-End Model

CTC-Based End-to-End Model

Key Ideas of CTC

Path Probability Calculation

Path Aggregation

Model Structure

Large-Scale Data Training

Language Model

RNN-Transducer End-to-End Model

Key Ideas of RNN-Transducer

RNN-Transducer Works

Attention-Based End-to-End Model

Works on the Encoder

Delay and Information Redundancy

Network Structure

Works on Attention

Continuity Problem

Monotonic Problem

Inaccurate Extraction of Key Information

Works on Decoder

Model Characteristics Comparison

Model Recognition Performance Comparison

Findings

Future Works

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

An Overview of End-to-End Automatic Speech Recognition

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Symmetry

Lead the way for us

Similar Papers

Integrate template matching and statistical modeling for continuous speech recognition
Xie Sun
-
Xie SunXie Sun
01 Jan 2010
01 Jan 2010

Discrete-Mixture HMMs-based Approach for Noisy Speech Recognition
Tetsuo Kosaka ... Masaharu Katoh
-
Tetsuo Kosaka, et. al.Tetsuo Kosaka ... Masaharu Katoh
01 Jun 2007
01 Jun 2007

Deep convolutional neural networks-based features for Indonesian large vocabulary speech recognition
Hilman F Pardede ... Dikdik Krisnandi
IAES International Journal of Artificial Intelligence (IJ-AI) | VOL. 12
Hilman F Pardede, et. al.Hilman F Pardede ... Dikdik Krisnandi
01 Jun 2023
IAES International Journal of Artificial Intelligence (IJ-AI) | VOL. 12

Comparing Fusion Models for DNN-Based Audiovisual Continuous Speech Recognition
Ahmed Hussen Abdelaziz
IEEE/ACM transactions on audio, speech, and language processing | VOL. 26
Ahmed Hussen AbdelazizAhmed Hussen Abdelaziz
01 Mar 2018
IEEE/ACM transactions on audio, speech, and language processing | VOL. 26

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

An Overview of End-to-End Automatic Speech Recognition

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Symmetry