End-to-End Mandarin Speech Recognition Combining CNN and BLSTM

Dong Wang,Xiaodong Wang,Shaohe Lv

doi:10.3390/sym11050644

Dong Wang, Xiaodong Wang + Show 1 more

Open Access

PDF Available

https://doi.org/10.3390/sym11050644

Copy DOI

Export

Save

Cite

Journal: Symmetry	Publication Date: May 7, 2019
Citations: 32	License type: CC BY 4.0

Affiliation: National University of Defense Technology

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

Since conventional Automatic Speech Recognition (ASR) systems often contain many modules and use varieties of expertise, it is hard to build and train such models. Recent research show that end-to-end ASRs can significantly simplify the speech recognition pipelines and achieve competitive performance with conventional systems. However, most end-to-end ASR systems are neither reproducible nor comparable because they use specific language models and in-house training databases which are not freely available. This is especially common for Mandarin speech recognition. In this paper, we propose a CNN+BLSTM+CTC end-to-end Mandarin ASR. This CNN+BLSTM+CTC ASR uses Convolutional Neural Net (CNN) to learn local speech features, uses Bidirectional Long-Short Time Memory (BLSTM) to learn history and future contextual information, and uses Connectionist Temporal Classification (CTC) for decoding. Our model is completely trained on the by-far-largest open-source Mandarin speech corpus AISHELL-1, using neither any in-house databases nor external language models. Experiments show that our CNN+BLSTM+CTC model achieves a WER of 19.2%, outperforming the exiting best work. Because all the data corpora we used are freely available, our model is reproducible and comparable, providing a new baseline for further Mandarin ASR research.

Highlights

With the rapid development of smart devices such as mobile phones and robots, users increasingly interact with man–machine interfaces via speech recognition
Our model is completely trained on the by-far-largest open-source Mandarin speech corpus AISHELL-1, using neither any in-house databases nor external language models
We propose an end-to-end Mandarin Automatic Speech Recognition (ASR) model that combines Convolutional

Summary

Introduction

With the rapid development of smart devices such as mobile phones and robots, users increasingly interact with man–machine interfaces via speech recognition. Large Vocabulary Continuous Speech Recognition (LVCSR) systems often contain several separate modules, including acoustic, phonetic, language models, and some special lexicons All these modules in an ASR system are trained separately. Our model is reproducible and comparable for other researchers because it is trained on the open-accessed Mandarin Speech Dataset AISHELL-1, using neither other in-house data nor external language model. We benchmark our CNN+BLSTM+CTC on AISHELL-1 test dataset and compare it to some existing works, Experiments results show that our model gets a WER of 19.2%, outperforming existing methods in [6,7]. Development, testing data we used come from dataset AISHELL-1, which can be freely acquired This makes our results comparable for other researchers.

Related Works

End-to-End Model for Mandarin ASR

Convolution Layer

Batch Normalization

Activations

Clipped ReLU

Max Pooling

Bidirectional LSTM

Stacking Up LSTMs of Opposite Directions

Datasets and Input Features

Convolution Neural Network

Comparison with Existing Works

Findings

Conclusions

Future Works

Full Text

Published Version (Free)

View/Download pdf

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

End-to-End Mandarin Speech Recognition Combining CNN and BLSTM

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: Symmetry

Lead the way for us

Similar Papers

End-to-End Audiovisual Speech Recognition System With Multitask Learning
Fei Tao ... Carlos Busso
IEEE Transactions on Multimedia | VOL. 23
Fei Tao, et. al.Fei Tao ... Carlos Busso
06 Mar 2020
IEEE Transactions on Multimedia | VOL. 23

Chapter 2 - End-to-End Acoustic Modeling Using Convolutional Neural Networks
Vishal Passricha ... Rajesh Kumar Aggarwal
Intelligent Speech Signal Processing | VOL. -
Vishal Passricha, et. al.Vishal Passricha ... Rajesh Kumar Aggarwal
01 Jan 2019
Intelligent Speech Signal Processing | VOL. -

Non-Native Pronunciation Variation Modeling for Automatic Speech Recognition
Hong Kook ... Yoo Rhee
-
Hong Kook, et. al.Hong Kook ... Yoo Rhee
16 Aug 2010
16 Aug 2010

Neural Speech-to-Text Language Models for Rescoring Hypotheses of DNN-HMM Hybrid Automatic Speech Recognition Systems
Tomohiro Tanaka ... Ryo Masumura
-
Tomohiro Tanaka, et. al.Tomohiro Tanaka ... Ryo Masumura
01 Nov 2018
01 Nov 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

End-to-End Mandarin Speech Recognition Combining CNN and BLSTM

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: Symmetry