Abstract
Since conventional Automatic Speech Recognition (ASR) systems often contain many modules and use varieties of expertise, it is hard to build and train such models. Recent research show that end-to-end ASRs can significantly simplify the speech recognition pipelines and achieve competitive performance with conventional systems. However, most end-to-end ASR systems are neither reproducible nor comparable because they use specific language models and in-house training databases which are not freely available. This is especially common for Mandarin speech recognition. In this paper, we propose a CNN+BLSTM+CTC end-to-end Mandarin ASR. This CNN+BLSTM+CTC ASR uses Convolutional Neural Net (CNN) to learn local speech features, uses Bidirectional Long-Short Time Memory (BLSTM) to learn history and future contextual information, and uses Connectionist Temporal Classification (CTC) for decoding. Our model is completely trained on the by-far-largest open-source Mandarin speech corpus AISHELL-1, using neither any in-house databases nor external language models. Experiments show that our CNN+BLSTM+CTC model achieves a WER of 19.2%, outperforming the exiting best work. Because all the data corpora we used are freely available, our model is reproducible and comparable, providing a new baseline for further Mandarin ASR research.
Highlights
With the rapid development of smart devices such as mobile phones and robots, users increasingly interact with man–machine interfaces via speech recognition
Our model is completely trained on the by-far-largest open-source Mandarin speech corpus AISHELL-1, using neither any in-house databases nor external language models
We propose an end-to-end Mandarin Automatic Speech Recognition (ASR) model that combines Convolutional
Summary
With the rapid development of smart devices such as mobile phones and robots, users increasingly interact with man–machine interfaces via speech recognition. Large Vocabulary Continuous Speech Recognition (LVCSR) systems often contain several separate modules, including acoustic, phonetic, language models, and some special lexicons All these modules in an ASR system are trained separately. Our model is reproducible and comparable for other researchers because it is trained on the open-accessed Mandarin Speech Dataset AISHELL-1, using neither other in-house data nor external language model. We benchmark our CNN+BLSTM+CTC on AISHELL-1 test dataset and compare it to some existing works, Experiments results show that our model gets a WER of 19.2%, outperforming existing methods in [6,7]. Development, testing data we used come from dataset AISHELL-1, which can be freely acquired This makes our results comparable for other researchers.
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have