Improving Deep Learning based Automatic Speech Recognition for Gujarati

  • TL;DR
  • Abstract
  • Literature Map
  • Similar Papers
TL;DR

This study enhances Gujarati end-to-end speech recognition by integrating CNN, BiLSTM, Dense layers, and CTC loss, combined with a prefix decoding technique and BERT-based post-processing, resulting in a 5.87% reduction in Word Error Rate on the Microsoft Speech Corpus.

Abstract
Translate article icon Translate Article Star icon

We present a novel approach for improving the performance of an End-to-End speech recognition system for the Gujarati language. We follow a deep learning-based approach that includes Convolutional Neural Network, Bi-directional Long Short Term Memory layers, Dense layers, and Connectionist Temporal Classification as a loss function. To improve the performance of the system with the limited size of the dataset, we present a combined language model (Word-level language Model and Character-level language model)-based prefix decoding technique and Bidirectional Encoder Representations from Transformers-based post-processing technique. To gain key insights from our Automatic Speech Recognition (ASR) system, we used the inferences from the system and proposed different analysis methods. These insights help us in understanding and improving the ASR system as well as provide intuition into the language used for the ASR system. We have trained the model on the Microsoft Speech Corpus, and we observe a 5.87% decrease in Word Error Rate (WER) with respect to base-model WER.

Similar Papers
  • Research Article
  • Cite Count Icon 16
  • 10.1109/taslp.2014.2303295
Theoretical Analysis of Diversity in an Ensemble of Automatic Speech Recognition Systems
  • Mar 1, 2014
  • IEEE/ACM Transactions on Audio, Speech, and Language Processing
  • Kartik Audhkhasi + 3 more

Diversity or complementarity of automatic speech recognition (ASR) systems is crucial for achieving a reduction in word error rate (WER) upon fusion using the ROVER algorithm. We present a theoretical proof explaining this often-observed link between ASR system diversity and ROVER performance. This is in contrast to many previous works that have only presented empirical evidence for this link or have focused on designing diverse ASR systems using intuitive algorithmic modifications. We prove that the WER of the ROVER output approximately decomposes into a difference of the average WER of the individual ASR systems and the average WER of the ASR systems with respect to the ROVER output. We refer to the latter quantity as the diversity of the ASR system ensemble because it measures the spread of the ASR hypotheses about the ROVER hypothesis. This result explains the trade-off between the WER of the individual systems and the diversity of the ensemble. We support this result through ROVER experiments using multiple ASR systems trained on standard data sets with the Kaldi toolkit. We use the proposed theorem to explain the lower WERs obtained by ASR confidence-weighted ROVER as compared to word frequency-based ROVER. We also quantify the reduction in ROVER WER with increasing diversity of the N-best list. We finally present a simple discriminative framework for jointly training multiple diverse acoustic models (AMs) based on the proposed theorem. Our framework generalizes and provides a theoretical basis for some recent intuitive modifications to well-known discriminative training criterion for training diverse AMs.

  • Research Article
  • Cite Count Icon 9
  • 10.1007/s10278-018-0085-8
Retrospective Analysis of Clinical Performance of an Estonian Speech Recognition System for Radiology: Effects of Different Acoustic and Language Models
  • Apr 30, 2018
  • Journal of Digital Imaging
  • A Paats + 3 more

The aim of this study was to analyze retrospectively the influence of different acoustic and language models in order to determine the most important effects to the clinical performance of an Estonian language-based non-commercial radiology-oriented automatic speech recognition (ASR) system. An ASR system was developed for Estonian language in radiology domain by utilizing open-source software components (Kaldi toolkit, Thrax). The ASR system was trained with the real radiology text reports and dictations collected during development phases. The final version of the ASR system was tested by 11 radiologists who dictated 219 reports in total, in spontaneous manner in a real clinical environment. The audio files collected in the final phase were used to measure the performance of different versions of the ASR system retrospectively. ASR system versions were evaluated by word error rate (WER) for each speaker and modality and by WER difference for the first and the last version of the ASR system. Total average WER for the final version throughout all material was improved from 18.4% of the first version (v1) to 5.8% of the last (v8) version which corresponds to relative improvement of 68.5%. WER improvement was strongly related to modality and radiologist. In summary, the performance of the final ASR system version was close to optimal, delivering similar results to all modalities and being independent on user, the complexity of the radiology reports, user experience, and speech characteristics.

  • Research Article
  • Cite Count Icon 3
  • 10.2174/2210327911666210118143758
An Investigation of Multilingual TDNN-BLSTM Acoustic Modeling for Hindi Speech Recognition
  • Jan 1, 2022
  • International Journal of Sensors, Wireless Communications and Control
  • Ankit Kumar + 1 more

Background: In India, thousands of languages or dialects are in use. Most Indian dialects are low asset dialects. A well-performing Automatic Speech Recognition (ASR) system for Indian languages is unavailable due to a lack of resources. Hindi is one of them as large vocabulary Hindi speech datasets are not freely available. We have only a few hours of transcribed Hindi speech dataset. There is a lot of time and money involved in creating a well-transcribed speech dataset. Thus, developing a real-time ASR system with a few hours of the training dataset is the most challenging task. The different techniques like data augmentation, semi-supervised training, multilingual architecture, and transfer learning, have been reported in the past to tackle the fewer speech data issues. In this paper, we examine the effect of multilingual acoustic modeling in ASR systems for the Hindi language. Objective: This article’s objective is to develop a high accuracy Hindi ASR system with a reasonable computational load and high accuracy using a few hours of training data. Method: To achieve this goal we used Multilingual training with Time Delay Neural Network- Bidirectional Long Short Term Memory (TDNN-BLSTM) acoustic modeling. Multilingual acoustic modeling has significantly improved the ASR system's performance for low and limited resource languages. The common practice is to train the acoustic model by merging data from similar languages. In this work, we use three Indian languages, namely Hindi, Marathi, and Bengali. Hindi with 2.5 hours of training data and Marathi with 5.5 hours of training data and Bengali with 28.5 hours of transcribed data, was used in this work to train the proposed model. Results: The Kaldi toolkit was used to perform all the experiments. The paper is investigated over three main points. First, we present the monolingual ASR system using various Neural Network (NN) based acoustic models. Second, we show that Recurrent Neural Network (RNN) language modeling helps to improve the ASR performance further. Finally, we show that a multilingual ASR system significantly reduces the Word Error Rate (WER) (absolute 2% WER reduction for Hindi and 3% for the Marathi language). In all the three languages, the proposed TDNN-BLSTM-A multilingual acoustic models help to get the lowest WER. Conclusion: The multilingual hybrid TDNN-BLSTM-A architecture shows a 13.67% relative improvement over the monolingual Hindi ASR system. The best WER of 8.65% was recorded for Hindi ASR. For Marathi and Bengali, the proposed TDNN-BLSTM-A acoustic model reports the best WER of 30.40% and 10.85%.

  • Research Article
  • Cite Count Icon 5
  • 10.1044/2019_jslhr-s-18-0313
Mandarin Electrolaryngeal Speech Recognition Based on WaveNet-CTC.
  • Jun 14, 2019
  • Journal of Speech, Language, and Hearing Research
  • Zhaopeng Qian + 4 more

Purpose The application of Chinese Mandarin electrolaryngeal (EL) speech for laryngectomees has been limited by its drawbacks such as single fundamental frequency, mechanical sound, and large radiation noise. To improve the intelligibility of Chinese Mandarin EL speech, a new perspective using the automatic speech recognition (ASR) system was proposed, which can convert EL speech into healthy speech, if combined with text-to-speech. Method An ASR system was designed to recognize EL speech based on a deep learning model WaveNet and the connectionist temporal classification (WaveNet-CTC). This system mainly consists of 3 parts: the acoustic model, the language model, and the decoding model. The acoustic features are extracted during speech preprocessing, and 3,230 utterances of EL speech mixed with 10,000 utterances of healthy speech are used to train the ASR system. Comparative experiment was designed to evaluate the performance of the proposed method. Results The results show that the proposed ASR system has higher stability and generalizability compared with the traditional methods, manifesting superiority in terms of Chinese characters, Chinese words, short sentences, and long sentences. Phoneme confusion occurs more easily in the stop and affricate of EL speech than the healthy speech. However, the highest accuracy of the ASR could reach 83.24% when 3,230 utterances of EL speech were used to train the ASR system. Conclusions This study indicates that EL speech could be recognized effectively by the ASR based on WaveNet-CTC. This proposed method has a higher generalization performance and better stability than the traditional methods. A higher accuracy of the ASR system based on WaveNet-CTC can be obtained, which means that EL speech can be converted into healthy speech. Supplemental Material https://doi.org/10.23641/asha.8250830.

  • Research Article
  • Cite Count Icon 30
  • 10.1080/10400435.2022.2061085
Interaction between people with dysarthria and speech recognition systems: A review
  • Apr 16, 2022
  • Assistive Technology
  • Aisha Jaddoh + 2 more

In recent years, rapid advancements have taken place for automatic speech recognition (ASR) systems and devices. Though ASR technologies have increased, the accessibility of these novel interaction systems is underreported and may present difficulties for people with speech impediments. In this article, we attempt to identify gaps in current research on the interaction between people with dysarthria and ASR systems and devices. We cover the period from 2011, when Siri (the first and the leading commercial voice assistant) was launched, to 2020. The review employs an interaction framework in which each element (user, input, system, and output) contributes to the interaction process. To select the articles for review, we conducted a search of scientific databases and academic journals. A total of 36 studies met the inclusion criteria, which included use of the word error rate (WER) as a measurement for evaluating ASR systems. This review determines that challenges in interacting with ASR systems persist even in light of the most recent commercial technologies. Further, understanding of the entire interaction process remains limited; thus, to improve this interaction, the recent progress of ASR systems must be elucidated.

  • Conference Article
  • Cite Count Icon 12
  • 10.23919/apsipa.2018.8659622
Neural Speech-to-Text Language Models for Rescoring Hypotheses of DNN-HMM Hybrid Automatic Speech Recognition Systems
  • Nov 1, 2018
  • Tomohiro Tanaka + 3 more

In this paper, we propose to leverage end-to-end automatic speech recognition (ASR) systems for assisting deep neural network-hidden Markov model (DNN-HMM) hybrid ASR systems. The DNN-HMM hybrid ASR system, which is composed of an acoustic model, a language model and a pronunciation model, is known to be the most practical architecture in ASR field. On the other hand, much attention has been paied in recent studies to the end-to-end ASR systems that are fully composed of neural networks. It is known that they can yield comparative performance without introducing heuristic operations. However, one problem is that the end-to-end ASR systems sometimes suffer from redundant generation and ommission of important words in text generation phases. This is because these systems cannot explicitly consider the connection between the input speech and the output text. Therefore, our idea is to regard the end-to-end ASR systems as neural speech-to-text language models (NS2TLMs) and to use them for rescoring hypotheses generated in the DNN-HMM hybrid ASR systems. This enables us to leverage the end-to-end ASR systems while avoiding the generation issues because the DNN-HMM hybrid ASR systems can generate speech-aligned hypotheses. It is expected that the NS2TLMs improve the DNN-HMM hybrid ASR systems because the end-to-end ASR systems correctly handle short-duration utterances. In our experiments, we use state-of-the-art DNN-HMM hybrid ASR systems with convolutional and long short-term memory recurrent neural network acoustic models and end-to-end ASR systems based on attetional encoder-decoder. We demonstrate that our proposed method can yield a better ASR performance than both the DNN-HMM hybrid ASR system and the end-to-end ASR system.

  • Conference Article
  • Cite Count Icon 27
  • 10.1109/icassp.2019.8683086
Analyzing Uncertainties in Speech Recognition Using Dropout
  • May 1, 2019
  • Apoorv Vyas + 3 more

The performance of Automatic Speech Recognition (ASR) systems is often measured using Word Error Rates (WER) which requires time-consuming and expensive manually transcribed data. In this paper, we use state-of-the-art ASR systems based on Deep Neural Networks (DNN) and propose a novel framework which uses "Dropout" at the test time to model uncertainty in prediction hypotheses. We systematically exploit this uncertainty to estimate WER without the need for explicit transcriptions. In addition, we show that the predictive uncertainty can also be used to accurately localize the errors made by the ASR system. We study the performance of our approach on Switchboard database where it predicts WER accurately within a range of 2.6% and 5.0% for HMM-DNN and Connectionist Temporal Classification (CTC) ASR systems, respectively.

  • Book Chapter
  • Cite Count Icon 1
  • 10.1007/978-981-19-0095-2_56
Robust Feature Extraction and Recognition Model for Automatic Speech Recognition System on News Report Dataset
  • Jun 23, 2022
  • Sunanda Mendiratta + 2 more

Information processing has become ubiquitous. The process of deriving speech from transcription is known as automatic speech recognition systems. In recent days, most of the real-time applications such as home computer systems, mobile telephones, and various public and private telephony services have been deployed with automatic speech recognition (ASR) systems. Inspired by commercial speech recognition technologies, the study on automatic speech recognition (ASR) systems has developed an immense interest among the researchers. This paper is an enhancement of convolution neural networks (CNNs) via a robust feature extraction model and intelligent recognition systems. First, the news report dataset is collected from a public repository. The collected dataset is subjective to different noises that are preprocessed by min–max normalization. The normalization technique linearly transforms the data into an understandable form. Then, the best sequence of words, corresponding to the audio based on the acoustic and language model, undergoes feature extraction using Mel-frequency Cepstral Coefficients (MFCCs). The transformed features are then fed into convolutional neural networks. Hidden layers perform limited iterations to get robust recognition systems. Experimental results have proved better accuracy of 96.17% than existing ANN.KeywordsSpeech recognitionTextMel featuresRecognition accuracyConvolutional neural networks

  • Research Article
  • Cite Count Icon 24
  • 10.1007/s10772-020-09671-5
Enhancements in automatic Kannada speech recognition system by background noise elimination and alternate acoustic modelling
  • Jan 22, 2020
  • International Journal of Speech Technology
  • G Thimmaraja Yadava + 1 more

In this paper, the improvements in the recently implemented Kannada speech recognition system is demonstrated in detail. The Kannada automatic speech recognition (ASR) system consists of ASR models which are created by using Kaldi, IVRS call flow and weather and agricultural commodity prices information databases. The task specific speech data used in the recently developed spoken dialogue system had high level of different background noises. The different types of noises present in collected speech data had an adverse effect on the on line and off line speech recognition performances. Therefore, to improve the speech recognition accuracy in Kannada ASR system, a noise reduction algorithm is developed which is a fusion of spectral subtraction with voice activity detection (SS-VAD) and minimum mean square error spectrum power estimator based on zero crossing (MMSE-SPZC) estimator. The noise elimination algorithm is added in the system before the feature extraction part. An alternative ASR models are created using subspace Gaussian mixture models (SGMM) and deep neural network (DNN) modeling techniques. The experimental results show that, the fusion of noise elimination technique and SGMM/DNN based modeling gives a better relative improvement of 7.68% accuracy compared to the recently developed GMM-HMM based ASR system. The least word error rate (WER) acoustic models could be used in spoken dialogue system. The developed spoken query system is tested from Karnataka farmers under uncontrolled environment.

  • Book Chapter
  • Cite Count Icon 1
  • 10.1007/978-981-16-5747-4_23
Hybrid End-to-End Architecture for Hindi Speech Recognition System
  • Jan 1, 2022
  • A Kumar + 3 more

Traditional automatic speech recognition (ASR) systems are based on statistical models like hidden Markov model–Gaussian mixture model (HMM-GMM). Deep neural network (DNN)-based ASR systems have a complex structure containing modules like lexicon, acoustic, and language models. Each module has their limitations, and the combined form greatly reduce the performance. Recently, end-to-end ASR models greatly reduce this limitation by introducing single deep network architecture without any additional language resource. Connectionist temporal classification (CTC), attention mechanism, and hybrid CTC/attention mechanism are the three major types of end-to-end models. These end-to-end models require a massive amount of training data to work fine. The training data is not available for the majority of Indian languages. Hindi is one of them. This work is part of the Hindi speech challenge (https://sites.google.com/view/asr-challenge) conducted by IIT, Madras, in September 2020. In this paper, we trained the end-to-end ASR system using the Hindi dataset. We found hybrid CTC/attention model with \(\lambda =0.5\) performs better compare to other models. The training of the end-to-end ASR model has been done without an additional language model.KeywordsAutomatic speech recognitionHybrid CTC/attention architectureEnd-to-end modelsAcoustic modeling

  • Book Chapter
  • Cite Count Icon 9
  • 10.5772/6380
Statistical Language Modeling for Automatic Speech Recognition of Agglutinative Languages
  • Nov 1, 2008
  • Ebru Arsoy + 6 more

Automatic Speech Recognition (ASR) systems utilize statistical acoustic and language models to find the most probable word sequence when the speech signal is given. Hidden Markov Models (HMMs) are used as acoustic models and language model probabilities are approximated using n-grams where the probability of a word is conditioned on n-1 previous words. The n-gram probabilities are estimated by Maximum Likelihood Estimation. One of the problems in n-gram language modeling is the data sparseness that results in non-robust probability estimates especially for rare and unseen n-grams. Therefore, smoothing is applied to produce better estimates for these n-grams. The traditional n-gram word language models are commonly used in state-of-the-art Large Vocabulary Continuous Speech Recognition (LVSCR) systems. These systems result in reasonable recognition performances for languages such as English and French. For instance, broadcast news (BN) in English can now be recognized with about ten percent word error rate (WER) (NIST, 2000) which results in mostly quite understandable text. Some rare and new words may be missing in the vocabulary but the result has proven to be sufficient for many important applications, such as browsing and retrieval of recorded speech and information retrieval from the speech (Garofolo et al., 2000). However, LVCSR attempts with similar systems in agglutinative languages, such as Finnish, Estonian, Hungarian and Turkish so far have not resulted in comparable performance to the English systems. The main reason of this performance deterioration in those languages is their rich morphological structure. In agglutinative languages, words are formed mainly by concatenation of several suffixes to the roots and together with compounding and inflections this leads to millions of different, but still frequent word forms. Therefore, it is practically impossible to build a word-based vocabulary for speech recognition in agglutinative languages that would cover all the relevant words. If words are used as language modeling units, there will be many out-of-vocabulary (OOV) words due to using limited vocabulary sizes in ASR systems. It was shown that with an optimized 60K lexicon O pe n A cc es s D at ab as e w w w .in te ch w eb .o rg

  • Research Article
  • Cite Count Icon 92
  • 10.1109/tmm.2020.2975922
End-to-End Audiovisual Speech Recognition System With Multitask Learning
  • Mar 6, 2020
  • IEEE Transactions on Multimedia
  • Fei Tao + 1 more

An automatic speech recognition (ASR) system is a key component in current speech-based systems. However, the surrounding acoustic noise can severely degrade the performance of an ASR system. An appealing solution to address this problem is to augment conventional audio-based ASR systems with visual features describing lip activity. This paper proposes a novel end-to-end, multitask learning (MTL), audiovisual ASR (AV-ASR) system. A key novelty of the approach is the use of MTL, where the primary task is AV-ASR, and the secondary task is audiovisual voice activity detection (AV-VAD). We obtain a robust and accurate audiovisual system that generalizes across conditions. By detecting segments with speech activity, the AV-ASR performance improves as its connectionist temporal classification (CTC) loss function can leverage from the AV-VAD alignment information. Furthermore, the end-to-end system learns from the raw audiovisual inputs a discriminative high-level representation for both speech tasks, providing the flexibility to mine information directly from the data. The proposed architecture considers the temporal dynamics within and across modalities, providing an appealing and practical fusion scheme. We evaluate the proposed approach on a large audiovisual corpus (over 60 hours), which contains different channel and environmental conditions, comparing the results with competitive single task learning (STL) and MTL baselines. Although our main goal is to improve the performance of our ASR task, the experimental results show that the proposed approach can achieve the best performance across all conditions for both speech tasks. In addition to state-of-the-art performance in AV-ASR, the proposed solution can also provide valuable information about speech activity, solving two of the most important tasks in speech-based applications.

  • Book Chapter
  • Cite Count Icon 3
  • 10.5772/10112
Non-Native Pronunciation Variation Modeling for Automatic Speech Recognition
  • Aug 16, 2010
  • Hong Kook + 2 more

Communication using speech is inherently natural, with this ability of communication unconsciously acquired in a step-by-step manner throughout life. In order to explore the benefits of speech communication in devices, there have been many research works performed over the past several decades. As a result, automatic speech recognition (ASR) systems have been deployed in a range of applications, including automatic reservation systems, dictation systems, navigation systems, etc. Due to increasing globalization, the need for effective interlingual communication has also been growing. However, because of the fact that most people tend to speak foreign languages with variant or influent pronunciations, this has led to an increasing demand for the development of non-native ASR systems (Goronzy et al., 2001). In other words, a conventional ASR system is optimized with native speech; however, non-native speech has different characteristics from native speech. That is, non-native speech tends to reflect the pronunciations or syntactic characteristics of the mother tongue of the non-native speakers, as well as the wide range of fluencies among non-native speakers. Therefore, the performance of an ASR system evaluated using non-native speech tends to severely degrade when compared to that of native speech due to the mismatch between the native training data and the nonnative test data (Compernolle, 2001). A simple way to improve the performance of an ASR system for non-native speech would be to train the ASR system using a non-native speech database, though in reality the number of non-native speech samples available for this task is not currently sufficient to train an ASR system. Thus, techniques for improving non-native ASR performance using only small amount of non-native speech are required. There have been three major approaches for handling non-native speech for ASR: acoustic modeling, language modeling, and pronunciation modeling approaches. First, acoustic modeling approaches find pronunciation differences and transform and/or adapt acoustic models to include the effects of non-native speech (Gruhn et al., 2004; Morgan, 2004; Steidl et al., 2004). Second, language modeling approaches deal with the grammatical effects or speaking style of non-native speech (Bellegarda, 2001). Third, pronunciation modeling approaches derive pronunciation variant rules from non-native speech and apply the derived rules to pronunciation models for non-native speech (Amdal et al., 2000; FoslerLussier, 1999; Goronzy et al., 2004; Gruhn et al., 2004; Raux, 2004; Strik et al., 1999). Source: Advances in Speech Recognition, Book edited by: Noam R. Shabtai, ISBN 978-953-307-097-1, pp. 164, September 2010, Sciyo, Croatia, downloaded from SCIYO.COM

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 16
  • 10.3390/sym12020290
Building a Speech and Text Corpus of Turkish: Large Corpus Collection with Initial Speech Recognition Results
  • Feb 17, 2020
  • Symmetry
  • Huseyin Polat + 1 more

To build automatic speech recognition (ASR) systems with a low word error rate (WER), a large speech and text corpus is needed. Corpus preparation is the first step required for developing an ASR system for a language with few argument speech documents available. Turkish is a language with limited resources for ASR. Therefore, development of a symmetric Turkish transcribed speech corpus according to the high resources languages corpora is crucial for improving and promoting Turkish speech recognition activities. In this study, we constructed a viable alternative to classical transcribed corpus preparation techniques for collecting Turkish speech data. In the presented approach, three different methods were used. In the first step, subtitles, which are mainly supplied for people with hearing difficulties, were used as transcriptions for the speech utterances obtained from movies. In the second step, data were collected via a mobile application. In the third step, a transfer learning approach to the Grand National Assembly of Turkey session records (videotext) was used. We also provide the initial speech recognition results of artificial neural network and Gaussian mixture-model-based acoustic models for Turkish. For training models, the newly collected corpus and other existing corpora published by the Linguistic Data Consortium were used. In light of the test results of the other existing corpora, the current study showed the relative contribution of corpus variability in a symmetric speech recognition task. The decrease in WER after including the new corpus was more evident with increased verified data size, compensating for the status of Turkish as a low resource language. For further studies, the importance of the corpus and language model in the success of the Turkish ASR system is shown.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 37
  • 10.3390/sym11050644
End-to-End Mandarin Speech Recognition Combining CNN and BLSTM
  • May 7, 2019
  • Symmetry
  • Dong Wang + 2 more

Since conventional Automatic Speech Recognition (ASR) systems often contain many modules and use varieties of expertise, it is hard to build and train such models. Recent research show that end-to-end ASRs can significantly simplify the speech recognition pipelines and achieve competitive performance with conventional systems. However, most end-to-end ASR systems are neither reproducible nor comparable because they use specific language models and in-house training databases which are not freely available. This is especially common for Mandarin speech recognition. In this paper, we propose a CNN+BLSTM+CTC end-to-end Mandarin ASR. This CNN+BLSTM+CTC ASR uses Convolutional Neural Net (CNN) to learn local speech features, uses Bidirectional Long-Short Time Memory (BLSTM) to learn history and future contextual information, and uses Connectionist Temporal Classification (CTC) for decoding. Our model is completely trained on the by-far-largest open-source Mandarin speech corpus AISHELL-1, using neither any in-house databases nor external language models. Experiments show that our CNN+BLSTM+CTC model achieves a WER of 19.2%, outperforming the exiting best work. Because all the data corpora we used are freely available, our model is reproducible and comparable, providing a new baseline for further Mandarin ASR research.

Save Icon
Up Arrow
Open/Close