Abstract

Automatic Speech Recognition, (ASR) has achieved the best results for English, with end-to-end neural network based supervised models. These supervised models need huge amounts of labeled speech data for good generalization, which can be quite a challenge to obtain for low-resource languages like Urdu. Most models proposed for Urdu ASR are based on Hidden Markov Models (HMMs). This paper proposes an end-to-end neural network model, for Urdu ASR, regularized with dropout, ensemble averaging and Maxout units. Dropout and ensembles are averaging techniques over multiple neural network models while Maxout are units in a neural network which adapt their activation functions. Due to limited labeled data, Semi Supervised Learning (SSL) techniques are also incorporated to improve model generalization. Speech features are transformed into a lower dimensional manifold using an unsupervised dimensionality-reduction technique called Locally Linear Embedding (LLE). Transformed data along with higher dimensional features is used to train neural networks. The proposed model also utilizes label propagation-based self-training of initially trained models and achieves a Word Error Rate (WER) of 4% less than that reported as the benchmark on the same Urdu corpus using HMM. The decrease in WER after incorporating SSL is more significant with an increased validation data size.

Highlights

  • Automatic Speech Recognition (ASR) can be a vital component in artificially-intelligent interactive systems

  • Word Error Rate (WER) down to 22% is achieved by the proposed Supervised Learning (SSL)-Neural Network (SSL-Neural Networks (NN)) model for speaker

  • WER down to 22% is achieved by the proposed SSL-Neural Network (SSL-NN) model for independent setup as compared to 25.42% achieved by Hidden Markov Models (HMMs) on the same corpus [3]

Read more

Summary

Introduction

Automatic Speech Recognition (ASR) can be a vital component in artificially-intelligent interactive systems. The unsupervised learning generally performs clustering, density estimation and dimension reduction tasks Utilizing both supervised and unsupervised techniques for data classification is called. Rate (WER) using HMM models for speech recognition on the corpus in speaker independent setup with 90% of speech used as training and 10% as test data [3]. This paper describes the performance of an end-to-end neural network-based speech recognition model tested on the same corpus. The model is tested using as low as 50% of the available corpus as training data for the first time and the performance does not deteriorate drastically with the limited training data portion because of SSL This is quite significant for low-resource languages like Urdu. The conclusion and scope for future work is presented at the end

Deep Learning
Semi-Supervised Learning
System Model
Neural
Results and and Analysis
Neural Network Architecture Analysis
Evaluation of LLE and Self-Training
Discussion and Conclusions
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.