Abstract

This paper reports the development of a multi-genre Urdu Broadcast (BC) corpus and a Large Vocabulary Continuous Speech Recognition (LVCSR) system. BC speech corpus of 98 hours from 453 speakers is collected and annotated. For acoustic modeling, Time-delay Neural Network (TDNN) is developed with prior Gaussian Mixture Model-Hidden Markov Model (GMM-HMM) training and alignments. For the language model, 3-gram, 4-gram and Recurrent Neural Network (RNN) based models are developed on a text corpus of 188 million words. The developed models are tested on 4.3 hours of unseen BC multi-genre speech dataset and the best Word Error Rate (WER) 18.59% is achieved using RNN based Language Model (LM). Moreover, a detailed word error analysis is carried out to compare the errors made by humans and the Automatic Speech Recognition (ASR) System. The results showed a similar behavior of word misrecognitions by both humans and ASR.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.