Abstract

Different approaches have been used to estimate language models from a given corpus. Recently, researchers have used different neural network architectures to estimate the language models from a given corpus using unsupervised learning neural networks capabilities. Generally, neural networks have demonstrated success compared to conventional n-gram language models. With languages that have a rich morphological system and a huge number of vocabulary words, the major trade-off with neural network language models is the size of the network. This paper presents a recurrent neural network language model based on the tokenization of words into three parts: the prefix, the stem, and the suffix. The proposed model is tested with the English AMI speech recognition dataset and outperforms the baseline n-gram model, the basic recurrent neural network language models (RNNLM) and the GPU-based recurrent neural network language models (CUED-RNNLM) in perplexity and word error rate. The automatic spelling correction accuracy was enhanced by approximately 3.5% for Arabic language misspelling mistakes dataset.

Highlights

  • Statistical language models estimate the probability for a given sequence of words.Given a sentence s with n words such as s = (w1, w2 . . . wn), the language model assignsP(s)

  • The results reported by this work show that recurrent neural network-based language models results outperform traditional models results for two different tasks: English automatic speech recognition and Arabic automatic spelling error correction

  • The results show that the proposed token-based recurrent neural network language model has outperformed the n-gram LM by approximately 3% and enhances the basic recurrent neural network language models (RNNLM) and its GPU version CUED-RNNLM by approximately 1.5% when using the

Read more

Summary

Introduction

Statistical language models estimate the probability for a given sequence of words. Given a sentence s with n words such as s = (w1, w2 . . . wn), the language model assigns. The CUEDRNNLM [11] provides an implementation for the recurrent neural network-based model, and it has GPU support to achieve a more efficient training speed Both the basic feed forward network and the recurrent neural network-based language models do not include any type of word level morphological features, but some researchers tried to add this type of word feature explicitly by input layer factorization. Their complexity is higher than that of the original models since they add word features explicitly to the input layer While adding these features improves network performance, it adds more complexity to the models estimation and the application performance, especially when applying it to large size vocabulary applications or language with rich morphological features.

Input word
Buckwalter transliteration English translation Stemmer output
Correction accuracy
Findings
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.