Enhancing recurrent neural network-based language models by word tokenization

Hatem M Noaman,Mohsen A A Rashwan,Shahenda S Sarhan

doi:10.1186/s13673-018-0133-x

Abstract

Different approaches have been used to estimate language models from a given corpus. Recently, researchers have used different neural network architectures to estimate the language models from a given corpus using unsupervised learning neural networks capabilities. Generally, neural networks have demonstrated success compared to conventional n-gram language models. With languages that have a rich morphological system and a huge number of vocabulary words, the major trade-off with neural network language models is the size of the network. This paper presents a recurrent neural network language model based on the tokenization of words into three parts: the prefix, the stem, and the suffix. The proposed model is tested with the English AMI speech recognition dataset and outperforms the baseline n-gram model, the basic recurrent neural network language models (RNNLM) and the GPU-based recurrent neural network language models (CUED-RNNLM) in perplexity and word error rate. The automatic spelling correction accuracy was enhanced by approximately 3.5% for Arabic language misspelling mistakes dataset.

Highlights

Statistical language models estimate the probability for a given sequence of words.Given a sentence s with n words such as s = (w1, w2 . . . wn), the language model assignsP(s)
The results reported by this work show that recurrent neural network-based language models results outperform traditional models results for two different tasks: English automatic speech recognition and Arabic automatic spelling error correction
The results show that the proposed token-based recurrent neural network language model has outperformed the n-gram LM by approximately 3% and enhances the basic recurrent neural network language models (RNNLM) and its GPU version CUED-RNNLM by approximately 1.5% when using the

Summary

Introduction

Statistical language models estimate the probability for a given sequence of words. Given a sentence s with n words such as s = (w1, w2 . . . wn), the language model assigns. The CUEDRNNLM [11] provides an implementation for the recurrent neural network-based model, and it has GPU support to achieve a more efficient training speed Both the basic feed forward network and the recurrent neural network-based language models do not include any type of word level morphological features, but some researchers tried to add this type of word feature explicitly by input layer factorization. Their complexity is higher than that of the original models since they add word features explicitly to the input layer While adding these features improves network performance, it adds more complexity to the models estimation and the application performance, especially when applying it to large size vocabulary applications or language with rich morphological features.

Input word

Buckwalter transliteration English translation Stemmer output

Correction accuracy

Findings

Conclusion

Full Text

Published Version (Free)

View/Download pdf

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Human-centric Computing and Information Sciences	Publication Date: Apr 27, 2018
Citations: 13	License type: open-access

R Discovery Prime

Enhancing recurrent neural network-based language models by word tokenization

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: Human-centric Computing and Information Sciences

Lead the way for us

Similar Papers

On efficient training of word classes and their application to recurrent neural network language models
Rami Botros ... Kazuki Irie
-
Rami Botros, et. al.Rami Botros ... Kazuki Irie
06 Sep 2015
06 Sep 2015

Joint unsupervised adaptation of n-gram and RNN language models via LDA-based hybrid mixture modeling
Ryo Masumura ... Hirokazu Masataki
-
Ryo Masumura, et. al.Ryo Masumura ... Hirokazu Masataki
01 Dec 2017
01 Dec 2017

Bag-of-words input for long history representation in neural network-based language models for speech recognition
Kazuki Irie ... Hermann Ney
-
Kazuki Irie, et. al.Kazuki Irie ... Hermann Ney
06 Sep 2015
06 Sep 2015

Improving handwritten Chinese text recognition using neural network language models and convolutional neural network shape models
Yi-Chao Wu ... Cheng-Lin Liu
Pattern Recognition | VOL. 65
Yi-Chao Wu, et. al.Yi-Chao Wu ... Cheng-Lin Liu
29 Dec 2016
Pattern Recognition | VOL. 65

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

Enhancing recurrent neural network-based language models by word tokenization

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: Human-centric Computing and Information Sciences