Abstract
In this paper we investigate different n-gram language models that are defined over an open lexicon. We introduce a character-level language model and combine it with a standard word-level language model in a back off fashion. The character-level language model is redefined and renormalized to assign zero probability to words from a fixed vocabulary. Furthermore we present a way to interpolate language models created at the word and character levels. The computation of character-level probabilities incorporates the across-word context. We compare perplexities on all words from the test set and on in-lexicon and OOV words separately on corpora of English and Arabic text.
Submitted Version (Free)
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have