Sentence Level N-Gram Context Feature in Real-Word Spelling Error Detection and Correction: Unsupervised Corpus Based Approach

Tsegay Mullu Kassa

doi:10.7176/jiea/10-4-02

Abstract

Spell checking is the process of finding misspelled words and possibly correcting them. Most of the modern commercial spell checkers use a straightforward approach to finding misspellings, which considered a word is erroneous when it is not found in the dictionary. However, this approach is not able to check the correctness of words in their context and this is called real-word spelling error. To solve this issue, in the state-of-the-art researchers use context feature at fixed size n-gram (i.e. tri-gram) and this reduces the effectiveness of model due to limited feature. In this paper, we address the problem of this issue by adopting sentence level n-gram feature for real-word spelling error detection and correction. In this technique, all possible word n-grams are used to learn proposed model about properties of target language and this enhance its effectiveness. In this investigation, the only corpus required to training proposed model is unsupervised corpus (or raw text) and this enables the model flexible to be adoptable for any natural languages. But, for demonstration purpose we adopt under-resourced languages such as Amharic, Afaan Oromo and Tigrigna. The model has been evaluated in terms of Recall, Precision, F-measure and a comparison with literature was made (i.e. fixed n-gram context feature) to assess if the technique used performs as good. The experimental result indicates proposed model with sentence level n-gram context feature achieves a better result: for real-word error detection and correction achieves an average F-measure of 90.03%, 85.95%, and 84.24% for Amharic, Afaan Oromo and Tigrigna respectively. Keywords: S entence level n-gram, real-word spelling error, spell checker , unsupervised corpus based spell checker DOI: 10.7176/JIEA/10-4-02 Publication date: September 30 th 2020

Highlights

Poor spelling is a common challenge faced by people on their day to day lives, to encounter such issues spellcheckers are an essential tool
Most of the modern commercial spell checkers use a straightforward approach to finding misspellings, which considered a word is erroneous when it is not found in the dictionary
We address the problem of this issue by adopting sentence level n-gram feature for real-word spelling error detection and correction

Summary

Introduction

Poor spelling is a common challenge faced by people on their day to day lives, to encounter such issues spellcheckers are an essential tool. In a straightforward approach a spell checker is built in dictionary of words to detect errors, and on a corpus based probabilistic model to perform error corrections. In this approach , when a word is not in the dictionary , so it is considered as misspelled word and such type of error is called non-word form spelling error (Pirinen et al 2014). To correct such detected spelling errors, this approach searches words in the lexicon that resemble the erroneous word. This approach is not able to check the correctness of words in their context and such error is called real-word spelling error, words that are found in the language lexicon but contextually not correct

Objectives

Results

Conclusion