Abstract

This study aims to solve the context-sensitive spelling error problem for English documents. There are two types of spelling errors in English: non-word spelling errors and context-sensitive spelling errors. Non-word spelling errors are simple to correct because they can only be detected by matching the words in sentences with those in a dictionary; however, context-sensitive spelling errors entail increased difficulty of correction because the relationship between the word to be corrected and the surrounding context must be known. Spelling errors are considered noise in every field that uses text information, and preprocessing via document correction is necessary to minimize this problem. Context-sensitive spelling errors include homophone errors (which arise from the incorrect use of words that sound the same but are spelled differently), typographical errors (caused by striking an incorrect key on a keyboard), grammatical errors (which occur when the user does not know the correct grammatical rules), and cross word boundary errors (which arise from incorrect spacing between words). This study focuses on typographical errors. The context-sensitive spelling error problem is solved using the deep learning method, which is not an existing statistical method. The deep learning language model-based correction approach is divided into four parts, namely, correction based on word embedding information, contextual embedding information, an auto-regressive (AR) language model, and an auto-encoding (AE) language model. In this study, the best correction performance was obtained for the AE language model-based approach, and we verified its performance through a detailed correction test.

Highlights

  • Spelling errors can be classified into two categories: nonword and context-sensitive spelling errors

  • This paper is structured as follows: Section 2 presents related research, Section 3 discusses the context-sensitive spelling errors considered in this study, Section 4 elucidates the correctional language model, Section 5 presents an analysis of the experiment and results, and Section 6 presents the conclusion and future research

  • In the context-sensitive spelling error correction process, it is difficult to obtain correct answers to spelling errors for all words; we chose a deep learning language model based on unsupervised learning

Read more

Summary

INTRODUCTION

Spelling errors can be classified into two categories: nonword and context-sensitive spelling errors The former occur when a word is spelt with a non-conventional spelling, such as ‘‘fron.’’ it is easy to detect these errors by analyzing a word morphologically. The methods used to correct context-sensitive spelling errors can be separated into three categories: rule-based, statistical, and deep learning-based method. We apply various recently developed deep learning language models to context-sensitive spelling error correction and suggest the direction of a correction experiment. This paper is structured as follows: Section 2 presents related research, Section 3 discusses the context-sensitive spelling errors considered in this study, Section 4 elucidates the correctional language model, Section 5 presents an analysis of the experiment and results, and Section 6 presents the conclusion and future research

RELATED RESEARCH
CONTEXT-SENSITIVE SPELLING CORRECTION TECHNIQUE
COMPARISON OF EMBEDDING-BASED CORRECTION PERFORMANCE
Findings
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call