Abstract

Any mistake in writing of a document will cause the information to be told falsely. These days, most of the document is written with a computer. For that reason, spelling correction is needed to solve any writing mistakes. This design process discuss about the making of spelling correction for document text in Indonesian language with document's text as its input and a .txt file as its output. For the realization, 5 000 news articles have been used as training data. Methods used includes Finite State Automata (FSA), Levenshtein distance, and N-gram. The results of this designing process are shown by perplexity evaluation, correction hit rate and false positive rate. Perplexity with the smallest value is a unigram with value 1.14. On the other hand, the highest percentage of correction hit rate is bigram and trigram with value 71.20 %, but bigram is superior in processing time average which is 01:21.23 min. The false positive rate of unigram, bigram, and trigram has the same percentage which is 4.15 %. Due to the disadvantages at using FSA method, modification is done and produce bigram's correction hit rate as high as 85.44 %.

Highlights

  • Language is one of the most important component in human life, which could be expressed by either spoken word or written text

  • The easiest way to calculate the probability is with Maximum Likelihood Estimation (MLE), by taking the number of corpus and dividing to produce interval [0,1]

  • There are many smoothing techniques for Maximum Likelihood Estimation (MLE) that can be done from the simplest to the sophisticated smoothing techniques, such as Good-Turing discounting or back-off models. Some of these smoothing methods work by determining a distribution value of N-grams and using Bayesian inference to calculate the probability of N-grams produced

Read more

Summary

Introduction

Language is one of the most important component in human life, which could be expressed by either spoken word or written text. Language became an important element in document writing. Any mistake in document writing will cause the information told falsely. There could be some mistakes happened because of human error. The mistake or error can be caused by either the letters from the adjacent keyboard keys, errors due to mechanical failure or slip of the hand or finger. For that reason, spelling correction is needed to solve any writing mistakes. This research aim to objectify spelling correction on Indonesian text documents, to overcome non-word error. FSA method is used to determine which letter caused error in a word. The word suggestion sequence is determined by the probability of N-gram

Pre-processing
Levenshtein distance
N-gram
Add-one smoothing
Perplexity
Correction hit rate and false positive rate
System plan
Initiation
Design process
Interface design
Implementation system
Website testing
Correction hit rate and false position rate testing
Findings
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call