Spelling Correction for Text Documents in Bahasa Indonesia Using Finite State Automata and Levinshtein Distance Method

Viny Christanti Mawardi,Dali Santun Naga,Niko Susanto

doi:10.1051/matecconf/201816401047

Viny Christanti Mawardi, Dali Santun Naga + Show 1 more

Open Access

https://doi.org/10.1051/matecconf/201816401047

Copy DOI

Journal: MATEC Web of Conferences	Publication Date: Jan 1, 2018
Citations: 9	License type: CC BY 4.0

Affiliation: Tarumanagara University

Abstract

Any mistake in writing of a document will cause the information to be told falsely. These days, most of the document is written with a computer. For that reason, spelling correction is needed to solve any writing mistakes. This design process discuss about the making of spelling correction for document text in Indonesian language with document's text as its input and a .txt file as its output. For the realization, 5 000 news articles have been used as training data. Methods used includes Finite State Automata (FSA), Levenshtein distance, and N-gram. The results of this designing process are shown by perplexity evaluation, correction hit rate and false positive rate. Perplexity with the smallest value is a unigram with value 1.14. On the other hand, the highest percentage of correction hit rate is bigram and trigram with value 71.20 %, but bigram is superior in processing time average which is 01:21.23 min. The false positive rate of unigram, bigram, and trigram has the same percentage which is 4.15 %. Due to the disadvantages at using FSA method, modification is done and produce bigram's correction hit rate as high as 85.44 %.

Highlights

Language is one of the most important component in human life, which could be expressed by either spoken word or written text
The easiest way to calculate the probability is with Maximum Likelihood Estimation (MLE), by taking the number of corpus and dividing to produce interval [0,1]
There are many smoothing techniques for Maximum Likelihood Estimation (MLE) that can be done from the simplest to the sophisticated smoothing techniques, such as Good-Turing discounting or back-off models. Some of these smoothing methods work by determining a distribution value of N-grams and using Bayesian inference to calculate the probability of N-grams produced

Summary

Introduction

Language is one of the most important component in human life, which could be expressed by either spoken word or written text. Language became an important element in document writing. Any mistake in document writing will cause the information told falsely. There could be some mistakes happened because of human error. The mistake or error can be caused by either the letters from the adjacent keyboard keys, errors due to mechanical failure or slip of the hand or finger. For that reason, spelling correction is needed to solve any writing mistakes. This research aim to objectify spelling correction on Indonesian text documents, to overcome non-word error. FSA method is used to determine which letter caused error in a word. The word suggestion sequence is determined by the probability of N-gram

Pre-processing

Levenshtein distance

N-gram

Add-one smoothing

Perplexity

Correction hit rate and false positive rate

System plan

Initiation

Design process

Interface design

Implementation system

Website testing

Correction hit rate and false position rate testing

Findings

Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Spelling Correction for Text Documents in Bahasa Indonesia Using Finite State Automata and Levinshtein Distance Method

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: MATEC Web of Conferences

Lead the way for us

Similar Papers

Optimizing Deep Learning for Detection Cyberbullying Text in Indonesian Language
Laksmi Anindyati ... Ayu Purwarianti
-
Laksmi Anindyati, et. al.Laksmi Anindyati ... Ayu Purwarianti
01 Sep 2019
01 Sep 2019

Fast string correction with Levenshtein automata
Klaus U Schulz ... Stoyan Mihov
International Journal on Document Analysis and Recognition | VOL. 5
Klaus U Schulz, et. al.Klaus U Schulz ... Stoyan Mihov
01 Nov 2002
International Journal on Document Analysis and Recognition | VOL. 5

Kompresi Data Berdasarkan Perhitungan Distribusi Probabilitas Kemunculan Karakter Orde Dua Dalam Teks Bahasa Indonesia

-

01 Jan 2008
01 Jan 2008

Automatic Text Summarization for Indonesian Language Using TextTeaser
...
-
, et. al. ...
01 Dec 2016
01 Dec 2016

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Spelling Correction for Text Documents in Bahasa Indonesia Using Finite State Automata and Levinshtein Distance Method

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: MATEC Web of Conferences