OCR correction for Indonesian historic newspapers using word repetition, stemmer and n-gram

D Purwantoro,H Akbar,A Hidayati,Sfenrianto Sfenrianto

doi:10.1088/1742-6596/1193/1/012032

D Purwantoro, H Akbar + Show 2 more

Open Access

https://doi.org/10.1088/1742-6596/1193/1/012032

Copy DOI

Abstract

Most digital archives of the old newspapers in Indonesia are only available as microfilm image file without their textual content. Manual transcription is certainly not effective and tiring for publishers that have large archives. Therefore, more automated transcription is required. As a part of that effort, this paper proposes OCR error correction of old spelling news articles utilizing new spelling databases. Spelling conversions which based on pattern are used to bridge spelling differences. The Error detection uses dictionary lookup and the phenomenon of word repetition and OCR errors that mostly are non-word errors. The Dictionary is built from KBBI and enriched with derivative words, English words, and entity names from validated news archives. Confix-stripping stemmer is used to validate derivative words while the English dictionary is used to validate English words in the news archive. The Error correction uses context-based methods by searching the phrase trigram/bigram word for each error in Google, then the Google Spelling suggestions are used as the correction. Experiments on 9 texts of OCR result of KOMPAS daily article between 1965 and 1966 is resulting the comparison of error rate before and after the correction (improvement ratio) of 193.01% compared to Hunspell spell checker of 61.47%.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Physics: Conference Series	Publication Date: Apr 1, 2019
Citations: 1	License type: cc-by

R Discovery Prime

R Discovery Prime

OCR correction for Indonesian historic newspapers using word repetition, stemmer and n-gram

Abstract

Talk to us

Similar Papers

More From: Journal of Physics: Conference Series

Lead the way for us

Similar Papers

An In-depth Analysis of OCR Errors for Unconstrained Vietnamese Handwriting
Quoc-Dung Nguyen ... Ivan Zelinka
-
Quoc-Dung Nguyen, et. al.Quoc-Dung Nguyen ... Ivan Zelinka
01 Jan 2020
01 Jan 2020

An Analysis of the Performance of Named Entity Recognition over OCRed Documents
Ahmed Hamdi ... Mickael Coustaty
-
Ahmed Hamdi, et. al.Ahmed Hamdi ... Mickael Coustaty
01 Jun 2019
01 Jun 2019

<title>Counting OCR errors in typeset text</title>
Jonathan S Sandberg
-
Jonathan S SandbergJonathan S Sandberg
30 Mar 1995
30 Mar 1995

Language Translation
A F R Brown
Journal of the ACM | VOL. 5
A F R BrownA F R Brown
01 Jan 1958
Journal of the ACM | VOL. 5

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

OCR correction for Indonesian historic newspapers using word repetition, stemmer and n-gram

Abstract

Talk to us

Similar Papers

More From: Journal of Physics: Conference Series