Abstract

Most digital archives of the old newspapers in Indonesia are only available as microfilm image file without their textual content. Manual transcription is certainly not effective and tiring for publishers that have large archives. Therefore, more automated transcription is required. As a part of that effort, this paper proposes OCR error correction of old spelling news articles utilizing new spelling databases. Spelling conversions which based on pattern are used to bridge spelling differences. The Error detection uses dictionary lookup and the phenomenon of word repetition and OCR errors that mostly are non-word errors. The Dictionary is built from KBBI and enriched with derivative words, English words, and entity names from validated news archives. Confix-stripping stemmer is used to validate derivative words while the English dictionary is used to validate English words in the news archive. The Error correction uses context-based methods by searching the phrase trigram/bigram word for each error in Google, then the Google Spelling suggestions are used as the correction. Experiments on 9 texts of OCR result of KOMPAS daily article between 1965 and 1966 is resulting the comparison of error rate before and after the correction (improvement ratio) of 193.01% compared to Hunspell spell checker of 61.47%.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call