OCR Error Correction of an Inflectional Indian Language Using Morphological Parsing

Umapada Pal ,Pulak K Kundu ,B B Chaudhuri

doi:10.6688/jise.2000.16.6.6

Abstract

This paper deals with an OCR (Optical Character Recognition) error detection and correction technique for a highly inflectional Indian language, Bangla, the second-most popular language in India and fifth-most popular language in the world. The technique is based on morphological parsing where using two separate lexicons of root words and suffixes, the candidate root-suffix pairs of each input string, are detected, their grammatical agreement is tested and the root suffix part in which the error has occurred is noted. The correction is made to the corresponding error part of the input string by means of a fast dictionary access technique. To do so, the information about the error patterns generated by the OCR system are examined, and some alternative strings are generated for an erroneous word. Among the alternative strings, those satisl5iing grammatical agreement in root and suffix are finally chosen as suggested words. In the list of suggested words generated by the system, the desired word is available in 84.22% cases.

Full Text