Improving IR Performance from OCRed Text using Cooccurrence

Kripabandhu Ghosh,Swapan Kumar Parui,Anirban Chakraborty

doi:10.1145/2701336.2701648

Abstract

Information Retrieval performance is hurt to a great extent by OCR errors. Much research has been reported on modelling and correction of OCR errors. However, all the existing systems make use of language dependent resources or training texts to study the nature of errors. No research has been reported on improving retrieval performance from erroneous text when no training data is available. We propose a novel algorithm for automatic detection of OCR errors and improvement of retrieval performance from the erroneous corpus. Our algorithm does not use any training data or any language specific resources like thesaurus. It also does not use any knowledge about the language except that the word delimiter is blank space. We have tested our algorithm on erroneous OCRed Bangla FIRE collection offered in the RISOT 2012 track and obtained about 9% improvement over the OCRed baseline. However, the improvement is not statistically significant.

Full Text