Abstract

OCR errors hurt retrieval performance to a great extent. Research has been done on modelling and correction of OCR errors. However, most of the existing systems use language dependent resources or training texts for studying the nature of errors. Not much research has been reported on improving retrieval performance from erroneous text when no training data is available. We propose a novel algorithm for detecting OCR errors and improving retrieval performance on an E-Discovery corpus. Our contribution is two-fold : (1) identifying erroneous variants of query terms for improvement in retrieval performance, and (2) presenting a scope for a possible error-modelling in the erroneous corpus where clean ground truth text is not available for comparison. Our algorithm does not use any training data or any language specific resources like thesaurus. It also does not use any knowledge about the language except that the word delimiter is blank space. The proposed approach obtained statistically significant improvements in recall over state-of-the-art baselines.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call