Abstract

A significant portion of currently available documents exist in the form of images, for instance, as scanned documents. Electronic documents produced by scanning and OCR software contain recognition errors. This paper uses an automatic approach to examine the selection and the effectiveness of searching techniques for possible erroneous terms for query expansion. The proposed method consists of two basic steps. In the first step, confused characters in erroneous words are located and editing operations are applied to create a collection of erroneous error-grams in the basic unit of the model. The second step uses query terms and error-grams to generate additional query terms, identify appropriate matching terms, and determine the degree of relevance of retrieved document images to the user's query, based on a vector space IR model. The proposed approach has been trained on 979 document images to construct about 2,822 error-grams and tested on 100 scanned Web pages, 200 advertisements and manuals, and 700 degraded images. The performance of our method is evaluated experimentally by determining retrieval effectiveness with respect to recall and precision. The results obtained show its effectiveness and indicate an improvement over standard methods such as vectorial systems without expanded query and 3-gram overlapping.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.