Abstract

This paper provides a new model aimed to enhanceArabic OCR degraded text retrieval effectiveness. The proposed model based onsimulating the Arabic OCR recognition mistakesbased on both, word based and Character N-Gram approaches. Then we expand the user search query using the expected OCR errors. The resulting search query expanded gives high precision and recall values in searching Arabic OCR-Degraded text rather than the original query. The proposed model showed a significant increase in the degraded text retrieval effectiveness over the previous models. The retrieval effectiveness of the newmodel is %93, while the best effectiveness published for word based approach was %84 and the best effectiveness for character based approach was %56.

Highlights

  • Printed documents has never lost its importance, the number of documents available as character-coded text is increasing as a result of electronic publishing,but it’s not the same case for Arabic

  • Getting the alternative degraded shapes of the word is done by the word based model, which is based on aligning the Optical Character Recognition (OCR) degraded words and the clean text words, the alignment operation is done by calculating the edit distance between both words. usually the edit distance is calculated by calculating the number of operations that must be done on the word to convert it to the other, these operations are insertions, deletions, and substitutions operations

  • To test and verify that OCR degradation search model we built increases the accuracy of the degraded text information retrieval, we have to make a group of tests.evaluating retrieval effectiveness requires the availability of a test document

Read more

Summary

INTRODUCTION

Printed documents has never lost its importance, the number of documents available as character-coded text is increasing as a result of electronic publishing,but it’s not the same case for Arabic. Since searching character coded documents is the easiest way, and automating the process means generating the character-coded representations of the documents. We can generate the character-coded representation of the documents by rekeying the documents’ text or creating metadata about the documents such as titles, summaries, or keywords. These approaches would be labor intensive and impractical for large numbers of documents. We can produce thecharacter-coded representation of the document byscanning the documents and use Optical Character Recognition (OCR), which is an automated process that converts document images into character-coded text. The OCR process is inexpensive and well suited for large document collections

Orthographic Properties of Arabic
ElGhazaly OCR-Degraded Synthesizing Model
Darwish OCR-Degraded Synthesizing Model
The Accuracy of the Models
The OCR-Degradation Synthesizing Model
The Word based OCR Degradation Synthesizing Model
Character based Model
Orthographic Query Expansion
Training the OCR Degradation Synthesizing Model
Testing the Orthographic Query Expansion Model
Testingthe Orthographic Query Expansion Model Accuracy
Findings
CONCLUSION

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.