A Word &amp; Character N-Gram based Arabic OCR Error Simulation model

Mostafa Ezzat,Mervat Gheith,Tarek Ahmed Elghazaly

doi:10.24297/ijct.v12i8.2999

A Word & Character N-Gram based Arabic OCR Error Simulation model

Mostafa Ezzat, Mervat Gheith + Show 1 more

Open Access

https://doi.org/10.24297/ijct.v12i8.2999

Copy DOI

Abstract

This paper provides a new model aimed to enhanceArabic OCR degraded text retrieval effectiveness. The proposed model based onsimulating the Arabic OCR recognition mistakesbased on both, word based and Character N-Gram approaches. Then we expand the user search query using the expected OCR errors. The resulting search query expanded gives high precision and recall values in searching Arabic OCR-Degraded text rather than the original query. The proposed model showed a significant increase in the degraded text retrieval effectiveness over the previous models. The retrieval effectiveness of the newmodel is %93, while the best effectiveness published for word based approach was %84 and the best effectiveness for character based approach was %56.

Highlights

Printed documents has never lost its importance, the number of documents available as character-coded text is increasing as a result of electronic publishing,but it’s not the same case for Arabic
Getting the alternative degraded shapes of the word is done by the word based model, which is based on aligning the Optical Character Recognition (OCR) degraded words and the clean text words, the alignment operation is done by calculating the edit distance between both words. usually the edit distance is calculated by calculating the number of operations that must be done on the word to convert it to the other, these operations are insertions, deletions, and substitutions operations
To test and verify that OCR degradation search model we built increases the accuracy of the degraded text information retrieval, we have to make a group of tests.evaluating retrieval effectiveness requires the availability of a test document

Summary

INTRODUCTION

Printed documents has never lost its importance, the number of documents available as character-coded text is increasing as a result of electronic publishing,but it’s not the same case for Arabic. Since searching character coded documents is the easiest way, and automating the process means generating the character-coded representations of the documents. We can generate the character-coded representation of the documents by rekeying the documents’ text or creating metadata about the documents such as titles, summaries, or keywords. These approaches would be labor intensive and impractical for large numbers of documents. We can produce thecharacter-coded representation of the document byscanning the documents and use Optical Character Recognition (OCR), which is an automated process that converts document images into character-coded text. The OCR process is inexpensive and well suited for large document collections

Orthographic Properties of Arabic

ElGhazaly OCR-Degraded Synthesizing Model

Darwish OCR-Degraded Synthesizing Model

The Accuracy of the Models

The OCR-Degradation Synthesizing Model

The Word based OCR Degradation Synthesizing Model

Character based Model

Orthographic Query Expansion

Training the OCR Degradation Synthesizing Model

Testing the Orthographic Query Expansion Model

Testingthe Orthographic Query Expansion Model Accuracy

Findings

CONCLUSION

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY	Publication Date: Feb 22, 2014
Citations: 1	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

A Word & Character N-Gram based Arabic OCR Error Simulation model

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY

Lead the way for us

Similar Papers

An Enhanced Arabic OCR Degraded Text Retrieval Model
Mostafa Ezzat ... Mervat Gheith
-
Mostafa Ezzat, et. al.Mostafa Ezzat ... Mervat Gheith
01 Jan 2013
01 Jan 2013

Trainable segmentation for transmission electron microscope images of inorganic nanoparticles.
Cameron G Bell ... Kevin P Treder
Journal of Microscopy | VOL. 288
Cameron G Bell, et. al.Cameron G Bell ... Kevin P Treder
11 May 2022
Journal of Microscopy | VOL. 288

Improving OCR-Degraded Arabic Text Retrieval Through an Enhanced Orthographic Query Expansion Model
Tarek Elghazaly
-
Tarek ElghazalyTarek Elghazaly
01 Jan 2015
01 Jan 2015

Named Entity Recognition In Electronic Medical Records Based On Hybrid Neural Network And Transformer
Muhammad Sumarudin ... Mohammad Syafrullah
Eduvest - Journal of Universal Studies | VOL. 4
Muhammad Sumarudin, et. al.Muhammad Sumarudin ... Mohammad Syafrullah
25 Jun 2024
Eduvest - Journal of Universal Studies | VOL. 4

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Word &amp; Character N-Gram based Arabic OCR Error Simulation model

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: INTERNATIONAL JOURNAL OF COMPUTERS &amp; TECHNOLOGY

A Word & Character N-Gram based Arabic OCR Error Simulation model

More From: INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY