Abstract

Improving the accuracy of Arabic text recognition in imagery requires a big modern dataset as data is the fuel for many modern machine learning models. This paper proposes a new dataset, called QTID, for Quran Text Image Dataset, the first Arabic dataset that includes Arabic marks. It consists of 309,720 different 192x64 annotated Arabic word images that contain 2,494,428 characters in total, which were taken from the Holy Quran. These finely annotated images were randomly divided into 90%, 5%, 5% sets for training, validation, and testing, respectively. In order to analyze QTID, a different dataset statistics were shown. Experimental evaluation shows that current best Arabic text recognition engines like Tesseract and ABBYY FineReader cannot work well with word images from the proposed dataset.

Highlights

  • Optical character recognition (OCR) is the process of converting an image that contains text into a readable machine text

  • This paper presents a new Arabic images dataset that can help machine learning models master the Arabic language text recognition

  • The dataset is generated from the Holy Quran, which contains a handwritten Arabic text including Arabic language marks

Read more

Summary

INTRODUCTION

Optical character recognition (OCR) is the process of converting an image that contains text into a readable machine text. OCR is an old problem, Arabic text recognition is still under development, especially in handwritten text [1], [2] due to many reasons including special Arabic language characteristics Some of these characteristics are: A character may have up to four different shapes as depicted, a character’s width and height might change relative to its location within a word, the Arabic language is written from right to left, and some characters have the same shape except for the presence/location of dots above or below that shape. The dataset is generated from the Holy Quran, which contains a handwritten Arabic text including Arabic language marks. As the data is the fuel for many machine learning models, creating a big modern dataset can help data-hungry machine learning models master the Arabic text recognition It can be used as a benchmark to measure the current state of recognizing Arabic text.

RELATED WORK
Image Generation
HDF5 files creation
DATASET STATISTICS
EXPERIMENTAL EVALUATION
Findings
DISCUSSION AND CONCLUSIONS
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.