Abstract
Improving the accuracy of Arabic text recognition in imagery requires a big modern dataset as data is the fuel for many modern machine learning models. This paper proposes a new dataset, called QTID, for Quran Text Image Dataset, the first Arabic dataset that includes Arabic marks. It consists of 309,720 different 192x64 annotated Arabic word images that contain 2,494,428 characters in total, which were taken from the Holy Quran. These finely annotated images were randomly divided into 90%, 5%, 5% sets for training, validation, and testing, respectively. In order to analyze QTID, a different dataset statistics were shown. Experimental evaluation shows that current best Arabic text recognition engines like Tesseract and ABBYY FineReader cannot work well with word images from the proposed dataset.
Highlights
Optical character recognition (OCR) is the process of converting an image that contains text into a readable machine text
This paper presents a new Arabic images dataset that can help machine learning models master the Arabic language text recognition
The dataset is generated from the Holy Quran, which contains a handwritten Arabic text including Arabic language marks
Summary
Optical character recognition (OCR) is the process of converting an image that contains text into a readable machine text. OCR is an old problem, Arabic text recognition is still under development, especially in handwritten text [1], [2] due to many reasons including special Arabic language characteristics Some of these characteristics are: A character may have up to four different shapes as depicted, a character’s width and height might change relative to its location within a word, the Arabic language is written from right to left, and some characters have the same shape except for the presence/location of dots above or below that shape. The dataset is generated from the Holy Quran, which contains a handwritten Arabic text including Arabic language marks. As the data is the fuel for many machine learning models, creating a big modern dataset can help data-hungry machine learning models master the Arabic text recognition It can be used as a benchmark to measure the current state of recognizing Arabic text.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: International Journal of Advanced Computer Science and Applications
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.