Abstract

Developing Deep Learning Optical Character Recognition is an active area of research, where models based on deep neural networks are trained on data to eventually extract text within an image. Even though many advances are currently being made in this area in general, the Arabic OCR domain notably lacks a dataset for ancient manuscripts. Here, we fill this gap by providing both the image and textual ground truth for a collection of ancient Arabic manuscripts. This scarce dataset is collected from the central library of the Islamic University of Madinah, and it encompasses rich text spanning different geographies across centuries. Specifically, eight ancient books with a total of forty pages, both images and text, transcribed by the experts, are present in this dataset. Particularly, this dataset holds a significant value due to the unavailability of such data publicly, which conspicuously contributes to the deep learning models development/augmenting, validation, testing, and generalization by researchers and practitioners, both for the tasks of Arabic OCR and Arabic text correction.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.