QTID: Quran Text Image Dataset

Mahmoud Badry,Hanaa Bayomi,Hussien Oakasha,Hesham Hassan

doi:10.14569/ijacsa.2018.090351

Abstract

Improving the accuracy of Arabic text recognition in imagery requires a big modern dataset as data is the fuel for many modern machine learning models. This paper proposes a new dataset, called QTID, for Quran Text Image Dataset, the first Arabic dataset that includes Arabic marks. It consists of 309,720 different 192x64 annotated Arabic word images that contain 2,494,428 characters in total, which were taken from the Holy Quran. These finely annotated images were randomly divided into 90%, 5%, 5% sets for training, validation, and testing, respectively. In order to analyze QTID, a different dataset statistics were shown. Experimental evaluation shows that current best Arabic text recognition engines like Tesseract and ABBYY FineReader cannot work well with word images from the proposed dataset.

Highlights

Optical character recognition (OCR) is the process of converting an image that contains text into a readable machine text
This paper presents a new Arabic images dataset that can help machine learning models master the Arabic language text recognition
The dataset is generated from the Holy Quran, which contains a handwritten Arabic text including Arabic language marks

Summary

INTRODUCTION

Optical character recognition (OCR) is the process of converting an image that contains text into a readable machine text. OCR is an old problem, Arabic text recognition is still under development, especially in handwritten text [1], [2] due to many reasons including special Arabic language characteristics Some of these characteristics are: A character may have up to four different shapes as depicted, a character’s width and height might change relative to its location within a word, the Arabic language is written from right to left, and some characters have the same shape except for the presence/location of dots above or below that shape. The dataset is generated from the Holy Quran, which contains a handwritten Arabic text including Arabic language marks. As the data is the fuel for many machine learning models, creating a big modern dataset can help data-hungry machine learning models master the Arabic text recognition It can be used as a benchmark to measure the current state of recognizing Arabic text.

RELATED WORK

Image Generation

HDF5 files creation

DATASET STATISTICS

EXPERIMENTAL EVALUATION

Findings

DISCUSSION AND CONCLUSIONS

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

QTID: Quran Text Image Dataset

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International Journal of Advanced Computer Science and Applications

Lead the way for us

Journal: International Journal of Advanced Computer Science and Applications	Publication Date: Jan 1, 2018
License type: cc-by

Similar Papers

Writer identification using edge-based directional probability distribution features for arabic words
Somaya Al-Ma'Adeed ... Dori Al Kassis
-
Somaya Al-Ma'Adeed, et. al.Somaya Al-Ma'Adeed ... Dori Al Kassis
01 Mar 2008
01 Mar 2008

Quranic Script Optical Text Recognition Using Deep Learning in IoT Systems
Mahmoud Badry ... Asghar Chandio
Computers, Materials & Continua | VOL. 68
Mahmoud Badry, et. al.Mahmoud Badry ... Asghar Chandio
01 Jan 2020
Computers, Materials & Continua | VOL. 68

Arabic Text Recognition and Machine Translation
Ihab Alkhoury
-
Ihab AlkhouryIhab Alkhoury
13 Jul 2015
13 Jul 2015

Arabic Scene Text Recognition in the Deep Learning Era: Analysis on a Novel Dataset
Heba Hassan ... Ahmed El-Mahdy
IEEE Access | VOL. 9
Heba Hassan, et. al.Heba Hassan ... Ahmed El-Mahdy
01 Jan 2020
IEEE Access | VOL. 9

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

QTID: Quran Text Image Dataset

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International Journal of Advanced Computer Science and Applications