This paper presents a novel Arabic dataset that considers the characteristics of the Arabic language filling some gaps not covered by existing datasets. Conventional datasets consider Arabic in a similar way to Latin languages. These datasets either delete diacritic and supplement marks, considering them as defects, or keep them without considering the actual meaning. More than half of all Arabic characters have diacritics above or below characters. In this context, this work presents the novel Detailed Arabic Dataset (DAD) for bridging these gaps. The additional marks included in this dataset are the single dot, two dots -, three dots ^, Hamza and two supplement marks: The bar for Tah, or Zah and the complement bar for Kaf. A special application was built to generate a dataset for Arabic online recognition and writer identification (called OFMArabicDatasetBuilder). Totally the ground truth contains 93064 entries based on sub-word and letter parts (not on words or lines as other datasets). This dataset will provide researchers with a strong tool for online Arabic language text recognition especially in the segmentation phase and writer identification. This paper also presents benchmarking results of using k-nearest neighbours machine learning with DAD.
Read full abstract