MOJ-DB: A new database of Arabic historical handwriting and a novel approach for subwords extraction

Abdelhay Zoizou,Arsalane Zarghili,Ilham Chaker

doi:10.1016/j.patrec.2022.04.040

Abstract

The digitalization of historical documents is vital to preserving their content and the historical memory of nations. Although, the results of historical Arabic handwritten text recognition and word spotting are still unsatisfactory. The increasing research efforts during the last few years are still not sufficient since handwriting recognition systems rely heavily on robust databases. In this paper, we present a new contour-based method of subword extraction from Arabic historical documents and a novel database of Arabic historical subwords MOJ-DB. The proposed method of subword extraction includes a process of touching components resolving. It proved high performance and consistency while tested on different databases and compared with other methods from the literature. The proposed database contains 560000 subwords distributed on 5600 different classes. It was built using 64 pages extracted from 10 books written in the 17th and 16th centuries. MOJ-DB database is divided into three sets; 70%,20%, and 10% for training, testing, and validation, respectively. Ground truth is established iteratively to guarantee minimal error. It includes information about the subword as of the sourcebook and page. We conducted several experiments to verify the robustness of the proposed database as well as the validity of the segmentation process . The database is freely available for the public research community. It can be used for word and subword recognition, word spotting, subword extraction, and database construction.

Full Text