Invaluable for historical research, Arabic manuscripts are often challenging to access due to their physical preservation needs. Digitization coupled with handwriting recognition offers a powerful solution for making these resources readily available. However, accurate text recognition hinges on effective segmentation into lines and words. While line segmentation has received significant attention, word detection in historical Arabic script remains an under-explored area due to the scarcity of annotated datasets. This paper addresses this gap by introducing a novel, word-level annotated database specifically designed for historical Arabic manuscripts. We further present two neural network architectures based on Transformers relying on an Arabic language model and a robust CNN-BLSTM with skip connections, guaranteeing the preservation of crucial spatial information for recognition. Validation on a dataset of 20 pages of historical manuscripts demonstrates the effectiveness of the proposed models, achieving a Character Error Rate (CER) of 4.8% with data augmentation, surpassing the state-of-the-art.
Read full abstract