Learning-free, divide and conquer text-line extraction algorithm for printed Arabic text with diacritics

Aziz Qaroush,Abdalkarim Awad,Abualsoud Hanani,Khader Mohammad,Basam Jaber,Ala Hasheesh

doi:10.1016/j.jksuci.2022.04.021

Aziz Qaroush, Abdalkarim Awad + Show 4 more

Open Access

https://doi.org/10.1016/j.jksuci.2022.04.021

Copy DOI

Abstract

The extraction of text lines from document images is a critical step in optical character recognition. It is still considered an open document analysis problem. The presence of numerous font variations, diacritics, overlapping, and touching text-lines presents a challenge to algorithms designed for machine-printed text. In this paper, we present a simple and robust text-line extraction algorithm for printed Arabic text. The presented method is divided into two stages: preprocessing and text-line extraction. It extracts text-lines efficiently, even in small font sizes, by utilizing baselines, projection profiles, and a top-down divide and conquer technique. The presented method is fast and efficient when dealing with non-uniform inter-line spacing and the text-line overlapping problem.A set of experiments were conducted on the collected dataset. The experiments revealed that the proposed method outperforms two baseline approaches, with an average error rate of 3% on Arabic text without diacritics and 11% on Arabic text with diacritics. Furthermore, the experiments demonstrate that the proposed algorithm has a simple computational running time, with an average running time of 0.087 s per document image.

Full Text