Abstract

The extraction of text lines from document images is a critical step in optical character recognition. It is still considered an open document analysis problem. The presence of numerous font variations, diacritics, overlapping, and touching text-lines presents a challenge to algorithms designed for machine-printed text. In this paper, we present a simple and robust text-line extraction algorithm for printed Arabic text. The presented method is divided into two stages: preprocessing and text-line extraction. It extracts text-lines efficiently, even in small font sizes, by utilizing baselines, projection profiles, and a top-down divide and conquer technique. The presented method is fast and efficient when dealing with non-uniform inter-line spacing and the text-line overlapping problem.A set of experiments were conducted on the collected dataset. The experiments revealed that the proposed method outperforms two baseline approaches, with an average error rate of 3% on Arabic text without diacritics and 11% on Arabic text with diacritics. Furthermore, the experiments demonstrate that the proposed algorithm has a simple computational running time, with an average running time of 0.087 s per document image.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.