An approach to analysis of arabic text documents into text lines, words, and characters

Hakim A Abdo,Ramesh Manza,Shobha Bawiskar,Ahmed Abdu

doi:10.11591/ijeecs.v26.i2.pp754-763

Abstract

Text line extraction from a text document image and segmenting it into isolate words and segmenting these words into individual characters are considered as one of the most critical processes in OCR systems development and turning the document into a searchable electronic representation, this paper presents a new approach to analyze the Arabic text documents, the proposed approach contains four steps, preprocessing, text line segmentation, word segmentation, character segmentation. The horizontal projection method are used to detect and extract the text line from preprocessed text documents image, in word segmentation step The space threshold are computed to determine the spaces among connected components in text line as within-word space or between-words space for segmenting the text line into isolate words, finally thinning method applied to find the skeleton of segmented word and analyses geometric characteristics of the characters to detect ligatures and characters. The proposed approach was tested and evaluated on a set of 115 text images, this set contains images from the KFUPM Handwritten Arabic TexT (KHATT) database and some images produced by the authors. The experiment results are extremely encouraging, with a success rate of 98.6% for lines segmentation, 96% for words segmentation, and 87.1% for characters segmentation.

Full Text