Abstract

Text line extraction from a text document image and segmenting it into isolate words and segmenting these words into individual characters are considered as one of the most critical processes in OCR systems development and turning the document into a searchable electronic representation, this paper presents a new approach to analyze the Arabic text documents, the proposed approach contains four steps, preprocessing, text line segmentation, word segmentation, character segmentation. The horizontal projection method are used to detect and extract the text line from preprocessed text documents image, in word segmentation step The space threshold are computed to determine the spaces among connected components in text line as within-word space or between-words space for segmenting the text line into isolate words, finally thinning method applied to find the skeleton of segmented word and analyses geometric characteristics of the characters to detect ligatures and characters. The proposed approach was tested and evaluated on a set of 115 text images, this set contains images from the KFUPM Handwritten Arabic TexT (KHATT) database and some images produced by the authors. The experiment results are extremely encouraging, with a success rate of 98.6% for lines segmentation, 96% for words segmentation, and 87.1% for characters segmentation.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call