Abstract

The recognition accuracy of ligature-based Urdu language optical character recognition (OCR) systems highly depends on the accuracy of segmentation that converts Urdu text into lines and ligatures. In general, lines and ligatures-based Urdu language OCRs are more successful as compared to characters-based. This paper presents the techniques for segmenting Urdu Nastaleeq text images into lines and subsequently to ligatures. Classical horizontal projection-based segmentation method is augmented with a curved-line-split algorithm for successfully overcoming the problems, such as text line split position, overlapping, merged ligatures, and ligatures crossing line split positions. Ligature segmentation algorithm extracts connected components from text lines, categorizes them into primary and secondary classes, and allocates secondary components to the primary class by examining width, height, coordinates, overlapping, centroids, and baseline information. The proposed line segmentation algorithm is tested on 47 pages with 99.17% accuracy. The proposed ligature segmentation algorithm is mainly tested on a large Urdu-printed text images data set. The proposed algorithm segmented Urdu-printed text images data set to 189 000 ligatures from 10 063 text lines having 332 000 connected components. A total of about 142 000 secondary components have been successfully allocated to more than 189 000 primary ligatures with accuracy rate of 99.80%. Thus, both of the proposed segmentation algorithms outperform the existing algorithms employed for Urdu Nastaleeq text segmentation. Moreover, the proposed line segmentation algorithm is also tested on Arabic, for which it also extracted lines correctly.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.