Abstract

Historical documents contain valuable heritage information. These documents are preserved in the manuscript preservation center and archaeological departments. They are mostly degraded in nature and hence hard to read and understand the contents. So, there is a need for text segmentation and feature extraction to convert these manuscripts into machine editable format. In this work, we present an effective way to segment historical document images into characters. It is a challenging segmentation process due to complex background images. In this paper, horizontal histogram, vertical histogram and connected component analysis is used to segment text documents images. In this algorithm, the input image is converted to gray scale image, then gray image is converted into binary image [Otsu’s method] and then all the objects containing fewer than desired pixels are removed. Line and word segmentation is implemented using horizontal and vertical histogram method respectively. Then the connected components are labeled and properties are measured for the image regions. Connected component analysis is used to segment the characters and the individual characters are extracted. The simulation result shows that the proposed segmentation method achieves an average accuracy of 93.37% for HDLAC 2011 DATASET. Moreover this method is more efficient and more suitable for real time tasks.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.