Abstract

Inspired from human perception and common text documents characteristics based on readability constraints, an Arabic text line segmentation approach is proposed using seam carving. Taking the gray scale of the image as input data, this technique offers better results at extracting handwritten text lines without the need for the binary representation of the document image. In addition to its fast processing time, its versatility permits to process a multitude of document types, especially documents presenting low text-to-background contrast such as degraded historical manuscripts or complex writing styles like cursive handwriting. Even if our focus in this paper was on Arabic text segmentation, this method is language independent. Tests on a public database of 123 handwritten Arabic documents showed a line detection rate of 97.5% for a matching score of 90%.

Highlights

  • WITH the advent of digital means to share information, people are slowly abandoning paper as a medium and use digital devices and technologies instead

  • With the volume of processed information growing every day, businesses, organizations and public services adopt new digital technologies instead of paper documents. This transition brings a real need for scalable optical character recognition (OCR) systems capable of converting paper documents - handwritten or printed – into digital formats

  • This approach is based on watershed technique which relies on connected components analysis to create a similar image partitioning to the Voronoi diagram, where these partitions are analyzed to merge them into text lines

Read more

Summary

Introduction

WITH the advent of digital means to share information, people are slowly abandoning paper as a medium and use digital devices and technologies instead. With the volume of processed information growing every day, businesses, organizations and public services adopt new digital technologies instead of paper documents This transition brings a real need for scalable optical character recognition (OCR) systems capable of converting paper documents - handwritten or printed – into digital formats. OCR systems use many steps to process document images These steps can be summarized in four main steps, text blocks identifying, text lines segmentation, word segmentation, and character recognition. Most of the proposed solutions in the literature are based on Connected Components (CC) analysis This type of approaches has shown some struggles dealing with low text-to-background contrast images such as historical manuscripts or damaged documents because of the binarization pre-processing as shown in [1], where binarization may ensue critical information loss. The robustness of this seam carving approach is discussed with practical illustrations leading to conclusions and perspectives for future works

Related Works
Seam Carving
Energy Map
Medial Seams
Methods
Robustness
Findings
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.