Abstract

Document image processing has become an increasingly important technology in the automation of office documentation tasks. Automatic document scanners such as text readers and OCR (Optical Character Recognition) systems are an essential component of systems capable of those tasks. One of the problems in this field is that the document to be read is not always placed correctly on a flat-bed scanner. This means that the document may be skewed on the scanner bed, resulting in a skewed image. This skew has a detrimental effect on document analysis, document understanding, and character segmentation and recognition. Consequently, detecting the skew of a document image and correcting it are important issues in realizing a practical document reader. This paper presents the use of analyzing the connected components extracted from the binary image of a document page. Such an analysis provides a lot of useful information, and will be used to perform skew correction, segmentation and classification of the document. Moreover, we describe two new algorithms — one for skew detection and one for skew correction. The new skew correction algorithm we propose has been shown to be fast and accurate, with run times averaging under 1.5 CPU seconds and 30 seconds real time to calculate the angle on a 5000/20 DEC workstation. Experiments on over 100 pages show that the method works well on a wide variety of layouts, including sparse textual regions, mixed fonts, multiple columns, and even for documents with a high graphical content.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.