Abstract

Page segmentation to locate text blocks is a prior and primary step in document processing, in particular for understanding a journal's cover page. However, texts, graphics and images are usually isolated in most documents, unlike cover pages in which texts may be overlaid onto graphics or images. In this paper a new adaptive page segmentation method is proposed to extract text blocks from various types of color technical journals' cover images. Although color involves useful information to overcome the overlapping problem, color processing requires tremendous computation loads. Thus, a complexity analysis is included to adaptively adjust processing steps in our approach. In other words, simple cover images, with few colors and no text–graphics/image overlapping, can be treated as monochrome images to speed up processing time, while for complex cover images, with many colors and text–graphics/image overlapping, correct segmentation results can still be obtained but more processing time is required. To accomplish the design concept mentioned above, our method includes several components. First, in order to degrade the processing complexity on true color images, a new simple quantization method is employed to reduce the color numbers from 24-bit true colors to 42 colors or less. In the block segmentation stage, smearing, labeling and complexity analysis techniques are used together with edge and color information to find out coherent blocks adaptively. After that, in the block classification stage, some conventional and some new features are computed from each block to decide whether it is a text block or not. Finally, in the post-processing stage, some spatial relations are adopted to rectify the classification results. Experimental results prove the feasibility and practicality of the proposed approach.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.