Abstract

Semantic page segmentation of document images is a basic task for document layout analysis which is key to document reconstruction and digitalization. Previous work usually considers only a few semantic types in a page (e.g., text and non-text) and performs mainly on English document images and it is still challenging to make the finer semantic segmentation on Chinese and English document pages. In this paper, we propose a deep learning based method for semantic page segmentation in Chinese and English documents such that a document page can be decomposed into regions of four semantic types such as text, table, figure and formula. Specifically, a deep semantic segmentation neural network is designed to achieve the pixel-wise segmentation where each pixel of an input document page image is labeled as background or one of the four categories above. Then we can obtain the accurate locations of regions in different types by implementing the Connected Component Analysis algorithm on the prediction mask. Moreover, a Non-Intersecting Region Segmentation Algorithm is further designed to generate a series of regions which do not overlap each other, and thus improve the segmentation results and avoid possible location conflicts in the page reconstruction. For the training of the neural network, we manually annotate a dataset whose documents are from Chinese and English language sources and contain various layouts. The experimental results on our collected dataset demonstrate the superiority of our proposed method over the other existing methods. In addition, we utilize transfer learning on public POD dataset and obtain the promising results in comparison with the state-of-the-art methods.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.