Abstract

The successful use of deep learning solutions for document image segmentation typically relies on a large number of manually labeled ground truth examples, which is expensive to obtain for historical document images that have significant noise effects and variation. At the same time, successful applications of deep learning solutions for document image segmentation have rich potential to facilitate greater levels of description in archival collections (e.g., at and below the item-level). These greater levels of description are critical to increasing access and use of archival collections across an array of research domains. In response, this article investigates whether an augmentation-based approach to generating pseudo-ground truth can be effective with a limited number of labeled images in a document segmentation application. The rationale is that if we can decrease the cost of generating ground truth through augmentation-based approaches, we can use these approaches as part of the description and access pipelines for historical library and archival collections. In this initial exploration, we first generate synthetic images and corresponding pseudo-ground truth using a set of existing degradation-based augmentation models from a small number of labeled actual images. When generating synthetic images, we control the visual quality distortion based on OCR word-level confidence to avoid generating images unlikely to be present in the dataset. Then, we perform several investigations to examine the impact of incorporating pseudo-ground truth data in the training of the deep learning network dhSegment and further evaluate the use of multiple combinations of degradation models. We also assess the generalizability of the approach by applying the trained network on a larger dataset. Our investigations primarily use real-world datasets known to have significant noise effects. Results show that augmentation-based pseudo-ground truth generation is capable of improving segmentation performance with the use of the full original dataset and requires only 30% of the original dataset. Results also show that using more than three degradation models is likely to cause overfitting during training. Furthermore, we show that a segmentation network trained on pseudo-ground truth data has generalization capability.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.