Abstract

Most digital libraries that provide user-friendly interfaces, enabling quick and intuitive access to their resources, are based on Document Image Analysis and Recognition (DIAR) methods. Such DIAR methods need ground-truthed document images to be evaluated/compared and, in some cases, trained. Especially with the advent of deep learning-based approaches, the required size of annotated document datasets seems to be ever-growing. Manually annotating real documents has many drawbacks, which often leads to small reliably annotated datasets. In order to circumvent those drawbacks and enable the generation of massive ground-truthed data with high variability, we present DocCreator, a multi-platform and open-source software able to create many synthetic image documents with controlled ground truth. DocCreator has been used in various experiments, showing the interest of using such synthetic images to enrich the training stage of DIAR tools.

Highlights

  • Almost every researcher in the field of Document Image Analysis and Recognition (DIAR) had to face the problem of obtaining a ground-truthed document image dataset

  • In this paper we present DocCreator, an open-source and multi-platform software that is able to create virtually unlimited amounts of different ground-truthed synthetic document images based on a small number of real images

  • DocCreator gives to DIAR researchers a simple and rapid way to extend existing document image databases or to create new ones avoiding the tedious task of manual ground truth generation

Read more

Summary

Introduction

Almost every researcher in the field of Document Image Analysis and Recognition (DIAR) had to face the problem of obtaining a ground-truthed document image dataset. Despite the use of such software, manual annotation remains a costly task that cannot always be performed by a non-specialist Another solution is available for getting (quickly and with lower human cost) large ground-truthed document image datasets. In this paper we present DocCreator, an open-source and multi-platform software that is able to create virtually unlimited amounts of different ground-truthed synthetic document images based on a small number of real images. The user can change the baseline or the letter assigned to a character and smooth the border of a character Via this semi-automatic font extraction method, the user is able to correct mistakes made by the OCR (frequent on old documents). This method has the advantage of a very low computational cost, without any preprocessing training required At this point, the three characteristics used in the synthetic image generation process have been extracted (background, font and layout). One can combine fonts, background images, layout from different images and various texts, to generate many of synthetic document images

Document Degradation Models
Ink Degradation
Phantom Character
Paper Holes
Bleed-Through
Adaptive Blur
Nonlinear Illumination Model
Document Image Generation for Performance Evaluation
Document Image Generation for Retraining Task
Increase the Prediction Rate of Predictive Binarization Algorithm
Predict OCR Recognition Rate Using Synthetic Images
Conclusions
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.