DocCreator: A New Software for Creating Synthetic Ground-Truthed Document Images

Nicholas Journet,Antoine Billy,Muriel Visani,Boris Mansencal,Kieu Van-Cuong

doi:10.3390/jimaging3040062

Abstract

Most digital libraries that provide user-friendly interfaces, enabling quick and intuitive access to their resources, are based on Document Image Analysis and Recognition (DIAR) methods. Such DIAR methods need ground-truthed document images to be evaluated/compared and, in some cases, trained. Especially with the advent of deep learning-based approaches, the required size of annotated document datasets seems to be ever-growing. Manually annotating real documents has many drawbacks, which often leads to small reliably annotated datasets. In order to circumvent those drawbacks and enable the generation of massive ground-truthed data with high variability, we present DocCreator, a multi-platform and open-source software able to create many synthetic image documents with controlled ground truth. DocCreator has been used in various experiments, showing the interest of using such synthetic images to enrich the training stage of DIAR tools.

Highlights

Almost every researcher in the field of Document Image Analysis and Recognition (DIAR) had to face the problem of obtaining a ground-truthed document image dataset
In this paper we present DocCreator, an open-source and multi-platform software that is able to create virtually unlimited amounts of different ground-truthed synthetic document images based on a small number of real images
DocCreator gives to DIAR researchers a simple and rapid way to extend existing document image databases or to create new ones avoiding the tedious task of manual ground truth generation

Summary

Introduction

Almost every researcher in the field of Document Image Analysis and Recognition (DIAR) had to face the problem of obtaining a ground-truthed document image dataset. Despite the use of such software, manual annotation remains a costly task that cannot always be performed by a non-specialist Another solution is available for getting (quickly and with lower human cost) large ground-truthed document image datasets. In this paper we present DocCreator, an open-source and multi-platform software that is able to create virtually unlimited amounts of different ground-truthed synthetic document images based on a small number of real images. The user can change the baseline or the letter assigned to a character and smooth the border of a character Via this semi-automatic font extraction method, the user is able to correct mistakes made by the OCR (frequent on old documents). This method has the advantage of a very low computational cost, without any preprocessing training required At this point, the three characteristics used in the synthetic image generation process have been extracted (background, font and layout). One can combine fonts, background images, layout from different images and various texts, to generate many of synthetic document images

Document Degradation Models

Ink Degradation

Phantom Character

Paper Holes

Bleed-Through

Adaptive Blur

Nonlinear Illumination Model

Document Image Generation for Performance Evaluation

Document Image Generation for Retraining Task

Increase the Prediction Rate of Predictive Binarization Algorithm

Predict OCR Recognition Rate Using Synthetic Images

Conclusions

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Imaging	Publication Date: Dec 11, 2017
Citations: 43	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

DocCreator: A New Software for Creating Synthetic Ground-Truthed Document Images

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Imaging

Lead the way for us

Similar Papers

Document image analysis and recognition: a survey
V.V Arlazarov ... O.O Petrova
Computer Optics | VOL. 46
V.V Arlazarov, et. al.V.V Arlazarov ... O.O Petrova
01 Aug 2022
Computer Optics | VOL. 46

Fast and Accurate Ground Truth Generation for Skew-Tolerance Evaluation of Page Segmentation Algorithms
Oleg Okun ... Matti Pietikäinen
EURASIP Journal on Advances in Signal Processing | VOL. 2006
Oleg Okun, et. al.Oleg Okun ... Matti Pietikäinen
12 Mar 2006
EURASIP Journal on Advances in Signal Processing | VOL. 2006

Generating synthetic μCT images of wood fibre materials
Erik L G Wernersson ... Anders Brun
-
Erik L G Wernersson, et. al.Erik L G Wernersson ... Anders Brun
01 Jan 2009
01 Jan 2009

Deep Learning a Single Photo Voxel Model Prediction from Real and Synthetic Images
Vladimir V Kniaz ... Vladimir A Mizginov
-
Vladimir V Kniaz, et. al.Vladimir V Kniaz ... Vladimir A Mizginov
04 Sep 2019
04 Sep 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

DocCreator: A New Software for Creating Synthetic Ground-Truthed Document Images

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Imaging