Использование синтетических данных для тонкой настройки моделей сегментации документов

Oksana Belyaeva,Andrey Perminov,Ilya Kozlov

doi:10.15514/ispras-2020-32(4)-14

Oksana Belyaeva, Andrey Perminov + Show 1 more

Open Access

https://doi.org/10.15514/ispras-2020-32(4)-14

Copy DOI

Abstract

In this paper, we propose an approach to the document images segmentation in a case of limited set of real data for training. The main idea of our approach is to use artificially created data for training and post-processing. The domain of the paper is PDF documents, such as scanned contracts, commercial proposals and technical specifications without a text layer is considered as data. As part of the task of automatic document analysis, we solve the problem of segmentation of DLA documents (Document Layout Analysis). In the paper we train the known high-level FasterRCNN \cite{ren2015faster} model to segment text blocks, tables, stamps and captions on images of the domain. The aim of the paper is to generate synthetic data similar to real data of the domain. It is necessary because the model needs a large dataset for training and the high labor intensity of their preparation. In the paper, we describe the post-processing stage to eliminate artifacts that are obtained as a result of the segmentation. We tested and compared the quality of a model trained on different datasets (with / without synthetic data, small / large set of real data, with / without post-processing stage). As a result, we show that the generation of synthetic data and the use of post-processing increase the quality of the model with a small real training data.

Highlights

Ключевые слова: анализ физической структуры документа; сегментация документа; анализ макета документа; обнаружение объектов на изображении; тонкая настройка модели; активное обучение
We propose an approach to the document images segmentation in a case of limited set of real data for training
The domain of the paper is PDF documents, such as scanned contracts, commercial proposals and technical specifications without a text layer is considered as data

Summary

DLA решения

Развитие DLA берет начало с ранних известных эвристических методов Smearing [10], Recursive XY-cut [11], Docstrum [12], а также сегментация методом наибольших белых прямоугольников [13]. Это нужно для анализа исторических документов, на которых важно отфильтровать только текстовую информацию. В [4,5] предлагают использование свёрточной сети для сегментации страниц исторических. Авторы [4] используют простую сеть только из одного свёрточного слоя для распознавания исторических рукописных документов и сравнивают результаты с более глубокими сложными сетями. В [5] для сегментирования исторических документов предлагают Fully Convolution Network (FCN) сеть, использующую метрику, которая учитывает только пиксели переднего плана на бинаризованной странице и игнорирует фоновые пиксели. Ниже приводится сравнительная таблица качества сегментации архитектур 2. Точность сегментации различных архитектур (на наборе данных COCO) Table 2. Которых в наших данных большинство, FasterRCNN опережает большинство других моделей по точности сегментации на COCO данных Согласно работе [3], FasterRCNN достигает state-of-the-art качества обнаружения таблиц на изображениях. ISP RAS, vol 32, issue 4, 2020. pp. 189–202

Входные данные

Описание классов сегментирования

Генерация документов

Особенности создаваемых документов

Обучение модели

Постобработка

Результаты

Заключение

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Proceedings of the Institute for System Programming of the RAS	Publication Date: Jan 1, 2020
Citations: 2	License type: cc-by

R Discovery Prime

R Discovery Prime

Использование синтетических данных для тонкой настройки моделей сегментации документов

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Proceedings of the Institute for System Programming of the RAS

Lead the way for us

Similar Papers

Visual Perception with Synthetic Data

-

21 Aug 2020
21 Aug 2020

Machine learning models trained on synthetic datasets of multiple sample sizes for the use of predicting blood pressure from clinical data in a national dataset.
Anmol Arora ... Sathishkumar V E
PloS one | VOL. 18
Anmol Arora, et. al.Anmol Arora ... Sathishkumar V E
16 Mar 2023
PloS one | VOL. 18

Synthetic Training Data Generation and Domain Randomization for Object Detection in the Formula Student Driverless Framework
Rebecca Adam ... Paulius Janciauskas
-
Rebecca Adam, et. al.Rebecca Adam ... Paulius Janciauskas
16 Nov 2022
16 Nov 2022

An Approach to Automation Selection of Decision Tree based on Training Data Set
D Saravanakumar ... N Ananthi
International Journal of Computer Applications | VOL. 64
D Saravanakumar, et. al.D Saravanakumar ... N Ananthi
15 Feb 2013
International Journal of Computer Applications | VOL. 64

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Использование синтетических данных для тонкой настройки моделей сегментации документов

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Proceedings of the Institute for System Programming of the RAS