Research of text recognition systems and data removal for ukrainian-­language documents

K O Hordiyenko,A B Koba,T P Dovzhenko

doi:10.31673/2412-9070.2020.066163

Abstract

This article discusses the existing software, the main task of which is to extract information from digitized documents. From all the software was selected what is based on neural network technology and deep learning. To extract information from documents, manual work of personal computer operators can be used, which takes a long time and does not exclude the influence of the human factor, as well as digitization of documents with further processing in software based on the principle of subordination of documents to templates and rules, data processing speed and the need to make changes to the settings due to a change in the type of document. The article aims to investigate the existing software for extracting data from digital documents based on neural network technology, and their applicability to Ukrainian-language documents. To do this, a simple set of invoices was created and uploaded to the system. The development of a system for extracting information from digitized Ukrainian-language documents using neural networks will speed up data processing, provide an opportunity for their processing depending on the scope of the user of this software. It is established that at present, there are no systems that can independently determine what data is needed for extraction from Ukrainian-language documents. Existing systems require the creation of software that will act as a cover for the functionality of systems that transmit their information through the REST API. Google Form Parser is considered to be the best system, but it requires a constant connection to the Internet, which can be a serious obstacle to the use of such a product in certain areas of activity.

Full Text