Data pre-processing to increase the quality of optical text recognition systems

Konstantin Dergachov,Leonid Krasnov,Anatoly Zymovin,Vladislav Bilozerskyi

doi:10.32620/reks.2021.4.15

Abstract

The subject of study in the article is the formulation of a modern concept of improving the quality of work of optical recognition systems by using a set of various algorithms for preprocessing document images at the user's discretion. The research synthesizes algorithms that compensate for external negative influences (unfavorable geometric factor, poor lighting conditions when photographing, the effect of noise, etc.). The methods used imply a certain sequence of data preprocessing stages: geometric transformation of the original images, their processing with a set of various filters, image equalization without increasing the noise level to increase the contrast of images, the binarization of images with adaptive conversion thresholds to eliminate the influence of uneven photo illumination. The following results were obtained. A package of algorithms for preliminary processing of photographs of documentation has been created, in which, to increase the functionality of data identification, a face detection algorithm is also built in, intended for their further recognition (face recognition). A number of service procedures are provided to ensure the convenience of data processing and their information protection. In particular, interactive procedures for text segmentation with the possibility of anonymizing its individual fragments are proposed. It helps provide the confidentiality of the processed documents. The structure of the listed algorithms is described and the stability of their operation under various conditions is investigated. Based on the results of the research, a text recognition software was developed using the Tesseract version 4.0 optical character recognition (OCR) program. The program "HQ Scanner" is written in Python using the OpenCV library. An original technique for evaluating the effectiveness of the algorithms using the criterion of the maximum probability of correct text recognition has been implemented in software. A large number of examples of system operation and software testing results are provided. Conclusions. The results of the research conducted are a basis for developing software for creating cost-effective and easy-to-use OCR systems for commercial use.

Full Text