О ТОЧНОСТИ И ТРУДОЕМКОСТИ МНОГОЭТАПНОГО МЕТОДА КОРРЕКЦИИ ИСКАЖЕННЫХ ТЕКСТОВ В ЗАВИСИМОСТИ ОТ СТЕПЕНИ ИСКАЖЕНИЯ

D.V Vakhlakov,S.Y Melnikov,A V Germanovich,N N Copkalo,V A Peresypkin

doi:10.18522/2311-3103-2021-7-130-142

Abstract

One of the main factors that significantly complicate the understanding, translation andanalysis of texts obtained by automatic recognition of speech or images of texts is the presence ofdistortions in the form of erroneous symbols, words and phrases. Until recently, there were noeffective software tools for correcting texts with significant distortions, although this task is relevantboth for Russian and other common languages in the context of the active use of recognitionsystems in advanced augmented reality systems. The authors proposed a new multi-stage methodfor correcting distorted texts, which significantly increases the accuracy of the correction (interms of the number of correctly corrected words in the text) and is based on the sequential detectionof errors and their correction. In this paper, we evaluate the accuracy and computationalcomplexity of the proposed method for correcting distorted texts at various levels of distortion, anddetermine its place among other modern approaches to correction. The most typical errors ofrecognition systems are: – replacing a word with a similar sound or graphic spelling; – replacingseveral words with one; – replacing one word with several; – omission of words; – insertion ordeletion of short words (including prepositions and conjunctions). As a result of recognition, adistorted text is obtained, which consists mainly of dictionary words, even in places of distortion.With a large number of distortions, the texts become almost unreadable. Due to the fact that it isproblematic to select texts with a wide range of distortion levels in the required amount based onthe results of real machine recognition of speech and images of texts, software modeling of distortionswas used. A text distortion technique has been proposed and implemented that simulates theresults of recognition systems in a wide range of distortions; distorted texts have been prepared inthe required amount. Within the framework of the proposed multi-stage correction method, nondictionaryword forms and words are considered distorted if the probability of their occurrence inthe text in accordance with the chosen language model is less than a given threshold. For suchdistorted words, a list of possible variants of words is built, which includes only those word formsfrom the dictionary that are at a certain Levenshtein distance from the word under study. The correctedtext from the tables of word variants is obtained by searching for the most probable chainof word forms. The correction method consists of several stages, at each stage only those fragmentsof the text that remain distorted after the previous stage are corrected. According to theresults of the experiments on the correction of distorted texts, it was concluded that the proposedcorrection method showed good results with an average value of F-measure >50 % in the distortionrange from 0 to 75 %. Linguistic experts confirmed the fruitfulness of the proposed approachto correction and its preference over other modern approaches, fixing that with a level of distortionof up to 50 % of words, the corrected text is read with much less effort than a distorted one,and with a level of distortion of up to 70% of words, the corrected text also allows you to highlightuseful information about the content.

Full Text