A SEARCHING ALGORITHM FOR TEXT WITH MISTAKES

S Nasr,O V German

doi:10.35596/1729-7648-2020-18-1-29-34

Abstract

The paper contains a new text searching method representing modification of the Boyer-Moore algorithm and enabling a user to find the places in the text where the given substring occurs maybe with possible errors, that is the string in text and a query may not coincide but nevertheless are identical. The idea consists in division of the searching process in two phases: at the first phase a fuzzy variant of the Boyer–Moore algorithm is performed; at the second phase the Dice metrics is used. The advantage of suggested technique in comparison with the known methods using the fixed value of the mistakes number is that it 1) does not perform precomputation of the auxiliary table of the sizes comparable to the original text sizes and 2) it more flexibly catches the semantics of the erroneous text substrings even for a big number of mistakes. This circumstance extends possibilities of the Boyer–Moore method by addmitting a bigger amount of possible mistakes in text and preserving text semantics. The suggested method provides also more accurate regulation of the upper boundary for the text mistakes which differs it from the known methods with fixed value of the maximum number of mistakes not depending on the text sizes. Moreover, this upper boundary is defined as Levenshtein distance not suitable for evaluating a relevance of the founded text and a query, while the Dice metrics provides such a relevance. In fact, if maximum Levenshtein distanse is 3 then how one can judge if this value is big or small to provide relevance of the search results. Consequently, the suggested method is more flexible, enables one to find relevant answers even in case of a big number of mistakes in text. The efficiency of the suggested method in the worst case is O(nc) with constant c defining the biggest allowable number of mistakes.

Highlights

Text searching methods are widely required in modern text-based applications
If the number of mistakes lays in the diapazone [k1 +1, k2] the text word is processed with the help of fuzzy comparison based on Dice metrics [6]
If all symbols in the query have been compared with CV-text fragment . 2.1 If the number of mistakes encountered is in diapason Dp = [0, k1] cvf is accepted as successfully recognized and searching procedure resumes from the right symbol following cvf. 2.2 If the number of mistakes encountered is in diapason Dp = [k1+1, k2] cvf is compared with query by means of Dice metrics

Summary

Introduction

Text searching methods are widely required in modern text-based applications. Let us note CV (Curriculum Vitae) and paper abstracts processing, extracting an invention formula description from patents, e-mails filtering and so on. If the number of mistakes does not exceed k1 the searched text word is recognized by means of the modified Boyer–Moore method. If the number of mistakes lays in the diapazone [k1 +1, k2] the text word is processed with the help of fuzzy comparison based on Dice metrics [6].

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A SEARCHING ALGORITHM FOR TEXT WITH MISTAKES

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Doklady BGUIR

Lead the way for us

Journal: Doklady BGUIR	Publication Date: Mar 6, 2020
License type: cc-by

Similar Papers

Computer-aided recognition of dental implants in X-ray images
Ernesto Ferreira ... João L Vilaça
-
Ernesto Ferreira, et. al.Ernesto Ferreira ... João L Vilaça
20 Mar 2015
20 Mar 2015

Detection of courtesy amount block on bank checks
Karim Hussein
Journal of Electronic Imaging | VOL. 5
Karim HusseinKarim Hussein
01 Apr 1996
Journal of Electronic Imaging | VOL. 5

Fast Average-Case Pattern Matching on Weighted Sequences
Carl Barton ... Solon P Pissis
International Journal of Foundations of Computer Science | VOL. 29
Carl Barton, et. al.Carl Barton ... Solon P Pissis
01 Dec 2018
International Journal of Foundations of Computer Science | VOL. 29

Granularity-Based Assessment of Similarity Between Short Text Strings
Harpreet Kaur ... Raman Maini
-
Harpreet Kaur, et. al.Harpreet Kaur ... Raman Maini
01 Jan 2019
01 Jan 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A SEARCHING ALGORITHM FOR TEXT WITH MISTAKES

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Doklady BGUIR