Abstract

Approximate pattern matching has a wide range of applications and, depending on the type of approximation, there exist numerous algorithms for solving it. In this article we focus on texts which originate from OCRed documents, whose errors quite often have a particular form and are far from being random errors. We introduce a new variant of the edit distance metric, where apart from the traditional edit operations, two new operations are supported. The combination operation allows two or more symbols from a string x to be interpreted as a single symbol and then "matched" (or aligned) against a single symbol of a second string y. Its dual is the operation of a split, where a single symbol from x is broken down into a sequence of two or more other symbols, that can then be matched against an equal number of symbols from y. Our algorithm requires O(L) time for preprocessing, and O(mnk) time for computing the edit distance, where L is the total length of all the valid combinations/splits, m and n are the lengths of the two strings under comparison and k is an upper bound on the number of valid splits for any single symbol. The expected running time is O(mn).

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.