EDIT DISTANCE WITH COMBINATIONS AND SPLITS AND ITS APPLICATIONS IN OCR NAME MATCHING

Manolis Christodoulakis,Gerhard Brey

doi:10.1142/s0129054109007030

Abstract

Approximate pattern matching has a wide range of applications and, depending on the type of approximation, there exist numerous algorithms for solving it. In this article we focus on texts which originate from OCRed documents, whose errors quite often have a particular form and are far from being random errors. We introduce a new variant of the edit distance metric, where apart from the traditional edit operations, two new operations are supported. The combination operation allows two or more symbols from a string x to be interpreted as a single symbol and then "matched" (or aligned) against a single symbol of a second string y. Its dual is the operation of a split, where a single symbol from x is broken down into a sequence of two or more other symbols, that can then be matched against an equal number of symbols from y. Our algorithm requires O(L) time for preprocessing, and O(mnk) time for computing the edit distance, where L is the total length of all the valid combinations/splits, m and n are the lengths of the two strings under comparison and k is an upper bound on the number of valid splits for any single symbol. The expected running time is O(mn).

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

EDIT DISTANCE WITH COMBINATIONS AND SPLITS AND ITS APPLICATIONS IN OCR NAME MATCHING

Abstract

Talk to us

Similar Papers

More From: International Journal of Foundations of Computer Science

Lead the way for us

Journal: International Journal of Foundations of Computer Science	Publication Date: Dec 1, 2009
Citations: 2

Similar Papers

A contextual normalised edit distance
Colin De La Higuera ... Luisa Mico
-
Colin De La Higuera, et. al.Colin De La Higuera ... Luisa Mico
01 Apr 2008
01 Apr 2008

A Contextual Normalised Edit Distance
Colin De La Higuera ... Luisa Micó
-
Colin De La Higuera, et. al.Colin De La Higuera ... Luisa Micó
01 Apr 2008
01 Apr 2008

Edit distance for timed automata
Krishnendu Chatterjee ... Rupak Majumdar
-
Krishnendu Chatterjee, et. al.Krishnendu Chatterjee ... Rupak Majumdar
15 Apr 2014
15 Apr 2014

Brief Announcement: Graph-Based and Probabilistic Discrete Models Used in Detection of Malicious Attacks
Sergey Frenkel ... Victor Zakharov
-
Sergey Frenkel, et. al.Sergey Frenkel ... Victor Zakharov
01 Jan 2018
01 Jan 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

EDIT DISTANCE WITH COMBINATIONS AND SPLITS AND ITS APPLICATIONS IN OCR NAME MATCHING

Abstract

Talk to us

Similar Papers

More From: International Journal of Foundations of Computer Science