Improved algorithms for approximate string matching (extended abstract)

Dimitris Papamichail,Georgios Papamichail

doi:10.1186/1471-2105-10-s1-s10

Dimitris Papamichail, Georgios Papamichail

Open Access

https://doi.org/10.1186/1471-2105-10-s1-s10

Copy DOI

Journal: BMC Bioinformatics	Publication Date: Jan 1, 2009
Citations: 19	License type: CC BY 2.0

Abstract

BackgroundThe problem of approximate string matching is important in many different areas such as computational biology, text processing and pattern recognition. A great effort has been made to design efficient algorithms addressing several variants of the problem, including comparison of two strings, approximate pattern identification in a string or calculation of the longest common subsequence that two strings share.ResultsWe designed an output sensitive algorithm solving the edit distance problem between two strings of lengths n and m respectively in time O((s - |n - m|)·min(m, n, s) + m + n) and linear space, where s is the edit distance between the two strings. This worst-case time bound sets the quadratic factor of the algorithm independent of the longest string length and improves existing theoretical bounds for this problem. The implementation of our algorithm also excels in practice, especially in cases where the two strings compared differ significantly in length.ConclusionWe have provided the design, analysis and implementation of a new algorithm for calculating the edit distance of two strings with both theoretical and practical implications. Source code of our algorithm is available online.

Highlights

The problem of approximate string matching is important in many different areas such as computational biology, text processing and pattern recognition
Fast practical algorithms for approximate string matching are in high demand
There are several variants of the approximate string matching problem, including the problem of finding a pattern in a text allowing a limited number of errors and the problem of finding the number of edit operations that can transform one string to another

Summary

Introduction

The problem of approximate string matching is important in many different areas such as computational biology, text processing and pattern recognition. Approximate string matching is a fundamental, challenging problem in Computer Science, often requiring a large amount of computational resources. It finds applications in different areas such as computational biology, text processing, pattern recognition and signal processing. For these reasons, fast practical algorithms for approximate string matching are in high demand. There are several variants of the approximate string matching problem, including the problem of finding a pattern in a text allowing a limited number of errors and the problem of finding the number of edit operations that can transform one string to another. In this work we will focus on the Levenshtein edit distance [1], where the allowed edit operations are insertion, deletion (page number not for citation purposes)

Methods

Results

Conclusion