Approximate String Matching Algorithms: A Brief Survey and Comparison

Fareal Ahmed,Rosina Surovi Khan,Syeda Shabnamhasan

doi:10.5120/21247-4048

Abstract

Many database applications require similarity based retrieval on stored text and/or multimedia objects. This is an area of increasing research interest in the sectors of database, data mining, information retrieval and knowledge discovery. This paper presents a brief survey on the existing approximate string matching algorithms by primarily demonstrating three families of algorithms — the Brute force, the Lipschitz Embeddings and the Ball Partitioning algorithms. While Brute Force performs approximate string matching based on distance measures of the query object from each string stored in the database, Lipschitz Embeddings uses a far more efficient approach which embeds the stored strings in database in vector space so that the distances of embedded strings approximates the actual distances. Ball Partitioning algorithm, much more efficient than Brute force but less efficient than Lipschitz algorithm, performs search in approximate string matching based on distances where queries operate on an arbitrary search hierarchy. The paper compares and makes an analysis of these three algorithms which are suitable for approximate matching of strings stored in database text files, an issue much required in the context of similarity based retrieval of objects. The work can be extended for future work by taking into account a larger number of algorithms suited to approximate string matching for the benefit of a wider scope of comparisons and picking out the most optimal one. General Terms Algorithms for Approximate String Matching.

Full Text