Fast string matching by using probabilities: On an optimal mismatch variant of Horspool's algorithm

Markus E Nebel

doi:10.1016/j.tcs.2006.05.028

Markus E Nebel

https://doi.org/10.1016/j.tcs.2006.05.028

Copy DOI

Journal: Theoretical Computer Science	Publication Date: Jun 30, 2006
Citations: 21	License type: elsevier-specific

Affiliation: University of Kaiserslautern

Abstract

The string matching problem, i.e. the task of finding all occurrences of one string as a substring of another one, is a fundamental problem in computer science. Recently, this problem received a great deal of attention due to numerous applications in computational biology. In this paper we address a modified version of Horspool's string matching algorithm using the probabilities of the different symbols to speed up the search. We show that the modified algorithm has a linear average running time; a precise asymptotical representation of the running time will be proven. A comparison of the average running time of the modified algorithm with well-known results for the original method shows that a substantial speed up for most of the symbol distributions has been achieved. Finally, we show that the distribution of the symbols can be approximated to a high precision using a random sample of sublinear size.

Full Text