A linear size index for approximate pattern matching

Ho-Leung Chan,Tak-Wah Lam,Wing-Kin Sung,Siu-Lung Tam,Swee-Seong Wong

doi:10.1016/j.jda.2011.04.004

Ho-Leung Chan, Tak-Wah Lam + Show 3 more

Open Access

https://doi.org/10.1016/j.jda.2011.04.004

Copy DOI

Abstract

This paper revisits the problem of indexing a text S [ 1 . . n ] for pattern matching with up to k errors. A naive solution either has a worst-case matching time complexity of Ω ( m k ) or requires Ω ( n k ) space, where m is the length of the pattern. Devising a solution with better performance has been a challenge until Cole et al. (2004) [5] showed an O ( n log k n ) -space index that can support k-error matching in O ( m + occ + log k n log log n ) time, where occ is the number of occurrences. Motivated by the indexing of long sequences like DNA, we have investigated the feasibility of devising a linear-size index that still has a time complexity linear in pattern length. This paper in particular presents an O ( n ) -space index that supports k-error matching in O ( m + occ + ( log n ) k ( k + 1 ) log log n ) worst-case time. This index can be further compressed from O ( n ) words into O ( n ) bits with a slight increase in the time complexity.

Full Text