Few matches or almost periodicity: faster pattern matching with mismatches in compressed texts

Karl Bringmann ,Marvin Könnemann ,Philip Wellnitz

doi:10.5555/3310435.3310504

Abstract

A fundamental problem on strings in the realm of approximate string matching is pattern matching with mismatches: Given a text t, a pattern p, and a number k, determine whether some substring of t has Hamming distance at most k to p; such a substring is called a k-match.As real-world texts often come in compressed form, we study the case of searching for a small pattern p in a text t that is compressed by a straight-line program. This grammar compression is popular in the string community, since it is mathematically elegant and unifies many practically relevant compression schemes such as the Lempel-Ziv family, dictionary methods, and others. We denote by m the length of p and by n the compressed size of t. While exact pattern matching, that is, the case k = 0, is known to be solvable in near-linear time O (n + m) [Jez TALG'15], despite considerable interest in the string community, the fastest known algorithm for pattern matching with mismatches runs in time [MATH HERE] [Gawrychowski, Straszak ISAAC'13], which is far from linear even for very small k.In this paper, we obtain an algorithm for pattern matching with mismatches running in time O((n + m) poly(k)). This is near-linear in the input size for any constant (or slightly superconstant) k. We obtain analogous running time for counting and enumerating all k-matches.Our algorithm is based on a new structural insight for approximate pattern matching, essentially showing that either the number of k-matches is very small or both text and pattern must be almost periodic. While intuitive and simple for exact matches, such a characterization is surprising when allowing k mismatches.

Full Text