We study two pattern matching problems that are motivated by applications in computational biology. In the Closest Substring problem k strings $s_1,\dots, s_k$ are given, and the task is to find a string s of length L such that each string $s_i$ has a consecutive substring of length L whose distance is at most d from s. We present two algorithms that aim to be efficient for small fixed values of d and k: for some functions f and g, the algorithms have running time $f(d)\cdot n^{O(\log d)}$ and $g(d,k)\cdot n^{O(\log\log k)}$, respectively. The second algorithm is based on connections with the extremal combinatorics of hypergraphs. The Closest Substring problem is also investigated from the parameterized complexity point of view. Answering an open question from [P. A. Evans, A. D. Smith, and H. T. Wareham, Theoret. Comput. Sci., 306 (2003), pp. 407–430, M. R. Fellows, J. Gramm, and R. Niedermeier, Combinatorica, 26 (2006), pp. 141–167, J. Gramm, J. Guo, and R. Niedermeier, Lecture Notes in Comput. Sci. 2751, Springer, Berlin, 2003, pp. 195–209, J. Gramm, R. Niedermeier, and P. Rossmanith, Algorithmica, 37 (2003), pp. 25–42], we show that the problem is W[1]-hard even if both d and k are parameters. It follows as a consequence of this hardness result that our algorithms are optimal in the sense that the exponent of n in the running time cannot be improved to $o(\log d)$ or to $o(\log \log k)$ (modulo some complexity-theoretic assumptions). Consensus Patterns is the variant of the problem where, instead of the requirement that each $s_i$ has a substring that is of distance at most d from s, we have to select the substrings in such a way that the average of these k distances is at most $\delta$. By giving an $f(\delta)\cdot n^9$ time algorithm, we show that the problem is fixed-parameter tractable. This answers an open question from [M. R. Fellows, J. Gramm, and R. Niedermeier, Combinatorica, 26 (2006), pp. 141–167].
Read full abstract