On approximating string selection problems with outliers

Christina Boucher,Gad M Landau,Avivit Levy,David Pritchard,Oren Weimann

doi:10.1016/j.tcs.2013.05.037

Abstract

Many problems in bioinformatics are about finding strings that approximately represent a collection of given strings. We look at more general problems where some input strings can be classified as outliers. The Close to Most Strings problem is, given a set S of the same-length strings, and a parameter d, find a string x that maximizes the number of “non-outliers” within Hamming distance d of x. We prove that this problem has no polynomial-time approximation scheme (PTAS) unless NP has randomized polynomial-time algorithms, correcting a decade-old erroneous proof made previously in the literature. The Most Strings with Few Bad Columns problem is to find a maximum-size subset of input strings so that the number of non-identical positions is at most k; we show it has no PTAS unless P=NP. We also observe Closest to k Strings has no efficient PTAS (EPTAS) unless a parameterized complexity hierarchy collapses. In sum, outliers help model problems associated with using biological data, but we show the problem of finding an approximate solution is computationally difficult.

Full Text