Closest string with outliers.

Christina Boucher,Bin Ma

doi:10.1186/1471-2105-12-s1-s55

Abstract

BackgroundGiven n strings s1, …, sn each of length ℓ and a nonnegative integer d, the CLOSEST STRING problem asks to find a center string s such that none of the input strings has Hamming distance greater than d from s. Finding a common pattern in many – but not necessarily all – input strings is an important task that plays a role in many applications in bioinformatics.ResultsAlthough the closest string model is robust to the oversampling of strings in the input, it is severely affected by the existence of outliers. We propose a refined model, the CLOSEST STRING WITH OUTLIERS (CSWO) problem, to overcome this limitation. This new model asks for a center string s that is within Hamming distance d to at least n – k of the n input strings, where k is a parameter describing the maximum number of outliers. A CSWO solution not only provides the center string as a representative for the set of strings but also reveals the outliers of the set.We provide fixed parameter algorithms for CSWO when d and k are parameters, for both bounded and unbounded alphabets. We also show that when the alphabet is unbounded the problem is W[1]-hard with respect to n – k, ℓ, and d.ConclusionsOur refined model abstractly models finding common patterns in several but not all input strings. We initialize the study of the computability of this model and show that it is sensitive to different parameterizations. Lastly, we conclude by suggesting several open problems which warrant further investigation.

Highlights

Given n strings s1, ..., sn each of length l and a nonnegative integer d, the CLOSEST STRING problem asks to find a center string s such that none of the input strings has Hamming distance greater than d from s
We show that CLOSEST STRING WITH OUTLIERS (CSWO) is W[1]-hard for every combination of the parameters l, d, and n* and is fixed parameter intractable when parameterized by any subset of these parameters, unless FPT = W[1]
Previous Results It is worth noting that analogous parameterized complexity studies have been performed for the CLOSEST STRING problem and the CLOSEST SUBSTRING problem

Summary

Introduction

Given n strings s1, ..., sn each of length l and a nonnegative integer d, the CLOSEST STRING problem asks to find a center string s such that none of the input strings has Hamming distance greater than d from s. Finding a common pattern in many – but not necessarily all – input strings is an important task that plays a role in many applications in bioinformatics. The CLOSEST STRING problem formalizes these tasks and can be defined as follows: given a set of n strings S of length l over the alphabet Σ and parameter d, the aim is to determine if there exists a string s that has Hamming distance at most d from each string in S. Since its introduction the investigation of efficient polynomial time approximation algorithms and exact exponential time algorithms for the CLOSEST STRING problem has been thoroughly considered [2,11,12,13,14,15,16]

Objectives

Methods

Results

Conclusion