Constrained sequence analysis algorithms in computational biology

Effat Farhana,M. Sohel Rahman

doi:10.1016/j.ins.2014.10.019

Abstract

The knowledge of the similarity of two or more sequences is crucial in computational molecular biology. The longest common subsequence (LCS) is a well-known and widely used measure for sequence similarity. Constrained variants of the LCS problem have been studied in the literature where the knowledge of the functionalities or structures of the input sequences are provided in the form of inclusion/exclusion constraint patterns. In this paper we focus on different variants of the LCS problem involving multiple input sequences and constraint patterns. Given L input sequences and ℓ constraint patterns, the goal here is to construct an LCS of the given sequences such that each of the constraint patterns occurs/does not occur in the LCS as a subsequence/substring.We devise finite automata based efficient algorithms for all the variants of the problem that run in O(|Σ|(R+L)+nL+|Σ|Rnℓ) time, where R is the size of the resulting subsequence automaton, n is the length of each input sequence and Σ is the underlying alphabet. We also conduct an extensive experimental study to evaluate the practical performance of our algorithms. The experimental results suggest the superiority of our finite automata based algorithms. Therefore, we believe that our automata based algorithms will be useful in practical sequence analysis in computational biology and will replace the existing algorithms that are mostly based on memory intensive dynamic programming based methods.

Full Text