An improved algorithm for the regular expression constrained multiple sequence alignment problem

Abdullah N Arslan,Dan He

doi:10.1109/bibe.2006.253324

Abstract

Constrained sequence alignment has been proposed as a way for incorporating biologists' knowledge about common structures or functions into the alignment process. For alignment of protein sequences, several studies have suggested taking into account the motifs (a restricted regular expression) from the PROSITE database to guide alignments. The regular expression constrained sequence alignment has been introduced for this purpose. An alignment satisfies the constraint if part of it matches a given regular expression in each dimension (i.e. in each sequence aligned). There is a method that rewards the alignments that include a region matching the given regular expression. This method does not always guarantee the satisfaction of the constraint. Another method constructs a weighted finite automaton from the given regular expression, and presents a dynamic programming solution that simulates copies of this automaton to find an alignment with maximum score satisfying the regular expression constraint. We propose a new algorithm for the regular expression constrained multiple sequence alignment problem. Our algorithm considers two layers each of which corresponds to part of the dynamic programming matrix for the alignment of the given sequences. We compute each layer differently using dynamic programming. We propose the following modification in the definition of the problem: the region satisfying the constraint does not contribute to the total score. This modification is not necessary for the correctness and the performance in certain cases such as the constraint involves only one motif or motif-matching regions span short distance in each sequence but we believe that with this modification we achieve the same goal by doing less work in practice. Our algorithm is much more efficient than a previously proposed algorithm that uses weighted automata, and its performance in practice is comparable to (and under certain conditions even better than) that of the ordinary (unconstrained) multiple sequence alignment algorithm. Our experiments on real biological sequences, and regular expressions each composed of a sequence of motifs verify this

Full Text