Instability in progressive multiple sequence alignment algorithms.

Kieran Boyce,Desmond G Higgins,Fabian Sievers

doi:10.1186/s13015-015-0057-1

Kieran Boyce, Desmond G Higgins + Show 1 more

Open Access

https://doi.org/10.1186/s13015-015-0057-1

Copy DOI

Journal: Algorithms for Molecular Biology	Publication Date: Oct 9, 2015
Citations: 35	License type: cc-by

Affiliation: University College Dublin

Abstract

BackgroundProgressive alignment is the standard approach used to align large numbers of sequences. As with all heuristics, this involves a tradeoff between alignment accuracy and computation time.ResultsWe examine this tradeoff and find that, because of a loss of information in the early steps of the approach, the alignments generated by the most common multiple sequence alignment programs are inherently unstable, and simply reversing the order of the sequences in the input file will cause a different alignment to be generated. Although this effect is more obvious with larger numbers of sequences, it can also be seen with data sets in the order of one hundred sequences. We also outline the means to determine the number of sequences in a data set beyond which the probability of instability will become more pronounced.ConclusionsThis has major ramifications for both the designers of large-scale multiple sequence alignment algorithms, and for the users of these alignments.

Highlights

Progressive alignment is the standard approach used to align large numbers of sequences
The power of progressive multiple sequence alignement may come from the fact that “more similar” sequences are aligned first: “...assuming that in progressive alignment, the best accuracy is obtained at each node by aligning the two profiles that have fewest differences, even if they are not evolutionary neighbours” [3]
This paper examines the impact of the tradeoff of accuracy for speed in the construction of the guide trees in protein progressive multiple sequence alignment

Summary

Introduction

Progressive alignment is the standard approach used to align large numbers of sequences. The creation of a multiple sequence alignment is a routine step in the analysis of homologous genes or proteins. For aligning more than a few hundred sequences, most methods use a heuristic approach termed “progressive alignment” by Feng and Doolittle [1]. This is a two-stage process: first a guide tree [2] is created by clustering the sequences based on some distance or similarity measure, and the branching structure of the guide tree is used to order the pairwise alignment of sequences. All sequences are compared to each other to generate a matrix of distance measures

Methods

Results

Conclusion