Improving pairwise sequence alignment accuracy using near-optimal protein sequence alignments

Michael L Sierk,Michael E Smoot,Ellen J Bass,William R Pearson

doi:10.1186/1471-2105-11-146

Michael L Sierk, Michael E Smoot + Show 2 more

Open Access

https://doi.org/10.1186/1471-2105-11-146

Copy DOI

Abstract

BackgroundWhile the pairwise alignments produced by sequence similarity searches are a powerful tool for identifying homologous proteins - proteins that share a common ancestor and a similar structure; pairwise sequence alignments often fail to represent accurately the structural alignments inferred from three-dimensional coordinates. Since sequence alignment algorithms produce optimal alignments, the best structural alignments must reflect suboptimal sequence alignment scores. Thus, we have examined a range of suboptimal sequence alignments and a range of scoring parameters to understand better which sequence alignments are likely to be more structurally accurate.ResultsWe compared near-optimal protein sequence alignments produced by the Zuker algorithm and a set of probabilistic alignments produced by the probA program with structural alignments produced by four different structure alignment algorithms. There is significant overlap between the solution spaces of structural alignments and both the near-optimal sequence alignments produced by commonly used scoring parameters for sequences that share significant sequence similarity (E-values < 10-5) and the ensemble of probA alignments. We constructed a logistic regression model incorporating three input variables derived from sets of near-optimal alignments: robustness, edge frequency, and maximum bits-per-position. A ROC analysis shows that this model more accurately classifies amino acid pairs (edges in the alignment path graph) according to the likelihood of appearance in structural alignments than the robustness score alone. We investigated various trimming protocols for removing incorrect edges from the optimal sequence alignment; the most effective protocol is to remove matches from the semi-global optimal alignment that are outside the boundaries of the local alignment, although trimming according to the model-generated probabilities achieves a similar level of improvement. The model can also be used to generate novel alignments by using the probabilities in lieu of a scoring matrix. These alignments are typically better than the optimal sequence alignment, and include novel correct structural edges. We find that the probA alignments sample a larger variety of alignments than the Zuker set, which more frequently results in alignments that are closer to the structural alignments, but that using the probA alignments as input to the regression model does not increase performance.ConclusionsThe pool of suboptimal pairwise protein sequence alignments substantially overlaps structure-based alignments for pairs with statistically significant similarity, and a regression model based on information contained in this alignment pool improves the accuracy of pairwise alignments with respect to structure-based alignments.

Highlights

While the pairwise alignments produced by sequence similarity searches are a powerful tool for identifying homologous proteins - proteins that share a common ancestor and a similar structure; pairwise sequence alignments often fail to represent accurately the structural alignments inferred from three-dimensional coordinates
Since the quality of the final 3D-model depends on the alignment of the unknown sequence to the structural template, we focus on improving the quality of alignments between proteins that share statistically significant similarity, and 20% to 40% sequence identity [3,4]
Our results suggest that for proteins with moderately significant sequence similarity, sequence alignments can often be within the range of different structural alignments

Summary

Introduction

While the pairwise alignments produced by sequence similarity searches are a powerful tool for identifying homologous proteins - proteins that share a common ancestor and a similar structure; pairwise sequence alignments often fail to represent accurately the structural alignments inferred from three-dimensional coordinates. Homology -common evolutionary ancestry - can be reliably inferred for proteins that share statistically significant sequence similarity. While the inference of homology is quite robust (proteins that share significant similarity in pairwise alignments always have similar structures), [1] some of the more detailed functional inferences are critically dependent upon the quality of the alignment between the two sequences. For proteins that are very similar (>60% identity), functional inferences are usually very accurate, but for more distantly related proteins, ambiguity in the alignment of poorly conserved regions can lead to errors [2]. Since the quality of the final 3D-model depends on the alignment of the unknown sequence to the structural template, we focus on improving the quality of alignments between proteins that share statistically significant similarity, and 20% to 40% sequence identity [3,4]

Objectives

Methods

Results

Conclusion