Abstract

BackgroundWhile substitution matrices can readily be computed from reference alignments, it is challenging to compute optimal or approximately optimal gap penalties. It is also not well understood which substitution matrices are the most effective when alignment accuracy is the goal rather than homolog recognition. Here a new parameter optimization procedure, POP, is described and applied to the problems of optimizing gap penalties and selecting substitution matrices for pair-wise global protein alignments.ResultsPOP is compared to a recent method due to Kim and Kececioglu and found to achieve from 0.2% to 1.3% higher accuracies on pair-wise benchmarks extracted from BALIBASE. The VTML matrix series is shown to be the most accurate on several global pair-wise alignment benchmarks, with VTML200 giving best or close to the best performance in all tests. BLOSUM matrices are found to be slightly inferior, even with the marginal improvements in the bug-fixed RBLOSUM series. The PAM series is significantly worse, giving accuracies typically 2% less than VTML. Integer rounding is found to cause slight degradations in accuracy. No evidence is found that selecting a matrix based on sequence divergence improves accuracy, suggesting that the use of this heuristic in CLUSTALW may be ineffective. Using VTML200 is found to improve the accuracy of CLUSTALW by 8% on BALIBASE and 5% on PREFAB.ConclusionThe hypothesis that more accurate alignments of distantly related sequences may be achieved using low-identity matrices is shown to be false for commonly used matrix types. Source code and test data is freely available from the author's web site at http://www.drive5.com/pop.

Highlights

  • While substitution matrices can readily be computed from reference alignments, it is challenging to compute optimal or approximately optimal gap penalties

  • The BLOSUM62 matrix in 1/3 bit units was used as this was hardcoded into implementation of this algorithm (IPA)

  • POP was found to be from 0.2% to 1.3% more accurate than IPA; these improvements are typical

Read more

Summary

Introduction

While substitution matrices can readily be computed from reference alignments, it is challenging to compute optimal or approximately optimal gap penalties It is not well understood which substitution matrices are the most effective when alignment accuracy is the goal rather than homolog recognition. Several heuristics are in common use, for example CLUSTALW's choice of low-identity matrices for aligning low-identity sequences [2], which have not to the best of my knowledge been empirically tested. One factor impeding such testing is the lack of effective automated methods for optimizing parameters for a given objective function. Previous work in this area has included unsupervised expectation maximization [3], discriminative train-

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.