Optimizing substitution matrix choice and gap parameters for sequence alignment

Robert C Edgar

doi:10.1186/1471-2105-10-396

Abstract

BackgroundWhile substitution matrices can readily be computed from reference alignments, it is challenging to compute optimal or approximately optimal gap penalties. It is also not well understood which substitution matrices are the most effective when alignment accuracy is the goal rather than homolog recognition. Here a new parameter optimization procedure, POP, is described and applied to the problems of optimizing gap penalties and selecting substitution matrices for pair-wise global protein alignments.ResultsPOP is compared to a recent method due to Kim and Kececioglu and found to achieve from 0.2% to 1.3% higher accuracies on pair-wise benchmarks extracted from BALIBASE. The VTML matrix series is shown to be the most accurate on several global pair-wise alignment benchmarks, with VTML200 giving best or close to the best performance in all tests. BLOSUM matrices are found to be slightly inferior, even with the marginal improvements in the bug-fixed RBLOSUM series. The PAM series is significantly worse, giving accuracies typically 2% less than VTML. Integer rounding is found to cause slight degradations in accuracy. No evidence is found that selecting a matrix based on sequence divergence improves accuracy, suggesting that the use of this heuristic in CLUSTALW may be ineffective. Using VTML200 is found to improve the accuracy of CLUSTALW by 8% on BALIBASE and 5% on PREFAB.ConclusionThe hypothesis that more accurate alignments of distantly related sequences may be achieved using low-identity matrices is shown to be false for commonly used matrix types. Source code and test data is freely available from the author's web site at http://www.drive5.com/pop.

Highlights

While substitution matrices can readily be computed from reference alignments, it is challenging to compute optimal or approximately optimal gap penalties
The BLOSUM62 matrix in 1/3 bit units was used as this was hardcoded into implementation of this algorithm (IPA)
POP was found to be from 0.2% to 1.3% more accurate than IPA; these improvements are typical

Summary

Introduction

While substitution matrices can readily be computed from reference alignments, it is challenging to compute optimal or approximately optimal gap penalties It is not well understood which substitution matrices are the most effective when alignment accuracy is the goal rather than homolog recognition. Several heuristics are in common use, for example CLUSTALW's choice of low-identity matrices for aligning low-identity sequences [2], which have not to the best of my knowledge been empirically tested. One factor impeding such testing is the lack of effective automated methods for optimizing parameters for a given objective function. Previous work in this area has included unsupervised expectation maximization [3], discriminative train-

Methods

Results

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Dec 1, 2009
Citations: 41	License type: CC BY 2.0

R Discovery Prime

R Discovery Prime

Optimizing substitution matrix choice and gap parameters for sequence alignment

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Improved pairwise alignments of proteins in the Twilight Zone using local structure predictions
Yao-Ming Huang ... Christopher Bystroff
Bioinformatics | VOL. 22
Yao-Ming Huang, et. al.Yao-Ming Huang ... Christopher Bystroff
13 Dec 2005
Bioinformatics | VOL. 22

Hubsm: A Novel Amino Acid Substitution Matrix for Comparing Hub Proteins
Renganayaki G ... Achuthsankar S Nair
International Journal of Advanced Research in Computer Science and Software Engineering | VOL. 7
Renganayaki G, et. al.Renganayaki G ... Achuthsankar S Nair
30 Aug 2017
International Journal of Advanced Research in Computer Science and Software Engineering | VOL. 7

1149 POSTER Recombinant lectin ATL-104 reduces the duration and severity of intestinal epithelial damage caused by 5-fluorouracil in rats
M Duncan ... R Palmer
European Journal of Cancer Supplements | VOL. 5
M Duncan, et. al.M Duncan ... R Palmer
01 Sep 2007
European Journal of Cancer Supplements | VOL. 5

GLoBES: General Long Baseline Experiment Simulator
Patrick Huber ... Walter Winter
Computer Physics Communications | VOL. 177
Patrick Huber, et. al.Patrick Huber ... Walter Winter
24 May 2007
Computer Physics Communications | VOL. 177

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Optimizing substitution matrix choice and gap parameters for sequence alignment

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics