Abstract

Statistical methods for phylogeny estimation, especially maximum likelihood (ML), offer high accuracy with excellent theoretical properties. However, RAxML, the current leading method for large-scale ML estimation, can require weeks or longer when used on datasets with thousands of molecular sequences. Faster methods for ML estimation, among them FastTree, have also been developed, but their relative performance to RAxML is not yet fully understood. In this study, we explore the performance with respect to ML score, running time, and topological accuracy, of FastTree and RAxML on thousands of alignments (based on both simulated and biological nucleotide datasets) with up to 27,634 sequences. We find that when RAxML and FastTree are constrained to the same running time, FastTree produces topologically much more accurate trees in almost all cases. We also find that when RAxML is allowed to run to completion, it provides an advantage over FastTree in terms of the ML score, but does not produce substantially more accurate tree topologies. Interestingly, the relative accuracy of trees computed using FastTree and RAxML depends in part on the accuracy of the sequence alignment and dataset size, so that FastTree can be more accurate than RAxML on large datasets with relatively inaccurate alignments. Finally, the running times of RAxML and FastTree are dramatically different, so that when run to completion, RAxML can take several orders of magnitude longer than FastTree to complete. Thus, our study shows that very large phylogenies can be estimated very quickly using FastTree, with little (and in some cases no) degradation in tree accuracy, as compared to RAxML.

Highlights

  • Phylogeny estimation is an important part of much biological research

  • The study showed that RAxML produced better maximum likelihood (ML) scores than both FastTree and RAxMLLimited, and topologically more accurate trees than RAxMLLimited, in almost all cases

  • The relative performance of FastTree and RAxML depended upon the alignment and dataset, so that RAxML typically produced slightly more accurate trees than FastTree on the large datasets

Read more

Summary

Introduction

Phylogeny estimation is an important part of much biological research. Methods (either Bayesian or maximum likelihood) based upon stochastic models of sequence evolution have many desirable statistical properties, but are computationally the most challenging. Bayesian MCMC methods (e.g., MrBayes [1]) offer an advantage over maximum likelihood in that they provide a distribution of trees rather than a single point estimate; because the time needed for the MCMC analysis to converge can be very large, these methods are generally not used on datasets with more than a few hundred sequences. Large-scale statistical phylogeny estimation, with many hundreds or several thousand sequences, is performed using maximum likelihood (ML). Of the many ML methods, RAxML [2,3] is the main method for large-scale ML estimation because it produces the best ML scores and does so faster than other ML methods that have comparable accuracy with respect to ML scores. Other widelyused ML methods include GARLI [4], Phyml [5], and PAUP* [6], but these methods have generally not been used on very large datasets

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call