Abstract

MotivationWith the rapid growth rate of newly sequenced genomes, species tree inference from multiple genes has become a basic bioinformatics task in comparative and evolutionary biology. However, accurate species tree estimation is difficult in the presence of gene tree discordance, which is often due to incomplete lineage sorting (ILS), modelled by the multi-species coalescent. Several highly accurate coalescent-based species tree estimation methods have been developed over the last decade, including MP-EST. However, the running time for MP-EST increases rapidly as the number of species grows.ResultsWe present divide-and-conquer techniques that improve the scalability of MP-EST so that it can run efficiently on large datasets. Surprisingly, this technique also improves the accuracy of species trees estimated by MP-EST, as our study shows on a collection of simulated and biological datasets.

Highlights

  • A standard approach to species tree estimation uses multiple loci and concatenates alignments for each locus into a super-matrix, which is used to estimate the species tree

  • This technique improves the accuracy of species trees estimated by MP-EST, as our study shows on a collection of simulated and biological datasets

  • The vast majority of the running time for both the Disk-Covering Methods (DCMs)-boosted and short subtree graph (SSG)-boosted versions of MP-EST is in computing the starting tree and when it runs MP-EST on subsets; all the other steps completed in seconds, run sequentially

Read more

Summary

Introduction

A standard approach to species tree estimation uses multiple loci and concatenates alignments for each locus into a super-matrix, which is used to estimate the species tree. Some of these coalescent-based methods are fast enough to be used with phylogenomic datasets that contain hundreds or thousands of genes and more than 30 or so species. The other type of coalescent-based method are called “summary methods” because they estimate species trees by combining estimated gene trees. These methods tend to be much faster than the fully-parametric methods, and some of these methods (e.g., MP-EST [5]) are able to be used with hundreds to thousands of genes

Methods
Results
Discussion
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.