Abstract

BackgroundWe consider the problem of finding the maximum frequent agreement subtrees (MFASTs) in a collection of phylogenetic trees. Existing methods for this problem often do not scale beyond datasets with around 100 taxa. Our goal is to address this problem for datasets with over a thousand taxa and hundreds of trees.ResultsWe develop a heuristic solution that aims to find MFASTs in sets of many, large phylogenetic trees. Our method works in multiple phases. In the first phase, it identifies small candidate subtrees from the set of input trees which serve as the seeds of larger subtrees. In the second phase, it combines these small seeds to build larger candidate MFASTs. In the final phase, it performs a post-processing step that ensures that we find a frequent agreement subtree that is not contained in a larger frequent agreement subtree. We demonstrate that this heuristic can easily handle data sets with 1000 taxa, greatly extending the estimation of MFASTs beyond current methods.ConclusionsAlthough this heuristic does not guarantee to find all MFASTs or the largest MFAST, it found the MFAST in all of our synthetic datasets where we could verify the correctness of the result. It also performed well on large empirical data sets. Its performance is robust to the number and size of the input trees. Overall, this method provides a simple and fast way to identify strongly supported subtrees within large phylogenetic hypotheses.

Highlights

  • We consider the problem of finding the maximum frequent agreement subtrees (MFASTs) in a collection of phylogenetic trees

  • Our heuristic does not guarantee to find all MFASTs or the largest MFAST in theory, it found the true MFAST in all of our synthetic datasets where we could verify the correctness of the result

  • We identify small subtrees from the input trees that have a potential to be a part of an MFAST

Read more

Summary

Introduction

We consider the problem of finding the maximum frequent agreement subtrees (MFASTs) in a collection of phylogenetic trees Existing methods for this problem often do not scale beyond datasets with around 100 taxa. Support for a tree often is determined by methods such as nonparametric bootstrapping [4], jackknifing [5], or Bayesian MCMC sampling (e.g., [6]), which generate a collection of trees with identical taxa representing the range of possible phylogenetic relationships. These trees can be summarized in a consensus tree (see [7]).

Objectives
Methods
Results
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call