Abstract
Motivation Species tree estimation from gene trees can be complicated by gene duplication and loss, and “gene tree parsimony” (GTP) is one approach for estimating species trees from multiple gene trees. In its standard formulation, the objective is to find a species tree that minimizes the total number of gene duplications and losses with respect to the input set of gene trees. Although much is known about GTP, little is known about how to treat inputs containing some incomplete gene trees (i.e., gene trees lacking one or more of the species).ResultsWe present new theory for GTP considering whether the incompleteness is due to gene birth and death (i.e., true biological loss) or taxon sampling, and present dynamic programming algorithms that can be used for an exact but exponential time solution for small numbers of taxa, or as a heuristic for larger numbers of taxa. We also prove that the “standard” calculations for duplications and losses exactly solve GTP when incompleteness results from taxon sampling, although they can be incorrect when incompleteness results from true biological loss. The software for the DP algorithm is freely available as open source code at https://github.com/smirarab/DynaDup.
Highlights
The estimation of species trees is often performed by estimating multiple sequence alignments for some collection of genes, concatenating these alignments into one supermatrix, and estimating a tree on the resultant supermatrix
This paper showed that different interpretations of incompleteness can impact the way that these reconciliation costs should be calculated, and need to be taken into account when using Gene Tree Parsimony to construct species trees from gene trees
We presented a dynamic programming algorithm that provably finds an optimal species tree given a set of gene trees under the GDL model within a constrained search space, treating incompleteness as due to true biological loss
Summary
The estimation of species trees is often performed by estimating multiple sequence alignments for some collection of genes, concatenating these alignments into one supermatrix, and estimating a tree (often using maximum likelihood or a Bayesian technique) on the resultant supermatrix. Since the gene must be lost on each of those edges, and the total number of losses is the sum of this value and the number of losses that occur within the subtree rooted at M(r(gt)), it follows that L∗bd(gt, ST ) = Lbd(gt, ST ) + |UMMC(gt, ST )|. We derive a different approach for the GTP problems, treating incomplete gene trees as due to true biological loss (i.e., minimizing Lbd(gt, ST ) or L∗bd(gt, ST )).
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.