OCTAL: Optimal Completion of gene trees in polynomial time

Sarah Christensen,Erin K. Molloy,Tandy Warnow,Pranjal Vachaspati

doi:10.1186/s13015-018-0124-5

Sarah Christensen, Erin K. Molloy + Show 2 more

Open Access

https://doi.org/10.1186/s13015-018-0124-5

Copy DOI

Abstract

BackgroundFor a combination of reasons (including data generation protocols, approaches to taxon and gene sampling, and gene birth and loss), estimated gene trees are often incomplete, meaning that they do not contain all of the species of interest. As incomplete gene trees can impact downstream analyses, accurate completion of gene trees is desirable.ResultsWe introduce the Optimal Tree Completion problem, a general optimization problem that involves completing an unrooted binary tree (i.e., adding missing leaves) so as to minimize its distance from a reference tree on a superset of the leaves. We present OCTAL, an algorithm that finds an optimal solution to this problem when the distance between trees is defined using the Robinson–Foulds (RF) distance, and we prove that OCTAL runs in O(n^2) time, where n is the total number of species. We report on a simulation study in which gene trees can differ from the species tree due to incomplete lineage sorting, and estimated gene trees are completed using OCTAL with a reference tree based on a species tree estimated from the multi-locus dataset. OCTAL produces completed gene trees that are closer to the true gene trees than an existing heuristic approach in ASTRAL-II, but the accuracy of a completed gene tree computed by OCTAL depends on how topologically similar the reference tree (typically an estimated species tree) is to the true gene tree.ConclusionsOCTAL is a useful technique for adding missing taxa to incomplete gene trees and provides good accuracy under a wide range of model conditions. However, results show that OCTAL’s accuracy can be reduced when incomplete lineage sorting is high, as the reference tree can be far from the true gene tree. Hence, this study suggests that OCTAL would benefit from using other types of reference trees instead of species trees when there are large topological distances between true gene trees and species trees.

Highlights

For a combination of reasons, estimated gene trees are often incomplete, meaning that they do not contain all of the species of interest
The median matching distance was 18 for ASTRAL-II and 15 for OCTAL (Fig. 3) and the improvement obtained by using OCTAL over ASTRAL-II was statistically significant in 19 out of 20 of the replicates
OCTAL improves more genes than ASTRAL-II except in the higher incomplete lineage sorting (ILS) conditions with the RF distance criteria, in which case OCTAL and ASTRAL-II are more often equal in their performance quartet distance was 7% for ASTRAL-II and 6% for OCTAL (Fig. 3) and the improvement obtained by using OCTAL over ASTRAL-II was statistically significant in 18 out of 20 of the replicates

Summary

Introduction

For a combination of reasons (including data generation protocols, approaches to taxon and gene sampling, and gene birth and loss), estimated gene trees are often incomplete, meaning that they do not contain all of the species of interest. Other biological processes, such as gene duplication and loss, incomplete lineage sorting (ILS), and gene flow, produce heterogeneity across the genome but are still properly modeled by a single species tree [7, 8]. In the latter case, species tree estimation methods should be robust to heterogeneity across the genome. One popular approach for estimating species trees under the MSC model is to estimate trees on individual loci and combine these gene trees into a species tree Some of these “summary methods”, such as ASTRAL-II [10] and ASTRID [11], have been shown to scale well to datasets

Results

Discussion

Conclusion