Using Robinson-Foulds supertrees in divide-and-conquer phylogeny estimation

Xilin Yu,Erin K Molloy,Sarah A Christensen,Thien Le,Tandy Warnow

doi:10.1186/s13015-021-00189-2

Abstract

One of the Grand Challenges in Science is the construction of the Tree of Life, an evolutionary tree containing several million species, spanning all life on earth. However, the construction of the Tree of Life is enormously computationally challenging, as all the current most accurate methods are either heuristics for NP-hard optimization problems or Bayesian MCMC methods that sample from tree space. One of the most promising approaches for improving scalability and accuracy for phylogeny estimation uses divide-and-conquer: a set of species is divided into overlapping subsets, trees are constructed on the subsets, and then merged together using a “supertree method”. Here, we present Exact-RFS-2, the first polynomial-time algorithm to find an optimal supertree of two trees, using the Robinson-Foulds Supertree (RFS) criterion (a major approach in supertree estimation that is related to maximum likelihood supertrees), and we prove that finding the RFS of three input trees is NP-hard. Exact-RFS-2 is available in open source form on Github at https://github.com/yuxilin51/GreedyRFS.

Highlights

Supertree construction is a natural algorithmic problem that has important applications to computational biology; see [1] for a 2004 book on the subject and [2,3,4,5,6,7,8,9] for some of the recent papers on this subject
Supertree methods are important for large-scale phylogeny estimation, where it can be used as a final step in a divide-and-conquer pipeline [10]: the species set is divided into two or more overlapping subsets, unrooted leaf-labelled trees are constructed on each subset, and these subset trees are combined into a tree on the full dataset, using the selected supertree method
We present Exact-2-RFS, a polynomial time algorithm for the Robinson-Foulds Supertree (RFS) of two trees, which establishes that RFS is solvable in O(n2|X|) time for two trees, where n is the number of leaves and X is the set of shared leaves (Theorem 1)

Summary

Introduction

Supertree construction (i.e., the combination of a collection of trees, each on a potentially different subset of the species, into a tree on the full set of species) is a natural algorithmic problem that has important applications to computational biology; see [1] for a 2004 book on the subject and [2,3,4,5,6,7,8,9] for some of the recent papers on this subject. For a tree T, let V(T) and E(T) denote the set of vertices and edges of T, respectively. If π ∈ C(T1, T2, X) , we let T R∗(π ) refer to the set of extra subtrees that attach to edges in a backbone tree that induce π in either T1|X or

Results

Conclusion