Abstract
Many popular algorithms for searching the space of leaf-labelled (phylogenetic) trees are based on tree rearrangement operations. Under any such operation, the problem is reduced to searching a graph where vertices are trees and (undirected) edges are given by pairs of trees connected by one rearrangement operation (sometimes called a move). Most popular are the classical nearest neighbour interchange, subtree prune and regraft, and tree bisection and reconnection moves. The problem of computing distances, however, is {mathbf {N}}{mathbf {P}}-hard in each of these graphs, making tree inference and comparison algorithms challenging to design in practice. Although ranked phylogenetic trees are one of the central objects of interest in applications such as cancer research, immunology, and epidemiology, the computational complexity of the shortest path problem for these trees remained unsolved for decades. In this paper, we settle this problem for the ranked nearest neighbour interchange operation by establishing that the complexity depends on the weight difference between the two types of tree rearrangements (rank moves and edge moves), and varies from quadratic, which is the lowest possible complexity for this problem, to {mathbf {N}}{mathbf {P}}-hard, which is the highest. In particular, our result provides the first example of a phylogenetic tree rearrangement operation for which shortest paths, and hence the distance, can be computed efficiently. Specifically, our algorithm scales to trees with tens of thousands of leaves (and likely hundreds of thousands if implemented efficiently).
Highlights
We thank Alexei Drummond, David Bryant, and Kieran Elmes for useful discussions about the weight difference between RNNI moves, complexity, and applied aspects of our results
For example in species evolution, where internal nodes of trees correspond to speciation events, the ranking of these nodes represents the order of divergence events in time
Most tree inference methods rely on various tree rearrangement operations (Semple and Steel 2003), the most popular of which are nearest neighbour interchange (NNI), subtree prune and regraft (SPR), and tree bisection and reconnection (TBR)
Summary
One of the major problems in computational biology is the reconstruction of evolutionary histories, known as phylogenetic trees, from sequence data such as RNA, DNA, or protein sequences. Most tree inference methods rely on various tree rearrangement operations (Semple and Steel 2003), the most popular of which are nearest neighbour interchange (NNI), subtree prune and regraft (SPR), and tree bisection and reconnection (TBR). Computing the NNI distance is known to be fixed parameter tractable (DasGupta et al 1999) Important, these algorithms remain impractical for large distances and are only applied to trees with a moderate number of leaves or those with small distances (Whidden and Matsen 2018). The Robinson–Foulds distance is not motivated by a biological process, unlike for example SPR, where the tree rearrangement operation can be used to model hybridisation and other horizontal events This pattern is quite common—tree distance measures that are easy to compute lack biological interpretability, while those that are biologically meaningful are often hard to compute (Whidden and Matsen 2018). Because NNI can be seen as a special case of RNNI, we investigate whether there exists a threshold at which the complexity of the shortest path problem shifts from
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have