Abstract

Many popular algorithms for searching the space of leaf-labelled (phylogenetic) trees are based on tree rearrangement operations. Under any such operation, the problem is reduced to searching a graph where vertices are trees and (undirected) edges are given by pairs of trees connected by one rearrangement operation (sometimes called a move). Most popular are the classical nearest neighbour interchange, subtree prune and regraft, and tree bisection and reconnection moves. The problem of computing distances, however, is {mathbf {N}}{mathbf {P}}-hard in each of these graphs, making tree inference and comparison algorithms challenging to design in practice. Although ranked phylogenetic trees are one of the central objects of interest in applications such as cancer research, immunology, and epidemiology, the computational complexity of the shortest path problem for these trees remained unsolved for decades. In this paper, we settle this problem for the ranked nearest neighbour interchange operation by establishing that the complexity depends on the weight difference between the two types of tree rearrangements (rank moves and edge moves), and varies from quadratic, which is the lowest possible complexity for this problem, to {mathbf {N}}{mathbf {P}}-hard, which is the highest. In particular, our result provides the first example of a phylogenetic tree rearrangement operation for which shortest paths, and hence the distance, can be computed efficiently. Specifically, our algorithm scales to trees with tens of thousands of leaves (and likely hundreds of thousands if implemented efficiently).

Highlights

  • We thank Alexei Drummond, David Bryant, and Kieran Elmes for useful discussions about the weight difference between RNNI moves, complexity, and applied aspects of our results

  • For example in species evolution, where internal nodes of trees correspond to speciation events, the ranking of these nodes represents the order of divergence events in time

  • Most tree inference methods rely on various tree rearrangement operations (Semple and Steel 2003), the most popular of which are nearest neighbour interchange (NNI), subtree prune and regraft (SPR), and tree bisection and reconnection (TBR)

Read more

Summary

Page 2 of 19

One of the major problems in computational biology is the reconstruction of evolutionary histories, known as phylogenetic trees, from sequence data such as RNA, DNA, or protein sequences. Most tree inference methods rely on various tree rearrangement operations (Semple and Steel 2003), the most popular of which are nearest neighbour interchange (NNI), subtree prune and regraft (SPR), and tree bisection and reconnection (TBR). Computing the NNI distance is known to be fixed parameter tractable (DasGupta et al 1999) Important, these algorithms remain impractical for large distances and are only applied to trees with a moderate number of leaves or those with small distances (Whidden and Matsen 2018). The Robinson–Foulds distance is not motivated by a biological process, unlike for example SPR, where the tree rearrangement operation can be used to model hybridisation and other horizontal events This pattern is quite common—tree distance measures that are easy to compute lack biological interpretability, while those that are biologically meaningful are often hard to compute (Whidden and Matsen 2018). Because NNI can be seen as a special case of RNNI, we investigate whether there exists a threshold at which the complexity of the shortest path problem shifts from

Page 4 of 19
Definitions and background results
Page 6 of 19
FINDPATH algorithm
FINDPATH computes shortest paths in optimal time
Page 8 of 19
Page 10 of 19
Page 12 of 19
Page 14 of 19
Page 16 of 19
Additional open problems
Page 18 of 19
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call