Abstract

Accurate reconstruction of phylogenies remains a key challenge in evolutionary biology. Most biologically plausible formulations of the problem are formally NP-hard, with no known efficient solution. The standard in practice are fast heuristic methods that are empirically known to work very well in general, but can yield results arbitrarily far from optimal. Practical exact methods, which yield exponential worst-case running times but generally much better times in practice, provide an important alternative. We report progress in this direction by introducing a provably optimal method for the weighted multi-state maximum parsimony phylogeny problem. The method is based on generalizing the notion of the Buneman graph, a construction key to efficient exact methods for binary sequences, so as to apply to sequences with arbitrary finite numbers of states with arbitrary state transition weights. We implement an integer linear programming (ILP) method for the multi-state problem using this generalized Buneman graph and demonstrate that the resulting method is able to solve data sets that are intractable by prior exact methods in run times comparable with popular heuristics. We further show on a collection of less difficult problem instances that the ILP method leads to large reductions in average-case run times relative to leading heuristics on moderately hard problems. Our work provides the first method for provably optimal maximum parsimony phylogeny inference that is practical for multi-state data sets of more than a few characters.

Highlights

  • One of the fundamental problems in computational biology is that of inferring evolutionary relationships between a set of observed amino acid sequences or taxa

  • We provide a new generalization of the definition of Buneman graph for any finite number of states that guarantees the resulting graph will contain all distinct minimum Steiner trees of the multi-state input set

  • We compared the performance of our method with two popular heuristic methods for maximum parsimony phylogeny inference — pars, which is part of the freely-available PHYLIP package [13], and PAUP* [12], the leading commercial phylogenetics package

Read more

Summary

Introduction

One of the fundamental problems in computational biology is that of inferring evolutionary relationships between a set of observed amino acid sequences or taxa These evolutionary relationships are commonly represented by a tree (phylogeny) describing the descent of all observed taxa from a common ancestor, a reasonable model provided we are working with sequences over small enough regions or distant enough relationships that we can neglect recombination or other sources of reticulation [1]. Maximum parsimony defines the tree(s) with the fewest mutations as the optimum, generally a reasonable assumption for short time-scales or conserved sequences. It is a simple, non-parametric criterion, as opposed to common maximum likelihood models or various popular distance-based methods [2]. MP is known to be NP-hard [3] and practical implementations of MP are generally based on heuristics which do not guarantee optimal solutions

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call