Abstract

BackgroundThe gene duplication (GD) problem seeks a species tree that implies the fewest gene duplication events across a given collection of gene trees. Solving this problem makes it possible to use large gene families with complex histories of duplication and loss to infer phylogenetic trees. However, the GD problem is NP-hard, and therefore, most analyses use heuristics that lack any performance guarantee.ResultsWe describe the first integer linear programming (ILP) formulation to solve instances of the gene duplication problem exactly. With simulations, we demonstrate that the ILP solution can solve problem instances with up to 14 taxa. Furthermore, we apply the new ILP solution to solve the gene duplication problem for the seed plant phylogeny using a 12-taxon, 6, 084-gene data set. The unique, optimal solution, which places Gnetales sister to the conifers, represents a new, large-scale genomic perspective on one of the most puzzling questions in plant systematics.ConclusionsAlthough the GD problem is NP-hard, our novel ILP solution for it can solve instances with data sets consisting of as many as 14 taxa and 1, 000 genes in a few hours. These are the largest instances that have been solved to optimally to date. Thus, this work can provide large-scale genomic perspectives on phylogenetic questions that previously could only be addressed by heuristic estimates.

Highlights

  • The gene duplication (GD) problem seeks a species tree that implies the fewest gene duplication events across a given collection of gene trees

  • Simulations In the simulation experiments, the size of the species tree has a major impact on running time (Table 2), but we were able to find exact solutions for the GD problem for data sets with up to 14 taxa (Table 2)

  • The data sets with 1000 gene trees took, on average, longer to solve than data sets with fewer gene trees, in some cases with fewer gene trees (10 gene trees) it is difficult to determine an optimal solution when the optimal species tree is not unique

Read more

Summary

Introduction

The gene duplication (GD) problem seeks a species tree that implies the fewest gene duplication events across a given collection of gene trees Solving this problem makes it possible to use large gene families with complex histories of duplication and loss to infer phylogenetic trees. With recent advances in DNA sequencing technology, there is much interest in using genomic data sets to infer phylogenetic trees Evolutionary events such as gene duplication and loss, incomplete lineage sorting (deep coalescence), and lateral gene transfer can produce discordance between gene trees and the phylogeny of the species in which the genes evolve (e.g., [1]). Given a collection of gene trees, this problem seeks a species tree that implies the minimum reconciliation cost, i.e., the fewest number of evolutionary events that can explain discordance in the gene phylogenies. We describe an ILP formulation solving the GD problem exactly and demonstrate its performance on both simulated and empirical data sets

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call