A scalability study of phylogenetic network inference methods using empirical datasets and simulations involving a single reticulation.

Hussein A Hejase,Kevin J Liu

doi:10.1186/s12859-016-1277-1

Hussein A Hejase, Kevin J Liu

Open Access

PDF Available

https://doi.org/10.1186/s12859-016-1277-1

Copy DOI

Export

Save

Cite

Journal: BMC Bioinformatics	Publication Date: Oct 13, 2016
Citations: 42	License type: CC BY 4.0

Affiliation: Michigan State University

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

BackgroundBranching events in phylogenetic trees reflect bifurcating and/or multifurcating speciation and splitting events. In the presence of gene flow, a phylogeny cannot be described by a tree but is instead a directed acyclic graph known as a phylogenetic network. Both phylogenetic trees and networks are typically reconstructed using computational analysis of multi-locus sequence data. The advent of high-throughput sequencing technologies has brought about two main scalability challenges: (1) dataset size in terms of the number of taxa and (2) the evolutionary divergence of the taxa in a study. The impact of both dimensions of scale on phylogenetic tree inference has been well characterized by recent studies; in contrast, the scalability limits of phylogenetic network inference methods are largely unknown.ResultsIn this study, we quantify the performance of state-of-the-art phylogenetic network inference methods on large-scale datasets using empirical data sampled from natural mouse populations and a range of simulations using model phylogenies with a single reticulation. We find that, as in the case of phylogenetic tree inference, the performance of leading network inference methods is negatively impacted by both dimensions of dataset scale. In general, we found that topological accuracy degrades as the number of taxa increases; a similar effect was observed with increased sequence mutation rate. The most accurate methods were probabilistic inference methods which maximize either likelihood under coalescent-based models or pseudo-likelihood approximations to the model likelihood. The improved accuracy obtained with probabilistic inference methods comes at a computational cost in terms of runtime and main memory usage, which become prohibitive as dataset size grows past twenty-five taxa. None of the probabilistic methods completed analyses of datasets with 30 taxa or more after many weeks of CPU runtime.ConclusionsWe conclude that the state of the art of phylogenetic network inference lags well behind the scope of current phylogenomic studies. New algorithmic development is critically needed to address this methodological gap.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-016-1277-1) contains supplementary material, which is available to authorized users.

Highlights

Branching events in phylogenetic trees reflect bifurcating and/or multifurcating speciation and splitting events
Of the full likelihood methods, Maximum likelihood estimation (MLE)-length was consistently faster than MLE; the comparison of pseudo-likelihoodbased methods revealed that Species networks applying quartets (SNaQ) was consistently faster than Maximum pseudo-likelihood (MPL)
The observed growth in runtime is similar to previous performance studies [25, 65, 66], which suggest an increase in runtime as sampled dataset sizes grow

Summary

Introduction

Branching events in phylogenetic trees reflect bifurcating and/or multifurcating speciation and splitting events. Gene flow – the process by which genetic material is exchanged between different populations and/or species existing at the same point in time – has been shown to have played a major role in the evolution of a diverse array of metazoans, including humans and ancient hominins [1, 2], mice [3], and butterflies [4] Each of these organisms (as well as many others across the Tree of Life [5,6,7]) has a phylogeny, or evolutionary history, which cannot be represented as a tree, where a branching event reflects strict bifurcating and/or multifurcating speciation/splitting and subsequent genetic isolation of the resulting species/populations. We focus our attention on explicit phylogenetic networks and we hereafter omit the “explicit” qualifier for brevity

Methods

Results

Discussion

Conclusion