A global test of hybrid ancestry from genome-scale data.
Methods based on the multi-species coalescent have been widely used in phylogenetic tree estimation using genome-scale DNA sequence data to understand the underlying evolutionary relationship between the sampled species. Evolutionary processes such as hybridization, which creates new species through interbreeding between two different species, necessitate inferring a species network instead of a species tree. A species tree is strictly bifurcating and thus fails to incorporate hybridization events which require an internal node of degree three. Hence, it is crucial to decide whether a tree or network analysis should be performed given a DNA sequence data set, a decision that is based on the presence of hybrid species in the sampled species. Although many methods have been proposed for hybridization detection, it is rare to find a technique that does so globally while considering a data generation mechanism that allows both hybridization and incomplete lineage sorting. In this paper, we consider hybridization and coalescence in a unified framework and propose a new test that can detect whether there are any hybrid species in a set of species of arbitrary size. Based on this global test of hybridization, one can decide whether a tree or network analysis is appropriate for a given data set.
- Book Chapter
2
- 10.1002/9780470015902.a0025781
- Nov 14, 2014
During the last decades, gene trees have been often interpreted as species phylogenies. However, the extensive gene tree discordance found in multi‐locus datasets has put into question this interpretation, and a variety of new methods that explicitly consider species trees have been proposed in recent years. Some of these explicitly consider evolutionary processes that can lead to true gene tree discordance, namely incomplete lineage sorting, gene duplication and loss and horizontal gene transfer. Choosing the most appropriate species tree method for the data at hand is not straightforward due to different data prerequisites, model assumptions, analytical strategies and computational implementations.Key Concepts:We could think of at least three different phylogenetic layers corresponding to species trees, locus trees and gene trees. These depict, respectively, the history of the sampled species, loci and genes copies.Traditional phylogenetic inference has focused on the reconstruction of gene trees, assumed to be accurate proxies for species history.True species, locus and gene trees can be incongruent due to the effect of evolutionary processes like incomplete lineage sorting, gene duplication and loss and horizontal gene transfer. This incongruence might appear larger due to estimation error.Extensive gene tree incongruence unveiled in multi‐locus datasets has encouraged the development of methods that explicitly reconstruct species trees. The supermatrix approach combines loci into a super‐alignment and estimates the corresponding supergene tree. Supertree methods combine gene trees to obtain an estimate of the species tree. Other methods co‐estimate gene and species trees in a single step using full probabilistic models.Different species tree methods require distinct data specifications, mainly related with the consideration of paralogs, number of sampled species, missing taxa and number of loci.Rooting and reconstruction uncertainty must be carefully considered before choosing a species tree method.
- Research Article
67
- 10.1186/s12862-019-1439-7
- May 30, 2019
- BMC Evolutionary Biology
BackgroundCoalescent-based species tree inference has become widely used in the analysis of genome-scale multilocus and SNP datasets when the goal is inference of a species-level phylogeny. However, numerous evolutionary processes are known to violate the assumptions of a coalescence-only model and complicate inference of the species tree. One such process is hybrid speciation, in which a species shares its ancestry with two distinct species. Although many methods have been proposed to detect hybrid speciation, only a few have considered both hybridization and coalescence in a unified framework, and these are generally limited to the setting in which putative hybrid species must be identified in advance.ResultsHere we propose a method that can examine genome-scale data for a large number of taxa and detect those taxa that may have arisen via hybridization, as well as their potential “parental” taxa. The method is based on a model that considers both coalescence and hybridization together, and uses phylogenetic invariants to construct a test that scales well in terms of computational time for both the number of taxa and the amount of sequence data. We test the method using simulated data for up 20 taxa and 100,000bp, and find that the method accurately identifies both recent and ancient hybrid species in less than 30 s. We apply the method to two empirical datasets, one composed of Sistrurus rattlesnakes for which hybrid speciation is not supported by previous work, and one consisting of several species of Heliconius butterflies for which some evidence of hybrid speciation has been previously found.ConclusionsThe proposed method is powerful for detecting hybridization for both recent and ancient hybridization events. The computations required can be carried out rapidly for a large number of sequences using genome-scale data, and the method is appropriate for both SNP and multilocus data.
- Research Article
54
- 10.1016/j.ympev.2013.01.004
- Jan 23, 2013
- Molecular Phylogenetics and Evolution
Genetic introgression and hybridization in Antillean freshwater turtles (Trachemys) revealed by coalescent analyses of mitochondrial and cloned nuclear markers
- Research Article
359
- 10.1016/j.ympev.2015.10.027
- Oct 27, 2015
- Molecular Phylogenetics and Evolution
Implementing and testing the multispecies coalescent model: A valuable paradigm for phylogenomics.
- Research Article
3
- 10.1007/s00285-022-01731-5
- Apr 1, 2022
- Journal of mathematical biology
Species tree estimation faces many significant hurdles. Chief among them is that the trees describing the ancestral lineages of each individual gene-the gene trees-often differ from the species tree. The multispecies coalescent is commonly used to model this gene tree discordance, at least when it is believed to arise from incomplete lineage sorting, a population-genetic effect. Another significant challenge in this area is that molecular sequences associated to each gene typically provide limited information about the gene trees themselves. While the modeling of sequence evolution by single-site substitutions is well-studied, few species tree reconstruction methods with theoretical guarantees actually address this latter issue. Instead, a standard-but unsatisfactory-assumption is that gene trees are perfectly reconstructed before being fed into a so-called summary method. Hence much remains to be done in the development of inference methodologies that rigorously account for gene tree estimation error-or completely avoid gene tree estimation in the first place. In previous work, a data requirement trade-off was derived between the number of loci m needed for an accurate reconstruction and the length of the locus sequences k. It was shown that to reconstruct an internal branch of length f, one needs m to be of the order of [Formula: see text]. That previous result was obtained under the restrictive assumption that mutation rates as well as population sizes are constant across the species phylogeny. Here we further generalize this result beyond this assumption. Our main contribution is a novel reduction to the molecular clock case under the multispecies coalescent, which we refer to as a stochastic Farris transform. As a corollary, we also obtain a new identifiability result of independent interest: for any species tree with [Formula: see text] species, the rooted topology of the species tree can be identified from the distribution of its unrooted weighted gene trees even in the absence of a molecular clock.
- Research Article
32
- 10.1098/rspb.2020.2102
- Dec 9, 2020
- Proceedings of the Royal Society B: Biological Sciences
Genome-scale data have greatly facilitated the resolution of recalcitrant nodes that Sanger-based datasets have been unable to resolve. However, phylogenomic studies continue to use traditional methods such as bootstrapping to estimate branch support; and high bootstrap values are still interpreted as providing strong support for the correct topology. Furthermore, relatively little attention has been given to assessing discordances between gene and species trees, and the underlying processes that produce phylogenetic conflict. We generated novel genomic datasets to characterize and determine the causes of discordance in Old World treefrogs (Family: Rhacophoridae)-a group that is fraught with conflicting and poorly supported topologies among major clades. Additionally, a suite of data filtering strategies and analytical methods were applied to assess their impact on phylogenetic inference. We showed that incomplete lineage sorting was detected at all nodes that exhibited high levels of discordance. Those nodes were also associated with extremely short internal branches. We also clearly demonstrate that bootstrap values do not reflect uncertainty or confidence for the correct topology and, hence, should not be used as a measure of branch support in phylogenomic datasets. Overall, we showed that phylogenetic discordances in Old World treefrogs resulted from incomplete lineage sorting and that species tree inference can be improved using a multi-faceted, total-evidence approach, which uses the most amount of data and considers results from different analytical methods and datasets.
- Research Article
113
- 10.1371/journal.pgen.1006006
- May 4, 2016
- PLOS Genetics
The multispecies coalescent (MSC) is a statistical framework that models how gene genealogies grow within the branches of a species tree. The field of computational phylogenetics has witnessed an explosion in the development of methods for species tree inference under MSC, owing mainly to the accumulating evidence of incomplete lineage sorting in phylogenomic analyses. However, the evolutionary history of a set of genomes, or species, could be reticulate due to the occurrence of evolutionary processes such as hybridization or horizontal gene transfer. We report on a novel method for Bayesian inference of genome and species phylogenies under the multispecies network coalescent (MSNC). This framework models gene evolution within the branches of a phylogenetic network, thus incorporating reticulate evolutionary processes, such as hybridization, in addition to incomplete lineage sorting. As phylogenetic networks with different numbers of reticulation events correspond to points of different dimensions in the space of models, we devise a reversible-jump Markov chain Monte Carlo (RJMCMC) technique for sampling the posterior distribution of phylogenetic networks under MSNC. We implemented the methods in the publicly available, open-source software package PhyloNet and studied their performance on simulated and biological data. The work extends the reach of Bayesian inference to phylogenetic networks and enables new evolutionary analyses that account for reticulation.
- Research Article
7
- 10.1101/gr.277629.122
- May 17, 2023
- Genome Research
methods are widely used to estimate species trees from genome-scale data. However, they can fail to produce accurate species trees when the input gene trees are highly discordant because of estimation error and biological processes, such as incomplete lineage sorting. Here, we introduce TREE-QMC, a new summary method that offers accuracy and scalability under these challenging scenarios. TREE-QMC builds upon weighted Quartet Max Cut, which takes weighted quartets as input and then constructs a species tree in a divide-and-conquer fashion, at each step forming a graph and seeking its max cut. The wQMC method has been successfully leveraged in the context of species tree estimation by weighting quartets by their frequencies in the gene trees; we improve upon this approach in two ways. First, we address accuracy by normalizing the quartet weights to account for "artificial taxa" introduced during the divide phase so subproblem solutions can be combined during the conquer phase. Second, we address scalability by introducing an algorithm to construct the graph directly from the gene trees; this gives TREE-QMC a time complexity of [Formula: see text], where n is the number of species and k is the number of gene trees, assuming the subproblem decomposition is perfectly balanced. These contributions enable TREE-QMC to be highly competitive in terms of species tree accuracy and empirical runtime with the leading quartet-based methods, even outperforming them on some model conditions explored in our simulation study. We also present the application of these methods to an avian phylogenomics data set.
- Research Article
110
- 10.1371/journal.pone.0129183
- Jun 18, 2015
- PLoS ONE
Because biological processes can result in different loci having different evolutionary histories, species tree estimation requires multiple loci from across multiple genomes. While many processes can result in discord between gene trees and species trees, incomplete lineage sorting (ILS), modeled by the multi-species coalescent, is considered to be a dominant cause for gene tree heterogeneity. Coalescent-based methods have been developed to estimate species trees, many of which operate by combining estimated gene trees, and so are called "summary methods". Because summary methods are generally fast (and much faster than more complicated coalescent-based methods that co-estimate gene trees and species trees), they have become very popular techniques for estimating species trees from multiple loci. However, recent studies have established that summary methods can have reduced accuracy in the presence of gene tree estimation error, and also that many biological datasets have substantial gene tree estimation error, so that summary methods may not be highly accurate in biologically realistic conditions. Mirarab et al. (Science 2014) presented the "statistical binning" technique to improve gene tree estimation in multi-locus analyses, and showed that it improved the accuracy of MP-EST, one of the most popular coalescent-based summary methods. Statistical binning, which uses a simple heuristic to evaluate "combinability" and then uses the larger sets of genes to re-calculate gene trees, has good empirical performance, but using statistical binning within a phylogenomic pipeline does not have the desirable property of being statistically consistent. We show that weighting the re-calculated gene trees by the bin sizes makes statistical binning statistically consistent under the multispecies coalescent, and maintains the good empirical performance. Thus, "weighted statistical binning" enables highly accurate genome-scale species tree estimation, and is also statistically consistent under the multi-species coalescent model. New data used in this study are available at DOI: http://dx.doi.org/10.6084/m9.figshare.1411146, and the software is available at https://github.com/smirarab/binning.
- Preprint Article
3
- 10.7287/peerj.preprints.3489v1
- Dec 27, 2017
Despite recent efforts that have produced data sets with hundreds and thousands of gene regions to resolve regions of the tree of life, recalcitrant nodes persist and disagreement among genes as well as disagreement between individual gene trees and species trees are common. There are a number of evolutionary processes that contribute to these conflicts between gene trees and species trees, including deep coalescence (lineage sorting), horizontal gene transfer or hybridization, etc. While for some of these processes, we have very powerful and sophisticated models that uses the conflict in the gene trees as information that contributes materially to correctly inferring the species tree, such as the multispecies coalescent (MSC). However, usage of these models require a priori recognition of relevant processes, which is often unknown for empirical dataset. Here we propose a new perspective to not only identify the cause of discord among gene trees, but also use it to classify loci by the underlying cause of discord to identify subsets of loci for analysis with the goal of improving phylogenetic accuracy. This approach differs fundamentally from all other criteria used for making decisions about which loci to include in a phylogenetic analysis. In particular, the choice of loci in this framework is based on identifying those that reflect descent from a common ancestor (as opposed to other processes), and thereby can minimize problems with model misspecification. We present preliminary results that demonstrate the potential of this framework in distinguishing the lateral gene transfer (LGT) from incomplete lineage sorting (ILS) process, as implemented in a new software package CLASSIPHY, while also highlighting areas for further development and testing. We discussed why such methods (i) are critical to improving phylogenetic accuracy with the increased complexity of genomic/transcriptomic datasets, and that (ii) characterizing patterns of discordance and the contribution of different processes to this discordance is itself of interest for generating hypotheses about the role of lateral gene transfer, gene duplication, and incomplete lineage sorting during the divergence of different taxa.
- Research Article
28
- 10.1111/jse.12175
- Sep 1, 2015
- Journal of Systematics and Evolution
Phylogenomics comprises an interdisciplinary field of comparative biology that uses genomic data to produce phylogenetic relationships among organisms (Philippe et al., 2005; Chan & Ragan, 2013). The term “phylogenomics” was initially applied to studies on gene functions (Eisen, 1998) but subsequently has been widely used to infer the tree of life (Philippe et al., 2004,2005; Pennisi, 2008; Zou et al., 2008; Dos Reis et al., 2012; Liu et al., 2015a, 2015b). With the advent and rapid development of next-generation sequencing (NGS), phylogenomics has gained more popularity in the last few years and is now being employed by many members of the evolutionary and systematics community (Lemmon & Lemmon, 2013; Wen et al., 2013, 2015; Jarvis et al., 2014; Weitemier et al., 2014; Wickett et al., 2014; Zeng et al., 2014).
- Research Article
165
- 10.1093/sysbio/syv016
- Mar 25, 2015
- Systematic Biology
The estimation of species trees using multiple loci has become increasingly common. Because different loci can have different phylogenetic histories (reflected in different gene tree topologies) for multiple biological causes, new approaches to species tree estimation have been developed that take gene tree heterogeneity into account. Among these multiple causes, incomplete lineage sorting (ILS), modeled by the multi-species coalescent, is potentially the most common cause of gene tree heterogeneity, and much of the focus of the recent literature has been on how to estimate species trees in the presence of ILS. Despite progress in developing statistically consistent techniques for estimating species trees when gene trees can differ due to ILS, there is substantial controversy in the systematics community as to whether to use the new coalescent-based methods or the traditional concatenation methods. One of the key issues that has been raised is understanding the impact of gene tree estimation error on coalescent-based methods that operate by combining gene trees. Here we explore the mathematical guarantees of coalescent-based methods when analyzing estimated rather than true gene trees. Our results provide some insight into the differences between promise of coalescent-based methods in theory and their performance in practice.
- Research Article
56
- 10.1093/sysbio/syab056
- Jul 10, 2021
- Systematic biology
Many recent phylogenetic methods have focused on accurately inferring species trees when there is gene tree discordance due to incomplete lineage sorting (ILS). For almost all of these methods, and for phylogenetic methods in general, the data for each locus are assumed to consist of orthologous, single-copy sequences. Loci that are present in more than a single copy in any of the studied genomes are excluded from the data. These steps greatly reduce the number of loci available for analysis. The question we seek to answer in this study is: what happens if one runs such species tree inference methods on data where paralogy is present, in addition to or without ILS being present? Through simulation studies and analyses of two large biological data sets, we show that running such methods on data with paralogs can still provide accurate results. We use multiple different methods, some of which are based directly on the multispecies coalescent model, and some of which have been proven to be statistically consistent under it. We also treat the paralogous loci in multiple ways: from explicitly denoting them as paralogs, to randomly selecting one copy per species. In all cases, the inferred species trees are as accurate as equivalent analyses using single-copy orthologs. Our results have significant implications for the use of ILS-aware phylogenomic analyses, demonstrating that they do not have to be restricted to single-copy loci. This will greatly increase the amount of data that can be used for phylogenetic inference.[Gene duplication and loss; incomplete lineage sorting; multispecies coalescent; orthology; paralogy.]
- Research Article
10
- 10.1007/s11538-021-00918-z
- Jul 23, 2021
- Bulletin of Mathematical Biology
Inference of the evolutionary histories of species, commonly represented by a species tree, is complicated by the divergent evolutionary history of different parts of the genome. Different loci on the genome can have different histories from the underlying species tree (and each other) due to processes such as incomplete lineage sorting (ILS), gene duplication and loss, and horizontal gene transfer. The multispecies coalescent is a commonly used model for performing inference on species and gene trees in the presence of ILS. This paper introduces Lily-T and Lily-Q, two new methods for species tree inference under the multispecies coalescent. We then compare them to two frequently used methods, SVDQuartets and ASTRAL, using simulated and empirical data. Both methods generally showed improvement over SVDQuartets, and Lily-Q was superior to Lily-T for most simulation settings. The comparison to ASTRAL was more mixed-Lily-Q tended to be better than ASTRAL when the length of recombination-free loci was short, when the coalescent population parameter [Formula: see text] was small, or when the internal branch lengths were longer.
- Conference Article
1
- 10.1109/bibe50027.2020.00021
- Oct 1, 2020
Species tree estimation from multi-locus data is complicated as biological processes can result in different loci having different evolutionary histories. Incomplete lineage sorting (ILS), modeled by the multi-species coalescent (MSC), is considered to be a dominant cause for gene tree incongruence. Various optimization criteria (e.g., quartet score, pseudo-likelihood, etc.) are statistically consistent under the MSC model, meaning that they return the true species tree with high probability given sufficiently large numbers of accurate gene trees. However, the number of genes is limited and estimating highly accurate gene trees is difficult. Therefore, even popular methods, optimizing a particular criterion, may fail to reconstruct highly accurate trees under practical model conditions with limited numbers of genes and in the presence of gene tree estimation error. In this study, we advocate multi-objective optimization to tackle this challenge. Particularly, we present a multi-objective metaheuristic (i.e., SNOGA), a modified version of the popular NSGAII, which combines various optimization criteria to find a suitable search space containing highly accurate species trees.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.