Species Tree Estimation Research Articles

The massive accumulation of genome-sequences in public databases promoted the proliferation of genome-level phylogenetic analyses in many areas of biological research. However, due to diverse evolutionary and genetic processes, many loci have undesirable properties for phylogenetic reconstruction. These, if undetected, can result in erroneous or biased estimates, particularly when estimating species trees from concatenated datasets. To deal with these problems, we developed GET_PHYLOMARKERS, a pipeline designed to identify high-quality markers to estimate robust genome phylogenies from the orthologous clusters, or the pan-genome matrix (PGM), computed by GET_HOMOLOGUES. In the first context, a set of sequential filters are applied to exclude recombinant alignments and those producing anomalous or poorly resolved trees. Multiple sequence alignments and maximum likelihood (ML) phylogenies are computed in parallel on multi-core computers. A ML species tree is estimated from the concatenated set of top-ranking alignments at the DNA or protein levels, using either FastTree or IQ-TREE (IQT). The latter is used by default due to its superior performance revealed in an extensive benchmark analysis. In addition, parsimony and ML phylogenies can be estimated from the PGM. We demonstrate the practical utility of the software by analyzing 170 Stenotrophomonas genome sequences available in RefSeq and 10 new complete genomes of Mexican environmental S. maltophilia complex (Smc) isolates reported herein. A combination of core-genome and PGM analyses was used to revise the molecular systematics of the genus. An unsupervised learning approach that uses a goodness of clustering statistic identified 20 groups within the Smc at a core-genome average nucleotide identity (cgANIb) of 95.9% that are perfectly consistent with strongly supported clades on the core- and pan-genome trees. In addition, we identified 16 misclassified RefSeq genome sequences, 14 of them labeled as S. maltophilia, demonstrating the broad utility of the software for phylogenomics and geno-taxonomic studies. The code, a detailed manual and tutorials are freely available for Linux/UNIX servers under the GNU GPLv3 license at https://github.com/vinuesa/get_phylomarkers. A docker image bundling GET_PHYLOMARKERS with GET_HOMOLOGUES is available at https://hub.docker.com/r/csicunam/get_homologues/, which can be easily run on any platform.

Read full abstract

With the increasing availability of whole genome data, many species trees are being constructed from hundreds to thousands of loci. Although concatenation analysis using maximum likelihood is a standard approach for estimating species trees, it does not account for gene tree heterogeneity, which can occur due to many biological processes, such as incomplete lineage sorting. Coalescent species tree estimation methods, many of which are statistically consistent in the presence of incomplete lineage sorting, include Bayesian methods that coestimate the gene trees and the species tree, summary methods that compute the species tree by combining estimated gene trees, and site-based methods that infer the species tree from site patterns in the alignments of different loci. Due to concerns that poor quality loci will reduce the accuracy of estimated species trees, many recent phylogenomic studies have removed or filtered genes on the basis of phylogenetic signal and/or missing data prior to inferring species trees; little is known about the performance of species tree estimation methods when gene filtering is performed. We examine how incomplete lineage sorting, phylogenetic signal of individual loci, and missing data affect the absolute and the relative accuracy of species tree estimation methods and show how these properties affect methods' responses to gene filtering strategies. In particular, summary methods (ASTRAL-II, ASTRID, and MP-EST), a site-based coalescent method (SVDquartets within PAUP*), and an unpartitioned concatenation analysis using maximum likelihood (RAxML) were evaluated on a heterogeneous collection of simulated multilocus data sets, and the following trends were observed. Filtering genes based on gene tree estimation error improved the accuracy of the summary methods when levels of incomplete lineage sorting were low to moderate but did not benefit the summary methods under higher levels of incomplete lineage sorting, unless gene tree estimation error was also extremely high (a model condition with few replicates). Neither SVDquartets nor concatenation analysis using RAxML benefited from filtering genes on the basis of gene tree estimation error. Finally, filtering genes based on missing data was either neutral (i.e., did not impact accuracy) or else reduced the accuracy of all five methods. By providing insight into the consequences of gene filtering, we offer recommendations for estimating species tree in the presence of incomplete lineage sorting and reconcile seemingly conflicting observations made in prior studies regarding the impact of gene filtering.

Read full abstract

Species Tree Estimation Research Articles

Related Topics

Articles published on Species Tree Estimation

Assessing the Impacts of Positive Selection on Coalescent-Based Species Tree Estimation and Species Delimitation.

Taxonomic and functional diversity in Calogaya (lichenised Ascomycota) in dry continental Asia

The performance of coalescent-based species tree estimation methods under models of missing data

Molecular phylogenetic species in Alternaria pathogens infecting pistachio and wild relatives.

GET_PHYLOMARKERS, a Software Package to Select Optimal Orthologous Clusters for Phylogenomics and Inferring Pan-Genome Phylogenies, Used for a Critical Geno-Taxonomic Revision of the Genus Stenotrophomonas.

SIESTA: enhancing searches for optimal supertrees and species trees

OCTAL: Optimal Completion of gene trees in polynomial time

SVDquest: Improving SVDquartets species tree estimation using exact optimization within a constrained search space

Allele phasing has minimal impact on phylogenetic reconstruction from targeted nuclear gene sequences in a case study of Artocarpus.

Testing for Polytomies in Phylogenetic Species Trees Using Quartet Frequencies.

Evidence that Myotis lucifugus "Subspecies" are Five Nonsister Species, Despite Gene Flow.

Phylogenomics of a rapid radiation: the Australian rainbow skinks

Phenotypic and Genetic Structure Support Gene Flow Generating Gene Tree Discordances in an Amazonian Floodplain Endemic Species.

Gene tree parsimony for incomplete gene trees: addressing true biological loss

Large-scale phylogenomic analysis resolves a backbone phylogeny in ferns.

Let's jump in: A phylogenetic study of the great basin springfishes and poolfishes, Crenichthys and Empetrichthys (Cyprinodontiformes: Goodeidae).

Pinniped Diphyly and Bat Triphyly: More Homology Errors Drive Conflicts in the Mammalian Tree.

Distance-based species tree estimation under the coalescent: Information-theoretic trade-off between number of loci and sequence length

To Include or Not to Include: The Impact of Gene Filtering on Species Tree Estimation Methods.

Early-branching euteleost relationships: areas of congruence between concatenation and coalescent model inferences.

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Species Tree Estimation Research Articles

Related Topics

Articles published on Species Tree Estimation

Assessing the Impacts of Positive Selection on Coalescent-Based Species Tree Estimation and Species Delimitation.

Taxonomic and functional diversity in Calogaya (lichenised Ascomycota) in dry continental Asia

The performance of coalescent-based species tree estimation methods under models of missing data

Molecular phylogenetic species in Alternaria pathogens infecting pistachio and wild relatives.

GET_PHYLOMARKERS, a Software Package to Select Optimal Orthologous Clusters for Phylogenomics and Inferring Pan-Genome Phylogenies, Used for a Critical Geno-Taxonomic Revision of the Genus Stenotrophomonas.

SIESTA: enhancing searches for optimal supertrees and species trees

OCTAL: Optimal Completion of gene trees in polynomial time

SVDquest: Improving SVDquartets species tree estimation using exact optimization within a constrained search space

Allele phasing has minimal impact on phylogenetic reconstruction from targeted nuclear gene sequences in a case study of Artocarpus.

Testing for Polytomies in Phylogenetic Species Trees Using Quartet Frequencies.

Evidence that Myotis lucifugus "Subspecies" are Five Nonsister Species, Despite Gene Flow.

Phylogenomics of a rapid radiation: the Australian rainbow skinks

Phenotypic and Genetic Structure Support Gene Flow Generating Gene Tree Discordances in an Amazonian Floodplain Endemic Species.

Gene tree parsimony for incomplete gene trees: addressing true biological loss

Large-scale phylogenomic analysis resolves a backbone phylogeny in ferns.

Let's jump in: A phylogenetic study of the great basin springfishes and poolfishes, Crenichthys and Empetrichthys (Cyprinodontiformes: Goodeidae).

Pinniped Diphyly and Bat Triphyly: More Homology Errors Drive Conflicts in the Mammalian Tree.

Distance-based species tree estimation under the coalescent: Information-theoretic trade-off between number of loci and sequence length

To Include or Not to Include: The Impact of Gene Filtering on Species Tree Estimation Methods.

Early-branching euteleost relationships: areas of congruence between concatenation and coalescent model inferences.