Abstract

Comprehensively sampled phylogenetic trees provide the most compelling foundations for strong inferences in comparative evolutionary biology. Mismatches are common, however, between the taxa for which comparative data are available and the taxa sampled by published phylogenetic analyses. Moreover, many published phylogenies are gene trees, which cannot always be adapted immediately for species level comparisons because of discordance, gene duplication, and other confounding biological processes. A new database, STBase, lets comparative biologists quickly retrieve species level phylogenetic hypotheses in response to a query list of species names. The database consists of 1 million single- and multi-locus data sets, each with a confidence set of 1000 putative species trees, computed from GenBank sequence data for 413,000 eukaryotic taxa. Two bodies of theoretical work are leveraged to aid in the assembly of multi-locus concatenated data sets for species tree construction. First, multiply labeled gene trees are pruned to conflict-free singly-labeled species-level trees that can be combined between loci. Second, impacts of missing data in multi-locus data sets are ameliorated by assembling only decisive data sets. Data sets overlapping with the user’s query are ranked using a scheme that depends on user-provided weights for tree quality and for taxonomic overlap of the tree with the query. Retrieval times are independent of the size of the database, typically a few seconds. Tree quality is assessed by a real-time evaluation of bootstrap support on just the overlapping subtree. Associated sequence alignments, tree files and metadata can be downloaded for subsequent analysis. STBase provides a tool for comparative biologists interested in exploiting the most relevant sequence data available for the taxa of interest. It may also serve as a prototype for future species tree oriented databases and as a resource for assembly of larger species phylogenies from precomputed trees.

Highlights

  • Phylogenetic trees have greatly altered comparative biology by rearranging the context for comparison, enhancing statistical power of comparative tests, and broadening taxonomicPLOS ONE | DOI:10.1371/journal.pone.0117987 February 13, 2015STBase design, data collection and analysis, decision to publish, or preparation of the manuscript

  • In this paper we describe a new database of precomputed phylogenetic trees of eukaryotes, STBase (“Species Tree Database”), optimized for use by comparative biologists

  • Suppose the database contains a large tree of 1200 taxa that shares 80 of the names on the query list, that the majority rule consensus tree (MRT) of 1000 bootstrapped trees, pruned to those 80 taxa, is fully resolved and has an average bootstrap value of 70%, and the user has selected an h value of 0.5

Read more

Summary

OPEN ACCESS

Data Availability Statement: The database can be accessed at (http://STBase.org). A new database, STBase, lets comparative biologists quickly retrieve species level phylogenetic hypotheses in response to a query list of species names. The database consists of 1 million single- and multi-locus data sets, each with a confidence set of 1000 putative species trees, computed from GenBank sequence data for 413,000 eukaryotic taxa. STBase provides a tool for comparative biologists interested in exploiting the most relevant sequence data available for the taxa of interest. It may serve as a prototype for future species tree oriented databases and as a resource for assembly of larger species phylogenies from precomputed trees

Introduction
Construction and Content
Tree Construction
Total database
The Database
User Interface
Tree quality
Rationale for Data Set Assembly Strategy
Species Trees and Gene Tree Conflict
Applications in Comparative Biology and Large Tree Construction
Findings
Big data
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call