Abstract
Comparing bacterial 16S rDNA sequences to GenBank and other large public databases via BLAST often provides results of little use for identification and taxonomic assignment of the organisms of interest. The human microbiome, and in particular the oral microbiome, includes many taxa, and accurate identification of sequence data is essential for studies of these communities. For this purpose, a phylogenetically curated 16S rDNA database of the core oral microbiome, CORE, was developed. The goal was to include a comprehensive and minimally redundant representation of the bacteria that regularly reside in the human oral cavity with computationally robust classification at the level of species and genus. Clades of cultivated and uncultivated taxa were formed based on sequence analyses using multiple criteria, including maximum-likelihood-based topology and bootstrap support, genetic distance, and previous naming. A number of classification inconsistencies for previously named species, especially at the level of genus, were resolved. The performance of the CORE database for identifying clinical sequences was compared to that of three publicly available databases, GenBank nr/nt, RDP and HOMD, using a set of sequencing reads that had not been used in creation of the database. CORE offered improved performance compared to other public databases for identification of human oral bacterial 16S sequences by a number of criteria. In addition, the CORE database and phylogenetic tree provide a framework for measures of community divergence, and the focused size of the database offers advantages of efficiency for BLAST searching of large datasets. The CORE database is available as a searchable interface and for download at http://microbiome.osu.edu.
Highlights
Large datasets consisting of hundreds of thousands and even millions of sequences are produced with high-throughput sequencing technologies, and developing methods for accurate and efficient analysis of these datasets is a growing challenge
In order to make taxonomic divisions for large 16S rRNA gene datasets two fundamentally different approaches have been used. 16S rDNA sequences from bacteria have been grouped into operational taxonomic units (OTUs) with distance-based agglomerative clustering approaches such as MOTHUR [1] Cd-hit [2], and QIIME [3]
In order to address these problems, we developed a 16S database of the core human oral microbiome (CORE)
Summary
Large datasets consisting of hundreds of thousands and even millions of sequences are produced with high-throughput sequencing technologies, and developing methods for accurate and efficient analysis of these datasets is a growing challenge. It is currently computationally intractable to make individual taxonomic assignments with de novo phylogenetic tree construction approaches for such large numbers of sequences. 16S rDNA sequences have been identified and classified by comparing novel sequences to a comprehensive reference database for which taxonomic assignments have previously been made. General reference databases include the GenBank nucleotide database Nuccore) and the more highly curated and specialized Ribosomal Database Project (RDP) (rdp.cme.msu.edu/), SILVA (www.arbsilva.de) and greengenes (greengenes.lbl.gov) databases. Tools for identification and assignment of sequences against databases include Basic Local Alignment Search Tool (BLAST) [4], BLAST-Like Alignment Tool (BLAT) [5], RDP Sequence Match [6] and the RDP Classifier [7]
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have