Construction of habitat-specific training sets to achieve species-level assignment in 16S rRNA gene datasets

Isabel F Escapa,Katherine P Lemon,Tsute Chen,Floyd E Dewhirst,Maoxuan Lin,Alexis Kokaras,Yanmei Huang

doi:10.1186/s40168-020-00841-w

Isabel F Escapa, Katherine P Lemon + Show 5 more

Open Access

https://doi.org/10.1186/s40168-020-00841-w

Copy DOI

Abstract

BackgroundThe low cost of 16S rRNA gene sequencing facilitates population-scale molecular epidemiological studies. Existing computational algorithms can resolve 16S rRNA gene sequences into high-resolution amplicon sequence variants (ASVs), which represent consistent labels comparable across studies. Assigning these ASVs to species-level taxonomy strengthens the ecological and/or clinical relevance of 16S rRNA gene-based microbiota studies and further facilitates data comparison across studies.ResultsTo achieve this, we developed a broadly applicable method for constructing high-resolution training sets based on the phylogenic relationships among microbes found in a habitat of interest. When used with the naïve Bayesian Ribosomal Database Project (RDP) Classifier, this training set achieved species/supraspecies-level taxonomic assignment of 16S rRNA gene-derived ASVs. The key steps for generating such a training set are (1) constructing an accurate and comprehensive phylogenetic-based, habitat-specific database; (2) compiling multiple 16S rRNA gene sequences to represent the natural sequence variability of each taxon in the database; (3) trimming the training set to match the sequenced regions, if necessary; and (4) placing species sharing closely related sequences into a training-set-specific supraspecies taxonomic level to preserve subgenus-level resolution. As proof of principle, we developed a V1–V3 region training set for the bacterial microbiota of the human aerodigestive tract using the full-length 16S rRNA gene reference sequences compiled in our expanded Human Oral Microbiome Database (eHOMD). We also overcame technical limitations to successfully use Illumina sequences for the 16S rRNA gene V1–V3 region, the most informative segment for classifying bacteria native to the human aerodigestive tract. Finally, we generated a full-length eHOMD 16S rRNA gene training set, which we used in conjunction with an independent PacBio single molecule, real-time (SMRT)-sequenced sinonasal dataset to validate the representation of species in our training set. This also established the effectiveness of a full-length training set for assigning taxonomy of long-read 16S rRNA gene datasets.ConclusionHere, we present a systematic approach for constructing a phylogeny-based, high-resolution, habitat-specific training set that permits species/supraspecies-level taxonomic assignment to short- and long-read 16S rRNA gene-derived ASVs. This advancement enhances the ecological and/or clinical relevance of 16S rRNA gene-based microbiota studies.DgoJmEpwWqGYbvQenHYPjXVideo

Highlights

The low cost of 16S rRNA gene sequencing facilitates population-scale molecular epidemiological studies
Compiling closely related sequences for each taxon in a training set improves the accuracy of species-level taxonomic classification Genus-level taxonomic assignment is not an inherent limitation of the naïve Bayesian Ribosomal Database Project (RDP) Classifier
The naïve Bayesian RDP Classifier algorithm indicates that a training set with a larger number of sequences representing each taxon will result in more confident taxonomic assignment [39]

Summary

Introduction

The low cost of 16S rRNA gene sequencing facilitates population-scale molecular epidemiological studies. Newer algorithms, many of which are not based on similarity thresholds, allow single-nucleotide resolution and can resolve 16S rRNA gene short-read sequences into species- or strain-level phylotypes, usually called amplicon sequence variants (ASVs) (e.g., MED (minimal entropy decomposition) [8, 9], DADA2 (divisive amplicon denoising algorithm) [10, 11], and UNOISE2 [12, 13], among others [14, 15]). We have developed a method that, by combining the “reusability, reproducibility, and comprehensiveness” of ASVs, per Callahan and colleagues [11, 13], and the selection of highly informative regions of the 16S rRNA gene, maximizes 16S rRNA gene shortread sequencing potential to achieve sub-genus level resolution taxonomic assignment

Methods

Results

Conclusion