GSR-DB: a manually curated and optimized taxonomical database for 16S rRNA amplicon analysis.

Leidy-Alejandra G Molano,Sara Vega-Abellaneda,Chaysavanh Manichanh

doi:10.1128/msystems.00950-23

Abstract

Amplicon-based 16S ribosomal RNA sequencing remains a widely used method to profile microbial communities, especially in low biomass samples, due to its cost-effectiveness and low-complexity approach. Reference databases are a mainstay for taxonomic assignments, which typically rely on popular databases such as SILVA, Greengenes, Genome Taxonomy Database (GTDB), or Ribosomal Database Project (RDP). However, the inconsistency of the nomenclature across databases and the presence of shortcomings in the annotation of these databases are limiting the resolution of the analysis. To overcome these limitations, we created the GSR database (Greengenes, SILVA, and RDP database), an integrated and manually curated database for bacterial and archaeal 16S amplicon taxonomy analysis. Unlike previous integration approaches, this database creation pipeline includes a taxonomy unification step to ensure consistency in taxonomical annotations. The database was validated with three mock communities, two real data sets, and a 10-fold cross-validation method and compared with existing 16S databases such as Greengenes, Greengenes 2, GTDB, ITGDB, SILVA, RDP, and MetaSquare. Results showed that the GSR database enhances taxonomical annotations of 16S sequences, outperforming current 16S databases at the species level, based on the evaluation of the mock communities. This was confirmed by the 10-fold cross-validation, except for Greengenes 2. The GSR database is available for full-length 16S sequences and the most commonly used hypervariable regions: V4, V1-V3, V3-V4, and V3-V5.IMPORTANCETaxonomic assignments of microorganisms have long been hindered by inconsistent nomenclature and annotation issues in existing databases like SILVA, Greengenes, Greengenes2, Genome Taxonomy Database, or Ribosomal Database Project. To overcome these issues, we created Greengenes-SILVA-RDP database (GSR-DB), accurate and comprehensive taxonomic annotations of 16S amplicon data. Unlike previous approaches, our innovative pipeline includes a unique taxonomy unification step, ensuring consistent and reliable annotations. Our evaluation analyses showed that GSR-DB outperforms existing databases in providing species-level resolution, especially based on mock-community analysis, making it a game-changer for microbiome studies. Moreover, GSR-DB is designed to be accessible to researchers with limited computational resources, making it a powerful tool for scientists across the board. Available for full-length 16S sequences and commonly used hypervariable regions, including V4, V1-V3, V3-V4, and V3-V5, GSR-DB is a go-to database for robust and accurate microbial taxonomy analysis.

Full Text