RESCRIPt: Reproducible sequence taxonomy reference database management.

Michael S Robeson,Devon R O’Rourke,Matthew R Dillon,Benjamin D Kaehler,Jeffrey T Foster,Nicholas A Bokulich,Michal Ziemski,Mihaela Pertea

doi:10.1371/journal.pcbi.1009581

Abstract

Nucleotide sequence and taxonomy reference databases are critical resources for widespread applications including marker-gene and metagenome sequencing for microbiome analysis, diet metabarcoding, and environmental DNA (eDNA) surveys. Reproducibly generating, managing, using, and evaluating nucleotide sequence and taxonomy reference databases creates a significant bottleneck for researchers aiming to generate custom sequence databases. Furthermore, database composition drastically influences results, and lack of standardization limits cross-study comparisons. To address these challenges, we developed RESCRIPt, a Python 3 software package and QIIME 2 plugin for reproducible generation and management of reference sequence taxonomy databases, including dedicated functions that streamline creating databases from popular sources, and functions for evaluating, comparing, and interactively exploring qualitative and quantitative characteristics across reference databases. To highlight the breadth and capabilities of RESCRIPt, we provide several examples for working with popular databases for microbiome profiling (SILVA, Greengenes, NCBI-RefSeq, GTDB), eDNA and diet metabarcoding surveys (BOLD, GenBank), as well as for genome comparison. We show that bigger is not always better, and reference databases with standardized taxonomies and those that focus on type strains have quantitative advantages, though may not be appropriate for all use cases. Most databases appear to benefit from some curation (quality filtering), though sequence clustering appears detrimental to database quality. Finally, we demonstrate the breadth and extensibility of RESCRIPt for reproducible workflows with a comparison of global hepatitis genomes. RESCRIPt provides tools to democratize the process of reference database acquisition and management, enabling researchers to reproducibly and transparently create reference materials for diverse research applications. RESCRIPt is released under a permissive BSD-3 license at https://github.com/bokulich-lab/RESCRIPt.

Highlights

Marker-gene amplicon and metagenome sequencing have become attractive methods for characterizing microbial community composition and function [1,2] in human health [3,4,5] and agriculture [6,7,8], as well as macroorganism diversity through diet metabarcoding studies [9,10,11] and environmental DNA surveys [12,13,14,15]
To highlight the breadth and capabilities of RESCRIPt, we provide several examples for working with popular databases for microbiome profiling (SILVA, Greengenes, NCBI-RefSeq, Genome Taxonomy Database (GTDB)), environmental DNA (eDNA) and diet metabarcoding surveys (BOLD, GenBank), as well as for genome comparison
Generating and managing sequence and taxonomy reference data presents a bottleneck to many researchers, whether they are generating custom databases or attempting to format existing, curated reference databases for use with standard sequence analysis tools

Summary

Introduction

Marker-gene amplicon and metagenome sequencing have become attractive methods for characterizing microbial community composition and function [1,2] in human health [3,4,5] and agriculture [6,7,8], as well as macroorganism diversity through diet metabarcoding studies [9,10,11] and environmental DNA (eDNA) surveys [12,13,14,15]. Taxonomic classification is often a primary goal in marker-gene and metagenome sequencing studies to identify the composition of a mixed community, or to detect species of interest (e.g., pathogens or invasive species). This is accomplished by comparing the observed sequences to a reference database consisting of target marker-gene or genome sequences from known species. Identification of Bacteria and Archaea is most commonly performed using the 16S rRNA gene, due to its historical use as a phylogenetic marker [18,19] and the existence of curated reference databases [20,21]. Non-16S genes are attractive targets for bacterial and archaeal species identification due to the degree of species resolution that they afford, but their application is limited by the relative lack of curated reference materials [27,28,29]

Methods

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: PLoS computational biology	Publication Date: Nov 8, 2021
Citations: 325	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

RESCRIPt: Reproducible sequence taxonomy reference database management.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLoS computational biology

Lead the way for us

Similar Papers

The EMBL Nucleotide Sequence and Genome Reviews Databases
Peter Sterk ... Rolf Apweiler
-
Peter Sterk, et. al.Peter Sterk ... Rolf Apweiler
01 Jan 2007
01 Jan 2007

Transcriptome sequencing and bioinformatics analysis of Tyrophagus putres-centiae
...
Chinese journal of schistosomiasis control | VOL. 32
, et. al. ...
15 Oct 2020
Chinese journal of schistosomiasis control | VOL. 32

Searching the literature is not for the faint of heart!
Jacqueline M Mcgrath ... Debra Brandon
Advances in Neonatal Care | VOL. 14
Jacqueline M Mcgrath, et. al.Jacqueline M Mcgrath ... Debra Brandon
01 Aug 2014
Advances in Neonatal Care | VOL. 14

Crabs-A software program to generate curated reference databases for metabarcoding sequencing data.
Gert‐Jan Jeunen ... Jonika Edgecombe
Molecular ecology resources | VOL. 23
Gert‐Jan Jeunen, et. al.Gert‐Jan Jeunen ... Jonika Edgecombe
11 Dec 2022
Molecular ecology resources | VOL. 23

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

RESCRIPt: Reproducible sequence taxonomy reference database management.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLoS computational biology