Abstract

BackgroundShotgun metagenomic sequencing reveals the potential in microbial communities. However, lower-cost 16S ribosomal RNA (rRNA) gene sequencing provides taxonomic, not functional, observations. To remedy this, we previously introduced Piphillin, a software package that predicts functional metagenomic content based on the frequency of detected 16S rRNA gene sequences corresponding to genomes in regularly updated, functionally annotated genome databases. Piphillin (and similar tools) have previously been evaluated on 16S rRNA data processed by the clustering of sequences into operational taxonomic units (OTUs). New techniques such as amplicon sequence variant error correction are in increased use, but it is unknown if these techniques perform better in metagenomic content prediction pipelines, or if they should be treated the same as OTU data in respect to optimal pipeline parameters.ResultsTo evaluate the effect of 16S rRNA sequence analysis method (clustering sequences into OTUs vs amplicon sequence variant error correction into amplicon sequence variants (ASVs)) on the ability of Piphillin to predict functional metagenomic content, we evaluated Piphillin-predicted functional content from 16S rRNA sequence data processed through OTU clustering and error correction into ASVs compared to corresponding shotgun metagenomic data. We show a strong correlation between metagenomic data and Piphillin-predicted functional content resulting from both 16S rRNA sequence analysis methods. Differential abundance testing with Piphillin-predicted functional content exhibited a low false positive rate (< 0.05) while capturing a large fraction of the differentially abundant features resulting from corresponding metagenomic data. However, Piphillin prediction performance was optimal at different cutoff parameters depending on 16S rRNA sequence analysis method. Using data analyzed with amplicon sequence variant error correction, Piphillin outperformed comparable tools, for instance exhibiting 19% greater balanced accuracy and 54% greater precision compared to PICRUSt2.ConclusionsOur results demonstrate that raw Illumina sequences should be processed for subsequent Piphillin analysis using amplicon sequence variant error correction (with DADA2 or similar methods) and run using a 99% ID cutoff for Piphillin, while sequences generated on platforms other than Illumina should be processed via OTU clustering (e.g., UPARSE) and run using a 96% ID cutoff for Piphillin. Piphillin is publicly available for academic users (Piphillin server. http://piphillin.secondgenome.com/.)

Highlights

  • Shotgun metagenomic sequencing reveals the potential in microbial communities

  • 16S ribosomal RNA (rRNA) sequence analysis approach impacts the quantity of sequences kept for processing, correlation to metagenomic data, and detection of differentially abundant features Traditionally, 16S rRNA gene sequence data has been analyzed via either clustering sequences to an external reference, clustering sequences to an external reference de novo operational taxonomic units (OTU) clustering on remaining reads, or de novo OTU clustering on all reads

  • We studied the impact of 16S rRNA gene sequence analysis method (ASV error correction with DADA2 (ASVs) versus 97% de novo OTU clustering using UPARSE (OTUs)) on Piphillin results at varying identity cutoffs

Read more

Summary

Introduction

Shotgun metagenomic sequencing reveals the potential in microbial communities. lowercost 16S ribosomal RNA (rRNA) gene sequencing provides taxonomic, not functional, observations. Piphillin (and similar tools) have previously been evaluated on 16S rRNA data processed by the clustering of sequences into operational taxonomic units (OTUs) New techniques such as amplicon sequence variant error correction are in increased use, but it is unknown if these techniques perform better in metagenomic content prediction pipelines, or if they should be treated the same as OTU data in respect to optimal pipeline parameters. Since Piphillin exploits nearest-neighbor matching of 16S rRNA gene sequences to genomic sequence data held in these databases, the significant expansion observed in both collections increases the likelihood of matched candidates. These expansions enhance the integrity and accuracy of predicted genome contents. Considering these significant changes to reference sequence databases, it is necessary to re-assess Piphillin using the same metrics and criteria described in the original paper

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call