Subsampled open-reference clustering creates consistent, comprehensive OTU definitions and scales to billions of sequences.

Jai Ram Rideout,Yan He,William A Walters,Rob Knight,J Gregory Caporaso,Jose C Clemente,John Chase,Jack A Gilbert,Jose A Navas-Molina,Hong-Wei Zhou,Luke K Ursell,Adam Robbins-Pianka,Daniel Mcdonald,Antonio Gonzalez,Susan M Huse,Sean M Gibbons

doi:10.7717/peerj.545

Abstract

We present a performance-optimized algorithm, subsampled open-reference OTU picking, for assigning marker gene (e.g., 16S rRNA) sequences generated on next-generation sequencing platforms to operational taxonomic units (OTUs) for microbial community analysis. This algorithm provides benefits over de novo OTU picking (clustering can be performed largely in parallel, reducing runtime) and closed-reference OTU picking (all reads are clustered, not only those that match a reference database sequence with high similarity). Because more of our algorithm can be run in parallel relative to “classic” open-reference OTU picking, it makes open-reference OTU picking tractable on massive amplicon sequence data sets (though on smaller data sets, “classic” open-reference OTU clustering is often faster). We illustrate that here by applying it to the first 15,000 samples sequenced for the Earth Microbiome Project (1.3 billion V4 16S rRNA amplicons). To the best of our knowledge, this is the largest OTU picking run ever performed, and we estimate that our new algorithm runs in less than 1/5 the time than would be required of “classic” open reference OTU picking. We show that subsampled open-reference OTU picking yields results that are highly correlated with those generated by “classic” open-reference OTU picking through comparisons on three well-studied datasets. An implementation of this algorithm is provided in the popular QIIME software package, which uses uclust for read clustering. All analyses were performed using QIIME’s uclust wrappers, though we provide details (aided by the open-source code in our GitHub repository) that will allow implementation of subsampled open-reference OTU picking independently of QIIME (e.g., in a compiled programming language, where runtimes should be further reduced). Our analyses should generalize to other implementations of these OTU picking algorithms. Finally, we present a comparison of parameter settings in QIIME’s OTU picking workflows and make recommendations on settings for these free parameters to optimize runtime without reducing the quality of the results. These optimized parameters can vastly decrease the runtime of uclust-based OTU picking in QIIME.

Highlights

Three high-level strategies for defining Operational Taxonomic Unit (OTU) cluster centroids have been widely applied for centroid-based greedy clustering (Li & Godzik, 2006; Edgar, 2010) of marker gene (e.g., 16S rRNA) sequences generated on nextgeneration sequencing platforms to facilitate microbial community analysis
Minor differences likely arise from the non-deterministic step of rarefying all samples to even sampling depth before comparing samples. These results suggest that subsampled open-reference picking yields the same results as classic open-reference OTU picking, including identical numbers of sequences failing to hit the reference database, and is a suitable replacement
Application to the Earth Microbiome Project dataset In order to evaluate the effectiveness of the subsampled open-reference OTU picking method on an extremely large data set, the first 15,000 samples (1.3 billion V4 16S rRNA amplicons) from the Earth Microbiome Project (EMP, Gilbert et al, 2010) were processed on the Amazon Web Services (AWS) EC2 platform

Summary

Introduction

Three high-level strategies for defining Operational Taxonomic Unit (OTU) cluster centroids have been widely applied for centroid-based greedy clustering (Li & Godzik, 2006; Edgar, 2010) of marker gene (e.g., 16S rRNA) sequences generated on nextgeneration sequencing platforms to facilitate microbial community analysis. These are canonically described as de novo, closed-reference, and open-reference OTU picking (Navas-Molina et al, 2013). This approach cannot scale to modern-sized data sets

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: PeerJ	Publication Date: Aug 21, 2014
Citations: 508	License type: cc-by

R Discovery Prime

R Discovery Prime

Subsampled open-reference clustering creates consistent, comprehensive OTU definitions and scales to billions of sequences.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PeerJ

Lead the way for us

Similar Papers

Editor's evaluation: Ribosomal RNA (rRNA) sequences from 33 globally distributed mosquito species for improved metagenomics and species identification
Sara L Sawyer
-
Sara L SawyerSara L Sawyer
23 Nov 2022
23 Nov 2022

Author response: Ribosomal RNA (rRNA) sequences from 33 globally distributed mosquito species for improved metagenomics and species identification
Cassandra Koh ... Philippe Dussart
-
Cassandra Koh, et. al.Cassandra Koh ... Philippe Dussart
23 Dec 2022
23 Dec 2022

Decision letter: Ribosomal RNA (rRNA) sequences from 33 globally distributed mosquito species for improved metagenomics and species identification
Katherine I Young ... Sara L Sawyer
-
Katherine I Young, et. al.Katherine I Young ... Sara L Sawyer
23 Nov 2022
23 Nov 2022

Dumpster diving for diatom plastid 16S rRNA genes.
Krista L Bonfantine ... Ana Neckovic
PeerJ | VOL. 9
Krista L Bonfantine, et. al.Krista L Bonfantine ... Ana Neckovic
01 Jul 2021
PeerJ | VOL. 9

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Subsampled open-reference clustering creates consistent, comprehensive OTU definitions and scales to billions of sequences.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PeerJ