A novel conceptual approach to read-filtering in high-throughput amplicon sequencing studies.

Fernando Puente-Sánchez,Jacobo Aguirre,Víctor Parro

doi:10.1093/nar/gkv1113

Abstract

Adequate read filtering is critical when processing high-throughput data in marker-gene-based studies. Sequencing errors can cause the mis-clustering of otherwise similar reads, artificially increasing the number of retrieved Operational Taxonomic Units (OTUs) and therefore leading to the overestimation of microbial diversity. Sequencing errors will also result in OTUs that are not accurate reconstructions of the original biological sequences. Herein we present the Poisson binomial filtering algorithm (PBF), which minimizes both problems by calculating the error-probability distribution of a sequence from its quality scores. In order to validate our method, we quality-filtered 37 publicly available datasets obtained by sequencing mock and environmental microbial communities with the Roche 454, Illumina MiSeq and IonTorrent PGM platforms, and compared our results to those obtained with previous approaches such as the ones included in mothur, QIIME and USEARCH. Our algorithm retained substantially more reads than its predecessors, while resulting in fewer and more accurate OTUs. This improved sensitiveness produced more faithful representations, both quantitatively and qualitatively, of the true microbial diversity present in the studied samples. Furthermore, the method introduced in this work is computationally inexpensive and can be readily applied in conjunction with any existent analysis pipeline.

Highlights

High-throughput sequencing of marker genes, such as the 16S ribosomal RNA, has become an invaluable tool for microbial ecologists, since it allows for a previously unreachable level of detail in the analysis of complex microbial communities
We validated the Poisson binomial filtering algorithm and compared it with the different filtering approaches recommended by the authors of mothur [8,13,20], USEARCHUPARSE [10,15,26] and QIIME [21] by quality-filtering datasets obtained by sequencing different mock and environmental microbial communities with the Roche 454 GS FLX Titanium, the Illumina MiSeq and the IonTorrent PGM platforms
In order to evaluate the different methods on equal grounds, filtered reads were processed with a common downstream pipeline that included chimera-filtering with UCHIME [27], sample size standardization and Operational Taxonomic Units (OTUs) clustering

Summary

Introduction

High-throughput sequencing of marker genes, such as the 16S ribosomal RNA, has become an invaluable tool for microbial ecologists, since it allows for a previously unreachable level of detail in the analysis of complex microbial communities. Alternatives to traditional clustering have been recently proposed, such as distribution-based clustering [16] or a clustering-free approach [17] These novel methods are specially suited for subpopulation level studies, but work only for moderate-to-high abundance sequences, being unsuitable for population-level alpha or beta diversity studies [17]. Even they can remove likely erroneous sequences and resolve subpopulations based on dynamic information, they rely on a quality filtering step for the preprocessing of raw reads [17]

Methods

Results

Conclusion