From reads to operational taxonomic units: an ensemble processing pipeline for MiSeq amplicon sequencing data.

Mohamed Mysara,Pieter Monsieurs,Mercy Njima,Jeroen Raes,Natalie Leys

doi:10.1093/gigascience/giw017

Abstract

The development of high-throughput sequencing technologies has provided microbial ecologists with an efficient approach to assess bacterial diversity at an unseen depth, particularly with the recent advances in the Illumina MiSeq sequencing platform. However, analyzing such high-throughput data is posing important computational challenges, requiring specialized bioinformatics solutions at different stages during the processing pipeline, such as assembly of paired-end reads, chimera removal, correction of sequencing errors, and clustering of those sequences into Operational Taxonomic Units (OTUs). Individual algorithms grappling with each of those challenges have been combined into various bioinformatics pipelines, such as mothur, QIIME, LotuS, and USEARCH. Using a set of well-described bacterial mock communities, state-of-the-art pipelines for Illumina MiSeq amplicon sequencing data are benchmarked at the level of the amount of sequences retained, computational cost, error rate, and quality of the OTUs. In addition, a new pipeline called OCToPUS is introduced, which is making an optimal combination of different algorithms. Huge variability is observed between the different pipelines in respect to the monitored performance parameters, where in general the amount of retained reads is found to be inversely proportional to the quality of the reads. By contrast, OCToPUS achieves the lowest error rate, minimum number of spurious OTUs, and the closest correspondence to the existing community, while retaining the uppermost amount of reads when compared to other pipelines. The newly introduced pipeline translates Illumina MiSeq amplicon sequencing data into high-quality and reliable OTUs, with improved performance and accuracy compared to the currently existing pipelines.

Highlights

Thank you for addressing the points from the previous review and making the edits in this revised manuscript
Line 290 - "amount of reads removed by USEARCH - the pipeline with the second best performance - is drastically lower compared with other pipelines..." - Doesn't USEARCH reject the greatest number of reads? Or am I reading this sentence incorrectly?
Line 338 - has period at end whereas, none of the others do Methods Are the methods appropriate to the aims of the study, are they well described, and are necessary controls included? Yes Conclusions Are the conclusions adequately supported by the data shown? Yes Reporting Standards Does the manuscript adhere to the journal’s guidelines on minimum standards of reporting? Yes Statistics Are you able to assess all statistics in the manuscript, including the appropriateness of statistical tests used? There are no statistics in the manuscript

Summary

Introduction

Reviewer Comments to Author: Review of revised OCToPUS manuscript 9-29-16 Thank you for addressing the points from the previous review and making the edits in this revised manuscript. Line 255-261 - The authors state "the amount of reads retained by each of the workflows was dramatically differing between each of them." the percentages on average for 4 of them were very similar (23%, 24%, 26%, 26%). Minor edits: Line 119 - "contains of 21 species" ; perhaps the "of" should be deleted so it reads "contains 21 species"

Results

Conclusion