Abstract

One of the major methods to identify microbial community composition, to unravel microbial population dynamics, and to explore microbial diversity in environmental samples is high-throughput DNA- or RNA-based 16S rRNA (gene) amplicon sequencing in combination with bioinformatics analyses. However, focusing on environmental samples from contrasting habitats, it was not systematically evaluated (i) which analysis methods provide results that reflect reality most accurately, (ii) how the interpretations of microbial community studies are biased by different analysis methods and (iii) if the most optimal analysis workflow can be implemented in an easy-to-use pipeline. Here, we compared the performance of 16S rRNA (gene) amplicon sequencing analysis tools (i.e., Mothur, QIIME1, QIIME2, and MEGAN) using three mock datasets with known microbial community composition that differed in sequencing quality, species number and abundance distribution (i.e., even or uneven), and phylogenetic diversity (i.e., closely related or well-separated amplicon sequences). Our results showed that QIIME2 outcompeted all other investigated tools in sequence recovery (>10 times fewer false positives), taxonomic assignments (>22% better F-score) and diversity estimates (>5% better assessment), suggesting that this approach is able to reflect the in situ microbial community most accurately. Further analysis of 24 environmental datasets obtained from four contrasting terrestrial and freshwater sites revealed dramatic differences in the resulting microbial community composition for all pipelines at genus level. For instance, at the investigated river water sites Sphaerotilus was only reported when using QIIME1 (8% abundance) and Agitococcus with QIIME1 or QIIME2 (2 or 3% abundance, respectively), but both genera remained undetected when analyzed with Mothur or MEGAN. Since these abundant taxa probably have implications for important biogeochemical cycles (e.g., nitrate and sulfate reduction) at these sites, their detection and semi-quantitative enumeration is crucial for valid interpretations. A high-performance computing conformant workflow was constructed to allow FAIR (Findable, Accessible, Interoperable, and Re-usable) 16S rRNA (gene) amplicon sequence analysis starting from raw sequence files, using the most optimal methods identified in our study. Our presented workflow should be considered for future studies, thereby facilitating the analysis of high-throughput 16S rRNA (gene) sequencing data substantially, while maximizing reliability and confidence in microbial community data analysis.

Highlights

  • The ribosomal 16S rRNA gene is a phylogenetic marker that has been analyzed extensively within the last decade due to its presence in all microorganisms (Hugenholtz et al, 1998), and due to a combination of variable regions, influenced by the evolutionary clock that allow differentiation of taxa, with conserved regions, for universal priming (Head et al, 1998)

  • We found that (i) QIIME2 results reflected reality most accurately using mock communities, that (ii) interpretations of microbial studies were biased by the analysis method regarding sequence recovery, taxonomic identification and diversity measures and (iii) we implemented a high-quality analysis workflow using the lessons learned in this study

  • Mothur and QIIME1 recovered almost all 16S rRNA gene amplicon sequences and genera but the number and abundance of false positives was relatively high, so that sometimes the true positive sequences were buried underneath false positives

Read more

Summary

Introduction

The ribosomal 16S rRNA gene is a phylogenetic marker that has been analyzed extensively within the last decade due to its presence in all microorganisms (Hugenholtz et al, 1998), and due to a combination of variable regions, influenced by the evolutionary clock that allow differentiation of taxa, with conserved regions, for universal priming (Head et al, 1998). All current analysis methods suffer from imperfect recall (not all sequences or taxa are detected) or imperfect precision (additional false sequences or taxa are detected) (Callahan et al, 2016) that originate from a diverse set of frequent shortcomings of the entire workflow. These include biases in sample preparation (e.g., DNA extraction, PCR, sequencing library preparation), suboptimal experimental design (e.g., amplicon and primer selection), erroneous sequences produced by the sequencing method and the bioinformatics analysis strategy (Kozich et al, 2013; Wesolowska-Andersen et al, 2014; de Muinck et al, 2017; Laursen et al, 2017; Almeida et al, 2018; Nearing et al, 2018)

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call