Abstract

BackgroundOne of the main challenges in metagenomics is the identification of microorganisms in clinical and environmental samples. While an extensive and heterogeneous set of computational tools is available to classify microorganisms using whole-genome shotgun sequencing data, comprehensive comparisons of these methods are limited.ResultsIn this study, we use the largest-to-date set of laboratory-generated and simulated controls across 846 species to evaluate the performance of 11 metagenomic classifiers. Tools were characterized on the basis of their ability to identify taxa at the genus, species, and strain levels, quantify relative abundances of taxa, and classify individual reads to the species level. Strikingly, the number of species identified by the 11 tools can differ by over three orders of magnitude on the same datasets. Various strategies can ameliorate taxonomic misclassification, including abundance filtering, ensemble approaches, and tool intersection. Nevertheless, these strategies were often insufficient to completely eliminate false positives from environmental samples, which are especially important where they concern medically relevant species. Overall, pairing tools with different classification strategies (k-mer, alignment, marker) can combine their respective advantages.ConclusionsThis study provides positive and negative controls, titrated standards, and a guide for selecting tools for metagenomic analyses by comparing ranges of precision, accuracy, and recall. We show that proper experimental design and analysis parameters can reduce false positives, provide greater resolution of species in complex metagenomic samples, and improve the interpretation of results.

Highlights

  • One of the main challenges in metagenomics is the identification of microorganisms in clinical and environmental samples

  • To evaluate the extent of the problem of false positives with respect to specific tools, we calculated precision, recall, area under the precision-recall curve (AUPR), and F1 score based on detection of the presence or absence of a given genus, species, or subspecies at any abundance

  • For k-mer-based tools, adding an abundance threshold increased precision and F1 score, which is more affected than AUPR by false positives detected at low abundance, bringing both metrics to the same a d b c

Read more

Summary

Introduction

One of the main challenges in metagenomics is the identification of microorganisms in clinical and environmental samples. The main objectives when sequencing a metagenomic community are to detect, identify, and describe its component taxa. Selective amplification (e.g. 16S, 18S, ITS) of specific gene regions has long been standard for microbial community sequencing, but it introduces bias and omits organisms and functional elements from analysis. Conserved regions within these genes permit the use of common primers for sequencing [7]. Certain species of archaea include introns with repetitive regions that interfere with the binding of the most common 16S primers [8, 9] and 16S amplification is unable to capture viral, plasmid, and eukaryotic members of a microbial community [10], which may represent pivotal drivers of an individual infection or epidemic. Conserved genes with higher evolutionary rates than 16S rRNA [11] or gene panels could improve discriminatory power among closely related strains of prokaryotes, these strategies suffer from low adoption and underdeveloped reference databases

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call