Abstract

BackgroundIt is possible to detect bacterial species in shotgun metagenome datasets through the presence of only a few sequence reads. However, false positive results can arise, as was the case in the initial findings of a recent New York City subway metagenome project. False positives are especially likely when two closely related are present in the same sample. Bacillus anthracis, the etiologic agent of anthrax, is a high-consequence pathogen that shares >99% average nucleotide identity with Bacillus cereus group (BCerG) genomes. Our goal was to create an analysis tool that used k-mers to detect B. anthracis, incorporating information about the coverage of BCerG in the metagenome sample.MethodsUsing public complete genome sequence datasets, we identified a set of 31-mer signatures that differentiated B. anthracis from other members of the B. cereus group (BCerG), and another set which differentiated BCerG genomes (including B. anthracis) from other Bacillus strains. We also created a set of 31-mers for detecting the lethal factor gene, the key genetic diagnostic of the presence of anthrax-causing bacteria. We created synthetic sequence datasets based on existing genomes to test the accuracy of a k-mer based detection model.ResultsWe found 239,503 B. anthracis-specific 31-mers (the Ba31 set), 10,183 BCerG 31-mers (the BCerG31 set), and 2,617 lethal factor k-mers (the lef31 set). We showed that false positive B. anthracis k-mers—which arise from random sequencing errors—are observable at high genome coverages of B. cereus. We also showed that there is a “gray zone” below 0.184× coverage of the B. anthracis genome sequence, in which we cannot expect with high probability to identify lethal factor k-mers. We created a linear regression model to differentiate the presence of B. anthracis-like chromosomes from sequencing errors given the BCerG background coverage. We showed that while shotgun datasets from the New York City subway metagenome project had no matches to lef31 k-mers and hence were negative for B. anthracis, some samples showed evidence of strains very closely related to the pathogen.DiscussionThis work shows how extensive libraries of complete genomes can be used to create organism-specific signatures to help interpret metagenomes. We contrast “specialist” approaches to metagenome analysis such as this work to “generalist” software that seeks to classify all organisms present in the sample and note the more general utility of a k-mer filter approach when taxonomic boundaries lack clarity or high levels of precision are required.

Highlights

  • There is great interest in the use of shotgun metagenome data to detect pathogens in clinical and environmental samples

  • We incorporate some of the results introduced informally on our blog and extend them to create a k-mer based approach—using recent public B. anthracis and Bacillus cereus group (BCerG) data—to analyze in greater detail how to search for traces of B. anthracis in shotgun metagenome data

  • The reads aligned along the entire length of the chromosome, forming a characteristic peak at the replication origin, a pattern often seen when other bacterial chromosomes have been recovered from metagenome samples (Brown et al, 2016)

Read more

Summary

Introduction

There is great interest in the use of shotgun metagenome data to detect pathogens in clinical and environmental samples. In 2015, Afshinnekoo et al (2015a) published initial findings from an extensive study of the New York Subway metagenome, which claimed that they had detected bacteria responsible for anthrax (Bacillus anthracis) and plague (Yersinia pestis). While these misidentifications were swiftly corrected (Mason, 2015; Afshinnekoo et al, 2015b), indistinct or fuzzy boundaries between species may yield many errors of this nature. We showed that while shotgun datasets from the New York City subway metagenome project had no matches to lef k-mers and were negative for B. anthracis, some samples showed evidence of strains very closely related to the pathogen. We contrast ‘‘specialist’’ approaches to metagenome analysis such as this work to ‘‘generalist’’ software that seeks to classify all organisms present in the sample and note the more general utility of a k-mer filter approach when taxonomic boundaries lack clarity or high levels of precision are required

Objectives
Methods
Results
Discussion
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.