Abstract

BackgroundA metagenome is a collection of genomes, usually in a micro-environment, and sequencing a metagenomic sample en masse is a powerful means for investigating the community of the constituent microorganisms. One of the challenges is in distinguishing between similar organisms due to rampant multiple possible assignments of sequencing reads, resulting in false positive identifications. We map the problem to a topological data analysis (TDA) framework that extracts information from the geometric structure of data. Here the structure is defined by multi-way relationships between the sequencing reads using a reference database.ResultsBased primarily on the patterns of co-mapping of the reads to multiple organisms in the reference database, we use two models: one a subcomplex of a Barycentric subdivision complex and the other a Čech complex. The Barycentric subcomplex allows a natural mapping of the reads along with their coverage of organisms while the Čech complex takes simply the number of reads into account to map the problem to homology computation. Using simulated genome mixtures we show not just enrichment of signal but also microbe identification with strain-level resolution.ConclusionsIn particular, in the most refractory of cases where alternative algorithms that exploit unique reads (i.e., mapped to unique organisms) fail, we show that the TDA approach continues to show consistent performance. The Čech model that uses less information is equally effective, suggesting that even partial information when augmented with the appropriate structure is quite powerful.

Highlights

  • A metagenome is a collection of genomes, usually in a micro-environment, and sequencing a metagenomic sample en masse is a powerful means for investigating the community of the constituent microorganisms

  • We applied the model on simulated shotgun sequencing reads from a collection of 36 recently published Salmonella genomes [15], in an effort to study the applicability of the approach to strain-level detection

  • Read simulation and mapping was performed with the Metagenomics Computation and Analytics Workbench (MCAW) [18]

Read more

Summary

Results

We applied the model on simulated shotgun sequencing reads from a collection of 36 recently published Salmonella genomes [15], in an effort to study the applicability of the approach to strain-level detection. We observed a simulated mixture of 3 strains where two false positives (FPs) had more unique reads (2,596 and 2,427) than one of the true positives (TPs) (2,105). In this case, relying only on the number of unique reads would miss the truly present strain and falsely indicate the presence of missing ones. As a comparison, ordering the organisms by read coverage (computed by bedtools [19]) of the same reads as in the TDA input, the TPs could be separated from FPs. as discussed earlier, using only the uniquely mapping reads would lead to false positives. We demonstrated that the TDA approach solves the problem of enriching the top of the list of voting values for truly present organisms

Conclusions
Background
Methods
Discussion and conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.