Abstract

BackgroundSequencing of environmental DNA (often called metagenomics) has shown tremendous potential to uncover the vast number of unknown microbes that cannot be cultured and sequenced by traditional methods. Because the output from metagenomic sequencing is a large set of reads of unknown origin, clustering reads together that were sequenced from the same species is a crucial analysis step. Many effective approaches to this task rely on sequenced genomes in public databases, but these genomes are a highly biased sample that is not necessarily representative of environments interesting to many metagenomics projects.ResultsWe present SCIMM (Sequence Clustering with Interpolated Markov Models), an unsupervised sequence clustering method. SCIMM achieves greater clustering accuracy than previous unsupervised approaches. We examine the limitations of unsupervised learning on complex datasets, and suggest a hybrid of SCIMM and supervised learning method Phymm called PHYSCIMM that performs better when evolutionarily close training genomes are available.ConclusionsSCIMM and PHYSCIMM are highly accurate methods to cluster metagenomic sequences. SCIMM operates entirely unsupervised, making it ideal for environments containing mostly novel microbes. PHYSCIMM uses supervised learning to improve clustering in environments containing microbial strains from well-characterized genera. SCIMM and PHYSCIMM are available open source from http://www.cbcb.umd.edu/software/scimm.

Highlights

  • Sequencing of environmental DNA has shown tremendous potential to uncover the vast number of unknown microbes that cannot be cultured and sequenced by traditional methods

  • This is useful for clustering of metagenomic sequences where the amount of sequence from each species may differ widely due to differential abundance of organisms and the amount of sequencing performed on the sample

  • Simulated reads To assess the performance of SCIMM and PHYSCIMM, we simulated sequencing reads from mixtures of 1028

Read more

Summary

Introduction

Sequencing of environmental DNA (often called metagenomics) has shown tremendous potential to uncover the vast number of unknown microbes that cannot be cultured and sequenced by traditional methods. Sequencing of environmental DNA (often called metagenomics) has shown tremendous potential to drive the discovery and understanding of the “unculturable majority” of species – the vast number of unknown microbes that cannot be cultured in the laboratory [3]. Successful metagenomics projects have sequenced DNA from ocean water sampled from around the world [4], microbial communities in and on humans [5,6,7,8], and acid drainage from an abandoned mine [9]. The output from an environmental shotgun sequencing project is a large set of DNA sequence “reads” of unknown origin Because these reads come from a diverse population of microbial strains, assembly produces a large collection of small contigs (contiguous stretches of unambiguously overlapping reads) [13,14]. Advances in computational analysis techniques are essential to move the field forward

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call