Abstract

BackgroundThe new research field of metagenomics is providing exciting insights into various, previously unclassified ecological systems. Next-generation sequencing technologies are producing a rapid increase of environmental data in public databases. There is great need for specialized software solutions and statistical methods for dealing with complex metagenome data sets.Methodology/Principal FindingsTo facilitate the development and improvement of metagenomic tools and the planning of metagenomic projects, we introduce a sequencing simulator called MetaSim. Our software can be used to generate collections of synthetic reads that reflect the diverse taxonomical composition of typical metagenome data sets. Based on a database of given genomes, the program allows the user to design a metagenome by specifying the number of genomes present at different levels of the NCBI taxonomy, and then to collect reads from the metagenome using a simulation of a number of different sequencing technologies. A population sampler optionally produces evolved sequences based on source genomes and a given evolutionary tree.Conclusions/SignificanceMetaSim allows the user to simulate individual read datasets that can be used as standardized test scenarios for planning sequencing projects or for benchmarking metagenomic software.

Highlights

  • Metagenomics is based on the isolation and characterization of DNA from environmental samples without the need for prior cultivation of microorganisms

  • To show the utility of this simulation software, for example, in the benchmarking of new software, we generated 9 data sets using a range of parameters and used them to test how well the MEGAN software succeeded in successfully binning sequences based on taxonomic classification by homology

  • Summary of MEGAN results The analysis of the nine artifical data sets help to reveal the pros and cons of taxonomical binning based on homology as done by MEGAN

Read more

Summary

Introduction

Metagenomics is based on the isolation and characterization of DNA from environmental samples without the need for prior cultivation of microorganisms. The research field of Metagenomics is spurred by the recent development and improvement of next-generation sequencing technologies like Roche’s 454 pyrosequencing [7]. These high through-put technologies promise faster and relatively inexpensive generation of reads, Sanger sequencing still has been used in environmental genome projects [5] to avoid the drawbacks of shorter read lengths. Studies show that algorithms developed for singlegenome assembly are only suitable for environmental sequences under special conditions, for example in low complexity populations [2,8]. The new research field of metagenomics is providing exciting insights into various, previously unclassified ecological systems. There is great need for specialized software solutions and statistical methods for dealing with complex metagenome data sets

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call