Abstract
BackgroundMetagenomics is the study of genetic materials derived directly from complex microbial samples, instead of from culture. One of the crucial steps in metagenomic analysis, referred to as “binning”, is to separate reads into clusters that represent genomes from closely related organisms. Among the existing binning methods, unsupervised methods base the classification on features extracted from reads, and especially taking advantage in case of the limitation of reference database availability. However, their performance, under various aspects, is still being investigated by recent theoretical and empirical studies. The one addressed in this paper is among those efforts to enhance the accuracy of the classification.ResultsThis paper presents an unsupervised algorithm, called BiMeta, for binning of reads from different species in a metagenomic dataset. The algorithm consists of two phases. In the first phase of the algorithm, reads are grouped into groups based on overlap information between the reads. The second phase merges the groups by using an observation on l-mer frequency distribution of sets of non-overlapping reads. The experimental results on simulated and real datasets showed that BiMeta outperforms three state-of-the-art binning algorithms for both short and long reads (≥700 bp) datasets.ConclusionsThis paper developed a novel and efficient algorithm for binning of metagenomic reads, which does not require any reference database. The software implementing the algorithm and all test datasets mentioned in this paper can be downloaded at http://it.hcmute.edu.vn/bioinfo/bimeta/index.htm.Electronic supplementary materialThe online version of this article (doi:10.1186/s13015-014-0030-4) contains supplementary material, which is available to authorized users.
Highlights
Metagenomics is the study of genetic materials derived directly from complex microbial samples, instead of from culture
Illumina read length is from 50 to 300 bp, while Roche 454 System can produce reads with the length of 700 bp [8]. Both of analysis tools for long reads and short reads are necessary for metagenomic projects
This paper presents a novel unsupervised algorithm to classify reads from different organisms in a metagenomic dataset, called BiMeta (i.e., A Binning algorithm for Metagenomic reads)
Summary
Metagenomics is the study of genetic materials derived directly from complex microbial samples, instead of from culture. Among the existing binning methods, unsupervised methods base the classification on features extracted from reads, and especially taking advantage in case of the limitation of reference database availability. Their performance, under various aspects, is still being investigated by recent theoretical and empirical studies. Initial efforts in studying microbial samples usually use traditional methods which only focus on single species in laboratory culture. Because a sample obtained from a microbial community may contain many species which interact with both each other and their habitats, a clone culture cannot represent the true state of affairs in Sanger sequencing technology is used in some initial metagenomic projects [4,5]. Both of analysis tools for long reads and short reads are necessary for metagenomic projects
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.