Abstract

BackgroundWith the rapid development of high-throughput technologies, researchers can sequence the whole metagenome of a microbial community sampled directly from the environment. The assignment of these metagenomic reads into different species or taxonomical classes is a vital step for metagenomic analysis, which is referred to as binning of metagenomic data.ResultsIn this paper, we propose a new method TM-MCluster for binning metagenomic reads. First, we represent each metagenomic read as a set of "k-mers" with their frequencies occurring in the read. Then, we employ a probabilistic topic model -- the Latent Dirichlet Allocation (LDA) model to the reads, which generates a number of hidden "topics" such that each read can be represented by a distribution vector of the generated topics. Finally, as in the MCluster method, we apply SKWIC -- a variant of the classical K-means algorithm with automatic feature weighting mechanism to cluster these reads represented by topic distributions.ConclusionsExperiments show that the new method TM-MCluster outperforms major existing methods, including AbundanceBin, MetaCluster 3.0/5.0 and MCluster. This result indicates that the exploitation of topic modeling can effectively improve the binning performance of metagenomic reads.

Highlights

  • Due to the limitations of biological experiments, traditional microbial genomic studies focus on individual bacterium genomes

  • The series of MetaCluster algorithms can automatically determine the number of clusters, which is extremely important for binning of metagenomic reads as most samples are from unknown species in real datasets

  • The proposed method TM-MCluster consists of three major steps: 1) representing each read as a vector of k-mers with occurring frequencies; 2) transforming each read vector to a topic distribution vector based on the Latent Dirichlet Allocation (LDA) model [15]; 3) clustering the vectorized reads by the SKWIC algorithm [16], as in the MCluster method [14]

Read more

Summary

Results

We propose a new method TM-MCluster for binning metagenomic reads. We represent each metagenomic read as a set of “k-mers” with their frequencies occurring in the read. We employ a probabilistic topic model – the Latent Dirichlet Allocation (LDA) model to the reads, which generates a number of hidden “topics” such that each read can be represented by a distribution vector of the generated topics. As in the MCluster method, we apply SKWIC – a variant of the classical K-means algorithm with automatic feature weighting mechanism to cluster these reads represented by topic distributions

Conclusions
Introduction
Methods
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.