Abstract
BackgroundMetagenomics technology can directly extract microbial genetic material from the environmental samples to obtain their sequencing reads, which can be further assembled into contigs through assembly tools. Clustering methods of contigs are subsequently applied to recover complete genomes from environmental samples. The main problems with current clustering methods are that they cannot recover more high-quality genes from complex environments. Firstly, there are multiple strains under the same species, resulting in assembly of chimeras. Secondly, different strains under the same species are difficult to be classified. Thirdly, it is difficult to determine the number of strains during the clustering process.ResultsIn view of the shortcomings of current clustering methods, we propose an unsupervised clustering method which can improve the ability to recover genes from complex environments and a new method for selecting the number of sample’s strains in clustering process. The sequence composition characteristics (tetranucleotide frequency) and co-abundance are combined to train the probability model for clustering. A new recursive method that can continuously reduce the complexity of the samples is proposed to improve the ability to recover genes from complex environments. The new clustering method was tested on both simulated and real metagenomic datasets, and compared with five state-of-the-art methods including CONCOCT, Maxbin2.0, MetaBAT, MyCC and COCACOLA. In terms of the number and quality of recovered genes from metagenomic datasets, the results show that our proposed method is more effective.ConclusionsA new contigs clustering method is proposed, which can recover more high-quality genes from complex environmental samples.
Highlights
Metagenomics technology can directly extract microbial genetic material from the environmental samples to obtain their sequencing reads, which can be further assembled into contigs through assembly tools
Metagenomics arises with the development of secondgeneration sequencing technology, which can obtain the genetic material of all microorganisms in the samples directly from the natural environments without the need for pure culture on the medium like the traditional methods
We propose a new clustering method MetaCRS (MetaCRS: unsupervised clustering of Contigs with the Recursive Strategy of reducing metagenomic dataset’s complexity) that can continuously reduce the complexity of the samples through a recursive strategy to improve the ability to recover genes from complex environments and a new method to determine the number of strains in the samples
Summary
In order to verify the effectiveness of our proposed method and the ability to recover genes from complex environments, we compared it with five state-of-the-art methods including CONCOCT, Maxbin2.0, MetaBAT, MyCC, and COCACOLA on simulated and real datasets. We tested our method on three datasets of different complexity, and compared it with other state-of-the-art clustering methods including CONCOCT [20], Maxbin2.0 [23], MetaBAT [24], MyCC [25], and COCACOLA [26]. Our proposed method obtained the largest number of genes in almost every recall threshold both in medium-complexity and high-complexity dataset. In low-complexity dataset, CONCOCT was better than our method at the recall rate greater than 90% This may be the reason that the K-means algorithm of our proposed method is affected by dirty data in low-complexity dataset, resulting in poor clustering effect in the first stage.
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have