Abstract

BackgroundMetagenomic sequencing is a powerful technology for studying the mixture of microbes or the microbiomes on human and in the environment. One basic task of analyzing metagenomic data is to identify the component genomes in the community. This task is challenging due to the complexity of microbiome composition, limited availability of known reference genomes, and usually insufficient sequencing coverage.ResultsAs an initial step toward understanding the complete composition of a metagenomic sample, we studied the problem of estimating the total length of all distinct component genomes in a metagenomic sample. We showed that this problem can be solved by estimating the total number of distinct k-mers in all the metagenomic sequencing data. We proposed a method for this estimation based on the sequencing coverage distribution of observed k-mers, and introduced a k-mer redundancy index (KRI) to fill in the gap between the count of distinct k-mers and the total genome length. We showed the effectiveness of the proposed method on a set of carefully designed simulation data corresponding to multiple situations of true metagenomic data. Results on real data indicate that the uncaptured genomic information can vary dramatically across metagenomic samples, with the potential to mislead downstream analyses.ConclusionsWe proposed the question of how long the total genome length of all different species in a microbial community is and introduced a method to answer it.

Highlights

  • Metagenomic sequencing is a powerful technology for studying the mixture of microbes or the microbiomes on human and in the environment

  • k-mer redundancy index (KRI) is defined as the ratio of total k-mer count (TKC) and distinct k-mer count (DKC), which reflects the degree of repetition of k-mers in the sequence

  • Results on simulated metagenomic datasets We tested our method on all synthetic metagenomic samples

Read more

Summary

Introduction

Metagenomic sequencing is a powerful technology for studying the mixture of microbes or the microbiomes on human and in the environment. One basic task of analyzing metagenomic data is to identify the component genomes in the community. This task is challenging due to the complexity of microbiome composition, limited availability of known reference genomes, and usually insufficient sequencing coverage. Metagenomic sequencing is a powerunderlying information about the microbiome from the metagenomic sample. The level of coverage of a metagenomic sample is of key importance for recovering the information about the microbiome. The basic task of a metagenomic study is to read out the

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call