Abstract

Ever-increasing affordability of next-generation sequencing makes whole-metagenome sequencing an attractive alternative to traditional 16S rDNA, RFLP, or culturing approaches for the analysis of microbiome samples. The advantage of whole-metagenome sequencing is that it allows direct inference of the metabolic capacity and physiological features of the studied metagenome without reliance on the knowledge of genotypes and phenotypes of the members of the bacterial community. It also makes it possible to overcome problems of 16S rDNA sequencing, such as unknown copy number of the 16S gene and lack of sufficient sequence similarity of the “universal” 16S primers to some of the target 16S genes. On the other hand, next-generation sequencing suffers from biases resulting in non-uniform coverage of the sequenced genomes. To overcome this difficulty, we present a model of GC-bias in sequencing metagenomic samples as well as filtration and normalization techniques necessary for accurate quantification of microbial organisms. While there has been substantial research in normalization and filtration of read-count data in such techniques as RNA-seq or Chip-seq, to our knowledge, this has not been the case for the field of whole-metagenome shotgun sequencing. The presented methods assume that complete genome references are available for most microorganisms of interest present in metagenomic samples. This is often a valid assumption in such fields as medical diagnostics of patient microbiota. Testing the model on two validation datasets showed four-fold reduction in root-mean-square error compared to non-normalized data in both cases. The presented methods can be applied to any pipeline for whole metagenome sequencing analysis relying on complete microbial genome references. We demonstrate that such pre-processing reduces the number of false positive hits and increases accuracy of abundance estimates.

Highlights

  • Metagenomics is the study of microbial communities in their natural habitat without isolation or cultivation of individual species [1]

  • To prevent reporting false positive bacterial hits that are identified exclusively by reads mapping to genomic islands, it is necessary to filter out such bacterial hits that only have a few clusters of mapped reads

  • This method can be augmented by heuristics that filter out bacterial hits based on setting an upper limit on the percent of allowed short read distances, e.g., less than 10 bp, or based on estimation of a bacterial genome size, as specified by Davenport et al [9] and comparing it with the actual genome size (S1 File)

Read more

Summary

Introduction

Metagenomics is the study of microbial communities in their natural habitat without isolation or cultivation of individual species [1]. The boom of next-generation sequencing technologies makes it more affordable to sequence whole metagenomes of environmental samples with high coverage. This technique is known as whole-metagenome shotgun (WMS) sequencing. The WMS sequencing approach allows estimation of fungi and viruses in the sample, which is not possible with the biomarker-based metagenomic techniques. The sequencing coverage of individual bacterial genomes comprising the metagenome will vary based on two factors: their abundance in the sample and sequencing factors. Sequencing factors include GC bias, fragmentation bias, sequencing depth, sequencing protocols, etc Normalization of these biases can be used for correct estimation of bacterial abundance in the sample. The true origin of these reads is the source of the genomic island rather than the organism containing this genomic island; such bacterial hits must be filtered out prior to identification and quantification

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call