Abstract

Theoretical analysis of DNA sequencing coverage problem has been investigated with complex mathematical models such as Lander–Waterman expectation theory and Stevens’ theorem for randomly covering a domain. In the field of metagenomics sequencing, several approaches have been developed to estimate the coverage of whole-genome shotgun sequencing, but surprisingly few studies addressed the coverage problem for marker-gene amplicon sequencing, for which arguably the biggest challenge is the complexity or heterogeneity of microbial communities. Overall, much of the practice still relies variously on speculation, semi-empirical and ad hoc heuristic models. Conservatively raising coverage may ensure the success of sequencing project, but often with unduly cost. In this study, we borrow the principles and approaches of optimum sampling methodology originated in applied entomology, achieved equal success in plant pathology and parasitology, and plays a critical role in the decision-making for global crop and forest protection against economic pests since 1970s when the pesticide crisis and food safety concerns forced the reduction of pesticide usages, which in turn requires reliable sampling techniques for monitoring pest populations. We realized that sequencing coverage is essentially an optimum sampling problem. Perhaps the only essential difference between sampling insects and sampling microbiome is the “instrument” used. In traditional entomology, it is usually humans that visually count the numbers of insects, occasionally aided by binocular microscope. In the metagenomics research, it is the DNA sequencers that count the number of DNA reads. Furthermore, a key theoretical foundation for sampling insect pest populations, i.e., Taylor’s power law, which achieved rare status of ecological law and captures the population aggregation, has been recently extended to the community level for describing community heterogeneity and stability, namely, Taylor’s power law extensions (TPLEs). This theoretical advance enabled us to develop a novel approach to assessing the quality and determining optimum reads (coverage) of amplicon sequencing operations. Specifically, two applications were developed: one is, in hindsight, to assess the quality of amplicon sequencing operation in terms of the precision and confidence levels. Another is, prior to sequencing operation, to determine the minimum sequencing efforts for a sequencing project to achieve preset precision and confidence levels.

Highlights

  • Microbiome researchers employ two types of DNA sequencing technologies

  • The first category of applications is based on the TPLEs (Ma, 2015)

  • Regarding the TPLE based optimum sample sizes for addressing 16s-rRNA sequencing coverage problem, there are two additional important intricacies, which we briefly described here, but the detailed discussion is deferred to the section with illustrative examples

Read more

Summary

Introduction

Microbiome researchers employ two types of DNA sequencing technologies. Existing approaches to studying the sequencing coverage problem for microbiome research have been focused on the former type, and surprisingly few studies have been on the amplicon sequencing. Rodriguez-R and Konstantinidis (2014) first distinguished two terms in microbiome research, sequencing coverage (the fraction of the metagenome represented in the metagenomic dataset) vs sequencing depth (repetition of features, which we are not concerned in this study). In extreme cases, when small datasets from sequencing with insufficient coverage are utilized to describe complex communities, statistical inferences become unreliable and may even generate misleading conclusions (Rodriguez-R and Konstantinidis, 2014). As rightly pointed out by Rodriguez-R and Konstantinidis (2014), coverage is not a function of dataset size. The relationship heavily depends on the complexity (i.e., heterogeneity) of the microbial communities sampled. Wendl et al (2013) characterized current metagenomic project designs as relying on variously on speculation, semi-empirical and ad hoc heuristic models such as elementary extensions of single-sample Lander–Waterman expectation theory

Objectives
Methods
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call