Abstract

We provide a novel method, DRISEE (duplicate read inferred sequencing error estimation), to assess sequencing quality (alternatively referred to as “noise” or “error”) within and/or between sequencing samples. DRISEE provides positional error estimates that can be used to inform read trimming within a sample. It also provides global (whole sample) error estimates that can be used to identify samples with high or varying levels of sequencing error that may confound downstream analyses, particularly in the case of studies that utilize data from multiple sequencing samples. For shotgun metagenomic data, we believe that DRISEE provides estimates of sequencing error that are more accurate and less constrained by technical limitations than existing methods that rely on reference genomes or the use of scores (e.g. Phred). Here, DRISEE is applied to (non amplicon) data sets from both the 454 and Illumina platforms. The DRISEE error estimate is obtained by analyzing sets of artifactual duplicate reads (ADRs), a known by-product of both sequencing platforms. We present DRISEE as an open-source, platform-independent method to assess sequencing error in shotgun metagenomic data, and utilize it to discover previously uncharacterized error in de novo sequence data from the 454 and Illumina sequencing platforms.

Highlights

  • Accurate quantification of sequencing error is the single most essential consideration of sequence-dependent biological investigations

  • The limitations of reference-genome and score -based methods inspired the creation of Duplicate Read Inferred Sequencing Error Estimation (DRISEE)

  • DRISEE utilizes multiple alignment of groups of prefix-identical clusters of artifactual duplicate reads (ADRs) to create internal standards to which each individual duplicate read is compared

Read more

Summary

Introduction

Accurate quantification of sequencing error is the single most essential consideration of sequence-dependent biological investigations. Metagenomic studies produce biological inferences as the nearexclusive product of computational analyses of high throughput sequence data that attempt to classify the taxonomic (through 16s ribosomal amplicon sequencing [MG-RAST [1], QIIME [2]]) and functional (through whole genome shotgun sequencing [MGRAST [1]]) content of entire microbial communities. The accuracy of these inferences rests largely on the fidelity of sequence data, and on the ability of existing methods to quantify and account for sequencing error. Two methods are commonly used: reference-genome and score -based

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.