Abstract
BackgroundSequencing errors are key confounding factors for detecting low-frequency genetic variants that are important for cancer molecular diagnosis, treatment, and surveillance using deep next-generation sequencing (NGS). However, there is a lack of comprehensive understanding of errors introduced at various steps of a conventional NGS workflow, such as sample handling, library preparation, PCR enrichment, and sequencing. In this study, we use current NGS technology to systematically investigate these questions.ResultsBy evaluating read-specific error distributions, we discover that the substitution error rate can be computationally suppressed to 10−5 to 10−4, which is 10- to 100-fold lower than generally considered achievable (10−3) in the current literature. We then quantify substitution errors attributable to sample handling, library preparation, enrichment PCR, and sequencing by using multiple deep sequencing datasets. We find that error rates differ by nucleotide substitution types, ranging from 10−5 for A>C/T>G, C>A/G>T, and C>G/G>C changes to 10−4 for A>G/T>C changes. Furthermore, C>T/G>A errors exhibit strong sequence context dependency, sample-specific effects dominate elevated C>A/G>T errors, and target-enrichment PCR led to ~ 6-fold increase of overall error rate. We also find that more than 70% of hotspot variants can be detected at 0.1 ~ 0.01% frequency with the current NGS technology by applying in silico error suppression.ConclusionsWe present the first comprehensive analysis of sequencing error sources in conventional NGS workflows. The error profiles revealed by our study highlight new directions for further improving NGS analysis accuracy both experimentally and computationally, ultimately enhancing the precision of deep sequencing.
Highlights
Sequencing errors are key confounding factors for detecting low-frequency genetic variants that are important for cancer molecular diagnosis, treatment, and surveillance using deep next-generation sequencing (NGS)
We systematically investigated substitution error profiles by analyzing multiple sequencing datasets from five DNA sequencing providers: three a b c deep sequencing datasets generated by St
Jude), HudsonAlpha Institute of Biotechnology (HAIB), and WuXiNextCode and whole-exome sequencing datasets generated by Broad Institute (BI) and Baylor College of Medicine (BCM) on five different Illumina sequencing platforms (Additional file 1: Table S1)
Summary
Sequencing errors are key confounding factors for detecting low-frequency genetic variants that are important for cancer molecular diagnosis, treatment, and surveillance using deep next-generation sequencing (NGS). Errors acquired during next-generation sequencing (NGS) are key confounding factors of sensitive detection of low-frequency variants by deep sequencing. The substitution error rate by conventional NGS was first reported to be > 0.1% in 2011 [10] and was similar in later reports [11, 12] and in a recent review [1]. This presumed high error rate (> 0.1%) constrains further exploration of ways to improve sensitivity of low-frequency variant detection. With the rapid progress in sequencing technology and dramatic reductions in sequencing cost, there is a great need to systematically evaluate
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.