Synthetic Spike-in Standards Improve Run-Specific Systematic Error Analysis for DNA and RNA Sequencing

Justin M Zook,Shurjo K Sen,Jennifer Mcdaniel,Daniel Samarov,Marc Salit

doi:10.1371/journal.pone.0041356

Abstract

While the importance of random sequencing errors decreases at higher DNA or RNA sequencing depths, systematic sequencing errors (SSEs) dominate at high sequencing depths and can be difficult to distinguish from biological variants. These SSEs can cause base quality scores to underestimate the probability of error at certain genomic positions, resulting in false positive variant calls, particularly in mixtures such as samples with RNA editing, tumors, circulating tumor cells, bacteria, mitochondrial heteroplasmy, or pooled DNA. Most algorithms proposed for correction of SSEs require a data set used to calculate association of SSEs with various features in the reads and sequence context. This data set is typically either from a part of the data set being “recalibrated” (Genome Analysis ToolKit, or GATK) or from a separate data set with special characteristics (SysCall). Here, we combine the advantages of these approaches by adding synthetic RNA spike-in standards to human RNA, and use GATK to recalibrate base quality scores with reads mapped to the spike-in standards. Compared to conventional GATK recalibration that uses reads mapped to the genome, spike-ins improve the accuracy of Illumina base quality scores by a mean of 5 Phred-scaled quality score units, and by as much as 13 units at CpG sites. In addition, since the spike-in data used for recalibration are independent of the genome being sequenced, our method allows run-specific recalibration even for the many species without a comprehensive and accurate SNP database. We also use GATK with the spike-in standards to demonstrate that the Illumina RNA sequencing runs overestimate quality scores for AC, CC, GC, GG, and TC dinucleotides, while SOLiD has less dinucleotide SSEs but more SSEs for certain cycles. We conclude that using these DNA and RNA spike-in standards with GATK improves base quality score recalibration.

Highlights

As sequencing costs drop, it is becoming cost-effective to sequence even whole genomes to a sufficient depth that random errors become insignificant
These spike-in standards could be used for characterizing systematic sequencing errors (SSEs) both in DNA and RNA sequencing, but in this paper we focus on RNA spike-ins
The base quality scores reported by the instrument are frequently not accurate measures of error rates, in part due to SSEs associated with covariates such as machine cycle and dinucleotide context

Summary

Introduction

It is becoming cost-effective to sequence even whole genomes to a sufficient depth that random errors become insignificant. Compensating for these SSEs is critical for applications in which a variant might be expected to be in only a small fraction of the reads, such as samples containing RNAediting [6,7], cancer tissues and circulating tumor cells [8,9,10,11], fetal DNA in mother’s blood [12], mixtures of bacterial strains [13], mitochondrial heteroplasmy [14], mosaic disorders [15], and pooled samples [16,17]. We combine the advantages of these approaches by using DNA or RNA spike-in standards without homology to almost all biological organisms

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: PLoS ONE	Publication Date: Jul 31, 2012
Citations: 73	License type: CC0 1.0

R Discovery Prime

R Discovery Prime

Synthetic Spike-in Standards Improve Run-Specific Systematic Error Analysis for DNA and RNA Sequencing

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLoS ONE

Lead the way for us

Similar Papers

Abstract NG04: Diversity of circulating tumor cells in a mouse pancreatic cancer model identified by single cell RNA sequencing
...
Cancer Research | VOL. 74
, et. al. ...
30 Sep 2014
Cancer Research | VOL. 74

Abstract 876: Sequencing a new broadly-consented tumor/normal cell line for a Genome in a Bottle Benchmark
Gail Rosen ... Andrew Liss
Cancer Research | VOL. 83
Gail Rosen, et. al.Gail Rosen ... Andrew Liss
04 Apr 2023
Cancer Research | VOL. 83

Biological and Clinical Significance of Undetectable Circulating Tumor Cells (CTCs) in Patients (Pts) with Multiple Myeloma (MM)
Juan-José Garcés ... Norma C Gutierrez
Blood | VOL. 142
Juan-José Garcés, et. al.Juan-José Garcés ... Norma C Gutierrez
02 Nov 2023
Blood | VOL. 142

Abstract 1549: Molecular profiling of circulating tumor cells as a surrogate for distant metastasis in stage IV breast cancer
Alexander Ring ... Steven Carrasco
Cancer Research | VOL. 76
Alexander Ring, et. al.Alexander Ring ... Steven Carrasco
15 Jul 2016
Cancer Research | VOL. 76

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Synthetic Spike-in Standards Improve Run-Specific Systematic Error Analysis for DNA and RNA Sequencing

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLoS ONE