Abstract

BackgroundNext-generation sequencing technologies have become important tools for genome-wide studies. However, the quality scores that are assigned to each base have been shown to be inaccurate. If the quality scores are used in downstream analyses, these inaccuracies can have a significant impact on the results.ResultsHere we present ReQON, a tool that recalibrates the base quality scores from an input BAM file of aligned sequencing data using logistic regression. ReQON also generates diagnostic plots showing the effectiveness of the recalibration. We show that ReQON produces quality scores that are both more accurate, in the sense that they more closely correspond to the probability of a sequencing error, and do a better job of discriminating between sequencing errors and non-errors than the original quality scores. We also compare ReQON to other available recalibration tools and show that ReQON is less biased and performs favorably in terms of quality score accuracy.ConclusionReQON is an open source software package, written in R and available through Bioconductor, for recalibrating base quality scores for next-generation sequencing data. ReQON produces a new BAM file with more accurate quality scores, which can improve the results of downstream analysis, and produces several diagnostic plots showing the effectiveness of the recalibration.

Highlights

  • Next-generation sequencing technologies have become important tools for genome-wide studies

  • Plot B shows that the majority of quality scores before recalibration were larger than 35, with almost 50% of the bases receiving a quality score near 40

  • Plot C confirms that the original quality scores are not very accurate because the quality scores with the largest frequencies are far from the 45-degree line and, Frequency-Weighted Squared Error (FWSE) is large

Read more

Summary

Introduction

Next-generation sequencing technologies have become important tools for genome-wide studies. Next-generation sequencing (NGS) technologies are important tools for studying genome-wide DNA and RNA expression, Single Nucleotide Polymorphisms (SNPs), mutations and alternative splicing [1]. When a sequencer calls a specific base, there is a small chance that it will make an error and call an incorrect base. This sequencing error rate is machine, run and sample specific, but it occurs at a rate of approximately 1/1000 [2], resulting in tens of millions of errors in a single experiment. Many labs do not have the resources to store fluorescence intensity data and are unable to use such calibration tools. There exist recalibration tools that only require aligned sequence data, with two of the most popular being GATK [3] and the BAQ option in SAMtools [7], which both run on Unix

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call