Abstract

BackgroundMinor allele detection in very high coverage sequence data (>1000X) has many applications such as detecting mtDNA heteroplasmy, somatic mutations in cancer or tumors, SNP calling in pool sequencing, etc., where reads with low frequency are not necessarily sequence error but may instead convey biological information. However, the suitability of common base quality recalibration tools for such applications has not been investigated in detail.ResultsWe show that the widely used tool GATK BaseRecalibration has several limitations in minor allele detection. First, GATK IndelRealignment fails to work if the sequence coverage is above a certain level since it then becomes computationally infeasible. Second, the accuracy of the base quality largely depends on the database of known SNPs as the control, which limits the ability of de novo minor allele detection. Third, GATK reduces the base quality of sequence errors at the cost of reducing scores for true minor alleles. To overcome these limitations, we present a novel approach called SEGREG, which applies segmented regression to control sequences (e.g. phiX174 DNA) spiked into a sequencing run. Based on simulations SEGREG improves both the accuracy of base quality scores and the detection of minor alleles. We further investigate sequence error and recalibration parameters by applying a Logarithm Likelihood Ratio (LLR) approach to SEGREG recalibrated base quality scores for phiX174 DNA sequenced to very high coverage, and for mtDNA genome sequences previously analyzed for heteroplasmic variants.ConclusionsOur results suggest that SEGREG improves base recalibration without suffering the limitations discussed above, and the LLR approach benefits from SEGREG in identifying more true minor alleles, while avoiding false positives from sequencing error.Electronic supplementary materialThe online version of this article (doi:10.1186/s12864-016-2463-2) contains supplementary material, which is available to authorized users.

Highlights

  • Minor allele detection in very high coverage sequence data (>1000X) has many applications such as detecting mitochondrial DNA (mtDNA) heteroplasmy, somatic mutations in cancer or tumors, SNP calling in pool sequencing, etc., where reads with low frequency are not necessarily sequence error but may instead convey biological information

  • GATK4 has a larger Frequency-Weighted Squared Error (FWSE) than GATK1, which reflects the misalignment issue; with more knowledge of the actual genetic variants, GATK can improve the accuracy of recalibrated base quality as GATK2 has a lower FWSE than GATK1, and GATK3 has a lower FWSE than GATK2

  • SEGREG, has the lowest FWSE, which probably reflects both the direct regression on the multiple conditional probability in our model, as well as the simplicity of the error model generated from Simseq, which is based on mtDNA sequence data (Simseq reference)

Read more

Summary

Introduction

Minor allele detection in very high coverage sequence data (>1000X) has many applications such as detecting mtDNA heteroplasmy, somatic mutations in cancer or tumors, SNP calling in pool sequencing, etc., where reads with low frequency are not necessarily sequence error but may instead convey biological information. The raw base quality from the Illumina default basecaller (Bustard) is inaccurate [3]; a number of basecallers aimed at achieving better performance have been developed. They either apply a model-based strategy (e.g., AYB [4], naiveBayescall [5]) or use supervised learning approaches with an additional. We compare our base recalibration tool with others and discuss why they fail to accurately distinguish minor alleles from sequence errors

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call