Abstract

MotivationThe increasing adoption of clinical whole-genome resequencing (WGS) demands for highly accurate and reproducible variant calling (VC) methods. The observed discordance between state-of-the-art VC pipelines, however, indicates that the current practice still suffers from non-negligible numbers of false positive and negative SNV and INDEL calls that were shown to be enriched among discordant calls but also in genomic regions with low sequence complexity.ResultsHere, we describe our method ReliableGenome (RG) for partitioning genomes into high and low concordance regions with respect to a set of surveyed VC pipelines. Our method combines call sets derived by multiple pipelines from arbitrary numbers of datasets and interpolates expected concordance for genomic regions without data. By applying RG to 219 deep human WGS datasets, we demonstrate that VC concordance depends predominantly on genomic context rather than the actual sequencing data which manifests in high recurrence of regions that can/cannot be reliably genotyped by a single method. This enables the application of pre-computed regions to other data created with comparable sequencing technology and software. RG outperforms comparable efforts in predicting VC concordance and false positive calls in low-concordance regions which underlines its usefulness for variant filtering, annotation and prioritization. RG allows focusing resource-intensive algorithms (e.g. consensus calling methods) on the smaller, discordant share of the genome (20–30%) which might result in increased overall accuracy at reasonable costs. Our method and analysis of discordant calls may further be useful for development, benchmarking and optimization of VC algorithms and for the relative comparison of call sets between different studies/pipelines.Availability and ImplementationRG was implemented in Java, source code and binaries are freely available for non-commercial use at https://github.com/popitsch/wtchg-rg/.Supplementary information Supplementary data are available at Bioinformatics online.

Highlights

  • Whole-genome resequencing (WGS) allows researchers to address a broad range of clinical and research questions at comparably low costs and with short turnaround times

  • The observed discordance between state-of-the-art variant calling (VC) pipelines, indicates that the current practice still suffers from nonnegligible numbers of false positive and negative SNV and INDEL calls that were shown to be enriched among discordant calls and in genomic regions with low sequence complexity

  • By applying RG to 219 deep human whole-genome resequencing (WGS) datasets, we demonstrate that VC concordance depends predominantly on genomic context rather than the actual sequencing data which manifests in high recurrence of regions that can/cannot be reliably genotyped by a single method

Read more

Summary

Introduction

One proposed practice to improve overall VC accuracy is to apply multiple VC pipelines to the same sequencing data and combine the results in order to reach a consensus from multiple algorithms (Cantarel et al, 2014; Gezsi et al, 2015) While this strategy may significantly increase VC accuracy it greatly increases analysis costs and turnaround times which may be unfeasible in many real world situations. Such a consensus approach was used for the development of first genome-wide benchmarks that enable us to determine VC accuracy and reproducibility and pave the way for systematically improving these measures (Goldfeder et al, 2016; Highnam et al, 2015; Zook et al, 2014). Removing all variant calls in such difficult regions is straightforward and did not compromise sensitivity significantly in the author’s evaluation (Li, 2014)

Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.