Systematic benchmark of state-of-the-art variant calling pipelines identifies major factors affecting accuracy of coding sequence variant discovery

Yury A Barbitoff,Andrey S Glotov,Ruslan Abasov,Alexander V Predeus,Varvara E Tvorogova

doi:10.1186/s12864-022-08365-3

Abstract

BackgroundAccurate variant detection in the coding regions of the human genome is a key requirement for molecular diagnostics of Mendelian disorders. Efficiency of variant discovery from next-generation sequencing (NGS) data depends on multiple factors, including reproducible coverage biases of NGS methods and the performance of read alignment and variant calling software. Although variant caller benchmarks are published constantly, no previous publications have leveraged the full extent of available gold standard whole-genome (WGS) and whole-exome (WES) sequencing datasets.ResultsIn this work, we systematically evaluated the performance of 4 popular short read aligners (Bowtie2, BWA, Isaac, and Novoalign) and 9 novel and well-established variant calling and filtering methods (Clair3, DeepVariant, Octopus, GATK, FreeBayes, and Strelka2) using a set of 14 “gold standard” WES and WGS datasets available from Genome In A Bottle (GIAB) consortium. Additionally, we have indirectly evaluated each pipeline’s performance using a set of 6 non-GIAB samples of African and Russian ethnicity. In our benchmark, Bowtie2 performed significantly worse than other aligners, suggesting it should not be used for medical variant calling. When other aligners were considered, the accuracy of variant discovery mostly depended on the variant caller and not the read aligner. Among the tested variant callers, DeepVariant consistently showed the best performance and the highest robustness. Other actively developed tools, such as Clair3, Octopus, and Strelka2, also performed well, although their efficiency had greater dependence on the quality and type of the input data. We have also compared the consistency of variant calls in GIAB and non-GIAB samples. With few important caveats, best-performing tools have shown little evidence of overfitting.ConclusionsThe results show surprisingly large differences in the performance of cutting-edge tools even in high confidence regions of the coding genome. This highlights the importance of regular benchmarking of quickly evolving tools and pipelines. We also discuss the need for a more diverse set of gold standard genomes that would include samples of African, Hispanic, or mixed ancestry. Additionally, there is also a need for better variant caller assessment in the repetitive regions of the coding genome.

Highlights

Accurate variant detection in the coding regions of the human genome is a key requirement for molecular diagnostics of Mendelian disorders
Data collection and analysis strategy To dissect the factors that define the accuracy of variant calling, we selected a matching set of whole-exome sequencing (WES) and Whole genome sequencing (WGS) datasets available for the gold standard Genome In A Bottle (GIAB) samples, including NA12878 (HG001), three members of an Ashkenazi trio (HG002 - HG004), and three members of the Chinese Han trio (HG005 - HG007) (Fig. 1a, Table 1)
The following analysis strategy was employed for each sample (Fig. 1a): raw reads were aligned onto the GRCh37 human reference genome with either of the four short read aligners: Bowtie2 (in either end-to-end (BT-E2E) or local (BT-LOC) alignment mode), BWA

Summary

Introduction

Accurate variant detection in the coding regions of the human genome is a key requirement for molecular diagnostics of Mendelian disorders. Efficiency of variant discovery from next-generation sequencing (NGS) data depends on multiple factors, including reproducible coverage biases of NGS methods and the performance of read alignment and variant calling software. Millions of individual exomes and genomes have been sequenced across the globe, and large-scale variant datasets have been constructed from NGS data, including the Genome Aggregation Database (gnomAD) [2] and UK Biobank exome sequencing dataset [3]. These datasets are extensively used in both clinical practice and basic human genetics research.

Methods

Results

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Genomics	Publication Date: Feb 22, 2022
Citations: 34	License type: open-access

R Discovery Prime

R Discovery Prime

Systematic benchmark of state-of-the-art variant calling pipelines identifies major factors affecting accuracy of coding sequence variant discovery

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Genomics

Lead the way for us

Similar Papers

Systematic comparison of variant calling pipelines using gold standard personal exome variants.
Sohyun Hwang ... Eiru Kim
Scientific Reports | VOL. 5
Sohyun Hwang, et. al.Sohyun Hwang ... Eiru Kim
01 Dec 2015
Scientific Reports | VOL. 5

Abstract 876: Sequencing a new broadly-consented tumor/normal cell line for a Genome in a Bottle Benchmark
Gail Rosen ... Andrew Liss
American Journal of Cancer | VOL. 83
Gail Rosen, et. al.Gail Rosen ... Andrew Liss
04 Apr 2023
American Journal of Cancer | VOL. 83

Negligible effects of read trimming on the accuracy of germline short variant calling in the human genome
Yury Barbitoff ... Alexander Predeus
F1000Research | VOL. 13
Yury Barbitoff, et. al.Yury Barbitoff ... Alexander Predeus
17 May 2024
F1000Research | VOL. 13

A Comparison of Variant Calling Pipelines Using Genome in a Bottle as a Reference.
Adam Cornish ... Chittibabu Guda
BioMed research international | VOL. 2015
Adam Cornish, et. al.Adam Cornish ... Chittibabu Guda
01 Jan 2015
BioMed research international | VOL. 2015

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Systematic benchmark of state-of-the-art variant calling pipelines identifies major factors affecting accuracy of coding sequence variant discovery

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Genomics