Abstract

BackgroundGenotypes generated in next generation sequencing studies contain errors which can significantly impact the power to detect signals in common and rare variant association tests. These genotyping errors are not explicitly filtered by the standard GATK Variant Quality Score Recalibration (VQSR) tool and thus remain a source of errors in whole exome sequencing (WES) projects that follow GATK’s recommended best practices. Therefore, additional data filtering methods are required to effectively remove these errors before performing association analyses with complex phenotypes. Here we empirically derive thresholds for genotype and variant filters that, when used in conjunction with the VQSR tool, achieve higher data quality than when using VQSR alone.ResultsThe detailed filtering strategies improve the concordance of sequenced genotypes with array genotypes from 99.33% to 99.77%; improve the percent of discordant genotypes removed from 10.5% to 69.5%; and improve the Ti/Tv ratio from 2.63 to 2.75. We also demonstrate that managing batch effects by separating samples based on different target capture and sequencing chemistry protocols results in a final data set containing 40.9% more high-quality variants. In addition, imputation is an important component of WES studies and is used to estimate common variant genotypes to generate additional markers for association analyses. As such, we demonstrate filtering methods for imputed data that improve genotype concordance from 79.3% to 99.8% while removing 99.5% of discordant genotypes.ConclusionsThe described filtering methods are advantageous for large population-based WES studies designed to identify common and rare variation associated with complex diseases. Compared to data processed through standard practices, these strategies result in substantially higher quality data for common and rare association analyses.

Highlights

  • Genotypes generated in generation sequencing studies contain errors which can significantly impact the power to detect signals in common and rare variant association tests

  • We focus on showing improvements compared to Genome Analysis Toolkit (GATK)’s Best Practices because a recent publication has shown that GATK is the best variant caller for general NGS analyses [36]

  • Variant calling and standard GATK Variant Quality Score Recalibration (VQSR) filtering As part of a large case-control study, we sequenced the exomes of 920 samples from a Norwegian population to an average depth of 100× in target regions, with an average of 82.5% of the target base pairs having at least 30× coverage

Read more

Summary

Introduction

Genotypes generated in generation sequencing studies contain errors which can significantly impact the power to detect signals in common and rare variant association tests. These genotyping errors are not explicitly filtered by the standard GATK Variant Quality Score Recalibration (VQSR) tool and remain a source of errors in whole exome sequencing (WES) projects that follow GATK’s recommended best practices. While WES sequencing studies have many advantages over array-based analyses, they are susceptible to higher levels of genotyping errors [20,21,22,23] These errors are generated throughout the sequencing process, especially at sites with low coverage or variants with low minor allele frequency (MAF). As the MAF increases, homozygote to heterozygote errors increase in likelihood

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call