Identifying and mitigating batch effects in whole genome sequencing data

Jennifer A Tom,Jens Reeder,Timothy W Behrens,Julie Hunkapiller,Tushar R Bhangale,Robert R Graham,William F Forrest

doi:10.1186/s12859-017-1756-z

Abstract

BackgroundLarge sample sets of whole genome sequencing with deep coverage are being generated, however assembling datasets from different sources inevitably introduces batch effects. These batch effects are not well understood and can be due to changes in the sequencing protocol or bioinformatics tools used to process the data. No systematic algorithms or heuristics exist to detect and filter batch effects or remove associations impacted by batch effects in whole genome sequencing data.ResultsWe describe key quality metrics, provide a freely available software package to compute them, and demonstrate that identification of batch effects is aided by principal components analysis of these metrics. To mitigate batch effects, we developed new site-specific filters that identified and removed variants that falsely associated with the phenotype due to batch effect. These include filtering based on: a haplotype based genotype correction, a differential genotype quality test, and removing sites with missing genotype rate greater than 30% after setting genotypes with quality scores less than 20 to missing. This method removed 96.1% of unconfirmed genome-wide significant SNP associations and 97.6% of unconfirmed genome-wide significant indel associations. We performed analyses to demonstrate that: 1) These filters impacted variants known to be disease associated as 2 out of 16 confirmed associations in an AMD candidate SNP analysis were filtered, representing a reduction in power of 12.5%, 2) In the absence of batch effects, these filters removed only a small proportion of variants across the genome (type I error rate of 3%), and 3) in an independent dataset, the method removed 90.2% of unconfirmed genome-wide SNP associations and 89.8% of unconfirmed genome-wide indel associations.ConclusionsResearchers currently do not have effective tools to identify and mitigate batch effects in whole genome sequencing data. We developed and validated methods and filters to address this deficiency.

Highlights

Large sample sets of whole genome sequencing with deep coverage are being generated, assembling datasets from different sources inevitably introduces batch effects
The filters we developed do not remove all unconfirmed genome-wide significant association (UGA) impacted by batch effects and come at the cost of a reduction in power of 12.5%, when applied in conjunction with standard quality control measures they can substantially mitigate the impact of batch effects
We showed that the quality metrics we developed can determine whether a batch effect exists within a dataset and released software that allows researchers to quickly assess the quality of their sequencing data

Summary

Introduction

Large sample sets of whole genome sequencing with deep coverage are being generated, assembling datasets from different sources inevitably introduces batch effects These batch effects are not well understood and can be due to changes in the sequencing protocol or bioinformatics tools used to process the data. QC strategies proposed for exome sequencing (WES) include empirically derived variant filtering [10] and methods for removing batch effects in copy number variation calling [11, 12]. These algorithms rely on read depth and either singular

Objectives

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Jul 24, 2017
Citations: 47	License type: open-access

R Discovery Prime

R Discovery Prime

Identifying and mitigating batch effects in whole genome sequencing data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Batch effects in population genomic studies with low-coverage whole genome sequencing data: Causes, detection and mitigation.
Runyang Nicolas Lou ... Nina Overgaard Therkildsen
Molecular Ecology Resources | VOL. 22
Runyang Nicolas Lou, et. al.Runyang Nicolas Lou ... Nina Overgaard Therkildsen
09 Dec 2021
Molecular Ecology Resources | VOL. 22

ICAM-1 molecular mechanism and genome wide SNP's association studies
C Anbarasan ... S Ajit Mullasari
Indian Heart Journal | VOL. 67
C Anbarasan, et. al.C Anbarasan ... S Ajit Mullasari
01 May 2015
Indian Heart Journal | VOL. 67

Distinguishing potential bacteria-tumor associations from contamination in a secondary data analysis of public cancer genome sequence data
Kelly M Robinson ... Julie C Dunning Hotopp
Microbiome | VOL. 5
Kelly M Robinson, et. al.Kelly M Robinson ... Julie C Dunning Hotopp
25 Jan 2017
Microbiome | VOL. 5

Advanced deep-learning algorithm for multi-cancer detection using cf-WGS.
Tae-Rim Lee ... Eunsung Jun
Journal of Clinical Oncology | VOL. 41
Tae-Rim Lee, et. al.Tae-Rim Lee ... Eunsung Jun
01 Jun 2023
Journal of Clinical Oncology | VOL. 41

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Identifying and mitigating batch effects in whole genome sequencing data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics