Abstract
BackgroundLarge sample sets of whole genome sequencing with deep coverage are being generated, however assembling datasets from different sources inevitably introduces batch effects. These batch effects are not well understood and can be due to changes in the sequencing protocol or bioinformatics tools used to process the data. No systematic algorithms or heuristics exist to detect and filter batch effects or remove associations impacted by batch effects in whole genome sequencing data.ResultsWe describe key quality metrics, provide a freely available software package to compute them, and demonstrate that identification of batch effects is aided by principal components analysis of these metrics. To mitigate batch effects, we developed new site-specific filters that identified and removed variants that falsely associated with the phenotype due to batch effect. These include filtering based on: a haplotype based genotype correction, a differential genotype quality test, and removing sites with missing genotype rate greater than 30% after setting genotypes with quality scores less than 20 to missing. This method removed 96.1% of unconfirmed genome-wide significant SNP associations and 97.6% of unconfirmed genome-wide significant indel associations. We performed analyses to demonstrate that: 1) These filters impacted variants known to be disease associated as 2 out of 16 confirmed associations in an AMD candidate SNP analysis were filtered, representing a reduction in power of 12.5%, 2) In the absence of batch effects, these filters removed only a small proportion of variants across the genome (type I error rate of 3%), and 3) in an independent dataset, the method removed 90.2% of unconfirmed genome-wide SNP associations and 89.8% of unconfirmed genome-wide indel associations.ConclusionsResearchers currently do not have effective tools to identify and mitigate batch effects in whole genome sequencing data. We developed and validated methods and filters to address this deficiency.
Highlights
Large sample sets of whole genome sequencing with deep coverage are being generated, assembling datasets from different sources inevitably introduces batch effects
The filters we developed do not remove all unconfirmed genome-wide significant association (UGA) impacted by batch effects and come at the cost of a reduction in power of 12.5%, when applied in conjunction with standard quality control measures they can substantially mitigate the impact of batch effects
We showed that the quality metrics we developed can determine whether a batch effect exists within a dataset and released software that allows researchers to quickly assess the quality of their sequencing data
Summary
Large sample sets of whole genome sequencing with deep coverage are being generated, assembling datasets from different sources inevitably introduces batch effects These batch effects are not well understood and can be due to changes in the sequencing protocol or bioinformatics tools used to process the data. QC strategies proposed for exome sequencing (WES) include empirically derived variant filtering [10] and methods for removing batch effects in copy number variation calling [11, 12]. These algorithms rely on read depth and either singular
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.