A Novel Statistical Method to Diagnose, Quantify and Correct Batch Effects in Genomic Studies

Gift Nyamundanda,Anguraj Sadanandam,Yatish Patil,Pawan Poudel

doi:10.1038/s41598-017-11110-6

Gift Nyamundanda, Anguraj Sadanandam + Show 2 more

Open Access

https://doi.org/10.1038/s41598-017-11110-6

Copy DOI

Abstract

Genome projects now generate large-scale data often produced at various time points by different laboratories using multiple platforms. This increases the potential for batch effects. Currently there are several batch evaluation methods like principal component analysis (PCA; mostly based on visual inspection), and sometimes they fail to reveal all of the underlying batch effects. These methods can also lead to the risk of unintentionally correcting biologically interesting factors attributed to batch effects. Here we propose a novel statistical method, finding batch effect (findBATCH), to evaluate batch effect based on probabilistic principal component and covariates analysis (PPCCA). The same framework also provides a new approach to batch correction, correcting batch effect (correctBATCH), which we have shown to be a better approach to traditional PCA-based correction. We demonstrate the utility of these methods using two different examples (breast and colorectal cancers) by merging gene expression data from different studies after diagnosing and correcting for batch effects and retaining the biological effects. These methods, along with conventional visual inspection-based PCA, are available as a part of an R package exploring batch effect (exploBATCH; https://github.com/syspremed/exploBATCH).

Highlights

Many approaches have been developed to remove batch effects from high-throughput genomic profiling datasets
This method has been successfully applied to compare the performances of different batch correction methods, it has the following main limitations in diagnosing batch effects: (i) it involves multiple batch evaluation steps, which reduces statistical power; (ii) there is no standard approach for selecting the optimal number of principal components (PCs) associated with the data; and (iii) it does not use a formal statistical test to assess the significance of the batch effects
The findBATCH function selects the optimal number of probabilistic (p) PCs18 associated with principal component and covariates analysis (PPCCA) and exploits variability associated with the batch variable to quantify and test the effect of batch(es) in the data

Summary

Introduction

Many approaches have been developed to remove batch effects from high-throughput genomic profiling datasets. PVCA derives the proportion of variability associated with batch effect using the estimated batch variability from the linear mixed model and eigenvalues associated with each PC from PCA1 This method has been successfully applied to compare the performances of different batch correction methods, it has the following main limitations in diagnosing batch effects: (i) it involves multiple batch evaluation steps, which reduces statistical power; (ii) there is no standard approach for selecting the optimal number of PCs associated with the data; and (iii) it does not use a formal statistical test to assess the significance of the batch effects. There remains a need for methods that perform formal statistical testing to significantly evaluate/diagnose the batch effect(s) before and after batch correction

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Scientific Reports	Publication Date: Sep 7, 2017
Citations: 43	License type: open-access

R Discovery Prime

R Discovery Prime

A Novel Statistical Method to Diagnose, Quantify and Correct Batch Effects in Genomic Studies

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Scientific Reports

Lead the way for us

Similar Papers

Detecting and Correcting Batch Effects in High-Throughput Genomic Experiments

-

12 Jul 2014
12 Jul 2014

Abstract 893: Batch effects in tumor biomarker studies using tissue microarrays: Extent, impact, and remediation
Konrad H Stopsack ... J Bailey Vaselkiv
Cancer Research | VOL. 81
Konrad H Stopsack, et. al.Konrad H Stopsack ... J Bailey Vaselkiv
01 Jul 2021
Cancer Research | VOL. 81

Protein complex-based analysis is resistant to the obfuscating consequences of batch effects --- a case study in clinical proteomics
Wilson Wen Bin Goh ... Limsoon Wong
BMC Genomics | VOL. 18
Wilson Wen Bin Goh, et. al.Wilson Wen Bin Goh ... Limsoon Wong
01 Mar 2017
BMC Genomics | VOL. 18

How missing value imputation is confounded with batch effects and what you can do about it
Wilson Wen Bin Goh ... Limsoon Wong
Drug Discovery Today | VOL. 28
Wilson Wen Bin Goh, et. al.Wilson Wen Bin Goh ... Limsoon Wong
09 Jun 2023
Drug Discovery Today | VOL. 28

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Novel Statistical Method to Diagnose, Quantify and Correct Batch Effects in Genomic Studies

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Scientific Reports