Abstract

Genome projects now generate large-scale data often produced at various time points by different laboratories using multiple platforms. This increases the potential for batch effects. Currently there are several batch evaluation methods like principal component analysis (PCA; mostly based on visual inspection), and sometimes they fail to reveal all of the underlying batch effects. These methods can also lead to the risk of unintentionally correcting biologically interesting factors attributed to batch effects. Here we propose a novel statistical method, finding batch effect (findBATCH), to evaluate batch effect based on probabilistic principal component and covariates analysis (PPCCA). The same framework also provides a new approach to batch correction, correcting batch effect (correctBATCH), which we have shown to be a better approach to traditional PCA-based correction. We demonstrate the utility of these methods using two different examples (breast and colorectal cancers) by merging gene expression data from different studies after diagnosing and correcting for batch effects and retaining the biological effects. These methods, along with conventional visual inspection-based PCA, are available as a part of an R package exploring batch effect (exploBATCH; https://github.com/syspremed/exploBATCH).

Highlights

  • Many approaches have been developed to remove batch effects from high-throughput genomic profiling datasets

  • This method has been successfully applied to compare the performances of different batch correction methods, it has the following main limitations in diagnosing batch effects: (i) it involves multiple batch evaluation steps, which reduces statistical power; (ii) there is no standard approach for selecting the optimal number of principal components (PCs) associated with the data; and (iii) it does not use a formal statistical test to assess the significance of the batch effects

  • The findBATCH function selects the optimal number of probabilistic (p) PCs18 associated with principal component and covariates analysis (PPCCA) and exploits variability associated with the batch variable to quantify and test the effect of batch(es) in the data

Read more

Summary

Introduction

Many approaches have been developed to remove batch effects from high-throughput genomic profiling datasets. PVCA derives the proportion of variability associated with batch effect using the estimated batch variability from the linear mixed model and eigenvalues associated with each PC from PCA1 This method has been successfully applied to compare the performances of different batch correction methods, it has the following main limitations in diagnosing batch effects: (i) it involves multiple batch evaluation steps, which reduces statistical power; (ii) there is no standard approach for selecting the optimal number of PCs associated with the data; and (iii) it does not use a formal statistical test to assess the significance of the batch effects. There remains a need for methods that perform formal statistical testing to significantly evaluate/diagnose the batch effect(s) before and after batch correction

Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.