Abstract

Abstract Combining multiple data sets from the Gene Expression Omnibus (GEO) or other data repositories for an integrated analysis requires appropriate batch correction. ComBat, an empirical Bayesian method for batch correction of microarray data, is widely used and has been reported to be the best correction method. We combined cancer data from 16 public studies representing 8 tissue types and a total of 3,563 samples, used the R “sva” package and ComBat for batch correction, and examined 6 gene sets representing positive and negative controls. As positive controls, we extracted 4 gene sets from the Human Protein Atlas that were found to be expressed at least 5-fold higher in one tissue than in any of 35 other tissues, and we matched these genes to their Affymetrix U133A probesets. This resulted in 16 probesets specific for stomach, 18 for lung, 37 for pancreas, and 27 for prostate. A fifth positive control is a group of 85 genes called BA80 that we have found to be expressed much lower in blood than in solid tissues. As a negative control that we do not expect to change much between tissues, we used a list of 3,804 housekeeping (HK) genes that were reported to show less than a four-fold expression change across 16 tissue types. We compared the ComBat results to a new method we call equal medians. The equal medians method assumes that the 22,277 genes measured on the Affymetrix U133A microarrays can vary widely between tissues and batches, but that the median of the 22,277 genes is the same for every sample. We created boxplots of each gene set across the 16 studies before and after each method of batch correction. The reduction in batch effects was scored using the change in standard deviation of the HK genes. The preservation of biological variability was scored using the fold change of the positive controls, comparing the target tissue’s median to the nearest alternate tissue’s median. We used two GEO studies as independent representatives of each tissue type, so the two fold changes were averaged to create a single measure. The results using the HK genes showed that ComBat removed 99.90% of the batch effects visible in the raw data, while equal medians removed 61.58%. However, equal medians did the best at preserving biological variability, with a fold change of 4.8 for stomach, 13.1 for lung, 42.3 for pancreas, 12.0 for prostate, and 3.9 for blood. The corresponding fold changes for ComBat were 1.4, 1.1, 2.2, 1.0, and 1.0. We conclude that ComBat was best at removing batch effects, but at the undesirable cost of minimizing biological variation. We believe this is due to known and unknown sources of variability that are confounded with batches, which is one of ComBat’s known risks. Equal medians showed the opposite performance, preserving biological variation better while partially removing batch effects. We offer the equal medians method as an alternative batch correction method in cases where ComBat shows evidence of overcorrection. Citation Format: John C. Obenauer, Thomas P. Stockfisch, Marcia V. Fournier. Overcorrection of batch effects by ComBat can be avoided by using an equal medians method [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2019; 2019 Mar 29-Apr 3; Atlanta, GA. Philadelphia (PA): AACR; Cancer Res 2019;79(13 Suppl):Abstract nr 1659.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call