Batch effect confounding leads to strong bias in performance estimates obtained by cross-validation.

Charlotte Soneson,Sarah Gerster,Mauro Delorenzi,Shu-Dong Zhang

doi:10.1371/journal.pone.0100335

Charlotte Soneson, Sarah Gerster + Show 2 more

Open Access

https://doi.org/10.1371/journal.pone.0100335

Copy DOI

Abstract

BackgroundWith the large amount of biological data that is currently publicly available, many investigators combine multiple data sets to increase the sample size and potentially also the power of their analyses. However, technical differences (“batch effects”) as well as differences in sample composition between the data sets may significantly affect the ability to draw generalizable conclusions from such studies.FocusThe current study focuses on the construction of classifiers, and the use of cross-validation to estimate their performance. In particular, we investigate the impact of batch effects and differences in sample composition between batches on the accuracy of the classification performance estimate obtained via cross-validation. The focus on estimation bias is a main difference compared to previous studies, which have mostly focused on the predictive performance and how it relates to the presence of batch effects.DataWe work on simulated data sets. To have realistic intensity distributions, we use real gene expression data as the basis for our simulation. Random samples from this expression matrix are selected and assigned to group 1 (e.g., ‘control’) or group 2 (e.g., ‘treated’). We introduce batch effects and select some features to be differentially expressed between the two groups. We consider several scenarios for our study, most importantly different levels of confounding between groups and batch effects.MethodsWe focus on well-known classifiers: logistic regression, Support Vector Machines (SVM), k-nearest neighbors (kNN) and Random Forests (RF). Feature selection is performed with the Wilcoxon test or the lasso. Parameter tuning and feature selection, as well as the estimation of the prediction performance of each classifier, is performed within a nested cross-validation scheme. The estimated classification performance is then compared to what is obtained when applying the classifier to independent data.

Highlights

Every day, large quantities of data are generated by biological and medical labs all over the world
We focus on well-known classifiers: logistic regression, Support Vector Machines (SVM), k-nearest neighbors and Random Forests (RF)
The current study focuses on the impact of batch effects on the ability to build and evaluate the performance of a classifier based on gene expression data

Summary

Introduction

Large quantities of data are generated by biological and medical labs all over the world. The measured gene expression levels are very sensitive to external factors such as the technician running the experiment, the reagent batch, or the time of the day when an assay was processed [5] Such systematic errors, related to technical aspects, are often referred to as batch effects (see [6] for a comprehensive discussion). The bias introduced by a batch effect ‘‘may be defined as unintentional, systematic erroneous association of some characteristic with a group in a way that distorts a comparison with another group’’ [5] These biases can appear between data sets from completely different studies, and within single studies, where time and capacity restrictions may imply that it is not always possible to process all samples under identical conditions. We consider several scenarios for our study, most importantly different levels of confounding between groups and batch effects

Objectives

Methods

Results

Conclusion