Embracing noise to improve cross-batch prediction accuracy

Chuan Hock Koh,Limsoon Wong

doi:10.1186/1752-0509-6-s2-s3

Chuan Hock Koh, Limsoon Wong

Open Access

https://doi.org/10.1186/1752-0509-6-s2-s3

Copy DOI

Journal: BMC Systems Biology	Publication Date: Jan 1, 2012
Citations: 17	License type: cc-by

Affiliation: National University of Singapore

Abstract

One important application of microarray in clinical settings is for constructing a diagnosis or prognosis model. Batch effects are a well-known obstacle in this type of applications. Recently, a prominent study was published on how batch effects removal techniques could potentially improve microarray prediction performance. However, the results were not very encouraging, as prediction performance did not always improve. In fact, in up to 20% of the cases, prediction accuracy was reduced. Furthermore, it was stated in the paper that the techniques studied require sufficiently large sample sizes in both batches (train and test) to be effective, which is not a realistic situation especially in clinical settings. In this paper, we propose a different approach, which is able to overcome limitations faced by conventional methods. Our approach uses ranking value of microarray data and a bagging ensemble classifier with sequential hypothesis testing to dynamically determine the number of classifiers required in the ensemble. Using similar datasets to those in the original study, we showed that in only one case (<2%) is our performance reduced (by more than -0.05 AUC) and, in >60% of cases, it is improved (by more than 0.05 AUC). In addition, our approach works even on much smaller training data sets and is independent of the sample size of the test data, making it feasible to be applied on clinical studies.

Highlights

Noise has a negative connotation in the classical view of biology
Measurements errors in gene expression microarrays are studied by the MicroArray Quality Control (MAQC) project, a largescale study led by FDA scientists involving 137 participants from 51 organizations, where they showed that the median coefficient of variation of replicates is between 5% and 15% [5]
Complimenting that with bagging brings the improvements one notch higher, while dynamic bagging is able to maintain high performance with a minimum number of bootstrap replicates. With this Duchenne Muscular Dystrophy (DMD) data set, we have shown that our methodology works well on a non-cancer-related data set and it further suggests that our work is able to overcome cross-platform prediction problems in addition to batch effects

Summary

Introduction

Noise has a negative connotation in the classical view of biology. one often attempts to remove “noise” from data using various statistical methods before any downstream analysis. There are two different types of noise in biological data, experimental noise and inherent cell variation. Distinguishing experimental noise from natural fluctuation due to inherent cell variation is a daunting task, and attempts to de-noise data often remove meaningful cell variation as well. Inherent cell variations could arise from intrinsic and extrinsic sources [1]. Intrinsic noise sources would affect two equivalent and independent gene reporters placed in the same cell differently, whereas extrinsic noise sources would affect two reporters in any given cell but affect reporters in another cell differently. Examples of Experimental noise in gene expression measurement data mainly contains two forms of experimental errors: measurement errors and batch effects. The batch effects problem is a non-biological systematic bias that exists in various batches of samples due to experimental handling. Incorrect conclusions might be drawn, especially when batch effects are correlated with an outcome of interest [6]

Objectives

Methods

Results

Conclusion