The impact of sample imbalance on identifying differentially expressed genes

Kun Yang,Hong Gao,Jianzhong Li

doi:10.1186/1471-2105-7-s4-s8

Kun Yang, Hong Gao + Show 1 more

Open Access

https://doi.org/10.1186/1471-2105-7-s4-s8

Copy DOI

Journal: BMC bioinformatics	Publication Date: Dec 1, 2006
Citations: 45	License type: CC BY 2.0

Affiliation: Harbin Institute of Technology

Abstract

BackgroundRecently several statistical methods have been proposed to identify genes with differential expression between two conditions. However, very few studies consider the problem of sample imbalance and there is no study to investigate the impact of sample imbalance on identifying differential expression genes. In addition, it is not clear which method is more suitable for the unbalanced data.ResultsBased on random sampling, two evaluation models are proposed to investigate the impact of sample imbalance on identifying differential expression genes. Using the proposed evaluation models, the performances of six famous methods are compared on the unbalanced data. The experimental results indicate that the sample imbalance has a great influence on selecting differential expression genes. Furthermore, different methods have very different performances on the unbalanced data. Among the six methods, the welch t-test appears to perform best when the size of samples in the large variance group is larger than that in the small one, while the Regularized t-test and SAM outperform others on the unbalanced data in other cases.ConclusionTwo proposed evaluation models are effective and sample imbalance should be taken into account in microarray experiment design and gene expression data analysis. The results and two proposed evaluation models can provide some help in selecting suitable method to process the unbalanced data.

Highlights

Several statistical methods have been proposed to identify genes with differential expression between two conditions
Six methods including two-sample t-test with equal variances [6], two-sample ttest with unequal variances (i.e. Welch t-test) [5,7], Wilcoxon rank-sum test [10], Significance Analysis of Microarray (SAM) [11], Regularized t-test [8] and the permutation-based method of Pan [15] are systematically compared on real data and simulated data according to two evaluation models
The expected Overlap Rates of six methods as well as their error limits on prostate and liver datasets under the evaluation model 1, where the sizes of samples in Class C1 of the artificial data, which are created from the liver data and the prostate data, are all fixed at 60

Summary

Introduction

Several statistical methods have been proposed to identify genes with differential expression between two conditions. Very few studies consider the problem of sample imbalance and there is no study to investigate the impact of sample imbalance on identifying differential expression genes. It is not clear which method is more suitable for the unbalanced data. BMC Bioinformatics 2006, 7(Suppl 4):S8 strongly related to the conditions and truly change their expression levels according to conditions. These differentially expressed genes are very useful in latter research and clinical applications [2,3]. We are interesting in identifying which of several thousands candidate genes have had their expression levels changed by condition, given a microarray data

Methods

Results

Discussion

Conclusion