The balance of reproducibility, sensitivity, and specificity of lists of differentially expressed genes in microarray studies.

Leming Shi,Lu Zhang,Yaron Turpaz,Wenjun Bao,Charles Wang,Vincent Bertholet,Shashi Amur,Huixiao Hong,Jing Han,Yunqing Ma,Sheng Zhong,Sue Jane Wang,Nan Mei,Qian Xie,Xu Guo,Yuling Luo,Richard Shippy,Cecilie Boysen,Liang Zhang,Jie Wu,Zhenqiang Su,Hongmei Sun,Hong Fang,Weida Tong,Damir Herman,Lei Guo,Feng Qian,Roderick V Jensen ,Roger Perkins ,Tzu Ming Chu ,Janet A Warrington ,Lisa J Croner ,Patrick J Collins ,Federico Goodsaid ,Wendell Jones ,Cátálin Bárbácioru ,Felix W Frueh ,Raj K Puri ,Russell D Wolfinger ,Stephen Harris ,Ernest S Kawasaki ,James C Fuscoe ,Xiao Hui Fan ,Xiaoxi Cao ,Yongming Sun ,James C Willey ,Quan Zhen Li ,Ron L Peterson ,Brett T Thorn

doi:10.1186/1471-2105-9-s9-s10

Abstract

BackgroundReproducibility is a fundamental requirement in scientific experiments. Some recent publications have claimed that microarrays are unreliable because lists of differentially expressed genes (DEGs) are not reproducible in similar experiments. Meanwhile, new statistical methods for identifying DEGs continue to appear in the scientific literature. The resultant variety of existing and emerging methods exacerbates confusion and continuing debate in the microarray community on the appropriate choice of methods for identifying reliable DEG lists.ResultsUsing the data sets generated by the MicroArray Quality Control (MAQC) project, we investigated the impact on the reproducibility of DEG lists of a few widely used gene selection procedures. We present comprehensive results from inter-site comparisons using the same microarray platform, cross-platform comparisons using multiple microarray platforms, and comparisons between microarray results and those from TaqMan – the widely regarded "standard" gene expression platform. Our results demonstrate that (1) previously reported discordance between DEG lists could simply result from ranking and selecting DEGs solely by statistical significance (P) derived from widely used simple t-tests; (2) when fold change (FC) is used as the ranking criterion with a non-stringent P-value cutoff filtering, the DEG lists become much more reproducible, especially when fewer genes are selected as differentially expressed, as is the case in most microarray studies; and (3) the instability of short DEG lists solely based on P-value ranking is an expected mathematical consequence of the high variability of the t-values; the more stringent the P-value threshold, the less reproducible the DEG list is. These observations are also consistent with results from extensive simulation calculations.ConclusionWe recommend the use of FC-ranking plus a non-stringent P cutoff as a straightforward and baseline practice in order to generate more reproducible DEG lists. Specifically, the P-value cutoff should not be stringent (too small) and FC should be as large as possible. Our results provide practical guidance to choose the appropriate FC and P-value cutoffs when selecting a given number of DEGs. The FC criterion enhances reproducibility, whereas the P criterion balances sensitivity and specificity.

Highlights

Reproducibility is a fundamental requirement in scientific experiments
[21] While P could be computed from many different statistical methods, for simplicity and consistency, throughout this article P is calculated with the two-tailed t-test that is widely employed in microarray data analysis
A fundamental step of microarray studies is the identification of a small subset of differentially expressed genes (DEGs) from among tens of thousands of genes probed on the microarray

Summary

Introduction

Some recent publications have claimed that microarrays are unreliable because lists of differentially expressed genes (DEGs) are not reproducible in similar experiments. A comparative neurotoxicological study by Miller et al [5] using the same set of RNA samples found only 11 common DEGs among 138 and 425, respectively, from Affymetrix and CodeLink platforms. All these studies ranked genes by P-value from simple t-tests, used a P threshold to identify DEG lists, and applied the concept of the Percentage of Overlapping Genes (POG), or the Venn diagram, between DEG lists as the measure of reproducibility

Methods

Results

Discussion

Conclusion