Abstract

BackgroundDetection of disease-associated markers plays a crucial role in gene screening for biological studies. Two-sample test statistics, such as the t-statistic, are widely used to rank genes based on gene expression data. However, the resultant gene ranking is often not reproducible among different data sets. Such irreproducibility may be caused by disease heterogeneity.ResultsWhen we divided data into two subsets, we found that the signs of the two t-statistics were often reversed. Focusing on such instability, we proposed a sign-sum statistic that counts the signs of the t-statistics for all possible subsets. The proposed method excludes genes affected by heterogeneity, thereby improving the reproducibility of gene ranking. We compared the sign-sum statistic with the t-statistic by a theoretical evaluation of the upper confidence limit. Through simulations and applications to real data sets, we show that the sign-sum statistic exhibits superior performance.ConclusionWe derive the sign-sum statistic for getting a robust gene ranking. The sign-sum statistic gives more reproducible ranking than the t-statistic. Using simulated data sets we show that the sign-sum statistic excludes hetero-type genes well. Also for the real data sets, the sign-sum statistic performs well in a viewpoint of ranking reproducibility.Electronic supplementary materialThe online version of this article (doi:10.1186/s12920-016-0214-5) contains supplementary material, which is available to authorized users.

Highlights

  • Detection of disease-associated markers plays a crucial role in gene screening for biological studies

  • Two-sample test statistics, such as the t-statistic and Wilcoxon sum-rank statistic, are widely used to rank genes based on gene expression data

  • The ranking irreproducibility would be caused by such heterogeneity in the real data analysis

Read more

Summary

Introduction

Detection of disease-associated markers plays a crucial role in gene screening for biological studies. In this field, statisticians seek to identify informative genes as candidates for further investigation. Statisticians seek to identify informative genes as candidates for further investigation To this end, it is desirable to correctly rank genes according to their degree of differential expression. The resultant gene rankings are often not reproducible among different data sets Such irreproducibility may be caused by disease heterogeneity [1]. We can confirm this ranking irreproducibility in

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call