Abstract

BackgroundWith the growing number of public repositories for high-throughput genomic data, it is of great interest to combine the results produced by independent research groups. Such a combination allows the identification of common genomic factors across multiple cancer types and provides new insights into the disease process. In the framework of the proportional hazards model, classical procedures, which consist of ranking genes according to the estimated hazard ratio or the p-value obtained from a test statistic of no association between survival and gene expression level, are not suitable for gene selection across multiple genomic datasets with different sample sizes. We propose a novel index for identifying genes with a common effect across heterogeneous genomic studies designed to remain stable whatever the sample size and which has a straightforward interpretation in terms of the percentage of separability between patients according to their survival times and gene expression measurements.ResultsThe simulations results show that the proposed index is not substantially affected by the sample size of the study and the censoring. They also show that its separability performance is higher than indices of predictive accuracy relying on the likelihood function. A simulated example illustrates the good operating characteristics of our index. In addition, we demonstrate that it is linked to the score statistic and possesses a biologically relevant interpretation.The practical use of the index is illustrated for identifying genes with common effects across eight independent genomic cancer studies of different sample sizes. The meta-selection allows the identification of four genes (ESPL1, KIF4A, HJURP, LRIG1) that are biologically relevant to the carcinogenesis process and have a prognostic impact on survival outcome across various solid tumors.ConclusionThe proposed index is a promising tool for identifying factors having a prognostic impact across a collection of heterogeneous genomic datasets of various sizes.

Highlights

  • With the growing number of public repositories for high-throughput genomic data, it is of great interest to combine the results produced by independent research groups

  • The proposed index is a promising tool for identifying factors having a prognostic impact across a collection of heterogeneous genomic datasets of various sizes

  • Choosing the estimated hazard ratio clearly ignores the variability of the data, while the choice of quantities derived from test statistics leads to emphasize large datasets, since it is well known that every test statistic increases with the sample size

Read more

Summary

Introduction

With the growing number of public repositories for high-throughput genomic data, it is of great interest to combine the results produced by independent research groups. The growing number of public repositories for high-throughput genomic data facilitates the retrieval and combination of various datasets produced by independent research groups (for a few: GEO [1], Oncomine [2], ArrayExpress [3]) These databases potentially represent valuable resources for identifying genomic factors that have a common prognostic impact on clinical. Frequently used in practice, consist of ranking the genes in each dataset from the highest (or lowest) value to the lowest (or highest) value according to either the estimated hazard ratio or quantities derived from the test statistic (e.g. p-value), and to select those that appear at the intersection of the lists using a defined thresholding procedure [9]. Choosing the estimated hazard ratio clearly ignores the variability of the data, while the choice of quantities derived from test statistics leads to emphasize large datasets, since it is well known that every test statistic increases with the sample size

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call